From Tim@mail.powweb.com Fri Nov 1 00:07:10 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 18:07:10 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021031235145.C0376F59F@cashew.wolfskeep.com> Message-ID: There is considerable discussion in the original papers as to the usefulness of Spambayes not only for filtering spam, but also for altering the behavior of spammers. This second consideration is actually much more powerful than the first. If spam can be successfully dealt with, in a way that allows evolution as spammers evolve, then eventually spammers will be so restricted that their activities will not be profitable. THIS is the real goal. So, to that end, we must make Spambayes useful to a huge audience. That means the windoze platform (cough hack gag). Direct delivery users are almost invariably the creme-de-le-creme of users. They will know how to wire Spambayes into their world, almost instinctively. But the people that we need to have using this product (because there are SO many of 'em) are the type who might actually have trouble configuring a pop3proxy... a simply braindead installation is required. So... unless we want this to simply be interesting research, we gotta take it to the masses.... 10/31/2002 5:51:45 PM, "T. Alexander Popiel" wrote: >In message: > Tim Stone Four Stones Forum writes: > >>But can it be useful to the masses? The pop3proxy is the right way to go >>in my opinion. > >You folks make me feel like such a fuddy-duddy, still using MH >from a shell account with the mailboxes fetched through the >filesystem, instead of through some network mailbox protocol... >Heck, I don't even have software to access a POP mailbox installed... > >I guess that raises the question: what is our target audience, >and how strictly do we want to cater to them? Do we want to >offer support for processing in direct-delivery situations, >even though it's only old-school fuddy-duddies like myself >who use them, anymore? > >- Alex > > > > - Tim www.fourstonesExpressions.com From neale@woozle.org Fri Nov 1 00:14:41 2002 From: neale@woozle.org (Neale Pickett) Date: 31 Oct 2002 16:14:41 -0800 Subject: [Spambayes] Database reduction In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > [cool database trick] The bigger problem, at least for hammie, is that pickling wordinfo instances makes huge strings, the majority of which is redundant information. When pickling a Bayes object, the pickler is smart enough not to repeatedly say "this is a wordinfo object" but rather, I assume, "this is of type 2", only having to name type 2 once. However, hammie pickles each wordinfo individually, keyed by a string. This makes for fast lookups, but giant databases. Tim just mentioned a performance tweak; is this an indicator that now would be a good time to resume trying to reduce hammie's database size? Neale From rmunn@pobox.com Fri Nov 1 00:37:12 2002 From: rmunn@pobox.com (Robin Munn) Date: Thu, 31 Oct 2002 18:37:12 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021031235145.C0376F59F@cashew.wolfskeep.com> References: <20021031235145.C0376F59F@cashew.wolfskeep.com> Message-ID: <20021101003712.GA28132@rmunnlfs> ---------------------- multipart/signed attachment On Thu, Oct 31, 2002 at 03:51:45PM -0800, T. Alexander Popiel wrote: > In message: > Tim Stone Four Stones Forum writes: >=20 > >But can it be useful to the masses? The pop3proxy is the right way to go > >in my opinion. >=20 > You folks make me feel like such a fuddy-duddy, still using MH > from a shell account with the mailboxes fetched through the > filesystem, instead of through some network mailbox protocol... > Heck, I don't even have software to access a POP mailbox installed... >=20 > I guess that raises the question: what is our target audience, > and how strictly do we want to cater to them? Do we want to > offer support for processing in direct-delivery situations, > even though it's only old-school fuddy-duddies like myself > who use them, anymore? The "itch" that I'm scratching is that I'm tired of seeing all my non-techie friends using inferior technology because the quality open-source solutions are too complicated for them and/or have user-unfriendly interfaces. So I'm inclined to focus on solutions that will cater to the general public's needs first; techies capable of scratching their own itches are going to be a distant second on my priority list, personally. Certainly we should offer support for as many configurations as possible, including direct-delivery situations, but I want to first focus on a solution for the general non-techie public. I agree that pop3proxy is the optimal way to go, but it does require the ISP's cooperation to install. I also want a solution that the end user can install without needing the ISP's cooperation; something that could integrate into, say, Outlook Express and add a "Block this junk mail" button (which adds the message to the spam corpus) to the E-mail reading interface. Taking this kind of approach will lead to more work for us, but would make the project useful sooner for all kinds of users. What would be needed for a user-install-only interface? 1. It must integrate into the user's email client as seamlessly as possible. This means researching the plugin API of Outlook, Eudora, Pegasus Mail, Mozilla, et al. 2. The algorithm and filtering component must also run in the background without any user intervention required after the initial install. This means being able to install as a Windows NT service or into the StartUp folder of Windows 9x. 3. There *MUST* be good documentation. We all know the user is going to run the installer program before reading the documentation, but we must include a "How to train your filter to recognize junk mail" document that the installer displays after finishing installation. This means actually writing said documentation. :-) Those are the three things that I think are essential to a version of spambayes that can be installed and used profitably by non-techie end-users. Meanwhile, I'll try to help out with pop3proxy. --=20 Robin Munn http://www.rmunn.com/ PGP key ID: 0x6AFB6838 50FF 2478 CFFB 081A 8338 54F7 845D ACFD 6AFB 6838 ---------------------- multipart/signed attachment A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021031/2f1215fa/attachment.bin ---------------------- multipart/signed attachment-- From rmunn@pobox.com Fri Nov 1 00:46:00 2002 From: rmunn@pobox.com (Robin Munn) Date: Thu, 31 Oct 2002 18:46:00 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" Message-ID: <20021101004600.GB28132@rmunnlfs> ---------------------- multipart/signed attachment When we start writing user documentation, I propose using the term "junk mail" instead of "spam" and "non-junk mail" (or some other term) instead of "ham". I believe this will reduce confusion among non-techies who are encountering spam terminology for the first time. They'll have enough new ideas to learn trying to install and run a filter, let's not add jargon to what they have to learn. Other possibilities: "junk email" instead of "spam" "valid email" instead of "ham" "unwanted email" instead of "spam" "wanted email" instead of "ham" [Insert your clever idea here] Comments, anyone? --=20 Robin Munn http://www.rmunn.com/ PGP key ID: 0x6AFB6838 50FF 2478 CFFB 081A 8338 54F7 845D ACFD 6AFB 6838 ---------------------- multipart/signed attachment A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021031/25b0bfc8/attachment.bin ---------------------- multipart/signed attachment-- From guido@python.org Fri Nov 1 00:56:13 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 31 Oct 2002 19:56:13 -0500 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" In-Reply-To: Your message of "Thu, 31 Oct 2002 18:46:00 CST." <20021101004600.GB28132@rmunnlfs> References: <20021101004600.GB28132@rmunnlfs> Message-ID: <200211010056.gA10uDH02868@pcp02138704pcs.reston01.va.comcast.net> > When we start writing user documentation, I propose using the term > "junk mail" instead of "spam" and "non-junk mail" (or some other > term) instead of "ham". I believe this will reduce confusion among > non-techies who are encountering spam terminology for the first > time. They'll have enough new ideas to learn trying to install and > run a filter, let's not add jargon to what they have to learn. If they don't know the word "spam", they don't need a spam filter yet. I agree that we need something better than "ham". Non-spam works for me; "good mail" too. --Guido van Rossum (home page: http://www.python.org/~guido/) From Tim@mail.powweb.com Fri Nov 1 00:56:15 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 18:56:15 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021101003712.GA28132@rmunnlfs> Message-ID: Well, a pop3proxy is certainly capable of running on a client machine. See http://www.software.bisswanger.de/en/index.php?seite=smtp for an example of a similar proxy that inserts SMTPAuth into a non-SMTPAuth enabled mailer, such as Opera. That said, it would certainly be simpler to plug into the individual mailers in a much more seamless manner. I'm not quite sure that this is even possible with various mailers. If it is, great. If not, then running a proxy process locally is a reasonable solution, and easier to implement in the near term. I found this SMTP proxy by checking the Opera site when my host converted to SMTPAuth. The Opera folks felt like this was easy enough to install(which it was) to put in their faq as the answer to how to use SMTPAuth with their mailer. I think most folks could pull it off... The problem with plugging into mail clients is that the plugin architecture can change over time, which produces an ongoing maintenance effort. There will also be multiple codebases to maintain, as each plugin architecture (if one exists) will be different. There are dozens of different mail clients... consider AOL for example. Can we plug into their mailer? It's used by millions of people... - Tim S. On Thu, Oct 31, 2002 at 03:51:45PM -0800, T. Alexander Popiel wrote: > In message: > Tim Stone Four Stones Forum writes: > > >But can it be useful to the masses? The pop3proxy is the right way to go > >in my opinion. > > You folks make me feel like such a fuddy-duddy, still using MH > from a shell account with the mailboxes fetched through the > filesystem, instead of through some network mailbox protocol... > Heck, I don't even have software to access a POP mailbox installed... > > I guess that raises the question: what is our target audience, > and how strictly do we want to cater to them? Do we want to > offer support for processing in direct-delivery situations, > even though it's only old-school fuddy-duddies like myself > who use them, anymore? The "itch" that I'm scratching is that I'm tired of seeing all my non-techie friends using inferior technology because the quality open-source solutions are too complicated for them and/or have user-unfriendly interfaces. So I'm inclined to focus on solutions that will cater to the general public's needs first; techies capable of scratching their own itches are going to be a distant second on my priority list, personally. Certainly we should offer support for as many configurations as possible, including direct-delivery situations, but I want to first focus on a solution for the general non-techie public. I agree that pop3proxy is the optimal way to go, but it does require the ISP's cooperation to install. I also want a solution that the end user can install without needing the ISP's cooperation; something that could integrate into, say, Outlook Express and add a "Block this junk mail" button (which adds the message to the spam corpus) to the E-mail reading interface. Taking this kind of approach will lead to more work for us, but would make the project useful sooner for all kinds of users. What would be needed for a user-install-only interface? 1. It must integrate into the user's email client as seamlessly as possible. This means researching the plugin API of Outlook, Eudora, Pegasus Mail, Mozilla, et al. 2. The algorithm and filtering component must also run in the background without any user intervention required after the initial install. This means being able to install as a Windows NT service or into the StartUp folder of Windows 9x. 3. There *MUST* be good documentation. We all know the user is going to run the installer program before reading the documentation, but we must include a "How to train your filter to recognize junk mail" document that the installer displays after finishing installation. This means actually writing said documentation. :-) Those are the three things that I think are essential to a version of spambayes that can be installed and used profitably by non-techie end-users. Meanwhile, I'll try to help out with pop3proxy. -- Robin Munn http://www.rmunn.com/ PGP key ID: 0x6AFB6838 50FF 2478 CFFB 081A 8338 54F7 845D ACFD 6AFB 6838 10/31/2002 6:37:12 PM, Robin Munn wrote: - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Fri Nov 1 01:08:41 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 19:08:41 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" Message-ID: Unwanted email is the best idea in your list. This technology is useful for filtering mail from any number of sources, not just spammers. How about mail from an ex-* who won't leave you alone... While perhaps other kinds of mail are not quite as predictable as general spam, they could be filtered nonetheless... The reality is that the average person that will use this technology doesn't make a distinction between what we call spam and mail from their ex-boyfriend. It's all unwanted crap and they want to filter it. - Tim Robin said: When we start writing user documentation, I propose using the term "junk mail" instead of "spam" and "non-junk mail" (or some other term) instead of "ham". I believe this will reduce confusion among non-techies who are encountering spam terminology for the first time. They'll have enough new ideas to learn trying to install and run a filter, let's not add jargon to what they have to learn. Other possibilities: "junk email" instead of "spam" "valid email" instead of "ham" "unwanted email" instead of "spam" "wanted email" instead of "ham" [Insert your clever idea here] Comments, anyone? -- Robin Munn http://www.rmunn.com/ PGP key ID: 0x6AFB6838 50FF 2478 CFFB 081A 8338 54F7 845D ACFD 6AFB 6838 10/31/2002 6:46:00 PM, Robin Munn wrote: - Tim www.fourstonesExpressions.com From vanhorn@whidbey.com Fri Nov 1 01:20:58 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Thu, 31 Oct 2002 17:20:58 -0800 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" References: <20021101004600.GB28132@rmunnlfs> Message-ID: <3DC1D6FA.343FD9F0@whidbey.com> I think Spam and Ham will be perfectly clear to the users I support. They may be realtors, but they aren't braindead, and a little humor helps in teaching. Van Robin Munn wrote: > When we start writing user documentation, I propose using the term "junk > mail" instead of "spam" and "non-junk mail" (or some other term) instead > of "ham". I believe this will reduce confusion among non-techies who are > encountering spam terminology for the first time. They'll have enough > new ideas to learn trying to install and run a filter, let's not add > jargon to what they have to learn. > > Other possibilities: > > "junk email" instead of "spam" > "valid email" instead of "ham" > > "unwanted email" instead of "spam" > "wanted email" instead of "ham" > > [Insert your clever idea here] > > Comments, anyone? > > -- > Robin Munn > http://www.rmunn.com/ > PGP key ID: 0x6AFB6838 50FF 2478 CFFB 081A 8338 54F7 845D ACFD 6AFB 6838 > > ------------------------------------------------------------------------ > Part 1.2Type: application/pgp-signature -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From mhammond@skippinet.com.au Fri Nov 1 01:25:14 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 1 Nov 2002 12:25:14 +1100 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junkmail" In-Reply-To: Message-ID: > The reality is that the average person that will use this > technology doesn't make a distinction between what we call spam > and mail from their ex-boyfriend. > It's all unwanted crap and they want to filter it. Agreed. Most people in my social circle who have heard of spam often think it is a general term for "any mail with a large To list". I've received a number of legit mails starting with eg. "Sorry for the spam, but I thought this too funny not to send to all of you" (it wasn't ). At the end of the day though, this is an issue for the front-ends rather than the engine. The outlook addin tends to use "spam or unwanted email" and "good email". "junk email" seems good too. It wouldn't surprise me to find the word "spam" excised eventually. Front-end issues will probably drive small engine details though - nothing beats real experience with a tool Mark. From guido@python.org Fri Nov 1 01:31:07 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 31 Oct 2002 20:31:07 -0500 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes INTEGRATION.txt,NONE,1.1 In-Reply-To: Your message of "Thu, 31 Oct 2002 17:23:30 PST." References: Message-ID: <200211010131.gA11V8d03085@pcp02138704pcs.reston01.va.comcast.net> [Skip checked in:] > first scribbled notes about integrating Spambayes with different email > packages. Hm, maybe the spambayes website could be brought a bit more up to date too? --Guido van Rossum (home page: http://www.python.org/~guido/) From Tim@mail.powweb.com Fri Nov 1 01:33:01 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 19:33:01 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junkmail" In-Reply-To: Message-ID: You got it, Mark. Front ends are where this is really gonna happen, and your point is well taken. I filter 'FWD:' stuff, even if it's from someone I know, because it is invariably unwanted. In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible... Can I manually configure that weight? lol So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like. It'd be interesting to see how the stats fall out if we were to (somehow) incorporate this kind of email into the current corpus. Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam. Perhaps you really want to see something from quickinspirations.com (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted email'. - Tim 10/31/2002 7:25:14 PM, "Mark Hammond" wrote: >> The reality is that the average person that will use this >> technology doesn't make a distinction between what we call spam >> and mail from their ex-boyfriend. >> It's all unwanted crap and they want to filter it. > >Agreed. Most people in my social circle who have heard of spam often think >it is a general term for "any mail with a large To list". I've received a >number of legit mails starting with eg. "Sorry for the spam, but I thought >this too funny not to send to all of you" (it wasn't ). > >At the end of the day though, this is an issue for the front-ends rather >than the engine. The outlook addin tends to use "spam or unwanted email" >and "good email". "junk email" seems good too. It wouldn't surprise me to >find the word "spam" excised eventually. > >Front-end issues will probably drive small engine details though - nothing >beats real experience with a tool > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Fri Nov 1 01:33:01 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 19:33:01 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junkmail" In-Reply-To: Message-ID: You got it, Mark. Front ends are where this is really gonna happen, and your point is well taken. I filter 'FWD:' stuff, even if it's from someone I know, because it is invariably unwanted. In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible... Can I manually configure that weight? lol So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like. It'd be interesting to see how the stats fall out if we were to (somehow) incorporate this kind of email into the current corpus. Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam. Perhaps you really want to see something from quickinspirations.com (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted email'. - Tim 10/31/2002 7:25:14 PM, "Mark Hammond" wrote: >> The reality is that the average person that will use this >> technology doesn't make a distinction between what we call spam >> and mail from their ex-boyfriend. >> It's all unwanted crap and they want to filter it. > >Agreed. Most people in my social circle who have heard of spam often think >it is a general term for "any mail with a large To list". I've received a >number of legit mails starting with eg. "Sorry for the spam, but I thought >this too funny not to send to all of you" (it wasn't ). > >At the end of the day though, this is an issue for the front-ends rather >than the engine. The outlook addin tends to use "spam or unwanted email" >and "good email". "junk email" seems good too. It wouldn't surprise me to >find the word "spam" excised eventually. > >Front-end issues will probably drive small engine details though - nothing >beats real experience with a tool > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Fri Nov 1 01:33:01 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 19:33:01 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junkmail" In-Reply-To: Message-ID: You got it, Mark. Front ends are where this is really gonna happen, and your point is well taken. I filter 'FWD:' stuff, even if it's from someone I know, because it is invariably unwanted. In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible... Can I manually configure that weight? lol So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like. It'd be interesting to see how the stats fall out if we were to (somehow) incorporate this kind of email into the current corpus. Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam. Perhaps you really want to see something from quickinspirations.com (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted email'. - Tim 10/31/2002 7:25:14 PM, "Mark Hammond" wrote: >> The reality is that the average person that will use this >> technology doesn't make a distinction between what we call spam >> and mail from their ex-boyfriend. >> It's all unwanted crap and they want to filter it. > >Agreed. Most people in my social circle who have heard of spam often think >it is a general term for "any mail with a large To list". I've received a >number of legit mails starting with eg. "Sorry for the spam, but I thought >this too funny not to send to all of you" (it wasn't ). > >At the end of the day though, this is an issue for the front-ends rather >than the engine. The outlook addin tends to use "spam or unwanted email" >and "good email". "junk email" seems good too. It wouldn't surprise me to >find the word "spam" excised eventually. > >Front-end issues will probably drive small engine details though - nothing >beats real experience with a tool > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Fri Nov 1 01:34:05 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 19:34:05 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junkmail" Message-ID: You got it, Mark. Front ends are where this is really gonna happen, and your point is well taken. I filter 'FWD:' stuff, even if it's from someone I know, because it is invariably unwanted. In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible... Can I manually configure that weight? lol So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like. It'd be interesting to see how the stats fall out if we were to (somehow) incorporate this kind of email into the current corpus. Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam. Perhaps you really want to see something from quickinspirations.com (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted email'. - Tim 10/31/2002 7:25:14 PM, "Mark Hammond" wrote: >> The reality is that the average person that will use this >> technology doesn't make a distinction between what we call spam >> and mail from their ex-boyfriend. >> It's all unwanted crap and they want to filter it. > >Agreed. Most people in my social circle who have heard of spam often think >it is a general term for "any mail with a large To list". I've received a >number of legit mails starting with eg. "Sorry for the spam, but I thought >this too funny not to send to all of you" (it wasn't ). > >At the end of the day though, this is an issue for the front-ends rather >than the engine. The outlook addin tends to use "spam or unwanted email" >and "good email". "junk email" seems good too. It wouldn't surprise me to >find the word "spam" excised eventually. > >Front-end issues will probably drive small engine details though - nothing >beats real experience with a tool > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From skip@pobox.com Fri Nov 1 01:25:32 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 31 Oct 2002 19:25:32 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021031211819.GC27454@rmunnlfs> References: <20021031211819.GC27454@rmunnlfs> Message-ID: <15809.55308.1945.988931@montanaro.dyndns.org> Robin> I just joined the spambayes mailing list a couple of days ago and Robin> have been trying to skim through the archives. It looks like a Robin> lot of time is being spent on algorithm refining and not as much Robin> time on email client integration or end-user documentation. That's true, largely because that's what the focus of the initial phase of the project was supposed to be. Even if it gets no farther than it is today, the process has been highly educational for me, because we have an expert in algorithm design (that'd be Tim) exposing his thought processes and mechanics for the rest of us. That said, I think the classification stuff has gone about as far as it's going to go. Future changes to the tokenizer are also likely to be incremental, so the major changes over the next while will be in email integration. Mark Hammond has done a terrific service for all the pointyhaired folks out there by adding some modules to the system which allow this stuff to work rather elegantly from Outlook (from what I can divine reading the list - I don't use Outlook). A number of other people have integrated it in various ways with other Unix-based mail systems. Jeremy and I both use VM from X/Emacs. We've approached the problem of integration somewhat differently. There's also the pop3proxy script which Richie Hindle wrote as another way of integrating spambayes into a MTA/MUA setup. Neil Schemenauer also has a pair of scripts he uses (look for neil*.py in the spambayes source tree). This has all sort of sporadically been "documented" in the mailing list. I just checked in INTEGRATION.txt to the CVS repository. Consider it a few scribbled notes about integration, based upon my own experience. I'm sure others have much to share as well. Skip From skip@pobox.com Fri Nov 1 01:37:56 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 31 Oct 2002 19:37:56 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021101003712.GA28132@rmunnlfs> References: <20021031235145.C0376F59F@cashew.wolfskeep.com> <20021101003712.GA28132@rmunnlfs> Message-ID: <15809.56053.158.812689@montanaro.dyndns.org> Robin> The "itch" that I'm scratching is that I'm tired of seeing all my Robin> non-techie friends using inferior technology because the quality Robin> open-source solutions are too complicated for them and/or have Robin> user-unfriendly interfaces. I think Outlook users can eventually be handled by an installer which installs Mark's Outlook modules and Python if necessary. Should be point and shoot. They need not ever know that Python is there. Skip From skip@pobox.com Fri Nov 1 01:41:47 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 31 Oct 2002 19:41:47 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" In-Reply-To: <200211010056.gA10uDH02868@pcp02138704pcs.reston01.va.comcast.net> References: <20021101004600.GB28132@rmunnlfs> <200211010056.gA10uDH02868@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15809.56283.323513.587530@montanaro.dyndns.org> Guido> I agree that we need something better than "ham". Non-spam works Guido> for me; "good mail" too. I don't think that's necessarily the case. "Ham" has a certain panache. It rolls of the tongue better than anything else I've seen. It distinguishes Spambayes from the herd a bit, and may be a clever little bit of marketing. (I noted before that some people in the SpamAssassin community have picked up the term.) I wouldn't change it until it's demonstrated to be a liability. Skip From skip@pobox.com Fri Nov 1 01:34:31 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 31 Oct 2002 19:34:31 -0600 Subject: [Spambayes] Database reduction In-Reply-To: References: Message-ID: <15809.55847.349091.23441@montanaro.dyndns.org> Neale> When pickling a Bayes object, the pickler is smart enough not to Neale> repeatedly say "this is a wordinfo object" but rather, I assume, Neale> "this is of type 2", only having to name type 2 once. However, Neale> hammie pickles each wordinfo individually, keyed by a string. Neale> This makes for fast lookups, but giant databases. You can always define your own __getstate__ and __setstate__ methods for the Wordinfo class which processes a more compact form of the object's state. Or am I misunderstanding what you said? Neale> Tim just mentioned a performance tweak; is this an indicator that Neale> now would be a good time to resume trying to reduce hammie's Neale> database size? I reduced the size of my database significantly after my training run by deleting wordinfo where the hamcount was 1 and the spamcount was 0 or vice versa. Skip From skip@pobox.com Fri Nov 1 01:29:14 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 31 Oct 2002 19:29:14 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021031235145.C0376F59F@cashew.wolfskeep.com> References: <20021031235145.C0376F59F@cashew.wolfskeep.com> Message-ID: <15809.55530.598489.521312@montanaro.dyndns.org> Alex> I guess that raises the question: what is our target audience, and Alex> how strictly do we want to cater to them? On a sheer numbers basis, your target audience is definitely Outlook and Outlook Express users. The rest of it is just noise. Mark Hammond seems to have taken good care of the Outlook users. Skip From tim.one@comcast.net Fri Nov 1 02:32:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 31 Oct 2002 21:32:04 -0500 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" In-Reply-To: <15809.56283.323513.587530@montanaro.dyndns.org> Message-ID: [Guido] > I agree that we need something better than "ham". > Non-spam works for me; "good mail" too. [Skip Montanaro] > I don't think that's necessarily the case. "Ham" has a certain > panache. It rolls of the tongue better than anything else I've > seen. It distinguishes Spambayes from the herd a bit, and may > be a clever little bit of marketing. > (I noted before that some people in the SpamAssassin community have > picked up the term.) I wouldn't change it until it's demonstrated > to be a liability. Another data point: I gave a live demo of the Outlook 2000 client last week, to a group of people who were taking a Python+Zope training class at Zope Corp. They laughed out loud at the "spam vs ham" distinction, which surprised me because I've come to think of them as purely technical terms identifying a region in chi-squared probability space. That may intensify suspicions that they were laughing at me instead of with me , but I do think they genuinely enjoyed the word play. The only thing that got a bigger laugh was that a "Want a BIG penis NOW?" spam happened to arrive during the demo. If the choice is between "spam" and "ham", or "spam" and "big penis", I'm weakly in favor of the former. buncha-stuffed-shirts-ly y'rs - tim From tim.one@comcast.net Fri Nov 1 02:38:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 31 Oct 2002 21:38:26 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15809.55530.598489.521312@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > On a sheer numbers basis, your target audience is definitely Outlook and > Outlook Express users. The rest of it is just noise. This is so sadly true. If Netscape Communicator still survives in some form, that would be a good one too. I have a sister who uses that, and better that someone else try to make her happy. > Mark Hammond seems to have taken good care of the Outlook users. Indeed he is, and it is indeed lots of hard work, and Outlook has to be the most difficult email program in the universe to hook up with. Except for Outlook Express, which doesn't appear to offer any programming hooks at *all* (Outlook may supply thousands, to judge by the number of toes that have been broken stumbling into them so far). From Tim@mail.powweb.com Fri Nov 1 02:40:27 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 20:40:27 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15809.55530.598489.521312@montanaro.dyndns.org> Message-ID: The latest figures that I can find from Microsoft are that Outlook has 57% market share, Notes has 29%, Browser based email is 9%, and the rest is split between cc:Mail, GroupWise, Outlook Express, the Exchange client, and 2% of "Other." So Outlook is certainly the low hanging fruit here, but Notes is a big client as well. The Notes market will be a bit more difficult to reach, because as a product it is even more closed than Outlook... (imagine that...) But I think that Notes users comprise a good sized segment, and I don't believe that there are any effective filters for that product. That said, a POP3 proxy won't work for Notes, because Notes servers are not POP3... it's a replication scheme. So there may not be any hope for the Notes thing. So, I think we should do a nice integration with Outlook (which it seems as though Mark has already handled), and Outlook Express if possible, do a Pop3proxy that can be used in most other circumstances, and leave the special cases to those who are interested enough to do whatever integration they wish... That documentation effort should be enough to keep Robin busy for a while... ;) - Tim 10/31/2002 7:29:14 PM, Skip Montanaro wrote: > > Alex> I guess that raises the question: what is our target audience, and > Alex> how strictly do we want to cater to them? > >On a sheer numbers basis, your target audience is definitely Outlook and >Outlook Express users. The rest of it is just noise. Mark Hammond seems to >have taken good care of the Outlook users. > >Skip > > > > - Tim www.fourstonesExpressions.com From mhammond@skippinet.com.au Fri Nov 1 02:52:30 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 1 Nov 2002 13:52:30 +1100 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: > The latest figures that I can find from Microsoft are that > Outlook has 57% market share, Notes has 29%, Browser based email > is 9%, and the rest is split > between cc:Mail, GroupWise, Outlook Express, the Exchange client, > and 2% of "Other." So Outlook is certainly the low hanging fruit > here, but Notes is a big > client as well. This is really for "internet mail", whereas I bet that the figures above are "corporate" users. Most corporate users I have spoken to simply dont have a large spam problem - their work address is rarely publically posted to the Web, and the corporate internet mail gateway tends to have some rudimentary spam filtering anyway. Basically, I would be surprised to find many Lotus Notes users with a spam problem. > The Notes market will be a bit more difficult to > reach, because as a product it is even more closed than > Outlook... (imagine that...) I find that a strange comment given the integration already achieved with Outlook. From an extensibility point-of-view, Outlook is almost as open as I can imagine. Mark. From neale@woozle.org Fri Nov 1 02:56:20 2002 From: neale@woozle.org (Neale Pickett) Date: 31 Oct 2002 18:56:20 -0800 Subject: [Spambayes] non-ascii mail in hammiecli In-Reply-To: <200210311817.57268.tdickenson@geminidataloggers.com> References: <200210311817.57268.tdickenson@geminidataloggers.com> Message-ID: So then, Toby Dickenson is all like: > hammiecli is giving its input an xmlrpc binary wrapper, to avoid marshalling > problems with non-ascii input. However hammiesrv wasnt doing the same for > its output. > > diff attached Rippin', Toby! I've checked in your patch. Thanks! Neale From jeremy@alum.mit.edu Fri Nov 1 02:55:32 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Thu, 31 Oct 2002 21:55:32 -0500 Subject: [Spambayes] Fwd: Re: [Design] Contacts (Michael R. Bernstein) Message-ID: <15809.60708.8788.803101@slothrop.zope.com> ---------------------- multipart/mixed attachment Interesting effect. I signed up for a couple of new mailing lists (concerning the Kapor PIM project). The discussions on them seem to be very different than the stuff I usually get, and the conclusions it that at least some of it is definitely spam. It's unfortunate that they get marked as spam instead of unsure. It means that getting mail in a new subject area means that the classifier will make some wildly wrong guesses until you get enough new training data. Of the 14 new messages, I see 1 ham, 3 spam, and 10 unsure. I've forwarded one of the high scoring spams. Jeremy Score: 0.998870303948 Clues ----- *H* 0.00155925012726 *S* 0.999299858023 wrote: 0.00338600451467 michael 0.0155709342561 book, 0.0167286245353 except 0.0196506550218 joe 0.0266272189349 (in 0.0302013422819 from:"michael 0.0348837209302 interface 0.0348837209302 thu, 0.0412844036697 false 0.0652173913043 (so 0.0918367346939 foundation 0.0918367346939 machine, 0.0918367346939 machine. 0.0918367346939 origin 0.0918367346939 shared 0.0918367346939 subject:Design 0.0918367346939 widely 0.0918367346939 play 0.140636581012 (ie. 0.155172413793 (there 0.155172413793 apps 0.155172413793 belong 0.155172413793 chandler 0.155172413793 claiming 0.155172413793 count, 0.155172413793 horses 0.155172413793 managing, 0.155172413793 scripting 0.155172413793 solved 0.155172413793 subset 0.155172413793 tool. 0.155172413793 trojan 0.155172413793 header:In-Reply-To:1 0.212519247953 header:Errors-To:1 0.237464007184 can't 0.321673801617 proto:http 0.662469900861 information 0.662953865426 one 0.665248119685 header:Return-Path:1 0.665758340634 allow 0.670346027038 good 0.671104201444 get 0.671415294108 last 0.673179750848 place. 0.674836986007 are 0.679601700319 has 0.681104376636 used 0.688719773647 way 0.695441491093 distribution 0.695807170463 wrong 0.695807170463 users 0.69606431081 having 0.696133185508 needs 0.699732035795 skip:c 10 0.700608177332 other 0.702538098269 want 0.708850616487 e-mail 0.715394926258 running 0.716462999299 should 0.718351527065 someone 0.722025540013 part 0.723159240531 access 0.726693881975 also 0.727229519921 more 0.740683224491 user 0.741338432978 share 0.743372688795 may 0.745213686825 because 0.746505250332 data 0.747224736221 without 0.749466559635 all 0.751250287138 verify 0.752508123366 bit 0.75613165431 code 0.756797827917 these 0.757105434828 sign 0.760030537028 read 0.761534404086 address 0.762088324731 header:Message-Id:1 0.764545352748 open 0.76530284221 list 0.76574131017 >from 0.770761722645 avoid 0.770761722645 entry 0.770761722645 only 0.772148590346 provided 0.773230674523 first 0.776410372426 different 0.778505146553 copy 0.779604640492 further 0.783390240127 allows 0.785722995011 here. 0.785722995011 large 0.785722995011 building 0.788593119438 they're 0.788593119438 then 0.789698777271 whether 0.791623214186 those 0.795849646208 annoying 0.797771282426 make 0.800014773042 try 0.804967405411 saying 0.807062436029 mailing 0.814445440614 again. 0.815142308819 real 0.815142308819 must 0.817697541998 applications 0.81854602306 installed 0.81854602306 skip:i 10 0.822492441981 even 0.823228458985 person 0.823924036541 call 0.825417398415 secure 0.826130452181 skip:p 10 0.826254664449 isn't 0.826344529636 full 0.829689461024 once 0.831240651842 people 0.832090040931 skip:a 10 0.834052950494 joel 0.838351988458 requiring 0.838351988458 cannot 0.843680980631 case 0.843791821294 results 0.844279808423 ...and 0.844827586207 hacked 0.844827586207 techniques, 0.844827586207 sharing 0.84571021875 capabilities 0.845855006747 outlook 0.845855006747 same 0.848300323403 computer 0.851704776271 every 0.855780693872 response 0.861670606819 easy 0.878526274377 prevent 0.8924082233 key 0.893108945959 nothing 0.898726242139 happen 0.904700604011 2.. 0.908163265306 trust 0.908217126704 private 0.919486477907 computer, 0.923563305955 easily 0.929335171741 application 0.935918946825 security 0.939824371865 contacts 0.948037345934 list, 0.970027495049 data. 0.973372781065 ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: "Michael R. Bernstein" Subject: Re: [Design] Contacts Date: 31 Oct 2002 13:46:57 -0800 Size: 5181 Url: http://mail.python.org/pipermail/spambayes/attachments/20021031/022226d6/attachment.txt ---------------------- multipart/mixed attachment-- From tim.one@comcast.net Fri Nov 1 03:01:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 31 Oct 2002 22:01:49 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [Tim@mail.powweb.com] > The Notes market will be a bit more difficult to > reach, because as a product it is even more closed than > Outlook... (imagine that...) [Mark Hammond] > I find that a strange comment given the integration already achieved > with Outlook. From an extensibility point-of-view, Outlook is almost > as open as I can imagine. Indeed, that's part of the *problem* with Outlook, isn't it? There are so many different ways to hook into it (the Outlook object model, the MAPI substrate, the Collaboration Data Objects layer, ...) I can't even hold them all in my head, and it's never clear which of seemingly dozens of ways to get a thing done may actually work. It's the very definition of poke-and-hope programming. I did some Notes programming in a previous life, and hope never to do so again. That was more a matter of wading through seemingly dozens of ways *not* to get a thing done, hoping to paste enough failure modes together so that the end result almost appeared to work some of the time. BTW, Notes was the purest example of a whole being greater than the sum of its parts I've ever seen: each piece (from email to database) sucked, but the whole was nevertheless very useful for workgroup collaboration. From Tim@mail.powweb.com Fri Nov 1 03:03:37 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 21:03:37 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: Well you're talkin to a Notes user with a HUGE spam problem... hundreds per day. And to make it worse, corporate users often cannot simply change their address to shut off the flow. You may be right about the corporate versus personal user. This research really doesn't specify which market they're talking about. I suspect that the lion's share of personal use is through hotmail, yahoo mail, and other web based mailers, or AOL, earthlink, or other specialized clients, which are beasts of a different nature... can we enable these? So what are we saying here? I think i'm getting (giving) mixed messages about what/who we should be targeting. Outlook is expensive, thus mostly corporations use it. There is a lot of research that suggests that corporations have a huge spam problem. Non-web-based personal use may not be the most productive area to enable, but it certainly is a visible segment, and when people at home can deal with spam effectively, they'll take that story to work.... So I propose we enable Outlook, Mozilla (for our OS brethren, which will likely get Netscape, too), have a pop3proxy that can be run locally and easily configured to be used by a number of "noise" mailers, and go from there... - Tim 10/31/2002 8:52:30 PM, "Mark Hammond" wrote: >> The latest figures that I can find from Microsoft are that >> Outlook has 57% market share, Notes has 29%, Browser based email >> is 9%, and the rest is split >> between cc:Mail, GroupWise, Outlook Express, the Exchange client, >> and 2% of "Other." So Outlook is certainly the low hanging fruit >> here, but Notes is a big >> client as well. > >This is really for "internet mail", whereas I bet that the figures above are >"corporate" users. Most corporate users I have spoken to simply dont have a >large spam problem - their work address is rarely publically posted to the >Web, and the corporate internet mail gateway tends to have some rudimentary >spam filtering anyway. > >Basically, I would be surprised to find many Lotus Notes users with a spam >problem. > >> The Notes market will be a bit more difficult to >> reach, because as a product it is even more closed than >> Outlook... (imagine that...) > >I find that a strange comment given the integration already achieved with >Outlook. From an extensibility point-of-view, Outlook is almost as open as >I can imagine. > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Fri Nov 1 03:03:37 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 21:03:37 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: Well you're talkin to a Notes user with a HUGE spam problem... hundreds per day. And to make it worse, corporate users often cannot simply change their address to shut off the flow. You may be right about the corporate versus personal user. This research really doesn't specify which market they're talking about. I suspect that the lion's share of personal use is through hotmail, yahoo mail, and other web based mailers, or AOL, earthlink, or other specialized clients, which are beasts of a different nature... can we enable these? So what are we saying here? I think i'm getting (giving) mixed messages about what/who we should be targeting. Outlook is expensive, thus mostly corporations use it. There is a lot of research that suggests that corporations have a huge spam problem. Non-web-based personal use may not be the most productive area to enable, but it certainly is a visible segment, and when people at home can deal with spam effectively, they'll take that story to work.... So I propose we enable Outlook, Mozilla (for our OS brethren, which will likely get Netscape, too), have a pop3proxy that can be run locally and easily configured to be used by a number of "noise" mailers, and go from there... - Tim 10/31/2002 8:52:30 PM, "Mark Hammond" wrote: >> The latest figures that I can find from Microsoft are that >> Outlook has 57% market share, Notes has 29%, Browser based email >> is 9%, and the rest is split >> between cc:Mail, GroupWise, Outlook Express, the Exchange client, >> and 2% of "Other." So Outlook is certainly the low hanging fruit >> here, but Notes is a big >> client as well. > >This is really for "internet mail", whereas I bet that the figures above are >"corporate" users. Most corporate users I have spoken to simply dont have a >large spam problem - their work address is rarely publically posted to the >Web, and the corporate internet mail gateway tends to have some rudimentary >spam filtering anyway. > >Basically, I would be surprised to find many Lotus Notes users with a spam >problem. > >> The Notes market will be a bit more difficult to >> reach, because as a product it is even more closed than >> Outlook... (imagine that...) > >I find that a strange comment given the integration already achieved with >Outlook. From an extensibility point-of-view, Outlook is almost as open as >I can imagine. > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Fri Nov 1 03:04:16 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 21:04:16 -0600 Subject: [Spambayes] Email client integration -- what's needed? Message-ID: Well you're talkin to a Notes user with a HUGE spam problem... hundreds per day. And to make it worse, corporate users often cannot simply change their address to shut off the flow. You may be right about the corporate versus personal user. This research really doesn't specify which market they're talking about. I suspect that the lion's share of personal use is through hotmail, yahoo mail, and other web based mailers, or AOL, earthlink, or other specialized clients, which are beasts of a different nature... can we enable these? So what are we saying here? I think i'm getting (giving) mixed messages about what/who we should be targeting. Outlook is expensive, thus mostly corporations use it. There is a lot of research that suggests that corporations have a huge spam problem. Non-web-based personal use may not be the most productive area to enable, but it certainly is a visible segment, and when people at home can deal with spam effectively, they'll take that story to work.... So I propose we enable Outlook, Mozilla (for our OS brethren, which will likely get Netscape, too), have a pop3proxy that can be run locally and easily configured to be used by a number of "noise" mailers, and go from there... - Tim 10/31/2002 8:52:30 PM, "Mark Hammond" wrote: >> The latest figures that I can find from Microsoft are that >> Outlook has 57% market share, Notes has 29%, Browser based email >> is 9%, and the rest is split >> between cc:Mail, GroupWise, Outlook Express, the Exchange client, >> and 2% of "Other." So Outlook is certainly the low hanging fruit >> here, but Notes is a big >> client as well. > >This is really for "internet mail", whereas I bet that the figures above are >"corporate" users. Most corporate users I have spoken to simply dont have a >large spam problem - their work address is rarely publically posted to the >Web, and the corporate internet mail gateway tends to have some rudimentary >spam filtering anyway. > >Basically, I would be surprised to find many Lotus Notes users with a spam >problem. > >> The Notes market will be a bit more difficult to >> reach, because as a product it is even more closed than >> Outlook... (imagine that...) > >I find that a strange comment given the integration already achieved with >Outlook. From an extensibility point-of-view, Outlook is almost as open as >I can imagine. > >Mark. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Fri Nov 1 03:20:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 31 Oct 2002 22:20:55 -0500 Subject: [Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein) In-Reply-To: <15809.60708.8788.803101@slothrop.zope.com> Message-ID: [Jeremy Hylton] > Interesting effect. I signed up for a couple of new mailing lists > (concerning the Kapor PIM project). The discussions on them seem to > be very different than the stuff I usually get, and the conclusions it > that at least some of it is definitely spam. > > It's unfortunate that they get marked as spam instead of unsure. It > means that getting mail in a new subject area means that the > classifier will make some wildly wrong guesses until you get enough > new training data. If it gets a high spam score, that simply reflects how you've trained; e.g., for a tech guy to have the word "computer" as high-spamprob word is suspicious all by itself: > computer 0.851704776271 Other oddities: > applications 0.81854602306 > installed 0.81854602306 > full 0.829689461024 > once 0.831240651842 > people 0.832090040931 > results 0.844279808423 > ...and 0.844827586207 > hacked 0.844827586207 > techniques, 0.844827586207 Those four look like hapaxes, to judge from the scores. > computer, 0.923563305955 > application 0.935918946825 > security 0.939824371865 > contacts 0.948037345934 > list, 0.970027495049 > data. 0.973372781065 I can only assume you've only trained it on Shakespeare ham . Even stranger are the words that *don't* show up with high spamprobs for you. I haven't signed up for this list, but my personal classifier scored the attachment very differently: Spam Score: 2.85542e-008 '*H*' 1 '*S*' 1.40346e-010 'wrote:' 0.001868 'subject:: [' 0.0142256 'url:mailman' 0.0151898 'url:listinfo' 0.0182172 'url:lists' 0.0193688 That you didn't have those as low-spamprob words suggests you've trained on almost no mailing-list ham. 'otoh,' 0.0215311 'interface' 0.0310106 '(so' 0.0348837 'scripting' 0.0348837 'thu,' 0.0412844 'url:org' 0.0474607 'solved' 0.050232 'it).' 0.0505618 'x-mailer:ximian evolution 1.0.8' 0.0505618 'header:In-Reply-To:1' 0.0539033 'false' 0.0564499 'header:Errors-To:1' 0.0634285 '>from:' 0.0652174 '[snip]' 0.0652174 'origin' 0.0652174 'pointless' 0.0652174 'protocol' 0.0652174 'share' 0.076162 'subject:] ' 0.0842127 'tool.' 0.0918367 'challenge' 0.0918367 'machine,' 0.0918367 'api,' 0.0918367 'techniques,' 0.0918367 'subset' 0.0918367 'key,' 0.0918367 'apps' 0.0918367 'except' 0.0982036 '(in' 0.114502 'copy' 0.12034 'problem' 0.12813 'encrypted' 0.132432 'returned' 0.138575 'quite' 0.140119 'foundation' 0.150981 'compromise' 0.155172 'automating' 0.155172 'widely.' 0.155172 'url:osafoundation' 0.155172 'key.' 0.155172 'horses' 0.155172 'header:Received:5' 0.156821 'running' 0.158385 'obviously' 0.160346 'user' 0.168307 'machine.' 0.181282 'interesting' 0.186509 'code' 0.191394 'feature' 0.196331 '(there' 0.197397 'book,' 0.197397 'list' 0.199755 'also' 0.202693 'probably' 0.205907 'environment.' 0.208559 'mine' 0.208559 'michael' 0.213552 'those' 0.215792 'entry' 0.218192 'installed' 0.232266 'requiring' 0.245609 'belong' 0.245609 'distribution' 0.248786 'shared' 0.253444 'but' 0.254219 'data' 0.257012 '>from' 0.262199 'insecure' 0.262199 'open' 0.268785 'think' 0.276856 'wrong' 0.283272 'which' 0.284943 'saying' 0.287973 'source' 0.291711 'his' 0.293522 'application' 0.306776 'should' 0.307599 'address' 0.308855 'machine' 0.309309 'users' 0.321312 'e-mail' 0.32136 'public' 0.322013 'keys' 0.334772 "can't" 0.334823 'were' 0.338587 'using' 0.345285 'needs' 0.34717 'used' 0.348371 'mailing' 0.349775 'having' 0.351608 'header:Message-Id:1' 0.353295 'part' 0.356067 'bit' 0.359997 'skip:s 10' 0.361335 "it's" 0.362434 'anyone' 0.365533 'once' 0.369773 "won't" 0.370267 'avoid' 0.370705 'provided' 0.376177 "isn't" 0.379705 'joel' 0.382591 'widely' 0.382591 'with' 0.384731 'there' 0.390536 'that' 0.394845 'even' 0.600998 'sharing' 0.605368 'key' 0.60538 'header:Return-Path:1' 0.616771 'every' 0.617388 'easy' 0.62261 'full' 0.625898 'large' 0.628002 'trust' 0.673698 'capabilities' 0.674899 'computer,' 0.682833 'secure' 0.689415 'results' 0.693054 'happen' 0.711906 'contacts' 0.716121 'here.' 0.717214 'place.' 0.720453 'easily' 0.750357 'further' 0.766401 'information' 0.778198 'again.' 0.779556 'securely,' 0.844828 'trusting' 0.844828 'trojan' 0.844828 'hacked' 0.844828 'from:"michael' 0.844828 'claiming' 0.844828 'response' 0.878172 'list,' 0.881299 '2..' 0.886933 'header:Mime-Version:1' 0.889037 'data.' 0.889756 'url:design' 0.908163 '...and' 0.969799 'wealth' 0.97619 And that you didn't get 'wealth' as a high-spamprob word suggests something even weirder. > Of the 14 new messages, I see 1 ham, 3 spam, and 10 unsure. I've > forwarded one of the high scoring spams. Train on them; it will learn what you teach it, and nothing else. From skip@pobox.com Fri Nov 1 03:30:16 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 31 Oct 2002 21:30:16 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: Message-ID: <15809.62792.793436.43805@montanaro.dyndns.org> Tim> Well you're talkin to a Notes user with a HUGE spam problem... I take it Notes is responsible for screwing up your return address... ;-) Tim> So I propose we enable Outlook, Mozilla (for our OS brethren, which Tim> will likely get Netscape, too), have a pop3proxy that can be run Tim> locally and easily configured to be used by a number of "noise" Tim> mailers, and go from there... Like all things open source, what gets implemented depends on who has an itch that needs scratching. Outlook was an obvious early choice, not primarily because of its market share, but because Mark Hammond is an expert in the area of Python/Windows integration and Tim happens to use Outlook as his mail reader. Mark couldn't have asked for a better beta tester. Mozilla will happen when someone who uses Mozilla wants/needs it. One thing to consider about mail programs is that outside the realm of people whose software toolchest is dictated to them (generally corporate types), most folks probably find something that works and then stick with it until there is an overwhelming reason to change. I have a MacOSX laptop now but still use XEmacs+VM to read mail with a peculiar method of getting mail off my server. It took me a fair amount of time to decide to switch from rmail to VM many years ago. There are lots of reasons for this sort of inertia, but I think it's generally dominated by familiarity with the user interface, (perceived) difficulty converting mailboxes, and fear that new tools won't be as stable. So, go right ahead, solve the Notes integration problem. We're here to lend moral support. ;-) Skip From Tim@mail.powweb.com Fri Nov 1 03:41:05 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Thu, 31 Oct 2002 21:41:05 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15809.62792.793436.43805@montanaro.dyndns.org> Message-ID: Hehe... point taken. There I go, thinking like a marketer again. Present the requirement to the developers and it'll magically get done... :) As for Notes, that's S.E.P. If my company doesn't care that I take 30 minutes out of my day to sort through my inbox, so be it. My personal mail is my itch... so the pop3proxy is the scratch for me... looking forward to the relief! "Regular email user dude trys out Spambayes. Stay tuned, news at 11..." 10/31/2002 9:30:16 PM, Skip Montanaro wrote: > > Tim> Well you're talkin to a Notes user with a HUGE spam problem... > >I take it Notes is responsible for screwing up your return address... ;-) > > Tim> So I propose we enable Outlook, Mozilla (for our OS brethren, which > Tim> will likely get Netscape, too), have a pop3proxy that can be run > Tim> locally and easily configured to be used by a number of "noise" > Tim> mailers, and go from there... > >Like all things open source, what gets implemented depends on who has an >itch that needs scratching. Outlook was an obvious early choice, not >primarily because of its market share, but because Mark Hammond is an expert >in the area of Python/Windows integration and Tim happens to use Outlook as >his mail reader. Mark couldn't have asked for a better beta tester. > >Mozilla will happen when someone who uses Mozilla wants/needs it. > >One thing to consider about mail programs is that outside the realm of >people whose software toolchest is dictated to them (generally corporate >types), most folks probably find something that works and then stick with it >until there is an overwhelming reason to change. I have a MacOSX laptop now >but still use XEmacs+VM to read mail with a peculiar method of getting mail >off my server. It took me a fair amount of time to decide to switch from >rmail to VM many years ago. There are lots of reasons for this sort of >inertia, but I think it's generally dominated by familiarity with the user >interface, (perceived) difficulty converting mailboxes, and fear that new >tools won't be as stable. > >So, go right ahead, solve the Notes integration problem. We're here to lend >moral support. ;-) > >Skip > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From anthony@interlink.com.au Fri Nov 1 04:17:07 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 01 Nov 2002 15:17:07 +1100 Subject: [Spambayes] Fwd: [Spambayes-checkins] spambayes timtest.py,1.29,1.30 Message-ID: <200211010417.gA14H7009458@localhost.localdomain> ---------------------- multipart/mixed attachment Note for anyone with their own test harnesses that aren't checked into the CVS. I've updated timtest and timcv (the only users of the spam/ham keep options that I could find) but if you use your own, you'll need to make a change. Anthony ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: "Anthony Baxter" Subject: [Spambayes-checkins] spambayes timtest.py,1.29,1.30 Date: Thu, 31 Oct 2002 20:13:13 -0800 Size: 4422 Url: http://mail.python.org/pipermail/spambayes/attachments/20021101/76527f7a/attachment.txt ---------------------- multipart/mixed attachment-- From anthony@interlink.com.au Fri Nov 1 04:31:11 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 01 Nov 2002 15:31:11 +1100 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" In-Reply-To: <15809.56283.323513.587530@montanaro.dyndns.org> Message-ID: <200211010431.gA14VBT09600@localhost.localdomain> A non-techie data point - my wife got what I meant by 'ham' ("everything that's not spam is ham") and thought it was a good term to use. I'd prefer it to something clumsy like 'wanted email' or suchlike. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Fri Nov 1 04:50:36 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 01 Nov 2002 15:50:36 +1100 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes INTEGRATION.txt,NONE,1.1 In-Reply-To: <200211010131.gA11V8d03085@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <200211010450.gA14oai09774@localhost.localdomain> >>> Guido van Rossum wrote > [Skip checked in:] > > first scribbled notes about integrating Spambayes with different email > > packages. > > Hm, maybe the spambayes website could be brought a bit more up to date > too? I've just chucked an 'Applications' page up there now. People should feel free to add more. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Fri Nov 1 05:35:35 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 01 Nov 2002 16:35:35 +1100 Subject: [Spambayes] package-ifying spambayes. Message-ID: <200211010535.gA15Zat10734@localhost.localdomain> I'd like to think about turning spambayes into something a little more suitable for installing (e.g. with setup.py). At the moment we install directly into site-packages, and with module names like 'tokenizer', 'Histogram', 'msgs', 'TestDriver' and 'Options', this is a bit naughty :) This would be my current suggestions for how to organise it, but I'd like other suggestions, too: prefix/lib/python/site-packages/ spambayes/ -- main body of code chi2.py classifier.py hammie.py heapq.py -- if not py > 2.2 Histogram.py mboxutils.py msgs.py Options.py sets.py -- if not py > 2.2 TestDriver.py Tester.py tokenizer.py scripts: the current setup.py installs all of the following as scripts: cmp.py HistToGNU.py neiltrain.py splitndirs.py unheader.py hammiecli.py loosecksum.py rates.py table.py hammie.py mboxcount.py rebal.py timcv.py hammiesrv.py mboxtest.py runtest.sh timtest.py I'd suggest that the only things that should be installed by default as scripts are: hammiecli.py hammiesrv.py hammie.py neiltrain.py neilfilter.py -- but maybe with a better name? unheader.py -- but maybe with a better name? pop3proxy.py -- but maybe with a better name? Anyone else see anything I missed? Anthony From B-Morgan@concentric.net Fri Nov 1 06:30:35 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Thu, 31 Oct 2002 23:30:35 -0700 Subject: [Spambayes] Outlook Plugin Questions In-Reply-To: Message-ID: Where should future questions about the Outlook 2000 plugin be directed? I've got a fairly large set of Rules Wizard rules that separate my incoming mail into folders. Where does the Spambayes plugin fit into this? I think I'd like it to be "first" so that the rules wizard rules just operate on what should be ham only. I've used SpamWeasel and SpamAssassin Pro to help me build my initial spam corpus. Both of these products have added header fields with their "conclusions". Will the presence of these fields have any adverse effect on this code? After following the instructions in the About... text, I've added the Spam field to my inbox display but it contains #ERROR for all messages. Is there something else I need to do? Is there a "remove" or "uninstall" procedure? If not, is one needed? Thanks for your continued efforts. I'll be happy to help with whatever I can. Regards, Brad Morgan From tim.one@comcast.net Fri Nov 1 06:56:11 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 01:56:11 -0500 Subject: [Spambayes] RE: Outlook Plugin Questions In-Reply-To: Message-ID: [Brad Morgan] > Where should future questions about the Outlook 2000 plugin be directed? I think right here for now. Some people will probably want that to go to a different mailing list, but while people are still in the early stages of integrating this code, *I* think it's valuable to get the chance to read about everyone's tribulations and triumphs. There are lots of UI issues that are going to be puzzles for everyone. > I've got a fairly large set of Rules Wizard rules that separate > my incoming mail into folders. Where does the Spambayes plugin fit into > this? At the wrong end at the moment, and possibly forever. > I think I'd like it to be "first" so that the rules wizard rules just > operate on what should be ham only. It may not be technically feasible to go first. Mark is hooking "item appears in folder" events, and those appear to trigger after the Rules Wizard is done. Outlook doesn't have a full object model, and the Rules Wizard appears to be unhookable. You can get much the same effect in the end, though: In the Outlook addin's Define Filters dialog, select *every one* of your destination folders in the "Filter the following folders as msgs arrive" folder selector. It's a multi-selection tree view, so this is easy. The Rules Wizard runs first regardless, but each folder you selected alerts the addin whenever a msg ends up there. The addin can then move or copy (your choice) the msg to an Unsure or Spam folder (as appropriate). > I've used SpamWeasel and SpamAssassin Pro to help me build my initial spam > corpus. Both of these products have added header fields with their > "conclusions". Will the presence of these fields have any adverse effect > on this code? By default, we ignore almost all header fields now. So, no. > After following the instructions in the About... text, I've added the Spam > field to my inbox display but it contains #ERROR for all messages. Is > there something else I need to do? I don't know; it's a new one on me; Mark may have a better clue, but I wouldn't count on it at once -- the Outlook API for setting custom fields is a mess, and doesn't appear to work as documented. We've both spent hours today trying to make better sense of it, and this subsystem is likely to change. In the meantime, just get rid of the Spam column if the #ERROR things bother you. The score is probably still there. If you're puzzled by any particular msg, select and click Anti-Spam -> Show spam clues for current msg. > Is there a "remove" or "uninstall" procedure? If not, is one needed? For what? If just the Outlook addin, cd to the Outlook2000 directory and run python addin.py --unregister for now. I suppose something fancier may get added later. > Thanks for your continued efforts. I'll be happy to help with whatever I > can. You're helping already! Trying to use a thing is the best test we can get now, and thanks. From vanhorn@whidbey.com Fri Nov 1 07:42:05 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Thu, 31 Oct 2002 23:42:05 -0800 Subject: [Spambayes] Email client integration -- what's needed? References: Message-ID: <3DC2304C.D634ADDD@whidbey.com> Tim Peters wrote: > [Skip Montanaro] > > On a sheer numbers basis, your target audience is definitely Outlook and > > Outlook Express users. The rest of it is just noise. > > This is so sadly true. If Netscape Communicator still survives in some > form, that would be a good one too. I have a sister who uses that, and > better that someone else try to make her happy. Outlook Express clearly has the big numbers among the numb, and Outlook among the pointy haired, but looking over my incoming mail at various accounts I don't think they are the majority yet. Bigger than anything else, but there is a lot of "anything else" out there. Thank God. Van -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From anthony@interlink.com.au Fri Nov 1 07:54:47 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 01 Nov 2002 18:54:47 +1100 Subject: [Spambayes] 'sender' and 'reply-to' tokenising. Message-ID: <200211010754.gA17slK11570@localhost.localdomain> comments in tokenizer.py: # Dang -- I can't use Sender:. If I do, # 'sender:email name:python-list-admin' # becomes the most powerful indicator in the whole database. # # From: # this helps both rates # Reply-To: # my error rates are too low now to tell about this # # one (smalls wins & losses across runs, overall # # not significant), so leaving it out So now we have things like h/s mean/sdev, we get more useful data. I tried enabling tokenization of both 'sender' and 'reply-to' (and both) along with the 'from' line. The left-hand column is the default. filename: from from+sender from+sender+replyto from+replyto ham:spam: 11192:1826 11192:1826 11192:1826 11192:1826 fp total: 7 6 7 6 fp %: 0.06 0.05 0.06 0.05 fn total: 5 4 5 4 fn %: 0.27 0.22 0.27 0.22 unsure t: 80 82 80 81 unsure %: 0.61 0.63 0.61 0.62 real cost: $91.00 $80.40 $91.00 $80.20 best cost: $28.00 $27.20 $28.20 $25.80 h mean: 0.62 1.32 0.63 1.11 h sdev: 4.27 4.42 4.19 4.19 s mean: 98.69 98.66 98.68 98.65 s sdev: 7.69 7.86 7.74 7.92 mean diff: 98.07 97.34 98.05 97.54 k: 8.20 7.93 8.22 8.05 Summary: 'sender' was an across-the-board lose for me. It knocked out a fp and a fn, but did considerable damage to both ham mean and sdev, and spam mean and sdev. 'reply-to' tightened up ham scores, and loosened spam scores (but not as much). I'd suggest re-enabling reply-to with the following patch: --- tokenizer.py 31 Oct 2002 15:43:55 -0000 1.59 +++ tokenizer.py 1 Nov 2002 07:51:34 -0000 @@ -1082,10 +1082,9 @@ # becomes the most powerful indicator in the whole database. # # From: # this helps both rates - # Reply-To: # my error rates are too low now to tell about this - # # one (smalls wins & losses across runs, overall - # # not significant), so leaving it out - for field in ('from',): + # Reply-To: # this tightens up ham for me (anthony) and makes spam + # # slightly worse (but the ham improvement is more) + for field in ('from', 'reply-to'): prefix = field + ':' x = msg.get(field, 'none').lower() for w in x.split(): Someone else want to repeat this test? Anthony From richie@entrian.com Fri Nov 1 09:17:12 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 01 Nov 2002 09:17:12 +0000 Subject: [Spambayes] Re: pop3proxy bug? In-Reply-To: References: Message-ID: <71h4suo5nknl0sifno0q2vql97jaf0hs9b@4ax.com> > Once I've reproduced the problem on Linux, I'll apply, test and > commit that fix - thanks. Done. And all without a Linux box. Three very slooooow cheers for Bochs and mxCGIPython... I'm still not sure the pop3proxy self-test works properly on Linux, but I think that's a threading issue in the test code itself - the main program works fine. -- Richie Hindle richie@entrian.com From skip@pobox.com Fri Nov 1 14:33:10 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 1 Nov 2002 08:33:10 -0600 Subject: [Spambayes] package-ifying spambayes. In-Reply-To: <200211010535.gA15Zat10734@localhost.localdomain> References: <200211010535.gA15Zat10734@localhost.localdomain> Message-ID: <15810.37030.424053.826099@montanaro.dyndns.org> Anthony> I'd like to think about turning spambayes into something a Anthony> little more suitable for installing (e.g. with setup.py). At Anthony> the moment we install directly into site-packages, and with Anthony> module names like 'tokenizer', 'Histogram', 'msgs', Anthony> 'TestDriver' and 'Options', this is a bit naughty :) Go for it. My knowledge of distutils is minimal, at best. I seem to recall asking about this awhile ago. Your structure seems fine to me. Anthony> scripts: the current setup.py installs all of the following as Anthony> scripts: Anthony> cmp.py HistToGNU.py neiltrain.py splitndirs.py unheader.py Anthony> hammiecli.py loosecksum.py rates.py table.py Anthony> hammie.py mboxcount.py rebal.py timcv.py Anthony> hammiesrv.py mboxtest.py runtest.sh timtest.py Anthony> I'd suggest that the only things that should be installed by Anthony> default as scripts are: Anthony> hammiecli.py Anthony> hammiesrv.py Anthony> hammie.py Anthony> neiltrain.py Anthony> neilfilter.py -- but maybe with a better name? Anthony> unheader.py -- but maybe with a better name? Anthony> pop3proxy.py -- but maybe with a better name? I take it you're segregating the script population into stuff which is generally useful and stuff which is useful only for testing? Seems reasonable. Again, my lack of distutils experience rears its ugly head. I think an install_test target mmight be useful (as in "python setup.py install_test") but don't know how to implement it and was too lazy to figure it out at the time. Skip From tdickenson@devmail.geminidataloggers.co.uk Fri Nov 1 14:34:33 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Fri, 1 Nov 2002 14:34:33 +0000 Subject: [Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein) In-Reply-To: References: Message-ID: <200211011433.02149.tdickenson@geminidataloggers.com> On Friday 01 November 2002 3:20 am, Tim Peters wrote: > e.g., for a tech guy to have the word "computer" as high-spamprob word = is > suspicious all by itself: > > computer 0.851704776271 Its a 0.88 for me too, due to "If you want to make money with your comput= er"=20 spam. From skip@pobox.com Fri Nov 1 14:39:50 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 1 Nov 2002 08:39:50 -0600 Subject: [Spambayes] 'sender' and 'reply-to' tokenising. In-Reply-To: <200211010754.gA17slK11570@localhost.localdomain> References: <200211010754.gA17slK11570@localhost.localdomain> Message-ID: <15810.37430.792347.935488@montanaro.dyndns.org> Anthony> So now we have things like h/s mean/sdev, we get more useful Anthony> data. I tried enabling tokenization of both 'sender' and Anthony> 'reply-to' (and both) along with the 'from' line. I have a patch locally to generate to and cc tokens on a per-domain basis (e.g. "To: skip@mojam.com, james@bond.net" would generate "to:@mojam.com" and "to:@bond.net"). While this might not be useful for the current crop of users, I think it will be useful for people who have abandoned email addresses in the past for whatever reason. *@calendar.com gets nothing but spam these days, for example, though I don't publicize it any longer (it still works). I'm headed out right now but will generate some new results while I wait for my car to be serviced and get something out later today. Skip From agmsmith@rogers.com Fri Nov 1 15:13:46 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Fri, 01 Nov 2002 10:13:46 EST (-0500) Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" In-Reply-To: <20021101004600.GB28132@rmunnlfs> Message-ID: <733561455-BeMail@CR593174-A> Robin Munn wrote: > Other possibilities: > "unwanted email" instead of "spam" > "wanted email" instead of "ham" > > [Insert your clever idea here] I use "Genuine" mail for the good stuff in my documentation: http://members.rogers.com/agmsmith/beos/AGMSBayesianSpam.Documentation/index.html Also remember to include a reference to the Monty Python skit, otherwise your documentation won't be complete! - Alex From skip@pobox.com Fri Nov 1 16:26:21 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 1 Nov 2002 10:26:21 -0600 Subject: [Spambayes] tokenizing to: and cc: Message-ID: <15810.43821.983763.629427@montanaro.dyndns.org> I made a change to tokenizer.py (just locally for now) to tokenize the domains mentioned in to: and cc: headers: Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.59 diff -c -r1.59 tokenizer.py *** tokenizer.py 31 Oct 2002 15:43:55 -0000 1.59 --- tokenizer.py 1 Nov 2002 16:17:23 -0000 *************** *** 6,11 **** --- 6,12 ---- import email import email.Message import email.Errors + import email.Utils import re import math import time *************** *** 1098,1104 **** for field in ('to', 'cc'): count = 0 for addrs in msg.get_all(field, []): ! count += len(addrs.split(',')) if count > 0: yield '%s:2**%d' % (field, round(log2(count))) --- 1099,1112 ---- for field in ('to', 'cc'): count = 0 for addrs in msg.get_all(field, []): ! addrs = map(email.Utils.parseaddr, addrs.split(',')) ! count += len(addrs) ! if options.generate_to_domains: ! # also generate tokens containing the destination domain ! for name,addr in addrs: ! yield '%s:@%s' % (field, ! (addr.split("@")[1:] or ! ["local"])[0]) if count > 0: yield '%s:2**%d' % (field, round(log2(count))) Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.63 diff -c -r1.63 Options.py *** Options.py 28 Oct 2002 20:19:46 -0000 1.63 --- Options.py 1 Nov 2002 16:19:38 -0000 *************** *** 111,116 **** --- 111,120 ---- # spam indicator. replace_nonascii_chars: False + # If true, generate tokens from the to and cc fields containing the destination + # domains, e.g. 'To: skip@pobox.com' would generate to:@pobox.com + generate_to_domains: False + [TestDriver] # These control various displays in class TestDriver.Driver, and Tester.Test. *************** *** 323,328 **** --- 327,333 ---- 'basic_header_tokenize': boolean_cracker, 'basic_header_tokenize_only': boolean_cracker, 'basic_header_skip': ('get', lambda s: Set(s.split())), + 'generate_to_domains': boolean_cracker, 'replace_nonascii_chars': boolean_cracker, }, 'TestDriver': {'nbuckets': int_cracker, I think this should be turned into a separate pass over the to: and cc: headers to simplify the logic and move the option test out of the inner loop. Here's a summary of the results: % python table.py base.txt to.txt -> tested 200 hams & 200 spams against 1800 hams & 1800 spams ... filename: base to ham:spam: 2000:2000 2000:2000 fp total: 8 7 fp %: 0.40 0.35 fn total: 21 22 fn %: 1.05 1.10 unsure t: 95 87 unsure %: 2.38 2.17 real cost: $120.00 $109.40 best cost: $79.80 $79.60 h mean: 0.79 0.79 h sdev: 7.43 7.43 s mean: 97.41 97.46 s sdev: 12.53 12.53 mean diff: 96.62 96.67 k: 4.84 4.84 base.ini is [TestDriver] show_unsure: True to.ini is [Tokenizer] generate_to_domains: True [TestDriver] show_unsure: True All things considered, I think it did pretty well for me. It dropped the unsure percentage a bit and spread the ham and spam means a bit further apart. As I mentioned earlier, I think this option may be useful for people with inactive, but still operational, email addresses. Over time, those addresses will tend to get nothing but spam. (It would thus be important to not train on messages sent to those addresses before or shortly after during abandonment.) Should I rework the patch and check it in? Skip From tdickenson@devmail.geminidataloggers.co.uk Fri Nov 1 16:42:49 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Fri, 1 Nov 2002 16:42:49 +0000 Subject: [Spambayes] hammie appending headers Message-ID: <200211011642.49043.tdickenson@devmail.geminidataloggers.co.uk> Hammie currently appends the X-Hammie-Disposition header. Any existing=20 X-Hammie-Disposition headers are left intact. I think we should be removi= ng=20 them, to prevent spammers (or testers ;-) adding headers that confuse=20 downstream filters. Index: hammie.py =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.33 diff -c -4 -r1.33 hammie.py *** hammie.py 27 Oct 2002 22:56:15 -0000 1.33 --- hammie.py 1 Nov 2002 16:39:25 -0000 *************** *** 266,273 **** --- 266,274 ---- else: disp =3D "Unsure" disp +=3D "; %.2f" % prob disp +=3D "; " + self.formatclues(clues) + del msg[header] msg.add_header(header, disp) return msg.as_string(unixfrom=3D(msg.get_unixfrom() is not None= )) def train(self, msg, is_spam): From francois.granger@free.fr Fri Nov 1 17:47:10 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Fri, 1 Nov 2002 18:47:10 +0100 Subject: [Spambayes] Recently discovered this work. Message-ID: Config: MacOS 9.1 MacPython 2.2.2 I started developping on the idea of bayesian filtering by end of august after reading the article. It took me time (spare time) to arrive at the point where I had a set of script with most functionalities needed to do it. I discovered few days ago the work you have done. I guess I can stop my development because It can't compare to yours. Along the time, I discovered two issues. The email package is fragil at decoding Eudora messages with enclosure wether I get them by OSA (similar to COM on windows) or by direct access to the mbox files. I went back to using the rfc822 package instead because it was more robust if less sophisticated. I don't know if this come from Eudora not being conforming to the standards. I downloaded your software and tried to use the tokenizer on my stored mail messages to understand how it was working. I can't make it works even modifying it a little. If anyone is interested, I did a small script to show the issue. If anyone is interested, I can send both the script and a mail message on wich it hangs. As a side note I have two more questions. The current software, as downloaded from SF on Oct 29 seems to be difficult to use on MacOS 9. I would be interrested in having the Pop3 proxy version working. The other way of using such a filter would be to have "plug In" to interract with the various mail clients. I implemented it in my development and have three plugs in for mails stored as file, for Eudora and for Entourage. They are not really nice but the idea is there. What about multilingual situation. On average, I think I get spam splitted like this: 80% is english, 12% is french 5% is spanish and 3% is german. Not counting asian ones wich I easily filter on encoding and strange chars. How this technique would do on such a situation ? I started to develop a language discriminator in order to automatically sort by main language and then use frequency databases for each language. I don't know if this is needed ? -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From tim.one@comcast.net Fri Nov 1 20:24:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 15:24:42 -0500 Subject: [Spambayes] Recently discovered this work. In-Reply-To: Message-ID: [Fran=E7ois Granger] > Config: MacOS 9.1 MacPython 2.2.2 I don't have experience with either, but others here do. Keep postin= g here until they admit it . > I started developping on the idea of bayesian filtering by end of > august after reading the article. It took me time (spare time) to > arrive at the point where I had a set of script with most > functionalities needed to do it. I discovered few days ago the work > you have done. I guess I can stop my development because It can't > compare to yours. Unclear: at least yours worked for you! > Along the time, I discovered two issues. > > The email package is fragil at decoding Eudora messages with > enclosure wether I get them by OSA (similar to COM on windows) or b= y > direct access to the mbox files. I went back to using the rfc822 > package instead because it was more robust if less sophisticated. I > don't know if this come from Eudora not being conforming to the > standards. I don't know either; we would need specific examples. > I downloaded your software and tried to use the tokenizer on my > stored mail messages to understand how it was working. I can't make > it works even modifying it a little. If anyone is interested, I did= a > small script to show the issue. If anyone is interested, I can send > both the script and a mail message on wich it hangs. Hangs? That's hard to imagine -- there aren't any unbounded loops in= the tokenizer. It could be that a regexp search is taking a very long ti= me, although I tried to cut the legs off that possibility too. > As a side note I have two more questions. > > The current software, as downloaded from SF on Oct 29 seems to be > difficult to use on MacOS 9. I would be interrested in having the > Pop3 proxy version working. The other way of using such a filter > would be to have "plug In" to interract with the various mail > clients. I implemented it in my development and have three plugs in > for mails stored as file, for Eudora and for Entourage. They are no= t > really nice but the idea is there. Sorry, I didn't find a question in there. > What about multilingual situation. On average, I think I get spam > splitted like this: 80% is english, 12% is french 5% is spanish an= d > 3% is german. Not counting asian ones wich I easily filter on > encoding and strange chars. There appears no need to special-case Asian spam with this code. It generates a bunch of tokens that are virtually unique to Asian spam, = and they quickly get very high spamprobs upon training. The non-default = option [Tokenizer] replace_nonascii_chars: True accelerates learning for Asian spam, but at the cost of replacing *al= l* high-bit chars. > How this technique would do on such a situation ? Can't say: you didn't say how much of your ham (non-spam) is English= , French, Spanish and German. I expect it will work fine, as all those languages (as opposed to some Asian languages) use whitespace too, an= d the tokenizer merely splits on whitespace. This code is *certainly* bett= er than I am at distinguishing ham from spam in non-English languages, but th= at's not saying much. Try it! > I started to develop a language discriminator in order to > automatically sort by main language and then use frequency database= s > for each language. I don't know if this is needed ? I doubt it's necessary, and somewhat doubt it would even be helpful. = The tokenizer has no concept of semantics, it's just crunching strings, a= nd doesn't know beans about English as opposed to anything else. You ma= y need more training to get comparable results, or maybe not. Nobody has te= sted this yet. > -- > Le courrier =E9lectronique est un moyen de communication. Les gens = devraient > se poser des questions sur les implications politiques des choix (o= u non > choix) de leurs outils et technologies. My personal email classifier was sure your msg was ham: Spam Score: 2.82082e-007 '*H*' 0.999999 '*S*' 6.70774e-009 but some of the French words in your sig had high spamprobs: 'sur' 0.908163 'les' 0.969799 'est' 0.973373 This reflects that I personally get a lot more French spam than Frenc= h ham. Your classifier is very likely to score these differently, and that's= a great strength of the system for personal use; do note that it has no= idea these words *are* French. It doesn't even know they're words, for th= at matter . From tim.one@comcast.net Fri Nov 1 20:39:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 15:39:01 -0500 Subject: [Spambayes] tokenizing to: and cc: In-Reply-To: <15810.43821.983763.629427@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I made a change to tokenizer.py (just locally for now) to tokenize the > domains mentioned in to: and cc: headers: > > ... > for field in ('to', 'cc'): > count = 0 > for addrs in msg.get_all(field, []): > ! addrs = map(email.Utils.parseaddr, addrs.split(',')) > ! count += len(addrs) > ! if options.generate_to_domains: > ! for name,addr in addrs: > ! yield '%s:@%s' % (field, > ! (addr.split("@")[1:] or > ! ["local"])[0]) > if count > 0: > yield '%s:2**%d' % (field, round(log2(count))) > ... > I think this should be turned into a separate pass over the to: and cc: > headers to simplify the logic and move the option test out of the inner > loop. The time cost is trivial. > Here's a summary of the results: > > % python table.py base.txt to.txt > -> tested 200 hams & 200 spams against 1800 hams & 1800 spams > ... > filename: base to > ham:spam: 2000:2000 > 2000:2000 > fp total: 8 7 > fp %: 0.40 0.35 > fn total: 21 22 > fn %: 1.05 1.10 > unsure t: 95 87 > unsure %: 2.38 2.17 > real cost: $120.00 $109.40 > best cost: $79.80 $79.60 This says you could have got more benefit simply by changing your ham_cutoff and spam_cutoff values. If you had picked the best possible in both cases, the total difference would have been 1 unsure msg (79.60 - 79.60 = 0.20, the default "cost" of one unsure). See your "all runs" histograms for more info about that. > h mean: 0.79 0.79 > h sdev: 7.43 7.43 > s mean: 97.41 97.46 > s sdev: 12.53 12.53 > mean diff: 96.62 96.67 > k: 4.84 4.84 > ... > All things considered, I think it did pretty well for me. It dropped the > unsure percentage a bit Changing cutoffs can do the same. > and spread the ham and spam means a bit further apart. A change of 0.05 relative to 96.62 is insignificant. > As I mentioned earlier, I think this option may be useful > for people with inactive, but still operational, email addresses. > Over time, those addresses will tend to get nothing but spam. (It > would thus be important to not train on messages sent to those ? addresses before or shortly after during abandonment.) > > Should I rework the patch and check it in? I'm -0, but would become +1 if it really helped someone, or nailed cases that can't be nailed via a more general gimmick. For example, what if you were to introduce an option to fully tokenize To: and Cc: addresses instead? We don't even catch "Undisclosed Recipients" now. We ignore addressees by default because it becomes a killer-strong clue for bogus reasons when training with mixed-source corpora (e.g., To: bruceg@whatever is in thousands & thousands of BruceG's spams). From tim.one@comcast.net Fri Nov 1 20:49:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 15:49:46 -0500 Subject: [Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein) In-Reply-To: <200211011433.02149.tdickenson@geminidataloggers.com> Message-ID: [Tim Peters, on Jeremy's poorly-scoring example] > e.g., for a tech guy to have the word "computer" as > high-spamprob word is suspicious all by itself: > >> computer 0.851704776271 [Toby Dickenson] > Its a 0.88 for me too, due to "If you want to make money with > your computer" spam. I believe it. In context, Jeremy had many computer*ish* words scoring with high spamprobs, and many mailing-list lexicalisms not scoring with low spamprobs, and some obvious spam words not scoring with high spamprobs. Jeremy has said in the past that he's inclined to train only on mistakes, and I've raised as many cautions about that as I can. The system was intended from the start to be trained on a random sampling of all your ham and spam. Every time someone has sent me a "surprising msg", my personal classifier has absolutely nailed it in the correct category; I don't think that's because I know a secret way to start Python , but suspect it's because I've made sustained attempts to train my personal classifier on a "random slice of real life" every day (including a representative sampling of duplicates when I get a single ham or spam from multiple sources). This gives it a reality-driven view of the probabilities instead of a mistake-driven view, and also adapts its view of both as time goes on. From francois.granger@free.fr Fri Nov 1 20:51:33 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Fri, 1 Nov 2002 21:51:33 +0100 Subject: [Spambayes] Email client integration -- what's needed? Message-ID: At 18:07 -0600 on 31/10/02, in message Re: [Spambayes] Email client integration -- what's need, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expre wrote: >but also for altering the behavior of >spammers. This second consideration is actually much more powerful >than the first. > >So... unless we want this to simply be interesting research, we >gotta take it to the masses.... I think that this is the real aim. Making it so hard to the spammers that they stop. For this, I think htat we need a server version not too strict for the mail server. It may catch 80 to 90% of the spam. And a client version maybe as a pop3 proxy to remove the remaining spam at the user level. -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From tim.one@comcast.net Fri Nov 1 21:48:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 16:48:48 -0500 Subject: [Spambayes] 'sender' and 'reply-to' tokenising. In-Reply-To: <200211010754.gA17slK11570@localhost.localdomain> Message-ID: [Anthony Baxter] > comments in tokenizer.py: > > # Dang -- I can't use Sender:. If I do, > # 'sender:email name:python-list-admin' > # becomes the most powerful indicator in the whole database. > # > # From: # this helps both rates > # Reply-To: # my error rates are too low now to tell about this > # # one (smalls wins & losses across runs, overall > # # not significant), so leaving it out > > So now we have things like h/s mean/sdev, we get more useful data. I'm tempted to drop them! mean/sdev were useful under schemes with real systematic overlap between the population scores, but chi-combining is so extreme that overlaps simply aren't due to random effects. > I tried enabling tokenization of both 'sender' and 'reply-to' (and both) > along with the 'from' line. The left-hand column is the default. > > filename: from from+sender from+sender+replyto > from+replyto > ham:spam: 11192:1826 11192:1826 > 11192:1826 11192:1826 > fp total: 7 6 7 6 > fp %: 0.06 0.05 0.06 0.05 > fn total: 5 4 5 4 > fn %: 0.27 0.22 0.27 0.22 > unsure t: 80 82 80 81 > unsure %: 0.61 0.63 0.61 0.62 > real cost: $91.00 $80.40 $91.00 $80.20 > best cost: $28.00 $27.20 $28.20 $25.80 > h mean: 0.62 1.32 0.63 1.11 > h sdev: 4.27 4.42 4.19 4.19 > s mean: 98.69 98.66 98.68 98.65 > s sdev: 7.69 7.86 7.74 7.92 > mean diff: 98.07 97.34 98.05 97.54 > k: 8.20 7.93 8.22 8.05 > > Summary: 'sender' was an across-the-board lose for me. I disagree: it looks like it had no significant effect either way; indeed, I don't see a solid difference across any of these runs. > It knocked out a fp and a fn, but did considerable damage to both > ham mean and sdev, and spam mean and sdev. This really doesn't matter for chi-combining. The arithmetic mean is supremely sensitive to scores "at the wrong end", and so is the sdev. The percentiles shown at the top of the all-runs histogram displays are much better measures for extreme schemes (the median is barely affected at all by a single exceptionally large or small value; the mean can be affected a lot). It's likely that some of your extreme FN and FP simply got even more extreme across these runs, and just a handful of "bad case" msgs can have a large effect on mean and sdev. Under gary-combining, where the population means were sometimes less than 50 points apart, and scores of both kinds near 50 were relatively common, the k value seemed to have good predictive power (high k <-> gary-combining worked well on the corpus). But under chi-combining, it appears to have none (indeed, in the table above, the two runs with the lowest error rates and lowest "best cost" were the two with the *lowest* k values, not the highest). IOW, for chi-combining, believe the error rates, not the mean/sdev statistics. Across all these runs, a score "in the middle" is about 8 sdevs away from both means, and that's astronomically large when viewed against the tightness of the chi-combining score distributions (extremely clustered near 0.0 for ham and near 1.0 for spam -- look at the percentiles, or look at your histograms, to *see* this). > 'reply-to' tightened up ham scores, and loosened spam scores (but not as > much). I'd suggest re-enabling reply-to with the following patch: > > --- tokenizer.py 31 Oct 2002 15:43:55 -0000 1.59 > +++ tokenizer.py 1 Nov 2002 07:51:34 -0000 > @@ -1082,10 +1082,9 @@ > # becomes the most powerful indicator in the whole database. > # > # From: # this helps both rates > - # Reply-To: # my error rates are too low now to tell > about this > - # # one (smalls wins & losses across runs, overall > - # # not significant), so leaving it out > - for field in ('from',): > + # Reply-To: # this tightens up ham for me (anthony) > and makes spam > + # # slightly worse (but the ham > improvement is more) The comment isn't justified: the increase in k value from 8.20 to 8.22 was too tiny to be significant, and your "best cost" measure actually got worse (but also by an insignificant amount). > + for field in ('from', 'reply-to'): > prefix = field + ':' > x = msg.get(field, 'none').lower() > for w in x.split(): > > Someone else want to repeat this test? I tried it before on my c.l.py test, and your test runs seem to have confirmed the comment the patch removed -- it costs, but neither helps nor hurts the bottom line. Now I just tried it again on a more-general python.org test, and it did manage to nudge 3 marginal false positives (of 5 total) below the line. All three were redeemed for a single reason, and I challenge you to think about this and do something better about it than just tokenizing reply-to : In all three cases, the new token that saved the ham's bacon was 'reply-to:none' IOW, there *wasn't* a Reply-To header at all in these three, and that is indeed a mild ham clue. The way things currently work, Reply-To *is* in the default safe_headers list, which feeds into your old "count the mere # of header lines of each given kind" scheme. But if the count is 0, no token is generated for that header line. So, in the end, by default the *presence* of a Reply-To header ends up being a mild spam clue, but the *absence* of a Reply-To header doesn't get noted. The only useful effect of tokenizing Reply-To in my python.org test appeared to be making the absence of a Reply-To header visible to the classifier. So that's a test for you to think about: despite that your results didn't show any real improvement, if you were to just record the absence of a Reply-To header as a positive clue, and judged your test results with the same infectious optimism , would it have done just as good? If so, let's try to generalize that into a cheap and more-general "produce clues about header absence too" gimmick (BTW, Jeremy suggested that long ago, but nobody has followed up on it yet). From tim.one@comcast.net Fri Nov 1 22:05:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 17:05:48 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15809.55308.1945.988931@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > That's true, largely because that's what the focus of the initial phase of > the project was supposed to be. Even if it gets no farther than it is > today, the process has been highly educational for me, because we have an > expert in algorithm design (that'd be Tim) exposing his thought processes > and mechanics for the rest of us. I'm glad you've found it amusing . I'm afraid "think for a second, code for a minute, test for a day; repeat 6 times before you get a small win" is par for the course when trying to push any decent scheme beyond the 80/20 rule (each additional 20% improvement requires 80% of all the effort that went before). > That said, I think the classification stuff has gone about as far as it's > going to go. Me too. The classifier is hack-free now, as clean and uncompromising a realization of the underlying math as anything can be. The assumption of word independence is a limitation of the approach, though. > Future changes to the tokenizer are also likely to be incremental, so > the major changes over the next while will be in email integration. Yup! Thanks to Sean and especially Mark lately, the non-Windows platforms are a month behind on that too. It's a curious thing about Windows: because it is closed-source, the Windows market is homogenous enough that one major effort there can make millions of happy campers. I still hope that the pop3proxy can do that for non-Windows systems too, and that's the only advice I can offer: find a way to use the proxy instead of pursuing "deep integration" with unbounded dozens of quirky twenty-user email clients. From Tim@mail.powweb.com Fri Nov 1 22:11:35 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Fri, 01 Nov 2002 16:11:35 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: I think you're right about the pop3proxy. Outlook is done, let that be, and let the proxy handle the rest. That's what I'm going to try to do shortly here. I've got the Opera mailer on Windoze platform... There's no doubt I can make the proxy work just fine, but I'm not at all sure I can train the classifier. It seems like the training regimen requires spam in the file system, and at least with the Opera mailer, it stuffs mail into a single file with a proprietary format. There is no export function... We'll see how that pans out. 11/1/2002 4:05:48 PM, Tim Peters wrote: >[Skip Montanaro] >> That's true, largely because that's what the focus of the initial phase of >> the project was supposed to be. Even if it gets no farther than it is >> today, the process has been highly educational for me, because we have an >> expert in algorithm design (that'd be Tim) exposing his thought processes >> and mechanics for the rest of us. > >I'm glad you've found it amusing . I'm afraid "think for a second, >code for a minute, test for a day; repeat 6 times before you get a small >win" is par for the course when trying to push any decent scheme beyond the >80/20 rule (each additional 20% improvement requires 80% of all the effort >that went before). > >> That said, I think the classification stuff has gone about as far as it's >> going to go. > >Me too. The classifier is hack-free now, as clean and uncompromising a >realization of the underlying math as anything can be. The assumption of >word independence is a limitation of the approach, though. > >> Future changes to the tokenizer are also likely to be incremental, so >> the major changes over the next while will be in email integration. > >Yup! Thanks to Sean and especially Mark lately, the non-Windows platforms >are a month behind on that too. It's a curious thing about Windows: >because it is closed-source, the Windows market is homogenous enough that >one major effort there can make millions of happy campers. I still hope >that the pop3proxy can do that for non-Windows systems too, and that's the >only advice I can offer: find a way to use the proxy instead of pursuing >"deep integration" with unbounded dozens of quirky twenty-user email >clients. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From jeremy@alum.mit.edu Fri Nov 1 22:18:09 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 1 Nov 2002 17:18:09 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <15809.55308.1945.988931@montanaro.dyndns.org> Message-ID: <15810.64929.812472.459643@slothrop.zope.com> >>>>> "TP" == Tim Peters writes: TP> I still hope that the pop3proxy can do that for non-Windows TP> systems too, and that's the only advice I can offer: find a way TP> to use the proxy instead of pursuing "deep integration" with TP> unbounded dozens of quirky twenty-user email clients. The pop proxy is great for people who use pop, but lots of people don't. Even for people who use pop, the proxy doesn't help with training at all. So I'm afraid it's just a mess for non-Windows users. Jeremy From pje@telecommunity.com Fri Nov 1 22:37:08 2002 From: pje@telecommunity.com (Phillip J. Eby) Date: Fri, 01 Nov 2002 17:37:08 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <15809.55308.1945.988931@montanaro.dyndns.org> Message-ID: <5.1.1.6.0.20021101172008.020d8ec0@telecommunity.com> At 05:05 PM 11/1/02 -0500, Tim Peters wrote: >Yup! Thanks to Sean and especially Mark lately, the non-Windows platforms >are a month behind on that too. It's a curious thing about Windows: >because it is closed-source, the Windows market is homogenous enough that >one major effort there can make millions of happy campers. I still hope >that the pop3proxy can do that for non-Windows systems too, and that's the >only advice I can offer: find a way to use the proxy instead of pursuing >"deep integration" with unbounded dozens of quirky twenty-user email >clients. And perhaps the proxy could include a web GUI to handle its other UI requirements. The proxy could keep a history of "recently received" messages, along with their ham/spam/unsure status. It would only permit downloading of ham messages, keeping the rest to itself. Periodically, it would drop in a "unsure and spam summary" message that included the list of unsures followed by the spams, with a link to the web training UI. The UI would let you inspect messages and mark them as ham or spam, and also allow you to go back and mark false negatives as spam, doing the necessary unlearning or relearning as needed. By default, it would train itself on all "sure" messages, ham or spam. This approach would ensure that the "right" training procedure gets followed, while keeping spam from ever entering the mail client. If the installation procedure set up a desktop icon (or local platform equivalent thereof) to launch the Web UI, and set up the POP-proxy/webserver to run continuously or start-on-demand, the result could be "easy enough" for most people. I think the POPFile (http://popfile.sf.net/) people are taking a similar approach to their Bayesian filtering proxy, complete with step-by-step screenshots of how to configure Outlook, Eudora, and Outlook Express to use their proxy. One nice thing about the proxy approach is that a company could easily offer this as a commercial service, that would let you set up your home and office mail clients to pick up from the proxy, so you wouldn't have to train your filter in more than one place. Of course, for such a service to work, it'd probably have to support some kind of SMTP proxying as well, since so many SMTP servers require POP-before-SMTP. Hm. From tim.one@comcast.net Fri Nov 1 22:48:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 17:48:12 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15810.64929.812472.459643@slothrop.zope.com> Message-ID: [Jeremy Hylton] > The pop proxy is great for people who use pop, but lots of people > don't. Name 362. Ha! > Even for people who use pop, the proxy doesn't help with training at all. > So I'm afraid it's just a mess for non-Windows users. I don't know that means it can't be less of a mess, though. For example, I expect we could use a common Training class, which manages a database of opaque message objects and takes care of things like calling appropriate classifier methods at appropriate times, and remembering which messages have been trained on as what. Mark invented a bunch of code like that for the Outlook client, but there's really nothing Outlook-specific about it apart from the all Outlook-specific bits . Those could be factored out, though. A budding system architect could have a lot of fun sorting this out. John Draper seemed to be threatening to at one point, but didn't get much mindshare at the time. It's time now! From seant@iname.com Fri Nov 1 22:55:00 2002 From: seant@iname.com (Sean True) Date: Fri, 1 Nov 2002 17:55:00 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: > Yup! Thanks to Sean and especially Mark lately, the non-Windows platforms > are a month behind on that too. It's a curious thing about Windows: > because it is closed-source, the Windows market is homogenous enough that > one major effort there can make millions of happy campers. I still hope > that the pop3proxy can do that for non-Windows systems too, and that's the > only advice I can offer: find a way to use the proxy instead of pursuing > "deep integration" with unbounded dozens of quirky twenty-user email > clients. > Mark, mostly. I just complain. I'd like to second the pop3proxy architecture as the way forward. If it weren't for the fact that virus scanners weren't already sometimes adding both pop3 and smtp proxies to the mix on Windows, I would have pushed (er, whined) in favor of that architecture even for Outlook. (There is also the problem of how to configure proxies for the case of multiple pop3 accounts). You can mangle the host, user, and password together in a horrible looking user name, but it is really a pain. Nonetheless, it has a relatively clean API, and the same architecture could be used to proxy SMTP output traffic, catching messages to ham@wherever and spam@wherever, in order to talk to the training system. The same code then might run pretty happily on both the client and the server, depending on the needs of the installation. -- Sean From Tim@mail.powweb.com Fri Nov 1 23:02:29 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Fri, 01 Nov 2002 17:02:29 -0600 Subject: [Spambayes] Email client integration -- what's needed? Message-ID: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> This proposal has a lot of attractions. Forwarding to ham@ and spam@ would be a bit of a pain at first, but it would work for existing bodies of mail. Training would be MUCH simpler with this method, and would not require some fancy-schmancy installation or configuration glorp. Multiple pop account management is a requirement for sure. I'd say most people that use pop use more than one. Non-pop users? There might be < 362 on non-windoze platform, but count web-mail in, and the vast majority of them are non-pop users... We're not going to be able to help them directly, we'll need to do some server-side enablement somehow. Perhaps a spambayes web-mail system is called for... who knows... not my itch. - Tim 11/1/2002 4:55:00 PM, "Sean True" wrote: >> Yup! Thanks to Sean and especially Mark lately, the non-Windows platforms >> are a month behind on that too. It's a curious thing about Windows: >> because it is closed-source, the Windows market is homogenous enough that >> one major effort there can make millions of happy campers. I still hope >> that the pop3proxy can do that for non-Windows systems too, and that's the >> only advice I can offer: find a way to use the proxy instead of pursuing >> "deep integration" with unbounded dozens of quirky twenty-user email >> clients. >> >Mark, mostly. I just complain. > >I'd like to second the pop3proxy architecture as the way forward. If it >weren't for the fact that virus scanners weren't already sometimes adding >both pop3 and smtp proxies to the mix on Windows, >I would have pushed (er, whined) in favor of that architecture even for >Outlook. (There is also the problem of how to configure proxies for the case >of multiple pop3 accounts). You can mangle the host, user, and password >together in a horrible looking user name, but it is really a pain. > >Nonetheless, it has a relatively clean API, and the same architecture could >be used to proxy SMTP output traffic, catching messages to ham@wherever and >spam@wherever, in order to talk to the training >system. The same code then might run pretty happily on both the client and >the server, depending on the needs of the installation. > >-- Sean > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From piersh@friskit.com Fri Nov 1 23:20:58 2002 From: piersh@friskit.com (Piers Haken) Date: Fri, 1 Nov 2002 15:20:58 -0800 Subject: [Spambayes] Outlook plugin errors with Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB92974FC@zeus.sfhq.friskit.com> I'd like to report some problems I'm having with the Outlook plugin. I hope this is the right place. My setup is as follows: - Windows XP SP1 - Outlook XP - Exchange 2000 - python 2.2.2 - win32all-150 - spambayes CVS (currrent) I have 3 message stores: - my main inbox on the exchange server. - a 'Personal Folders' (.pst) file on my local drive for Auto-Archived mail. - my 'Hotmail' inbox. I realize that this config may well be untested/unsupported, especially the fact that my inbox message store is on an Exchange server, but hopefully this info can be of some use to someone... Also, this is my first time using python, so I'm sorry if I'm missing something really simple here. 1) It looks like the plugin is having problems hooking the folder events for the exchange message store. When I use the 'Filter Rules' dialog to select my exchange inbox, I get the following exception in PythonWin: Traceback (most recent call last): File "C:\Python22\spam\spambayes\Outlook2000\dialogs\ManagerDialog.py", line 156, in OnButDoSomething doer(self) File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 305, in define_filter dlg.mgr.addin.FiltersChanged() File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 281, in FiltersChanged self.UpdateFolderHooks() File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 289, in UpdateFolderHooks FolderItemsEvent) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 309, in _HookFolderEvents folder =3D msgstore_folder.GetOutlookItem() File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 250, in GetOutlookItem return self.msgstore.outlook.Session.GetFolderFromID(hex_item_id, hex_store_id) File "C:\Python22\lib\site-packages\win32com\gen_py\00062FFF-0000-0000-C000-0 00000000046x0x9x1\_NameSpace.py", line 48, in GetFolderFromID ret =3D self._oleobj_.InvokeTypes(8456, LCID, 1, (9, 0), ((8, 1), = (12, 17)),EntryIDFolder, EntryIDStore) pywintypes.com_error: (-2147352567, 'Exception occurred.', (4096, 'Microsoft Outlook', 'The operation failed.', None, 0, -2147221241), None) win32ui: Error in Command Message handler for command ID 1029, Code 0 The error '-2147221241' is defined as: CDONTS.h: CdoE_INVALID_ENTRYID =3D 0x80040107, I am able to select the inboxes in my hotmail and .pst messages stores without getting an exception thrown. 2) okay, so I have events hooked on my hotmail and .pst stores. If I move an unread message into _either_ of these folders, I get the following exception: pythoncom error: Python error invoking COM method. Traceback (most recent call last): File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 275, in _Invoke_ return self._invoke_(dispid, lcid, wFlags, args) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 280, in _invoke_ return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 562, in _invokeex_ return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags, args, kwArgs, serviceProvider) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 510, in _invokeex_ return apply(func, args) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 123, in OnItemAdd msgstore_message =3D self.manager.message_store.GetMessage(item.EntryID) File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 211, in GetMessage mapi_object =3D self._OpenEntry(message_id) File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 152, in _OpenEntry return store.OpenEntry(item_id, iid, flags) pywintypes.com_error: (-2147221241, 'OLE error 0x80040107', None, None) I don't think this problem is exchange-related since it happens even if I completely remove my exchange account from my outlook settings. 3) messages sent from one exchange account to another (ie, never going over SMTP) have no headers. This may be a problem since the parser can never infer the sender or any other metadata about the message. It might be useful to have a special tag that says that the message has no headers, since such email is very probably ham. Alternatively, some SMTP headers could be faked up from the various MAPI properties. 4) for some reason, my outlook is prefixing the headers of SMTP mail with the string "Microsoft Mail Internet Headers Version 2.0\r\n", and this is causing every SMTP message to throw an exception during parsing (for example, when doing a 'show clues'): Traceback (most recent call last): File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 275, in _Invoke_ return self._invoke_(dispid, lcid, wFlags, args) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 280, in _invoke_ return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 510, in _invokeex_ return apply(func, args) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 101, in OnClick self.handler(*self.args) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 192, in ShowClues score, clues =3D mgr.score(msgstore_message, evidence=3DTrue, scale=3DFalse) File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 258, in score email =3D msg.GetEmailPackageObject() File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 362, in GetEmailPackageObject msg =3D email.message_from_string(text) File "C:\Python22\spam\spambayes\email\__init__.py", line 39, in message_from_string return Parser(_class, strict=3Dstrict).parsestr(s) File "C:\Python22\spam\spambayes\email\Parser.py", line 52, in parsestr return self.parse(StringIO(text), headersonly=3Dheadersonly) File "C:\Python22\spam\spambayes\email\Parser.py", line 46, in parse self._parseheaders(root, fp) File "C:\Python22\spam\spambayes\email\Parser.py", line 107, in _parseheaders raise Errors.HeaderParseError( email.Errors.HeaderParseError: Not a header, not a continuation: ``Microsoft Mail Internet Headers Version 2.0'' This string is also shown in the 'options' dialog for the message (on both OutlookXP and Outlook2K) so I think it's something that exchange server adds to the message, ugh. Here's a patch that fixes this for me and at least allows me to train on a full set of messages: Index: email/Parser.py =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvsroot/spambayes/spambayes/email/Parser.py,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 Parser.py --- email/Parser.py 23 Sep 2002 13:18:55 -0000 1.1.1.1 +++ email/Parser.py 1 Nov 2002 22:17:34 -0000 @@ -101,6 +101,8 @@ elif lineno =3D=3D 1 and line.startswith('--'): # allow through duplicate boundary tags. continue + elif lineno =3D=3D 1 and line.startswith('Microsoft = Mail Internet Headers Version '): + continue else: raise Errors.HeaderParseError( "Not a header, not a continuation: ``%s''"%line) I'd like to get to the bottom of the event hooking problems so I can actually have this stuff working live on my incoming spam. If anyone has an hints on how to proceed, I'd be more than glad to hear them. Thanks. Piers. From jeremy@alum.mit.edu Fri Nov 1 23:18:51 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 1 Nov 2002 18:18:51 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <15810.64929.812472.459643@slothrop.zope.com> Message-ID: <15811.3035.967754.435766@slothrop.zope.com> >>>>> "TP" == Tim Peters writes: TP> [Jeremy Hylton] >> The pop proxy is great for people who use pop, but lots of people >> don't. TP> Name 362. Ha! Guido and at least 361 other people . >> Even for people who use pop, the proxy doesn't help with training >> at all. So I'm afraid it's just a mess for non-Windows users. TP> I don't know that means it can't be less of a mess, though. For TP> example, I expect we could use a common Training class, which TP> manages a database of opaque message objects and takes care of TP> things like calling appropriate classifier methods at TP> appropriate times, and remembering which messages have been TP> trained on as what. Mark invented a bunch of code like that for TP> the Outlook client, but there's really nothing Outlook-specific TP> about it apart from the all Outlook-specific bits . Those TP> could be factored out, though. I should look at integrating Mark's code and my own training system based on VM folders. See what common code falls out. Jeremy From vanhorn@whidbey.com Fri Nov 1 23:35:23 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Fri, 01 Nov 2002 15:35:23 -0800 Subject: [Spambayes] Email client integration -- what's needed? References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> Message-ID: <3DC30FBB.7D203CF5@whidbey.com> Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote: > This proposal has a lot of attractions. Forwarding to ham@ and spam@ would be a bit of a pain at first, but it would work for existing bodies of mail. Training > would be MUCH simpler with this method, and would not require some fancy-schmancy installation or configuration glorp. In my desired configuration, as a MailScanner plugin like SpamAssassin, I had a thought that I think would work. My larger clients not only use my system for mail, but they also have internal discussion lists. Since users are about a hundred times more likely to report spam than they are to send letters from their girlfriends to ham@, joining ham@ to the discussion lists would be a good source of training material with industry-specific language. It's a particular issue because there is so much spam related to mortgage financing, and most of my users are realtors or loan officers, so dictionary filters are risky. And they tend to use a lot of exclamation marks. Van -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From mhammond@skippinet.com.au Sat Nov 2 00:27:12 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 2 Nov 2002 11:27:12 +1100 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15811.3035.967754.435766@slothrop.zope.com> Message-ID: > TP> trained on as what. Mark invented a bunch of code like that for > TP> the Outlook client, but there's really nothing Outlook-specific > TP> about it apart from the all Outlook-specific bits . Those > TP> could be factored out, though. > > I should look at integrating Mark's code and my own training system > based on VM folders. See what common code falls out. I tend to filter the Python zen thusly: % python -c "import this" | grep purity Although practicality beats purity. However, I have tried to think a little about what a generic system would look like. For example, I tried to create a generic "message" object family: class MsgStore: def Close(self): def GetFolderGenerator(self, folder_ids, include_sub): def GetFolder(self, folder_id): def GetMessage(self, message_id): class MsgStoreFolder: def GetMessageGenerator(self, folder): class MsgStoreMsg: def GetEmailPackageObject(self, strip_mime_headers=True): # Return a "read-only" Python email package object # "read-only" in that changes will never be reflected to the real store. raise NotImplementedError def SetField(self, name, value): # Abstractly set a user field name/id to a field value. # User field is for the user to see - status/internal fields # should get their own methods raise NotImplementedError def GetField(self, name): # Abstractly get a user field name/id to a field value. raise NotImplementedError def Save(self): # Save changes after field changes. raise NotImplementedError def MoveTo(self, folder_id): # Move the message to a folder. raise NotImplementedError def CopyTo(self, folder_id): # Copy the message to a folder. raise NotImplementedError The essence of our training code is then: def train_folder( f, isspam, mgr, progress): # fancy progress reporting code omitted for message in f.GetMessageGenerator(): train_message(message, isspam, mgr) def train_message(msg, is_spam, mgr): # Train an individual message. # Returns True if newly added (message will be correctly # untrained if it was in the wrong category), False if already # in the correct category. Catch your own damn exceptions. from tokenizer import tokenize stream = msg.GetEmailPackageObject() tokens = tokenize(stream) # Handle we may have already been trained. was_spam = mgr.message_db.get(msg.searchkey) if was_spam is None: # never previously trained. pass elif was_spam == is_spam: # Already in DB - do nothing (full retrain will wipe msg db) # leave now. return False else: mgr.bayes.unlearn(tokens, was_spam, False) # OK - setup the new data. mgr.bayes.learn(tokens, is_spam, False) mgr.message_db[msg.searchkey] = is_spam mgr.bayes_dirty = True return True As Tim says, not much Outlook specific here (some - eg, "msg.searchkey" - but nothing too painful) Mark. From jeremy@alum.mit.edu Sat Nov 2 00:29:43 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 1 Nov 2002 19:29:43 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <15811.3035.967754.435766@slothrop.zope.com> Message-ID: <15811.7287.50962.651569@slothrop.zope.com> >>>>> "MH" == Mark Hammond writes: MH> I tend to filter the Python zen thusly: MH> % python -c "import this" | grep purity MH> Although practicality beats purity. :-). MH> However, I have tried to think a little about what a generic MH> system would look like. For example, I tried to create a MH> generic "message" object family: MH> class MsgStoreMsg: MH> def GetEmailPackageObject(self, strip_mime_headers=True): MH> # Return a "read-only" Python email package object MH> # "read-only" in that changes will never be reflected to MH> # the real MH> store. MH> raise NotImplementedError MH> def SetField(self, name, value): MH> # Abstractly set a user field name/id to a field value. MH> # User field is for the user to see - status/internal MH> # fields should get their own methods MH> raise NotImplementedError MH> def GetField(self, name): MH> # Abstractly get a user field name/id to a field value. MH> raise NotImplementedError MH> def Save(self): MH> # Save changes after field changes. MH> raise NotImplementedError MH> def MoveTo(self, folder_id): MH> # Move the message to a folder. MH> raise NotImplementedError MH> def CopyTo(self, folder_id): MH> # Copy the message to a folder. MH> raise NotImplementedError This part of the code doesn't work that well for my mail folders. The code to move messages from folder to folder needs to be written in elisp. I'm not sure how important that is. The training code looks simple enough. My version is: def update(self): """Update classifier from current folder contents.""" changed1 = self._update(self.hams, False) changed2 = self._update(self.spams, True) if changed1 or changed2: self.classifier.update_probabilities() get_transaction().commit() def _update(self, folders, is_spam): changed = False for f in folders: added, removed = f.read() get_transaction().commit() if not (added or removed): continue changed = True # It's important not to commit a transaction until # after update_probabilities is called in update(). # Otherwise some new entries will cause scoring to fail. for msg in added.keys(): self.classifier.learn(tokenize(msg), is_spam, False) del added get_transaction().commit(1) for msg in removed.keys(): self.classifier.unlearn(tokenize(msg), is_spam, False) del removed get_transaction().commit(1) return changed The read method scans a folder and returns two sets of messages -- those added and removed since the last time it was read. Jeremy From mhammond@skippinet.com.au Sat Nov 2 00:42:38 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 2 Nov 2002 11:42:38 +1100 Subject: [Spambayes] Outlook plugin errors with Exchange In-Reply-To: <9891913C5BFE87429D71E37F08210CB92974FC@zeus.sfhq.friskit.com> Message-ID: > I'd like to report some problems I'm having with the Outlook plugin. I > hope this is the right place. Me too . I think it will be until we annoy everyone else sufficiently! > - spambayes CVS (currrent) I'm sure it was at the time > 1) It looks like the plugin is having problems hooking the folder events > for the exchange message store. When I use the 'Filter Rules' dialog to > select my exchange inbox, I get the following exception in PythonWin: Hmmm. This is a little strange. I would not be surprised to find Outlook can't hook Exchange folder events, but this is failing before that - it is failing just getting a regular Outlook "MAPIFolder" object for that folder. We may be able to get a little further offline. > 2) okay, so I have events hooked on my hotmail and .pst stores. If I > move an unread message into _either_ of these folders, I get the > following exception: This one is now fixed in CVS. > 3) messages sent from one exchange account to another (ie, never going > over SMTP) have no headers. This may be a problem since the parser can > never infer the sender or any other metadata about the message. It might > be useful to have a special tag that says that the message has no > headers, since such email is very probably ham. Alternatively, some SMTP > headers could be faked up from the various MAPI properties. This is true, and known. We could synthesize some headers in this case. However, no one else involved on the project has this problem, so contributions welcome. You did read the "about" text, right? > 4) for some reason, my outlook is prefixing the headers of SMTP mail > with the string "Microsoft Mail Internet Headers Version 2.0\r\n", and Probably because this mail is coming in via the exchange mail gateway, rather than directly fetched by the Outlook client's internet mail capabilities. > This string is also shown in the 'options' dialog for the message (on > both OutlookXP and Outlook2K) so I think it's something that exchange > server adds to the message, ugh. Here's a patch that fixes this for me > and at least allows me to train on a full set of messages: > > Index: email/Parser.py > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/email/Parser.py,v > retrieving revision 1.1.1.1 > diff -u -r1.1.1.1 Parser.py > --- email/Parser.py 23 Sep 2002 13:18:55 -0000 1.1.1.1 > +++ email/Parser.py 1 Nov 2002 22:17:34 -0000 > @@ -101,6 +101,8 @@ > elif lineno == 1 and line.startswith('--'): > # allow through duplicate boundary tags. > continue > + elif lineno == 1 and line.startswith('Microsoft Mail > Internet Headers Version '): > + continue > else: > raise Errors.HeaderParseError( > "Not a header, not a continuation: > ``%s''"%line) I wonder if there is another property on the message that holds the prefix? I might send you some scripts to try Mark. From Tim@mail.powweb.com Sat Nov 2 00:46:33 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Fri, 01 Nov 2002 18:46:33 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <3DC30FBB.7D203CF5@whidbey.com> Message-ID: Ok, so what's rattling around in my head is a set of two proxies: a pop3 proxy and a smtp proxy. the pop3 proxy, running either locally or on the mail server machine, is responsible for classification of email, and delivery as appropriate (tbd). The smtp proxy, again running either locally or on the mail server, is responsible for training. Mail sent to spam@ or ham@ is used by the proxy as training, and isn't actually sent onward. The proxies would simply have to be configurable for what port to listen on, *and* what port to send on. This configurability also handles the case where there are multiple proxies running in the same system. For instance, I already have an SMTP proxy running, that I probably can't live without. The Spambayes proxy would have to listen on a new port and send to the port that my current proxy is listening on... 11/1/2002 5:35:23 PM, "G. Armour Van Horn" wrote: >Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote: > >> This proposal has a lot of attractions. Forwarding to ham@ and spam@ would be a bit of a pain at first, but it would work for existing bodies of mail. Training >> would be MUCH simpler with this method, and would not require some fancy-schmancy installation or configuration glorp. > >In my desired configuration, as a MailScanner plugin like SpamAssassin, I had a thought that I think would work. My larger clients not only use my system for mail, >but they also have internal discussion lists. Since users are about a hundred times more likely to report spam than they are to send letters from their girlfriends >to ham@, joining ham@ to the discussion lists would be a good source of training material with industry-specific language. > >It's a particular issue because there is so much spam related to mortgage financing, and most of my users are realtors or loan officers, so dictionary filters are >risky. And they tend to use a lot of exclamation marks. > >Van > >-- >---------------------------------------------------------- >Sign up now for Quotes of the Day, a handful of quotations >on a theme delivered every morning. >Enlightenment! Daily, for free! >mailto:twisted@whidbey.com?subject=Subscribe_QOTD > >For web hosting and maintenance, >visit Van's home page: http://www.domainvanhorn.com/van/ >---------------------------------------------------------- > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From anthony@interlink.com.au Sat Nov 2 04:25:07 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 02 Nov 2002 15:25:07 +1100 Subject: [Spambayes] 'sender' and 'reply-to' tokenising. In-Reply-To: Message-ID: <200211020425.gA24P7R19149@localhost.localdomain> >>> Tim, smacking down my naive attempts at analysing test data: > I'm tempted to drop them! mean/sdev were useful under schemes with real > systematic overlap between the population scores, but chi-combining is so > extreme that overlaps simply aren't due to random effects. So we're back with the problem we had with the Graham method, that it's really really hard to analyse tokenizer changes because of the lack of meaningful test data? Is it worth trying the tests with gary-combining to see if the tokenizer changes actually make things better or worse? I don't think we're going to see any "easy big wins" from the tokenizer - but trying to figure out whether incremental changes are positive or negative seems like it's going to be hard if we can only use fp/fn numbers. Anthony, confused. -- Anthony Baxter It's never too late to have a happy childhood. From tim.one@comcast.net Sat Nov 2 04:36:23 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 23:36:23 -0500 Subject: [Spambayes] Outlook plugin errors with Exchange In-Reply-To: <9891913C5BFE87429D71E37F08210CB92974FC@zeus.sfhq.friskit.com> Message-ID: [Piers Haken] Thank you for the excellent report! > ... > I realize that this config may well be untested/unsupported, especially > the fact that my inbox message store is on an Exchange server, but > hopefully this info can be of some use to someone... There's no intention *not* to support Exchange server, but I've never been near one and I'm not sure anyone else here is near one either. Someone with access to that will have to deal with it. You're elected. > Also, this is my first time using python, so I'm sorry if I'm missing > something really simple here. No, you did a great job of faking it . > ... > 3) messages sent from one exchange account to another (ie, never going > over SMTP) have no headers. This may be a problem since the parser can > never infer the sender or any other metadata about the message. It might > be useful to have a special tag that says that the message has no > headers, since such email is very probably ham. Alternatively, some SMTP > headers could be faked up from the various MAPI properties. By default, the tokenizer code ignores most header fields. It would be good to simulate a few, especially Subject and From. Sticking something like NOHEADERS in the synthesized Subject header would suffice to teach the classifier that NOHEADERS-in-a-Subject-header is a strong ham clue, and there's really no need to get fancier than that. > 4) for some reason, my outlook is prefixing the headers of SMTP mail > with the string "Microsoft Mail Internet Headers Version 2.0\r\n", and > this is causing every SMTP message to throw an exception during parsing > (for example, when doing a 'show clues'): ... > File "C:\Python22\spam\spambayes\email\Parser.py", line 107, in > _parseheaders > raise Errors.HeaderParseError( > email.Errors.HeaderParseError: Not a header, not a continuation: > ``Microsoft Mail Internet Headers Version 2.0'' That would be an error! The format of header lines is specified by a public standard, and as the error msg said, that specific line is neither a valid header line nor a valid continuation of a preceding header line. > This string is also shown in the 'options' dialog for the message (on > both OutlookXP and Outlook2K) so I think it's something that exchange > server adds to the message, ugh. Sounds very likely; I haven't seen this. > Here's a patch that fixes this for me and at least allows me to > train on a full set of messages:> > Index: email/Parser.py The email pkg is a part of standard Python, and we (speaking as a Python developer here) won't warp it to accept non-standard headers. If it's necessary to worm around this in the Outlook client, it should be easy to do so by fiddling Outlook2000\msgstore.py's _GetMessageText(). For example, this is untested but almost certainly close to working: if headers.startswith("Microsoft Mail"): headers = "X-MS-Mail-Gibberish: " + headers It's enough just to check for the "Microsoft Mail" prefix, as the embedded space alone makes it an invalid header line. Stuffing a legitimate header tag at the front should be enough to make the email pkg's parser happy again. From popiel@wolfskeep.com Sat Nov 2 04:41:00 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 01 Nov 2002 20:41:00 -0800 Subject: [Spambayes] An alternate use Message-ID: <20021102044100.7CC18F5AC@cashew.wolfskeep.com> A couple things have been kicking around in my head, and they've managed to come together in an interesting configuration and stick, so I'm going to make a quiet little proposal and see how much thunder it generates. First off, the observations: 1. Based on recent reports, spambayes works better when given full data about everything that comes through, not just the mistakes. This is predicted by the theory, too. 2. spambayes is extremely sensitive to changes in the nature of ham, and is moderately likely to classify any new topics/venues as spam. 3. spambayes is still a techie toy (though perhaps not for much longer). People with a little knowhow are going to have a much easier time training it than the average joe. 4. We want a large penetration into the mail-reading populace, to better force the spammers to change tactics. 5. Many people read mailing lists. In fact, for high volume mail users, mailing lists probably make the majority of their incoming mail (or at least their incoming ham). 6. A noticable amount of spam gets relayed through mailing lists, and most personal filters are notoriously bad about passing it through because it comes from a whitelisted intermediary. 6. Most mailing lists keep archives of everything sent over the list. 7. Most mailing lists are single-topic, and anything off-topic is unwanted. So, what I propose is that we specifically target mailing list managers (mailman and ecartis being the two obvious first targets) for spambayes integration. I see two main modes for this: just adding headers for the less intrusive, and actually rejecting or forcing moderation for the heavily policed. Training is easily accomplished by taking the list archives as a ham corpus and one of the spam collections floating around as a spam corpus. Run the classifier over the training data to kick out all the false positives and false negatives for possible resorting, then retrain. Only the list owner has to be techie to do this, and list owners are more likely to be techie than not (they set up a mailing list, after all). Periodic retraining can be handled in the same way. In the case of adding headers, we'll want to avoid collisions with personal use of spambayes, too. I suggest tagging the X-Spambayes-Disposition header (or whatever we call it) with some identifier for which classifier generated the rating, so that multiple X-Spambayes-Disposition lines are distinguishable. Something like: X-Spambayes-Disposition: Spam by spambayes@python.org X-Spambayes-Disposition: Unsure by pennmush@pennmush.org Personal classifiers could leave off the 'by' section. Heck, make it so that X-Spambayes-Disposition lines are turned into words similar to the mailer lines, and then personal classifiers can use the judgements of list classifiers as clues. Doing this sort of integration into mailing list managers takes advantage of some 'weaknesses' of spambayes, and could be of great benefit to many people beyond just those with the wherewithal to train and run the filter. - Alex From tim.one@comcast.net Sat Nov 2 04:57:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 01 Nov 2002 23:57:08 -0500 Subject: [Spambayes] 'sender' and 'reply-to' tokenising. In-Reply-To: <200211020425.gA24P7R19149@localhost.localdomain> Message-ID: [Tim, praising Anthonly's enthusiastic attempts at analysing test data] > I'm tempted to drop them! mean/sdev were useful under schemes with real > systematic overlap between the population scores, but chi-combining is > so extreme that overlaps simply aren't due to random effects. [Anthony Baxter] > So we're back with the problem we had with the Graham method, that > it's really really hard to analyse tokenizer changes because of the > lack of meaningful test data? The problem I had with Graham-combining is that the more and better the training data you had, the more embarrassing its errors became: the middle ground kept getting smaller, and eventually everything scored as 0.0 or 1.0, and whether right or wrong. chi-combining reliably scores highly ambiguous msgs near 0.5, and its middle ground is (a) very accurate about when it's confused, and (b) doesn't degenerate as training data increases. > Is it worth trying the tests with gary-combining to see if the tokenizer > changes actually make things better or worse? > > I don't think we're going to see any "easy big wins" from the > tokenizer - but trying to figure out whether incremental changes > are positive or negative seems like it's going to be hard if > we can only use fp/fn numbers. The FP/FN/unsure rates are the only numbers that matter in the end, and under chi-combining it's *much* easier to stare at mistakes and find commonalities. Given a reasonable amount of training data, errors almost never score at 0.0 or 1.0 under chi, which makes it plausible that tokenizer chnages can redeem them. This requires more work but is more rewarding. For example, it was easy to identify exactly what about tokenizing Reply-To saved 3 FP in my python.org test, and that suggested a focused area for further work. Precisely because there are very likely no big wins remaining, progress now has to come from thinking about mistakes, finding cheap ways to avoid them, and then running tests to ensure that new gimmicks don't hurt anything else. As with the only good effect I found from Reply-To in my python.org test, I expect most such gimmicks will boil down to letting the classifier see more of the msg -- but not so much that highly correlated words lead to extreme mistakes. There's still a lot of header info we ignore by default, and we still ignore almost everything in almost all HTML tags, and almost everything in almost all non-text/* sections, so there's still plenty of room for small improvements. Looking for something that increases the mean spread by 0.1% when the means are already 16 sdev apart is a waste of time now, though. Looking for something that cuts an FP without hurting FN or unsure is golden. progress-is-harder-now-but-that's-a-sign-of-success-ly y'rs - tim From tim.one@comcast.net Sat Nov 2 05:32:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 02 Nov 2002 00:32:58 -0500 Subject: [Spambayes] An alternate use In-Reply-To: <20021102044100.7CC18F5AC@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > A couple things have been kicking around in my head, and they've > managed to come together in an interesting configuration and stick, > so I'm going to make a quiet little proposal and see how much > thunder it generates. > > > First off, the observations: > > 1. Based on recent reports, spambayes works better when given full > data about everything that comes through, not just the mistakes. > This is predicted by the theory, too. I'd say "representative data" more than "full data". A random slice of real life, consistently applied, should be enough. > 2. spambayes is extremely sensitive to changes in the nature of > ham, and is moderately likely to classify any new topics/venues > as spam. Almost certainly true for a classifier trained mostly by mistakes, ignoring the correctly classified msgs. The latter are needed to transform spamprobs from serendipitous hapaxes into robust indicators. In my own classifier, I trained on *no* msgs from the Spambayes list at first. I left them out on purpose. Recall that I reported on what happened after I had a pretty decent classifier and scored more than 1,000 backed-up spambayes msgs: they were almost all scored as ham, despite not training on the topic at all. I expect this is more rule than exception for a properly trained classifier. What it *is* extremely sensitive to is advertising you sign up for. I've been at this thru a full billing cycle now, and marketing msgs from vendors I want to do business with still score as Unsure before training on several msgs from a specific vendor. Spam that uses the same words can keep knocking them back into Unsure territory too. > 3. spambayes is still a techie toy (though perhaps not for much > longer). People with a little knowhow are going to have a > much easier time training it than the average joe. Absolutely. > 4. We want a large penetration into the mail-reading populace, > to better force the spammers to change tactics. Heh. It's still an irony of this project that I've never particularly minded getting 100 spam per day . > 5. Many people read mailing lists. In fact, for high volume > mail users, mailing lists probably make the majority of > their incoming mail (or at least their incoming ham). True here. > 6. A noticable amount of spam gets relayed through mailing lists, > and most personal filters are notoriously bad about passing > it through because it comes from a whitelisted intermediary. Indeed, that's why I still ignore most of the header lines. python.org and Mailman put so many "I touched this!" clues in the headers, and do such a good job of stopping spam already, that if I pay attention to those clues then almost none of the spam they let pass gets caught. > 6. Most mailing lists keep archives of everything sent over the > list. Yup. > 7. Most mailing lists are single-topic, and anything off-topic > is unwanted. Eh -- probably. I started with the mailing-list version of comp.lang.python, and there's a huge amount of traffic there that never mentions Python. The variety of ham on that group is quite amazing. But it contains almost no advertising beyond conference announcements, and I still expect that accounts for the breathtaking results I get on my c.l.py tests (2 mistakes out of 34,000 msgs, where one "mistake" is saying that a quote of a full Nigerian-scam spam is itself spam). > So, what I propose is that we specifically target mailing list > managers (mailman and ecartis being the two obvious first > targets) for spambayes integration. I see two main modes for > this: just adding headers for the less intrusive, and actually > rejecting or forcing moderation for the heavily policed. That's actually what started this project: Barry Warsaw is GNU Mailman's author, and he asked me to look into adapting Graham's scheme for incorporation into Mailman. Barry has been pretty much missing in action here since then, but I expect him to take it up again. > Training is easily accomplished by taking the list archives > as a ham corpus and one of the spam collections floating > around as a spam corpus. That's exactly what I did, and it was anything but easy. Mixed-source corpora create a world of problems, and Mailmain archives in particular save *all* the Mailman distortions introduced into the headers. Even on the more general "python.org email" test I've been doing behind the scenes lately, the headers are polluted by judgments from SpamAssassin, and goofy little things like python.org's MTA inventing Message-Id lines out of thin air when one doesn't come across on the wire. There are lots and lots of traps here. > Run the classifier over the training data to kick out all the false > positives and false negatives for possible resorting, then retrain. > Only the list owner has to be techie to do this, and list owners are > more likely to be techie than not (they set up a mailing list, after > all). Periodic retraining can be handled in the same way. > > In the case of adding headers, we'll want to avoid collisions > with personal use of spambayes, too. I suggest tagging the > X-Spambayes-Disposition header (or whatever we call it) with > some identifier for which classifier generated the rating, > so that multiple X-Spambayes-Disposition lines are distinguishable. > Something like: > > X-Spambayes-Disposition: Spam by spambayes@python.org > X-Spambayes-Disposition: Unsure by pennmush@pennmush.org > > Personal classifiers could leave off the 'by' section. > > Heck, make it so that X-Spambayes-Disposition lines are turned > into words similar to the mailer lines, and then personal > classifiers can use the judgements of list classifiers as clues. Easy to spoof, and I'm sure spammers would pick up on that quickly. > Doing this sort of integration into mailing list managers takes > advantage of some 'weaknesses' of spambayes, and could be of > great benefit to many people beyond just those with the > wherewithal to train and run the filter. That was Barry's idea, yes . I'll leave it to him to resume this battle. One idea we kicked around was to add a If this looks like spam, click here: http://yadda.yadda.yorg/abc?=etc line at the bottom of each mailing-list msg. An automated system on the server would collect and organize votes. There's no intention that users get to vote on what *is* spam, the real point is more devious: a msg that *nobody* claims is spam almost certainly isn't spam, so it's really most valuable as a way to identify ham. That is, if nobody claims msg X is spam within a few days, it's almost certainly the case that X is safe to add to the ham training. That seems so certain that it could be automated. Msgs that got "weveral" spam votes would be brought to the list admin's attention, for human judgment about whether to classify them as errors. Automating *that* part gets too close to censorship-by-vocal-minority for my tastes, so if Barry implemented that part I'd kill him . From popiel@wolfskeep.com Sat Nov 2 06:29:39 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 01 Nov 2002 22:29:39 -0800 Subject: [Spambayes] An alternate use In-Reply-To: Message from Tim Peters References: Message-ID: <20021102062939.3CD89F5AC@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> >> 1. Based on recent reports, spambayes works better when given full >> data about everything that comes through, not just the mistakes. >> This is predicted by the theory, too. > >I'd say "representative data" more than "full data". A random slice of real >life, consistently applied, should be enough. Granted. >> 4. We want a large penetration into the mail-reading populace, >> to better force the spammers to change tactics. > >Heh. It's still an irony of this project that I've never particularly >minded getting 100 spam per day . Whereas my disgust with getting 70 spam per day (out of about 100 messages total) is one of the major things that prompted me to actually try Graham's algorithm. ;-) >> So, what I propose is that we specifically target mailing list >> managers (mailman and ecartis being the two obvious first >> targets) for spambayes integration. I see two main modes for >> this: just adding headers for the less intrusive, and actually >> rejecting or forcing moderation for the heavily policed. > >That's actually what started this project: Barry Warsaw is GNU Mailman's >author, and he asked me to look into adapting Graham's scheme for >incorporation into Mailman. Barry has been pretty much missing in action >here since then, but I expect him to take it up again. Heh. Glad to hear I'm not the only one thinking like this. I don't claim to have new ideas... recycled ideas are easier. ;-) >> Training is easily accomplished by taking the list archives >> as a ham corpus and one of the spam collections floating >> around as a spam corpus. > >That's exactly what I did, and it was anything but easy. Mixed-source >corpora create a world of problems, and Mailmain archives in particular save >*all* the Mailman distortions introduced into the headers. Blech. You're right... I just forgot about the troubles you had. Ecartis is similar with the tainting of the archives. >> In the case of adding headers, we'll want to avoid collisions >> with personal use of spambayes, too. I suggest tagging the >> X-Spambayes-Disposition header (or whatever we call it) with >> some identifier for which classifier generated the rating, >> so that multiple X-Spambayes-Disposition lines are distinguishable. >> Something like: >> >> X-Spambayes-Disposition: Spam by spambayes@python.org >> X-Spambayes-Disposition: Unsure by pennmush@pennmush.org >> >> Personal classifiers could leave off the 'by' section. >> >> Heck, make it so that X-Spambayes-Disposition lines are turned >> into words similar to the mailer lines, and then personal >> classifiers can use the judgements of list classifiers as clues. > >Easy to spoof, and I'm sure spammers would pick up on that quickly. Yes, it would be easy to spoof, unless compared with routing information... but doing that sort of comparison is beyond the sorting rule capabilities of something like Outlook (and Outlook is sadly one of the best GUI tools in that arena). I'm not even sure procmail is up to the task without help from a custom program. On the other hand, we could build the smarts for it into spambayes itself, for use in the personal classifier figuring out when to trust the apparent list classifier... perhaps I'll look into routing analysis for my next algorithmic experiment. >One idea we kicked around was to add a > > If this looks like spam, click here: http://yadda.yadda.yorg/abc?=etc > >line at the bottom of each mailing-list msg. An automated system on the >server would collect and organize votes. There's no intention that users >get to vote on what *is* spam, the real point is more devious: a msg that >*nobody* claims is spam almost certainly isn't spam, so it's really most >valuable as a way to identify ham. That is, if nobody claims msg X is spam >within a few days, it's almost certainly the case that X is safe to add to >the ham training. That seems so certain that it could be automated. Msgs >that got "weveral" spam votes would be brought to the list admin's >attention, for human judgment about whether to classify them as errors. >Automating *that* part gets too close to censorship-by-vocal-minority for my >tastes, so if Barry implemented that part I'd kill him . Interesting, as a ham indicator. Way too corruptible as a spam indicator, I agree. - Alex From piersh@friskit.com Sat Nov 2 12:11:16 2002 From: piersh@friskit.com (Piers Haken) Date: Sat, 2 Nov 2002 04:11:16 -0800 Subject: [Spambayes] Outlook plugin errors with Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB9297502@zeus.sfhq.friskit.com> > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net]=20 > Sent: Friday, November 01, 2002 8:36 PM > To: Piers Haken > Cc: spambayes@python.org > Subject: RE: [Spambayes] Outlook plugin errors with Exchange >=20 >=20 > [Piers Haken] >=20 > Thank you for the excellent report! My pleasure. Thanks to you and your team for a great tool. I've been lurking on this list for a while since I saw it mentioned on slashdot with the intention of someday writing an outlook or exchange plugin based on the algorithms you have finessed. But thanks to Mark and co. I don't have to ;-) > > ... > > I realize that this config may well be untested/unsupported,=20 > > especially the fact that my inbox message store is on an Exchange=20 > > server, but hopefully this info can be of some use to someone... >=20 > There's no intention *not* to support Exchange server, but=20 > I've never been near one and I'm not sure anyone else here is=20 > near one either. Someone with access to that will have to=20 > deal with it. You're elected. Heh, I already took it offline with Mark and he's made a few fixes already and I've sent him a patch that should work around the problem I'm seeing (that is, if it doesn't break everything else in the process). If you want, I can send it to the list. > > Also, this is my first time using python, so I'm sorry if=20 > I'm missing=20 > > something really simple here. >=20 > No, you did a great job of faking it . Beginner's luck, I assure you ;-) > > ... > > 3) messages sent from one exchange account to another (ie,=20 > never going=20 > > over SMTP) have no headers. This may be a problem since the=20 > parser can=20 > > never infer the sender or any other metadata about the message. It=20 > > might be useful to have a special tag that says that the=20 > message has=20 > > no headers, since such email is very probably ham.=20 > Alternatively, some=20 > > SMTP headers could be faked up from the various MAPI properties. >=20 > By default, the tokenizer code ignores most header fields. =20 > It would be good to simulate a few, especially Subject and=20 > From. Sticking something like NOHEADERS in the synthesized=20 > Subject header would suffice to teach the classifier that=20 > NOHEADERS-in-a-Subject-header is a strong ham clue, and=20 > there's really no need to get fancier than that. My patch tries to fake up the subject, from, to && cc fields. However, there's no easy way to get an smtp address from an X.400 address (there may not even be one), so I just put the display names (eg, "Tim Peters") in those cases. If they're not parsed then it doesn't really matter, I guess the subject is the most important bit. I also added an "X-Exchange-Message: true" header for these messages, so I guess people can add that to their options if they want an extra ham bonus. > > 4) for some reason, my outlook is prefixing the headers of=20 > SMTP mail=20 > > with the string "Microsoft Mail Internet Headers Version=20 > 2.0\r\n", and=20 > > this is causing every SMTP message to throw an exception during=20 > > parsing (for example, when doing a 'show clues'): > ... > > File "C:\Python22\spam\spambayes\email\Parser.py", line 107, in=20 > > _parseheaders > > raise Errors.HeaderParseError( > > email.Errors.HeaderParseError: Not a header, not a continuation:=20 > > ``Microsoft Mail Internet Headers Version 2.0'' >=20 > That would be an error! The format of header lines is=20 > specified by a public standard, and as the error msg said,=20 > that specific line is neither a valid header line nor a valid=20 > continuation of a preceding header line. Yeah, I'm not sure why they do this. It's not normally a problem because these headers are never used by any SMTP transport once they've gone through the MTA, but it's still a pain, and it'll probably change in the future, breaking everything... > > This string is also shown in the 'options' dialog for the=20 > message (on=20 > > both OutlookXP and Outlook2K) so I think it's something=20 > that exchange=20 > > server adds to the message, ugh. >=20 > Sounds very likely; I haven't seen this. >=20 > > Here's a patch that fixes this for me and at least allows=20 > me to train=20 > > on a full set of messages:> > > Index: email/Parser.py >=20 > The email pkg is a part of standard Python, and we (speaking=20 > as a Python developer here) won't warp it to accept=20 > non-standard headers. If it's necessary to worm around this=20 > in the Outlook client, it should be easy to do so by fiddling=20 > Outlook2000\msgstore.py's _GetMessageText(). For example,=20 > this is untested but almost certainly close to working: >=20 > if headers.startswith("Microsoft Mail"): > headers =3D "X-MS-Mail-Gibberish: " + headers So close to working, in fact, that it works, so that's what I did ;-) Great stuff guys. Thanks for killing my spam for me. Piers. From rob@hooft.net Sat Nov 2 16:17:10 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 02 Nov 2002 17:17:10 +0100 Subject: [Spambayes] spambayes.org Message-ID: <3DC3FA86.8090305@hooft.net> I just reserved spambayes.org Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sat Nov 2 16:42:56 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 02 Nov 2002 17:42:56 +0100 Subject: [Spambayes] An alternate use References: Message-ID: <3DC40090.5050109@hooft.net> Tim Peters wrote: > [T. Alexander Popiel] > >>7. Most mailing lists are single-topic, and anything off-topic >> is unwanted. > > > Eh -- probably. I started with the mailing-list version of > comp.lang.python, and there's a huge amount of traffic there that never > mentions Python. The variety of ham on that group is quite amazing. But it > contains almost no advertising beyond conference announcements, and I still > expect that accounts for the breathtaking results I get on my c.l.py tests > (2 mistakes out of 34,000 msgs, where one "mistake" is saying that a quote > of a full Nigerian-scam spam is itself spam). > > >>So, what I propose is that we specifically target mailing list >>managers (mailman and ecartis being the two obvious first >>targets) for spambayes integration. I see two main modes for >>this: just adding headers for the less intrusive, and actually >>rejecting or forcing moderation for the heavily policed. > > > That's actually what started this project: Barry Warsaw is GNU Mailman's > author, and he asked me to look into adapting Graham's scheme for > incorporation into Mailman. Barry has been pretty much missing in action > here since then, but I expect him to take it up again. So, we'd have to make mailing lists keep a spam-archive as well? Or do we deliver spambayes with a pre-cooked spam archive to get started with new mailing lists? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From Tim@mail.powweb.com Sat Nov 2 16:47:21 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 10:47:21 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy Message-ID: Ok, I've got the pop3proxy up and running on my machine. Very simple to get running. I don't have a trained database (the real challenge) at this point, and it's adding the x-hammie-disposition header with value of 'no'. I presume that this means that the classifier thinks this is NOT ham? So if there's no database, then it assumes everything is spam? Or am I reading the meaning of the header backwards? - Tim www.fourstonesExpressions.com From richie@entrian.com Sat Nov 2 18:16:05 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 02 Nov 2002 18:16:05 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: References: Message-ID: Hi Tim, > adding the x-hammie-disposition header with value of 'no'. 'No' means it thinks it's ham - the header means "Is it spam?" At the moment the header added by pop3proxy.py is always "Yes" or "No" - I'll add the new "Unsure" value when I get the chance. > I don't have a trained database (the real challenge) at this point Use hammie.py to train it - the usage message should tell you everything you need to know, except how to create the mbox files or directories of email message to feed into it. Hopefully your email client will export messages into one of those formats... -- Richie Hindle richie@entrian.com From richie@entrian.com Sat Nov 2 18:31:08 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 02 Nov 2002 18:31:08 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: References: Message-ID: Hi Tim, > ... make the proxy listen on different ports. I've modified the > code to do that, was a simple mod. Do you want the mod? Yes please! -- Richie Hindle richie@entrian.com From tim.one@comcast.net Sat Nov 2 17:57:39 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 02 Nov 2002 12:57:39 -0500 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: [Tim@mail.powweb.com] > Ok, I've got the pop3proxy up and running on my machine. Very > simple to get running. Good! I haven't had time to try it yet, so I won't be much help, but I'm glad it ran easily for you. > I don't have a trained database (the real challenge) The difficulty of bootstrapping a database is generally overstated, and especially by those who haven't yet done it . Train on everything you get for a few days. I predict you'll find it gets most things right after just a dozen msgs of each kind. But it will also make howling mistakes until you've trained on much more than that. Even so, don't take the classifications too seriously at the start, and it should be very helpful quickly. > at this point, and it's adding the x-hammie-disposition header with > value of 'no'. I presume that this means that the classifier thinks > this is NOT ham? More accurately, that the score fell below the value of spam_cutoff you've set, and if you didn't set one yet, the default value of spam_cutoff: 0.90 The relevant code appears to be in pop3proxy BayesProxy.onRetr(): prob = self.bayes.spamprob(tokenizer.tokenize(message)) if prob > options.spam_cutoff: disposition = "Yes" else: disposition = "No " > So if there's no database, then it assumes everything is spam? There's always a database, but at the start it's empty. If there are no words in the database, that's not a special case to the code, the math simply works out to give a score of 0.5 to every msg then (which makes sense: in the absence of any evidence at all, it has no reason to favor any specific conclusion). Whatever you set ham_cutoff and spam_cutoff to be, 0.5 should definitely be in your Unsure category. However, it doesn't look like pop3proxy is paying attention to ham_cutoff yet, nor is it currently capable of generating an "I'm lost -- help me!" Unsure disposition. Someone needs to teach it about the middle ground. > Or am I reading the meaning of the header backwards? No, you're reading it right. From richie@entrian.com Sat Nov 2 18:43:49 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 02 Nov 2002 18:43:49 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: References: Message-ID: Hi Tim, > it doesn't look > like pop3proxy is paying attention to ham_cutoff yet, nor is it currently > capable of generating an "I'm lost -- help me!" Unsure disposition. Someone > needs to teach it about the middle ground. I'll do this. -- Richie Hindle richie@entrian.com From Tim@mail.powweb.com Sat Nov 2 19:51:02 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 13:51:02 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: Ok, so Tim says I'm not reading it backwards, Richie says I am... I think the x-hammie-disposition header should be ham|spam|unsure versus 'yes|no|unsure'. This is much clearer, not much chance for interpretive errors... and furthermore, the header itself should be x-spambayes-disposition, because this says clearly where the header came from... I can make that change, too, if the collective wills it... but if I'm gonna make many changes, it might be reasonable to bring me up to speed on the cvs checkin thing... - Tim 11/2/2002 11:57:39 AM, Tim Peters wrote: >[Tim@mail.powweb.com] >> Ok, I've got the pop3proxy up and running on my machine. Very >> simple to get running. > >Good! I haven't had time to try it yet, so I won't be much help, but I'm >glad it ran easily for you. > >> I don't have a trained database (the real challenge) > >The difficulty of bootstrapping a database is generally overstated, and >especially by those who haven't yet done it . Train on everything you >get for a few days. I predict you'll find it gets most things right after >just a dozen msgs of each kind. But it will also make howling mistakes >until you've trained on much more than that. Even so, don't take the >classifications too seriously at the start, and it should be very helpful >quickly. > >> at this point, and it's adding the x-hammie-disposition header with >> value of 'no'. I presume that this means that the classifier thinks >> this is NOT ham? > >More accurately, that the score fell below the value of spam_cutoff you've >set, and if you didn't set one yet, the default value of > >spam_cutoff: 0.90 > >The relevant code appears to be in pop3proxy BayesProxy.onRetr(): > > prob = self.bayes.spamprob(tokenizer.tokenize(message)) > if prob > options.spam_cutoff: > disposition = "Yes" > else: > disposition = "No " > >> So if there's no database, then it assumes everything is spam? > >There's always a database, but at the start it's empty. If there are no >words in the database, that's not a special case to the code, the math >simply works out to give a score of 0.5 to every msg then (which makes >sense: in the absence of any evidence at all, it has no reason to favor any >specific conclusion). Whatever you set ham_cutoff and spam_cutoff to be, >0.5 should definitely be in your Unsure category. However, it doesn't look >like pop3proxy is paying attention to ham_cutoff yet, nor is it currently >capable of generating an "I'm lost -- help me!" Unsure disposition. Someone >needs to teach it about the middle ground. > >> Or am I reading the meaning of the header backwards? > >No, you're reading it right. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From richie@entrian.com Sat Nov 2 21:02:35 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 02 Nov 2002 21:02:35 +0000 Subject: [Spambayes] pop3proxy,py now supports 'Unsure' and can run on arbitrary ports Message-ID: Hi all, pop3proxy.py now supports the 'Unsure' value for X-Hammie-Disposition. If you're using pop3proxy and filtering on 'No' values, you'll now get fewer hits because some emails that used to be 'No' will be 'Unsure'. Also, it can now listen on the port of your choice (thanks to Tim Stone) meaning you can run many proxies on the same machine (and also run it as non-root on Unix systems). Finally, it's less anal about correcting for the size of the added header - it no longer adds trailing spaces to the header to make the message up to the size reported by the LIST command. If this breaks anything I'll eat my head. -- Richie Hindle richie@entrian.com From richie@entrian.com Sat Nov 2 21:03:38 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 02 Nov 2002 21:03:38 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: References: Message-ID: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> Hi Tim, > Ok, so Tim says I'm not reading it backwards, Richie says I am... Some misunderstanding I think - the header means "Is it spam?" But you're right, 'Yes' / 'No' is less clear (unless we rename the header to something that makes it clear) than 'Spam' / 'Ham'. Have we collectively decided that 'Ham' is the official word for non-Spam? Someone pointed out a while ago that it's a little impolite towards Hormel to imply that Spam is the opposite of ham... though that might be a little hyper-sensitive. Someone else (I should check my references but I'm lazy 8-) suggested that our use of the word 'Ham' is a useful USP. I vote for keeping it. I also agree that the header should have a new name - Hammie was the first front-end to the spambayes project, and other front-ends have since inherited the header, which is a bit daft (sorry Neale!). I'd like to drop the techie word 'disposition' as well - how about: X-Spambayes-Judgement: Spam / Unsure / Ham X-Spambayes-Is-Spam: Yes / Unsure / No X-Spambayes-Looks-Like-Spam: Yes / Unsure / No If we're going to change this, we should make sure we get it right first (albeit second) time. That includes deciding whether there are optional extra details that can go into the header, or whether there's an optional extra header to carry those details. I think there *should* be optional extra details, probably in a separate header - it's one of the cool things about SpamAssassin. I vote to drop all extra details from the main header, then decide later on whether there should be an extra header. We ought to sort this out soon, because more and more people are starting to use the software, and we're going to affect them all if/when we rename headers. -- Richie Hindle richie@entrian.com From Tim@mail.powweb.com Sat Nov 2 21:26:57 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 15:26:57 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> Message-ID: <8771YW62YTA8WTA5WSSPHCZUF0HCWVR.3dc44321@riven> Richie, >X-Spambayes-Judgement: Spam / Unsure / Ham >X-Spambayes-Is-Spam: Yes / Unsure / No >X-Spambayes-Looks-Like-Spam: Yes / Unsure / No I vote for the first. It contains the most information: This is a judgement, made by spambayes, that says this email is <_>. This is all that should go in this header, unless there is something we can do to make the header less forgeable by spammers, which I doubt. Further information in other headers might be very useful. It certainly is in spam assassin. What information spambayes might be able to share... the stats dudes probably have a better handle on that than me. 11/2/2002 3:03:38 PM, Richie Hindle wrote: >Hi Tim, > >> Ok, so Tim says I'm not reading it backwards, Richie says I am... > >Some misunderstanding I think - the header means "Is it spam?" But you're >right, 'Yes' / 'No' is less clear (unless we rename the header to something >that makes it clear) than 'Spam' / 'Ham'. Have we collectively decided >that 'Ham' is the official word for non-Spam? Someone pointed out a while >ago that it's a little impolite towards Hormel to imply that Spam is the >opposite of ham... though that might be a little hyper-sensitive. Someone >else (I should check my references but I'm lazy 8-) suggested that our use >of the word 'Ham' is a useful USP. I vote for keeping it. > >I also agree that the header should have a new name - Hammie was the first >front-end to the spambayes project, and other front-ends have since >inherited the header, which is a bit daft (sorry Neale!). I'd like to drop >the techie word 'disposition' as well - how about: > >X-Spambayes-Judgement: Spam / Unsure / Ham >X-Spambayes-Is-Spam: Yes / Unsure / No >X-Spambayes-Looks-Like-Spam: Yes / Unsure / No > >If we're going to change this, we should make sure we get it right first >(albeit second) time. That includes deciding whether there are optional >extra details that can go into the header, or whether there's an optional >extra header to carry those details. I think there *should* be optional >extra details, probably in a separate header - it's one of the cool things >about SpamAssassin. I vote to drop all extra details from the main header, >then decide later on whether there should be an extra header. > >We ought to sort this out soon, because more and more people are starting >to use the software, and we're going to affect them all if/when we rename >headers. > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Sat Nov 2 21:36:53 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 15:36:53 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: >>The difficulty of bootstrapping a database is generally overstated, and >>especially by those who haven't yet done it . Train on everything you >>get for a few days. The problem I have is that my windoze based opera mailer does not store mail in textual format in a separate file system artifact for each email. There is a limited functionality for storing a particular mail in a text file, but I have to do that manually one at a time. Once this is done, then neiltrain.py will work perfectly well, but that's still an enormous amount of work. This will probably be a typical kind of problem. I know netscape and mozilla do much the same thing. I'm going to try to figure out a better way to do it. The idea of an smtp proxy that recognizes forwards to ham@ and spam@ is very attractive... 11/2/2002 1:51:02 PM, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote: >Ok, so Tim says I'm not reading it backwards, Richie says I am... I think the x-hammie-disposition header should be ham|spam|unsure versus 'yes|no|unsure'. >This is much clearer, not much chance for interpretive errors... and furthermore, the header itself should be x-spambayes-disposition, because this says >clearly where the header came from... I can make that change, too, if the collective wills it... but if I'm gonna make many changes, it might be reasonable to >bring me up to speed on the cvs checkin thing... > >- Tim > >11/2/2002 11:57:39 AM, Tim Peters wrote: > >>[Tim@mail.powweb.com] >>> Ok, I've got the pop3proxy up and running on my machine. Very >>> simple to get running. >> >>Good! I haven't had time to try it yet, so I won't be much help, but I'm >>glad it ran easily for you. >> >>> I don't have a trained database (the real challenge) >> >>The difficulty of bootstrapping a database is generally overstated, and >>especially by those who haven't yet done it . Train on everything you >>get for a few days. I predict you'll find it gets most things right after >>just a dozen msgs of each kind. But it will also make howling mistakes >>until you've trained on much more than that. Even so, don't take the >>classifications too seriously at the start, and it should be very helpful >>quickly. >> >>> at this point, and it's adding the x-hammie-disposition header with >>> value of 'no'. I presume that this means that the classifier thinks >>> this is NOT ham? >> >>More accurately, that the score fell below the value of spam_cutoff you've >>set, and if you didn't set one yet, the default value of >> >>spam_cutoff: 0.90 >> >>The relevant code appears to be in pop3proxy BayesProxy.onRetr(): >> >> prob = self.bayes.spamprob(tokenizer.tokenize(message)) >> if prob > options.spam_cutoff: >> disposition = "Yes" >> else: >> disposition = "No " >> >>> So if there's no database, then it assumes everything is spam? >> >>There's always a database, but at the start it's empty. If there are no >>words in the database, that's not a special case to the code, the math >>simply works out to give a score of 0.5 to every msg then (which makes >>sense: in the absence of any evidence at all, it has no reason to favor any >>specific conclusion). Whatever you set ham_cutoff and spam_cutoff to be, >>0.5 should definitely be in your Unsure category. However, it doesn't look >>like pop3proxy is paying attention to ham_cutoff yet, nor is it currently >>capable of generating an "I'm lost -- help me!" Unsure disposition. Someone >>needs to teach it about the middle ground. >> >>> Or am I reading the meaning of the header backwards? >> >>No, you're reading it right. >> >> >>_______________________________________________ >>Spambayes mailing list >>Spambayes@python.org >>http://mail.python.org/mailman/listinfo/spambayes >> >> >> >> >- Tim >www.fourstonesExpressions.com > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From vanhorn@whidbey.com Sat Nov 2 21:46:19 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Sat, 02 Nov 2002 13:46:19 -0800 Subject: [Spambayes] x-hammie-disposition in pop3proxy References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> Message-ID: <3DC447AB.D64D6CE6@whidbey.com> Richie Hindle wrote: > X-Spambayes-Judgement: Spam / Unsure / Ham > X-Spambayes-Is-Spam: Yes / Unsure / No > X-Spambayes-Looks-Like-Spam: Yes / Unsure / No I know we have a long tradition of spelling errors behind us, such as dropping an "R" from "referrer" in Apache logs, but I'd hate to start a new one! Please, only one "E" in "judgment." Van -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From guido@python.org Sat Nov 2 22:41:59 2002 From: guido@python.org (Guido van Rossum) Date: Sat, 02 Nov 2002 17:41:59 -0500 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Your message of "Sat, 02 Nov 2002 13:46:19 PST." <3DC447AB.D64D6CE6@whidbey.com> References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> Message-ID: <200211022241.gA2Mfxq07985@pcp02138704pcs.reston01.va.comcast.net> > > X-Spambayes-Judgement: Spam / Unsure / Ham > > X-Spambayes-Is-Spam: Yes / Unsure / No > > X-Spambayes-Looks-Like-Spam: Yes / Unsure / No > > I know we have a long tradition of spelling errors behind us, such > as dropping an "R" from "referrer" in Apache logs, but I'd hate to > start a new one! Please, only one "E" in "judgment." But it's not a spelling error! --Guido van Rossum (home page: http://www.python.org/~guido/) From Tim@mail.powweb.com Sat Nov 2 22:48:21 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 16:48:21 -0600 Subject: [Spambayes] SMTP proxy Message-ID: Ok, I have a ***very*** rudimentary SMTP proxy for the purpose of training a spambayes database from mailers like the mozilla mailer, which keeps mail in a single file, and gives you very little facility for extracting individual mails. The proxy is based on code from Lee Smithson (http://sourceforge.net/cvs/?group_id=31674), and requires the DNS module (http://sourceforge.net/cvs/? group_id=31674). You run the proxy, naming the port you want it to listen on. Then you point your mailer to localhost:. You're all set at that point. I haven't tested it very much, and it doesn't appear to handle error conditions particularly well (at all). To train, simply forward or redirect a mail to spam@localspambayes.trn or to ham@localspambayes.trn (these addresses are hardcoded at the moment...) The proxy recognizes these addresses and executes a learn and updates probabilities in a database that's named as an argument when the proxy is started up. I'm assuming that update probabilities preserves the existing information in the database... I couldn't tell if this was the case from neiltrain.py. If not, then this stuff won't work yet... So, now my question is does the project want this stuff? If so, where should I send it? - Tim www.fourstonesExpressions.com From richie@entrian.com Sat Nov 2 22:59:30 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 02 Nov 2002 22:59:30 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <3DC447AB.D64D6CE6@whidbey.com> References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> Message-ID: Hi Van, > > X-Spambayes-Judgement: Spam / Unsure / Ham > > I know we have a long tradition of spelling errors behind us, such as dropping > an "R" from "referrer" in Apache logs, but I'd hate to start a new one! Please, > only one "E" in "judgment." Is this a British English vs. American English thing? My Concise Oxford says: judgement n. (also judgment) 1 the critical faculty; discernment (an error of judgement). 2 good sense. [snip] listing 'judgement' first and 'judgment' as the variant. The American Heritage Dictionary lists 'judgment' first and 'judgement' as the variant. Princeton's WordNet agrees with the Oxford. Meriam-Webster agrees with American Heritage. My girlfriend (the ultimate authority on most things) agrees with the Oxford. I really hate to say this, but we should probably use the common American spelling (even if it's wrong 8-) -- Richie Hindle richie@entrian.com From guido@python.org Sat Nov 2 23:01:22 2002 From: guido@python.org (Guido van Rossum) Date: Sat, 02 Nov 2002 18:01:22 -0500 Subject: [Spambayes] Spam at hackers conference Message-ID: <200211022301.gA2N1MJ08093@pcp02138704pcs.reston01.va.comcast.net> At the "Hackers" conference (a cool west coast event by invitation only) there was a session on spam. A few things to note: - The term "ham" is now generally accepted :-) - People are still at the Paul Graham level of Bayesian filtering; I wish I had a blurb about the work done here on chi-square. - Combining different approaches (e.g. blacklists, whitelists, Bayesian) seems to make people more comfortable. - The name of Bill Yerazunis was mentioned as someone who has done good spam work. Paul Graham seems to agree: http://www.paulgraham.com/wsy.html ; one idea of his takes groups of 5 words and does various permutations (including leaving out some) and then hashes on the result; very good results apparently. (Maybe the URL abouve has more info?) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Sat Nov 2 23:19:17 2002 From: guido@python.org (Guido van Rossum) Date: Sat, 02 Nov 2002 18:19:17 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Your message of "Fri, 01 Nov 2002 18:18:51 EST." <15811.3035.967754.435766@slothrop.zope.com> References: <15810.64929.812472.459643@slothrop.zope.com> <15811.3035.967754.435766@slothrop.zope.com> Message-ID: <200211022319.gA2NJHt08300@pcp02138704pcs.reston01.va.comcast.net> > >> The pop proxy is great for people who use pop, but lots of people > >> don't. > > TP> Name 362. Ha! > > Guido and at least 361 other people . Um, I get all my mail via pop (and fetchmail). --Guido van Rossum (home page: http://www.python.org/~guido/) From Tim@mail.powweb.com Sat Nov 2 23:18:51 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 17:18:51 -0600 Subject: [Spambayes] Spam at hackers conference Message-ID: I've *always* suspected that spambayes in combination with other technology would present a very powerful anti-spam arsenal. But spambayes by itself is so good, that it may not really require supplemental technology. I say *always* because I've only been in this game for a couple weeks... ;) so what do I REALLY know? - Tim 11/2/2002 5:01:22 PM, Guido van Rossum wrote: >At the "Hackers" conference (a cool west coast event by invitation >only) there was a session on spam. A few things to note: > >- The term "ham" is now generally accepted :-) > >- People are still at the Paul Graham level of Bayesian filtering; > I wish I had a blurb about the work done here on chi-square. > >- Combining different approaches (e.g. blacklists, whitelists, > Bayesian) seems to make people more comfortable. > >- The name of Bill Yerazunis was mentioned as someone who has done > good spam work. Paul Graham seems to agree: > http://www.paulgraham.com/wsy.html ; one idea of his takes groups of > 5 words and does various permutations (including leaving out some) > and then hashes on the result; very good results apparently. (Maybe > the URL abouve has more info?) > >--Guido van Rossum (home page: http://www.python.org/~guido/) > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From mhammond@skippinet.com.au Sat Nov 2 23:51:34 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sun, 3 Nov 2002 10:51:34 +1100 Subject: [Spambayes] Spam at hackers conference In-Reply-To: Message-ID: > I've *always* suspected that spambayes in combination with other > technology would present a very powerful anti-spam arsenal. But > spambayes by itself is so > good, that it may not really require supplemental technology. I'm finding that too. My email had 2 different problems - Spam, and attempted worm payload (Klez et al). As soon as I had an Outlook plugin working, I hacked up a trivial worm detector - way before I had the spambayes stuff working. I was very very happy with the results - worm problem almost gone! Then bayes came along. I made real attempts to keep these worms out of my spam corpa, as I thought they would mess up Bayes (eg, they often had "pythonwin" in the subject). But regardless of how careful I am, Bayes *still* defines them as Spam. My worm filter and Bayes are battling over who gets to move the mail message. No matter how careful I am about keeping these worms from my Spam folder, Bayes just keeps on knowing they are junk. This mirrors what Tim has been saying - it seems likely that a single classifier, over *all* of your mail (including mailing list etc) will be pretty much all you need. And-client-software-that-doesn't-keep-crapping-out Mark. From vanhorn@whidbey.com Sun Nov 3 00:01:47 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Sat, 02 Nov 2002 16:01:47 -0800 Subject: [Spambayes] x-hammie-disposition in pop3proxy References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> Message-ID: <3DC4676B.F5ED02CE@whidbey.com> Richie, Well, you're right. The COD does list it with that spelling first. The RHUD (Random House Unabridged Dictionary), which is always open right next to my desk, shows "Also, esp. Brit, judgement" after eight definitions and I hadn't read that far. Normally I feel pretty odd, having 18 dictionaries on my shelf, 8 of them English. It's good to be in a group where someone else cares! Van Richie Hindle wrote: > Hi Van, > > > > X-Spambayes-Judgement: Spam / Unsure / Ham > > > > I know we have a long tradition of spelling errors behind us, such as dropping > > an "R" from "referrer" in Apache logs, but I'd hate to start a new one! Please, > > only one "E" in "judgment." > > Is this a British English vs. American English thing? My Concise Oxford > says: > > judgement n. (also judgment) > 1 the critical faculty; discernment (an error of judgement). > 2 good sense. > [snip] > > listing 'judgement' first and 'judgment' as the variant. The American > Heritage Dictionary lists 'judgment' first and 'judgement' as the variant. > Princeton's WordNet agrees with the Oxford. Meriam-Webster agrees with > American Heritage. My girlfriend (the ultimate authority on most things) > agrees with the Oxford. > > I really hate to say this, but we should probably use the common American > spelling (even if it's wrong 8-) > > -- > Richie Hindle > richie@entrian.com -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From richie@entrian.com Sun Nov 3 00:07:03 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 00:07:03 +0000 Subject: [Spambayes] SMTP proxy In-Reply-To: References: Message-ID: <0ip8sug6mg229optc7fkebjn10j7tktqu5@4ax.com> Hi Tim, > I have a ***very*** rudimentary SMTP proxy for the purpose of training > a spambayes database [...] does the project want this stuff? If so, > where should I send it? Yay! I'd love to see it - please send your code (either to me or to the list). Your comments raise a couple of questions, but I'll wait until I see the code before asking them. One thing worth mentioning to anyone joining the coding team on this project is the Python coding standard at http://www.python.org/peps/pep-0008.html - if all your code is full of code ( with spaces ( inside parentheses ) ) ( like mine used to be ) then I have a script which can help (written as a result of the other Tim kindly pointing me at the style guide when I first submitted code). -- Richie Hindle richie@entrian.com From mhammond@skippinet.com.au Sun Nov 3 00:20:07 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sun, 3 Nov 2002 11:20:07 +1100 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <3DC4676B.F5ED02CE@whidbey.com> Message-ID: > "Also, esp. Brit, judgement" after eight definitions and I hadn't > read that far. > > Normally I feel pretty odd, having 18 dictionaries on my shelf, 8 > of them English. > It's good to be in a group where someone else cares! Well, us colonials all gave up caring what the yanks did to the language long ago . I have one dictionary on my desk; the Macquarie, the official Australian dictionary. It is listed as: judgment=judgement with no further comment on the alternative spellings. But-the-final-authority-is-that-Outlook's-spell-checker-likes-them-both-too Mark. From richie@entrian.com Sun Nov 3 00:21:23 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 00:21:23 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <3DC4676B.F5ED02CE@whidbey.com> References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> <3DC4676B.F5ED02CE@whidbey.com> Message-ID: Hi Van, > It's good to be in a group where someone else cares! Definitely. But aren't most technical groups like that? Being a language geek is a direct consequence of being a geek of any kind, isn't it? 8-) Back to the topic: the spelling 'judgment' looks simply wrong to me (and to my girlfriend, the nineteenth dictionary). I suspect it never gets used in British English. However, Google lists 2,150,000 hits for 'judgement' and 5,800,000 hits for 'judgment', which implies that the latter is in more common use worldwide. The question is, does 'judgement' look as wrong to American eyes as 'judgment' does to British ones? Judging (ha ha) by your initial response, I'd guess it does. (Pardon me if you're not American!) Or maybe the *real* question is, shall we call the header X-Spambayes-Classification? -- Richie Hindle richie@entrian.com From Tim@mail.powweb.com Sun Nov 3 00:30:56 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 18:30:56 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: Well, judgment looks wrong to me... How about x-hammie-classified-as? - TimS 11/2/2002 6:21:23 PM, Richie Hindle wrote: >Hi Van, > >> It's good to be in a group where someone else cares! > >Definitely. But aren't most technical groups like that? Being a language >geek is a direct consequence of being a geek of any kind, isn't it? 8-) > >Back to the topic: the spelling 'judgment' looks simply wrong to me (and to >my girlfriend, the nineteenth dictionary). I suspect it never gets used in >British English. However, Google lists 2,150,000 hits for 'judgement' and >5,800,000 hits for 'judgment', which implies that the latter is in more >common use worldwide. The question is, does 'judgement' look as wrong to >American eyes as 'judgment' does to British ones? Judging (ha ha) by your >initial response, I'd guess it does. (Pardon me if you're not American!) > >Or maybe the *real* question is, shall we call the header >X-Spambayes-Classification? > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From popiel@wolfskeep.com Sun Nov 3 00:30:53 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 02 Nov 2002 16:30:53 -0800 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message from "G. Armour Van Horn" of "Sat, 02 Nov 2002 13:46:19 PST." <3DC447AB.D64D6CE6@whidbey.com> References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> Message-ID: <20021103003053.44B27F49B@cashew.wolfskeep.com> In message: <3DC447AB.D64D6CE6@whidbey.com> "G. Armour Van Horn" writes: >Richie Hindle wrote: > >> X-Spambayes-Judgement: Spam / Unsure / Ham > >I know we have a long tradition of spelling errors behind us, such as >dropping an "R" from "referrer" in Apache logs, but I'd hate to start >a new one! Please, only one "E" in "judgment." Actually, Merriam-Webster lists both forms as valid. - Alex From B-Morgan@concentric.net Sun Nov 3 00:31:59 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Sat, 2 Nov 2002 17:31:59 -0700 Subject: [Spambayes] SMTP proxy In-Reply-To: Message-ID: Tim, Please submit the SMTP proxy to the project. I think this is a good interface for training. I'm also following the popfile Sourceforge project and they have a useable interface using HTML (on an alternate port) which is a reasonable alternative. I do have a question on the SMTP proxy. Can it be configured to pass everything it doesn't capture on the the "normal" proxy (where "normal" is specified somehow)? If not, how would it be configured in say, Outlook? Keep up the good work! Regards, Brad From B-Morgan@concentric.net Sun Nov 3 00:36:27 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Sat, 2 Nov 2002 17:36:27 -0700 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: > Or maybe the *real* question is, shall we call the header > X-Spambayes-Classification? "X-Spambayes-Classification: spam, ham, or unsure" makes perfect sense to me. There's enough words in common between British, American, (and Austrailian) English that we can use . Regards, Brad From popiel@wolfskeep.com Sun Nov 3 00:40:02 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 02 Nov 2002 16:40:02 -0800 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message from Richie Hindle References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> <3DC4676B.F5ED02CE@whidbey.com> Message-ID: <20021103004002.92222F49B@cashew.wolfskeep.com> In message: Richie Hindle writes: >Hi Van, > >> It's good to be in a group where someone else cares! > >Definitely. But aren't most technical groups like that? Being a language >geek is a direct consequence of being a geek of any kind, isn't it? 8-) Not a direct consequence, but there is a high correlation. I've never run chi-square on it, though. ;-) >The question is, does 'judgement' look as wrong to American eyes as >'judgment' does to British ones? Not to this American's eyes... but I tend to go for the British versions of many words, so it might just be an affectation. >Or maybe the *real* question is, shall we call the header >X-Spambayes-Classification? I like that one. - Alex From Tim@mail.powweb.com Sun Nov 3 00:47:38 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 18:47:38 -0600 Subject: [Spambayes] SMTP proxy In-Reply-To: Message-ID: Brad, I've sent the code to Richie, as I don't have checkin privileges (sp?) on the Spambayes project. Gosh, you guys have got me all worried about spelling... To answer your question, yes it passes mail through right now, but there is some configuration related work that really needs to be done before I'd consider it to be ready for primetime. When you run the proxy, you tell it what port to listen on and give it a dns ip address. Normal outgoing mail is processed by doing a dns lookup on the domain in the to: address, grabbing the smtp server name from the dns lookup return (a bit of a mystery to me) and connecting to that server. This is kinda not right imo. I think that the outgoing smtp server name should be specifiable as a startup option, as should the port to send on. That way, you can specify localhost: if you have another proxy running, which will allow you to chain them. In my instance, this is exactly what I have at the moment. So... that'll be coming, soon I hope... - TimS 11/2/2002 6:31:59 PM, "Brad Morgan" wrote: >Tim, > >Please submit the SMTP proxy to the project. I think this is a good >interface for training. I'm also following the popfile Sourceforge project >and they have a useable interface using HTML (on an alternate port) which is >a reasonable alternative. > >I do have a question on the SMTP proxy. Can it be configured to pass >everything it doesn't capture on the the "normal" proxy (where "normal" is >specified somehow)? If not, how would it be configured in say, Outlook? > >Keep up the good work! > >Regards, > >Brad > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Sun Nov 3 00:48:48 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 18:48:48 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <20021103004002.92222F49B@cashew.wolfskeep.com> Message-ID: X-Spambayes-Classification Perfect. - TimS 11/2/2002 6:40:02 PM, "T. Alexander Popiel" wrote: >In message: > Richie Hindle writes: >>Hi Van, >> >>> It's good to be in a group where someone else cares! >> >>Definitely. But aren't most technical groups like that? Being a language >>geek is a direct consequence of being a geek of any kind, isn't it? 8-) > >Not a direct consequence, but there is a high correlation. >I've never run chi-square on it, though. ;-) > >>The question is, does 'judgement' look as wrong to American eyes as >>'judgment' does to British ones? > >Not to this American's eyes... but I tend to go for the British versions >of many words, so it might just be an affectation. > >>Or maybe the *real* question is, shall we call the header >>X-Spambayes-Classification? > >I like that one. > >- Alex > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Sun Nov 3 00:54:29 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sat, 02 Nov 2002 18:54:29 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: <1T985KFB9MK09SRD8PLJG1X05SNSOIG.3dc473c5@riven> Has there been any thought given to additional classifications, beyond ham|unsure|spam? Like, ham|probablyham|unsure|probablyspam|spam, with corresponding cutoffs specified in Options? I don't know if that's interesting to anybody at all... I could see X-Spambayes-Classification: probablyspam being useful as a range of mail that should be checked manually... - TimS 11/2/2002 6:48:48 PM, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote: >X-Spambayes-Classification > >Perfect. > >- TimS > >11/2/2002 6:40:02 PM, "T. Alexander Popiel" wrote: > >>In message: >> Richie Hindle writes: >>>Hi Van, >>> >>>> It's good to be in a group where someone else cares! >>> >>>Definitely. But aren't most technical groups like that? Being a language >>>geek is a direct consequence of being a geek of any kind, isn't it? 8-) >> >>Not a direct consequence, but there is a high correlation. >>I've never run chi-square on it, though. ;-) >> >>>The question is, does 'judgement' look as wrong to American eyes as >>>'judgment' does to British ones? >> >>Not to this American's eyes... but I tend to go for the British versions >>of many words, so it might just be an affectation. >> >>>Or maybe the *real* question is, shall we call the header >>>X-Spambayes-Classification? >> >>I like that one. >> >>- Alex >> >>_______________________________________________ >>Spambayes mailing list >>Spambayes@python.org >>http://mail.python.org/mailman/listinfo/spambayes >> >> >- Tim >www.fourstonesExpressions.com > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From vanhorn@whidbey.com Sun Nov 3 03:12:33 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Sat, 02 Nov 2002 19:12:33 -0800 Subject: [Spambayes] x-hammie-disposition in pop3proxy References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> <3DC4676B.F5ED02CE@whidbey.com> Message-ID: <3DC49421.8EAA85FE@whidbey.com> Richie, I have not found any great correlation between language precision and technical competence, other than that leaders of most technical communities are both highly intelligent and well educated. It's amazing how many times I see "loose" when the obvious meaning is "lose" in online forums. I have a lot of problems with homophones, but lose and loose don't even rhyme. The British spelling of judgement would go unnoticed by the vast majority of my fellow citizens of the distressingly illeducated United States. Of course, adding or dropping vowels really isn't that big a deal in English, a language that can be accurately read with an amazing quantity of missing vowels. As to other Americans, south of us they speak Spanish and wouldn't care, north of us they have so much British spelling to deal with they wouldn't notice. (They might notice if you used colour and labor in the same sentence, I suspect.) As to the final choice of the name, the image of a stern black-robed jurist behind a high podium is a lot more appealing to me than an entymologist with a magnifier or a librarian choosing a Dewey Decimal System code for a book. So I vote for Judgement over Classification. As to Outlook, mentioned in the previous message, to my disgust Microsoft Word accepts both spellings when the US English dictionary is loaded. I really need to move to a program that allows correcting the factory dictionary. Van Richie Hindle wrote: > Hi Van, > > > It's good to be in a group where someone else cares! > > Definitely. But aren't most technical groups like that? Being a language > geek is a direct consequence of being a geek of any kind, isn't it? 8-) > > Back to the topic: the spelling 'judgment' looks simply wrong to me (and to > my girlfriend, the nineteenth dictionary). I suspect it never gets used in > British English. However, Google lists 2,150,000 hits for 'judgement' and > 5,800,000 hits for 'judgment', which implies that the latter is in more > common use worldwide. The question is, does 'judgement' look as wrong to > American eyes as 'judgment' does to British ones? Judging (ha ha) by your > initial response, I'd guess it does. (Pardon me if you're not American!) > > Or maybe the *real* question is, shall we call the header > X-Spambayes-Classification? > > -- > Richie Hindle > richie@entrian.com -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From tim.one@comcast.net Sun Nov 3 03:15:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 02 Nov 2002 22:15:43 -0500 Subject: [Spambayes] Spam at hackers conference In-Reply-To: <200211022301.gA2N1MJ08093@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > At the "Hackers" conference (a cool west coast event by invitation > only) there was a session on spam. A few things to note: > > - The term "ham" is now generally accepted :-) So where are my royalties ? > - People are still at the Paul Graham level of Bayesian filtering; > I wish I had a blurb about the work done here on chi-square. If you had the source code, there are accurate ('tho bare bones) explanations in Options.py and classifier.py. > - Combining different approaches (e.g. blacklists, whitelists, > Bayesian) seems to make people more comfortable. I doubt a blacklist is going to be worth the bother with this scheme, but a whitelist may be. MarkH taught the Outlook client how to traverse "old" Outlook .pst files, and running my current classifier over the year 2000's email put about 2 dozen personal msgs from rare correspondents into my Unsure category (which is fine), but 3 into the Spam category (which is not fine). OTOH, they're *such* rare correspondents that I never would have thought to whitelist them anyway. Indeed, since it turns out I never responded to these msgs anyway , the world would be exactly the same if I had never gotten them. > - The name of Bill Yerazunis was mentioned as someone who has done > good spam work. Paul Graham seems to agree: > http://www.paulgraham.com/wsy.html ; one idea of his takes groups of > 5 words and does various permutations (including leaving out some) > and then hashes on the result; very good results apparently. (Maybe > the URL abouve has more info?) His project is at: http://crm114.sourceforge.net/ The source code is easier to read than his attempt to explain it. I mentioned this late last week indirectly, in connection with using many-to-one hashing as a database reduction gimmick; CRM114 carries that to extremes, and has to else it would consume gigabytes (or even terabytes). Bill tokenizes a msg. Looks like he splits on whitespace, and preserves case, but that's not "the gimmick". The tokens are all hashed, and the rest of the scheme works with their hash codes. So the output of tokenization is a list of N hash codes (where N is the # of tokens in the msg). Then a sliding 5-word window marches across the list of hashes, one 5-token slice at a time. At *each* of the N-4 window positions, 16 hashes-of-hashes (HOH) are computed, via 16 different hash functions. Each HOH folds in the hash code of the rightmost token in the window, and then all subsets of the preceding 4 token hashes are folded in too; 2**4 is the # of possible subsets, and so that's where the 16 comes from. This gives 16 numbers at each window position, which are used to index mmap'ed files of one-byte ham and spam counts. Bill simply keeps a running total of the ham and spam counts, and whichever total is higher at the end wins. We currently compute "exact" word 1-gram stats. All of Bill's stats are fuzzy because of multiple layers of many-to-one mappings. Ignoring that, and also ignoring some glitches at the start and end of the window positions, he's effectively capturing: + All word 1-grams. + All word 2-grams. + All word 3-grams. + All word 4-grams. + All word 5-grams. + All word 4-grams taken from 5-word slices but ignoring a word. + All word 3-grams taken from 5-word slices but ignoring 2 words. + All word 2-grams taken from 5-word slices but ignoring 3 words. Example: the earnings potential is truly staggering generates a HOH for each of the earnings potential is truly the earnings potential truly the earnings is truly the potential is truly earnings potential is truly [and 6 more skipping 2 words of the first 4] [and 4 more skipping 3 words of the first 4] truly [and 16 more for the 5 words starting at "earnings"] There are lots of unknowns in this scheme. It's clear that it will learn very quickly at the start. It's unclear how it will do over time. Things acting it against there are (not saying they can't be wormed around, am saying they will need to be wormed around, whether or not the need has become apparent yet): 1. The spam and ham counts are clamped at one byte, and will eventually saturate. 2. 1,000,000 buckets isn't much over time, given the prolific rate of HOH generation (16*N HOH's for an N-word msg). The scheme clearly will work best if bucket collisions are rare, but throw just a few thousand HOHs at 1,000,000 buckets and collisions are certain. 3. By construction, there's extreme correlation wrt the haminess and spaminess of the generated HOH's. This makes a good combining scheme a puzzle. Graham-combining gets in trouble due to the bogus word-independence assumption even for unigrams. Overlapping bigrams suffer much worse correlation. Overlapping trigrams or higher, forget it. Even chi-combining gets in trouble if the tokenzier produces "too many" highly correlated one-grams. In my earlier experiments with word bigrams, they did worse than what we're doing now, and there seemed to be solid reasons for that (like "Aahz rocks" is neutral but "Aahz" on its own is a strong c.l.py ham clue). In a later experiment grabbing both bigrams and unigrams, at the time I saw a significant drop in FN rate (along with a large boost in database size). We've since pushed unigrams to the point where I see better error rates than I got in that experiment, but I haven't run that experiment again. It stands to reason that we're missing *some* useful info. Bill appears to have stats only for his own email; if there's been wider testing, I haven't bumped into results. I'm getting error rates at least as good on my own email, and better on the c.l.py test. Bill needed a lot less training. Fiddling our codebase to trying something like it wouldn't be hard. From tim.one@comcast.net Sun Nov 3 03:20:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 02 Nov 2002 22:20:28 -0500 Subject: [Spambayes] Why I added src=cid: etc Message-ID: This is typical of the kind of email I'm getting a lot of lately. Without mining the HTML, there's almost nothing to look at, not even a word in the Subject line. (Of course, if we weren't throwing the HTML tags away, the classifier would have learned this stuff on its own.) Spam Score: 0.999492 '*H*' 0.000694711 '*S*' 0.999679 'header:Received:4' 0.151395 'header:Return-Path:1' 0.621969 'header:Message-ID:1' 0.787093 'virus:width=0' 0.842427 'message-id:@fwd04.sul.t-online.com' 0.844828 'virus:height=0' 0.855208 'from:email addr:t-online.de' 0.908163 'from:email name:520018173831-0001' 0.908163 'virus:src=cid:' 0.978469 'virus: X-Sender: 520018173831-0001@t-dialin.net Return-Path: 520018173831-0001@t-online.de X-OriginalArrivalTime: 03 Nov 2002 02:21:58.0083 (UTC) FILETIME=[C8F85130:01C282DF] From anthony@interlink.com.au Sun Nov 3 04:05:33 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sun, 03 Nov 2002 15:05:33 +1100 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <3DC49421.8EAA85FE@whidbey.com> Message-ID: <200211030405.gA345Xo00791@localhost.localdomain> >>> "G. Armour Van Horn" wrote > The British spelling of judgement would go unnoticed by the vast > majority of my fellow citizens of the distressingly illeducated United > States. Of course, adding or dropping vowels really isn't that big > a deal in English, a language that can be accurately read with an > amazing quantity of missing vowels. So maybe it should be X-SpmBys-Jdgmnt ? From skip@pobox.com Sat Nov 2 14:14:56 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 2 Nov 2002 08:14:56 -0600 Subject: [Spambayes] An alternate use In-Reply-To: <20021102062939.3CD89F5AC@cashew.wolfskeep.com> References: <20021102062939.3CD89F5AC@cashew.wolfskeep.com> Message-ID: <15811.56800.291193.255713@montanaro.dyndns.org> >>> In the case of adding headers, we'll want to avoid collisions with >>> personal use of spambayes, too. I suggest tagging the >>> X-Spambayes-Disposition header (or whatever we call it) with some >>> identifier for which classifier generated the rating, so that >>> multiple X-Spambayes-Disposition lines are distinguishable. >>> Something like: >>> >>> X-Spambayes-Disposition: Spam by spambayes@python.org >>> X-Spambayes-Disposition: Unsure by pennmush@pennmush.org >>> >>> Personal classifiers could leave off the 'by' section. >>> >>> Heck, make it so that X-Spambayes-Disposition lines are turned into >>> words similar to the mailer lines, and then personal classifiers can >>> use the judgements of list classifiers as clues. >> >> Easy to spoof, and I'm sure spammers would pick up on that quickly. Alex> Yes, it would be easy to spoof, unless compared with routing Alex> information... but doing that sort of comparison is beyond the Alex> sorting rule capabilities of something like Outlook (and Outlook Alex> is sadly one of the best GUI tools in that arena). I'm not even Alex> sure procmail is up to the task without help from a custom Alex> program. I was using a spoof-proof mechanism from procmail before I disabled SpamAssassin. I inserted my own header using formail: :0H * ! ^X-SA-Host: { :0fw | spamc | $FORMAIL -a "X-SA-Host: `hostname --fqdn`" } which says, "if there is no X-SA-Host header present, run spamc, add a header and include the fully qualified hostname". If an X-SA-Host header is present it tells me spamc had already been run on this message (I was running SA on two different machines at the time). That way I wasn't relying on SA's own headers to decide whether or not to run it. Skip From skip@pobox.com Sat Nov 2 17:30:22 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 2 Nov 2002 11:30:22 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: References: Message-ID: <15812.2990.423954.983990@montanaro.dyndns.org> Tim> Ok, I've got the pop3proxy up and running on my machine. Very Tim> simple to get running. I don't have a trained database (the real Tim> challenge) at this point, and it's adding the x-hammie-disposition Tim> header with value of 'no'. I presume that this means that the Tim> classifier thinks this is NOT ham? So if there's no database, then Tim> it assumes everything is spam? Or am I reading the meaning of the Tim> header backwards? Correct. "no" means "i think it's ham". "yes" means "i think it's spam". "unsure" means ... "no" and "yes" are interpreted the same as SpamAssassin's use of these words. Perhaps this suggests that we need a different header? SA uses X-Spam-Status: yes which reads in the obvious fashion. I still think we need to leave "X-Spam-*" to the SA folks to avoid ambiguity, but maybe we can use X-Ham-Status: yes to mean "it's ham" X-Ham-Status: no to mean "it's spam" Just a thought. Skip From skip@pobox.com Sat Nov 2 00:04:34 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 1 Nov 2002 18:04:34 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> Message-ID: <15811.5778.454998.906629@montanaro.dyndns.org> >>>>> "Tim" == Tim writes: Tim> This proposal has a lot of attractions. Forwarding to ham@ and Tim> spam@ would be a bit of a pain at first, but it would work for Tim> existing bodies of mail. Training would be MUCH simpler with this Tim> method, and would not require some fancy-schmancy installation or Tim> configuration glorp. The more you ask people to type, the more mistakes they will make. I'm still amazed at how many mistakes I've made, not because I mentally mistook the nature of an email, but because I simply saved it to the wrong file. Skip From skip@pobox.com Sat Nov 2 00:51:40 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 1 Nov 2002 18:51:40 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15811.7287.50962.651569@slothrop.zope.com> References: <15811.3035.967754.435766@slothrop.zope.com> <15811.7287.50962.651569@slothrop.zope.com> Message-ID: <15811.8604.210277.923442@montanaro.dyndns.org> Jeremy> This part of the code doesn't work that well for my mail Jeremy> folders. The code to move messages from folder to folder needs Jeremy> to be written in elisp. I'm not sure how important that is. You could try Pymacs... Skip From skip@pobox.com Sun Nov 3 04:23:47 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 2 Nov 2002 22:23:47 -0600 Subject: [Spambayes] Spam at hackers conference In-Reply-To: References: <200211022301.gA2N1MJ08093@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15812.42195.737956.831082@montanaro.dyndns.org> >> - Combining different approaches (e.g. blacklists, whitelists, >> Bayesian) seems to make people more comfortable. Tim> I doubt a blacklist is going to be worth the bother with this Tim> scheme, but a whitelist may be. I doubt it. There is just too much email spoofing going on to trust any addresses that absolutely. When using SA, I rarely used its whitelist facility, and only for odd email addresses whose automailings it always classified as spam. For instance, I get a bit of mail from American Airlines letting me know when the airfare between Chicago and Albany changes. As you might imagine, it's very spammy looking. The only way I could convince SA to leave it alone was to whitelist it. With Spambayes, it's never a problem. Skip From tim.one@comcast.net Sun Nov 3 05:47:37 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 00:47:37 -0500 Subject: [Spambayes] Spam at hackers conference In-Reply-To: Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment [Tim, sketches the CRM114 algorithm] > ... > Fiddling our codebase to [try] something like it wouldn't be hard. Proof attached. Like the docs say, nothing is sacred here, and if that algorithm works better, great, we can go home early . The attached patches classifier.py to do CRM114 HOH generation and scoring by default. The token hash is Python's string hash, which is a better hash than CRM114 uses. The 16 HOH hashes used here are, I believe, identical to the ones CRM114 uses; I grit my teeth at this because they don't appear to be good HOH functions, but let's let that pass. This runs very much slower and requires a lot more memory than what we're using now. OTOH, the memory use is bounded no matter how much training data there is, due to layers of many-to-one hash mappings. If this scheme becomes serious here, recoding the scoring in C would seem necessary for both speed and memory efficiency (I'm already playing obscure speed and memory reduction tricks here, but they don't help enough). The patch doesn't change tokenization at all, although I believe Bill preserves case when tokenizing, doesn't skip either short words or meta-tokenize long words, and doesn't do any of our "fancy" tokenization gimmicks (a note on the project site suggests that he'll start decoding base64, because that's been a problem with the scheme; we are decoding base64, of course). The patch also doesn't clamp counts to 1-byte values, although I doubt that played a role here (later: unclear!). If you try this, set ham_cutoff and spam_cutoff to 0.5 (later: also unclear what to do here). It's just comparing raw counts, and the bigger count wins. The score returned here is S/(S+H) where S is the sum of the ~16*N HOH spamcounts H is the sum of the ~16*N HOH hamcounts So < 0.5 means S was smaller, and > 0.5 means S was larger. On a python.org email test I was running anyway, the results weren't stellar: filename: base1 crm ham:spam: 2741:948 2741:948 fp total: 5 2 fp %: 0.18 0.07 fn total: 2 271 fn %: 0.21 28.59 unsure t: 66 0 unsure %: 1.79 0.00 real cost: $65.20 $291.00 best cost: $21.40 $177.00 h mean: 0.84 18.28 h sdev: 6.21 6.17 s mean: 98.05 64.63 s sdev: 9.10 23.53 mean diff: 97.21 46.35 k: 6.35 1.56 This isn't a big test, but bloated to 100MB and took so long I killed it once suspecting a hang (it wasn't hung, so I got to start over again ). However, 1. If there's a usable middle ground here, setting both cutoffs to 0.5 can't reflect that. Still, the run was done with nbuckets=200, which gave the automated histogram analysis a lot of resolution to play with, and the best-cost crm value was $177.00, 8x worse than the best- cost "before" value (deduced from the same nbuckets value). 2. It occurs to me that *because* it's just scoring by comparing raw counts, it's probably crucial to train on an equal number of ham and spam. That there was 3x as much ham in this test made it much easier to get high raw hamcounts than high raw spamcounts. That may (or may not) explain the bulk of the huge FN rate. OK, doing a 10-fold cross-validation run across 2000 random ham and 2000 random spam, but the same random sets for "before" and "after": filename: before crm ham:spam: 2000:2000 2000:2000 fp total: 1 1604 fp %: 0.05 80.20 fn total: 0 0 fn %: 0.00 0.00 unsure t: 20 0 unsure %: 0.50 0.00 real cost: $14.00$16040.00 best cost: $2.00 $228.00 h mean: 0.55 53.54 h sdev: 4.50 5.30 s mean: 99.91 71.40 s sdev: 1.64 6.84 mean diff: 99.36 17.86 k: 16.18 1.47 Well, that was a disaster. My guess: since virtually all ham contains strong spam words, 0.5 is a lousy value for spam_cutoff. For crm: -> Ham scores for all runs: 2000 items; mean 53.54; sdev 5.30 -> min 24.6294; median 54.3869; max 74.4693 -> percentiles: 5% 43.7643; 25% 51.1123; 75% 56.8845; 95% 60.2207 -> Spam scores for all runs: 2000 items; mean 71.40; sdev 6.84 -> min 50; median 69.805; max 96.6838 -> percentiles: 5% 63.4695; 25% 66.6684; 75% 74.2597; 95% 86.1775 -> best cost for all runs: $228.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.61 & 0.695 -> fp 1; fn 17; unsure ham 63; unsure spam 942 -> fp rate 0.05%; fn rate 0.85%; unsure rate 25.1% The highest-scoring ham is one our unigram scheme would never call spam; since 0.75 is very near the spam 75th-percentile score, you'd have to call about 25% of all spam "unsure" to avoid calling this spam (and, indeed, the automated histogram analysis found its best-cost value at an unsure rate of 25.1%): """ Data/Ham/Set4/128466.txt prob = 0.744693151307 prob('*H*') = 59619 prob('*S*') = 173900 Received: from [80.17.80.215] (helo=veronika.quadrante.com) by mail.python.org with smtp (Exim 3.21 #1) id 16dCZB-0000Op-00 for python-list@python.org; Tue, 19 Feb 2002 10:52:29 -0500 Received: (qmail 29664 invoked by uid 64014); 19 Feb 2002 16:14:13 -0000 Received: from abottoni@quadrante.com by veronika by uid 64011 with qmail-scanner-1.10 (uvscan: v4.1.40/v4121. . Clear:0. Processed in 0.341367 secs); 19 Feb 2002 16:14:13 -0000 Received: from unknown (HELO backup.quadrante.com) (80.17.80.210) by 80.17.80.215 with SMTP; 19 Feb 2002 16:14:13 -0000 Message-Id: <5.1.0.14.0.20020219163858.00a901a8@veronika.quadrante.com> X-Sender: abottoni@veronika.quadrante.com X-Mailer: QUALCOMM Windows Eudora Version 5.1 Date: Tue, 19 Feb 2002 16:56:25 +0100 To: python-list@python.org From: Alessandro Bottoni Subject: Python-based "Portal System"? Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed Sender: python-list-admin@python.org Errors-To: python-list-admin@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.0.8 (101270) Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: Most likely, all of you know a number of open source, pre-built "portal systems", like ezPublish (http://developer.ez.no), PHPNuke (www.phpnuke.org), PostNuke (http://www.postnuke.com), Midgard (http://www.midgard-project.org/) and so on. Does anybody know if exists a Portal System like those, written in Python? Thanks in advance Alessandro Bottoni PS: I know about Zope (http://www.zope.org) and WebWare (http://webware.sourceforge.net), already... """ The 2nd-highest-scoring ham is due to our own Skip, and is at least as mysterious: """ Data/Ham/Set2/146718.txt prob = 0.693046527054 prob('*H*') = 80982 prob('*S*') = 182843 Received: from exim by mail.python.org with spamc (Exim 4.02) id 17DRZM-0005Xn-00 for python-list@python.org; Thu, 30 May 2002 11:10:28 -0400 Received: from 12-248-41-177.client.attbi.com ([12.248.41.177]) by mail.python.org with esmtp (Exim 4.02) id 17DRZM-0005Xg-00 for python-list@python.org; Thu, 30 May 2002 11:10:28 -0400 Received: (from skip@localhost) by 12-248-41-177.client.attbi.com (8.11.6/8.11.6) id g4UFAPD25155; Thu, 30 May 2002 10:10:25 -0500 From: Skip Montanaro MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <15606.16608.916236.657101@12-248-41-177.client.attbi.com> Date: Thu, 30 May 2002 10:10:24 -0500 To: "David LeBlanc" Cc: "Jeff Shannon" , Subject: RE: Crashing IDLE In-Reply-To: References: X-Mailer: VM 6.96 under 21.4 (patch 6) "Common Lisp" XEmacs Lucid Reply-To: skip@pobox.com X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: python-list-admin@python.org Errors-To: python-list-admin@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.0.11 (101270) Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: David> I would consider that a bug - "pass" should be checking for David> ctrl-c and other events imo. It sure strikes me as a point for David> relinquishing control. It will relinquish control to another thread and sense KeyboardInterrupt. If your app is not threaded though, Tk will never get control so it can process its event queue. That's what fills up. -- Skip Montanaro (skip@pobox.com - http://www.mojam.com/) Boycott Netflix - they spam - http://www.musi-cal.com/~skip/netflix.html """ Perhaps CRM114's one-byte count clamps are needed to prevent insane scores (a form of bias acting against the extreme HOH correlation), or perhaps one of the hash reductions mapped "control" to "big penis", or ... who knows? If someone wants to pursue this (I've seen enough ), it would be a lot more interesting now to download CRM114 and run it the way the author intended. ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: crm.patch Type: application/octet-stream Size: 11203 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021103/5cf1f41f/crm.exe ---------------------- multipart/mixed attachment-- From tim.one@comcast.net Sun Nov 3 07:19:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 02:19:36 -0500 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> Message-ID: [TimS] > Ok, so Tim says I'm not reading it backwards, Richie says I am... That's because I was reading your question backwards. Sorry! You were right the first time: you were reading it backwards. From tim.one@comcast.net Sun Nov 3 07:31:39 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 02:31:39 -0500 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <1T985KFB9MK09SRD8PLJG1X05SNSOIG.3dc473c5@riven> Message-ID: [Tim@mail.powweb.com] > Has there been any thought given to additional classifications, > beyond ham|unsure|spam? No; you could get "a score" with 17 decimal digits of precision, about 1 of which is meangingful,. > Like, ham|probablyham|unsure|probablyspam|spam, with > corresponding cutoffs specified in Options? I don't know if > that's interesting to anybody at all... > > I could see X-Spambayes-Classification: probablyspam being useful > as a range of mail that should be checked manually... That's what Unsure is for. If you don't check Unsure msgs, you'll be sorry. They split about half-and-half between ham and spam for me, and if the system *could* have made a better jugmint about them, it would have. If you do have the score, we've gotten mixed reports here about whether sorting Unsure msgs by score is helpful. I find that it is in my email, but there are many exceptions (ham closer to high end of the Unsure range, and spam closer to the low end). From rob@hooft.net Sun Nov 3 07:32:01 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 03 Nov 2002 08:32:01 +0100 Subject: [Spambayes] Spambayes Header Format Message-ID: <3DC4D0F1.5000509@hooft.net> Lots of discussion about the Spambayes header, but nobody takes any concrete initiatives. For me, the proposal... X-Spambayes-Classification: {Ham|Unsure|Spam} ...looks very good. But obviously this will break backward compatibility. And since I'm only using hammie.py and procmail, I can only change those parts and test them. To make everything work together again we'd have to make a concerted effort. Better now than later? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Sun Nov 3 07:44:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 02:44:43 -0500 Subject: [Spambayes] Spam at hackers conference In-Reply-To: Message-ID: [Tim@mail.powweb.com] > I've *always* suspected that spambayes in combination with other > technology would present a very powerful anti-spam arsenal. But > spambayes by itself is so good, that it may not really require > supplemental technology. I say *always* because I've only been in > this game for a couple weeks... ;) so what do I REALLY know? I don't know what to do about opt-in advertising, apart from the obvious: keep an eye out for it in your Spam folder, and train on it as Ham whenever it shows up there. This is effective. Very brief msgs from rare correspondents seem also to be a problem, because lots of spam is also very brief (believe it or not ). python.org has a very specific problem: the various mailing lists have *-request addresses, for adminstrivia. Greg currently whitelists the snot out of those recipients in SpamAssassin, else a significant percentage of that traffic would be considered spam. *This* code appears to be less willing to call it spam than unfiddled SpamAssassin, but it's still the major source of FPs in my python.org mail tests. The kind of FP here has the single word "unsubscribe" or "help" or "confirm 1534232" buried under 10KB of employer-generated HTML disclaimers, or is sent as a reply to a spam or conference announcement the poster found objectionable, quoted in full. Making things worse, "subscribe" and "unsubscribe" are themselves high-spamprob words. The FP rate is still very low even with that, but every non-trivial scheme has non-zero error rates, and that has to be realized. From tim.one@comcast.net Sun Nov 3 07:52:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 02:52:18 -0500 Subject: [Spambayes] Spam at hackers conference In-Reply-To: <15812.42195.737956.831082@montanaro.dyndns.org> Message-ID: [Skip Montanaro, on whitelists] > I doubt it. There is just too much email spoofing going on to trust any > addresses that absolutely. When using SA, I rarely used its whitelist > facility, and only for odd email addresses whose automailings it always > classified as spam. For instance, I get a bit of mail from American > Airlines letting me know when the airfare between Chicago and Albany > changes. As you might imagine, it's very spammy looking. Yup, I get the same kind of thing from Expedia (what's up with that? does Microsoft own my soul too ?), and it rated as solid spam before training on it. Now it's solid ham, in part because the specific routes it always tells me about have become recognized as strong ham words. > The only way I could convince SA to leave it alone was to whitelist it. > With Spambayes, it's never a problem. OTOH, I talk here about my sisters, and email from them is often brief, and initially scored as Unsure. I could whitelist them without problem. They're not computer geeks, and one of them has never gotten spam: she has no web or mailing-list presence at all, and uses a small regional ISP. Nobody is going to guess her address, unless they break into the ISP's database. If they do, then maybe a whitelist will dump a spam or two in my Inbox. BFD. OTOH, after training on msgs from my sisters, my classifier also scores them as ham now anyway. I'm having more trouble when esmokes.com changes the brands of cancer sticks on sale this week . From jbublitz@nwinternet.com Sun Nov 3 07:47:32 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Sat, 02 Nov 2002 23:47:32 -0800 (PST) Subject: [Spambayes] Spam at hackers conference In-Reply-To: <15812.42195.737956.831082@montanaro.dyndns.org> Message-ID: On 03-Nov-02 Skip Montanaro wrote: >> > - Combining different approaches (e.g. blacklists, whitelists, >> > Bayesian) seems to make people more comfortable. >> Tim> I doubt a blacklist is going to be worth the bother with >> this scheme, but a whitelist may be. > I doubt it. There is just too much email spoofing going on to > trust any addresses that absolutely. When using SA, I rarely > used its whitelist facility, and only for odd email addresses > whose automailings it always classified as spam. For instance, > I get a bit of mail from American Airlines letting me know when > the airfare between Chicago and Albany changes. As you might > imagine, it's very spammy looking. The only way I could > convince SA to leave it alone was to whitelist it. With > Spambayes, it's never a problem. You may be correct that from a purely technical point of view Spambayes doesn't really need a whitelist (although the fp rate is still non-zero), but there are some other considerations. >From my personal point of view, I spend a lot of money to get certain email sent to me, and missing some email could be very costly ($10 could be off by orders of magnitude in the worst case) For those reasons alone, I want a whitelist. I also recently saw someplace (/.?) an article about a woman suing an ISP who cut off her email because of non-payment. She's suing because she missed an email from a potential employer for a possible high paying job. If I were in a position similar to that ISP (potential liability), I think "due diligence" would require that I make every effort to make sure valid mail got through - hence a more deterministic method in combination with a statistical method (in combination with review in my case). A convenient whitelist option seems to me to make it a more attractive package. I'd want whitelisted mail to go into the training database too. Jim From tim.one@comcast.net Sun Nov 3 08:12:00 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 03:12:00 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <15811.7287.50962.651569@slothrop.zope.com> Message-ID: [Jeremy Hylton, on the Folder parts of MarkH's training interface\ > ... > This part of the code doesn't work that well for my mail folders. The > code to move messages from folder to folder needs to be written in > elisp. I'm not sure how important that is. Whatever a general-purpose training class may look like, it seems to need two concepts: "a msg", and "a collection of msgs", the latter to remember, e.g., which msgs have been trained as ham, and which as spam. Mark views collections as folders because that's actually how they're set up in the Outlook client, but a "virtual folder" makes sense too. In your case you may have just two folders, Ham and Spam, which exist only in cyberspace, as a way for the training class to keep track of the state of your training. Mark's MoveTo() is then just a way to record the classification a msg should have. > ... > # It's important not to commit a transaction until > # after update_probabilities is called in update(). > # Otherwise some new entries will cause scoring to fail. I'm not sure what that's about, but I probably fixed it late last week (Outlook has lots of threads, and it was possible there for scoring to occur in parallel with training; WordInfo records are now created with the unknown-word spamprob by default instead of with None, so that an attempt to score a brand-new word is effectively ignored instead of raising an exception). From tim.one@comcast.net Sun Nov 3 08:49:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 03:49:04 -0500 Subject: [Spambayes] Something to test Message-ID: This little patch arranges to create "noheader:HEADERNAME" tokens for headers in options.safe_headers that *don't* appear in a msg's headers. On my fat c.l.py test it's a small theoretical improvement: best-cost falls from $26.80 to $22.00, by knocking down the score of the second-worst hopeless FP just enough so that redeeming it *could* be traded away for an increase in the Unsure rate. That's not realistic, though (the spam_cutoff value needed to redeem that FP is no longer insane, but is still *unreasonably* high). I'm keener on it because it eliminated a few difficult FP without changing cutoffs, in three smaller tests on different test data. I haven't run a test where it hurt yet, and it has helped several times. This captures the useful (in my data) part of what Anthony's tokenization of Reply-To accomplished, without needing to tokenize the Reply-To content (the thing that helped me there was that tokenizing Reply-To inadvertently generated a token for the *absence* of a Reply-To header, and that's a ham clue in my data, provided that the classifier can see it; one effect of the patch is to generate a "noheader:reply-to" token when no Reply-To is found in the headers; other effects include that the lack of an Organization header becomes a spam clue in my data; sometimes more than one of these coooperate to help push a difficult case to "the right side" of a cutoff). Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.60 diff -c -r1.60 tokenizer.py *** tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 --- tokenizer.py 3 Nov 2002 08:31:44 -0000 *************** *** 1178,1183 **** --- 1178,1185 ---- x2n[x] = x2n.get(x, 0) + 1 for x in x2n.items(): yield "header:%s:%d" % x + for x in options.safe_headers - Set([k.lower() for k in x2n]): + yield "noheader:" + x def tokenize_body(self, msg, maxword=options.skip_max_word_size): """Generate a stream of tokens from an email Message. From richie@entrian.com Sun Nov 3 10:48:21 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 10:48:21 +0000 Subject: [Spambayes] Spambayes Header Format In-Reply-To: <3DC4D0F1.5000509@hooft.net> References: <3DC4D0F1.5000509@hooft.net> Message-ID: <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> Hi Rob, > Lots of discussion about the Spambayes header, but nobody takes any > concrete initiatives. For me, the proposal... > > X-Spambayes-Classification: {Ham|Unsure|Spam} > > ...looks very good. But obviously this will break backward > compatibility. And since I'm only using hammie.py and procmail, I can > only change those parts and test them. To make everything work together > again we'd have to make a concerted effort. Better now than later? I agree. I volunteer to make the edit, and to combine any duplicated code ("if prob > spam_cutoff: disp = 'Yes'" currently appears in at least two places, for instance). Let's give it a couple of days and see whether there are any violent objections or better suggestions, then I'll make the edit. Is that OK with everyone? -- Richie Hindle richie@entrian.com From mhammond@skippinet.com.au Sun Nov 3 11:26:04 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sun, 3 Nov 2002 22:26:04 +1100 Subject: [Spambayes] Spambayes Header Format In-Reply-To: <3DC4D0F1.5000509@hooft.net> Message-ID: > Lots of discussion about the Spambayes header, but nobody takes any > concrete initiatives. For me, the proposal... > > X-Spambayes-Classification: {Ham|Unsure|Spam} Something I find cute about "Yes, "No", "Unsure" is that it sorts naturally. And-my-brain-even-processes-it-naturally-ly. Mark. From rob@hooft.net Sun Nov 3 12:18:11 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 03 Nov 2002 13:18:11 +0100 Subject: [Spambayes] Spambayes Header Format References: <3DC4D0F1.5000509@hooft.net> <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> Message-ID: <3DC51403.6030208@hooft.net> Richie Hindle wrote: > Hi Rob, > > >>Lots of discussion about the Spambayes header, but nobody takes any >>concrete initiatives. For me, the proposal... >> >> X-Spambayes-Classification: {Ham|Unsure|Spam} >> >>...looks very good. But obviously this will break backward >>compatibility. And since I'm only using hammie.py and procmail, I can >>only change those parts and test them. To make everything work together >>again we'd have to make a concerted effort. Better now than later? > > > I agree. I volunteer to make the edit, and to combine any duplicated code > ("if prob > spam_cutoff: disp = 'Yes'" currently appears in at least two > places, for instance). Other todo items and ideas: - Make the "Yes, Unsure, No" items into Options, keeping the defaults the same as in the past for a few days. - Add the debugging info optionally as X-Spambayes-Info Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From Tim@mail.powweb.com Sun Nov 3 12:47:42 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sun, 03 Nov 2002 06:47:42 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: I agree.. it was a dumb idea. Hopefully I've exhausted my quota of those... ;) - TimS 11/3/2002 1:31:39 AM, Tim Peters wrote: >[Tim@mail.powweb.com] >> Has there been any thought given to additional classifications, >> beyond ham|unsure|spam? > >No; you could get "a score" with 17 decimal digits of precision, about 1 of >which is meangingful,. > >> Like, ham|probablyham|unsure|probablyspam|spam, with >> corresponding cutoffs specified in Options? I don't know if >> that's interesting to anybody at all... >> >> I could see X-Spambayes-Classification: probablyspam being useful >> as a range of mail that should be checked manually... > >That's what Unsure is for. If you don't check Unsure msgs, you'll be sorry. >They split about half-and-half between ham and spam for me, and if the >system *could* have made a better jugmint about them, it would have. > >If you do have the score, we've gotten mixed reports here about whether >sorting Unsure msgs by score is helpful. I find that it is in my email, but >there are many exceptions (ham closer to high end of the Unsure range, and >spam closer to the low end). > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From Tim@mail.powweb.com Sun Nov 3 13:25:24 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sun, 03 Nov 2002 07:25:24 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> I agree with the need for a general purpose training class that accepts a single message or collection of messages. In addition it should optionally remember existing training, or create a new training database. >a collection of msgs", the latter to remember, >e.g., which msgs have been trained as ham, and which as spam. Remembering is an interesting idea, but what real purpose does it serve aside from making testing easier? - TimS 11/3/2002 2:12:00 AM, Tim Peters wrote: >[Jeremy Hylton, on the Folder parts of MarkH's training interface\ >> ... >> This part of the code doesn't work that well for my mail folders. The >> code to move messages from folder to folder needs to be written in >> elisp. I'm not sure how important that is. > >Whatever a general-purpose training class may look like, it seems to need >two concepts: "a msg", and "a collection of msgs", the latter to remember, >e.g., which msgs have been trained as ham, and which as spam. Mark views >collections as folders because that's actually how they're set up in the >Outlook client, but a "virtual folder" makes sense too. In your case you >may have just two folders, Ham and Spam, which exist only in cyberspace, as >a way for the training class to keep track of the state of your training. >Mark's MoveTo() is then just a way to record the classification a msg should >have. > >> ... >> # It's important not to commit a transaction until >> # after update_probabilities is called in update(). >> # Otherwise some new entries will cause scoring to fail. > >I'm not sure what that's about, but I probably fixed it late last week >(Outlook has lots of threads, and it was possible there for scoring to occur >in parallel with training; WordInfo records are now created with the >unknown-word spamprob by default instead of with None, so that an attempt to >score a brand-new word is effectively ignored instead of raising an >exception). > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From rob@hooft.net Sun Nov 3 14:00:04 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 03 Nov 2002 15:00:04 +0100 Subject: [Spambayes] Spambayes Header Format References: <3DC4D0F1.5000509@hooft.net> <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> <3DC51403.6030208@hooft.net> Message-ID: <3DC52BE4.6010602@hooft.net> I wrote: > Other todo items and ideas: > - Make the "Yes, Unsure, No" items into Options, keeping the defaults > the same as in the past for a few days. > - Add the debugging info optionally as X-Spambayes-Info I just did some of this, and some other plans I had: * Added options "header_spam_string", "header_unsure_string", "header_ham_string". Defaults are set to "Yes", "Unsure", "No". * Added options header_score_digits and header_score_logarithm. The first is an integer telling hammie in how many digits it should show the score. If the second option is set to "True", scores of 1.00 or 0.00 are augmented by a logarithmic "one-ness" or "zero-ness" score (basically it shows the "number of zeros" or "number of nines" next to the score value). * Added support for a debugging header using the boolean hammie_debug_header option and the string hammie_debug_header_name * Changed hammie.py to use all of the new options Please note that I've tried to make this backward compatible where I thought that was essential (hope my thoughts are exhaustive). If the pop3 proxy and the outlook plugin are adapted to use the same options as hammie, we can change the defaults at any point without breaking the interaction (only procmail recipes and other external clients will need to be adapted). Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tdickenson@devmail.geminidataloggers.co.uk Sun Nov 3 15:27:08 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Sun, 3 Nov 2002 15:27:08 +0000 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> Message-ID: <200211031527.08809.tdickenson@devmail.geminidataloggers.co.uk> > Forwarding to ham@ and spam@ would > be a bit of a pain at first, but it would work for existing bodies of m= ail. > Training would be MUCH simpler with this method, and would not require > some fancy-schmancy installation or configuration glorp. Forwarding to spam@ or ham@ has some disadvantages because the forwarding= =20 process destroys some information. Most mail clients dont forward headers= =2E=20 From richie@entrian.com Sun Nov 3 16:41:59 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 16:41:59 +0000 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <200211031527.08809.tdickenson@devmail.geminidataloggers.co.uk> References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven> <200211031527.08809.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: Hi Toby, > Forwarding to spam@ or ham@ has some disadvantages because the forwarding > process destroys some information. Most mail clients dont forward headers. The inbound part (pop3proxy, hammie, the Outlook stuff, whatever) could cache the messages, then the SMTP proxy could compare the forwarded messages with the cache (somehow - there'd be no Message-Id to compare) to find the original to train against. You're right - losing headers will make a difference, even with the fairly minimal header tokenising we currently do. When I added the Unsure classification to pop3proxy, I tested it by forwarding a bunch of spams to myself and they all came out Unsure where they had been Yes before - at first I thought it was a bug, but then a couple of genuine spams rolled in and were classified correctly. -- Richie Hindle richie@entrian.com From richie@entrian.com Sun Nov 3 16:42:14 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 16:42:14 +0000 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: <3DC49421.8EAA85FE@whidbey.com> References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> <3DC4676B.F5ED02CE@whidbey.com> <3DC49421.8EAA85FE@whidbey.com> Message-ID: Hi Van, > As to the final choice of the name, the image of a stern black-robed jurist > behind a high podium is a lot more appealing to me than an entymologist with a > magnifier or a librarian choosing a Dewey Decimal System code for a book. So I > vote for Judgement over Classification. I think 'judgement' is the better word too, but I also think the risk of seeming to have mis-spelled it outweighs the benefits. Plus, the word 'classify' is already strongly associated with what we're doing. -- Richie Hindle richie@entrian.com From popiel@wolfskeep.com Sun Nov 3 17:01:15 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 03 Nov 2002 09:01:15 -0800 Subject: [Spambayes] Spambayes Header Format In-Reply-To: Message from Richie Hindle <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> References: <3DC4D0F1.5000509@hooft.net> <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> Message-ID: <20021103170116.12994F57D@cashew.wolfskeep.com> In message: <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> Richie Hindle writes: >Hi Rob, > >> Lots of discussion about the Spambayes header, but nobody takes any >> concrete initiatives. For me, the proposal... >> >> X-Spambayes-Classification: {Ham|Unsure|Spam} >> >> ...looks very good. But obviously this will break backward >> compatibility. And since I'm only using hammie.py and procmail, I can >> only change those parts and test them. To make everything work together >> again we'd have to make a concerted effort. Better now than later? > >I agree. I volunteer to make the edit, and to combine any duplicated code >("if prob > spam_cutoff: disp = 'Yes'" currently appears in at least two >places, for instance). > >Let's give it a couple of days and see whether there are any violent >objections or better suggestions, then I'll make the edit. Is that OK with >everyone? Sounds good to me. I was going to second this header proposal anyway, once I got through the stack of mail waiting for me this morning... - Alex From popiel@wolfskeep.com Sun Nov 3 17:14:08 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 03 Nov 2002 09:14:08 -0800 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message from Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> References: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> Message-ID: <20021103171408.3A986F57D@cashew.wolfskeep.com> In message: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> Tim@mail.powweb.com writes: > >Remembering (training) is an interesting idea, but what real purpose >does it serve aside from making testing easier? Remembering helps in the following scenario: Mail is trained on as it is received, reinforcing whatever judgement spambayes already made. Then, if a mistake is made, the mistaken message is untrained from the remembered category and trained into the new category. Remembering the training type in association with the message itself (instead of inferring it from what folder it's in or some such) makes it simpler to implement incremental training along these lines. Heck, it helps even if the training isn't automatic, because it keeps you from having to train from scratch any time a training error is discovered. - Alex From popiel@wolfskeep.com Sun Nov 3 17:25:11 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 03 Nov 2002 09:25:11 -0800 Subject: [Spambayes] An alternate use In-Reply-To: Message from Skip Montanaro <15811.56800.291193.255713@montanaro.dyndns.org> References: <20021102062939.3CD89F5AC@cashew.wolfskeep.com> <15811.56800.291193.255713@montanaro.dyndns.org> Message-ID: <20021103172511.6984AF57D@cashew.wolfskeep.com> In message: <15811.56800.291193.255713@montanaro.dyndns.org> Skip Montanaro writes: > >I was using a spoof-proof mechanism from procmail before I disabled >SpamAssassin. I inserted my own header using formail: > > :0H > * ! ^X-SA-Host: > { > :0fw > | spamc | $FORMAIL -a "X-SA-Host: `hostname --fqdn`" > } > >which says, "if there is no X-SA-Host header present, run spamc, add a >header and include the fully qualified hostname". If an X-SA-Host header is >present it tells me spamc had already been run on this message (I was >running SA on two different machines at the time). That way I wasn't >relying on SA's own headers to decide whether or not to run it. This is not spoof-proof; it's merely relying on no one else inserting an X-SA-Host header. If any mail comes in with that header already on it, you don't run SpamAssassin. Even if you made the rule pay attention to the hostname in the header, there's nothing preventing someone from inserting a header with the right hostname. The two obvious methods for making it reasonably spoof-proof are comparing with routing information (and making sure that your mail daemon (and all the upstream mail daemons that you trust) reject mail from hosts that lie about their identity), or putting a cryptographic signature in the header (signing the body + whatever classification headers you're trusting because of the signature). Verifying either of these methods is beyond the abilities of most end-user filters. - Alex From tim.one@comcast.net Sun Nov 3 18:28:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 13:28:18 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [Richie Hindle] > ... > You're right - losing headers will make a difference, even with the fairly > minimal header tokenising we currently do. When I added the Unsure > classification to pop3proxy, I tested it by forwarding a bunch of spams to > myself and they all came out Unsure where they had been Yes before - at > first I thought it was a bug, but then a couple of genuine spams rolled in > and were classified correctly. There's indeed a *lot* of info in the headers we look at by default. About a full day of work went into deciding on each one of those, and finding the most helpful way to tokenize each. Alas, most of that work went into discovering which headers didn't improve results, or gave great results for bogus reasons. OTOH, at the start we didn't look at headers *at all* in this project (it took a long time to sort out the problems with headers in mixed-source corpora), so we worked harder than other projects at tokenizing the body in effective ways too. Here's the tokenization generator: def tokenize(self, obj): msg = self.get_message(obj) for tok in self.tokenize_headers(msg): yield tok for tok in self.tokenize_body(msg): yield tok If we comment out either loop, the classifier will see only the headers or only the body. Here are results from doing that, on the same randomized set of 2000 ham + 2000 spam from my c.l.py test, with ham_cutoff=0.2 and spam_cutoff=0.8, and also using the "generate tokens for the absence of key header lines too" patch I posted in the wee hours. "before" is looking at both hdrs and body, "hdr" looking only at headers (no bodies), and "body" looking only at bodies (no headers): filename: before hdr body ham:spam: 2000:2000 2000:2000 2000:2000 fp total: 1 0 5 fp %: 0.05 0.00 0.25 fn total: 0 0 1 fn %: 0.00 0.00 0.05 unsure t: 20 29 62 unsure %: 0.50 0.72 1.55 real cost: $14.00 $5.80 $63.40 best cost: $2.00 $1.60 $10.40 h mean: 0.55 0.66 1.68 h sdev: 4.50 3.46 8.02 s mean: 99.91 99.40 99.56 s sdev: 1.64 3.46 4.46 mean diff: 99.36 98.74 97.88 k: 16.18 14.27 7.84 A higher spam_cutoff would have helped the body column a lot, but it's clear we're getting an enormous amount of useful info out of the handful of header lines we look at by default; indeed, the hdr column is marginally better than the before column! In the body column, the FN was one of those brief "Paul, it was great to see you today. The proposal will be ready tomorrow. Heidi." spams. The only real spam clues in those are in the headers. The FP are harder to characterize, a mix of conference announcements, one-liner "unsubscribe" thingies, and thoroughly off-topic posts. By default they get redeemed because the headers contain clues that they came from a real person, and weren't posted using spammer software that leaves behind strange capitalization (BTW, "MiME-Version:", with the lowercase i, turned out to be one the highest-spamprob words in my personal email classifier too -- wasn't unique to BruceG's spam). Using twice as much test data makes a mildly interesting point: filename: before hdr body ham:spam: 4000:4000 4000:4000 4000:4000 fp total: 1 0 4 fp %: 0.03 0.00 0.10 fn total: 0 0 1 fn %: 0.00 0.00 0.03 unsure t: 28 71 114 unsure %: 0.35 0.89 1.43 real cost: $15.60 $14.20 $63.80 best cost: $2.40 $3.80 $20.00 h mean: 0.36 0.63 1.44 h sdev: 3.28 3.68 6.89 s mean: 99.93 99.44 99.64 s sdev: 1.42 3.40 4.07 mean diff: 99.57 98.81 98.20 k: 21.19 13.96 8.96 The h and s means & sdevs in the hdr column barely budge, but in the body column obviously "improve". That suggests there's more variability in the bodies (than in the headers) of both ham and spam. Bottom line: the header info is vital in this scheme for best results, but you could get a useful classifier out of headers alone or bodies alone! From tim.one@comcast.net Sun Nov 3 18:51:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 13:51:48 -0500 Subject: [Spambayes] An alternate use In-Reply-To: <3DC40090.5050109@hooft.net> Message-ID: [Tim] > That's actually what started this project: Barry Warsaw is GNU > Mailman's author, and he asked me to look into adapting Graham's > scheme for incorporation into Mailman. ... [Rob Hooft] > So, we'd have to make mailing lists keep a spam-archive as well? Or do > we deliver spambayes with a pre-cooked spam archive to get started with > new mailing lists? That will remain unclear until someone sets up relevant experiments and people measure results. I'm counting on Barry to drive that. Seeding a mailing-list classifier with ham may also be a puzzle. I suspect, but don't know, that training several times on the initial list introduction post will do well at that -- most lists have "a topic" , and a good list intro is bound to mention many words characteristic of that topic. For python.org use, I expect we'll share a single spam corpus across all non-personal email carried by that site. One of the reasons I keep the default header analysis as platform-independent as I can is so that it won't be a nightmare to *try* to share spam stats. I haven't tried to do this, though. A hint of potential: where w is the WordInfo dict from my fat c.l.py test: """ d = {} for k, r in w.iteritems(): if r.spamprob > 0.95 and r.spamcount + r.hamcount >= 10: d[k] = r f = file('reduced.pik', 'wb') pickle.dump(d, f, 1) f.close() """ Of the 327,439 words in the full dict, 10,559 pass that rather demanding test for "strong spamness" (high spamprob and not close to being a hapax). Seeding a classifier with those *may* work well, although the probabilities will get recomputed in the new classifier, and it's unclear (to me) how to fiddle the spamcounts and hamcounts in the inherited words so that they don't dominate the first year of a list's life. From tim.one@comcast.net Sun Nov 3 19:27:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 03 Nov 2002 14:27:08 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <20021101003712.GA28132@rmunnlfs> Message-ID: [Robin Munn] > ... > something that could integrate into, say, Outlook Express and add > a "Block this junk mail" button (which adds the message to the spam > corpus) to the E-mail reading interface. AFAIK, Outlook Express has no hooks at all for programmers -- it's a closed end-user app (as opposed to Outlook, which is highly programmable). OTOH, OE's file format is relatively easy to reverse-engineer (again unlike Outlook's), which gives some hope for a separate process to watch what the user does indirectly. I doubt there's any way to filter incoming mail in OE short of having the user (1) redirect to a pop3 proxy, and (2) set up an OE rule to look for a header injected by the proxy. > ... > 1. It must integrate into the user's email client as seamlessly as > possible. This means researching the plugin API of Outlook, Eudora, > Pegasus Mail, Mozilla, et al. If you're interested in the masses, you could make life easier by restricting this to clients actually used by the masses . > 2. The algorithm and filtering component must also run in the background > without any user intervention required after the initial install. This > means being able to install as a Windows NT service or into the StartUp > folder of Windows 9x. The current Outlook 2000 client runs as an in-process server, via COM. That means it starts up automatically whenever the user starts Outlook, and closes itself down when the user quits Outlook. IOW, services and startup groups aren't required for Outlook integration. They may be for OE, but nobody here has shown a sign of looking at an OE approach (appart from Richie Hindle, who had OE in mind when he wrote pop3proxy -- although this may be news to him ). > 3. There *MUST* be good documentation. We all know the user is going to > run the installer program before reading the documentation, but we must > include a "How to train your filter to recognize junk mail" document > that the installer displays after finishing installation. This means > actually writing said documentation. :-) OTOH, the masses don't read docs. In a previous life I worked at a company doing commercial software "for the masses", and doing usability testing for mass use is extremely time-consuming and expensive. The masses don't see what techies see when they look at a UI, they don't read docs, and they do very surprising things. One of my sisters suffered a network outage at work, and, in frustration, picked up her keyboard and slammed it on her desk. The network happened to come back up again then. I won't say any more, apart from noting that she has an ongoing problem with broken keyboards . From richie@entrian.com Sun Nov 3 19:35:56 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 19:35:56 +0000 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: Message-ID: Hi Tim, > There's indeed a *lot* of info in the headers we look at by default. [Snip very interesting experiment] > Bottom line: the header info is vital in this scheme for best results, but > you could get a useful classifier out of headers alone or bodies alone! That last fact could be very useful, but I'm not sure I know how yet. 8-) -- Richie Hindle richie@entrian.com From richie@entrian.com Sun Nov 3 19:44:08 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 03 Nov 2002 19:44:08 +0000 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <20021101003712.GA28132@rmunnlfs> Message-ID: [Tim] > Richie Hindle, who had OE in mind when he wrote pop3proxy -- although this > may be news to him You and your time machine are quite right. > the masses don't read docs. I've yet to test this theory, but this is one reason I'd like to use HTML as the 'GUI toolkit' for the UI of the POP3 proxy. The docs can be tied so closely to the UI that people won't even realise they're reading them... -- Richie Hindle richie@entrian.com From skip@pobox.com Sun Nov 3 21:25:39 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 3 Nov 2002 15:25:39 -0600 Subject: [Spambayes] An alternate use In-Reply-To: <20021103172511.6984AF57D@cashew.wolfskeep.com> References: <20021102062939.3CD89F5AC@cashew.wolfskeep.com> <15811.56800.291193.255713@montanaro.dyndns.org> <20021103172511.6984AF57D@cashew.wolfskeep.com> Message-ID: <15813.37971.977432.564454@montanaro.dyndns.org> >> If an X-SA-Host header is present it tells me spamc had already been >> run on this message ... Alex> This is not spoof-proof; it's merely relying on no one else Alex> inserting an X-SA-Host header. Well, yeah, but I think the odds of some spammer deciding to crack into my mailbox by inserting X-SA-Host are slim. On the other hand, spammers are clearly already faking SpamAssassin headers. Skip From Tim@mail.powweb.com Sun Nov 3 22:37:43 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sun, 03 Nov 2002 16:37:43 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: Yeah, forward generally loses headers... My mailer has a redirect function, which sends the entire thing, headers and all... much better for this kind of thing. So this leaves us back at the question of training a database with mailers that do not provide for the export of mail into file system artifacts. Most mailers do only have a forward function, which lops off most of the headers... the smtp could use the mail cached by the pop3proxy, assuming it is running... which makes me believe that perhaps the pop3proxy and smtpproxy should be different threads on the same process. That way, users don't have to have two processes running, and the two sides of the equation can more easily keep themselves in sync. - TimS 11/3/2002 10:41:59 AM, Richie Hindle wrote: >Hi Toby, > >> Forwarding to spam@ or ham@ has some disadvantages because the forwarding >> process destroys some information. Most mail clients dont forward headers. > >The inbound part (pop3proxy, hammie, the Outlook stuff, whatever) could >cache the messages, then the SMTP proxy could compare the forwarded >messages with the cache (somehow - there'd be no Message-Id to compare) to >find the original to train against. > >You're right - losing headers will make a difference, even with the fairly >minimal header tokenising we currently do. When I added the Unsure >classification to pop3proxy, I tested it by forwarding a bunch of spams to >myself and they all came out Unsure where they had been Yes before - at >first I thought it was a bug, but then a couple of genuine spams rolled in >and were classified correctly. > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From skip@pobox.com Sun Nov 3 22:56:29 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 3 Nov 2002 16:56:29 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: References: <3DC447AB.D64D6CE6@whidbey.com> <3DC4676B.F5ED02CE@whidbey.com> Message-ID: <15813.43421.554220.439152@montanaro.dyndns.org> Richie> Or maybe the *real* question is, shall we call the header Richie> X-Spambayes-Classification? I suggested X-Ham-Status. I believe "ham" and "status" are spelled the same in most dialects of English. Skip From piersh@friskit.com Sun Nov 3 23:21:33 2002 From: piersh@friskit.com (Piers Haken) Date: Sun, 3 Nov 2002 15:21:33 -0800 Subject: [Spambayes] Email client integration -- what's needed? Message-ID: <9891913C5BFE87429D71E37F08210CB9297504@zeus.sfhq.friskit.com> it would seem to me that instead of trying to cram these message store capabilities into a protocol that just doesn't support it (POP3), why not use a protocol that does (IMAP4)? i'd suggest writing a simple IMAP 'proxy' (possibly single-user) that retrieves messages from a 'real' mail server via POP3, scans the incoming messages, classifies them, then puts them in the corresponding folders. the IMAP server can then reclassify/retrain on the messages when they are moved between folders (just as the outlook plugin does). piers. -----Original Message----- From: Tim@mail.powweb.com [mailto:Tim@mail.powweb.com] Sent: Sunday, November 03, 2002 2:38 PM To: Spambayes Subject: Re: [Spambayes] Email client integration -- what's needed? Yeah, forward generally loses headers... My mailer has a redirect function, which sends the entire thing, headers and all... much better for this kind of thing. So this leaves us back at the question of training a database with mailers that do not provide for the export of mail into file system artifacts. Most mailers do=20 only have a forward function, which lops off most of the headers... the smtp could use the mail cached by the pop3proxy, assuming it is running... which=20 makes me believe that perhaps the pop3proxy and smtpproxy should be different threads on the same process. That way, users don't have to have two=20 processes running, and the two sides of the equation can more easily keep themselves in sync. - TimS 11/3/2002 10:41:59 AM, Richie Hindle wrote: >Hi Toby, > >> Forwarding to spam@ or ham@ has some disadvantages because the forwarding=20 >> process destroys some information. Most mail clients dont forward headers.=20 > >The inbound part (pop3proxy, hammie, the Outlook stuff, whatever) could >cache the messages, then the SMTP proxy could compare the forwarded >messages with the cache (somehow - there'd be no Message-Id to compare) to >find the original to train against. > >You're right - losing headers will make a difference, even with the fairly >minimal header tokenising we currently do. When I added the Unsure >classification to pop3proxy, I tested it by forwarding a bunch of spams to >myself and they all came out Unsure where they had been Yes before - at >first I thought it was a bug, but then a couple of genuine spams rolled in >and were classified correctly. > >--=20 >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com=20 _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From seant@iname.com Mon Nov 4 01:00:35 2002 From: seant@iname.com (Sean True) Date: Sun, 3 Nov 2002 20:00:35 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <9891913C5BFE87429D71E37F08210CB9297504@zeus.sfhq.friskit.com> Message-ID: > it would seem to me that instead of trying to cram these message store > capabilities into a protocol that just doesn't support it (POP3), why > not use a protocol that does (IMAP4)? > > i'd suggest writing a simple IMAP 'proxy' (possibly single-user) that > retrieves messages from a 'real' mail server via POP3, scans the > incoming messages, classifies them, then puts them in the corresponding > folders. the IMAP server can then reclassify/retrain on the messages > when they are moved between folders (just as the outlook plugin does). > I like this idea, with one significant reservation -- I'm now trusting an external IMAP mail store with my mail, all 2 GB of it. That makes me queasy, a little, and underlines the problematic of implementing a mail store, instead of just an MTA. -- Sean From anthony@interlink.com.au Mon Nov 4 01:04:07 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 04 Nov 2002 12:04:07 +1100 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <9891913C5BFE87429D71E37F08210CB9297504@zeus.sfhq.friskit.com> Message-ID: <200211040104.gA4147M06387@localhost.localdomain> >>> "Piers Haken" wrote > it would seem to me that instead of trying to cram these message store > capabilities into a protocol that just doesn't support it (POP3), why > not use a protocol that does (IMAP4)? > > i'd suggest writing a simple IMAP 'proxy' (possibly single-user) that > retrieves messages from a 'real' mail server via POP3, scans the > incoming messages, classifies them, then puts them in the corresponding > folders. the IMAP server can then reclassify/retrain on the messages > when they are moved between folders (just as the outlook plugin does). The problems with this approach are to do with the complexity of IMAP. It's a heavy weight protocol, and lots of different IMAP clients abuse it in slightly different ways each. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From piersh@friskit.com Mon Nov 4 02:09:51 2002 From: piersh@friskit.com (Piers Haken) Date: Sun, 3 Nov 2002 18:09:51 -0800 Subject: [Spambayes] Email client integration -- what's needed? Message-ID: <9891913C5BFE87429D71E37F08210CB9297505@zeus.sfhq.friskit.com> Might it be possible, in the case where you're already using an IMAP message store, to write an IMAP client which connects to that store with the express purpose of filtering the email? It could somehow watch for messages incoming and being moved between folders and perform the filtering/retraining based on those events. I just see problems trying to use POP3, which is essentilly a message transfer protocol being used to perform functions which should be applied to a message store. The alternative is to write handlers for all different kinds of message stores... Piers. > -----Original Message----- > From: Sean True [mailto:seant@iname.com]=20 > Sent: Sunday, November 03, 2002 5:01 PM > To: Piers Haken; Tim@mail.powweb.com; Spambayes > Subject: RE: [Spambayes] Email client integration -- what's needed? >=20 >=20 > > it would seem to me that instead of trying to cram these=20 > message store=20 > > capabilities into a protocol that just doesn't support it=20 > (POP3), why=20 > > not use a protocol that does (IMAP4)? > > > > i'd suggest writing a simple IMAP 'proxy' (possibly=20 > single-user) that=20 > > retrieves messages from a 'real' mail server via POP3, scans the=20 > > incoming messages, classifies them, then puts them in the=20 > > corresponding folders. the IMAP server can then=20 > reclassify/retrain on=20 > > the messages when they are moved between folders (just as=20 > the outlook=20 > > plugin does). > > >=20 > I like this idea, with one significant reservation -- I'm now=20 > trusting an external IMAP mail store with my mail, all 2 GB=20 > of it. That makes me queasy, a little, and underlines the=20 > problematic of implementing a mail store, instead of just an MTA. >=20 > -- Sean >=20 >=20 From Tim@mail.powweb.com Mon Nov 4 03:04:44 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sun, 03 Nov 2002 21:04:44 -0600 Subject: [Spambayes] Need some clarification on the training database Message-ID: <8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc@riven> Ok, so help me out... neiltrain does its thing then writes to a cdb. But hammie appears to expect that the training database be a pickle. It's not adding up, and when I start the pop3proxy pointing at the database I trained using the code I lifted from neiltrain.py, it pukes (of course)... - TimS From Tim@mail.powweb.com Mon Nov 4 03:51:45 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Sun, 03 Nov 2002 21:51:45 -0600 Subject: [Spambayes] Need some clarification on the training database In-Reply-To: <8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc@riven> Message-ID: Never mind... I figured it out. Ok, so I have the pop3proxy looking at the training database that my smtpproxy creates when I send it spam and ham via redirect, which preserves headers in my mailer. Unfortunately, it also adds my headers, so I'm training spambayes to recognize mail from myself to myself as spam or ham, depending on how many of each i send... ;) But this isn't a good general solution, because most people use mailers that only offer forward, which may strip off many of the original headers. It would be a simple thing to make the pop3proxy store incoming mails, but this is kinda silly because then the mail may be being stored in three places: the server, the mailer, and the pop3proxy. Another solution might be for the proxy to package up the headers and add them to the mail as an attachment, that the smtpproxy could recognize... All that to say "I'll live to fight tomorrow" - TimS 11/3/2002 9:04:44 PM, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote: >Ok, so help me out... neiltrain does its thing then writes to a cdb. But hammie appears to expect that the training database be a pickle. It's not adding up, and >when I start the pop3proxy pointing at the database I trained using the code I lifted from neiltrain.py, it pukes (of course)... > >- TimS > > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From B-Morgan@concentric.net Mon Nov 4 05:17:58 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Sun, 3 Nov 2002 22:17:58 -0700 Subject: [Spambayes] Need some clarification on the training database In-Reply-To: Message-ID: > But this isn't a good general solution, because most people use mailers that only offer > forward, which may strip off many of the original headers. It would be a > simple thing to make the pop3proxy store incoming mails, but this is kinda silly because > then the mail may be being stored in three places: the server, the > mailer, and the pop3proxy. Another solution might be for the proxy to package up the > headers and add them to the mail as an attachment, that the smtpproxy > could recognize... The copy that the pop3proxy keeps doesn't need to be kept long term. Only long enough to be told it got something wrong. After a while, it can clear its cache under the assumption that no news is good news. Look at the popfile interface (another Sourceforge project). They save the messages and use an HTTP interface to interact with the proxy. Not a bad solution. Regards, Brad From tim.one@comcast.net Mon Nov 4 06:21:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 01:21:45 -0500 Subject: [Spambayes] Need some clarification on the training database In-Reply-To: Message-ID: ]Tim@mail.powweb.com\ > Never mind... I figured it out. Ok, so I have the pop3proxy > looking at the training database that my smtpproxy creates when I > send it spam and ham via redirect, which preserves headers in my > mailer. Unfortunately, it also adds my headers, so I'm training > spambayes to recognize mail from myself to myself as spam or ham, > depending on how many of each i send... ;) At least that part shouldn't matter much. The spamprob guesses are based on the percentages of spam and ham in which a word appears. If a particular header clue about you appears in 100% of the forwarded spam, and 100% of the forwarded ham, it will get spamprob 0.5 regardless of the absolute numbers involved. This is, it will be 100% neutral. This can still distort scores, but I expect in a very minor way, and via pulling them toward Unsure rather toward either side. From rob@hooft.net Mon Nov 4 06:26:21 2002 From: rob@hooft.net (Rob Hooft) Date: Mon, 04 Nov 2002 07:26:21 +0100 Subject: [Spambayes] counterweight: it really works! Message-ID: <3DC6130D.40508@hooft.net> Hmmm. I trained hammie on my private account yesterday night (already running about a week at work), and found this in my spam folder this morning: ====================== Subject: *** We want to finance/buy your business..Pres. please ! Mime-Version: 1.0 Content-type: text/plain; charset="iso-8859-1" Message-Id: <20021104021425.D097A77809@temoleh.chem.uu.nl> Date: Mon, 4 Nov 2002 03:14:25 +0100 (CET) X-Spam-Status: No, hits=0.1 required=5.0 tests=PLING version=2.31 X-Spam-Level: X-Hammie-Disposition: Yes; 1.00 (8) ====================== Just to remind everyone that this software really works! Its spambayes score deviates from 1.0 only by about 10**-8, but SA didn't see much. The setup I had to make for my private account is really quite involved. My private mail domain arrives on a workstation with ample space, but without a pop or imap server and without an adequate backup policy for storing E-mail (it used to be forwarded by a postfix virtual map). On the other hand, I read it from an IMAP server that doesn't have python2 installed, nor do I have enough space there for a hammie.db. I ended up with a .procmailrc on the workstation that ends by forwarding the non-spam messages to the IMAP server using "sendmail -i"... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From anthony@interlink.com.au Mon Nov 4 06:27:47 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 04 Nov 2002 17:27:47 +1100 Subject: [Spambayes] Something to test In-Reply-To: Message-ID: <200211040627.gA46Rm108104@localhost.localdomain> >>> Tim Peters wrote > This little patch arranges to create "noheader:HEADERNAME" tokens for > headers in options.safe_headers that *don't* appear in a msg's headers. On > my fat c.l.py test it's a small theoretical improvement: best-cost falls > from $26.80 to $22.00, by knocking down the score of the second-worst > hopeless FP just enough so that redeeming it *could* be traded away for an > increase in the Unsure rate. That's not realistic, though (the spam_cutoff > value needed to redeem that FP is no longer insane, but is still > *unreasonably* high). > filename: before after ham:spam: 11192:1826 11192:1826 fp total: 0 1 fp %: 0.00 0.01 fn total: 7 8 fn %: 0.38 0.44 unsure t: 106 107 unsure %: 0.81 0.82 real cost: $28.20 $39.40 best cost: $28.20 $30.40 h mean: 0.63 0.42 h sdev: 4.19 4.19 s mean: 98.68 98.63 s sdev: 7.74 7.95 mean diff: 98.05 98.21 k: 8.22 8.09 The additional fp was a mail-out from Nettwerk (that I've signed up for, but which are _incredibly_ spammy) that went from 0.956 to 0.964, where my spam cutoff is 0.96. The noheader: errors-to was the killer clue that pushed it over the edge. The spam situation is considerably worse. The additional false negative was something that went from 0.467 to 0.431 (ham_cutoff 0.45). The damage came from prob('noheader:mime-version') = 0.245329 (It was a very short spam) One fn went from 0.27 to 0.029, due to: prob('noheader:subject') = 0.0042591 prob('noheader:to') = 0.0652536 prob('noheader:mime-version') = 0.245329 It made pretty much all of my fn's at least slightly worse, if not much worse. For what it's worth the "Iron Citadel" comp.lang.python spam is currently showing up as a 0.0057 ham, prob('*H*')=1, prob('*S*')=0.0115174 This is far and away the worst spam I've seen for some time. -- Anthony Baxter It's never too late to have a happy childhood. From tim.one@comcast.net Mon Nov 4 06:37:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 01:37:49 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [Sean True] > I like this idea, with one significant reservation -- I'm now trusting an > external IMAP mail store with my mail, all 2 GB of it. You realize that Outlook has a hard 2GB limit on .pst files, right? They keep fixing this in service packs whenever a new verion of Outlook comes out, most recently for Outlook 2002: http://support.microsoft.com/default.aspx?scid=KB;EN-US;q304863 "The fix" this time around is to display Task 'Microsoft Exchange Server - Receiving' reported error (0x8004060C): 'Unknown Error 0x8004060C' instead of silently refusing to accept new email(!) when 2GB is approached. For earlier versions of Outlook, MS now makes a tool available that truncates a too-big .pst file, with no hope of recovering the lost data. The good news is that they say you can at least start Outlook again then. > That makes me queasy, a little, and underlines the problematic of > implementing a mail store, instead of just an MTA. I expect that Python's "large file" support on Windows is more reliable than Outlook's (well, the latter is guaranteed to fail in nasty ways, but that's not what I mean by reliable ). From anthony@interlink.com.au Mon Nov 4 06:40:17 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 04 Nov 2002 17:40:17 +1100 Subject: [Spambayes] Re: [Spambayes-checkins] website background.ht,1.1,1.2 In-Reply-To: Message-ID: <200211040640.gA46eIe08260@localhost.localdomain> JFYI - I'd like corrections and updates to this. I'm attempting to channel Tim (always a error-prone task) and I've undoubtedly got stuff wrong. >>> "Anthony Baxter" wrote > Update of /cvsroot/spambayes/website > In directory usw-pr-cvs1:/tmp/cvs-serv16178 > > Modified Files: > background.ht > Log Message: > A bit of a potted history here. I probably have a bunch of things here > that need to be cleaned up and made more obvious, but hey, it's a start. > > > Index: background.ht > =================================================================== > RCS file: /cvsroot/spambayes/website/background.ht,v > retrieving revision 1.1 > retrieving revision 1.2 > diff -C2 -d -r1.1 -r1.2 > *** background.ht 19 Sep 2002 23:39:24 -0000 1.1 > --- background.ht 4 Nov 2002 06:39:42 -0000 1.2 > *************** > *** 15,18 **** > --- 15,67 ---- >

more links? mail anthony at interlink.com.au

> > +

Overall Approach

> + Please note that I (Anthony) am writing this based on memory and > + limited understanding of some of the subtler points of the maths. Gentle > + corrections are welcome, or even encouraged. > +

Tokenizing

> +

The architecture of the spambayes system has a couple of distinct > + parts. The first, and most obvious, is the tokenizer. This takes > + a mail message and breaks it up into a series of tokens. At the moment > + it splits words out of the text parts of a message, there's a variety > + of header tokenization that goes on as well. The code in tokenizer.py > + and the comments in the Tokenizer section of Options.py contain more > + information about various approaches to tokenizing.

> + > +

Combining and Scoring

> +

The next part of the system is the scoring and combining part. This > + is where the hairy mathematics and statistics come in.

> +

Initially we started with Paul Graham's original combining scheme - > + this has a number of "magic numbers" and "fuzz factors" built into it. > + The Graham combining scheme has a number of problems, aside from the > + magic in the internal fudge factors - it tends to produce scores of > + either 1 or 0, and there's a very small middle ground in between - it > + doesn't often claim to be "unsure", and gets it wrong because of this. > + There's a number of discussions back and forth between Tim Peters and > + Gary Robinson on this subject in the mailing list archives - I'll try > + and put links to the relevant threads at some point.

> +

Gary produced a number of alternative approaches to combining and > + scoring word probabilities. The initial one, after much back and forth > + in the mailing list, is in the code today as 'gary_combining'. A couple > + of other approaches, using the Central Limit Theorem, were also tried. > + They produced interesting output - but histograms of the ham and spam > + distributions had a disturbingly large overlap in the middle. There was > + also an issue with incremental training and untraining of messages that > + made it harder to use in the "real world". These two central limit > + approaches were dropped after Tim, Gary and Rob Hooft produced a combining > + scheme using chi-squared probabilities. This is now the default combining > + scheme.

> +

The chi-squared approach produces two numbers - a "ham probability" ("*H *") > + and a "spam probability" ("*S*"). A typical spam will have a high *S* > + and low *H*, while a ham will have high *H* and low *S*. In the case where > + the message looks entirely unlike anything the system's been trained on, > + you can end up with a low *H* and low *S* - this is the code saying "I don' t > + know what this message is". So at the end of the processing, you end up > + with three possible results - "Spam", "Ham", or "Unsure". It's possible to > + tweak the high and low cutoffs for the Unsure window - this trades off > + unsure messages vs possible false positives or negatives.

> + > +

Training

> +

TBD

> + >

Mailing list archives

>

There's a lot of background on what's been tried available from > > > > _______________________________________________ > Spambayes-checkins mailing list > Spambayes-checkins@python.org > http://mail.python.org/mailman/listinfo/spambayes-checkins > -- Anthony Baxter It's never too late to have a happy childhood. From rob@hooft.net Mon Nov 4 07:09:39 2002 From: rob@hooft.net (Rob Hooft) Date: Mon, 04 Nov 2002 08:09:39 +0100 Subject: [Spambayes] Re: [Spambayes-checkins] website background.ht,1.1,1.2 References: <200211040640.gA46eIe08260@localhost.localdomain> Message-ID: <3DC61D33.20602@hooft.net> Anthony Baxter wrote: >>+ and a "spam probability" ("*S*"). A typical spam will have a high *S* >>+ and low *H*, while a ham will have high *H* and low *S*. In the case where >>+ the message looks entirely unlike anything the system's been trained on, >>+ you can end up with a low *H* and low *S* - this is the code saying "I don't >>+ know what this message is". Some messages can even have both a high *H* and a high *S*, telling you basically that the message looks very much like ham, but also very much like spam. In this case spambayes is also unsure where the message should be classified, and the final score will be near 0.5. >>+ So at the end of the processing, you end up >>+ with three possible results - "Spam", "Ham", or "Unsure". It's possible to >>+ tweak the high and low cutoffs for the Unsure window - this trades off >>+ unsure messages vs possible false positives or negatives. -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Mon Nov 4 07:37:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 02:37:32 -0500 Subject: [Spambayes] Something to test In-Reply-To: <200211040627.gA46Rm108104@localhost.localdomain> Message-ID: Just a quickie (I realized I have to sleep sometime ): [Anthony Baxter] > ... > > For what it's worth the "Iron Citadel" comp.lang.python spam is > currently showing up as a 0.0057 ham, prob('*H*')=1, prob('*S*')=0.0115174 > This is far and away the worst spam I've seen for some time. It's great spam. I had to read quite a bit of it before I realized it was spam. If the "real purpose" of this kind of project is to alter spammers' behavior, then that's a glimpse of the future: soft-sell spam with lots of detail talk and almost no blatant advertising. It scores 0.33 for me today (H=0.69 and S=0.35), but the still-low S value got "so high" only because of hapaxes (words unique in this msg). My lowest-scoring spam to date *still* scores under 0.05 (even after training on it): """ Spam Score: 0.0443661 '*H*' 0.999899 '*S*' 0.0886315 'url:python-list' 0.0111408 python.org 'url:mailman' 0.0116596 python.org 'url:python' 0.0143819 python.org 'url:listinfo' 0.0150747 python.org 'header:X-Complaints-to:1' 0.0185989 orignal 'url:org' 0.0444069 python.org 'header:Errors-to:1' 0.0452392 python.org 'header:Organization:1' 0.0737601 original 'header:Return-path:1' 0.0757934 python.org 'header:Message-id:1' 0.0792662 original 'url:mail' 0.0882956 python.org 'header:MIME-version:1' 0.0986049 original 'header:Reply-to:1' 0.15973 original 'header:Received:4' 0.162073 original 'subject:new' 0.327577 original 'url:com' 0.60446 original 'from:email addr:infonie.fr' 0.844828 hapaxes from here 'from:email name:bmcc' 0.844828 'message-id:@infonie.fr' 0.844828 'paix.' 0.844828 'url:keyrouz' 0.844828 'voix' 0.844828 'x-mailer:mozilla 4.7 (macintosh; i; ppc)' 0.844828 to here 'peace.' 0.908163 original Message Stream: Return-path: Path: news.baymountain.com!uunet!ash.uu.net!dfw.uu.net!sac.uu.net!lore.csc.com!nnt p.abs.net!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!news.tel e.dk!news.tele.dk!small.news.tele.dk!news-fra1.dfn.de!newsfeed.hanau.net!fr. clara.net!heighliner.fr.clara.net!news.tiscali.fr!not-for-mail Received: from bright01.icomcast.net (bright01-qfe0.icomcast.net [172.20.4.8]) by msgstore01.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) with ESMTP id <0H4V0021NFMT35@msgstore01.icomcast.net> for tim.one@ims-ms-daemon (ORCPT tim.one@comcast.net); Thu, 31 Oct 2002 19:20:53 -0500 (EST) Received: from mtain03 (bright-LB.icomcast.net [172.20.3.155])for <@msgstore01.icomcast.net:tim.one@comcast.net>; Thu, 31 Oct 2002 19:21:13 -0500 (EST) Received: from mail.python.org (mail.python.org [12.155.117.29]) by mtain03.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) with ESMTP id <0H4V009DCFN3ZE@mtain03.icomcast.net> for tim.one@comcast.net (ORCPT tim.one@comcast.net); Thu, 31 Oct 2002 19:21:03 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org) by mail.python.org with esmtp (Exim 4.05) id 187PYc-0005rG-00; Thu, 31 Oct 2002 19:21:02 -0500 X-Trace: news2adm.tiscali.fr. 1036109620 28050 172.29.129.3 (1 Nov 2002 00:13:40 GMT) Date: Fri, 01 Nov 2002 01:17:34 +0100 From: bmcc@infonie.fr Subject: new Sender: python-list-admin@python.org To: python-list@python.org Errors-to: python-list-admin@python.org Reply-to: bmcc@infonie.fr Message-id: <3DC1C81F.66B97F91@infonie.fr> Organization: Guest of TISCALI - FRANCE X-Complaints-to: abuse@libertysurf.fr MIME-version: 1.0 X-Mailer: Mozilla 4.7 (Macintosh; I; PPC) Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7bit NNTP-posting-date: Fri, 1 Nov 2002 00:13:40 +0000 (UTC) X-Accept-Language: en Precedence: bulk X-BeenThere: python-list@python.org X-NNTP-Posting-Host: dyn-212-232-61-200.ppp.tiscali.fr Newsgroups: comp.lang.python Lines: 3 NNTP-posting-host: news3adm.tiscali.fr X-Mailman-Version: 2.0.13 (101270) List-Post: List-Subscribe: , List-Unsubscribe: , List-Archive: List-Help: List-Id: General discussion list for the Python programming language Xref: news.baymountain.com comp.lang.python:187745 The voice of peace. La voix de la paix. http://www.keyrouz.com -- http://mail.python.org/mailman/listinfo/python-list """ I've currently got 1,947 spam in my personal classifier. 2 of them score as ham (that was one of 'em; the other was similar, and also got huge boosts from having come via python.org), and 16 score as Unsure (ham cutoff at 0.20, spam cutoff at 0.80). It's delightful that "peace." is a high spamprob word, eh ? From tim.one@comcast.net Mon Nov 4 07:59:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 02:59:26 -0500 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: [Tim] > ... > If you do have the score, we've gotten mixed reports here about whether > sorting Unsure msgs by score is helpful. I find that it is in my > email, but there are many exceptions (ham closer to high end of the > Unsure range, and spam closer to the low end). I forgot something there: sorting by score is *extremely* helpful in a different context: after a batch of training, I score the ham and spam training sets themselves, then sort them by score (Outlook is very good at this -- sorted displays, and grouped displays, on arbitrary columns, is built in). Misclassified msgs reliably end up at "the wrong end" of the display, and that makes recovery easy. From anthony@interlink.com.au Mon Nov 4 09:50:30 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 04 Nov 2002 20:50:30 +1100 Subject: [Spambayes] A couple of small tokenizer experiments. Message-ID: <200211040950.gA49oU809201@localhost.localdomain> First experiment was to make the URL tokenizer look for the string 'mailman' in the URL. If it was found, simple push the clue "url: Mailman URL" onto the clue-pile. This was an attempt to remove the many many related clues that get bolted onto the occasional spam that makes it past Greg to the python.org mailservers. It's something of a violation of "stupid beats smart", but I'd noticed that the mailman footer from spam via mailman lists was always providing a bunch of clues that were making life harder. --- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 +++ tokenizer.py 4 Nov 2002 06:59:37 -0000 @@ -931,6 +931,11 @@ new_text.append(text[i : start]) new_text.append(' ') + if guts.find('mailman') != -1: + pushclue("url: Mailman URL") + i = end + break + pushclue("proto:" + proto) # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. This produced an improvement in unsure and in fn, but made a couple of high-scoring hams a bit worse. Nothing that can't be fixed by tweaking the spam-cutoff number: filename: before-nomailma after-nomailman ham:spam: 11192:1826 11192:1826 fp total: 0 2 fp %: 0.00 0.02 fn total: 7 5 fn %: 0.38 0.27 unsure t: 108 104 unsure %: 0.83 0.80 real cost: $28.60 $45.80 best cost: $28.00 $27.60 h mean: 0.62 0.67 h sdev: 4.27 4.52 s mean: 98.69 98.89 s sdev: 7.69 6.96 mean diff: 98.07 98.22 k: 8.20 8.56 from the tails of the files, before gave -> achieved at ham & spam cutoffs 0.48 & 0.96 -> fp 0; fn 8; unsure ham 30; unsure spam 70 -> fp rate 0%; fn rate 0.438%; unsure rate 0.768% after gave best effort of -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.49 & 0.98 -> fp 0; fn 6; unsure ham 31; unsure spam 77 -> fp rate 0%; fn rate 0.329%; unsure rate 0.83% The top end of the spam histogram is enlightening (I think): before: 90 3 * 91 2 * 92 8 * 93 2 * 94 11 * 95 1 * 96 6 * 97 12 * 98 18 * 99 1712 ************************************************************ after: 90 0 91 5 * 92 5 * 93 5 * 94 5 * 95 7 * 96 5 * 97 7 * 98 21 * 99 1722 ************************************************************ So this really did some nice work for a kinda ugly hack :) Next I tried tokenizing the To: line. I parsed it properly, then decoded the real name and split the words. I also added a token for the RHS and LHS of the email @ sign. --- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 +++ tokenizer.py 4 Nov 2002 09:26:12 -0000 @@ -5,6 +5,8 @@ import email import email.Message +import email.Header +import email.Utils import email.Errors import re import math @@ -1099,6 +1110,17 @@ count = 0 for addrs in msg.get_all(field, []): count += len(addrs.split(',')) + for rname,ename in email.Utils.getaddresses([addrs]): + if rname: + d = email.Header.decode_header(rname)[0] + rname,rcharset = d + for w in rname.split(): + yield field+'realname: '+w + if rcharset is not None: + yield field+'charset: '+rcharset + if ename: + for w in ename.split('@'): + yield field+'email: '+w if count > 0: yield '%s:2**%d' % (field, round(log2(count))) filename: after-nomailman after-tocctok ham:spam: 11192:1826 11192:1826 fp total: 2 2 fp %: 0.02 0.02 fn total: 5 6 fn %: 0.27 0.33 unsure t: 104 83 unsure %: 0.80 0.64 real cost: $45.80 $42.60 best cost: $27.60 $25.40 h mean: 0.67 0.60 h sdev: 4.52 4.19 s mean: 98.89 98.96 s sdev: 6.96 6.94 mean diff: 98.22 98.36 k: 8.56 8.84 The "best cost" data, before: -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.49 & 0.98 -> fp 0; fn 6; unsure ham 31; unsure spam 77 -> fp rate 0%; fn rate 0.329%; unsure rate 0.83% and after: -> best cost for all runs: $25.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.47 & 0.98 -> fp 0; fn 6; unsure ham 27; unsure spam 70 This, to me, seems a clear win - in particular, I see stuff like: prob('toemail: connect.com.au') = 0.998359 prob('toemail: arb') = 0.998992 (this is an old old email address that gets mostly spam now). The final test was to decode the Subject header if it's encoded, and tokenize that, rather than in encoded. --- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 +++ tokenizer.py 4 Nov 2002 09:45:25 -0000 @@ -1071,6 +1078,10 @@ # especially significant in this context. Experiment showed a small # but real benefit to keeping case intact in this specific context. x = msg.get('subject', '') + # Subject decoding. + x, subjcharset = email.Header.decode_header(x)[0] + if subjcharset is not None: + yield 'subjectcharset:' + subjcharset for w in subject_word_re.findall(x): for t in tokenize_word(w): yield 'subject:' + t filename: after-tocctok2 after-subjdecode ham:spam: 11192:1826 11192:1826 fp total: 2 1 fp %: 0.02 0.01 fn total: 6 6 fn %: 0.33 0.33 unsure t: 83 87 unsure %: 0.64 0.67 real cost: $42.60 $33.40 best cost: $25.40 $24.00 h mean: 0.60 0.59 h sdev: 4.19 4.18 s mean: 98.96 98.92 s sdev: 6.94 7.05 mean diff: 98.36 98.33 k: 8.84 8.76 Tails of the runs: before: -> best cost for all runs: $25.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.47 & 0.98 -> fp 0; fn 6; unsure ham 27; unsure spam 70 -> fp rate 0%; fn rate 0.329%; unsure rate 0.745% after: -> best cost for all runs: $24.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.47 & 0.97 -> fp 0; fn 6; unsure ham 27; unsure spam 63 -> fp rate 0%; fn rate 0.329%; unsure rate 0.691% Remember that the best before these 3 patches was: -> fp 0; fn 8; unsure ham 30; unsure spam 70 -> fp rate 0%; fn rate 0.438%; unsure rate 0.768% So this (to me) seems a bunch of definite wins. But Tim is free to disagree :) My remaining 6 fns are: a brazilian spam-ish thing: (*H* 0.633859 *S* 0.20342 = 0.28478) ----------------- >From angel@rjnet.com.br Sat Sep 28 09:35:45 2002 Return-Path: Received: from localhost (localhost.localdomain [127.0.0.1]) by localhost.localdomain (8.11.6/8.11.6) with ESMTP id g8RNZhh05864 for ; Sat, 28 Sep 2002 09:35:44 +1000 Received: from mail.interlink.com.au [203.9.111.130] by localhost with POP3 (fetchmail-5.9.0) for anthony@localhost (single-drop); Sat, 28 Sep 2002 09:35:44 +1000 (ES T) Received: from mediterraneo.rjnet.com.br (root@[200.152.115.30]) by valdez.interlink.com.au (8.11.6/8.11.2) with ESMTP id g8RNZJc28230 for ; Sat, 28 Sep 2002 09:35:20 +1000 Received: from locutus.rjnet.com.br (root@locutus.rjnet.com.br [200.222.31.10]) by mediterraneo.rjnet.com.br (8.11.4/8.11.4) with ESMTP id g8RNNc801901; Fri, 27 Sep 2002 20:23:38 -0300 Received: from localhost ([200.222.39.21]) by locutus.rjnet.com.br (8.11.2/8.11.2) with ESMTP id g8RMqEN00464; Fri, 27 Sep 2002 19:52:14 -0300 Date: Fri, 27 Sep 2002 19:52:14 -0300 From: Liliane Andrade Angel Message-Id: <200209272252.g8RMqEN00464@locutus.rjnet.com.br> DATA ----------------- I plan to try something like tokenizing the oldest three received lines (to hopefully avoid the previous issues with mail.python.org blowing numbers to hell) to see if that will help this one. The "iron citadel" python-list spam (*H* 0.999999, *S* 0.038123 = 0.01906) A base64d MP3 spam sent via zope-dev (*H* 0.993904, *S* 0.187868 = 0.0969820429397) which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and also the various mailman type clues (although that's better with the first patch, above) Someone spamming Linux CDs via a list at 4thought (*H* 1, *S* 0.207177 = 0.103588442478) A short porn spam sent via python-list (*H* 0.817004, *S* 0.618399 = 0.400697521022) A wierd german spam for some sort of expert systems (in english). (*H* 0.997132, *S* 0.84965 = 0.426259133645) From bkc@murkworks.com Mon Nov 4 15:07:15 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 04 Nov 2002 10:07:15 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <20021101003712.GA28132@rmunnlfs> Message-ID: <3DC64611.30897.3CB37091@localhost> On 3 Nov 2002 at 14:27, Tim Peters wrote: > AFAIK, Outlook Express has no hooks at all for programmers -- it's a closed How about this? http://msdn.microsoft.com/library/en-us/mapi/html/_mapi1book_using_message_filtering_to_manage_messages.asp I think this is new information released under the DOJ settlement. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From msergeant@startechgroup.co.uk Mon Nov 4 15:17:42 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 04 Nov 2002 15:17:42 +0000 Subject: [Spambayes] counterweight: it really works! References: <3DC6130D.40508@hooft.net> Message-ID: <3DC68F96.6070809@startechgroup.co.uk> Rob Hooft said the following on 04/11/02 06:26: > Hmmm. I trained hammie on my private account yesterday night (already > running about a week at work), and found this in my spam folder this > morning: > > ====================== > Subject: *** We want to finance/buy your business..Pres. please ! > Mime-Version: 1.0 > Content-type: text/plain; charset="iso-8859-1" > Message-Id: <20021104021425.D097A77809@temoleh.chem.uu.nl> > Date: Mon, 4 Nov 2002 03:14:25 +0100 (CET) > X-Spam-Status: No, hits=0.1 required=5.0 > tests=PLING > version=2.31 > X-Spam-Level: > X-Hammie-Disposition: Yes; 1.00 (8) > ====================== > > Just to remind everyone that this software really works! Its spambayes > score deviates from 1.0 only by about 10**-8, but SA didn't see much. Please don't compare to 4 months old SpamAssassin's. Upgrade if you want to compare. Thanks. Matt. From msergeant@startechgroup.co.uk Mon Nov 4 15:40:42 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 04 Nov 2002 15:40:42 +0000 Subject: [Spambayes] Why I added src=cid: etc References: Message-ID: <3DC694FA.7000905@startechgroup.co.uk> Tim Peters said the following on 03/11/02 03:20: > This is typical of the kind of email I'm getting a lot of lately. Without > mining the HTML, there's almost nothing to look at, not even a word in the > Subject line. (Of course, if we weren't throwing the HTML tags away, the > classifier would have learned this stuff on its own.) It's a virus though. Why don't you just get a gateway scanner (like the one I wrote [1] for qpsmtpd [2] which plugs into qmail and bounces viruses with a 5xx return code) which uses clamav[3]? It's optimised for catching viruses, so you can focus on just catching spam (lets face it, the techniques are slightly different). [1] http://use.perl.org/~Matts/journal/ # down at the moment so I can't find the specific journal entry - but it was fairly recently and is obvious because it's about 50 lines of perl [2] http://www.develooper.com/code/qpsmtpd/ [3] http://clamav.elektrapro.com/ I'm down from about 20 viruses a day (because my address ends up in a lot of web caches) to zero. And I'm very happy about it ;-) From tim@fourstonesforum.com Sat Nov 2 18:24:08 2002 From: tim@fourstonesforum.com (Tim Stone Four Stones Forum) Date: Sat, 02 Nov 2002 12:24:08 -0600 Subject: [Spambayes] x-hammie-disposition in pop3proxy In-Reply-To: Message-ID: Kewl, Richie. Ok, so the next thing is I have to run three of these things. I can do that if I can make the proxy listen on different ports. I've modified the code to do that, was a simple mod. Do you want the mod? 11/2/2002 12:16:05 PM, Richie Hindle wrote: >Hi Tim, > >> adding the x-hammie-disposition header with value of 'no'. > >'No' means it thinks it's ham - the header means "Is it spam?" At the >moment the header added by pop3proxy.py is always "Yes" or "No" - I'll add >the new "Unsure" value when I get the chance. > >> I don't have a trained database (the real challenge) at this point > >Use hammie.py to train it - the usage message should tell you everything >you need to know, except how to create the mbox files or directories of >email message to feed into it. Hopefully your email client will export >messages into one of those formats... > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Mon Nov 4 16:17:56 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 11:17:56 -0500 Subject: [Spambayes] Why I added src=cid: etc In-Reply-To: <3DC694FA.7000905@startechgroup.co.uk> Message-ID: [Tim] > This is typical of the kind of email I'm getting a lot of > lately. Without> mining the HTML, there's almost nothing to > look at, not even a word in the Subject line. (Of course, if we > weren't throwing the HTML tags away, the classifier would have > learned this stuff on its own.) [Matt Sergeant] > It's a virus though. Why don't you just get a gateway scanner (like the > one I wrote [1] for qpsmtpd [2] which plugs into qmail and bounces > viruses with a 5xx return code) which uses clamav[3]? Because , like *most* of the world, I'm just running "the email stuff" that came with my Windows box here. Not one user in a thousand knows beans beyond that. > It's optimised for catching viruses, so you can focus on just catching > spam (lets face it, the techniques are slightly different). Yes. Greg Ward and Neil Schemenauer here have each written their own virus detectors too, and Greg's stops essentially all viruses from getting beyond python.org. The ones I'm getting come from other accounts, but somewhere along the line the actual virus payload has been stripped out, leaving just the little HTML trigger. I wouldn't recommend this project's code for virus/worm detection, although anecdotal reports here (not controlled experiments) have been that it works for that purpose too. From tim.one@comcast.net Mon Nov 4 16:32:23 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 11:32:23 -0500 Subject: [Spambayes] counterweight: it really works! In-Reply-To: <3DC68F96.6070809@startechgroup.co.uk> Message-ID: [Matt Sergeant, to Rob Hooft] > Please don't compare to 4 months old SpamAssassin's. Upgrade if you want > to compare. Thanks. I expect Rob is typical of single-user SpamAssassin clients, though: they download it once, and watch it deteriorate. I've seen many other reports of that too. That makes it a fine comparison for people "like that". This codebase doesn't need upgrading, but does need ongoing training on a user's own email. Given that, I dare say it appears to work at least as well for spam detection as an up-to-date SA (can't say about my personal email, as I don't run SA here; on python.org's list email, I know it works at least as well, as we've run controlled tests on that -- but I don't know how often GregW upgrades the SA running at python.org). From papaDoc@videotron.ca Mon Nov 4 16:38:44 2002 From: papaDoc@videotron.ca (papaDoc) Date: Mon, 04 Nov 2002 11:38:44 -0500 Subject: [Spambayes] x-hammie-disposition in pop3proxy References: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com> <200211022241.gA2Mfxq07985@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3DC6A294.7020400@videotron.ca> Hi, >>>X-Spambayes-Judgement: Spam / Unsure / Ham >>>X-Spambayes-Is-Spam: Yes / Unsure / No >>>X-Spambayes-Looks-Like-Spam: Yes / Unsure / No >>> >>> >>I know we have a long tradition of spelling errors behind us, such >>as dropping an "R" from "referrer" in Apache logs, but I'd hate to >>start a new one! Please, only one "E" in "judgment." >> >> > >But it's not a spelling error! > In French you should remove the "d" if you want no spelling error ;-) papaDoc From Tim@mail.powweb.com Mon Nov 4 16:39:54 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 10:39:54 -0600 Subject: [Spambayes] counterweight: it really works! Message-ID: The thing that SB has that SA doesn't is the ongoing ability to train a database according to the USER'S definition of spam. SA has some configurability, but who actually does that? Who wants to download updates? SB lets me say "This is spam. Learn from this" or "This is ham..." I'm not going back. :) - TimS 11/4/2002 10:32:23 AM, Tim Peters wrote: >[Matt Sergeant, to Rob Hooft] >> Please don't compare to 4 months old SpamAssassin's. Upgrade if you want >> to compare. Thanks. > >I expect Rob is typical of single-user SpamAssassin clients, though: they >download it once, and watch it deteriorate. I've seen many other reports of >that too. That makes it a fine comparison for people "like that". This >codebase doesn't need upgrading, but does need ongoing training on a user's >own email. Given that, I dare say it appears to work at least as well for >spam detection as an up-to-date SA (can't say about my personal email, as I >don't run SA here; on python.org's list email, I know it works at least as >well, as we've run controlled tests on that -- but I don't know how often >GregW upgrades the SA running at python.org). > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From msergeant@startechgroup.co.uk Mon Nov 4 16:37:15 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 04 Nov 2002 16:37:15 +0000 Subject: [Spambayes] counterweight: it really works! References: Message-ID: <3DC6A23B.5000904@startechgroup.co.uk> Tim Peters said the following on 04/11/02 16:32: > [Matt Sergeant, to Rob Hooft] > >>Please don't compare to 4 months old SpamAssassin's. Upgrade if you want >>to compare. Thanks. > > > I expect Rob is typical of single-user SpamAssassin clients, though: they > download it once, and watch it deteriorate. I've seen many other reports of > that too. That makes it a fine comparison for people "like that". This > codebase doesn't need upgrading, but does need ongoing training on a user's > own email. Given that, I dare say it appears to work at least as well for > spam detection as an up-to-date SA (can't say about my personal email, as I > don't run SA here; on python.org's list email, I know it works at least as > well, as we've run controlled tests on that -- but I don't know how often > GregW upgrades the SA running at python.org). It's the same though. SA needs constant training too - it just happens to occur somewhere other than on your own box, and involve some human intervention (though luckily not from the user). Matt. From msergeant@startechgroup.co.uk Mon Nov 4 16:35:36 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 04 Nov 2002 16:35:36 +0000 Subject: [Spambayes] Why I added src=cid: etc References: Message-ID: <3DC6A1D8.6040507@startechgroup.co.uk> Tim Peters said the following on 04/11/02 16:17: > [Tim] > >>This is typical of the kind of email I'm getting a lot of >>lately. Without> mining the HTML, there's almost nothing to >>look at, not even a word in the Subject line. (Of course, if we >>weren't throwing the HTML tags away, the classifier would have >>learned this stuff on its own.) > > [Matt Sergeant] > >>It's a virus though. Why don't you just get a gateway scanner (like the >>one I wrote [1] for qpsmtpd [2] which plugs into qmail and bounces >>viruses with a 5xx return code) which uses clamav[3]? > > Because , like *most* of the world, I'm just running "the email stuff" > that came with my Windows box here. Not one user in a thousand knows beans > beyond that. Ah Windows eh. I didn't realise anyone still used that. ;-) >>It's optimised for catching viruses, so you can focus on just catching >>spam (lets face it, the techniques are slightly different). > > Yes. Greg Ward and Neil Schemenauer here have each written their own virus > detectors too, and Greg's stops essentially all viruses from getting beyond > python.org. The ones I'm getting come from other accounts, but somewhere > along the line the actual virus payload has been stripped out, leaving just > the little HTML trigger. > > I wouldn't recommend this project's code for virus/worm detection, although > anecdotal reports here (not controlled experiments) have been that it works > for that purpose too. Yeah, I've got some neat results just from classifying file extensions. The double extension ones are especially good ;-) Matt. From tim.one@comcast.net Mon Nov 4 16:48:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 11:48:46 -0500 Subject: [Spambayes] Why I added src=cid: etc In-Reply-To: <3DC6A1D8.6040507@startechgroup.co.uk> Message-ID: [Matt Sergeant, on virus/worm detection] > Yeah, I've got some neat results just from classifying file extensions. > The double extension ones are especially good ;-) GregW's is a bit of Perl that scans for file extensions, and my work account does some double-extension detection. They're very effective but Draconian. For example, I maintain the Python Windows distribution, and the former prevents users from sending me .exe files directly; the latter prevented a coworker two weeks ago from sending error-log files because he named them "xyz.good.log" and "xyz.bad.log". It seems a very curious thing to me that email admins seem generally happy to accept false positives when it comes to suspected virus/worm stuff. Then again, it's not a very surprising thing . From msergeant@startechgroup.co.uk Mon Nov 4 16:46:03 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 04 Nov 2002 16:46:03 +0000 Subject: [Spambayes] counterweight: it really works! References: Message-ID: <3DC6A44B.4070509@startechgroup.co.uk> Tim@mail.powweb.com said the following on 04/11/02 16:39: > The thing that SB has that SA doesn't is the ongoing ability to train a database according to the USER'S definition of spam. SA has some configurability, but > who actually does that? Who wants to download updates? SB lets me say "This is spam. Learn from this" or "This is ham..." I'm not going back. :) FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't officially released yet, but then neither is spambayes [grin]) using the Robinson algorithm. I'm hoping to get the chi-squared algorithm in there too, but /I had some trouble with it producing wierd results for me (I tried to post something to this list about it but it vanished into the ether, so I'll try again shortly). Ultimately I think what people will find is that statistical classifiers are a good part of an overall strategy, but not necessarily the end of the story in spam detection (which is a shame). SpamAssassin is a pretty mature product these days, with some really neat technology going on in there like the auto-whitelist, and it's really great that we can all learn from each other this way./ Matt. From tim.one@comcast.net Mon Nov 4 16:56:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 11:56:36 -0500 Subject: [Spambayes] counterweight: it really works! In-Reply-To: Message-ID: [TimS] > The thing that SB has that SA doesn't is the ongoing ability to > train a database according to the USER'S definition of spam. > SA has some configurability, but who actually does that? Who wants > to download updates? Email adminstrators do both, and *because* the SB code needs to learn about ham as well as spam, and opt-in marketing email is so user-specific, it remains a puzzle how to use this code for, e.g., an email admin serving 1,000 accounts. The SB code appears quite capable of handling python.org's mailing list traffic with a lot less bother and resource consumption than SA requires, but I still don't think it will work well if we fold in the personal email carried by python.org too. > SB lets me say "This is spam. Learn from this" or "This is ham..." I'm > not going back. :) Provided we can make ongoing training easy enough, I expect single-user installations will enjoy training SB more than downloading SA, and that it will work better for them (e.g., there are *some* kinds of spam I want to see, and training my classifier to accept that kind was, like everything else, just a matter of putting examples in my ham folder). From Tim@mail.powweb.com Mon Nov 4 17:09:14 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 11:09:14 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: It occurs to me that there are a couple of issues floating around here regarding integration with email clients, and they're related, although the relationship hasn't been brought forward yet. Issue #1 Using proxys (POP3 and SMTP) for integration into the mailers that the masses use. Issue #2 Putting a user interface onto the pop3proxy So here goes: I've been running an SMTP proxy that recognizes mail being sent to special addresses and does a train using the message, rather than send it to the actual SMTP server. This works very well, though it requires a rather arduous training regimen to get it all started. ::sigh:: Richie's pop3proxy then picks up the training to classify incoming mail, which is filtered into spam by my mailer's standard filtering mechanism. There are two problems with this approach. First, it's manual, message-by-message, which means *re-training* is kinda out of the question. The second is that the 'Forward' or 'Redirect' function of most mailers strips at least some of the headers in which valuable clues can be found. So... some have proposed that we make the pop3proxy so that it caches incoming mail. This cache could be used by the smtpproxy to recover the original headers, using a unique id that the pop3proxy embedded in the mail somewhere. Or... we could give the pop3proxy a user interface that allows users to select mail in the cache to do training on. This approach eliminates the need for the smtpproxy in the first place, and allows a corpus to be built up by the proxy for retraining purposes. While the ui for a caching pop3 proxy might be a bit of a challenge, I think this approach bears some examination. Arguments for: * Simpler overall system * Allows the building of an easily usable corpus for average mail users like me * Headers are maintained exactly as they were received, before a mailer has the chance to get in and mess 'em up Arguments against: * A new user interface that is not a normal part of a user's everyday existence * Now documentation will have to include "User's Guide" as well as "Install Guide" * Some ongoing cache maintenance... expiry of cached messages, etc. Other considerations? P.S. How's this, Skip? - TimS From neale@woozle.org Mon Nov 4 17:58:06 2002 From: neale@woozle.org (Neale Pickett) Date: 04 Nov 2002 09:58:06 -0800 Subject: [Spambayes] Database reduction In-Reply-To: <15809.55847.349091.23441@montanaro.dyndns.org> References: <15809.55847.349091.23441@montanaro.dyndns.org> Message-ID: So then, Skip Montanaro is all like: > Neale> When pickling a Bayes object, the pickler is smart enough not to > Neale> repeatedly say "this is a wordinfo object" but rather, I assume, > Neale> "this is of type 2", only having to name type 2 once. However, > Neale> hammie pickles each wordinfo individually, keyed by a string. > Neale> This makes for fast lookups, but giant databases. > > You can always define your own __getstate__ and __setstate__ methods for the > Wordinfo class which processes a more compact form of the object's state. > Or am I misunderstanding what you said? Perhaps a picture would be worth 1K words: >>> import classifier >>> w = classifier.WordInfo('aoeu', 2) >>> import pickle >>> w WordInfo"('aoeu', 0, 0, 0, 2)" >>> pickle.dumps(w, 1) 'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__builtin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02tq\x05bq\x06.' In case it isn't obvious yet, here's the problem: >>> len(pickle.dumps(w, 1)) 102 >>> len(`w`) 30 So, at least for hammie, you can get a 66% reduction in database size by *not* pickling WordInfo types. Tim calls this "administrative pickle bloat", which is the coolest jargon term I've heard all year. As I understand it, things which pickle the Bayes object avoid this overhead from some pickler optimizations along the lines of "if we've already seen this type, just give it a number and stop referring to it by name." Thus, I suppose the proper way to get this reduction in hammie would be to extend the pickler to recognize WordInfo types, right? If so, I'll add that code in. Neale From tim.one@comcast.net Mon Nov 4 18:53:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 13:53:26 -0500 Subject: [Spambayes] Database reduction In-Reply-To: Message-ID: [Neale Pickett] > Perhaps a picture would be worth 1K words: > > >>> import classifier > >>> w = classifier.WordInfo('aoeu', 2) > >>> import pickle > >>> w > WordInfo"('aoeu', 0, 0, 0, 2)" > >>> pickle.dumps(w, 1) > > 'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__b > uiltin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02 > tq\x05bq\x06.' > > In case it isn't obvious yet, here's the problem: > > >>> len(pickle.dumps(w, 1)) > 102 > >>> len(`w`) > 30 OTOH, >>> cPickle.dumps(w.__getstate__(), 1) '(U\x04aoeuq\x01K\x00K\x00K\x00K\x02t.' >>> len(_) 19 >>> which is shorter than your string repr. This isn't typical because 2 is an absurd spamprob (it's > 1, and is an int instead of a double); the savings would be greater with a real spamprob (which will consume about 19 bytes in a string repr, but about 8 in a pickle). > So, at least for hammie, you can get a 66% reduction in database size > by *not* pickling WordInfo types. Tim calls this "administrative pickle > bloat", which is the coolest jargon term I've heard all year. Glad you liked it . If you pickle the states instead, you'll save a lot of space. The state is a plain tuple. On the other end, you have to construct a WordInfo object and pass the unpickled tuple to its __setstate__ method. > As I understand it, things which pickle the Bayes object avoid this > overhead from some pickler optimizations along the lines of "if we've > already seen this type, just give it a number and stop referring to it > by name." Yes, but a Pickler does this automatically. You're using convenience functions, which is why you get no savings. Here's pickle.dumps(): def dumps(object, bin = 0): file = StringIO() Pickler(file, bin).dump(object) return file.getvalue() It creates a brand new Pickler every time you call dumps, so nothing can be remembered from one call to the next. Avoiding that is clumsy in this context, but possible: >>> f = StringIO.StringIO() >>> p = cPickle.Pickler(f, 1) >>> p.dump(w) >>> f.getvalue() 'ccopy_reg\n_reconstructor\nq\x01(cclassifier\nWordInfo\nq\x02c__builtin__\n object\nq\x03NtRq\x04(U\x04abdeq\x05K\x00K\x00K\x00G?\xd3333333tb.' >>> f.truncate(0) >>> p.dump(w) >>> f.getvalue() 'h\x04.' >>> In this case, by reusing the Pickler, the second time dumping w created a 2-byte pickle: the Pickler maintains its own internal dict remembering everything it pickled in the past. This can be a real data burden of its own, though. See the docs for ways to clear a Pickler's dict (called the pickle "memo" in the docs). I'd avoid all that and pickle the states, but that's just me. From Tim@mail.powweb.com Mon Nov 4 19:08:48 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 13:08:48 -0600 Subject: [Spambayes] My first results with pop3proxy and smtpproxy Message-ID: I've trained using the smtpproxy and a few dozen spams that I hadn't deleted and hadn't been contaminated by SA before I got involved with spambayes (basically SA mistakes). Even given the small size of the corpus, it is doing an amazingly great job classifying inbound mail. It even correctly classified one of those "here's another funny story" infernal mails that gets forwarded three hundred times, and I hadn't trained it on anything like that. I have to say that a corpus of thousands really isn't turning out to be a necessity for spambayes to be useful to me. One other observation... my strong tendency *IS* to train this thing only when it makes a mistake. Skip et.al. has warned boucoup times about not doing this... train on a reasonable smattering of both, even if they're correctly classified, and train often. **BUT** if this is my tendency and I understand the system, then this will likely be a real problem when the masses get started using it. How to ensure that mistakes only training isn't the norm? Beats me. But we've either gotta figure out how to make sure that the teeming masses don't make this error, or we've gotta figure out how to make the system tolerate this error reasonably well. - Tim www.fourstonesExpressions.com From guido@python.org Mon Nov 4 19:27:08 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 14:27:08 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development Message-ID: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> This smells like a clever spam, disguised as a zope-announce message I sent. SA scored it -2.8: X-Spam-Status: No, hits=-2.8 required=5.0 tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,QUOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE Wonder if SB will do any better... --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Mon, 04 Nov 2002 03:18:02 +0000 From: "Lindsey Carter" To: guido@python.org Subject: Re: [Zope-Annce] New zope.org development Hey, thanks for writing me back here. You are the only guy so far, but I only answered 3 ads, you don't have much competition in our area! I'm sure you got a few emails, but I hope you take the time to get to know me better. What are you doing this weekend? Maybe we can get together for coffee or ice cream? =o) I bet you are waiting to see my pics? I will get them to you. I tried attaching them in this email but they wouldn't fit so click here http://www.my-homepages.net/lindseyspage/index.html I'd rather have picked out my best pics and emailed them to you, but it's probably better that you see all these. I hope I'm not too shocking to you? There is more about me on my homepage and the pics really make for an interesting conversation (my specialty). If you like what you see I'll be waiting to hear from you, I hope you approve, xoxo Lindsey PS. pistachio ice cream is my fav. >From: Guido van Rossum >To: zope-announce@lists.zope.org >CC: zope3-dev@zope.org, zope@zope.org, zope-web@zope.org >Subject: [Zope-Annce] New zope.org development >Date: Thu, 31 Oct 2002 17:10:16 -0500 > >There have been complaints about zope.org. The complaints >include things like "the design is ugly", "the navigation is >difficult", "it's too slow", and "search results are not >useful." > >For a long time there have been plans to convert the site to >the current version of Zope using CMF, and various false >starts have been made, but so far the site is still running >software that's best described "FrakenZope 2.3"... > >I opened my big mouth, and now I'm responsible for fixing >this. :-) > >Actually, a new plan was already in place, and all I have to >do is coach its execution. Zope Corporation has retained a >highly skilled Zope developer, Sidnei da Silva, to do the >work. The advantage over Zope Corporation developers doing >this (as has been tried in the past) is that Sidnei isn't >likely to be pre-empted by higher-priority customer work: >for him, this *is* customer work, the customer being Zope >Corporation. At the same time, because Sidnei is being >paid, the expectation is that the plan will be carried out >at a steady pace, as opposed to simply "letting the >community sort it out." > >Sidnei's plan includes the following pieces: > >- use Zope 2.6 with CMF 1.3 for the new site > >- use a new skin design, the winner of the zope.org contest > >- use the new ZCTextIndex search engine > >- migrate all existing users and as much content as practical > to the new site > >(There's more, but we'd be getting into detail territory.) > >The project goals include minimizing the amount of new code >and content to be created, in order to minimize the risks of >failure. We also strive to make future maintenance of the >site simpler, both at the sysadmin level (process and >resource control) and at the webmaster level (content >control). All these goals are designed to make sure that >the site can be kept current once the upgrade is in place. > >Time-wise, Sidnei expects a preview version of the new site >to go up as new.zope.org within a month, and the final >version to go live (replacing the old www.zope.org) within >2-3 months after that. Sidnei expects to be asking >community help with some tasks; he'll post about this >himself. > >In addition, we're also hoping to hire Sidnei as part-time >webmaster for zope.org, starting next week. Having a steady >webmaster will help the site stay accurate and up-to-date. > >--Guido van Rossum (home page: http://www.python.org/~guido/) > >_______________________________________________ >Zope-Announce maillist - Zope-Announce@zope.org >http://lists.zope.org/mailman/listinfo/zope-announce > > Zope-Announce for Announcements only - no discussions > >(Related lists - > Users: http://lists.zope.org/mailman/listinfo/zope > Developers: http://lists.zope.org/mailman/listinfo/zope-dev ) _________________________________________________________________ Unlimited Internet access -- and 2 months free!  Try MSN. http://resourcecenter.msn.com/access/plans/2monthsfree.asp ------- End of Forwarded Message From jeremy@alum.mit.edu Mon Nov 4 19:32:51 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 4 Nov 2002 14:32:51 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15814.52067.105184.561839@slothrop.zope.com> >>>>> "GvR" == Guido van Rossum writes: GvR> This smells like a clever spam, disguised as a zope-announce GvR> message I sent. SA scored it -2.8: GvR> X-Spam-Status: No, hits=-2.8 required=5.0 GvR> tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,QUOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE GvR> Wonder if SB will do any better... I got the same spam and SB was sure it was ham. It was responding to me ZODB announcement. As a result the mail contained a bunch of good ham indicators from my original announcement. As I recall, the fact that it not only contained my announcement but had some of the words quoted really nailed it. That is, ">release" is a better ham indicator than "release". Jeremy From tim.one@comcast.net Mon Nov 4 19:39:31 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 14:39:31 -0500 Subject: [Spambayes] My first results with pop3proxy and smtpproxy In-Reply-To: Message-ID: [Tim@mail.powweb.com] > I've trained using the smtpproxy and a few dozen spams that I hadn't > deleted and hadn't been contaminated by SA before I got involved with > spambayes (basically SA mistakes). You don't have to worry about that: by default, the tokenizer ignores all header lines SA may have had anything to do with, so it doesn't matter whether SA has added headers or not. > Even given the small size of the corpus, it is doing an amazingly great > job classifying inbound mail. Good! > It even correctly classified one of those "here's another funny > story" infernal mails that gets forwarded three hundred times, and I > hadn't trained it on anything like that. Ya, but that would be ham to somebody else. Train accordingly . > I have to say that a corpus of thousands really isn't turning out to > be a necessity for spambayes to be useful to me. Indeed, old tests show that it is, on average, *useful* after training on a single ham and a single spam: it gets significantly more right than wrong after that much. So long as *none* of your ham looks like advertising or random chatter, a few hundred of each may be fine for you. Fraction-of-a-percent error rate improvements are important for high-volume uses (like python.org, which handle more email in a day than most people get in a year). > One other observation... my strong tendency *IS* to train this thing > only when it makes a mistake. That's a UI problem. A good UI would deduce what's ham and spam by watching what you do to your email, and train on a random sampling of it. The Outlook client may be the only one making real progress in that direction so far. > Skip et.al. has warned boucoup times about not doing this... That would be me. > train on a reasonable smattering of both, even if they're correctly > classified, and train often. The things I call ham would shock you . > **BUT** if this is my tendency and I understand the system, then this > will likely be a real problem when the masses get started using it. > How to ensure that mistakes only training isn't the norm? Beats me. > But we've either gotta figure out how to make sure that the teeming > masses don't make this error, or we've gotta figure out how to make > the system tolerate this error reasonably well. It can't tolerate it -- it can only learn what it's been taught, and reliance on hapaxes is both vital over the short term and brittle over the long term; ongoing training is needed to prevent hapaxes from becoming a liability over time. *Most* spam is dead easy to recognize, though, as is most ham. The errors occur in atypical cases. From guido@python.org Mon Nov 4 19:42:38 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 14:42:38 -0500 Subject: [Spambayes] deployment for mailman lists Message-ID: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> I just realized that the deployment parameters for mailman lists are entirely different than for individual users. This may be obvious already, but I don't recall reading it here. - Mailing lists have a tendency to have a clear focus, which is recorded in the list archives. This makes for near-ideal training, unless in the past a lot of spam made it into the archives (they should be manually checked first). - Integration into Mailman means that there's only one setup to be concerned about, rather than the gazillions of different ways ordinary users receive their email. - The person who administers the list can be assumed to be a little bit more clueful than an ordinary user. - An obvious default policy with tunable parameters presents itself: ham goes to the list, spam is dropped (or bounced), and unsure goes into the moderator's queue. (Of course, having this integrated into Mailman also gives Mailman a leg up against the competition.) --Guido van Rossum (home page: http://www.python.org/~guido/) From jeremy@alum.mit.edu Mon Nov 4 19:43:12 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 4 Nov 2002 14:43:12 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: <15814.52067.105184.561839@slothrop.zope.com> References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> <15814.52067.105184.561839@slothrop.zope.com> Message-ID: <15814.52688.115970.206304@slothrop.zope.com> On the other hand, the message you forwarded got scored 0.494 with both *H* and *S* > 0.98. I'm quite puzzled, though, about how my training data is getting used. I looked back at the spam that came through (now in my spam training set) and see that it got scored 0.000. It now gets scored 1.000, but for reasons that don't really make sense to me. Here's a snippet of the spamish word from the detailed scoring: >available 0.844827586207 >cc: 0.844827586207 >corp. 0.844827586207 >dickenson 0.844827586207 >persistence 0.844827586207 >pure 0.844827586207 >released 0.844827586207 >source 0.844827586207 >unexpected 0.844827586207 >windows 0.844827586207 >zeo 0.844827586207 >zodb 0.844827586207 >zope 0.844827586207 [zope-annce] 0.844827586207 approve, 0.844827586207 area! 0.844827586207 behavior.) 0.844827586207 beta 0.844827586207 btrees 0.844827586207 compiler, 0.844827586207 conflict 0.844827586207 cream? 0.844827586207 email addr:zope.org, 0.844827586207 emails, 0.844827586207 fav. 0.844827586207 from:"lindsey 0.844827586207 from:carter" 0.844827586207 from:email name: References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15814.53303.926055.735822@montanaro.dyndns.org> Guido> - An obvious default policy with tunable parameters presents Guido> itself: ham goes to the list, spam is dropped (or bounced), and Guido> unsure goes into the moderator's queue. I would argue that spam should by default go into the moderator's queue as well. The default should never be to drop or bounce a message. Either way, you run the risk that legitimate mail gets lost. Skip From piersh@friskit.com Mon Nov 4 20:12:20 2002 From: piersh@friskit.com (Piers Haken) Date: Mon, 4 Nov 2002 12:12:20 -0800 Subject: [Spambayes] My first results with pop3proxy and smtpproxy Message-ID: <9891913C5BFE87429D71E37F08210CB9297506@zeus.sfhq.friskit.com> > > One other observation... my strong tendency *IS* to train=20 > this thing=20 > > only when it makes a mistake. >=20 > That's a UI problem. A good UI would deduce what's ham and=20 > spam by watching what you do to your email, and train on a=20 > random sampling of it. The Outlook client may be the only=20 > one making real progress in that direction so far. The outlook plugin positively rocks in this respect. That's kinda why I was suggesting taking the IMAP route since you'd easily (!?) be able to correct classification errors using whichever IMAP-enabled client UI you prefer(outlook, OE, mozilla, opera, the list goes on...) and it would not be something new that users would have to learn. As an aside: I've been using spambays with great success since we got the outlook plugin to play nicely with exchange. As you probably know, exchange is a centralized message store, and as such, you can have multiple clients connected to the same store at the same time. I generally have up to 3 copies of outlook running at any one time. Two at work, one at home. I run spambayes at home. This morning, when I got to work, I saw that SB had marked some ham as 'unsure', so I moved it back into my inbox using an outlook client on a machine where SB was NOT running. I then connected to my home machine (the one running SB) and noticed that, even though I had moved the message on a different machine, the SB plugin had noticed the move and retrained the database. Very cool. Piers. From tim.one@comcast.net Mon Nov 4 20:07:23 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 15:07:23 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > This smells like a clever spam, Click on the link to Lindsey's webpage if you have lingering doubts. > disguised as a zope-announce message I sent. SA scored it -2.8: > > X-Spam-Status: No, hits=-2.8 required=5.0 > tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,Q > UOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE > > Wonder if SB will do any better... Absolutely: SB never gives negative scores . Barry once floated the idea of trying to strip quoted text in the tokenizer, but nobody (AFAIK) tried that. Short of something like that, I expect the best you can hope for is that this will end up in your Unsure category. I believe that QUOTED_EMAIL_TEXT means SA gave it a *ham* boost for containing quoted email. The "Re:" in the subject line is a clue of that sort for SB too, along with various words starting w/ ">". From tim.one@comcast.net Mon Nov 4 20:13:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 15:13:30 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > I just realized that the deployment parameters for mailman lists are > entirely different than for individual users. This may be obvious > already, but I don't recall reading it here. You really don't read much of this list <0.9 wink>. > - Mailing lists have a tendency to have a clear focus, which is > recorded in the list archives. This makes for near-ideal training, > unless in the past a lot of spam made it into the archives (they > should be manually checked first). Yes & no. Zope lists have a clear focus, but c.l.py is all over the map, from Alex Martelli discussing the right kind of water to use when preparing pasta, to debates about Microsoft's place in the world. You can't really can't imagine what a sprawling zoo c.l.py is until you've stared at 20,000 randomly selected msgs for months. But what they all have in common is ALMOST NO ADVERTISING. I believe that makes them much easier than personal email, barring the one-word subscribe etc thingies attached to mountains of employer-generated disclaimers. > - Integration into Mailman means that there's only one setup to be > concerned about, rather than the gazillions of different ways > ordinary users receive their email. Yup. > - The person who administers the list can be assumed to be a little > bit more clueful than an ordinary user. Ditto. > - An obvious default policy with tunable parameters presents itself: > ham goes to the list, spam is dropped (or bounced), and unsure goes > into the moderator's queue. Yup. But since there *is* a non-zero FP rate, and always will be, dropping email is probably not a politically acceptable option (even if the FP rate is lower than the chance of a mail-transport screwup losing the mail). > (Of course, having this integrated into Mailman also gives Mailman a > leg up against the competition.) I believe that's not lost on Barry either . From Tim@mail.powweb.com Mon Nov 4 20:19:53 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 14:19:53 -0600 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: Message-ID: If we were to run a similar bayesian analysis of the pages that spam links point to, and used that information as another set of clues for classification, would that have made a difference in this instance, and in general? By that I mean, once a mail has been classified as spam, we could look at the pages that the page points to and make a similar wordlist type classification. This classification could be used in Unsure instances by looking at the pages the mail points to and then applying the webpage wordlist bayes classification to it. If it's a probable spam-pointed-to-page, then the mail is probably spam... at least that could weigh (heavily) into the statistics for the words in the mail itself.... - TimS 11/4/2002 2:07:23 PM, Tim Peters wrote: >[Guido] >> This smells like a clever spam, > >Click on the link to Lindsey's webpage if you have lingering doubts. > >> disguised as a zope-announce message I sent. SA scored it -2.8: >> >> X-Spam-Status: No, hits=-2.8 required=5.0 >> tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,Q >> UOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE >> >> Wonder if SB will do any better... > >Absolutely: SB never gives negative scores . > >Barry once floated the idea of trying to strip quoted text in the tokenizer, >but nobody (AFAIK) tried that. Short of something like that, I expect the >best you can hope for is that this will end up in your Unsure category. I >believe that QUOTED_EMAIL_TEXT means SA gave it a *ham* boost for containing >quoted email. The "Re:" in the subject line is a clue of that sort for SB >too, along with various words starting w/ ">". > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Mon Nov 4 20:24:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 15:24:48 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment In-Reply-To: <15814.52688.115970.206304@slothrop.zope.com> Message-ID: [Jeremy Hylton] > On the other hand, the message you forwarded got scored 0.494 with > both *H* and *S* > 0.98. I'm quite puzzled, though, about how my > training data is getting used. I looked back at the spam that came > through (now in my spam training set) and see that it got scored > 0.000. It now gets scored 1.000, but for reasons that don't really > make sense to me. An endless string of hapaxes. This is what mistake-based training can be *expected* to do over time: swing wildly from near 0 to near 1 (or vice versa). > Here's a snippet of the spamish word from the detailed scoring: > > >available 0.844827586207 Every word with that spamprob is a hapax (unique to this msg). The Bayseian-adjusted spamprob for a word is s*x + n*p --------- s+n where, for a spam hapax, p=1.0 and n=1. s and x are taken from Options.py unless you've overridden them; the defaults are s=0.45 and x=0.5. Plug those all in and you get >>> (.45 * 0.5 + 1 * 1.0) / (.45 + 1) 0.84482758620689669 >>> for a spam hapax. > >cc: 0.844827586207 > >corp. 0.844827586207 > >dickenson 0.844827586207 > >persistence 0.844827586207 > >pure 0.844827586207 > >released 0.844827586207 > >source 0.844827586207 > >unexpected 0.844827586207 > >windows 0.844827586207 > >zeo 0.844827586207 > >zodb 0.844827586207 > >zope 0.844827586207 > [zope-annce] 0.844827586207 > approve, 0.844827586207 > area! 0.844827586207 > behavior.) 0.844827586207 > beta 0.844827586207 > btrees 0.844827586207 > compiler, 0.844827586207 > conflict 0.844827586207 > cream? 0.844827586207 > email addr:zope.org, 0.844827586207 > emails, 0.844827586207 > fav. 0.844827586207 > from:"lindsey 0.844827586207 > from:carter" 0.844827586207 > from:email name:do' 0.908163 'url:index' 0.911483 'shocking' 0.921667 'ice' 0.934783 'ads,' 0.958716 'emails,' 0.965116 'part-time' 0.97619 'area!' 0.987106 'pics' 0.991159 > It seems like I'm still doing something wrong with pspam and training > but I don't know what. The odd thing is that I tend to get good > results, Most spam is easy to detect even from hapaxes. That's what makes mistake-based training tempting, I'm afraid. > the osaf lists aside. What's an osaf list? From guido@python.org Mon Nov 4 20:29:14 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 15:29:14 -0500 Subject: [Spambayes] counterweight: it really works! In-Reply-To: Your message of "Mon, 04 Nov 2002 16:46:03 GMT." <3DC6A44B.4070509@startechgroup.co.uk> References: <3DC6A44B.4070509@startechgroup.co.uk> Message-ID: <200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net> > FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't > officially released yet, but then neither is spambayes [grin]) using the > Robinson algorithm. I'm hoping to get the chi-squared algorithm in there > too, but /I had some trouble with it producing wierd results for me (I > tried to post something to this list about it but it vanished into the > ether, so I'll try again shortly). Cool! What do you do for training of your Robinson classifier? --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Nov 4 20:50:34 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 15:50:34 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: <15814.53303.926055.735822@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I would argue that spam should by default go into the moderator's > queue as well. The default should never be to drop or bounce a > message. Either way, you run the risk that legitimate mail gets lost. That will never fly at python.org: there's too much spam coming in for anyone to deal with (or so I've been told -- I believe I get more spam at home, but only an infinitesimal percentage of the virus email python.org gets). Indeed, Greg bounces lots of spam at SMTP *connect* time, without analyzing it any deeper than seeing that it uses a character set that's out of favor. A Reject response should make its way back to the sender then. If not, ya, the email is lost. Put a specific price on that, because it's a tradeoff, and "never" is too costly in other ways to tolerate. From jeremy@alum.mit.edu Mon Nov 4 20:44:50 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 4 Nov 2002 15:44:50 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment In-Reply-To: References: <15814.52688.115970.206304@slothrop.zope.com> Message-ID: <15814.56386.841563.8206@slothrop.zope.com> I've been training, of late, on a growing sample of my incoming email. At the moment just a few hundred of each ham and spam. It has done moderately well. Apparently the Carter spam used to trigger on words in the old archives I was using -- and the new smaller training database just doesn't have many occurrences of those words. The osaf lists are for Kapor et al.'s new PIM. I've got 24 messages from those lists in my ham training set, but it hasn't been enough to get the scores reliably below 0.1. Jeremy From guido@python.org Mon Nov 4 20:44:35 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 15:44:35 -0500 Subject: [Spambayes] Database reduction In-Reply-To: Your message of "04 Nov 2002 09:58:06 PST." References: <15809.55847.349091.23441@montanaro.dyndns.org> Message-ID: <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net> > In case it isn't obvious yet, here's the problem: > > >>> len(pickle.dumps(w, 1)) > 102 > >>> len(`w`) > 30 > > So, at least for hammie, you can get a 66% reduction in database size > by *not* pickling WordInfo types. Tim calls this "administrative pickle > bloat", which is the coolest jargon term I've heard all year. > > As I understand it, things which pickle the Bayes object avoid this > overhead from some pickler optimizations along the lines of "if we've > already seen this type, just give it a number and stop referring to it > by name." Thus, I suppose the proper way to get this reduction in > hammie would be to extend the pickler to recognize WordInfo types, > right? If so, I'll add that code in. I'm aware that pickling new-style class instances is inefficient, due to the gross hack employed. I'll try to find time to do something about this in Python 2.3. You could also experiment with adding a custom __reduce__ method and/or custom __getstate__ and __setstate__ methods. Or pickle tuples instead of WordInfo instances. Or make WordInfo a classic class (classic class instances are pickled more efficiently). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Mon Nov 4 20:58:38 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 15:58:38 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: Your message of "Mon, 04 Nov 2002 14:32:51 EST." <15814.52067.105184.561839@slothrop.zope.com> References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> <15814.52067.105184.561839@slothrop.zope.com> Message-ID: <200211042058.gA4Kwc922052@pcp02138704pcs.reston01.va.comcast.net> > I got the same spam and SB was sure it was ham. It was responding to > me ZODB announcement. As a result the mail contained a bunch of good > ham indicators from my original announcement. As I recall, the fact > that it not only contained my announcement but had some of the words > quoted really nailed it. That is, ">release" is a better ham > indicator than "release". Yup. At least one spammer got clever... :-( --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Mon Nov 4 21:01:18 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 16:01:18 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: Your message of "Mon, 04 Nov 2002 14:43:12 EST." <15814.52688.115970.206304@slothrop.zope.com> References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net> <15814.52067.105184.561839@slothrop.zope.com> <15814.52688.115970.206304@slothrop.zope.com> Message-ID: <200211042101.gA4L1IY22088@pcp02138704pcs.reston01.va.comcast.net> > On the other hand, the message you forwarded got scored 0.494 with > both *H* and *S* > 0.98. I'm quite puzzled, though, about how my > training data is getting used. I looked back at the spam that came > through (now in my spam training set) and see that it got scored > 0.000. It now gets scored 1.000, but for reasons that don't really > make sense to me. > > Here's a snippet of the spamish word from the detailed scoring: > > >available 0.844827586207 > >cc: 0.844827586207 > >corp. 0.844827586207 > >dickenson 0.844827586207 > >persistence 0.844827586207 > >pure 0.844827586207 > >released 0.844827586207 > >source 0.844827586207 > >unexpected 0.844827586207 > >windows 0.844827586207 > >zeo 0.844827586207 > >zodb 0.844827586207 > >zope 0.844827586207 > [zope-annce] 0.844827586207 > approve, 0.844827586207 > area! 0.844827586207 > behavior.) 0.844827586207 > beta 0.844827586207 > btrees 0.844827586207 > compiler, 0.844827586207 > conflict 0.844827586207 > cream? 0.844827586207 > email addr:zope.org, 0.844827586207 > emails, 0.844827586207 > fav. 0.844827586207 > from:"lindsey 0.844827586207 > from:carter" 0.844827586207 > from:email name: References: <15814.53303.926055.735822@montanaro.dyndns.org> Message-ID: <15814.57503.637984.11424@montanaro.dyndns.org> >>>>> "Tim" == Tim Peters writes: Tim> [Skip Montanaro] >> I would argue that spam should by default go into the moderator's >> queue as well. The default should never be to drop or bounce a >> message. Either way, you run the risk that legitimate mail gets >> lost. Tim> That will never fly at python.org: there's too much spam coming in Tim> for anyone to deal with (or so I've been told -- I believe I get Tim> more spam at home, but only an infinitesimal percentage of the Tim> virus email python.org gets). It's fine to give moderators the ability to twiddle these settings. The person managing the mailing list can check the "delete spam" box. All I'm saying is that the default should not be to delete anything. Skip From guido@python.org Mon Nov 4 21:06:41 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 16:06:41 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: Your message of "Mon, 04 Nov 2002 13:53:27 CST." <15814.53303.926055.735822@montanaro.dyndns.org> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> Message-ID: <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> > Guido> - An obvious default policy with tunable parameters presents > Guido> itself: ham goes to the list, spam is dropped (or > Guido> bounced), and unsure goes into the moderator's queue. > > I would argue that spam should by default go into the moderator's > queue as well. The default should never be to drop or bounce a > message. Either way, you run the risk that legitimate mail gets > lost. For most mailing lists, I disagree. It's not like you're going to miss an important message from your boss or from a potential customer or employer when a false positive is bounced from the dangerous-hobbies-involving-jello list. Given the amount of spam that most lists get, and the clumsiness (I believe Barry agrees with this assessment :-) of the Mailman moderation API, putting all spam in the moderation queue by default would be a bad idea. I agree that it should be possible to configure it this way if you really want, but I don't think it should be the default. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Nov 4 21:06:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 16:06:53 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment In-Reply-To: <15814.56386.841563.8206@slothrop.zope.com> Message-ID: [Jeremy Hylton] > I've been training, of late, on a growing sample of my incoming > email. Good -- I knew I could browbeat you into that . > At the moment just a few hundred of each ham and spam. It has done > moderately well. Apparently the Carter spam used to trigger on > words in the old archives I was using -- and the new smaller training > database just doesn't have many occurrences of those words. Expiring words over time is something that should be done with ongoing training too ("database pruning"). There's been no progress on that, though. > The osaf lists are for Kapor et al.'s new PIM. I've got 24 messages > from those lists in my ham training set, but it hasn't been enough to > get the scores reliably below 0.1. With just a few hundred training msgs, that's very surprising to me, and especially since the one example I've seen scored very solidly as ham under my classifier (which had not been trained on any of these things). Could there be a persistence glitch such that training isn't "taking hold"? I just looked, and noticed that _remove_msg() didn't do the self.wordinfo[word] = record bit at the end which may be needed to tell a persistent DB that the content of *record* changed. Then untraining a msg would screw things up, by decrementing the nspam or nham count but not reducing the word counts to match. I'll check in a fix for that now. Maybe there are other places "like that". From jeremy@alum.mit.edu Mon Nov 4 21:08:09 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 4 Nov 2002 16:08:09 -0500 Subject: [Spambayes] Database reduction In-Reply-To: <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net> References: <15809.55847.349091.23441@montanaro.dyndns.org> <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15814.57785.966040.687158@slothrop.zope.com> I'd find it convenient if Bayes was a classic class, too, so that I can more easily use ExtensionClass-based persistence. Jeremy From jeremy@alum.mit.edu Mon Nov 4 21:10:25 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 4 Nov 2002 16:10:25 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment In-Reply-To: References: <15814.56386.841563.8206@slothrop.zope.com> Message-ID: <15814.57921.614104.945895@slothrop.zope.com> >>>>> "TP" == Tim Peters writes: TP> I just looked, and noticed that _remove_msg() didn't do the TP> self.wordinfo[word] = record TP> bit at the end which may be needed to tell a persistent DB that TP> the content of *record* changed. Then untraining a msg would TP> screw things up, by decrementing the nspam or nham count but not TP> reducing the word counts to match. I'll check in a fix for that TP> now. Maybe there are other places "like that". Actually, the pspam code ended up making WordInfo objects back into independent persistent objects just so that I don't have to worry about these sorts of issues. So this is not the problem now (although it may have been a week or two ago). Jeremy From guido@python.org Mon Nov 4 21:13:23 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 16:13:23 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: Your message of "Mon, 04 Nov 2002 15:13:30 EST." References: Message-ID: <200211042113.gA4LDNM22178@pcp02138704pcs.reston01.va.comcast.net> > Yup. But since there *is* a non-zero FP rate, and always will be, > dropping email is probably not a politically acceptable option (even > if the FP rate is lower than the chance of a mail-transport screwup > losing the mail). That should be up to the list admin though. The list admin (or his sysadmin) has the choice to install other software that drops or rejects spam anyway (as we do for python.org). --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Nov 4 21:13:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 16:13:43 -0500 Subject: [Spambayes] Database reduction In-Reply-To: <15814.57785.966040.687158@slothrop.zope.com> Message-ID: [Jeremy Hylton] > I'd find it convenient if Bayes was a classic class, too, so that I > can more easily use ExtensionClass-based persistence. Fine by me -- check it in. I only want to keep WordInfo instancea lightweight (via __slots__). From bkc@murkworks.com Mon Nov 4 21:20:15 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 04 Nov 2002 16:20:15 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: References: Message-ID: <3DC69D7B.16642.3E08EE54@localhost> On 4 Nov 2002 at 14:19, Tim@mail.powweb.com, Stone@mail.powweb.co wrote: > If we were to run a similar bayesian analysis of the pages that spam > links point to, and used that information as another set of clues for I would not do this. Spammers could use a webbug-like technique to validate your email address this way. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Mon Nov 4 21:26:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 16:26:17 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment In-Reply-To: <200211042101.gA4L1IY22088@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > ... > At least the last 4 are probably unique to this particular spam, so > you must've trained on it. Read my reply -- *all* the words here were hapaxes. No exceptions. > That should explain why it's now considered spam. Unfortunately you've > also made zope-announce posts look more spammy! :-( As soon as he trains on just one ham from zope-announce, the spamprob will fall to 0.5. Scoring relying on hapaxes is brittle, despite the instant gratification it supplies; the correct cure is to train over a random sampling of all your email regularly, and whether or not it's been correctly classified. I got a dozen stronger-than-hapax spam clues out of your email example (all from the spam part of it), because I keep training even on spam that scores 1.0 and ham that scores 0.0; this moves spamprobs out of the brittle hapax range into a reflection of what email *really* looks like. From Tim@mail.powweb.com Mon Nov 4 21:27:15 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 15:27:15 -0600 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org development In-Reply-To: <3DC69D7B.16642.3E08EE54@localhost> Message-ID: <97YT946ZRM1VNH05NH3X86H95NRMIH.3dc6e633@riven> For sure, there are a number of considerations that might make such a proposal impractical. But my question was more theoretical in nature. So... practicalities aside, would an analysis of this nature be useful? - TimS 11/4/2002 3:20:15 PM, "Brad Clements" wrote: >On 4 Nov 2002 at 14:19, Tim@mail.powweb.com, Stone@mail.powweb.co wrote: > > >> If we were to run a similar bayesian analysis of the pages that spam >> links point to, and used that information as another set of clues for > >I would not do this. Spammers could use a webbug-like technique to validate your >email address this way. > >Brad Clements, bkc@murkworks.com (315)268-1000 >http://www.murkworks.com (315)268-9812 Fax >AOL-IM: BKClements > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From skip@pobox.com Mon Nov 4 21:36:00 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 4 Nov 2002 15:36:00 -0600 Subject: [Spambayes] deployment for mailman lists In-Reply-To: <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15814.59456.141076.98902@montanaro.dyndns.org> Guido> For most mailing lists, I disagree. It's not like you're going Guido> to miss an important message from your boss or from a potential Guido> customer or employer when a false positive is bounced from the Guido> dangerous-hobbies-involving-jello list. Perhaps not, but Mailman and Spambayes could hardly get worse PR than if valid messages began to simply disappear. We all know there are any of a number of reasons why ham can get misclassified. All I'm saying is make the default setting for new groups in this yet-to-be Mailman+Spambayes tool be to forward spam to the moderator. Mailman can say to the moderator, "This message looks like spam. If you would rather I delete such messages, here's how you do it, and here are the implications." If mail just disappears, there is no place to hang that little warning message. I don't understand why this seems to be such a difficult point to make. The readers of this list are so obviously far from the normal user and/or list moderator that our personal experience as people who read and moderate technical mailing lists just doesn't apply. I manage a very active non-technical mailing list using Mailman. Most of the people wouldn't know a Python script if it bit 'em in the ass. The other people who help me moderate the list are substantially less computer-savvy than I am. Trust me on this. They wouldn't know how to disable the "delete spam" feature if they were to somehow figure out why mail was disappearing. Guido> Given the amount of spam that most lists get, and the clumsiness Guido> (I believe Barry agrees with this assessment :-) of the Mailman Guido> moderation API, putting all spam in the moderation queue by Guido> default would be a bad idea. I agree that it should be possible Guido> to configure it this way if you really want, but I don't think it Guido> should be the default. That is yet another argument for not deleting mail (and probably an argument for fixing the moderation interface). If you save spam you can tell them precisely where in the moderation interface to go to make the change. If the interface is poor, it may well be hard for the moderator to figure out where to go to stop the bleeding. Skip From tim.one@comcast.net Mon Nov 4 21:37:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 16:37:42 -0500 Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment In-Reply-To: <97YT946ZRM1VNH05NH3X86H95NRMIH.3dc6e633@riven> Message-ID: [Tim@mail.powweb.com, on chasing URLs] > For sure, there are a number of considerations that might make such a > proposal impractical. But my question was more theoretical in nature. > So... practicalities aside, would an analysis of this nature be > useful? Maybe, but it's hard to ignore the practicalities. BTW, I expect a URL that doesn't resolve would be a great spam clue -- lots of spam sites get shut down within hours. From pje@telecommunity.com Mon Nov 4 21:51:22 2002 From: pje@telecommunity.com (Phillip J. Eby) Date: Mon, 04 Nov 2002 16:51:22 -0500 Subject: [Spambayes] Database reduction In-Reply-To: <15814.57785.966040.687158@slothrop.zope.com> References: <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net> <15809.55847.349091.23441@montanaro.dyndns.org> <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <5.1.0.14.0.20021104165053.02620be0@mail.telecommunity.com> At 04:08 PM 11/4/02 -0500, Jeremy Hylton wrote: >I'd find it convenient if Bayes was a classic class, too, so that I >can more easily use ExtensionClass-based persistence. You mean you're not using ZODB 4 yet? For shame. :) From guido@python.org Mon Nov 4 22:01:52 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 17:01:52 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: Your message of "Mon, 04 Nov 2002 15:36:00 CST." <15814.59456.141076.98902@montanaro.dyndns.org> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <15814.59456.141076.98902@montanaro.dyndns.org> Message-ID: <200211042201.gA4M1qe22562@pcp02138704pcs.reston01.va.comcast.net> From mhammond@skippinet.com.au Mon Nov 4 22:14:15 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 5 Nov 2002 09:14:15 +1100 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <3DC64611.30897.3CB37091@localhost> Message-ID: > > AFAIK, Outlook Express has no hooks at all for programmers -- > it's a closed > > > How about this? > > http://msdn.microsoft.com/library/en-us/mapi/html/_mapi1book_using > _message_filtering_to_manage_messages.asp > > I think this is new information released under the DOJ settlement. This is simply MAPI documentation, and what the existing Outlook plugin is using. (Actually, for the "new message" hook we are using the Outlook model rather than the documentation you pointed at, but it wont be long until we move to the MAPI system, I bet ). I don't think the info is DOJ related - my July 2000 MSDN CD has the same article. Unfortunately, I see nothing here that indicates this works for Outlook Express. If we can use MAPI with Outlook Express, the plugin should not be hard to port at all. Mark. From guido@python.org Mon Nov 4 22:11:47 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 17:11:47 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: Your message of "Mon, 04 Nov 2002 15:36:00 CST." <15814.59456.141076.98902@montanaro.dyndns.org> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <15814.59456.141076.98902@montanaro.dyndns.org> Message-ID: <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net> > Guido> For most mailing lists, I disagree. It's not like you're > Guido> going to miss an important message from your boss or from > Guido> a potential customer or employer when a false positive is > Guido> bounced from the dangerous-hobbies-involving-jello list. [Skip] > Perhaps not, but Mailman and Spambayes could hardly get worse PR > than if valid messages began to simply disappear. We all know there > are any of a number of reasons why ham can get misclassified. All > I'm saying is make the default setting for new groups in this > yet-to-be Mailman+Spambayes tool be to forward spam to the > moderator. Mailman can say to the moderator, "This message looks > like spam. If you would rather I delete such messages, here's how > you do it, and here are the implications." If mail just disappears, > there is no place to hang that little warning message. > > I don't understand why this seems to be such a difficult point to > make. The readers of this list are so obviously far from the normal > user and/or list moderator that our personal experience as people > who read and moderate technical mailing lists just doesn't apply. I > manage a very active non-technical mailing list using Mailman. Most > of the people wouldn't know a Python script if it bit 'em in the > ass. The other people who help me moderate the list are > substantially less computer-savvy than I am. Trust me on this. > They wouldn't know how to disable the "delete spam" feature if they > were to somehow figure out why mail was disappearing. But the key is that *you* are the list's main administrator and in charge of the initial setup. So *you* should set it up to minimize your pain (which includes constant worries about lost mail due to false positives in the spam filter). I believe that while Mailman is relatively easy to set up, it requires (at least) typical mail admin skills, and a mail admin already has in his/her head ideas about the cost of lost mail. You seem to have been burned by this, and as a consequence I believe you're on the conservative side. As long as the consequences are clear when a list admin chooses to enable spam filtering, I think the default should be for convenience, not for liability. > Guido> Given the amount of spam that most lists get, and the > Guido> clumsiness (I believe Barry agrees with this assessment > Guido> :-) of the Mailman moderation API, putting all spam in > Guido> the moderation queue by default would be a bad idea. I > Guido> agree that it should be possible to configure it this way > Guido> if you really want, but I don't think it should be the > Guido> default. > > That is yet another argument for not deleting mail (and probably an > argument for fixing the moderation interface). If you save spam you > can tell them precisely where in the moderation interface to go to > make the change. If the interface is poor, it may well be hard for > the moderator to figure out where to go to stop the bleeding. There's no way you can design a web moderation interface to deal well with manually moderating 200 spams per day. IMO if you show *all* spam in the moderation interface, the kind of non-techie moderator that you describe is *more* likely to make mistakes (rejecting ham or approving spam) than in the default that I propose. You've made this same (or a very similar) point many times, and while I agree with you that it's bad to delete spam in many setups, I strongly disagree in this case. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Nov 4 22:17:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 17:17:01 -0500 Subject: [Spambayes] Proposing to drop retain_pure_html_tags Message-ID: AFAIK, nobody enables retain_pure_html_tags anymore. In the very early days of the project, it was the only choice. If it's enabled, it's virtually the same as saying "any msg whatsoever using HTML or XML is spam, even if it's just a plain-text msg discussing HTML examples". That was *almost* appropriate for the early c.l.py tests, because HTML msgs are so hated on tech mailing lists. The algorithms have since improved to the point where it does more harm than good even on my python.org tests, so I can't imagine a good use for it anymore. There's still a world of info we're missing inside HTML decorations, but retaining all of it will never work (the presence or absence of assorted HTML decorations violateds the word-independence assumption to an extreme; it's a bogus assumption anyway, but there's no hope of recovering from the abuse HTML tags heap on it). From bkc@murkworks.com Mon Nov 4 22:26:37 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 04 Nov 2002 17:26:37 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: References: <3DC64611.30897.3CB37091@localhost> Message-ID: <3DC6AD09.3250.3E45B0C3@localhost> On 5 Nov 2002 at 9:14, Mark Hammond wrote: > > http://msdn.microsoft.com/library/en-us/mapi/html/_mapi1book_using > > _message_filtering_to_manage_messages.asp > > > > I think this is new information released under the DOJ settlement. > I don't think the info is DOJ related - my July 2000 MSDN CD has the same > article. > On the DOJ info page at Microsoft, which I had a really hard time finding. It listed "Outlook Express APIs" and that URL pointed to these pages. Hmm. Who knows. :-( Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From neale@woozle.org Mon Nov 4 22:36:37 2002 From: neale@woozle.org (Neale Pickett) Date: 04 Nov 2002 14:36:37 -0800 Subject: [Spambayes] Database reduction In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > OTOH, > > >>> cPickle.dumps(w.__getstate__(), 1) > '(U\x04aoeuq\x01K\x00K\x00K\x00K\x02t.' > >>> len(_) > 19 > >>> > > which is shorter than your string repr. This isn't typical because 2 > is an absurd spamprob (it's > 1, and is an int instead of a double); > the savings would be greater with a real spamprob (which will consume > about 19 bytes in a string repr, but about 8 in a pickle). Right. I had some code in hammie to pickle the tuple instead of the object itself, but I thought it was a pretty gnarly kludge at the time. In any case, some variation on this seems obviously the right way to go. > [ Tim magic regarding pickle hacks ] > I'd avoid all that and pickle the states, but that's just me. I'm inclined to agree with you. If I do this, though, we have to all agree on a convention: if you need to modify a wordinfo object, you *must* write it back to the dictionary. Otherwise hammie will never know it changed. I was bitten by this a few times at first, and I haven't played with the code enough to know if any of this has crept back in. Would it be out of line to alter WordInfo to be immutable, to encourage folks to write it back to the dictionary? Neale From skip@pobox.com Mon Nov 4 22:38:27 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 4 Nov 2002 16:38:27 -0600 Subject: [Spambayes] deployment for mailman lists In-Reply-To: <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <15814.59456.141076.98902@montanaro.dyndns.org> <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15814.63203.981010.604877@montanaro.dyndns.org> Guido> But the key is that *you* are the list's main administrator and Guido> in charge of the initial setup. So *you* should set it up to Guido> minimize your pain (which includes constant worries about lost Guido> mail due to false positives in the spam filter). Correct, but regardless of my abilities in this particular case, the *default* for new mailing lists - those created by ~mailman/bin/newlist - should be to not delete the spam. The administrator of the site has to run that. The moderator of the list (who generally won't have shell access to the machine running Mailman) will then get her chance to go through and fiddle the bits. Guido> I believe that while Mailman is relatively easy to set up, it Guido> requires (at least) typical mail admin skills, and a mail admin Guido> already has in his/her head ideas about the cost of lost mail. Guido> You seem to have been burned by this, and as a consequence I Guido> believe you're on the conservative side. As long as the Guido> consequences are clear when a list admin chooses to enable spam Guido> filtering, I think the default should be for convenience, not for Guido> liability. It has nothing to do with getting burned, I just have relevant current experience dealing with less technical lists. There are tons of non-technical folks out there running Mailman-managed mailing lists. Consider that many hosting companies like Hostway make this available to their customers. Every other mail-handling tool I've ever seen (sendmail, fetchmail, procmail, etc) goes to great lengths to avoid losing mail. Why shouldn't Mailman? Guido> There's no way you can design a web moderation interface to deal Guido> well with manually moderating 200 spams per day. IMO if you show Guido> *all* spam in the moderation interface, the kind of non-techie Guido> moderator that you describe is *more* likely to make mistakes Guido> (rejecting ham or approving spam) than in the default that I Guido> propose. I'm not saying that you have to design an interface to deal with moderating 200 spams a day. I'm also not saying it's a one-time-only setting. Still, by making the default for held spam messages be "discard" instead of "defer", Mailman could make it a one-click operation to delete all 200 with one "Submit All Data" click from the moderation interface. I haven't used Mailman 2.1 yet, but I think that was something Barry had hoped to make a configuration option as well. Guido> You've made this same (or a very similar) point many times, and Guido> while I agree with you that it's bad to delete spam in many Guido> setups, I strongly disagree in this case. Only because you seem to continually misunderstand what I'm saying. I am *only* saying it's bad to delete spam by default when the list is first created. Let the list moderator decide, "I can't handle all this crap, please delete it for me". I see two scenarios: 1. An existing mailing list is converted to a new Mailman+Spambayes setup. The moderator is either (a) thankful that all the spam which had previously shown up on the list is now somewhere he can deal with it, or (b) he was already doing something to deflect most/all the spam, so doesn't see much of it in the moderation interface. 2. A brand new mailing list is setup with Mailman+Spambayes. As a new list, it should not be getting 200 spams per day. The moderator will have time to figure out how to change the settings on the list to delete spam instead of hold it. I just don't understand why you have a hard time understanding that out-of-the-box Mailman+Spambayes should not delete spam. It's a one-click change for Greg or Barry, or whoever controls python-list. Why not err on the side of caution? Skip From skip@pobox.com Mon Nov 4 22:39:49 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 4 Nov 2002 16:39:49 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <3DC6AD09.3250.3E45B0C3@localhost> References: <3DC64611.30897.3CB37091@localhost> <3DC6AD09.3250.3E45B0C3@localhost> Message-ID: <15814.63285.147666.504828@montanaro.dyndns.org> Mark> I don't think the info is DOJ related - my July 2000 MSDN CD has Mark> the same article. Brad> On the DOJ info page at Microsoft, which I had a really hard time Brad> finding. It listed "Outlook Express APIs" and that URL pointed to Brad> these pages. Maybe they were just trying to convince DOJ they were complying with the proposed settlement. Skip From tim.one@comcast.net Mon Nov 4 22:41:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 17:41:48 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [Mark Hammond] > ... > Unfortunately, I see nothing here that indicates this works for Outlook > Express. If we can use MAPI with Outlook Express, the plugin > should not be hard to port at all. Noting that, according to http://support.microsoft.com/default.aspx?scid=KB;EN-US;q192119 at least OE4 physically replaced the system Mapi32.dll with its own version that spoke only Simple MAPI, when OE was selected as the default email program. As a result, all "real" MAPI and CDO apps failed. Things may have improved since then (OE is at version 6 now, I believe), but OE4 looks hopeless for this reason. Rummaging around, I get the *impression* that they've solved the conflicting DLL problem, but that OE is still restricted to Simple MAPI. From guido@python.org Mon Nov 4 22:54:24 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 04 Nov 2002 17:54:24 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: Your message of "Mon, 04 Nov 2002 16:38:27 CST." <15814.63203.981010.604877@montanaro.dyndns.org> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <15814.59456.141076.98902@montanaro.dyndns.org> <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net> <15814.63203.981010.604877@montanaro.dyndns.org> Message-ID: <200211042254.gA4MsOc22974@pcp02138704pcs.reston01.va.comcast.net> > Guido> But the key is that *you* are the list's main > Guido> administrator and in charge of the initial setup. So > Guido> *you* should set it up to minimize your pain (which > Guido> includes constant worries about lost mail due to false > Guido> positives in the spam filter). > > Correct, but regardless of my abilities in this particular case, the > *default* for new mailing lists - those created by > ~mailman/bin/newlist - should be to not delete the spam. The > administrator of the site has to run that. The moderator of the > list (who generally won't have shell access to the machine running > Mailman) will then get her chance to go through and fiddle the bits. The default should be not to enabl spambayes filtering at all, since there's no way to set up the training data to begin with. > Guido> I believe that while Mailman is relatively easy to set > Guido> up, it requires (at least) typical mail admin skills, and > Guido> a mail admin already has in his/her head ideas about the > Guido> cost of lost mail. You seem to have been burned by this, > Guido> and as a consequence I believe you're on the conservative > Guido> side. As long as the consequences are clear when a list > Guido> admin chooses to enable spam filtering, I think the > Guido> default should be for convenience, not for liability. > > It has nothing to do with getting burned, I just have relevant > current experience dealing with less technical lists. There are > tons of non-technical folks out there running Mailman-managed > mailing lists. Consider that many hosting companies like Hostway > make this available to their customers. Every other mail-handling > tool I've ever seen (sendmail, fetchmail, procmail, etc) goes to > great lengths to avoid losing mail. Why shouldn't Mailman? See above. Enabling spam filtering should be an explicit step. The UI should clarify the consequences and show the configuration settings. But the default configuration settings *once spam filtering is enabled* should be to bounce (not drop) spam scoring higher than the top of the "uncertain" region. Example UI: [ ] Enable Baysian spam filtering [help link] [ 95 ] Spam cutoff score [ 5 ] Ham cutoff score Disposition for messages scoring at least spam cutoff: (x) Bounce ( ) Discard ( ) Moderate Disposition for messages scoring between ham and spam cutoff: ( ) Moderate (x) Approve > Guido> There's no way you can design a web moderation interface > Guido> to deal well with manually moderating 200 spams per day. > Guido> IMO if you show *all* spam in the moderation interface, > Guido> the kind of non-techie moderator that you describe is > Guido> *more* likely to make mistakes (rejecting ham or > Guido> approving spam) than in the default that I propose. > > I'm not saying that you have to design an interface to deal with > moderating 200 spams a day. I'm also not saying it's a > one-time-only setting. Still, by making the default for held spam > messages be "discard" instead of "defer", Mailman could make it a > one-click operation to delete all 200 with one "Submit All Data" > click from the moderation interface. I haven't used Mailman 2.1 > yet, but I think that was something Barry had hoped to make a > configuration option as well. And that's exactly what I fear -- mixing the spam and unsure messages in a single moderation queue will increase mistakes. > Guido> You've made this same (or a very similar) point many > Guido> times, and while I agree with you that it's bad to delete > Guido> spam in many setups, I strongly disagree in this case. > > Only because you seem to continually misunderstand what I'm saying. > I am *only* saying it's bad to delete spam by default when the list > is first created. Let the list moderator decide, "I can't handle > all this crap, please delete it for me". OK, then we agree. I say spam filtering shouldn't be enabled at all when the list is created -- after all you have no ham training data! > I see two scenarios: > > 1. An existing mailing list is converted to a new > Mailman+Spambayes setup. The moderator is either (a) > thankful that all the spam which had previously shown up on > the list is now somewhere he can deal with it, or (b) he was > already doing something to deflect most/all the spam, so > doesn't see much of it in the moderation interface. Depends on whether whatever he was doing before can be ported to the MM setup. > 2. A brand new mailing list is setup with Mailman+Spambayes. As > a new list, it should not be getting 200 spams per day. The > moderator will have time to figure out how to change the > settings on the list to delete spam instead of hold it. > > I just don't understand why you have a hard time understanding that > out-of-the-box Mailman+Spambayes should not delete spam. It's a > one-click change for Greg or Barry, or whoever controls python-list. > Why not err on the side of caution? It was all a big misunderstanding. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Nov 4 23:18:34 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 18:18:34 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: <200211042254.gA4MsOc22974@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > ... > The default should be not to enable spambayes filtering at all, since > there's no way to set up the training data to begin with. Well, we don't know that yet. Work on seeding classifiers has been minimal so far. I agree it shouldn't be enabled by default regardless. From tim.one@comcast.net Mon Nov 4 23:49:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 18:49:16 -0500 Subject: [Spambayes] Database reduction In-Reply-To: Message-ID: [Neale Pickett] > Right. I had some code in hammie to pickle the tuple instead of the > object itself, but I thought it was a pretty gnarly kludge at the time. > In any case, some variation on this seems obviously the right way to go. If you use __getstate__() to get the tuple, there's nothing objectionable about it: it's the *purpose* of __getstate__/__setstate__ to get/set state into/from tuples. Objectionable would be to access the fields directly yourself by name, since they may change over time. There's a problem here, though, in that only the Bayes class saves a PICKLE_VERSION identifier in its pickles; changes in WordInfo structure can't be transparent to old databases unless WordInfo pickles contained a version identifier too. >> I'd avoid all that and pickle the states, but that's just me. > I'm inclined to agree with you. If I do this, though, we have to all > agree on a convention: if you need to modify a wordinfo object, you > *must* write it back to the dictionary. Otherwise hammie will never > know it changed. I was bitten by this a few times at first, and I > haven't played with the code enough to know if any of this has crept > back in. I fixed one of those today. The database still isn't getting updated with the new word atimes during scoring, but I've ignored that because nobody has made any use of atimes yet. I have to say it's painful to do these redundant stores -- it generally doubles the number of dict operations, and that's a speed drag. However, compared to I/O and tokenization times, it appears to be a minor drag at worst. > Would it be out of line to alter WordInfo to be immutable, to encourage > folks to write it back to the dictionary? I've done enough bending backwards for a subsystem I don't use . There are only a handful of places these structs mutate. From tim.one@comcast.net Tue Nov 5 00:14:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 19:14:17 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [Richie Hindle] > ... > I've yet to test this theory, but this is one reason I'd like to use HTML > as the 'GUI toolkit' for the UI of the POP3 proxy. The docs can > be tied so closely to the UI that people won't even realise they're > reading them... More, my sisters are fluent with their browsers. One's favorite editor is still Notepad, although she knows her away around Word far better than I do (I'll pit my Notepad skills against anyone's, though ). So, based on my sibling experience, the two UIs that work are those with no buttons to push, and those with too many to push to even know where to start. Let that be your guiding principle . From tim.one@comcast.net Tue Nov 5 00:21:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 19:21:41 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> Message-ID: [Tim] > a collection of msgs", the latter to remember, > e.g., which msgs have been trained as ham, and which as spam. [Tim@mail.powweb.com] > Remembering is an interesting idea, but what real purpose does it > serve aside from making testing easier? Not for testing. Say a user discovers they made a mistake, and moves a misclassified spam into their ham folder. The action needed then is two-fold: *un*train the msg as spam, *re*train on it as ham. That's too many manual steps for a user to keep track of. If a training class remembers what it's done with each msg, though, you need merely inform it that msg X has moved from thither to yon, and it can deduce evertyhing needed from that. Likewise if the user drags a msg from their unsure folder into a ham folder, or into a spam folder, etc. Life might be easier this way if a client supported attaching metadata to msgs too (Outlook supports rich facilities for doing so; others are probably restricted to injecting more headers). From Tim@mail.powweb.com Tue Nov 5 00:34:27 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 18:34:27 -0600 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: So exactly how does one 'untrain' given a particular message? 11/4/2002 6:21:41 PM, Tim Peters wrote: >[Tim] >> a collection of msgs", the latter to remember, >> e.g., which msgs have been trained as ham, and which as spam. > >[Tim@mail.powweb.com] >> Remembering is an interesting idea, but what real purpose does it >> serve aside from making testing easier? > >Not for testing. Say a user discovers they made a mistake, and moves a >misclassified spam into their ham folder. The action needed then is >two-fold: *un*train the msg as spam, *re*train on it as ham. That's too >many manual steps for a user to keep track of. If a training class >remembers what it's done with each msg, though, you need merely inform it >that msg X has moved from thither to yon, and it can deduce evertyhing >needed from that. Likewise if the user drags a msg from their unsure folder >into a ham folder, or into a spam folder, etc. > >Life might be easier this way if a client supported attaching metadata to >msgs too (Outlook supports rich facilities for doing so; others are probably >restricted to injecting more headers). > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Tue Nov 5 00:48:40 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 19:48:40 -0500 Subject: [Spambayes] Something to test In-Reply-To: <200211040627.gA46Rm108104@localhost.localdomain> Message-ID: [Tim] > This little patch arranges to create "noheader:HEADERNAME" tokens for > headers in options.safe_headers that *don't* appear in a msg's headers. This has been checked in now, disabled by default, under bool option name record_header_absence. [Anthony Baxter] Thanks for testing! > filename: before after > ham:spam: 11192:1826 > 11192:1826 > fp total: 0 1 > fp %: 0.00 0.01 > fn total: 7 8 > fn %: 0.38 0.44 > unsure t: 106 107 > unsure %: 0.81 0.82 > real cost: $28.20 $39.40 > best cost: $28.20 $30.40 > h mean: 0.63 0.42 > h sdev: 4.19 4.19 > s mean: 98.68 98.63 > s sdev: 7.74 7.95 > mean diff: 98.05 98.21 > k: 8.22 8.09 Wow -- it cut your ham mean by a third . > The additional fp was a mail-out from Nettwerk (that I've signed up > for, but which are _incredibly_ spammy) that went from 0.956 to 0.964, > where my spam cutoff is 0.96. The noheader: errors-to was the killer > clue that pushed it over the edge. The spam situation is considerably > worse. The additional false negative was something that went from 0.467 > to 0.431 (ham_cutoff 0.45). The damage came from > prob('noheader:mime-version') = 0.245329 > (It was a very short spam) So, in all, it nudged two marginal msgs over the edge, but in the wrong directions. So I disabled it by default. It helps python.org tests, though, so it's an option now. > One fn went from 0.27 to 0.029, due to: > prob('noheader:subject') = 0.0042591 > prob('noheader:to') = 0.0652536 Those are bizarre. From where do you get ham lacking Subject and To headers? In my personal classifier, #h #s spamprob 'noheader:to' 10 95 0.884678455795 'noheader:subject' 2 16 0.858858950186 Is there some systematic reason for why you've got lots of ham without key header lines? Your noheader:subject spamprob in particular is astonishingly low. > prob('noheader:mime-version') = 0.245329 > > It made pretty much all of my fn's at least slightly worse, if not > much worse. The lack of common headers in your ham is the mystery to me. Try to figure out why that is? For example, perhaps you have some systematic source of ham creating headers the email pkg can't parse. In that case we fall back to the raw body text, and don't get any header info at all. But in that case, we should learn *why* the email pkg is blowing up, and worm around it. For the same reason your FN got worse, your FN would get better if these things had the high spamprobs they were expected to have (and do have, in all my tests; nobody else has reported on this experiment, alas). From tim.one@comcast.net Tue Nov 5 01:01:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 20:01:20 -0500 Subject: [Spambayes] Another great spam Message-ID: It's a trend! This was a heavy spam day on my home box, with about 150 nailed by the classifier. But this got scored as rock solid ham. poets-should-be-shot-just-like-mimes-ly y'rs - tim Spam Score: 0.00738292 '*H*' 0.985376 '*S*' 0.000141353 'issue.' 0.0155709 "i'd" 0.0199122 'section,' 0.0302013 'publications' 0.0348837 'page.' 0.0352476 'left-hand' 0.0505618 'bay' 0.0652174 'god,' 0.0652174 'header:Organization:1' 0.0740827 'there,' 0.074749 '(in' 0.083982 'bay,' 0.0918367 'bay.' 0.0918367 'homepage,' 0.0918367 'issue' 0.120995 'scroll' 0.143265 'bar).' 0.155172 'listing.' 0.155172 'poem' 0.155172 'subject:Announcement' 0.155172 'leaves' 0.17751 'archive' 0.207696 'know' 0.213762 'page' 0.219117 'yours,' 0.245748 'which' 0.250988 'subject:/' 0.263804 'thought' 0.265127 'since' 0.273836 'released' 0.277373 'contents' 0.281415 'web' 0.290549 'some' 0.305297 'hello,' 0.306255 'almost' 0.312699 'hope' 0.316514 'two' 0.327412 'link' 0.331387 'wishes' 0.334614 'once' 0.369855 'about' 0.37169 'already' 0.372141 'wanted' 0.374675 'that' 0.374984 'doing' 0.381985 'let' 0.385857 'however,' 0.386094 '2002' 0.388204 'menu' 0.388208 'changing' 0.389442 'actually' 0.398668 'url:com' 0.608106 'gone!' 0.612957 'header:MIME-Version:1' 0.613927 'give' 0.616814 'bottom' 0.620525 'here.' 0.633211 'care,' 0.650199 'way.' 0.654826 'content' 0.65579 'thank' 0.661727 'noheader:errors-to' 0.663115 'header:Return-Path:1' 0.681022 'best' 0.688081 'powerful' 0.702059 'year' 0.734858 'address:' 0.742064 'click' 0.749785 'publication' 0.775658 'interest' 0.780263 'header:Received:3' 0.78214 'online' 0.785239 'color' 0.81036 'amen' 0.844828 'from:email name: To: Subject: John Amen/A Publication Announcement Date: MON, 04 NOV 2002 18:28:14 -0400 MIME-Version: 1.0 Reply-To: john@johnamen.com Message-Id: <200211041830453.SM00816@CAMPAIGN> X-RBL-Warning: SPAMHEADERS: This E-mail has headers consistent with spam [4000020e]. X-Note: This E-mail was scanned by Declude JunkMail (www.declude.com) for spam. Organization: WEBPRO International Return-Path: john@johnamen.com X-OriginalArrivalTime: 04 Nov 2002 23:33:03.0155 (UTC) FILETIME=[84E88830:01C2845A] Hello, I hope you are doing well and enjoying the autumn. The leaves are already changing color here. God, another year almost gone! I wanted to let you know about three publications in which my poems are currently appearing: Two poems in Thunder Sandwich. Web address: http://www.thundersandwich.com Once you get to the homepage, you'll be flashed to a contents page. Two of my poems are included in the poetry section. One poem in Sidereality. Web address: http://www.sidereality.com Once at the homepage, you can click on "contents" (in the left-hand menu bar). My poem is in the poetry section. One poem in Poetry Bay. A new issue of this publication has actually been released since the one in which my poem appeared; however, I thought I'd give you an archive link. Web address: http://www.poetrybay.com Once there, you click on "Poetry Bay Online Magazine." That will take you to the current issue of Poetry Bay, which includes some powerful poems. If you scroll to the bottom of the page and click on "Summer 2002" in the "Prior Versions" section, you'll link to the Summer 2002 issue. My poem is included in the content listing. I thank you for your interest and hope that these poems move you in some way. Take care, and best wishes for the winter! Yours, John Amen From tim.one@comcast.net Tue Nov 5 01:06:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 20:06:58 -0500 Subject: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [Tim@mail.powweb.com] > So exactly how does one 'untrain' given a particular message? Exactly the same way you trained that msg, except that instead of calling the classifier's learn() method, you call its unlearn() method. The arglists are identical, and exactly the same stuff should be passed in both cases. From vanhorn@whidbey.com Tue Nov 5 02:59:23 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Mon, 04 Nov 2002 18:59:23 -0800 Subject: [Spambayes] deployment for mailman lists References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3DC7340B.953FACD9@whidbey.com> Guido van Rossum wrote: > > Guido> - An obvious default policy with tunable parameters presents > > Guido> itself: ham goes to the list, spam is dropped (or > > Guido> bounced), and unsure goes into the moderator's queue. > > > > I would argue that spam should by default go into the moderator's > > queue as well. The default should never be to drop or bounce a > > message. Either way, you run the risk that legitimate mail gets > > lost. > > For most mailing lists, I disagree. It's not like you're going to > miss an important message from your boss or from a potential customer > or employer when a false positive is bounced from the > dangerous-hobbies-involving-jello list. > > Given the amount of spam that most lists get, and the clumsiness (I > believe Barry agrees with this assessment :-) of the Mailman > moderation API, putting all spam in the moderation queue by default > would be a bad idea. I agree that it should be possible to configure > it this way if you really want, but I don't think it should be the > default. Although I find the 2.1b interface to be rather clumsier than 2.0.13 was, I would certainly want the default to allow for moderation. About half the lists I run are commercial, announcement lists for employees. It's not that you risk missing an important message from a potential employer, which should be barred, but from the current employer, who is paying for the list. Also, as I had suggested earlier, I would want the list output to be fed into ham@. As I understand it, it would be desirable to forward the spam to the spam@ address to keep at least some training on the spam side going on. With the combination of hammie running in front and manual moderation, a dedicated hammie for a single mailing list would be spectacular over time. Van -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From tim.one@comcast.net Tue Nov 5 04:10:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 23:10:32 -0500 Subject: FW: RE: RE: [Spambayes] Email client integration -- what's needed? Message-ID: Tim, all msgs to you bounce, with Recipient address: Tim@mail.powweb.com Reason: Remote SMTP server has rejected address Diagnostic code: smtp;550 : User unknown Remote system: dns;mail.powweb.com (TCP|24.153.64.230|22677| 63.251.213.34|25) (mail.powweb.com ESMTP Postfix) -----Original Message----- From: Tim Peters [mailto:tim.one@comcast.net] Sent: Monday, November 04, 2002 11:02 PM To: Tim@mail.powweb.com Subject: RE: RE: RE: [Spambayes] Email client integration -- what's needed? > Ok, I'm totally with ya now. Is anyone working on a general purpose > training class? Not seriously as such, although the Outlook client has steps in that direction. I expect people think the retraining steps are too trivial to factor out, but I think that's a mistake: while the general class should indeed end up being simple, there are subtleties that should be captured once and for all, and the very existence of a training class will help the next person figure out how to proceed with the next client. The current Tester and TestDriver classes (esp. the former) have that flavor too: their existence has driven the creation of concrete test drivers, and supplied just enough commonality so that post-test analysis tools have been relatively easy to write. > If not, I can take a crack at it... The smtpproxy is kinda broken > without it, because while it can train, it will need some kind of > remembering in order to be able to untrain... I think you're in a good position, then. When the client allows tight integration, it seems hard to abstract things enough for easy reusability; with a nightmare client <0.7 wink>, it should be easier to picture an ideal. From Tim@mail.powweb.com Tue Nov 5 04:19:01 2002 From: Tim@mail.powweb.com (Tim@mail.powweb.com) Date: Mon, 04 Nov 2002 22:19:01 -0600 Subject: FW: RE: RE: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: Tim, I don't know why it's doing that. I'm sending from tim@fourstonesExpressions.com... but it comes out on the list with the invalid address... Lemme do some more investigation... but it's probably my stinkin host... forever screwing me up. Certainly a training class would have helped me get my head around the training side of things. It's not really a trivial abstraction... Ok, I'll take a crack at the training class, then... got some ideas, but could use a few suggestions on some remembering stuff... All we can really ever count on having is a few basic headers and the message body. Somehow from whatever we have, we need to create a key that will be used to find a saved message. I could hash the entire message, or use a checksum... ideas? 11/4/2002 10:10:32 PM, Tim Peters wrote: >Tim, all msgs to you bounce, with > > Recipient address: Tim@mail.powweb.com > Reason: Remote SMTP server has rejected address > Diagnostic code: smtp;550 : User unknown > Remote system: dns;mail.powweb.com (TCP|24.153.64.230|22677| > 63.251.213.34|25) (mail.powweb.com ESMTP Postfix) > >-----Original Message----- >From: Tim Peters [mailto:tim.one@comcast.net] >Sent: Monday, November 04, 2002 11:02 PM >To: Tim@mail.powweb.com >Subject: RE: RE: RE: [Spambayes] Email client integration -- what's >needed? > > >> Ok, I'm totally with ya now. Is anyone working on a general purpose >> training class? > >Not seriously as such, although the Outlook client has steps in that >direction. I expect people think the retraining steps are too trivial to >factor out, but I think that's a mistake: while the general class should >indeed end up being simple, there are subtleties that should be captured >once and for all, and the very existence of a training class will help the >next person figure out how to proceed with the next client. The current >Tester and TestDriver classes (esp. the former) have that flavor too: their >existence has driven the creation of concrete test drivers, and supplied >just enough commonality so that post-test analysis tools have been >relatively easy to write. > >> If not, I can take a crack at it... The smtpproxy is kinda broken >> without it, because while it can train, it will need some kind of >> remembering in order to be able to untrain... > >I think you're in a good position, then. When the client allows tight >integration, it seems hard to abstract things enough for easy reusability; >with a nightmare client <0.7 wink>, it should be easier to picture an ideal. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Tue Nov 5 04:59:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 04 Nov 2002 23:59:26 -0500 Subject: FW: RE: RE: [Spambayes] Email client integration -- what's needed? In-Reply-To: Message-ID: [TimS] > Certainly a training class would have helped me get my head around the > training side of things. It's not really a trivial abstraction... It will be if done right -- it's making it concrete that will be non-trivial. > Ok, I'll take a crack at the training class, then... got some ideas, > but could use a few suggestions on some remembering stuff... All we > can really ever count on having is a few basic headers and the message > body. Somehow from whatever we have, we need to create a key that > will be used to find a saved message. I could hash the entire > message, or use a checksum... ideas? I don't think the training class should know anything concrete about msgs. Instead it should work with opaque message objects. Off the top of my head, msgs should support: + An arbitrary but consistent total ordering (so that they're usable as keys in B-Tree based persistent databases), and hashability (so that they're usable as keys in a dict). + A method to return a human-comprehensible name (perhaps an access path relative to the client's folder hierarchy -- but the training class shouldn't care). Note that if these names are required to be unique strings, that can be exploited to give a consistent total ordering, and hashability (just compare or hash the string names). + A method to deliver a token stream, suitable for passing to the classifier. I expect it would be most convenient to make msgs iterable, so they can be passed directly as-is to tokenize(). The existing msgs.Msg class does part of this stuff, but is a concrete class, and geared toward testing. A training class needs to specify a Msg interface (protocol, abstract base clase, however you like to think of these things), and clients need to supply classes or factory functions that implement that interface (protocol, whatever). Right? This is just OO design: identify the objects and actors in the domain, and model them with classes. The client will have to supply concrete versions that implement the interfaces the trainer requires. The trick is to define the trainer in such a way that it requires exactly enough to get its job done, and clients have to implement at least that much (but may implement more). From tim.one@comcast.net Tue Nov 5 06:05:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 05 Nov 2002 01:05:41 -0500 Subject: [Spambayes] More mildly clever spam In-Reply-To: Message-ID: I won't show the whole thing here. It scored 0.62 for me (H=0.75, S=0.99), so was Unsure, but looking at it was baffling: Our highly successful 24 year old multi-national company gives you an exclusive business that's guaranteed to make you an extra weekly etc etc Despite the obvious spamcicity, the clue list had words like sniper, emacs and distros. Turns out they were only visible in reverse video, thanks to HTML trickery: """ Anyone, regardless of background, education or experience can easily
make money with BT Online ©. We provide everything you need.
You can start making a Guaranteed extra income just 5 minutes from now!

Finally, I've always found that the RPMs Nvidia supplies don't put all the files in all the right places. I strongly recommend using the binary tarballs for the Nvidia kernel and GLX driver instead. It's incredibly easy; you just unpack them wherever you please, bust out a root shell and run make from the top level directories. It's actually easier and faster than RPM, and even better, it always works. Just make sure that the statements Load "glx" and Driver "nvidia" appear in etc/X11/XF86Config under Section "Module" and Section "Device" respectively before you re-boot (or make sure you know how to use emacs or vi, and make sure you know the path to your XF86Config file -- different distros put it in different places.)
"If you can check your email, you can make $$ with BT Online ©" """ Etc on both sides. There are snippets of news stories about the East Coast snipers, tech postings, and business stories, spread evenly throughout the msg. The white-on-white text is actually used to space out spam paragraphs! I expect that the worst this gimmickery can do with our code is knock a spam into Unsure territory. Indeed, despite that there was a lot more hidden ham than visible spam in this msg, it had 33 words with spamprobs above 0.90, and it's darned hard to hit that many words with spamprobs below 0.10 by luck. This one was particularly lucky in including sniper news, since I live in the snipers' target area, and have lots of ham about that from friends & relatives over the last month. I say "lucky" instead of clever here because tim_one@msn.com was just one of 22 tim_xyz@msn.com addresses in the To and Cc lines. What's amazing me now is how very few spam I get that try to play tricks at all! From rob@hooft.net Tue Nov 5 07:09:32 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Tue, 05 Nov 2002 08:09:32 +0100 Subject: [Spambayes] counterweight: it really works! References: <3DC6130D.40508@hooft.net> <3DC68F96.6070809@startechgroup.co.uk> Message-ID: <3DC76EAC.4000807@hooft.net> Matt Sergeant wrote: > Rob Hooft said the following on 04/11/02 06:26: >> Just to remind everyone that this software really works! Its spambayes >> score deviates from 1.0 only by about 10**-8, but SA didn't see much > Please don't compare to 4 months old SpamAssassin's. Upgrade if you want > to compare. Thanks. OK, sorry, maybe I should have updated. OTOH, that is part of the problem, isn't it? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Tue Nov 5 07:14:16 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Tue, 05 Nov 2002 08:14:16 +0100 Subject: [Spambayes] Why I added src=cid: etc References: <3DC6A1D8.6040507@startechgroup.co.uk> Message-ID: <3DC76FC8.5010109@hooft.net> Matt Sergeant wrote: [on viruses] > > Yeah, I've got some neat results just from classifying file extensions. > The double extension ones are especially good ;-) > > Matt. 2-line virusscanner in /etc/postfix/body_checks: /^(Content-(Type|Disposition):.*|[[:space:]]*(file)?)name=("[^"]*|[^[:space:]]*)\.(exe|com|scr|pif|bat|lnk|dll|vbs|js)/ REJECT /^Content-Type:[[:space:]]*audio\// REJECT -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie@entrian.com Tue Nov 5 08:44:56 2002 From: richie@entrian.com (richie@entrian.com) Date: Tue, 05 Nov 2002 08:44:56 +0000 Subject: [Spambayes] HTML user interface for spambayes Message-ID: Hi all, Lots happening in the past couple of days! To make sure we're not duplicating effort, I'd like to let people know that I'm partway through writing an HTML-based user interface for the spambayes database and the POP3 proxy. I should commit an initial version in the next couple of days. At the moment, it gives you: o A 'Word query' form where you can get information from the database about a specific word o Training by uploading message files to it (one at once at the moment, but I'll add support for mbox files) o Training by pasting an email into a form o The status of the pop3proxy - how many emails classified and so on, plus a shutdown button. I had a look at POPFile, and their HTML user interface lets you (re)classify recently-proxied messages very easily. This is a great idea, and along with Tim's SMTP proxy stuff should make the process of (re)classification nice and simple. One question: can we still untrain a message? The code is still there, but I have it in my head that some of Gary's ideas prevented untraining from working, and that was why the CV tests needed to retrain from scratch for each pass... or am I getting this totally wrong? I hope I am, because reclassifying (through the web or through SMTP) will need to use the untraining stuff. Also on my list is to commit Tim Stone's SMTP proxy code, possibly after integrating it with the pop3proxy (but I need to discuss that with you, Tim, after looking in more detail at the code, hopefully tonight). -- Richie Hindle richie@entrian.com From rob@hooft.net Tue Nov 5 10:22:57 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Tue, 05 Nov 2002 11:22:57 +0100 Subject: [Spambayes] HTML user interface for spambayes References: Message-ID: <3DC79C01.4090303@hooft.net> richie@entrian.com wrote: > One question: can we still untrain a message? The code is still there, > but I have it in my head that some of Gary's ideas prevented untraining > from working, and that was why the CV tests needed to retrain from > scratch for each pass... or am I getting this totally wrong? I hope I > am, because reclassifying (through the web or through SMTP) will need > to use the untraining stuff. All the CV stuff that could not untrain has been removed from the code; with the score methods that are in there now, untrain will work fine. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From msergeant@startechgroup.co.uk Tue Nov 5 10:20:58 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 05 Nov 2002 10:20:58 +0000 Subject: [Spambayes] counterweight: it really works! References: <3DC6A44B.4070509@startechgroup.co.uk> <200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3DC79B8A.9000409@startechgroup.co.uk> Guido van Rossum said the following on 04/11/02 20:29: >>FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't >>officially released yet, but then neither is spambayes [grin]) using the >>Robinson algorithm. I'm hoping to get the chi-squared algorithm in there >>too, but /I had some trouble with it producing wierd results for me (I >>tried to post something to this list about it but it vanished into the >>ether, so I'll try again shortly). > > > Cool! What do you do for training of your Robinson classifier? At the moment it's manually trained (same was as spambayes), but we're looking into auto-training - feeding back current SA results into the corpus. That's a double edged sword of course, but it'll be interesting research. Matt. From msergeant@startechgroup.co.uk Tue Nov 5 10:27:40 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 05 Nov 2002 10:27:40 +0000 Subject: [Spambayes] counterweight: it really works! References: <3DC6130D.40508@hooft.net> <3DC68F96.6070809@startechgroup.co.uk> <3DC76EAC.4000807@hooft.net> Message-ID: <3DC79D1C.5040403@startechgroup.co.uk> Rob W.W. Hooft said the following on 05/11/02 07:09: > Matt Sergeant wrote: > >>Rob Hooft said the following on 04/11/02 06:26: > > >>>Just to remind everyone that this software really works! Its spambayes >>>score deviates from 1.0 only by about 10**-8, but SA didn't see much > > >>Please don't compare to 4 months old SpamAssassin's. Upgrade if you want >>to compare. Thanks. > > OK, sorry, maybe I should have updated. OTOH, that is part of the > problem, isn't it? As I explained in another email, both spambayes (and other statistical solutions) and SpamAssassin need constantly updating. It's just that spambayes is slightly easier as the manual intervention you have to do isn't far off your regular reading of email (whereas spamassassin requires you to drop into a console and type: perl -MCPAN -e 'install Mail::SpamAssassin'). Of course you could put the latter in a cron job, but most sensible people wouldn't trust it. BTW: I'm not suggesting that SA would have caught that particular spam - it probably wouldn't have, I just hate to see invalid comparisons. Matt. From msergeant@startechgroup.co.uk Tue Nov 5 10:29:04 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 05 Nov 2002 10:29:04 +0000 Subject: [Spambayes] Why I added src=cid: etc References: <3DC6A1D8.6040507@startechgroup.co.uk> <3DC76FC8.5010109@hooft.net> Message-ID: <3DC79D70.3030309@startechgroup.co.uk> Rob W.W. Hooft said the following on 05/11/02 07:14: > Matt Sergeant wrote: > [on viruses] > >>Yeah, I've got some neat results just from classifying file extensions. >>The double extension ones are especially good ;-) >> >>Matt. > > > 2-line virusscanner in /etc/postfix/body_checks: > > /^(Content-(Type|Disposition):.*|[[:space:]]*(file)?)name=("[^"]*|[^[:space:]]*)\.(exe|com|scr|pif|bat|lnk|dll|vbs|js)/ > REJECT > /^Content-Type:[[:space:]]*audio\// REJECT Never REJECT on file extension. Only ever ACCEPT! This is the same rule as firewalling - never close off insecure ports, only open the ones you know are secure and/or needed. Matt. From sjoerd@acm.org Tue Nov 5 10:42:42 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Tue, 05 Nov 2002 11:42:42 +0100 Subject: [Spambayes] Something to test In-Reply-To: References: Message-ID: <200211051042.gA5AggI07324@indus.ins.cwi.nl> On Sun, Nov 3 2002 Tim Peters wrote: > Index: tokenizer.py > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v > retrieving revision 1.60 > diff -c -r1.60 tokenizer.py > *** tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 > --- tokenizer.py 3 Nov 2002 08:31:44 -0000 > *************** > *** 1178,1183 **** > --- 1178,1185 ---- > x2n[x] = x2n.get(x, 0) + 1 > for x in x2n.items(): > yield "header:%s:%d" % x > + for x in options.safe_headers - Set([k.lower() for k in x2n]): > + yield "noheader:" + x > > def tokenize_body(self, msg, maxword=options.skip_max_word_size): > """Generate a stream of tokens from an email Message. Here are my results: filename: cv1s cv2s ham:spam: 11850:3360 11850:3360 fp total: 3 3 fp %: 0.03 0.03 fn total: 4 4 fn %: 0.12 0.12 unsure t: 103 100 unsure %: 0.68 0.66 real cost: $54.60 $54.00 best cost: $26.60 $25.80 h mean: 0.20 0.19 h sdev: 3.15 3.15 s mean: 99.29 99.28 s sdev: 5.94 5.95 mean diff: 99.09 99.09 k: 10.90 10.89 The difference between the two runs: 3 unsure messages got nailed correctly, so it's a marginal improvement. -- Sjoerd Mullender From bkc@murkworks.com Tue Nov 5 15:07:54 2002 From: bkc@murkworks.com (Brad Clements) Date: Tue, 05 Nov 2002 10:07:54 -0500 Subject: [Spambayes] More mildly clever spam In-Reply-To: References: Message-ID: <3DC797B2.10341.41DA65C8@localhost> On 5 Nov 2002 at 1:05, Tim Peters wrote: > I say "lucky" instead > of clever here because tim_one@msn.com was just one of 22 tim_xyz@msn.com > addresses in the To and Cc lines. Is the tokenizer still not parsing/counting recipients? > What's amazing me now is how very few spam I get that try to play tricks at > all! Someone, somewhere will put this into an automated spam tool, just wait. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From guido@python.org Tue Nov 5 15:23:40 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 05 Nov 2002 10:23:40 -0500 Subject: [Spambayes] counterweight: it really works! In-Reply-To: Your message of "Tue, 05 Nov 2002 10:20:58 GMT." <3DC79B8A.9000409@startechgroup.co.uk> References: <3DC6A44B.4070509@startechgroup.co.uk> <200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net> <3DC79B8A.9000409@startechgroup.co.uk> Message-ID: <200211051523.gA5FNec19110@odiug.zope.com> > >>FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't > >>officially released yet, but then neither is spambayes [grin]) using the > >>Robinson algorithm. I'm hoping to get the chi-squared algorithm in there > >>too, but /I had some trouble with it producing wierd results for me (I > >>tried to post something to this list about it but it vanished into the > >>ether, so I'll try again shortly). > > > > > > Cool! What do you do for training of your Robinson classifier? > > At the moment it's manually trained (same was as spambayes), but we're > looking into auto-training - feeding back current SA results into the > corpus. That's a double edged sword of course, but it'll be interesting > research. With the manual training, do you distribute it pre-trained on a standard set of email? Or do you let the installer train it? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue Nov 5 15:41:29 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 05 Nov 2002 10:41:29 -0500 Subject: [Spambayes] deployment for mailman lists In-Reply-To: Your message of "Mon, 04 Nov 2002 18:59:23 PST." <3DC7340B.953FACD9@whidbey.com> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <3DC7340B.953FACD9@whidbey.com> Message-ID: <200211051541.gA5FfTs19274@odiug.zope.com> > Although I find the 2.1b interface to be rather clumsier than 2.0.13 > was, Please provide Barry with details! > I would certainly want the default to allow for moderation. About > half the lists I run are commercial, announcement lists for > employees. It's not that you risk missing an important message from > a potential employer, which should be barred, but from the current > employer, who is paying for the list. If they are closed for posting, you shouldn't need to turn on additional spam filtering anyway. Even if they aren't, I would find it strange that such "internal" lists would get much spam -- if they're internal, why do their addresses appear on the web? --Guido van Rossum (home page: http://www.python.org/~guido/) From msergeant@startechgroup.co.uk Tue Nov 5 15:40:52 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 05 Nov 2002 15:40:52 +0000 Subject: [Spambayes] counterweight: it really works! References: <3DC6A44B.4070509@startechgroup.co.uk> <200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net> <3DC79B8A.9000409@startechgroup.co.uk> <200211051523.gA5FNec19110@odiug.zope.com> Message-ID: <3DC7E684.3090008@startechgroup.co.uk> Guido van Rossum said the following on 05/11/02 15:23: > With the manual training, do you distribute it pre-trained on a > standard set of email? Or do you let the installer train it? We're planning to ship it pre-trained. Otherwise we lose some plug-and-playness. Matt. From Paul.Moore@atosorigin.com Tue Nov 5 15:53:57 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Tue, 5 Nov 2002 15:53:57 -0000 Subject: [Spambayes] Outlook addin - initial impressions Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D86@UKDCX001.uk.int.atosorigin.com> (I hope this works, I'm posting from my work a/c, rather than the home = one that's subscribed to the list...) I just grabbed the latest spambayes from CVS and set it up on my Outlook = installation. Unfortunately, I haven't been saving spam, but what the = heck, I thought, I'll train it on what I've got (320 good mails in my = inbox, and a measly 34 spam received today). Amazingly, even with this = tiny training set, the results are pretty good already. I've only = received a few more spams so far today (the US-based spammers haven't = woken up yet, I guess) but they've all been caught correctly. One thing is not right, though. In the dialog for managing the filter, = the option to automatically run the filters is greyed out (as is the = "Advanced" button, but I took that as being "no advanced options yet"). = So I can't set automatic filtering on, and I have to manually filter my = mail. FWIW, I do have a number of rules wizard entries which run to = filter out my mailing list traffic - I understand that the rules wizard = runs first and so only mails left by that will get checked by Spambayes, = but that's OK. I checked the trace output, and saw ----------------------------------------------- Collecting Python Trace Output... Outlook Spam Addin module loading SpamAddin - Connecting to Outlook Loaded bayes database from 'C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\default_bayes_databas= e.pck' Loaded message database from 'C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000default_message_databa= se.pck' Bayes database initialized with 34 spam and 320 good messages Setting image to delete_as_spam.bmp AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Spam SpamAddin - OnAddInsUpdate None SpamAddin - OnStartupComplete None Traceback (most recent call last): File "C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog= .py", line 88, in OnInitDialog self.UpdateControlStatus() File "C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog= .py", line 131, in updateControlStatus self.SetDlgItemText(IDC_FILTER_STATUS, filter_status) UnboundLocalError: local variable 'filter_status' referenced before = assignment win32ui: OnInitDialog() virtual handler (>) raised an exception AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Spam Traceback (most recent call last): File "C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog= .py", line 157, in OnButDoSomething self.UpdateControlStatus() File "C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog= .py", line 131, in UpdateControlStatus self.SetDlgItemText(IDC_FILTER_STATUS, filter_status) UnboundLocalError: local variable 'filter_status' referenced before = assignment win32ui: Error in Command Message handler for command ID 1029, Code 0 C:\Documents and = Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\about.html ----------------------------------------------- I'm guessing that the UnboundLocalError is not right... But even fixing that (by setting filter_status to "" at the top of the = routine) didn't enable the "enable filtering" button. I can't see an = obvious answer to this. I'll try to find out a bit more, but I thought I = may as well let the list know, in case it's an easy problem for someone = familiar with the code... Anyway, thanks for the tool - I'm very impressed, even given that there = are still some rough edges to smooth out. Paul. From Paul.Moore@atosorigin.com Tue Nov 5 16:28:56 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Tue, 5 Nov 2002 16:28:56 -0000 Subject: [Spambayes] Re: Outlook addin - initial impressions Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D88@UKDCX001.uk.int.atosorigin.com> > (I hope this works, I'm posting from my work a/c, rather > than the home one that's subscribed to the list...)=20 I see it did - thanks, Mr. Moderator, I've subscribed now :-) And sorry for the horrible formatting... Looking again at the notes, I wonder - is the problem of the "filter" button not being enabled what is referred to as "Filtering an Exchange Server public store appears to not work."? I was planning on filtering my inbox, which is indeed on Exchange... Assuming this is the issue, then can I offer myself as a guinea pig? Is there anything useful I can do (test scripts I can run, output I can report) to help diagnose the problem? [OK, so "provide a patch to fix the problem" is probably more help, but I've not got that far yet :-)] Paul. From tim.one@comcast.net Tue Nov 5 17:04:54 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 05 Nov 2002 12:04:54 -0500 Subject: [Spambayes] More mildly clever spam In-Reply-To: <3DC797B2.10341.41DA65C8@localhost> Message-ID: [Tim] >> I say "lucky" instead of clever here because tim_one@msn.com was >> just one of 22 tim_xyz@msn.com addresses in the To and Cc lines. [Brad Clements] > Is the tokenizer still not parsing/counting recipients? It is counting To and Cc recipients now, and the spam in question did get two strikes against it for "fat" recipient lists. It probably wouldn't have helped to tokenize the recipients. >> What's amazing me now is how very few spam I get that try to >> play tricks at all! > Someone, somewhere will put this into an automated spam tool, just wait. It looked automated to me already. It's another case where, if we weren't throwing HTML tags away, the classifier would pick up on the HTML tricks (like size=1 and color=white) by itself. If it becomes a popular dodge, not blinding the classifier to those would help; and/or the tokenizer could try to figure which text "is invisible", and not let the classifier see that stuff. It's like this trick was deep . From tim.one@comcast.net Tue Nov 5 17:27:40 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 05 Nov 2002 12:27:40 -0500 Subject: [Spambayes] HTML user interface for spambayes In-Reply-To: Message-ID: [richie@entrian.com] > ... > One question: can we still untrain a message? Yes, and I hope forever more: all of the problematic "third training pass" combining schemes were removed from the codebase. Learning and unlearning can be mixed freely under the combining schemes remaining. From bkc@murkworks.com Tue Nov 5 17:39:22 2002 From: bkc@murkworks.com (Brad Clements) Date: Tue, 05 Nov 2002 12:39:22 -0500 Subject: [Spambayes] Capitol Steps spam song.. OT? Message-ID: <3DC7BB31.17489.42651270@localhost> Sorry if this is OT. Anyone else hear the Capitol steps halloween special this year.. They had a spam song.. wasn't too bad. http://www.capsteps.com/radio/ Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From vanhorn@whidbey.com Tue Nov 5 17:42:47 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Tue, 05 Nov 2002 09:42:47 -0800 Subject: [Spambayes] deployment for mailman lists References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <200211051541.gA5FfTs19274@odiug.zope.com> Message-ID: <3DC80317.B38952C4@whidbey.com> Guido, While the membership functions are more powerful, and probably a real boon to those with really large lists, spreading those functions over three pages is clumsier for dealing with small lists like mine. The old interface for clearing deferred posts had everything on one page, now you need to jump to a second page. I see what bigger lists gain, and am not particularly complaining. The lists do not get "a lot" of spam, but because they are in folks address books the address does get spread around. Also, some of the lists here are definitely public. All of the lists here are unmoderated but accept messages only from members. But spam and viruses do get submitted and require manual intervention. Unfortunately, many of the posts that require attention are "non member" submissions, most of which are valid. I would welcome a way to eliminate the spam and virus entries and reduce the number of trips to the admin interface. Van Guido van Rossum wrote: > > Although I find the 2.1b interface to be rather clumsier than 2.0.13 > > was, > > Please provide Barry with details! > > > I would certainly want the default to allow for moderation. About > > half the lists I run are commercial, announcement lists for > > employees. It's not that you risk missing an important message from > > a potential employer, which should be barred, but from the current > > employer, who is paying for the list. > > If they are closed for posting, you shouldn't need to turn on > additional spam filtering anyway. Even if they aren't, I would find > it strange that such "internal" lists would get much spam -- if > they're internal, why do their addresses appear on the web? > > --Guido van Rossum (home page: http://www.python.org/~guido/) -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From skip@pobox.com Tue Nov 5 18:11:53 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 5 Nov 2002 12:11:53 -0600 Subject: [Spambayes] I thought bogus message structure problem was solved... Message-ID: <15816.2537.764299.386817@montanaro.dyndns.org> ---------------------- multipart/mixed attachment I just saw a message with no hammie header. Looking at my procmail log= file I saw this traceback info: Tue Nov 5 11:49:47 2002 Traceback (most recent call last): File "/Users/skip/local/bin/hammie.py", line 488, in ? =09main() File "/Users/skip/local/bin/hammie.py", line 472, in main =09filtered =3D h.filter(msg) File "/Users/skip/local/bin/hammie.py", line 269, in filter =09msg =3D email.message_from_string(msg) File "/Users/skip/local/lib/python2.3/email/__init__.py", line 52= , in message_from_string =09return Parser(_class, strict=3Dstrict).parsestr(s) File "/Users/skip/local/lib/python2.3/email/Parser.py", line 75, = in parsestr =09return self.parse(StringIO(text), headersonly=3Dheadersonly) File "/Users/skip/local/lib/python2.3/email/Parser.py", line 64, = in parse =09self._parsebody(root, fp) File "/Users/skip/local/lib/python2.3/email/Parser.py", line 228,= in _parsebody =09msgobj =3D self.parsestr(part) File "/Users/skip/local/lib/python2.3/email/Parser.py", line 75, = in parsestr =09return self.parse(StringIO(text), headersonly=3Dheadersonly) File "/Users/skip/local/lib/python2.3/email/Parser.py", line 62, = in parse =09self._parseheaders(root, fp) File "/Users/skip/local/lib/python2.3/email/Parser.py", line 128,= in _parseheaders =09raise Errors.HeaderParseError( email.Errors.HeaderParseError: Not a header, not a continuation: ``= It=92s Easier to Shop Online!'' procmail: Program failure (1) of "/Users/skip/local/bin/hammie.py" procmail: Rescue of unfiltered data succeeded The message structure is clearly bogus (attached for completeness). I thought someone had fixed this problem, but it appears it was only in o= ther contexts. Looking around for ParseError I see that in a couple instanc= es MessageParseError (base for HeaderParseError) is trapped, as in this sn= ippet from mboxutils.py: def _factory(fp): =09# Helper for getmbox =09try: =09 return email.message_from_file(fp) =09except email.Errors.MessageParseError: =09 return '' However, it seems like we ought to be able to come up with a better fal= lback action than returning an empty string when classifying messages. Is th= ere a way to simply treat the entire body as plain text even though the Content-Type header says otherwise? Skip ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: OfficeManager Subject: F_R_E_E Shipping! Printer Ink Sale! Details Inside! Date: Tue, 5 Nov 2002 12:44:16 -0500 Size: 225 Url: http://mail.python.org/pipermail/spambayes/attachments/20021105/a129a481/attachment.txt ---------------------- multipart/mixed attachment-- From jeremy@alum.mit.edu Tue Nov 5 18:22:48 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Tue, 5 Nov 2002 13:22:48 -0500 Subject: [Spambayes] I thought bogus message structure problem was solved... In-Reply-To: <15816.2537.764299.386817@montanaro.dyndns.org> References: <15816.2537.764299.386817@montanaro.dyndns.org> Message-ID: <15816.3192.941960.767496@slothrop.zope.com> I've had similar problems with some digests for python bugs & patches lists. Barry says to try again with the latest version of the email package. He thinks it is fixed. Jeremy From fazal@majid.fm Tue Nov 5 18:29:51 2002 From: fazal@majid.fm (Fazal Majid) Date: Tue, 5 Nov 2002 10:29:51 -0800 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" Message-ID: > Guido van Rossum guido@python.org > Fri Nov 1 00:56:13 2002 > > If they don't know the word "spam", they don't need a spam filter yet. > > I agree that we need something better than "ham". Non-spam works for > me; "good mail" too. > > --Guido van Rossum (home page: http://www.python.org/~guido/) While revulsion for spam may be universal, Jews, Muslims, Hindus and Buddhists (combined, over 50% of the world's population) would not necessarily think of "ham" as something desirable. -- Fazal Majid Mail: fazal@majid.fm 1111 Jones St. Apt #1 Voice: +1 415 359 0918 San Francisco, CA 94109 PCS: +1 818 231 2144 USA http://www.majid.info/ From tim@fourstonesExpressions.com Tue Nov 5 18:35:06 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 05 Nov 2002 12:35:06 -0600 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" In-Reply-To: Message-ID: <71SMZLILFSOGAJDRO5193JEUPUS4W4.3dc80f5a@riven> Good point, Fazal. - TimS 11/5/2002 12:29:51 PM, "Fazal Majid" wrote: >> Guido van Rossum guido@python.org >> Fri Nov 1 00:56:13 2002 >> >> If they don't know the word "spam", they don't need a spam filter yet. >> >> I agree that we need something better than "ham". Non-spam works for >> me; "good mail" too. >> >> --Guido van Rossum (home page: http://www.python.org/~guido/) > >While revulsion for spam may be universal, Jews, Muslims, Hindus and >Buddhists (combined, over 50% of the world's population) would not >necessarily think of "ham" as something desirable. > >-- >Fazal Majid Mail: fazal@majid.fm >1111 Jones St. Apt #1 Voice: +1 415 359 0918 >San Francisco, CA 94109 PCS: +1 818 231 2144 >USA http://www.majid.info/ > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From skip@pobox.com Tue Nov 5 18:56:33 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 5 Nov 2002 12:56:33 -0600 Subject: [Spambayes] deployment for mailman lists In-Reply-To: <200211051541.gA5FfTs19274@odiug.zope.com> References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net> <15814.53303.926055.735822@montanaro.dyndns.org> <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net> <3DC7340B.953FACD9@whidbey.com> <200211051541.gA5FfTs19274@odiug.zope.com> Message-ID: <15816.5217.915721.290970@montanaro.dyndns.org> Guido> Even if they aren't, I would find it strange that such "internal" Guido> lists would get much spam -- if they're internal, why do their Guido> addresses appear on the web? Everything leaks eventually, for lots of reasons. Skip From tim.one@comcast.net Tue Nov 5 19:32:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 05 Nov 2002 14:32:04 -0500 Subject: [Spambayes] I thought bogus message structure problem was solved... Message-ID: <64b6162160.6216064b61@icomcast.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment [Skip Montanaro] > I just saw a message with no hammie header. Looking at my > procmail log file I saw this traceback info: > ... > File "/Users/skip/local/bin/hammie.py", line 269, in filter > msg = email.message_from_string(msg) How about refactoring the code? tokenizer.Tokenizer.get_message() wrestles with all known problem getting an email object out of a string, and all code should use it. > ... > However, it seems like we ought to be able to come up with a > better fallback action than returning an empty string when > classifying messages. > > Is there a way to simply treat the entire body as plain text even > though the Content-Type header says otherwise? See the method referenced above. ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: OfficeManager Subject: F_R_E_E Shipping! Printer Ink Sale! Details Inside! Date: Tue, 05 Nov 2002 12:44:16 -0500 Size: 258 Url: http://mail.python.org/pipermail/spambayes/attachments/20021105/eeadc60e/attachment.txt ---------------------- multipart/mixed attachment _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes ---------------------- multipart/mixed attachment-- From tim.one@comcast.net Tue Nov 5 19:39:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 05 Nov 2002 14:39:05 -0500 Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail" Message-ID: <5df3b5f93f.5f93f5df3b@icomcast.net> [Fazal Majid] > While revulsion for spam may be universal, Jews, Muslims, Hindus and > Buddhists (combined, over 50% of the world's population) would not > necessarily think of "ham" as something desirable. They would if they ever ate spam . If the Hormel Corporation and over half the world's population can't tolerate good-natured wordplay, then (a) I believe it, and (b) that's life. From db3l@fitlinxx.com Tue Nov 5 19:42:27 2002 From: db3l@fitlinxx.com (David Bolen) Date: 05 Nov 2002 14:42:27 -0500 Subject: [Spambayes] Re: Outlook addin - initial impressions References: <16E1010E4581B049ABC51D4975CEDB885E2D88@UKDCX001.uk.int.atosorigin.com> Message-ID: "Moore, Paul" writes: > Looking again at the notes, I wonder - is the problem of the > "filter" button not being enabled what is referred to as "Filtering > an Exchange Server public store appears to not work."? I was planning > on filtering my inbox, which is indeed on Exchange... I think it's more a buglet when first setting up filters and not specifying an action to perform. If you leave the filter set to Untouched (the default) and don't select folders this will happen just due to how the code tries to build the string for displaying on the main manager dialog. A workaround is just to pick an action (Copy/Move) and target folders on the filter - e.g., use the suggested setup of having Spam and Possible Spam folders to move the mail into. It's not having folders selected that is bypassing the logic to enable the checkbox. The other fix is to patch the code to set filter_status to something useful to display, and also set ok_to_enable to True, so the checkbox for filtering gets enabled. In this case you'll get some later complaints about the Untouched setting when it applies the filter but all will work properly without moving/copying the message. There were definitely problems doing ID lookups on an Exchange server public store, but as of the changes in CVS from 11/2 or so it seems to be working properly at least for me - I've been filtering my Exchange server inbox since yesterday as well as using Exchange server folders for both Spam and Possible Spam folders, with no burps yet. -- David From mhammond@skippinet.com.au Tue Nov 5 21:58:00 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 6 Nov 2002 08:58:00 +1100 Subject: [Spambayes] Outlook addin - initial impressions In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2D86@UKDCX001.uk.int.atosorigin.com> Message-ID: > One thing is not right, though. In the dialog for managing the > filter, the option to automatically run the filters is greyed out > (as is the "Advanced" button, but I took that as being "no > advanced options yet"). So I can't set automatic filtering on, > and I have to manually filter my mail. FWIW, I do have a number > of rules wizard entries which run to filter out my mailing list > traffic - I understand that the rules wizard runs first and so > only mails left by that will get checked by Spambayes, but that's > OK. I checked the trace output, and saw > rDialog.py", line 131, in updateControlStatus > self.SetDlgItemText(IDC_FILTER_STATUS, filter_status) > UnboundLocalError: local variable 'filter_status' referenced > before assignment > But even fixing that (by setting filter_status to "" at the top > of the routine) didn't enable the "enable filtering" button. I The problem would be that the code detected filtering can not be enabled, but neglected to set the status to indicate why. A quick check of the code shows there are 2 ways this can happen - if no "Spam" or "Uncertain" folders are setup in the training dialog. But if this is not the case, we should work out what branch is being taken that wants filtering to be disabled. I checked in a patch that catches what I found - please see if it fixes your problem. > Anyway, thanks for the tool - I'm very impressed, even given that > there are still some rough edges to smooth out. Thanks! Hopefully some of the people using this pre-alpha tool will be willing to pick up the sand-paper Mark. From lists@morpheus.demon.co.uk Tue Nov 5 21:50:03 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Tue, 05 Nov 2002 21:50:03 +0000 Subject: [Spambayes] Outlook addin - initial impressions References: <16E1010E4581B049ABC51D4975CEDB885E2D86@UKDCX001.uk.int.atosorigin.com> Message-ID: "Moore, Paul" writes: > (I hope this works, I'm posting from my work a/c, rather than the home > one that's subscribed to the list...) That one got through (thanks, Mr. Moderator!) although I must apologise for the dreadful formatting... But the followup which I posted after subscribing doesn't seem to have :-( But I've got more information since then... > One thing is not right, though. In the dialog for managing the filter, > the option to automatically run the filters is greyed out Specifically, in the add-in manager dialog, the "Enable filtering" checkbox is greyed out, so I can't enable filtering. The DB has 82 good and 27 spam, so it's "big enough" to allow filtering to be enabled (according to the code) [... pause while Paul fiddles ...] Ah! I hadn't wanted to move "Possible Spam", so I'd left that option as "Leave untouched". The code doesn't enable the "enable" checkbox in that case. But if I change the "Possible Spam" action to move possible spam to "Inbox", and then reset it to "Untouched", the enable checkbox is enabled! So it's not as bad as I thought, but I guess it is a minor bug. Paul. -- This signature intentionally left blank From richie@entrian.com Tue Nov 5 22:25:41 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 05 Nov 2002 22:25:41 +0000 Subject: [Spambayes] HTML user interface for spambayes In-Reply-To: References: Message-ID: [Richie] > One question: can we still untrain a message? [Tim and Rob] > Yes Good. A POPFile-style HTML retraining interface is feasible then - that goes from my 'maybe' list to my 'ooo yes please' list. -- Richie Hindle richie@entrian.com From richie@entrian.com Tue Nov 5 22:22:31 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 05 Nov 2002 22:22:31 +0000 Subject: [Spambayes] HTML user interface for spambayes In-Reply-To: References: Message-ID: <1vagsug6n353cqupqd9ie47s2nl62lq2gj@4ax.com> Hi, The HTML-based user interface is now committed. Quoting myself: > it gives you: > > o A 'Word query' form where you can get information from the database > about a specific word > o Training by uploading message files to it (one at once at the moment, > but I'll add support for mbox files) > o Training by pasting an email into a form > o The status of the pop3proxy - how many emails classified and so on, > plus a shutdown button. "pop3proxy -b" (plus the existing options) will launch a web browser containing the user interface (after loading the database, which is a bit annoying on a slow machine). Training saves the pickle (if you're using one) every time, which is also a pain on slow machines - I'll sort that out RSN. I'm planning to add an HTML (re)training interface a la POPFile as well - it keeps a cache of last (configurably) few messages and lets you (re)classify them from your browser. When time permits... My plan of building the documentation and the UI together has not yet come to fruition - I was too interested in making the UI work than on designing it. Apart from the viking helmet - that was essential. 8-) -- Richie Hindle richie@entrian.com From richie@entrian.com Wed Nov 6 00:05:07 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 06 Nov 2002 00:05:07 +0000 Subject: [Spambayes] Spambayes Header Format In-Reply-To: <3DC52BE4.6010602@hooft.net> References: <3DC4D0F1.5000509@hooft.net> <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> <3DC51403.6030208@hooft.net> <3DC52BE4.6010602@hooft.net> Message-ID: Rob moved the names and values of the X-Hammie-Disposition header into the Options - pop3proxy now respects this as well as hammie. The defaults are still the old "X-Hammie-Disposition" system. pop3proxy doesn't put in any of the extra header pieces; just plain old Yes/No/Unsure. Are there any other pieces that have any of this hard-coded? If not, should we switch to X-Spambayes-Classification: Spam/Ham/Unsure? A few people have come up with objections or alternatives, but more have backed this idea than any other (including leaving things as they are - or is there a silent majority in favour of that option?). -- Richie Hindle richie@entrian.com From skip@pobox.com Wed Nov 6 01:31:53 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 5 Nov 2002 19:31:53 -0600 Subject: [Spambayes] Spambayes Header Format In-Reply-To: References: <3DC4D0F1.5000509@hooft.net> <3DC51403.6030208@hooft.net> <3DC52BE4.6010602@hooft.net> Message-ID: <15816.28937.523690.35204@montanaro.dyndns.org> Richie> If not, should we switch to X-Spambayes-Classification: Richie> Spam/Ham/Unsure? I still like "X-Ham-Status: {yes,no,unsure}" but never saw any responses pro or con to the idea, which I mentioned a couple times. Did those messages ever turn up on the list? I think it reads well, uses the catchy "ham" marketing term I like so well, provides a nice counter to SpamAssassin's X-Spam-Status and doesn't exhibit any spelling differences across English dialects. The only potential problem is the religious dietary issue, which I believe Tim addressed in an earlier message. Skip From skip@pobox.com Wed Nov 6 01:46:44 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 5 Nov 2002 19:46:44 -0600 Subject: [Spambayes] I thought bogus message structure problem was solved... In-Reply-To: <64b6162160.6216064b61@icomcast.net> References: <64b6162160.6216064b61@icomcast.net> Message-ID: <15816.29828.224103.989829@montanaro.dyndns.org> Tim> How about refactoring the code? tokenizer.Tokenizer.get_message() Tim> wrestles with all known problem getting an email object out of a Tim> string, and all code should use it. Okay, that's in process. Thx for the pointer. Question: mboxcount used to keep track of good and bad messages. Using the approach found in the above get_message() means no messages appear to be unparseable anymore without a little extra dance after the call. Does that matter to anyone? Skip From skip@pobox.com Wed Nov 6 01:50:20 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 5 Nov 2002 19:50:20 -0600 Subject: [Spambayes] I thought bogus message structure problem was solved... In-Reply-To: <15816.29828.224103.989829@montanaro.dyndns.org> References: <64b6162160.6216064b61@icomcast.net> <15816.29828.224103.989829@montanaro.dyndns.org> Message-ID: <15816.30044.478724.248988@montanaro.dyndns.org> Skip> Question: mboxcount used to keep track of good and bad messages. Skip> Using the approach found in the above get_message() means no Skip> messages appear to be unparseable anymore without a little extra Skip> dance after the call. Does that matter to anyone? Never mind. I figure if a message object is returned with both "to" and "cc" fields == None, then it was unparseable. Skip From skip@pobox.com Wed Nov 6 02:18:54 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 5 Nov 2002 20:18:54 -0600 Subject: [Spambayes] mboxutils.get_message() Message-ID: <15816.31758.56856.328120@montanaro.dyndns.org> For those of you who don't read spambayes-checkins, I got rid of all the places which try to generate email.Message.Message objects and moved tokenizer.Tokenizer.get_message() to mboxutils.get_message(). The latter is now the best factory function to use. It accepts Message objects, strings and file-like objects as inputs. If it encounters a message parsing error, it just creates a new mail message and stuffs the current input's message body in it as plain text then returns that. I replaced _factory() functions in several files with get_message() as well. Skip From Paul.Moore@atosorigin.com Wed Nov 6 09:12:33 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 6 Nov 2002 09:12:33 -0000 Subject: [Spambayes] Re: Outlook addin - initial impressions Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D89@UKDCX001.uk.int.atosorigin.com> My messages seem to be passing each other in the ether... From: Moore, Paul=20 > Looking again at the notes, I wonder - is the problem of the > "filter" button not being enabled what is referred to as "Filtering > an Exchange Server public store appears to not work."? I was planning > on filtering my inbox, which is indeed on Exchange... As I said in another post, this turned out to be a minor UI issue. Just tested again at work, and filtering an Exchange inbox does indeed work fine. > Assuming this is the issue, then can I offer myself as a guinea pig? This offer still stands, though. I'm happy to help if I can. Paul. _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From Paul.Moore@atosorigin.com Wed Nov 6 10:08:58 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 6 Nov 2002 10:08:58 -0000 Subject: [Spambayes] Outlook plugin - training Message-ID: <16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com> When the Outlook plugin filters mails, it classifies them as either spam or potential spam, and can put them in appropriate folders. In the spam/potential spam folders, there is a "Recover from Spam" button available, and in other folders there is a "Delete as spam" button. These buttons add the message to the training database as well as taking the appropriate action. One thing I don't see, however, is a means of confirming the classifier's decisions as correct. A "yes, that is spam" button for the spam folder, and a "yes, that's ham" button in non-spam folders. As I'm starting from a very small message base, I worry that correct classifications are still somewhat based on "luck", and training based on correct decisions would help to increase both my and the classifier's confidence level. I can do this by regular retraining, but that has 2 disadvantages: it's much clumsier than simply clicking on a "clever boy!" button, and it relies on me not deleting messages until I do a training run. Much of the ham I get is "read and forget", so I'd rather delete immediately. When I get a chance to dive into the code, I'll see how hard this would be to implement. Paul. From piersh@friskit.com Wed Nov 6 11:35:25 2002 From: piersh@friskit.com (Piers Haken) Date: Wed, 6 Nov 2002 03:35:25 -0800 Subject: [Spambayes] Outlook plugin - training Message-ID: <9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com> I don't believe you need this. I think that the classifier automatically trains on messages as they arrive (or at least on messages that it's sure about). You only need to retrain if it has made a mistake, or if it's unsure. Piers. > -----Original Message----- > From: Moore, Paul [mailto:Paul.Moore@atosorigin.com]=20 > Sent: Wednesday, November 06, 2002 2:09 AM > To: Spambayes (E-mail) > Subject: [Spambayes] Outlook plugin - training >=20 >=20 > When the Outlook plugin filters mails, it classifies them as=20 > either spam or potential spam, and can put them in=20 > appropriate folders. >=20 > In the spam/potential spam folders, there is a "Recover from=20 > Spam" button available, and in other folders there is a=20 > "Delete as spam" button. These buttons add the message to the=20 > training database as well as taking the appropriate action. >=20 > One thing I don't see, however, is a means of confirming the=20 > classifier's decisions as correct. A "yes, that is spam"=20 > button for the spam folder, and a "yes, that's ham" button in=20 > non-spam folders. >=20 > As I'm starting from a very small message base, I worry that=20 > correct classifications are still somewhat based on "luck",=20 > and training based on correct decisions would help to=20 > increase both my and the classifier's confidence level. >=20 > I can do this by regular retraining, but that has 2=20 > disadvantages: it's much clumsier than simply clicking on a=20 > "clever boy!" button, and it relies on me not deleting=20 > messages until I do a training run. Much of the ham I get is=20 > "read and forget", so I'd rather delete immediately. >=20 > When I get a chance to dive into the code, I'll see how hard=20 > this would be to implement. >=20 > Paul. >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 From Paul.Moore@atosorigin.com Wed Nov 6 12:00:56 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 6 Nov 2002 12:00:56 -0000 Subject: [Spambayes] Outlook plugin - training Message-ID: <16E1010E4581B049ABC51D4975CEDB88619928@UKDCX001.uk.int.atosorigin.com> From: Piers Haken [mailto:piersh@friskit.com] > I don't believe you need this. I think that the classifier > automatically trains on messages as they arrive (or at least on > messages that it's sure about). You only need to retrain if it > has made a mistake, or if it's unsure. If it does, it doesn't update the "Database has XXX good and XXX spam" information in the manager dialog (at least not in all cases) - I just got a correctly classified spam, and the reported number of spams in the database hadn't changed. But I'd be happy if it does train on what it classifies - then all I do is correct any mistakes. What does it do about "Potential Spam"? Train as if it were spam, and then correct its assumption when I move it back to the inbox? That probably makes most sense, given that the "Potential Spam" folder gets a "Recover from Spam" button, rather than a "Delete as spam" one. Paul. PS In case it's not obvious, I'll summarise what I've learnt once I get to grips with all of this. Hopefully, a summary would be useful for the documentation... From Paul.Moore@atosorigin.com Wed Nov 6 13:23:45 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 6 Nov 2002 13:23:45 -0000 Subject: [Spambayes] Outlook plugin - training Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D92@UKDCX001.uk.int.atosorigin.com> From: Moore, Paul=20 > What does it do about "Potential Spam"? Train as if it were spam, > and then correct its assumption when I move it back to the inbox? > That probably makes most sense, given that the "Potential Spam" > folder gets a "Recover from Spam" button, rather than a "Delete > as spam" one. Actually, I'm not sure I like "Potential Spam" being treated as spam until confirmed as OK. I have Rules Wizard rules which sort E-Mail traffic out into folders. I'm entirely happy with the behavious I understand to be the case - rules processed before the plugin - as I don't get spam on list addresses, so I'm OK with list traffic being totally excluded from the spam process. But I've had a couple of list messages end up in "Potential Spam". Either the rules wizard is missing them (possible, I never had much confidence in it :-() or the plugin is interfering somehow. I think I may switch off the "potential spam" bit, and just filter out known spam, and classify my Inbox by hand. I'll leave it a bit longer before deciding, though. Paul From richie@entrian.com Wed Nov 6 16:56:29 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 06 Nov 2002 16:56:29 +0000 Subject: [Spambayes] Spambayes Header Format In-Reply-To: <15816.28937.523690.35204@montanaro.dyndns.org> References: <3DC4D0F1.5000509@hooft.net> <3DC51403.6030208@hooft.net> <3DC52BE4.6010602@hooft.net> <15816.28937.523690.35204@montanaro.dyndns.org> Message-ID: Hi Skip, > I still like "X-Ham-Status: {yes,no,unsure}" but never saw any responses pro > or con to the idea, which I mentioned a couple times. Did those messages > ever turn up on the list? Yes, I saw them, but didn't have a strong opinion on it so I kept quiet. I prefer "X-Spambayes-Classification: Spam" to "X-Ham-Status: No" for a couple of reasons: the former contains the word 'Spam' in at least one place, or two for spams. It contains the word 'Spambayes', which immediately tells you the name of the software that added the header (or at least gives you something to tell Google you Feel Lucky about). I like the 'Ham' word as well (notwithstanding the non-pig-eaters, who I doubt will be offended), but not enough to want it in the name of the header itself - using it as a header value is enough to make it a visible USP. All MHO. -- Richie Hindle richie@entrian.com From richie@entrian.com Wed Nov 6 18:05:14 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 06 Nov 2002 18:05:14 +0000 Subject: [Spambayes] Spambayes Header Format In-Reply-To: References: <3DC4D0F1.5000509@hooft.net> <3DC51403.6030208@hooft.net> <3DC52BE4.6010602@hooft.net> <15816.28937.523690.35204@montanaro.dyndns.org> Message-ID: I said: > I prefer "X-Spambayes-Classification: Spam" to "X-Ham-Status: No" for a > couple of reasons: the former contains the word 'Spam' in at least one > place, or two for spams. Sorry, I should have said _why_ I think that's an advantage - it immediately tells you that the header has something to do with spam, rather than being just another random email header. If you haven't heard of our cool 'Ham' term, the header is meaningless. (OK, why are you looking at the header if you've never heard of Spambayes? Maybe your giant multinational corporation is making us all rich by running Enterprise Spambayes Double Pro Plus Premium Edition while we laze about on our private island. Or is that not how Open Source works? 8-) -- Richie Hindle richie@entrian.com From tim@fourstonesExpressions.com Wed Nov 6 18:16:32 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 06 Nov 2002 12:16:32 -0600 Subject: [Spambayes] Spambayes Header Format In-Reply-To: Message-ID: 11/6/2002 12:05:14 PM, Richie Hindle wrote: > >I said: >> I prefer "X-Spambayes-Classification: Spam" to "X-Ham-Status: No" for a >> couple of reasons: the former contains the word 'Spam' in at least one >> place, or two for spams. > >Sorry, I should have said _why_ I think that's an advantage - it >immediately tells you that the header has something to do with spam, rather >than being just another random email header. If you haven't heard of our >cool 'Ham' term, the header is meaningless. The thing with including Spambayes in the header is that if someone hasn't heard of it, and does a google on the term, they'll get something meaningful. on the other hand, if they do a google on ham... I vote for "X-Spambayes-Classification: Spam" - TimS > >(OK, why are you looking at the header if you've never heard of Spambayes? >Maybe your giant multinational corporation is making us all rich by running >Enterprise Spambayes Double Pro Plus Premium Edition while we laze about on >our private island. Or is that not how Open Source works? 8-) > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From papaDoc@videotron.ca Wed Nov 6 18:55:55 2002 From: papaDoc@videotron.ca (papaDoc) Date: Wed, 06 Nov 2002 13:55:55 -0500 Subject: [Spambayes] Spambayes Header Format References: Message-ID: <3DC965BB.80207@videotron.ca> Hi, I also prefer : X-Spambayes-Classification: "Ham|Unsure|Spam" to X-Ham-Status: "Yes|No|Unsure" Because it is less confusing since x-Ham-Status is not a question the answer should not be "yes|or" but a status like "processed_and_found_to_be_ham", "unsure_need_more_training" papaDoc From tim.one@comcast.net Wed Nov 6 19:27:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 06 Nov 2002 14:27:41 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2D92@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul, on the Outlook2K client] > Actually, I'm not sure I like "Potential Spam" being treated as spam > until confirmed as OK. It doesn't. "Potential Spam" really means "unsure" -- it would be as accurate to call it "Potential Ham", but neither is as accurate as Unsure. The system knows it doesn't know what to call msgs in this category, and the client doesn't automatically train on Unsure msgs (unless you *manually* drag one into your Spam folder, or into one of your Ham folders). > I have Rules Wizard rules which sort E-Mail traffic out into folders. > I'm entirely happy with the behavious I understand to be the case - rules > processed before the plugin - as I don't get spam on list addresses, so > I'm OK with list traffic being totally excluded from the spam process. The Define Filters dialog has a multi-selection folder control, so you can tell the client to watch any number of folders (you're not limited to the Inbox alone; add the destination folders of your other Outlook rules if you want email coming into those watched too). The interaction with Outlook's Rules Wizard (RW) remains unclear. The RW's internal workings appear undocumented, and there appears no way to hook into it. I've definitely seen the addin's filtering rules trigger *while* the RW was still running, and in some cases that can lead to the addin's filtering looking at a msg more than once. For example, the addin's filter may trigger when a msg first arrives in the Inbox, and then a second time on the same msg when the RW moves it into a different folder that the addin's filter is also watching. In this case the client suffers an internal exception, as the entry ID Outlook told it to use for the first trigger gets invalidated by the move. It works OK in the end, but "something isn't quite right" about it. > But I've had a couple of list messages end up in "Potential Spam". > Either the rules wizard is missing them (possible, I never had much > confidence in it :-() or the plugin is interfering somehow. Sorry, can't say without a concrete example to stare at. I haven't seen tha addin make any mistakes here, although it's common to get baffled about exactly why the RW does what it does. > I think I may switch off the "potential spam" bit, and just filter > out known spam, and classify my Inbox by hand. I'll leave it a bit > longer before deciding, though. You'll be happier if you keep an Unsure folder. For me, about 1% of my email ends up there, about half-and-half ham vs spam, and my Inbox is virtually spam-free (while my Spam folder is pure spam now -- about 100 per day). Another: Note that this is pre-alpha software, and you should definitely keep persistent Ham and Spam folders for training, as updating the code may invalidate your database(s), or introduce tokenization and/or scoring and/or configuration changes that render your database(s) worse than useless. IOW, you should stay prepared to retrain from scratch. I set up a distinct .pst file to hold Ham and Spam examples for this purpose, to keep from cluttering my primary msg store. The folder controls in the addin (unlike several in Outlook itself!) allow selecting multiple folders from multiple msg stores too, and my Spam folder is actually in this other .pst file. From tim.one@comcast.net Wed Nov 6 19:36:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 06 Nov 2002 14:36:04 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > ... > I can do this by regular retraining, but that has 2 disadvantages: > it's much clumsier than simply clicking on a "clever boy!" button, and > it relies on me not deleting messages until I do a training run. Much > of the ham I get is "read and forget", so I'd rather delete > immediately. > > When I get a chance to dive into the code, I'll see how hard this > would be to implement. Automatic training needs lots of work. The Outlook client has gotten smarter than anything else about this so far, but at the moment it's basically automating "mistake based" training, which I think will prove to be a Bad Idea over time. Ideal is to train regularly on a random sample of all msgs, whether or not correctly classified (I fake this by hand for now). That presents some UI and algorithmic challenges. It will also create a database size problem: without a strategy for pruning useless words, the database will grow without bounds (an intuition that at a certain non-fantastic size, "all words" will have been seen is incorrect for computer-based indexing apps, and especially for email -- unique words keep appearing and keep bloating the beast). There's been no research done here yet on how to prune a database over time without damaging accuracy. From just@letterror.com Wed Nov 6 19:55:28 2002 From: just@letterror.com (Just van Rossum) Date: Wed, 6 Nov 2002 20:55:28 +0100 Subject: [Spambayes] Upgrade problem Message-ID: Hi there, First off: I started playing with spambayes last sunday, and it's been a blast so far. I'm using pop3proxy.py, love the brand new web interface. However, I did a cvs up today, and unpickling the database stopped working, as classifier.Bayes became a classic class. After some twiddling I managed to repair it, but now I get AssertionErrors during training: [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix Training ham (mymail/good.mbox.fix): 4 Traceback (most recent call last): File "./hammie.py", line 483, in ? main() File "./hammie.py", line 460, in main h.update_probabilities() File "./hammie.py", line 336, in update_probabilities self.bayes.update_probabilities() File "classifier.py", line 327, in update_probabilities assert hamcount <= nham AssertionError Is my db screwed or is it repairable? Just From tim@fourstonesExpressions.com Wed Nov 6 20:02:53 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 06 Nov 2002 14:02:53 -0600 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Lemme answer before Tim gets to ya... This is why you keep a corpus. This is pre-alpha code, and anything that anyone does at any time can screw the world up. You should simply delete your database and retrain it. If you don't have a corpus, go ahead and make one now... - TimS 11/6/2002 1:55:28 PM, Just van Rossum wrote: >Hi there, > >First off: I started playing with spambayes last sunday, and it's been a blast >so far. I'm using pop3proxy.py, love the brand new web interface. > >However, I did a cvs up today, and unpickling the database stopped working, as >classifier.Bayes became a classic class. After some twiddling I managed to >repair it, but now I get AssertionErrors during training: > > [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix > Training ham (mymail/good.mbox.fix): > 4 > Traceback (most recent call last): > File "./hammie.py", line 483, in ? > main() > File "./hammie.py", line 460, in main > h.update_probabilities() > File "./hammie.py", line 336, in update_probabilities > self.bayes.update_probabilities() > File "classifier.py", line 327, in update_probabilities > assert hamcount <= nham > AssertionError > >Is my db screwed or is it repairable? > >Just > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From just@letterror.com Wed Nov 6 20:12:44 2002 From: just@letterror.com (Just van Rossum) Date: Wed, 6 Nov 2002 21:12:44 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Tim Stone - Four Stones Expressions wrote: > Lemme answer before Tim gets to ya... > > This is why you keep a corpus. This is pre-alpha code, and anything that > anyone does at any time can screw the world up. You should simply delete your > database and retrain it. If you don't have a corpus, go ahead and make one > now... Okelydokely! Hey, it already works so well, why not call it "beta"? Just From python-spambayes@discworld.dyndns.org Wed Nov 6 20:16:09 2002 From: python-spambayes@discworld.dyndns.org (Charles Cazabon) Date: Wed, 6 Nov 2002 14:16:09 -0600 Subject: [Spambayes] Outlook plugin - training In-Reply-To: ; from tim.one@comcast.net on Wed, Nov 06, 2002 at 02:36:04PM -0500 References: <16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com> Message-ID: <20021106141609.B31428@discworld.dyndns.org> Tim Peters wrote: > > It will also create a database size problem: without a strategy for pruning > useless words, the database will grow without bounds (an intuition that at a > certain non-fantastic size, "all words" will have been seen is incorrect for > computer-based indexing apps, and especially for email -- unique words keep > appearing and keep bloating the beast). Did you actually find this? I found the growth tailed off dramatically after not too long. I no longer have the exact numbers, but database growth for me tailed off almost to nothing after I had trained on something like 1500 messages. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From just@letterror.com Wed Nov 6 21:42:27 2002 From: just@letterror.com (Just van Rossum) Date: Wed, 6 Nov 2002 22:42:27 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Tim Stone - Four Stones Expressions wrote: > This is why you keep a corpus. This is pre-alpha code, and anything that > anyone does at any time can screw the world up. You should simply delete your > database and retrain it. If you don't have a corpus, go ahead and make one > now... Alright, this triggered a feature request in me, which resulted in some hacking activity . The patch below appends training messages to one of two mbox files ('_pop3proxyspam.mbox' or '_pop3proxyham.mbox' respectively), making it easier to later rebuild the database from scratch, while still being able to train ad hoc with the web interface of pop3proxy.py. Good idea? Just Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.10 diff -c -r1.10 pop3proxy.py *** pop3proxy.py 5 Nov 2002 22:18:56 -0000 1.10 --- pop3proxy.py 6 Nov 2002 21:37:03 -0000 *************** *** 608,615 **** raise SystemExit def onUpload(self, params): ! message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') self.bayes.learn(tokenizer.tokenize(message), isSpam, True) self.push("""

Trained on your message. Saving database...

""") self.push(" ") # Flush... must find out how to do this properly... --- 608,626 ---- raise SystemExit def onUpload(self, params): ! message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild + # the database later. + message = message.replace('\r\n', '\n').replace('\r', '\n') + if isSpam: + f = open("_pop3proxyspam.mbox", "a") + else: + f = open("_pop3proxyham.mbox", "a") + f.write("From ???@???\n") # fake From line (XXX good enough?) + f.write(message) + f.write("\n") + f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) self.push("""

Trained on your message. Saving database...

""") self.push(" ") # Flush... must find out how to do this properly... From mhammond@skippinet.com.au Wed Nov 6 22:09:04 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 7 Nov 2002 09:09:04 +1100 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com> Message-ID: [Piers responding to Paul] > I don't believe you need this. I think that the classifier automatically > trains on messages as they arrive (or at least on messages that it's > sure about). You only need to retrain if it has made a mistake, or if > it's unsure. As Tim says, we really only do "mistake" training - nothing is trained as it comes in, only scored. Manually moving messages (via the button or d&d) is the only thing that triggers an incremental re-train. The key limitation of this scheme, as Tim also alludes to, is that this never correctly classifies ham. However, I actually see this incremental training more as a "get smarter now" than a "just get smarter" technique - ie, a user sees a mis-classified Spam, by re-training they are increasing the chances that the next similar mail will be handled correctly. Instant feedback, especially while the user is getting started. ie, it is indeed "mistake based training", but that may still prove useful in addition to ongoing training. I can't help thinking that we are somehow underestimating our own tool here. As is common when people first use this tool, spam is generally found in the ham set and vice-versa. Because of this, I know that my Inbox is spam free (but less sure about my other "ham" folders). I'm also sure that my Spam folder has no ham. This should remain true while I continue to use the tool. So surely we can exploit this somehow. Off the top of my head: * Assume we don't trust the last 2 days of mail (as the user may not yet have sorted them). Anything in the "good" and "spam" folders older than this can be assumed correctly classified, and able to be trained on. * A process could go through all ham and spam trained on, and score each message. Any "suspect" messages are presented in a list (much like the Outlook "Find Message" result list). The user can indicate that the message is correct (and the system will remember, never asking about this message again) or is indeed incorrectly classified. If incorrect, it will be moved, and incrementally trained as per now. (I can also picture a whitelist kicking in here; if incorrect, offer to add user to whitelist. If user in the whitelist, assume ham thereby meaning mail from this person can never again be spam) I can picture this working in the background, and simply indicating to the user that there are "conflicts" to be resolved at their leisure. Further, I imagine that as we build better training data for each message store, the number of "conflicts" actually found would generally be zero - ie, the system would find that all 2 day and older mail correctly classifies. While the above is more a brain-fart than a reasoned design, I agree that staying out of your face is important for widespread use. Mark. From anthony@interlink.com.au Wed Nov 6 22:19:40 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 07 Nov 2002 09:19:40 +1100 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: <200211062219.gA6MJe502959@localhost.localdomain> >>> Tim Peters wrote > Automatic training needs lots of work. The Outlook client has gotten > smarter than anything else about this so far, but at the moment it's > basically automating "mistake based" training, which I think will prove to > be a Bad Idea over time. > > Ideal is to train regularly on a random sample of all msgs, whether or not > correctly classified (I fake this by hand for now). That presents some UI > and algorithmic challenges. Note that "random sample" is not as trivial as all that, either - if you have a very high ham:spam ratio in your training DB, your accuracy will suffer (see the tests from Alex, myself and others). An easy example of this is those of us who are on a bunch of higher volume python.org lists - Greg's sterling work there means that very little spam gets through there. As spambayes takes over the world, this could be a larger problem. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From lists@morpheus.demon.co.uk Wed Nov 6 22:38:46 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Wed, 06 Nov 2002 22:38:46 +0000 Subject: [Spambayes] Outlook plugin - training References: <16E1010E4581B049ABC51D4975CEDB885E2D92@UKDCX001.uk.int.atosorigin.com> Message-ID: Tim Peters writes: > [Moore, Paul, on the Outlook2K client] >> Actually, I'm not sure I like "Potential Spam" being treated as >> spam until confirmed as OK. > > It doesn't. "Potential Spam" really means "unsure" -- it would be > as accurate to call it "Potential Ham", but neither is as accurate > as Unsure. The system knows it doesn't know what to call msgs in > this category, and the client doesn't automatically train on Unsure > msgs (unless you *manually* drag one into your Spam folder, or into > one of your Ham folders). That sounds like the best option. But it makes me wonder - what is a "Spam" folder, and what is a "Ham" folder, in this context? My best guess is that we're looking at the folders defined in the training dialog. I'm having difficulty following the addin code, but that feels logical (I've never seen an Outlook addin before, so I'm struggling with "lots of code, can't see the flow" problems ATM...) >> I have Rules Wizard rules which sort E-Mail traffic out into >> folders. I'm entirely happy with the behavious I understand to be >> the case - rules processed before the plugin - as I don't get spam >> on list addresses, so I'm OK with list traffic being totally >> excluded from the spam process. > > The Define Filters dialog has a multi-selection folder control, so > you can tell the client to watch any number of folders (you're not > limited to the Inbox alone; add the destination folders of your > other Outlook rules if you want email coming into those watched > too). I'm not entirely sure I do. As I said, anything moved by the rules wizard is list traffic, and as such is (a) non-spam (so no need to check it) and (b) not at all typical of personal mail. My intuition says that including list traffic will tend to dilute the clues which distinguish personal mail and spam. Of course, I know that the classifier *really* works by magic, and so my intuition is useless :-) > The interaction with Outlook's Rules Wizard (RW) remains unclear. > The RW's internal workings appear undocumented, and there appears no > way to hook into it. I've definitely seen the addin's filtering > rules trigger *while* the RW was still running, and in some cases > that can lead to the addin's filtering looking at a msg more than > once. For example, the addin's filter may trigger when a msg first > arrives in the Inbox, and then a second time on the same msg when > the RW moves it into a different folder that the addin's filter is > also watching. In this case the client suffers an internal > exception, as the entry ID Outlook told it to use for the first > trigger gets invalidated by the move. It works OK in the end, but > "something isn't quite right" about it. Ooh, that's even worse than I thought (and also entirely consistent with what I've come to expect from Outlook :-() >> I think I may switch off the "potential spam" bit, and just filter >> out known spam, and classify my Inbox by hand. I'll leave it a bit >> longer before deciding, though. > > You'll be happier if you keep an Unsure folder. For me, about 1% of > my email ends up there, about half-and-half ham vs spam, and my > Inbox is virtually spam-free (while my Spam folder is pure spam now > -- about 100 per day). You could easily be right on this. It's not so much that I don't want an Unsure folder, as that I don't know how best to manage it. My instinctive reaction is that I want "Spam" and "Not Spam" buttons, and then I read or delete the message in situ. Using the act of moving the message to indicate the status feels wrong. But maybe, in the light of what you said above (about watching multiple folders), I need to rethink this - for "normal mail" folders at least, if not for list traffic. OK, I'll try thinking in terms of 4 categories of folder - ham, spam, unsure, and "list traffic". In real terms, "list traffic" is no different than unsure, other than in that the addin will never put mail into the "list traffic" folders. I think that fits what I'm after, and doesn't stray too far from the "expected model". I even think that (if it works) I can write the logic up well enough to serve as the basis for some documentation :-) > Another: Note that this is pre-alpha software, and you should definitely > keep persistent Ham and Spam folders for training, as updating the code may > invalidate your database(s), or introduce tokenization and/or scoring and/or > configuration changes that render your database(s) worse than useless. IOW, > you should stay prepared to retrain from scratch. I set up a distinct .pst > file to hold Ham and Spam examples for this purpose, to keep from cluttering > my primary msg store. The folder controls in the addin (unlike several in > Outlook itself!) allow selecting multiple folders from multiple msg stores > too, and my Spam folder is actually in this other .pst file. Oh, I agree. I'm keeping spam now, so that I have a good training set of spam. I already keep loads of ham, so I don't feel the need to keep any more. But I do delete a particular *type* of message - the one-liners from Accomodation Services about cars with their lights left on, fire alarm tests, and the like. I'd rather not bother retaining these - just read and hit the delete button. OK, maybe I could code up a "move to ham archive" button which I could put next to the delete button. Maybe that's worth doing. It's back to that "how does the classifier know?" question again :-) Paul. -- This signature intentionally left blank From lists@morpheus.demon.co.uk Wed Nov 6 23:31:35 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Wed, 06 Nov 2002 23:31:35 +0000 Subject: [Spambayes] Outlook plugin - training References: <9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com> Message-ID: "Mark Hammond" writes: > ie, it is indeed "mistake based training", but that may still prove useful > in addition to ongoing training. >From a newcomer's point of view, I think a key point is that "mistake based training" is easy to understand. I also believe that "confirmation based training" (my "clever boy!" button for specifically affirming that the classifier's magic gave the right answer) is easy to understand. More than that, a new user *expects* to need to do something like this, as the initial impression is one of amazement at the accuracy of the classifier. But such a gadget will fall into disuse as the user starts to expect the classifier to be right - so it probably doesn't have enough long-term value to be worth providing. Batch training (keeping ham and spam, and pumping it into the classifier in a regular training run) feels highly unnatural. My instinct is to *delete* spam - keeping it feels wrong. > I can't help thinking that we are somehow underestimating our own tool here. Coming at it from cold, I can confirm that the effect feels like pure magic. I trained on what I thought was a uselessly small corpus (I had *no* historical spam, so I retrieved the day's batch from the wastebin and used that). The results have been so good that I can already, 2 days later, feel myself tending to "trust" the classifier, and forgetting about training issues. But unlike Mark, my instinct is that this is not such a good thing (solely from a training point of view). If people get such good results on inadequate training, they won't work at it enough, so the need is to make good training so easy and automatic that the tendency to forget to bother is offset. It's too late to think this through right now. I'll ponder some more in the morning... Paul. -- This signature intentionally left blank From tim.one@comcast.net Thu Nov 7 02:16:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 06 Nov 2002 21:16:25 -0500 Subject: [Spambayes] My first non-personal personal false positive Message-ID: I think this is ham. It just squeaked over my 0.80 ham cutoff, and s= uffers because I get a hell of a lot more spam in Spanish. BTW, if anyone c= an really read what he's asking, feel encouraged to reply! Spam Score: 0.800253 '*H*' 0.397658 '*S*' 0.998163 'python' 0.000386299 'header:Return-path:1' 0.073398 'header:Message-id:1' 0.0778812 'espa?ol' 0.0918367 'header:MIME-version:1' 0.09891 'header:Received:6' 0.1493 'programar' 0.155172 'msn' 0.193048 'url:msn' 0.301957 'web' 0.304265 'noheader:reply-to' 0.387101 'url:com' 0.61757 'noheader:errors-to' 0.67097 'url:g' 0.79048 'from:skip:=3D 40' 0.811451 'mediante' 0.844828 'pagina' 0.844828 'tiene' 0.844828 'from:email addr:hotmail.com>' 0.867843 'clic' 0.908163 'muy' 0.908163 'pero' 0.908163 'saber' 0.908163 'con' 0.922673 'bien' 0.934783 'eso' 0.934783 'hola' 0.934783 'que' 0.945895 'aqu?' 0.949438 'les' 0.973373 'por' 0.991803 Message Stream: Return-path: Received: from bright07. (bright07-qfe0.icomcast.net [172.20.4.162]) by msgstore01.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) with ESMTP id <0H5600EHCOJTG1@msgstore01.icomcast.net>; Wed, 06 Nov 2002 21:07:05 -0500 (EST) Received: from mtain03 (bright-LB.icomcast.net [172.20.3.155]) =09by bright07. (8.11.6/8.11.6) with ESMTP id gA727TG20948; Wed, 06 Nov 2002 21:07:29 -0500 (EST) Received: from mail.python.org (mail.python.org [12.155.117.29]) by mtain03.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) with ESMTP id <0H5600FCNOK1PT@mtain03.icomcast.net>; Wed, 06 Nov 2002 21:07:13 -0500 (EST) Received: from f226.pav1.hotmail.com ([64.4.31.226] helo=3Dhotmail.co= m) =09by mail.python.org with esmtp (Exim 4.05) =09id 189c4g-00038I-00=09for webmaster@python.org; =09Wed, 06 Nov 2002 21:07:14 -0500 Received: from mail pickup service by hotmail.com with Microsoft SMTP= SVC; Wed, 06 Nov 2002 18:05:40 -0800 Received: from 168.243.104.248 by pv1fd.pav1.hotmail.msn.com with HTT= P; Thu, 07 Nov 2002 02:05:39 +0000 (GMT) Date: Thu, 07 Nov 2002 02:05:39 +0000 =46rom: =3D?iso-8859-1?B?amFpciBjZXJvbiBEZWxl824=3D?=3D Subject: PETICION X-Originating-IP: [168.243.104.248] To: webmaster@python.org Message-id: MIME-version: 1.0 Content-type: text/html; charset=3Diso-8859-1 Content-transfer-encoding: 8BIT X-Spam-Status: No, hits=3D1.5 required=3D5.0 tests=3DFROM_BIGISP,SPAM= _PHRASE_00_01 X-Spam-Level: * X-OriginalArrivalTime: 07 Nov 2002 02:05:40.0148 (UTC) FILETIME=3D[2BBA4740:01C28602]
hola a todos, bueno me encontraba navegando por la web y me tope= con su pagina ya que a mi me encantaria aprender a programar pero tengo un p= roblema yo no puedo Ingles muy bien por eso quiero saber si tiene el manual d= e PYTHON en espa=F1ol si me ayudan les estare eternamente agradecido cu= idense mucho...


Charla con tus am= igos en l=EDnea mediante MSN Messenger: Haz clic aqu=ED From matt@mondoinfo.com Thu Nov 7 02:47:22 2002 From: matt@mondoinfo.com (Matthew Dixon Cowles) Date: Wed, 6 Nov 2002 20:47:22 -0600 (CST) Subject: [Spambayes] My first non-personal personal false positive In-Reply-To: References: Message-ID: <1036636892.95.852@sake.mondoinfo.com> >=20hola=20a=20todos,=20bueno=20me=20encontraba=20navegando=20por=20la=20we= b=20y=20me=20tope >=20con=20su=20pagina=20ya=20que=20a=20mi=20me=20encantaria=20aprender=20a= =20programar=20pero >=20tengo=20un=20problema=20yo=20no=20puedo=20Ingles=20muy=20bien=20por=20e= so=20quiero=20saber >=20si=20tiene=20el=20manual=20de=20PYTHON=20en=20espa=F1ol=20si=20me=20ayu= dan=20les=20estare >=20eternamente=20agradecido=20cuidense=20mucho... He's=20asking=20where=20he=20can=20find=20a=20Python=20manual=20in=20Spanis= h=2E=20I'd=20send=20a reply=20to=20him=20but=20his=20address=20seems=20to=20have=20gone=20astray= =20from=20the message.=20The=20From=20header From:=20=3D?iso-8859-1?B?amFpciBjZXJvbiBEZWxl824=3D?=3D=20 decodes=20to=20only=20a=20name. Regards, Matt From tim.one@comcast.net Thu Nov 7 03:03:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 06 Nov 2002 22:03:50 -0500 Subject: [Spambayes] My first non-personal personal false positive In-Reply-To: <1036636892.95.852@sake.mondoinfo.com> Message-ID: [Matthew Dixon Cowles] > He's asking where he can find a Python manual in Spanish. I'd send = a > reply to him but his address seems to have gone astray from the > message. The From header > > From: =3D?iso-8859-1?B?amFpciBjZXJvbiBEZWxl824=3D?=3D > > decodes to only a name. Thanks! I sent this to him: """ Apesadumbrado, no hablo espa=F1ol: http://www.python.org/doc/NonEnglish.html#spanish """ I expect Babelfish gave me an absurd translation for "Sorry,", but I = have no shame . I copied Python-Help so someone else can take it from = there. From tim.one@comcast.net Thu Nov 7 03:11:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 06 Nov 2002 22:11:50 -0500 Subject: [Spambayes] New scam Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment I haven't seen this one before. The classifier nailed it, of course. This jerk even set up a bad web site to "confirm" the claims: http://www.delottonetherlands.net That's worth looking at just for the photo of "our new computer room were [sic] the computer balloting is carried out". always-suspected-guido-had-a-night-job-ly y'rs - tim ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: DIRECTOR OF PROMOTIONS Subject: CONGRATULATIONS! OUR LUCKY WINNER. Date: Tue, 05 Nov 2002 21:49:34 -0500 Size: 3712 Url: http://mail.python.org/pipermail/spambayes/attachments/20021106/a9727231/attachment.txt ---------------------- multipart/mixed attachment-- From francois.granger@free.fr Thu Nov 7 08:56:13 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Thu, 07 Nov 2002 09:56:13 +0100 Subject: [Spambayes] My first non-personal personal false positive In-Reply-To: Message-ID: on 7/11/02 3:16, Tim Peters at tim.one@comcast.net wrote: > 'mediante', 'pagina', 'tiene', 'clic', 'muy', 'pero', 'saber', 'con', 'bien', 'eso', 'hola', 'que', 'aqu?', 'les', 'por' Here are the most probable English equivalents of the Spanish words. > 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with', 'good', 'this', 'Hi', 'that', 'here', 'the', 'for' This illustrate he need for properly balanced training sets and re raise the question of language discrimination. At least prior language discrimination would allow for a different database for each language or for a systematic "unsure" flag for not trained languages. If you put my messages in a Ham training set, you will flag French spams as ham because of my French sig ;-) All these words should rate around 0.5 since they are among the most common ones in this language. -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From francois.granger@free.fr Thu Nov 7 08:57:45 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Thu, 07 Nov 2002 09:57:45 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: on 6/11/02 20:55, Just van Rossum at just@letterror.com wrote: > First off: I started playing with spambayes last sunday, and it's been a blast > so far. I'm using pop3proxy.py, love the brand new web interface. Did you installed it on MacOS9 or MacOSX ? -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From just@letterror.com Thu Nov 7 09:05:59 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 10:05:59 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Fran=E7ois Granger wrote: > Did you installed it on MacOS9 or MacOSX ? OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work = with 2.2, so you can't use the Python shipped with 10.2. (In theory it might w= ork under OS9, but I've never had much luck with sockets in MacPython 2.x, bu= t you could try. It uses asyncore and not threading, so that's hopeful for 9.) Just PS: the web interface of pop3proxy.py is pretty good and useful, the only downside is that it saves the database after each training, which makes i= t hard to train with a few messages: after each message you have to wait (up to = 10 seconds on my machine with my database) before you can continue. Maybe an explicit "Save database" button is an idea? From francois.granger@free.fr Thu Nov 7 11:00:47 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Thu, 07 Nov 2002 12:00:47 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: on 7/11/02 10:05, Just van Rossum at just@letterror.com wrote: > Fran=E7ois Granger wrote: >=20 >> Did you installed it on MacOS9 or MacOSX ? >=20 > OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work = with > 2.2, so you can't use the Python shipped with 10.2. (In theory it might w= ork > under OS9, but I've never had much luck with sockets in MacPython 2.x, bu= t you > could try. It uses asyncore and not threading, so that's hopeful for 9.) I got up to have it running with MacOS9.1 and Python 2.2.1. The Web server works and the proxy answers to a telnet on 127.0.0.1:110. I think I don't get the idea of the setting for the proxy. I give to spambayes my pop3 server name, I then change my account in my mail reader to have it to connect to 127.0.0.1 as a pop3 server. And nothing happens. > after each message you have to wait (up to 10 > seconds on my machine with my database) before you can continue. Maybe an > explicit "Save database" button is an idea? With the -d parameter, you can use a anydbm instead of Pickle. With some hack it can probably use gdbm as the anydbm db. --=20 Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From Paul.Moore@atosorigin.com Thu Nov 7 11:01:14 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Thu, 7 Nov 2002 11:01:14 -0000 Subject: [Spambayes] Outlook plugin - training Message-ID: <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com> > It's too late to think this through right now. I'll ponder some more > in the morning... Some post-ponder musings... I'm assuming (based on a message I recall seeing recently) that it's possible to "correct" training - ie, if I train the classifier that a specific message is spam, I can later say "no it isn't, it's ham". Assuming that this is so, is it not reasonable to train dynamically on an "assume I got it right" basis? In other words, whenever the addin filters a message as ham or spam, automatically train on that basis as well. Then, if the user sees a mistake, he corrects it, which automatically retrains the classifier (manually deleting as spam or moving a message already does this). This will keep the database right up to date, and all the user has to do is correct any bad decisions the classifier makes (which he should be doing anyway). I've ignored database growth issues, but other than that, is there any other problem with this approach? Paul. From just@letterror.com Thu Nov 7 11:11:35 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 12:11:35 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Fran=E7ois Granger wrote: > > after each message you have to wait (up to 10 > > seconds on my machine with my database) before you can continue. Mayb= e an > > explicit "Save database" button is an idea? >=20 > With the -d parameter, you can use a anydbm instead of Pickle. With som= e > hack it can probably use gdbm as the anydbm db. Right, that's the obvious solution. Thanks. Just From just@letterror.com Thu Nov 7 14:21:21 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 15:21:21 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Fran=E7ois Granger wrote: > > after each message you have to wait (up to 10 > > seconds on my machine with my database) before you can continue. Mayb= e an > > explicit "Save database" button is an idea? >=20 > With the -d parameter, you can use a anydbm instead of Pickle. With som= e > hack it can probably use gdbm as the anydbm db. Ok, so I did it. With my current setup anydbm uses dbhash/bsddb, and trai= ning (on a single message) performance seems _worse_ than with the pickle (abo= ut 20 seconds now, around 10 with pickle). Don't know whether the training itse= lf is slower or updating the database. Training with my entire corpus took many= times longer as well. Not to mention that the database is now 20 megs instead o= f 5... Would gdbm be expected to work faster? (I currently don't even have it.) Just From msergeant@startechgroup.co.uk Thu Nov 7 14:21:11 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Thu, 07 Nov 2002 14:21:11 +0000 Subject: [Spambayes] Chi-squared perl port problems Message-ID: <3DCA76D7.4070404@startechgroup.co.uk> [Moderators - sent this from the wrong address. Please kill that mail] OK, I've tried to convert your chi-squared stuff to Perl, but for some reason it's producing bizarre results. It always scores low. And I have no idea why, because I thought I'd copied the code pretty much verbatim (albeit adding in a few $'s and {}'s ;-) First of all, here's the token scores an email in question gets: received-ip:207.230.250.119 => 1.00000 AWESOME => 1.00000 BUKAKKE => 1.00000 GALLERIES! => 1.00000 received-ip:218.53.86.224 => 1.00000 orgasmic => 1.00000 jism => 1.00000 barrages => 1.00000 href=http://205.197.95.39/users/belinda/bukkakehouse/index.html => 1.00000 href=http://205.197.95.39/remove.php => 1.00000 border=20 => 1.00000 bukakke => 1.00000 from: => 1.00000 color=#FFCC33 => 1.00000 color=#FFCC99 => 1.00000 from:"Carol" => 1.00000 size=+3 => 0.91967 content-type:text/html => 0.90484 size=+4 => 0.89516 bgcolor=#000000 => 0.89012 size=+2 => 0.88557 instant => 0.79354 align=center => 0.78635 access! => 0.77353 color=#FFFFFF => 0.77167 remove => 0.76071 width=600 => 0.75813 now! => 0.75306 click => 0.66231 bukkake => 0.59412 20123 => 0.59412 faces! => 0.59412 color=#FF6600 => 0.55386 here => 0.46364 face=verdana => 0.44700 yourself => 0.43374 action. => 0.35701 bordercolor=Black => 0.32793 blow => 0.28531 stop => 0.25280 japanese => 0.14122 drenched => 0.10872 facial => 0.07652 please => 0.07193 drinking => 0.06229 house => 0.04293 The resulting score my chi-squared code gives this is 0.331736284189509 - which to me is obviously incorrect (if you pass it through Paul Graham's method it scores 1.0). So here's the code I'm using: if (1) { # Chi-Squared method. Produces mostly boolean $result # but with a grey area. my ($H, $S); my ($Hexp, $Sexp); $H = $S = 1.0; $Hexp = $Sexp = 0; my $num_clues = @sorted; foreach my $row (@sorted) { $S *= 1.0 - $row->[PROB]; $H *= $row->[PROB]; if ($S < 1e-200) { my $e; ($S, $e) = frexp($S); $Sexp += $e; } if ($H < 1e-200) { my $e; ($H, $e) = frexp($H); $Hexp += $e; } } $S = log($S) + $Sexp + LN2; $H = log($H) + $Hexp + LN2; if ($num_clues) { $S = 1.0 - chi2q(-2.0 * $S, 2 * $num_clues); $H = 1.0 - chi2q(-2.0 * $H, 2 * $num_clues); $result = (($S - $H) + 1.0) / 2.0; } else { $result = 0.5; } } And here's the chi2q routine, if that's relevant: # Chi-squared function sub chi2q { my ($x2, $v) = @_; die "v must be even in chi2q(x2, v)" if $v & 1; my $m = $x2 / 2.0; my ($sum, $term); $sum = $term = exp(0 - $m); for my $i (1 .. ($v >> 2)) { $term *= $m / $i; $sum += $term; } return $sum < 1.0 ? $sum : 1.0; } I also added some debugging output so that I could see the three stages of S and H (after the loop, after the log(), and after the chi2q bit). Here's the output from that: S1=1e-10; H1=1.25335384490988e-12 S2=-22.3327037493805; H2=-26.7120509011492 S3=0.389722189708954; H3=0.726249621329936 If you can help me at all, I would *really* appreciate it, as I honestly can't see where your code and mine differs. Thanks! Matt. From sjoerd@acm.org Thu Nov 7 14:34:45 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Thu, 07 Nov 2002 15:34:45 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: References: Message-ID: <200211071434.gA7EYjZ28924@indus.ins.cwi.nl> On Thu, Nov 7 2002 Just van Rossum wrote: > Fran=E7ois Granger wrote: > > > > after each message you have to wait (up to 10 > > > seconds on my machine with my database) before you can continue. May= be an > > > explicit "Save database" button is an idea? > > > > With the -d parameter, you can use a anydbm instead of Pickle. With so= me > > hack it can probably use gdbm as the anydbm db. > > Ok, so I did it. With my current setup anydbm uses dbhash/bsddb, and tra= ining > (on a single message) performance seems _worse_ than with the pickle (ab= out 20 > seconds now, around 10 with pickle). Don't know whether the training its= elf is > slower or updating the database. Training with my entire corpus took man= y times > longer as well. Not to mention that the database is now 20 megs instead = of 5... > Would gdbm be expected to work faster? (I currently don't even have it.)= The problem with training is that the update_probabilities() method which is called at the end goes through the whole database and updates just about every word. So the whole database is touched and needs to be written to disk. -- Sjoerd Mullender From skip@pobox.com Thu Nov 7 14:42:08 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 7 Nov 2002 08:42:08 -0600 Subject: [Spambayes] Chi-squared perl port problems In-Reply-To: <3DCA76D7.4070404@startechgroup.co.uk> References: <3DCA76D7.4070404@startechgroup.co.uk> Message-ID: <15818.31680.223575.90177@montanaro.dyndns.org> Matt> OK, I've tried to convert your chi-squared stuff to Perl, but for Matt> some reason it's producing bizarre results. I think $S = log($S) + $Sexp + LN2; $H = log($H) + $Hexp + LN2; should be $S = log($S) + $Sexp * LN2; $H = log($H) + $Hexp * LN2; Skip From popiel@wolfskeep.com Thu Nov 7 15:01:13 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 07 Nov 2002 07:01:13 -0800 Subject: [Spambayes] Upgrade problem In-Reply-To: Message from Sjoerd Mullender <200211071434.gA7EYjZ28924@indus.ins.cwi.nl> References: <200211071434.gA7EYjZ28924@indus.ins.cwi.nl> Message-ID: <20021107150113.4AADCF5CC@cashew.wolfskeep.com> In message: <200211071434.gA7EYjZ28924@indus.ins.cwi.nl> Sjoerd Mullender writes: > >The problem with training is that the update_probabilities() method >which is called at the end goes through the whole database and updates >just about every word. So the whole database is touched and needs to >be written to disk. Why don't we just store the counts, and only compute the probabilities when we need to reference them? Yes, it is more efficient for bulk testing to only compute the probabilities once, but it's definitely a lose for incremental training. Unless there's good arguments against, I'll make a patch for this in the next day or two. - Alex From msergeant@startechgroup.co.uk Thu Nov 7 15:03:43 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Thu, 07 Nov 2002 15:03:43 +0000 Subject: [Spambayes] Chi-squared perl port problems References: <3DCA76D7.4070404@startechgroup.co.uk> <15818.31680.223575.90177@montanaro.dyndns.org> Message-ID: <3DCA80CF.4040101@startechgroup.co.uk> Skip Montanaro said the following on 07/11/02 14:42: > > Matt> OK, I've tried to convert your chi-squared stuff to Perl, but for > Matt> some reason it's producing bizarre results. > > I think > > $S = log($S) + $Sexp + LN2; > $H = log($H) + $Hexp + LN2; > > should be > > $S = log($S) + $Sexp * LN2; > $H = log($H) + $Hexp * LN2; Thanks. That was one difference. However I still get odd results. Here's another set of tokens, which scores 1.0 under graham, but 0.03-ish with my chi-squared code: 2161361384acrd-zgwm => 1.00000 FREE?! => 1.00000 Schoolgirl => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/spacer.gif => 1.00000 received-by:mail2.studiocev.com => 1.00000 amateuryouth.com => 1.00000 bgcolor=#525D94 => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/index_09.jpg => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/index_07.jpg => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/index_05.jpg => 1.00000 received-ip:216.136.138.4 => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/index_04.jpg => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/index_03.jpg => 1.00000 src=http://www.studiocev.com/stop/images/amateuryouth/images/index_01.jpg => 1.00000 skip:21613 19 => 1.00000 from: => 1.00000 height=188 => 1.00000 href=http://amateuryouth.com/enter.html => 1.00000 href=http://www.studiocev.com/unsubscribe.html => 1.00000 width=345 => 1.00000 width=376 => 1.00000 height=73 => 0.97424 height=141 => 0.96062 size=+1 => 0.95901 height=62 => 0.95276 free!! => 0.91652 content-type:text/html => 0.90486 rowspan=4 => 0.89331 width=298 => 0.88645 size=5 => 0.88634 width=375 => 0.84656 align=center => 0.78637 color=#FFFFFF => 0.77172 remove => 0.76079 width=30 => 0.74293 width=153 => 0.73757 rowspan=2 => 0.73211 height=47 => 0.71467 color=WHITE => 0.70927 border=0 => 0.69548 target=_blank => 0.69225 width=47 => 0.69081 width=1 => 0.67389 cellpadding=0 => 0.65914 colspan=6 => 0.65661 cellspacing=0 => 0.65439 colspan=5 => 0.64894 width=15 => 0.64689 width=130 => 0.64543 height=1 => 0.63505 colspan=2 => 0.61052 from:Foreman" => 0.59412 face=Verdana => 0.58404 sites => 0.57298 enter => 0.52994 colspan=4 => 0.52631 colspan=3 => 0.52302 absolutely => 0.51730 width=77 => 0.49915 here => 0.46375 yourself => 0.43398 offer => 0.33481 face=arial => 0.33373 http-equiv=Content-Type => 0.27761 years => 0.27318 models => 0.25215 least => 0.24571 time => 0.22573 this => 0.21638 within => 0.19113 content=text/html; charset=iso-8859-1 => 0.16949 from => 0.12558 come => 0.10411 listed => 0.07929 please => 0.07193 service => 0.02372 from:"Susan => 0.02048 limited => 0.01139 And here's the S and H values at each stage this time: S1=1e-10; H1=1.36351472450952e-22 S2=-23.0258509299405; H2=-50.3468063235703 S3=0.00083341211626875; H3=0.941170965180294 Every single email I throw at this gives me a high H and a low S. I'm really not sure what I'm doing wrong here... Matt. From tim.one@comcast.net Thu Nov 7 15:28:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 10:28:18 -0500 Subject: [Spambayes] Chi-squared perl port problems Message-ID: [Matt Sergeant] > for my $i (1 .. ($v >> 2)) { Watch out for that too -- shifting right by 2 isn't dividing by 2, so you're systematically getting results (probably way) too small out of chi2Q. The right range here is 1 .. ($v/2-1). Suggestion: use chi2.py's showscore() function to show the internal Python details on small artificial prob vectors. Then you can check intermediates one-by-one against the Perl version. From guido@python.org Thu Nov 7 15:36:37 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 07 Nov 2002 10:36:37 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Your message of "Thu, 07 Nov 2002 10:05:59 +0100." References: Message-ID: <200211071536.gA7FabZ27507@odiug.zope.com> > OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't > work with 2.2, so you can't use the Python shipped with 10.2. Long ago, we settled for Python 2.2 (some people wanted 2.1, but that was unbearable). If you see violations of 2.2 compatibility, please supply patches (we'll also gladly give you checkin permission). (If it makes a difference, I'd prefer aiming for 2.2 compatibility over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS 10.2. Unless it gets too ugly.) --Guido van Rossum (home page: http://www.python.org/~guido/) From just@letterror.com Thu Nov 7 15:40:15 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 16:40:15 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: <200211071536.gA7FabZ27507@odiug.zope.com> Message-ID: Guido van Rossum wrote: > Long ago, we settled for Python 2.2 (some people wanted 2.1, but that > was unbearable). If you see violations of 2.2 compatibility, please > supply patches (we'll also gladly give you checkin permission). > > (If it makes a difference, I'd prefer aiming for 2.2 compatibility > over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS > 10.2. Unless it gets too ugly.) The docs say 2.2.1 and that's correct: the code is littered with True and False. Those are the only 2.2.1-isms I've seen. But a patch would nevertheless be quite big. Just From just@letterror.com Thu Nov 7 15:47:22 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 16:47:22 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: <20021107150113.4AADCF5CC@cashew.wolfskeep.com> Message-ID: T. Alexander Popiel wrote: > Why don't we just store the counts, and only compute the probabilities > when we need to reference them? Yes, it is more efficient for bulk > testing to only compute the probabilities once, but it's definitely > a lose for incremental training. > > Unless there's good arguments against, I'll make a patch for this > in the next day or two. +1 (I assume you mean implementing a chaching scheme, right?) Just From just@letterror.com Thu Nov 7 16:01:12 2002 From: just@letterror.com (Just van Rossum) Date: Thu, 7 Nov 2002 17:01:12 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Just van Rossum wrote: > Guido van Rossum wrote: > > > Long ago, we settled for Python 2.2 (some people wanted 2.1, but that > > was unbearable). If you see violations of 2.2 compatibility, please > > supply patches (we'll also gladly give you checkin permission). > > > > (If it makes a difference, I'd prefer aiming for 2.2 compatibility > > over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS > > 10.2. Unless it gets too ugly.) > > The docs say 2.2.1 and that's correct: the code is littered with True > and False. Those are the only 2.2.1-isms I've seen. But a patch would > nevertheless be quite big. I just did a quick test with 2.2 (adding True and False to __builtins__ ;-), and the only other 2.2.1-ism is bool(), which is only used in Options.py. After fixing that everything seems to work just fine. I'd be happy to add a this try: True, False except NameError: True, False = 1, 0 to a bunch of files, and patch the docs. Your call. My sf username is "jvr" ;-) Just From bfallik@attbi.com Thu Nov 7 19:00:32 2002 From: bfallik@attbi.com (Brian Fallik) Date: Thu, 7 Nov 2002 14:00:32 -0500 Subject: [Spambayes] FW: I finally found you! Message-ID: <009601c2868f$f36eded0$0302a8c0@disaster> I recently received this email message, which I believe is very clever SPAM. In fact, it took me a few readings to actually figure it out (I admit I was initially excited about JennyB). I checked out the base URL without the form data (http://www.5050dating.com) and became suspicious because it is a dating service. Then I performed a search for jenny on their site didn't find any girls matching that name. However I did find about 5 other guys, all recently registered, who had posted comments about looking for Jenny B. D'oh. The message is very generic, except for the reference to college. The mistake is that I graduated from Cornell several years ago, and anyone who knew me from high school would know that. I've concluded that 5050 dating got their information from my email address, which is pretty obvious. My question to the group is: how would the Bayesian filter handle a message like this, which can even trick humans? I'm not a member of this list so please CC me on any replies. Thanks, brian -----Original Message----- From: Jenny B [mailto:jenny14296@hotmail.com] Sent: Wednesday, November 06, 2002 6:23 PM To: baf11@cornell.edu Subject: I finally found you! Hey you, I haven't seen to you in sooooo long. I guess I was just a little shy then and I don't remember if you would remember me. But I ran into some of the guys we went to high school with and they said you were going to Cornell now. Good for you. Well, the reason I'm writing is because I've kinda always had a crush on you. I wanted to see if there was any way I could get a second chance of getting to know you better. Anyway, a bunch of my girlfriends and I just got on 5050 Dating <--just click on that to meet up. You've got to come up there. It is pretty wild & there are a few things I've got to tell you. Who knows, maybe we'll hit it off this time and next time you come in town we can get together and I can show you a good time. I look forward to catching up. See ya soon! Jenny:) ---------------------------------------------------------------------------- ---- Add photos to your e-mail with MSN 8. Get 2 months FREE*. From jeremy@alum.mit.edu Thu Nov 7 19:28:06 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Thu, 7 Nov 2002 14:28:06 -0500 Subject: [Spambayes] FW: I finally found you! In-Reply-To: <009601c2868f$f36eded0$0302a8c0@disaster> References: <009601c2868f$f36eded0$0302a8c0@disaster> Message-ID: <15818.48838.108249.714677@slothrop.zope.com> Hey you, I don't think we've exchanged email before. Did we know each other in college? I hope you're not too shy to answer; don't worry about your girlfriends getting a copy of this message. Would you feel better if you hadn't told us you were initially exicted about JennyB? (Let's see how that get's scored :-). The answer to your question depends on the ham and spam you've trained with. My classifier was sure that your message, including the quoted spam, was ham. It was unsure about the spam by itself (although it didn't have properly formed headers to tokenize). Here's the detailed scoring information. Jeremy Score: 0.336511997458 Clues ----- *H* 0.532493458876 *S* 0.205517453791 anyway, 0.0196506550218 hey 0.0302013422819 bunch 0.0348837209302 soon! 0.0505617977528 kinda 0.0652173913043 catching 0.0918367346939 dating 0.0918367346939 guess 0.116621141434 i've 0.179689607795 maybe 0.201283517411 i'm 0.213137445659 haven't 0.232979675235 things 0.32151534693 were 0.329670122312 could 0.333164025881 we'll 0.349417503362 but 0.353848452351 would 0.362072927452 going 0.374658912657 got 0.378612196264 content-type:text/plain 0.387559318404 me. 0.391376634892 tell 0.603151379913 well, 0.606935963897 you've 0.608393669222 because 0.610456054038 getting 0.618613009077 seen 0.645415129199 hit 0.6617111985 high 0.668283205336 always 0.67912287871 forward 0.683019483647 now. 0.705543778678 you. 0.719307394428 you, 0.740951557525 show 0.750270607144 off 0.75066835312 town 0.763455632778 wild 0.763455632778 subject:you 0.775359384491 who 0.781981579908 better. 0.823070962358 girlfriends 0.844827586207 shy 0.844827586207 subject:found 0.844827586207 subject:! 0.849127345648 click 0.863834813092 message-id:invalid 0.908163265306 From skip@pobox.com Thu Nov 7 19:34:01 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 7 Nov 2002 13:34:01 -0600 Subject: [Spambayes] FW: I finally found you! In-Reply-To: <009601c2868f$f36eded0$0302a8c0@disaster> References: <009601c2868f$f36eded0$0302a8c0@disaster> Message-ID: <15818.49193.777463.850934@montanaro.dyndns.org> Brian> My question to the group is: how would the Bayesian filter handle Brian> a message like this, which can even trick humans? It all depends on what data you've trained on. It's hard for us to get a good read on this particular message because you left out most of the headers, which are often good sources of clues. -- Skip Montanaro - skip@pobox.com http://www.mojam.com/ http://www.musi-cal.com/ From tim.one@comcast.net Thu Nov 7 19:44:19 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 14:44:19 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: pJust van Rossum] > ... > However, I did a cvs up today, and unpickling the database > stopped working, as classifier.Bayes became a classic class. After > some twiddling I managed to repair it, but now I get AssertionErrors > during training: I suppose it would have worked to restore the inheritance from object long enough to open the old pickle, then copy the contents into an instance of the changed class and pickle that. > [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix > Training ham (mymail/good.mbox.fix): > 4 > Traceback (most recent call last): > File "./hammie.py", line 483, in ? > main() > File "./hammie.py", line 460, in main > h.update_probabilities() > File "./hammie.py", line 336, in update_probabilities > self.bayes.update_probabilities() > File "classifier.py", line 327, in update_probabilities > assert hamcount <= nham > AssertionError > > Is my db screwed or is it repairable? It's obviously screwed, and whether it's repairable depends on exactly what "some twiddling" meant. I'm sure you've built a new from scratch by now, though! From tim.one@comcast.net Thu Nov 7 19:54:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 14:54:17 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: [Just van Rossum] > Alright, this triggered a feature request in me, which resulted in some > hacking activity . The patch below appends training messages to > one of two mbox files ('_pop3proxyspam.mbox' or '_pop3proxyham.mbox' > respectively), making it easier to later rebuild the database from > scratch, while still being able to train ad hoc with the web interface > of pop3proxy.py. Good idea? Yes, and it's another reason to create a dedicated "training class" module, so that various clients can at least share an *interface* for doing such stuff (and so that new clients don't have to reinvent these concepts from scratch each time around). From tim.one@comcast.net Thu Nov 7 20:19:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 15:19:30 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: <20021107150113.4AADCF5CC@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Why don't we just store the counts, and only compute the probabilities > when we need to reference them? Yes, it is more efficient for bulk > testing to only compute the probabilities once, but it's definitely > a lose for incremental training. Unqualified judgments are always wrong . I often get email in batches of 200, and scoring speed is important to me -- much more so than training speed. It will be even more so at python.org, where training probably won't occur more often than once a week, but scoring is ongoing around the clock. Note that for purposes of scoring, the *counts* needn't be saved at all now, and a scoring-only database can exploit that (and this project's neiltrain.py already does). > Unless there's good arguments against, I'll make a patch for this > in the next day or two. When one size doesn't fit all, think instead about subclasses, different methods, additional arguments, and/or instance attributes. It's also nice that the current code separates probability estimation algorithms from probability combination algorithms. From tim.one@comcast.net Thu Nov 7 20:23:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 15:23:32 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: [Just van Rossum] > ... > I'd be happy to add a this > > try: > True, False > except NameError: > True, False = 1, 0 > > to a bunch of files, and patch the docs. > > Your call. My sf username is "jvr" ;-) It's fine by me, and you have commit privileges now. From richie@entrian.com Thu Nov 7 20:45:37 2002 From: richie@entrian.com (Richie Hindle) Date: Thu, 07 Nov 2002 20:45:37 +0000 Subject: [Spambayes] Upgrade problem In-Reply-To: References: <20021107150113.4AADCF5CC@cashew.wolfskeep.com> Message-ID: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com> [Tim] > Note that for purposes of scoring, the *counts* needn't be saved at all now A quick note in case someone decides to remove the counts from the database: the HTML front end has a "Word query" feature which will tell you the information in the database for a given word - it's interesting to see how many more times the word 'Viagra' appears in ham than in spam. I mean the other way round. -- Richie Hindle richie@entrian.com From richie@entrian.com Thu Nov 7 20:53:06 2002 From: richie@entrian.com (Richie Hindle) Date: Thu, 07 Nov 2002 20:53:06 +0000 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: References: Message-ID: [Tim Peters] > it's another reason to create a dedicated "training class" module, > so that various clients can at least share an *interface* for doing such > stuff Tim Stone and I have made a start on this (or rather Tim has and I've poked my nose in) - I mention it because he's away until the weekend and we wouldn't want anyone to duplicate the work. It may be too early to talk details (and slightly rude in Tim's absence - my apologies!) but here's the email I sent to Tim outlining how I thought it might work. I was thinking more about generic Message and Corpus classes than specifically about training. Laughing and pointing should be directed towards me rather than Tim. ------------------------------------------------------------------------- [Tim S] > We would include methods in Corpus to add a message to, remove a message from, > move from one to another, with the appropriate untraining/retraining built in. > We *could* have a method that, given a message substance (headers and body) > would find an existing message in a corpus that matched it (somehow). We > would include metadata with the corpus that tells us whether it's a > spam/ham/untrained corpus, so the retraining can be done. We could even > include a fourth type of corpus (cache) with methods to use expiry data in the > message metadata to remove old cache messages... This is excellent stuff. A Corpus contains Messages. CacheCorpus is a subclass of Corpus that adds the concept of expiry, and contains CachedMessages (CachedMessage being a subclass of Message) that know about their own expiry details (time of creation, size, time of last use, whatever it depends on). That's very neat. A Corpus wouldn't know how to create Message objects, nor would a Message object know how to create itself - classes *derived from* them would know how to do that. For instance (totally untested code, probably full of typos) - class Message: def __init__(self, messageText): """Pass in the text of the message, headers and body.""" # etc. def name(self): """Returns a name for this message which is unique within its corpus.""" raise NotImplementedError class FileMessage(Message): """A Message representing an email stored in a file on disk.""" def __init__(self, pathname): self.pathname = pathname messageFile = open(self.pathname) messageText = messageFile.read() Message.__init__(messageText) messageFile.close() def name(self): return self.pathname ...so the Message class dictates that all Messages must have name unique to their corpus, but doesn't dictate how that name is determined. Concrete Message-derived classes fill in that detail. [I may be putting too much into the base class by demanding that the text of the message be given to the constructor - that precludes making FileMessage lazy, and only read the file when it needs to.] 'Corpus' works the same way; again, the details may be naive, but this is the general idea: class Corpus: """A collection of Message objects.""" def __getitem__(self, messageName): """Makes Corpus act like a dictionary: a la corpus[messageName]""" raise NotImplementedError class DirectoryCorpus(Corpus): """Represents a corpus of messages stored as individual files in a directory. Example: corpus = DirectoryCorpus('mydir', '*.msg')""" def __init__(self, directoryPathname, globPattern): self.directoryPathname = directoryPathname self.globPattern = globPattern self.messageCache = {} # The messages we're read from disk so far. def __getitem__(self, messageName): try: return self.messageCache[messageName] except KeyError: if not fnmatch.fnmatch(messageName, self.globPattern): raise KeyError, "Message name doesn't match naming pattern" pathname = os.path.join(self.directoryPathname, messageName) message = FileMessage(pathname) # May raise IOError - let it. self.messageCache[messageName] = message return message Here I've implemented the laziness idea by only reading the file when it's asked for. Maybe the message cache should go in Corpus - that would be useful for *all* Corpus implementations. You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on. > move [Messages] from one [Corpus] to another, with the appropriate > untraining/retraining built in. Yes - this could work using observer objects registered with Corpus objects: class CorpusObserver: """Derive your class from this and call corpus.addObserver to be informed when things happen to a corpus.""" def onAddMessage(self, corpus, message): """Called when a message is added to a corpus.""" pass # Not NotImlementedError, so that people don't have to # implement *all* the event functions of CorpusObserver. class Corpus: def __init__(self): self.observers = [] # List of CorpusObservers to inform of events def addObserver(self, observer): self.observers.append(observer) def addMessage(self, message): """External code adds messages by calling this - for example, in an OutlookCorpus it would be called as a result of the user dragging a message into the folder.""" self.messageCache[message.name()] = message for observer in self.observers: observer.onAddMessage(self, message) class AutoTrainer(CorpusObserver): """Trains the given classifier when messages are added or removed from the given Ham/Spam corpuses.""" def __init__(self, bayes, hamCorpus, spamCorpus): self.bayes = bayes self.hamCorpus = hamCorpus self.spamCorpus = spamCorpus hamCorpus.addObserver(self) spamCorpus.addObserver(self) def onAddMessage(self, corpus, message): if corpus == self.spamCorpus: self.bayes.learn(tokenize(message), True) else: assert corpus == self.hamCorpus, "Unknown corpus" self.bayes.learn(tokenize(message), False) ...and likewise for removeMessage, onRemoveMessage and unlearn. > I'm going to be travelling for the rest of the week, and may not be able to > connect, so you may not hear from me till Friday sometime... OK. Hopefully this will get to you before you leave, and give you plenty to think about. You might want to run it past Tim Peters, 'cos he's *far* better at this kind of thing than I am (though he's also busy). I think this is the sort of thing he has in mind. Most of the *new* code that's needed is defining the abstract concepts and their interfaces, rather than writing code that actually *does* anything - it's building a framework. Once the framework is there, most of the code needed to implement the functionality should already be in the project - code to hook into Outlook, to train on a message, to parse mbox files, and so on. It just needs hooking into the framework. The mark of a good framework is when you write a tiny little class (like AutoTrainer above for instance) that contains hardly any code but adds a major new feature (in this case, automatic training when moving messages around in Outlook). ------------------------------------------------------------------------- -- Richie Hindle richie@entrian.com From tim.one@comcast.net Thu Nov 7 21:00:21 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 16:00:21 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <20021106141609.B31428@discworld.dyndns.org> Message-ID: [Tim] > It will also create a database size problem: without a strategy for > pruning useless words, the database will grow without bounds [Charles Cazabon] > Did you actually find this? Yes. > I found the growth tailed off dramatically after not too long. That too -- the second derivative is negative from the start, but the first remains positive. "It's like" log that way, growing ever more slowly, but inexorably. > I no longer have the exact numbers, but database growth for > me tailed off almost to nothing after I had trained on something like > 1500 messages. When I run my c.l.py test, 10 classifiers are built each training on about 30,000 msgs. The classifier pickles hug 18MB each then. My classifier at work has been trained on about 1,100 msgs, and its classifier pickle is about 2MB. My classifier at home has been trained on about 3,000 msgs, and its classifier pickle is about 4MB. That last one is from memory, so when I get home I'll make up a different number so that the three points exactly fit a log curve . Nobody has used this system long enough under a high enough daily load yet to get frantic about database bloat, but the people who have run very large tests must all be aware that it's inevitable (without pruning). I've already noticed the increase in startup time on my home box, due to loading a bigger pickle every day. From tim@zope.com Thu Nov 7 22:11:22 2002 From: tim@zope.com (Tim Peters) Date: Thu, 7 Nov 2002 17:11:22 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <200211062219.gA6MJe502959@localhost.localdomain> Message-ID: [Anthony Baxter] > Note that "random sample" is not as trivial as all that, either - if > you have a very high ham:spam ratio in your training DB, your accuracy > will suffer (see the tests from Alex, myself and others). I still need to try to make sense of those tests. A real complication is that more than one thing changes when trying to test ratios: it's not just the ratio that changes, it's the absolute number of each trained on too. For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham and 10000 spam. The ratios are identical. Do we expect the error rates to be identical too? I don't, but haven't tried it. I expect the latter would do better than the former, despite the identical ratios, simply because more msgs allow better spamprob estimates. Something missing in "the ratio tests" is a rationale (even an after-the-fact one) for believing there's some aspect of the system that's sensitive to the ratio. The combining method certainly is not, and the spamprob estimation (update_probabilities()) deliberately works with percentages instead of raw counts so that the ham::spam training ratio has no direct effect on the spamprobs calculated. > An easy example of this is those of us who are on a bunch of higher > volume python.org lists - Greg's sterling work there means that very > little spam gets through there. The total # of spam training msgs does limit how high a spamprob can get, and the total # of ham training msgs limits how low. The *suspicion* I had running my large c.l.py test is that it wasn't the ratio that mattered so much as the absolute number, and that the error rates didn't "settle down" to the 4th digit until I got near 10,000 spam total. > As spambayes takes over the world, this could be a larger problem. Despite all the above , when faking "random sample" by hand in my personal classifiers, I see I've *ended up* aiming for about an equal number of each in my training data. That works well too (for me, and anecdotally -- these aren't controlled experiments). From tim@zope.com Thu Nov 7 22:35:56 2002 From: tim@zope.com (Tim Peters) Date: Thu, 7 Nov 2002 17:35:56 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Paul Moore] > That sounds like the best option. But it makes me wonder - what is a > "Spam" folder, and what is a "Ham" folder, in this context? My best > guess is that we're looking at the folders defined in the training > dialog. Right, that's what I meant. > I'm having difficulty following the addin code, but that feels > logical (I've never seen an Outlook addin before, so I'm struggling > with "lots of code, can't see the flow" problems ATM...) It's a GUI app: all the interesting things happen by magic via callbacks and hooks, and tracing the connections between what the user sees and "pieces of code" is puzzling. MS MAPI is also a massive, low-level API. Add prints to the code and they'll be displayed in PythonWin's trace window; that can help. I'm more lost than not in it myself! >> The Define Filters dialog has a multi-selection folder control, >> ... > I'm not entirely sure I do. As I said, anything moved by the rules > wizard is list traffic, and as such is (a) non-spam (so no need to > check it) and (b) not at all typical of personal mail. My intuition > says that including list traffic will tend to dilute the clues which > distinguish personal mail and spam. Don't worry about it before you try it. I suggest trying it because I'm not sure it's possible to *stop* the system now from scoring all incoming msgs (the "new msg in Inbox" filter appears to trigger for every one, regardless of whether the RW decides to move it; after that it may just be a race between the RW and the addin deciding where to move each). > Of course, I know that the classifier *really* works by magic, and > so my intuition is useless :-) It's more that unless you know exactly how the math works, your intuition is simply baseless here, carried over from some other experience. Do *you* have trouble distinguishing personal and work email from spam? There you go, and you can't even compute inverse chi-squared probabilities to 14 significant digits on demand in your head . > .. > You could easily be right on this. It's not so much that I don't want > an Unsure folder, as that I don't know how best to manage it. What's to manage? I get about 600 emails per day, and about 1% end up in Unsure (about 6 -- actually less than that, lately; the system is learning). Looking at 6 msgs is no burden. I often find that msgs that end up here are neither ham *nor* spam to me, and then don't train on them at all. Jeremy Hylton said the same today about his experience -- we're both glad we see them instead of calling them spam, and we're both glad they didn't clutter our Inbox. It's peculiar that there are msgs that are subjectively neither ham nor spam (I wasn't expecting this!), and it's downright spooky that the Unsure folder tends to collect them. > My instinctive reaction is that I want "Spam" and "Not Spam" buttons, > and then I read or delete the message in situ. MarkH has since implemented this in the Unsure folder. > Using the act of moving the message to indicate the status feels wrong. > > But maybe, in the light of what you said above (about watching > multiple folders), I need to rethink this - for "normal mail" folders > at least, if not for list traffic. > > OK, I'll try thinking in terms of 4 categories of folder - ham, spam, > unsure, and "list traffic". I still think you're making life too complicated. Is list traffic spam? If so, call it spam. If not, call it ham. > ... > the delete button. Maybe that's worth doing. It's back to that "how > does the classifier know?" question again :-) It knows what you teach it, of course . From just@letterror.com Thu Nov 7 23:12:45 2002 From: just@letterror.com (Just van Rossum) Date: Fri, 8 Nov 2002 00:12:45 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Tim Peters wrote: > [T. Alexander Popiel] > > Why don't we just store the counts, and only compute the probabilities > > when we need to reference them? Yes, it is more efficient for bulk > > testing to only compute the probabilities once, but it's definitely > > a lose for incremental training. > > Unqualified judgments are always wrong . I often get email in batches > of 200, and scoring speed is important to me -- much more so than training > speed. It will be even more so at python.org, where training probably won't > occur more often than once a week, but scoring is ongoing around the clock. I think it can be done with almost no extra overhead with a caching scheme. This assumes (probably wrongly ) that the cache stays in memory between runs. Something like this perhaps: *** classifier.py Thu Nov 7 23:03:07 2002 --- classifier.py.hack Fri Nov 8 00:04:05 2002 *************** *** 456,459 **** --- 456,460 ---- wordinfoget = self.wordinfo.get + spamprobget = self.spamprobcache.get now = time.time() for word in Set(wordstream): *************** *** 463,467 **** else: record.atime = now ! prob = record.spamprob distance = abs(prob - 0.5) if distance >= mindist: --- 464,470 ---- else: record.atime = now ! prob = spamprobget(word) ! if prob is None: ! prob = self.calcspamprob(word, record) distance = abs(prob - 0.5) if distance >= mindist: Just From popiel@wolfskeep.com Fri Nov 8 00:06:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 07 Nov 2002 16:06:27 -0800 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message from "Tim Peters" References: Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Anthony Baxter] >> Note that "random sample" is not as trivial as all that, either - if >> you have a very high ham:spam ratio in your training DB, your accuracy >> will suffer (see the tests from Alex, myself and others). > >I still need to try to make sense of those tests. A real complication is >that more than one thing changes when trying to test ratios: it's not just >the ratio that changes, it's the absolute number of each trained on too. True. >For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham >and 10000 spam. The ratios are identical. Do we expect the error rates to >be identical too? I don't, but haven't tried it. I have tried this, and the effects of ratio were diminished as the training set size increased. For details, see http://www.wolfskeep.com/~popiel/spambayes/ratio2 . The tests were done with gary-combining, not chi-square, so I really ought to rerun them. >I expect the latter would do better than the former, despite the identical >ratios, simply because more msgs allow better spamprob estimates. It depended on what the ratio in question was... for 1:4 ham:spam, increased training set size hurt instead of helped, in the ranges that I was able to test. For 1:1, increased training helped instead of hurt. >Something missing in "the ratio tests" is a rationale (even an >after-the-fact one) for believing there's some aspect of the system that's >sensitive to the ratio. The combining method certainly is not, and the >spamprob estimation (update_probabilities()) deliberately works with >percentages instead of raw counts so that the ham::spam training ratio >has no direct effect on the spamprobs calculated. Eh, I have a perfectly good rationale for believing that something is sensitive the the ratio: the tests I've run show such a sensitivity. What's missing is a theory on _why_ there's a sensitivity. ;-) I don't think the following theory is perfectly phrased, but it seems plausible to me: Perhaps the number of topics discussed in ham is greater than that in spam. Thus, the average percentage of ham messages containing a particular significant ham word is systematically lower than the average probability of a particular significant spam word appearing in spam messages. As the training set size increases, the percentage difference becomes more consistent and pronounced. Since we're then combining the percentages, we systematically skew slightly due to the differing averages. Changing the ratio of ham to spam has the effect of changing the number of topics discussed, particularly when the training set size is small and random chance can exclude all instances of a given topic. Balancing the number of topics removes the skew in the probabilities. As training set size increases, adjusting the ratio has less effect, because it has less likelyhood of eliminating topics of discussion. I think that would account for my data. >The total # of spam training msgs does limit how high a spamprob can get, >and the total # of ham training msgs limits how low. The *suspicion* I had >running my large c.l.py test is that it wasn't the ratio that mattered so >much as the absolute number, and that the error rates didn't "settle down" >to the 4th digit until I got near 10,000 spam total. I suspect that by the time the corpora got that large, adjusting the training ratio wouldn't make a lick of difference if the corpora were sampled randomly to achieve the given ratio. There would just be too little chance of excluding a topic from the samples. Systematically excluding a topic might produce equivalent results to my ratio tests. - Alex From richie@entrian.com Fri Nov 8 00:17:25 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 00:17:25 +0000 Subject: [Spambayes] SMTP proxy questions Message-ID: [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com From anthony@interlink.com.au Fri Nov 8 00:30:09 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 08 Nov 2002 11:30:09 +1100 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <200211080030.gA80UAf11390@localhost.localdomain> > I've discussed this with Tim S, and he's going off the SMTP proxy idea > while I'm still broadly in favour of it. What do people think - do > non-Outlook users want to forward messages to 'spam' and 'ham' to train the > system, or use an HTML UI? I'd have to say I don't like the idea. There's too many potential places where it can all go horribly horribly pear-shaped, and too many rat-holes that the various email clients can screw up with. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From jbublitz@nwinternet.com Fri Nov 8 01:15:29 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Thu, 07 Nov 2002 17:15:29 -0800 (PST) Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: On 08-Nov-02 Richie Hindle wrote: > Or perhaps there's another way. The only other option I'd > thought of was to add two hyperlinks to the end of the message, > "This is spam" and "This is ham" (in ways that would work for > both HTML and plain-text messages, in both HTML and plain-text > email clients). They'd link to the HTML interface and tell it > the cache ID of the message. Adding content to emails is way > more intrusive (and difficult) than adding headers. But no more > intrusive than the .sig that mailman adds. What about adding a MIME object to the msg with the Spambayes info (text/spambayes?) - or will forwarding lose that info too? The email module should be able to do this. Jim From tim.one@comcast.net Fri Nov 8 04:07:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:07:18 -0500 Subject: [Spambayes] Proposing to drop retain_pure_html_tags In-Reply-To: Message-ID: FYI, that option is gone now. From tim.one@comcast.net Fri Nov 8 04:29:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:29:17 -0500 Subject: [Spambayes] Proposing to rename some fundamental options In-Reply-To: Message-ID: The original names made more sense when we had half a dozen competing schemes. Current Proposed ------- -------- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Note: unknown_word_prob is what the Baysian prob adjustment moves toward, more strongly the less evidence backs up a counting spamprob estimate (the fewer the msgs a word has been seen in, the more the adjustment pushes the spamprob toward unknown_word_prob; for a word that's never been seen before, this reduces to unknown_word_prob exactly). We've always set it to 0.5 by default, and previous tests never showed benefit from changing that. We've gotten better since then, though, and it's possible to deduce "a more correct" value. For example, take the mean of all the by-counting spamprobs in your database, across words that have appeared in at least 10 msgs (so that there's reason to have *some* confidence in the by-counting guess). That's then an estimate of the spamprob a new word will eventually get over time. Across 3 databases I tried this on, it turned out to be a little over 0.5, from 0.513 (my home personal classifier) to 0.540 (fat c.l.py test). If someone has time for a controlled experiment, run the attached code to find this guess for one of your databases; then if it differs from 0.5, try a before-and-after test just changing that much. If there's any promise here, update_probabilities() could easily be changed to compute and use this automatically. """ import cPickle as pickle f = file('fat.pik', 'rb') # your database pickle goes here c = pickle.load(f) f.close() w = c.wordinfo def guessx(): nham = float(c.nham or 1.0) nspam = float(c.nspam or 1.0) n = 0 probsum = 0.0 for rec in w.itervalues(): if rec.hamcount + rec.spamcount >= 10: hamratio = rec.hamcount / nham spamratio = rec.spamcount / nspam prob = spamratio / (spamratio + hamratio) probsum += prob n += 1 print n, probsum / n guessx() """ From mhammond@skippinet.com.au Fri Nov 8 04:48:54 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 8 Nov 2002 15:48:54 +1100 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: > Laughing and pointing should be directed towards me rather than Tim. None of that, but some thoughts . I think that the classes I posted a while ago suffer from the exact reverse problem as your idea. My idea was to make a "message store" that is largely independent of training. I believe the problem with your design is that it deals with the training at the expense of the message store. Obviously, but worth mentioning, is that there are competing interests here. My focus is towards clients, and specifically the outlook one (if there were more clients I would be happy to think of them too ). Alot of the focus of this group is towards admins rather than individuals (which is just fine!) But it seems the current thinking is of a corpus as being a fairly static, well-controlled set of messages used almost purely for training purposes. For client programs, this may not be practical. The corpus is a more dynamic set of messages - and worse, actually *is* the user's set of messages rather than a collection of message copies. For example, "moving" a message in a corpus may actually mean moving the message in the user's real inbox. This may or may not be what is intended - a corpus "move" operation is more about changing a message's classification than it is about physically moving pieces of mail around. > A Corpus wouldn't know how to create Message objects, nor would a Message > object know how to create itself - classes *derived from* them would know > how to do that. For instance (totally untested code, probably full of > typos) - > > class Message: Jeremy and I both posted real code, so starting with something that takes that into consideration would be good. > I may be putting too much > into the base class by demanding that the text of the message be given to > the constructor - that precludes making FileMessage lazy, and > only read the > file when it needs to.] It also defeats the abstract nature of the class. > 'Corpus' works the same way; again, the details may be naive, but this is > the general idea: I'm hoping I don't sound grumpy, but again, the few systems that already exist for this engine are the best ones to use to discover the naivety early > You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an > IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on. I can't quite imagine that at the moment, as per my comments at the top. Off the top of my head, I believe we need: * An abstract "message id" * A message classification database, as discussed before - basically just a dictionary, keyed by ID, holding either "spam" or "ham". * A "corpus" becomes just an enumerator of message IDs for bulk/batch training. It has no move etc operations. * A "message store" is capable of returning a message object given its ID. * The training API simply takes message objects and updates the probability and message databases. At that level, we really don't need much else - no folders or any other grouping of messages. I'm really not too sure there is much value in adding higher-level concepts such as folders or message store "move" operations - certainly not at the outset, where there are too many competing requirements. > Yes - this could work using observer objects registered with Corpus > objects: This could work, but may be too simple to be necessary. If the process of re-training a message in the Outlook GUI becomes: def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True) And not a whole lot else, it doesn't seem worth it. Unfortunately, the decision to perform the retrain is the complex, but client specific part. Is this a newly delivered message? Did the user manually move the message somewhere? Did the user click one of our buttons? Is the user deleting old ham that we want to train on before it dies forever? Outlook does this via examining what Outlook event we are seeing, and looking at meta-data we possibly previously attached to the message. I'm not sure this can be encapsulated well at the moment without adding all our meta-data etc baggage to the base classes. > Most of the *new* code that's needed is defining the abstract concepts and > their interfaces, rather than writing code that actually *does* anything - > it's building a framework. *cough* ummm... This is doomed to failure. Code *must* do something to be taken seriously. At the very least, I would expect to see the existing test driver framework running against these "abstract concepts" > Once the framework is there, most of the code needed to implement the > functionality should already be in the project - code to hook > into Outlook, > to train on a message, to parse mbox files, and so on. It just needs > hooking into the framework. See above . Mark. From tim.one@comcast.net Fri Nov 8 04:50:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:50:42 -0500 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: [Richie Hindle] > ... > The most difficult problem for retraining-by-forwarding is matching the > forwarded message to one from the cache, after Outlook Express > has stripped the headers, top-quoted the users .sig, converted it > to HTML and added fifteen macro viruses. Any ideas? If user can be convinced to forward as an *attachment*, those problems go away, at least in OE. You can create a new msg there, select any number of msgs, drag them to the msg as a group, and OE will create an attachment for each one. Unlike Outlook, OE appears to save the original stuff that came in over the wire (we're finding it's a real hoot in the OL client to try to guess what the original MIME structure may have been). > Can the tokeniser help? If you put in a token unique to each msg, sure . Perhaps the "loose checksum" program Skip checked in could be useful for this. From tim.one@comcast.net Fri Nov 8 05:06:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 00:06:43 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com> Message-ID: [Richie Hindle] > A quick note in case someone decides to remove the counts from the > database: Neil Schemenauer already does, in his CDB code (neil*.py). It's a lean scoring-only database, mapping tokens to *just* spamprobs. If he went on to store them as scaled ints, he could almost certainly reduce this to 2 bytes of prob info per token, and possibly even just 1. > the HTML front end has a "Word query" feature which will tell you the > information in the database for a given word - it's interesting to see > how many more times the word 'Viagra' appears in ham than in spam. I > mean the other way round. What a geek . From tim.one@comcast.net Fri Nov 8 05:48:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 00:48:25 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: [Just van Rossum] > I think it can be done with almost no extra overhead with a > caching scheme. This assumes (probably wrongly ) that > the cache stays in memory between runs. > Something like this perhaps: > > *** classifier.py Thu Nov 7 23:03:07 2002 > --- classifier.py.hack Fri Nov 8 00:04:05 2002 > *************** > *** 456,459 **** > --- 456,460 ---- > > wordinfoget = self.wordinfo.get > + spamprobget = self.spamprobcache.get > now = time.time() > for word in Set(wordstream): > *************** > *** 463,467 **** > else: > record.atime = now > ! prob = record.spamprob > distance = abs(prob - 0.5) > if distance >= mindist: > --- 464,470 ---- > else: > record.atime = now > ! prob = spamprobget(word) > ! if prob is None: > ! prob = self.calcspamprob(word, record) > distance = abs(prob - 0.5) > if distance >= mindist: Sorry, I don't know what this is trying to accomplish. Like, what is self.spamprobcache? There's no such thing now, and the patch doesn't appear to create one (i.e., this code doesn't run). Whatever it's supposed to be, why isn't spamprobcache.get *itself* responsible for returning a spamprob, instead of making its caller deal with two cases? If the answer is "it's supposed to be a dict, so .get ain't that smart", then the memory burden for a long-running scorer process will zoom, negating one of the benefits people attached to "real databases" thought they were buying in return for giant files and slothful performance . Life would be easier if databaseheads trained all they liked as often as they liked, but refrained from calling update_probabilities() until the end of the day (or other "quiet time"). The idea that the model should be updated after every msg trained on is an extreme. From tim.one@comcast.net Fri Nov 8 06:23:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 01:23:13 -0500 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: [Richie Hindle, cogitates about Messages and their Corpus(ora)] That's the ticket! Backing off to a more fundamental level looks useful to me too. We never even straightened that much out for testing purposes (msgs.py isn't general enough; for some custom test drivers (never checked in), I couldn't even reuse the MsgStream class for my *own* directory structures). I disagree with Mark's > If the process of re-training a message in the Outlook GUI becomes: > > def RetrainMessageAsSpam(): > # Outlook specific code to get an ID. > message = message_store.GetMessage(id) > if not classifier.IsSpam(message): > classifier.train(message, is_spam=True) > > And not a whole lot else, it doesn't seem worth it. because it illustrates the point : it doesn't look like a correct re-training method (although it may be, depending on assumptions about where "id" comes from, and what assorted classifier methods do), and while a correct method shouldn't be hard, in the absence of a class dedicated to doing the simple common things that *can* be done in a common way, everyone will keep screwing it up in their own client code. > ... > You might want to run it past Tim Peters, 'cos he's *far* better at this > kind of thing than I am (though he's also busy). I have to do more Python and Zope work now, so have to guard my time on *this* project more jealously than I have. MarkH and SeanT and JeremyH all have ideas here too, and I trust you'll sort them out as a harmonious family bent on world domination. As a general strategy, the first person to check code in usually wins . > ... > The mark of a good framework is when you write a tiny little class (like > AutoTrainer above for instance) that contains hardly any code but adds a > major new feature (in this case, automatic training when moving messages > around in Outlook). The client-specific code to hook and track msg movement in Outlook is relatively massive, so everything else appears a drop in the bucket to Mark. Nevertheless, if a usable framework for capturing the *common* part of this stuff were available, removing the 5 lines of code quoted above would help (the Outlook client, and all others). From B-Morgan@concentric.net Fri Nov 8 06:25:30 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Thu, 7 Nov 2002 23:25:30 -0700 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: As I see it, having pop3proxy keep copies of the messages and using an HTML UI for training has the least amount of dependancy on the email client's forwarding capabilities (or lack thereof). I have a severe aversion to opening spam that will probably carry over to unsure messages, so having a link added to the message body may not do me much good. I will, however, go to an HTML UI and examine a message if that UI doesn't "execute" the HTML. I don't want to see pretty, raw data is good enough for me to decide. I hate to keep mentioning a "rival" project , but popfile's UI seems pretty close to what I think would work best here. Regards, Brad -----Original Message----- From: spambayes-bounces@python.org [mailto:spambayes-bounces@python.org]On Behalf Of Richie Hindle Sent: Thursday, November 07, 2002 5:17 PM To: spambayes@python.org Subject: [Spambayes] SMTP proxy questions [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From tim.one@comcast.net Fri Nov 8 06:46:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 01:46:14 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > ... > I'm assuming (based on a message I recall seeing recently) that it's > possible to "correct" training - ie, if I train the classifier that a > specific message is spam, I can later say "no it isn't, it's ham". That's right, and at the level of classifier.py it's a two-step process: unlearn() as spam, then learn() as ham. It actually doesn't matter which order those are done in, but I won't admit to that . > Assuming that this is so, is it not reasonable to train dynamically > on an "assume I got it right" basis? Depending on context, it *may* be. > In other words, whenever the addin filters a message as ham or spam, > automatically train on that basis as well. Then, if the user sees a > mistake, he corrects it, which automatically retrains the classifier > (manually deleting as spam or moving a message already does this). Assuming a conscientious user, and a client that knows enough about what the user is doing, that should work fine. > This will keep the database right up to date, and all the user has to > do is correct any bad decisions the classifier makes (which he should > be doing anyway). > > I've ignored database growth issues, but other than that, is there any > other problem with this approach? Doubtless hundreds, but why quibble . A misclassified msg will have bad effects at once if the training gets reflected into the probabilities at once, so it gets less appealing the less zealous the user is about correcting mistakes right away. That can be mitigated by doing the day's training into a distinct dict, or not calling update_probabilities() in a single dict, until "the end of the day", when the user has (presumably) corrected all the day's mistakes they're going to correct. But if the model updating is going to be delayed anyway, then it makes as much sense to delay doing any training on "the day's" msgs until the end of the day. Determining what "the end of the day" means is a puzzle then too. For example, maybe I left my email client running and went on a week-long vacation. I'm not going to look over 700 presumed spam when I get back, I'll just delete it. But if ham was in there, I've now let it train in the wrong direction, and that will hurt. In other contexts, the scheme doesn't get off the ground. For example, for python.org use, nobody is going to review msgs claimed to be spam. A system feeding on its own judgments is going to reinforce its own mistakes too, so the "conscientious, timely, reviewing human" bit is important. From tim.one@comcast.net Fri Nov 8 07:20:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 02:20:18 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Mark Hammond] > ... > The key limitation of this scheme, as Tim also alludes to, is that this > never correctly classifies ham. However, I actually see this > incremental training more as a "get smarter now" than a "just get > smarter" technique - ie, a user sees a mis-classified Spam, by re- > training they are increasing the chances that the next similar mail > will be handled correctly. Instant feedback, especially while the user > is getting started. > > ie, it is indeed "mistake based training", but that may still prove > useful in addition to ongoing training. I sure agree it's *very* useful at the start, and expect it will continue to be useful over time. > I can't help thinking that we are somehow underestimating our own > tool here. I'm going to try an experiment: I'm going to wipe my home database and start over from scratch, training first on one ham and one spam, then only on mistakes and unsures. This should be fun . > As is common when people first use this tool, spam is generally > found in the ham set and vice-versa. Because of this, I know that my > Inbox is spam free (but less sure about my other "ham" folders). I'm > also sure that my Spam folder has no ham. This should remain true > while continue to use the tool. How do you know your Spam folder has no ham? I know mine doesn't because I routinely score it, sort on the score, and stare at "the wrong end". I find ham there as often as not, *usually* apparently due to mousing error when dragging a training ham into the Ham folder and overshooting the mark. > So surely we can exploit this somehow. Off the top of my head: > * Assume we don't trust the last 2 days of mail (as the user may not > yet have sorted them). Anything in the "good" and "spam" folders older > than this can be assumed correctly classified, and able to be trained > on. Provided the user has already done a decent amount of training, then as Paul Moore suggested it could even work to trust ham-vs-spam decisions immediately, and let user corrections undo those as needed. A well-trained system should be pretty robust against a few misclassifications over the short term. > * A process could go through all ham and spam trained on, and score each > message. Any "suspect" messages are presented in a list (much like the > Outlook "Find Message" result list). The user can indicate that the > message is correct (and the system will remember, never asking about > this message again) or is indeed incorrectly classified. If incorrect, > it will be moved, and incrementally trained as per now. (I can also > picture a whitelist kicking in here; if incorrect, offer to add user to > whitelist. If user in the whitelist, assume ham thereby meaning mail > from this person can never again be spam) Tell us about the mistakes *you* see. I feel like we're designing a solution to a hypothetical problem otherwise. The only "mistake" I routinely see is that my cigarettes-via-web advertising keeps getting knocked back into Unsure territory. That doesn't bother me enough to do anything about it, but if it bothers you enough then, yes, a whitelist would solve that one. > I can picture this working in the background, and simply indicating to > the user that there are "conflicts" to be resolved at their leisure. Or maybe we could just move those back to the Unsure folder. The user should already know what to do about things in Unsure, so it's nothing new to them. Moving a msg out of Unsure could be taken as a positive sign that the user has classified such a msg once and for all (well, until they move it again, anyway). > Further, I imagine that as we build better training data for each > message store, the number of "conflicts" actually found would > generally be zero - ie, the system would find that all 2 day and > older mail correctly classifies. I expect that's true. From just@letterror.com Fri Nov 8 07:54:04 2002 From: just@letterror.com (Just van Rossum) Date: Fri, 8 Nov 2002 08:54:04 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Tim Peters wrote: > [Just van Rossum] > > I think it can be done with almost no extra overhead with a > > caching scheme. This assumes (probably wrongly ) that > > the cache stays in memory between runs. > > Something like this perhaps: > > > > *** classifier.py Thu Nov 7 23:03:07 2002 > > --- classifier.py.hack Fri Nov 8 00:04:05 2002 > > *************** > > *** 456,459 **** > > --- 456,460 ---- > > > > wordinfoget = self.wordinfo.get > > + spamprobget = self.spamprobcache.get > > now = time.time() > > for word in Set(wordstream): > > *************** > > *** 463,467 **** > > else: > > record.atime = now > > ! prob = record.spamprob > > distance = abs(prob - 0.5) > > if distance >= mindist: > > --- 464,470 ---- > > else: > > record.atime = now > > ! prob = spamprobget(word) > > ! if prob is None: > > ! prob = self.calcspamprob(word, record) > > distance = abs(prob - 0.5) > > if distance >= mindist: > > Sorry, I don't know what this is trying to accomplish. Like, what is > self.spamprobcache? There's no such thing now, and the patch doesn't appear > to create one (i.e., this code doesn't run). Tim, don't be such a programmer . But ok, I promise I'll never post pseudocode as a patch again... > Whatever it's supposed to be, > why isn't spamprobcache.get *itself* responsible for returning a spamprob, > instead of making its caller deal with two cases? I thought I was doing your performance needs a favor . > If the answer is "it's > supposed to be a dict, so .get ain't that smart", That's the answer. > then the memory burden for > a long-running scorer process will zoom, negating one of the benefits people > attached to "real databases" thought they were buying in return for giant > files and slothful performance . Right. If a float takes up 20 bytes in memory (just a guess, no time to look), then for a database of 100000 words (that's roughly the size of my personal db) the memory burden is 100000 * (8 + 20), almost three megs. Just in case the higher memory usage is not an issue, there's a simpler approach: don't store spamprob in the db, but call bayes.update_probabilities() on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC on my db (hm, that's using pickle, so will be a lot more when using a db :-( ). You can tell I'm thinking mostly about long running processes... I guess you're right, one size doesn't fit all. One last idea for this morning: how about splitting the db in a training db (storing hamcount and spamcount) and a classifying db (storing only spamprob)? > Life would be easier if databaseheads trained all they liked as often as > they liked, but refrained from calling update_probabilities() until the end > of the day (or other "quiet time"). The idea that the model should be > updated after every msg trained on is an extreme. Good points. Just From richie@entrian.com Fri Nov 8 08:06:33 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 08:06:33 +0000 Subject: [Spambayes] Upgrade problem In-Reply-To: References: Message-ID: [Just] > the web interface of pop3proxy.py is pretty good and useful, the only > downside is that it saves the database after each training That's now fixed (at least partly) along with some other bits: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. -- Richie Hindle richie@entrian.com From tim.one@comcast.net Fri Nov 8 09:15:24 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 04:15:24 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Tim] > ... > I'm going to try an experiment: I'm going to wipe my home database and > start over from scratch, training first on one ham and one spam, then > only on mistakes and unsures. This should be fun . It is! The msg from me I'm replying to here scored 94 (solid spam). I've now got 5 ham and 5 spam in my training set, most of the new ones from Unsures. The latest spam was a blatant false negative, from Hapax City: '*H*' 0.998601 '*S*' 8.60833e-005 'can' 0.0652174 'have' 0.0652174 "don't" 0.0918367 'never' 0.0918367 'number' 0.0918367 'one' 0.0918367 'what' 0.0918367 '"the' 0.155172 ham hapaxes from here 'able' 0.155172 'about' 0.155172 'against' 0.155172 'also' 0.155172 'any' 0.155172 'anything' 0.155172 'back' 0.155172 'because' 0.155172 'been' 0.155172 'check' 0.155172 'even' 0.155172 'find' 0.155172 'found' 0.155172 'heard' 0.155172 'how' 0.155172 'into' 0.155172 "it's" 0.155172 'more' 0.155172 'needed' 0.155172 'other' 0.155172 'out' 0.155172 'own' 0.155172 'people' 0.155172 'skip:a 10' 0.155172 'skip:i 10' 0.155172 'special' 0.155172 'subject:.' 0.155172 'subject:: ' 0.155172 'their' 0.155172 'them.' 0.155172 'they' 0.155172 'those' 0.155172 'time' 0.155172 'time.' 0.155172 'unsubscribe' 0.155172 'until' 0.155172 'useful' 0.155172 'using' 0.155172 to here 'and' 0.275281 'for' 0.275281 'subject: ' 0.275281 'you' 0.275281 'from' 0.355072 'not' 0.355072 'off' 0.355072 'our' 0.355072 'when' 0.355072 'new' 0.644928 'see' 0.644928 'url:gif' 0.724719 'url:www' 0.724719 'call' 0.844828 spam hapaxes from here 'contact' 0.844828 'credit' 0.844828 'email.' 0.844828 'every' 0.844828 'further' 0.844828 'header:Received:2' 0.844828 'made' 0.844828 'more!' 0.844828 'most' 0.844828 'now' 0.844828 'plus,' 0.844828 'receive' 0.844828 'search' 0.844828 'skip:1 10' 0.844828 'url:jpg' 0.844828 to here 'email' 0.908163 I think I've established that 5+5 isn't enough for great results . However, 80% of its decisions have been correct so far! From tdickenson@devmail.geminidataloggers.co.uk Fri Nov 8 10:52:32 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Fri, 8 Nov 2002 10:52:32 +0000 Subject: [Spambayes] Re: unsupervised training In-Reply-To: References: Message-ID: <200211081052.32567.tdickenson@devmail.geminidataloggers.co.uk> On Friday 08 November 2002 7:20 am, Tim Peters wrote: > Provided the user has already done a decent amount of training, then as > Paul Moore suggested it could even work to trust ham-vs-spam decisions > immediately, and let user corrections undo those as needed. A well-tra= ined > system should be pretty robust against a few misclassifications over th= e > short term. For the last two weeks I have been using a setup that uses this type of=20 unsupervised training. I have a procmail filter that sends a copy of all incoming ham and spam t= o two=20 seperate mailboxes. These mailboxes are used for overnight batch training= ,=20 then deleted. Messages marked 'Unsure' do not take part in this automatic= =20 training. I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am= =20 manually inspecting the unsure folder, and manually adding them to the=20 appropriate training mailboxes. Initially about 3% of mails were 'unsure'= ,=20 but this has dropped to less than 1% after 2 weeks. Starting next week I plan to change the mua filtering to treat 'unsure' t= he=20 same as 'ham', and stop all manual training. It will be interesting to se= e if=20 the training remains stable. From bkc@murkworks.com Fri Nov 8 14:51:15 2002 From: bkc@murkworks.com (Brad Clements) Date: Fri, 08 Nov 2002 09:51:15 -0500 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <3DCB8912.18340.2FB5F81@localhost> On 8 Nov 2002 at 0:17, Richie Hindle wrote: > Or perhaps there's another way. The only other option I'd thought of was > to add two hyperlinks to the end of the message, "This is spam" and "This > is ham" (in ways that would work for both HTML and plain-text messages, in > both HTML and plain-text email clients). They'd link to the HTML interface > and tell it the cache ID of the message. Adding content to emails is way > more intrusive (and difficult) than adding headers. But no more intrusive > than the .sig that mailman adds. If you do this, what's to keep spammers from also adding similar looking URLs? A busy person might not notice any difference, could click through and confirm their email address... Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From barry@python.org Fri Nov 8 15:04:56 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 8 Nov 2002 10:04:56 -0500 Subject: [Spambayes] SMTP proxy questions References: Message-ID: <15819.53912.407893.819241@gargle.gargle.HOWL> >>>>> "JB" == Jim Bublitz writes: JB> What about adding a MIME object to the msg with the Spambayes JB> info (text/spambayes?) - or will forwarding lose that info JB> too? The email module should be able to do this. Of course that would have to be text/x-spambayes :) -Barry From randy.diffenderfer@eds.com Fri Nov 8 17:21:25 2002 From: randy.diffenderfer@eds.com (Diffenderfer, Randy) Date: Fri, 8 Nov 2002 12:21:25 -0500 Subject: [Spambayes] SMTP proxy questions Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC295@usahm014.exmi01.exch.eds.com> |>>>>> "JB" == Jim Bublitz writes: | | JB> What about adding a MIME object to the msg with the Spambayes | JB> info (text/spambayes?) - or will forwarding lose that info | JB> too? The email module should be able to do this. | |Of course that would have to be text/x-spambayes :) | |-Barry While a fair portion of messages may very well be MIME compliant, this wouldn't work without some serious munging around for non-MIME messages, as well as being very problematic for the many deformed MIME (read very NON compliant :-) ) messages floating around out there! Just an observation... From jbublitz@nwinternet.com Fri Nov 8 17:10:33 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Fri, 08 Nov 2002 09:10:33 -0800 (PST) Subject: [Spambayes] SMTP proxy questions In-Reply-To: <15819.53912.407893.819241@gargle.gargle.HOWL> Message-ID: On 08-Nov-02 Barry A. Warsaw wrote: > >>>>>> "JB" == Jim Bublitz writes: > > JB> What about adding a MIME object to the msg with the > Spambayes > JB> info (text/spambayes?) - or will forwarding lose that > info > JB> too? The email module should be able to do this. > > Of course that would have to be text/x-spambayes :) Well - there's application/ms-excel or some such. Isn't spambayes just as good? :) Point taken. Jim From barry@python.org Fri Nov 8 17:33:53 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 8 Nov 2002 12:33:53 -0500 Subject: [Spambayes] SMTP proxy questions References: <15819.53912.407893.819241@gargle.gargle.HOWL> Message-ID: <15819.62849.101901.822699@gargle.gargle.HOWL> >>>>> "JB" == Jim Bublitz writes: JB> Well - JB> there's application/ms-excel or some such. Isn't spambayes JB> just as good? :) It depends on whether you hold the IETF and IANA in as high regard as Microsoft does . http://www.iana.org/assignments/media-types/ -Barry From lists@morpheus.demon.co.uk Fri Nov 8 21:07:45 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 08 Nov 2002 21:07:45 +0000 Subject: [Spambayes] Outlook plugin - training References: Message-ID: "Tim Peters" writes: [About the plugin code...] > I'm more lost than not in it myself! That makes me feel better :-) [About bothering with leaving list traffic out] > Don't worry about it before you try it. I suggest trying it because I'm not > sure it's possible to *stop* the system now from scoring all incoming msgs > (the "new msg in Inbox" filter appears to trigger for every one, regardless > of whether the RW decides to move it; after that it may just be a race > between the RW and the addin deciding where to move each). OK, I've switched over. I now have one Spam folder, one Potential Spam folder, and the rest are Ham (actually, some historic archive folders I've left out, but that's just because I never use them any more). We'll see how it goes. >> Of course, I know that the classifier *really* works by magic, and >> so my intuition is useless :-) > > It's more that unless you know exactly how the math works, your intuition is > simply baseless here, carried over from some other experience. Do *you* > have trouble distinguishing personal and work email from spam? There you > go, and you can't even compute inverse chi-squared probabilities to 14 > significant digits on demand in your head . How do *you* know I can't compute inverse chi-squared probabilities in my head? Oh, hang on - you wanted me to get the right answer, didn't you? :-) > What's to manage? I get about 600 emails per day, and about 1% end > up in Unsure (about 6 -- actually less than that, lately; the system > is learning). My ratio is still a lot worse than that. But as I say, my training corpus is still quite small. But you're right - managing a few mails isn't hard. It's just that the overall results are *so* much better than the old home-grown soution I used that I became instantly spoiled :-) Seriously, I've said this before, but what you guys have developed here is *phenomenally* good. I've reached the point where I look forward to getting spam, just because I enjoy so much seeing it automatically appear in the spam folder :-) >> My instinctive reaction is that I want "Spam" and "Not Spam" buttons, >> and then I read or delete the message in situ. > > MarkH has since implemented this in the Unsure folder. Time for a CVS update, I guess... > I still think you're making life too complicated. Is list traffic > spam? If so, call it spam. If not, call it ham. Sounds sensible. I think that all the troubles I've had in the past trying to manage spam have left me with an instinctive feeling that the problem is complicated. This leads to looking for complicated solutions. But you're right. The spam/ham distinction itself is a simple yes/no, so the setup should be, too. But permit me to drag my feet a little, as I throw away all my cherished preconceptions :-) More seriously, I'm putting this point into my spambayes notes folder. I suspect it's something a lot of new users will have to get used to. Thanks for the comments, Paul. -- This signature intentionally left blank From lists@morpheus.demon.co.uk Fri Nov 8 21:12:17 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 08 Nov 2002 21:12:17 +0000 Subject: [Spambayes] Outlook plugin plus Exchange Message-ID: I've noticed a couple of strange effects with the Outlook plugin used against an Exchange server. The main one is that when I start up the client in the morning, there are a lot of overnight messages in my inbox. They don't seem to get filtered. I suspect this is to do with Outlook not firing the "new mail" event on stuff that's in the Exchange store when the client starts up. But I'll need to test this. Unfortunately, the Exchange server is at work, and I can only do any serious hacking on this at home, so I'm running a batch cycle (code at home, take into work, try out, take bugs home, and repeat). So it'll take me a while to make any progress. I'll report back when I get more details. Paul (Off to look at Outlook events in MSDN, and to write a simple "log the events and see what is going on" plugin to test with) -- This signature intentionally left blank From mhammond@skippinet.com.au Fri Nov 8 21:52:20 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 9 Nov 2002 08:52:20 +1100 Subject: [Spambayes] Outlook plugin plus Exchange In-Reply-To: Message-ID: > I've noticed a couple of strange effects with the Outlook plugin used > against an Exchange server. The main one is that when I start up the > client in the morning, there are a lot of overnight messages in my > inbox. They don't seem to get filtered. I suspect this is to do with > Outlook not firing the "new mail" event on stuff that's in the > Exchange store when the client starts up. But I'll need to test this. I am working on code that optionally processes "missed" messages at startup. It looks like I can list all unread, unscored mail in my 1000+ item inbox very quickly, so this should be feasible. > Paul (Off to look at Outlook events in MSDN, and to write a simple > "log the events and see what is going on" plugin to test with) Check out the Outlook plugin in the win32com\demos directory - probably a good place to start. Or if anyone gets lots of KLEZ mail via Outlook, I have a plugin that does a decent job at killing them. Mark. From francois.granger@free.fr Fri Nov 8 23:25:51 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Sat, 9 Nov 2002 00:25:51 +0100 Subject: [Spambayes] pop3proxy Message-ID: Thanks to Richie Hindle, it now works on MacOS 9. Excellent job ! -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From tim.one@comcast.net Fri Nov 8 23:33:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 18:33:50 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Tim] > ... > I'm going to try an experiment: I'm going to wipe my home database and > start over from scratch, training first on one ham and one spam, then > only on mistakes and unsures. This should be fun . > ... After enduring the first round of gross mistakes, when I got up today I did this: while some ham in my inbox scores above 0.20 (my ham_cutoff): pick the highest-scoring ham in the inbox add it to the ham training set train on it rescore the inbox These are false positives and unsures the classifier would have had if these msgs had come in after I started the experiment. There were about 700 msgs in the inbox. Other than that, I've left it mistake-driven and unsure-driven on live incoming email. Spam that's correctly classified simply gets deleted (no training on it), ditto ham. It's been a light spam day, but hundreds of msgs have come in since then and I haven't seen a mistake or unsure in about 5 hours, although plenty of ham gets near ham_cutoff and plenty of spam near spam_cutoff. Total training data now: just 45 ham and 20 spam. Scores remain grossly hapax-driven, but that's actually enough to classify most of my email correctly: a small number of subjects and senders and mailing lists overwhelmingly dominate my ham mix, and one email account accounts for the vast bulk of my spam. Removing the hapaxes from the database dropped the # of words from 5500 to about 1700. Rescoring the inbox with this reduced database then pushed about 5% of the msgs back into Unsure. So (no surprise here) hapaxes are vital with little training data. That also means that as soon as one of those words shows up in the other kind of email, it changes from a strong clue to netural, *provided that* I actually train on the new email. I'm not training now unless there's a mistake/unsure, so the hapaxes remain strong clues (even when they point in the wrong direction). BTW, when there are mistakes/unsures, I'm not training on all of them: as I did when I got up, I train the worst example then rescore, one at a time, until no mistakes/unsures remain. I'm never going to get sub-0.1% error rates this way, but if this is the best it ever got, I'd be quite happy with it for my personal email. Something to ponder? If so, you can get away with a very small database, and while hapaxes must not be removed blindly in this extreme scheme, using the atime field could (I suspect) be very effective in slashing the already-small database size (lots of hapaxes will never be seen again even if you train on everything; the WordInfo atime field tells you when a word was last used at all). From rob@hooft.net Fri Nov 8 23:49:59 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 09 Nov 2002 00:49:59 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCC4DA7.80401@hooft.net> Tim Peters wrote: > I'm never going to get sub-0.1% error rates this way, but if this is the > best it ever got, I'd be quite happy with it for my personal email. > Something to ponder? If so, you can get away with a very small database, > and while hapaxes must not be removed blindly in this extreme scheme, using > the atime field could (I suspect) be very effective in slashing the > already-small database size (lots of hapaxes will never be seen again even > if you train on everything; the WordInfo atime field tells you when a word > was last used at all). Tim, This seems to imply that you're still playing with the idea that hapaxes could be "slashed" from the database when using the "old" train-on-all procedure. I don't see how that can ever work, as all words pass through the hapax stage at some point. Or do you mean to slash "old" hapaxes only? And what is "old"? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim@fourstonesExpressions.com Sat Nov 9 00:55:07 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 08 Nov 2002 18:55:07 -0600 Subject: [Spambayes] Persisting a pickled bayes database Message-ID: I can see the nice createbayes function in hammie, but I don't see any persistence function anywhere. I do see several places where code to write a pickled bayes database is hard coded, and I understand the PersistentBayes thing. I might be missing something... I've been using a simple class to handle creating and persisting my bayes databases. I *think* this stuff should probably go somewhere, but beats me where... classifier? doesn't make much sense there... hammie? Any ideas anybody? Here's the class... (kinda a dumb name ;)) class BayesHelper: '''helper class for bayes databases''' def __init__(self, db_name, useDB): ''' constructor ''' self.db_name = db_name self.useDB = useDB self.bayes = hammie.createbayes(db_name, useDB) # no __del__() method, because we don't *necessarily* want to persist def persist(self): '''store the bayes database''' if not self.useDB: fp = open(self.db_name, 'wb') pickle.dump(self.bayes, fp, 1) fp.close() - TimS From tim.one@comcast.net Sat Nov 9 18:35:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 09 Nov 2002 13:35:43 -0500 Subject: [Spambayes] Persisting a pickled bayes database In-Reply-To: Message-ID: [Tim Stone] > I can see the nice createbayes function in hammie, but I don't see any > persistence function anywhere. I do see several places where code > to write a pickled bayes database is hard coded, and I understand the > PersistentBayes thing. I might be missing something... Just experience with idiomatic Python persistence. The persistence was all in DBDict.__init__'s: self.hash = anydbm.open(dbname, 'c') The tradition in Python is that "a persistent database" supplies an interface much like a Python dict, but persists almost purely by magic. For example, here's a brief Python session: C:\Code\python\PCbuild>python Python 2.3a0 (#29, Nov 8 2002, 10:51:55) [MSC 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import anydbb >>> d = anydbm.open('example.dat', 'n') >>> d['an'] = 'example' >>> # and quit Python at this point Then in another session: >>> import anydbm >>> d = anydbm.open('example.dat') >>> print d >>> print d.keys() ['an'] >>> print d['an'] example >>> Note that anydbm used bsddb as the underlying database mechanism on my box. It may use some other database mechanism on some other box (it depends on what it finds available). I could have used bsddb directly instead, of course, but then my code would require that bsddb be available. anydbm uses whatever it can scrounge up. Subclassing the builtin dict type can give a similar "by magic" facility; e.g., here's temp.py: """ import cPickle as pickle import os class PDict(dict): def __init__(self, fname): self.fname = fname if os.path.exists(fname): f = file(fname, 'rb') guts = pickle.load(f) f.close() self.update(guts) self.is_open = True def close(self): if self.is_open: f = file(self.fname, 'wb') pickle.dump(self, f, 1) f.close() self.is_open = False def __del__(self): self.close() """ That just adds a few methods to a regular dict, arranging to dump its value to a pickle when .close() is called or when it becomes unreachable. It's intended that .close() be called explicitly, though (by-magic shutdown semantics are never something to bet your life on). Then in one Python session: >>> from temp import PDict >>> d = PDict('example.pck') >>> d['another'] = 'example' and in another: >>> from temp import PDict >>> d = PDict('example.pck') >>> d {'another': 'example'} >>> In your example helper class, you decided you don't necessarily want to persist. That may or may not be a useful ability, but "the usual" simple Python database facilities don't give you a choice about that: they commit changes to disk *as* mutations occur. In DB terms, they view each mutation as a transaction. The ZODB-based stuff Jeremy is doing is different that way: changes to a ZODB db have to be explicitly committed. That's what the get_transaction().commit() lines in the pspam directory are doing. ZODB is much more of "a real database" than these other gimmicks, by which I mean it has an explicit and pretty rich transactional model and API. From tim.one@comcast.net Sat Nov 9 20:00:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 09 Nov 2002 15:00:42 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCC4DA7.80401@hooft.net> Message-ID: [Tim] > I'm never going to get sub-0.1% error rates this way, but if this is the > best it ever got, I'd be quite happy with it for my personal email. > Something to ponder? If so, you can get away with a very small > database, and while hapaxes must not be removed blindly in this extreme > scheme, using the atime field could (I suspect) be very effective in > slashing the already-small database size (lots of hapaxes will never be > seen again even if you train on everything; the WordInfo atime field > tells you when a word was last used at all). BTW, I'm still doing this experiment, and my total training data is up to 45 ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP are FN are both rare now, and the Unsure rate is about 5% overall and visibly falling. The Unsure spam are more surprising than the Unsure ham, but that may be more psychological than real. For example, it took about 24 hours before I got my first Nigerian spam, and it was shocking to see it score at the low end of the Unsure range. Looking at the internals is scary. I have entire folders that are called ham seemingly because the mailing list they come from has a few lexical conventions unique to it, and the hapaxes from the single training msg from that list save almost all of that list's msgs from Unsure status. In the msg of Rob's I'm replying to, these are all ham hapaxes: 'database' 0.155172 'database,' 0.155172 'ever' 0.155172 'idea' 0.155172 'quite' 0.155172 'scheme,' 0.155172 'seen' 0.155172 'subject:Outlook' 0.155172 'subject:Spambayes' 0.155172 'subject:plugin' 0.155172 'subject:training' 0.155172 'tells' 0.155172 'words' 0.155172 and they slug it out with these spam hapaxes: 'away' 0.844828 'effective' 0.844828 'field' 0.844828 'mean' 0.844828 'word' 0.844828 That 'word' is a strong spam clue but 'words' a strong ham clue should tell us something about how robust this is . [Rob Hooft] > This seems to imply that you're still playing with the idea that hapaxes > could be "slashed" from the database when using the "old" train-on-all > procedure. I don't see how that can ever work, as all words pass through > the hapax stage at some point. Or do you mean to slash "old" hapaxes > only? Well, training has no effect on scoring until update_probabilities() is called, and in a batch-training context I mean hapax from update_probabilities's POV. Of course hamcounts or spamcounts for new words start life at 1, but when doing batch training I don't mean to look at the counts until the probabilities are updated. At that point, a hapax is a word that was seen in only one msg from the entire batch of new msgs. Here's a quick test, based on unpublished general python.org email (we can't publish the ham because it includes some personal email; GregW was working on making the spam collection available, but I haven't heard about that in a week; ditto his very large python.org virus collection). In each case, it trains on 2,741 ham and 948 spam, then predicts the same numbers of each. The "all" column includes hapaxes (wrt counts at the *end* of training). The gt1 column threw away words at the end of training where spamcount+hamcount <= 1; i.e., it retained only words that appeared more than once, the non-hapaxes. The gt2 column retained only words that appeared more than twice; and so on. ham_cutoff was 0.20 here, and spam_cutoff 0.90. filename: all gt1 gt2 gt3 gt4 gt5 gt6 ham:spam: 2741:948 2741:948 2741:948 2741:948 2741:948 2741:948 2741:948 fp total: 1 0 1 0 0 0 0 fp %: 0.04 0.00 0.04 0.00 0.00 0.00 0.00 fn total: 2 2 2 1 2 3 4 fn %: 0.21 0.21 0.21 0.11 0.21 0.32 0.42 unsure t: 81 87 89 82 98 96 100 unsure %: 2.20 2.36 2.41 2.22 2.66 2.60 2.71 real cost: $28.20 $19.40 $29.80 $17.40 $21.60 $22.20 $24.00 best cost: $22.20 $17.60 $20.00 $15.40 $16.80 $17.40 $22.40 h mean: 0.81 0.86 0.87 0.72 0.67 0.64 0.65 h sdev: 6.05 6.18 6.17 5.42 5.13 4.94 5.11 s mean: 98.00 97.66 97.54 97.38 97.03 96.62 96.52 s sdev: 9.26 10.22 10.37 10.62 11.19 12.49 12.61 mean diff: 97.19 96.80 96.67 96.66 96.36 95.98 95.87 k: 6.35 5.90 5.84 6.03 5.90 5.51 5.41 # retained words: 74327 36437 23877 16143 12798 10719 9157 So while hapaxes are vital with very little training data, even with "just" about 4K training msgs they didn't buy anything in this test, and neither did words that appeared only two or three times, and it doesn't appear to be touchy (all of these columns show excellent results!). > And what is "old"? That remains a good question, and a good answer may differ between personal email and bulk email applications. A problem I see coming up in my personal email is that some correspondents only show up once a year, and the hapaxes they generate remain valuable clues, but only once a year. General python.org email doesn't appear to suffer anything like that (so long as personal email is kept out of the python.org mix). From rob@hooft.net Sat Nov 9 22:24:52 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 09 Nov 2002 23:24:52 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCD8B34.6040903@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Tim Peters wrote: > [Tim] >>I'm never going to get sub-0.1% error rates this way, but if this is the >>best it ever got, I'd be quite happy with it for my personal email. > BTW, I'm still doing this experiment, and my total training data is up to 45 > ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP > are FN are both rare now, and the Unsure rate is about 5% overall and > visibly falling. I just added a testdriver to CVS that simulates your behaviour as I understand it: It will train on the first 30 messages, plus on all misclassified and all unsure messages. It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am hierarchy. I think we should test its performance at different Options settings. It may not even be very realistic to training on fp's, as I think in my private E-mail I won't even check the spam folder very thoroughly at all. Anyway, a default run for me now gives: 100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47 200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60 300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67 400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73 500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83 600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87 700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98 800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106 900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108 1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112 1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119 1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125 1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133 1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139 1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145 1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149 1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152 1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158 1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161 2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166 2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170 2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175 2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178 2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182 2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189 2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191 2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196 2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202 2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204 3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209 fp: Data/Ham/Set2/n05250.txt score:0.9312 3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214 3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218 3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222 3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224 3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225 3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229 3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235 3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240 3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243 4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249 4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256 4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257 4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263 4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267 4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272 4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276 4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279 4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280 4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285 fp: Data/Ham/Set1/n01590.txt score:0.9092 5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288 5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291 5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294 5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296 5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303 5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304 5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307 5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310 5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312 5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316 6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321 6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322 6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323 6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328 6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333 6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335 Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 336 (5.1%) Trained on 178 ham and 162 spam fp: 2 fn: 2 Total cost: $89.20 (This is on 3 out of my 10 test directories). Interesting to note so far: * The "Total cost" is much higher than for train-on-all schemes, but it is only due to Unsures; fp and fn are still small. * The database growth doesn't decay with time after a while; it can be described as: nwords = 9200 + 1.6 * nmessages or alternatively: nwords = 5700 + 40 * ntrained ..as can be seen in the attached png's * The training set is almost balanced, even though I scored many more ham than spam * The unsure rate drops over time: 0- 1000: 11.2% (minus 3.0% to be fair) 1000- 2000: 5.4% 2000- 3000: 4.3% 3000- 4000: 4.0% 4000- 5000: 3.9% 5000- 6000: 3.3% Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: words1.png Type: image/png Size: 12191 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words1.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: words2.png Type: image/png Size: 12807 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words2.png ---------------------- multipart/mixed attachment-- From rob@hooft.net Sat Nov 9 23:46:02 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 00:46:02 +0100 Subject: [Spambayes] More experiments with weaktest.py Message-ID: <3DCD9E3A.4040809@hooft.net> These were results of weaktest with default parameters: Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 336 (5.1%) Trained on 178 ham and 162 spam fp: 2 fn: 2 Total cost: $89.20 If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical (spam_cutoff is 90 by default): Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 442 (6.8%) Trained on 292 ham and 152 spam fp: 2 fn: 0 Total cost: $108.40 So the database grows by 30% but it didn't help my cost. The training set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to the default 20: Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 304 (4.6%) Trained on 213 ham and 101 spam fp: 7 fn: 3 Total cost: $133.80 This reduces the database by only 10%, but at very high fp cost. Same 2:1 unbalance in the training set. Back to the default 20:90 then, and set the minimum_prob_strength to 0.0: Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 933 (14.3%) Trained on 497 ham and 437 spam fp: 0 fn: 1 Total cost: $187.60 OK, so that didn't work either. How about setting it to 0.2? Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 304 (4.6%) Trained on 134 ham and 177 spam fp: 2 fn: 5 Total cost: $85.80 Hm. That is slightly better. Funny, we are suddenly training on more spam than ham.... Back to 0.1 anyway ---the differences are too small--- and set robinson_probability_x = 0.3 (default is 0.5): Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 602 (9.2%) Trained on 54 ham and 616 spam fp: 1 fn: 67 Total cost: $197.40 Very interesting: this changes the training ratio to 1:12, at huge cost! (less than one in three spams was recognized solidly as such). Wonder what this could do if changed together with the cutoff.... Lets move it back to 0.5, and try "robinson_probability_s = 0.3": Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 348 (5.3%) Trained on 237 ham and 120 spam fp: 7 fn: 2 Total cost: $141.60 Ouf. I am back with the defaults, but I'd still like to do an automated optimization of everything simultaneously. Might try that. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From trebor@animeigo.com Sun Nov 10 00:32:36 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Sat, 9 Nov 2002 19:32:36 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: Hello everyone, Just a quick note to introduce myself; I ran the session at that Hacker's conference that Guido mentioned, and passed on the suggestion of checking out Bill Y's combinatorial approach. I've been playing with rules-based techniques for almost a year (see http://www.madoverlord.com/projects/told.t for details) and toying with bayesian systems for only the last couple of months, on and off. So no expert in that regard; I have mostly replicated the early work you guys have done (skimmed the archive today). I'm particularly impressed with the chi-square work, it looks very interesting (but more stats for my poor stats-challenged mind to work on; not to mention that now I'm going to have to get around to cramming python in there with all the other languages that have accumulated over the years...). Also, it's nice the way you're testing a lot of variants, I've been crossing things off my "try this" list all afternoon. Couple of comments (bear in mind, I haven't grabbed the source yet, and only skimmed the archive, so if this repeats things you've already tried, sorry). This is just stuff that's been in my mind recently, plus stuff stimulated by my skim. * The great headers debate; suggest you put both machine and human readable opinions in the header, eg: X-SpamBayes-Rating: 9 (Very Spammy) X-SpamBayes-Rating: 7 (Somewhat Spammy) X-SpamBayes-Rating: 5 (Unsure) X-SpamBayes-Rating: 3 (Probably Innocent) X-SpamBayes-Rating: 0 (The Finest Ham) The reason being, many mailreaders can use a finer discriminant than (yes,no,beats me) in ranking spam. A common strategy (which I like myself) is to start an email at neutral priority and bump it up and down based on various triggers, whitelists, whatever, then sort the inbox by the final priority. A cute hack I used in TOLD was to output the result like this: X-SpamBayes-Rating: 0123456789 (Very Spammy) X-SpamBayes-Rating: 012345 (Unsure) This permits a mailreader with limited filtering tools (like Eudora) to classify multiple results with a single rule (such as "if an X-SpamBayes-Rating header contains the string 12345678, set priority to double-low", which catches both 8 and 9 rated emails). BTW, being pedantic, "rating" is a better word to use, it is more precisely what the discriminator is doing, is the same in all flavors of english, and is shorter. "Score" might be even better. ;^) * Hashing to a 32-bit token is very fast, saves a ton of memory, and the number of collisions using the python hash (I appealed for hash functions on the hackers-l and Guido was kind enough to send me the source) is low. About 1100 collisions out of 3.3 million unique tokens on a training set I was using. CRC32, of all things, is actually slightly better, but only by a hair. So this kind of hashing probably won't have much effect on the statistical results. * Bill Y's byte bucket system has a lot of problems, but a there are probably some data reduction techniques that would work well. One that occurred to me on the way back from Hackers would be simply to keep a 1-byte count of ham/spam hits for each token, and when the ham or spam count is about to wrap, cut each count in half, rounding up the other value; ie: // increment ham count for bucket i // apologies, my pseudocode is a bizarre mixture of // now-unknown languages if (ham[i]=255) { ham[i]=128; spam[i]=(spam[i]/2)+(spam[i]%2) } else ham[i]++; The nice thing about this is that it would bias in favor of more recent email; things would "age". But note this means when building the original database you have to feed it ham and spam in small chunks, or use a greater resolution before cramming it into individual bytes. * I was playing a week or two back with 1 and 2 token groups, and found that a useful technique was, for each new token, to only consider the most deviant result. So if the individual word was .99 spam, and the two word phrase was .95, it would only consider the .99 result. This would probably help with Bill Y's combinatorial scheme. Dunno if you've tried this; it prevents a particularly spammy or hammy sequence from dominating the results (I was only considering the 16 or so most deviant results in my bayesian calc, at least on my corpus, more than that didn't really help). * My personal bias (as I think Guido mentioned) is for a multifaceted approach, using Bayesian, rules-based (attacking things that bayesian isn't good at, like looking for obfuscated url structures), DNSBL, and whitelisting heuristics to generate an overall ranking. So a hammy mail from a guy in your address book would bubble up to highest priority, whereas something spammy from him would stay neutral. There's lots of room for cooperation between the various approaches and multiple agents means its less likely that a spam will get by. In particular, whitelisting heuristics can almost eliminate false positives. * Finally, if anyone needs more spam, I get over 300 a day (I've been around a while!) and have a cleaned corpus of over 130MB of spam and foreign email. Also, given all the legit web-marketing email I get because of the url registration work I've done, I've got tons of the spammiest ham you could imagine. Best R -- ----------------------------------------------------------------------- http://madoverlord.com/ World Domination - a fun family activity! From tim.one@comcast.net Sun Nov 10 06:55:59 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 01:55:59 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Robert Woodhead] > Hello everyone, Hi! > Just a quick note to introduce myself; I ran the session at that > Hacker's conference that Guido mentioned, and passed on the > suggestion of checking out Bill Y's combinatorial approach. You can find test results for that in the list archives. Bottom line is that it did worse than what we're doing now, to such an extent that I'm the only one who appeared to try it (my reports weren't encouraging). I may have misimplemented the idea, but I don't think so. The results were in line with earlier experiments we tried on gimmicks that systematically generate highly correlated "words". Such things appear to learn a lot faster than word unigrams, but we've always found (so far) that unigrams soon enough overcome that, and then go on to win. What we're missing is any practical approach to a scheme that can suck out phrases without identifying them by hand first, and without generating highly correlated phrases (overlapping word n-grams are highly correlated, of course, and Bill carries that to extremes). Something I didn't report on is later experiments using chi-combining instead of Bill's "add up the raw counts". chi-combining worked better. I know Bill has gone on to do a more Bayesian-like combination method, but I expect that to do worse than what we've got now for the same reasons we gave up on Paul Graham's combining scheme, but more so: the word independence assumption is bogus, and feeding the Bayesian calcs highly correlated words grossly over- or under- estimates the true probability as a result. In the end you get a scheme that claims certainty even when it's dead wrong, and although it's not dead wrong often, it is dead wrong at a non-zero rate. Revealing: fiddle our chi2.py to use whatever combining scheme Bill is using now, and feed it vectors of *random* probabilities. Most of the code needed for that, and to display a histogram of results, is already there. Try it with Graham's combining scheme and you'll find that scores are almost always very near 0 or very near 1 even when the inputs are random and uniformly distributed. I expect that can only get worse by doing "chain rule" calcs on probs that are highly correlated to begin with. The internal chi-combining S and H scores are uniformly distributed given random inputs, so chi-combining doesn't infer certainty by chance any more often than it "should" infer certainty by chance. That appears to be what makes it far more robust against embarrassing mistakes, and it reliably pumps out a score near 0.5 given a highly ambiguous input msg (many strong ham and many strong spam clues -- we call that "cancellation disease" here, and chi-combining doesn't infer certainty when it happens; all other schemes did, and didn't do better than chance when it happened). > I've been playing with rules-based techniques for almost a year (see > http://www.madoverlord.com/projects/told.t for details) and toying > with bayesian systems for only the last couple of months, on and > off. So no expert in that regard; I have mostly replicated the early > work you guys have done (skimmed the archive today). > > I'm particularly impressed with the chi-square work, it looks very > interesting (but more stats for my poor stats-challenged mind to work > on; So copy and paste . > not to mention that now I'm going to have to get around to > cramming python in there with all the other languages that have > accumulated over the years...). In return, you can throw twelve other languages out <0.7 wink>. > Also, it's nice the way you're testing a lot of variants, I've been > crossing things off my "try this" list all afternoon. Testing has pretty much run out of steam here, though. My error rates are so low now I couldn't measure an improvement in a convincing way even if one were to be made, and the same is true of a few others here too. We appear to be fresh out of big algorithmic wins, so are pushing on to wrestling with deployment issues. BTW, download the source code and read the comments in tokenizer.py: the results of many early experiments are given there in comment blocks. > Couple of comments (bear in mind, I haven't grabbed the source yet, > and only skimmed the archive, so if this repeats things you've > already tried, sorry). This is just stuff that's been in my mind > recently, plus stuff stimulated by my skim. > > * The great headers debate; suggest you put both machine and human > readable opinions in the header, eg: > > X-SpamBayes-Rating: 9 (Very Spammy) > X-SpamBayes-Rating: 7 (Somewhat Spammy) > X-SpamBayes-Rating: 5 (Unsure) > X-SpamBayes-Rating: 3 (Probably Innocent) > X-SpamBayes-Rating: 0 (The Finest Ham) > > The reason being, many mailreaders can use a finer discriminant than > (yes,no,beats me) in ranking spam. A common strategy (which I like > myself) is to start an email at neutral priority and bump it up and > down based on various triggers, whitelists, whatever, then sort the > inbox by the final priority. Spoken like someone who worked on a rule-based system . We have three categories: Ham, Unsure, and Spam, and I haven't seen anything to make me believe that a finer distinction than that can be quantitatively justified (but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's what I mean by "can't measure an improvement anymore", and a finer-grained scheme isn't going to touch those 2 mistakes; one of them is formally ham because it was sent by a real person, but consists of a one-line comment followed by a quote of an entire Nigerian scam spam -- nothing useful is ever going to *call* that one ham, and it scores as spam *almost* as solidly as an original Nigerian spam). > A cute hack I used in TOLD was to output the result like this: > > X-SpamBayes-Rating: 0123456789 (Very Spammy) > X-SpamBayes-Rating: 012345 (Unsure) > > This permits a mailreader with limited filtering tools (like Eudora) > to classify multiple results with a single rule (such as "if an > X-SpamBayes-Rating header contains the string 12345678, set priority > to double-low", which catches both 8 and 9 rated emails). > > BTW, being pedantic, "rating" is a better word to use, it is more > precisely what the discriminator is doing, is the same in all flavors > of english, and is shorter. "Score" might be even better. ;^) "Score" is my favorite, but isn't catching on. I believe the word "ham" for "not spam" was my invention, and since that one caught on big, I'm not fighting to the death for any others . > * Hashing to a 32-bit token is very fast, saves a ton of memory, > and the number of collisions using the python hash (I appealed for hash > functions on the hackers-l and Guido was kind enough to send me the > source) is low. About 1100 collisions out of 3.3 million unique > tokens on a training set I was using. That's significantly better than you could expect from a truly random hash function, so is fishy. Tossing 3.3M balls into 2**32 buckets at random should leave 3298733 buckets occupied on average, with an sdev of 35.58 buckets. Getting 1100 collisions is about 4.7 sdevs fewer than the random mean. > CRC32, of all things, is actually slightly better, With sparse occupancy (3.3e6 out of 4.3e9 buckets is sparse) they may be comparable. PythonLabs ran large-scale statistical tests a few years ago on this. The Python string hash produced 32-bit numbers indistinguishable from random (on the #-of-collision basis) as far as we pushed it; crc32 broken down *very* badly as occupancy increased, with collision rates hundreds of sdevs worse than random. So I can't recommend crc32 for general string hashing (and the Python docs indeed warn about this now), but can recommend Python's string hash. By coincidence, it turns out that Python's string hash is very similar to what later became "the standard" Fowler-Noll-Vo string hash, which may be the most widely tested "seems as good as random" fast string hash now: http://www.isthe.com/chongo/tech/comp/fnv/ > but only by a hair. So this kind of hashing probably won't have much > effect on the statistical results. Since we're sticking to unigrams, we don't have an insane database burden. We also (by default) limit ourselves to looking at no more than 150 words per msg. So I'm not sure saving some bytes of string storage is "worth it" for us, and it's very nice that we can get back the exact list of words that went into computing a score later. A pile of hash codes wouldn't give the same loving effect . > * Bill Y's byte bucket system has a lot of problems, but a there are > probably some data reduction techniques that would work well. One > that occurred to me on the way back from Hackers would be simply to > keep a 1-byte count of ham/spam hits for each token, and when the ham > or spam count is about to wrap, cut each count in half, rounding up > the other value; ie: > > // increment ham count for bucket i > // apologies, my pseudocode is a bizarre mixture of > // now-unknown languages > > if (ham[i]=255) > { > ham[i]=128; > spam[i]=(spam[i]/2)+(spam[i]%2) > } > else > ham[i]++; > > The nice thing about this is that it would bias in favor of more > recent email; things would "age". But note this means when building > the original database you have to feed it ham and spam in small > chunks, or use a greater resolution before cramming it into > individual bytes. Except I didn't get good enough results from his approach to justify pursuing it here, even leaving the hash codes at the full 32 bits. When I went on to squash them to fit in a million buckets, a few false positives popped up that were just too bad to bear (two can be found in the list archives): ham that was so obviously ham that no system that called them spam would be acceptable to most people. > * I was playing a week or two back with 1 and 2 token groups, and > found that a useful technique was, for each new token, to only > consider the most deviant result. So if the individual word was .99 > spam, and the two word phrase was .95, it would only consider the .99 > result. This would probably help with Bill Y's combinatorial scheme. It could be a viable approach to the problem mentioned above: a scheme to suck out more than one word that doesn't systematically generate mounds of nearly redundant (highly correlated) clues. We're clearly missing info by never looking at bigrams (or beyond) now, and that continues to bother me (even if it doesn't seem to be bothering the error rates ). > Dunno if you've tried this; it prevents a particularly spammy or > hammy sequence from dominating the results (I was only considering > the 16 or so most deviant results in my bayesian calc, at least on my > corpus, more than that didn't really help). There's too much I don't know about everything you're doing to say much about that. *All* the biases in Graham's original scheme eventually went away in this project, and things like clamping the spamprobs into [.01, 0.99] turned out to make it systematically useless to try to use more than 16 words under Graham-combining (it just caused more "cancellation disease", and so caused more wildly wrong mistakes). We use 150 now, but IIRC we generally stopped seeing strong benefits after hitting about 40. That 40 was better than 16 very much relied on removing all the biases, though (no "ham boosts", no prob clamping, no minimum word count, no giving unknown words spamprobs above 0.5 to favor ham, no doubling the ham count when computing a spam prob, etc). > * My personal bias (as I think Guido mentioned) is for a multifaceted > approach, using Bayesian, rules-based (attacking things that bayesian > isn't good at, like looking for obfuscated url structures), DNSBL, > and whitelisting heuristics to generate an overall ranking. So a > hammy mail from a guy in your address book would bubble up to highest > priority, whereas something spammy from him would stay neutral. I'm not sure we really need it. For example, *lots* of spam has been discussed on this mailing list, so much so that the python.org email admin had to castrate SpamAssassin for msgs to this list address else it kept blocking ordinary list traffic. My personal email classifier never calls anything here spam, though, nor does it call the originals of the spams posted here ham. I do worry a little about obsfuscated HTML. We strip almost all HTML tags by default for a reason I've harped on enough : all HTML decorations have very high spamprobs, and counting more than one of them as "a clue" fools almost every combining scheme into believing the msg containing them is spam (if you know a msg contains both
and

, it's not really more likely to be spam than if you just know it contains
!). So we blind the classifier to HTML decorations now. But a spam I forwarded here a week or so ago exploited that: the spam was interleaved with size=1 white-on-white news stories and tech mailing list postings. The classifier *did* see those, but didn't see the HTML decorations hiding them. This was a cancellation-disease-by-construction kind of msg, and chi-combining scored it near 0.5 as a result (solidly Unsure). It's the only spam of that kind I've seen so far; if it becomes a popular techinque, we'll have to take more HTML blinders off the classifier. > There's lots of room for cooperation between the various approaches > and multiple agents means its less likely that a spam will get by. > In particular, whitelisting heuristics can almost eliminate false > positives. I'll let you know if I ever see one . Seriously, one of the apps I've especially got in mind is filtering the high-volume mailing lists on python.org. The only kind of FP I see there now in tests is adminstrative requests to *-request addresses, which typically consist of a one word "subscribe" or "unsubscribe" (themselves words with high spamprobs!), followed by 6KB of employer-generated HTML disclaimers, and/or a forwarded spam or conference announcement the sender didn't like. There's still a very low FP rate even on those, but text analysis simply can't be expected to nail them every time. Under SpamAssassin, those recipient addresses are given strong ham boosts by the python.org email admin. > * Finally, if anyone needs more spam, I get over 300 a day (I've been > around a while!) and have a cleaned corpus of over 130MB of spam and > foreign email. Also, given all the legit web-marketing email I get > because of the url registration work I've done, I've got tons of the > spammiest ham you could imagine. Wasn't Paul Graham collecting corpora? Yup, still is: http://www.paulgraham.com/spamarchives.html Getting vast quantities of spam isn't a problem anymore, but getting vast quantities of ham is. Since your spammy ham is presumably business-related, I assume you can't share it. Or can you? Mixing spam and ham from different sources also causes worlds of problems (indeed, we still (by default) ignore most of the header lines partly for that reason, else the system gets great results for bogus reasons). From tim.one@comcast.net Sun Nov 10 07:27:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 02:27:38 -0500 Subject: [Spambayes] More experiments with weaktest.py In-Reply-To: <3DCD9E3A.4040809@hooft.net> Message-ID: [Rob Hooft] > These were results of weaktest with default parameters: Very interesting! I'll have to try that too. Note that in my live email experiment here, I'm (except for the very start) also scoring/training msgs in (with small lapses) the order they arrive. It's been reported before that this helps; although I still haven't run a controlled experiment on that, my *impression* is that it does help. > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 336 (5.1%) > Trained on 178 ham and 162 spam > fp: 2 fn: 2 > Total cost: $89.20 > > If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical > (spam_cutoff is 90 by default): The asymmetry is intentional: most people hate FP more than FN, so by default I made it harder for a thing to get called spam. In test after test we've also seen that spam has a tighter score distribution than ham, which is a more formal justification for setting the spam cutoff closer to its endpoint than the ham cutoff. Setting ham_cutoff as low as 10 is for the truly paranoid <0.9 wink>. > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 442 (6.8%) > Trained on 292 ham and 152 spam > fp: 2 fn: 0 > Total cost: $108.40 > > So the database grows by 30% but it didn't help my cost. The training > set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to > the default 20: > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 304 (4.6%) > Trained on 213 ham and 101 spam > fp: 7 fn: 3 > Total cost: $133.80 > > This reduces the database by only 10%, but at very high fp cost. Same > 2:1 unbalance in the training set. > Back to the default 20:90 then, and set the minimum_prob_strength to 0.0: > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 933 (14.3%) > Trained on 497 ham and 437 spam > fp: 0 fn: 1 > Total cost: $187.60 > > OK, so that didn't work either. How about setting it to 0.2? > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 304 (4.6%) > Trained on 134 ham and 177 spam > fp: 2 fn: 5 > Total cost: $85.80 > > Hm. That is slightly better. Funny, we are suddenly training on more > spam than ham.... Back to 0.1 anyway ---the differences are too small--- > and set robinson_probability_x = 0.3 (default is 0.5): > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 602 (9.2%) > Trained on 54 ham and 616 spam > fp: 1 fn: 67 > Total cost: $197.40 > > Very interesting: this changes the training ratio to 1:12, at huge cost! > (less than one in three spams was recognized solidly as such). Note that in calculations I reported a day or two ago, the measured mean of spamprobs across 3 different corpora was > 0.5, but not by a lot. .3 moves it outside the range minimum_prob_strength ignores, so now every "new word" is instantly taken as a ham clue, where before all new words were ignored by default. So that it grossly inflated the FN rate isn't surprising; everything that will *eventually* become a hapax is initially taken to be a ham clue, even when it's never been seen before. > Wonder what this could do if changed together with the cutoff.... > Lets move it back to 0.5, and try "robinson_probability_s = 0.3": > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 348 (5.3%) > Trained on 237 ham and 120 spam > fp: 7 fn: 2 > Total cost: $141.60 > > Ouf. I hope you're at least gaining some respect for how much work went into picking the defaults . > I am back with the defaults, but I'd still like to do an automated > optimization of everything simultaneously. Might try that. Now *that* could be a useful system regardless of scheme. I've tended to do hill-climbing across one dimension at a time, occasionally moving batches of params random amounts at once (to see whether that kicks it out of a stubborn local minimum). From tim.one@comcast.net Sun Nov 10 07:52:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 02:52:42 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCD8B34.6040903@hooft.net> Message-ID: [Rob Hooft] > I just added a testdriver to CVS that simulates your behaviour as I > understand it: It will train on the first 30 messages, I trained on 1 of each at the start. If I were to do it over, I'd start with an empty database . > plus on all misclassified and all unsure messages. Since I'm doing this real-time on my live email, I've been training "on the worst" (farthest away from correct) msg that arrives in a batch, then rescoring all the ones that arrived in the batch, then training the worst remaining, ... until all new ham is below ham_cutoff and all new spam above spam_cutoff. I don't know that it matters, just being clear(er). As things turned out, this worst-at-a-time training never managed to push one of the remaining mistakes/unsures into the correct category, *except* for cases where I got more than one copy of a spam from different accounts at the same time. Then it always pushed the copies into scoring near 1.0, since the hapaxes in the training copy are abundant. > It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am > hierarchy. > > I think we should test its performance at different Options settings. > > It may not even be very realistic to training on fp's, as I think in my > private E-mail I won't even check the spam folder very thoroughly at all. But I will (and do), and my primary interest here is to see how bad things can get if a user takes mistake-based training to an extreme. Despite that it's heavily hapax-driven, it appears to do very well when judged by error rate. I've been doing it long enough now, though, that it doesn't do so well subjectively: the Unsures are too often bizarre. For example, I sent a long reply here to Robert Woodland, and the copy I get bock showed up as Unsure, with H=1 and S=0.66. There were a lot of accidental spam hapaxes in that msg! Training on it as ham then eliminated about 30 spam hapaxes (there're now netural, having been seen in one ham and one spam each). So it's no different from my POV than the cases where people have sent me "surprising msgs" in the past, and my carefully trained slice-of-life classifier (regularly trained on a sampling of correctly classified msgs too) at the time had no trouble nailing them as ham or spam, with lots of non-hapax evidence to back it up. IOW, I'm still sticking to what I guessed before I started this: mistake-driven training will appear to work well over the short term, but it's brittle, and is brittle because of its reliance on hapaxes. > Anyway, a default run for me now gives: > > 100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47 > 200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60 > 300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67 > 400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73 > 500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83 > 600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87 > 700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98 > 800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106 > 900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108 > 1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112 > 1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119 > 1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125 > 1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133 > 1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139 > 1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145 > 1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149 > 1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152 > 1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158 > 1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161 > 2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166 > 2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170 > 2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175 > 2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178 > 2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182 > 2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189 > 2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191 > 2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196 > 2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202 > 2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204 > 3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209 > fp: Data/Ham/Set2/n05250.txt score:0.9312 > 3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214 > 3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218 > 3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222 > 3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224 > 3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225 > 3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229 > 3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235 > 3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240 > 3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243 > 4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249 > 4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256 > 4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257 > 4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263 > 4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267 > 4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272 > 4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276 > 4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279 > 4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280 > 4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285 > fp: Data/Ham/Set1/n01590.txt score:0.9092 > 5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288 > 5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291 > 5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294 > 5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296 > 5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303 > 5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304 > 5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307 > 5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310 > 5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312 > 5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316 > 6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321 > 6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322 > 6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323 > 6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328 > 6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333 > 6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335 > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 336 (5.1%) > Trained on 178 ham and 162 spam > fp: 2 fn: 2 > Total cost: $89.20 > > (This is on 3 out of my 10 test directories). > > Interesting to note so far: > * The "Total cost" is much higher than for train-on-all schemes, > but it is only due to Unsures; fp and fn are still small. That matches my experience too, although I started with 1 ham and 1 spam and had high FP and FN rates over the first few hours. > * The database growth doesn't decay with time after a while; > it can be described as: > nwords = 9200 + 1.6 * nmessages > or alternatively: > nwords = 5700 + 40 * ntrained > ..as can be seen in the attached png's I expect that's mostly because there are still (relatively) few total msgs trained on. > * The training set is almost balanced, even though I scored > many more ham than spam Curiously, same here! I get about 500 ham and 100 spam per day, but my training database now has 47 ham and 41 spam. It does well, except when it sucks . > * The unsure rate drops over time: I haven't measured that, but it's clearly been so here too (as I said before). > 0- 1000: 11.2% (minus 3.0% to be fair) > 1000- 2000: 5.4% > 2000- 3000: 4.3% > 3000- 4000: 4.0% > 4000- 5000: 3.9% > 5000- 6000: 3.3% Proving what I've always suspected: over time, all msgs are repetitions of ones you've seen before <0.9 wink>. From tim.one@comcast.net Sun Nov 10 08:36:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 03:36:10 -0500 Subject: [Spambayes] My first non-personal personal false positive In-Reply-To: Message-ID: [Tim, asks for help on a Spanish Unsure] [Fran=E7ois Granger] > Here are the most probable English equivalents of the Spanish words= . > 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with', > 'good', 'this', 'Hi', 'that', 'here', 'the', 'for' > > This illustrate he need for properly balanced training sets and > re raise the question of language discrimination. It really doesn't raise it for me: this was in my personal email, an= d since I couldn't read the msg anyway, it may as well have been spam. I get= way too much email to bother more than 2 seconds with something I can't r= ead. I only looked at this one because I'm paying heavy attention to everyth= ing the automatic classifier calls spam. If I weren't using this system, I w= ould have thrown out that msg at once. If I were someone who got any quantity of Spanish ham, the system wou= ld have scored it as ham. As is, the only Spanish I get is in Spanish spam, = so the system correctly judged it for my personal email mix. > At least prior language discrimination would allow for a different > database for each language Whether that would improve results is a testable hypothesis; I've alr= eady said I doubt it would be helpful, and have no motivation to try such = an experiment myself. > or for a systematic "unsure" flag for not trained languages. But I *do* train on Spanish -- and Russian, and Turkish, and Chinese,= and Japanese, and German, and French, and Polish (at least): in my email= mix, they're all used in spam, aren't used in my ham, and are spam to me b= ecause they're unreadable by me. > If you put my messages in a Ham training set, you will flag French = spams > as ham because of my French sig ;-) Nope, the system isn't that stupid (or, rather, it is ). What = it will do is knock down the spamprobs of those words. Despite that I've got= French spam in my training data, your msg here-- including the French sig --= got a solid ham score, with H=3D1 (to six significant digits) and S=3D1.1e-= 11. The strongest spam word in fact came from your sig, spamprob('est')=3D0.8= 4. It didn't matter, because I could actually read most of what you wrote, = and it wasn't trying to sell me Viagra . > All these words should rate around 0.5 since they are among the > most common ones in this language. If I got any French ham, they would rate around 0.5, but for my perso= nal email it's Just Fine that they're considered spam words. It wouldn't= be OK for python.org use, but python.org gets a non-trivial amount of non-E= nglish ham, so it trains there accordingly. > Le courrier est un moyen de communication. Les gens devraient > se poser des questions sur les implications politiques des choix (o= u non > choix) de leurs outils et technologies. Pour des courriers propres = : > -- Indeed . From rob@hooft.net Sun Nov 10 11:09:28 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 12:09:28 +0100 Subject: [Spambayes] Introducing myself References: Message-ID: <3DCE3E68.2060101@hooft.net> Robert Woodhead wrote: > * My personal bias (as I think Guido mentioned) is for a multifaceted > approach, using Bayesian, rules-based (attacking things that bayesian > isn't good at, like looking for obfuscated url structures), DNSBL, and > whitelisting heuristics to generate an overall ranking. So a hammy mail > from a guy in your address book would bubble up to highest priority, > whereas something spammy from him would stay neutral. There's lots of > room for cooperation between the various approaches and multiple agents > means its less likely that a spam will get by. In particular, > whitelisting heuristics can almost eliminate false positives. I think our very good experience with the bayesian classifier would "forbid" to use whitelisting. Once a whitelisted feature "leaks" into the spam community, it will be useless. But there is a bayesian solution to it: Make the tokenizer recognize the feature that you want to whitelist or blacklist, and emit a new token to that effect. From: --> Will have a low spamprob url:numeric-host --> Will have a high spamprob We're already doing something that for a number of the SpamAssassin tests (e.g. mime-type tokens). This approach still uses a purely bayesian classifier, and it will follow reality automatically. I'd like to note that a lot of what you were saying and what was in Tim's response (and mine here) is only valid in a train-on-all scheme. i.e. like we've been using until a week ago.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Nov 10 12:11:46 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 13:11:46 +0100 Subject: [Spambayes] More experiments with weaktest.py References: Message-ID: <3DCE4D02.6060907@hooft.net> Tim Peters wrote: > [Rob Hooft] > >>These were results of weaktest with default parameters: > > > Very interesting! I'll have to try that too. Note that in my live email > experiment here, I'm (except for the very start) also scoring/training msgs > in (with small lapses) the order they arrive. It's been reported before > that this helps; although I still haven't run a controlled experiment on > that, my *impression* is that it does help. I toyed with the idea, but that would involve parsing all messages once before starting, and sorting them on date. Putting them in a set to "randomize" the order is much easier, so I was lazy. > Setting ham_cutoff as low as 10 is for the > truly paranoid <0.9 wink>. Very much so. For my "production" systems, I have ham_cutoff at 40... > I hope you're at least gaining some respect for how much work went into > picking the defaults . I was just arriving when it happened. But that was on a completely different classifier, so I'm still convinced these need to be thoroughly tested. >>I am back with the defaults, but I'd still like to do an automated >>optimization of everything simultaneously. Might try that. > Now *that* could be a useful system regardless of scheme. I've tended to do > hill-climbing across one dimension at a time, occasionally moving batches of > params random amounts at once (to see whether that kicks it out of a > stubborn local minimum). Hm. That sounds so enthousiastic that I just might commit what I have gone through this night. Some more info: * No, I have not used a "Simulated Annealing" or "Threshold Accepting" yet. Please keep in mind that each step in the optimization takes between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my work PC). This would be way too costly. Just minimization it will be. * I tried to use "Simplex optimization" (let a multidimensional triangle walk through phase space) on the "Total cost" parameter. This was simply disastrous. Phase space consists of plateau regions that are exactly flat, joined by huge ridges. Think about that one spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one bang to the cost. This field is impossible to optimize. * I designed a new "Flex cost" field. That one does away with the "unsure cost". The cost of a message is 0.0 at its own cutoff, and increases linearly towards its "false" cost at the other cutoff, and increases further to the other end. Hm. Unreadable. A table: Score Spam with this Ham with this score costs score costs 0.00 $ 1.29 $ 0.00 0.20 $ 1.00 $ 0.00 0.55 $ 0.50 $ 5.00 0.90 $ 0.00 $10.00 1.00 $ 0.00 $11.43 This field is much more smooth than the total cost field, so I was hoping that pure minimization will do. Obviously, the flex cost is much, much higher than the total cost because unsures are so much more expensive. The flex cost field will also be less sensitive to the {sp|h}am_cutoff parameters than the total cost field, because there are no sudden cost jumps. * Results are not great I need to experiment more before reporting on them. * I just committed: weaktest.py: introduction of the flexcost measure optimize.py: simplex optimization (needs Numeric python; sorry) weakloop.py: run weaktest.py repeatedly under simplex optimization Regards, Rob Hooft -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Nov 10 12:28:44 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 13:28:44 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCE50FC.3050005@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Tim Peters wrote: > [Rob Hooft] > >>I just added a testdriver to CVS that simulates your behaviour as I >>understand it: It will train on the first 30 messages, > > > I trained on 1 of each at the start. If I were to do it over, I'd start > with an empty database . This is easy enough to change, but I left it at 30 for now. > Since I'm doing this real-time on my live email, I've been training "on the > worst" (farthest away from correct) msg that arrives in a batch, then > rescoring all the ones that arrived in the batch, then training the worst > remaining, ... until all new ham is below ham_cutoff and all new spam above > spam_cutoff. I don't know that it matters, just being clear(er). As things > turned out, this worst-at-a-time training never managed to push one of the > remaining mistakes/unsures into the correct category, *except* for cases > where I got more than one copy of a spam from different accounts at the same > time. Then it always pushed the copies into scoring near 1.0, since the > hapaxes in the training copy are abundant. But I'm doing exactly the same, except that my batch size is always 1 ;-) >>It may not even be very realistic to training on fp's, as I think in my >>private E-mail I won't even check the spam folder very thoroughly at all. > But I will (and do), and my primary interest here is to see how bad things > can get if a user takes mistake-based training to an extreme. Despite that > it's heavily hapax-driven, it appears to do very well when judged by error > rate. Hm. There are so little fp/fn's relative to unsures (at least after 30 messages initial training), that it wouldn't matter much (I think). >> * The database growth doesn't decay with time after a while; >> it can be described as: >> nwords = 9200 + 1.6 * nmessages >> or alternatively: >> nwords = 5700 + 40 * ntrained >> ..as can be seen in the attached png's > > > I expect that's mostly because there are still (relatively) few total msgs > trained on. Hm, it is more like a sqrt after more messages. See attached image which has a sqrt X axis. The fit fits the data even at the lowest end. Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: words3.png Type: image/png Size: 13675 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021110/b5905d0f/words3.png ---------------------- multipart/mixed attachment-- From lists@morpheus.demon.co.uk Sun Nov 10 14:31:30 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 10 Nov 2002 14:31:30 +0000 Subject: [Spambayes] Outlook plugin plus Exchange References: Message-ID: "Mark Hammond" writes: > I am working on code that optionally processes "missed" messages at startup. > It looks like I can list all unread, unscored mail in my 1000+ item inbox > very quickly, so this should be feasible. That sounds like the best option. I haven't had a chance to check Exchange yet, but with an IMAP store there are no "New mail" events triggered when I start Outlook with new mail in the IMAP inbox. I'd expect Exchange to be the same. (I didn't write a new addin, the spambayes addin does log when it gets a NewMail event, which I can see via win32traceutil...) I'll be interested to see the code, in any case, as when I tried to list unread mail for anotyher project, I couldn't get it to be fast :-( Paul. -- This signature intentionally left blank From trebor@animeigo.com Sun Nov 10 21:59:28 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Sun, 10 Nov 2002 16:59:28 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: [my apologies if some of the suggestions/comments below have been previously discussed, I'm still getting up to speed on the list] > > I'm particularly impressed with the chi-square work, it looks very >> interesting (but more stats for my poor stats-challenged mind to work >> on; > >So copy and paste . Heh, call me old fashioned, but I actually like to know how things work, rather than relying on black magic. ;^) > > not to mention that now I'm going to have to get around to >> cramming python in there with all the other languages that have >> accumulated over the years...). > >In return, you can throw twelve other languages out <0.7 wink>. Why would I ever want to do that? You never know when you'll need to be able to remember PL/C, JPL, APL, TUTOR, etc., etc., etc. Though I pray I never have to remember NOVA MOBOL ("Language of Kings") ;^) >Testing has pretty much run out of steam here, though. My error rates are >so low now I couldn't measure an improvement in a convincing way even if one >were to be made, and the same is true of a few others here too. We appear >to be fresh out of big algorithmic wins, so are pushing on to wrestling with >deployment issues. Indeed. And you also have to start worrying about the metagame; assuming your system goes into widespread deployment, what will the intelligent spammer (oxymoron) responses be? >BTW, download the source code and read the comments in tokenizer.py: the >results of many early experiments are given there in comment blocks. Will be doing this over the next day or so. >Spoken like someone who worked on a rule-based system . We have three >categories: Ham, Unsure, and Spam, and I haven't seen anything to make me >believe that a finer distinction than that can be quantitatively justified >(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's >what I mean by "can't measure an improvement anymore", and a finer-grained >scheme isn't going to touch those 2 mistakes; one of them is formally ham >because it was sent by a real person, but consists of a one-line comment >followed by a quote of an entire Nigerian scam spam -- nothing useful is >ever going to *call* that one ham, and it scores as spam *almost* as solidly >as an original Nigerian spam). Ah, but there are more considerations. First, many people's training sets may not be as distinct as yours, so the results might be more blurry. Second, future versions of the software might end up including other recognizers in the mix (for example, DNSBL, url heuristics, whitelists, stamping systems, etc), so adding a bit of flexibility at the start doesn't cost you anything, but could end up saving everyone a lot of work down the road. Since most existing mailreader filter schemes are relatively primitive, more than 10 levels of discrimination isn't going to be all that useful. But only 3 would seem to be to be too few. In a 1-9 scheme, the current 3 levels would map to (say), 2,5,8. It's just a syntactic difference, but it gives you precious wiggle room. >"Score" is my favorite, but isn't catching on. I believe the word "ham" for >"not spam" was my invention, and since that one caught on big, I'm not >fighting to the death for any others . Hey, why quit when you're on a roll? > >> * Hashing to a 32-bit token is very fast, saves a ton of memory, >> and the number of collisions using the python hash (I appealed for hash >> functions on the hackers-l and Guido was kind enough to send me the >> source) is low. About 1100 collisions out of 3.3 million unique >> tokens on a training set I was using. > >That's significantly better than you could expect from a truly random hash >function, so is fishy. Tossing 3.3M balls into 2**32 buckets at random >should leave 3298733 buckets occupied on average, with an sdev of 35.58 >buckets. Getting 1100 collisions is about 4.7 sdevs fewer than the random >mean. I may have gotten the # of tokens wrong. Currently my test runs are using 3.3M tokens but it may have been fewer when I was doing the hash tests. Maybe 2.3-2.4M tokens at that time? Anyway, thanks for the info about the relative merits of CRC32 and the Python hash; I'd been told CRC32 was bad and so was really surprised when it was marginally better. >Since we're sticking to unigrams, we don't have an insane database burden. >We also (by default) limit ourselves to looking at no more than 150 words >per msg. So I'm not sure saving some bytes of string storage is "worth it" >for us, and it's very nice that we can get back the exact list of words that >went into computing a score later. A pile of hash codes wouldn't give the >same loving effect . Well, unless I'm missing something, you've got to keep track of every token you've ever seen, and you've got to look up every token you encounter to determine if it's significant enough to consider in the final calc. If so, assuming the final calc isn't exponential, reducing the lookup time/resources can be a big win performance-wise. Note that since you have the text of the token before you hash it, you can keep that around for significant tokens and display it later. The only reason to hash is for speed of access to the probability data. The cost of the hashing is the inevitable collisions, which blur the probabilities for colliding tokens. >Except I didn't get good enough results from his approach to justify >pursuing it here, even leaving the hash codes at the full 32 bits. When I >went on to squash them to fit in a million buckets, a few false positives >popped up that were just too bad to bear (two can be found in the list >archives): ham that was so obviously ham that no system that called them >spam would be acceptable to most people. I wasn't commenting on the phrase system, or even hashing, but rather on data reduction to reduce the memory footprint required of the statistical tables (ie: using 1 byte frequency counts vs. 4 byte ones). Also, a cautionary note: just because the current system doesn't generate any horrible false positives on your corpii doesn't mean it won't do so on Joe Schmoe's. Or my slightly smelly ham. > > * I was playing a week or two back with 1 and 2 token groups, and >> found that a useful technique was, for each new token, to only >> consider the most deviant result. So if the individual word was .99 >> spam, and the two word phrase was .95, it would only consider the .99 >> result. This would probably help with Bill Y's combinatorial scheme. > >It could be a viable approach to the problem mentioned above: a scheme to >suck out more than one word that doesn't systematically generate mounds of >nearly redundant (highly correlated) clues. We're clearly missing info by >never looking at bigrams (or beyond) now, and that continues to bother me >(even if it doesn't seem to be bothering the error rates ). Right; and, related to the metagame, you've got to consider responses by the spammers. The initial attempt to defeat these kind of recognizers is going to try and exploit cancellation disease, probably by having a spammy preamble and a very hammy postscript. So one possible approach would be to gradually degrade the significance of a token the further along in the email it is (both during training and recognition). But of course, then you'll have to watch for html email that loads the front of the message with invisible ham. So a parser that spits out only the tokens a human is going to see is indicated. > > * My personal bias (as I think Guido mentioned) is for a multifaceted >> approach, using Bayesian, rules-based (attacking things that bayesian >> isn't good at, like looking for obfuscated url structures), DNSBL, >> and whitelisting heuristics to generate an overall ranking. So a >> hammy mail from a guy in your address book would bubble up to highest >> priority, whereas something spammy from him would stay neutral. > >I'm not sure we really need it. For example, *lots* of spam has been >discussed on this mailing list, so much so that the python.org email admin >had to castrate SpamAssassin for msgs to this list address else it kept >blocking ordinary list traffic. My personal email classifier never calls >anything here spam, though, nor does it call the originals of the spams >posted here ham. Beware the One True Path. There is strength in diversity. Or, as the noted philosopher D. Vader put it, "Don't be too proud of this technological terror you have created." As you will recall, those rebel scum managed to craft a nasty false positive. > >I do worry a little about obsfuscated HTML. We strip almost all HTML tags >by default for a reason I've harped on enough : all HTML decorations >have very high spamprobs, and counting more than one of them as "a clue" >fools almost every combining scheme into believing the msg containing them >is spam (if you know a msg contains both
and

, it's not really more >likely to be spam than if you just know it contains
!). So we blind the >classifier to HTML decorations now. > >But a spam I forwarded here a week or so ago exploited that: the spam was >interleaved with size=1 white-on-white news stories and tech mailing list >postings. The classifier *did* see those, but didn't see the HTML >decorations hiding them. This was a cancellation-disease-by-construction >kind of msg, and chi-combining scored it near 0.5 as a result (solidly >Unsure). It's the only spam of that kind I've seen so far; if it becomes a >popular techinque, we'll have to take more HTML blinders off the classifier. That's a classic example of metagaming. Seems to me, the strength of the spambayes recognizer is in recognizing the semantics (the spammy meaning of the message), not the syntactics. So train it only on what a human would see reading the message. Have another recognizer (either rules-based, bayesian, whatever works) that deals with the syntactics, and picks up on the html decoration tricks. In other words, one that looks at what the message says, and another that looks at how it is presented. This will prevent that particular kind of simple cancellation attacks. And that wraps back to the "more responses" suggestion above. How do you rate a hammy message with spammy html ornaments? Might not "a little hammy" be a better response than "beat's me, boss!"? > >> There's lots of room for cooperation between the various approaches >> and multiple agents means its less likely that a spam will get by. >> In particular, whitelisting heuristics can almost eliminate false >> positives. > >I'll let you know if I ever see one . You will. And it will be the one email that you really, really needed to read. Murphy's Law guarantees that it will happen. In fact, it typically happens (in my painful personal experience) soon after you make comments like the above. >Getting vast quantities of spam isn't a problem anymore, but getting vast >quantities of ham is. Since your spammy ham is presumably business-related, >I assume you can't share it. Or can you? Probably not. Unless I could process them and just give you the tokens and frequencies in some useable format. I'll see what I can do next week, gotta get python up and running along on my Mac. Also gotta get the battlebot finished or my kids will hurt me. > Mixing spam and ham from >different sources also causes worlds of problems (indeed, we still (by >default) ignore most of the header lines partly for that reason, else the >system gets great results for bogus reasons). I do the same, I'm currently just looking at the subject line. At 12:09 PM +0100 11/10/02, Rob Hooft wrote: >I think our very good experience with the bayesian classifier would >"forbid" to use whitelisting. Once a whitelisted feature "leaks" >into the spam community, it will be useless. Not if the whitelist heuristics are based on the individual user's environment, as opposed to global features. >But there is a bayesian solution to it: Make the tokenizer recognize >the feature that you want to whitelist or blacklist, and emit a new >token to that effect. > > From: --> Will have a low spamprob > url:numeric-host --> Will have a high spamprob While this is a useful approach, there is (IMHO) a need for users to be able to override, or at least modulate, the bayesian results in certain circumstances. The classic example would be your boss forwarding a 419 scam to you with the comment "Looks good, I'm going to invest in this, what do you think?". The spamminess might overwhelm the low spamprob From: A (paranoid) user needs to be able to tell the system "I don't care how spammy an email looks, if it's got this feature, I've got to at least glance at it with the Mk.1 Eyeball Recognition System". Note that this doesn't mean that it should be declared "clean as the driven snow", just "might not be a pile of decomposing lunchmeat" Yeah, this means that every spam going into Microsoft will eventually be from "billg@microsoft.com", but the consequences of this might be interesting. Or at least, amusing. best,R -- Woodhead's Law: "The further you are from your server, the more likely it is to crash." From tim.one@comcast.net Mon Nov 11 00:59:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 19:59:05 -0500 Subject: [Spambayes] More experiments with weaktest.py In-Reply-To: <3DCE4D02.6060907@hooft.net> Message-ID: [Tim, notes that his mistake-only training works in the order msgs come in] [Rob Hooft] > I toyed with the idea, but that would involve parsing all messages once > before starting, and sorting them on date. Putting them in a set to > "randomize" the order is much easier, so I was lazy. That's fine. For purposes of comparing this against previous tests, I expect it's even good, since they were randomized too. > ... > Hm. That sounds so enthousiastic that I just might commit what I have > gone through this night. You did, and I thank you! Note that there were already three Simplex pkgs linked from http://www.python.org/topics/scicomp/numbercrunching.html but I know how much fun it is write such stuff again . > Some more info: > > * No, I have not used a "Simulated Annealing" or "Threshold Accepting" > yet. Please keep in mind that each step in the optimization takes > between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my > work PC). This would be way too costly. Just minimization it will be. Understood. > * I tried to use "Simplex optimization" (let a multidimensional > triangle walk through phase space) on the "Total cost" parameter. > This was simply disastrous. Phase space consists of plateau regions > that are exactly flat, joined by huge ridges. Think about that one > spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one > bang to the cost. This field is impossible to optimize. Yes, it's a sum of step functions in the end, and at every point "the derivative" is either 0 or infinite, depending on where you are and which direction you look. Making a new "smooth" cost measure was thoroughly appropriate: > * I designed a new "Flex cost" field. That one does away with the > "unsure cost". The cost of a message is 0.0 at its own cutoff, and > increases linearly towards its "false" cost at the other cutoff, > and increases further to the other end. Hm. Unreadable. The code is clear enough, though. What I didn't understand is why each term in the flexcost is divided by the difference between the (fixed per run) cutoff levels: / (SPC - HC). That seems to systematically penalize, e.g., ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8). In the limit, if someone wanted a binary classifier (ham_cutoff == spam_cutoff), any mistake would be charged an infinite penalty. > A table: > > Score Spam with this Ham with this > score costs score costs > 0.00 $ 1.29 $ 0.00 It's hard to see where that comes from. Assuming ham_cutoff is 0.2 and spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 * (.9-0.0)/(.9-.2) ? > 0.20 $ 1.00 $ 0.00 > 0.55 $ 0.50 $ 5.00 > 0.90 $ 0.00 $10.00 > 1.00 $ 0.00 $11.43 > > This field is much more smooth than the total cost field, so I was > hoping that pure minimization will do. Obviously, the flex cost is > much, much higher than the total cost because unsures are so much > more expensive. The flex cost field will also be less sensitive to > the {sp|h}am_cutoff parameters than the total cost field, because > there are no sudden cost jumps. Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a DivideByZero exception, so it's sure sensitive there . I suspect it might work better if the "/(SPC-HC)" business were simply removed? > * Results are not great I need to experiment more before reporting > on them. > * I just committed: > weaktest.py: introduction of the flexcost measure > optimize.py: simplex optimization (needs Numeric python; sorry) > weakloop.py: run weaktest.py repeatedly under simplex optimization I've been running weakloop.py over two sets of my c.l.py data while typing this. That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total msgs. It's been thru the whole business about 25 times now. At the start, Trained on 88 ham and 66 spam fp: 0 fn: 0 Total cost: $30.80 Flex cost: $212.3120 x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31 It's having a hard time doing better than that. The best so far seems to be Trained on 82 ham and 66 spam fp: 0 fn: 0 Total cost: $29.60 Flex cost: $200.0924 x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09 which is so close to the starting point that it's hard to believe it's finding something "real". It *does* seem to be in a nasty local minimum, though, as the next attempt was: Trained on 118 ham and 69 spam fp: 1 fn: 0 Total cost: $47.20 Flex cost: $344.7334 x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73 I'm afraid it looks like it's eventually going to converge on the most delicate possible settings that barely manage to avoid that 1 FP. From tim.one@comcast.net Mon Nov 11 01:17:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 20:17:46 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCE50FC.3050005@hooft.net> Message-ID: [Tim] >> ... my primary interest here is to see how bad things can get if >> a user takes mistake-based training to an extreme. Despite that >> it's heavily hapax-driven, it appears to do very well when judged by >> error rate. [Rob Hooft] > Hm. There are so little fp/fn's relative to unsures (at least after 30 > messages initial training), that it wouldn't matter much (I think). As I tried to explain later, the psychological impact of the Unsures isn't attractive, though -- they remain bizarre to human eyes. When I got up today, I got 6 new Unsure spam: human growth hormone, gay porn, life insurance, mortgage rates, a msg that made no sense (empty except for a Yahoo auto-generated sig), and Genuine Leather Jackets. It's not picking up on general "this is advertising" clues, or even on general "this is gay porn" clues. Indeed, "XXX" is still a hapax! This particular HGH spam will never get through again, because training it found 80(!) hapaxes unique to it. It's not going to do much to stop other HGH spam, though -- this one was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost' and 'anywhere' to the collection of (what are now, after training on it) spam hapaxes -- just as previous HGH spam trained on didn't stop this one. To my eyes, I had already told it about HGH spam, and I'm irked that it showed me another one. Ditto gay porn, ditto life insurance, etc. [on database growth as a function of # of msgs] > Hm, it is more like a sqrt after more messages. See attached image which > has a sqrt X axis. The fit fits the data even at the lowest end. Cool! That was a dramatic graph indeed. Soon there will be no mysteries remaining . From tim.one@comcast.net Mon Nov 11 02:00:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 21:00:20 -0500 Subject: [Spambayes] Proposing to rename some fundamental options In-Reply-To: Message-ID: [Tim] > The original names made more sense when we had half a dozen competing > schemes. > > Current Proposed > ------- -------- > robinson_probability_x unknown_word_prob > robinson_probability_s unknown_word_strength > robinson_minimum_prob_strength minimum_prob_strength This renaming has been done. It should have no effect on pickles or databases (i.e., no need to retrain). From anthony@interlink.com.au Mon Nov 11 02:22:26 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 11 Nov 2002 13:22:26 +1100 Subject: [Spambayes] helping push the ham score for "nigeria" higher. Message-ID: <200211110222.gAB2MQB11817@localhost.localdomain> apologies for the marginal relevance, but it entertained me :) http://news.bbc.co.uk/1/hi/world/africa/2423283.stm "I am writing to you in the hope that you are under god and well. My naming is Professor Isoun Turner, and I am having hope you can assist. We are having a communications sattelite worth $15 millon US dollars that needs to be launched, but we need to find an international launch pad" From tim.one@comcast.net Mon Nov 11 05:42:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 00:42:51 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Robert Woodhead] > ... > Heh, call me old fashioned, but I actually like to know how things > work, rather than relying on black magic. ;^) You'll like this code, then! We hate "mystery knobs", and everything has a purpose. A purpose may not make sense, but at least it has one. > ... > Indeed. And you also have to start worrying about the metagame; > assuming your system goes into widespread deployment, what will the > intelligent spammer (oxymoron) responses be? I expect to get rich by selling spammer software to defeat this latest round of classifiers, so it's not that I can't tell you what their responses will be, it's that I don't want to reveal trade secrets . Indeed, if there are technically savvy spammers, they're subscribed to this list (and others like it). > ... > Ah, but there are more considerations. First, many people's training > sets may not be as distinct as yours, so the results might be more > blurry. Of all the things this project has done I find lacking in other projects, this is the part I think gives this project its clearest advantage: we have a statistically sound testing framework, more than one person testing on more than one corpus, people are beat up for running sloppy tests, and major algorithm improvements have been vetted by many here on their own data, and publicly reported results.. Winners survived and losers got purged from the codebase, and no single test corpus ruled that. Even for people with a single test corpus, the testing framework slices-and-dices it into multiple runs, so that results specific to a quirk of one subset can't be mistaken for "the truth". The project's TESTING.txt talks more about this. My tech mailing-list data turned out to be easier than most peoples', seemingly because almost all forms of advertising, and of HTML, are despised on tech mailing lists. But I've got other, harder test data too, and at least one person here (hi, Anthony!) has a flatly horrid corpus. > Second, future versions of the software might end up including other > recognizers in the mix (for example, DNSBL, url heuristics, whitelists, > stamping systems, etc), so adding a bit of flexibility at the start > doesn't cost you anything, but could end up saving everyone a lot of > work down the road. We'll define a stable API for accessing this system. If people want to combine it with other systems, that's fine, and Python excels at playing nice with other systems. If someone wants to add, e.g., a DNSBL gimmick to *this* codebase, they should write a new module to do so. I don't want fundamentally different approaches mixed into one module, let alone one function. > Since most existing mailreader filter schemes are relatively primitive, > more than 10 levels of discrimination isn't going to be all that useful. > But only 3 would seem to be to be too few. In a 1-9 scheme, the > current 3 levels would map to (say), 2,5,8. Let me clarify: I don't object to defining a billion levels, the problem is that I've seen no evidence that the algorithm in use here *can* provide more than 3 meaningful levels. chi-combining usually gives extreme scores. The median spam score is (to 6 significant digits) 1.0; the median ham score is on the order of 1e-10. The difference between, e.g., 1e-20 and 1e-5 appears meaningless, despite that it's 15 orders of magnitude. When chi doesn't give an extreme score, it tends to give one near 0.5, and which side of 0.5 it lies on doesn't appear to have strong correlation with whether a thing is ham or spam. The system is saying "I'm lost!" then, and it is. In effect, it's a 1-bit classifier but with a very useful middle ground. That it only gives about 1 bit of info follows from that the underlying math is a statistical accept/reject test (a two-outcome decision). Well, it's actually two accept/reject tests under the covers (one for ham, one for spam), and that's where the middle ground comes from (they both accept or both reject). If we were to call our middle ground 5, what good would that do anyone else? It doesn't mean we judge the odds of a msg being spam at 1 in 2. It means we have no idea. It certainly doesn't mean what, e.g., a 5 coming out of SpamAssassin means. "Unsure" means what it says. If, in the future, a new and better algorithm comes along with 6 meaningful digits, then I expect a new X- header would be defined to report it. > It's just a syntactic difference, but it gives you precious wiggle room. I'll leave more on this to people adding headers (the client I'm using doesn't use headers, but does attach integer score (in 0-100) metadata to msgs). [on hash collisions] > ... > I may have gotten the # of tokens wrong. Currently my test runs are > using 3.3M tokens but it may have been fewer when I was doing the > hash tests. Maybe 2.3-2.4M tokens at that time? Anyway, thanks for > the info about the relative merits of CRC32 and the Python hash; I'd > been told CRC32 was bad and so was really surprised when it was > marginally better. Hard to say. Neither CRC32 nor Python's string hash make any effort toward being "crytographically secure", and Python's string hash is in fact and deliberately "better than random" in some common cases: >>> hash('x1') 739453787 >>> hash('x2') 739453784 >>> hash('x3') 739453785 >>> hash('x4') 739453790 >>> That is, it's very regular in a way that most often yields fewer 32-bit collisions than a truly random hash function would yield when fed input strings with regularities. That eventually breaks down if you throw enough strings at it -- but it doesn't get "worse than random" then either, so far as it's ever been pushed. > ... > Well, unless I'm missing something, you've got to keep track of every > token you've ever seen, So far we have, but there's slow-motion work in progress on database pruning. > and you've got to look up every token you encounter to determine if > it's significant enough to consider in the final calc. Yes, and that will probably always be true. > If so, assuming the final calc isn't exponential, reducing the lookup > time/resources can be a big win performance-wise. I don't believe so. When using a Python dict as "the database", the time for scoring a msg is minor compared to the time taken by parsing and tokenization, and especially compared to the time just to get the msg *into* the system (whether that's file I/O, or socket I/O, or some email pkg's programming API, or whatever -- that part is the bottleneck when using a dict; when not using a dict, database access time may become a burden, and most databases in use here require string keys even if you're working with ints -- the database user has to convert the hash code to a string! Other databases (like ZODB) could use ints directly as keys, but they're rare.). > Note that since you have the text of the token before you hash it, > you can keep that around for significant tokens and display it later. Good point! I had overlooked that indeed. > The only reason to hash is for speed of access to the probability > data. Feel free to experiment; as above, I don't have reason to suspect that switching to hash codes would speed anything here, except for Jeremy's ZODB database (which could switch to using an IOBTree, which is zippier than an OOBTree). > The cost of the hashing is the inevitable collisions, which > blur the probabilities for colliding tokens. Another cost is obscuring the code. > ... > I wasn't commenting on the phrase system, or even hashing, but rather > on data reduction to reduce the memory footprint required of the > statistical tables (ie: using 1 byte frequency counts vs. 4 byte > ones). Ours are actually unbounded, but I don't have any problem with the memory footprint now. Others do. It seems more fruitful at this point to concentrate on ways to reduce the # of tokens, rather than the size burden per token. BTW, see the neil*.py files for how one person here builds a lean scoring-only CDB database -- you can store things any way you like, provided that the database access function is fiddled to convert to what the classifier expects to use. I don't believe such conversion is a significant time burden, but I haven't run the CDB variant and so haven't timed it (Neil, do you have gripes about memory or time? Spit 'em out.). > Also, a cautionary note: just because the current system doesn't > generate any horrible false positives on your corpii doesn't mean it > won't do so on Joe Schmoe's. Or my slightly smelly ham. Sure, but I'm a realist: any non-trivial scheme has a non-zero FP rate. That's life. What users choose to do about that isn't for this project to dictate. It is our responsibility to say up-front that there will be false positives, and we do say so. > ... > Right; and, related to the metagame, you've got to consider responses > by the spammers. The initial attempt to defeat these kind of > recognizers is going to try and exploit cancellation disease, > probably by having a spammy preamble and a very hammy postscript. They can't really defeat this scheme that way. At best they can hope to push msgs into Unsure territory. What constitutes "very hammy" is a function of each user's database here, and no generic blob of text is going to score high for hamminess everywhere. The spam in question happened to include a news story about the DC-area snipers, and that was very hammy for *me* because I live in that area and many friends and relatives had corresponded about the snipers (including forwarding the text of that very news story, as if we were suffering a news blackout here ). Even so, the message ended up as Unsure for me, not as Ham. That's to the credit of chi-combining, which is very good about knowing when it's confused. > So one possible approach would be to gradually degrade the > significance of a token the further along in the email it is (both > during training and recognition). I think there is reason to believe that spammers have to get your attention early. OTOH, many pieces of incriminating evidence also live at the end of spams ("this is not spam!" blurbs, the explanation that you got this because you're on an opt-in list run by one of their "partners", references to various state and federal bills, the "unsubscribe me" URL slash address harverster, etc). The white-on-white spam I mentioned before had hammish stuff at the start, and at the end, and between each pair of paragraphs. > But of course, then you'll have to watch for html email that loads the > front of the message with invisible ham. So a parser that spits out > only the tokens a human is going to see is indicated. Yup. Guido suggested that at the start, but that level of HTML analysis gets a lot more expensive too. We'll see. BTW, on large tests this system scores about 80 msgs/second on my box, including everything (system time, training, I/O, parsing, tokenizing, scoring, reporting, recording, and analyzing results -- this is # of msgs divided by elapsed wall-clock time). We could afford to get slower, if necessary. > ... > Beware the One True Path. There is strength in diversity. Let a thousand classifiers bloom. If someone here wants to volunteer the effort to try a different approach, that's always been welcome. But the results have been so good sticking to one basic approach that I don't see that happening. We ended up doing one thing exceedingly well, and that's a contribution to diversity too, of a kind you may be undervaluing . > Or, as the noted philosopher D. Vader put it, "Don't be too proud of > this technological terror you have created." As you will recall, > those rebel scum managed to craft a nasty false positive. I don't view an FP as being as costly as needing to build a new Death Star. For goodness sake, this is email we're talking about -- anyone trusting a truly critical msg to email is dreaming to begin with. > ... > That's a classic example of metagaming. Seems to me, the strength of > the spambayes recognizer is in recognizing the semantics (the spammy > meaning of the message), not the syntactics. Well, it's got no semantic knowledge at all. It doesn't even know which language a msg is written in, let alone what it means, and has no concept of "word" beyond "stuff that appears between whitespace". It's very much focused on purely local lexical structure. > So train it only on what a human would see reading the message. We get a lot of value out of mining a handful of header lines. We also get a lot of value out of tokenizing embedded "invisible" URLs. The theme here is that we tokenize "what works", and that's driven by measured error rates; philosophy doesn't enter into that part. > Have another recognizer (either rules-based, bayesian, whatever works) > that deals with the syntactics, and picks up on the html decoration > tricks. In other words, one that looks at what the message says, and > another that looks at how it is presented. This will prevent that > particular kind of simple cancellation attacks. A rule-based system seems more effective to me too against that particular gimmick. Also against viruses. > And that wraps back to the "more responses" suggestion above. How do > you rate a hammy message with spammy html ornaments? Might not "a > little hammy" be a better response than "beat's me, boss!"? I have no real idea, but fear that presuming "yes" is presuming a lot of intelligence that systems parsing this header won't actually have. The fancier the rating scheme the fancier they have to be too. In the end, the user has to decide what to do about everything that's not called ham, no matter how many or few the non-ham categories. As a user myself, I've got no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats me". That already gives two categories I have to check, and that's enough. I do find it useful that my client can sort on the score metadata, and there are proposals here too to add fancier header lines beyond the basic spam/ham/unsure one. [on FPs] > You will. Of course I will. > And it will be the one email that you really, really needed to read. It doesn't matter -- I review all my spam. Other people won't, and so it goes. > Murphy's Law guarantees that it will happen. In fact, it typically > happens (in my painful personal experience) soon after you make > comments like the above. You realize you're overselling badly here, right ? > ... > I do the same, I'm currently just looking at the subject line. Look at tokenize_headers() in tokenizer.py for a number of other corpus-independent header lines that proved useful to tokenize. Surprising but true: we can get a very good classifier by looking at this handful of header lines alone. Or by looking at the body alone. Looking at both takes longer . > ... > While this is a useful approach, there is (IMHO) a need for users to > be able to override, or at least modulate, the bayesian results in > certain circumstances. The classic example would be your boss > forwarding a 419 scam to you with the comment "Looks good, I'm going > to invest in this, what do you think?". The spamminess might > overwhelm the low spamprob From: This is akin to my "entire Nigerian scam quote" FP, and it's all but certain that the spam content would overwhelm the brief "from the boss" clues. OTOH, if my boss didn't wait for my reply and went ahead and invested anyway, the subsequent financial disgrace would open the door for me to take his job. After all, he relied on me for advice, so who more logical to succeed him? two-winners-and-only-one-loser-ly y'rs - tim From popiel@wolfskeep.com Mon Nov 11 06:11:25 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 10 Nov 2002 22:11:25 -0800 Subject: [Spambayes] More experiments with weaktest.py In-Reply-To: Message from Tim Peters References: Message-ID: <20021111061126.211B5F4CD@cashew.wolfskeep.com> In message: Tim Peters writes: > >I've been running weakloop.py over two sets of my c.l.py data while typing I've now run weakloop.py over three sets of my private data; that's 3*200 ham and 3*200 spam, for a total of 1200 messages. The best few it came up with were: Trained on 39 ham and 61 spam fp: 4 fn: 3 Total cost: $61.60 Flex cost: $189.7713 x=0.5040 p=0.1040 s=0.4400 sc=0.902 hc=0.204 189.77 Trained on 38 ham and 61 spam fp: 4 fn: 2 Total cost: $60.60 Flex cost: $189.9767 x=0.5060 p=0.1060 s=0.4300 sc=0.903 hc=0.206 189.98 Trained on 37 ham and 61 spam fp: 4 fn: 2 Total cost: $60.40 Flex cost: $189.2842 x=0.5054 p=0.0980 s=0.4436 sc=0.905 hc=0.209 189.28 Trained on 37 ham and 61 spam fp: 4 fn: 2 Total cost: $60.40 Flex cost: $189.8255 x=0.5033 p=0.0981 s=0.4456 sc=0.903 hc=0.206 189.83 Trained on 37 ham and 61 spam fp: 4 fn: 2 Total cost: $60.40 Flex cost: $189.8260 x=0.5026 p=0.1000 s=0.4458 sc=0.902 hc=0.207 189.83 There were a few where it trained on a couple more or less ham and spam... but I had to go hunting for them. I find it quite interesting that my ham:spam training ratio here (about 2:3, about where all my ratio tests have been pointing as a sweet spot) is significantly different than that reported by others (which has been much closer to 1:1 or favoring more ham than spam). I guess my corpus really is unusual. FWIW, I'm running it again with all 10 of my sets (4000 messages total) overnight. - Alex From popiel@wolfskeep.com Fri Nov 8 00:06:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 07 Nov 2002 16:06:27 -0800 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message from "Tim Peters" References: Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Anthony Baxter] >> Note that "random sample" is not as trivial as all that, either - if >> you have a very high ham:spam ratio in your training DB, your accuracy >> will suffer (see the tests from Alex, myself and others). > >I still need to try to make sense of those tests. A real complication is >that more than one thing changes when trying to test ratios: it's not just >the ratio that changes, it's the absolute number of each trained on too. True. >For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham >and 10000 spam. The ratios are identical. Do we expect the error rates to >be identical too? I don't, but haven't tried it. I have tried this, and the effects of ratio were diminished as the training set size increased. For details, see http://www.wolfskeep.com/~popiel/spambayes/ratio2 . The tests were done with gary-combining, not chi-square, so I really ought to rerun them. >I expect the latter would do better than the former, despite the identical >ratios, simply because more msgs allow better spamprob estimates. It depended on what the ratio in question was... for 1:4 ham:spam, increased training set size hurt instead of helped, in the ranges that I was able to test. For 1:1, increased training helped instead of hurt. >Something missing in "the ratio tests" is a rationale (even an >after-the-fact one) for believing there's some aspect of the system that's >sensitive to the ratio. The combining method certainly is not, and the >spamprob estimation (update_probabilities()) deliberately works with >percentages instead of raw counts so that the ham::spam training ratio >has no direct effect on the spamprobs calculated. Eh, I have a perfectly good rationale for believing that something is sensitive the the ratio: the tests I've run show such a sensitivity. What's missing is a theory on _why_ there's a sensitivity. ;-) I don't think the following theory is perfectly phrased, but it seems plausible to me: Perhaps the number of topics discussed in ham is greater than that in spam. Thus, the average percentage of ham messages containing a particular significant ham word is systematically lower than the average probability of a particular significant spam word appearing in spam messages. As the training set size increases, the percentage difference becomes more consistent and pronounced. Since we're then combining the percentages, we systematically skew slightly due to the differing averages. Changing the ratio of ham to spam has the effect of changing the number of topics discussed, particularly when the training set size is small and random chance can exclude all instances of a given topic. Balancing the number of topics removes the skew in the probabilities. As training set size increases, adjusting the ratio has less effect, because it has less likelyhood of eliminating topics of discussion. I think that would account for my data. >The total # of spam training msgs does limit how high a spamprob can get, >and the total # of ham training msgs limits how low. The *suspicion* I had >running my large c.l.py test is that it wasn't the ratio that mattered so >much as the absolute number, and that the error rates didn't "settle down" >to the 4th digit until I got near 10,000 spam total. I suspect that by the time the corpora got that large, adjusting the training ratio wouldn't make a lick of difference if the corpora were sampled randomly to achieve the given ratio. There would just be too little chance of excluding a topic from the samples. Systematically excluding a topic might produce equivalent results to my ratio tests. - Alex From richie@entrian.com Fri Nov 8 00:17:25 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 00:17:25 +0000 Subject: [Spambayes] SMTP proxy questions Message-ID: [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com From anthony@interlink.com.au Fri Nov 8 00:30:09 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 08 Nov 2002 11:30:09 +1100 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <200211080030.gA80UAf11390@localhost.localdomain> > I've discussed this with Tim S, and he's going off the SMTP proxy idea > while I'm still broadly in favour of it. What do people think - do > non-Outlook users want to forward messages to 'spam' and 'ham' to train the > system, or use an HTML UI? I'd have to say I don't like the idea. There's too many potential places where it can all go horribly horribly pear-shaped, and too many rat-holes that the various email clients can screw up with. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From jbublitz@nwinternet.com Fri Nov 8 01:15:29 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Thu, 07 Nov 2002 17:15:29 -0800 (PST) Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: On 08-Nov-02 Richie Hindle wrote: > Or perhaps there's another way. The only other option I'd > thought of was to add two hyperlinks to the end of the message, > "This is spam" and "This is ham" (in ways that would work for > both HTML and plain-text messages, in both HTML and plain-text > email clients). They'd link to the HTML interface and tell it > the cache ID of the message. Adding content to emails is way > more intrusive (and difficult) than adding headers. But no more > intrusive than the .sig that mailman adds. What about adding a MIME object to the msg with the Spambayes info (text/spambayes?) - or will forwarding lose that info too? The email module should be able to do this. Jim From tim.one@comcast.net Fri Nov 8 04:07:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:07:18 -0500 Subject: [Spambayes] Proposing to drop retain_pure_html_tags In-Reply-To: Message-ID: FYI, that option is gone now. From tim.one@comcast.net Fri Nov 8 04:29:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:29:17 -0500 Subject: [Spambayes] Proposing to rename some fundamental options In-Reply-To: Message-ID: The original names made more sense when we had half a dozen competing schemes. Current Proposed ------- -------- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Note: unknown_word_prob is what the Baysian prob adjustment moves toward, more strongly the less evidence backs up a counting spamprob estimate (the fewer the msgs a word has been seen in, the more the adjustment pushes the spamprob toward unknown_word_prob; for a word that's never been seen before, this reduces to unknown_word_prob exactly). We've always set it to 0.5 by default, and previous tests never showed benefit from changing that. We've gotten better since then, though, and it's possible to deduce "a more correct" value. For example, take the mean of all the by-counting spamprobs in your database, across words that have appeared in at least 10 msgs (so that there's reason to have *some* confidence in the by-counting guess). That's then an estimate of the spamprob a new word will eventually get over time. Across 3 databases I tried this on, it turned out to be a little over 0.5, from 0.513 (my home personal classifier) to 0.540 (fat c.l.py test). If someone has time for a controlled experiment, run the attached code to find this guess for one of your databases; then if it differs from 0.5, try a before-and-after test just changing that much. If there's any promise here, update_probabilities() could easily be changed to compute and use this automatically. """ import cPickle as pickle f = file('fat.pik', 'rb') # your database pickle goes here c = pickle.load(f) f.close() w = c.wordinfo def guessx(): nham = float(c.nham or 1.0) nspam = float(c.nspam or 1.0) n = 0 probsum = 0.0 for rec in w.itervalues(): if rec.hamcount + rec.spamcount >= 10: hamratio = rec.hamcount / nham spamratio = rec.spamcount / nspam prob = spamratio / (spamratio + hamratio) probsum += prob n += 1 print n, probsum / n guessx() """ From mhammond@skippinet.com.au Fri Nov 8 04:48:54 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 8 Nov 2002 15:48:54 +1100 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: > Laughing and pointing should be directed towards me rather than Tim. None of that, but some thoughts . I think that the classes I posted a while ago suffer from the exact reverse problem as your idea. My idea was to make a "message store" that is largely independent of training. I believe the problem with your design is that it deals with the training at the expense of the message store. Obviously, but worth mentioning, is that there are competing interests here. My focus is towards clients, and specifically the outlook one (if there were more clients I would be happy to think of them too ). Alot of the focus of this group is towards admins rather than individuals (which is just fine!) But it seems the current thinking is of a corpus as being a fairly static, well-controlled set of messages used almost purely for training purposes. For client programs, this may not be practical. The corpus is a more dynamic set of messages - and worse, actually *is* the user's set of messages rather than a collection of message copies. For example, "moving" a message in a corpus may actually mean moving the message in the user's real inbox. This may or may not be what is intended - a corpus "move" operation is more about changing a message's classification than it is about physically moving pieces of mail around. > A Corpus wouldn't know how to create Message objects, nor would a Message > object know how to create itself - classes *derived from* them would know > how to do that. For instance (totally untested code, probably full of > typos) - > > class Message: Jeremy and I both posted real code, so starting with something that takes that into consideration would be good. > I may be putting too much > into the base class by demanding that the text of the message be given to > the constructor - that precludes making FileMessage lazy, and > only read the > file when it needs to.] It also defeats the abstract nature of the class. > 'Corpus' works the same way; again, the details may be naive, but this is > the general idea: I'm hoping I don't sound grumpy, but again, the few systems that already exist for this engine are the best ones to use to discover the naivety early > You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an > IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on. I can't quite imagine that at the moment, as per my comments at the top. Off the top of my head, I believe we need: * An abstract "message id" * A message classification database, as discussed before - basically just a dictionary, keyed by ID, holding either "spam" or "ham". * A "corpus" becomes just an enumerator of message IDs for bulk/batch training. It has no move etc operations. * A "message store" is capable of returning a message object given its ID. * The training API simply takes message objects and updates the probability and message databases. At that level, we really don't need much else - no folders or any other grouping of messages. I'm really not too sure there is much value in adding higher-level concepts such as folders or message store "move" operations - certainly not at the outset, where there are too many competing requirements. > Yes - this could work using observer objects registered with Corpus > objects: This could work, but may be too simple to be necessary. If the process of re-training a message in the Outlook GUI becomes: def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True) And not a whole lot else, it doesn't seem worth it. Unfortunately, the decision to perform the retrain is the complex, but client specific part. Is this a newly delivered message? Did the user manually move the message somewhere? Did the user click one of our buttons? Is the user deleting old ham that we want to train on before it dies forever? Outlook does this via examining what Outlook event we are seeing, and looking at meta-data we possibly previously attached to the message. I'm not sure this can be encapsulated well at the moment without adding all our meta-data etc baggage to the base classes. > Most of the *new* code that's needed is defining the abstract concepts and > their interfaces, rather than writing code that actually *does* anything - > it's building a framework. *cough* ummm... This is doomed to failure. Code *must* do something to be taken seriously. At the very least, I would expect to see the existing test driver framework running against these "abstract concepts" > Once the framework is there, most of the code needed to implement the > functionality should already be in the project - code to hook > into Outlook, > to train on a message, to parse mbox files, and so on. It just needs > hooking into the framework. See above . Mark. From tim.one@comcast.net Fri Nov 8 04:50:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:50:42 -0500 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: [Richie Hindle] > ... > The most difficult problem for retraining-by-forwarding is matching the > forwarded message to one from the cache, after Outlook Express > has stripped the headers, top-quoted the users .sig, converted it > to HTML and added fifteen macro viruses. Any ideas? If user can be convinced to forward as an *attachment*, those problems go away, at least in OE. You can create a new msg there, select any number of msgs, drag them to the msg as a group, and OE will create an attachment for each one. Unlike Outlook, OE appears to save the original stuff that came in over the wire (we're finding it's a real hoot in the OL client to try to guess what the original MIME structure may have been). > Can the tokeniser help? If you put in a token unique to each msg, sure . Perhaps the "loose checksum" program Skip checked in could be useful for this. From tim.one@comcast.net Fri Nov 8 05:06:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 00:06:43 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com> Message-ID: [Richie Hindle] > A quick note in case someone decides to remove the counts from the > database: Neil Schemenauer already does, in his CDB code (neil*.py). It's a lean scoring-only database, mapping tokens to *just* spamprobs. If he went on to store them as scaled ints, he could almost certainly reduce this to 2 bytes of prob info per token, and possibly even just 1. > the HTML front end has a "Word query" feature which will tell you the > information in the database for a given word - it's interesting to see > how many more times the word 'Viagra' appears in ham than in spam. I > mean the other way round. What a geek . From tim.one@comcast.net Fri Nov 8 05:48:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 00:48:25 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: [Just van Rossum] > I think it can be done with almost no extra overhead with a > caching scheme. This assumes (probably wrongly ) that > the cache stays in memory between runs. > Something like this perhaps: > > *** classifier.py Thu Nov 7 23:03:07 2002 > --- classifier.py.hack Fri Nov 8 00:04:05 2002 > *************** > *** 456,459 **** > --- 456,460 ---- > > wordinfoget = self.wordinfo.get > + spamprobget = self.spamprobcache.get > now = time.time() > for word in Set(wordstream): > *************** > *** 463,467 **** > else: > record.atime = now > ! prob = record.spamprob > distance = abs(prob - 0.5) > if distance >= mindist: > --- 464,470 ---- > else: > record.atime = now > ! prob = spamprobget(word) > ! if prob is None: > ! prob = self.calcspamprob(word, record) > distance = abs(prob - 0.5) > if distance >= mindist: Sorry, I don't know what this is trying to accomplish. Like, what is self.spamprobcache? There's no such thing now, and the patch doesn't appear to create one (i.e., this code doesn't run). Whatever it's supposed to be, why isn't spamprobcache.get *itself* responsible for returning a spamprob, instead of making its caller deal with two cases? If the answer is "it's supposed to be a dict, so .get ain't that smart", then the memory burden for a long-running scorer process will zoom, negating one of the benefits people attached to "real databases" thought they were buying in return for giant files and slothful performance . Life would be easier if databaseheads trained all they liked as often as they liked, but refrained from calling update_probabilities() until the end of the day (or other "quiet time"). The idea that the model should be updated after every msg trained on is an extreme. From tim.one@comcast.net Fri Nov 8 06:23:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 01:23:13 -0500 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: [Richie Hindle, cogitates about Messages and their Corpus(ora)] That's the ticket! Backing off to a more fundamental level looks useful to me too. We never even straightened that much out for testing purposes (msgs.py isn't general enough; for some custom test drivers (never checked in), I couldn't even reuse the MsgStream class for my *own* directory structures). I disagree with Mark's > If the process of re-training a message in the Outlook GUI becomes: > > def RetrainMessageAsSpam(): > # Outlook specific code to get an ID. > message = message_store.GetMessage(id) > if not classifier.IsSpam(message): > classifier.train(message, is_spam=True) > > And not a whole lot else, it doesn't seem worth it. because it illustrates the point : it doesn't look like a correct re-training method (although it may be, depending on assumptions about where "id" comes from, and what assorted classifier methods do), and while a correct method shouldn't be hard, in the absence of a class dedicated to doing the simple common things that *can* be done in a common way, everyone will keep screwing it up in their own client code. > ... > You might want to run it past Tim Peters, 'cos he's *far* better at this > kind of thing than I am (though he's also busy). I have to do more Python and Zope work now, so have to guard my time on *this* project more jealously than I have. MarkH and SeanT and JeremyH all have ideas here too, and I trust you'll sort them out as a harmonious family bent on world domination. As a general strategy, the first person to check code in usually wins . > ... > The mark of a good framework is when you write a tiny little class (like > AutoTrainer above for instance) that contains hardly any code but adds a > major new feature (in this case, automatic training when moving messages > around in Outlook). The client-specific code to hook and track msg movement in Outlook is relatively massive, so everything else appears a drop in the bucket to Mark. Nevertheless, if a usable framework for capturing the *common* part of this stuff were available, removing the 5 lines of code quoted above would help (the Outlook client, and all others). From B-Morgan@concentric.net Fri Nov 8 06:25:30 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Thu, 7 Nov 2002 23:25:30 -0700 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: As I see it, having pop3proxy keep copies of the messages and using an HTML UI for training has the least amount of dependancy on the email client's forwarding capabilities (or lack thereof). I have a severe aversion to opening spam that will probably carry over to unsure messages, so having a link added to the message body may not do me much good. I will, however, go to an HTML UI and examine a message if that UI doesn't "execute" the HTML. I don't want to see pretty, raw data is good enough for me to decide. I hate to keep mentioning a "rival" project , but popfile's UI seems pretty close to what I think would work best here. Regards, Brad -----Original Message----- From: spambayes-bounces@python.org [mailto:spambayes-bounces@python.org]On Behalf Of Richie Hindle Sent: Thursday, November 07, 2002 5:17 PM To: spambayes@python.org Subject: [Spambayes] SMTP proxy questions [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From tim.one@comcast.net Fri Nov 8 06:46:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 01:46:14 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > ... > I'm assuming (based on a message I recall seeing recently) that it's > possible to "correct" training - ie, if I train the classifier that a > specific message is spam, I can later say "no it isn't, it's ham". That's right, and at the level of classifier.py it's a two-step process: unlearn() as spam, then learn() as ham. It actually doesn't matter which order those are done in, but I won't admit to that . > Assuming that this is so, is it not reasonable to train dynamically > on an "assume I got it right" basis? Depending on context, it *may* be. > In other words, whenever the addin filters a message as ham or spam, > automatically train on that basis as well. Then, if the user sees a > mistake, he corrects it, which automatically retrains the classifier > (manually deleting as spam or moving a message already does this). Assuming a conscientious user, and a client that knows enough about what the user is doing, that should work fine. > This will keep the database right up to date, and all the user has to > do is correct any bad decisions the classifier makes (which he should > be doing anyway). > > I've ignored database growth issues, but other than that, is there any > other problem with this approach? Doubtless hundreds, but why quibble . A misclassified msg will have bad effects at once if the training gets reflected into the probabilities at once, so it gets less appealing the less zealous the user is about correcting mistakes right away. That can be mitigated by doing the day's training into a distinct dict, or not calling update_probabilities() in a single dict, until "the end of the day", when the user has (presumably) corrected all the day's mistakes they're going to correct. But if the model updating is going to be delayed anyway, then it makes as much sense to delay doing any training on "the day's" msgs until the end of the day. Determining what "the end of the day" means is a puzzle then too. For example, maybe I left my email client running and went on a week-long vacation. I'm not going to look over 700 presumed spam when I get back, I'll just delete it. But if ham was in there, I've now let it train in the wrong direction, and that will hurt. In other contexts, the scheme doesn't get off the ground. For example, for python.org use, nobody is going to review msgs claimed to be spam. A system feeding on its own judgments is going to reinforce its own mistakes too, so the "conscientious, timely, reviewing human" bit is important. From tim.one@comcast.net Fri Nov 8 07:20:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 02:20:18 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Mark Hammond] > ... > The key limitation of this scheme, as Tim also alludes to, is that this > never correctly classifies ham. However, I actually see this > incremental training more as a "get smarter now" than a "just get > smarter" technique - ie, a user sees a mis-classified Spam, by re- > training they are increasing the chances that the next similar mail > will be handled correctly. Instant feedback, especially while the user > is getting started. > > ie, it is indeed "mistake based training", but that may still prove > useful in addition to ongoing training. I sure agree it's *very* useful at the start, and expect it will continue to be useful over time. > I can't help thinking that we are somehow underestimating our own > tool here. I'm going to try an experiment: I'm going to wipe my home database and start over from scratch, training first on one ham and one spam, then only on mistakes and unsures. This should be fun . > As is common when people first use this tool, spam is generally > found in the ham set and vice-versa. Because of this, I know that my > Inbox is spam free (but less sure about my other "ham" folders). I'm > also sure that my Spam folder has no ham. This should remain true > while continue to use the tool. How do you know your Spam folder has no ham? I know mine doesn't because I routinely score it, sort on the score, and stare at "the wrong end". I find ham there as often as not, *usually* apparently due to mousing error when dragging a training ham into the Ham folder and overshooting the mark. > So surely we can exploit this somehow. Off the top of my head: > * Assume we don't trust the last 2 days of mail (as the user may not > yet have sorted them). Anything in the "good" and "spam" folders older > than this can be assumed correctly classified, and able to be trained > on. Provided the user has already done a decent amount of training, then as Paul Moore suggested it could even work to trust ham-vs-spam decisions immediately, and let user corrections undo those as needed. A well-trained system should be pretty robust against a few misclassifications over the short term. > * A process could go through all ham and spam trained on, and score each > message. Any "suspect" messages are presented in a list (much like the > Outlook "Find Message" result list). The user can indicate that the > message is correct (and the system will remember, never asking about > this message again) or is indeed incorrectly classified. If incorrect, > it will be moved, and incrementally trained as per now. (I can also > picture a whitelist kicking in here; if incorrect, offer to add user to > whitelist. If user in the whitelist, assume ham thereby meaning mail > from this person can never again be spam) Tell us about the mistakes *you* see. I feel like we're designing a solution to a hypothetical problem otherwise. The only "mistake" I routinely see is that my cigarettes-via-web advertising keeps getting knocked back into Unsure territory. That doesn't bother me enough to do anything about it, but if it bothers you enough then, yes, a whitelist would solve that one. > I can picture this working in the background, and simply indicating to > the user that there are "conflicts" to be resolved at their leisure. Or maybe we could just move those back to the Unsure folder. The user should already know what to do about things in Unsure, so it's nothing new to them. Moving a msg out of Unsure could be taken as a positive sign that the user has classified such a msg once and for all (well, until they move it again, anyway). > Further, I imagine that as we build better training data for each > message store, the number of "conflicts" actually found would > generally be zero - ie, the system would find that all 2 day and > older mail correctly classifies. I expect that's true. From just@letterror.com Fri Nov 8 07:54:04 2002 From: just@letterror.com (Just van Rossum) Date: Fri, 8 Nov 2002 08:54:04 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Tim Peters wrote: > [Just van Rossum] > > I think it can be done with almost no extra overhead with a > > caching scheme. This assumes (probably wrongly ) that > > the cache stays in memory between runs. > > Something like this perhaps: > > > > *** classifier.py Thu Nov 7 23:03:07 2002 > > --- classifier.py.hack Fri Nov 8 00:04:05 2002 > > *************** > > *** 456,459 **** > > --- 456,460 ---- > > > > wordinfoget = self.wordinfo.get > > + spamprobget = self.spamprobcache.get > > now = time.time() > > for word in Set(wordstream): > > *************** > > *** 463,467 **** > > else: > > record.atime = now > > ! prob = record.spamprob > > distance = abs(prob - 0.5) > > if distance >= mindist: > > --- 464,470 ---- > > else: > > record.atime = now > > ! prob = spamprobget(word) > > ! if prob is None: > > ! prob = self.calcspamprob(word, record) > > distance = abs(prob - 0.5) > > if distance >= mindist: > > Sorry, I don't know what this is trying to accomplish. Like, what is > self.spamprobcache? There's no such thing now, and the patch doesn't appear > to create one (i.e., this code doesn't run). Tim, don't be such a programmer . But ok, I promise I'll never post pseudocode as a patch again... > Whatever it's supposed to be, > why isn't spamprobcache.get *itself* responsible for returning a spamprob, > instead of making its caller deal with two cases? I thought I was doing your performance needs a favor . > If the answer is "it's > supposed to be a dict, so .get ain't that smart", That's the answer. > then the memory burden for > a long-running scorer process will zoom, negating one of the benefits people > attached to "real databases" thought they were buying in return for giant > files and slothful performance . Right. If a float takes up 20 bytes in memory (just a guess, no time to look), then for a database of 100000 words (that's roughly the size of my personal db) the memory burden is 100000 * (8 + 20), almost three megs. Just in case the higher memory usage is not an issue, there's a simpler approach: don't store spamprob in the db, but call bayes.update_probabilities() on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC on my db (hm, that's using pickle, so will be a lot more when using a db :-( ). You can tell I'm thinking mostly about long running processes... I guess you're right, one size doesn't fit all. One last idea for this morning: how about splitting the db in a training db (storing hamcount and spamcount) and a classifying db (storing only spamprob)? > Life would be easier if databaseheads trained all they liked as often as > they liked, but refrained from calling update_probabilities() until the end > of the day (or other "quiet time"). The idea that the model should be > updated after every msg trained on is an extreme. Good points. Just From richie@entrian.com Fri Nov 8 08:06:33 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 08:06:33 +0000 Subject: [Spambayes] Upgrade problem In-Reply-To: References: Message-ID: [Just] > the web interface of pop3proxy.py is pretty good and useful, the only > downside is that it saves the database after each training That's now fixed (at least partly) along with some other bits: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. -- Richie Hindle richie@entrian.com From tim.one@comcast.net Fri Nov 8 09:15:24 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 04:15:24 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Tim] > ... > I'm going to try an experiment: I'm going to wipe my home database and > start over from scratch, training first on one ham and one spam, then > only on mistakes and unsures. This should be fun . It is! The msg from me I'm replying to here scored 94 (solid spam). I've now got 5 ham and 5 spam in my training set, most of the new ones from Unsures. The latest spam was a blatant false negative, from Hapax City: '*H*' 0.998601 '*S*' 8.60833e-005 'can' 0.0652174 'have' 0.0652174 "don't" 0.0918367 'never' 0.0918367 'number' 0.0918367 'one' 0.0918367 'what' 0.0918367 '"the' 0.155172 ham hapaxes from here 'able' 0.155172 'about' 0.155172 'against' 0.155172 'also' 0.155172 'any' 0.155172 'anything' 0.155172 'back' 0.155172 'because' 0.155172 'been' 0.155172 'check' 0.155172 'even' 0.155172 'find' 0.155172 'found' 0.155172 'heard' 0.155172 'how' 0.155172 'into' 0.155172 "it's" 0.155172 'more' 0.155172 'needed' 0.155172 'other' 0.155172 'out' 0.155172 'own' 0.155172 'people' 0.155172 'skip:a 10' 0.155172 'skip:i 10' 0.155172 'special' 0.155172 'subject:.' 0.155172 'subject:: ' 0.155172 'their' 0.155172 'them.' 0.155172 'they' 0.155172 'those' 0.155172 'time' 0.155172 'time.' 0.155172 'unsubscribe' 0.155172 'until' 0.155172 'useful' 0.155172 'using' 0.155172 to here 'and' 0.275281 'for' 0.275281 'subject: ' 0.275281 'you' 0.275281 'from' 0.355072 'not' 0.355072 'off' 0.355072 'our' 0.355072 'when' 0.355072 'new' 0.644928 'see' 0.644928 'url:gif' 0.724719 'url:www' 0.724719 'call' 0.844828 spam hapaxes from here 'contact' 0.844828 'credit' 0.844828 'email.' 0.844828 'every' 0.844828 'further' 0.844828 'header:Received:2' 0.844828 'made' 0.844828 'more!' 0.844828 'most' 0.844828 'now' 0.844828 'plus,' 0.844828 'receive' 0.844828 'search' 0.844828 'skip:1 10' 0.844828 'url:jpg' 0.844828 to here 'email' 0.908163 I think I've established that 5+5 isn't enough for great results . However, 80% of its decisions have been correct so far! From tdickenson@devmail.geminidataloggers.co.uk Fri Nov 8 10:52:32 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Fri, 8 Nov 2002 10:52:32 +0000 Subject: [Spambayes] Re: unsupervised training In-Reply-To: References: Message-ID: <200211081052.32567.tdickenson@devmail.geminidataloggers.co.uk> On Friday 08 November 2002 7:20 am, Tim Peters wrote: > Provided the user has already done a decent amount of training, then as > Paul Moore suggested it could even work to trust ham-vs-spam decisions > immediately, and let user corrections undo those as needed. A well-tra= ined > system should be pretty robust against a few misclassifications over th= e > short term. For the last two weeks I have been using a setup that uses this type of=20 unsupervised training. I have a procmail filter that sends a copy of all incoming ham and spam t= o two=20 seperate mailboxes. These mailboxes are used for overnight batch training= ,=20 then deleted. Messages marked 'Unsure' do not take part in this automatic= =20 training. I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am= =20 manually inspecting the unsure folder, and manually adding them to the=20 appropriate training mailboxes. Initially about 3% of mails were 'unsure'= ,=20 but this has dropped to less than 1% after 2 weeks. Starting next week I plan to change the mua filtering to treat 'unsure' t= he=20 same as 'ham', and stop all manual training. It will be interesting to se= e if=20 the training remains stable. From bkc@murkworks.com Fri Nov 8 14:51:15 2002 From: bkc@murkworks.com (Brad Clements) Date: Fri, 08 Nov 2002 09:51:15 -0500 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <3DCB8912.18340.2FB5F81@localhost> On 8 Nov 2002 at 0:17, Richie Hindle wrote: > Or perhaps there's another way. The only other option I'd thought of was > to add two hyperlinks to the end of the message, "This is spam" and "This > is ham" (in ways that would work for both HTML and plain-text messages, in > both HTML and plain-text email clients). They'd link to the HTML interface > and tell it the cache ID of the message. Adding content to emails is way > more intrusive (and difficult) than adding headers. But no more intrusive > than the .sig that mailman adds. If you do this, what's to keep spammers from also adding similar looking URLs? A busy person might not notice any difference, could click through and confirm their email address... Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From barry@python.org Fri Nov 8 15:04:56 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 8 Nov 2002 10:04:56 -0500 Subject: [Spambayes] SMTP proxy questions References: Message-ID: <15819.53912.407893.819241@gargle.gargle.HOWL> >>>>> "JB" == Jim Bublitz writes: JB> What about adding a MIME object to the msg with the Spambayes JB> info (text/spambayes?) - or will forwarding lose that info JB> too? The email module should be able to do this. Of course that would have to be text/x-spambayes :) -Barry From randy.diffenderfer@eds.com Fri Nov 8 17:21:25 2002 From: randy.diffenderfer@eds.com (Diffenderfer, Randy) Date: Fri, 8 Nov 2002 12:21:25 -0500 Subject: [Spambayes] SMTP proxy questions Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC295@usahm014.exmi01.exch.eds.com> |>>>>> "JB" == Jim Bublitz writes: | | JB> What about adding a MIME object to the msg with the Spambayes | JB> info (text/spambayes?) - or will forwarding lose that info | JB> too? The email module should be able to do this. | |Of course that would have to be text/x-spambayes :) | |-Barry While a fair portion of messages may very well be MIME compliant, this wouldn't work without some serious munging around for non-MIME messages, as well as being very problematic for the many deformed MIME (read very NON compliant :-) ) messages floating around out there! Just an observation... From jbublitz@nwinternet.com Fri Nov 8 17:10:33 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Fri, 08 Nov 2002 09:10:33 -0800 (PST) Subject: [Spambayes] SMTP proxy questions In-Reply-To: <15819.53912.407893.819241@gargle.gargle.HOWL> Message-ID: On 08-Nov-02 Barry A. Warsaw wrote: > >>>>>> "JB" == Jim Bublitz writes: > > JB> What about adding a MIME object to the msg with the > Spambayes > JB> info (text/spambayes?) - or will forwarding lose that > info > JB> too? The email module should be able to do this. > > Of course that would have to be text/x-spambayes :) Well - there's application/ms-excel or some such. Isn't spambayes just as good? :) Point taken. Jim From barry@python.org Fri Nov 8 17:33:53 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 8 Nov 2002 12:33:53 -0500 Subject: [Spambayes] SMTP proxy questions References: <15819.53912.407893.819241@gargle.gargle.HOWL> Message-ID: <15819.62849.101901.822699@gargle.gargle.HOWL> >>>>> "JB" == Jim Bublitz writes: JB> Well - JB> there's application/ms-excel or some such. Isn't spambayes JB> just as good? :) It depends on whether you hold the IETF and IANA in as high regard as Microsoft does . http://www.iana.org/assignments/media-types/ -Barry From lists@morpheus.demon.co.uk Fri Nov 8 21:07:45 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 08 Nov 2002 21:07:45 +0000 Subject: [Spambayes] Outlook plugin - training References: Message-ID: "Tim Peters" writes: [About the plugin code...] > I'm more lost than not in it myself! That makes me feel better :-) [About bothering with leaving list traffic out] > Don't worry about it before you try it. I suggest trying it because I'm not > sure it's possible to *stop* the system now from scoring all incoming msgs > (the "new msg in Inbox" filter appears to trigger for every one, regardless > of whether the RW decides to move it; after that it may just be a race > between the RW and the addin deciding where to move each). OK, I've switched over. I now have one Spam folder, one Potential Spam folder, and the rest are Ham (actually, some historic archive folders I've left out, but that's just because I never use them any more). We'll see how it goes. >> Of course, I know that the classifier *really* works by magic, and >> so my intuition is useless :-) > > It's more that unless you know exactly how the math works, your intuition is > simply baseless here, carried over from some other experience. Do *you* > have trouble distinguishing personal and work email from spam? There you > go, and you can't even compute inverse chi-squared probabilities to 14 > significant digits on demand in your head . How do *you* know I can't compute inverse chi-squared probabilities in my head? Oh, hang on - you wanted me to get the right answer, didn't you? :-) > What's to manage? I get about 600 emails per day, and about 1% end > up in Unsure (about 6 -- actually less than that, lately; the system > is learning). My ratio is still a lot worse than that. But as I say, my training corpus is still quite small. But you're right - managing a few mails isn't hard. It's just that the overall results are *so* much better than the old home-grown soution I used that I became instantly spoiled :-) Seriously, I've said this before, but what you guys have developed here is *phenomenally* good. I've reached the point where I look forward to getting spam, just because I enjoy so much seeing it automatically appear in the spam folder :-) >> My instinctive reaction is that I want "Spam" and "Not Spam" buttons, >> and then I read or delete the message in situ. > > MarkH has since implemented this in the Unsure folder. Time for a CVS update, I guess... > I still think you're making life too complicated. Is list traffic > spam? If so, call it spam. If not, call it ham. Sounds sensible. I think that all the troubles I've had in the past trying to manage spam have left me with an instinctive feeling that the problem is complicated. This leads to looking for complicated solutions. But you're right. The spam/ham distinction itself is a simple yes/no, so the setup should be, too. But permit me to drag my feet a little, as I throw away all my cherished preconceptions :-) More seriously, I'm putting this point into my spambayes notes folder. I suspect it's something a lot of new users will have to get used to. Thanks for the comments, Paul. -- This signature intentionally left blank From lists@morpheus.demon.co.uk Fri Nov 8 21:12:17 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 08 Nov 2002 21:12:17 +0000 Subject: [Spambayes] Outlook plugin plus Exchange Message-ID: I've noticed a couple of strange effects with the Outlook plugin used against an Exchange server. The main one is that when I start up the client in the morning, there are a lot of overnight messages in my inbox. They don't seem to get filtered. I suspect this is to do with Outlook not firing the "new mail" event on stuff that's in the Exchange store when the client starts up. But I'll need to test this. Unfortunately, the Exchange server is at work, and I can only do any serious hacking on this at home, so I'm running a batch cycle (code at home, take into work, try out, take bugs home, and repeat). So it'll take me a while to make any progress. I'll report back when I get more details. Paul (Off to look at Outlook events in MSDN, and to write a simple "log the events and see what is going on" plugin to test with) -- This signature intentionally left blank From mhammond@skippinet.com.au Fri Nov 8 21:52:20 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 9 Nov 2002 08:52:20 +1100 Subject: [Spambayes] Outlook plugin plus Exchange In-Reply-To: Message-ID: > I've noticed a couple of strange effects with the Outlook plugin used > against an Exchange server. The main one is that when I start up the > client in the morning, there are a lot of overnight messages in my > inbox. They don't seem to get filtered. I suspect this is to do with > Outlook not firing the "new mail" event on stuff that's in the > Exchange store when the client starts up. But I'll need to test this. I am working on code that optionally processes "missed" messages at startup. It looks like I can list all unread, unscored mail in my 1000+ item inbox very quickly, so this should be feasible. > Paul (Off to look at Outlook events in MSDN, and to write a simple > "log the events and see what is going on" plugin to test with) Check out the Outlook plugin in the win32com\demos directory - probably a good place to start. Or if anyone gets lots of KLEZ mail via Outlook, I have a plugin that does a decent job at killing them. Mark. From francois.granger@free.fr Fri Nov 8 23:25:51 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Sat, 9 Nov 2002 00:25:51 +0100 Subject: [Spambayes] pop3proxy Message-ID: Thanks to Richie Hindle, it now works on MacOS 9. Excellent job ! -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From tim.one@comcast.net Fri Nov 8 23:33:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 18:33:50 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Tim] > ... > I'm going to try an experiment: I'm going to wipe my home database and > start over from scratch, training first on one ham and one spam, then > only on mistakes and unsures. This should be fun . > ... After enduring the first round of gross mistakes, when I got up today I did this: while some ham in my inbox scores above 0.20 (my ham_cutoff): pick the highest-scoring ham in the inbox add it to the ham training set train on it rescore the inbox These are false positives and unsures the classifier would have had if these msgs had come in after I started the experiment. There were about 700 msgs in the inbox. Other than that, I've left it mistake-driven and unsure-driven on live incoming email. Spam that's correctly classified simply gets deleted (no training on it), ditto ham. It's been a light spam day, but hundreds of msgs have come in since then and I haven't seen a mistake or unsure in about 5 hours, although plenty of ham gets near ham_cutoff and plenty of spam near spam_cutoff. Total training data now: just 45 ham and 20 spam. Scores remain grossly hapax-driven, but that's actually enough to classify most of my email correctly: a small number of subjects and senders and mailing lists overwhelmingly dominate my ham mix, and one email account accounts for the vast bulk of my spam. Removing the hapaxes from the database dropped the # of words from 5500 to about 1700. Rescoring the inbox with this reduced database then pushed about 5% of the msgs back into Unsure. So (no surprise here) hapaxes are vital with little training data. That also means that as soon as one of those words shows up in the other kind of email, it changes from a strong clue to netural, *provided that* I actually train on the new email. I'm not training now unless there's a mistake/unsure, so the hapaxes remain strong clues (even when they point in the wrong direction). BTW, when there are mistakes/unsures, I'm not training on all of them: as I did when I got up, I train the worst example then rescore, one at a time, until no mistakes/unsures remain. I'm never going to get sub-0.1% error rates this way, but if this is the best it ever got, I'd be quite happy with it for my personal email. Something to ponder? If so, you can get away with a very small database, and while hapaxes must not be removed blindly in this extreme scheme, using the atime field could (I suspect) be very effective in slashing the already-small database size (lots of hapaxes will never be seen again even if you train on everything; the WordInfo atime field tells you when a word was last used at all). From rob@hooft.net Fri Nov 8 23:49:59 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 09 Nov 2002 00:49:59 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCC4DA7.80401@hooft.net> Tim Peters wrote: > I'm never going to get sub-0.1% error rates this way, but if this is the > best it ever got, I'd be quite happy with it for my personal email. > Something to ponder? If so, you can get away with a very small database, > and while hapaxes must not be removed blindly in this extreme scheme, using > the atime field could (I suspect) be very effective in slashing the > already-small database size (lots of hapaxes will never be seen again even > if you train on everything; the WordInfo atime field tells you when a word > was last used at all). Tim, This seems to imply that you're still playing with the idea that hapaxes could be "slashed" from the database when using the "old" train-on-all procedure. I don't see how that can ever work, as all words pass through the hapax stage at some point. Or do you mean to slash "old" hapaxes only? And what is "old"? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim@fourstonesExpressions.com Sat Nov 9 00:55:07 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 08 Nov 2002 18:55:07 -0600 Subject: [Spambayes] Persisting a pickled bayes database Message-ID: I can see the nice createbayes function in hammie, but I don't see any persistence function anywhere. I do see several places where code to write a pickled bayes database is hard coded, and I understand the PersistentBayes thing. I might be missing something... I've been using a simple class to handle creating and persisting my bayes databases. I *think* this stuff should probably go somewhere, but beats me where... classifier? doesn't make much sense there... hammie? Any ideas anybody? Here's the class... (kinda a dumb name ;)) class BayesHelper: '''helper class for bayes databases''' def __init__(self, db_name, useDB): ''' constructor ''' self.db_name = db_name self.useDB = useDB self.bayes = hammie.createbayes(db_name, useDB) # no __del__() method, because we don't *necessarily* want to persist def persist(self): '''store the bayes database''' if not self.useDB: fp = open(self.db_name, 'wb') pickle.dump(self.bayes, fp, 1) fp.close() - TimS From popiel@wolfskeep.com Fri Nov 8 00:06:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 07 Nov 2002 16:06:27 -0800 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message from "Tim Peters" References: Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Anthony Baxter] >> Note that "random sample" is not as trivial as all that, either - if >> you have a very high ham:spam ratio in your training DB, your accuracy >> will suffer (see the tests from Alex, myself and others). > >I still need to try to make sense of those tests. A real complication is >that more than one thing changes when trying to test ratios: it's not just >the ratio that changes, it's the absolute number of each trained on too. True. >For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham >and 10000 spam. The ratios are identical. Do we expect the error rates to >be identical too? I don't, but haven't tried it. I have tried this, and the effects of ratio were diminished as the training set size increased. For details, see http://www.wolfskeep.com/~popiel/spambayes/ratio2 . The tests were done with gary-combining, not chi-square, so I really ought to rerun them. >I expect the latter would do better than the former, despite the identical >ratios, simply because more msgs allow better spamprob estimates. It depended on what the ratio in question was... for 1:4 ham:spam, increased training set size hurt instead of helped, in the ranges that I was able to test. For 1:1, increased training helped instead of hurt. >Something missing in "the ratio tests" is a rationale (even an >after-the-fact one) for believing there's some aspect of the system that's >sensitive to the ratio. The combining method certainly is not, and the >spamprob estimation (update_probabilities()) deliberately works with >percentages instead of raw counts so that the ham::spam training ratio >has no direct effect on the spamprobs calculated. Eh, I have a perfectly good rationale for believing that something is sensitive the the ratio: the tests I've run show such a sensitivity. What's missing is a theory on _why_ there's a sensitivity. ;-) I don't think the following theory is perfectly phrased, but it seems plausible to me: Perhaps the number of topics discussed in ham is greater than that in spam. Thus, the average percentage of ham messages containing a particular significant ham word is systematically lower than the average probability of a particular significant spam word appearing in spam messages. As the training set size increases, the percentage difference becomes more consistent and pronounced. Since we're then combining the percentages, we systematically skew slightly due to the differing averages. Changing the ratio of ham to spam has the effect of changing the number of topics discussed, particularly when the training set size is small and random chance can exclude all instances of a given topic. Balancing the number of topics removes the skew in the probabilities. As training set size increases, adjusting the ratio has less effect, because it has less likelyhood of eliminating topics of discussion. I think that would account for my data. >The total # of spam training msgs does limit how high a spamprob can get, >and the total # of ham training msgs limits how low. The *suspicion* I had >running my large c.l.py test is that it wasn't the ratio that mattered so >much as the absolute number, and that the error rates didn't "settle down" >to the 4th digit until I got near 10,000 spam total. I suspect that by the time the corpora got that large, adjusting the training ratio wouldn't make a lick of difference if the corpora were sampled randomly to achieve the given ratio. There would just be too little chance of excluding a topic from the samples. Systematically excluding a topic might produce equivalent results to my ratio tests. - Alex From richie@entrian.com Fri Nov 8 00:17:25 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 00:17:25 +0000 Subject: [Spambayes] SMTP proxy questions Message-ID: [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com From anthony@interlink.com.au Fri Nov 8 00:30:09 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 08 Nov 2002 11:30:09 +1100 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <200211080030.gA80UAf11390@localhost.localdomain> > I've discussed this with Tim S, and he's going off the SMTP proxy idea > while I'm still broadly in favour of it. What do people think - do > non-Outlook users want to forward messages to 'spam' and 'ham' to train the > system, or use an HTML UI? I'd have to say I don't like the idea. There's too many potential places where it can all go horribly horribly pear-shaped, and too many rat-holes that the various email clients can screw up with. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From jbublitz@nwinternet.com Fri Nov 8 01:15:29 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Thu, 07 Nov 2002 17:15:29 -0800 (PST) Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: On 08-Nov-02 Richie Hindle wrote: > Or perhaps there's another way. The only other option I'd > thought of was to add two hyperlinks to the end of the message, > "This is spam" and "This is ham" (in ways that would work for > both HTML and plain-text messages, in both HTML and plain-text > email clients). They'd link to the HTML interface and tell it > the cache ID of the message. Adding content to emails is way > more intrusive (and difficult) than adding headers. But no more > intrusive than the .sig that mailman adds. What about adding a MIME object to the msg with the Spambayes info (text/spambayes?) - or will forwarding lose that info too? The email module should be able to do this. Jim From tim.one@comcast.net Fri Nov 8 04:07:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:07:18 -0500 Subject: [Spambayes] Proposing to drop retain_pure_html_tags In-Reply-To: Message-ID: FYI, that option is gone now. From tim.one@comcast.net Fri Nov 8 04:29:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:29:17 -0500 Subject: [Spambayes] Proposing to rename some fundamental options In-Reply-To: Message-ID: The original names made more sense when we had half a dozen competing schemes. Current Proposed ------- -------- robinson_probability_x unknown_word_prob robinson_probability_s unknown_word_strength robinson_minimum_prob_strength minimum_prob_strength Note: unknown_word_prob is what the Baysian prob adjustment moves toward, more strongly the less evidence backs up a counting spamprob estimate (the fewer the msgs a word has been seen in, the more the adjustment pushes the spamprob toward unknown_word_prob; for a word that's never been seen before, this reduces to unknown_word_prob exactly). We've always set it to 0.5 by default, and previous tests never showed benefit from changing that. We've gotten better since then, though, and it's possible to deduce "a more correct" value. For example, take the mean of all the by-counting spamprobs in your database, across words that have appeared in at least 10 msgs (so that there's reason to have *some* confidence in the by-counting guess). That's then an estimate of the spamprob a new word will eventually get over time. Across 3 databases I tried this on, it turned out to be a little over 0.5, from 0.513 (my home personal classifier) to 0.540 (fat c.l.py test). If someone has time for a controlled experiment, run the attached code to find this guess for one of your databases; then if it differs from 0.5, try a before-and-after test just changing that much. If there's any promise here, update_probabilities() could easily be changed to compute and use this automatically. """ import cPickle as pickle f = file('fat.pik', 'rb') # your database pickle goes here c = pickle.load(f) f.close() w = c.wordinfo def guessx(): nham = float(c.nham or 1.0) nspam = float(c.nspam or 1.0) n = 0 probsum = 0.0 for rec in w.itervalues(): if rec.hamcount + rec.spamcount >= 10: hamratio = rec.hamcount / nham spamratio = rec.spamcount / nspam prob = spamratio / (spamratio + hamratio) probsum += prob n += 1 print n, probsum / n guessx() """ From mhammond@skippinet.com.au Fri Nov 8 04:48:54 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 8 Nov 2002 15:48:54 +1100 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: > Laughing and pointing should be directed towards me rather than Tim. None of that, but some thoughts . I think that the classes I posted a while ago suffer from the exact reverse problem as your idea. My idea was to make a "message store" that is largely independent of training. I believe the problem with your design is that it deals with the training at the expense of the message store. Obviously, but worth mentioning, is that there are competing interests here. My focus is towards clients, and specifically the outlook one (if there were more clients I would be happy to think of them too ). Alot of the focus of this group is towards admins rather than individuals (which is just fine!) But it seems the current thinking is of a corpus as being a fairly static, well-controlled set of messages used almost purely for training purposes. For client programs, this may not be practical. The corpus is a more dynamic set of messages - and worse, actually *is* the user's set of messages rather than a collection of message copies. For example, "moving" a message in a corpus may actually mean moving the message in the user's real inbox. This may or may not be what is intended - a corpus "move" operation is more about changing a message's classification than it is about physically moving pieces of mail around. > A Corpus wouldn't know how to create Message objects, nor would a Message > object know how to create itself - classes *derived from* them would know > how to do that. For instance (totally untested code, probably full of > typos) - > > class Message: Jeremy and I both posted real code, so starting with something that takes that into consideration would be good. > I may be putting too much > into the base class by demanding that the text of the message be given to > the constructor - that precludes making FileMessage lazy, and > only read the > file when it needs to.] It also defeats the abstract nature of the class. > 'Corpus' works the same way; again, the details may be naive, but this is > the general idea: I'm hoping I don't sound grumpy, but again, the few systems that already exist for this engine are the best ones to use to discover the naivety early > You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an > IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on. I can't quite imagine that at the moment, as per my comments at the top. Off the top of my head, I believe we need: * An abstract "message id" * A message classification database, as discussed before - basically just a dictionary, keyed by ID, holding either "spam" or "ham". * A "corpus" becomes just an enumerator of message IDs for bulk/batch training. It has no move etc operations. * A "message store" is capable of returning a message object given its ID. * The training API simply takes message objects and updates the probability and message databases. At that level, we really don't need much else - no folders or any other grouping of messages. I'm really not too sure there is much value in adding higher-level concepts such as folders or message store "move" operations - certainly not at the outset, where there are too many competing requirements. > Yes - this could work using observer objects registered with Corpus > objects: This could work, but may be too simple to be necessary. If the process of re-training a message in the Outlook GUI becomes: def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True) And not a whole lot else, it doesn't seem worth it. Unfortunately, the decision to perform the retrain is the complex, but client specific part. Is this a newly delivered message? Did the user manually move the message somewhere? Did the user click one of our buttons? Is the user deleting old ham that we want to train on before it dies forever? Outlook does this via examining what Outlook event we are seeing, and looking at meta-data we possibly previously attached to the message. I'm not sure this can be encapsulated well at the moment without adding all our meta-data etc baggage to the base classes. > Most of the *new* code that's needed is defining the abstract concepts and > their interfaces, rather than writing code that actually *does* anything - > it's building a framework. *cough* ummm... This is doomed to failure. Code *must* do something to be taken seriously. At the very least, I would expect to see the existing test driver framework running against these "abstract concepts" > Once the framework is there, most of the code needed to implement the > functionality should already be in the project - code to hook > into Outlook, > to train on a message, to parse mbox files, and so on. It just needs > hooking into the framework. See above . Mark. From tim.one@comcast.net Fri Nov 8 04:50:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Nov 2002 23:50:42 -0500 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: [Richie Hindle] > ... > The most difficult problem for retraining-by-forwarding is matching the > forwarded message to one from the cache, after Outlook Express > has stripped the headers, top-quoted the users .sig, converted it > to HTML and added fifteen macro viruses. Any ideas? If user can be convinced to forward as an *attachment*, those problems go away, at least in OE. You can create a new msg there, select any number of msgs, drag them to the msg as a group, and OE will create an attachment for each one. Unlike Outlook, OE appears to save the original stuff that came in over the wire (we're finding it's a real hoot in the OL client to try to guess what the original MIME structure may have been). > Can the tokeniser help? If you put in a token unique to each msg, sure . Perhaps the "loose checksum" program Skip checked in could be useful for this. From tim.one@comcast.net Fri Nov 8 05:06:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 00:06:43 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com> Message-ID: [Richie Hindle] > A quick note in case someone decides to remove the counts from the > database: Neil Schemenauer already does, in his CDB code (neil*.py). It's a lean scoring-only database, mapping tokens to *just* spamprobs. If he went on to store them as scaled ints, he could almost certainly reduce this to 2 bytes of prob info per token, and possibly even just 1. > the HTML front end has a "Word query" feature which will tell you the > information in the database for a given word - it's interesting to see > how many more times the word 'Viagra' appears in ham than in spam. I > mean the other way round. What a geek . From tim.one@comcast.net Fri Nov 8 05:48:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 00:48:25 -0500 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: [Just van Rossum] > I think it can be done with almost no extra overhead with a > caching scheme. This assumes (probably wrongly ) that > the cache stays in memory between runs. > Something like this perhaps: > > *** classifier.py Thu Nov 7 23:03:07 2002 > --- classifier.py.hack Fri Nov 8 00:04:05 2002 > *************** > *** 456,459 **** > --- 456,460 ---- > > wordinfoget = self.wordinfo.get > + spamprobget = self.spamprobcache.get > now = time.time() > for word in Set(wordstream): > *************** > *** 463,467 **** > else: > record.atime = now > ! prob = record.spamprob > distance = abs(prob - 0.5) > if distance >= mindist: > --- 464,470 ---- > else: > record.atime = now > ! prob = spamprobget(word) > ! if prob is None: > ! prob = self.calcspamprob(word, record) > distance = abs(prob - 0.5) > if distance >= mindist: Sorry, I don't know what this is trying to accomplish. Like, what is self.spamprobcache? There's no such thing now, and the patch doesn't appear to create one (i.e., this code doesn't run). Whatever it's supposed to be, why isn't spamprobcache.get *itself* responsible for returning a spamprob, instead of making its caller deal with two cases? If the answer is "it's supposed to be a dict, so .get ain't that smart", then the memory burden for a long-running scorer process will zoom, negating one of the benefits people attached to "real databases" thought they were buying in return for giant files and slothful performance . Life would be easier if databaseheads trained all they liked as often as they liked, but refrained from calling update_probabilities() until the end of the day (or other "quiet time"). The idea that the model should be updated after every msg trained on is an extreme. From tim.one@comcast.net Fri Nov 8 06:23:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 01:23:13 -0500 Subject: [Spambayes] Corpus module (was: Upgrade problem) In-Reply-To: Message-ID: [Richie Hindle, cogitates about Messages and their Corpus(ora)] That's the ticket! Backing off to a more fundamental level looks useful to me too. We never even straightened that much out for testing purposes (msgs.py isn't general enough; for some custom test drivers (never checked in), I couldn't even reuse the MsgStream class for my *own* directory structures). I disagree with Mark's > If the process of re-training a message in the Outlook GUI becomes: > > def RetrainMessageAsSpam(): > # Outlook specific code to get an ID. > message = message_store.GetMessage(id) > if not classifier.IsSpam(message): > classifier.train(message, is_spam=True) > > And not a whole lot else, it doesn't seem worth it. because it illustrates the point : it doesn't look like a correct re-training method (although it may be, depending on assumptions about where "id" comes from, and what assorted classifier methods do), and while a correct method shouldn't be hard, in the absence of a class dedicated to doing the simple common things that *can* be done in a common way, everyone will keep screwing it up in their own client code. > ... > You might want to run it past Tim Peters, 'cos he's *far* better at this > kind of thing than I am (though he's also busy). I have to do more Python and Zope work now, so have to guard my time on *this* project more jealously than I have. MarkH and SeanT and JeremyH all have ideas here too, and I trust you'll sort them out as a harmonious family bent on world domination. As a general strategy, the first person to check code in usually wins . > ... > The mark of a good framework is when you write a tiny little class (like > AutoTrainer above for instance) that contains hardly any code but adds a > major new feature (in this case, automatic training when moving messages > around in Outlook). The client-specific code to hook and track msg movement in Outlook is relatively massive, so everything else appears a drop in the bucket to Mark. Nevertheless, if a usable framework for capturing the *common* part of this stuff were available, removing the 5 lines of code quoted above would help (the Outlook client, and all others). From B-Morgan@concentric.net Fri Nov 8 06:25:30 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Thu, 7 Nov 2002 23:25:30 -0700 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: As I see it, having pop3proxy keep copies of the messages and using an HTML UI for training has the least amount of dependancy on the email client's forwarding capabilities (or lack thereof). I have a severe aversion to opening spam that will probably carry over to unsure messages, so having a link added to the message body may not do me much good. I will, however, go to an HTML UI and examine a message if that UI doesn't "execute" the HTML. I don't want to see pretty, raw data is good enough for me to decide. I hate to keep mentioning a "rival" project , but popfile's UI seems pretty close to what I think would work best here. Regards, Brad -----Original Message----- From: spambayes-bounces@python.org [mailto:spambayes-bounces@python.org]On Behalf Of Richie Hindle Sent: Thursday, November 07, 2002 5:17 PM To: spambayes@python.org Subject: [Spambayes] SMTP proxy questions [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From tim.one@comcast.net Fri Nov 8 06:46:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 01:46:14 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > ... > I'm assuming (based on a message I recall seeing recently) that it's > possible to "correct" training - ie, if I train the classifier that a > specific message is spam, I can later say "no it isn't, it's ham". That's right, and at the level of classifier.py it's a two-step process: unlearn() as spam, then learn() as ham. It actually doesn't matter which order those are done in, but I won't admit to that . > Assuming that this is so, is it not reasonable to train dynamically > on an "assume I got it right" basis? Depending on context, it *may* be. > In other words, whenever the addin filters a message as ham or spam, > automatically train on that basis as well. Then, if the user sees a > mistake, he corrects it, which automatically retrains the classifier > (manually deleting as spam or moving a message already does this). Assuming a conscientious user, and a client that knows enough about what the user is doing, that should work fine. > This will keep the database right up to date, and all the user has to > do is correct any bad decisions the classifier makes (which he should > be doing anyway). > > I've ignored database growth issues, but other than that, is there any > other problem with this approach? Doubtless hundreds, but why quibble . A misclassified msg will have bad effects at once if the training gets reflected into the probabilities at once, so it gets less appealing the less zealous the user is about correcting mistakes right away. That can be mitigated by doing the day's training into a distinct dict, or not calling update_probabilities() in a single dict, until "the end of the day", when the user has (presumably) corrected all the day's mistakes they're going to correct. But if the model updating is going to be delayed anyway, then it makes as much sense to delay doing any training on "the day's" msgs until the end of the day. Determining what "the end of the day" means is a puzzle then too. For example, maybe I left my email client running and went on a week-long vacation. I'm not going to look over 700 presumed spam when I get back, I'll just delete it. But if ham was in there, I've now let it train in the wrong direction, and that will hurt. In other contexts, the scheme doesn't get off the ground. For example, for python.org use, nobody is going to review msgs claimed to be spam. A system feeding on its own judgments is going to reinforce its own mistakes too, so the "conscientious, timely, reviewing human" bit is important. From tim.one@comcast.net Fri Nov 8 07:20:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 02:20:18 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Mark Hammond] > ... > The key limitation of this scheme, as Tim also alludes to, is that this > never correctly classifies ham. However, I actually see this > incremental training more as a "get smarter now" than a "just get > smarter" technique - ie, a user sees a mis-classified Spam, by re- > training they are increasing the chances that the next similar mail > will be handled correctly. Instant feedback, especially while the user > is getting started. > > ie, it is indeed "mistake based training", but that may still prove > useful in addition to ongoing training. I sure agree it's *very* useful at the start, and expect it will continue to be useful over time. > I can't help thinking that we are somehow underestimating our own > tool here. I'm going to try an experiment: I'm going to wipe my home database and start over from scratch, training first on one ham and one spam, then only on mistakes and unsures. This should be fun . > As is common when people first use this tool, spam is generally > found in the ham set and vice-versa. Because of this, I know that my > Inbox is spam free (but less sure about my other "ham" folders). I'm > also sure that my Spam folder has no ham. This should remain true > while continue to use the tool. How do you know your Spam folder has no ham? I know mine doesn't because I routinely score it, sort on the score, and stare at "the wrong end". I find ham there as often as not, *usually* apparently due to mousing error when dragging a training ham into the Ham folder and overshooting the mark. > So surely we can exploit this somehow. Off the top of my head: > * Assume we don't trust the last 2 days of mail (as the user may not > yet have sorted them). Anything in the "good" and "spam" folders older > than this can be assumed correctly classified, and able to be trained > on. Provided the user has already done a decent amount of training, then as Paul Moore suggested it could even work to trust ham-vs-spam decisions immediately, and let user corrections undo those as needed. A well-trained system should be pretty robust against a few misclassifications over the short term. > * A process could go through all ham and spam trained on, and score each > message. Any "suspect" messages are presented in a list (much like the > Outlook "Find Message" result list). The user can indicate that the > message is correct (and the system will remember, never asking about > this message again) or is indeed incorrectly classified. If incorrect, > it will be moved, and incrementally trained as per now. (I can also > picture a whitelist kicking in here; if incorrect, offer to add user to > whitelist. If user in the whitelist, assume ham thereby meaning mail > from this person can never again be spam) Tell us about the mistakes *you* see. I feel like we're designing a solution to a hypothetical problem otherwise. The only "mistake" I routinely see is that my cigarettes-via-web advertising keeps getting knocked back into Unsure territory. That doesn't bother me enough to do anything about it, but if it bothers you enough then, yes, a whitelist would solve that one. > I can picture this working in the background, and simply indicating to > the user that there are "conflicts" to be resolved at their leisure. Or maybe we could just move those back to the Unsure folder. The user should already know what to do about things in Unsure, so it's nothing new to them. Moving a msg out of Unsure could be taken as a positive sign that the user has classified such a msg once and for all (well, until they move it again, anyway). > Further, I imagine that as we build better training data for each > message store, the number of "conflicts" actually found would > generally be zero - ie, the system would find that all 2 day and > older mail correctly classifies. I expect that's true. From just@letterror.com Fri Nov 8 07:54:04 2002 From: just@letterror.com (Just van Rossum) Date: Fri, 8 Nov 2002 08:54:04 +0100 Subject: [Spambayes] Upgrade problem In-Reply-To: Message-ID: Tim Peters wrote: > [Just van Rossum] > > I think it can be done with almost no extra overhead with a > > caching scheme. This assumes (probably wrongly ) that > > the cache stays in memory between runs. > > Something like this perhaps: > > > > *** classifier.py Thu Nov 7 23:03:07 2002 > > --- classifier.py.hack Fri Nov 8 00:04:05 2002 > > *************** > > *** 456,459 **** > > --- 456,460 ---- > > > > wordinfoget = self.wordinfo.get > > + spamprobget = self.spamprobcache.get > > now = time.time() > > for word in Set(wordstream): > > *************** > > *** 463,467 **** > > else: > > record.atime = now > > ! prob = record.spamprob > > distance = abs(prob - 0.5) > > if distance >= mindist: > > --- 464,470 ---- > > else: > > record.atime = now > > ! prob = spamprobget(word) > > ! if prob is None: > > ! prob = self.calcspamprob(word, record) > > distance = abs(prob - 0.5) > > if distance >= mindist: > > Sorry, I don't know what this is trying to accomplish. Like, what is > self.spamprobcache? There's no such thing now, and the patch doesn't appear > to create one (i.e., this code doesn't run). Tim, don't be such a programmer . But ok, I promise I'll never post pseudocode as a patch again... > Whatever it's supposed to be, > why isn't spamprobcache.get *itself* responsible for returning a spamprob, > instead of making its caller deal with two cases? I thought I was doing your performance needs a favor . > If the answer is "it's > supposed to be a dict, so .get ain't that smart", That's the answer. > then the memory burden for > a long-running scorer process will zoom, negating one of the benefits people > attached to "real databases" thought they were buying in return for giant > files and slothful performance . Right. If a float takes up 20 bytes in memory (just a guess, no time to look), then for a database of 100000 words (that's roughly the size of my personal db) the memory burden is 100000 * (8 + 20), almost three megs. Just in case the higher memory usage is not an issue, there's a simpler approach: don't store spamprob in the db, but call bayes.update_probabilities() on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC on my db (hm, that's using pickle, so will be a lot more when using a db :-( ). You can tell I'm thinking mostly about long running processes... I guess you're right, one size doesn't fit all. One last idea for this morning: how about splitting the db in a training db (storing hamcount and spamcount) and a classifying db (storing only spamprob)? > Life would be easier if databaseheads trained all they liked as often as > they liked, but refrained from calling update_probabilities() until the end > of the day (or other "quiet time"). The idea that the model should be > updated after every msg trained on is an extreme. Good points. Just From richie@entrian.com Fri Nov 8 08:06:33 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 08:06:33 +0000 Subject: [Spambayes] Upgrade problem In-Reply-To: References: Message-ID: [Just] > the web interface of pop3proxy.py is pretty good and useful, the only > downside is that it saves the database after each training That's now fixed (at least partly) along with some other bits: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. -- Richie Hindle richie@entrian.com From tim.one@comcast.net Fri Nov 8 09:15:24 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 08 Nov 2002 04:15:24 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message-ID: [Tim] > ... > I'm going to try an experiment: I'm going to wipe my home database and > start over from scratch, training first on one ham and one spam, then > only on mistakes and unsures. This should be fun . It is! The msg from me I'm replying to here scored 94 (solid spam). I've now got 5 ham and 5 spam in my training set, most of the new ones from Unsures. The latest spam was a blatant false negative, from Hapax City: '*H*' 0.998601 '*S*' 8.60833e-005 'can' 0.0652174 'have' 0.0652174 "don't" 0.0918367 'never' 0.0918367 'number' 0.0918367 'one' 0.0918367 'what' 0.0918367 '"the' 0.155172 ham hapaxes from here 'able' 0.155172 'about' 0.155172 'against' 0.155172 'also' 0.155172 'any' 0.155172 'anything' 0.155172 'back' 0.155172 'because' 0.155172 'been' 0.155172 'check' 0.155172 'even' 0.155172 'find' 0.155172 'found' 0.155172 'heard' 0.155172 'how' 0.155172 'into' 0.155172 "it's" 0.155172 'more' 0.155172 'needed' 0.155172 'other' 0.155172 'out' 0.155172 'own' 0.155172 'people' 0.155172 'skip:a 10' 0.155172 'skip:i 10' 0.155172 'special' 0.155172 'subject:.' 0.155172 'subject:: ' 0.155172 'their' 0.155172 'them.' 0.155172 'they' 0.155172 'those' 0.155172 'time' 0.155172 'time.' 0.155172 'unsubscribe' 0.155172 'until' 0.155172 'useful' 0.155172 'using' 0.155172 to here 'and' 0.275281 'for' 0.275281 'subject: ' 0.275281 'you' 0.275281 'from' 0.355072 'not' 0.355072 'off' 0.355072 'our' 0.355072 'when' 0.355072 'new' 0.644928 'see' 0.644928 'url:gif' 0.724719 'url:www' 0.724719 'call' 0.844828 spam hapaxes from here 'contact' 0.844828 'credit' 0.844828 'email.' 0.844828 'every' 0.844828 'further' 0.844828 'header:Received:2' 0.844828 'made' 0.844828 'more!' 0.844828 'most' 0.844828 'now' 0.844828 'plus,' 0.844828 'receive' 0.844828 'search' 0.844828 'skip:1 10' 0.844828 'url:jpg' 0.844828 to here 'email' 0.908163 I think I've established that 5+5 isn't enough for great results . However, 80% of its decisions have been correct so far! From tdickenson@devmail.geminidataloggers.co.uk Fri Nov 8 10:52:32 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Fri, 8 Nov 2002 10:52:32 +0000 Subject: [Spambayes] Re: unsupervised training In-Reply-To: References: Message-ID: <200211081052.32567.tdickenson@devmail.geminidataloggers.co.uk> On Friday 08 November 2002 7:20 am, Tim Peters wrote: > Provided the user has already done a decent amount of training, then as > Paul Moore suggested it could even work to trust ham-vs-spam decisions > immediately, and let user corrections undo those as needed. A well-tra= ined > system should be pretty robust against a few misclassifications over th= e > short term. For the last two weeks I have been using a setup that uses this type of=20 unsupervised training. I have a procmail filter that sends a copy of all incoming ham and spam t= o two=20 seperate mailboxes. These mailboxes are used for overnight batch training= ,=20 then deleted. Messages marked 'Unsure' do not take part in this automatic= =20 training. I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am= =20 manually inspecting the unsure folder, and manually adding them to the=20 appropriate training mailboxes. Initially about 3% of mails were 'unsure'= ,=20 but this has dropped to less than 1% after 2 weeks. Starting next week I plan to change the mua filtering to treat 'unsure' t= he=20 same as 'ham', and stop all manual training. It will be interesting to se= e if=20 the training remains stable. From bkc@murkworks.com Fri Nov 8 14:51:15 2002 From: bkc@murkworks.com (Brad Clements) Date: Fri, 08 Nov 2002 09:51:15 -0500 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <3DCB8912.18340.2FB5F81@localhost> On 8 Nov 2002 at 0:17, Richie Hindle wrote: > Or perhaps there's another way. The only other option I'd thought of was > to add two hyperlinks to the end of the message, "This is spam" and "This > is ham" (in ways that would work for both HTML and plain-text messages, in > both HTML and plain-text email clients). They'd link to the HTML interface > and tell it the cache ID of the message. Adding content to emails is way > more intrusive (and difficult) than adding headers. But no more intrusive > than the .sig that mailman adds. If you do this, what's to keep spammers from also adding similar looking URLs? A busy person might not notice any difference, could click through and confirm their email address... Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From barry@python.org Fri Nov 8 15:04:56 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 8 Nov 2002 10:04:56 -0500 Subject: [Spambayes] SMTP proxy questions References: Message-ID: <15819.53912.407893.819241@gargle.gargle.HOWL> >>>>> "JB" == Jim Bublitz writes: JB> What about adding a MIME object to the msg with the Spambayes JB> info (text/spambayes?) - or will forwarding lose that info JB> too? The email module should be able to do this. Of course that would have to be text/x-spambayes :) -Barry From randy.diffenderfer@eds.com Fri Nov 8 17:21:25 2002 From: randy.diffenderfer@eds.com (Diffenderfer, Randy) Date: Fri, 8 Nov 2002 12:21:25 -0500 Subject: [Spambayes] SMTP proxy questions Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC295@usahm014.exmi01.exch.eds.com> |>>>>> "JB" == Jim Bublitz writes: | | JB> What about adding a MIME object to the msg with the Spambayes | JB> info (text/spambayes?) - or will forwarding lose that info | JB> too? The email module should be able to do this. | |Of course that would have to be text/x-spambayes :) | |-Barry While a fair portion of messages may very well be MIME compliant, this wouldn't work without some serious munging around for non-MIME messages, as well as being very problematic for the many deformed MIME (read very NON compliant :-) ) messages floating around out there! Just an observation... From jbublitz@nwinternet.com Fri Nov 8 17:10:33 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Fri, 08 Nov 2002 09:10:33 -0800 (PST) Subject: [Spambayes] SMTP proxy questions In-Reply-To: <15819.53912.407893.819241@gargle.gargle.HOWL> Message-ID: On 08-Nov-02 Barry A. Warsaw wrote: > >>>>>> "JB" == Jim Bublitz writes: > > JB> What about adding a MIME object to the msg with the > Spambayes > JB> info (text/spambayes?) - or will forwarding lose that > info > JB> too? The email module should be able to do this. > > Of course that would have to be text/x-spambayes :) Well - there's application/ms-excel or some such. Isn't spambayes just as good? :) Point taken. Jim From barry@python.org Fri Nov 8 17:33:53 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 8 Nov 2002 12:33:53 -0500 Subject: [Spambayes] SMTP proxy questions References: <15819.53912.407893.819241@gargle.gargle.HOWL> Message-ID: <15819.62849.101901.822699@gargle.gargle.HOWL> >>>>> "JB" == Jim Bublitz writes: JB> Well - JB> there's application/ms-excel or some such. Isn't spambayes JB> just as good? :) It depends on whether you hold the IETF and IANA in as high regard as Microsoft does . http://www.iana.org/assignments/media-types/ -Barry From lists@morpheus.demon.co.uk Fri Nov 8 21:07:45 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 08 Nov 2002 21:07:45 +0000 Subject: [Spambayes] Outlook plugin - training References: Message-ID: "Tim Peters" writes: [About the plugin code...] > I'm more lost than not in it myself! That makes me feel better :-) [About bothering with leaving list traffic out] > Don't worry about it before you try it. I suggest trying it because I'm not > sure it's possible to *stop* the system now from scoring all incoming msgs > (the "new msg in Inbox" filter appears to trigger for every one, regardless > of whether the RW decides to move it; after that it may just be a race > between the RW and the addin deciding where to move each). OK, I've switched over. I now have one Spam folder, one Potential Spam folder, and the rest are Ham (actually, some historic archive folders I've left out, but that's just because I never use them any more). We'll see how it goes. >> Of course, I know that the classifier *really* works by magic, and >> so my intuition is useless :-) > > It's more that unless you know exactly how the math works, your intuition is > simply baseless here, carried over from some other experience. Do *you* > have trouble distinguishing personal and work email from spam? There you > go, and you can't even compute inverse chi-squared probabilities to 14 > significant digits on demand in your head . How do *you* know I can't compute inverse chi-squared probabilities in my head? Oh, hang on - you wanted me to get the right answer, didn't you? :-) > What's to manage? I get about 600 emails per day, and about 1% end > up in Unsure (about 6 -- actually less than that, lately; the system > is learning). My ratio is still a lot worse than that. But as I say, my training corpus is still quite small. But you're right - managing a few mails isn't hard. It's just that the overall results are *so* much better than the old home-grown soution I used that I became instantly spoiled :-) Seriously, I've said this before, but what you guys have developed here is *phenomenally* good. I've reached the point where I look forward to getting spam, just because I enjoy so much seeing it automatically appear in the spam folder :-) >> My instinctive reaction is that I want "Spam" and "Not Spam" buttons, >> and then I read or delete the message in situ. > > MarkH has since implemented this in the Unsure folder. Time for a CVS update, I guess... > I still think you're making life too complicated. Is list traffic > spam? If so, call it spam. If not, call it ham. Sounds sensible. I think that all the troubles I've had in the past trying to manage spam have left me with an instinctive feeling that the problem is complicated. This leads to looking for complicated solutions. But you're right. The spam/ham distinction itself is a simple yes/no, so the setup should be, too. But permit me to drag my feet a little, as I throw away all my cherished preconceptions :-) More seriously, I'm putting this point into my spambayes notes folder. I suspect it's something a lot of new users will have to get used to. Thanks for the comments, Paul. -- This signature intentionally left blank From lists@morpheus.demon.co.uk Fri Nov 8 21:12:17 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 08 Nov 2002 21:12:17 +0000 Subject: [Spambayes] Outlook plugin plus Exchange Message-ID: I've noticed a couple of strange effects with the Outlook plugin used against an Exchange server. The main one is that when I start up the client in the morning, there are a lot of overnight messages in my inbox. They don't seem to get filtered. I suspect this is to do with Outlook not firing the "new mail" event on stuff that's in the Exchange store when the client starts up. But I'll need to test this. Unfortunately, the Exchange server is at work, and I can only do any serious hacking on this at home, so I'm running a batch cycle (code at home, take into work, try out, take bugs home, and repeat). So it'll take me a while to make any progress. I'll report back when I get more details. Paul (Off to look at Outlook events in MSDN, and to write a simple "log the events and see what is going on" plugin to test with) -- This signature intentionally left blank From mhammond@skippinet.com.au Fri Nov 8 21:52:20 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sat, 9 Nov 2002 08:52:20 +1100 Subject: [Spambayes] Outlook plugin plus Exchange In-Reply-To: Message-ID: > I've noticed a couple of strange effects with the Outlook plugin used > against an Exchange server. The main one is that when I start up the > client in the morning, there are a lot of overnight messages in my > inbox. They don't seem to get filtered. I suspect this is to do with > Outlook not firing the "new mail" event on stuff that's in the > Exchange store when the client starts up. But I'll need to test this. I am working on code that optionally processes "missed" messages at startup. It looks like I can list all unread, unscored mail in my 1000+ item inbox very quickly, so this should be feasible. > Paul (Off to look at Outlook events in MSDN, and to write a simple > "log the events and see what is going on" plugin to test with) Check out the Outlook plugin in the win32com\demos directory - probably a good place to start. Or if anyone gets lots of KLEZ mail via Outlook, I have a plugin that does a decent job at killing them. Mark. From tim@fourstonesExpressions.com Sat Nov 9 00:55:07 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 08 Nov 2002 18:55:07 -0600 Subject: [Spambayes] Persisting a pickled bayes database Message-ID: I can see the nice createbayes function in hammie, but I don't see any persistence function anywhere. I do see several places where code to write a pickled bayes database is hard coded, and I understand the PersistentBayes thing. I might be missing something... I've been using a simple class to handle creating and persisting my bayes databases. I *think* this stuff should probably go somewhere, but beats me where... classifier? doesn't make much sense there... hammie? Any ideas anybody? Here's the class... (kinda a dumb name ;)) class BayesHelper: '''helper class for bayes databases''' def __init__(self, db_name, useDB): ''' constructor ''' self.db_name = db_name self.useDB = useDB self.bayes = hammie.createbayes(db_name, useDB) # no __del__() method, because we don't *necessarily* want to persist def persist(self): '''store the bayes database''' if not self.useDB: fp = open(self.db_name, 'wb') pickle.dump(self.bayes, fp, 1) fp.close() - TimS From tim.one@comcast.net Sat Nov 9 18:35:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 09 Nov 2002 13:35:43 -0500 Subject: [Spambayes] Persisting a pickled bayes database In-Reply-To: Message-ID: [Tim Stone] > I can see the nice createbayes function in hammie, but I don't see any > persistence function anywhere. I do see several places where code > to write a pickled bayes database is hard coded, and I understand the > PersistentBayes thing. I might be missing something... Just experience with idiomatic Python persistence. The persistence was all in DBDict.__init__'s: self.hash = anydbm.open(dbname, 'c') The tradition in Python is that "a persistent database" supplies an interface much like a Python dict, but persists almost purely by magic. For example, here's a brief Python session: C:\Code\python\PCbuild>python Python 2.3a0 (#29, Nov 8 2002, 10:51:55) [MSC 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import anydbb >>> d = anydbm.open('example.dat', 'n') >>> d['an'] = 'example' >>> # and quit Python at this point Then in another session: >>> import anydbm >>> d = anydbm.open('example.dat') >>> print d >>> print d.keys() ['an'] >>> print d['an'] example >>> Note that anydbm used bsddb as the underlying database mechanism on my box. It may use some other database mechanism on some other box (it depends on what it finds available). I could have used bsddb directly instead, of course, but then my code would require that bsddb be available. anydbm uses whatever it can scrounge up. Subclassing the builtin dict type can give a similar "by magic" facility; e.g., here's temp.py: """ import cPickle as pickle import os class PDict(dict): def __init__(self, fname): self.fname = fname if os.path.exists(fname): f = file(fname, 'rb') guts = pickle.load(f) f.close() self.update(guts) self.is_open = True def close(self): if self.is_open: f = file(self.fname, 'wb') pickle.dump(self, f, 1) f.close() self.is_open = False def __del__(self): self.close() """ That just adds a few methods to a regular dict, arranging to dump its value to a pickle when .close() is called or when it becomes unreachable. It's intended that .close() be called explicitly, though (by-magic shutdown semantics are never something to bet your life on). Then in one Python session: >>> from temp import PDict >>> d = PDict('example.pck') >>> d['another'] = 'example' and in another: >>> from temp import PDict >>> d = PDict('example.pck') >>> d {'another': 'example'} >>> In your example helper class, you decided you don't necessarily want to persist. That may or may not be a useful ability, but "the usual" simple Python database facilities don't give you a choice about that: they commit changes to disk *as* mutations occur. In DB terms, they view each mutation as a transaction. The ZODB-based stuff Jeremy is doing is different that way: changes to a ZODB db have to be explicitly committed. That's what the get_transaction().commit() lines in the pspam directory are doing. ZODB is much more of "a real database" than these other gimmicks, by which I mean it has an explicit and pretty rich transactional model and API. From tim.one@comcast.net Sat Nov 9 20:00:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 09 Nov 2002 15:00:42 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCC4DA7.80401@hooft.net> Message-ID: [Tim] > I'm never going to get sub-0.1% error rates this way, but if this is the > best it ever got, I'd be quite happy with it for my personal email. > Something to ponder? If so, you can get away with a very small > database, and while hapaxes must not be removed blindly in this extreme > scheme, using the atime field could (I suspect) be very effective in > slashing the already-small database size (lots of hapaxes will never be > seen again even if you train on everything; the WordInfo atime field > tells you when a word was last used at all). BTW, I'm still doing this experiment, and my total training data is up to 45 ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP are FN are both rare now, and the Unsure rate is about 5% overall and visibly falling. The Unsure spam are more surprising than the Unsure ham, but that may be more psychological than real. For example, it took about 24 hours before I got my first Nigerian spam, and it was shocking to see it score at the low end of the Unsure range. Looking at the internals is scary. I have entire folders that are called ham seemingly because the mailing list they come from has a few lexical conventions unique to it, and the hapaxes from the single training msg from that list save almost all of that list's msgs from Unsure status. In the msg of Rob's I'm replying to, these are all ham hapaxes: 'database' 0.155172 'database,' 0.155172 'ever' 0.155172 'idea' 0.155172 'quite' 0.155172 'scheme,' 0.155172 'seen' 0.155172 'subject:Outlook' 0.155172 'subject:Spambayes' 0.155172 'subject:plugin' 0.155172 'subject:training' 0.155172 'tells' 0.155172 'words' 0.155172 and they slug it out with these spam hapaxes: 'away' 0.844828 'effective' 0.844828 'field' 0.844828 'mean' 0.844828 'word' 0.844828 That 'word' is a strong spam clue but 'words' a strong ham clue should tell us something about how robust this is . [Rob Hooft] > This seems to imply that you're still playing with the idea that hapaxes > could be "slashed" from the database when using the "old" train-on-all > procedure. I don't see how that can ever work, as all words pass through > the hapax stage at some point. Or do you mean to slash "old" hapaxes > only? Well, training has no effect on scoring until update_probabilities() is called, and in a batch-training context I mean hapax from update_probabilities's POV. Of course hamcounts or spamcounts for new words start life at 1, but when doing batch training I don't mean to look at the counts until the probabilities are updated. At that point, a hapax is a word that was seen in only one msg from the entire batch of new msgs. Here's a quick test, based on unpublished general python.org email (we can't publish the ham because it includes some personal email; GregW was working on making the spam collection available, but I haven't heard about that in a week; ditto his very large python.org virus collection). In each case, it trains on 2,741 ham and 948 spam, then predicts the same numbers of each. The "all" column includes hapaxes (wrt counts at the *end* of training). The gt1 column threw away words at the end of training where spamcount+hamcount <= 1; i.e., it retained only words that appeared more than once, the non-hapaxes. The gt2 column retained only words that appeared more than twice; and so on. ham_cutoff was 0.20 here, and spam_cutoff 0.90. filename: all gt1 gt2 gt3 gt4 gt5 gt6 ham:spam: 2741:948 2741:948 2741:948 2741:948 2741:948 2741:948 2741:948 fp total: 1 0 1 0 0 0 0 fp %: 0.04 0.00 0.04 0.00 0.00 0.00 0.00 fn total: 2 2 2 1 2 3 4 fn %: 0.21 0.21 0.21 0.11 0.21 0.32 0.42 unsure t: 81 87 89 82 98 96 100 unsure %: 2.20 2.36 2.41 2.22 2.66 2.60 2.71 real cost: $28.20 $19.40 $29.80 $17.40 $21.60 $22.20 $24.00 best cost: $22.20 $17.60 $20.00 $15.40 $16.80 $17.40 $22.40 h mean: 0.81 0.86 0.87 0.72 0.67 0.64 0.65 h sdev: 6.05 6.18 6.17 5.42 5.13 4.94 5.11 s mean: 98.00 97.66 97.54 97.38 97.03 96.62 96.52 s sdev: 9.26 10.22 10.37 10.62 11.19 12.49 12.61 mean diff: 97.19 96.80 96.67 96.66 96.36 95.98 95.87 k: 6.35 5.90 5.84 6.03 5.90 5.51 5.41 # retained words: 74327 36437 23877 16143 12798 10719 9157 So while hapaxes are vital with very little training data, even with "just" about 4K training msgs they didn't buy anything in this test, and neither did words that appeared only two or three times, and it doesn't appear to be touchy (all of these columns show excellent results!). > And what is "old"? That remains a good question, and a good answer may differ between personal email and bulk email applications. A problem I see coming up in my personal email is that some correspondents only show up once a year, and the hapaxes they generate remain valuable clues, but only once a year. General python.org email doesn't appear to suffer anything like that (so long as personal email is kept out of the python.org mix). From rob@hooft.net Sat Nov 9 22:24:52 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 09 Nov 2002 23:24:52 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCD8B34.6040903@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Tim Peters wrote: > [Tim] >>I'm never going to get sub-0.1% error rates this way, but if this is the >>best it ever got, I'd be quite happy with it for my personal email. > BTW, I'm still doing this experiment, and my total training data is up to 45 > ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP > are FN are both rare now, and the Unsure rate is about 5% overall and > visibly falling. I just added a testdriver to CVS that simulates your behaviour as I understand it: It will train on the first 30 messages, plus on all misclassified and all unsure messages. It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am hierarchy. I think we should test its performance at different Options settings. It may not even be very realistic to training on fp's, as I think in my private E-mail I won't even check the spam folder very thoroughly at all. Anyway, a default run for me now gives: 100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47 200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60 300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67 400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73 500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83 600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87 700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98 800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106 900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108 1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112 1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119 1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125 1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133 1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139 1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145 1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149 1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152 1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158 1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161 2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166 2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170 2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175 2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178 2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182 2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189 2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191 2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196 2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202 2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204 3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209 fp: Data/Ham/Set2/n05250.txt score:0.9312 3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214 3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218 3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222 3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224 3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225 3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229 3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235 3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240 3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243 4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249 4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256 4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257 4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263 4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267 4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272 4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276 4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279 4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280 4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285 fp: Data/Ham/Set1/n01590.txt score:0.9092 5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288 5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291 5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294 5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296 5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303 5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304 5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307 5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310 5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312 5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316 6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321 6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322 6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323 6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328 6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333 6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335 Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 336 (5.1%) Trained on 178 ham and 162 spam fp: 2 fn: 2 Total cost: $89.20 (This is on 3 out of my 10 test directories). Interesting to note so far: * The "Total cost" is much higher than for train-on-all schemes, but it is only due to Unsures; fp and fn are still small. * The database growth doesn't decay with time after a while; it can be described as: nwords = 9200 + 1.6 * nmessages or alternatively: nwords = 5700 + 40 * ntrained ..as can be seen in the attached png's * The training set is almost balanced, even though I scored many more ham than spam * The unsure rate drops over time: 0- 1000: 11.2% (minus 3.0% to be fair) 1000- 2000: 5.4% 2000- 3000: 4.3% 3000- 4000: 4.0% 4000- 5000: 3.9% 5000- 6000: 3.3% Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: words1.png Type: image/png Size: 12191 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words1-0001.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: words2.png Type: image/png Size: 12807 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words2-0001.png ---------------------- multipart/mixed attachment-- From rob@hooft.net Sat Nov 9 23:46:02 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 00:46:02 +0100 Subject: [Spambayes] More experiments with weaktest.py Message-ID: <3DCD9E3A.4040809@hooft.net> These were results of weaktest with default parameters: Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 336 (5.1%) Trained on 178 ham and 162 spam fp: 2 fn: 2 Total cost: $89.20 If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical (spam_cutoff is 90 by default): Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 442 (6.8%) Trained on 292 ham and 152 spam fp: 2 fn: 0 Total cost: $108.40 So the database grows by 30% but it didn't help my cost. The training set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to the default 20: Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 304 (4.6%) Trained on 213 ham and 101 spam fp: 7 fn: 3 Total cost: $133.80 This reduces the database by only 10%, but at very high fp cost. Same 2:1 unbalance in the training set. Back to the default 20:90 then, and set the minimum_prob_strength to 0.0: Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 933 (14.3%) Trained on 497 ham and 437 spam fp: 0 fn: 1 Total cost: $187.60 OK, so that didn't work either. How about setting it to 0.2? Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 304 (4.6%) Trained on 134 ham and 177 spam fp: 2 fn: 5 Total cost: $85.80 Hm. That is slightly better. Funny, we are suddenly training on more spam than ham.... Back to 0.1 anyway ---the differences are too small--- and set robinson_probability_x = 0.3 (default is 0.5): Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 602 (9.2%) Trained on 54 ham and 616 spam fp: 1 fn: 67 Total cost: $197.40 Very interesting: this changes the training ratio to 1:12, at huge cost! (less than one in three spams was recognized solidly as such). Wonder what this could do if changed together with the cutoff.... Lets move it back to 0.5, and try "robinson_probability_s = 0.3": Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 348 (5.3%) Trained on 237 ham and 120 spam fp: 7 fn: 2 Total cost: $141.60 Ouf. I am back with the defaults, but I'd still like to do an automated optimization of everything simultaneously. Might try that. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From trebor@animeigo.com Sun Nov 10 00:32:36 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Sat, 9 Nov 2002 19:32:36 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: Hello everyone, Just a quick note to introduce myself; I ran the session at that Hacker's conference that Guido mentioned, and passed on the suggestion of checking out Bill Y's combinatorial approach. I've been playing with rules-based techniques for almost a year (see http://www.madoverlord.com/projects/told.t for details) and toying with bayesian systems for only the last couple of months, on and off. So no expert in that regard; I have mostly replicated the early work you guys have done (skimmed the archive today). I'm particularly impressed with the chi-square work, it looks very interesting (but more stats for my poor stats-challenged mind to work on; not to mention that now I'm going to have to get around to cramming python in there with all the other languages that have accumulated over the years...). Also, it's nice the way you're testing a lot of variants, I've been crossing things off my "try this" list all afternoon. Couple of comments (bear in mind, I haven't grabbed the source yet, and only skimmed the archive, so if this repeats things you've already tried, sorry). This is just stuff that's been in my mind recently, plus stuff stimulated by my skim. * The great headers debate; suggest you put both machine and human readable opinions in the header, eg: X-SpamBayes-Rating: 9 (Very Spammy) X-SpamBayes-Rating: 7 (Somewhat Spammy) X-SpamBayes-Rating: 5 (Unsure) X-SpamBayes-Rating: 3 (Probably Innocent) X-SpamBayes-Rating: 0 (The Finest Ham) The reason being, many mailreaders can use a finer discriminant than (yes,no,beats me) in ranking spam. A common strategy (which I like myself) is to start an email at neutral priority and bump it up and down based on various triggers, whitelists, whatever, then sort the inbox by the final priority. A cute hack I used in TOLD was to output the result like this: X-SpamBayes-Rating: 0123456789 (Very Spammy) X-SpamBayes-Rating: 012345 (Unsure) This permits a mailreader with limited filtering tools (like Eudora) to classify multiple results with a single rule (such as "if an X-SpamBayes-Rating header contains the string 12345678, set priority to double-low", which catches both 8 and 9 rated emails). BTW, being pedantic, "rating" is a better word to use, it is more precisely what the discriminator is doing, is the same in all flavors of english, and is shorter. "Score" might be even better. ;^) * Hashing to a 32-bit token is very fast, saves a ton of memory, and the number of collisions using the python hash (I appealed for hash functions on the hackers-l and Guido was kind enough to send me the source) is low. About 1100 collisions out of 3.3 million unique tokens on a training set I was using. CRC32, of all things, is actually slightly better, but only by a hair. So this kind of hashing probably won't have much effect on the statistical results. * Bill Y's byte bucket system has a lot of problems, but a there are probably some data reduction techniques that would work well. One that occurred to me on the way back from Hackers would be simply to keep a 1-byte count of ham/spam hits for each token, and when the ham or spam count is about to wrap, cut each count in half, rounding up the other value; ie: // increment ham count for bucket i // apologies, my pseudocode is a bizarre mixture of // now-unknown languages if (ham[i]=255) { ham[i]=128; spam[i]=(spam[i]/2)+(spam[i]%2) } else ham[i]++; The nice thing about this is that it would bias in favor of more recent email; things would "age". But note this means when building the original database you have to feed it ham and spam in small chunks, or use a greater resolution before cramming it into individual bytes. * I was playing a week or two back with 1 and 2 token groups, and found that a useful technique was, for each new token, to only consider the most deviant result. So if the individual word was .99 spam, and the two word phrase was .95, it would only consider the .99 result. This would probably help with Bill Y's combinatorial scheme. Dunno if you've tried this; it prevents a particularly spammy or hammy sequence from dominating the results (I was only considering the 16 or so most deviant results in my bayesian calc, at least on my corpus, more than that didn't really help). * My personal bias (as I think Guido mentioned) is for a multifaceted approach, using Bayesian, rules-based (attacking things that bayesian isn't good at, like looking for obfuscated url structures), DNSBL, and whitelisting heuristics to generate an overall ranking. So a hammy mail from a guy in your address book would bubble up to highest priority, whereas something spammy from him would stay neutral. There's lots of room for cooperation between the various approaches and multiple agents means its less likely that a spam will get by. In particular, whitelisting heuristics can almost eliminate false positives. * Finally, if anyone needs more spam, I get over 300 a day (I've been around a while!) and have a cleaned corpus of over 130MB of spam and foreign email. Also, given all the legit web-marketing email I get because of the url registration work I've done, I've got tons of the spammiest ham you could imagine. Best R -- ----------------------------------------------------------------------- http://madoverlord.com/ World Domination - a fun family activity! From tim.one@comcast.net Sun Nov 10 06:55:59 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 01:55:59 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Robert Woodhead] > Hello everyone, Hi! > Just a quick note to introduce myself; I ran the session at that > Hacker's conference that Guido mentioned, and passed on the > suggestion of checking out Bill Y's combinatorial approach. You can find test results for that in the list archives. Bottom line is that it did worse than what we're doing now, to such an extent that I'm the only one who appeared to try it (my reports weren't encouraging). I may have misimplemented the idea, but I don't think so. The results were in line with earlier experiments we tried on gimmicks that systematically generate highly correlated "words". Such things appear to learn a lot faster than word unigrams, but we've always found (so far) that unigrams soon enough overcome that, and then go on to win. What we're missing is any practical approach to a scheme that can suck out phrases without identifying them by hand first, and without generating highly correlated phrases (overlapping word n-grams are highly correlated, of course, and Bill carries that to extremes). Something I didn't report on is later experiments using chi-combining instead of Bill's "add up the raw counts". chi-combining worked better. I know Bill has gone on to do a more Bayesian-like combination method, but I expect that to do worse than what we've got now for the same reasons we gave up on Paul Graham's combining scheme, but more so: the word independence assumption is bogus, and feeding the Bayesian calcs highly correlated words grossly over- or under- estimates the true probability as a result. In the end you get a scheme that claims certainty even when it's dead wrong, and although it's not dead wrong often, it is dead wrong at a non-zero rate. Revealing: fiddle our chi2.py to use whatever combining scheme Bill is using now, and feed it vectors of *random* probabilities. Most of the code needed for that, and to display a histogram of results, is already there. Try it with Graham's combining scheme and you'll find that scores are almost always very near 0 or very near 1 even when the inputs are random and uniformly distributed. I expect that can only get worse by doing "chain rule" calcs on probs that are highly correlated to begin with. The internal chi-combining S and H scores are uniformly distributed given random inputs, so chi-combining doesn't infer certainty by chance any more often than it "should" infer certainty by chance. That appears to be what makes it far more robust against embarrassing mistakes, and it reliably pumps out a score near 0.5 given a highly ambiguous input msg (many strong ham and many strong spam clues -- we call that "cancellation disease" here, and chi-combining doesn't infer certainty when it happens; all other schemes did, and didn't do better than chance when it happened). > I've been playing with rules-based techniques for almost a year (see > http://www.madoverlord.com/projects/told.t for details) and toying > with bayesian systems for only the last couple of months, on and > off. So no expert in that regard; I have mostly replicated the early > work you guys have done (skimmed the archive today). > > I'm particularly impressed with the chi-square work, it looks very > interesting (but more stats for my poor stats-challenged mind to work > on; So copy and paste . > not to mention that now I'm going to have to get around to > cramming python in there with all the other languages that have > accumulated over the years...). In return, you can throw twelve other languages out <0.7 wink>. > Also, it's nice the way you're testing a lot of variants, I've been > crossing things off my "try this" list all afternoon. Testing has pretty much run out of steam here, though. My error rates are so low now I couldn't measure an improvement in a convincing way even if one were to be made, and the same is true of a few others here too. We appear to be fresh out of big algorithmic wins, so are pushing on to wrestling with deployment issues. BTW, download the source code and read the comments in tokenizer.py: the results of many early experiments are given there in comment blocks. > Couple of comments (bear in mind, I haven't grabbed the source yet, > and only skimmed the archive, so if this repeats things you've > already tried, sorry). This is just stuff that's been in my mind > recently, plus stuff stimulated by my skim. > > * The great headers debate; suggest you put both machine and human > readable opinions in the header, eg: > > X-SpamBayes-Rating: 9 (Very Spammy) > X-SpamBayes-Rating: 7 (Somewhat Spammy) > X-SpamBayes-Rating: 5 (Unsure) > X-SpamBayes-Rating: 3 (Probably Innocent) > X-SpamBayes-Rating: 0 (The Finest Ham) > > The reason being, many mailreaders can use a finer discriminant than > (yes,no,beats me) in ranking spam. A common strategy (which I like > myself) is to start an email at neutral priority and bump it up and > down based on various triggers, whitelists, whatever, then sort the > inbox by the final priority. Spoken like someone who worked on a rule-based system . We have three categories: Ham, Unsure, and Spam, and I haven't seen anything to make me believe that a finer distinction than that can be quantitatively justified (but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's what I mean by "can't measure an improvement anymore", and a finer-grained scheme isn't going to touch those 2 mistakes; one of them is formally ham because it was sent by a real person, but consists of a one-line comment followed by a quote of an entire Nigerian scam spam -- nothing useful is ever going to *call* that one ham, and it scores as spam *almost* as solidly as an original Nigerian spam). > A cute hack I used in TOLD was to output the result like this: > > X-SpamBayes-Rating: 0123456789 (Very Spammy) > X-SpamBayes-Rating: 012345 (Unsure) > > This permits a mailreader with limited filtering tools (like Eudora) > to classify multiple results with a single rule (such as "if an > X-SpamBayes-Rating header contains the string 12345678, set priority > to double-low", which catches both 8 and 9 rated emails). > > BTW, being pedantic, "rating" is a better word to use, it is more > precisely what the discriminator is doing, is the same in all flavors > of english, and is shorter. "Score" might be even better. ;^) "Score" is my favorite, but isn't catching on. I believe the word "ham" for "not spam" was my invention, and since that one caught on big, I'm not fighting to the death for any others . > * Hashing to a 32-bit token is very fast, saves a ton of memory, > and the number of collisions using the python hash (I appealed for hash > functions on the hackers-l and Guido was kind enough to send me the > source) is low. About 1100 collisions out of 3.3 million unique > tokens on a training set I was using. That's significantly better than you could expect from a truly random hash function, so is fishy. Tossing 3.3M balls into 2**32 buckets at random should leave 3298733 buckets occupied on average, with an sdev of 35.58 buckets. Getting 1100 collisions is about 4.7 sdevs fewer than the random mean. > CRC32, of all things, is actually slightly better, With sparse occupancy (3.3e6 out of 4.3e9 buckets is sparse) they may be comparable. PythonLabs ran large-scale statistical tests a few years ago on this. The Python string hash produced 32-bit numbers indistinguishable from random (on the #-of-collision basis) as far as we pushed it; crc32 broken down *very* badly as occupancy increased, with collision rates hundreds of sdevs worse than random. So I can't recommend crc32 for general string hashing (and the Python docs indeed warn about this now), but can recommend Python's string hash. By coincidence, it turns out that Python's string hash is very similar to what later became "the standard" Fowler-Noll-Vo string hash, which may be the most widely tested "seems as good as random" fast string hash now: http://www.isthe.com/chongo/tech/comp/fnv/ > but only by a hair. So this kind of hashing probably won't have much > effect on the statistical results. Since we're sticking to unigrams, we don't have an insane database burden. We also (by default) limit ourselves to looking at no more than 150 words per msg. So I'm not sure saving some bytes of string storage is "worth it" for us, and it's very nice that we can get back the exact list of words that went into computing a score later. A pile of hash codes wouldn't give the same loving effect . > * Bill Y's byte bucket system has a lot of problems, but a there are > probably some data reduction techniques that would work well. One > that occurred to me on the way back from Hackers would be simply to > keep a 1-byte count of ham/spam hits for each token, and when the ham > or spam count is about to wrap, cut each count in half, rounding up > the other value; ie: > > // increment ham count for bucket i > // apologies, my pseudocode is a bizarre mixture of > // now-unknown languages > > if (ham[i]=255) > { > ham[i]=128; > spam[i]=(spam[i]/2)+(spam[i]%2) > } > else > ham[i]++; > > The nice thing about this is that it would bias in favor of more > recent email; things would "age". But note this means when building > the original database you have to feed it ham and spam in small > chunks, or use a greater resolution before cramming it into > individual bytes. Except I didn't get good enough results from his approach to justify pursuing it here, even leaving the hash codes at the full 32 bits. When I went on to squash them to fit in a million buckets, a few false positives popped up that were just too bad to bear (two can be found in the list archives): ham that was so obviously ham that no system that called them spam would be acceptable to most people. > * I was playing a week or two back with 1 and 2 token groups, and > found that a useful technique was, for each new token, to only > consider the most deviant result. So if the individual word was .99 > spam, and the two word phrase was .95, it would only consider the .99 > result. This would probably help with Bill Y's combinatorial scheme. It could be a viable approach to the problem mentioned above: a scheme to suck out more than one word that doesn't systematically generate mounds of nearly redundant (highly correlated) clues. We're clearly missing info by never looking at bigrams (or beyond) now, and that continues to bother me (even if it doesn't seem to be bothering the error rates ). > Dunno if you've tried this; it prevents a particularly spammy or > hammy sequence from dominating the results (I was only considering > the 16 or so most deviant results in my bayesian calc, at least on my > corpus, more than that didn't really help). There's too much I don't know about everything you're doing to say much about that. *All* the biases in Graham's original scheme eventually went away in this project, and things like clamping the spamprobs into [.01, 0.99] turned out to make it systematically useless to try to use more than 16 words under Graham-combining (it just caused more "cancellation disease", and so caused more wildly wrong mistakes). We use 150 now, but IIRC we generally stopped seeing strong benefits after hitting about 40. That 40 was better than 16 very much relied on removing all the biases, though (no "ham boosts", no prob clamping, no minimum word count, no giving unknown words spamprobs above 0.5 to favor ham, no doubling the ham count when computing a spam prob, etc). > * My personal bias (as I think Guido mentioned) is for a multifaceted > approach, using Bayesian, rules-based (attacking things that bayesian > isn't good at, like looking for obfuscated url structures), DNSBL, > and whitelisting heuristics to generate an overall ranking. So a > hammy mail from a guy in your address book would bubble up to highest > priority, whereas something spammy from him would stay neutral. I'm not sure we really need it. For example, *lots* of spam has been discussed on this mailing list, so much so that the python.org email admin had to castrate SpamAssassin for msgs to this list address else it kept blocking ordinary list traffic. My personal email classifier never calls anything here spam, though, nor does it call the originals of the spams posted here ham. I do worry a little about obsfuscated HTML. We strip almost all HTML tags by default for a reason I've harped on enough : all HTML decorations have very high spamprobs, and counting more than one of them as "a clue" fools almost every combining scheme into believing the msg containing them is spam (if you know a msg contains both
and

, it's not really more likely to be spam than if you just know it contains
!). So we blind the classifier to HTML decorations now. But a spam I forwarded here a week or so ago exploited that: the spam was interleaved with size=1 white-on-white news stories and tech mailing list postings. The classifier *did* see those, but didn't see the HTML decorations hiding them. This was a cancellation-disease-by-construction kind of msg, and chi-combining scored it near 0.5 as a result (solidly Unsure). It's the only spam of that kind I've seen so far; if it becomes a popular techinque, we'll have to take more HTML blinders off the classifier. > There's lots of room for cooperation between the various approaches > and multiple agents means its less likely that a spam will get by. > In particular, whitelisting heuristics can almost eliminate false > positives. I'll let you know if I ever see one . Seriously, one of the apps I've especially got in mind is filtering the high-volume mailing lists on python.org. The only kind of FP I see there now in tests is adminstrative requests to *-request addresses, which typically consist of a one word "subscribe" or "unsubscribe" (themselves words with high spamprobs!), followed by 6KB of employer-generated HTML disclaimers, and/or a forwarded spam or conference announcement the sender didn't like. There's still a very low FP rate even on those, but text analysis simply can't be expected to nail them every time. Under SpamAssassin, those recipient addresses are given strong ham boosts by the python.org email admin. > * Finally, if anyone needs more spam, I get over 300 a day (I've been > around a while!) and have a cleaned corpus of over 130MB of spam and > foreign email. Also, given all the legit web-marketing email I get > because of the url registration work I've done, I've got tons of the > spammiest ham you could imagine. Wasn't Paul Graham collecting corpora? Yup, still is: http://www.paulgraham.com/spamarchives.html Getting vast quantities of spam isn't a problem anymore, but getting vast quantities of ham is. Since your spammy ham is presumably business-related, I assume you can't share it. Or can you? Mixing spam and ham from different sources also causes worlds of problems (indeed, we still (by default) ignore most of the header lines partly for that reason, else the system gets great results for bogus reasons). From tim.one@comcast.net Sun Nov 10 07:27:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 02:27:38 -0500 Subject: [Spambayes] More experiments with weaktest.py In-Reply-To: <3DCD9E3A.4040809@hooft.net> Message-ID: [Rob Hooft] > These were results of weaktest with default parameters: Very interesting! I'll have to try that too. Note that in my live email experiment here, I'm (except for the very start) also scoring/training msgs in (with small lapses) the order they arrive. It's been reported before that this helps; although I still haven't run a controlled experiment on that, my *impression* is that it does help. > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 336 (5.1%) > Trained on 178 ham and 162 spam > fp: 2 fn: 2 > Total cost: $89.20 > > If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical > (spam_cutoff is 90 by default): The asymmetry is intentional: most people hate FP more than FN, so by default I made it harder for a thing to get called spam. In test after test we've also seen that spam has a tighter score distribution than ham, which is a more formal justification for setting the spam cutoff closer to its endpoint than the ham cutoff. Setting ham_cutoff as low as 10 is for the truly paranoid <0.9 wink>. > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 442 (6.8%) > Trained on 292 ham and 152 spam > fp: 2 fn: 0 > Total cost: $108.40 > > So the database grows by 30% but it didn't help my cost. The training > set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to > the default 20: > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 304 (4.6%) > Trained on 213 ham and 101 spam > fp: 7 fn: 3 > Total cost: $133.80 > > This reduces the database by only 10%, but at very high fp cost. Same > 2:1 unbalance in the training set. > Back to the default 20:90 then, and set the minimum_prob_strength to 0.0: > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 933 (14.3%) > Trained on 497 ham and 437 spam > fp: 0 fn: 1 > Total cost: $187.60 > > OK, so that didn't work either. How about setting it to 0.2? > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 304 (4.6%) > Trained on 134 ham and 177 spam > fp: 2 fn: 5 > Total cost: $85.80 > > Hm. That is slightly better. Funny, we are suddenly training on more > spam than ham.... Back to 0.1 anyway ---the differences are too small--- > and set robinson_probability_x = 0.3 (default is 0.5): > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 602 (9.2%) > Trained on 54 ham and 616 spam > fp: 1 fn: 67 > Total cost: $197.40 > > Very interesting: this changes the training ratio to 1:12, at huge cost! > (less than one in three spams was recognized solidly as such). Note that in calculations I reported a day or two ago, the measured mean of spamprobs across 3 different corpora was > 0.5, but not by a lot. .3 moves it outside the range minimum_prob_strength ignores, so now every "new word" is instantly taken as a ham clue, where before all new words were ignored by default. So that it grossly inflated the FN rate isn't surprising; everything that will *eventually* become a hapax is initially taken to be a ham clue, even when it's never been seen before. > Wonder what this could do if changed together with the cutoff.... > Lets move it back to 0.5, and try "robinson_probability_s = 0.3": > > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 348 (5.3%) > Trained on 237 ham and 120 spam > fp: 7 fn: 2 > Total cost: $141.60 > > Ouf. I hope you're at least gaining some respect for how much work went into picking the defaults . > I am back with the defaults, but I'd still like to do an automated > optimization of everything simultaneously. Might try that. Now *that* could be a useful system regardless of scheme. I've tended to do hill-climbing across one dimension at a time, occasionally moving batches of params random amounts at once (to see whether that kicks it out of a stubborn local minimum). From tim.one@comcast.net Sun Nov 10 07:52:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 02:52:42 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCD8B34.6040903@hooft.net> Message-ID: [Rob Hooft] > I just added a testdriver to CVS that simulates your behaviour as I > understand it: It will train on the first 30 messages, I trained on 1 of each at the start. If I were to do it over, I'd start with an empty database . > plus on all misclassified and all unsure messages. Since I'm doing this real-time on my live email, I've been training "on the worst" (farthest away from correct) msg that arrives in a batch, then rescoring all the ones that arrived in the batch, then training the worst remaining, ... until all new ham is below ham_cutoff and all new spam above spam_cutoff. I don't know that it matters, just being clear(er). As things turned out, this worst-at-a-time training never managed to push one of the remaining mistakes/unsures into the correct category, *except* for cases where I got more than one copy of a spam from different accounts at the same time. Then it always pushed the copies into scoring near 1.0, since the hapaxes in the training copy are abundant. > It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am > hierarchy. > > I think we should test its performance at different Options settings. > > It may not even be very realistic to training on fp's, as I think in my > private E-mail I won't even check the spam folder very thoroughly at all. But I will (and do), and my primary interest here is to see how bad things can get if a user takes mistake-based training to an extreme. Despite that it's heavily hapax-driven, it appears to do very well when judged by error rate. I've been doing it long enough now, though, that it doesn't do so well subjectively: the Unsures are too often bizarre. For example, I sent a long reply here to Robert Woodland, and the copy I get bock showed up as Unsure, with H=1 and S=0.66. There were a lot of accidental spam hapaxes in that msg! Training on it as ham then eliminated about 30 spam hapaxes (there're now netural, having been seen in one ham and one spam each). So it's no different from my POV than the cases where people have sent me "surprising msgs" in the past, and my carefully trained slice-of-life classifier (regularly trained on a sampling of correctly classified msgs too) at the time had no trouble nailing them as ham or spam, with lots of non-hapax evidence to back it up. IOW, I'm still sticking to what I guessed before I started this: mistake-driven training will appear to work well over the short term, but it's brittle, and is brittle because of its reliance on hapaxes. > Anyway, a default run for me now gives: > > 100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47 > 200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60 > 300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67 > 400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73 > 500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83 > 600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87 > 700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98 > 800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106 > 900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108 > 1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112 > 1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119 > 1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125 > 1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133 > 1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139 > 1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145 > 1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149 > 1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152 > 1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158 > 1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161 > 2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166 > 2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170 > 2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175 > 2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178 > 2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182 > 2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189 > 2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191 > 2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196 > 2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202 > 2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204 > 3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209 > fp: Data/Ham/Set2/n05250.txt score:0.9312 > 3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214 > 3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218 > 3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222 > 3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224 > 3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225 > 3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229 > 3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235 > 3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240 > 3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243 > 4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249 > 4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256 > 4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257 > 4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263 > 4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267 > 4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272 > 4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276 > 4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279 > 4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280 > 4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285 > fp: Data/Ham/Set1/n01590.txt score:0.9092 > 5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288 > 5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291 > 5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294 > 5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296 > 5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303 > 5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304 > 5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307 > 5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310 > 5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312 > 5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316 > 6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321 > 6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322 > 6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323 > 6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328 > 6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333 > 6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335 > Total messages 6540 (4800 ham and 1740 spam) > Total unsure (including 30 startup messages): 336 (5.1%) > Trained on 178 ham and 162 spam > fp: 2 fn: 2 > Total cost: $89.20 > > (This is on 3 out of my 10 test directories). > > Interesting to note so far: > * The "Total cost" is much higher than for train-on-all schemes, > but it is only due to Unsures; fp and fn are still small. That matches my experience too, although I started with 1 ham and 1 spam and had high FP and FN rates over the first few hours. > * The database growth doesn't decay with time after a while; > it can be described as: > nwords = 9200 + 1.6 * nmessages > or alternatively: > nwords = 5700 + 40 * ntrained > ..as can be seen in the attached png's I expect that's mostly because there are still (relatively) few total msgs trained on. > * The training set is almost balanced, even though I scored > many more ham than spam Curiously, same here! I get about 500 ham and 100 spam per day, but my training database now has 47 ham and 41 spam. It does well, except when it sucks . > * The unsure rate drops over time: I haven't measured that, but it's clearly been so here too (as I said before). > 0- 1000: 11.2% (minus 3.0% to be fair) > 1000- 2000: 5.4% > 2000- 3000: 4.3% > 3000- 4000: 4.0% > 4000- 5000: 3.9% > 5000- 6000: 3.3% Proving what I've always suspected: over time, all msgs are repetitions of ones you've seen before <0.9 wink>. From tim.one@comcast.net Sun Nov 10 08:36:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 03:36:10 -0500 Subject: [Spambayes] My first non-personal personal false positive In-Reply-To: Message-ID: [Tim, asks for help on a Spanish Unsure] [Fran=E7ois Granger] > Here are the most probable English equivalents of the Spanish words= . > 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with', > 'good', 'this', 'Hi', 'that', 'here', 'the', 'for' > > This illustrate he need for properly balanced training sets and > re raise the question of language discrimination. It really doesn't raise it for me: this was in my personal email, an= d since I couldn't read the msg anyway, it may as well have been spam. I get= way too much email to bother more than 2 seconds with something I can't r= ead. I only looked at this one because I'm paying heavy attention to everyth= ing the automatic classifier calls spam. If I weren't using this system, I w= ould have thrown out that msg at once. If I were someone who got any quantity of Spanish ham, the system wou= ld have scored it as ham. As is, the only Spanish I get is in Spanish spam, = so the system correctly judged it for my personal email mix. > At least prior language discrimination would allow for a different > database for each language Whether that would improve results is a testable hypothesis; I've alr= eady said I doubt it would be helpful, and have no motivation to try such = an experiment myself. > or for a systematic "unsure" flag for not trained languages. But I *do* train on Spanish -- and Russian, and Turkish, and Chinese,= and Japanese, and German, and French, and Polish (at least): in my email= mix, they're all used in spam, aren't used in my ham, and are spam to me b= ecause they're unreadable by me. > If you put my messages in a Ham training set, you will flag French = spams > as ham because of my French sig ;-) Nope, the system isn't that stupid (or, rather, it is ). What = it will do is knock down the spamprobs of those words. Despite that I've got= French spam in my training data, your msg here-- including the French sig --= got a solid ham score, with H=3D1 (to six significant digits) and S=3D1.1e-= 11. The strongest spam word in fact came from your sig, spamprob('est')=3D0.8= 4. It didn't matter, because I could actually read most of what you wrote, = and it wasn't trying to sell me Viagra . > All these words should rate around 0.5 since they are among the > most common ones in this language. If I got any French ham, they would rate around 0.5, but for my perso= nal email it's Just Fine that they're considered spam words. It wouldn't= be OK for python.org use, but python.org gets a non-trivial amount of non-E= nglish ham, so it trains there accordingly. > Le courrier est un moyen de communication. Les gens devraient > se poser des questions sur les implications politiques des choix (o= u non > choix) de leurs outils et technologies. Pour des courriers propres = : > -- Indeed . From rob@hooft.net Sun Nov 10 11:09:28 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 12:09:28 +0100 Subject: [Spambayes] Introducing myself References: Message-ID: <3DCE3E68.2060101@hooft.net> Robert Woodhead wrote: > * My personal bias (as I think Guido mentioned) is for a multifaceted > approach, using Bayesian, rules-based (attacking things that bayesian > isn't good at, like looking for obfuscated url structures), DNSBL, and > whitelisting heuristics to generate an overall ranking. So a hammy mail > from a guy in your address book would bubble up to highest priority, > whereas something spammy from him would stay neutral. There's lots of > room for cooperation between the various approaches and multiple agents > means its less likely that a spam will get by. In particular, > whitelisting heuristics can almost eliminate false positives. I think our very good experience with the bayesian classifier would "forbid" to use whitelisting. Once a whitelisted feature "leaks" into the spam community, it will be useless. But there is a bayesian solution to it: Make the tokenizer recognize the feature that you want to whitelist or blacklist, and emit a new token to that effect. From: --> Will have a low spamprob url:numeric-host --> Will have a high spamprob We're already doing something that for a number of the SpamAssassin tests (e.g. mime-type tokens). This approach still uses a purely bayesian classifier, and it will follow reality automatically. I'd like to note that a lot of what you were saying and what was in Tim's response (and mine here) is only valid in a train-on-all scheme. i.e. like we've been using until a week ago.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Nov 10 12:11:46 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 13:11:46 +0100 Subject: [Spambayes] More experiments with weaktest.py References: Message-ID: <3DCE4D02.6060907@hooft.net> Tim Peters wrote: > [Rob Hooft] > >>These were results of weaktest with default parameters: > > > Very interesting! I'll have to try that too. Note that in my live email > experiment here, I'm (except for the very start) also scoring/training msgs > in (with small lapses) the order they arrive. It's been reported before > that this helps; although I still haven't run a controlled experiment on > that, my *impression* is that it does help. I toyed with the idea, but that would involve parsing all messages once before starting, and sorting them on date. Putting them in a set to "randomize" the order is much easier, so I was lazy. > Setting ham_cutoff as low as 10 is for the > truly paranoid <0.9 wink>. Very much so. For my "production" systems, I have ham_cutoff at 40... > I hope you're at least gaining some respect for how much work went into > picking the defaults . I was just arriving when it happened. But that was on a completely different classifier, so I'm still convinced these need to be thoroughly tested. >>I am back with the defaults, but I'd still like to do an automated >>optimization of everything simultaneously. Might try that. > Now *that* could be a useful system regardless of scheme. I've tended to do > hill-climbing across one dimension at a time, occasionally moving batches of > params random amounts at once (to see whether that kicks it out of a > stubborn local minimum). Hm. That sounds so enthousiastic that I just might commit what I have gone through this night. Some more info: * No, I have not used a "Simulated Annealing" or "Threshold Accepting" yet. Please keep in mind that each step in the optimization takes between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my work PC). This would be way too costly. Just minimization it will be. * I tried to use "Simplex optimization" (let a multidimensional triangle walk through phase space) on the "Total cost" parameter. This was simply disastrous. Phase space consists of plateau regions that are exactly flat, joined by huge ridges. Think about that one spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one bang to the cost. This field is impossible to optimize. * I designed a new "Flex cost" field. That one does away with the "unsure cost". The cost of a message is 0.0 at its own cutoff, and increases linearly towards its "false" cost at the other cutoff, and increases further to the other end. Hm. Unreadable. A table: Score Spam with this Ham with this score costs score costs 0.00 $ 1.29 $ 0.00 0.20 $ 1.00 $ 0.00 0.55 $ 0.50 $ 5.00 0.90 $ 0.00 $10.00 1.00 $ 0.00 $11.43 This field is much more smooth than the total cost field, so I was hoping that pure minimization will do. Obviously, the flex cost is much, much higher than the total cost because unsures are so much more expensive. The flex cost field will also be less sensitive to the {sp|h}am_cutoff parameters than the total cost field, because there are no sudden cost jumps. * Results are not great I need to experiment more before reporting on them. * I just committed: weaktest.py: introduction of the flexcost measure optimize.py: simplex optimization (needs Numeric python; sorry) weakloop.py: run weaktest.py repeatedly under simplex optimization Regards, Rob Hooft -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Nov 10 12:28:44 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 10 Nov 2002 13:28:44 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCE50FC.3050005@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Tim Peters wrote: > [Rob Hooft] > >>I just added a testdriver to CVS that simulates your behaviour as I >>understand it: It will train on the first 30 messages, > > > I trained on 1 of each at the start. If I were to do it over, I'd start > with an empty database . This is easy enough to change, but I left it at 30 for now. > Since I'm doing this real-time on my live email, I've been training "on the > worst" (farthest away from correct) msg that arrives in a batch, then > rescoring all the ones that arrived in the batch, then training the worst > remaining, ... until all new ham is below ham_cutoff and all new spam above > spam_cutoff. I don't know that it matters, just being clear(er). As things > turned out, this worst-at-a-time training never managed to push one of the > remaining mistakes/unsures into the correct category, *except* for cases > where I got more than one copy of a spam from different accounts at the same > time. Then it always pushed the copies into scoring near 1.0, since the > hapaxes in the training copy are abundant. But I'm doing exactly the same, except that my batch size is always 1 ;-) >>It may not even be very realistic to training on fp's, as I think in my >>private E-mail I won't even check the spam folder very thoroughly at all. > But I will (and do), and my primary interest here is to see how bad things > can get if a user takes mistake-based training to an extreme. Despite that > it's heavily hapax-driven, it appears to do very well when judged by error > rate. Hm. There are so little fp/fn's relative to unsures (at least after 30 messages initial training), that it wouldn't matter much (I think). >> * The database growth doesn't decay with time after a while; >> it can be described as: >> nwords = 9200 + 1.6 * nmessages >> or alternatively: >> nwords = 5700 + 40 * ntrained >> ..as can be seen in the attached png's > > > I expect that's mostly because there are still (relatively) few total msgs > trained on. Hm, it is more like a sqrt after more messages. See attached image which has a sqrt X axis. The fit fits the data even at the lowest end. Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: words3.png Type: image/png Size: 13675 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021110/b5905d0f/words3-0001.png ---------------------- multipart/mixed attachment-- From lists@morpheus.demon.co.uk Sun Nov 10 14:31:30 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 10 Nov 2002 14:31:30 +0000 Subject: [Spambayes] Outlook plugin plus Exchange References: Message-ID: "Mark Hammond" writes: > I am working on code that optionally processes "missed" messages at startup. > It looks like I can list all unread, unscored mail in my 1000+ item inbox > very quickly, so this should be feasible. That sounds like the best option. I haven't had a chance to check Exchange yet, but with an IMAP store there are no "New mail" events triggered when I start Outlook with new mail in the IMAP inbox. I'd expect Exchange to be the same. (I didn't write a new addin, the spambayes addin does log when it gets a NewMail event, which I can see via win32traceutil...) I'll be interested to see the code, in any case, as when I tried to list unread mail for anotyher project, I couldn't get it to be fast :-( Paul. -- This signature intentionally left blank From trebor@animeigo.com Sun Nov 10 21:59:28 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Sun, 10 Nov 2002 16:59:28 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: [my apologies if some of the suggestions/comments below have been previously discussed, I'm still getting up to speed on the list] > > I'm particularly impressed with the chi-square work, it looks very >> interesting (but more stats for my poor stats-challenged mind to work >> on; > >So copy and paste . Heh, call me old fashioned, but I actually like to know how things work, rather than relying on black magic. ;^) > > not to mention that now I'm going to have to get around to >> cramming python in there with all the other languages that have >> accumulated over the years...). > >In return, you can throw twelve other languages out <0.7 wink>. Why would I ever want to do that? You never know when you'll need to be able to remember PL/C, JPL, APL, TUTOR, etc., etc., etc. Though I pray I never have to remember NOVA MOBOL ("Language of Kings") ;^) >Testing has pretty much run out of steam here, though. My error rates are >so low now I couldn't measure an improvement in a convincing way even if one >were to be made, and the same is true of a few others here too. We appear >to be fresh out of big algorithmic wins, so are pushing on to wrestling with >deployment issues. Indeed. And you also have to start worrying about the metagame; assuming your system goes into widespread deployment, what will the intelligent spammer (oxymoron) responses be? >BTW, download the source code and read the comments in tokenizer.py: the >results of many early experiments are given there in comment blocks. Will be doing this over the next day or so. >Spoken like someone who worked on a rule-based system . We have three >categories: Ham, Unsure, and Spam, and I haven't seen anything to make me >believe that a finer distinction than that can be quantitatively justified >(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's >what I mean by "can't measure an improvement anymore", and a finer-grained >scheme isn't going to touch those 2 mistakes; one of them is formally ham >because it was sent by a real person, but consists of a one-line comment >followed by a quote of an entire Nigerian scam spam -- nothing useful is >ever going to *call* that one ham, and it scores as spam *almost* as solidly >as an original Nigerian spam). Ah, but there are more considerations. First, many people's training sets may not be as distinct as yours, so the results might be more blurry. Second, future versions of the software might end up including other recognizers in the mix (for example, DNSBL, url heuristics, whitelists, stamping systems, etc), so adding a bit of flexibility at the start doesn't cost you anything, but could end up saving everyone a lot of work down the road. Since most existing mailreader filter schemes are relatively primitive, more than 10 levels of discrimination isn't going to be all that useful. But only 3 would seem to be to be too few. In a 1-9 scheme, the current 3 levels would map to (say), 2,5,8. It's just a syntactic difference, but it gives you precious wiggle room. >"Score" is my favorite, but isn't catching on. I believe the word "ham" for >"not spam" was my invention, and since that one caught on big, I'm not >fighting to the death for any others . Hey, why quit when you're on a roll? > >> * Hashing to a 32-bit token is very fast, saves a ton of memory, >> and the number of collisions using the python hash (I appealed for hash >> functions on the hackers-l and Guido was kind enough to send me the >> source) is low. About 1100 collisions out of 3.3 million unique >> tokens on a training set I was using. > >That's significantly better than you could expect from a truly random hash >function, so is fishy. Tossing 3.3M balls into 2**32 buckets at random >should leave 3298733 buckets occupied on average, with an sdev of 35.58 >buckets. Getting 1100 collisions is about 4.7 sdevs fewer than the random >mean. I may have gotten the # of tokens wrong. Currently my test runs are using 3.3M tokens but it may have been fewer when I was doing the hash tests. Maybe 2.3-2.4M tokens at that time? Anyway, thanks for the info about the relative merits of CRC32 and the Python hash; I'd been told CRC32 was bad and so was really surprised when it was marginally better. >Since we're sticking to unigrams, we don't have an insane database burden. >We also (by default) limit ourselves to looking at no more than 150 words >per msg. So I'm not sure saving some bytes of string storage is "worth it" >for us, and it's very nice that we can get back the exact list of words that >went into computing a score later. A pile of hash codes wouldn't give the >same loving effect . Well, unless I'm missing something, you've got to keep track of every token you've ever seen, and you've got to look up every token you encounter to determine if it's significant enough to consider in the final calc. If so, assuming the final calc isn't exponential, reducing the lookup time/resources can be a big win performance-wise. Note that since you have the text of the token before you hash it, you can keep that around for significant tokens and display it later. The only reason to hash is for speed of access to the probability data. The cost of the hashing is the inevitable collisions, which blur the probabilities for colliding tokens. >Except I didn't get good enough results from his approach to justify >pursuing it here, even leaving the hash codes at the full 32 bits. When I >went on to squash them to fit in a million buckets, a few false positives >popped up that were just too bad to bear (two can be found in the list >archives): ham that was so obviously ham that no system that called them >spam would be acceptable to most people. I wasn't commenting on the phrase system, or even hashing, but rather on data reduction to reduce the memory footprint required of the statistical tables (ie: using 1 byte frequency counts vs. 4 byte ones). Also, a cautionary note: just because the current system doesn't generate any horrible false positives on your corpii doesn't mean it won't do so on Joe Schmoe's. Or my slightly smelly ham. > > * I was playing a week or two back with 1 and 2 token groups, and >> found that a useful technique was, for each new token, to only >> consider the most deviant result. So if the individual word was .99 >> spam, and the two word phrase was .95, it would only consider the .99 >> result. This would probably help with Bill Y's combinatorial scheme. > >It could be a viable approach to the problem mentioned above: a scheme to >suck out more than one word that doesn't systematically generate mounds of >nearly redundant (highly correlated) clues. We're clearly missing info by >never looking at bigrams (or beyond) now, and that continues to bother me >(even if it doesn't seem to be bothering the error rates ). Right; and, related to the metagame, you've got to consider responses by the spammers. The initial attempt to defeat these kind of recognizers is going to try and exploit cancellation disease, probably by having a spammy preamble and a very hammy postscript. So one possible approach would be to gradually degrade the significance of a token the further along in the email it is (both during training and recognition). But of course, then you'll have to watch for html email that loads the front of the message with invisible ham. So a parser that spits out only the tokens a human is going to see is indicated. > > * My personal bias (as I think Guido mentioned) is for a multifaceted >> approach, using Bayesian, rules-based (attacking things that bayesian >> isn't good at, like looking for obfuscated url structures), DNSBL, >> and whitelisting heuristics to generate an overall ranking. So a >> hammy mail from a guy in your address book would bubble up to highest >> priority, whereas something spammy from him would stay neutral. > >I'm not sure we really need it. For example, *lots* of spam has been >discussed on this mailing list, so much so that the python.org email admin >had to castrate SpamAssassin for msgs to this list address else it kept >blocking ordinary list traffic. My personal email classifier never calls >anything here spam, though, nor does it call the originals of the spams >posted here ham. Beware the One True Path. There is strength in diversity. Or, as the noted philosopher D. Vader put it, "Don't be too proud of this technological terror you have created." As you will recall, those rebel scum managed to craft a nasty false positive. > >I do worry a little about obsfuscated HTML. We strip almost all HTML tags >by default for a reason I've harped on enough : all HTML decorations >have very high spamprobs, and counting more than one of them as "a clue" >fools almost every combining scheme into believing the msg containing them >is spam (if you know a msg contains both
and

, it's not really more >likely to be spam than if you just know it contains
!). So we blind the >classifier to HTML decorations now. > >But a spam I forwarded here a week or so ago exploited that: the spam was >interleaved with size=1 white-on-white news stories and tech mailing list >postings. The classifier *did* see those, but didn't see the HTML >decorations hiding them. This was a cancellation-disease-by-construction >kind of msg, and chi-combining scored it near 0.5 as a result (solidly >Unsure). It's the only spam of that kind I've seen so far; if it becomes a >popular techinque, we'll have to take more HTML blinders off the classifier. That's a classic example of metagaming. Seems to me, the strength of the spambayes recognizer is in recognizing the semantics (the spammy meaning of the message), not the syntactics. So train it only on what a human would see reading the message. Have another recognizer (either rules-based, bayesian, whatever works) that deals with the syntactics, and picks up on the html decoration tricks. In other words, one that looks at what the message says, and another that looks at how it is presented. This will prevent that particular kind of simple cancellation attacks. And that wraps back to the "more responses" suggestion above. How do you rate a hammy message with spammy html ornaments? Might not "a little hammy" be a better response than "beat's me, boss!"? > >> There's lots of room for cooperation between the various approaches >> and multiple agents means its less likely that a spam will get by. >> In particular, whitelisting heuristics can almost eliminate false >> positives. > >I'll let you know if I ever see one . You will. And it will be the one email that you really, really needed to read. Murphy's Law guarantees that it will happen. In fact, it typically happens (in my painful personal experience) soon after you make comments like the above. >Getting vast quantities of spam isn't a problem anymore, but getting vast >quantities of ham is. Since your spammy ham is presumably business-related, >I assume you can't share it. Or can you? Probably not. Unless I could process them and just give you the tokens and frequencies in some useable format. I'll see what I can do next week, gotta get python up and running along on my Mac. Also gotta get the battlebot finished or my kids will hurt me. > Mixing spam and ham from >different sources also causes worlds of problems (indeed, we still (by >default) ignore most of the header lines partly for that reason, else the >system gets great results for bogus reasons). I do the same, I'm currently just looking at the subject line. At 12:09 PM +0100 11/10/02, Rob Hooft wrote: >I think our very good experience with the bayesian classifier would >"forbid" to use whitelisting. Once a whitelisted feature "leaks" >into the spam community, it will be useless. Not if the whitelist heuristics are based on the individual user's environment, as opposed to global features. >But there is a bayesian solution to it: Make the tokenizer recognize >the feature that you want to whitelist or blacklist, and emit a new >token to that effect. > > From: --> Will have a low spamprob > url:numeric-host --> Will have a high spamprob While this is a useful approach, there is (IMHO) a need for users to be able to override, or at least modulate, the bayesian results in certain circumstances. The classic example would be your boss forwarding a 419 scam to you with the comment "Looks good, I'm going to invest in this, what do you think?". The spamminess might overwhelm the low spamprob From: A (paranoid) user needs to be able to tell the system "I don't care how spammy an email looks, if it's got this feature, I've got to at least glance at it with the Mk.1 Eyeball Recognition System". Note that this doesn't mean that it should be declared "clean as the driven snow", just "might not be a pile of decomposing lunchmeat" Yeah, this means that every spam going into Microsoft will eventually be from "billg@microsoft.com", but the consequences of this might be interesting. Or at least, amusing. best,R -- Woodhead's Law: "The further you are from your server, the more likely it is to crash." From tim.one@comcast.net Mon Nov 11 00:59:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 19:59:05 -0500 Subject: [Spambayes] More experiments with weaktest.py In-Reply-To: <3DCE4D02.6060907@hooft.net> Message-ID: [Tim, notes that his mistake-only training works in the order msgs come in] [Rob Hooft] > I toyed with the idea, but that would involve parsing all messages once > before starting, and sorting them on date. Putting them in a set to > "randomize" the order is much easier, so I was lazy. That's fine. For purposes of comparing this against previous tests, I expect it's even good, since they were randomized too. > ... > Hm. That sounds so enthousiastic that I just might commit what I have > gone through this night. You did, and I thank you! Note that there were already three Simplex pkgs linked from http://www.python.org/topics/scicomp/numbercrunching.html but I know how much fun it is write such stuff again . > Some more info: > > * No, I have not used a "Simulated Annealing" or "Threshold Accepting" > yet. Please keep in mind that each step in the optimization takes > between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my > work PC). This would be way too costly. Just minimization it will be. Understood. > * I tried to use "Simplex optimization" (let a multidimensional > triangle walk through phase space) on the "Total cost" parameter. > This was simply disastrous. Phase space consists of plateau regions > that are exactly flat, joined by huge ridges. Think about that one > spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one > bang to the cost. This field is impossible to optimize. Yes, it's a sum of step functions in the end, and at every point "the derivative" is either 0 or infinite, depending on where you are and which direction you look. Making a new "smooth" cost measure was thoroughly appropriate: > * I designed a new "Flex cost" field. That one does away with the > "unsure cost". The cost of a message is 0.0 at its own cutoff, and > increases linearly towards its "false" cost at the other cutoff, > and increases further to the other end. Hm. Unreadable. The code is clear enough, though. What I didn't understand is why each term in the flexcost is divided by the difference between the (fixed per run) cutoff levels: / (SPC - HC). That seems to systematically penalize, e.g., ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8). In the limit, if someone wanted a binary classifier (ham_cutoff == spam_cutoff), any mistake would be charged an infinite penalty. > A table: > > Score Spam with this Ham with this > score costs score costs > 0.00 $ 1.29 $ 0.00 It's hard to see where that comes from. Assuming ham_cutoff is 0.2 and spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 * (.9-0.0)/(.9-.2) ? > 0.20 $ 1.00 $ 0.00 > 0.55 $ 0.50 $ 5.00 > 0.90 $ 0.00 $10.00 > 1.00 $ 0.00 $11.43 > > This field is much more smooth than the total cost field, so I was > hoping that pure minimization will do. Obviously, the flex cost is > much, much higher than the total cost because unsures are so much > more expensive. The flex cost field will also be less sensitive to > the {sp|h}am_cutoff parameters than the total cost field, because > there are no sudden cost jumps. Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a DivideByZero exception, so it's sure sensitive there . I suspect it might work better if the "/(SPC-HC)" business were simply removed? > * Results are not great I need to experiment more before reporting > on them. > * I just committed: > weaktest.py: introduction of the flexcost measure > optimize.py: simplex optimization (needs Numeric python; sorry) > weakloop.py: run weaktest.py repeatedly under simplex optimization I've been running weakloop.py over two sets of my c.l.py data while typing this. That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total msgs. It's been thru the whole business about 25 times now. At the start, Trained on 88 ham and 66 spam fp: 0 fn: 0 Total cost: $30.80 Flex cost: $212.3120 x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31 It's having a hard time doing better than that. The best so far seems to be Trained on 82 ham and 66 spam fp: 0 fn: 0 Total cost: $29.60 Flex cost: $200.0924 x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09 which is so close to the starting point that it's hard to believe it's finding something "real". It *does* seem to be in a nasty local minimum, though, as the next attempt was: Trained on 118 ham and 69 spam fp: 1 fn: 0 Total cost: $47.20 Flex cost: $344.7334 x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73 I'm afraid it looks like it's eventually going to converge on the most delicate possible settings that barely manage to avoid that 1 FP. From tim.one@comcast.net Mon Nov 11 01:17:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 20:17:46 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCE50FC.3050005@hooft.net> Message-ID: [Tim] >> ... my primary interest here is to see how bad things can get if >> a user takes mistake-based training to an extreme. Despite that >> it's heavily hapax-driven, it appears to do very well when judged by >> error rate. [Rob Hooft] > Hm. There are so little fp/fn's relative to unsures (at least after 30 > messages initial training), that it wouldn't matter much (I think). As I tried to explain later, the psychological impact of the Unsures isn't attractive, though -- they remain bizarre to human eyes. When I got up today, I got 6 new Unsure spam: human growth hormone, gay porn, life insurance, mortgage rates, a msg that made no sense (empty except for a Yahoo auto-generated sig), and Genuine Leather Jackets. It's not picking up on general "this is advertising" clues, or even on general "this is gay porn" clues. Indeed, "XXX" is still a hapax! This particular HGH spam will never get through again, because training it found 80(!) hapaxes unique to it. It's not going to do much to stop other HGH spam, though -- this one was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost' and 'anywhere' to the collection of (what are now, after training on it) spam hapaxes -- just as previous HGH spam trained on didn't stop this one. To my eyes, I had already told it about HGH spam, and I'm irked that it showed me another one. Ditto gay porn, ditto life insurance, etc. [on database growth as a function of # of msgs] > Hm, it is more like a sqrt after more messages. See attached image which > has a sqrt X axis. The fit fits the data even at the lowest end. Cool! That was a dramatic graph indeed. Soon there will be no mysteries remaining . From tim.one@comcast.net Mon Nov 11 02:00:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 10 Nov 2002 21:00:20 -0500 Subject: [Spambayes] Proposing to rename some fundamental options In-Reply-To: Message-ID: [Tim] > The original names made more sense when we had half a dozen competing > schemes. > > Current Proposed > ------- -------- > robinson_probability_x unknown_word_prob > robinson_probability_s unknown_word_strength > robinson_minimum_prob_strength minimum_prob_strength This renaming has been done. It should have no effect on pickles or databases (i.e., no need to retrain). From anthony@interlink.com.au Mon Nov 11 02:22:26 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 11 Nov 2002 13:22:26 +1100 Subject: [Spambayes] helping push the ham score for "nigeria" higher. Message-ID: <200211110222.gAB2MQB11817@localhost.localdomain> apologies for the marginal relevance, but it entertained me :) http://news.bbc.co.uk/1/hi/world/africa/2423283.stm "I am writing to you in the hope that you are under god and well. My naming is Professor Isoun Turner, and I am having hope you can assist. We are having a communications sattelite worth $15 millon US dollars that needs to be launched, but we need to find an international launch pad" From tim.one@comcast.net Mon Nov 11 05:42:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 00:42:51 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Robert Woodhead] > ... > Heh, call me old fashioned, but I actually like to know how things > work, rather than relying on black magic. ;^) You'll like this code, then! We hate "mystery knobs", and everything has a purpose. A purpose may not make sense, but at least it has one. > ... > Indeed. And you also have to start worrying about the metagame; > assuming your system goes into widespread deployment, what will the > intelligent spammer (oxymoron) responses be? I expect to get rich by selling spammer software to defeat this latest round of classifiers, so it's not that I can't tell you what their responses will be, it's that I don't want to reveal trade secrets . Indeed, if there are technically savvy spammers, they're subscribed to this list (and others like it). > ... > Ah, but there are more considerations. First, many people's training > sets may not be as distinct as yours, so the results might be more > blurry. Of all the things this project has done I find lacking in other projects, this is the part I think gives this project its clearest advantage: we have a statistically sound testing framework, more than one person testing on more than one corpus, people are beat up for running sloppy tests, and major algorithm improvements have been vetted by many here on their own data, and publicly reported results.. Winners survived and losers got purged from the codebase, and no single test corpus ruled that. Even for people with a single test corpus, the testing framework slices-and-dices it into multiple runs, so that results specific to a quirk of one subset can't be mistaken for "the truth". The project's TESTING.txt talks more about this. My tech mailing-list data turned out to be easier than most peoples', seemingly because almost all forms of advertising, and of HTML, are despised on tech mailing lists. But I've got other, harder test data too, and at least one person here (hi, Anthony!) has a flatly horrid corpus. > Second, future versions of the software might end up including other > recognizers in the mix (for example, DNSBL, url heuristics, whitelists, > stamping systems, etc), so adding a bit of flexibility at the start > doesn't cost you anything, but could end up saving everyone a lot of > work down the road. We'll define a stable API for accessing this system. If people want to combine it with other systems, that's fine, and Python excels at playing nice with other systems. If someone wants to add, e.g., a DNSBL gimmick to *this* codebase, they should write a new module to do so. I don't want fundamentally different approaches mixed into one module, let alone one function. > Since most existing mailreader filter schemes are relatively primitive, > more than 10 levels of discrimination isn't going to be all that useful. > But only 3 would seem to be to be too few. In a 1-9 scheme, the > current 3 levels would map to (say), 2,5,8. Let me clarify: I don't object to defining a billion levels, the problem is that I've seen no evidence that the algorithm in use here *can* provide more than 3 meaningful levels. chi-combining usually gives extreme scores. The median spam score is (to 6 significant digits) 1.0; the median ham score is on the order of 1e-10. The difference between, e.g., 1e-20 and 1e-5 appears meaningless, despite that it's 15 orders of magnitude. When chi doesn't give an extreme score, it tends to give one near 0.5, and which side of 0.5 it lies on doesn't appear to have strong correlation with whether a thing is ham or spam. The system is saying "I'm lost!" then, and it is. In effect, it's a 1-bit classifier but with a very useful middle ground. That it only gives about 1 bit of info follows from that the underlying math is a statistical accept/reject test (a two-outcome decision). Well, it's actually two accept/reject tests under the covers (one for ham, one for spam), and that's where the middle ground comes from (they both accept or both reject). If we were to call our middle ground 5, what good would that do anyone else? It doesn't mean we judge the odds of a msg being spam at 1 in 2. It means we have no idea. It certainly doesn't mean what, e.g., a 5 coming out of SpamAssassin means. "Unsure" means what it says. If, in the future, a new and better algorithm comes along with 6 meaningful digits, then I expect a new X- header would be defined to report it. > It's just a syntactic difference, but it gives you precious wiggle room. I'll leave more on this to people adding headers (the client I'm using doesn't use headers, but does attach integer score (in 0-100) metadata to msgs). [on hash collisions] > ... > I may have gotten the # of tokens wrong. Currently my test runs are > using 3.3M tokens but it may have been fewer when I was doing the > hash tests. Maybe 2.3-2.4M tokens at that time? Anyway, thanks for > the info about the relative merits of CRC32 and the Python hash; I'd > been told CRC32 was bad and so was really surprised when it was > marginally better. Hard to say. Neither CRC32 nor Python's string hash make any effort toward being "crytographically secure", and Python's string hash is in fact and deliberately "better than random" in some common cases: >>> hash('x1') 739453787 >>> hash('x2') 739453784 >>> hash('x3') 739453785 >>> hash('x4') 739453790 >>> That is, it's very regular in a way that most often yields fewer 32-bit collisions than a truly random hash function would yield when fed input strings with regularities. That eventually breaks down if you throw enough strings at it -- but it doesn't get "worse than random" then either, so far as it's ever been pushed. > ... > Well, unless I'm missing something, you've got to keep track of every > token you've ever seen, So far we have, but there's slow-motion work in progress on database pruning. > and you've got to look up every token you encounter to determine if > it's significant enough to consider in the final calc. Yes, and that will probably always be true. > If so, assuming the final calc isn't exponential, reducing the lookup > time/resources can be a big win performance-wise. I don't believe so. When using a Python dict as "the database", the time for scoring a msg is minor compared to the time taken by parsing and tokenization, and especially compared to the time just to get the msg *into* the system (whether that's file I/O, or socket I/O, or some email pkg's programming API, or whatever -- that part is the bottleneck when using a dict; when not using a dict, database access time may become a burden, and most databases in use here require string keys even if you're working with ints -- the database user has to convert the hash code to a string! Other databases (like ZODB) could use ints directly as keys, but they're rare.). > Note that since you have the text of the token before you hash it, > you can keep that around for significant tokens and display it later. Good point! I had overlooked that indeed. > The only reason to hash is for speed of access to the probability > data. Feel free to experiment; as above, I don't have reason to suspect that switching to hash codes would speed anything here, except for Jeremy's ZODB database (which could switch to using an IOBTree, which is zippier than an OOBTree). > The cost of the hashing is the inevitable collisions, which > blur the probabilities for colliding tokens. Another cost is obscuring the code. > ... > I wasn't commenting on the phrase system, or even hashing, but rather > on data reduction to reduce the memory footprint required of the > statistical tables (ie: using 1 byte frequency counts vs. 4 byte > ones). Ours are actually unbounded, but I don't have any problem with the memory footprint now. Others do. It seems more fruitful at this point to concentrate on ways to reduce the # of tokens, rather than the size burden per token. BTW, see the neil*.py files for how one person here builds a lean scoring-only CDB database -- you can store things any way you like, provided that the database access function is fiddled to convert to what the classifier expects to use. I don't believe such conversion is a significant time burden, but I haven't run the CDB variant and so haven't timed it (Neil, do you have gripes about memory or time? Spit 'em out.). > Also, a cautionary note: just because the current system doesn't > generate any horrible false positives on your corpii doesn't mean it > won't do so on Joe Schmoe's. Or my slightly smelly ham. Sure, but I'm a realist: any non-trivial scheme has a non-zero FP rate. That's life. What users choose to do about that isn't for this project to dictate. It is our responsibility to say up-front that there will be false positives, and we do say so. > ... > Right; and, related to the metagame, you've got to consider responses > by the spammers. The initial attempt to defeat these kind of > recognizers is going to try and exploit cancellation disease, > probably by having a spammy preamble and a very hammy postscript. They can't really defeat this scheme that way. At best they can hope to push msgs into Unsure territory. What constitutes "very hammy" is a function of each user's database here, and no generic blob of text is going to score high for hamminess everywhere. The spam in question happened to include a news story about the DC-area snipers, and that was very hammy for *me* because I live in that area and many friends and relatives had corresponded about the snipers (including forwarding the text of that very news story, as if we were suffering a news blackout here ). Even so, the message ended up as Unsure for me, not as Ham. That's to the credit of chi-combining, which is very good about knowing when it's confused. > So one possible approach would be to gradually degrade the > significance of a token the further along in the email it is (both > during training and recognition). I think there is reason to believe that spammers have to get your attention early. OTOH, many pieces of incriminating evidence also live at the end of spams ("this is not spam!" blurbs, the explanation that you got this because you're on an opt-in list run by one of their "partners", references to various state and federal bills, the "unsubscribe me" URL slash address harverster, etc). The white-on-white spam I mentioned before had hammish stuff at the start, and at the end, and between each pair of paragraphs. > But of course, then you'll have to watch for html email that loads the > front of the message with invisible ham. So a parser that spits out > only the tokens a human is going to see is indicated. Yup. Guido suggested that at the start, but that level of HTML analysis gets a lot more expensive too. We'll see. BTW, on large tests this system scores about 80 msgs/second on my box, including everything (system time, training, I/O, parsing, tokenizing, scoring, reporting, recording, and analyzing results -- this is # of msgs divided by elapsed wall-clock time). We could afford to get slower, if necessary. > ... > Beware the One True Path. There is strength in diversity. Let a thousand classifiers bloom. If someone here wants to volunteer the effort to try a different approach, that's always been welcome. But the results have been so good sticking to one basic approach that I don't see that happening. We ended up doing one thing exceedingly well, and that's a contribution to diversity too, of a kind you may be undervaluing . > Or, as the noted philosopher D. Vader put it, "Don't be too proud of > this technological terror you have created." As you will recall, > those rebel scum managed to craft a nasty false positive. I don't view an FP as being as costly as needing to build a new Death Star. For goodness sake, this is email we're talking about -- anyone trusting a truly critical msg to email is dreaming to begin with. > ... > That's a classic example of metagaming. Seems to me, the strength of > the spambayes recognizer is in recognizing the semantics (the spammy > meaning of the message), not the syntactics. Well, it's got no semantic knowledge at all. It doesn't even know which language a msg is written in, let alone what it means, and has no concept of "word" beyond "stuff that appears between whitespace". It's very much focused on purely local lexical structure. > So train it only on what a human would see reading the message. We get a lot of value out of mining a handful of header lines. We also get a lot of value out of tokenizing embedded "invisible" URLs. The theme here is that we tokenize "what works", and that's driven by measured error rates; philosophy doesn't enter into that part. > Have another recognizer (either rules-based, bayesian, whatever works) > that deals with the syntactics, and picks up on the html decoration > tricks. In other words, one that looks at what the message says, and > another that looks at how it is presented. This will prevent that > particular kind of simple cancellation attacks. A rule-based system seems more effective to me too against that particular gimmick. Also against viruses. > And that wraps back to the "more responses" suggestion above. How do > you rate a hammy message with spammy html ornaments? Might not "a > little hammy" be a better response than "beat's me, boss!"? I have no real idea, but fear that presuming "yes" is presuming a lot of intelligence that systems parsing this header won't actually have. The fancier the rating scheme the fancier they have to be too. In the end, the user has to decide what to do about everything that's not called ham, no matter how many or few the non-ham categories. As a user myself, I've got no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats me". That already gives two categories I have to check, and that's enough. I do find it useful that my client can sort on the score metadata, and there are proposals here too to add fancier header lines beyond the basic spam/ham/unsure one. [on FPs] > You will. Of course I will. > And it will be the one email that you really, really needed to read. It doesn't matter -- I review all my spam. Other people won't, and so it goes. > Murphy's Law guarantees that it will happen. In fact, it typically > happens (in my painful personal experience) soon after you make > comments like the above. You realize you're overselling badly here, right ? > ... > I do the same, I'm currently just looking at the subject line. Look at tokenize_headers() in tokenizer.py for a number of other corpus-independent header lines that proved useful to tokenize. Surprising but true: we can get a very good classifier by looking at this handful of header lines alone. Or by looking at the body alone. Looking at both takes longer . > ... > While this is a useful approach, there is (IMHO) a need for users to > be able to override, or at least modulate, the bayesian results in > certain circumstances. The classic example would be your boss > forwarding a 419 scam to you with the comment "Looks good, I'm going > to invest in this, what do you think?". The spamminess might > overwhelm the low spamprob From: This is akin to my "entire Nigerian scam quote" FP, and it's all but certain that the spam content would overwhelm the brief "from the boss" clues. OTOH, if my boss didn't wait for my reply and went ahead and invested anyway, the subsequent financial disgrace would open the door for me to take his job. After all, he relied on me for advice, so who more logical to succeed him? two-winners-and-only-one-loser-ly y'rs - tim From popiel@wolfskeep.com Mon Nov 11 06:11:25 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 10 Nov 2002 22:11:25 -0800 Subject: [Spambayes] More experiments with weaktest.py In-Reply-To: Message from Tim Peters References: Message-ID: <20021111061126.211B5F4CD@cashew.wolfskeep.com> In message: Tim Peters writes: > >I've been running weakloop.py over two sets of my c.l.py data while typing I've now run weakloop.py over three sets of my private data; that's 3*200 ham and 3*200 spam, for a total of 1200 messages. The best few it came up with were: Trained on 39 ham and 61 spam fp: 4 fn: 3 Total cost: $61.60 Flex cost: $189.7713 x=0.5040 p=0.1040 s=0.4400 sc=0.902 hc=0.204 189.77 Trained on 38 ham and 61 spam fp: 4 fn: 2 Total cost: $60.60 Flex cost: $189.9767 x=0.5060 p=0.1060 s=0.4300 sc=0.903 hc=0.206 189.98 Trained on 37 ham and 61 spam fp: 4 fn: 2 Total cost: $60.40 Flex cost: $189.2842 x=0.5054 p=0.0980 s=0.4436 sc=0.905 hc=0.209 189.28 Trained on 37 ham and 61 spam fp: 4 fn: 2 Total cost: $60.40 Flex cost: $189.8255 x=0.5033 p=0.0981 s=0.4456 sc=0.903 hc=0.206 189.83 Trained on 37 ham and 61 spam fp: 4 fn: 2 Total cost: $60.40 Flex cost: $189.8260 x=0.5026 p=0.1000 s=0.4458 sc=0.902 hc=0.207 189.83 There were a few where it trained on a couple more or less ham and spam... but I had to go hunting for them. I find it quite interesting that my ham:spam training ratio here (about 2:3, about where all my ratio tests have been pointing as a sweet spot) is significantly different than that reported by others (which has been much closer to 1:1 or favoring more ham than spam). I guess my corpus really is unusual. FWIW, I'm running it again with all 10 of my sets (4000 messages total) overnight. - Alex From popiel@wolfskeep.com Fri Nov 8 00:06:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 07 Nov 2002 16:06:27 -0800 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message from "Tim Peters" References: Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Anthony Baxter] >> Note that "random sample" is not as trivial as all that, either - if >> you have a very high ham:spam ratio in your training DB, your accuracy >> will suffer (see the tests from Alex, myself and others). > >I still need to try to make sense of those tests. A real complication is >that more than one thing changes when trying to test ratios: it's not just >the ratio that changes, it's the absolute number of each trained on too. True. >For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham >and 10000 spam. The ratios are identical. Do we expect the error rates to >be identical too? I don't, but haven't tried it. I have tried this, and the effects of ratio were diminished as the training set size increased. For details, see http://www.wolfskeep.com/~popiel/spambayes/ratio2 . The tests were done with gary-combining, not chi-square, so I really ought to rerun them. >I expect the latter would do better than the former, despite the identical >ratios, simply because more msgs allow better spamprob estimates. It depended on what the ratio in question was... for 1:4 ham:spam, increased training set size hurt instead of helped, in the ranges that I was able to test. For 1:1, increased training helped instead of hurt. >Something missing in "the ratio tests" is a rationale (even an >after-the-fact one) for believing there's some aspect of the system that's >sensitive to the ratio. The combining method certainly is not, and the >spamprob estimation (update_probabilities()) deliberately works with >percentages instead of raw counts so that the ham::spam training ratio >has no direct effect on the spamprobs calculated. Eh, I have a perfectly good rationale for believing that something is sensitive the the ratio: the tests I've run show such a sensitivity. What's missing is a theory on _why_ there's a sensitivity. ;-) I don't think the following theory is perfectly phrased, but it seems plausible to me: Perhaps the number of topics discussed in ham is greater than that in spam. Thus, the average percentage of ham messages containing a particular significant ham word is systematically lower than the average probability of a particular significant spam word appearing in spam messages. As the training set size increases, the percentage difference becomes more consistent and pronounced. Since we're then combining the percentages, we systematically skew slightly due to the differing averages. Changing the ratio of ham to spam has the effect of changing the number of topics discussed, particularly when the training set size is small and random chance can exclude all instances of a given topic. Balancing the number of topics removes the skew in the probabilities. As training set size increases, adjusting the ratio has less effect, because it has less likelyhood of eliminating topics of discussion. I think that would account for my data. >The total # of spam training msgs does limit how high a spamprob can get, >and the total # of ham training msgs limits how low. The *suspicion* I had >running my large c.l.py test is that it wasn't the ratio that mattered so >much as the absolute number, and that the error rates didn't "settle down" >to the 4th digit until I got near 10,000 spam total. I suspect that by the time the corpora got that large, adjusting the training ratio wouldn't make a lick of difference if the corpora were sampled randomly to achieve the given ratio. There would just be too little chance of excluding a topic from the samples. Systematically excluding a topic might produce equivalent results to my ratio tests. - Alex From richie@entrian.com Fri Nov 8 00:17:25 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 08 Nov 2002 00:17:25 +0000 Subject: [Spambayes] SMTP proxy questions Message-ID: [Me] > Also on my list is to commit Tim Stone's SMTP proxy code, possibly after > integrating it with the pop3proxy (but I need to discuss that with you, > Tim, after looking in more detail at the code, hopefully tonight). I've discussed this with Tim S, and he's going off the SMTP proxy idea while I'm still broadly in favour of it. What do people think - do non-Outlook users want to forward messages to 'spam' and 'ham' to train the system, or use an HTML UI? The most difficult problem for retraining-by-forwarding is matching the forwarded message to one from the cache, after Outlook Express has stripped the headers, top-quoted the users .sig, converted it to HTML and added fifteen macro viruses. Any ideas? Can the tokeniser help? Or perhaps there's another way. The only other option I'd thought of was to add two hyperlinks to the end of the message, "This is spam" and "This is ham" (in ways that would work for both HTML and plain-text messages, in both HTML and plain-text email clients). They'd link to the HTML interface and tell it the cache ID of the message. Adding content to emails is way more intrusive (and difficult) than adding headers. But no more intrusive than the .sig that mailman adds. -- Richie Hindle richie@entrian.com From anthony@interlink.com.au Fri Nov 8 00:30:09 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 08 Nov 2002 11:30:09 +1100 Subject: [Spambayes] SMTP proxy questions In-Reply-To: Message-ID: <200211080030.gA80UAf11390@localhost.localdomain> > I've discussed this with Tim S, and he's going off the SMTP proxy idea > while I'm still broadly in favour of it. What do people think - do > non-Outlook users want to forward messages to 'spam' and 'ham' to train the > system, or use an HTML UI? I'd have to say I don't like the idea. There's too many potential places where it can all go horribly horribly pear-shaped, and too many rat-holes that the various email clients can screw up with. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From rob@hooft.net Mon Nov 11 09:12:57 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Mon, 11 Nov 2002 10:12:57 +0100 Subject: [Spambayes] More experiments with weaktest.py References: Message-ID: <3DCF7499.6030705@hooft.net> Tim Peters wrote: > [Rob Hooft] >>... >>Hm. That sounds so enthousiastic that I just might commit what I have >>gone through this night. > > > You did, and I thank you! Note that there were already three Simplex pkgs > linked from > > http://www.python.org/topics/scicomp/numbercrunching.html > > but I know how much fun it is write such stuff again . Yeah, but on the other hand, all those people didn't have access to my module when they wrote theirs, because it wasn't publicized ;-) [Let me add that my optimize code dates from late 1997] >> * I designed a new "Flex cost" field. That one does away with the >> "unsure cost". The cost of a message is 0.0 at its own cutoff, and >> increases linearly towards its "false" cost at the other cutoff, >> and increases further to the other end. Hm. Unreadable. > > > The code is clear enough, though. What I didn't understand is why each term > in the flexcost is divided by the difference between the (fixed per run) > cutoff levels: / (SPC - HC). That seems to systematically penalize, e.g., > ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and > spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8). > In the limit, if someone wanted a binary classifier (ham_cutoff == > spam_cutoff), any mistake would be charged an infinite penalty. You're right. > > >>A table: >> >> Score Spam with this Ham with this >> score costs score costs >> 0.00 $ 1.29 $ 0.00 > > > It's hard to see where that comes from. Assuming ham_cutoff is 0.2 and > spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 * > (.9-0.0)/(.9-.2) ? Yes. > >> 0.20 $ 1.00 $ 0.00 >> 0.55 $ 0.50 $ 5.00 >> 0.90 $ 0.00 $10.00 >> 1.00 $ 0.00 $11.43 But you're right that it would be better to make: Score Spam with this Ham with this score costs score costs 0.00 $ 1.00 $ 0.00 0.20 $ 1.00 $ 0.00 0.55 $ 0.50 $ 5.00 0.90 $ 0.00 $10.00 1.00 $ 0.00 $10.00 i.e. both functions consist of 3 linear segments rather than 2. > Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a > DivideByZero exception, so it's sure sensitive there . I suspect it > might work better if the "/(SPC-HC)" business were simply removed? That would no longer satisfy the constraints I put in. > I've been running weakloop.py over two sets of my c.l.py data while typing > this. That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total > msgs. It's been thru the whole business about 25 times now. At the start, > > Trained on 88 ham and 66 spam > fp: 0 fn: 0 > Total cost: $30.80 > Flex cost: $212.3120 > x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31 > > It's having a hard time doing better than that. The best so far seems to be > > Trained on 82 ham and 66 spam > fp: 0 fn: 0 > Total cost: $29.60 > Flex cost: $200.0924 > x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09 > > which is so close to the starting point that it's hard to believe it's > finding something "real". It *does* seem to be in a nasty local minimum, > though, as the next attempt was: > > Trained on 118 ham and 69 spam > fp: 1 fn: 0 > Total cost: $47.20 > Flex cost: $344.7334 > x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73 > > I'm afraid it looks like it's eventually going to converge on the most > delicate possible settings that barely manage to avoid that 1 FP. This is exactly what I found so far, even with my complete data set. It is too delicate to work. Now this could be due to 2 things: 1. The flexcost is still causing lots of false minima 2. The weaktest is causing lots of false minima I suspect the latter, because it contains lots of "yes/no" decisions that may tuble the other way with minimal changes in the parameters. My conclusion is to stop this, and try the optimization on something like timtest.py but with the flexcost as target function. Or maybe change weaktest such that it trains on all messages in the process. That would simulate the "optimal" strategy of a user that has to start from nothing. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From msergeant@startechgroup.co.uk Mon Nov 11 09:49:38 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 11 Nov 2002 09:49:38 +0000 Subject: [Spambayes] Introducing myself References: Message-ID: <3DCF7D32.4090209@startechgroup.co.uk> Robert Woodhead said the following on 10/11/02 00:32: > * My personal bias (as I think Guido mentioned) is for a multifaceted > approach, using Bayesian, rules-based (attacking things that bayesian > isn't good at, like looking for obfuscated url structures), DNSBL, > and whitelisting heuristics to generate an overall ranking. So a > hammy mail from a guy in your address book would bubble up to highest > priority, whereas something spammy from him would stay neutral. > There's lots of room for cooperation between the various approaches > and multiple agents means its less likely that a spam will get by. > In particular, whitelisting heuristics can almost eliminate false > positives. That's the approach SpamAssassin now takes, fwiw (including the bayesian stuff). All done in 2.50 CVS. > * Finally, if anyone needs more spam, I get over 300 a day (I've been > around a while!) and have a cleaned corpus of over 130MB of spam and > foreign email. Also, given all the legit web-marketing email I get > because of the url registration work I've done, I've got tons of the > spammiest ham you could imagine. I'm always looking for more corpuses. Stick the data on an FTP/HTTP server somewhere (password protect if you need to). Or contact me privately if that's not possible. Matt. From papaDoc@videotron.ca Mon Nov 11 13:03:40 2002 From: papaDoc@videotron.ca (papaDoc) Date: Mon, 11 Nov 2002 08:03:40 -0500 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DCFAAAC.4020807@videotron.ca> Hi, Can someone define what is an hapaxe ! >Scores remain grossly hapax-driven, but that's actually enough to classify >most of my email correctly: a small number of subjects and senders and >mailing lists overwhelmingly dominate my ham mix, and one email account >accounts for the vast bulk of my spam. Removing the hapaxes from the >database dropped the # of words from 5500 to about 1700. Rescoring the >inbox with this reduced database then pushed about 5% of the msgs back into >Unsure. > >So (no surprise here) hapaxes are vital with little training data. That >also means that as soon as one of those words shows up in the other kind of >email, it changes from a strong clue to netural, *provided that* I actually >train on the new email. I'm not training now unless there's a >mistake/unsure, so the hapaxes remain strong clues (even when they point in >the wrong direction). BTW, when there are mistakes/unsures, I'm not >training on all of them: as I did when I got up, I train the worst example >then rescore, one at a time, until no mistakes/unsures remain. > > papaDoc P.S. Someday I will contribute to the code but first I need to learn python. > > From bkc@murkworks.com Mon Nov 11 13:22:53 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 11 Nov 2002 08:22:53 -0500 Subject: [Spambayes] Exchange integration Message-ID: <3DCF67F7.16091.91EB9C8@localhost> Just musing here hoping someone can jump in with good comments. -- I'm thinking about running spambayes inside Exchange 5.5 (or Exchange 2000). At first, I thought I'd use the Event service, but in 5.5 it's async and MS even says "don't use this to filter all your messages". In Exchange 2000 apparently there's a synchronous event service, but I don't have Exchange 2000. So it looks like I need to create some kind of MAPI hook or preprocessor or mailbox assistant.. I'm not sure which. Anyone know? And, can I do this all in Python via COM or do I need some "real C to hook in? Finally, why does MS make it so hard to find the info you want? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From Paul.Moore@atosorigin.com Mon Nov 11 14:24:24 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 11 Nov 2002 14:24:24 -0000 Subject: [Spambayes] Some more experiences with the Outlook plugin Message-ID: <16E1010E4581B049ABC51D4975CEDB88619933@UKDCX001.uk.int.atosorigin.com> I've now had the Outlook plugin running for about a week, and I'm starting to get a feel for using it. The following is my "user interface" experience. It's a slightly unrealistic combination of "what I actually did" and "what I realised afterwards I should have done", but it is what I would use as notes telling a new user how to set the system up, and as such it picks up on a few interesting issues: 1. To start with, configure the plugin to define one "Spam" folder and one "Unsure" folder, and define all other folders as "Ham". [1] 2. Train the classifier on whatever you have available. This will usually be massively overbalanced in favour of ham (few people collect their spam) but it *will* make a start. [2] 3. Run with this for a while, incrementally training on mistakes and unsures. Keep all of the spam! 4. Periodically, retrain the full database on all the collected ham and spam. Notes: [1] I got this wrong at the start - the key point to stress here is that *everything* that isn't spam is ham - by definition. Trying to "help" the classifier by telling it to ignore messages which you "know" are ham is actually detrimental - if you know, let the classifier find out! [2] I'm getting pretty good results now (but see below), with 5661 ham and 303 spam, but even with under 100 spam (admittedly with less ham, as I made the "exclude some ham" mistake) I was getting visible benefits. Other points: * The collection I end up with is still biased - there are a lot of ham messages which I just read and delete, and they are probably somehow "similar". While I could retain these, this would require a much more significant change to my way of working. * Results still seem to be pretty much hapax based (if I understand the term and its usage). Looking at the clues for a message often shows some pretty bizarre tokens showing up as *either* sort of clue. (One message showed 'yet' as a ham clue with a probability of 0.000877364!) * Following on from this, I also see Tim's behaviour of surprising unsure cases (or worse, false negatives!). Worst case recently was a message which scored as solid ham. I trained on it as "Spam", and rescored it. It still scored 5 - solid ham. My immediate reaction was "But I just *told* you it's spam!". I know that isn't how the = classifier works, but even so it was unsettling. FWIW, I attach the spam clues = for this one (I don't know if they make any sense in isolation, but it = can't hurt...) * I don't know how long it will be before I start grudging the use of disk space to store spam. At that point, the nasty question of whether I keep it, or risk being unable to recreate my database, becomes important. I need to look at how to get some more information out of the = classifier, to try to understand how much of the good results I see are down to luck (hapaxes, I guess - which makes me think of "happy accidents" rather = than its real meaning...) and hence is fragile, and how much is actually = solid. Can anyone point me at the right part of the code to read to find this? Paul. ------------------------------- Clues for that message I mentioned Spam Score: 0.0531684 '*H*' 1 '*S*' 0.106337 '(and' 0.00044603 'looking' 0.000489716 'added' 0.000613999 'work,' 0.00120032 'group' 0.00138504 'saying' 0.00196592 'possible,' 0.00254381 'up,' 0.00260266 'said,' 0.00306331 'thing.' 0.00372208 'first.' 0.00420954 'but,' 0.00530035 'posting' 0.00585176 'number.' 0.00600801 'exist.' 0.00617284 'enough.' 0.00634697 'mind.' 0.0065312 'skip:=3D 70' 0.00738916 'negative' 0.00764007 'links,' 0.00764007 'month.' 0.00884086 'info,' 0.0100223 'value,' 0.0104895 'tends' 0.0110024 'hook' 0.0110024 'them?' 0.0136778 'continues' 0.0145631 'large,' 0.0167286 'invite' 0.0196507 'to:2**1' 0.0234065 'experiences' 0.0238095 'submitting' 0.0238095 'answering' 0.0266272 'cost,' 0.0266272 'listen' 0.0302013 'chuck' 0.0412844 'this.' 0.0478427 'club' 0.0505618 'there:' 0.0505618 'agree' 0.0621736 'confirm' 0.0689037 'to:' 0.069921 'kind' 0.0736897 'but' 0.079886 'resident' 0.0918367 'intended' 0.0939539 'might' 0.104747 'create' 0.113245 'otherwise' 0.116335 'soon' 0.117769 "there's" 0.120957 'actually' 0.121474 'had' 0.123378 'skip:u 10' 0.134221 'having' 0.134288 'done' 0.135374 'there' 0.142693 'still' 0.147177 'doing' 0.150927 'going' 0.151564 'sweet' 0.155172 'ads,' 0.155172 'insult' 0.155172 'subject:COMPUTER' 0.155172 'does' 0.158656 'they' 0.16113 'need' 0.163861 'week,' 0.164396 'blank' 0.168753 'pass' 0.173864 'thing' 0.182002 'also' 0.193781 'work' 0.194529 'based' 0.199869 "don't" 0.209846 'same' 0.211326 'different' 0.214242 'just' 0.215541 'expect' 0.215606 'result' 0.217003 'them' 0.219267 'can' 0.220053 "that's" 0.221224 'meaning' 0.221593 'have' 0.22283 'put' 0.228286 'after' 0.23066 'each' 0.231785 'then' 0.237896 'check' 0.240837 "what's" 0.241936 'it.' 0.24197 'been' 0.241975 'most' 0.246523 "we'll" 0.250462 'opportunity' 0.754514 'luck,' 0.765135 'e-mail' 0.768274 'p.s.' 0.769646 'computer,' 0.770262 'address' 0.771157 'wish' 0.776725 'increase' 0.781917 '"this' 0.793163 'spam' 0.794529 'unknowingly' 0.798255 'continuing' 0.801283 'line.' 0.805659 'header:Return-Path:1' 0.80863 'url:com' 0.82533 'member,' 0.826136 'incredibly' 0.826136 'offering' 0.826485 'removal' 0.837909 'membership.' 0.844828 'home-based' 0.844828 'cash.' 0.844828 'list,' 0.847218 'washington' 0.851451 'compliance' 0.862362 '10-12' 0.86708 'site,' 0.872105 'sincerely,' 0.873162 'intelligence' 0.877181 'emails' 0.879528 'money' 0.895718 'companies' 0.898816 'hearing' 0.900058 'money.' 0.900837 'opportunity,' 0.908163 'reward' 0.908163 'classified' 0.913944 'skip:& 10' 0.914039 'obligation' 0.925726 'internet.' 0.92719 'header:Received:7' 0.933081 'ability.' 0.934783 'screening' 0.934783 '"this' 0.949438 'header:MiME-Version:1' 0.949438 'consumer' 0.950295 'e-mail,' 0.955625 'income' 0.955625 'residents' 0.958716 'washington,' 0.958716 'marketing' 0.959139 'opt-in' 0.966259 'subject:your' 0.973253 'click' 0.974006 '"remove"' 0.985437 From popiel@wolfskeep.com Mon Nov 11 14:50:13 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 11 Nov 2002 06:50:13 -0800 Subject: [Spambayes] Outlook plugin - training In-Reply-To: Message from papaDoc of "Mon, 11 Nov 2002 08:03:40 EST." <3DCFAAAC.4020807@videotron.ca> References: <3DCFAAAC.4020807@videotron.ca> Message-ID: <20021111145013.5AEBFF58B@cashew.wolfskeep.com> In message: <3DCFAAAC.4020807@videotron.ca> papaDoc writes: > >Can someone define what is an hapaxe ! Sure. Merriam Webster says: hapax legomenon: noun: 1. a word or form occuring only once in a document or corpus plural: hapax legomena >P.S. Someday I will contribute to the code but first I need to learn python. There's a lot of ways to contribute (testing, documentation, etc.) without knowing the language, if you're interested... - Alex From popiel@wolfskeep.com Mon Nov 11 14:55:11 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 11 Nov 2002 06:55:11 -0800 Subject: [Spambayes] Exchange integration In-Reply-To: Message from "Brad Clements" of "Mon, 11 Nov 2002 08:22:53 EST." <3DCF67F7.16091.91EB9C8@localhost> References: <3DCF67F7.16091.91EB9C8@localhost> Message-ID: <20021111145511.5DC84F58B@cashew.wolfskeep.com> In message: <3DCF67F7.16091.91EB9C8@localhost> "Brad Clements" writes: > >Finally, why does MS make it so hard to find the info you want? A bit off topic ;-), but they just have a _LOT_ of information, much of it written by people trained to dumb-down the tech so that it's acceptable to the masses. The good stuff (for techies like us) is drowned out in a sea of end-user docs, and the indexing tools don't know how to rate the stuff by technical thoroughness. MS isn't _trying_ to make stuff hard to find... it's just that by trying to make it accessible for _everyone_, they make it difficult for anyone to find stuff at an appropriate level. - Alex (not normally an MS apologist...) From tim.one@comcast.net Mon Nov 11 16:48:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 11:48:27 -0500 Subject: [Spambayes] Outlook plugin - training In-Reply-To: <3DCD8B34.6040903@hooft.net> Message-ID: [Rob Hooft] > ... > It may not even be very realistic to training on fp's, as I think in my > private E-mail I won't even check the spam folder very thoroughly at all. FYI, here's my base weaktest run: Total messages 6800 (4000 ham and 2800 spam) Total unsure (including 30 startup messages): 124 (1.8%) Trained on 57 ham and 68 spam fp: 1 fn: 0 Total cost: $34.80 Flex cost: $193.3770 Here's the same thing, but even weaker, fiddling the code *not* to train on false positives (so the only ham ever trained on is however much appeared in the first 30 startup msgs, and later Unsure ham): Total messages 6800 (4000 ham and 2800 spam) Total unsure (including 30 startup messages): 123 (1.8%) Trained on 57 ham and 66 spam fp: 1 fn: 0 Total cost: $34.60 Flex cost: $199.3106 And one more time, not only not training on FP, but starting with an empty database (no startup msgs). Total messages 6800 (4000 ham and 2800 spam) Total unsure (NO startup messages): 123 (1.8%) Trained on 57 ham and 67 spam fp: 4 fn: 1 Total cost: $65.60 Flex cost: $174.5831 All four FP were among the first 30. Since even my sisters could be talked into training on 10 msgs at the start: Total messages 6800 (4000 ham and 2800 spam) Total unsure (10 startup messages): 115 (1.7%) Trained on 50 ham and 66 spam fp: 0 fn: 1 Total cost: $24.00 Flex cost: $124.9315 Now for another extreme: after 10 startup msgs, the system trains itself on its own decisions, except that: 1. Unsures are correctly classified by the user. 2. False negatives are correctly classified by the user. But false positives are trained on *as spam*, assuming the user never looks at their spam folder. That takes a long time to run, because update_probabilities() is called after every msg. After 2,100 msgs, 2100 trained:1181H+919S wrds:59659 fp:0 fn:0 unsure:26 and the unsures are growing very slowly now (at 1400 msgs there were 25 unsures). So one more twist: as above (train on self-decisions, but spam below spam_cutoff is corrected by the user, and FP gets trained on as spam), but only update probabilities for each of the first 50 msgs, and every 50th msg thereafter: at 2,100 msgs, it was up to 29 unsure. At the end, Total messages 6800 (4000 ham and 2800 spam) Total unsure (10 startup messages): 48 (0.7%) Trained on 4000 ham and 2800 spam fp: 0 fn: 0 Total cost: $9.60 Flex cost: $104.3355 It would have been more interesting had there been an FP, eh? One conclusion is that, so far as error rates go, on this data it doesn't much matter how training is done, but by any cost measure lots of training is better than little (due to unsures). From nas@python.ca Mon Nov 11 17:33:40 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 11 Nov 2002 09:33:40 -0800 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: <20021111173340.GA22411@glacier.arctrix.com> Tim Peters wrote: > [...] I haven't run the CDB variant and so haven't timed it > (Neil, do you have gripes about memory or time? Spit 'em out.). It works fine for me on the old 200 Mhz machine I use for a mail server. I retrain very rarely so I don't care if it takes a bit extra time to rebuild the DB. Neil From tim.one@comcast.net Mon Nov 11 17:37:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 12:37:27 -0500 Subject: [Spambayes] Exchange integration In-Reply-To: <3DCF67F7.16091.91EB9C8@localhost> Message-ID: [Brad Clements] > I'm thinking about running spambayes inside Exchange 5.5 (or > Exchange 2000). > > At first, I thought I'd use the Event service, but in 5.5 it's > async and MS even says "don't use this to filter all your messages". > > In Exchange 2000 apparently there's a synchronous event service, > but I don't have Exchange 2000. I've got no version of Exchange at all, and neither does Mark Hammond, so you're pretty much on your own wrt the people here. Jump in . > So it looks like I need to create some kind of MAPI hook or > preprocessor or mailbox assistant.. I'm not sure which. > > Anyone know? And, can I do this all in Python via COM or do I > need some "real C to hook in? Study the project's Outlook2000 directory. There's a quite sophisticated Outlook 2000 addin there, written in Python + MarkH's win32 extensions. It uses a mix of MAPI and the Outlook object model. It used to use CDO too, but I think Mark found ways to get rid of all that (CDO isn't installed by default for IMO Outlook installs, so it required the user to dig out their Office CD and install CDO first). > Finally, why does MS make it so hard to find the info you want? They don't -- they only make it hard to find the info *you* want. You must have done something to piss off Bill. Beyond that, MAPI is a massive and excruciatingly low-level API. The MSDN SDK MAPI docs are extensive, and so are web resources trying to make sense of it all (e.g., expect to spend a lot of time staring at ). From trebor@animeigo.com Mon Nov 11 18:10:19 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Mon, 11 Nov 2002 13:10:19 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: > > If so, assuming the final calc isn't exponential, reducing the lookup >> time/resources can be a big win performance-wise. > >I don't believe so. When using a Python dict as "the database", the time >for scoring a msg is minor compared to the time taken by parsing and >tokenization, and especially compared to the time just to get the msg *into* >the system (whether that's file I/O, or socket I/O, or some email pkg's >programming API, or whatever -- that part is the bottleneck when using a >dict; when not using a dict, database access time may become a burden, and >most databases in use here require string keys even if you're working with >ints -- the database user has to convert the hash code to a string! Other >databases (like ZODB) could use ints directly as keys, but they're rare.). Oh, I'd roll my own, probably using an in-memory hash table scheme. If you're hashing to a nice, randomly distributed 32-bit key, you'd effectively take the database out of the equation. I think most of the reason I lean this way is that I'm thinking about actual implementations (as opposed to testing), and with bayesian, you want to do this as close to each individual user as possible (right in the mailreader, via a plugin). It seems to me that you're at the point where testing the effects of data reduction techniques would be fruitful. Once I get up and running on the code (just paid the tithe to O'Reilly) I'll test it out. One thing that occurred to me: now that you have something that seems to work pretty well, have you considered backtracking on particular features to see how much they contribute; for example, going to a trivial state machine parser to spit out tokens? > >> Note that since you have the text of the token before you hash it, >> you can keep that around for significant tokens and display it later. > >Good point! I had overlooked that indeed. Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!") have lots of tricks. We don't so much write code as remember it and retype it. > > The cost of the hashing is the inevitable collisions, which >> blur the probabilities for colliding tokens. > >Another cost is obscuring the code. Not really; it doesn't really matter what the format of a token coming out of the parser is, does it? You might need an extra data structure to take care of the hashed token/string token correspondences but you only need touch that at the end of the parser and in the diagnostic output. >They can't really defeat this scheme that way. At best they can hope to >push msgs into Unsure territory. That is good enough, because it means the human has to look at it. Which is what spammers want to have happen. > What constitutes "very hammy" is a >function of each user's database here, and no generic blob of text is going >to score high for hamminess everywhere. True; then it becomes a game of finding generic messages that are likely to evaluate as hammy enough to the average recognizer. And the meta-response is to send out multiple emails with differently tuned slices of ham. I hereby, btw, coin the term "Dagwood" (or perhaps it should be Wooddag?) to mean an email containing artfully sliced amounts of ham, spam, and html condiments. ;^) > > So one possible approach would be to gradually degrade the >> significance of a token the further along in the email it is (both >> during training and recognition). > >I think there is reason to believe that spammers have to get your attention >early. OTOH, many pieces of incriminating evidence also live at the end of >spams ("this is not spam!" blurbs, the explanation that you got this because >you're on an opt-in list run by one of their "partners", references to >various state and federal bills, the "unsubscribe me" URL slash address >harverster, etc). Might have to be a U-shaped function then. Or it may turn out that ignoring the stuff at the end doesn't cost much but reduces false positives on new (legit) mailing lists. I'm just throwing out ideas for possible tests. >Yup. Guido suggested that at the start, but that level of HTML analysis >gets a lot more expensive too. We'll see. Well, what you'd need is a hacked HTML renderer that output sets that look like (token,size,color,background) and ignored words that were too small or hard to read. > >BTW, on large tests this system scores about 80 msgs/second on my box, >including everything (system time, training, I/O, parsing, tokenizing, >scoring, reporting, recording, and analyzing results -- this is # of msgs >divided by elapsed wall-clock time). We could afford to get slower, if >necessary. And the machines will get faster. Eventually. > > Beware the One True Path. There is strength in diversity. > >Let a thousand classifiers bloom. If someone here wants to volunteer the >effort to try a different approach, that's always been welcome. But the >results have been so good sticking to one basic approach that I don't see >that happening. We ended up doing one thing exceedingly well, and that's a >contribution to diversity too, of a kind you may be undervaluing . I was somewhat teasing you. > >> Or, as the noted philosopher D. Vader put it, "Don't be too proud of >> this technological terror you have created." As you will recall, >> those rebel scum managed to craft a nasty false positive. > >I don't view an FP as being as costly as needing to build a new Death Star. >For goodness sake, this is email we're talking about -- anyone trusting a >truly critical msg to email is dreaming to begin with. Unfortunately, in the real world, this happens all too often. Keep in mind that the readers of this list are not the typical users of the resulting software techniques. >Well, it's got no semantic knowledge at all. It doesn't even know which >language a msg is written in, let alone what it means, and has no concept of >"word" beyond "stuff that appears between whitespace". It's very much >focused on purely local lexical structure. OK, I was being fuzzy in my use of semantics and syntactics. Mea Culpa. > > So train it only on what a human would see reading the message. > >We get a lot of value out of mining a handful of header lines. We also get >a lot of value out of tokenizing embedded "invisible" URLs. The theme here >is that we tokenize "what works", and that's driven by measured error rates; >philosophy doesn't enter into that part. Well, I'm thinking of the metagame. What are the spammer responses to a truly effective bayesian filter? Obviously, remove those features that are typical of spam. What features cannot be removed without making the spam useless as a commercial message? The actual words visible to the reader. This is what led me to decide, in my testing, to use a simple parser that extracted alphanumerics with a few permitted interior punctuation characters (like . and '), and which handled tokens with interior comments properly. An interesting test would be to train the system, then run a test with a parser that only outputs the simple tokens (simulating a spammer response) and see how well it does. >I have no real idea, but fear that presuming "yes" is presuming a lot of >intelligence that systems parsing this header won't actually have. The >fancier the rating scheme the fancier they have to be too. In the end, the >user has to decide what to do about everything that's not called ham, no >matter how many or few the non-ham categories. As a user myself, I've got >no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats >me". That already gives two categories I have to check, and that's enough. >I do find it useful that my client can sort on the score metadata, and there >are proposals here too to add fancier header lines beyond the basic >spam/ham/unsure one. Fair enough. Optional fancier header lines would do the job as well. > > Murphy's Law guarantees that it will happen. In fact, it typically >> happens (in my painful personal experience) soon after you make >> comments like the above. > >You realize you're overselling badly here, right ? If anything, the opposite. ! >This is akin to my "entire Nigerian scam quote" FP, and it's all but certain >that the spam content would overwhelm the brief "from the boss" clues. >OTOH, if my boss didn't wait for my reply and went ahead and invested >anyway, the subsequent financial disgrace would open the door for me to take >his job. After all, he relied on me for advice, so who more logical to >succeed him? Unfortunately, he invested your pension money. Ooops. ;^) R -- Woodhead's Law: "The further you are from your server, the more likely it is to crash." From trebor@animeigo.com Mon Nov 11 18:10:12 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Mon, 11 Nov 2002 13:10:12 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: <3DCF7D32.4090209@startechgroup.co.uk> References: <3DCF7D32.4090209@startechgroup.co.uk> Message-ID: >I'm always looking for more corpuses. Stick the data on an FTP/HTTP >server somewhere (password protect if you need to). Or contact me >privately if that's not possible. Should be up by the time you read this, a 30M zipped file containing a Macintosh Eudora Mailbox of mixed english and foreign spam. Represents the last few months of receipts, but nothing more current than a couple of weeks ago. http://www.madoverlord.com/data/spam.zip Let me know if you have troubles grabbing it. -- Robert Woodhead, Webslave & Mad Overlord http://selfpromotion.com/ Located in the Hurricane capitol of the US, Wilmington NC. Lucky me! From db3l@fitlinxx.com Mon Nov 11 22:28:01 2002 From: db3l@fitlinxx.com (David Bolen) Date: 11 Nov 2002 17:28:01 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange References: Message-ID: Paul Moore writes: > That sounds like the best option. I haven't had a chance to check > Exchange yet, but with an IMAP store there are no "New mail" events > triggered when I start Outlook with new mail in the IMAP inbox. I'd > expect Exchange to be the same. (...) I'm based on an Exchange server, and yes, the behavior is the same - no events fire. I think Mark's in-progress approach of scanning for unread and unscanned messages on startup is reasonable. I'm not quite sure how the Outlook processes client rules on startup, but it does have the "feeling" that it simply starts an execution of the rules against Inbox, so the spambayes addin would just be following along. -- David From tim@fourstonesExpressions.com Mon Nov 11 22:35:29 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 11 Nov 2002 16:35:29 -0600 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: 11/11/2002 4:28:01 PM, David Bolen wrote: >Paul Moore writes: > >> That sounds like the best option. I haven't had a chance to check >> Exchange yet, but with an IMAP store there are no "New mail" events >> triggered when I start Outlook with new mail in the IMAP inbox. I'd >> expect Exchange to be the same. (...) > >I'm based on an Exchange server, and yes, the behavior is the same - >no events fire. I think Mark's in-progress approach of scanning for >unread and unscanned messages on startup is reasonable. I'm not quite >sure how the Outlook processes client rules on startup, but it does >have the "feeling" that it simply starts an execution of the rules >against Inbox, so the spambayes addin would just be following along. The whole problem I see with this is that µ$0pht could and most likely will screw all these machinations up with the next release of Outlook or Exchange... They have this great history of not caring if their api changes, or system behavior changes, are backward compatible. If we're having this level of difficulty now, get ready... :( - TimS > >-- David > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Mon Nov 11 23:24:09 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 18:24:09 -0500 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: <200211040950.gA49oU809201@localhost.localdomain> Message-ID: [Anthony Baxter Sent: Monday, November 04, 2002 4:51 AM ] > First experiment was to make the URL tokenizer look for the string > 'mailman' in the URL. If it was found, simple push the clue "url: > Mailman URL" onto the clue-pile. This was an attempt to remove the > many many related clues that get bolted onto the occasional spam that > makes it past Greg to the python.org mailservers. It's something of a > violation of "stupid beats smart", but I'd noticed that the mailman > footer from spam via mailman lists was always providing a bunch of > clues that were making life harder. Indeed they do. > > --- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 > +++ tokenizer.py 4 Nov 2002 06:59:37 -0000 > @@ -931,6 +931,11 @@ > new_text.append(text[i : start]) > new_text.append(' ') > > + if guts.find('mailman') != -1: > + pushclue("url: Mailman URL") > + i = end > + break Can you try this again replacing "break" with "continue"? I can't believe you intended break here -- it means that the first time we see a Mailman URL in a msg, we stop looking for embedded URLs period. Spam could easily exploit that. >> ham:spam: 11192:1826 >> 11192:1826 You realize you've get a very high ratio of ham to spam, right? > ... > Next I tried tokenizing the To: line. I parsed it properly, then > decoded the real name and split the words. I also added a token for > the RHS and LHS of the email @ sign. We don't tokenize To: now because it gives good results for bad reasons on mixed-source corpora. It would be good to have an option to tokenize it. It appears that your code also tokenized Cc:; also fine. I would rather see the code added to the loop currently cracking "from" lines: for field in ('from',): so that we tokenize all address thingies in a uniform way. The option would control the list of field names looped over there (default just from:, optionally also to: and cc:). > ... > The final test was to decode the Subject header if it's encoded, and > tokenize that, rather than in encoded. > > --- tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60 > +++ tokenizer.py 4 Nov 2002 09:45:25 -0000 > @@ -1071,6 +1078,10 @@ > # especially significant in this context. Experiment > showed a small > # but real benefit to keeping case intact in this > specific context. > x = msg.get('subject', '') > + # Subject decoding. > + x, subjcharset = email.Header.decode_header(x)[0] Why is this tokenzing only "the first" piece of the Subject line? > + if subjcharset is not None: > + yield 'subjectcharset:' + subjcharset > for w in subject_word_re.findall(x): > for t in tokenize_word(w): > yield 'subject:' + t I changed this to loop over all the Subject parts, and saw some minor good effects on marginal msgs, so I'll check this one in without further ado. It wasn't much of a win for you either, but it's cheap so why not. In my personal email "subjectcharset:unknown" shows up a lot for some reason (but only in spam). > My remaining 6 fns are: > > a brazilian spam-ish thing: (*H* 0.633859 *S* 0.20342 = 0.28478) > ... > ----------------- > Received: from localhost (localhost.localdomain [127.0.0.1]) > by localhost.localdomain (8.11.6/8.11.6) with ESMTP id > g8RNZhh05864 > for ; Sat, 28 Sep 2002 09:35:44 +1000 > Received: from mail.interlink.com.au [203.9.111.130] > by localhost with POP3 (fetchmail-5.9.0) > for anthony@localhost (single-drop); Sat, 28 Sep 2002 > 09:35:44 +1000 (ES > T) > Received: from mediterraneo.rjnet.com.br (root@[200.152.115.30]) > by valdez.interlink.com.au (8.11.6/8.11.2) with ESMTP id > g8RNZJc28230 > for ; Sat, 28 Sep 2002 09:35:20 +1000 > Received: from locutus.rjnet.com.br (root@locutus.rjnet.com.br > [200.222.31.10]) > by mediterraneo.rjnet.com.br (8.11.4/8.11.4) with ESMTP > id g8RNNc801901; > Fri, 27 Sep 2002 20:23:38 -0300 > Received: from localhost ([200.222.39.21]) > by locutus.rjnet.com.br (8.11.2/8.11.2) with ESMTP id > g8RMqEN00464; > Fri, 27 Sep 2002 19:52:14 -0300 > DATA > ----------------- > I plan to try something like tokenizing the oldest three received > lines (to hopefully avoid the previous issues with mail.python.org > blowing numbers to hell) to see if that will help this one. Did you try that yet? I'm not replying in a timely fashion because I'm not interested, it's just because I'm 244 msgs behind on this mailing list alone now . > The "iron citadel" python-list spam > (*H* 0.999999, *S* 0.038123 = 0.01906) DAMNED good spam! > A base64d MP3 spam sent via zope-dev > (*H* 0.993904, *S* 0.187868 = 0.0969820429397) > which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and > also the various mailman type clues (although that's better with the > first patch, above) > > Someone spamming Linux CDs via a list at 4thought > (*H* 1, *S* 0.207177 = 0.103588442478) > > A short porn spam sent via python-list > (*H* 0.817004, *S* 0.618399 = 0.400697521022) > > A wierd german spam for some sort of expert systems (in english). > (*H* 0.997132, *S* 0.84965 = 0.426259133645) It's Weird that you have cutoffs arranged such that a number near .40 isn't Unsure for you. That may (or may not) be related to the lopsidedness of your data (> 6 ham per spam). From spambayes@djl.freeuk.com Mon Nov 11 23:54:27 2002 From: spambayes@djl.freeuk.com (David Leftley) Date: Mon, 11 Nov 2002 23:54:27 +0000 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: References: Message-ID: <76f0tu4o9e3a9p6ictc29kvv1u6bhict22@4ax.com> On 11 Nov 2002 17:28:01 -0500, David Bolen wrote: >I'm based on an Exchange server, and yes, the behavior is the same - >no events fire. I think Mark's in-progress approach of scanning for >unread and unscanned messages on startup is reasonable. With Exchange, though, it's not just on startup that the plugin doesn't notice new messages. I've only been playing with it for a couple of days, so I'm still not exactly sure in which circumstances it fails, but here's what I observed from today's mail: External e-mail was in every case processed immediately on arrival by the plugin. Internal e-mail (i.e. sent through Exchange) is never picked up immediately by the plugin. - In some cases these messages were classified when the next e-mail (whether external or internal) arrived. When this happened, the first message was also (annoyingly) marked as unread when it was classified. - in another case, the plugin classified an e-mail when I went to my Calendar and opened the details of an existing (from before I installed the plugin) meeting. - I replied to a couple of e-mails before further messages came in. The plugin never got around to classifying those messages. And, while I'm reporting the quirks of the Outlook plugin, I have 3 messages (out of my spam corpus of c. 2000) that the plugin refuses to classify. If I attempt to score the contents of a folder containing one of these messages, scoring simply stops at that point - the progress bar disappears, and the remaining messages are left unscored. Apart from those little things, though, this software rocks! Keep up the good work, guys! David. From tim.one@comcast.net Tue Nov 12 00:21:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 19:21:20 -0500 Subject: [Spambayes] Some more experiences with the Outlook plugin In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619933@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] You might want to play along with "the other" training strategy we're trying: last week I wiped my database and started over from scratch, training it *only* on mistakes and unsures. It's been thru a few thousand msgs since then, but so far I've trained it on only 51 ham and 55 spam. The Unsures are weird, but the Unsure rate is falling, and it makes very few outright mistakes now (BTW, I have ham_cutoff at 20 and spam_cutoff at 80 in the Outlook client). > ... > 1. To start with, configure the plugin to define one "Spam" folder and > one "Unsure" folder, and define all other folders as "Ham". [1] > [1] I got this wrong at the start - the key point to stress here is > that *everything* that isn't spam is ham - by definition. Trying > to "help" the classifier by telling it to ignore messages which > you "know" are ham is actually detrimental - if you know, let the > classifier find out! We don't have a way to train on a random sample now, and that's going to be a killer for some people (e.g., Sean True has 2 gigabytes of ham). > 2. Train the classifier on whatever you have available. This will > usually be massively overbalanced in favour of ham (few people > collect their spam) but it *will* make a start. [2] > [2] I'm getting pretty good results now (but see below), with 5661 > ham and 303 spam, but even with under 100 spam (admittedly with > less ham, as I made the "exclude some ham" mistake) I was getting > visible benefits. My guess is that you'd do better by striving for no more than a 3:1 imbalance in either direction. There are reasons to despise the "purely mistake-based training" described at the top, but it seems very naturally to keep the training sets in rough balance. > 3. Run with this for a while, incrementally training on mistakes and > unsures. Training on those is vital no matter what else you do. > Keep all of the spam! I'm afraid that one won't fly over time, except for researchers. And people boldly using unstable pre-alpha code . > 4. Periodically, retrain the full database on all the collected ham > and spam. That shouldn't be necessary when the code is complete and stable. > Notes: > > > > Other points: > > * The collection I end up with is still biased - there are a lot of > ham messages which I just read and delete, and they are probably > somehow "similar". While I could retain these, this would require > a much more significant change to my way of working. Keep working the way you like! The client should eventually be able to deduce what's ham by watching you throw away things without first calling them spam. > * Results still seem to be pretty much hapax based (if I understand the > term and its usage). Looking at the clues for a message often shows > some pretty bizarre tokens showing up as *either* sort of clue. (One > message showed 'yet' as a ham clue with a probability of 0.000877364!) hapax means a word that appeared only once in your entire training corpus. In the list you gave below, there are very few hapaxes (I recognize them from the probabilities; I should probably add code to the client to display the raw counts too): > 'sweet' 0.155172 these 4 appeared in one ham > 'ads,' 0.155172 > 'insult' 0.155172 > 'subject:COMPUTER' 0.155172 > 'membership.' 0.844828 these 3 appeared in one spam, > 'home-based' 0.844828 presumably itself since you > 'cash.' 0.844828 said you trained on it > * Following on from this, I also see Tim's behaviour of surprising > unsure cases (or worse, false negatives!). I expect for a very different reason, though: your 18:1 ham:spam imbalance. This implies words can get spamprobs much closer to 0 than they can get to 1. There's just not enough spam to *justify* spamprobs closer to 1 than there is enough ham to justify spamprobs closer to 0. Let's look at the 3 most extreme words on both ends of your listing: > '(and' 0.00044603 > 'looking' 0.000489716 > 'added' 0.000613999 > 'subject:your' 0.973253 > 'click' 0.974006 > '"remove"' 0.985437 '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1, and that makes the accidental appearance of a ham word in spam much more powerful than the systematic appearance of a spam word in spam. If you only had 300 ham in your training set, it would be much harder for a word to get a very low spamprob; contrarily, if you had 5500 spam in your training, it would be much easier for a word to get a very high spamprob. As is, your strong ham words are much more powerful than your strong spam words, and almost *must* be. Anthony Baxter here routinely runs with a ridiculous ham:spam ratio too, but you're even way beyond him (his is about 6:1). This brings out effects I've never seen before. > Worst case recently was a message which scored as solid ham. I > trained on it as "Spam", and rescored it. It still scored 5 - solid > ham. That's because you're *not* hapax-driven. If you were, the score would have shot up to 100 (maybe 99). All ham contains spam words, and my guess is you've got so much more ham than spam that it's drowning out the spam. That's but picturesque but inaccurate . A more accurate speculation was given above. > My immediate reaction was "But I just *told* you it's spam!". I know > that isn't how the classifier works, but even so it was unsettling. > FWIW, I attach the spam clues for this one (I don't know if they make > any sense in isolation, but it can't hurt...) No more than what I copied above. If you like, send me the original (as an attachment), and I'll score it under my well-trained classifier (the one I parked last week when starting the mistake-only training experiment). That one was trained on about 2 thousand recent spam. If that works better for me than for you, then I'd like tp try another experiment, shipping you just the stronger-than-hapax spam words from that classifier, along with a bit of code you can run to *merge* that into your own classifier. That would be an experiment in "seeding" a classifier, something we haven't gotten a good start on here yet. > * I don't know how long it will be before I start grudging the use of > disk space to store spam. At that point, the nasty question of > whether I keep it, or risk being unable to recreate my database, > becomes important. At 300 measly spam saved, I should remind you that a gigabyte of disk space costs less than the value of your time worrying about it . > I need to look at how to get some more information out of the > classifier, to try to understand how much of the good results I see > are down to luck (hapaxes, I guess - which makes me think of "happy > accidents" rather than its real meaning...) Cool! When hapaxes work, they *are* happy accidents! I like it. > and hence is fragile, and how much is actually solid. Can anyone point > me at the right part of the code to read to find this? classifier.py contains all the code for probability estimation and scoring. From tim.one@comcast.net Tue Nov 12 00:30:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 19:30:01 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <76f0tu4o9e3a9p6ictc29kvv1u6bhict22@4ax.com> Message-ID: [David Leftley] > ... > And, while I'm reporting the quirks of the Outlook plugin, I have 3 > messages (out of my spam corpus of c. 2000) that the plugin refuses to > classify. If I attempt to score the contents of a folder containing > one of these messages, scoring simply stops at that point - the > progress bar disappears, and the remaining messages are left unscored. Next time that happens, bring up PythonWin and do Tools -> Trace Collector Debugging Tool. That will pop up a window showing diagnostic msgs and tracebacks produced by the Outlook client. You'll probably find something "interesting" near the end. Note that nobody who has done work on the client has any form of Exchange running, so diagnosis may not lead to a cure. Still, can't fix what nobody understands, so it will be a start. > Apart from those little things, though, this software rocks! Keep up > the good work, guys! Tell Redmond -- if they paid Mark to bust his balls on this, I bet he'd grow a new pair . From tim@fourstonesExpressions.com Tue Nov 12 00:49:52 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 11 Nov 2002 18:49:52 -0600 Subject: [Spambayes] Re: RE: [Spambayes-checkins] website docs.ht,1.3,1.4 In-Reply-To: Message-ID: 11/11/2002 6:40:44 PM, Tim Peters wrote: >> !

hapax, hapax legomenon
a word or form occuring only once in a >> ! document or corpus. (plural is hapax legomena) >> > >Ya, but even I'm not that anal -- I usually say hapaxes. hapaxora would be >a hoot too . Hapax driven, alternate defn: Typical mode of intra-gender communication, as in: "Husband: Beer" "Wife: No" "Husband: Now" "Wife: NOT!!!" "Husband: Please?" "Wife: Dreamer" "Husband: " "Wife: Idiot" "Husband: What?" "Wife: LISTEN!" - TimS etc. etc. > > >_______________________________________________ >Spambayes-checkins mailing list >Spambayes-checkins@python.org >http://mail.python.org/mailman/listinfo/spambayes-checkins > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Tue Nov 12 01:27:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 20:27:03 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Robert Woodhead] > ... > It seems to me that you're at the point where testing the effects of > data reduction techniques would be fruitful. Bootstrapping a classifier, connecting to a gazillion quirky email clients, and testing training strategies are all current high priorities. Saving memory wouldn't buy me anything in the Outlook client I'm using, or in the high-volume python.org application. But, as I said, other people are keener on that, and I expect that reducing the sheer number of tokens is a more effective approach (in part because it ties into effective training strategies over time -- the database will just keep growing (albeit at a slackening pace) without active pruning, and whether a token takes one byte or 50). > Once I get up and running on the code (just paid the tithe to O'Reilly) > I'll test it out. It's all yours . > One thing that occurred to me: now that you have something that seems > to work pretty well, have you considered backtracking on particular > features to see how much they contribute; for example, going to a > trivial state machine parser to spit out tokens? In theory, all prior decisions should be revisited after every change. I haven't done anything like that lately, though, in part because no previous "let's revisit this!" experiment ever paid off. Note that the bulk of the body tokenizer couldn't be simpler: 1. Convert to lowercase. 2. Split on whitespace. Well, we *could* skip #1, but previous experiments found that it didn't give better error rates but did increase the database size. It did change the *kinds* of errors, though, and in particular conference announcements had a hard time getting thru when case was preserved (they're trying to sell you a conference, and often SCREAM ABOUT IT). > ... > Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!") They had 6 or 9 when I was a lad, depending on how you set the control bit for the Univac 1108's 36-bit words. > have lots of tricks. We don't so much write code as remember it and > retype it. You don't want to bet on who'e older here . > ... > Not really; it doesn't really matter what the format of a token > coming out of the parser is, does it? The classifier is happy with any immutable and hashable Python object, i.e. anything that can be used as a Python dict key. But people grafting various databases onto this have stronger requirements, and they're not always clear. As I mentioned last time, most "lightweight" databases require string keys, so any switch away from strings would break those systems. It's pre-alpha code, but still I'm not keen to rock anyone's boat unless there's a clear win in return. > ... > True; then it becomes a game of finding generic messages that are > likely to evaluate as hammy enough to the average recognizer. And > the meta-response is to send out multiple emails with differently > tuned slices of ham. They can try. Spam doesn't need to be stopped, though, it merely has to be made more costly to send than it brings back. Last week Jeremy and Guido here both reported a *very* effective technique: spam was sent to them as replies to mailing-list postings (not this mailing list ) they had made, including a full quote of the msg they had posted. That was guaranteed to have lots of ham words for them, and the Subject line was the expected "Re:" followed by their own subject line. I doubt they're going to get a response rate high enough to be able to afford this scheme over time, at least not on tech mailing lists. We'll see; if they can, it's going to be hard to beat. > I hereby, btw, coin the term "Dagwood" (or perhaps it should be > Wooddag?) to mean an email containing artfully sliced amounts of ham, > spam, and html condiments. ;^) Cool! Dagwood it is. > ... > Well, what you'd need is a hacked HTML renderer that output sets that > look like (token,size,color,background) and ignored words that were > too small or hard to read. Sure. I expect the quickest path would be to feed the source thru a text-only browser, and stare its output. That seems mondo expensive, though, >> For goodness sake, this is email we're talking about -- anyone >> trusting a truly critical msg to email is dreaming to begin with. > Unfortunately, in the real world, this happens all too often. Keep > in mind that the readers of this list are not the typical users of > the resulting software techniques. I do, but it's still not my problem <0.5 wink>. All non-trivial systems have non-zero FP rates, and that's a fact of life. You're keen on whitelists, but they wouldn't do a thing to stop any of the false positives I've seen, and so on; a multitude of schemes may reduce the overall error rates if they're combined intelligently, but they're not going to reach an error rate of 0. Not even with human review (as has become obvious to everyone who's run a good system over their supposedly clean ham and spam collections). At some point, learning that Santa Claus isn't actually a white man is a part of growing up . show-me-an-isp-that-guarantees-email-delivery-and-we'll-get- rich-shorting-its-stock-ly y'rs - tim From tim@fourstonesExpressions.com Tue Nov 12 01:32:18 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 11 Nov 2002 19:32:18 -0600 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: 11/11/2002 7:27:03 PM, Tim Peters wrote: >[Robert Woodhead] >> ... >> It seems to me that you're at the point where testing the effects of >> data reduction techniques would be fruitful. > >Bootstrapping a classifier, connecting to a gazillion quirky email clients, >and testing training strategies are all current high priorities. Saving >memory wouldn't buy me anything in the Outlook client I'm using, or in the >high-volume python.org application. But, as I said, other people are keener >on that, and I expect that reducing the sheer number of tokens is a more >effective approach (in part because it ties into effective training >strategies over time -- the database will just keep growing (albeit at a >slackening pace) without active pruning, and whether a token takes one byte >or 50). > >> Once I get up and running on the code (just paid the tithe to O'Reilly) >> I'll test it out. > >It's all yours . > >> One thing that occurred to me: now that you have something that seems >> to work pretty well, have you considered backtracking on particular >> features to see how much they contribute; for example, going to a >> trivial state machine parser to spit out tokens? > >In theory, all prior decisions should be revisited after every change. I >haven't done anything like that lately, though, in part because no previous >"let's revisit this!" experiment ever paid off. > >Note that the bulk of the body tokenizer couldn't be simpler: > >1. Convert to lowercase. >2. Split on whitespace. This makes me wonder what happens if someone spams you with various devices like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of w h i t e s p a c e - TimS > >Well, we *could* skip #1, but previous experiments found that it didn't give >better error rates but did increase the database size. It did change the >*kinds* of errors, though, and in particular conference announcements had a >hard time getting thru when case was preserved (they're trying to sell you a >conference, and often SCREAM ABOUT IT). > >> ... >> Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!") > >They had 6 or 9 when I was a lad, depending on how you set the control bit >for the Univac 1108's 36-bit words. > >> have lots of tricks. We don't so much write code as remember it and >> retype it. > >You don't want to bet on who'e older here . > >> ... >> Not really; it doesn't really matter what the format of a token >> coming out of the parser is, does it? > >The classifier is happy with any immutable and hashable Python object, i.e. >anything that can be used as a Python dict key. But people grafting various >databases onto this have stronger requirements, and they're not always >clear. As I mentioned last time, most "lightweight" databases require >string keys, so any switch away from strings would break those systems. >It's pre-alpha code, but still I'm not keen to rock anyone's boat unless >there's a clear win in return. > >> ... >> True; then it becomes a game of finding generic messages that are >> likely to evaluate as hammy enough to the average recognizer. And >> the meta-response is to send out multiple emails with differently >> tuned slices of ham. > >They can try. Spam doesn't need to be stopped, though, it merely has to be >made more costly to send than it brings back. > >Last week Jeremy and Guido here both reported a *very* effective technique: >spam was sent to them as replies to mailing-list postings (not this mailing >list ) they had made, including a full quote of the msg they had >posted. That was guaranteed to have lots of ham words for them, and the >Subject line was the expected "Re:" followed by their own subject line. > >I doubt they're going to get a response rate high enough to be able to >afford this scheme over time, at least not on tech mailing lists. We'll >see; if they can, it's going to be hard to beat. > >> I hereby, btw, coin the term "Dagwood" (or perhaps it should be >> Wooddag?) to mean an email containing artfully sliced amounts of ham, >> spam, and html condiments. ;^) > >Cool! Dagwood it is. > >> ... >> Well, what you'd need is a hacked HTML renderer that output sets that >> look like (token,size,color,background) and ignored words that were >> too small or hard to read. > >Sure. I expect the quickest path would be to feed the source thru a >text-only browser, and stare its output. That seems mondo expensive, >though, > >>> For goodness sake, this is email we're talking about -- anyone >>> trusting a truly critical msg to email is dreaming to begin with. > >> Unfortunately, in the real world, this happens all too often. Keep >> in mind that the readers of this list are not the typical users of >> the resulting software techniques. > >I do, but it's still not my problem <0.5 wink>. All non-trivial systems >have non-zero FP rates, and that's a fact of life. You're keen on >whitelists, but they wouldn't do a thing to stop any of the false positives >I've seen, and so on; a multitude of schemes may reduce the overall error >rates if they're combined intelligently, but they're not going to reach an >error rate of 0. Not even with human review (as has become obvious to >everyone who's run a good system over their supposedly clean ham and spam >collections). At some point, learning that Santa Claus isn't actually a >white man is a part of growing up . > >show-me-an-isp-that-guarantees-email-delivery-and-we'll-get- > rich-shorting-its-stock-ly y'rs - tim > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From anthony@interlink.com.au Tue Nov 12 01:36:28 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 12:36:28 +1100 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: Message-ID: <200211120136.gAC1aTs09777@localhost.localdomain> >>> Tim Peters > Can you try this again replacing "break" with "continue"? I can't believe > you intended break here -- it means that the first time we see a Mailman URL > in a msg, we stop looking for embedded URLs period. Spam could easily > exploit that. Woopsie. I knew that :) > >> ham:spam: 11192:1826 > >> 11192:1826 > > You realize you've get a very high ratio of ham to spam, right? *nod* It's my full personal test corpus. There's another 600 spam that haven't been dropped in. I'm re-running tests at the moment with smaller amounts. > We don't tokenize To: now because it gives good results for bad reasons on > mixed-source corpora. It would be good to have an option to tokenize it. > It appears that your code also tokenized Cc:; also fine. I would rather see > the code added to the loop currently cracking "from" lines: I've done this now, and am testing it before checking it in. > Why is this tokenzing only "the first" piece of the Subject line? Thinko. > I changed this to loop over all the Subject parts, and saw some minor good > effects on marginal msgs, so I'll check this one in without further ado. It > wasn't much of a win for you either, but it's cheap so why not. In my > personal email "subjectcharset:unknown" shows up a lot for some reason (but > only in spam). Hm. Dunno about that - Barry might know under what circumstances email package gives 'unknown' as a charset. I can't see how that could happen. > > I plan to try something like tokenizing the oldest three received > > lines (to hopefully avoid the previous issues with mail.python.org > > blowing numbers to hell) to see if that will help this one. > Did you try that yet? I'm not replying in a timely fashion because I'm not > interested, it's just because I'm 244 msgs behind on this mailing list alone > now . Not yet, no. It's on the stack. > > A base64d MP3 spam sent via zope-dev > > (*H* 0.993904, *S* 0.187868 = 0.0969820429397) > > which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and > > also the various mailman type clues (although that's better with the > > first patch, above) I'm going to try a patch to try and strip out mailing list [titles] at some point, too. Anthony From tim.one@comcast.net Tue Nov 12 01:53:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 20:53:07 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Tim Stone] > This makes me wonder what happens if someone spams you with > various devices > like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of > w > h > i > t > e > s > p > a > c > e Most of that would be invisible to us, as we ignore "words" with fewer than 3 characters, so they'd get judged mostly on the header lines, and it's not easy for spam to get by those even in isolation. But spammers won't *do* that regardless. There'a A Reason they use giant fonts and bright colors: the harder a msg is to read, the lower the response rate, and they're not immune to economics. A better strategy is to just have HTML pointing to a .gif or .jpg out on the web. They can make that as gaudy as they like and the classifier won't see any of it. This seems quite common in Asian spam now, but Guido speculated (and I think he's right) that this is more because the Asians are fighting intractable character-set issues. I'm seeing more of it now in English spam too, but it's still rare. For whatever reasons, this system hasn't had any trouble learning to call such stuff spam (I expect that the special tokenizing of URLs we do is helping a lot). From tim@fourstonesExpressions.com Tue Nov 12 02:03:38 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 11 Nov 2002 20:03:38 -0600 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: Gotcha. You dudes are on top of things... ;) Wanna do some ocr stuff on referenced jpgs and gifs? ;;;) I know I know... bad idea for any of a thousand reasons... - TimS 11/11/2002 7:53:07 PM, Tim Peters wrote: >[Tim Stone] >> This makes me wonder what happens if someone spams you with >> various devices >> like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of >> w >> h >> i >> t >> e >> s >> p >> a >> c >> e > >Most of that would be invisible to us, as we ignore "words" with fewer than >3 characters, so they'd get judged mostly on the header lines, and it's not >easy for spam to get by those even in isolation. > >But spammers won't *do* that regardless. There'a A Reason they use giant >fonts and bright colors: the harder a msg is to read, the lower the >response rate, and they're not immune to economics. > >A better strategy is to just have HTML pointing to a .gif or .jpg out on the >web. They can make that as gaudy as they like and the classifier won't see >any of it. This seems quite common in Asian spam now, but Guido speculated >(and I think he's right) that this is more because the Asians are fighting >intractable character-set issues. I'm seeing more of it now in English spam >too, but it's still rare. For whatever reasons, this system hasn't had any >trouble learning to call such stuff spam (I expect that the special >tokenizing of URLs we do is helping a lot). > > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Tue Nov 12 02:06:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 11 Nov 2002 21:06:05 -0500 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: <200211120136.gAC1aTs09777@localhost.localdomain> Message-ID: Quickie: >> In personal email "subjectcharset:unknown" shows up a lot for some >> reason (but only in spam). > Hm. Dunno about that - Barry might know under what circumstances > email package gives 'unknown' as a charset. I can't see how that > could happen. Easy : it's my personal email, and the string UNKNOWN is what *Outlook* delivers. I think it actually says UNKNOWN as it came in off the wire! I get my share of Subject: =?Big5?B?pc7BecHIpGq/+g==?= thingies but I also get a monsters like these: Subject: =?UNKNOWN?Q?=1B$B!z%-%c%s%Z!=3C%s=3CB=3B=5CCf!*!*=1B=28 B1=1B$B%/?==?UNKNOWN?Q?%j%C%/!w=1B=28B15=1B$B1=5F!A=1B=28 B25=1B$B1=5F!z=1B?==?UNKNOWN?Q?=28B?= That one came in to webmaster@python.org on Friday. Perhaps they've learned that Greg will reject a msg just for using an unloved charset, but I doubt it. In fact, I see that 'subjectcharset:unknown' is now the single strongest spam word in my entire mistaken-driven (and tiny) training corpus: 'subjectcharset:unknown' 0.934783' From trebor@animeigo.com Tue Nov 12 02:16:13 2002 From: trebor@animeigo.com (Robert Woodhead) Date: Mon, 11 Nov 2002 21:16:13 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: >Bootstrapping a classifier, connecting to a gazillion quirky email clients, >and testing training strategies are all current high priorities. Saving >memory wouldn't buy me anything in the Outlook client I'm using, or in the >high-volume python.org application. But, as I said, other people are keener >on that, and I expect that reducing the sheer number of tokens is a more >effective approach (in part because it ties into effective training >strategies over time -- the database will just keep growing (albeit at a >slackening pace) without active pruning, and whether a token takes one byte >or 50). My hunch, based on things I've done in the past, is that as the total volume of mail increases, the rate of increase in the number of unique tokens will approach a limit (that being, the number of distinct individual words in the language, though foreign unicode gibberish will have an effect). When I was doing single word analysis on a quarter-gig of ham and spam I was seeing, IIRC, about 300,000 distinct tokens (including the aforementioned gibberish). It will be interesting to see the results of some data reduction on the accuracy of the recogniser. My WAG is that even some serious hashing (down to, say, 20 bit tokens) won't have much effect on accuracy because most of the collisions will be between low frequency, insignificant tokens. >In theory, all prior decisions should be revisited after every change. I >haven't done anything like that lately, though, in part because no previous >"let's revisit this!" experiment ever paid off. Well, usually the time to check by chopping out particular components is when you've got it running so well that adding things doesn't help you. >They had 6 or 9 when I was a lad, depending on how you set the control bit >for the Univac 1108's 36-bit words. You had use of a Univac? You lucky, lucky bastard! I had to use a CARDIAC, and share the eraser used to wipe out the core. And had to walk 5 miles, uphill, in the snow, barefoot, to do that! ;^) >You don't want to bet on who'e older here . Old Fartdom is not measured in chronological years; it is an existential state of being. I became one the day I heard a young programmer complain that half a gigabyte of ram simply wasn't enough memory! ;^) >The classifier is happy with any immutable and hashable Python object, i.e. >anything that can be used as a Python dict key. But people grafting various >databases onto this have stronger requirements, and they're not always >clear. As I mentioned last time, most "lightweight" databases require >string keys, so any switch away from strings would break those systems. >It's pre-alpha code, but still I'm not keen to rock anyone's boat unless >there's a clear win in return. Point taken; my point (maybe not expressed clearly) was that if you go to a hashing/data reduction scheme, then you just keep the entire thing in memory. Or you graft a mock db interface onto the data structure for compatibility during testing (which is probably what I'll try). > > True; then it becomes a game of finding generic messages that are >> likely to evaluate as hammy enough to the average recognizer. And >> the meta-response is to send out multiple emails with differently >> tuned slices of ham. > >They can try. Spam doesn't need to be stopped, though, it merely has to be >made more costly to send than it brings back. Note that the multiple emails can be madlib'd. They have access to more processor and bandwidth over time as well, alas. >Last week Jeremy and Guido here both reported a *very* effective technique: >spam was sent to them as replies to mailing-list postings (not this mailing >list ) they had made, including a full quote of the msg they had >posted. That was guaranteed to have lots of ham words for them, and the >Subject line was the expected "Re:" followed by their own subject line. Ouch, that's evil. Maybe the solution for that is to look at the message and the quotation individually? But that can be metagamed too. > >> I hereby, btw, coin the term "Dagwood" (or perhaps it should be >> Wooddag?) to mean an email containing artfully sliced amounts of ham, >> spam, and html condiments. ;^) > >Cool! Dagwood it is. What, we're agreeing on something?! I must be doing something wrong! Wait a minute, you agreed with me. What's wrong with you? A fever perhaps? ;^) At 8:33 PM -0500 11/11/02, Tim Stone wrote: >This makes me wonder what happens if someone spams you with various devices >like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of >w >h >i >t >e >s >p >a >c >e I'd say that, af ter be ing forc ed to rea d it, if he was sell ing t yle nol or ibu pr of en I' d proba bl y b u y som e! (just not from him) R From piersh@friskit.com Tue Nov 12 02:41:37 2002 From: piersh@friskit.com (Piers Haken) Date: Mon, 11 Nov 2002 18:41:37 -0800 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB9183A08@zeus.sfhq.friskit.com> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment I see the same thing on a few messages in my corpus. I believe it's something weird to do with the way outlook splits out the MIME headers. Attached is a dump of the exception. Piers. > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net]=20 > Sent: Monday, November 11, 2002 4:30 PM > To: David Leftley > Cc: spambayes@python.org > Subject: RE: [Spambayes] Re: Outlook plugin plus Exchange >=20 >=20 > [David Leftley] > > ... > > And, while I'm reporting the quirks of the Outlook plugin, I have 3=20 > > messages (out of my spam corpus of c. 2000) that the plugin=20 > refuses to=20 > > classify. If I attempt to score the contents of a folder containing=20 > > one of these messages, scoring simply stops at that point - the=20 > > progress bar disappears, and the remaining messages are=20 > left unscored. >=20 > Next time that happens, bring up PythonWin and do Tools ->=20 > Trace Collector Debugging Tool. That will pop up a window=20 > showing diagnostic msgs and tracebacks produced by the=20 > Outlook client. You'll probably find something "interesting"=20 > near the end. Note that nobody who has done work on the=20 > client has any form of Exchange running, so diagnosis may not=20 > lead to a cure. Still, can't fix what nobody understands, so=20 > it will be a start. >=20 > > Apart from those little things, though, this software=20 > rocks! Keep up=20 > > the good work, guys! >=20 > Tell Redmond -- if they paid Mark to bust his balls on this,=20 > I bet he'd grow a new pair . >=20 >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 ---------------------- multipart/mixed attachment RkFJTEVEIHRvIGNyZWF0ZSBlbWFpbC5tZXNzYWdlIGZyb206ICAnWC1NUy1NYWlsLUdpYmJlcmlz aDogTWljcm9zb2Z0IE1haWwgSW50ZXJuZXQgSGVhZGVycyBWZXJzaW9uIDIuMFxyXG5SZWNlaXZl ZDogZnJvbSBpbmV0LW1haWw3Lm9yYWNsZS5jb20gKFsyMDkuMjQ2LjEwLjE3MV0pIGJ5IHpldXMu c2ZocS5mcmlza2l0LmNvbSB3aXRoIE1pY3Jvc29mdCBTTVRQU1ZDKDUuMC4yMTk1LjQ0NTMpO1xy XG5cdCBTYXQsIDEzIEFwciAyMDAyIDAzOjE5OjAxIC0wNzAwXHJcblJlY2VpdmVkOiBmcm9tIGJs YXN0ZXItc210cC5vcmFjbGUuY29tIChlYmxhc3QwMS5vcmFjbGVlYmxhc3QuY29tIFsxNDguODcu OS4xMV0pXHJcblx0YnkgaW5ldC1tYWlsNy5vcmFjbGUuY29tIChTd2l0Y2gtMi4yLjEvU3dpdGNo LTIuMi4wKSB3aXRoIEVTTVRQIGlkIGczREE4R1YzMDA2NVxyXG5cdGZvciBQSUVSU0hARlJJU0tJ VC5DT007IFNhdCwgMTMgQXByIDIwMDIgMDM6MDg6MTYgLTA3MDBcclxuRGF0ZTogU2F0LCAxMyBB cHIgMjAwMiAwMzowODoxNiAtMDcwMFxyXG5NZXNzYWdlLUlkOiA8MjAwMjA0MTMxMDA4LmczREE4 R1YzMDA2NUBpbmV0LW1haWw3Lm9yYWNsZS5jb20+XHJcblN1YmplY3Q6IE9yYWNsZSBVbml2ZXJz aXR5IGlTZW1pbmFyc1xyXG5Gcm9tOiBPcmFjbGUgQ29ycG9yYXRpb248cmVwbGllc0BvcmFjbGVl Ymxhc3QuY29tPlxyXG5UbzogUElFUlNIQEZSSVNLSVQuQ09NXHJcblJlcGx5LVRvOiByZXBsaWVz QG9yYWNsZWVibGFzdC5jb21cclxuQ29udGVudC1UcmFuc2Zlci1FbmNvZGluZzogOGJpdFxyXG5N SU1FLVZlcnNpb246IDEuMFxyXG5Db250ZW50LVR5cGU6IG11bHRpcGFydC9hbHRlcm5hdGl2ZTtc clxuICAgIGJvdW5kYXJ5PSJuZXh0X3BhcnRfb2ZfbWVzc2FnZSJcclxuUmV0dXJuLVBhdGg6IHJl cGxpZXNAb3JhY2xlZWJsYXN0LmNvbVxyXG5YLU9yaWdpbmFsQXJyaXZhbFRpbWU6IDEzIEFwciAy MDAyIDEwOjE5OjAxLjA5MzggKFVUQykgRklMRVRJTUU9W0ExRDhGOTIwOjAxQzFFMkQ0XVxyXG5c clxuLS1uZXh0X3BhcnRfb2ZfbWVzc2FnZVxyXG5vZl9tZXNzYWdlXHJcbmdlXHJcblxyXG4tLW5l eHRfcGFydF9vZl9tZXNzYWdlXHJcbkNvbnRlbnQtVHlwZTogdGV4dC9odG1sXHJcblxyXG5cblxy XG5cclxuPGh0bWw+XHJcbjxoZWFkPlxyXG48TUVUQSBIVFRQLUVRVUlWPSJDb250ZW50LVR5cGUi IENPTlRFTlQ9InRleHQvaHRtbDsgY2hhcnNldD1pc28tODg1OS0xIj5cclxuXHJcbjx0aXRsZT5y ZW1pbmRlcjwvdGl0bGU+XHJcblxyXG48c3R5bGUgdHlwZT0idGV4dC9jc3MiPlxyXG5cdHAgeyAg Zm9udC1mYW1pbHk6IFZlcmRhbmEsIEFyaWFsLCBIZWx2ZXRpY2EsIHNhbnMtc2VyaWY7IGZvbnQt c2l6ZTogMTFweDsgZm9udC1zdHlsZTogbm9ybWFsOyBsaW5lLWhlaWdodDogMThweDsgZm9udC13 ZWlnaHQ6IG5vcm1hbH1cclxuXHR1bCB7ICBmb250LWZhbWlseTogVmVyZGFuYSwgQXJpYWwsIEhl bHZldGljYSwgc2Fucy1zZXJpZjsgZm9udC1zaXplOiAxMXB4OyBmb250LXN0eWxlOiBub3JtYWw7 IGxpbmUtaGVpZ2h0OiAxOHB4OyBmb250LXdlaWdodDogbm9ybWFsOyBsaXN0LXN0eWxlLXR5cGU6 IGRpc2N9XHJcblx0Lm5vYm9sZHR4dCB7ICBmb250OiBub3JtYWwgMTFweC8xNHB4IFZlcmRhbmEs IEFyaWFsLCBIZWx2ZXRpY2EsIHNhbnMtc2VyaWZ9XHJcblx0LmxpdmVidG50eHQgeyAgZm9udC1m YW1pbHk6IFZlcmRhbmEsIEFyaWFsLCBIZWx2ZXRpY2EsIHNhbnMtc2VyaWY7IGZvbnQtc2l6ZTog MTFweDsgZm9udC1zdHlsZTogYm9sZDsgbGluZS1oZWlnaHQ6IDE0cHg7IGZvbnQtd2VpZ2h0OiBu b3JtYWw7IGNvbG9yOiAjRkZGRkZGfVxyXG5cdGIgeyAgZm9udDogYm9sZCAxMXB4LzE0cHggVmVy ZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZiB9XHJcblx0aSB7ICBmb250LWZhbWls eTogVmVyZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZjsgZm9udC1zaXplOiAxMXB4 OyBmb250LXN0eWxlOiBpdGFsaWM7IGxpbmUtaGVpZ2h0OiAxOHB4OyBmb250LXdlaWdodDogbm9y bWFsfVxyXG5cdGEgeyAgZm9udC1mYW1pbHk6IFZlcmRhbmEsIEFyaWFsLCBIZWx2ZXRpY2EsIHNh bnMtc2VyaWY7IGZvbnQtc2l6ZTogMTFweDsgZm9udC1zdHlsZTogYm9sZDsgbGluZS1oZWlnaHQ6 IDE4cHg7IGZvbnQtd2VpZ2h0OiBub3JtYWw7IGNvbG9yOiAjRkYwMDAwfVxyXG5cdC50aXRsZSB7 ICBmb250LWZhbWlseTogVmVyZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZjsgZm9u dC1zaXplOiAxNHB4OyBmb250LXN0eWxlOiBib2xkOyBsaW5lLWhlaWdodDogMjBweDsgZm9udC13 ZWlnaHQ6IGJvbGR9XHJcblx0LnN1YnRpdGxlIHsgIGZvbnQtZmFtaWx5OiBWZXJkYW5hLCBBcmlh bCwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6IDEycHg7IGZvbnQtc3R5bGU6IGJv bGQ7IGxpbmUtaGVpZ2h0OiAxOHB4OyBmb250LXdlaWdodDogYm9sZH1cclxuXHQuYm90bGluayB7 ICBmb250LWZhbWlseTogVmVyZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZjsgZm9u dC1zaXplOiAxMXB4OyBsaW5lLWhlaWdodDogMTRweDsgZm9udC13ZWlnaHQ6IGJvbGQ7IGNvbG9y OiAjRkZGRkZGfVxyXG48L3N0eWxlPlxyXG48L2hlYWQ+XHJcbjxib2R5IGJnY29sb3I9IiNGRkZG RkYiPlxyXG5cclxuXHJcbjwhLS0gT3JhY2xlIExvZ28gLS0+XHJcbjxUQUJMRSB3aWR0aD0iNTAw IiBib3JkZXI9IjAiIGNlbGxwYWRkaW5nPSIwIiBjZWxsc3BhY2luZz0iMCI+XHJcbiAgPHRyPiBc clxuICAgIDx0ZCBoZWlnaHQ9IjQwIiBiZ2NvbG9yPSIjRkYwMDAwIiB2YWxpZ249InRvcCIgd2lk dGg9IjUwMCI+PGltZyBzcmM9Imh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFy cy9FVFMtb3JhY2xlTG9nby5naWYiIHdpZHRoPSI1MDAiIGhlaWdodD0iNDAiPjwvdGQ+XHJcbiAg PC90cj5cclxuPC9UQUJMRT5cclxuXHJcblxyXG48IS0tICoqQkVHSU4gIEZsYXNoIE1lZGlhICBI RVJFKiogLS0+XHJcbjxUQUJMRSB3aWR0aD0iNTAwIiBib3JkZXI9IjAiIGNlbGxzcGFjaW5nPSIw IiBjZWxscGFkZGluZz0iMCI+XHJcbiAgPHRyPiBcclxuICAgIDx0ZCBjb2xzcGFuPSI1IiBhbGln bj0iY2VudGVyIiBoZWlnaHQ9IjEwMCIgd2lkdGg9IjUwMCIgYmdjb2xvcj0iI0ZGMDAwMCI+IFxy XG4gICAgICA8aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJz LzAyMDU5OW91X2VtMS5naWYiIHdpZHRoPSI1MDAiIGhlaWdodD0iMTAwIj4gPC90ZD5cclxuICA8 L3RyPlxyXG48L1RBQkxFPlxyXG48IS0tIEVORCBGbGFzaCBNZWRpYSAtLT5cclxuXHJcblxyXG48 IS0tIGRlbGluZWF0b3IgLS0+XHJcbjxUQUJMRSB3aWR0aD0iNTAwIiBib3JkZXI9IjAiIGNlbGxz cGFjaW5nPSIwIiBjZWxscGFkZGluZz0iMCI+XHJcbiAgPHRyPiBcclxuICAgIDx0ZCBhbGlnbj0i cmlnaHQiIGhlaWdodD0iMTAiIGJnY29sb3I9IiNGRjAwMDAiIHZhbGlnbj0idG9wIiB3aWR0aD0i NTAwIj48aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VU Uy1zcGFjZXIuZ2lmIiB3aWR0aD0iMSIgaGVpZ2h0PSIxMCI+PC90ZD5cclxuICA8L3RyPlxyXG48 L1RBQkxFPlxyXG5cclxuXHJcbjwhLS0gKipCRUdJTiAgQm9keSBDb250ZW50ICAqKk1ha2UgY2hh bmdlcyB0byBjb3B5IEhFUkUqKiAtLT5cclxuPFRBQkxFIHdpZHRoPSI1MDAiIGJvcmRlcj0iMCIg Y2VsbHBhZGRpbmc9IjAiIGNlbGxzcGFjaW5nPSIwIj5cclxuICA8dHI+IFxyXG4gICAgPHRkIHJv d3NwYW49IjMiIGJnY29sb3I9InJlZCI+PGltZyBzcmM9Imh0dHA6Ly93d3cub3JhY2xlLmNvbS9z dGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZiIgd2lkdGg9IjEiIGhlaWdodD0iMSI+PC90 ZD5cclxuICAgIDx0ZCByb3dzcGFuPSIzIiB2YWxpZ249InRvcCI+PGltZyBzcmM9Imh0dHA6Ly93 d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZiIgd2lkdGg9IjMw IiBoZWlnaHQ9IjEiPjwvdGQ+XHJcbiAgICA8dGQgYWxpZ249ImxlZnQiIHZhbGlnbj0idG9wIj48 aW1nIG5hbWU9InRpdGxlSGVhZCIgc3JjPSJodHRwOi8vd3d3Lm9yYWNsZS5jb20vc3RhcnQvb3Vf c2VtaW5hcnMvcmVtaW5kX3RpdGxlaGVhZC5naWYiIHdpZHRoPSIzMzciIGhlaWdodD0iNTAiIGJv cmRlcj0iMCI+PC90ZD5cclxuXHQ8dGQgcm93c3Bhbj0iMiIgdmFsaWduPSJ0b3AiPjxpbWcgc3Jj PSJodHRwOi8vd3d3Lm9yYWNsZS5jb20vc3RhcnQvb3Vfc2VtaW5hcnMvRVRTLXNwYWNlci5naWYi IHdpZHRoPSIxMCIgaGVpZ2h0PSIxMDAlIj48L3RkPlxyXG5cdDx0ZCByb3dzcGFuPSIyIiBhbGln bj0icmlnaHQiIHZhbGlnbj0idG9wIj48YnI+XHJcbiAgICAgPGEgaHJlZj0iaHR0cDovL3d3dy5v cmFjbGUuY29tL2dvLz8mU3JjPTEyOTYxOTcmQWN0PTgiPjxpbWcgc3JjPSJodHRwOi8vd3d3Lm9y YWNsZS5jb20vc3RhcnQvb3Vfc2VtaW5hcnMvcmVtaW5kLWNhbGxidXR0b24uZ2lmIiB3aWR0aD0i ODEiIGhlaWdodD0iMTAyIiBib3JkZXI9IjAiPjwvYT48L3RkPlxyXG4gICAgPHRkIHJvd3NwYW49 IjMiIHZhbGlnbj0idG9wIj48aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291 X3NlbWluYXJzL0VUUy1zcGFjZXIuZ2lmIiB3aWR0aD0iMjUiIGhlaWdodD0iMSI+PC90ZD5cclxu ICAgIDx0ZCByb3dzcGFuPSIzIiBiZ2NvbG9yPSJyZWQiPjxpbWcgc3JjPSJodHRwOi8vd3d3Lm9y YWNsZS5jb20vc3RhcnQvb3Vfc2VtaW5hcnMvRVRTLXNwYWNlci5naWYiIHdpZHRoPSIxIiBoZWln aHQ9IjEiPjwvdGQ+XHJcbiAgPC90cj5cclxuICA8dHI+IFxyXG4gICAgPHRkIGhlaWdodD0iMTAw JSIgdmFsaWduPSJ0b3AiIHdpZHRoPSIzMzciPiBcclxuICAgICAgPHA+PGI+RG9uJiMxNDY7dCBN aXNzIE5leHQgV2VlayYjMTQ2O3MgaVNlbWluYXJzIGZyb20gT3JhY2xlIFVuaXZlcnNpdHkuPC9i PiBcclxuICAgICAgPC9wPlxyXG4gICAgICA8cD4gRG9uJiMxNDY7dCBmb3JnZXQgYWJvdXQgdGhl IEZSRUUgaVNlbWluYXJzIGFuZCBsaXZlIGNoYXQgY29taW5nIHVwIG5leHQgXHJcbiAgICAgICAg d2VlayBmcm9tIE9yYWNsZSBVbml2ZXJzaXR5ISBXaXRoIHRoZSBsYXRlc3QgaW5mb3JtYXRpb24g b24gT3JhY2xlIGNlcnRpZmljYXRpb24gXHJcbiAgICAgICAgYW5kIHRlY2hub2xvZ3ksIHRoZXNl IGZpdmUgZXZlbnRzIHByb3ZpZGUgdW5pcXVlIGtub3dsZWRnZSBhbmQgZ3VpZGVkIFxyXG4gICAg ICAgIHRyYWluaW5nIHVuYXZhaWxhYmxlIGFueXdoZXJlIGVsc2UuIEVhY2ggc2VtaW5hciBpbmNs dWRlcyBhIDE1LW1pbnV0ZSBcclxuICAgICAgICBtaW5pLWxlc3NvbiBhbmQgUSZhbXA7QSBzZXNz aW9uIHdpdGggYW4gT3JhY2xlIFVuaXZlcnNpdHkgaW5zdHJ1Y3Rvci48L3A+XHJcblxyXG4gICAg ICA8cD4gSWYgeW91IGhhdmVuJiMxNDY7dCBhbHJlYWR5LCA8YSBocmVmPSJodHRwOi8vd3d3Lm9y YWNsZS5jb20vZ28vPyZTcmM9MTI5NjE5NyZBY3Q9OCI+Y2xpY2sgaGVyZTwvYT4gdG8gcmVnaXN0 ZXIgXHJcbiAgICAgICAgZm9yIHRoZSBmdWxsIHdlZWsgb2YgZXZlbnRzLjwvcD5cclxuICAgICAg PHA+IDxpPk1vbmRheSwgQXByaWwgMTUsIDIwMDIgJiMxNTA7IDEwOjAwIGEubS4gUERUPC9pPjxi cj5cclxuICAgICAgICA8Yj5PcmFjbGU5PGk+aTwvaT4gJiMxNTA7IFRyYWluaW5nIGZvciBDZXJ0 aWZpY2F0aW9uLjwvYj4gS2ljayBvZmYgeW91ciBcclxuICAgICAgICB3ZWVrIHdpdGggYSAzMC1t aW51dGUgc2Vzc2lvbiBvbiB0aGUgY29tcG9uZW50cywgdmFsdWUsIGFuZCBzdGVwcyBpbiB0aGUg XHJcbiAgICAgICAgT3JhY2xlIENlcnRpZmljYXRpb24gcHJvY2Vzcy48L3A+XHJcbiAgICAgIDxw PiA8aT5UdWVzZGF5LCBBcHJpbCAxNiwgMjAwMiAmIzE1MDsgODowMCBhLm0uIFBEVDwvaT48YnI+ XHJcbiAgICAgICAgPGI+T3JhY2xlOTxpPmk8L2k+IE5ldyBGZWF0dXJlcy48L2I+IFVwZGF0ZSB5 b3VyIGtub3dsZWRnZSBhbmQgaG9uZSB5b3VyIFxyXG4gICAgICAgIHNraWxscyBvbiB0aGUgbGF0 ZXN0IGZlYXR1cmVzIGFuZCBvcHRpb25zIGZvdW5kIGluIE9yYWNsZTk8aT5pPC9pPiB3aXRoIFxy XG4gICAgICAgIGFkdmljZSBmcm9tIG91ciB0cmFpbmluZyBleHBlcnRzLiA8YnI+XHJcbiAgICAg IDwvcD5cclxuICAgICAgPHA+PGk+V2VkbmVzZGF5LCBBcHJpbCAxNywgMjAwMiAmIzE1MDsgMTA6 MDAgYS5tLiBQRFQ8L2k+PGJyPlxyXG4gICAgICAgIDxiPk9yYWNsZTk8aT5pPC9pPiBTZWN1cml0 eSBUcmFpbmluZyBmb3IgQ2VydGlmaWNhdGlvbi48L2I+ICBMZWFybiBob3cgeW91IGNhbiBtZWV0 IHlvdXIgYnVzaW5lc3MgbmVlZHMgaW4gXHJcbiAgICB0aGUgcmFwaWRseSBjaGFuZ2luZyB3b3Js ZCBvZiBoaWdoLXRlY2ggc2VjdXJpdHkuIDxicj5cclxuICAgICAgPC9wPlxyXG4gICAgICA8cD48 aT5UaHVyc2RheSwgQXByaWwgMTgsIDIwMDIgJiMxNTA7IDEyOjAwIHAubS4gUERUPC9pPjxicj5c clxuICAgICAgICA8Yj5QZXJmb3JtYW5jZSBUdW5pbmcgJiMxNTA7IFRpcHMgZnJvbSB0aGUgRXhw ZXJ0cy48L2I+IExlYXJuIGJlc3QgcHJhY3RpY2VzIFxyXG4gICAgICAgIGFuZCB0ZWNobmljYWwg dGlwcyBvbiBob3cgdG8gZ2V0IHRoZSBtb3N0IGZyb20gT3JhY2xlJiMxNDY7cyBEYXRhYmFzZSBc clxuICAgICAgICBzb2x1dGlvbi48YnI+XHJcbiAgICAgIDwvcD5cclxuICAgICAgPHA+PGk+RnJp ZGF5LCBBcHJpbCAxOSwgMjAwMiAmIzE1MDsgMTI6MDAgcC5tLiBQRFQ8L2k+PGJyPlxyXG4gICAg ICAgIDxiPkNlcnRpZmljYXRpb246IEFuIE9wZW4gRm9ydW0gd2l0aCBPcmFjbGUgQ2VydGlmaWNh dGlvbiBQcm9ncmFtIEd1cnVzLCBcclxuICAgICAgICBNaWtlIFNlcnBlIGFuZCBKaW0gRGlJYW5u aS48L2I+IEludGVyYWN0IGRpcmVjdGx5IHdpdGggT3JhY2xlIENlcnRpZmljYXRpb24gXHJcbiAg ICAgICAgUHJvZ3JhbSBleHBlcnRzIHRvIGFuc3dlciBhbnkgcmVtYWluaW5nIHF1ZXN0aW9ucyB5 b3UgbWF5IGhhdmUgYWJvdXQgdGhlIFxyXG4gICAgICAgIHByb2dyYW0sIGN1cnJpY3VsdW0sIG9y IHRyYWluaW5nIG9wdGlvbnMuIDxicj48L3A+XHJcbiAgICAgIDxwPjxhIGhyZWY9Imh0dHA6Ly93 d3cub3JhY2xlLmNvbS9nby8/JlNyYz0xMjk2MTk3JkFjdD04Ij5DbGljayBoZXJlPC9hPiB0byBy ZWdpc3Rlci48L3A+XHJcbiAgICAgIDwvdGQ+XHJcbiAgPC90cj5cclxuICA8dHI+XHJcblx0PHRk IGNvbHNwYW49IjMiIGhpZWdodD0iMTAwJSIgdmFsaWduPSJ0b3AiIGFsaWduPSJsZWZ0XHJcblx0 Ij48aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VUUy1z cGFjZXIuZ2lmIiB3aWR0aD0iNDQzIiBoZWlnaHQ9IjIwIj48L3RkPlxyXG4gIDwvdHI+XHJcbjwv VEFCTEU+XHJcbjwhLS0gRU5EIGJvZHkgY29udGVudCAtLT5cclxuXHJcblxyXG48IS0tICoqQkVH SU4gIEJvdHRvbSBMaW5rICAtIFx4ZWNjYWxsLXRvLWFjdGlvblx4ZWUgSEVSRSoqIC0tPlxyXG48 VEFCTEUgd2lkdGg9IjUwMCIgYm9yZGVyPSIwIiBjZWxscGFkZGluZz0iMCIgY2VsbHNwYWNpbmc9 IjAiPlxyXG4gIDx0cj4gXHJcbiAgICA8dGQgaGVpZ2h0PSIzMCIgYmdjb2xvcj0iI0ZGMDAwMCIg dmFsaWduPSJtaWRkbGUiIHdpZHRoPSI1MDAiIGFsaWduPSJjZW50ZXIiPjxhIGhyZWY9Imh0dHA6 Ly93d3cub3JhY2xlLmNvbS9nby8/JlNyYz0xMjk2MTk3JkFjdD04IiBjbGFzcz0iYm90bGluayI+ Q2xpY2sgXHJcbiAgICAgIGhlcmUgdG8gdmlldyB5b3VyIEZSRUUgaVNlbWluYXJzLjwvYT48L3Rk PlxyXG4gIDwvdHI+XHJcbjwvVEFCTEU+XHJcbjwhLS0gRU5EIEJvdHRvbSBMaW5rIC0tPlxyXG5c clxuPC9ib2R5PlxyXG48L2h0bWw+XHJcbjxwPjxmb250IGZhY2U9IkFyaWFsLCBoZWx2ZXRpY2Ei IHNpemU9IjEiPlxyXG48YnI+VG8gYmUgcmVtb3ZlZCBmcm9tIE9yYWNsZVwncyBtYWlsaW5nIGxp c3RzLCBzZW5kIGFuIGVtYWlsIHRvOiBcclxuPGJyPjxhIGhyZWY9Im1haWx0bzp1bnN1YnNjcmli ZUBvcmFjbGVlYmxhc3QuY29tP3N1YmplY3Q9UkVNT1ZFIE9GIE9SQUNMRSBNQUlMSU5HIExJU1Qg MTI5ODM5MCZib2R5PVJFTU9WRSBQSUVSU0hARlJJU0tJVC5DT00gIj51bnN1YnNjcmliZUBvcmFj bGVlYmxhc3QuY29tPC9hPiBcclxuPGJyPndpdGggdGhlIGZvbGxvd2luZyBpbiB0aGUgbWVzc2Fn ZSBib2R5OiBcclxuPGJyPlJFTU9WRSBQSUVSU0hARlJJU0tJVC5DT01cclxuPGJyPlNUT1AgXHJc bjxwPlxyXG5bMTI3NTM3My81LzEwNzU0NzAxMl0gXHJcbjwvZm9udD5cclxuPGltZyBzcmM9Imh0 dHA6Ly93d3cub3JhY2xlLmNvbS9lbG9nL3RyYWNrdXJsP2RpPTEyOTgzOTAmc2kxPTEwNzU0NzAx MiIgYm9yZGVyPTA+IFxyXG5cclxuXHJcblxyXG5cclxuXHJcblxyXG5cclxuXHJcblxyXG5cbiAg PGh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtb3JhY2xlTG9nby5n aWY+IFx0XHJcbiAgPGh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy8wMjA1 OTlvdV9lbTEuZ2lmPiBcdFxyXG4gIDxodHRwOi8vd3d3Lm9yYWNsZS5jb20vc3RhcnQvb3Vfc2Vt aW5hcnMvRVRTLXNwYWNlci5naWY+IFx0XHJcbiAgPGh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFy dC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZj4gXHQgIDxodHRwOi8vd3d3Lm9yYWNsZS5jb20v c3RhcnQvb3Vfc2VtaW5hcnMvRVRTLXNwYWNlci5naWY+IFx0ICA8aHR0cDovL3d3dy5vcmFjbGUu Y29tL3N0YXJ0L291X3NlbWluYXJzL3JlbWluZF90aXRsZWhlYWQuZ2lmPiBcdCAgPGh0dHA6Ly93 d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZj4gXHRcclxuIDxo dHRwOi8vd3d3Lm9yYWNsZS5jb20vZ28vPyZTcmM9MTI5NjE5NyZBY3Q9OD4gXHQgICA8aHR0cDov L3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VUUy1zcGFjZXIuZ2lmPiBcdCAgPGh0 dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZj4gXHRc clxuXHJcbkRvblx4OTJ0IE1pc3MgTmV4dCBXZWVrXHg5MnMgaVNlbWluYXJzIGZyb20gT3JhY2xl IFVuaXZlcnNpdHkuIFxyXG5cclxuRG9uXHg5MnQgZm9yZ2V0IGFib3V0IHRoZSBGUkVFIGlTZW1p bmFycyBhbmQgbGl2ZSBjaGF0IGNvbWluZyB1cCBuZXh0IHdlZWsgZnJvbSBPcmFjbGUgVW5pdmVy c2l0eSEgV2l0aCB0aGUgbGF0ZXN0IGluZm9ybWF0aW9uIG9uIE9yYWNsZSBjZXJ0aWZpY2F0aW9u IGFuZCB0ZWNobm9sb2d5LCB0aGVzZSBmaXZlIGV2ZW50cyBwcm92aWRlIHVuaXF1ZSBrbm93bGVk Z2UgYW5kIGd1aWRlZCB0cmFpbmluZyB1bmF2YWlsYWJsZSBhbnl3aGVyZSBlbHNlLiBFYWNoIHNl bWluYXIgaW5jbHVkZXMgYSAxNS1taW51dGUgbWluaS1sZXNzb24gYW5kIFEmQSBzZXNzaW9uIHdp dGggYW4gT3JhY2xlIFVuaXZlcnNpdHkgaW5zdHJ1Y3Rvci5cclxuXHJcbklmIHlvdSBoYXZlblx4 OTJ0IGFscmVhZHksIGNsaWNrIGhlcmUgPGh0dHA6Ly93d3cub3JhY2xlLmNvbS9nby8/JlNyYz0x Mjk2MTk3JkFjdD04PiAgdG8gcmVnaXN0ZXIgZm9yIHRoZSBmdWxsIHdlZWsgb2YgZXZlbnRzLlxy XG5cclxuTW9uZGF5LCBBcHJpbCAxNSwgMjAwMiBceDk2IDEwOjAwIGEubS4gUERUXHJcbk9yYWNs ZTlpIFx4OTYgVHJhaW5pbmcgZm9yIENlcnRpZmljYXRpb24uIEtpY2sgb2ZmIHlvdXIgd2VlayB3 aXRoIGEgMzAtbWludXRlIHNlc3Npb24gb24gdGhlIGNvbXBvbmVudHMsIHZhbHVlLCBhbmQgc3Rl cHMgaW4gdGhlIE9yYWNsZSBDZXJ0aWZpY2F0aW9uIHByb2Nlc3MuXHJcblxyXG5UdWVzZGF5LCBB cHJpbCAxNiwgMjAwMiBceDk2IDg6MDAgYS5tLiBQRFRcclxuT3JhY2xlOWkgTmV3IEZlYXR1cmVz LiBVcGRhdGUgeW91ciBrbm93bGVkZ2UgYW5kIGhvbmUgeW91ciBza2lsbHMgb24gdGhlIGxhdGVz dCBmZWF0dXJlcyBhbmQgb3B0aW9ucyBmb3VuZCBpbiBPcmFjbGU5aSB3aXRoIGFkdmljZSBmcm9t IG91ciB0cmFpbmluZyBleHBlcnRzLiBcclxuXHJcblxyXG5XZWRuZXNkYXksIEFwcmlsIDE3LCAy MDAyIFx4OTYgMTA6MDAgYS5tLiBQRFRcclxuT3JhY2xlOWkgU2VjdXJpdHkgVHJhaW5pbmcgZm9y IENlcnRpZmljYXRpb24uIExlYXJuIGhvdyB5b3UgY2FuIG1lZXQgeW91ciBidXNpbmVzcyBuZWVk cyBpbiB0aGUgcmFwaWRseSBjaGFuZ2luZyB3b3JsZCBvZiBoaWdoLXRlY2ggc2VjdXJpdHkuIFxy XG5cclxuXHJcblRodXJzZGF5LCBBcHJpbCAxOCwgMjAwMiBceDk2IDEyOjAwIHAubS4gUERUXHJc blBlcmZvcm1hbmNlIFR1bmluZyBceDk2IFRpcHMgZnJvbSB0aGUgRXhwZXJ0cy4gTGVhcm4gYmVz dCBwcmFjdGljZXMgYW5kIHRlY2huaWNhbCB0aXBzIG9uIGhvdyB0byBnZXQgdGhlIG1vc3QgZnJv bSBPcmFjbGVceDkycyBEYXRhYmFzZSBzb2x1dGlvbi5cclxuXHJcblxyXG5GcmlkYXksIEFwcmls IDE5LCAyMDAyIFx4OTYgMTI6MDAgcC5tLiBQRFRcclxuQ2VydGlmaWNhdGlvbjogQW4gT3BlbiBG b3J1bSB3aXRoIE9yYWNsZSBDZXJ0aWZpY2F0aW9uIFByb2dyYW0gR3VydXMsIE1pa2UgU2VycGUg YW5kIEppbSBEaUlhbm5pLiBJbnRlcmFjdCBkaXJlY3RseSB3aXRoIE9yYWNsZSBDZXJ0aWZpY2F0 aW9uIFByb2dyYW0gZXhwZXJ0cyB0byBhbnN3ZXIgYW55IHJlbWFpbmluZyBxdWVzdGlvbnMgeW91 IG1heSBoYXZlIGFib3V0IHRoZSBwcm9ncmFtLCBjdXJyaWN1bHVtLCBvciB0cmFpbmluZyBvcHRp b25zLiBcclxuXHJcblxyXG5DbGljayBoZXJlIDxodHRwOi8vd3d3Lm9yYWNsZS5jb20vZ28vPyZT cmM9MTI5NjE5NyZBY3Q9OD4gIHRvIHJlZ2lzdGVyLlxyXG5cclxuICA8aHR0cDovL3d3dy5vcmFj bGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VUUy1zcGFjZXIuZ2lmPiBcdFxyXG5DbGljayAgPGh0 dHA6Ly93d3cub3JhY2xlLmNvbS9nby8/JlNyYz0xMjk2MTk3JkFjdD04PiBoZXJlIHRvIHZpZXcg eW91ciBGUkVFIGlTZW1pbmFycy5cdCBcclxuXHJcblxyXG5UbyBiZSByZW1vdmVkIGZyb20gT3Jh Y2xlXCdzIG1haWxpbmcgbGlzdHMsIHNlbmQgYW4gZW1haWwgdG86IFxyXG51bnN1YnNjcmliZUBv cmFjbGVlYmxhc3QuY29tIDxtYWlsdG86dW5zdWJzY3JpYmVAb3JhY2xlZWJsYXN0LmNvbT9zdWJq ZWN0PVJFTU9WRSBPRiBPUkFDTEUgTUFJTElORyBMSVNUIDEyOTgzOTAmYm9keT1SRU1PVkUgUElF UlNIQEZSSVNLSVQuQ09NPiAgXHJcbndpdGggdGhlIGZvbGxvd2luZyBpbiB0aGUgbWVzc2FnZSBi b2R5OiBcclxuUkVNT1ZFIFBJRVJTSEBGUklTS0lULkNPTSBcclxuU1RPUCBcclxuXHJcblxyXG5b MTI3NTM3My81LzEwNzU0NzAxMl0gICA8aHR0cDovL3d3dy5vcmFjbGUuY29tL2Vsb2cvdHJhY2t1 cmw/ZGk9MTI5ODM5MCZzaTE9MTA3NTQ3MDEyPiBcclxuXHJcbicKRXJyb3IgdHJhaW5pbmcgbWVz c2FnZSAnPE1BUElNc2dTdG9yZU1zZywgKHJlYWQpIGlkPTAwMDAwMDAwMzhBMUJCMTAwNUU1MTAx QUExQkIwODAwMkIyQTU2QzIwMDAwNDU0RDUzNEQ0NDQyMkU0NDRDNEMwMDAwMDAwMDAwMDAwMDAw MUI1NUZBMjBBQTY2MTFDRDlCQzgwMEFBMDAyRkM0NUEwQzAwMDAwMDVBNDU1NTUzMDAyRjZGM0Q0 NjcyNjk3MzZCNjk3NDIwNDk2RTYzMkUyRjZGNzUzRDQ2Njk3MjczNzQyMDQxNjQ2RDY5NkU2OTcz NzQ3MjYxNzQ2OTc2NjUyMDQ3NzI2Rjc1NzAyRjYzNkUzRDUyNjU2MzY5NzA2OTY1NkU3NDczMkY2 MzZFM0Q3MDY5NjU3MjczNjgwMC9FRjAwMDAwMDE5ODI2MkMwQUE2NjExQ0Q5QkM4MDBBQTAwMkZD NDVBMDYwMDAxMDAwMTAwMDAwMDAwMjk1QTI1MDEwMDAwMDAwMDJCRjBCND4nClRyYWNlYmFjayAo bW9zdCByZWNlbnQgY2FsbCBsYXN0KToKICBGaWxlICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXll c1xPdXRsb29rMjAwMFx0cmFpbi5weSIsIGxpbmUgNjcsIGluIHRyYWluX2ZvbGRlcgogICAgaWYg dHJhaW5fbWVzc2FnZShtZXNzYWdlLCBpc3NwYW0sIG1ncik6CiAgRmlsZSAiQzpcUHl0aG9uMjJc c3BhbVxzcGFtYmF5ZXNcT3V0bG9vazIwMDBcdHJhaW4ucHkiLCBsaW5lIDM2LCBpbiB0cmFpbl9t ZXNzYWdlCiAgICBzdHJlYW0gPSBtc2cuR2V0RW1haWxQYWNrYWdlT2JqZWN0KCkKICBGaWxlICJD OlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xPdXRsb29rMjAwMFxtc2dzdG9yZS5weSIsIGxpbmUg NDMxLCBpbiBHZXRFbWFpbFBhY2thZ2VPYmplY3QKICAgIG1zZyA9IGVtYWlsLm1lc3NhZ2VfZnJv bV9zdHJpbmcodGV4dCkKICBGaWxlICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xlbWFpbFxf X2luaXRfXy5weSIsIGxpbmUgMzksIGluIG1lc3NhZ2VfZnJvbV9zdHJpbmcKICAgIHJldHVybiBQ YXJzZXIoX2NsYXNzLCBzdHJpY3Q9c3RyaWN0KS5wYXJzZXN0cihzKQogIEZpbGUgIkM6XFB5dGhv bjIyXHNwYW1cc3BhbWJheWVzXGVtYWlsXFBhcnNlci5weSIsIGxpbmUgNTIsIGluIHBhcnNlc3Ry CiAgICByZXR1cm4gc2VsZi5wYXJzZShTdHJpbmdJTyh0ZXh0KSwgaGVhZGVyc29ubHk9aGVhZGVy c29ubHkpCiAgRmlsZSAiQzpcUHl0aG9uMjJcc3BhbVxzcGFtYmF5ZXNcZW1haWxcUGFyc2VyLnB5 IiwgbGluZSA0OCwgaW4gcGFyc2UKICAgIHNlbGYuX3BhcnNlYm9keShyb290LCBmcCkKICBGaWxl ICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xlbWFpbFxQYXJzZXIucHkiLCBsaW5lIDIwNiwg aW4gX3BhcnNlYm9keQogICAgbXNnb2JqID0gc2VsZi5wYXJzZXN0cihwYXJ0KQogIEZpbGUgIkM6 XFB5dGhvbjIyXHNwYW1cc3BhbWJheWVzXGVtYWlsXFBhcnNlci5weSIsIGxpbmUgNTIsIGluIHBh cnNlc3RyCiAgICByZXR1cm4gc2VsZi5wYXJzZShTdHJpbmdJTyh0ZXh0KSwgaGVhZGVyc29ubHk9 aGVhZGVyc29ubHkpCiAgRmlsZSAiQzpcUHl0aG9uMjJcc3BhbVxzcGFtYmF5ZXNcZW1haWxcUGFy c2VyLnB5IiwgbGluZSA0NiwgaW4gcGFyc2UKICAgIHNlbGYuX3BhcnNlaGVhZGVycyhyb290LCBm cCkKICBGaWxlICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xlbWFpbFxQYXJzZXIucHkiLCBs aW5lIDEwNSwgaW4gX3BhcnNlaGVhZGVycwogICAgcmFpc2UgRXJyb3JzLkhlYWRlclBhcnNlRXJy b3IoCkhlYWRlclBhcnNlRXJyb3I6IE5vdCBhIGhlYWRlciwgbm90IGEgY29udGludWF0aW9uOiBg YG9mX21lc3NhZ2UnJwo= ---------------------- multipart/mixed attachment-- From anthony@interlink.com.au Tue Nov 12 02:33:33 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 13:33:33 +1100 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: <200211120233.gAC2XXJ10069@localhost.localdomain> > >Last week Jeremy and Guido here both reported a *very* effective technique: > >spam was sent to them as replies to mailing-list postings (not this mailing > >list ) they had made, including a full quote of the msg they had > >posted. That was guaranteed to have lots of ham words for them, and the > >Subject line was the expected "Re:" followed by their own subject line. > > Ouch, that's evil. Maybe the solution for that is to look at the > message and the quotation individually? But that can be metagamed > too. This is still making the spammers work a lot harder, though, so it's not really a bad thing. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Tue Nov 12 02:41:33 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 13:41:33 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <9891913C5BFE87429D71E37F08210CB9183A08@zeus.sfhq.friskit.com> Message-ID: <200211120241.gAC2fYW10142@localhost.localdomain> >>> "Piers Haken" wrote > I see the same thing on a few messages in my corpus. I believe it's > something weird to do with the way outlook splits out the MIME headers. > > Attached is a dump of the exception. Here's a chunk of the message that's causing the problem: [snip] Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="next_part_of_message" Return-Path: replies@oracleeblast.com X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) FILETIME=[A1D8F920:01C1E2D4] --next_part_of_message of_message ge --next_part_of_message Content-Type: text/html [snip] This is utter bollocks :) The question is whether it's Oracle that's bollocksed it up, or Outbreak. Not a lot that could/should be done here - I guess in _theory_ email could do something where it tries to parse each multipart bit individually, and return the bits that work, but this seems like it's way too much work. I'm curious why the plugin doesn't fall back to raw message text in this case? And does outbreak display this message correctly? Anthony From popiel@wolfskeep.com Tue Nov 12 02:47:57 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 11 Nov 2002 18:47:57 -0800 Subject: [Spambayes] Introducing myself In-Reply-To: Message from Robert Woodhead References: Message-ID: <20021112024757.B199AF58B@cashew.wolfskeep.com> In message: Robert Woodhead writes: > >My hunch, based on things I've done in the past, is that as the total >volume of mail increases, the rate of increase in the number of >unique tokens will approach a limit (that being, the number of >distinct individual words in the language, though foreign unicode >gibberish will have an effect). When I was doing single word >analysis on a quarter-gig of ham and spam I was seeing, IIRC, about >300,000 distinct tokens (including the aforementioned gibberish). Rob Hooft recently (yesterday, that is) did a nice analysis and graph of database growth based on message count. He found it scaled almost linearly with the sqrt of the number of messages... but he only went up to a total of about 22000 messages, which is likely only about a fifth of a gig. >It will be interesting to see the results of some data reduction on >the accuracy of the recogniser. My WAG is that even some serious >hashing (down to, say, 20 bit tokens) won't have much effect on >accuracy because most of the collisions will be between low >frequency, insignificant tokens. Tim Peters did some hashing experiments back on 3 Nov; he posted these results: OK, doing a 10-fold cross-validation run across 2000 random ham and 2000 random spam, but the same random sets for "before" and "after": filename: before crm ham:spam: 2000:2000 2000:2000 fp total: 1 1604 fp %: 0.05 80.20 fn total: 0 0 fn %: 0.00 0.00 unsure t: 20 0 unsure %: 0.50 0.00 real cost: $14.00$16040.00 best cost: $2.00 $228.00 h mean: 0.55 53.54 h sdev: 4.50 5.30 s mean: 99.91 71.40 s sdev: 1.64 6.84 mean diff: 99.36 17.86 k: 16.18 1.47 Granted, he was doing more complex word combinations with this, too, and a different combining technique, but it really doesn't look promising. - Alex From piersh@friskit.com Tue Nov 12 03:21:25 2002 From: piersh@friskit.com (Piers Haken) Date: Mon, 11 Nov 2002 19:21:25 -0800 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB929750C@zeus.sfhq.friskit.com> Yup, oulook displays it properly. I have a feeling that it's oracle's mess, but that outlook just ignores the invalid MIME-part headers -- maybe spambayes can do the same. Maybe if someone else has received this message from oracle they can shed some more light on this. The problem is multiplied by the fact that outlook includes the MIME-part headers and boundaries with the regular headers, but separates the body parts and attachments. I don't think there's any way to get the original, unseparated message from the API. The Outlook UI shows the headers as: Microsoft Mail Internet Headers Version 2.0 Received: from inet-mail7.oracle.com ([209.246.10.171]) by zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.4453); Sat, 13 Apr 2002 03:19:01 -0700 Received: from blaster-smtp.oracle.com (eblast01.oracleeblast.com [148.87.9.11]) by inet-mail7.oracle.com (Switch-2.2.1/Switch-2.2.0) with ESMTP id g3DA8GV30065 for PIERSH@FRISKIT.COM; Sat, 13 Apr 2002 03:08:16 -0700 Date: Sat, 13 Apr 2002 03:08:16 -0700 Message-Id: <200204131008.g3DA8GV30065@inet-mail7.oracle.com> Subject: Oracle University iSeminars From: Oracle Corporation To: PIERSH@FRISKIT.COM Reply-To: replies@oracleeblast.com Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=3D"next_part_of_message" Return-Path: replies@oracleeblast.com X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) FILETIME=3D[A1D8F920:01C1E2D4] --next_part_of_message of_message ge --next_part_of_message Content-Type: text/html Piers. > -----Original Message----- > From: Anthony Baxter [mailto:anthony@interlink.com.au]=20 > Sent: Monday, November 11, 2002 6:42 PM > To: Piers Haken > Cc: Tim Peters; David Leftley; spambayes@python.org > Subject: Re: [Spambayes] Re: Outlook plugin plus Exchange=20 >=20 >=20 >=20 > >>> "Piers Haken" wrote > > I see the same thing on a few messages in my corpus. I believe it's=20 > > something weird to do with the way outlook splits out the MIME=20 > > headers. > >=20 > > Attached is a dump of the exception. >=20 > Here's a chunk of the message that's causing the problem: [snip] > Content-Transfer-Encoding: 8bit > MIME-Version: 1.0 > Content-Type: multipart/alternative; > boundary=3D"next_part_of_message" > Return-Path: replies@oracleeblast.com > X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)=20 > FILETIME=3D[A1D8F920:01C1E2D4] >=20 > --next_part_of_message > of_message > ge >=20 > --next_part_of_message > Content-Type: text/html >=20 > [snip] >=20 > This is utter bollocks :) The question is whether it's Oracle=20 > that's bollocksed it up, or Outbreak. >=20 > Not a lot that could/should be done here - I guess in=20 > _theory_ email could do something where it tries to parse=20 > each multipart bit individually, and return the bits that=20 > work, but this seems like it's way too much work.=20 >=20 > I'm curious why the plugin doesn't fall back to raw message=20 > text in this case? And does outbreak display this message correctly? >=20 > Anthony >=20 >=20 From mhammond@skippinet.com.au Tue Nov 12 04:44:11 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 15:44:11 +1100 Subject: [Spambayes] Some more experiences with the Outlook plugin In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619933@UKDCX001.uk.int.atosorigin.com> Message-ID: > I've now had the Outlook plugin running for about a week, and I'm > starting to get a feel for using it. The following is my "user > interface" experience. It's a slightly unrealistic combination of > "what I actually did" and "what I realised afterwards I should have > done", but it is what I would use as notes telling a new user how > to set the system up, and as such it picks up on a few interesting > issues: > > 1. To start with, configure the plugin to define one "Spam" folder and > one "Unsure" folder, and define all other folders as "Ham". [1] Tim gives a great explanation of why this is not really possible - some people simply have too much ham, while even for others, the relative ratios are important. > * Following on from this, I also see Tim's behaviour of surprising > unsure cases (or worse, false negatives!). Worst case recently was a > message which scored as solid ham. I trained on it as "Spam", and > rescored it. It still scored 5 - solid ham. My immediate reaction was > "But I just *told* you it's spam!". I know that isn't how the classifier > works, but even so it was unsettling. FWIW, I attach the spam clues for > this one (I don't know if they make any sense in isolation, but it can't > hurt...) This too was my experience. For a while, I did training over a huge ham corpus, and spam is still less than 1000 messages. I had around 15:1 ham:spam. I too trained new ham and spam, and was dissappointed to see the score remain almost identical. Re-training on just my inbox yields far far better results - roughly 3:1 ham:spam. Tim's idea of: > In the list you gave below, there are very few hapaxes (I recognize > them from the probabilities; I should probably add code to the client > to display the raw counts too): certainly would be useful. Without the maths background, I find it interesting to ignorantly speculate on these ratios. Tim's analysis: > '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1, > and that makes the accidental appearance of a ham word in spam much > more powerful than the systematic appearance of a spam word in spam. makes me wonder why the classifier can't exploit the ham:spam ratio to give weighted results. Or from another POV, what would happen if we artificially boosted the ratio by training on each spam multiple times? I speculate due to my experience with these large ratios, and the fact that *every* one of these mails came through my Inbox. Many messages are from python.org's mailman - thus, the *true* ratio of ham:spam through my mail account is much higher than the ham:spam ratio left once the mailing list traffic is removed. Even though the total spam is the same, the system will score better or worse depending on the amount of ham I throw at it. It isn't intuitive to me why this need be so. Mark. From mhammond@skippinet.com.au Tue Nov 12 04:48:17 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 15:48:17 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <200211120241.gAC2fYW10142@localhost.localdomain> Message-ID: [Anthony] > This is utter bollocks :) The question is whether it's Oracle > that's bollocksed it up, or Outbreak. > > Not a lot that could/should be done here - I guess in _theory_ > email could do something where it tries to parse each multipart > bit individually, and return the bits that work, but this seems > like it's way too much work. I believe the email package should give some consideration to the real world here. While creating well-formed messages is clearly mandatory, it is very frustrating when something exists in the real world, is clearly invalid, but everything else in the world has no trouble with it. eg, HTML parsing, when your parser fails on pages that every browser displays perfectly. I didn't create the page, but I can see it, and want to parse it. > I'm curious why the plugin doesn't fall back to raw message text in > this case? And ditto for every other application in the world that may try and use the email package on such an invalid message? While I accept that we will fix the plugin to handle this case, it does seem a shame to not be able to get *anything* out of the email package when your mail client itself is quite happy with the message. Eg, how much smarts do we move back into the plugin? Do we try and recover any headers at all? etc. I am *not* trying to say "outlook is broken, so the email package should handle it" - but simply something along the lines of "if most mailers could handle it, we should too". Outlook *is* broken, and I certainly don't want the email package to worm around all our problems - but I'm not convinced the problem above (or indeed most header related errors from this package) are outlook specific. Mark. From jeremy@alum.mit.edu Tue Nov 12 04:51:00 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 11 Nov 2002 23:51:00 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: <200211120233.gAC2XXJ10069@localhost.localdomain> References: <200211120233.gAC2XXJ10069@localhost.localdomain> Message-ID: <15824.34996.357483.745111@slothrop.zope.com> >>>>> "AB" == Anthony Baxter writes: >> Ouch, that's evil. Maybe the solution for that is to look at the >> message and the quotation individually? But that can be >> metagamed too. AB> This is still making the spammers work a lot harder, though, so AB> it's not really a bad thing. I'd wager that Tim is working much harder than any of the spammers. Jeremy From mhammond@skippinet.com.au Tue Nov 12 05:10:50 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 16:10:50 +1100 Subject: [Spambayes] Exchange integration In-Reply-To: <3DCF67F7.16091.91EB9C8@localhost> Message-ID: > At first, I thought I'd use the Event service, but in 5.5 it's > async and MS even says "don't > use this to filter all your messages". It appears the best way is to hook the message spooler, as per http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mapi/html/_ mapi1book_using_message_filtering_to_manage_messages.asp and I believe that this can be configured to run on the client or an exchange server. > So it looks like I need to create some kind of MAPI hook or > preprocessor or mailbox > assistant.. I'm not sure which. > > Anyone know? And, can I do this all in Python via COM or do I > need some "real C to > hook in? Python's MAPI support doesn't extend to this yet, but I would be happy to help make it so. eeek - except I also find in Q224362: SAMPLE: Hook.exe MAPI Spooler Hook Provider Sample (C++) (http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q224362) """ Other Notes Note that Hook Providers, including this one, will not work when using the Microsoft Exchange Transport Provider. This is a result of Exchange's tightly-coupled store and transport (that is, they bypass the MAPI spooler). If you use Exchange's POP/SMTP/IMAP abilities, the spooler hook will function just fine """ KB article Q190413 (http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q190413) discusses this a little more in the context of public exchange folders, but still leaves me somewhat confused. The documentation for IMAPIAdviseSink certainly implies it can be used for all kinds of new mail notifications. We may be forced to "suck it and see" :( Mark. From tim.one@comcast.net Tue Nov 12 05:55:19 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 00:55:19 -0500 Subject: [Spambayes] Some more experiences with the Outlook plugin In-Reply-To: Message-ID: [Tim] >> In the list you gave below, there are very few hapaxes (I recognize >> them from the probabilities; I should probably add code to the client >> to display the raw counts too): [MarkH] > certainly would be useful. That's been checked in now. > Without the maths background, I find it interesting to ignorantly > speculate on these ratios. Tim's analysis: >> '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1, >> and that makes the accidental appearance of a ham word in spam much >> more powerful than the systematic appearance of a spam word in spam. > makes me wonder why the classifier can't exploit the ham:spam > ratio to give weighted results. I think it's already doing the best it can here. It's like I've met a thousand Americans and 2 Australians, so from all I've *seen* I have to conclude you're all beer-swilling, Ducati-riding, chain-smoking pigs. But that's really not enough evidence for me to *marry* an Australian, just enough to think highly of 'em . > Or from another POV, what would happen if we artificially > boosted the ratio by training on each spam multiple times? Nobody knows. The "by-counting" spamprob estimate wouldn't change at all: that's already computed by ratios instead of by absolute counts. If a word appears in 3 of 4 spam, it gets exactly the same by-counting estimate as a word that appears in 15,000,000 of 20,000,000 spam. The difference would be solely in how much the Bayesian adjustment pushed the by-counting estimate towards 0.5: the greater the total number of msgs a word has been seen in, the more willing the Bayesian adjustment is to leave the by-counting estimate alone. Much the same effect *could* be gotten via reducing option unknown_word_strength instead. That also makes the Bayesian adjustment more willing to take the by-counting estimate at face value. Most of the people who helped pick a good default value for unknown_word_strength didn't have a strong imbalance in ham:spam. Maybe you need a lower value, but I expect it's much better for such people *not* to train on so much ham. Training on small random samples, plus mistakes and unsures, may well be a better approach. If you've been following the latest experiments, it turns out you can get very good results with a tiny fraction of the msgs people *have* been training on. My personal classifier right now has been trained on only about 100 msgs total, close to 1:1 ham:spam. This has weaknesses too, but not nearly as bad as I guessed in advance (it doesn't seem *any* more prone to making flat-out mistakes, but the Unsures are hilarious ). > I speculate due to my experience with these large ratios, and the > fact that *every* one of these mails came through my Inbox. Many > messages are from python.org's mailman - thus, the *true* ratio of > ham:spam through my mail account is much higher than the ham:spam > ratio left once the mailing list traffic is removed. Even though > the total spam is the same, the system will score better or worse > depending on the amount of ham I throw at it. It isn't intuitive > to me why this need be so. Only because if it has a lot more ham than spam, it has much more reason to be confident about hamprobs than spamprobs. I suppose the Bayesian adjustment could be fiddled so that it didn't "believe" it *could* be more confident about either class than is justified by the class for which it has the least amount of evidence. I'm not exactly sure of the details, but it's inuitively clear to me so will be obvious when I wake up . That would prevent the strange result in the example, but: 1. Training on the spam again still wouldn't do you much good, because if the ratio was 18:1 before training, it would still be close to 18:1 after training, so it still wouldn't have much reason to "believe" the new spamprobs. and 2. It would make most of the ham you trained on essentially a waste of time and space: by construction, it wouldn't believe the ham stats any more than it believed the spam stats. We know a lot more at this point about how the system behaves if you don't have a strong imbalance. From tim.one@comcast.net Tue Nov 12 06:25:02 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 01:25:02 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <9891913C5BFE87429D71E37F08210CB929750C@zeus.sfhq.friskit.com> Message-ID: [Piers Haken] > Yup, oulook displays it properly. Meaning it shows you the HTML part, as rendered HTML, I bet. > I have a feeling that it's oracle's mess, Not from what you showed below. It's not hard to find the end of the headers! The first blank line ends them. That Outlook is showing you stuff beyond that in its view of the headers says it didn't suck out the headers properly to begin with. > but that outlook just ignores the invalid MIME-part headers By this point Outlook isn't looking *at all* at the part that's damaged (and probably by it). It's just sucking out the PR_BODY_HTML property from the msg and rendering it, and the value of that property contains no MIME armor at all, just HTML stuff. > -- maybe spambayes can do the same. I keep telling people never to call email.message_from_string() directly, but they don't listen . The tokenizer's way of getting an email message from a string would have at least recovered the message body in this case, but would have lost the headers entirely (they're crap -- what can you do?). > The problem is multiplied by the fact that outlook includes the MIME- > part headers and boundaries with the regular headers, The Outlook client actually deletes those from the headers, because: > but separates the body parts and attachments. I don't think there's > any way to get the original, unseparated message from the API. That's right, there isn't. Outlook's basic structure appears to predate MIME catching on, and the MIME support very much appears hacked in after it was too late for a change in worldview. It's a mess that way, if you want to (as we do) get MIME back out. The Outlook client right now "loses" all attachments, and even loses the msg body if the msg has been digitally signed (because it turns out Outlook does Yet Another Entirely Different Thing for signed msgs, leaving the two "normal" body properties empty and stuffing the body *plus* the signature into Yet Another property). > The Outlook UI shows the headers as: By this do you mean View -> Options -> Internet headers? Microsoft Mail Internet Headers Version 2.0 Received: from inet-mail7.oracle.com ([209.246.10.171]) by zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.4453); Sat, 13 Apr 2002 03:19:01 -0700 Received: from blaster-smtp.oracle.com (eblast01.oracleeblast.com [148.87.9.11]) by inet-mail7.oracle.com (Switch-2.2.1/Switch-2.2.0) with ESMTP id g3DA8GV30065 for PIERSH@FRISKIT.COM; Sat, 13 Apr 2002 03:08:16 -0700 Date: Sat, 13 Apr 2002 03:08:16 -0700 Message-Id: <200204131008.g3DA8GV30065@inet-mail7.oracle.com> Subject: Oracle University iSeminars From: Oracle Corporation To: PIERSH@FRISKIT.COM Reply-To: replies@oracleeblast.com Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="next_part_of_message" Return-Path: replies@oracleeblast.com X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) FILETIME=[A1D8F920:01C1E2D4] --next_part_of_message of_message ge --next_part_of_message Content-Type: text/html There's no way blank lines can be part of the headers, so I don't believe Oracle screwed this up. They really are blank, too, as the traceback you sent earlier showed this at the tail end of the headers: \r\n --next_part_of_message\r\n of_message\r\n ge\r\n \r\n --next_part_of_message\r\n Content-Type: text/html\r\n \r\n \n and *our* code put in the lone oddball \n after the end of what Outlook told us were the original headers. If that's common damage, I can worm around it. From tim.one@comcast.net Tue Nov 12 06:33:59 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 01:33:59 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: [Mark Hammond] > I believe the email package should give some consideration to the > real world here. It tries to, starting in 2.2.2: http://www.python.org/doc/current/lib/node383.html The Parser class defaults to non-strict now, but as the docs say this doesn't mean MessageParseErrors are never raised; some ill- formatted messages just can't be parsed I'm sure Barry would be willing to entertain this specific case as a bug report. In theory, he reads this list, so should be shamed enough to do that himself . > While creating well-formed messages is clearly mandatory, > it is very frustrating when something exists in the real world, is > clearly invalid, but everything else in the world has no trouble > with it. In this specific case, it looks like Outlook created damaged headers itself, and Outlook doesn't care because *it* never *looks* at the headers again. It already sucked out the HTML and stored it in a property, and that's the only part it looks at again; the Subject and From etc are also tucked away in other properties. So as far as Outlook is concerned, PR_TRANSPORT_MESSAGE_HEADERS is passive trash. At least that's my guess . > ... > happy with the message. Eg, how much smarts do we move back into the > plugin? Do we try and recover any headers at all? etc. In this case I will; the form of the damage is clear and easily wormed around, provided you know what you're looking for in advance. From mhammond@skippinet.com.au Tue Nov 12 06:44:30 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 17:44:30 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: Something that confuses me completely here is: * Outlook shows headers with blank lines, appearing to royally screw things up. * Out Outlook client simply appends the body(s) to the headers as a simple string. * We pass this re-constituted string back into the email package, and it too seems to screw up the header parsing! ie, Outlook shows the headers as: """ ... X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) FILETIME=[A1D8F920:01C1E2D4] --next_part_of_message of_message ... """ And the traceback from the email package shows: "C:\Python22\spam\spambayes\email\Parser.py", line 105, in _parseheaders raise Errors.HeaderParseError( HeaderParseError: Not a header, not a continuation: ``of_message'' Which seems very strange to me. Why is the email package complaining about the "of_message" line, rather than itself stopping header parsing after that blank? (Recall that the the email package does not see the "ContentType:" header, as we remove that before sending it in.) I assume I am simply missing how messages are parsed. Mark. From piersh@friskit.com Tue Nov 12 07:02:51 2002 From: piersh@friskit.com (Piers Haken) Date: Mon, 11 Nov 2002 23:02:51 -0800 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB929750D@zeus.sfhq.friskit.com> > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net]=20 > Sent: Monday, November 11, 2002 10:25 PM > To: Piers Haken > Cc: David Leftley; spambayes@python.org > Subject: RE: [Spambayes] Re: Outlook plugin plus Exchange >=20 >=20 > [Piers Haken] > > Yup, oulook displays it properly. >=20 > Meaning it shows you the HTML part, as rendered HTML, I bet. Yup. > > I have a feeling that it's oracle's mess, >=20 > Not from what you showed below. It's not hard to find the=20 > end of the headers! The first blank line ends them. That=20 > Outlook is showing you stuff beyond that in its view of the=20 > headers says it didn't suck out the headers properly to begin with. I'm not sure that's the case. Outlook _always_ shows the MIME headers below the SMTP headers in its 'internet headers' UI. For example, heres the 'headers' from another message which does render correctly and that spambayes does parse correctly: Microsoft Mail Internet Headers Version 2.0 Received: from sccrmhc02.attbi.com ([204.127.202.62]) by zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.5329); Mon, 11 Nov 2002 11:22:27 -0800 Received: from Computer ([12.236.244.49]) by sccrmhc02.attbi.com (InterMail vM.4.01.03.27 201-229-121-127-20010626) with SMTP id <20021111191007.KEOD5251.sccrmhc02.attbi.com@Computer>; Mon, 11 Nov 2002 19:10:07 +0000 From: "Rebecca Whitworth" To: "Piers Haken" Cc: "Traci and Stephen Green" Subject: the green's car Date: Mon, 11 Nov 2002 11:15:54 -0800 Message-ID: MIME-Version: 1.0 Content-Type: multipart/related; boundary=3D"----=3D_NextPart_000_002F_01C28973.B386B770" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Return-Path: lesanctuaire@earthlink.net X-OriginalArrivalTime: 11 Nov 2002 19:22:27.0328 (UTC) FILETIME=3D[ABBFA800:01C289B7] ------=3D_NextPart_000_002F_01C28973.B386B770 Content-Type: multipart/alternative; boundary=3D"----=3D_NextPart_001_0030_01C28973.B386B770" ------=3D_NextPart_001_0030_01C28973.B386B770 Content-Type: text/plain; charset=3D"iso-8859-1" Content-Transfer-Encoding: 8bit ------=3D_NextPart_001_0030_01C28973.B386B770 Content-Type: text/html; charset=3D"iso-8859-1" Content-Transfer-Encoding: quoted-printable ------=3D_NextPart_001_0030_01C28973.B386B770-- ------=3D_NextPart_000_002F_01C28973.B386B770 Content-Type: image/jpeg; name=3D"image001.jpg" Content-Transfer-Encoding: base64 Content-ID: ------=3D_NextPart_000_002F_01C28973.B386B770-- As you can see it's just showing everything but the contents of the MIME parts. I don't think there's any suggestion that these are _just_ the SMTP headers, but the outlook plugin is treating them as such. Maybe the outlook plugin should trim the non-SMTP parts from these 'headers' before passing them to the classifier?? > > but that outlook just ignores the invalid MIME-part headers >=20 > By this point Outlook isn't looking *at all* at the part=20 > that's damaged (and probably by it). It's just sucking out=20 > the PR_BODY_HTML property from the msg and rendering it, and=20 > the value of that property contains no MIME armor at all,=20 > just HTML stuff. >=20 > > -- maybe spambayes can do the same. >=20 > I keep telling people never to call=20 > email.message_from_string() directly, but they don't listen=20 > . The tokenizer's way of getting an email message from=20 > a string would have at least recovered the message body in=20 > this case, but would have lost the headers entirely (they're=20 > crap -- what can you do?). >=20 > > The problem is multiplied by the fact that outlook includes=20 > the MIME-=20 > > part headers and boundaries with the regular headers, >=20 > The Outlook client actually deletes those from the headers, because: >=20 > > but separates the body parts and attachments. I don't think there's=20 > > any way to get the original, unseparated message from the API. >=20 > That's right, there isn't. Outlook's basic structure appears=20 > to predate MIME catching on, and the MIME support very much=20 > appears hacked in after it was too late for a change in=20 > worldview. It's a mess that way, if you want to (as we do)=20 > get MIME back out. The Outlook client right now "loses" all=20 > attachments, and even loses the msg body if the msg has been=20 > digitally signed (because it turns out Outlook does Yet=20 > Another Entirely Different Thing for signed msgs, leaving the=20 > two "normal" body properties empty and stuffing the body=20 > *plus* the signature into Yet Another property). Yeah, it's a mess, but I don't think that the classifier should assume that the message has SMTP headers at all, since many other MTA's exist (exchange, notes, etc...) Outlook wasn't designed with MIME in mind since exchange doesn't use MIME. > > The Outlook UI shows the headers as: >=20 > By this do you mean View -> Options -> Internet headers? Yup. From tim.one@comcast.net Tue Nov 12 06:53:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 01:53:33 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: Message-ID: [Tim Stone] > Gotcha. You dudes are on top of things... ;) It's more that they're on top of us, and won't get off . > Wanna do some ocr stuff on referenced jpgs and gifs? ;;;) I > know I know... bad idea for any of a thousand reasons... It's a fine idea, if this kind of stuff becomes a problem we can't address in a cheaper way. So far, though, this kind of stuff hasn't had any luck fooling this system. For example, "jpg" and "gif" appearing in URL components have high spamprobs, and if the msg just consists of pointing at images, those high-spamprob gif and jpg tokens become a major part of the msg's total token count, and kill it. Assuming the headers didn't kill it already. From tim.one@comcast.net Tue Nov 12 07:12:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 02:12:27 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: [Mark Hammond] > Something that confuses me completely here is: > > * Outlook shows headers with blank lines, appearing to royally > screw things up. Yes. > * Out Outlook client simply appends the body(s) to the headers as a > simple string. Ditto. > * We pass this re-constituted string back into the email package, > and it too seems to screw up the header parsing! Ditto again. You're on a roll, Mark . > ie, Outlook shows the headers as: > > """ > ... > X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) > FILETIME=[A1D8F920:01C1E2D4] > > --next_part_of_message > of_message > ... > """ > > And the traceback from the email package shows: > > "C:\Python22\spam\spambayes\email\Parser.py", line 105, in _parseheaders > raise Errors.HeaderParseError( > HeaderParseError: Not a header, not a continuation: ``of_message'' This won't make sense to you just yet , but look at the full traceback instead: Traceback (most recent call last): File "C:\Python22\spam\spambayes\Outlook2000\train.py", line 67, in train_folder if train_message(message, isspam, mgr): File "C:\Python22\spam\spambayes\Outlook2000\train.py", line 36, in train_message stream = msg.GetEmailPackageObject() File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 431, in GetEmailPackageObject msg = email.message_from_string(text) File "C:\Python22\spam\spambayes\email\__init__.py", line 39, in message_from_string return Parser(_class, strict=strict).parsestr(s) File "C:\Python22\spam\spambayes\email\Parser.py", line 52, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "C:\Python22\spam\spambayes\email\Parser.py", line 48, in parse self._parsebody(root, fp) File "C:\Python22\spam\spambayes\email\Parser.py", line 206, in _parsebody msgobj = self.parsestr(part) File "C:\Python22\spam\spambayes\email\Parser.py", line 52, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "C:\Python22\spam\spambayes\email\Parser.py", line 46, in parse self._parseheaders(root, fp) File "C:\Python22\spam\spambayes\email\Parser.py", line 105, in _parseheaders raise Errors.HeaderParseError( HeaderParseError: Not a header, not a continuation: ``of_message'' It's descending *into* the body when the error occurs, and at that point it's really talking about the MIME-section headers, not the message headers, starting with > --next_part_of_message > of_message as a distinct section. > Which seems very strange to me. Why is the email package > complaining about the "of_message" line, rather than itself stopping > header parsing after that blank? My guess is that it *did* stop after the first blank line, so far as the *message* headers were concerned. At this point it's looking at the headers in the individual MIME sections. I realize this still doesn't make sense to you, but it will very soon : > (Recall that the the email package does not see the "ContentType:" > header, as we remove that before sending it in.) That's what confused me at first too, but it isn't true here: we don't remove the Content-Type header until *after* email_message_from_string() returns a message. We never got that far in this case. > I assume I am simply missing how messages are parsed. Maybe, but it's irrelevant . By the time I'm stripping the MIME headers in the Outlook client, it's too late to do any good. I don't know how to better, though (with minor effort) -- it's really a job for Barry. We've been saved so far because the email parser *is* lax by default, and doesn't complain about missing MIME armor. It does complain about MIME armor that makes no sense, though, and I've never seen that happen in any of my email. If we managed to get Content-Type out of the Outlook headers before calling message_from_string, there's no problem with this msg (I tried that -- it works -- but I removed Content-Type by hand with an editor, which isn't terribly scalable ). From anthony@interlink.com.au Tue Nov 12 07:13:56 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 18:13:56 +1100 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: Message-ID: <200211120713.gAC7Du512404@localhost.localdomain> >>> Tim Peters wrote > We don't tokenize To: now because it gives good results for bad reasons on > mixed-source corpora. It would be good to have an option to tokenize it. > It appears that your code also tokenized Cc:; also fine. I would rather see > the code added to the loop currently cracking "from" lines: > > for field in ('from',): > > so that we tokenize all address thingies in a uniform way. The option would > control the list of field names looped over there (default just from:, > optionally also to: and cc:). I've added this now. For me, tokenising just the 'from' line with the new 'address_headers' option gives (vs the old code): (all tests with 4 sets of 1200H/400S) filename: old_from new_from ham:spam: 4800:1600 4800:1600 fp total: 1 1 fp %: 0.02 0.02 fn total: 12 11 fn %: 0.75 0.69 unsure t: 86 88 unsure %: 1.34 1.38 real cost: $39.20 $38.60 best cost: $31.80 $32.40 h mean: 0.36 0.36 h sdev: 4.04 4.05 s mean: 98.25 98.25 s sdev: 8.93 8.99 mean diff: 97.89 97.89 k: 7.55 7.51 The old code's best cost was: -> achieved at ham & spam cutoffs 0.24 & 0.99 -> fp 0; fn 3; unsure ham 26; unsure spam 118 -> fp rate 0%; fn rate 0.188%; unsure rate 2.25% The new code's best cost was: -> largest ham & spam cutoffs 0.26 & 0.99 -> fp 0; fn 4; unsure ham 24; unsure spam 118 -> fp rate 0%; fn rate 0.25%; unsure rate 2.22% The one additional fn was a spam that was dragged from 0.35 to 0.21 because it came from 'update@localhost.net' - the 'update' was a strong spam clue. Where it gets more interesting is when I also tokenize to and cc: filename: new_from new_fromtocc ham:spam: 4800:1600 4800:1600 fp total: 1 1 fp %: 0.02 0.02 fn total: 4 5 fn %: 0.25 0.31 unsure t: 121 104 unsure %: 1.89 1.62 real cost: $38.20 $35.80 best cost: $32.40 $28.00 h mean: 0.36 0.31 h sdev: 4.05 3.80 s mean: 98.25 98.42 s sdev: 8.99 8.77 mean diff: 97.89 98.11 k: 7.51 7.81 We go from: -> largest ham & spam cutoffs 0.26 & 0.99 -> fp 0; fn 4; unsure ham 24; unsure spam 118 -> fp rate 0%; fn rate 0.25%; unsure rate 2.22% to -> largest ham & spam cutoffs 0.22 & 0.99 -> fp 0; fn 3; unsure ham 25; unsure spam 100 -> fp rate 0%; fn rate 0.188%; unsure rate 1.95% That's a total of 142->125 unsures. I'll accept that :) Just to make sure, ran with a different seed. filename: new_from2 new_fromtocc2 ham:spam: 4800:1600 4800:1600 fp total: 0 0 fp %: 0.00 0.00 fn total: 6 6 fn %: 0.38 0.38 unsure t: 110 97 unsure %: 1.72 1.52 real cost: $28.00 $25.40 best cost: $23.00 $19.20 h mean: 0.45 0.39 h sdev: 4.72 4.48 s mean: 98.44 98.56 s sdev: 8.82 8.62 mean diff: 97.99 98.17 k: 7.24 7.49 went from: -> largest ham & spam cutoffs 0.28 & 0.94 -> fp 0; fn 6; unsure ham 23; unsure spam 62 -> fp rate 0%; fn rate 0.375%; unsure rate 1.33% to -> largest ham & spam cutoffs 0.24 & 0.93 -> fp 0; fn 4; unsure ham 25; unsure spam 51 -> fp rate 0%; fn rate 0.25%; unsure rate 1.19% toemail:python.org and toemail:zope.org both show up in my 'best discriminators' list as _very_ strong ham clues (not suprising, given the mailing lists I'm on). My old/uncommon email addresses generally show up as strong strong spam clues (eg prob('toemail:arb') = 0.999356) Next, I tried it against a chunk of my horrible corpus - 4 (out of 10) sets of 1200H/400S (out of 3500H/1800S in each set) filename: info_from info_fromtocc ham:spam: 4800:1600 4800:1600 fp total: 6 7 fp %: 0.12 0.15 fn total: 4 4 fn %: 0.25 0.25 unsure t: 208 179 unsure %: 3.25 2.80 real cost: $105.60 $109.80 best cost: $78.00 $66.40 h mean: 3.05 2.63 h sdev: 10.88 10.12 s mean: 99.17 99.12 s sdev: 6.65 6.99 mean diff: 96.12 96.49 k: 5.48 5.64 That's -> achieved at ham & spam cutoffs 0.62 & 0.99 -> fp 5; fn 11; unsure ham 44; unsure spam 41 -> fp rate 0.104%; fn rate 0.688%; unsure rate 1.33% going to -> achieved at ham & spam cutoffs 0.62 & 0.99 -> fp 4; fn 12; unsure ham 36; unsure spam 36 -> fp rate 0.0833%; fn rate 0.75%; unsure rate 1.12% Anyway, the option's checked in and there, so go play. I'll run a full test of the horror corpus overnight... Anthony From tim.one@comcast.net Tue Nov 12 07:31:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 02:31:49 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <9891913C5BFE87429D71E37F08210CB929750D@zeus.sfhq.friskit.com> Message-ID: [Piers Haken] > I'm not sure that's the case. Outlook _always_ shows the MIME headers > below the SMTP headers in its 'internet headers' UI. My Outlook never does. Really! Never. I've trained on thousands of HTML spam here, and surely would have noticed this problem if it had ever popped up. So, there's some difference between our Outlooks. Which version are you using? I'm using Outlook 2000 SR-1, build 9.0.0.4201, Internet Mail Only configuration, and get email solely thru remote POP3 accounts. I'm *guessing* you've got yours configured for Corporate/Workgroup, which is said to change lots of stuff in undocumented ways. > For example, heres the 'headers' from another message which does render > correctly and that spambayes does parse correctly: Microsoft Mail Internet Headers Version 2.0 Received: from sccrmhc02.attbi.com ([204.127.202.62]) by zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.5329); Mon, 11 Nov 2002 11:22:27 -0800 Received: from Computer ([12.236.244.49]) by sccrmhc02.attbi.com (InterMail vM.4.01.03.27 201-229-121-127-20010626) with SMTP id <20021111191007.KEOD5251.sccrmhc02.attbi.com@Computer>; Mon, 11 Nov 2002 19:10:07 +0000 From: "Rebecca Whitworth" To: "Piers Haken" Cc: "Traci and Stephen Green" Subject: the green's car Date: Mon, 11 Nov 2002 11:15:54 -0800 Message-ID: MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_002F_01C28973.B386B770" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Return-Path: lesanctuaire@earthlink.net X-OriginalArrivalTime: 11 Nov 2002 19:22:27.0328 (UTC) FILETIME=[ABBFA800:01C289B7] ------=_NextPart_000_002F_01C28973.B386B770 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0030_01C28973.B386B770" ------=_NextPart_001_0030_01C28973.B386B770 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit ------=_NextPart_001_0030_01C28973.B386B770 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable ------=_NextPart_001_0030_01C28973.B386B770-- ------=_NextPart_000_002F_01C28973.B386B770 Content-Type: image/jpeg; name="image001.jpg" Content-Transfer-Encoding: base64 Content-ID: ------=_NextPart_000_002F_01C28973.B386B770-- > As you can see it's just showing everything but the contents of > the MIME parts. Yes, I see. I've never seen anything like that before, though. > I don't think there's any suggestion that these are _just_ the SMTP > headers, but the outlook plugin is treating them as such. Sure, because that's what the people who wrote the client have *seen* in their Outlooks. It's not like what MS does here is documented . > Maybe the outlook plugin should trim the non-SMTP parts from these > 'headers' before passing them to the classifier?? Looks like we don't have any choice about that now. > Yeah, it's a mess, but I don't think that the classifier should assume > that the message has SMTP headers at all, since many other MTA's exist > (exchange, notes, etc...) We don't. The value of the MAPI PR_TRANSPORT_MESSAGE_HEADERS property is magnificently ill-defined, to the point of complete uselessness, so this is purely poke-and-hope programming. The only other case we've *seen* before this is the one where PR_TRANSPORT_MESSAGE_HEADERS has no value, in which case there's code to try to *synthesize* some bare-bones headers out of the PR_SUBJECT, PR_DISPLAY_NAME, PR_DISPLAY_TO, and PR_DISPLAY_CC properties. Now some other case has popped up -- so it goes. > Outlook wasn't designed with MIME in mind ... I had figured that one out already . From anthony@interlink.com.au Tue Nov 12 07:37:32 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 18:37:32 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: <200211120737.gAC7bXX12617@localhost.localdomain> >>> Tim Peters wrote > [Mark Hammond] > > I believe the email package should give some consideration to the > > real world here. > > It tries to, starting in 2.2.2: > > http://www.python.org/doc/current/lib/node383.html > > The Parser class defaults to non-strict now, but as the docs say > > this doesn't mean MessageParseErrors are never raised; some ill- > formatted messages just can't be parsed > > I'm sure Barry would be willing to entertain this specific case as a bug > report. In theory, he reads this list, so should be shamed enough to do > that himself . Yah. The non-strict mode was initially my fault, because I wanted to be able to parse bad MIME. In this case, you're hitting a broken MIME subsection. That 'of_message' is nothing like a header at all - if the broken MIME subsection is supposed to be parsed, there should be a newline between the boundary and the subsection. The section of code in question has this comment: # Normal, non-continuation header. BAW: this should check to make # sure it's a legal header, e.g. doesn't contain spaces. Also, we # should expose the header matching algorithm in the API, and # allow for a non-strict parsing mode (that ignores the line # instead of raising the exception). Here's an (untested :) patch. Depending on how you want to handle these sorts of errors, uncomment either the 'break' or the 'continue' line. --- Parser.py 23 Sep 2002 13:18:55 -0000 1.1.1.1 +++ Parser.py 12 Nov 2002 07:34:40 -0000 @@ -98,9 +98,15 @@ if self._strict: raise Errors.HeaderParseError( "Not a header, not a continuation: ``%s''"%line) - elif lineno == 1 and line.startswith('--'): - # allow through duplicate boundary tags. - continue + elif lineno == 1: + if line.startswith('--'): + # allow through duplicate boundary tags. + continue + else: + # hack hack hack. We saw a non header. Either: + #continue # to ignore it silently. + # or + break # to treat the rest of the headers as body else: raise Errors.HeaderParseError( "Not a header, not a continuation: ``%s''"%line) I'm not comfortable that this should go into the core distribution of the email package - but the above comment about exposed the header matching API is a good one. I'll think about how to do this. Anthony From piersh@friskit.com Tue Nov 12 07:51:48 2002 From: piersh@friskit.com (Piers Haken) Date: Mon, 11 Nov 2002 23:51:48 -0800 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB929750E@zeus.sfhq.friskit.com> > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net]=20 > Sent: Monday, November 11, 2002 11:32 PM > To: Piers Haken > Cc: David Leftley; spambayes@python.org > Subject: RE: [Spambayes] Re: Outlook plugin plus Exchange >=20 >=20 > [Piers Haken] > > I'm not sure that's the case. Outlook _always_ shows the=20 > MIME headers=20 > > below the SMTP headers in its 'internet headers' UI. >=20 > My Outlook never does. Really! Never. I've trained on=20 > thousands of HTML spam here, and surely would have noticed=20 > this problem if it had ever popped up. >=20 > So, there's some difference between our Outlooks. Which=20 > version are you using? I'm using Outlook 2000 SR-1, build=20 > 9.0.0.4201, Internet Mail Only configuration, and get email=20 > solely thru remote POP3 accounts. I'm > *guessing* you've got yours configured for=20 > Corporate/Workgroup, which is said to change lots of stuff in=20 > undocumented ways. Yeah, all my internet email comes via SMTP to my exchange server. It might be the corporate/workgroup setting that making the difference, or it could be the exchange SMTP MTA. Ugh. Piers. From piersh@friskit.com Tue Nov 12 07:55:57 2002 From: piersh@friskit.com (Piers Haken) Date: Mon, 11 Nov 2002 23:55:57 -0800 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <9891913C5BFE87429D71E37F08210CB9183A0C@zeus.sfhq.friskit.com> Yeha, the problem is twofolow: not only are the MIME headers broken, but even if they weren't the content of that MIME header would be empty since the body parts are appended (by the outlook plugin) after the MIME headers: Msgstory.py, line ~400: return "%s\n%s\n%s" % (headers, html, body) Piers. > -----Original Message----- > From: Anthony Baxter [mailto:anthony@interlink.com.au]=20 > Sent: Monday, November 11, 2002 11:38 PM > To: Tim Peters > Cc: Mark Hammond; Piers Haken; Barry A. Warsaw; spambayes@python.org > Subject: Re: [Spambayes] Re: Outlook plugin plus Exchange=20 >=20 >=20 >=20 > >>> Tim Peters wrote > > [Mark Hammond] > > > I believe the email package should give some consideration to the=20 > > > real world here. > >=20 > > It tries to, starting in 2.2.2: > >=20 > > http://www.python.org/doc/current/lib/node383.html > >=20 > > The Parser class defaults to non-strict now, but as the docs say > >=20 > > this doesn't mean MessageParseErrors are never raised; some ill- > > formatted messages just can't be parsed > >=20 > > I'm sure Barry would be willing to entertain this specific=20 > case as a=20 > > bug report. In theory, he reads this list, so should be=20 > shamed enough=20 > > to do that himself . >=20 > Yah. The non-strict mode was initially my fault, because I=20 > wanted to be able to parse bad MIME. In this case, you're=20 > hitting a broken MIME subsection. That 'of_message' is=20 > nothing like a header at all - if the broken MIME subsection=20 > is supposed to be parsed, there should be a newline between=20 > the boundary and the subsection. >=20 > The section of code in question has this comment: >=20 > # Normal, non-continuation header. BAW: this should check to make > # sure it's a legal header, e.g. doesn't contain spaces. Also, we > # should expose the header matching algorithm in the API, and > # allow for a non-strict parsing mode (that ignores the line > # instead of raising the exception). >=20 > Here's an (untested :) patch. Depending on how you want to=20 > handle these=20 > sorts of errors, uncomment either the 'break' or the 'continue' line. >=20 >=20 > --- Parser.py 23 Sep 2002 13:18:55 -0000 1.1.1.1 > +++ Parser.py 12 Nov 2002 07:34:40 -0000 > @@ -98,9 +98,15 @@ > if self._strict: > raise Errors.HeaderParseError( > "Not a header, not a continuation:=20 > ``%s''"%line) > - elif lineno =3D=3D 1 and line.startswith('--'): > - # allow through duplicate boundary tags. > - continue > + elif lineno =3D=3D 1: > + if line.startswith('--'): > + # allow through duplicate boundary tags. > + continue > + else:=20 > + # hack hack hack. We saw a non=20 > header. Either: > + #continue # to ignore it silently. > + # or > + break # to treat the rest of the headers as=20 > + body > else: > raise Errors.HeaderParseError( > "Not a header, not a continuation:=20 > ``%s''"%line) >=20 > I'm not comfortable that this should go into the core=20 > distribution of the email package - but the above comment=20 > about exposed the header matching API is a good one. I'll=20 > think about how to do this. >=20 > Anthony >=20 From anthony@interlink.com.au Tue Nov 12 07:49:59 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 18:49:59 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <9891913C5BFE87429D71E37F08210CB9183A0C@zeus.sfhq.friskit.com> Message-ID: <200211120749.gAC7nxn12707@localhost.localdomain> >>> "Piers Haken" wrote > > Yeha, the problem is twofolow: not only are the MIME headers broken, but > even if they weren't the content of that MIME header would be empty > since the body parts are appended (by the outlook plugin) after the MIME > headers: > > Msgstory.py, line ~400: > return "%s\n%s\n%s" % (headers, html, body) No idea on that one - it's in the Outlook plugin code - I'm not going near that one. Mark-touched-it-last-he-gets-to-fix-it, Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Tue Nov 12 07:52:19 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 18:52:19 +1100 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: Message-ID: <200211120752.gAC7qJv12783@localhost.localdomain> >>> Tim Peters wrote > > First experiment was to make the URL tokenizer look for the string > > 'mailman' in the URL. If it was found, simple push the clue "url: > > Mailman URL" onto the clue-pile. This was an attempt to remove the > Can you try this again replacing "break" with "continue"? I can't believe > you intended break here -- it means that the first time we see a Mailman URL > in a msg, we stop looking for embedded URLs period. Spam could easily > exploit that. --- tokenizer.py 12 Nov 2002 06:21:38 -0000 1.66 +++ tokenizer.py 12 Nov 2002 07:23:30 -0000 @@ -944,6 +944,11 @@ new_text.append(text[i : start]) new_text.append(' ') + if guts.find('mailman') != -1: + pushclue("url: Mailman URL") + i = end + continue + pushclue("proto:" + proto) # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. filename: new_fromtocc2 new_mailman2 ham:spam: 4800:1600 4800:1600 fp total: 0 0 fp %: 0.00 0.00 fn total: 6 5 fn %: 0.38 0.31 unsure t: 97 95 unsure %: 1.52 1.48 real cost: $25.40 $24.00 best cost: $19.20 $18.20 h mean: 0.39 0.42 h sdev: 4.48 4.59 s mean: 98.56 98.68 s sdev: 8.62 8.17 mean diff: 98.17 98.26 k: 7.49 7.70 before: -> largest ham & spam cutoffs 0.24 & 0.93 -> fp 0; fn 4; unsure ham 25; unsure spam 51 -> fp rate 0%; fn rate 0.25%; unsure rate 1.19% after: -> largest ham & spam cutoffs 0.24 & 0.94 -> fp 0; fn 3; unsure ham 27; unsure spam 49 -> fp rate 0%; fn rate 0.188%; unsure rate 1.19% It replaces a chunk of closely correlated ham clues, which has the expected result of pushing both ham and spam up slightly. This (for me) rescues one fn at the expense of a couple of extra unsure hams. This looks like a YMMV one. It's (for me) a marginal win. Anthony From Paul.Moore@atosorigin.com Tue Nov 12 09:39:27 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Tue, 12 Nov 2002 09:39:27 -0000 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DBC@UKDCX001.uk.int.atosorigin.com> From: Tim Stone - Four Stones Expressions > The whole problem I see with this is that =B5$0pht could and most > likely will screw all these machinations up with the next release of > Outlook or Exchange... They have this great history of not caring > if their api changes, or system behavior changes, are backward > compatible. If we're having this level of difficulty now, get > ready... :( But if we stick to a pretty trivial "On startup" hook which scans all new mail, along with a "New mail arrived" hook which filters mail as it arrives, then (1) we're covered, and (2) we aren't doing anything sufficiently complex that it's *likely* to get broken. (Not that MS can't break anything - Outlook.NET probably won't even have a COM addin interface...) Paul. From tim@fourstonesExpressions.com Tue Nov 12 09:45:26 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 12 Nov 2002 03:45:26 -0600 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DBC@UKDCX001.uk.int.atosorigin.com> Message-ID: 11/12/2002 3:39:27 AM, "Moore, Paul" wrote: >From: Tim Stone - Four Stones Expressions >> The whole problem I see with this is that µ$0pht could and most >> likely will screw all these machinations up with the next release of >> Outlook or Exchange... They have this great history of not caring >> if their api changes, or system behavior changes, are backward >> compatible. If we're having this level of difficulty now, get >> ready... :( > >But if we stick to a pretty trivial "On startup" hook which scans all >new mail, along with a "New mail arrived" hook which filters mail as >it arrives, then (1) we're covered, and (2) we aren't doing anything >sufficiently complex that it's *likely* to get broken. (Not that MS >can't break anything - Outlook.NET probably won't even have a COM >addin interface...) Yup.... 'xactly what I was thinkin. We'll have to maintain at least two versions of the plugin for some time to come, if not ad-infinitum. - TimS > >Paul. > > - Tim www.fourstonesExpressions.com From lists@webcrunchers.com Tue Nov 12 10:39:39 2002 From: lists@webcrunchers.com (John D.) Date: Tue, 12 Nov 2002 02:39:39 -0800 Subject: [Spambayes] How do I get the latest CVS? Message-ID: I FINALLY setup a system I can use to start using some of the SpamBayes work, but (sigh) I don't think I have access. I'm trying to get the SpamBayes project from SourceForge. But I need a password to get into it. I tried Anonymous, but that didn't work. I only need read access to it. CrunchBox# cvs -q get -P spambayes The authenticity of host 'cvs.spambayes.sourceforge.net (216.136.171.202)' can't be established. DSA key fingerprint is 02:ab:7c:aa:49:ed:0b:a8:50:13:10:c2:3e:92:0f:42. Are you sure you want to continue connecting (yes/no)? yes\ Warning: Permanently added 'cvs.spambayes.sourceforge.net,216.136.171.202' (DSA) to the list of known hosts. anonymous@cvs.spambayes.sourceforge.net's password: Permission denied, please try again. How do I get it? John From mhammond@skippinet.com.au Tue Nov 12 10:53:55 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 21:53:55 +1100 Subject: [Spambayes] How do I get the latest CVS? In-Reply-To: Message-ID: > I'm trying to get the SpamBayes project from SourceForge. But I > need a password to get into it. I tried Anonymous, but that > didn't work. As per the CVS instructions (http://sourceforge.net/cvs/?group_id=61702) the anonynous password is empty - just press the enter key. Mark. From mhammond@skippinet.com.au Tue Nov 12 11:00:06 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 22:00:06 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: [Tim Stone quoting Paul Moore] > >sufficiently complex that it's *likely* to get broken. (Not that MS > >can't break anything - Outlook.NET probably won't even have a COM > >addin interface...) > > Yup.... 'xactly what I was thinkin. We'll have to maintain at least two > versions of the plugin for some time to come, if not ad-infinitum. This is unlikely for some time I believe. MS don't piss-off the people required to generate their revenue (ie, corporates etc,) and as a rule go to huge pains to make software backward compatible. Windows is Windows for this reason, not because anyone wants it this way . We may end up with optimizations for later versions, but that is a different question. Mark. From anthony@interlink.com.au Tue Nov 12 11:01:56 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 22:01:56 +1100 Subject: [Spambayes] How do I get the latest CVS? In-Reply-To: Message-ID: <200211121101.gACB1uh02523@localhost.localdomain> >>> "Mark Hammond" wrote > > I'm trying to get the SpamBayes project from SourceForge. But I > > need a password to get into it. I tried Anonymous, but that > > didn't work. > > As per the CVS instructions (http://sourceforge.net/cvs/?group_id=61702) the > anonynous password is empty - just press the enter key. You should also be using pserver, not ext. Finally, you should use ssh protocol v1, not v2, for SF, if you are a developer. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From mhammond@skippinet.com.au Tue Nov 12 11:18:32 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 12 Nov 2002 22:18:32 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <200211120749.gAC7nxn12707@localhost.localdomain> Message-ID: [Anthony] > >>> "Piers Haken" wrote > > > > Yeha, the problem is twofolow: not only are the MIME headers broken, but > > even if they weren't the content of that MIME header would be empty > > since the body parts are appended (by the outlook plugin) after the MIME > > headers: > > > > Msgstory.py, line ~400: > > return "%s\n%s\n%s" % (headers, html, body) > > No idea on that one - it's in the Outlook plugin code - I'm not going > near that one. > > Mark-touched-it-last-he-gets-to-fix-it, Oh, if only it were that easy . The good news is that however this is fixed, it will also lend itself to a fix for the multipart/signed message abomination we are lumped with and I am partially stalled on. And I hereby retract any aspersions I cast upon the wonderful email package <0.1 wink> Mark. From anthony@interlink.com.au Tue Nov 12 11:28:33 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 12 Nov 2002 22:28:33 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: <200211121128.gACBSXR02787@localhost.localdomain> >>> "Mark Hammond" wrote > Oh, if only it were that easy . The good news is that however this is > fixed, it will also lend itself to a fix for the multipart/signed message > abomination we are lumped with and I am partially stalled on. Not familiar with that particular one - details? > And I hereby retract any aspersions I cast upon the wonderful email package > <0.1 wink> There's at least one "known to be broken" mail message out there - stuff from some mailer called Entourage. multipart/alternative nested inside a multipart/mixed, and both have the same boundary tag. Making this work with the current email package is extremely non-trivial. Go on, guess the vendor. Go on, I dare you. Anthony. From guido@python.org Tue Nov 12 12:55:17 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 12 Nov 2002 07:55:17 -0500 Subject: [Spambayes] Impersonation Message-ID: <200211121255.gACCtHb22270@pcp02138704pcs.reston01.va.comcast.net> More and more spam is impersonating real people rather than making up phoney AOL addresses. I just received a bounce quoting the following spam: > Message-ID: <000059425b35$0000481d$00007669@evertythingmail.net> > From: guido@python.org > Reply-To: Bob123@hotmail.com > To: clevelandindians@flash.net > Subject: 1hi29895 > Date: Tue, 12 Nov 2002 15:42:40 -0500 > MIME-Version: 1.0 > Content-Type: text/plain; > charset="iso-8859-1" > > g45, > > Do you own a Business? > > Do you need a Web Site? Let Us Design > it. > > We can Design a Web Site that's perfect for you. > > feedback@cdsymas.com h6 Surely this is copied from successful viruses. My wife startled me this weekend by telling me she couldn't run the virus protection program I had sent her. Of course, I had done no such thing. It was a Klez virus variant that disguises itself as a virus protection tool. The computer of a mutual friend must have been infected. :-( This shows how careful we have to be with whitelists... --Guido van Rossum (home page: http://www.python.org/~guido/) From msergeant@startechgroup.co.uk Tue Nov 12 13:02:57 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 12 Nov 2002 13:02:57 +0000 Subject: [Spambayes] Impersonation References: <200211121255.gACCtHb22270@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3DD0FC01.6010206@startechgroup.co.uk> Guido van Rossum said the following on 12/11/02 12:55: > More and more spam is impersonating real people rather than making up > phoney AOL addresses. I just received a bounce quoting the following > spam: > > >>Message-ID: <000059425b35$0000481d$00007669@evertythingmail.net> >>From: guido@python.org >>Reply-To: Bob123@hotmail.com >>To: clevelandindians@flash.net >>Subject: 1hi29895 >>Date: Tue, 12 Nov 2002 15:42:40 -0500 >>MIME-Version: 1.0 >>Content-Type: text/plain; >> charset="iso-8859-1" >> >>g45, >> >>Do you own a Business? >> >> Do you need a Web Site? Let Us Design >>it. >> >>We can Design a Web Site that's perfect for you. >> >>feedback@cdsymas.com h6 > > Surely this is copied from successful viruses. Spam was doing this first, fwiw. Matt. From bkc@murkworks.com Tue Nov 12 13:46:42 2002 From: bkc@murkworks.com (Brad Clements) Date: Tue, 12 Nov 2002 08:46:42 -0500 Subject: [Spambayes] Introducing myself In-Reply-To: References: Message-ID: <3DD0BF07.5980.E5AE4C2@localhost> On 11 Nov 2002 at 21:16, Robert Woodhead wrote: > My hunch, based on things I've done in the past, is that as the total > volume of mail increases, the rate of increase in the number of > analysis on a quarter-gig of ham and spam I was seeing, IIRC, about > 300,000 distinct tokens (including the aforementioned gibberish). My training/testing set of ... 13,000 messages resulted in pickles with 320,000 words. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From Paul.Moore@atosorigin.com Tue Nov 12 15:09:09 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Tue, 12 Nov 2002 15:09:09 -0000 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com> From: Mark Hammond [mailto:mhammond@skippinet.com.au] > I believe the email package should give some consideration to the > real world here. While creating well-formed messages is clearly > mandatory, it is very frustrating when something exists in the > real world, is clearly invalid, but everything else in the world > has no trouble with it. Was the problem with duff headers, or invalid MIME sections in the body? If the latter, there is an option in the email package to not parse the body - instead of email.message_from_string(data), you can use email.Parser.Parser().parsestr(data, 1). IIRC, this treats the body as a single uninterpreted "payload", rather than as structured MIME parts. If the problem's with the headers, this won't help, though... Paul. From Paul.Moore@atosorigin.com Tue Nov 12 15:26:11 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Tue, 12 Nov 2002 15:26:11 -0000 Subject: [Spambayes] Re: Outlook plugin plus Exchange Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DBF@UKDCX001.uk.int.atosorigin.com> From: Tim Peters [mailto:tim.one@comcast.net] >> As you can see it's just showing everything but the contents of >> the MIME parts. > > Yes, I see. I've never seen anything like that before, though. I see that too (on an Exchange server, ie Corporate/Workgroup configuration). It looks like it's an Exchange-specific thing. Paul. From tim.one@comcast.net Wed Nov 13 01:04:31 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 20:04:31 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: [Mark Hammond] > Oh, if only it were that easy . The good news is that > however this is fixed, it will also lend itself to a fix for the > multipart/signed message abomination we are lumped with and I am > partially stalled on. Good news -- I managed to fix it without making the multipart/signed silliness even one iota easier. OK, "fix" is a strong claim. I cut off the head, boltied it to my dashboard, and buried the torso in a different state. > And I hereby retract any aspersions I cast upon the wonderful > email package <0.1 wink> It should do better on this one -- patches accepted . From tim.one@comcast.net Wed Nov 13 01:29:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 20:29:51 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > Was the problem with duff headers, or invalid MIME sections in the > body? The latter. > If the latter, there is an option in the email package to not > parse the body - instead of email.message_from_string(data), you can > use email.Parser.Parser().parsestr(data, 1). IIRC, this treats the > body as a single uninterpreted "payload", rather than as structured > MIME parts. I'm not sure it helps in this case, and I don't understand what it's doing. Feeding the original msg (reconstructed from the diagnostic output) into parsestr(data, True) yields a Message object m with a pretty bizarre string representation: >>> print m.as_string() X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0 Received: from inet-mail7.oracle.com ([209.246.10.171]) by zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.4453); Sat, 13 Apr 2002 03:19:01 -0700 Received: from blaster-smtp.oracle.com (eblast01.oracleeblast.com [148.87.9.11])g3DA8GV30065 for PIERSH@FRISKIT.COM; Sat, 13 Apr 2002 03:08:16 -0700 Date: Sat, 13 Apr 2002 03:08:16 -0700 Message-Id: <200204131008.g3DA8GV30065@inet-mail7.oracle.com> Subject: Oracle University iSeminars From: Oracle Corporation To: PIERSH@FRISKIT.COM Reply-To: replies@oracleeblast.com Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="next_part_of_message" Return-Path: replies@oracleeblast.com X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) FILETIME=[A1D8F920:01C1E2D4] --next_part_of_message --next_part_of_message-- >>> The mystery there is why those MIME boundaries show up after "the real" headers. They're not reflected in the header count: >>> len(m) 14 >>> and they're not payload, preamble, or epilogue either: >>> `m.get_payload()` 'None' >>> m.preamble >>> m.epilogue >>> The type may or may not be expected: >>> m.get_type() 'multipart/alternative' >>> The reason I think it *may* be unexpected is that I thought it was a Message invariant that the type is multipart if and only if the payload is a list. m doesn't think it's multipart despite its type: >>> m.is_multipart() 0 >>> but in *that* case the docs say the payload is a string (which None is not). Barry, is this a sane Message object, or an insane one? From tim.one@comcast.net Wed Nov 13 02:09:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 12 Nov 2002 21:09:28 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: <200211121128.gACBSXR02787@localhost.localdomain> Message-ID: [Mark Hammond] > Oh, if only it were that easy . The good news is that > however this is fixed, it will also lend itself to a fix for the > multipart/signed message abomination we are lumped with and I am > partially stalled on. [Anthony Baxter] > Not familiar with that particular one - details? You don't want to know -- it's an Outlook thing. Outlook doesn't speak MIME natively. Incoming msgs are broken apart and sprayed into any number of "properties", which latter have nothing to do with MIME. Usually. The plain text part of a msg is usually stored in one property, and the HTML part in another. But in the case of a multipart/signed msg both those "normal body parts" are empty, and the msg body is hiding in yet another property that's usually otherwise empty. But it's not just the msg body hiding there then, it's also some disconnected MIME armor which also includes the signature part. Then it starts to get messy . From tim@fourstonesExpressions.com Wed Nov 13 02:13:14 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 12 Nov 2002 20:13:14 -0600 Subject: [Spambayes] Corpus modules Message-ID: I've been working with Richie Hindle to create modules that are useful for managing corpora for his pop3proxy. There is a Corpus class, a Message class, and a MessageFactory class, with subclasses that add persistence into a file system as text or gzip files in subdirectories. There's a Trainer class that observes Corpus instances and untrains/trains a bayes database as messages are moved between them. (Corpus is defined simply as a collection of messages). I've also got a BayesHelper class, that adds persistence to a Bayes object, that is imported from classifier (Bayes) or hammie(PersistentBayes). Assuming that I can get these things checked in sometime soon, they may be useful outside of the pop3proxy. I see some overlap with Messages and the msgs.py module. Also, the BayesHelper thing really doesn't belong in the Corpus.py module. So there's the context of my question(s) ;) Now for the questions. Hammie has interesting PersistentBayes and DB_Dict classes, with some helper functions for bayes object creation. It seems to me that a more cogent class hierarchy is called for, with Bayes being the abstract class, PersistentBayes being an abstract subclass, and subclasses of that for particular persistence mechanisms, like PickleBayes, ZODBBayes, DBDictBayes, etc. etc. It doesn't make a lot of sense to me to have the Bayes class in classifier and the PersistentBayes class in hammie... It would seem much more consistent to me to have a Bayes.py module, with all the bayes database classes. There might be a lot of momentum behind the hammie.py module, perhaps too much to change directions now, but hammie doesn't tell me much about what this module is really for, and when I look in it, I don't see much coherence either. The current scheme that I have in Corpus is to have a trainer object that knows about its Bayes object, and trains it in response observed message movement events. This is mainly a hack. It would be better for these bayes objects to be able to be the Corpus observers, and forget about this artificial Trainer object. Right now, my Message objects are fairly dumb. They simply wrap entire messages, which are used for training. It seems as if the training methods on Bayes create objects from msgs.py which have a lot more smarts in them, like 'gimme the headers', 'gimme the body', 'gimme a wordstream', etc. However, my Message objects have some attributes that are specifically useful for the pop3proxy handling of incoming pop3 mail, specifically persistence. Should these two classes be merged, could the msgs.py objects become more useful for the pop3proxy, or could my Message class become more broadly useful? It seems that the current msgs classes are useful for test training, and for deep within the bowels of the training algorithms, but would not be too useful for the pop3proxy... So I guess in summary, I propose that we create a Bayes.py module with guts from the current classifier and hammie modules, and make a Message class that's broadly useful, both for corpus management and for training... It's my itch, so I'm willing to scratch it, but what do the rest of you think? Musings of a latecomer to the party... - TimS www.fourstonesExpressions.com From lists@webcrunchers.com Wed Nov 13 05:29:36 2002 From: lists@webcrunchers.com (John D.) Date: Tue, 12 Nov 2002 21:29:36 -0800 Subject: [Spambayes] CVS Access.... Message-ID: Still trying to get the CVS Library... I use these following commands.... # export CVSROOT=anonymous@cvs.spambayes.sourceforge.net:/cvsroot/ # cd /usr # cvs -q get -P spambayes It then prompts me for a password. I try "anonymous" for password, but it still won't allow me access. If these are the wrong options for the "cvs" command, would someone please enlighten me with the right ones? I assume the "cvs" command gets it, and puts it into the "/usr/spambayes" directory. If I'm wrong, I would like to know how. John From barry@wooz.org Wed Nov 13 05:29:53 2002 From: barry@wooz.org (Barry A. Warsaw) Date: Wed, 13 Nov 2002 00:29:53 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange References: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com> Message-ID: <15825.58193.998340.448311@gargle.gargle.HOWL> >>>>> "TP" == Tim Peters writes: TP> but in *that* case the docs say the payload is a string (which TP> None is not). TP> Barry, is this a sane Message object, or an insane one? It's insane, but it may not entirely be your fault . I think Parser.parse() should call root.set_payload('') when headersonly is True. That ensures the Message invariant will hold. I'll add a test and check a patch into the Python 2.3 cvs. I'll also update the docs to make it clear that the file pointer is left at the first body line when parsing only the headers. (I'm not sure what do do with any firstbodyline that might get passed from _parseheaders() though). -Barry From tim.one@comcast.net Wed Nov 13 05:57:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 13 Nov 2002 00:57:50 -0500 Subject: [Spambayes] CVS Access.... In-Reply-To: Message-ID: [John D.] > Still trying to get the CVS Library... I use these following > commands.... > > # export CVSROOT=anonymous@cvs.spambayes.sourceforge.net:/cvsroot/ > # cd /usr > # cvs -q get -P spambayes > > It then prompts me for a password. I try "anonymous" for > password, but it still won't allow me access. John, you got two replies earlier today. Have you seen them? http://mail.python.org/pipermail/spambayes/2002-November/001807.html http://mail.python.org/pipermail/spambayes/2002-November/001809.html They're just repeating the instructions at http://sourceforge.net/cvs/?group_id=61702 for "Anonymous CVS Access". The instructions work, but you have to do what they say . As they say, there is no password, so don't give one. Just hit Enter at the prompt, without typing anything. But also use the cvs commands in the instructions -- there's no need to guess about anything here (except that the "modulename" variable in the instructions is indeed spambayes, as you've guessed already). From barry@python.org Wed Nov 13 05:44:52 2002 From: barry@python.org (Barry A. Warsaw) Date: Wed, 13 Nov 2002 00:44:52 -0500 Subject: [Spambayes] Re: Outlook plugin plus Exchange References: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com> <15825.58193.998340.448311@gargle.gargle.HOWL> Message-ID: <15825.59092.274040.453726@gargle.gargle.HOWL> >>>>> "BAW" == Barry A Warsaw writes: BAW> It's insane, but it may not entirely be your fault . I BAW> think Parser.parse() should call root.set_payload('') when BAW> headersonly is True. That ensures the Message invariant will BAW> hold. Hmm, that wasn't as easy as I though. I'll sleep on it. -Barry From tim.one@comcast.net Wed Nov 13 06:44:02 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 13 Nov 2002 01:44:02 -0500 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: <200211120713.gAC7Du512404@localhost.localdomain> Message-ID: [Anthony Baxter, tokenizing mail-address headers] > I've added this now. For me, tokenising just the 'from' line > with the new 'address_headers' option gives (vs the old code): > > (all tests with 4 sets of 1200H/400S) > > filename: old_from > new_from > ham:spam: 4800:1600 > 4800:1600 > fp total: 1 1 > fp %: 0.02 0.02 > fn total: 12 11 > fn %: 0.75 0.69 > unsure t: 86 88 > unsure %: 1.34 1.38 > real cost: $39.20 $38.60 > best cost: $31.80 $32.40 > h mean: 0.36 0.36 > h sdev: 4.04 4.05 > s mean: 98.25 98.25 > s sdev: 8.93 8.99 > mean diff: 97.89 97.89 > k: 7.55 7.51 > > The old code's best cost was: > -> achieved at ham & spam cutoffs 0.24 & 0.99 > -> fp 0; fn 3; unsure ham 26; unsure spam 118 > -> fp rate 0%; fn rate 0.188%; unsure rate 2.25% > > The new code's best cost was: > -> largest ham & spam cutoffs 0.26 & 0.99 > -> fp 0; fn 4; unsure ham 24; unsure spam 118 > -> fp rate 0%; fn rate 0.25%; unsure rate 2.22% > > The one additional fn was a spam that was dragged from 0.35 to > 0.21 because it came from 'update@localhost.net' - the 'update' > was a strong spam clue. Well, regardless of reason, the best cost got worse, and it did on my c.l.py test too, but also by a trivial amount. I fiddled the tokenization of this field until it did better again, so please make sure I didn't screw you too badly . Something that helped: it now generates log-count "no real name" metatokens too for address headers without real-name parts. 'from:no real name:2**0' 0.933186 became one of the 40 most-frequent discriminators in my c.l.py data then, and is a strong spam clue. The good news is that it raised my lowest-scoring spam from near 0.20 to over 0.27, so at ham_cutoff=0.20 (which I'm using on the c.l.py test), I have no spam close to being called ham anymore. The bad news is that it gave me another FP, but it's one of those useless msgs I don't care about (a two-word "confirm 12345" msg from a first-time poster sent to a wrong address, using a free email acct that inserted advertising at the bottom of the msg -- it's always been on the edge). > Where it gets more interesting is when I also tokenize to and cc: I would hope so . > filename: new_from > new_fromtocc > ham:spam: 4800:1600 > 4800:1600 > fp total: 1 1 > fp %: 0.02 0.02 > fn total: 4 5 > fn %: 0.25 0.31 > unsure t: 121 104 > unsure %: 1.89 1.62 > real cost: $38.20 $35.80 > best cost: $32.40 $28.00 > h mean: 0.36 0.31 > h sdev: 4.05 3.80 > s mean: 98.25 98.42 > s sdev: 8.99 8.77 > mean diff: 97.89 98.11 > k: 7.51 7.81 > > > We go from: > -> largest ham & spam cutoffs 0.26 & 0.99 > -> fp 0; fn 4; unsure ham 24; unsure spam 118 > -> fp rate 0%; fn rate 0.25%; unsure rate 2.22% > > to > -> largest ham & spam cutoffs 0.22 & 0.99 > -> fp 0; fn 3; unsure ham 25; unsure spam 100 > -> fp rate 0%; fn rate 0.188%; unsure rate 1.95% > > That's a total of 142->125 unsures. I'll accept that :) Yup, it's a small win. I can't use it my c.l.py test, but should be able to on the general python.org corpus (plus, of course, my own email). > Just to make sure, ran with a different seed. ... [and another small win] ... BTW, you should make sure the seeds aren't close together. For example, using seed 123 one time, and 124 the next, will give a lot of msg overlap. > toemail:python.org and toemail:zope.org both show up in > my 'best discriminators' list as _very_ strong ham clues > (not suprising, given the mailing lists I'm on). Well, that's also going to make the spam that slips thru that much harder to catch. Of course, after Greg deploys this system, there won't be any more spam slipping thru . > My old/uncommon email addresses generally show up as strong strong > spam clues (eg prob('toemail:arb') = 0.999356) Cool! From mhammond@skippinet.com.au Wed Nov 13 06:49:44 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 13 Nov 2002 17:49:44 +1100 Subject: [Spambayes] Some more experiences with the Outlook plugin In-Reply-To: Message-ID: > [Paul Moore] > > * Following on from this, I also see Tim's behaviour of surprising > > unsure cases (or worse, false negatives!). Worst case recently was a > > message which scored as solid ham. I trained on it as "Spam", and > > rescored it. It still scored 5 - solid ham. > > [Mark Hammond] > > This too was my experience. For a while, I did training over a huge > > ham corpus, and spam is still less than 1000 messages. I had around > > 15:1 ham:spam. I too trained new ham and spam, and was dissappointed > > to see the score remain almost identical. > Almost identical or exactly identical? Almost exactly identical . I can't recall for sure, and wasn't actually playing with bayes - just sorting through mail before trying to do something productive. I'll get back to playing with this stuff, but only after I get back to the client itself . > still heavily hapax-driven teensy classifier, the auto-rescore feature of > the Outlook client never seemed to change my scores either, and for a > hapax-driven classifier that's bizarre. It turns out that was because it > actually didn't change scores: the probabilities didn't get updated after > training on the reclassified msg, so "the new score" was in fact exactly > equal to "the old score". I just checked in a fix for that (unique to the > Outlook client). Hrm - I could have sworn I saw the scores change in quite a few cases. But as I said, this is hardly a controlled environment. You should see my desk . And to compound things, I am seeing messages I don't understand from my "delete as spam/recover from spam" functions - I suspect they are broken as I see "already trained as spam" when, eg, training a new unsure as spam. Quickly eyeballed the code and it looks OK - haven't debugged yet. > So it would be good to retain the old database for concurrent scoring > purposes until the new one is ready to use, or it would be good to delay > scoring msgs until training is complete. I've refrained from "doing > something" about this because it seems like it would be easy to do after > some mechanism is in place for scanning for unrated msgs at startup (i.e., > folder events could be disabled for the duration of from-scratch training, > then re-enabled after, and the scan-for-unrated machinery kicked > into action > again). Well, threads wouldn't be a bad infrastructure to use . Extended MAPI is documented as being thread-safe (which, of course, may just result in serialization). I understand that we still have the same issue with a full re-train, so I only mention this to ask if now is also a good time to implement our own locks or threading strategy. In the very least, it couldn't hurt to spin off the pickle loading, and anywhere else people complain we hurt (eg, future bulk deletes or moves, etc) A simple queue would even suffice - not much needs to be synchronous at the moment, if anything (assuming asynch usually means "almost synch" ). Asynch message filtering wont fly in the lower-level message hooking functions we are eyeballing though :( Outlook itself certainly does plenty in the background (and currently shows 14 threads for me). Eg, the unread message counts in the folder view and "Outlook Shortcuts" panes can take a few seconds before they show up (during which time Outlook is running just fine) Mark. From tim.one@comcast.net Wed Nov 13 05:45:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 13 Nov 2002 00:45:30 -0500 Subject: [Spambayes] Some more experiences with the Outlook plugin In-Reply-To: Message-ID: [Paul Moore] > * Following on from this, I also see Tim's behaviour of surprising > unsure cases (or worse, false negatives!). Worst case recently was a > message which scored as solid ham. I trained on it as "Spam", and > rescored it. It still scored 5 - solid ham. [Mark Hammond] > This too was my experience. For a while, I did training over a huge > ham corpus, and spam is still less than 1000 messages. I had around > 15:1 ham:spam. I too trained new ham and spam, and was dissappointed > to see the score remain almost identical. Almost identical or exactly identical? I wasn't looking over your shoulders, so it's hard to guess . I've been noticing that, in my still heavily hapax-driven teensy classifier, the auto-rescore feature of the Outlook client never seemed to change my scores either, and for a hapax-driven classifier that's bizarre. It turns out that was because it actually didn't change scores: the probabilities didn't get updated after training on the reclassified msg, so "the new score" was in fact exactly equal to "the old score". I just checked in a fix for that (unique to the Outlook client). BTW, another buglet here looks harder to fix: if you do a retrain from scratch in the client, all email that comes in *while* training is in progress gets scored at exactly 50. That's because the database being built isn't useful until it's done being built, but is used for scoring during the rebuild process. It won't blow up, but every word has unknown_word_prob before .update_probabilities() gets called at the end. So it would be good to retain the old database for concurrent scoring purposes until the new one is ready to use, or it would be good to delay scoring msgs until training is complete. I've refrained from "doing something" about this because it seems like it would be easy to do after some mechanism is in place for scanning for unrated msgs at startup (i.e., folder events could be disabled for the duration of from-scratch training, then re-enabled after, and the scan-for-unrated machinery kicked into action again). From rob@hooft.net Wed Nov 13 07:51:14 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Wed, 13 Nov 2002 08:51:14 +0100 Subject: [Spambayes] Outlook plugin - training References: Message-ID: <3DD20472.5080103@hooft.net> Tim Peters wrote: > Now for another extreme: after 10 startup msgs, the system trains itself on > its own decisions, except that: > > 1. Unsures are correctly classified by the user. > 2. False negatives are correctly classified by the user. > > But false positives are trained on *as spam*, assuming the user never looks > at their spam folder. That takes a long time to run, because > update_probabilities() is called after every msg. After 2,100 msgs, > > 2100 trained:1181H+919S wrds:59659 fp:0 fn:0 unsure:26 > > and the unsures are growing very slowly now (at 1400 msgs there were 25 > unsures). Now THIS is the way I'd like to go! I think this is approximately the minimum effort we can expect from lazy users (like myself). Sometimes, a fp might actually be corrected by the user at some point, but testing it the way you did should be giving the minimal possible performance of a minimal-impact system that would not require much training to begin with. There is one catch: what if the first 10 messages are all ham or all spam? Shouldn't we require at least a few of each? How would this work to start on a mailing list? I guess we could deliver spambayes with 5 "representative recent spam" (or a URL where they can be found). The mailing list would moderate the first few messages to the list, and then the filter will kick in. If a message is "spam", it can be returned to the sender, saying that the message has been judged inappropriate by the filter based on wording. "ham" can be posted without moderator approval. And all "unsure" messages are held for approval. The approval interface could have a separate "Spam" classification, but that is not really necessary: anything "inappropriate" can go in the spam corpus. For "fn"s, the archives should have the options to delete a message as spam. For now my MUA is so badly integrated that I have yet to train a second time.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From francois.granger@free.fr Wed Nov 13 13:13:16 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Wed, 13 Nov 2002 14:13:16 +0100 Subject: [Spambayes] Corpus modules In-Reply-To: Message-ID: on 13/11/02 3:13, Tim Stone - Four Stones Expressions at tim@fourstonesExpressions.com wrote: > Hammie has interesting PersistentBayes and DB_Dict classes, with some helper > functions for bayes object creation. It seems to me that a more cogent class > hierarchy is called for, with Bayes being the abstract class, PersistentBayes > being an abstract subclass, and subclasses of that for particular persistence > mechanisms, like PickleBayes, ZODBBayes, DBDictBayes, etc. etc. I was thinking of hacking the DB mechanisme to split the load between two databases (using anydbm) to reduce access to each one and to make them more accessible from outside. The scoring module needs only the second one. The training module would update both. I suspected that a major redesign was underway. Here the proposed split. {'word': ['ltime', # when this record was last modified 'spamcount', # of spams in which this word appears 'hamcount', # of hams in which this word appears ] } {'word': ['atime', # when this record was last used by scoring(*) 'killcount', # of times this made it to spamprob()'s nbest 'spamprob', # prob(spam | msg contains this word) ] } A 'dirty' flag could be added to the first so that a batch update of the second would recalculate only the dirty records. -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From anthony@interlink.com.au Wed Nov 13 13:24:57 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 14 Nov 2002 00:24:57 +1100 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: Message-ID: <200211131324.gADDOvu16292@localhost.localdomain> >>> Tim Peters wrote > Well, regardless of reason, the best cost got worse, and it did on my c.l.py > test too, but also by a trivial amount. I fiddled the tokenization of this > field until it did better again, so please make sure I didn't screw you too > badly . Seems fine. In this case, the trivial amount worse was kinda necessary (imho) to allow us to get a whole lot of other cheap wins. > Something that helped: it now generates log-count "no real name" metatokens > too for address headers without real-name parts. > 'from:no real name:2**0' 0.933186 I'll give this a go, see how it helps me. > BTW, you should make sure the seeds aren't close together. For example, > using seed 123 one time, and 124 the next, will give a lot of msg overlap. I think I tend to use 12345 and 23456 - should be far enough apart. > > toemail:python.org and toemail:zope.org both show up in > > my 'best discriminators' list as _very_ strong ham clues > > (not suprising, given the mailing lists I'm on). > > Well, that's also going to make the spam that slips thru that much harder to > catch. Of course, after Greg deploys this system, there won't be any more > spam slipping thru . That's the theory, yes. Of course, if Greg doesn't deploy this, then all the sophisticated new techniques that spammers will be forced to try will leave poor old spamassassin terribly confused, and the amount of spam getting through it will fix the solution for us :) Anthony From anthony@interlink.com.au Wed Nov 13 13:26:52 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 14 Nov 2002 00:26:52 +1100 Subject: [Spambayes] Re: Outlook plugin plus Exchange In-Reply-To: Message-ID: <200211131326.gADDQqQ16335@localhost.localdomain> >>> Tim Peters wrote > You don't want to know -- it's an Outlook thing. Outlook doesn't speak MIME > natively. [snip tale of horror] Then it starts to get messy . And you run this mailer why, again? For fun? Or some sort of masochistic desire to see email messages mangled beyond all belief? if-I-want-my-email-mangled-I'll-run-Lotus-Notes-thanks, Anthony From popiel@wolfskeep.com Wed Nov 13 16:59:50 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 13 Nov 2002 08:59:50 -0800 Subject: [Spambayes] Corpus modules In-Reply-To: Message from Fran=?ISO-8859-1?B?5w==?=ois Granger References: Message-ID: <20021113165950.477C9F53E@cashew.wolfskeep.com> In message: writes: > >I was thinking of hacking the DB mechanisme to split the load between two >databases (using anydbm) to reduce access to each one and to make them more >accessible from outside. The scoring module needs only the second one. The >training module would update both. I suspected that a major redesign was >underway. Here the proposed split. >{'word': ['ltime', # when this record was last modified > 'spamcount', # of spams in which this word appears > 'hamcount', # of hams in which this word appears > ] >} >{'word': ['atime', # when this record was last used by scoring(*) > 'killcount', # of times this made it to spamprob()'s nbest > 'spamprob', # prob(spam | msg contains this word) > ] >} > >A 'dirty' flag could be added to the first so that a batch update of the >second would recalculate only the dirty records. I am in the process of doing a very similar split, although I've (for my private stuff) made a few simplifications: 1. I don't keep track of modification and access times. Nothing references them, and I'm more in favor of the aging methods which keep the actual wordlists for messages around until the message as a whole is slated for untraining. 2. I don't keep track of killcounts. Again, nothing references them, and I really don't care which clues are being used a lot. Also, when a training (or untraining) event occurs, I completely trash the second database. This is warranted in most cases, since the number of spam and/or ham has changed, and thus (almost) all the spamprobs are invalidated. This saves us from needing a dirty flag. As I score messages, I fetch spamprobs from the second database, and if they aren't there, I compute them based on the first database. (If the words aren't in the first database either, then just use the unknown word probability and don't bother storing in the second database.) Initial tests show a 4% speed hit on large batch training and testing. On the other hand, it speeds up the 'score one, train one' runs immensely. I've got a few bugs yet, and it's rather intrusive... which is why I haven't checked it in. - Alex From piersh@friskit.com Wed Nov 13 17:50:47 2002 From: piersh@friskit.com (Piers Haken) Date: Wed, 13 Nov 2002 09:50:47 -0800 Subject: [Spambayes] Corpus modules Message-ID: <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com> > -----Original Message----- > From: T. Alexander Popiel [mailto:popiel@wolfskeep.com]=20 > > Also, when a training (or untraining) event occurs, I=20 > completely trash the second database. This is warranted in=20 > most cases, since the number of spam and/or ham has changed,=20 > and thus (almost) all the spamprobs are invalidated. This=20 > saves us from needing a dirty flag. Ouch, isn't this overly expensive for retraining a single message? Piers. From popiel@wolfskeep.com Wed Nov 13 17:46:24 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 13 Nov 2002 09:46:24 -0800 Subject: [Spambayes] Corpus modules In-Reply-To: Message from "Piers Haken" <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com> References: <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com> Message-ID: <20021113174625.055ACF53E@cashew.wolfskeep.com> In message: <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com> "Piers Haken" writes: > >> -----Original Message----- >> From: T. Alexander Popiel [mailto:popiel@wolfskeep.com]=20 >> >> Also, when a training (or untraining) event occurs, I=20 >> completely trash the second database. This is warranted in=20 >> most cases, since the number of spam and/or ham has changed,=20 >> and thus (almost) all the spamprobs are invalidated. This=20 >> saves us from needing a dirty flag. > >Ouch, isn't this overly expensive for retraining a single message? No, not really. That's the whole point; throwing away the entire database is a lot cheaper than touching every record individually, which is what update_probabilities does. I then compute the spamprobs on demand, instead of doing all of them regardless of if they're used. If you don't throw away the old spamprobs in some form when you (re)train a message, then you're getting invalid results from the scoring mechanism. The mechanism I outlined achieves correctness in the face of dynamically changing training data with less than a 5% speed penalty, worst case. - Alex From tim@fourstonesExpressions.com Wed Nov 13 23:35:06 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 13 Nov 2002 17:35:06 -0600 Subject: [Spambayes] Bayes Training Message-ID: <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> It occurs to me that perhaps *outgoing* mail might be a source of ham training. With the presence of the smtp proxy, we *could* train the database on mail that a user sends, presuming that mail that looks like mail that a person sends is unlikely to be spam... - Tim www.fourstonesExpressions.com From popiel@wolfskeep.com Wed Nov 13 23:50:59 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 13 Nov 2002 15:50:59 -0800 Subject: [Spambayes] Bayes Training In-Reply-To: Message from Tim Stone - Four Stones Expressions <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> References: <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> Message-ID: <20021113235059.C4B55F53E@cashew.wolfskeep.com> In message: <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> writes: >It occurs to me that perhaps *outgoing* mail might be a source of ham >training. With the presence of the smtp proxy, we *could* train the database >on mail that a user sends, presuming that mail that looks like mail that a >person sends is unlikely to be spam... Not so good, if we're parsing From addresses... one common spammer tactic is to make the mail appear to be coming from yourself. Training on a lot of data coming from the user would eliminate that as a spam clue... In any case, given the ham:spam ratios recently bandied about, I don't think there's really a problem finding sufficient ham from other sources. ;-) - Alex From tim@fourstonesExpressions.com Thu Nov 14 00:06:33 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 13 Nov 2002 18:06:33 -0600 Subject: [Spambayes] Bayes Training In-Reply-To: <20021113235059.C4B55F53E@cashew.wolfskeep.com> Message-ID: <95YLG95KGXVLKI1TC0EDFBS4XVSJH.3dd2e909@riven> 11/13/2002 5:50:59 PM, "T. Alexander Popiel" wrote: >In message: <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> > writes: >>It occurs to me that perhaps *outgoing* mail might be a source of ham >>training. With the presence of the smtp proxy, we *could* train the database >>on mail that a user sends, presuming that mail that looks like mail that a >>person sends is unlikely to be spam... > >Not so good, if we're parsing From addresses... one common spammer >tactic is to make the mail appear to be coming from yourself. >Training on a lot of data coming from the user would eliminate >that as a spam clue... > Yeah, parsing on from: would be a problem, but the smtpproxy could easily strip the from header out, or all the headers for that matter, before sending it for training. It seems very likely to me that the words I use in my mail are those that I would tend to want my database to weigh in the favor of ham... >In any case, given the ham:spam ratios recently bandied about, >I don't think there's really a problem finding sufficient ham >from other sources. ;-) I'm not completely convinced that the ham:spam that we're discussing are reflective of the average email user. I think people commonly experience 1:15 or 1:20 ratios... perhaps even more... we've been discussing much lower ratios if I recall correctly... - TimS > >- Alex > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Thu Nov 14 01:23:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 13 Nov 2002 20:23:27 -0500 Subject: [Spambayes] Outlook users should update Message-ID: I just checked in what should fix the last of a few related training bugs in the Outlook client. Incremental training (Train Now without selecting "Rebuild entire database") is much faster (msgs that have already been trained on are skipped over at light speed now). The default options have changed to exploit Anthony Baxter's new tokenization code for From, To, Cc, Sender, and Reply-to headers (Bad Idea if you're using mixed-source data, but, I think, if you're using Outlook you've got single-source data pretty much by definition). You don't have to retrain your database from scratch, but I recommend that you do. From anthony@interlink.com.au Thu Nov 14 02:25:16 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 14 Nov 2002 13:25:16 +1100 Subject: [Spambayes] A couple of small tokenizer experiments. In-Reply-To: Message-ID: <200211140225.gAE2PGk25367@localhost.localdomain> >>> Tim Peters wrote > Something that helped: it now generates log-count "no real name" metatokens > too for address headers without real-name parts. > > 'from:no real name:2**0' 0.933186 I saw 'from:no real name:2**0' 0.683287 'reply-to:no real name:2**0' 0.873138 in the horror corpus. > Yup, it's a small win. I can't use it my c.l.py test, but should be able to > on the general python.org corpus (plus, of course, my own email). On the nasty corpus, filename: shout_from shout_fromccetc ham:spam: 5000:2500 5000:2500 fp total: 10 7 fp %: 0.20 0.14 fn total: 5 5 fn %: 0.20 0.20 unsure t: 297 257 unsure %: 3.96 3.43 real cost: $164.40 $126.40 best cost: $99.80 $76.60 h mean: 4.12 3.53 h sdev: 12.63 11.53 s mean: 99.49 99.47 s sdev: 5.33 5.46 mean diff: 95.37 95.94 k: 5.31 5.65 Goes from: -> best cost for all runs: $99.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.71 & 0.99 -> fp 7; fn 15; unsure ham 37; unsure spam 37 -> fp rate 0.14%; fn rate 0.6%; unsure rate 0.987% to: -> best cost for all runs: $76.60 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 2 cutoff pairs -> smallest ham & spam cutoffs 0.69 & 0.99 -> fp 5; fn 14; unsure ham 29; unsure spam 34 -> fp rate 0.1%; fn rate 0.56%; unsure rate 0.84% -> largest ham & spam cutoffs 0.7 & 0.99 -> fp 5; fn 14; unsure ham 29; unsure spam 34 -> fp rate 0.1%; fn rate 0.56%; unsure rate 0.84% From IMarvinTPA@bigfoot.com Thu Nov 14 02:46:58 2002 From: IMarvinTPA@bigfoot.com (IMarvinTPA) Date: Wed, 13 Nov 2002 21:46:58 -0500 Subject: [Spambayes] Spam DB Message-ID: <001c01c28b88$19c2b8c0$767ba8c0@Destruction> Hi, I was looking over the project and read that you haven't found many spam databases. I have nearly 2100 messages in my spam folder in Outlook Express. (I know I can't use OE with SpamBayes.) If you have any interest in it, I can provide it to you. These messages are from 10/99 to today. The general rule I used to put mail into this folder (and I review it) is my e-mail address not being in either To or CC. Thanks, Andy Bay aka IMarvinTPA ICQ:1432002 My Homepage: http://imarvintpa.selfhost.com/ INTP http://www.keirsey.com/ Your personality Andy Bay, is comprised of Evokateur, Sage, and Evokateur styles. (http://www.ansir.com/) "Wedgies rush in where angels fear to smurf." From stuart@bmsi.com Thu Nov 14 02:30:56 2002 From: stuart@bmsi.com (Stuart D. Gathman) Date: Wed, 13 Nov 2002 21:30:56 -0500 (EST) Subject: [Spambayes] Milter wrinkles Message-ID: I am looking for ways to integrate bayesian filtering of some kind with the Python Milter: http://www.bmsi.com/python/milter.html First, there is the difficulty of statistics being preferrably user specific. Is this a show stopper for this kind of filtering at the milter level? How could the system get feedback from the users? Is this simply an inappropriate thing to do at this level? Second, a milter would like to hang up on spammers as soon as possible. This is why a blacklist of spam domains is valuabl - although it only stops a small percentage, they are stopped immediately before many resources are used. I had the thought that the bayesian analysis could be applied to the headers only. Then, email with very spammy headers could be rejected without bothering with the body. I'll have to experiment with how effective this is. -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From anthony@interlink.com.au Thu Nov 14 03:19:56 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 14 Nov 2002 14:19:56 +1100 Subject: [Spambayes] Milter wrinkles In-Reply-To: Message-ID: <200211140319.gAE3JvF25731@localhost.localdomain> >>> "Stuart D. Gathman" wrote > First, there is the difficulty of statistics being preferrably user > specific. Is this a show stopper for this kind of filtering at the milter > level? How could the system get feedback from the users? Is this simply > an inappropriate thing to do at this level? It depends on how closely coupled your user's interests are. You will need to do ham training on representative emails from all users - otherwise you could end up, say, with one of your users being interested in playing the bass guitar, and suddenly all of your users would be getting spam from those steenking bass guitar manufacturers. I guess it would be possible to have separate databases for each user, and use that when you get the RCPT TO: header. > Second, a milter would like to hang up on spammers as soon as possible. > This is why a blacklist of spam domains is valuabl - although it only > stops a small percentage, they are stopped immediately before many > resources are used. No-one's really done that much work on this yet. I think GregW has the python.org mailer set up so it grabs the entire message, checks it with spamassassin, and if it's completely spammy, it produces an error and drops the message. Greg can probably fill in more details here. > I had the thought that the bayesian analysis could be applied to the > headers only. Then, email with very spammy headers could be rejected > without bothering with the body. I'll have to experiment with how > effective this is. There's been some work in this area, but not an enormous amount. If you're letting them get past the DATA SMTP command, you may as well pull down the entire message rather than just the headers. Anthony From richie@entrian.com Thu Nov 14 09:45:59 2002 From: richie@entrian.com (richie@entrian.com) Date: Thu, 14 Nov 2002 09:45:59 +0000 Subject: [Spambayes] Spam DB In-Reply-To: <001c01c28b88$19c2b8c0$767ba8c0@Destruction> Message-ID: Hi Andy, > (I know I can't use OE with SpamBayes.) You can't use the Outlook pieces, but you can use the POP3 proxy. You configure OE to collect mail from localhost rather than your POP3 server, configure the proxy to connect to your POP3 proxy (see the [pop3proxy] section of Options.py for what to put into your bayescustomize.ini) and it then adds an X-Hammie-Disposition: Yes|No|Unsure header to each email. OE can then be set up to filter on that header. You can train it via a web interface (http://localhost:8880/ by default) or on the command line using hammie. The web interface only allows you to train it one message at a time at the moment, so you're probably best using hammie. Sorry there isn't any better documentation, but it's still early days! -- Richie Hindle richie@entrian.com From Alexander@Leidinger.net Thu Nov 14 11:25:11 2002 From: Alexander@Leidinger.net (Alexander Leidinger) Date: Thu, 14 Nov 2002 12:25:11 +0100 Subject: [Spambayes] Milter wrinkles In-Reply-To: <200211140319.gAE3JvF25731@localhost.localdomain> References: <200211140319.gAE3JvF25731@localhost.localdomain> Message-ID: <20021114122511.5b5e3f15.Alexander@Leidinger.net> On Thu, 14 Nov 2002 14:19:56 +1100 Anthony Baxter wrote: > > >>> "Stuart D. Gathman" wrote > > First, there is the difficulty of statistics being preferrably user > > specific. Is this a show stopper for this kind of filtering at the milter > > level? How could the system get feedback from the users? Is this simply > > an inappropriate thing to do at this level? > > It depends on how closely coupled your user's interests are. You will > need to do ham training on representative emails from all users - otherwise > you could end up, say, with one of your users being interested in playing > the bass guitar, and suddenly all of your users would be getting spam from > those steenking bass guitar manufacturers. I guess it would be possible to > have separate databases for each user, and use that when you get the > RCPT TO: header. While we are at it: - global database (may be not existent for a given setup) - per domain database (--"--) - per user database (--"--) - per "+-feature" database (--"--)... but I don't realy think this item may be usefull So root could add some default spams (e.g. no porn spam please), and also add some per domain defaults (no asian mails for one domain, but not for the other). With this features a per user database may be not needed in some cases. Bye, Alexander. -- Reboot America. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From richie@entrian.com Thu Nov 14 09:45:59 2002 From: richie@entrian.com (richie@entrian.com) Date: Thu, 14 Nov 2002 09:45:59 +0000 Subject: [Spambayes] Spam DB In-Reply-To: <001c01c28b88$19c2b8c0$767ba8c0@Destruction> Message-ID: Hi Andy, > (I know I can't use OE with SpamBayes.) You can't use the Outlook pieces, but you can use the POP3 proxy. You configure OE to collect mail from localhost rather than your POP3 server, configure the proxy to connect to your POP3 proxy (see the [pop3proxy] section of Options.py for what to put into your bayescustomize.ini) and it then adds an X-Hammie-Disposition: Yes|No|Unsure header to each email. OE can then be set up to filter on that header. You can train it via a web interface (http://localhost:8880/ by default) or on the command line using hammie. The web interface only allows you to train it one message at a time at the moment, so you're probably best using hammie. Sorry there isn't any better documentation, but it's still early days! -- Richie Hindle richie@entrian.com From fgranger@teleprosoft.com Thu Nov 14 09:47:06 2002 From: fgranger@teleprosoft.com (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Thu, 14 Nov 2002 10:47:06 +0100 Subject: [Spambayes] Mail with problem Message-ID: The enclosed file contains a mail wich when received or trained throught pop3prowy give me the following error: (MacOS 9.1 24 Mo memory for Python 2.2.1) When receiving: Loading database... Done. BayesProxyListener listening on port 110. UserInterfaceListener listening on port 8880. error: uncaptured python exception, closing channel <__main__.ServerLineReader connected at 0x6c9f9f0> (exceptions.RuntimeError:maximum recursion limit exceeded [HD:Python 2.2.1:Lib:asyncore.py|poll|95] [HD:Python 2.2.1:Lib:asyncore.py|handle_read_event|392] [HD:Python 2.2.1:Lib:asynchat.py|handle_read|130] [HD:Dev:spambayes:pop3proxy.py|found_terminator|192] [HD:Dev:spambayes:pop3proxy.py|onServerLine|260] [HD:Dev:spambayes:pop3proxy.py|onResponse|315] [HD:Dev:spambayes:pop3proxy.py|onTransaction|413] [HD:Dev:spambayes:pop3proxy.py|onRetr|460] [HD:Dev:spambayes:classifier.py|chi2_spamprob|234] [HD:Dev:spambayes:classifier.py|_getclues|459] [HD:Dev:spambayes:sets.py|__init__|374] [HD:Dev:spambayes:sets.py|_update|333] [HD:Dev:spambayes:tokenizer.py|tokenize|1008] [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254]) When training on it: Loading database... Done. BayesProxyListener listening on port 110. UserInterfaceListener listening on port 8880. error: uncaptured python exception, closing channel <__main__.UserInterface connected at 0x6b93910> (exceptions.RuntimeError:maximum recursion limit exceeded [HD:Python 2.2.1:Lib:asyncore.py|poll|95] [HD:Python 2.2.1:Lib:asyncore.py|handle_read_event|392] [HD:Python 2.2.1:Lib:asynchat.py|handle_read|112] [HD:Dev:spambayes:pop3proxy.py|found_terminator|670] [HD:Dev:spambayes:pop3proxy.py|onRequest|695] [HD:Dev:spambayes:pop3proxy.py|onTrain|786] [HD:Dev:spambayes:classifier.py|learn|296] [HD:Dev:spambayes:classifier.py|_add_msg|411] [HD:Dev:spambayes:sets.py|__init__|374] [HD:Dev:spambayes:sets.py|_update|333] [HD:Dev:spambayes:tokenizer.py|tokenize|1008] [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254]) Salutations, Francois Granger -- fgranger@teleprosoft.com - tel: +33 1 41 88 48 00 - Fax: + 33 1 41 88 48 48 -------------- next part -------------- PKTn-̦¤àQrart.txtí][sÛ¸~΃þ–Ú›©)‰’¯²¤cˉ·vâ±½»ít:%$$Á l§íïHIÔ-¢m9µcyv3"…Ë9ÎuAU""û«^Q!¸oºÇ=°ˆ– K”PÖ§^ù‚‡¨K#*™,÷¡Ág*tôó?jN­\ß)×jõr½þOÔ§‚ùŒz¯‘{‹ h,¸ä¾J[_*@WtyvuŽ.©€ÖÈ)ï¼F×LõPÇÜfº,WáÏÙÜq67‘Ïjú@ZÔ¥âÍÄí}tÕK6³‰>ð>ªU«5TuNµá8è/U¸ÎqÍäáœKå³›”ÍD Úª:¯K¯àÃv§v¼íTk[ûèêÁ\õÑ\µzÃÙilmÁ\NµŠ~>ì\½.]q@lÁ¥ËÄýD‰jÀ0Ç©˜QltºŽ‰JpÀE•è@(‰Þ™åø’PY:ÊÙ ½>¨)¨]&V­}ÏÎNÎ:öïTHº–«¥C))ûê6¦ &b1ª‚EE„€º_B.O"‹Û–eë¿=êxõmRÝ«×j[x—îú».ö¶èö–_Û«;Ø*Q)q—Ú'°"MMœã8ÐØÙÙÚ*°~3c¥Ú¥#¬€”b«pI#Šy²]ºê1‰à?œrfkÖP˜Ò†X„4ZöB¬Ê%Û.ÄÜdŠÞ¨J`í#ÒÃBRÕb’Û»»[{¶3j ’"}*ìND¸Ç¢níºL•J‡”0D¥BI„hÔçp¡÷Wg§åÒ%C}žH’Ðç·ô+‚5D„Ø@%²®{êÛ}®Eï2Âh‹O„` { ›D iÐct KÐlžsh‰B Ê×Íû\¢XP •F„Ê $4YzúÑÔPÌÅab ¢ò&·„°ú˜òTù Ã`¸§Âà~7:úxxõ÷óŽáÿööôäYv¥òGý°R9º:J¿Ø,W‘„i•ÃA¥Òù`µ›zâö«fb¯ý þš!Uõ”Šmú%aý–E2 Pk¡ìªeÈž¦Ú2C½jJu ë ;f퉔æ»æO¶mš¼*GZ@ôoT˜D²¯‰S‹oöÓ>YpÛÐVÙÃÞ@=ô©b>bÁ0 ¦G¶=J¸Àʨ~ÄAþ›/C›ð€ƒ"ýioO›ålèkʺ=¥›k öǨ'ÀÌT–Ês܃‘]xû“Œ (Ã9º4UÛÛûz/`žðœ¤dFkô¸vpã„úþ2M‡^*¹.÷nï/HƒQ”÷`aÕÚ>%R1ÿÖ lÛFî+F%ÚúS¦rMCµÛ5·¬ë¸7­#M…]Pžkæ©^˧ðg €àZVôAŒ=m ZÖnz-cLÌu5U2%R•Ž†ãg¢6ÐÐt’yãV§Æ£'c8O6Õ Å&´6@´¬€ú l Ø0½-K[—F¥r}}]÷iІ…]CNJ 8ê–õn¼ ’‚|cŒÊè²ÂB0ç²ÑkYïÁíÑwv”»Ì‡+þWÞ vv†<Â}fŽ8 vptv ëÚd¾À!5t­gtaOšØ0ïüá¦iXŽ{ñ/ ‹.Ø̹Ž>¯]÷°jLªÿà XGœ]½omnï¢÷“wï¯ZÛUtvpñîäCúÍà*û¶ŠŽ/Î:o?^u.àêòðâãééɇw­ˆ·›’+ót€£µþ+îãKss½ÍŸäB‚Àt–v3âƒOÙj/Œ|Ö¬£l†õtŠõT Øgôz1¦™Œ¯W׳%QZIpÖÒV&V}$(ð1» Ÿ´µ‡QªìÃÖYÓQ»‡«]^·û™–)ƒ8»¢ÝŒÛimº…ÔTì¼ÇNáz-re¼ßÈÛk“DÑ}ˆÄ¸b£Q43þÞÐSCX¬Ÿ„bI†#/Ag:8 £x #ªi× ]XÉtAÜÜÍJœÁé±þ„Ñ ºƒiH"–²eIàä`‚j ¢ MÒn ‘3|ß¹ŠµÚ@pÍH@ט§ÍIvÕªïÔkVû¼Ç—=£rµìè $*Z#„z,ØǤ7 †“IŠÖ0Ì™ú¤CÓ¡©mû†EòÌ€ ƒÌ8ÇD¥LN—l€ž•ñy↧ÑÉ­ci$ÿcKþdDâ ÈG>ñf†Ô§Þ‚hKm P}£öçT«iè±hH˜?Ì =¨Å% ÝôâXè¼iZğ׵=PpmÏ4Ã5~Á¼2Ç>äç&ït6j ÚA`,`))M„|òœíZíŽgòA`M;$@ú{Á“aàìxäéo&ÍÙA7шàò0L"õ0Ožã«ý[„d¬ù`~# èábmÛë’‚^Ê¿¾=xòÛPÔ<“„N@§QΣ—K³ªCúïwʹïšgŸ˜˜=°3#Âßβï,›ssè1¡Û\nÞ¼ +f ñÿw6(McWøN 5ýÔè caºo¨ͣu •ªò‹GfNê¨]ÓmëóyzKFoé“CH?Ÿ€ã˜yëÜ÷!Ú‚l´ïœ>KÆr3;¨‹V“|.ß)=®õ»¯Ù›½&nrK…\l†ñ-—°ú°WŠþÊÉDGQ¢kC‹±¾zRèÍX-Ö «´¦´ºÐ_Òƒ¬‹q>ìaa¼¢/áº÷9Î{WÏÀwÌCÇ÷rßßØ~Þ>ý‡Š‰ý‰»¹9 \JmºtÜ›ˆ´ ~WW¡0ÎâÁU©†;éB¡ç„xš.6k<’I z,-ÇK¿Hàï'釿¯À~Dë2.Ý+° ‡EŒ•ØÌ8ŒDˆçkìë‹Á§z“ÍœpHß~ÙÕ,ÛR8x)’Ÿ–ór“¦7*4"·$à1õXú/óò·Z[ —䘭 OÚ(ýØñ£E° ŠÏ œ=æ" {L’Ä ^lmêÙèÛ£–wG§f™²•[¹Ýüà£ß™[»Dp/o¥0 u±Üà›iÝI~õËÒntåRçºÔƒ,Œ”ÅüèÊ‘þë uœ0ªh¡Ôí|TÌ^$¶-Î ôyufNáÉÎVH/9AÖÙÅñ·òã_õ\¢¬O߉rØ_¦ÃÖ/×e¤¨¿þŸÉxzú³„ÒB•édFgs—½¹ü”Ê…]\êÏÞöbþë3¹¼döYÊ⎞a?wjÕ­)€ß%L¿E‡j[IÊ‚‚‘KQ >~zÎuOÚjŸ2I”(¬,÷sÑ–™qδ±™U‘̺}Œô»^¬¹{¯xc¯M›ßŒšqQöHø º8Ýi$z/ ³Å!Á”Ë2=š âiÝ©Yaí‰Ù¨¤zÌÞ™N/3ó(„Ì? anØá¯ÐáöÉ<…À9ìœJþtí:iÐ.¹ïüè÷||è8¢ªå‚ÓÉz½T©òƒä¦˜Bv ] Ãõ|“•ïÝXø£9ÌI”nèaïÝXíý ågoú\ãÅ°8•þÕKÚ/zD w§~”(ÀÝÅÁËÚœ+¶ ]F§½fâŒÜñØÈÃP{2áö}íIùV¦=ýjë&»£«¬Í=>kÜcþåXÅÛpžc?>2ùΫ…?W2ñ %…~2(ïgyϦÛ^#<¾ÝGÎÞÞ®ýxßèUšéAÇÌ«æ›û’ÈiBÆq™üy˜€{·0¸ùm³¢¿ÊfÛ¥ÿPK Tn-̦¤àQr €art.txtPK5From Paul.Moore@atosorigin.com Thu Nov 14 15:51:49 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Thu, 14 Nov 2002 15:51:49 -0000 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com> Here's a first cut at filtering any new unread messages when Outlook starts up (important for Exchange or IMAP users). It works, at least for my limited testing, but there are a couple of points I think need looking at. So far, I've tested it manually - the crunch comes tomorrow morning, when I'll have received my overnight dose of spam :-) First, the msg.unread property doesn't seem to be set right - the diagnostic output shows my code having filtered all of my inbox, not just the unread messages. This, if true, is obviously a bug. (On the other hand, it may be a bug in my code - I don't understand that stuff too well...) I can't see a simple way of checking this. Maybe I should only filter unread, unscored messages - the attached code doesn't do this, as I would have to wait for new mail to arrive if I were skipping scored messages. I'll add that once the basic thing is working. Second, an efficiency point - I'm going through the whole inbox via GetMessageGenerator(). In my case, my inbox has 365 messages, with only 4 unread. I was going to use the MAPI Find/FindNext methods, but the msgstore code doesn't expose them. Would it be worth having a method for just scanning unread messages (it could be used in the filter dialog, too)? Anyway, any comments would be appreciated. Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: addin.patch Type: application/octet-stream Size: 1832 bytes Desc: addin.patch Url : http://mail.python.org/pipermail/spambayes/attachments/20021114/f19b3b41/addin.exe From tim.one@comcast.net Thu Nov 14 16:12:23 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 11:12:23 -0500 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > Here's a first cut at filtering any new unread messages when Outlook > starts up (important for Exchange or IMAP users). Note that Mark Hammond already checked in something toward the same end. > ... > First, the msg.unread property doesn't seem to be set right - the > diagnostic output shows my code having filtered all of my inbox, not > just the unread messages. This, if true, is obviously a bug. (On the > other hand, it may be a bug in my code - I don't understand that stuff > too well...) I can't see a simple way of checking this. Me neither. > Maybe I should only filter unread, unscored messages - the attached > code doesn't do this, as I would have to wait for new mail to arrive > if I were skipping scored messages. I'll add that once the basic thing > is working. > > Second, an efficiency point - I'm going through the whole inbox via > GetMessageGenerator(). In my case, my inbox has 365 messages, with > only 4 unread. Mark added code to display time consumed by the startup scan. On my lean Work mailbox, it's like so: rocessing 0 missed spam in folder 'Inbox' took 1.86141ms Processing 0 missed spam in folder 'Zope' took 6.21308ms Processing 0 missed spam in folder 'Bayes' took 1.91225ms Processing 0 missed spam in folder 'Checkins' took 0.590019ms The Zope folder there had over 400 pending msgs, and 6ms to note >400 previously scored msgs is nothing. > I was going to use the MAPI Find/FindNext methods, but the msgstore code > doesn't expose them. Would it be worth having a method for just scanning > unread messages (it could be used in the filter dialog, too)? See the new GetNewUnscoredMessageGenerator() (in msgstore.py). Mark uses a method now that sucks out 70 unread and unscored msgs per MAPI call. From Paul.Moore@atosorigin.com Thu Nov 14 16:31:54 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Thu, 14 Nov 2002 16:31:54 -0000 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DCD@UKDCX001.uk.int.atosorigin.com> From: Tim Peters [mailto:tim.one@comcast.net] > [Moore, Paul] >> Here's a first cut at filtering any new unread messages when Outlook >> starts up (important for Exchange or IMAP users). > > Note that Mark Hammond already checked in something toward the same = end. Drat. Must have been since last night, since I did a CVS checkout then. (checks) yes it was. Oh, well, it was a learning exercise. Paul. From sjoerd@acm.org Thu Nov 14 16:37:06 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Thu, 14 Nov 2002 17:37:06 +0100 Subject: [Spambayes] updating email package Message-ID: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> Does anybody mind if I update the email package with the current verion from the Python CVS? I noticed that I received emails that can't be properly parsed by the version that's in spambayes (it raises an exception, resulting in the fallback behavior) but that can be parsed by the version in the Python CVS. -- Sjoerd Mullender From tim.one@comcast.net Thu Nov 14 16:40:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 11:40:42 -0500 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DCD@UKDCX001.uk.int.atosorigin.com> Message-ID: >> Note that Mark Hammond already checked in something toward the same end. [Moore, Paul] > Drat. Must have been since last night, since I did a CVS checkout then. > (checks) yes it was. Well, Mark lives in Australia (if you can call that living ), so "last night" means something perverse to him. Anyone developing code for this project should subscribe to the checkins mailing list too: http://mail.python.org/mailman/listinfo/spambayes-checkins Then you'll be the first on your continent to get breaking news. From tim.one@comcast.net Thu Nov 14 16:47:35 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 11:47:35 -0500 Subject: [Spambayes] updating email package In-Reply-To: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> Message-ID: [Sjoerd Mullender] > Does anybody mind if I update the email package with the current > verion from the Python CVS? Yes, that's a big -1. Barry intends to delete the email pkg from this project. It doesn't belong here. People working w/ current CVS will then get the latest without effort. People using older versions of Python will need to hammer out a scheme for installing the standalone email pkg, at http://mimelib.sf.net/ > I noticed that I received emails that can't be properly parsed by the > version that's in spambayes (it raises an exception, resulting in the > fallback behavior) but that can be parsed by the version in the Python > CVS. I believe it . From skip@pobox.com Thu Nov 14 16:49:28 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 14 Nov 2002 10:49:28 -0600 Subject: [Spambayes] read-only DBDict in hammie? Message-ID: <15827.54296.548473.905264@montanaro.dyndns.org> I'd like to share the anydbm file between several accounts on my machine. Before I fiddle hammie.py so it opens the file in read-only mode, is there any reason when classifying (not training) it actually needs to update the file? There's a __del__ method in PersistentBayes which does this: def __del__(self): #super.__del__(self) self.save_state() def save_state(self): self.wordinfo[self.statekey] = (self.nham, self.nspam) When classifying there's no reason that nham or nspam would change, right? Skip From jeremy@alum.mit.edu Thu Nov 14 16:56:59 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Thu, 14 Nov 2002 11:56:59 -0500 Subject: [Spambayes] updating email package In-Reply-To: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> Message-ID: <15827.54747.182021.325839@slothrop.zope.com> It would be *much* better to remove the email the package from spambayes and let users install a copy of whatever version they want in their python site-packages. I am using CVS python to run my spam filter, and I have to manually delete the email package every time I do a CVS update. (Well, technically, I deleted it once and replaced it with an empty directory named email. So I get a bunch of complaints every time I do a CVS update about the directory being in the way.) I can't think of any good reason to include an email package with spambayes. It only leads to complexity for people who have a version of email installed on their pythonpath. Jeremy From barry@python.org Thu Nov 14 16:59:19 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 14 Nov 2002 11:59:19 -0500 Subject: [Spambayes] updating email package References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> Message-ID: <15827.54887.396770.384358@gargle.gargle.HOWL> >>>>> "SM" == Sjoerd Mullender writes: SM> Does anybody mind if I update the email package with the SM> current verion from the Python CVS? You might want the version from mimelib.sf.net, which is at an officially stable release of 2.4.3. I'm working on a 2.5 release, which while it passes all the tests, still needs some tweaking before it's ready to go out. -Barry From barry@python.org Thu Nov 14 17:07:26 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 14 Nov 2002 12:07:26 -0500 Subject: [Spambayes] updating email package References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> Message-ID: <15827.55374.656571.993448@gargle.gargle.HOWL> >>>>> "TP" == Tim Peters writes: TP> Yes, that's a big -1. Barry intends to delete the email pkg TP> from this project. I'm deleting it now. -Barry From tim.one@comcast.net Thu Nov 14 17:16:47 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 12:16:47 -0500 Subject: [Spambayes] read-only DBDict in hammie? In-Reply-To: <15827.54296.548473.905264@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I'd like to share the anydbm file between several accounts on my machine. > Before I fiddle hammie.py so it opens the file in read-only mode, is there > any reason when classifying (not training) it actually needs to update the > file? Scoring currently updates .killcount and .atime members in WordInfo records. If you're not using them for anything, you don't care. > ... > When classifying there's no reason that nham or nspam would change, right? Correct. From tim.one@comcast.net Thu Nov 14 17:25:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 12:25:53 -0500 Subject: [Spambayes] Milter wrinkles In-Reply-To: <200211140319.gAE3JvF25731@localhost.localdomain> Message-ID: [Anthony Baxter] > ... > No-one's really done that much work on this yet. I think GregW has the > python.org mailer set up so it grabs the entire message, checks it with > spamassassin, and if it's completely spammy, it produces an error and > drops the message. Greg can probably fill in more details here. He's probably tired of that by now . python.org does lots of stuff, including rejecting msgs based on some smoking-gun header scanning. In particular, email is rejected at once if the headers indicate it uses a character set that's out of favor, or passed through a country that's out of favor. If it was email to a tech mailing list, he probably could (but doesn't) reject email just for having multipart/* or text/html type. From tim@fourstonesExpressions.com Thu Nov 14 17:46:57 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 14 Nov 2002 11:46:57 -0600 Subject: [Spambayes] read-only DBDict in hammie? In-Reply-To: Message-ID: 11/14/2002 11:16:47 AM, Tim Peters wrote: >[Skip Montanaro] >> I'd like to share the anydbm file between several accounts on my machine. >> Before I fiddle hammie.py so it opens the file in read-only mode, is there >> any reason when classifying (not training) it actually needs to update the >> file? I'm using the DBDict class in hammie for doing training with the pop3proxy. Can we make a read-only option, rather than making it always open for read? On a related note, should DBDict actually have it's own module, rather than be part of hammie? - TimS > >Scoring currently updates .killcount and .atime members in WordInfo records. >If you're not using them for anything, you don't care. > >> ... >> When classifying there's no reason that nham or nspam would change, right? > >Correct. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From grobinson@transpose.com Thu Nov 14 18:50:39 2002 From: grobinson@transpose.com (Gary Robinson) Date: Thu, 14 Nov 2002 13:50:39 -0500 Subject: [Spambayes] I thought this was an interesting spam article Message-ID: http://maccentral.macworld.com/news/0211/14.spam.php --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From tim@fourstonesExpressions.com Thu Nov 14 19:08:00 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 14 Nov 2002 13:08:00 -0600 Subject: Fwd: Re: [Spambayes] I thought this was an interesting spam article Message-ID: <52NK72MIYVNIIFID3XGD4YJFWUUOF0OK.3dd3f490@riven> The article originated at Computerworld, and their publication of the article at http://computerworld.com/softwaretopics/software/groupware/story/0,10801,75737 ,00.html has several very interesting sidebars, particularly the one named "The Other Side" -TimS 11/14/2002 12:50:39 PM, Gary Robinson wrote: > >http://maccentral.macworld.com/news/0211/14.spam.php > >--Gary > > >-- >Gary Robinson >CEO >Transpose, LLC >grobinson@transpose.com >207-942-3463 >http://www.emergentmusic.com >http://radio.weblogs.com/0101454 > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com -------- End of forwarded message -------- - Tim www.fourstonesExpressions.com From tim.one@comcast.net Thu Nov 14 19:24:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 14:24:58 -0500 Subject: [Spambayes] Mail with problem In-Reply-To: Message-ID: [Francois Granger] > The enclosed file contains a mail wich when received or trained throught > pop3prowy give me the following error: > > (MacOS 9.1 24 Mo memory for Python 2.2.1) > ... > [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254]) Looks like the regular expression engine runs out of (C) stack space while trying to find HTML tags to strip. I don't know enough about Macs to suggest something specific, but in general you have to do whatever it takes to convince he OS to give the program more stack space to work with. Short of that, reducing the instances of "2048" in html_re in tokenizer.py should make the problem go away, but since C stack space limits are platform-specific, it's impossible to say how small "is safe" for you without simply trying it over and over until the error goes away. From tim@fourstonesExpressions.com Thu Nov 14 18:21:54 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 14 Nov 2002 12:21:54 -0600 Subject: [Spambayes] Mozilla.org using bayesian spam filtering Message-ID: Anybody know anything about this? Doesn't look like our technology... http://www.mozilla.org/mailnews/spam.html - Tim www.fourstonesExpressions.com From popiel@wolfskeep.com Thu Nov 14 19:39:32 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 14 Nov 2002 11:39:32 -0800 Subject: [Spambayes] updating email package In-Reply-To: Message from barry@python.org (Barry A. Warsaw) <15827.55374.656571.993448@gargle.gargle.HOWL> References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> <15827.55374.656571.993448@gargle.gargle.HOWL> Message-ID: <20021114193932.7E180F58A@cashew.wolfskeep.com> In message: <15827.55374.656571.993448@gargle.gargle.HOWL> barry@python.org (Barry A. Warsaw) writes: > >>>>>> "TP" == Tim Peters writes: > > TP> Yes, that's a big -1. Barry intends to delete the email pkg > TP> from this project. > >I'm deleting it now. Would you mind putting some basic instructions about manual installation of the email package on the spambayes website for the python-package-management-impaired among us? (I tried looking at the mimelib.sf.net website, but it doesn't explain how to get the package into the search path...) - Alex From tim@fourstonesExpressions.com Thu Nov 14 19:48:36 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 14 Nov 2002 13:48:36 -0600 Subject: [Spambayes] Mail with problem In-Reply-To: Message-ID: Depending on what kind of regex engine python has (NFA or DFA) and on how the html parsing regex is implemented relative to its engine, it can take an enormous amount of memory. For example, with an NFA and a regex that uses alternation in certain ways, the stack can grow exponentially. We may want to take a hard look at tokenizer's html parsing regex. I looked at it briefly yesterday, but didn't pay much attention. Tim, do you know if the python regex is NFA or DFA? If it's NFA, is there a DFA engine we can plug in? - TimS 11/14/2002 1:24:58 PM, Tim Peters wrote: >[Francois Granger] >> The enclosed file contains a mail wich when received or trained throught >> pop3prowy give me the following error: >> >> (MacOS 9.1 24 Mo memory for Python 2.2.1) >> ... >> [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254]) > >Looks like the regular expression engine runs out of (C) stack space while >trying to find HTML tags to strip. I don't know enough about Macs to >suggest something specific, but in general you have to do whatever it takes >to convince he OS to give the program more stack space to work with. > >Short of that, reducing the instances of "2048" in html_re in tokenizer.py >should make the problem go away, but since C stack space limits are >platform-specific, it's impossible to say how small "is safe" for you >without simply trying it over and over until the error goes away. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From barry@python.org Thu Nov 14 19:58:56 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 14 Nov 2002 14:58:56 -0500 Subject: [Spambayes] updating email package References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl> <15827.55374.656571.993448@gargle.gargle.HOWL> <20021114193932.7E180F58A@cashew.wolfskeep.com> Message-ID: <15828.128.561764.174214@gargle.gargle.HOWL> >>>>> "TAP" == T Alexander Popiel writes: TAP> Would you mind putting some basic instructions about manual TAP> installation of the email package on the spambayes website TAP> for the python-package-management-impaired among us? I made the change to the developer.ht file but couldn't push out the .html files, probably due to the i'm-in-too-many-sf-groups permissions problem. Can someone else regen and push them out? TAP> (I tried looking at the mimelib.sf.net website, but it TAP> doesn't explain how to get the package into the search TAP> path...) You should be able to just unpack the tarball, and then follow the directions in the README file. -Barry From tim.one@comcast.net Thu Nov 14 20:01:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 15:01:50 -0500 Subject: [Spambayes] Mail with problem In-Reply-To: Message-ID: [Tim Stone] > Depending on what kind of regex engine python has (NFA or DFA) > and on how the html parsing regex is implemented relative to its > engine, it can take an enormous amount of memory. That isn't the problem here. There are no "runaway" regexps. The problem is that minimal matching is implemented via recursive call, one level per character matched. That's a long-standing problem. All minimal matches in the tokenizer regexps are bounded, but it's impossible to guess an upper limit that's "safe" across every half-a-brain C implementation in existence. > For example, with an NFA and a regex that uses alternation in certain ways, > the stack can grow exponentially. Yes, but that's not the case here. > We may want to take a hard look at tokenizer's html parsing > regex. I looked at it briefly yesterday, but didn't pay much attention. > > Tim, do you know if the python regex is NFA or DFA? NFA. > If it's NFA, is there a DFA engine we can plug in? No. With great pain, the regexp in question could be rewritten to avoid minimal matches. I'd rather the OP convince his OS to let him use some of the dozens of megabytes sitting idle on his machine . From tim.one@comcast.net Thu Nov 14 20:33:47 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 14 Nov 2002 15:33:47 -0500 Subject: [Spambayes] Mail with problem In-Reply-To: <2Z1V1W06C8B9FCUWQ62XYZYPNOKHE.3dd403e4@riven> Message-ID: [Tim Stone] > Point taken. Makes me wonder, though, if we might not have a > problem like this when this starts getting used by regular folks, like >with the proxy... The OP attached the email in question to his msg. It tokenized fine on my box at the time (Win2K), and if I don't hear about it causing problems on other boxes either, then I'll assume it's Just Another Glitch specific to Mac OS 9. > I suppose the reason we're not using python's html parser is > performance...? Flyswatters versus dynamite mostly. We're not doing anything with HTML except throwing it away. Half-assed regexps can do a fine job of this, are very robust against ill-formed HTML too, against damaged email that intended to call itself text/html but forgot to, etc. If we need to do fancier things with HTML, then a real parser becomes correspondingly more attractive. From skip@pobox.com Thu Nov 14 22:04:29 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 14 Nov 2002 16:04:29 -0600 Subject: [Spambayes] read-only DBDict in hammie? In-Reply-To: References: Message-ID: <15828.7661.200538.997623@montanaro.dyndns.org> >>> I'd like to share the anydbm file between several accounts on my >>> machine. Before I fiddle hammie.py so it opens the file in >>> read-only mode, is there any reason when classifying (not training) >>> it actually needs to update the file? Tim> I'm using the DBDict class in hammie for doing training with the Tim> pop3proxy. Can we make a read-only option, rather than making it Tim> always open for read? Yes, that was my intent. When you run hammie with the -g or -s flags it's opened for writing, but opened for reading otherwise. There is a new mode argument to the DBDict constructor. I suppose I should have defaulted it to 'c'. Skip From mhammond@skippinet.com.au Thu Nov 14 22:20:25 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 15 Nov 2002 09:20:25 +1100 Subject: [Spambayes] Mozilla.org using bayesian spam filtering In-Reply-To: Message-ID: > Anybody know anything about this? Doesn't look like our technology... > > http://www.mozilla.org/mailnews/spam.html > It's not - but I bet quite a few of our worthless Aussie dollars that someone with Mozilla and Python knowledge, starting now with PyXPCOM and SpamBayes, would have a system much much better, and "finished" way before theirs <0.1 wink>. Still-vainly-wishing-mozilla-and-pyxpcom-take-over-the-world ly, Mark. From neale@woozle.org Fri Nov 15 07:00:44 2002 From: neale@woozle.org (Neale Pickett) Date: 14 Nov 2002 23:00:44 -0800 Subject: [Spambayes] Optimization to DBDict (Was: read-only DBDict in hammie?) In-Reply-To: References: Message-ID: I have sitting here on my hard drive some changes to DBDict which make for much smaller databases by introducing an optimization for WordInfo classes (getting rid of Administrative Pickle Bloat). However, if I submit this, everyone's hammie database will slowly be rewritten to the new format, so I want to solicit feedback first. Here are the two new methods: def __getitem__(self, key): v = self.hash[key] if v[0] == 'W': val = pickle.loads(v[1:]) # We could be sneaky, like pickle.Unpickler.load_inst, # but I think that's overly confusing. obj = classifier.WordInfo(0) obj.__setstate__(val) return obj else: return pickle.loads(v) def __setitem__(self, key, val): if isinstance(val, classifier.WordInfo): val = val.__getstate__() v = 'W' + pickle.dumps(val, 1) else: v = pickle.dumps(val, 1) self.hash[key] = v Note that this makes the assumption that if a "W" pickle type is ever added to Python's pickler, it won't be pickled in a DBDict. Otherwise, you're in for trouble. If someone knows of a better way to do this, please step forward before I submit it and hammie starts to rewrite everyone's database. So then, Tim Stone - Four Stones Expressions is all like: > On a related note, should DBDict actually have it's own module, rather > than be part of hammie? You know what's funny is after I wrote DBDict I discovered python's shelve module, which does the same thing. I should probably rewrite DBDict to wrap the shelve class, but shelve is so minimal, maybe shelve should be rewritten to incorporate the DBDict class <0.2 wink> (gah, I'm doing the wink thing now). From sjoerd@acm.org Fri Nov 15 08:40:21 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Fri, 15 Nov 2002 09:40:21 +0100 Subject: [Spambayes] updating email package In-Reply-To: References: Message-ID: <20021115084026.5635C74C3B@indus.ins.cwi.nl> On Thu, Nov 14 2002 Tim Peters wrote: > [Sjoerd Mullender] > > Does anybody mind if I update the email package with the current > > verion from the Python CVS? > > Yes, that's a big -1. Barry intends to delete the email pkg from this > project. That's fine with me too. -- Sjoerd Mullender From Paul.Moore@atosorigin.com Fri Nov 15 09:23:44 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Fri, 15 Nov 2002 09:23:44 -0000 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup Message-ID: <16E1010E4581B049ABC51D4975CEDB8861993B@UKDCX001.uk.int.atosorigin.com> From: Tim Peters [mailto:tim.one@comcast.net] > [Moore, Paul] >> Here's a first cut at filtering any new unread messages when Outlook >> starts up (important for Exchange or IMAP users). > > Note that Mark Hammond already checked in something toward the same = end. I got Mark's changes last night (or in the morning, if you're Australian...) Tried them today, but got some strange results. The first time I started Outlook, nothing happened, but that's because I'd lost my filter definitions (probably my fault, I think I know why this happened). Then, when I fixed that and restarted, I got the trace output saying that "0 missed messages" had been handled. Looking at the code for GetNewUnscoredMessageGenerator, it seems to scan the folder for messages with Unread =3D True and no "Spam" field. There were 40 or so unread messages, so I can only assume that they had a Spam field. I don't know enough about MAPI to diagnose this much further - I do, however, have the "Spam" field displayed in my Inbox - could that be enough to cause the field to be automatically created? On the other hand, manually filtering with the "unread only" and "only messages not already filtered" checkboxes set on, *did* filter the messages. But that uses a different method at the moment, checking message.unread and message.GetField(mgr.config.field_score_name). So it would filter a message with a defined, but empty, "Spam" property. I think that GetNewUnscoredMessageGenerator needs to allow for messages with an existing but empty "Spam" field. I *think* this means you want to extend the restriction from AND PROPERTY =3D UNREAD (UNREAD, True) NOT EXISTS "Spam" to AND PROPERTY =3D UNREAD (UNREAD, True) OR NOT EXISTS "Spam" PROPERTY =3D "Spam" ("Spam", PT_NULL) (excuse weird pseudo-code description). But I know Mark will be better able to make the appropriate fix :-) Paul. From francois.granger@free.fr Fri Nov 15 09:53:36 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Fri, 15 Nov 2002 10:53:36 +0100 Subject: [Spambayes] Another software in the field Message-ID: Disovered today: http://cristal.inria.fr/~xleroy/software.html (page in english) -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From just@letterror.com Fri Nov 15 09:57:53 2002 From: just@letterror.com (Just van Rossum) Date: Fri, 15 Nov 2002 10:57:53 +0100 Subject: [Spambayes] Another software in the field In-Reply-To: Message-ID: Fran=E7ois Granger wrote: > http://cristal.inria.fr/~xleroy/software.html I don't think we have to fear much from procmail-only solutions... How close are "we" to an alpha (or even beta ;-) release? I think spambay= es could get some great publicity, but we need to be quick. The topic is ver= y hot. Just From anthony@interlink.com.au Fri Nov 15 11:00:02 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 15 Nov 2002 22:00:02 +1100 Subject: [Spambayes] Another software in the field In-Reply-To: Message-ID: <200211151100.gAFB02h16080@localhost.localdomain> >>> =3D?iso-8859-1?Q?Fran=3DE7ois?=3D Granger wrote > Disovered today: > = > http://cristal.inria.fr/~xleroy/software.html > (page in english) > = http://spambayes.sourceforge.net/related.html Got that one already. :) It's another implementation of the Graham algorithm. = -- = Anthony Baxter = It's never too late to have a happy childhood. From anthony@interlink.com.au Fri Nov 15 11:01:44 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 15 Nov 2002 22:01:44 +1100 Subject: [Spambayes] Another software in the field In-Reply-To: Message-ID: <200211151101.gAFB1iX16108@localhost.localdomain> >>> Just van Rossum wrote > How close are "we" to an alpha (or even beta ;-) release? I think spamb= ayes > could get some great publicity, but we need to be quick. The topic is v= ery > hot. One thing that suprises me is that there's a seemingly endless list of = projects all implementing Graham's approach exactly as he originally described it - almost no-one else is doing the basic testing and research that this sort of approach would seem to cry out for. From msergeant@startechgroup.co.uk Fri Nov 15 11:05:49 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Fri, 15 Nov 2002 11:05:49 +0000 Subject: [Spambayes] Another software in the field References: <200211151101.gAFB1iX16108@localhost.localdomain> Message-ID: <3DD4D50D.3090307@startechgroup.co.uk> Anthony Baxter said the following on 15/11/02 11:01: >>>>Just van Rossum wrote >> >>How close are "we" to an alpha (or even beta ;-) release? I think spambayes >>could get some great publicity, but we need to be quick. The topic is very >>hot. > > > One thing that suprises me is that there's a seemingly endless list of > projects all implementing Graham's approach exactly as he originally > described it - almost no-one else is doing the basic testing and > research that this sort of approach would seem to cry out for. Why would anyone else want to, when you guys are doing such an amazing job of it? ;-) Matt. From skip@pobox.com Fri Nov 15 11:11:36 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 15 Nov 2002 05:11:36 -0600 Subject: [Spambayes] Another software in the field In-Reply-To: References: Message-ID: <15828.54888.833862.811209@montanaro.dyndns.org> >> http://cristal.inria.fr/~xleroy/software.html Just> I don't think we have to fear much from procmail-only solutions... There are a few of us who are quite happy with procmail-only solutions. Some of us even use Macs. ;-) Skip From just@letterror.com Fri Nov 15 11:14:22 2002 From: just@letterror.com (Just van Rossum) Date: Fri, 15 Nov 2002 12:14:22 +0100 Subject: [Spambayes] Another software in the field In-Reply-To: <15828.54888.833862.811209@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Just> I don't think we have to fear much from procmail-only solutions... > > There are a few of us who are quite happy with procmail-only solutions. > Some of us even use Macs. ;-) And I completely happy with a pop3-only solution... The great thing about spambayes is that we're _both_ happy ;-) Just From mwh@python.net Fri Nov 15 11:44:41 2002 From: mwh@python.net (Michael Hudson) Date: 15 Nov 2002 11:44:41 +0000 Subject: [Spambayes] Re: Another software in the field References: <200211151101.gAFB1iX16108@localhost.localdomain> <3DD4D50D.3090307@startechgroup.co.uk> Message-ID: <2mheejgiqe.fsf@starship.python.net> Matt Sergeant writes: > Anthony Baxter said the following on 15/11/02 11:01: > >>>>Just van Rossum wrote > >> > >>How close are "we" to an alpha (or even beta ;-) release? I think spambayes > >>could get some great publicity, but we need to be quick. The topic is very > >>hot. > > > > > > One thing that suprises me is that there's a seemingly endless list of > > projects all implementing Graham's approach exactly as he originally > > described it - almost no-one else is doing the basic testing and > > research that this sort of approach would seem to cry out for. > > Why would anyone else want to, when you guys are doing such an amazing > job of it? ;-) But isn't that the point? The people are implementing Graham's original algorithm, not the soupa-doupa wizzy one that lives in spambayes... Cheers, M. -- I don't remember any dirty green trousers. -- Ian Jackson, ucam.chat From mhammond@skippinet.com.au Fri Nov 15 12:14:47 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 15 Nov 2002 23:14:47 +1100 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861993B@UKDCX001.uk.int.atosorigin.com> Message-ID: > Looking at the code for GetNewUnscoredMessageGenerator, it seems to > scan the folder for messages with Unread = True and no "Spam" field. > There were 40 or so unread messages, so I can only assume that they > had a Spam field. See if you can convince the Outlook2000/sandbox/dump_props.py script to produce anything useful. It should tell you if such a field exists (and plenty of other things!) > I don't know enough about MAPI to diagnose this much further - I do, > however, have the "Spam" field displayed in my Inbox - could that be > enough to cause the field to be automatically created? It shouldn't. Indeed, having a blank score would indicate there is no such field on the message. Getting the field automatically created in the folder is a different problem we are yet to solve, but that is different. > But I know Mark will be better able to make the appropriate fix :-) That may be right, but I would prefer to see some output from dump_props first for you. delete_outlook_field.py can also be useful - it can be used to remove the Spam field from all messages in a folder. This is indeed how I did most of my testing. Eg, I run: F:\...>delete_outlook_field.py -d --no-outlook Spam Processing folder Inbox Deleting field Spam Deleted 1257 field instances via MAPI Could not find property to delete in the folder Then when I restart outlook, I see: Processing 16 missed spam in folder 'Inbox' took 349.376ms As I do indeed have 16 unread mail in my inbox :( These 16 messages are the only ones now showing the Spam field. Further, I have indeed seen this work for me, for real - this was to scratch a personal itch for when Outlook crashes due to a buggy PGP plugin a client insists I use :( I will also forward the test script I used to come up with the fastest technique I could for the usual case of zero missed messages. Mark. From msergeant@startechgroup.co.uk Fri Nov 15 12:38:48 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Fri, 15 Nov 2002 12:38:48 +0000 Subject: [Spambayes] Re: Another software in the field References: <200211151101.gAFB1iX16108@localhost.localdomain> <3DD4D50D.3090307@startechgroup.co.uk> <2mheejgiqe.fsf@starship.python.net> Message-ID: <3DD4EAD8.7090703@startechgroup.co.uk> Michael Hudson said the following on 15/11/02 11:44: > But isn't that the point? The people are implementing Graham's > original algorithm, not the soupa-doupa wizzy one that lives in > spambayes... Well some of them are. SpamAssassin implemented Gary's algorithm, and I'm working on the chi-squared one (I think it's working, but my DB is trained with count == word count, not mail count). Matt. From anthony@interlink.com.au Fri Nov 15 14:39:41 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 16 Nov 2002 01:39:41 +1100 Subject: [Spambayes] Another software in the field In-Reply-To: <3DD4D50D.3090307@startechgroup.co.uk> Message-ID: <200211151439.gAFEdgP17487@localhost.localdomain> >>> Matt Sergeant wrote > Why would anyone else want to, when you guys are doing such an amazing > job of it? ;-) Ooo. Flattery :) I just hope that the various projects that are implementing the straight out Graham algorithm are planning to replace it with the more optimal code once Tim's finished perfecting it for us all. From msergeant@startechgroup.co.uk Fri Nov 15 14:44:45 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Fri, 15 Nov 2002 14:44:45 +0000 Subject: [Spambayes] Another software in the field References: <200211151439.gAFEdgP17487@localhost.localdomain> Message-ID: <3DD5085D.2070408@startechgroup.co.uk> Anthony Baxter said the following on 15/11/02 14:39: >>>>Matt Sergeant wrote >> >>Why would anyone else want to, when you guys are doing such an amazing >>job of it? ;-) > > > Ooo. Flattery :) > > I just hope that the various projects that are implementing the straight > out Graham algorithm are planning to replace it with the more optimal code > once Tim's finished perfecting it for us all. If I discover that chi-squared is indeed better than PG's method (I'd have to rebuild a live database, which is understandably a pain), then I definitely will be, and I'll be giving the code to SpamAssassin too. I suspect a lot of these projects that sprang up after PG's article will fall by the wayside as developers realise they don't really want to maintain them. But that's just evolution :) From Paul.Moore@atosorigin.com Fri Nov 15 14:54:07 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Fri, 15 Nov 2002 14:54:07 -0000 Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DD3@UKDCX001.uk.int.atosorigin.com> From: Mark Hammond [mailto:mhammond@skippinet.com.au] >> Looking at the code for GetNewUnscoredMessageGenerator, it seems >> to scan the folder for messages with Unread = True and no "Spam" >> field. There were 40 or so unread messages, so I can only assume >> that they had a Spam field. > >See if you can convince the Outlook2000/sandbox/dump_props.py script >to produce anything useful. It should tell you if such a field exists >(and plenty of other things!) The script doesn't work as it stands (COM error "Unknown Trust Provider", which I suspect is due to our Active Directory setup somehow) but I hacked to to just look at the Inbox. I did this, and got a colleague to send me a mail (while I did *not* have Outlook running). dump_props shows no Spam property. Started Outlook, it wasn't filtered, and there was still no Spam property. Filtered manually, and the Spam property appeared as expected... For what it's worth, I attach the output of dump_props.py from before I started Exchange, after I started but did nothing else, and after I manually filtered the Inbox. I also include the trace output from the addin after the startup filtering happened. Paul. -------------- next part -------------- z'µìmjÛZržžÜ²Ç+¹¶ÞtÖ¦y+Z®Û©N¶œy<©yªi–'¶*'þ‡-zÛ-­æ¦J,Þï_Ê׬ ëJÖ«¶êS­§R¹a¶Úþf¢–œ­†‰è®銗«™¨¥þÊZ™¶²zÏÚ¶Öœ†g§¶ÏöÓMµ×^éî7q¦Ÿý+Z®Û©N¶œyì^-------------- next part -------------- A non-text attachment was scrubbed... Name: AfterManualFilter Type: application/octet-stream Size: 12406 bytes Desc: AfterManualFilter Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/6e43caaf/AfterManualFilter.exe -------------- next part -------------- A non-text attachment was scrubbed... Name: BeforeExchangeStarted Type: application/octet-stream Size: 11632 bytes Desc: BeforeExchangeStarted Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/6e43caaf/BeforeExchangeStarted.exe -------------- next part -------------- A non-text attachment was scrubbed... Name: AfterExchangeStarted Type: application/octet-stream Size: 11632 bytes Desc: AfterExchangeStarted Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/6e43caaf/AfterExchangeStarted.exe From jm@jmason.org Fri Nov 15 12:45:15 2002 From: jm@jmason.org (Justin Mason) Date: Fri, 15 Nov 2002 12:45:15 +0000 Subject: [Spambayes] Another software in the field In-Reply-To: Message from Matt Sergeant of "Fri, 15 Nov 2002 11:05:49 GMT." <3DD4D50D.3090307@startechgroup.co.uk> Message-ID: <20021115124520.3E85D16F17@jmason.org> Matt Sergeant said: > > One thing that suprises me is that there's a seemingly endless list of > > projects all implementing Graham's approach exactly as he originally > > described it - almost no-one else is doing the basic testing and > > research that this sort of approach would seem to cry out for. > Why would anyone else want to, when you guys are doing such an amazing > job of it? ;-) Hi all, Well, I've just started for SpamAssassin -- I'm gradually reinventing the wheel I think. For example, I've just found that including hapaxes improves the middle ground very well, which I think is something you guys did a long time ago ;) But here's one thing I've noticed which might be useful for you guys. In SpamAssassin recently, we've been meditating on Message-Ids; particularly Outlook-format ones, like: <002901c28c22$3e8cb260$0201a8c0@gorm> now, I've figured out this is composed of TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which we can validate in SpamAssassin using perl code. not a runner for spambayes, unfortunately. However, SENDERID is a constant value which never changes for an Outlook or Exchange installation, as far as I can see -- so you want to make sure your tokenizer will parse message-ids, and will return that as one token. It will gain valuable probabilities for those tricky spammers who are getting good at sending legit-looking text and headers ;) No matter what hostnames they use, unless they reinstall Outlook (as far as I know) that should not change. Quick question BTW -- I've been trying to keep our bayes-testing stats close to yours, so we can compare portably. But there's one thing I've run into. As far as I can see, in your 10-fold cross-validation suite, you train using 1 fold and test against 9 -- whereas the published lit (or at least Ion's papers) seems to suggest that 10FCV works better trained against 9 and tested against 1. Is there a reason you chose this? PS: about time I posted here, I've been lurking and reading for weeks ;) --j. From neale@woozle.org Fri Nov 15 16:25:08 2002 From: neale@woozle.org (Neale Pickett) Date: 15 Nov 2002 08:25:08 -0800 Subject: [Spambayes] Another software in the field In-Reply-To: References: Message-ID: So then, François Granger is all like: > Disovered today: > > http://cristal.inria.fr/~xleroy/software.html > (page in english) That's SpamOracle. It's been around a while--it's what I began using immediately after writing my own pythonic spam filter and immediately before signing on with the spambayes effort :) What's cool about SpamOracle (aside from it being written in OCaml) is that it uses a lexical analyzer to parse up email. It's really fast! At least, fast for parsing messages. The original X-Hammie-Disposition header was a blatant rip from the header SpamOracle uses, too. Shameless, Neale From jm@jmason.org Fri Nov 15 15:36:25 2002 From: jm@jmason.org (Justin Mason) Date: Fri, 15 Nov 2002 15:36:25 +0000 Subject: [Spambayes] Another software in the field In-Reply-To: Message from Anthony Baxter <200211151439.gAFEdgP17487@localhost.localdomain> Message-ID: <20021115153630.15A5016F17@jmason.org> Anthony Baxter said: > I just hope that the various projects that are implementing the straight > out Graham algorithm are planning to replace it with the more optimal code > once Tim's finished perfecting it for us all. Actually, I reckon most of the projects will be happy to say "works for me", as PG did himself in the first place, at least until someone comes along and points out spambayes' much higher efficiency rating in a public forum like /. ;) --j. From popiel@wolfskeep.com Fri Nov 15 16:50:10 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 15 Nov 2002 08:50:10 -0800 Subject: [Spambayes] Re: Another software in the field In-Reply-To: Message from Michael Hudson of "15 Nov 2002 11:44:41 GMT." <2mheejgiqe.fsf@starship.python.net> References: <200211151101.gAFB1iX16108@localhost.localdomain> <3DD4D50D.3090307@startechgroup.co.uk> <2mheejgiqe.fsf@starship.python.net> Message-ID: <20021115165010.DBCC7F54C@cashew.wolfskeep.com> In message: <2mheejgiqe.fsf@starship.python.net> Michael Hudson writes: >Matt Sergeant writes: > >> Anthony Baxter said the following on 15/11/02 11:01: >> > >> > One thing that suprises me is that there's a seemingly endless list of >> > projects all implementing Graham's approach exactly as he originally >> > described it - almost no-one else is doing the basic testing and >> > research that this sort of approach would seem to cry out for. >> >> Why would anyone else want to, when you guys are doing such an amazing >> job of it? ;-) > >But isn't that the point? The people are implementing Graham's >original algorithm, not the soupa-doupa wizzy one that lives in >spambayes... None of us have written a nice, concise, and easily understood _English_ description of the algorithm we're actually using. Further, that nonexistant concise description hasn't been slashdotted. Until that happens, spambayes will be a footnote. Personally, I think the classifier algorithm is mature enough now to support such a description. (Yes, I know Gary has written essays on part of it, but they're not comprehensive, and I don't think anything has been written about our use of chi-square.) The classifier _implementation_ is still fluctuating a fair amount as people consider splitting the database, removing access counts, etc. The tokenizer algorithm we've got is both not mature enough to support such a description, and also not amenable to concise description. On the other hand, I suspect that in a screed about the classifier, we could get away with a very brief description of the basics (words counted once per message, split-on-whitespace, ignore most headers, etc.) and leave the fine tuning as a reference to the tokenizer implementation. Unfortunately, I'm not a good technical writer. If I were, I'd try my hand at writing up a description of the classifier, and then post it somewhere with a lot of bandwidth. As it is, I can only whine about it not existing. - Alex From tim.one@comcast.net Fri Nov 15 17:21:19 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 12:21:19 -0500 Subject: [Spambayes] Another software in the field In-Reply-To: <20021115124520.3E85D16F17@jmason.org> Message-ID: [Justin Mason] > Well, I've just started for SpamAssassin -- I'm gradually reinventing > the wheel I think. For example, I've just found that including hapaxes > improves the middle ground very well, which I think is something you > guys did a long time ago ;) Ya, ignoring hapaxes is a form of bias, and we eventually found that all forms of bias hurt. > But here's one thing I've noticed which might be useful for you guys. > In SpamAssassin recently, we've been meditating on Message-Ids; > particularly Outlook-format ones, like: > > <002901c28c22$3e8cb260$0201a8c0@gorm> Hmm. I use Outlook 2000, and my last post had: Message-id: OTOH, a recent one from Paul Moore had: Message-id: <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com> and from Mark Hammond: Message-id: and from Sean True: Message-id: These are all (I believe) Outlook users. No $ in sight! I believe Paul is alone in this group in using an Exchange server instead of straight SMTP. > now, I've figured out this is composed of > > > > TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which > we can validate in SpamAssassin using perl code. What does "validate" mean in this context? > not a runner for spambayes, unfortunately. Post the Perl code and I bet it will be easy to do in Python too. I'm not sure what you mean otherwise; for example, a FILETIME is conceptually a 64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the most-significant 4 bytes of that int, or the first 4 bytes in storage order (which happen to be the least-significant 4 bytes of the big int). > However, SENDERID is a constant value which never changes for an > Outlook or Exchange installation, as far as I can see -- so you want > to make sure your tokenizer will parse message-ids, and will return > that as one token. > > It will gain valuable probabilities for those tricky spammers > who are getting good at sending legit-looking text and headers ;) > No matter what hostnames they use, unless they reinstall Outlook > (as far as I know) that should not change. That would indeed be a great clue! > Quick question BTW -- I've been trying to keep our bayes-testing stats > close to yours, so we can compare portably. But there's one thing I've > run into. As far as I can see, in your 10-fold cross-validation suite, > you train using 1 fold and test against 9 That's backwards, although it's tricky: for speed, timcv.py: + Train on sets 2-10. + Predicts against set 1. + Incrementally trains set 1 (leaving the classifier trained on 1-10). + Incrementally *untrains* set 2 (leaving 1 + 3-10 trained). + Predicts against set 2. + Incrementailly trains set 2 (leaving 1-10 trained again). + Incrementally untrains set 3 (leaving 1-2 + 4-10 trained). + Predicts against set 3. + Incrementailly trains set 3 (levaing 1-10 trained again). and so on. This has huge performance benefits, in both instruction count and cache locality, versus running timcv.py with option build_each_classifier_from_scratch enabled. -- whereas the published lit (or at least Ion's papers) seems to > suggest that 10FCV works better trained against 9 and tested against 1. Right. > Is there a reason you chose this? I was looking for a new hobby after I stopped beating my wife . timtest.py is an NxN grid driver, running N**2-N tests each training on 1 and predicting against N-1. That's a good way to get lots of hard test runs if you have lots of data. timcv.py is vanilla cross-validation, running N tests each training on N-1 and predicting against 1. README.txt and TESTING.txt say more about all this. > PS: about time I posted here, I've been lurking and reading for weeks ;) Poor man -- I'm glad you uncloaked! Did the Outlook Message-Ids fit a pattern you've seen? I'm keen to pursue that. From tim.one@comcast.net Fri Nov 15 17:29:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 12:29:55 -0500 Subject: [Spambayes] Another software in the field In-Reply-To: <200211151101.gAFB1iX16108@localhost.localdomain> Message-ID: [Anthony Baxter] > One thing that suprises me is that there's a seemingly endless list of > projects all implementing Graham's approach exactly as he originally > described it - almost no-one else is doing the basic testing and > research that this sort of approach would seem to cry out for. Well, coding is fun, and the basic approach works so well at once that there's instant gratification. Sound statistical testing is (as the people here who've played along will surely testify!) tedious, time-consuming, trap-laden, and ego-deflating (it's not really *fun* when the data tells you your pet idea sucked, at least not before you've developed a healthy taste for humiliation ). From jm@jmason.org Fri Nov 15 18:06:24 2002 From: jm@jmason.org (Justin Mason) Date: Fri, 15 Nov 2002 18:06:24 +0000 Subject: [Spambayes] Another software in the field In-Reply-To: Message from Tim Peters Message-ID: <20021115180629.996C316F16@jmason.org> Tim Peters said: > Hmm. I use Outlook 2000, and my last post had: > Message-id: ... > These are all (I believe) Outlook users. No $ in sight! I believe Paul is > alone in this group in using an Exchange server instead of straight SMTP. Hmm, we thought they were Exchange-format ids; looks like O2K now uses that format. (thinks) maybe it's just Outlook Express does the $ id format -- but the important point is that it's frequently spoofed in spam (about 29% of my spam load, for example). So it becomes a great spam indicator. In fact, as Outlook users migrate *away* from that format, it gets better ;) BTW the O2K format IDs have not been spoofed yet, as far as I can see, so they would be a good ham sign, if the tokenizer could recognise them. as far as I know they always match /^<[A-Z]{28}\.\S+\@\S+>$/ . > What does "validate" mean in this context? compute what the value *should* be and compare. > Post the Perl code and I bet it will be easy to do in Python too. I'm not > sure what you mean otherwise; for example, a FILETIME is conceptually a > 64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the > most-significant 4 bytes of that int, or the first 4 bytes in storage order > (which happen to be the least-significant 4 bytes of the big int). most significant. Perl code is at the end of the mail... > [ten-pass] > That's backwards, although it's tricky: for speed, timcv.py: > + Train on sets 2-10. > + Predicts against set 1. > + Incrementally trains set 1 (leaving the classifier trained on 1-10). > + Incrementally *untrains* set 2 (leaving 1 + 3-10 trained). > + Predicts against set 2. > + Incrementailly trains set 2 (leaving 1-10 trained again). > + Incrementally untrains set 3 (leaving 1-2 + 4-10 trained). > + Predicts against set 3. > + Incrementailly trains set 3 (levaing 1-10 trained again). > and so on. This has huge performance benefits, in both instruction count > and cache locality, versus running timcv.py with option > build_each_classifier_from_scratch enabled. OK -- I must have misread it. so timcv.py *is* training on 9 sets each time. good. > I was looking for a new hobby after I stopped beating my wife . > timtest.py is an NxN grid driver, running N**2-N tests each training on 1 > and predicting against N-1. That's a good way to get lots of hard test runs > if you have lots of data. timcv.py is vanilla cross-validation, running N > tests each training on N-1 and predicting against 1. README.txt and > TESTING.txt say more about all this. bloody hell, timtest.py must take years to run ;) sounds interesting. BTW I hadn't read TESTING.txt (for some reason) -- I like the bigrams story. > Poor man -- I'm glad you uncloaked! Did the Outlook Message-Ids fit a > pattern you've seen? I'm keen to pursue that. yep, see above ;) BTW here's the perl code. it's cut and pasted from current Mail::SpamAssassin::EvalTests, so it won't run as-is, but it should be pretty easy to grok... # valid Outlookish Message-Ids contain the top word of the system time # when the message was sent! # We can verify this, by decoding the Date header, extracting # the time token from the Message-Id, and comparing them. # sub check_outlook_timestamp_token { my ($self) = @_; local ($_); my $id = $self->get ('Message-Id'); return 0 unless ($id =~ /^<[0-9a-f]{4}([0-9a-f]{8})\$[0-9a-f]{8}\$[0-9a-f]{8}\@/); my $timetoken = hex($1); # convert UNIX time_t to Windows FILETIME. From MSDN: # # LONGLONG ll = Int32x32To64(t, 10000000) + 116444736000000000; # pft->dwLowDateTime = (DWORD) ll; # pft->dwHighDateTime = ll >>32; # # IOW, ((tt * a) + b) / c = id . # Now to avoid using any kind of LONGLONG data type, we do this: # => tt * (a/c) + (b/c) = id # let x = (a/c) = 0.0023283064365387 # let y = (b/c) = 27111902.8329849 # my $x = 0.0023283064365387; my $y = 27111902.8329849; # quite generous, but we just want to be in the right ballpark, so we # can handle mostly-correct values OK, but catch random strings. my $fudge = 200; $_ = $self->get ('Date'); $_ = $self->_parse_rfc822_date($_); $_ ||= 0; my $expected = int (($_ * $x) + $y); my $diff = $timetoken - $expected; dbg("time token found: $timetoken expected (from Date): $expected: $diff"); if (abs ($diff) < $fudge) { return 0; } # also try last date in Received header, Date could have been rewritten $_ = $self->get ('Received'); /(\s.?\d+ \S\S\S \d+ \d+:\d+:\d+ \S+).*?$/; dbg("last date in Received: $1"); $_ = $self->_parse_rfc822_date($_); $_ ||= 0; $expected = int (($_ * $x) + $y); $diff = $timetoken - $expected; dbg("time token found: $timetoken expected (from Received): $expected: $diff"); if (abs ($diff) < $fudge) { return 0; } return 1; } # parse an RFC822 date into a time_t sub _parse_rfc822_date { my ($self, $date) = @_; local ($_); my ($yyyy, $mmm, $dd, $hh, $mm, $ss, $mon, $tzoff); # make it a bit easier to match $_ = " $date "; s/, */ /gs; s/\s+/ /gs; # now match it in parts. Date part first: if (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{4}) / /i) { $dd = $1; $mon = $2; $yyyy = $3; } elsif (s/ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +(\d+) \d+:\d+:\d+ (\d{4}) / /i) { $dd = $2; $mon = $1; $yyyy = $3; } elsif (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{2,3}) / /i) { $dd = $1; $mon = $2; $yyyy = $3; } else { dbg ("time cannot be parsed: $date"); return undef; } # handle two and three digit dates as specified by RFC 2822 if (defined $yyyy) { if (length($yyyy) == 2 && $yyyy < 50) { $yyyy += 2000; } elsif (length($yyyy) != 4) { # three digit years and two digit years with values between 50 and 99 $yyyy += 1900; } } # hh:mm:ss if (s/ ([\d\s]\d):(\d\d)(:(\d\d))? / /) { $hh = $1; $mm = $2; $ss = $4 || 0; } # numeric timezones if (s/ ([-+]\d{4}) / /) { $tzoff = $1; } # all other timezones are considered equivalent to "-0000" $tzoff ||= '-0000'; if (!defined $mmm && defined $mon) { my @months = qw(jan feb mar apr may jun jul aug sep oct nov dec); $mon = lc($mon); my $i; for ($i = 0; $i < 12; $i++) { if ($mon eq $months[$i]) { $mmm = $i+1; last; } } } $hh ||= 0; $mm ||= 0; $ss ||= 0; $dd ||= 0; $mmm ||= 0; $yyyy ||= 0; my $time; eval { # could croak $time = timegm ($ss, $mm, $hh, $dd, $mmm-1, $yyyy); }; if ($@) { dbg ("time cannot be parsed: $date, $yyyy-$mmm-$dd $hh:$mm:$ss"); return undef; } if ($tzoff =~ /([-+])(\d\d)(\d\d)$/) # convert to seconds difference { $tzoff = (($2 * 60) + $3) * 60; if ($1 eq '-') { $time += $tzoff; } else { $time -= $tzoff; } } return $time; } From popiel@wolfskeep.com Fri Nov 15 18:20:39 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 15 Nov 2002 10:20:39 -0800 Subject: [Spambayes] Another software in the field In-Reply-To: Message from Tim Peters References: Message-ID: <20021115182039.BD3F3F54C@cashew.wolfskeep.com> In message: Tim Peters writes: > >Poor man -- I'm glad you uncloaked! Did the Outlook Message-Ids fit a >pattern you've seen? I'm keen to pursue that. If you're keen on message ids, then one idea I've had (with no time to implement, alas) is to compare the message id domain with the sequence in the received headers, to detect when message ids are generated late in the delivery sequence. In more detail: Most received headers these days are of the (rfc 821 dictated) form: Received: from ([^ ]*).* by ([^ ]*).*;(.*) where \1 is the prior MTA, \2 is the current MTA, and \3 is the time of transfer. Reading all the received headers, you can get a chain of MTAs as the delivery sequence... as an example: Received: from mail.python.org (mail.python.org [12.155.117.29]) by cashew.wolfskeep.com (Postfix) with ESMTP id 97FAFF54C for ; Fri, 15 Nov 2002 09:44:19 -0800 (PST) Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org) by mail.python.org with esmtp (Exim 4.05) id 18CkXd-00065D-01; Fri, 15 Nov 2002 12:46:05 -0500 Received: from smtp.comcast.net ([24.153.64.2]) by mail.python.org with esmtp (Exim 4.05) id 18CkAN-0007r1-00 for spambayes@python.org; Fri, 15 Nov 2002 12:22:03 -0500 Received: from cj569191b (pcp736393pcs.reston01.va.comcast.net [68.48.241.201]) by mtaout03.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) spambayes@python.org; Fri, 15 Nov 2002 12:21:16 -0500 (EST) yields the sequence: cj569191b -> mtaout03.icomcast.net -> smtp.comcast.net -> mail.python.org -> localhost.localdomain -> mail.python.org -> mail.python.org -> cashew.wolfskeep.com Remove references to localhost.localdomain or localhost, then compress identical neighbors to yield: cj569191b -> mtaout03.icomcast.net -> smtp.comcast.net -> mail.python.org -> cashew.wolfskeep.com Now, look at the message id: Message-id: Extracting just the domain name from that, we get: comcast.net Now, compare the domain from the message id to the domains in the received list, yielding the number of hierarchy levels matched: 0 -> 1 -> 2 -> 0 -> 0 Find the first occurence of the best match, and generate a token: message-id-generation:skipped 2 If the received parser were a little smarter about parsing iPlanet received lines, it would have "pcp736393pcs.reston01.va.comcast.net" instead of "cj569191b" as the first element in the sequence, and the match list would have been 2 -> 1 -> 2 -> 0 -> 0, yielding: message-id-generation:skipped 0 I suspect that high skipped numbers would be a strong spam indicator, howing where message ids were omitted in the sent mail and/or received headers naively forged to prevent backtracking. Unfortunately, I haven't had time to implement and test this... - Alex From tinacoruth@concentric.net Fri Nov 15 01:19:42 2002 From: tinacoruth@concentric.net (polaner) Date: Fri, 15 Nov 2002 01:19:42 +0000 Subject: [Spambayes] Bullet proof bulk email friendly hosting & cheap mass email campaigns. Message-ID: <1897911063vsdped|hvCs|wkrq1ruj@prodigy.com> We are the marketing specialists www.host4bulk.com that provide cheap bullet proof bulk email friendly hosting for your website ($400 for one month of bullet proof hosting) and cheap bulk email campaigns ($200 for 1 million emails sent) As you may already know, many web hosting companies have Terms of Service (TOS) or Acceptable Use Policies (AUP) against the delivery of emails advertising or promoting your web site. If your web site host receives complaints or discovers that your web site has been advertised in email broadcasts, they may disconnect your account and shut down your web site. Our mission is to solve your problem and provide you with bulk email friendly hosting. You don't have to worry about your website being closed again. Adult and gambling sites welcomed. No set up fee. You may advertise your website by using your own resources or using 3rd party's service. However we can do all the advertising for your business. You just sit, relax and see how your income grows constantly. We guarantee the lowest prices on the web for our web hosting and bulk email campaigns. We only ask $200 us dollars for 1 million emails sent with your ad. We don't use duplicate emails. Our email base is up to date and it is updated weekly. Our current email data base contains over 50.000.000 emails sorted by various parameters to meet your specific needs. No competitors may offer this price. The lowest price you can find on the net is well over $500 for 1 million Don't make the mistake of bulk emailing directly to your website without bulletproof web hosting. Your web host will close your account and shut your site down in no time! No matter how long you have been with them, how much you are paying them, or how beautiful your site is. There are companies charging thousands for bulletproof web hosting and they can't keep you up and running like we can. If you host with us, your site will NOT BE SHUT DOWN due to complaints! Bulk email campaign together with bullet proof hosting will bring your business to success. Just imagine how many people will learn about your business or product at a really low price. Bulk email is considered to be the most effective way to advertise on the net. It is hundreds times effective than banner, solo ad and other campaigns. Once people use our service they always come back for more. We can always provide websites that use bulk email campaigns with our new reliable way to accept credit cards on the net without the need to open merchant account. You can start accepting credit card payments in second. It is totally free. Visit our website at http://www.host4bulk.com for more information and to order your bulk email hosting or/and email campaign. From tim.one@comcast.net Fri Nov 15 19:23:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 14:23:13 -0500 Subject: [Spambayes] Just for fun Message-ID: There's a msg to this list being held for moderator approval. I'm going to let it thru. The subject is Bullet proof bulk email friendly hosting & cheap mass email This should be a good test of whether your classifier thinks *everything* sent to this list is ham . From jeremy@alum.mit.edu Fri Nov 15 19:29:01 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 15 Nov 2002 14:29:01 -0500 Subject: [Spambayes] Just for fun In-Reply-To: References: Message-ID: <15829.19197.286799.636365@slothrop.zope.com> Here's how that message scored for me: Score: 0.998364543478 Clues ----- *H* 0.003146604587 *S* 0.999875691542 How dull. It's to hard to find good spam these days, where good spam is defined as clever enough to get through a decent spam filter. Jeremy From python-spambayes@discworld.dyndns.org Fri Nov 15 19:34:16 2002 From: python-spambayes@discworld.dyndns.org (Charles Cazabon) Date: Fri, 15 Nov 2002 13:34:16 -0600 Subject: [Spambayes] Just for fun In-Reply-To: <15829.19197.286799.636365@slothrop.zope.com>; from jeremy@alum.mit.edu on Fri, Nov 15, 2002 at 02:29:01PM -0500 References: <15829.19197.286799.636365@slothrop.zope.com> Message-ID: <20021115133416.A32235@discworld.dyndns.org> Jeremy Hylton wrote: > > How dull. It's to hard to find good spam these days, > where good spam is defined as clever enough to get through a decent > spam filter. Remember that this project /is/ the first instance of a decent spam filter :), so we can hardly blame the spammers for being a little behind. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From tim.one@comcast.net Fri Nov 15 20:20:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 15:20:55 -0500 Subject: [Spambayes] Just for fun In-Reply-To: <15829.19197.286799.636365@slothrop.zope.com> Message-ID: [Jeremy] > Here's how that message scored for me: > Score: 0.998364543478 > > Clues > ----- > *H* 0.003146604587 > *S* 0.999875691542 Good! On my tiny still-hapax-driven purely-mistake-based at-home classifier (which is up 79 each of ham and spam trained on) it fared much worse: Spam Score: 0.303213 word spamprob #ham #spam '*H*' 0.496939 - - '*S*' 0.103364 - - The spambayes/python.org clues were strong (reflecting how many mailing-list ham had to be redeemed from Unsure status over the last 2 weeks, and spam FN leaking thru python.org): 'spambayes' 0.0918367 2 0 'email name:spambayes' 0.0918367 2 0 'url:spambayes' 0.0918367 2 0 'subject:Spambayes' 0.0918367 2 0 'sender:addr:spambayes-bounces' 0.0918367 2 0 'url:mailman' 0.114051 16 2 'url:python' 0.120635 15 2 'sender:addr:python.org' 0.12996 20 3 'url:org' 0.145463 23 4 'email addr:python.org' 0.145904 12 2 'url:listinfo' 0.170558 19 4 'to:addr:python.org' 0.178246 18 4 OTOH, the strongest spam words were stronger, but overall the hapaxes "in the middle" favored ham: ham hapaxes 'solo' 0.155172 1 0 'x-mailer:microsoft outlook express 5.50.4807.1700' 0.155172 1 0 'sorted' 0.155172 1 0 'proof' 0.155172 1 0 'parameters' 0.155172 1 0 'however' 0.155172 1 0 'host' 0.155172 1 0 'considered' 0.155172 1 0 'ask' 0.155172 1 0 'account.' 0.155172 1 0 high-spamprob words 'advertise' 0.908163 0 2 'price.' 0.908163 0 2 'lowest' 0.908163 0 2 'base' 0.908163 0 2 '$500' 0.908163 0 2 'bulk' 0.934783 0 3 'service.' 0.934783 0 3 'effective' 0.934783 0 3 'subject: & ' 0.934783 0 3 'free.' 0.958716 0 5 'income' 0.965116 0 6 When I restored my "non-insane training" at-home classifier from 2 weeks ago, it was nailed as spam. The hapax-driven guy hasn't changed character for me since the first few days: rarely makes outright mistakes, but the Unsures remain surprising. > How dull. It's to hard to find good spam these days, where good > spam is defined as clever enough to get through a decent spam filter. One of the articles recently referenced here quoted a self-identified spammer who said he can make $1000 in a day by sending out 400,000 spam, and getting (just) 30 people to sign up for the porn sites he's advertising. I don't think he cares if spam filters can catch his stuff with 100% recall: his customers won't run spam filters, because they *want* porn spam. That's where we get rich, since this system can easily be trained to accept porn spam, but block human growth hormone spam. OTOH, no matter which way you cut all that, there's no incentive for porn spammers to change behavior. To the contrary, a system like this in wide use would help them reach their market more effectively. so-let's-apply-for-porn-funding-ly y'rs - tim From rob@hooft.net Fri Nov 15 20:30:40 2002 From: rob@hooft.net (Rob Hooft) Date: Fri, 15 Nov 2002 21:30:40 +0100 Subject: [Spambayes] Just for fun References: Message-ID: <3DD55970.4030500@hooft.net> Tim Peters wrote: > [Jeremy] > >>Here's how that message scored for me: >> Score: 0.998364543478 >> >> Clues >> ----- >> *H* 0.003146604587 >> *S* 0.999875691542 > > > Good! On my tiny still-hapax-driven purely-mistake-based at-home classifier > (which is up 79 each of ham and spam trained on) it fared much worse: > > Spam Score: 0.303213 > > word spamprob #ham #spam > '*H*' 0.496939 - - > '*S*' 0.103364 - - So that makes 80 in your classifier.... For my one-time-only private-email few-weeks-old too-lazy-to-retrain classifier, this message did 0.88 due to the spambayes ham clues.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From dereks@itsite.com Fri Nov 15 17:35:52 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Fri, 15 Nov 2002 12:35:52 -0500 (EST) Subject: [Spambayes] Just for fun In-Reply-To: <20021115133416.A32235@discworld.dyndns.org> Message-ID: > > How dull. It's to hard to find good spam these days, > > where good spam is defined as clever enough to get through a decent > > spam filter. Listening to this one would think spam is a problem of the past! > Remember that this project /is/ the first instance of a decent spam filter :), > so we can hardly blame the spammers for being a little behind. Let's not forget that SpamBayes only works for individuals or workgroups who have the same definitation of "ham". It doesn't help much in enterprise-level settings with tens of thousands of users, since the ham of such a large and varied group of people would dilute the definition of spam too much to be useful. I bet that playing the numbers game one could "show" that the helpdesk and maintenance costs of supporting a Python installation plus a per-person ham training procedure would be more expensive (for a Uni or Mega-Corp.) than just living with spam. (Pure conjecture on my part, but it is easily imagined.) There's another Python-based spam filter that might work better for SMTP server-wide deployment, called "Active Spam Killer", or ASK. http://www.paganini.net/ask/ It's schtick is that it maintains a whitelist of people who may email you. When an email from a new sender comes in, it holds the email for you, sends the person a simple confirmation messages (to which they simply hit Reply;Send), and then that person is added to your whitelist and their original messages is sent to you (and they are never ASKed again). There's also some very practical regex stuff, some migration tools, and an ignorelist and blacklist (for situations like http://www.psychoexgirlfriend.com/). It's currently targeted at individuals but if one thinks of this as an "E-mail Firewall", where only users who actually reply to messages are allowed to send messages to your company or campus, then this might work out well. Whether or not that's a desirable campus policy is up for debate, but I know that I want it for my small company. I bring it up on this list to (a) remind you guys that stopping spam at the server is far more efficient than stopping it at each user's INBOX, and (b) because I wanted to show a completely different spam filtering technology that doesn't depend on content at all. I would love to see a Python-based product that integrates well with Postfix/etc. and lets me pick and choose enterprise-wide spam filtering methods in whatever order I want. Pure "ASK"-like behaviour could be configured for large enterprise installations, but small workgroups would only "ASK" the senders for a confirmation if their first email looked too much like a spam (according to SpamBayes). Add in a Python port of SpamAssassin's methods and I think then you'd have a serious tool for stopping spam at the server. In short, I don't think filtering at the per-user level is a solution to the spam problem, because it only saves the time of the individual to identify and delete the spam. It does not save the cost of delivery, which is mostly on the receiving infrastructure. --Derek From tim.one@comcast.net Fri Nov 15 20:57:22 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 15:57:22 -0500 Subject: [Spambayes] Just for fun In-Reply-To: <3DD55970.4030500@hooft.net> Message-ID: [Rob Hooft] > So that makes 80 in your classifier.... I left this one out of the training -- the only reason it made it to the Spambayes list is because I approved it. SpamAssassin knew darned well it was spam, but SpamAssassin has been castrated for msgs sent to this list; it was held for approval just because this list has a members-only posting policy. BTW, that's the only "outside" spam I've seen mailed to this list so far! On the radio yesterday, a news story reported that the US Federal Trade Commission did a test, wherein they posted some email addresses on the web, then timed how long it took for them to receive their first spam: 8 minutes. Surely they can speed that up . > For my one-time-only private-email few-weeks-old too-lazy-to-retrain > classifier, this message did 0.88 due to the spambayes ham clues.... Good! I get hundreds of Mailman emails every day thru python.org, so the python.org + Mailman clues are (I expect) much stronger in my little database. It *still* didn't get near my ham_cutoff, though, which is mildly encouraging (granting that this training strategy is insane). From tim.one@comcast.net Fri Nov 15 21:22:02 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 16:22:02 -0500 Subject: [Spambayes] Just for fun In-Reply-To: Message-ID: [Derek Simkowiak] > Let's not forget that SpamBayes only works for individuals or > workgroups who have the same definitation of "ham". "workgroups" is too small. The general python.org tests show that it's also very effective for tech mailing lists, even collections of tech mailing lists, and even very high-volume collections of tech mailing lits. > It doesn't help much in enterprise-level settings with tens of > thousands of users, The python.org tech mailing lists serve tens of thousands. > since the ham of such a large and varied group of people would dilute > the definition of spam too much to be useful. That's always been my belief, but it hasn't been tested properly here so remains speculation. Matt Sargeant earlier reported clearly worse error rates from his more-classic Bayesian classifier when expanded to "many" users, but they weren't so high as to suggest uselessness. A lot depends on what you *do* with suspected spam. If it's bounced back to the user, it seems about the same as bouncing back a whitelist nag. > I bet that playing the numbers game one could "show" that the > helpdesk and maintenance costs of supporting a Python installation > plus a per-person ham training procedure would be more expensive > (for a Uni or Mega-Corp.) than just living with spam. (Pure > conjecture on my part, but it is easily imagined.) So is the converse . > There's another Python-based spam filter that might work better > for SMTP server-wide deployment, called "Active Spam Killer", or ASK. > > http://www.paganini.net/ask/ > > It's schtick is that it maintains a whitelist of people who may > email you. When an email from a new sender comes in, it holds the email > for you, sends the person a simple confirmation messages (to which they > simply hit Reply;Send), A deployment of *this* system could do the same, yes? Challenge-response is applicable to any system with a reject rule. > and then that person is added to your whitelist and their original > messages is sent to you (and they are never ASKed again). Think about this for "the enterprise". I doubt my employer would go for this: sales leads are sacred, and *anything* you do to make it harder for a potential customer to contact you is Major Sin. For this reason, no spam is blocked by my employer. Suspected spam merely gets a tilde prepended to the Subject line. Customer contacts *were* bounced when they did try to block spam. Bothering a customer with a whitelist nag is an approximation to asking that customer to do business elsewhere. For example, I can say that if one of my sisters got a whitelist nag asking them to reply, and it had a bunch of "funny numbers" in the subject line, they'd be afraid even to *read* it -- they'd delete it at once, fearing it was a virus or address harvester ("it looks funny, and it didn't come from Timmy, so I bet some hacker intercepted my email to xyz.com and is trying to trick me into replying"). > There's also some very practical regex stuff, some migration > tools, and an ignorelist and blacklist (for situations like > http://www.psychoexgirlfriend.com/). > > It's currently targeted at individuals but if one thinks of this > as an "E-mail Firewall", where only users who actually reply to > messages are allowed to send messages to your company or campus, then > this might work out well. Whether or not that's a desirable campus > policy is up for debate, but I know that I want it for my small > company. Until you learn that a potential customer did business with Zope Corp instead because we didn't nag them . > I bring it up on this list to (a) remind you guys that stopping > spam at the server is far more efficient than stopping it at each user's > INBOX, If the server were 100% reliable in determining what is and isn't spam, that would certainly be true. The costs associated with false positives are often judged to be very high, though, and asking the sender for confirmation is a cost too. For tech mailing lists I think it's an easily borne cost, but not for companies doing business with the public. > and (b) because I wanted to show a completely different spam > filtering technology that doesn't depend on content at all. As a whitelist gimmicks go, it appears to be a very good one. Whether whitelists are appropriate depends on the intended use. > I would love to see a Python-based product that integrates well > with Postfix/etc. and lets me pick and choose enterprise-wide spam > filtering methods in whatever order I want. Pure "ASK"-like behaviour > could be configured for large enterprise installations, but small > workgroups would only "ASK" the senders for a confirmation if their > first email looked too much like a spam (according to SpamBayes). Add > in a Python port of SpamAssassin's methods and I think then you'd have > a serious tool for stopping spam at the server. Regardless of scheme, I urge running tests and measuring error rates, else it's just whistling in the dark. "More technology" without such guidance is more likely to hurt than help (unless you think a typically overburdened admin is going to understand the interactions among half a dozen distinct systems perfectly out of the box). > In short, I don't think filtering at the per-user level is a > solution to the spam problem, because it only saves the time of the > individual to identify and delete the spam. >From my POV, that *is* my "spam problem". > It does not save the cost of delivery, which is mostly on the > receiving infrastructure. Now that you mention it, funding would be helpful . From piersh@friskit.com Fri Nov 15 21:52:35 2002 From: piersh@friskit.com (Piers Haken) Date: Fri, 15 Nov 2002 13:52:35 -0800 Subject: [Spambayes] fix for Outlook 'Spam' field Message-ID: <9891913C5BFE87429D71E37F08210CB9297513@zeus.sfhq.friskit.com> First off: I've never been able to get the 'Spam' field in outlook to work well on my system. It may be something to do with the fact that I'm using exchange, but I always found that some messages had a rating, some didn't and invariably the number in the field didn't match the 'show spam clues' number. So here's a patch. It does a couple of things: 1) firstly it changes the Class of the 'Spam' field to olPercent, which I believe is much more appropriate than olCombination. The problem with olCombination is that you have to manually change the field type in outlook in order to get anything to show up. With olPercent, the column shows up with a nice '%' sign which makes it more obvious what the number actually means. 2) secondly it adds a checkbox 'Update spam scores' to the training dialog. Checking this box causes the trainer to update the spam field for ALL messages in your training folders (in a second pass, if necessary). This means that ALL messages in your inbox have an entry in that field, not just those that arrived since you installed the plugin. This was a huge win for me since it allowed me to sort by the spam field and throw away about 20 spams from my inbox that I had missed during my initial manual pruning. The only issue here is that in order for this to work right, you'll have to manually delete your existing spam fields, restart outlook and then 'rescore'. Also, Mark, you never committed that patch I sent you that fixed the CompareIDs bug in the FolderSelector dialog. Was there a problem with it? Piers. -------------- next part -------------- A non-text attachment was scrubbed... Name: spam_field.patch Type: application/octet-stream Size: 16150 bytes Desc: spam_field.patch Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/70cf95b9/spam_field.exe From neale@woozle.org Fri Nov 15 21:40:55 2002 From: neale@woozle.org (Neale Pickett) Date: 15 Nov 2002 13:40:55 -0800 Subject: [Spambayes] Just for fun In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > There's a msg to this list being held for moderator approval. I'm going to > let it thru. The subject is > > Bullet proof bulk email friendly hosting & cheap mass email > > This should be a good test of whether your classifier thinks *everything* > sent to this list is ham . Uhhh, has anyone else noticed that mail to this list has SpamAssassin headers in it? I took SA out of my mail path months ago but just noticed the headers when checking out that spam's headers. Mightn't SA's score in the message headers bais the results of a later SpamBayes run? From rob@hooft.net Fri Nov 15 21:53:47 2002 From: rob@hooft.net (Rob Hooft) Date: Fri, 15 Nov 2002 22:53:47 +0100 Subject: [Spambayes] Better optimization loop Message-ID: <3DD56CEB.7050406@hooft.net> I've been playing a bit more with the weakloop concept. As Tim reported earlier, there is no chance that the "weak" training can be optimized this way. There are just too many binary choices in the training, resulting in a very bad optimization field. The "train automatically" mode that Tim proposed and that is much more stable runs way too slowly to work as a step in an optimization. So: I'm back at timcv.py. I removed weakloop.py from the CVS, and added a new 'simplexloop.py' that takes a single option: '-c commandline'. The command line will then be repeatedly executed with different bayescustomize.ini values, optimizing the cost that is reported as the third word of the last line of the output. Obviously, I needed to change the output of timcv.py to report the flexcost, and that I did by introducing a generic CostCounter class which is in its own module. I am currently running: python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \ --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out But I'm so curious about other peoples results that I've already committed this before letting it run to completion. During the small test runs I did make, I learned that even this cost function has very sharp edges. I think this is caused by very often occurring wordprobs that are either used or not used by a small step in 'min_prob_strength' or one of the other parameters. I think this is harmless if the training sets are large enough. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Fri Nov 15 22:04:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 17:04:49 -0500 Subject: [Spambayes] Just for fun In-Reply-To: Message-ID: [Neale Pickett] > Uhhh, has anyone else noticed that mail to this list has SpamAssassin > headers in it? Yes; all email going thru python.org does. > I took SA out of my mail path months ago but just noticed the headers > when checking out that spam's headers. > > Mightn't SA's score in the message headers bais the results of a later > SpamBayes run? We ignore most header lines by default; unless you've done something to change that, the classifier is blind to SA's headers (because the tokenizer doesn't look at them). If you enable count_all_header_lines, or add SA's keywords to the safe_header_lines list, then the fact that they exist will be recorded (but nothing about their content). If you enable basic_header_tokenize and don't exclude SA's headers via the basic_header_skip regexp list, then SA's headers will be fully tokenized. Those are the only things you could do to make SA's headers visible, short of changing the tokenizer source code. By default, count_all_header_lines is false, SA's keywords are not in safe_header_lines, and basic_header_tokenize is false. From noreply@sourceforge.net Fri Nov 15 21:47:48 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Fri, 15 Nov 2002 13:47:48 -0800 Subject: [Spambayes] [ spambayes-Patches-639122 ] hammie: ignore emails older than n days Message-ID: Patches item #639122, was opened at 2002-11-15 15:47 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639122&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jason Hildebrand (jdhildeb) Assigned to: Nobody/Anonymous (nobody) Summary: hammie: ignore emails older than n days Initial Comment: Since your documentation stresses the importance of training using only relatively recent emails, I thought a good way to do this would be to have hammie do it for me. So I added a new configuration option: [Hammie] # when training, hammie will ignore messages older than this number of days. # i.e. set to 365 to ignore messages older than one year. # Set to 0 to disable any filtering by date. ignore_old_messages: 0 The patch also modifies Hammie to output the number of messages it read/ignored for each mail file it processes. This option might also prove useful for doing incremental training (i.e. set up cron to train once a week, and set ignore_old_messages to 7). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639122&group_id=61702 From jason@peaceworks.ca Fri Nov 15 22:17:28 2002 From: jason@peaceworks.ca (Jason Hildebrand) Date: 15 Nov 2002 16:17:28 -0600 Subject: [Spambayes] introduction + date filtering for hammie Message-ID: <1037398648.10211.29.camel@trotzdem.raglan.org> Hi all, I'm new to this list. I played with content-based spam-filtering a few years ago in perl, and after coming across Gary Robinson's article (and Graham's) was excited enough to implemented both of these approaches in python. I was interested in using an approach with multiple metrics, which would include bayesian calculations as well as other ad-hoc measurements (i.e. the percentage of sentences ending with exclamation marks). I took these inputs and fed them into a back-progogating neural network (BPNN) using a python module I found on the web. My hope was that the neural network would find the optimum "weights" to use for combining the multiple inputs into a single output, and would also determine the optimum cutoff-point between ham/spam, so that no "tweaking" would be required. My initial tests (training on 100-500 emails) showed the neural network approach (using Robinson as one of the metrics) was somewhat better than either the Graham and Robinson without using the BPNN. However, when I started training on larger corpuses (I've been collecting spam since 1998), its accuracy degraded. I did some more reading on the limitations of BPNNs (namely overtraining), and this result made sense. So now I've ended up here. :) I'm still getting up-to-speed on the spambayes code. So far, I have one improvement to offer: Since your documentation stresses the importance of training using only relatively recent emails, I thought a good way to do this would be to have hammie filter out old messages for me. So I added a new configuration option: [Hammie] # when training, hammie will ignore messages older than this number of days. # i.e. set to 365 to ignore messages older than one year. # Set to 0 to disable any filtering by date. ignore_old_messages: 0 I also modified Hammie to output the number of messages it read/ignored for each mail file it processes. This option might also prove useful for doing incremental training (i.e. set up cron to train once a week, and set ignore_old_messages to 7). Caveat: this won't catch spams whose dates are deliberately set in the past, such as January 1, 1970 (I've seen a few). I've uploaded the patch to the sourceforge project page; hopefully someone has time to take a look at it. -- Jason D. Hildebrand jason@peaceworks.ca From francois.granger@free.fr Fri Nov 15 23:32:52 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Sat, 16 Nov 2002 00:32:52 +0100 Subject: [Spambayes] Just for fun In-Reply-To: <3DD55970.4030500@hooft.net> References: <3DD55970.4030500@hooft.net> Message-ID: At 21:30 +0100 15/11/02, in message Re: [Spambayes] Just for fun, Rob Hooft wrote: >Tim Peters wrote: >>[Jeremy] >> >>>Here's how that message scored for me: >>> Score: 0.998364543478 >>> >>> Clues >>> ----- >>> *H* 0.003146604587 >>> *S* 0.999875691542 >> >> >>Good! On my tiny still-hapax-driven purely-mistake-based at-home classifier >>(which is up 79 each of ham and spam trained on) it fared much worse: >> >>Spam Score: 0.303213 >> >>word spamprob #ham #spam >>'*H*' 0.496939 - - >>'*S*' 0.103364 - - > >So that makes 80 in your classifier.... > >For my one-time-only private-email few-weeks-old too-lazy-to-retrain >classifier, this message did 0.88 due to the spambayes ham clues.... In my database traine very badly with onlu some incoming ham and spam it did: Spam probability: 0.65072156 Clues: *H* 0.33760104 *S* 0.63904417 content-type:text/plain 0.21812081 x-mailer:none 0.34220907 noheader:date 0.84482759 noheader:to 0.84482759 -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From lists@morpheus.demon.co.uk Fri Nov 15 22:41:22 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 15 Nov 2002 22:41:22 +0000 Subject: [Spambayes] Training on individual messages Message-ID: I'm looking at a Gnus interface to Spambayes (I'm at home now, so I've got rid of Outlook for the weekend :-)) The main issue is training, and in particular individual-message training. I've added an option to hammie to train on a single message, read from stdin. This allows me to implement a "this is ham/spam" action without needing a temporary file. I wonder, though - is this the right thing to do? Should Hammie be growing more and more options (at the back of my mind is the possibility of an "unlearn" option, needed if a message gets misclassified) or should these sorts of things be split out into separate utilities? There's been some messages recently about some form of "Corpus" class - is that going to address any of this? Paul. -- This signature intentionally left blank From tim.one@comcast.net Sat Nov 16 00:48:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 19:48:53 -0500 Subject: [Spambayes] Just for fun In-Reply-To: Message-ID: [Fran=E7ois Granger] > In my database traine very badly with onlu some incoming ham and > spam it did: > > Spam probability: 0.65072156 > > Clues: > > *H*=090.33760104 > *S*=090.63904417 > content-type:text/plain=090.21812081 > x-mailer:none=090.34220907 > noheader:date=090.84482759 > noheader:to=090.84482759 Something went wrong there. That's what you'd get for an entirely em= pty file. The email pkg makes up text/plain by default, and the other th= ree are "negative clues" generated when we *don't* see a thing in the headers= . This msg did have a To header, and did have a Date header ... it also had = an X-Mailer header. From tim.one@comcast.net Sat Nov 16 04:14:44 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 23:14:44 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus Message-ID: Robert Woodhead mentioned an idea for using both unigrams and bigrams that might help, with a twist to avoid generating highly correlated clues. Gary Robinson was independently thinking along the same lines, and offline sketched a more fleshed-out similar scheme for doing this with unigrams, bigrams and trigrams. I implemented the latter but in a somewhat "purer" form. A patch for classifier.py is attached. Now I don't have any data that can show improvements, so whether this might help beats me. It wasn't a disaster for me, which is saying something, since previous ideas along these lines were clearly steps backward (as measured by error rates). So I need someone who's *not* getting great results now to try it (Anthony? Skip?). Big caution: this is a memory hog. I don't have enough RAM to run my full c.l.py test, or even half of it. Here's from a small-subset 10-fold CV run: filename: before tri ham:spam: 3000:3000 3000:3000 fp total: 0 0 fp %: 0.00 0.00 fn total: 0 0 fn %: 0.00 0.00 unsure t: 26 42 unsure %: 0.43 0.70 real cost: $5.20 $8.40 best cost: $0.00 $0.00 h mean: 0.37 0.50 h sdev: 3.07 3.77 s mean: 99.92 99.87 s sdev: 1.49 2.06 mean diff: 99.55 99.37 k: 21.83 17.04 Judging from the error rates, it's got nothing going for it or against it. Why it *might* help: while "Python" is a very strong ham word in my tests, "Python Video" is a porn vendor, and this scheme should reliably know the difference. Etc. My data isn't hard enough for it to matter. If this really helps someone, then a number of things follow: cut it off with bigrams instead; boost it to 4-grams instead; if more than bigrams are needed for it to help, buy into some hashing scheme to make the database burden finite again. As I saw before with pure bigrams, conference announcements once again move into high-scoring territory, but not nearly so bad. For example, the OSCON 2000 announcement got penalized for prob('electronic mail') = 0.969799 prob('and companies') = 0.973373 prob('last name:') = 0.973373 prob('the completion') = 0.973373 prob('individuals who') = 0.976644 prob('cutting edge') = 0.978469 prob('fax the') = 0.978469 prob('target audience') = 0.978469 prob('the subject line') = 0.980893 prob('send all') = 0.981928 prob('not accepted.') = 0.991493 prob('with marketing') = 0.992611 prob('your email') = 0.996391 prob('will receive') = 0.997 but also got helped by prob('note that the') = 0.0145631 prob('the call') = 0.0167286 prob('the tutorial') = 0.0167286 prob('problems that') = 0.0302013 prob('the open source') = 0.0348837 prob('tutorial and') = 0.0412844 prob('sent via') = 0.0608351 prob('other open') = 0.0652174 prob('proposals for') = 0.0652174 prob('text with') = 0.0652174 prob('the convention') = 0.0652174 prob('with open') = 0.0652174 prob('and open') = 0.0918367 prob('convention the') = 0.0918367 prob('for programmers,') = 0.0918367 prob('itself and the') = 0.0918367 prob('open source software') = 0.0918367 prob('source software') = 0.0918367 prob('that leads') = 0.0918367 prob('wide variety') = 0.0918367 In the end, it was highly ambiguous, with prob = 0.500000084396 prob('*H*') = 1 prob('*S*') = 1 From tim.one@comcast.net Sat Nov 16 04:48:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 15 Nov 2002 23:48:14 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: Message-ID: [Tim] > ... > I implemented the latter but in a somewhat "purer" form. A patch for > classifier.py is attached. Sorry about that -- it's attached to this. -------------- next part -------------- A non-text attachment was scrubbed... Name: tri.patch Type: application/octet-stream Size: 3880 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/25c41b69/tri.exe From rob@hooft.net Sat Nov 16 05:34:29 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 16 Nov 2002 06:34:29 +0100 Subject: [Spambayes] Better optimization loop References: <3DD56CEB.7050406@hooft.net> Message-ID: <3DD5D8E5.1020708@hooft.net> Rob Hooft wrote: > > I am currently running: > > python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \ > --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out > ....and it crashed at: x=0.5698 p=0.0592 s=0.5238 sc=0.922 hc=0.031 65023.78 x=0.5899 p=0.0498 s=0.5485 sc=0.925 hc=-0.000 64856.12 x=0.6177 p=0.0300 s=0.5853 sc=0.934 hc=-0.065 60950.20 x=0.6002 p=0.0488 s=0.5679 sc=0.930 hc=-0.054 61612.61 x=0.6336 p=0.0320 s=0.5783 sc=0.939 hc=-0.107 58698.02 x=0.6799 p=0.0088 s=0.6141 sc=0.953 hc=-0.212 53820.35 x=0.6485 p=0.0253 s=0.6067 sc=0.953 hc=-0.164 56007.05 x=0.6802 p=0.0158 s=0.6332 sc=0.955 hc=-0.219 53535.36 x=0.7354 p=-0.0059 s=0.6879 sc=0.971 hc=-0.344 48832.47 x=0.7339 p=-0.0318 s=0.7062 sc=0.974 hc=-0.358 48426.74 x=0.8114 p=-0.0849 s=0.8000 sc=1.000 hc=-0.548 43010.31 x=0.7970 p=-0.0595 s=0.7497 sc=0.995 hc=-0.479 44800.35 x=0.8512 p=-0.0765 s=0.7981 sc=1.014 hc=-0.634 40904.48 x=0.9680 p=-0.1297 s=0.9045 sc=1.055 hc=-0.919 33847.97 x=0.9481 p=-0.1338 s=0.8958 sc=1.037 hc=-0.837 35574.58 i.e. it is reducing the cost by pulling ham_cutoff and spam_cutoff infinitely far apart, but was stopped by a negative log.... I will think of a better flex cost, and start over... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sat Nov 16 05:46:19 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 16 Nov 2002 06:46:19 +0100 Subject: [Spambayes] Better optimization loop References: <3DD56CEB.7050406@hooft.net> <3DD5D8E5.1020708@hooft.net> Message-ID: <3DD5DBAB.4050101@hooft.net> Rob Hooft wrote: > Rob Hooft wrote: > >> >> I am currently running: >> >> python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \ >> --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out >> > > ....and it crashed at: > > x=0.9680 p=-0.1297 s=0.9045 sc=1.055 hc=-0.919 33847.97 > x=0.9481 p=-0.1338 s=0.8958 sc=1.037 hc=-0.837 35574.58 It was caused by an infinitely silly bug in the costcount that made it into a proper function to maximize instead of minimize. I shouldn't do things in the middle of the night (nor at 6:30 in the morning, but I'm trying to fix the pain I may have caused...) Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From niltsiar@neo.rr.com Sat Nov 16 06:11:24 2002 From: niltsiar@neo.rr.com (Todd Mokros) Date: 16 Nov 2002 01:11:24 -0500 Subject: [Spambayes] small vulnerability patch Message-ID: <1037427084.31134.17.camel@localhost> here's a small patch to fix a small header vulnerability. If a piece of spam spoofs the header added by hammie, then procmail recipes could match on the spoofed header. This deletes the hammie header before filtering. --- ../../cvs-tracking/spambayes/hammie.py 2002-11-14 17:00:15.000000000 -0500 +++ hammie.py 2002-11-16 00:44:50.000000000 -0500 @@ -272,6 +272,8 @@ """ msg = mboxutils.get_message(msg) + if msg.has_key(header): + del msg[header] prob, clues = self._scoremsg(msg, True) if prob < ham_cutoff: disp = options.header_ham_string -- Todd Mokros From tim.one@comcast.net Sat Nov 16 06:36:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 16 Nov 2002 01:36:04 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: Message-ID: [Tim] > ... > Skip?). Big caution: this is a memory hog. I don't have enough > RAM to run my full c.l.py test, or even half of it. So the new patch attached plays hash games to slash it. Changing MASK to boost it may help; it's set for 256K max hash codes as-is. On my full c.l.py test (which has over 330K distinct words, so squashing into 256K hash codes necessarily conflates many words): filename: cv tri ham:spam: 20000:14000 20000:14000 fp total: 3 0 fp %: 0.01 0.00 fn total: 0 1 fn %: 0.00 0.01 unsure t: 103 586 unsure %: 0.30 1.72 real cost: $50.60 $118.20 best cost: $21.40 $32.40 h mean: 0.24 1.69 h sdev: 2.76 5.70 s mean: 99.93 99.68 s sdev: 1.59 3.37 mean diff: 99.69 97.99 k: 22.92 10.80 The Unsure rate zoomed. I'm not sure why. The lowest-scoring spam was absurd, a giant multi-level marketing spam written in German: prob = 0.0580526384697 prob('*H*') = 1 prob('*S*') = 0.116105 prob('haben sie schon') = 0.00185261 prob('gegeben finanziell') = 0.00405771 prob('... ich habe') = 0.00413223 prob('die power') = 0.00418173 prob('skip:d 10 wurde mir') = 0.00464396 prob('und adresse die') = 0.00530035 prob('skip:a 10 passierte') = 0.00570342 prob('#6".') = 0.0065312 prob('ein produkt,') = 0.00715421 prob('weiteren schwung') = 0.00764007 prob('sie bei') = 0.00790861 prob('zealand ich') = 0.00872423 prob('beste') = 0.00884086 prob('sich ein fenster') = 0.00920245 prob('100 bestellungen (oder') = 0.00959488 etc. Of course it's never seen most of those phrases at all in ham, but hash codes don't know that. The full quote of the Nigerian-scam spam fell from off-the-charts spam to middling Unsure. Again hash collisions must account for it: Data/Ham/Set5/74506.txt prob = 0.580354361406 prob('*H*') = 0.839291 prob('*S*') = 1 prob('report the existence') = 0.00238221 ... prob('identified the amount') = 0.00455005 prob('country. please note') = 0.00693374 prob('numbers your reply') = 0.00715421 prob('process. because the') = 0.00959488 ... prob('duties, have') = 0.0328367 prob('foreign partner.') = 0.0503757 prob('solicit your strict') = 0.0724398 prob('which chairman') = 0.0757576 prob('skip:w 10 "abass kabiru"') = 0.0812396 ... prob('25% for') = 0.0907928 prob('subject:: subject: ') = 0.0937339 prob('would use') = 0.102003 prob('for skip:m 10 intend') = 0.103881 prob('complex,') = 0.107769 prob('matter trust') = 0.108386 prob('more details this') = 0.123444 prob('subject: ( subject:)') = 0.127565 prob('present authorities, they') = 0.12886 ... prob('housing federal secretariat') = 0.207914 So like previous gimmicks using hash codes, the mistakes are unfathomable to human eyes, although you're unlikely to see any unless you've got a lot of training and testing data (in which case wild mistakes become more certain the more you've got). When the hash space is too small (as it surely was in this test), what *would* have been mild-prob hapaxes get associated with strong-probability phrases by accident. Aha! "On average" you can expect those accidents to cancel out, but chi-combining tends to Unsure in the presence of cancellation. I bet that explains the bulk of the Unsure rate boost. Sometimes the accidents will pile up in one direction or the other, though, likely accounting for the examples above (especially the German example, where the hash code of virtually every phrase is an accident). -------------- next part -------------- A non-text attachment was scrubbed... Name: tri2.patch Type: application/octet-stream Size: 5256 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021116/dd2c93b0/tri2.exe From noreply@sourceforge.net Sat Nov 16 12:32:50 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 16 Nov 2002 04:32:50 -0800 Subject: [Spambayes] [ spambayes-Patches-639310 ] fix for outlook 'spam' field Message-ID: Patches item #639310, was opened at 2002-11-16 04:32 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Piers Haken (piersh) Assigned to: Nobody/Anonymous (nobody) Summary: fix for outlook 'spam' field Initial Comment: 1) firstly it changes the Class of the 'Spam' field to olPercent, which I believe is much more appropriate than olCombination. The problem with olCombination is that you have to manually change the field type in outlook in order to get anything to show up. With olPercent, the column shows up with a nice '%' sign which makes it more obvious what the number actually means. 2) secondly it adds a checkbox 'Update spam scores' to the training dialog. Checking this box causes the trainer to update the spam field for ALL messages in your training folders (in a second pass, if necessary). This means that ALL messages in your inbox have an entry in that field, not just those that arrived since you installed the plugin. This was a huge win for me since it allowed me to sort by the spam field and throw away about 20 spams from my inbox that I had missed during my initial manual pruning. The only issue here is that in order for this to work right, you'll have to manually delete your existing spam fields, restart outlook and then 'rescore'. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702 From noreply@sourceforge.net Sat Nov 16 12:35:19 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 16 Nov 2002 04:35:19 -0800 Subject: [Spambayes] [ spambayes-Patches-639312 ] fix for outlook CompareEntryIDs bug Message-ID: Patches item #639312, was opened at 2002-11-16 04:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Piers Haken (piersh) Assigned to: Nobody/Anonymous (nobody) Summary: fix for outlook CompareEntryIDs bug Initial Comment: This patch reenables the CompareEntryIDs for comparing folder IDs. It passes both the MAPI Session and the Oulook Session into the dialog, one for retrieving the exchange-compatible IDs and the other for comparing them. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702 From richie@entrian.com Sat Nov 16 18:05:32 2002 From: richie@entrian.com (Richie Hindle) Date: Sat, 16 Nov 2002 18:05:32 +0000 Subject: [Spambayes] Training on individual messages In-Reply-To: References: Message-ID: Hi Paul, > I wonder, though - is this the right thing to do? Should Hammie be > growing more and more options (at the back of my mind is the > possibility of an "unlearn" option, needed if a message gets > misclassified) or should these sorts of things be split out into > separate utilities? They should be in a shared module IMHO, and you're right about this: > There's been some messages recently about some form of "Corpus" class > - is that going to address any of this? Yes - Tim Stone's Corpus class, which he's just committed, encapsulates a corpus of emails, and lets you set up automatic training when adding/removing/moving messages. So for instance, you create a Spam corpus, attach a Trainer object to it, and call addMessage - that adds the message to the corpus, and trains on that message as Spam. Removing the message untrains it. pop3proxy.py is now using this for a web-based training interface, which I'm hoping to commit in the next couple of days. -- Richie Hindle richie@entrian.com From skip@pobox.com Sat Nov 16 21:15:53 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 16 Nov 2002 15:15:53 -0600 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: References: Message-ID: <15830.46473.661765.562628@montanaro.dyndns.org> Tim> So I need someone who's *not* getting great results now to try it Tim> (Anthony? Skip?). Big caution: this is a memory hog. I don't Tim> have enough RAM to run my full c.l.py test, or even half of it. Tim> Here's from a small-subset 10-fold CV run: Here's some data for you (thank goodness for RAM!). "base" is CVS (no mods). "tri" is CVS plus your first (no hash tricks) patch. The "tri" run consumed about 51 minutes of CPU on my Powerbook and pretty much ran in memory the entire time. filename: base tri ham:spam: 10439:6134 10439:6134 fp total: 24 25 fp %: 0.23 0.24 fn total: 71 55 fn %: 1.16 0.90 unsure t: 284 315 unsure %: 1.71 1.90 real cost: $367.80 $368.00 best cost: $312.40 $324.40 h mean: 0.68 0.71 h sdev: 6.56 6.73 s mean: 97.55 97.51 s sdev: 12.70 12.35 mean diff: 96.87 96.80 k: 5.03 5.07 I haven't looked at any raw output. Let me know if you want the raw numbers. Skip From tim.one@comcast.net Sat Nov 16 22:17:40 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 16 Nov 2002 17:17:40 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: <15830.46473.661765.562628@montanaro.dyndns.org> Message-ID: I ran my fat c.l.py test w/ the hash space clamped at 256K buckets. That was clearly a bad idea for that test, since there are about 330K unique unigrams in that corpus (let alone bigrams and trigrams). cv below is the current all-default result on that test data, excepting for [Tokenizer] replace_nonascii_chars: True record_header_absence: True The # of unsures is lower than I reported before: by staring at the unsures, I found 10 entirely empty (0 bytes) files in my spam corpus. Those got replaced with random spam from the reservoir (the empty msgs had scored as unsure). All other runs here are on the same data. tri19 is the hashed trigram gimmick with the hash space boosted to 512K (19 bits of hash code). Contrary to expectations, the Unsure rate actually increased over the run with 256K buckets. But it still appeared to be due to unlucky hash collisions. So tri20 boosted the # of hash buckets to a million. That still didn't help. At that point I switched body tokenization strategy: I've long speculated that split-on-whitespace helped us over alphanumeric-run tokenization because s-o-w captures a *little* contextual information from the punctuation, and because it generates highly correlated clues in a way that *helps* (like "Python" and "Python?" count as distinct words). But if we're getting context and helpful correlation from bigrams and trigrams too, it seems plausible that the punctuation context gets in the way. So tri20a is with a million hash buckets, but tokenzing via re.findall with [\w$\-\x80-\xff]+ instead of s-o-w. Alas, overall its "best cost" was even worse than tri19's. s-o-w still rules. So tri21 went back to s-o-w, but boosted the # of hasn buckets to 2 million. This finally started moving "in the right direction" again, but still loses to the original unhashed "exact" unigram scheme. Since I probably have more than a million unique unigrams + bigrams + trigrams (viewed as text strings) in this data, 2 million hash buckets is certainly *not* excessive. I expect it would do better with a lot more. But, even with the hash trickery, at 2M buckets I'm again pushing the limit of my RAM on the fat test (which trains on more than 30,000 msgs per run). So pushing this more would require a different database structure. So far the results aren't good enough to make me keen to pursue it. filename: cv tri19 tri20 tri20a tri21 ham:spam: 20000:14000 20000:14000 20000:14000 20000:14000 20000:14000 fp total: 3 0 0 0 0 fp %: 0.01 0.00 0.00 0.00 0.00 fn total: 0 7 8 3 2 fn %: 0.00 0.05 0.06 0.02 0.01 unsure t: 91 926 1128 1133 854 unsure %: 0.27 2.72 3.32 3.33 2.51 real cost: $48.20 $192.20 $233.60 $229.60 $172.80 best cost: $17.80 $36.60 $39.60 $51.00 $38.60 h mean: 0.24 0.30 0.20 0.27 0.38 h sdev: 2.73 2.25 1.93 2.38 3.00 s mean: 99.95 97.44 96.70 96.94 97.89 s sdev: 1.40 10.17 11.61 10.95 9.12 mean diff: 99.71 97.14 96.50 96.67 97.51 k: 24.14 7.82 7.13 7.25 8.05 The FN under all hashed schemes are mostly long spam in foreign languages, and *which* of those are judged ham varies across runs (changing the # of hash buckets, and/or the tokenization strategy, changes the set of accidental hash collisions). Because they're long they generate lots of hash codes; because they're foreign languages, the hash codes hit accidental matches; do that often enough and you're bound to get something that looks like solid ham. In tri21, the lowest-scoring FN was at 0.01, and happened to be a long spam in what looks like Polish. Non-hashing schemes are immune to this (brand new words are ignored, and the header clues dominate the score, which is usually enough to nail it as spam). The increase in Unsures appears to be almost entirely due to spam. Here's the ham score distro (in tri21) near 50: 47.0 0 47.5 0 48.0 1 * 48.5 2 * 49.0 0 49.5 1 * 50.0 1 * and no ham scored higher than that. The spam score distro hear 50: 47.0 2 * 47.5 3 * 48.0 2 * 48.5 3 * 49.0 7 * 49.5 104 * 50.0 343 ** 50.5 40 * 51.0 29 * 51.5 20 * 52.0 7 * 52.5 18 * 53.0 18 * I don't know why that is (well, yes, it's a huge increase in "cancellation disease" in spams, but I don't know *why* there's a huge cancellation disease increase for spam but not for ham). The quote of the Nigerian scam spam was the highest-scoring ham, scoring exactly 0.5, with H=1 and S=1. The H=1 appeared mostly due to extremely strong ham clues in the headers, the strongest being: prob('header:Subject:1 noheader:received noheader:x-abuse-info') = 4.88234e-005 Unfortunately, it's impossible to say whether that's "real" or was just a hash accident. It's pretty clear that this "ham clue" was an accident: prob('7597133 federal ministry') = 0.0505618 and this even more so: prob('housing (fmwh) nigeria.') = 0.0238095 The chance of this crap decreases as the # of hash buckets increases, but increases the more training data you've got too. better-the-devil-you-know?-ly y'rs - tim From tim.one@comcast.net Sat Nov 16 23:51:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 16 Nov 2002 18:51:20 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: Message-ID: One more in this round; bi21 is 2M hash buckets but limited to unigrams and bigrams (no trigrams): filename: cv tri21 bi21 ham:spam: 20000:14000 20000:14000 20000:14000 fp total: 3 0 0 fp %: 0.01 0.00 0.00 fn total: 0 2 0 fn %: 0.00 0.01 0.00 unsure t: 91 854 300 unsure %: 0.27 2.51 0.88 real cost: $48.20 $172.80 $60.00 best cost: $17.80 $38.60 $23.20 h mean: 0.24 0.38 0.25 h sdev: 2.73 3.00 2.61 s mean: 99.95 97.89 99.41 s sdev: 1.40 9.12 4.67 mean diff: 99.71 97.51 99.16 k: 24.14 8.05 13.62 My guess is that it's more likely bigrams benefited from suffering fewer unfortunate hash collisions than that they're actually generating better raw info. The "missing test" here is exact bigrams (no hash convolutions). I'll try that later; may not have enough RAM for that, but should. From vanhorn@whidbey.com Sun Nov 17 00:22:47 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Sat, 16 Nov 2002 16:22:47 -0800 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus References: Message-ID: <3DD6E157.2CF9450F@whidbey.com> I've been meaning to ask, what do "real cost" and "best cost" actually mean? I've seen you guys "spend" several million dollars while testing, and if it "costs" that much to test for spam in this way, I'm going to have a heck of a time marking it up and selling it to customers! Van Tim Peters wrote: > One more in this round; bi21 is 2M hash buckets but limited to unigrams and > bigrams (no trigrams): > > filename: cv tri21 bi21 > ham:spam: 20000:14000 20000:14000 > 20000:14000 > fp total: 3 0 0 > fp %: 0.01 0.00 0.00 > fn total: 0 2 0 > fn %: 0.00 0.01 0.00 > unsure t: 91 854 300 > unsure %: 0.27 2.51 0.88 > real cost: $48.20 $172.80 $60.00 > best cost: $17.80 $38.60 $23.20 > h mean: 0.24 0.38 0.25 > h sdev: 2.73 3.00 2.61 > s mean: 99.95 97.89 99.41 > s sdev: 1.40 9.12 4.67 > mean diff: 99.71 97.51 99.16 > k: 24.14 8.05 13.62 > > My guess is that it's more likely bigrams benefited from suffering fewer > unfortunate hash collisions than that they're actually generating better raw > info. > > The "missing test" here is exact bigrams (no hash convolutions). I'll try > that later; may not have enough RAM for that, but should. > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From tim.one@comcast.net Sun Nov 17 00:44:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 16 Nov 2002 19:44:16 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: <3DD6E157.2CF9450F@whidbey.com> Message-ID: [G. Armour Van Horn] > I've been meaning to ask, what do "real cost" and "best cost" > actually mean? In Options.py: # After the display of a ham+spam histogram pair, you can get a listing # of all the cutoff values (coinciding with histogram bucket boundaries) # that minimize # # best_cutoff_fp_weight * (# false positives) + # best_cutoff_fn_weight * (# false negatives) + # best_cutoff_unsure_weight * (# unsure msgs) # # This displays two cutoffs: hamc and spamc, where # # 0.0 <= hamc <= spamc <= 1.0 # # The idea is that if something scores < hamc, it's called ham; if # something scores >= spamc, it's called spam; and everything else is # called 'I'm not sure' -- the middle ground. # # Note: You may wish to increase nbuckets, to give this scheme more # cutoff values to analyze. compute_best_cutoffs_from_histograms: True best_cutoff_fp_weight: 10.00 best_cutoff_fn_weight: 1.00 best_cutoff_unsure_weight: 0.20 So by default, an FP is charged $10, an FN $1, and an unsure $0.20. The best cost is the lowest cost you could possibly have gotten by choosing ham and spam cutoffs with perfect knowledge of how things would turn out. The real cost is how things actually turned out, using the ham and spam cutoffs you supplied in advance. > I've seen you guys "spend" several million dollars while testing, > and if it "costs" that much to test for spam in this way, I'm going > to have a heck of a time marking it up and selling it to customers! Relax; they're Canadian dollars . From mhammond@skippinet.com.au Sun Nov 17 03:38:10 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Sun, 17 Nov 2002 14:38:10 +1100 Subject: [Spambayes] Too much information Message-ID: >From the spam that Tim let through - it came in at 0.8 - not too bad given the strong ham indications from mailman. But now the project is telling me things I really don't want to know - like I am being undervalued. From this spam's hints: word spamprob '$200' 0.0918367 ... '$500' 0.977856 Obviously the spammers are aiming too high - hit me with an offer of $200, and I seem to like it - but offer me $500, and I turn my nose up. Ripped-off-or-what? ly, Mark. From neale@woozle.org Sun Nov 17 03:49:26 2002 From: neale@woozle.org (Neale Pickett) Date: 16 Nov 2002 19:49:26 -0800 Subject: [Spambayes] hammie's dbm file has changed Message-ID: I just want to make sure everyone is aware that hammie.py's dbm file format has changed now. I sent a message out about it two days ago and didn't get any responses, so it's in now. From skip@pobox.com Sun Nov 17 03:50:24 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 16 Nov 2002 21:50:24 -0600 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes hammiefilter.py,NONE,1.1 README.txt,1.42,1.43 hammie.py,1.38,1.39 mboxutils.py,1.6,1.7 In-Reply-To: References: Message-ID: <15831.4608.910147.312797@montanaro.dyndns.org> Neale> * hammie.py can now take messages on stdin, but it's ugly. If Neale> you want to do this, you should look at hammiefilter.py I'm not sure I get this. I use hammie.py as a filter from my procmailrc file already. What new feature did you add? The ability to train on a message on stdin? Skip From neale@woozle.org Sun Nov 17 03:55:39 2002 From: neale@woozle.org (Neale Pickett) Date: 16 Nov 2002 19:55:39 -0800 Subject: [Spambayes] A kinder, gentler hammie Message-ID: I've checked in hammiefilter.py, which I've been using for a few days now. The idea is to make the user interface (that means command-line options) as lightweight as possible. Now the setup for a procmail-based solution is even easier: $ hammiefilter.py -n Created new database in hammie.db Then you add a procmail recipie like this: :0 fw | hammiefilter.py And in your MUA, pipe a message to it with the -s option for spam, or the -g option for ham. I think I'd like to have hammiefilter check for a "~/.hammierc" file in addition to a bayescustomize.ini file, and also set the persistent_storage_file option to "~/.hammie.db" unless it's overridden. With those two changes, we'd have something supremely easy to drop into your mail setup. Almost as easy as SpamAssassin (except, of course, that you don't have to keep retraining SpamAssassin). Neale From tim.one@comcast.net Sun Nov 17 05:19:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 17 Nov 2002 00:19:55 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: Message-ID: [Tim] > ... > The "missing test" here is exact bigrams (no hash convolutions). I'll > try that later; may not have enough RAM for that, but should. I didn't, but by cutting 6,000 ham out of my test data managed to complete the test in < 3 hours with lots of disk thrashing. cv is baseline, bi21 the bigram gimmick w/ 2**21 hash buckets, and bix the exact bigram run: filename: cv bi21 bix ham:spam: 20000:14000 14000:14000 20000:14000 fp total: 3 0 0 fp %: 0.01 0.00 0.00 fn total: 0 0 0 fn %: 0.00 0.00 0.00 unsure t: 91 300 98 unsure %: 0.27 0.88 0.35 real cost: $48.20 $60.00 $19.60 best cost: $17.80 $23.20 $6.80 h mean: 0.24 0.25 0.25 h sdev: 2.73 2.61 2.82 s mean: 99.95 99.41 99.91 s sdev: 1.40 4.67 1.94 mean diff: 99.71 99.16 99.66 k: 24.14 13.62 20.94 There are simply no surprises in the bix output; under the covers it's beautiful: -> Ham scores for all runs: 14000 items; mean 0.25; sdev 2.82 -> min 0; median 3.88578e-014; max 69.8223 -> percentiles: 5% 0; 25% 0; 75% 7.29026e-009; 95% 0.00817516 3 ham scored between 50 and 50.5; 1 ham scored 69.8; all other ham scored below 50. -> Spam scores for all runs: 14000 items; mean 99.91; sdev 1.94 -> min 30.4227; median 100; max 100 -> percentiles: 5% 100; 25% 100; 75% 100; 95% 100 2 apam scored between 49.5 and 50.5; 1 spam scored 30.4; all other spam scored above 50. The best-cost cutoffs relect this sharp separation: -> best cost for all runs: $6.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 9 cutoff pairs -> smallest ham & spam cutoffs 0.495 & 0.7 -> fp 0; fn 1; unsure ham 10; unsure spam 19 -> fp rate 0%; fn rate 0.00714%; unsure rate 0.104% -> largest ham & spam cutoffs 0.495 & 0.74 -> fp 0; fn 1; unsure ham 10; unsure spam 19 -> fp rate 0%; fn rate 0.00714%; unsure rate 0.104% The highest-scoring ham is the only reason for the high suggested spam cuttoff (else .51 would have worked fine and eliminated almost all the unsures), and was the 2nd-worst in the baseline test: poor Vickie Mills suffering from her obnoxious employer-generated sig. Bigrams helped her, but it's unclear why(!). Her three strongest ham clues: prob('noheader:return-path noheader:abuse-reports-to') = 0.000364471 prob('subject:Python') = 0.00116829 prob('header:From:1 header:MIME-Version:1') = 0.00129855 Now the order in which noheader metatokens get generated is an accident inherited from the order in which a Set happens to enumerate its elements. What I fear here is that I've stumbled into a too-good systematic difference between c.l.py headers and BruceG's spam headers: not in the *set* of header lines they contain (I'm ignoring all headers that might matter), but something much subtler having to do with how the set of all header lines in existence may affect dict iteration order for the few headers I actually look at. Bigrams involving headers lines are striking in the test output; e.g., this one is a strong spam clue, and is almost as mysterious: prob('header:Subject:1 noheader:errors-to') = 0.96321 If we pursue this approach, this will take much thought. The lowest-scoring spam was the one with a uuencoded text body we throw away unlooked-at. A header bigram boosted its spam score in a clearly helpful way: prob('from:addr:hotmail.com from:no real name:2**0') = 0.985839 Indeed, something sent from hotmail without a real name reeks to heaven of spam, much more so than hotmail alone or no-real-name alone. OTOH, check out the 6 strongest ham clues for the same spam: prob('header:X-Complaints-To:1') = 4.00324e-005 prob('header:From:1 header:Date:1') = 0.000279524 prob('header:Subject:1 noheader:received') = 0.000382997 prob('noheader:x-face noheader:return-path') = 0.00045433 prob('header:Organization:1 header:Message-ID:1') = 0.00137332 prob('noheader:in-reply-to noheader:reply-to') = 0.0110694 The X-Complaints-To header has always helped this guy a lot, but why the 5 new header-bigram combos here are hammish remains a mystery. The memory burden of this run is also a mystery. I played with mixing unigrams and bigrams before, and recall the c.l.py test topping out at about 120MB. This run was over 256MB (hence the massive swapping on my 256MB box), and it wasn't even a full run. A difference is that, in my previous runs, the *tokenizer* generated the unigrams and bigrams, and only for the body. In this run the classifier generated them, and header tokens got into the mix too. I suppose header bigrams are large (as strings), and that there are a lot of them -- heck, for lots of msgs, even just the text of the headers I look at here is larger than the msg bodies. So while this scheme may have real promise, mysteries and practical problems abound. I'm out of time for looking at this. If someone wants to pursue it, I'll attach the classifier I used. Based on everything I've done here, two suggestions: 1. Every time we've tried them, hash schemes have been unsatisfying, due to the human-incomprehensible mistakes they make. You're unlikely to see mistakes on a small run, though -- this is a percentage game that *eventually* loses big. 2. I've got some evidence to believe that exact bigrams may help, but saw nothing in the exact trigram runs to suggest they buy anything over that. Trigrams helped on-topic c.l.py conference announcements, but they also hurt them, and *that* class of problem msg is already solidly Unsure under the unigram scheme. -------------- next part -------------- # An implementation of a Bayes-like spam classifier. # # Paul Graham's original description: # # http://www.paulgraham.com/spam.html # # A highly fiddled version of that can be retrieved from our CVS repository, # via tag Last-Graham. This made many demonstrated improvements in error # rates over Paul's original description. # # This code implements Gary Robinson's suggestions, the core of which are # well explained on his webpage: # # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html # # This is theoretically cleaner, and in testing has performed at least as # well as our highly tuned Graham scheme did, often slightly better, and # sometimes much better. It also has "a middle ground", which people like: # the scores under Paul's scheme were almost always very near 0 or very near # 1, whether or not the classification was correct. The false positives # and false negatives under Gary's basic scheme (use_gary_combining) generally # score in a narrow range around the corpus's best spam_cutoff value. # However, it doesn't appear possible to guess the best spam_cutoff value in # advance, and it's touchy. # # The chi-combining scheme used by default here gets closer to the theoretical # basis of Gary's combining scheme, and does give extreme scores, but also # has a very useful middle ground (small # of msgs spread across a large range # of scores, and good cutoff values aren't touchy). # # This implementation is due to Tim Peters et alia. from __future__ import generators import math import time from sets import Set from Options import options from chi2 import chi2Q try: True, False except NameError: # Maintain compatibility with Python 2.2 True, False = 1, 0 LN2 = math.log(2) # used frequently by chi-combining PICKLE_VERSION = 1 class WordInfo(object): __slots__ = ('atime', # when this record was last used by scoring(*) 'spamcount', # # of spams in which this word appears 'hamcount', # # of hams in which this word appears 'killcount', # # of times this made it to spamprob()'s nbest 'spamprob', # prob(spam | msg contains this word) ) # Invariant: For use in a classifier database, at least one of # spamcount and hamcount must be non-zero. # # (*)atime is the last access time, a UTC time.time() value. It's the # most recent time this word was used by scoring (i.e., by spamprob(), # not by training via learn()); or, if the word has never been used by # scoring, the time the word record was created (i.e., by learn()). # One good criterion for identifying junk (word records that have no # value) is to delete words that haven't been used for a long time. # Perhaps they were typos, or unique identifiers, or relevant to a # once-hot topic or scam that's fallen out of favor. Whatever, if # a word is no longer being used, it's just wasting space. def __init__(self, atime, spamprob=options.unknown_word_prob): self.atime = atime self.spamcount = self.hamcount = self.killcount = 0 self.spamprob = spamprob def __repr__(self): return "WordInfo%r" % repr((self.atime, self.spamcount, self.hamcount, self.killcount, self.spamprob)) def __getstate__(self): return (self.atime, self.spamcount, self.hamcount, self.killcount, self.spamprob) def __setstate__(self, t): (self.atime, self.spamcount, self.hamcount, self.killcount, self.spamprob) = t class Bayes: # Defining __slots__ here made Jeremy's life needlessly difficult when # trying to hook this all up to ZODB as a persistent object. There's # no space benefit worth getting from slots in this class; slots were # used solely to help catch errors earlier, when this code was changing # rapidly. #__slots__ = ('wordinfo', # map word to WordInfo record # 'nspam', # number of spam messages learn() has seen # 'nham', # number of non-spam messages learn() has seen # ) # allow a subclass to use a different class for WordInfo WordInfoClass = WordInfo def __init__(self): self.wordinfo = {} self.nspam = self.nham = 0 def __getstate__(self): return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham def __setstate__(self, t): if t[0] != PICKLE_VERSION: raise ValueError("Can't unpickle -- version %s unknown" % t[0]) self.wordinfo, self.nspam, self.nham = t[1:] # spamprob() implementations. One of the following is aliased to # spamprob, depending on option settings. def gary_spamprob(self, wordstream, evidence=False): """Return best-guess probability that wordstream is spam. wordstream is an iterable object producing words. The return value is a float in [0.0, 1.0]. If optional arg evidence is True, the return value is a pair probability, evidence where evidence is a list of (word, probability) pairs. """ from math import frexp # This combination method is due to Gary Robinson; see # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html # The real P = this P times 2**Pexp. Likewise for Q. We're # simulating unbounded dynamic float range by hand. If this pans # out, *maybe* we should store logarithms in the database instead # and just add them here. But I like keeping raw counts in the # database (they're easy to understand, manipulate and combine), # and there's no evidence that this simulation is a significant # expense. P = Q = 1.0 Pexp = Qexp = 0 clues = self._getclues(wordstream) for prob, word, record in clues: if record is not None: # else wordinfo doesn't know about it record.killcount += 1 P *= 1.0 - prob Q *= prob if P < 1e-200: # move back into range P, e = frexp(P) Pexp += e if Q < 1e-200: # move back into range Q, e = frexp(Q) Qexp += e P, e = frexp(P) Pexp += e Q, e = frexp(Q) Qexp += e num_clues = len(clues) if num_clues: #P = 1.0 - P**(1./num_clues) #Q = 1.0 - Q**(1./num_clues) # # (x*2**e)**n = x**n * 2**(e*n) n = 1.0 / num_clues P = 1.0 - P**n * 2.0**(Pexp * n) Q = 1.0 - Q**n * 2.0**(Qexp * n) # (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives # ((P-Q)/(P+Q)+1)/2 = # ((P-Q+P-Q)/(P+Q)/2 = # (2*P/(P+Q)/2 = # P/(P+Q) prob = P/(P+Q) else: prob = 0.5 if evidence: clues = [(w, p) for p, w, r in clues] clues.sort(lambda a, b: cmp(a[1], b[1])) return prob, clues else: return prob if options.use_gary_combining: spamprob = gary_spamprob # Across vectors of length n, containing random uniformly-distributed # probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution # with 2*n degrees of freedom. This has been proven (in some # appropriate sense) to be the most sensitive possible test for # rejecting the hypothesis that a vector of probabilities is uniformly # distributed. Gary Robinson's original scheme was monotonic *with* # this test, but skipped the details. Turns out that getting closer # to the theoretical roots gives a much sharper classification, with # a very small (in # of msgs), but also very broad (in range of scores), # "middle ground", where most of the mistakes live. In particular, # this scheme seems immune to all forms of "cancellation disease": if # there are many strong ham *and* spam clues, this reliably scores # close to 0.5. Most other schemes are extremely certain then -- and # often wrong. def chi2_spamprob(self, wordstream, evidence=False): """Return best-guess probability that wordstream is spam. wordstream is an iterable object producing words. The return value is a float in [0.0, 1.0]. If optional arg evidence is True, the return value is a pair probability, evidence where evidence is a list of (word, probability) pairs. """ from math import frexp, log as ln # We compute two chi-squared statistics, one for ham and one for # spam. The sum-of-the-logs business is more sensitive to probs # near 0 than to probs near 1, so the spam measure uses 1-p (so # that high-spamprob words have greatest effect), and the ham # measure uses p directly (so that lo-spamprob words have greatest # effect). # # For optimization, sum-of-logs == log-of-product, and f.p. # multiplication is a lot cheaper than calling ln(). It's easy # to underflow to 0.0, though, so we simulate unbounded dynamic # range via frexp. The real product H = this H * 2**Hexp, and # likewise the real product S = this S * 2**Sexp. H = S = 1.0 Hexp = Sexp = 0 clues = self._getclues(wordstream) for prob, word, record in clues: if record is not None: # else wordinfo doesn't know about it record.killcount += 1 S *= 1.0 - prob H *= prob if S < 1e-200: # prevent underflow S, e = frexp(S) Sexp += e if H < 1e-200: # prevent underflow H, e = frexp(H) Hexp += e # Compute the natural log of the product = sum of the logs: # ln(x * 2**i) = ln(x) + i * ln(2). S = ln(S) + Sexp * LN2 H = ln(H) + Hexp * LN2 n = len(clues) if n: S = 1.0 - chi2Q(-2.0 * S, 2*n) H = 1.0 - chi2Q(-2.0 * H, 2*n) # How to combine these into a single spam score? We originally # used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H). A # systematic problem is that we could end up being near-certain # a thing was (for example) spam, even if S was small, provided # that H was much smaller. # Rob Hooft stared at these problems and invented the measure # we use now, the simpler S-H, scaled into [0., 1.]. prob = (S-H + 1.0) / 2.0 else: prob = 0.5 if evidence: clues = [(w, p) for p, w, r in clues] clues.sort(lambda a, b: cmp(a[1], b[1])) clues.insert(0, ('*S*', S)) clues.insert(0, ('*H*', H)) return prob, clues else: return prob if options.use_chi_squared_combining: spamprob = chi2_spamprob def learn(self, wordstream, is_spam, update_probabilities=True): """Teach the classifier by example. wordstream is a word stream representing a message. If is_spam is True, you're telling the classifier this message is definitely spam, else that it's definitely not spam. If optional arg update_probabilities is False (the default is True), don't update word probabilities. Updating them is expensive, and if you're going to pass many messages to learn(), it's more efficient to pass False here and call update_probabilities() once when you're done -- or to call learn() with update_probabilities=True when passing the last new example. The important thing is that the probabilities get updated before calling spamprob() again. """ self._add_msg(wordstream, is_spam) if update_probabilities: self.update_probabilities() def unlearn(self, wordstream, is_spam, update_probabilities=True): """In case of pilot error, call unlearn ASAP after screwing up. Pass the same arguments you passed to learn(). """ self._remove_msg(wordstream, is_spam) if update_probabilities: self.update_probabilities() def update_probabilities(self): """Update the word probabilities in the spam database. This computes a new probability for every word in the database, so can be expensive. learn() and unlearn() update the probabilities each time by default. Thay have an optional argument that allows to skip this step when feeding in many messages, and in that case you should call update_probabilities() after feeding the last message and before calling spamprob(). """ nham = float(self.nham or 1) nspam = float(self.nspam or 1) S = options.unknown_word_strength StimesX = S * options.unknown_word_prob for word, record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). # This is the Graham calculation, but stripped of biases, and # stripped of clamping into 0.01 thru 0.99. The Bayesian # adjustment following keeps them in a sane range, and one # that naturally grows the more evidence there is to back up # a probability. hamcount = record.hamcount assert hamcount <= nham hamratio = hamcount / nham spamcount = record.spamcount assert spamcount <= nspam spamratio = spamcount / nspam prob = spamratio / (hamratio + spamratio) # Now do Robinson's Bayesian adjustment. # # s*x + n*p(w) # f(w) = -------------- # s + n # # I find this easier to reason about like so (equivalent when # s != 0): # # x - p # p + ------- # 1 + n/s # # IOW, it moves p a fraction of the distance from p to x, and # less so the larger n is, or the smaller s is. n = hamcount + spamcount prob = (StimesX + n * prob) / (S + n) if record.spamprob != prob: record.spamprob = prob # The next seemingly pointless line appears to be a hack # to allow a persistent db to realize the record has changed. self.wordinfo[word] = record def clearjunk(self, oldesttime): """Forget useless wordinfo records. This can shrink the database size. A record for a word will be retained only if the word was accessed at or after oldesttime. """ wordinfo = self.wordinfo mincount = float(mincount) tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime] for w in tonuke: del wordinfo[w] # NOTE: Graham's scheme had a strange asymmetry: when a word appeared # n>1 times in a single message, training added n to the word's hamcount # or spamcount, but predicting scored words only once. Tests showed # that adding only 1 in training, or scoring more than once when # predicting, hurt under the Graham scheme. # This isn't so under Robinson's scheme, though: results improve # if training also counts a word only once. The mean ham score decreases # significantly and consistently, ham score variance decreases likewise, # mean spam score decreases (but less than mean ham score, so the spread # increases), and spam score variance increases. # I (Tim) speculate that adding n times under the Graham scheme helped # because it acted against the various ham biases, giving frequently # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else, # adding only once in training, a word like that was simply ignored until # it appeared in 5 distinct training hams. Without the ham-favoring # biases, though, and never ignoring words, counting n times introduces # a subtle and unhelpful bias. # There does appear to be some useful info in how many times a word # appears in a msg, but distorting spamprob doesn't appear a correct way # to exploit it. def _add_msg(self, wordstream, is_spam): if is_spam: self.nspam += 1 else: self.nham += 1 wordinfo = self.wordinfo wordinfoget = wordinfo.get now = time.time() for word in self._get_all_tokens(wordstream): record = wordinfoget(word) if record is None: record = self.WordInfoClass(now) if is_spam: record.spamcount += 1 else: record.hamcount += 1 # Needed to tell a persistent DB that the content changed. wordinfo[word] = record def _remove_msg(self, wordstream, is_spam): if is_spam: if self.nspam <= 0: raise ValueError("spam count would go negative!") self.nspam -= 1 else: if self.nham <= 0: raise ValueError("non-spam count would go negative!") self.nham -= 1 wordinfo = self.wordinfo wordinfoget = wordinfo.get for word in self._get_all_tokens(wordstream): record = wordinfoget(word) if record is not None: if is_spam: if record.spamcount > 0: record.spamcount -= 1 else: if record.hamcount > 0: record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: del wordinfo[word] else: # Needed to tell a persistent DB that the content changed. wordinfo[word] = record def _getclues(self, wordstream): mindist = options.minimum_prob_strength unknown = options.unknown_word_prob rawclues = [] pushclue = rawclues.append wordinfoget = self.wordinfo.get now = time.time() w1 = 'BOM' pos = 0 for w2 in self._wrap_wordstream(wordstream): pos += 1 first2 = w1 + " " + w2 endpos = pos for word in w1, first2: endpos += 1 record = wordinfoget(word) if record is None: prob = unknown else: record.atime = now prob = record.spamprob distance = abs(prob - 0.5) if distance >= mindist: pushclue((-distance, prob, word, record, pos, endpos)) w1 = w2 rawclues.sort() clues = [] pushclue = clues.append atmost = options.max_discriminators wordseen = {} posseen = [0] * (pos + 4) for junk, prob, word, record, pos, endpos in rawclues: if word in wordseen: continue skip = 0 for i in range(pos, endpos): if posseen[i]: skip = 1 break if skip: continue pushclue((prob, word, record)) wordseen[word] = 1 for i in range(pos, endpos): posseen[i] = 1 if len(clues) >= atmost: break # Return (prob, word, record). return clues def _wrap_wordstream(self, wordstream): for w in wordstream: yield w yield "EOM" def _get_all_tokens(self, wordstream): seen = {} w1 = 'BOM' for w2 in self._wrap_wordstream(wordstream): first2 = w1 + " " + w2 for word in w1, first2: if word not in seen: seen[word] = 1 yield word w1 = w2 From tim.one@comcast.net Sun Nov 17 06:52:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 17 Nov 2002 01:52:42 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: Message-ID: One more test result, from single-source general python.org email. This is a 2-fold (not 10-fold) cross validation run, so in each part it trained on half the data and predicted against the other half. org is baseline CVS, orgbix is exact bigram: filename: org orgbix ham:spam: 5482:1896 5482:1896 fp total: 2 2 fp %: 0.04 0.04 fn total: 17 13 fn %: 0.90 0.69 unsure t: 206 235 unsure %: 2.79 3.19 real cost: $78.20 $80.00 best cost: $70.00 $65.40 h mean: 0.58 0.53 h sdev: 4.96 4.83 s mean: 95.74 95.28 s sdev: 14.73 14.61 mean diff: 95.16 94.75 k: 4.83 4.87 The FP were the same under both runs. One is a one-word administrivia request ("subscribe") buried in a veritable mountain of employer-generated HTML disclaimers. The other FP is a mystery (I remember corresponding about it with GregW and we decided it was ham, but it wasn't obvious). Both schemes have trouble with long, chatty spam; I've generally found this a problem until training on thousands of spams. Each FN under orgbix was also an FN under unigrams. Bigrams pushed four brief FN into Unsure territory, but didn't nail them. From rob@hooft.net Sun Nov 17 08:07:59 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 17 Nov 2002 09:07:59 +0100 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus References: Message-ID: <3DD74E5F.9060606@hooft.net> Tim Peters wrote: > I ran my fat c.l.py test w/ the hash space clamped at 256K buckets. That > was clearly a bad idea for that test, since there are about 330K unique > unigrams in that corpus (let alone bigrams and trigrams). I collected some unigram statistics yesterday: training hammie on my 2x10 sets in the corpus one by one, and after each 1600ham+580spam set run a program that reports the would-be collisions using the python 32 bit hash function: Set1 : 109280 Set2 : 183560 Set3 : 227699 (2 clashes) Set4 : 277253 (3 clashes) Set5 : 329662 (5) Set6 : 362847 (7) Set7 : 394585 (12) Set8 : 422898 (12) Set9 : 448767 (16) Set10: 481393 (22) clash: [('1156', 0.027), ('607.80', 0.142)] clash: [('url:2516', 0.838), ('>beautiful', 0.142)] clash: [("erhc's", 0.964), ('27.7-0.144', 0.142)] clash: [('19271', 0.0841), ('richtig', 0.722)] clash: [('geleefd.', 0.142), ('20:10:05', 0.142)] clash: [('#000000', 0.905), ('from:name:jean richelle', 0.142)] clash: [('*lunit,', 0.142), ('.2635', 0.084)] clash: [("aminggs'", 0.142), ('m"f\'^', 0.142)] clash: [('02-6203-3010', 0.838), ('arona,', 0.838)] clash: [('dislin.graf.', 0.142), ('(inquires', 0.905)] clash: [('/9?!o_(jz?\\`', 0.142), ('arnhemse', 0.084)] clash: [('1075,1079', 0.142), ('from:name:c31', 0.838)] clash: [('1096377', 0.142), ('url:baoding', 0.838)] clash: [('url:bible', 0.565), ('scis', 0.084)] clash: [('334.8', 0.0596), ('\xc0\xd6\xbd\xc0\xb4\xcf\xb4\xd9.*', 0.905)] clash: [('d8/apex', 0.142), ('3\xb8\xb89\xc3\xb5\xbf\xf8\xc0\xbb', 0.838)] clash: [('subject:!!! ', 0.905), ('from:addr:lll2002', 0.838)] clash: [('constitutes', 0.848), ('roast)', 0.142)] clash: [('>madison,', 0.017), ('subject:dison', 0.059)] clash: [('(powerpc)', 0.142), ('url:table', 0.849)] clash: [('subject:Complaint', 0.142), ('-24.727', 0.142)] clash: [('om=-96.953', 0.142), ('line-with', 0.142)] The experienced spambayeser can see that I didn't use the standard parameters, this is because I did run an optimization using simplexloop in the background at the same time. Here, the number of hash collisions is still fairly low, but subtract bits, and see it explode..... Another thing that I learned from this, is that the number of distinct words with this test does not increase with the sqrt of the number of messages. Here is clash.py: ----- from hammie import DBDict from Options import options d=DBDict(options.persistent_storage_file,'r',('saved state',)) h={} n=0 for k in d.iterkeys(): n += 1 #print k,type(d[k]) hs=hash(k) if h.has_key(hs): h[hs].append((k,d[k].spamprob)) print "clash: ",h[hs] else: h[hs]=[(k,d[k].spamprob)] print n ----- Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Nov 17 08:28:26 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 17 Nov 2002 09:28:26 +0100 Subject: [Spambayes] Better optimization loop References: <3DD56CEB.7050406@hooft.net> <3DD5D8E5.1020708@hooft.net> <3DD5DBAB.4050101@hooft.net> Message-ID: <3DD7532A.8020906@hooft.net> Further changes in the optimization (not yet checked in, but I assume everyone is running trigrams now...) I decided that we have a perfect way to optimize the ham and spam cutoff values in timcv already, so that I can remove these from the simplex optimization. To that goal I added a "delayed" flexcost to the CostCounter module that can use the optimal cutoffs calculated at the end of timcv.py. And there are only three variables left to optimize using simplex I then ran one optimization on my complete (16000+5800) corpus. The result is that it is fighting very hard to remove fp's while introducing lots of unsure messages: At the start: -> all runs false positives: 15 -> all runs false negatives: 7 -> all runs unsure: 189 Standard Cost: $194.80 Flex Cost: $607.41 Delayed-Standard Cost: $98.80 Delayed-Flex Cost: $310.05 x=0.4990 p=0.1002 s=0.4537 310.05 And near the end: -> all runs false positives: 5 -> all runs false negatives: 6 -> all runs unsure: 342 -> all runs false positive %: 0.03125 -> all runs false negative %: 0.103448275862 -> all runs unsure %: 1.56880733945 -> all runs cost: $124.40 Standard Cost: $124.40 Flex Cost: $589.16 Delayed-Standard Cost: $98.60 Delayed-Flex Cost: $212.28 x=0.3515 p=0.2861 s=0.2467 212.28 At this stage it actually managed to get the delayed standard cost lower by $0.20 (it has been higher than the starting value during much of the optimization). The Delayed-Flex cost is lowered by about 30%. But look at the hugely different parameters it had to use! Can someone else run with these parameters and confirm that this is an extreme that is only warranted by my particular corpses? Please note that to get a delayed flex cost that is this much lower actually means that in the unsure area there is "50% more order" than before the optimization! At some point Tim (was it you?) has reported that in other optimization techniques it has proven to be very bad to "focus" on the persistent and hopeless fp/fn messages. I fear this might bother me here. I just started another optimization run, but lowered the cost of a fp from $10 to $2, and introduced another cost function that I called flex**2 cost because it changes the cost function for an unsure message from a linear function to a square function. Oops, two changes at the same time; but it takes such a long time to run.... More in 24 hours? Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie@entrian.com Sun Nov 17 14:37:52 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 17 Nov 2002 14:37:52 +0000 Subject: [Spambayes] A kinder, gentler hammie In-Reply-To: References: Message-ID: <0c9ftu81m52pcuem5j1kdrafdjo5qdea47@4ax.com> Hi Neale, > I think I'd like to have hammiefilter check for a "~/.hammierc" file > in addition to a bayescustomize.ini file, and also set the > persistent_storage_file option to "~/.hammie.db" unless it's > overridden. Be careful of other platforms - those filenames are meaningless on Windows (though I guess people aren't running procmail on Windows, or are they?). I wish I could be more constructive, but there's no equivalent of $HOME that works on all versions of Windows (SHGetSpecialFolderPath may help, but there's no interface to that in vanilla Python). If you do change the default database location, please make sure you announce it, and draw people's attention to the fact that pop3proxy uses the same defaults! -- Richie Hindle richie@entrian.com From lists@morpheus.demon.co.uk Sun Nov 17 14:57:35 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 17 Nov 2002 14:57:35 +0000 Subject: [Spambayes] A kinder, gentler hammie References: Message-ID: Neale Pickett writes: > I've checked in hammiefilter.py, which I've been using for a few days > now. The idea is to make the user interface (that means command-line > options) as lightweight as possible. Now the setup for a procmail-based > solution is even easier: This duplicated something I was doing just today - adding an option to Hammie to train on a single message. Your interface is much better than mine, though. > I think I'd like to have hammiefilter check for a "~/.hammierc" file > in addition to a bayescustomize.ini file, and also set the > persistent_storage_file option to "~/.hammie.db" unless it's > overridden. With those two changes, we'd have something supremely easy > to drop into your mail setup. Almost as easy as SpamAssassin (except, > of course, that you don't have to keep retraining SpamAssassin). On Windows (which I use) "~" isn't handled by the OS. Applications which use it often set the HOME environment variable properly, though, so it *can* make sense. To make this work involves passing filenames through os.path.expanduser(). Would anyone object to a patch which added a call to this function wherever it made sense? It should make no difference to non-Windows systems. On Windows, for cases where HOME is set, it will do "the right thing" When HOME is not set, ~ will change to mean C:\, which is a change in behaviour, but I suspect not one that will cause problems. I'd have to rely on others to check on platforms other than Windows, as I have no access to any other OSes. Comments? Paul. -- This signature intentionally left blank From lists@morpheus.demon.co.uk Sun Nov 17 16:50:56 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sun, 17 Nov 2002 16:50:56 +0000 Subject: [Spambayes] popget.py - Gnus mail fetcher Message-ID: I use Gnus on Windows for my mailreader at home, with Hamster (a local mail/news server) as my POP3 mailbox. I could use pop3proxy for spam detection, but it has a couple of problems: 1. As I already have a local POP server, I have to use a non-standard port 2. I have to remember to start the program, or package it up as a service, or otherwise automate the startup and hide the console window. Rather than this, I wrote a small program (attached) to grab the contents of a POP mailbox into a local file, scoring messages as it goes. I can then put the following in my .gnus.el file to run my mail through spambayes. It's equally usable on Unix, for people who have reasons for not wanting to use pop3proxy there. --- .gnus.el snippet --- ;; Popget program from spambayes setup (setq popget "C:\\Data\\spambayes\\spambayes-test\\popget.bat") ;; Get mail via POP3 (setq mail-sources '((pop :server "localhost" :user "XXXXXX" :password "XXXXXX" :program (concat popget " -u %u/%p -P %P -s %s -f %t")))) ------------------------ This works beautifully, and combined with XEmacs mail splitting, I have a nice spam detection facility. With a little bit of work (not hard) I can also use the new hammiefilter.py to add training keystrokes, and I have quite a nice setup. One other thing I'm going to add is a way to display the spam clues for the current message in a buffer. When I'm done, I'll package up the code in a form that can be included in the project as a sample Gnus setup. Paul. -------------- next part -------------- #!/usr/bin/env python # A program to get and classify the contents of a POP3 mailbox. """Usage: %(program)s [options] Where: -h Show usage and exit -s SERVER The server from which to get the mail (default localhost) -P PORT The port to use (default 110) -u USER/PASS The username and password of the POP3 account -f FILE The file to save messages in (defaults to stdout) -k Keep messages on the POP server (default is to delete them) -p FILE use file as the persistent store. loads data from this file if it exists, and saves data to this file at the end. Default: %(DEFAULTDB)s -d use the DBM store instead of cPickle. The file is larger and creating it is slower, but checking against it is much faster, especially for large word databases. Default: %(USEDB)s -D the reverse of -d: use the cPickle instead of DBM """ import os import sys import getopt import poplib import socket import hammie from Options import options try: True, False except NameError: # Maintain compatibility with Python 2.2 True, False = 1, 0 # For usage(); referenced by docstring above program = sys.argv[0] DEFAULTDB = options.persistent_storage_file class Config: def __init__(self): self.server = 'localhost' self.port = 110 self.user = None self.password = None self.DB = DEFAULTDB self.use_db = options.persistent_use_database self.filename = "" self.file = sys.stdout self.delete = True def createhammie(self): bayes = hammie.createbayes(self.DB, self.use_db) self.hammie = hammie.Hammie(bayes) FROMLINE = "From popget@spambayes.org Sat Jan 31 00:00:00 2000" def getmail(conf): pop = poplib.POP3(conf.server, conf.port) if conf.user: pop.user(conf.user) if conf.password: pop.pass_(conf.password) n, size = pop.stat() num = 1 while num <= n: rsp, lines, size = pop.retr(num) msg = "\n".join(lines) print >>conf.file, FROMLINE print >>conf.file, conf.hammie.filter(msg) print >>conf.file if conf.delete: pop.dele(num) num += 1 pop.quit() def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(): """Main program - parse options and run.""" try: opts, args = getopt.getopt(sys.argv[1:], "hs:P:u:f:kp:dD") except getopt.error, msg: usage(2, msg) if not opts: usage(2, "No options given") conf = Config() for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-s': conf.server = arg elif opt == '-P': try: conf.port = int(arg) except ValueError: usage(2, "Port must be a number ('%s' given)" % arg) elif opt == '-u': try: conf.user, conf.password = arg.split("/",1) except ValueError: usage(2, "-u option is USER/PASS ('%s' given)" % arg) elif opt == '-k': conf.delete = False elif opt == '-f': # Need to expand ~ conf.filename = os.path.expanduser(arg) conf.file = open(conf.filename, "w") elif opt == '-p': conf.DB = arg elif opt == '-d': conf.use_db = True elif opt == '-D': conf.use_db = False conf.createhammie() try: getmail(conf) except (poplib.error_proto, socket.error), e: print >> sys.stderr, "POP protocol error %s" % e sys.exit(1) if __name__=="__main__": main() -------------- next part -------------- -- This signature intentionally left blank From tim.one@comcast.net Sun Nov 17 19:38:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 17 Nov 2002 14:38:20 -0500 Subject: [Spambayes] Testers needed with unbalanced spam::ham training data In-Reply-To: Message-ID: If you have a strong imbalance between the # of ham and # of spam in your training data (or even if you don't but can spare the effort), please do a before-and-after test, where after adds the new option: [Classifier] experimental_ham_spam_imbalance_adjustment: True I expect this option to go away and become the default, but it needs testing first before I'll do that. My c.l.py test has minor imbalance, and enabling this option doesn't really matter on it: filename: cv imbal ham:spam: 20000:14000 20000:14000 fp total: 3 3 fp %: 0.01 0.01 fn total: 0 0 fn %: 0.00 0.00 unsure t: 91 95 unsure %: 0.27 0.28 real cost: $48.20 $49.00 best cost: $17.80 $17.80 h mean: 0.24 0.25 h sdev: 2.73 2.79 s mean: 99.95 99.96 s sdev: 1.40 1.32 mean diff: 99.71 99.71 k: 24.14 24.26 Since I have more ham than spam, the effect of the option is to "believe" the hamcounts less than it used to, so that spamprobs have a harder time getting close to 0. That in turn makes everything a little spammier than it used to be, so all the effects on the statistics are expected: ham and spam means both go up a little, ham sdev increases a little because strong ham words aren't as strong as they were, spam sdev decreases because strong spam words are stronger than they were, and a few edgecase hams drifted into Unsure territory because they're judged to be a little spammier than they were. A *possible* effect this data doesn't suffer is an increase in FP rate, which would again be due to everything looking a little spammier (I'm not being accurate here! it's really due to everything looking less hammy, but the distinction is too subtle to belabor ). Likewise some FN may be redeemed (but weren't in this test, since it had no FN to begin with). All these effects will be stronger the larger the imbalance in your ham::spam ratio. Oops! Looks like SourceForge is down -- I haven't been able to check in the changes yet. Keep trying until they show up . From tim.one@comcast.net Sun Nov 17 19:50:56 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 17 Nov 2002 14:50:56 -0500 Subject: [Spambayes] Testers needed with unbalanced spam::ham training data In-Reply-To: Message-ID: [Tim] > ... > Oops! Looks like SourceForge is down -- I haven't been able to > check in the changes yet. Keep trying until they show up . Fudge -- that may be a long time from now. From SF email last week: On 2002-11-17 (Sunday), project CVS services, project shell services, project web services (including all VHOSTs), and project database services will be offline for a period of up to twelve hours, starting at 10:00 Pacific (GMT-8). Project web services will be restored first, but will be brought up initially with read-only access to project group directory space. Static web content will be served correctly during this time period, but application-driven and database-dependent content and CGI scripts will not function correctly. Issues encountered during this time period SHOULD NOT be reported to SourceForge.net; they are an expected side-effect of this outage. The 12-hour outage is only 2 hours old at this time if my calculations are correct, but math is tricky . Without CVS access, I can't even give you a patch here! OK, if you're eager to test, unzip and drop in the classifier.py and Options.py from the attached. -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes.zip Type: application/x-zip-compressed Size: 19542 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021117/d9d688e6/spambayes.bin From neale@woozle.org Sun Nov 17 21:12:38 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Nov 2002 13:12:38 -0800 Subject: [Spambayes] A kinder, gentler hammie In-Reply-To: <0c9ftu81m52pcuem5j1kdrafdjo5qdea47@4ax.com> References: <0c9ftu81m52pcuem5j1kdrafdjo5qdea47@4ax.com> Message-ID: So then, Richie Hindle is all like: > If you do change the default database location, please make sure you > announce it, and draw people's attention to the fact that pop3proxy uses > the same defaults! Hey Richie. Sorry I wasn't clearer about this--I wouldn't want to change the default, I'd just want hammiefilter to: 1. Read in the default 2. Set the database type 3. Try to read in bayescustomize.ini (which will probably fail) 4. Read in ~/.spambayesrc or something like it So, no modifications to anything but hammiefilter.py. There's too much other stuff expecting it the way it currently works, I think. Neale From neale@woozle.org Sun Nov 17 21:22:47 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Nov 2002 13:22:47 -0800 Subject: [Spambayes] A kinder, gentler hammie In-Reply-To: References: Message-ID: So then, Paul Moore is all like: > On Windows (which I use) "~" isn't handled by the OS. Applications > which use it often set the HOME environment variable properly, though, > so it *can* make sense. To make this work involves passing filenames > through os.path.expanduser(). Hey cool! The audience for hammiefilter suddenly got larger. How would you use something like this on a Windows box? Can some MTAs run messages through an external filter one at a time? I thought only we Unix wonks were able do that ;) I suppose what we could do is try a number of pathnames for the ini file, and use the first one that works. If there's some reasonable way to figure out what to use for a "home directory" on Windows or Mac, without relying on non-standard environment variables, it would look there. Of course, it could just rely on the BAYESCUSTOMIZE environment variable like it does now. But then you'd have to wrap hammiefilter to set the variable before running. I'm doing that currently, but I think a wrapper around a wrapper around a driver is pretty ugly. ;) I'll check in some code that works in a Unix environment. Take a look at it and let me know what would make sense to try in a Windows environment (or just submit a change if you can do that). Neale From neale@woozle.org Sun Nov 17 21:43:10 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Nov 2002 13:43:10 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes hammiefilter.py,NONE,1.1 README.txt,1.42,1.43 hammie.py,1.38,1.39 mboxutils.py,1.6,1.7 In-Reply-To: <15831.4608.910147.312797@montanaro.dyndns.org> References: <15831.4608.910147.312797@montanaro.dyndns.org> Message-ID: So then, Skip Montanaro is all like: > Neale> * hammie.py can now take messages on stdin, but it's ugly. If > Neale> you want to do this, you should look at hammiefilter.py > > I'm not sure I get this. I use hammie.py as a filter from my procmailrc > file already. What new feature did you add? The ability to train on a > message on stdin? That's right. You can now do hammie.py -s - or hammie.py -g - to train on a single message. But immediately after I wrote this code, I decided what we really needed was a cleaner front-end. I think a lot of people are using hammie.py on a large existing corpus, either for training or testing. hammiefilter is my attempt at something you'd use as a callout from something else. Neale From mhammond@skippinet.com.au Sun Nov 17 22:38:21 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Mon, 18 Nov 2002 09:38:21 +1100 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right Message-ID: I got a strange looking mail today. All HTML. My brain was sure it was spam. Bayes scored it a pretty solid ham (0.003). So I re-eyeballed the mail (this *was* before the first coffee of the day, mind you), and was again sure it was spam. Was about to hit the "Spam Clues" to see what the story was, and marvelling how how they knew I was "Mark" (most spammers don't). Then I realized it actually *wasn't* spam, but a personally addressed mail. No idea why they picked me though. So - pre Bayes, I am *sure* this mail would have hit the bit-bucket. My brain was sure it was spam way past the threshold where I would have deleted it. [Interesting side-point: In the dollar values for bayes, we should assign a value to the frailty of the brain based on the size or ratios in the corpus. For example, if 50% of mail coming to my inbox is spam, I bet my brain makes far more FP errors than if it were only 1%] The text version of the mail is below. It was HTML, gray background, blue writing - big brain spam-clues And-I'm-yet-to-see-a-bayes-FP ly, Mark. --- From: "Special Imaging Services" To: Subject: Python... >>Special Imaging Services >>Secure Messaging Zone >>Encryption Status: OFF Hello there, Mark, We specialise in military standard image enhancement and digital, craniofacial reconstruction. Our default programming environment is Prolog. How might one integrate Python into the mix? Warmest regards, --- From richie@entrian.com Sun Nov 17 23:07:25 2002 From: richie@entrian.com (Richie Hindle) Date: Sun, 17 Nov 2002 23:07:25 +0000 Subject: [Spambayes] Testers needed with unbalanced spam::ham training data In-Reply-To: References: Message-ID: > [Classifier] > experimental_ham_spam_imbalance_adjustment: True Four runs, with and without experimental_ham_spam_imbalance_adjustment, and with a 10:1 ham:spam imbalance either way: lowham[_adj]: timcv.py -n10 --ham=20 --spam=200 -s1 lowspam[_adj]: timcv.py -n10 --ham=200 --spam=20 -s1 filename: lowham lowham_adj ham:spam: 200:2000 200:2000 fp total: 15 2 fp %: 7.50 1.00 fn total: 1 1 fn %: 0.05 0.05 unsure t: 37 42 unsure %: 1.68 1.91 real cost: $158.40 $29.40 best cost: $67.20 $26.40 h mean: 17.41 8.38 h sdev: 31.13 20.20 s mean: 99.90 99.66 s sdev: 2.47 3.35 mean diff: 82.49 91.28 k: 2.46 3.88 filename: lowspam lowspam_adj ham:spam: 2000:200 2000:200 fp total: 0 1 fp %: 0.00 0.05 fn total: 10 1 fn %: 5.00 0.50 unsure t: 35 72 unsure %: 1.59 3.27 real cost: $17.00 $25.40 best cost: $10.80 $7.00 h mean: 0.18 1.61 h sdev: 2.08 7.13 s mean: 89.39 96.69 s sdev: 23.92 10.59 mean diff: 89.21 95.08 k: 3.43 5.37 The introduced fp in lowspam_adj is a very spammy HTML email from an ISP - it's always showed up as an fp in my corpus. -- Richie Hindle richie@entrian.com From tim.one@comcast.net Sun Nov 17 23:31:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 17 Nov 2002 18:31:26 -0500 Subject: [Spambayes] Testers needed with unbalanced spam::ham training data In-Reply-To: Message-ID: [Richie Hindle, trying [Classifier] experimental_ham_spam_imbalance_adjustment: True ] Thank you! > Four runs, with and without > experimental_ham_spam_imbalance_adjustment, and > with a 10:1 ham:spam imbalance either way: > > lowham[_adj]: timcv.py -n10 --ham=20 --spam=200 -s1 > lowspam[_adj]: timcv.py -n10 --ham=200 --spam=20 -s1 > > filename: lowham lowham_adj > ham:spam: 200:2000 > 200:2000 > fp total: 15 2 > fp %: 7.50 1.00 > fn total: 1 1 > fn %: 0.05 0.05 > unsure t: 37 42 > unsure %: 1.68 1.91 > real cost: $158.40 $29.40 > best cost: $67.20 $26.40 > h mean: 17.41 8.38 > h sdev: 31.13 20.20 > s mean: 99.90 99.66 > s sdev: 2.47 3.35 > mean diff: 82.49 91.28 > k: 2.46 3.88 So the effect of the adjustment is to make everything less spammy: both means decrease, ham sdev decreases, spam sdev increases, FP get redeemed, and FN get more likely but less so than Unsures get more likely. The spread is small enough that the bottom-line increase in k is important, and everything works as hoped here. > filename: lowspam lowspam_adj > ham:spam: 2000:200 > 2000:200 > fp total: 0 1 > fp %: 0.00 0.05 > fn total: 10 1 > fn %: 5.00 0.50 > unsure t: 35 72 > unsure %: 1.59 3.27 > real cost: $17.00 $25.40 > best cost: $10.80 $7.00 > h mean: 0.18 1.61 > h sdev: 2.08 7.13 > s mean: 89.39 96.69 > s sdev: 23.92 10.59 > mean diff: 89.21 95.08 > k: 3.43 5.37 Now the effect is to make everything less hammy, so mirror image: both means increase, ham sdev increases, spam sdev decreases, FN get redeemed, and FP get more likely but less so than Unsures get more likely. So again everything worked as hoped, and the bottom-line increase in k is again a Good Thing. Great! That's all I could have hoped for. If you hoped for more, you were being unrealistic . Curious: both before and after, you got better results training on a lot more ham than spam than the reverse. Most previous reports have been the opposite (in my own tests, I haven't noted a reliable trend in either direction there). > The introduced fp in lowspam_adj is a very spammy HTML email from > an ISP - it's always showed up as an fp in my corpus. Since the after "best cost" was under $10, it's certain that the post-run histogram analysis found cutoffs where you would have gotten no FP. Whether those are cutoffs you'd be comfortable with I can't say. From rjdsnet@yahoo.com Thu Nov 7 19:22:34 2002 From: rjdsnet@yahoo.com (Ranieri J D Severiano) Date: Thu, 7 Nov 2002 17:22:34 -0200 Subject: [Spambayes] hammie's dbm file has changed Message-ID: <20021107192234.GA974@uyrapuru> Hi, my last CVS syncronization has generated the attached CVS/Entries. These are the upgrades: hammie.py: 1.29 -> 1.35 -> 1.38 hammiesrc.py: 1.9 -> 1.10 When I execute any program which try to access the pickle-DB, I get this exception: ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail Traceback (most recent call last): File "./hammie.py", line 497, in ? main() File "./hammie.py", line 459, in main bayes = createbayes(pck, usedb, mode) File "./hammie.py", line 401, in createbayes bayes = pickle.load(fp) File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor obj = base.__new__(cls, state) TypeError: ('object.__new__(X): X is not a type object (class)', , (, , None)) ranieri@uyrapuru:spambayes$ python2.2 Python 2.2.1 (#1, Sep 7 2002, 14:34:30) [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from cPickle import load >>> f = open('hammie.db') >>> o = load(f) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor obj = base.__new__(cls, state) TypeError: ('object.__new__(X): X is not a type object (class)', , (, , None)) >>> ranieri@uyrapuru:spambayes$ I believe pickle-DB file format has changed too. Thanks, Ranieri > > From: "Neale Pickett" > Date: Sun, 17 Nov 2002 03:49:26 > Subject: [Spambayes] hammie's dbm file has changed > > I just want to make sure everyone is aware that hammie.py's dbm file > format has changed now. I sent a message out about it two days ago and > didn't get any responses, so it's in now. > -------------- next part -------------- /.cvsignore/1.3/Fri Sep 20 15:24:54 2002// /HistToGNU.py/1.7/Fri Oct 4 03:01:29 2002// /LICENSE.txt/1.1/Sun Sep 22 04:59:54 2002// /TESTING.txt/1.1/Thu Sep 5 20:55:02 2002// /cdb.py/1.4/Mon Sep 23 21:20:10 2002// /cleanarch/1.1/Thu Sep 5 16:16:43 2002// /cmp.py/1.17/Thu Sep 26 03:20:51 2002// /fpfn.py/1.1/Wed Sep 25 01:01:49 2002// /heapq.py/1.1/Sun Sep 22 06:58:36 2002// /loosecksum.py/1.3/Mon Sep 23 21:20:10 2002// /neilfilter.py/1.4/Wed Oct 2 16:05:27 2002// /rates.py/1.7/Wed Sep 25 02:22:15 2002// D/Outlook2000//// D/email//// /hammiecli.py/1.2/Sun Oct 27 05:13:54 2002// /runtest.sh/1.9/Mon Nov 4 01:10:38 2002// /setup.py/1.9/Mon Nov 4 01:10:38 2002// /timtest.py/1.30/Mon Nov 4 01:10:39 2002// /unheader.py/1.8/Mon Nov 4 01:10:44 2002// /Histogram.py/1.7/Wed Nov 6 20:23:44 2002// /INTEGRATION.txt/1.2/Wed Nov 6 20:23:44 2002// /Options.py/1.70/Wed Nov 6 20:23:47 2002// /README.txt/1.42/Wed Nov 6 20:23:48 2002// /TestDriver.py/1.28/Wed Nov 6 20:23:49 2002// /Tester.py/1.8/Wed Nov 6 20:23:49 2002// /chi2.py/1.8/Wed Nov 6 20:23:50 2002// /classifier.py/1.50/Wed Nov 6 20:23:52 2002// /hammie.py/1.38/Result of merge// /hammiesrv.py/1.10/Result of merge// /mboxcount.py/1.3/Wed Nov 6 20:23:53 2002// /mboxtest.py/1.10/Wed Nov 6 20:23:54 2002// /mboxutils.py/1.6/Wed Nov 6 20:23:54 2002// /msgs.py/1.6/Wed Nov 6 20:23:54 2002// /neiltrain.py/1.4/Wed Nov 6 20:23:55 2002// /optimize.py/1.2/Sun Nov 10 19:59:22 2002// /pop3proxy.py/1.15/Wed Nov 6 20:23:59 2002// /rebal.py/1.9/Wed Nov 6 20:24:00 2002// /sets.py/1.2/Wed Nov 6 20:24:01 2002// /split.py/1.2/Wed Nov 6 20:24:01 2002// /splitn.py/1.4/Wed Nov 6 20:24:01 2002// /splitndirs.py/1.7/Wed Nov 6 20:24:01 2002// /table.py/1.5/Wed Nov 6 20:24:02 2002// /timcv.py/1.12/Wed Nov 6 20:24:02 2002// /tokenizer.py/1.68/Wed Nov 6 20:24:07 2002// /weakloop.py/1.2/Mon Nov 11 01:59:06 2002// /weaktest.py/1.3/Sun Nov 10 19:59:22 2002// From popiel@wolfskeep.com Mon Nov 18 02:24:57 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 17 Nov 2002 18:24:57 -0800 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: Message from Tim Peters References: Message-ID: <20021118022457.CC988F58D@cashew.wolfskeep.com> In message: Tim Peters writes: > >[Tim] >> ... >> The "missing test" here is exact bigrams (no hash convolutions). I'll >> try that later; may not have enough RAM for that, but should. I haven't been able to do a big run of this, but here's my results: filename: org orgbix ham:spam: 1000:1000 1000:1000 fp total: 3 2 fp %: 0.30 0.20 fn total: 10 7 fn %: 1.00 0.70 unsure t: 27 28 unsure %: 1.35 1.40 real cost: $45.40 $32.60 best cost: $24.00 $24.20 h mean: 0.43 0.50 h sdev: 5.64 5.95 s mean: 97.94 98.28 s sdev: 11.59 10.45 mean diff: 97.51 97.78 k: 5.66 5.96 This is from a five-fold cross validation run. Looks very nice. - Alex From neale@woozle.org Mon Nov 18 02:48:46 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Nov 2002 18:48:46 -0800 Subject: [Spambayes] hammie's dbm file has changed In-Reply-To: <20021107192234.GA974@uyrapuru> References: <20021107192234.GA974@uyrapuru> Message-ID: So then, Ranieri J D Severiano is all like: > ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail > Traceback (most recent call last): > File "./hammie.py", line 497, in ? > main() > File "./hammie.py", line 459, in main > bayes = createbayes(pck, usedb, mode) > File "./hammie.py", line 401, in createbayes > bayes = pickle.load(fp) > File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor > obj = base.__new__(cls, state) > TypeError: ('object.__new__(X): X is not a type object (class)', , (, , None)) Yikes! I'm pretty sure I didn't change anything that would affect the way pickles are stored (they don't use PersistentGrahamBayes or the DBDict classes), but it sure does look like *something* has changed for you. Unfortunately, SF CVS is down for the day, so I can't check to see what's changed between those versions. > >>> from cPickle import load > >>> f = open('hammie.db') > >>> o = load(f) Could you try the same thing, importing load from pickle instead? It will give a better traceback. I don't know enough yet about how the pickling works to be able to diagnose this without some more information first. Neale From neale@woozle.org Mon Nov 18 02:51:06 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Nov 2002 18:51:06 -0800 Subject: [Spambayes] small vulnerability patch In-Reply-To: <1037427084.31134.17.camel@localhost> References: <1037427084.31134.17.camel@localhost> Message-ID: So then, Todd Mokros is all like: > here's a small patch to fix a small header vulnerability. If a piece of > spam spoofs the header added by hammie, then procmail recipes could > match on the spoofed header. This deletes the hammie header before > filtering. Good catch, Todd! I'll check this into CVS as soon as it comes back up and I'm in front of a computer :) Thanks Neale > > > --- ../../cvs-tracking/spambayes/hammie.py 2002-11-14 > 17:00:15.000000000 -0500 > +++ hammie.py 2002-11-16 00:44:50.000000000 -0500 > @@ -272,6 +272,8 @@ > """ > > msg = mboxutils.get_message(msg) > + if msg.has_key(header): > + del msg[header] > prob, clues = self._scoremsg(msg, True) > if prob < ham_cutoff: > disp = options.header_ham_string > > > -- > Todd Mokros > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes From tim@fourstonesExpressions.com Mon Nov 18 02:52:49 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 17 Nov 2002 20:52:49 -0600 Subject: [Spambayes] hammie's dbm file has changed In-Reply-To: Message-ID: Here's a diff between hammie 1.38 and 1.39 cvs diff -r 1.38 -r 1.39 hammie.py (in directory C:\sourceforge\spambayes\) Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.38 retrieving revision 1.39 diff -r1.38 -r1.39 13c13 < Can be specified more than once. --- > Can be specified more than once, or use - for stdin. 16c16 < Can be specified more than once. --- > Can be specified more than once, or use - for stdin. 42a43 > import types 112,113c113,120 < if self.hash.has_key(key): < return pickle.loads(self.hash[key]) --- > v = self.hash[key] > if v[0] == 'W': > val = pickle.loads(v[1:]) > # We could be sneaky, like pickle.Unpickler.load_inst, > # but I think that's overly confusing. > obj = classifier.WordInfo(0) > obj.__setstate__(val) > return obj 115c122 < raise KeyError(key) --- > return pickle.loads(v) 118c125,129 < v = pickle.dumps(val, 1) --- > if isinstance(val, classifier.WordInfo): > val = val.__getstate__() > v = 'W' + pickle.dumps(val, 1) > else: > v = pickle.dumps(val, 1) *****CVS exited normally with code 1***** - TimS 11/17/2002 8:48:46 PM, Neale Pickett wrote: >So then, Ranieri J D Severiano is all like: > >> ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail >> Traceback (most recent call last): >> File "./hammie.py", line 497, in ? >> main() >> File "./hammie.py", line 459, in main >> bayes = createbayes(pck, usedb, mode) >> File "./hammie.py", line 401, in createbayes >> bayes = pickle.load(fp) >> File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor >> obj = base.__new__(cls, state) >> TypeError: ('object.__new__(X): X is not a type object (class)', , (, , None)) > >Yikes! > >I'm pretty sure I didn't change anything that would affect the way >pickles are stored (they don't use PersistentGrahamBayes or the DBDict >classes), but it sure does look like *something* has changed for you. > >Unfortunately, SF CVS is down for the day, so I can't check to see >what's changed between those versions. > >> >>> from cPickle import load >> >>> f = open('hammie.db') >> >>> o = load(f) > >Could you try the same thing, importing load from pickle instead? It >will give a better traceback. I don't know enough yet about how the >pickling works to be able to diagnose this without some more information >first. > >Neale > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From neale@woozle.org Mon Nov 18 03:16:45 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Nov 2002 19:16:45 -0800 Subject: [Spambayes] hammie's dbm file has changed In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > Here's a diff between hammie 1.38 and 1.39 Ah, I see SF is back up. Thanks, Tim :) Ranieri, I can't find anything which would affect the pickling in the diff between revisions 1.29 and 1.39 of hammie.py. Maybe a traceback from the pickle module will offer some more clues as to what's gone wrong. Neale From tim@fourstonesExpressions.com Mon Nov 18 04:44:42 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun, 17 Nov 2002 22:44:42 -0600 Subject: [Spambayes] hammie's dbm file has changed In-Reply-To: Message-ID: <05SNRPSM08A5E0VREDHCZYTVRSQKFC7.3dd8703a@riven> I've just finished updating the DBDictBayes class to implement the read/store semantic. I looked at the DBDict class, but I decided making it do that would be a bit more of a problem than making one if it's delegators do it... More comments below... - TimS 11/17/2002 10:35:33 PM, Neale Pickett wrote: >So then, Tim Stone - Four Stones Expressions is all like: > >> My Bayes module stuff has a store() method that is to store the >> wordinfo. This is a requirement for Richie's pop3proxy. Right now >> with DBDictBayes it's pretty much of a noop, only adding nham and >> nspam to the persistent dictionary. Can we alter dbdict, or make a >> subclass, that accomodates this behavior? > >Hmm. I just updated and got your new Bayes.py file. I like! This >looks like what hammiefilter.py should be using, instead of hammie's >DBDict. Feel free to move that out and into your Bayes class, it looks >like that's where it belongs. We could move the DBDIct class to the Bayes module, or to its own little module... it really is a more generally useful class. >Just make sure everything else still >works :) You'll probably have to modify hammie, hammiesrv, and >hammiefilter. Maybe hammiefilter should be renamed to just filter, if >it's not going to use hammie.py anymore. > >Would that solve your problem? If I understand it correctly, it >should. I think the hammie module ought to be split up into separate >pieces anyway. My problem with updating hammie is that I'm not too well equipped to test the mods... I can certainly take a look at it, and give it a spin, but I doubt that I can test all the scenarios on my simple wynd0ze snoozer. ;) > >Mind if we take this discussion onto the list? I'm sure Richie will >have some good input on the subject. > >Neale > > > > - Tim www.fourstonesExpressions.com From mhammond@skippinet.com.au Thu Nov 14 07:13:51 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 14 Nov 2002 18:13:51 +1100 Subject: [Spambayes] Outlook users should update In-Reply-To: Message-ID: And I just checked in a few changes too. Of most note is that the plugin should correctly filter all "unread, unscored" messages in your watch folders at startup. Works for me - let me know if it does for you too Mark. From tim.one@comcast.net Mon Nov 18 06:12:19 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 01:12:19 -0500 Subject: [Spambayes] hammie's dbm file has changed In-Reply-To: Message-ID: [Ranieri J D Severiano ] > ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail > Traceback (most recent call last): > File "./hammie.py", line 497, in ? > main() > File "./hammie.py", line 459, in main > bayes = createbayes(pck, usedb, mode) > File "./hammie.py", line 401, in createbayes > bayes = pickle.load(fp) > File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor > obj = base.__new__(cls, state) > TypeError: ('object.__new__(X): X is not a type object > (class)', , ( classifier.Bayes at 0x82103b4>, , None)) This is what happens if you try to load a Bayes pickle created before classifier.Bayes changed from a new-style class to an old-style class. You're best off retraining from scratch. If you're desperate to retrieve the old data, you can change class Bayes: back to class Bayes(object): and the pickle should load again. From tim.one@comcast.net Mon Nov 18 06:15:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 01:15:25 -0500 Subject: [Spambayes] hammie's dbm file has changed In-Reply-To: Message-ID: [Neale Pickett] > Ah, I see SF is back up. Thanks, Tim :) You're welcome. Feel emboldened, for my next act I'm thinking of making the last digit of the calendar year change in, oh, about a month and a half. > Ranieri, I can't find anything which would affect the pickling in the > diff between revisions 1.29 and 1.39 of hammie.py. Maybe a traceback > from the pickle module will offer some more clues as to what's gone > wrong. It's got nothing to do with hammie -- Bayes changed from a new-style to an old-style class to make Jeremy's ZODB life easier, and a pickle of an old-style class instance can't be loaded after the class has changed in this way. From tim.one@comcast.net Mon Nov 18 06:48:54 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 01:48:54 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: <3DD74E5F.9060606@hooft.net> Message-ID: [Rob Hooft] > I collected some unigram statistics yesterday: training hammie on my > 2x10 sets in the corpus one by one, and after each 1600ham+580spam set > run a program that reports the would-be collisions using the python 32 > bit hash function: > > Set1 : 109280 > Set2 : 183560 > Set3 : 227699 (2 clashes) > Set4 : 277253 (3 clashes) > Set5 : 329662 (5) > Set6 : 362847 (7) > Set7 : 394585 (12) > Set8 : 422898 (12) > Set9 : 448767 (16) > Set10: 481393 (22) I'm assuming the "big numbers" there are the number of distinct tokens, rather than the number of distinct 32-bit hash codes. If so, they're all a little better than could be expected from a truly random 32-bit hash code. BTW, after tossing N balls into M buckets, the expected # of occupied buckets has mean M - M*(1-1/M)**N variance M*(M-1)*(1-2/M)**N + M*(1-1/M)**N - M**2*(1-1/M)**(2*N) Unfortunately, those expressions are numerically intractable using double precision when M and N get large. The exact distribution is intractable even in theory; Knuth gives an iterative algorithm for computing it given specific M and N, which takes time super-linear in N. I have software left over for this stuff from Python's years-ago experiments. > ... > Here, the number of hash collisions is still fairly low, but subtract > bits, and see it explode..... Oh yes! The biggest pickle I've got sitting around has 327,439 tokens. Using the last 20 bits of the hash code means using 2**20 ~= a million buckets, and the mean number of collisions then is expected to be 46193.8, with sdev 174.5 (nb: it's not a normal distribution). Actually doing this gave 46,184 collisions using the low 20 bits of hash() 46,481 collisions using the low 20 bits of binascii.crc32() So either way is "random enough" for this range of numbers. > Another thing that I learned from this, is that the number of distinct > words with this test does not increase with the sqrt of the number of > messages. Perhaps not, but we're going to pretend that it is anyway because that's such a pretty & quotable result . From tim.one@comcast.net Mon Nov 18 06:57:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 01:57:12 -0500 Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus In-Reply-To: <20021118022457.CC988F58D@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel, tries "exact bigrams"] > I haven't been able to do a big run of this, but here's my > results: Thank you! > filename: org orgbix > ham:spam: 1000:1000 > 1000:1000 > fp total: 3 2 > fp %: 0.30 0.20 > fn total: 10 7 > fn %: 1.00 0.70 > unsure t: 27 28 > unsure %: 1.35 1.40 > real cost: $45.40 $32.60 > best cost: $24.00 $24.20 > h mean: 0.43 0.50 > h sdev: 5.64 5.95 > s mean: 97.94 98.28 > s sdev: 11.59 10.45 > mean diff: 97.51 97.78 > k: 5.66 5.96 > > This is from a five-fold cross validation run. Looks very nice. Yet the "best cost" measure increased; add that to the list of mysteries. I'd be keener about it if it were clearer how to make the time and database burdens reasonable. A less anal way of searching for the strongest unigrams and bigrams would probably take care of time (Gary suggested something cheaper to begin with, but that could miss some high-strength bigrams in favor of lower-value unigrams, and I wanted more to see the ultimate potential here). The database bloat is jaw-dropping, though, and I'm still unsure why that is. Hash codes are right out, IMO -- the goofy mistakes they lead to are intolerable. From tim.one@comcast.net Mon Nov 18 07:17:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 02:17:38 -0500 Subject: [Spambayes] Outlook users should update In-Reply-To: Message-ID: [Mark Hammond] > And I just checked in a few changes too. Of most note is that the > plugin should correctly filter all "unread, unscored" messages in your > watch folders at startup. Works for me - let me know if it does > for you too I never had this problem, & am happy to report that I still don't . One other change: the new (and I hope temporary) experimental_ham_spam_imbalance_adjustment option is enabled by default in the Outlook client now (this is specific to the Outlook client: it's disabled by default for everyone else). t won't do you any good (or harm ...) until you convince the client to update probabilities, though. You do *not* need to retrain your database from scratch. You just need to convince it to call the classifier's update_probabilities() method once. The easiest way to do that may be to drag a ham to your spam folder, then select that ham in your spam folder, and click "Recover from spam". From Paul.Moore@atosorigin.com Mon Nov 18 10:00:48 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 18 Nov 2002 10:00:48 -0000 Subject: [Spambayes] Just for fun Message-ID: <16E1010E4581B049ABC51D4975CEDB8861993D@UKDCX001.uk.int.atosorigin.com> From: Tim Peters [mailto:tim.one@comcast.net] > Good! On my tiny still-hapax-driven purely-mistake-based at-home > classifier (which is up 79 each of ham and spam trained on) it > fared much worse: Mine got some interesting results... The DB is trained on 366 good, 496 spam, which came mostly from collected spam over a week or so, plus the contents of my Inbox, and then training on mistakes (not many). My Inbox is, in some senses, a *lousy* source of ham, as it's mainly stuff I couldn't find a better home for. So it is 99% internal mail (ie, from Exchange rather than Internet mail) and probably comprises a spammier-than-average slice of my ham. But if I train on all my ham (across multiple folders) I get a massive ham:spam imbalance. (When I next get to CVS update, I'll try Tim's new tweak to compensate for imbalances). I'm not good at interpreting this stuff yet, but it came out as solidly unsure, with some interesting features. The 'sender:no real name:2**0' as a solid ham clue is almost certainly due to Exchange (basically because Exchange doesn't do real headers, I expect) - I see most internet headers as good spam clues, which is mildly worrying, although hasn't caused any real issues yet. The obvious implication is that getting a really good training corpus is *hard*. Probably beyond the means of the average user. But as a lousy corpus still gives good results, it's hard to decide whether or not to care. Here's the clues. Spam Score: 0.349681 word spamprob #ham #spam '*H*' 0.998703 - - '*S*' 0.698066 - - 'sender:no real name:2**0' 0.00884086 25 0 'subject:[' 0.0155709 14 0 'url:mailman' 0.0167286 13 0 'url:listinfo' 0.0180723 12 0 'specific' 0.0196507 11 0 'is.' 0.0238095 9 0 'url:python' 0.0266272 8 0 'to:addr:python.org' 0.0412844 5 0 'them,' 0.0505618 4 0 'sender:addr:python.org' 0.0505618 4 0 'problem' 0.0521891 44 3 'url:org' 0.0567176 28 2 'know,' 0.0652174 3 0 'email addr:python.org' 0.0652174 3 0 'skip:_ 40' 0.0652174 3 0 'delivery' 0.0676112 13 1 'updated' 0.0676112 13 1 "can't" 0.0683657 43 4 'running' 0.0727202 12 1 'set' 0.0789344 54 6 'date' 0.0912609 24 3 'mission' 0.0918367 2 0 'sorted' 0.0918367 2 0 'host' 0.104237 8 1 'base' 0.116911 7 1 'various' 0.121676 12 2 'using' 0.125907 73 14 'content-type:text/plain' 0.128685 326 65 'however' 0.133102 6 1 'back' 0.145642 40 9 'ask' 0.149462 22 5 'site.' 0.154513 5 1 'solve' 0.154513 5 1 'contains' 0.154992 9 2 'net.' 0.155172 1 0 'url:spambayes' 0.155172 1 0 'sender:addr:spambayes-bounces' 0.155172 1 0 'spambayes' 0.155172 1 0 'weekly.' 0.155172 1 0 'subject:email' 0.155172 1 0 'shut' 0.155172 1 0 'second.' 0.155172 1 0 'policies' 0.155172 1 0 'parameters' 0.155172 1 0 'emails.' 0.155172 1 0 'email name:spambayes' 0.155172 1 0 'duplicate' 0.155172 1 0 'together' 0.170569 8 2 'current' 0.175793 25 7 'closed' 0.184169 4 1 'paying' 0.184169 4 1 'data' 0.184776 17 5 'there' 0.184986 95 29 'meet' 0.189638 7 2 'close' 0.189638 7 2 'need' 0.190325 98 31 'site' 0.190699 32 10 'being' 0.192172 41 13 'directly' 0.204069 15 5 'they' 0.206035 66 23 'may' 0.206235 100 35 'use' 0.208203 96 34 'been' 0.223143 93 36 'have' 0.229481 221 89 'header:Received:9' 0.238618 24 10 'just' 0.251571 97 44 'like' 0.253328 70 32 'product' 0.261037 15 7 'not' 0.263917 200 97 'can' 0.263933 165 80 'reply-to:none' 0.266838 343 169 'noheader:reply-to' 0.266838 343 169 'will' 0.268784 175 87 'only' 0.27345 67 34 'that' 0.284503 223 120 'come' 0.28675 26 14 'for' 0.287229 253 138 'against' 0.292477 11 6 'down' 0.294767 25 14 'once' 0.294767 25 14 'new' 0.299542 90 52 'campaign' 0.299577 2 1 'reliable' 0.299577 2 1 'find' 0.304047 56 33 'already' 0.30613 22 13 'service' 0.308224 40 24 'well' 0.308669 30 18 'see' 0.308851 63 38 'way' 0.310625 33 20 'many' 0.311285 28 17 "don't" 0.317645 89 56 'again.' 0.683667 4 12 'subject:.' 0.697304 22 69 'card' 0.698247 5 16 'low' 0.70042 4 13 'totally' 0.703898 3 10 'header:Errors-To:1' 0.718457 21 73 'header:Date:1' 0.720317 142 496 'header:From:1' 0.720317 142 496 'us,' 0.72912 4 15 'header:Return-Path:1' 0.737732 130 496 'to:2**0' 0.74404 123 485 'proto:http' 0.772652 75 346 'to:no real name:2**0' 0.775127 90 421 'net' 0.776394 2 10 'price' 0.776394 2 10 'sites' 0.776394 2 10 'success.' 0.796678 1 6 'visit' 0.797988 15 81 'url:www' 0.805641 49 276 'matter' 0.81794 2 13 'url:com' 0.818306 50 306 'effective' 0.819813 1 7 'marketing' 0.831585 3 21 'companies' 0.838229 1 8 'price.' 0.844828 0 1 'subject:Bullet' 0.844828 0 1 'time!' 0.844828 0 1 'relax' 0.844828 0 1 'proof' 0.844828 0 1 'from:addr:concentric.net' 0.844828 0 1 'friendly' 0.844828 0 1 'campaigns.' 0.844828 0 1 'bullet' 0.844828 0 1 'beautiful' 0.844828 0 1 '$200' 0.844828 0 1 'credit' 0.871816 3 29 'emails' 0.875534 3 30 'income' 0.878287 2 21 'offer' 0.890749 7 79 'merchant' 0.908163 0 2 'complaints' 0.908163 0 2 'cheap' 0.908163 0 2 'header:Mime-Version:1' 0.923952 16 266 'url:mail' 0.935447 3 62 'adult' 0.958716 0 5 'lowest' 0.958716 0 5 'advertise' 0.969799 0 7 'prices' 0.973373 0 8 'gambling' 0.973373 0 8 '$500' 0.97619 0 9 'hundreds' 0.980349 0 11 'dollars' 0.983271 0 13 'guarantee' 0.983271 0 13 'thousands' 0.984429 0 14 'million' 0.990405 0 23 'bulk' 0.990405 0 23 'advertising' 0.990405 0 23 'advertised' 0.995627 0 51 'websites' 0.995942 0 55 Paul. From msergeant@startechgroup.co.uk Mon Nov 18 10:40:25 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Mon, 18 Nov 2002 10:40:25 +0000 Subject: [Spambayes] Just for fun References: Message-ID: <3DD8C399.8000305@startechgroup.co.uk> Derek Simkowiak said the following on 15/11/02 17:35: >>Remember that this project /is/ the first instance of a decent spam filter :), >>so we can hardly blame the spammers for being a little behind. > > Let's not forget that SpamBayes only works for individuals or > workgroups who have the same definitation of "ham". It doesn't help much > in enterprise-level settings with tens of thousands of users, since the > ham of such a large and varied group of people would dilute the definition > of spam too much to be useful. I think you're over-exhagerating. It most certainly *does* help, and it helps a lot. For a large diverse group of users statistical analysis is still about 90% correct. It's not the 99.9% correct you get for an individual's mailbox, but as part of the bigger picture (involving statistics, rules, DNSBL's, etc) it's a huge help. > I bet that playing the numbers game one could "show" that the > helpdesk and maintenance costs of supporting a Python installation plus a > per-person ham training procedure would be more expensive (for a Uni or > Mega-Corp.) than just living with spam. (Pure conjecture on my part, but > it is easily imagined.) Depends how you calculate the cost of spam. For me it's an interuption, and for my work (which involves intense periods of coding, maths, and reading) an interuption means I have to start again a lot of the time. For a highly paid programmer that cost could be about £20. Per spam. And ignoring my email is often not an option: I have to support a spam solution for over a million users. > There's another Python-based spam filter that might work better > for SMTP server-wide deployment, called "Active Spam Killer", or ASK. > > http://www.paganini.net/ask/ > > It's schtick is that it maintains a whitelist of people who may > email you. When an email from a new sender comes in, it holds the email > for you, sends the person a simple confirmation messages (to which they > simply hit Reply;Send), and then that person is added to your whitelist > and their original messages is sent to you (and they are never ASKed > again). There's also some very practical regex stuff, some migration > tools, and an ignorelist and blacklist (for situations like > http://www.psychoexgirlfriend.com/). This is the same as TMDA. I have evidence for you that it doesn't work. Case in point being direct from me: someone mailed me asking a technical question about one of my perl modules. I mailed him back a response, on my own free time. I got a TMDA bounce saying I had to confirm that I was a real person. Well frankly, sod that. I never replied. I never used his web page to confirm. I just ignored it and I'm sure he never got the reply to his question. Now imagine extending that to corporations, where people would be even less inclined to add themselves to somebody's whitelist. TMDA doesn't scale. Matt. From Paul.Moore@atosorigin.com Mon Nov 18 11:17:50 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 18 Nov 2002 11:17:50 -0000 Subject: [Spambayes] A kinder, gentler hammie Message-ID: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com> From: Neale Pickett [mailto:neale@woozle.org] > So then, Paul Moore is all like: > > > On Windows (which I use) "~" isn't handled by the OS. Applications > > which use it often set the HOME environment variable properly, > > though, so it *can* make sense. To make this work involves passing > > filenames through os.path.expanduser(). > > Hey cool! The audience for hammiefilter suddenly got larger. How > would you use something like this on a Windows box? Can some MTAs > run messages through an external filter one at a time? I thought > only we Unix wonks were able do that ;) At home, I use an application called Hamster, which is a local NNTP/POP/IMAP server and a NNTP/POP client, downloading from my ISP and serving the stuff up locally. Like leafnode on Unix, but for POP as well as NNTP. It can run filters on each mail as it comes through, but I don't do that... I use Gnus as a client, and for that I use popget.py (which I posted earlier) to grab mail from Hamster and add the appropriate header. It's slightly more convenient for me than pop3proxy, but has no training interface. I'd use hammiefilter for training - set a command up in Gnus to pipe the current message out to hammiefilter as spam or ham, as appropriate. > I suppose what we could do is try a number of pathnames for the ini > file, and use the first one that works. If there's some reasonable > way to figure out what to use for a "home directory" on Windows or > Mac, without relying on non-standard environment variables, it would > look there. Sounds about right. Basically, using relative pathnames in a GUI environment is error-prone, because you can't be sure what the current directory is. Windows' normal solution is to use the registry to hold absolue filenames, but that tends to be messy and generate registry bloat (as well as being very naive-user-unfriendly). [Thinking about this, isn't it a problem for Unix as well? What's the current directory for a procmail filter?] The standard environment variables which *can* be used for this sort of thing are 1. HOMEDRIVE and HOMEPATH - %HOMEDRIVE%%HOMEPATH% is basically the equivalent of Unix's $HOME. But for nearly all cases, these end up being C:\, which to my mind is a bad default. 2. USERPROFILE - %USERPROFILE% is a user-specific directory suitable for config information. But by default it's a directory with spaces in the name, which can be awkward for some purposes. It's also hard to navigate to in Windows explorer, which makes files stored there a little "hidden". Also, many Unix ports (like XEmacs/Gnus) expect the user to set HOME, so that they can work just like the cosy Unix environment they are used to. (Python sort of supports this, with os.path.expanduser()). I personally don't set a default HOME, but set HOME within each application that expects it (via wrappers or startup scripts or whatever). I think "try a number of pathnames" is a sensible approach. I'd suggest: %HOME%\bayescustomize.ini -- will normally fail, as HOME is not set, but helps Unix compatibility for people who care, as well as offering an "application-specific" answer for people like me who use HOME that way %USERPROFILE%\bayescustomize.ini -- the expected answer for people sophisticated enough to want to customise the application via an INI file. bayescustomize.ini -- as a final fallback, for commandline use if nothing else. Personally, I think that having an extra stage, where the INI file is looked relative to the application files, would be good, too. (Ie, look in the same directory as sys.argv[0]). But opinions on this are often divided. > Of course, it could just rely on the BAYESCUSTOMIZE environment > variable like it does now. But then you'd have to wrap hammiefilter > to set the variable before running. I'm doing that currently, but I > think a wrapper around a wrapper around a driver is pretty ugly. ;) I could use that in my Gnus setup. But what the heck, it would be nice if it worked in a way that other people coud use as well :-) > I'll check in some code that works in a Unix environment. Take a > look at it and let me know what would make sense to try in a Windows > environment (or just submit a change if you can do that). Will do. I'll let you know, as I don't have commit privs. (I'll send you a patch file). Paul. From sjoerd@acm.org Mon Nov 18 11:29:32 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Mon, 18 Nov 2002 12:29:32 +0100 Subject: [Spambayes] Testers needed with unbalanced spam::ham training data In-Reply-To: References: Message-ID: <20021118112937.7A3CD74B08@indus.ins.cwi.nl> On Sun, Nov 17 2002 Tim Peters wrote: > If you have a strong imbalance between the # of ham and # of spam in your > training data (or even if you don't but can spare the effort), please do a > before-and-after test, where after adds the new option: > > [Classifier] > experimental_ham_spam_imbalance_adjustment: True > > I expect this option to go away and become the default, but it needs testing > first before I'll do that. It doesn't look like a win for me: cv1 is all default, cv2 is with experimental_ham_spam_imbalance_adjustment: True filename: cv1 cv2 ham:spam: 14600:4000 14600:4000 fp total: 8 16 fp %: 0.05 0.11 fn total: 3 3 fn %: 0.07 0.07 unsure t: 97 108 unsure %: 0.52 0.58 real cost: $102.40 $184.60 best cost: $43.60 $137.80 h mean: 0.24 0.40 h sdev: 3.64 4.80 s mean: 99.44 99.65 s sdev: 5.00 3.92 mean diff: 99.20 99.25 k: 11.48 11.38 -- Sjoerd Mullender From francois.granger@free.fr Mon Nov 18 14:19:35 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Mon, 18 Nov 2002 15:19:35 +0100 Subject: [Spambayes] Classify issue with pop3proxy Message-ID: If I cut&past a message in the box, it get classified. If I open it through the [file...] button, it get the following result: =========================== Spam probability: 0.52423052 Clues: *H* 0.58188342 *S* 0.63034446 x-mailer:none 0.21414650 content-type:text/plain 0.24312113 message-id:invalid 0.93478261 =========================== My guess is that this is a MacOS line ending issue. But this works for training both way. The difference I see is line 774 in onTrain wich is not in onClassify. I sugest adding it at line 793. Tested here, it works. >From this morning CVS in pop3proxy.py line 763 et sqq: def onTrain(self, params): """Train on an uploaded or pasted message.""" # Upload or paste? Spam or ham? message = params.get('file') or params.get('text') isSpam = (params['which'] == 'Train as Spam') # Append the message to a file, to make it easier to rebuild # the database later. This is a temporary implementation - # it should keep a Corpus (from Tim Stone's forthcoming message # management module) to manage a cache of messages. It needs # to keep them for the HTML retraining interface anyway. message = message.replace('\r\n', '\n').replace('\r', '\n') #<==== if isSpam: f = open("_pop3proxyspam.mbox", "a") else: f = open("_pop3proxyham.mbox", "a") f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n") f.write(message) f.write("\n\n") f.close() # Train on the message. tokens = tokenizer.tokenize(message) self.bayes.learn(tokens, isSpam, True) self.push("

OK. Return Home or train another:

") self.push(self.pageSection % ('Train another', self.train)) def onClassify(self, params): """Classify an uploaded or pasted message.""" message = params.get('file') or params.get('text') tokens = tokenizer.tokenize(message) #<==== prob, clues = self.bayes.spamprob(tokens, evidence=True) self.push("

Spam probability: %.8f

" % prob) self.push("") self.push("\n") self.push("
Clues:
") for w, p in clues: self.push("\n" % (w, p)) self.push("
%s%.8f
") self.push("

Return Home or classify another:

") self.push(self.pageSection % ('Classify another', self.classify)) -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From Paul.Moore@atosorigin.com Mon Nov 18 15:57:40 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 18 Nov 2002 15:57:40 -0000 Subject: [Spambayes] Hammiefilter doesn't write out the pickle Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DDB@UKDCX001.uk.int.atosorigin.com> Um, is it just me, or does hammiefilter not save the database if you're using a pickle? A patch is attached. It doesn't feel "clean" (it would be nice if PersistentBayes covered both pickles and DB files, as well as any other cases which may appear) but it'll do for now. Maybe I'll look at the wider issue when I have some time... Paul. -------------- next part -------------- A non-text attachment was scrubbed... Name: hammiefilter.patch Type: application/octet-stream Size: 1070 bytes Desc: hammiefilter.patch Url : http://mail.python.org/pipermail/spambayes/attachments/20021118/ca757103/hammiefilter.exe From tim@fourstonesExpressions.com Mon Nov 18 16:03:04 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 18 Nov 2002 10:03:04 -0600 Subject: [Spambayes] Hammiefilter doesn't write out the pickle Message-ID: It's not you... you have to manually save it for the moment. fp = open(, 'wb') pickle.dump(, fp, 1) fp.close() We're beginning to work on making hammie* use Bayes.PersistentBayes to take care of this kind of stuff for you, but we're not there yet. - TimS 11/18/2002 9:57:40 AM, "Moore, Paul" wrote: >Um, is it just me, or does hammiefilter not save the database if >you're using a pickle? > >A patch is attached. It doesn't feel "clean" (it would be nice if >PersistentBayes covered both pickles and DB files, as well as any >other cases which may appear) but it'll do for now. Maybe I'll look at >the wider issue when I have some time... > >Paul. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > - Tim www.fourstonesExpressions.com From tim@fourstonesExpressions.com Mon Nov 18 16:13:00 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 18 Nov 2002 10:13:00 -0600 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DDC@UKDCX001.uk.int.atosorigin.com> Message-ID: I think applying your patch will break more things than it'll fix at the moment. There's a lot of in-memory training that is done for testing purposes that's never saved, at least that's what I believe. The Bayes.* classes have a load/store semantic, which is fully implemented for PickledBayes. We can *probably* simply change hammie.createbayes to create a PickledBayes object instead of a Bayes object, and everything will work nicely. I can't do that till the end of this week, though. You might check with Neale and see what he thinks. BTW, the Bayes.* classes do not auto-save, but they do offer a store() method that you can call which makes storing the stuff easy. - TimS 11/18/2002 10:05:59 AM, "Moore, Paul" wrote: >From: Tim Stone - Four Stones Expressions > >> It's not you... you have to manually save it for the moment. >> >> fp = open(, 'wb') >> pickle.dump(, fp, 1) >> fp.close() >> >> We're beginning to work on making hammie* use Bayes.PersistentBayes >> to take care of this kind of stuff for you, but we're not there yet. > >Thanks. If work on this already under way, I'll keep out of the way :-) >Unless there's anything I can do to help? > >Paul. > >PS Is my patch worth applying as an interim measure? > > - Tim www.fourstonesExpressions.com From neale@woozle.org Mon Nov 18 16:39:43 2002 From: neale@woozle.org (Neale Pickett) Date: 18 Nov 2002 08:39:43 -0800 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DDB@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB885E2DDB@UKDCX001.uk.int.atosorigin.com> Message-ID: So then, "Moore, Paul" is all like: > Um, is it just me, or does hammiefilter not save the database if > you're using a pickle? Ah, no, it wouldn't do that. As Tim Stone says, a clean solution is pending. In the meantime, though, I'm curious about how you're using hammiefilter. Loading up the entire pickle is painfully slow compared to the dbm method, and as hammiefilter is made specifically to run once per message and then go away, the pickle is a particularly bad fit. Are you running hammiefilter from procmail? How big is your pickle? Neale From tim@fourstonesExpressions.com Mon Nov 18 16:43:06 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 18 Nov 2002 10:43:06 -0600 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: Message-ID: <98FBOLZVB8HGBGB1XNON2YZYIE6YT.3dd9189a@riven> On a related subject, has there been any work done to persist the training into a ZODB? DBM has it's own set of limitations: e.g. very large database file. If there is ZODB work, how would I get ahold of that stuff? I don't see anything like that in cvs anywhere, or maybe I'm just missing it. - TimS 11/18/2002 10:39:43 AM, Neale Pickett wrote: >So then, "Moore, Paul" is all like: > >> Um, is it just me, or does hammiefilter not save the database if >> you're using a pickle? > >Ah, no, it wouldn't do that. As Tim Stone says, a clean solution is >pending. > >In the meantime, though, I'm curious about how you're using >hammiefilter. Loading up the entire pickle is painfully slow compared >to the dbm method, and as hammiefilter is made specifically to run once >per message and then go away, the pickle is a particularly bad fit. > >Are you running hammiefilter from procmail? How big is your pickle? > >Neale > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From Paul.Moore@atosorigin.com Mon Nov 18 16:54:38 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Mon, 18 Nov 2002 16:54:38 -0000 Subject: [Spambayes] Hammiefilter doesn't write out the pickle Message-ID: <16E1010E4581B049ABC51D4975CEDB88619940@UKDCX001.uk.int.atosorigin.com> From: Neale Pickett [mailto:neale@woozle.org] > So then, "Moore, Paul" is all like: >> Um, is it just me, or does hammiefilter not save the database if >> you're using a pickle? > Ah, no, it wouldn't do that. As Tim Stone says, a clean solution is > pending. > In the meantime, though, I'm curious about how you're using > hammiefilter. Loading up the entire pickle is painfully slow compared > to the dbm method, and as hammiefilter is made specifically to run = once > per message and then go away, the pickle is a particularly bad fit. > Are you running hammiefilter from procmail? How big is your pickle? I'm not using hammiefilter in its "filter" mode at all. I was planning = on using it for single-message incremental training ("Train as spam/ham") = by piping the message to "hammiefilter -[gs]". I guess that "hammie -[gs] = -" is just as good for this usage (well, better - it works!) I realise that this area is currently in a state of flux. I don't have a problem with changing things as I go. It's just a case of "what's best right now?" As for why I'm using pickles, it's simply because that's the default. I don't have enough feel for things (or a large enough base of messages) = to have a problem either way. (My popget.py goes via hammie.py, and so = loads the pickle once per scan through the POP mailbox. Performance for this has been fine so far, but I've only just started, so don't read too much into that...) This would be much easier if I wasn't working in "batch mode" - mild- mannered (as if!) Exchange/Outlook user by day, masked Gnus/POP3 user by night :-) Paul. From neale@woozle.org Mon Nov 18 16:54:56 2002 From: neale@woozle.org (Neale Pickett) Date: 18 Nov 2002 08:54:56 -0800 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > I think applying your patch will break more things than it'll fix at > the moment. There's a lot of in-memory training that is done for > testing purposes that's never saved, at least that's what I believe. I'm not sure what you're thinking of here, but regardless of whether or not there's wasted effort, unless the pickle is written out again, any training is useless. I've put a variation of Paul's patch in, and as soon as SF CVS comes back up again, I'll check it all in. > The Bayes.* classes have a load/store semantic, which is fully > implemented for PickledBayes. We can *probably* simply change > hammie.createbayes to create a PickledBayes object instead of a Bayes > object, and everything will work nicely. I can't do that till the end > of this week, though. You might check with Neale and see what he > thinks. Neale thinks this is the right way to do it. If the Bayes.* classes write out their state on destruction, we can treat them all the same. That's easy enough, just have them call self.store() in the __del__ method. Neale will try and see if he can check in something that does that. Neale's having problems with SF CVS today, though. Yours truly, Me From tim.one@comcast.net Mon Nov 18 16:56:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 11:56:12 -0500 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: <98FBOLZVB8HGBGB1XNON2YZYIE6YT.3dd9189a@riven> Message-ID: [Tim Stone] > On a related subject, has there been any work done to persist the > training into a ZODB? DBM has it's own set of limitations: e.g. very > large database file. If there is ZODB work, how would I get ahold of > that stuff? I don't see anything like that in cvs anywhere, or maybe > I'm just missing it. The project's pspam directory contains Jeremy Hylton's work on integrating the classifier with ZODB and ZEO. From neale@woozle.org Mon Nov 18 17:09:19 2002 From: neale@woozle.org (Neale Pickett) Date: 18 Nov 2002 09:09:19 -0800 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619940@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB88619940@UKDCX001.uk.int.atosorigin.com> Message-ID: So then, "Moore, Paul" is all like: > I realise that this area is currently in a state of flux. I don't have a > problem with changing things as I go. It's just a case of "what's best > right now?" CVS is back up, so the answer is now "hammiefilter". Wait, no, it's down again. The answer is now "hammie". Your setup is nearly identical to mine. Here is some elisp to bind "B h" to "train as ham and move to another folder" and "B s" to "train as spam and move to the spam folder". Drop it into .gnus. (defun pipe-message (command) (interactive "sCommand: ") (save-window-excursion (gnus-summary-show-article 'raw) (gnus-summary-select-article-buffer) (shell-command-on-region (point-min) (point-max) command)) (gnus-summary-show-article)) (defun spam () (interactive) (pipe-message "/home/neale/bin/hammie -s") (gnus-summary-move-article 1 "spam")) (defun notspam () (interactive) (pipe-message "/home/neale/bin/hammie -g") (gnus-summary-move-article 1)) (add-hook 'gnus-sum-load-hook (lambda nil (define-key gnus-summary-mode-map [(B) (h)] 'notspam) (define-key gnus-summary-mode-map [(B) (s)] 'spam))) > As for why I'm using pickles, it's simply because that's the default. Ah, that's what I figured. So the new hammiefilter is going to change the default to the dbm method, but *only* for hammiefilter. That sucks, cause now there's this weird disparity between the two. Maybe we should consider changing the default in Options.py... > This would be much easier if I wasn't working in "batch mode" - mild- > mannered (as if!) Exchange/Outlook user by day, masked Gnus/POP3 user > by night :-) Everyone has their own dirty little secret, but please spare us the details of what you do in those phone booths ;) Considering getting a cell phone now, Neale From neale@woozle.org Mon Nov 18 17:18:37 2002 From: neale@woozle.org (Neale Pickett) Date: 18 Nov 2002 09:18:37 -0800 Subject: [Spambayes] Just for fun In-Reply-To: <3DD8C399.8000305@startechgroup.co.uk> References: <3DD8C399.8000305@startechgroup.co.uk> Message-ID: So then, Matt Sergeant is all like: > This is the same as TMDA. I have evidence for you that it doesn't > work. Case in point being direct from me: someone mailed me asking a > technical question about one of my perl modules. I mailed him back a > response, on my own free time. I got a TMDA bounce saying I had to > confirm that I was a real person. Well frankly, sod that. I never > replied. I never used his web page to confirm. I just ignored it and I'm > sure he never got the reply to his question. I seem to recall about five or so years back, some guy asked a question on comp.lang.python, and his email address was something like "bob@nospam.sittingduck.com". Then some guy named "Guido" replied to the question with a very good answer, but the reply bounced because of the "nospam." part. The "Guido" chap declared that he wasn't inclined to jump through hoops for the privilege of answering the guy's question. That guy never got his question answered either, and I don't imagine Guido has changed his stance since then. It may seem like a good idea at first, but you ignore mail at your peril. Other people are busy enough with their own spam problems, don't make them responsible for yours too. Neale From tim.one@comcast.net Mon Nov 18 17:28:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 12:28:32 -0500 Subject: [Spambayes] Just for fun In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861993D@UKDCX001.uk.int.atosorigin.com> Message-ID: [Moore, Paul] > ... > I'm not good at interpreting this stuff yet, but it came out as > solidly unsure, with some interesting features. The 'sender:no real > name:2**0' as a solid ham clue is almost certainly due to Exchange > (basically because Exchange doesn't do real headers, I expect) If there is no Sender header, no token is generated. You get 'sender:no real name:2**0' only if there *is* a Sender header (and it doesn't contain a real name). The Outlook client's _GetFakeHeaders() doesn't synthesize a Sender header, either. So that token must come from internet mail. It may be a ham clue for you because some mailing lists create a Sender field without a real name. For example, the mailing-list version of comp.lang.python adds this to its headers: Sender: python-list-admin@python.org So that makes 'sender:no real name:2**0' a ham clue for me too. That's fine! In my corpus, it is a ham indicator. > - I see most internet headers as good spam clues, which is mildly > worrying, although hasn't caused any real issues yet. If your spam comes from the internet, it's appropriate . > The obvious implication is that getting a really good training corpus > is *hard*. Probably beyond the means of the average user. The best possible training corpus is the email they actually get, correctly classified. If they know their own judgment about ham vs spam, all the rest should happen by magic. It's still hard for clients to do that, though. From richie@entrian.com Mon Nov 18 18:01:48 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 18 Nov 2002 18:01:48 +0000 Subject: [Spambayes] A kinder, gentler hammie In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com> Message-ID: <079itu84n9lil7sqae4j9gge1sgppps34h@4ax.com> Hi Paul, > The standard [Windows] environment variables which *can* be used for > this sort of thing are > > 1. HOMEDRIVE and HOMEPATH - %HOMEDRIVE%%HOMEPATH% is basically the > equivalent of Unix's $HOME. But for nearly all cases, these end > up being C:\, which to my mind is a bad default. > 2. USERPROFILE - %USERPROFILE% is a user-specific directory suitable > for config information. But by default it's a directory with spaces > in the name, which can be awkward for some purposes. It's also hard > to navigate to in Windows explorer, which makes files stored there > a little "hidden". Not true on 98: C:\WIN98>set PROMPT=$p$g winbootdir=C:\WIN98 COMSPEC=C:\COMMAND.COM TMPDIR=c:\win98\temp TEMP=C:\win98\temp TMP=c:\win98\temp HOME=D:\BIN\richie PATH=[paranoia] QTDIR=C:\qt TMAKEPATH=C:\qt\tmake\lib\win32-msvc windir=C:\WIN98 and the only reason 'HOME' is there is that I manually added it - possibly for the sake of Cygwin, but certainly for something that your typical Windows user won't have. Having said that, I agree with this: > I think "try a number of pathnames" is a sensible approach. ...but is there a fallback that *always* works? I'm not sure whether there is - is argv[0] guaranteed to work, even in frozen / py2exe'd / Installer'd / cx_Frozen / Squeezed / etc. applications? -- Richie Hindle richie@entrian.com From richie@entrian.com Mon Nov 18 18:02:07 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 18 Nov 2002 18:02:07 +0000 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: References: Message-ID: <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com> Hi Neale, > Neale thinks this is the right way to do it. If the Bayes.* classes > write out their state on destruction, we can treat them all the same. > That's easy enough, just have them call self.store() in the __del__ > method. Richie thinks this is a bad move. Here's a minor rant I sent to Tim Stone when he did exactly this in his Bayes module: -------------------------------------------------------------------------- PersistentBayes.__del__() calls store() - this seems like a bad thing for three reasons. One is that I might not want to save my changes to the database - pop3proxy has an explicit "Save & Shutdown" and "Shutdown" buttons to give the user control over whether the database is saved or not (to let you do speculative training and discard the results, for instance). [This is the least important of the three reasons. Four, four reasons!] Also, the pop3proxy self-test uses an in-memory bayes instance that it never wants to write to disk. Secondly, it's unpredictable when __del__ will be called, or even *whether* it will be called - this: class A: def __del__(self): print "A.__del__" class B: def __del__(self): print "B.__del__" a = A() b = B() a.b = b b.a = a print "Exiting..." won't call either __del__ method in the current CPython implementation. Thirdly, if users of PersistentBayes explicitly call store() - which seems like the right thing to do - the database will be written out twice. [And that can take *a long time*.] [snip] I've found another reason why PersistentBayes.__del__() is a bad thing - self.db_name isn't set in the case where a PickledBayes is created using a filename that doesn't exist (which is done by the pop3proxy self-test) - that was leading to exceptions being throw from __del__, which is a notoriously hard problem to track down. -------------------------------------------------------------------------- I'd much rather have an explicit store() method and document the fact that storage may be pre-empted by certain implementations. Relying on __del__ is nasty. -- Richie Hindle richie@entrian.com From neale@woozle.org Mon Nov 18 18:44:18 2002 From: neale@woozle.org (Neale Pickett) Date: 18 Nov 2002 10:44:18 -0800 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com> References: <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com> Message-ID: So then, Richie Hindle is all like: > Richie thinks this is a bad move. Here's a minor rant I sent to Tim Stone > when he did exactly this in his Bayes module: Neale thinks Richie makes some good points here. My original reason for wanting to have the DB flush itself on deletion was something to do with exceptions while training large corpora. I think it's time for that hack to go now. There's nothing wrong with explicitly calling store() on shutdown, as you say, it's cleaner and more predictable. So let's agree to do that. Rather, I'll cave to what's right and modify my dodgy code to do what yours is already doing :) I think it may finally be time to give hammie a big makeover--it should just provide the Hammie class, and not be executable. I'll ponder this and post a big diff to see what you all think. Neale From richie@entrian.com Mon Nov 18 19:06:18 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 18 Nov 2002 19:06:18 +0000 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: References: <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com> Message-ID: <63eituclg186qi547gha5vruvpmq5fq1rt@4ax.com> Hi Neale, > I think it may finally be time to give hammie a big makeover--it should > just provide the Hammie class, and not be executable. You might be right. Especially given that Hammie can be used remotely via XML-RPC, I wonder whether Tim Stone's Bayes class and Hammie should be rolled into one, and any client (including pop3proxy) that currently uses classifier.Bayes or Bayes.XXXBayes should used the new class(es) - that would unify the API across all the clients, and make that API available remotely for (almost) free. We could even document it... 8-) -- Richie Hindle richie@entrian.com From richie@entrian.com Mon Nov 18 19:17:43 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 18 Nov 2002 19:17:43 +0000 Subject: [Spambayes] Classify issue with pop3proxy In-Reply-To: References: Message-ID: Hi François, > The difference I see is line 774 in onTrain wich is not > in onClassify. I sugest adding it at line 793. Thanks for that - I've checked in your patch. Could you (or someone else on a Mac) check that it works? Many thanks. -- Richie Hindle richie@entrian.com From richie@entrian.com Mon Nov 18 19:18:57 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 18 Nov 2002 19:18:57 +0000 Subject: [Spambayes] New web training interface for pop3proxy Message-ID: <63bitu8kflib5at5tplk79qgm4deo4bohp@4ax.com> Hi, I've just checked in a new web training interface for pop3proxy. It keeps a cache of all the messages that it's proxied (using Tim Stone's Corpus modules), and presents a web page with these untrained messages on, one page per day's messages. You check a Ham/Spam/Discard box next to each one and submit them for training. It also keeps trained messages but there's no interface for *re*training yet - that and automatic training will come soon, along with cache expiry. I've put up a mockup at http://entrian.com/review2.html - none of the buttons or links there works, but you can see what it looks like. What I want to do soon is auto-train on 'sure' spams and hams, and split the training interface into 'Review hams', 'Review spams' and 'Review unsure'. Or something. I probably need to look at the way the Outlook stuff does this. One consequence of this is that pop3proxy will create three subdirectories under its working directory in which to keep its caches: pop3proxy-spam-cache, pop3proxy-ham-cache and pop3proxy-unknown-cache. In the somewhat unlikely event that you already have directories with these names (!) you can configure them in bayescustomize.ini. I've also fixed some problems that François was having on the Mac, whereby it was falling over trying to re-open the log file, and uploading of messages to classify wasn't working. -- Richie Hindle richie@entrian.com From dereks@itsite.com Mon Nov 18 17:11:21 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Mon, 18 Nov 2002 12:11:21 -0500 (EST) Subject: [Spambayes] Just for fun In-Reply-To: Message-ID: > "bob@nospam.sittingduck.com". Then some guy named "Guido" replied to Some guy named "Guido" on comp.lang.python? What a coincidence! I think it would be best if we took the discussion of the merits of ASK off this list. I only wanted to mention it to get people thinking and give my perspective on INBOX-based filtering... not to start a long-running discussion of something that is not SpamBayes. > the "nospam." part. The "Guido" chap declared that he wasn't inclined > to jump through hoops for the privilege of answering the guy's question. For the record: I never suggested that ASK should be used for addresses where you EXPECT to get unsolicited emails. Published addresses like "sales@foo.com", "info@foo.com", and any email address used on a list or newsgroup would of course be bad candidates for ASK. But for any email address where you do not expect unsolicited emails (spam or not), I think it would be a reasonable protection. If somebody NEEDS to get a message through to you, then asking them to hit "Reply;Send" one time is not unreasonable (in my opinion). If they won't take the time and effort to do that then I don't really want their message... which is exactly why this technique works for spam. (Note that all friends, family, biz associates, etc. are automatically added by pointing the tool at your pre-existing INBOX.) (I only wish I could get this system for my home telephone!) > Other people are busy enough with their own spam problems, don't make > them responsible for yours too. I never suggested that anyone be made responsible for anything. --Derek From skip@pobox.com Mon Nov 18 20:41:17 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 18 Nov 2002 14:41:17 -0600 Subject: [Spambayes] New web training interface for pop3proxy Message-ID: <15833.20589.376685.686723@montanaro.dyndns.org> Richie> I've put up a mockup at http://entrian.com/review2.html... Some suggestions: * I think you need a 'defer' choice in addition to discard/ham/spam. I may well want to train on some obvious ones right now, but don't have the time to investigate others which will require some thought. * It would be nice if the subject was 'hot' so you can click on it and view the entire message in a new window. * Given that the time to classify a message is pretty cheap, it would also be nice if your interface preset the radio buttons based on an initial classification of each message. This suggests you need an 'unsure' radio button as well. Skip From tim.one@comcast.net Mon Nov 18 21:09:37 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 16:09:37 -0500 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <63bitu8kflib5at5tplk79qgm4deo4bohp@4ax.com> Message-ID: [Richie Hindle] > ... > What I want to do soon is auto-train on 'sure' spams and hams, and > split the training interface into 'Review hams', 'Review spams' and > 'Review unsure'. Or something. I probably need to look at the way the > Outlook stuff does this. The Outlook client doesn't train by magic on anything correctly classified (yet). Ham folders display a "Delete as Spam" button, Spam folders a "Recover from Spam" button, and the Unsure folder has both. Msgs are trained by magic only when selected and you click one of those buttons, or when you drag a msg from kind of folder to another. In part this simply reflects our inability so far to decide on "the best" way to train. Manual training in the Outlook client sucks up entire folders. From richie@entrian.com Mon Nov 18 22:52:04 2002 From: richie@entrian.com (Richie Hindle) Date: Mon, 18 Nov 2002 22:52:04 +0000 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <15833.20589.376685.686723@montanaro.dyndns.org> References: <15833.20589.376685.686723@montanaro.dyndns.org> Message-ID: <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> [Skip] > * I think you need a 'defer' choice in addition to discard/ham/spam. I > may well want to train on some obvious ones right now, but don't have > the time to investigate others which will require some thought. Good idea, yes. > * It would be nice if the subject was 'hot' so you can click on it and > view the entire message in a new window. Already on the to-do list - see pop3proxy.py. 8-) > * Given that the time to classify a message is pretty cheap, it would > also be nice if your interface preset the radio buttons based on an > initial classification of each message. This suggests you need an > 'unsure' radio button as well. I did think of that, then I thought that I was far more likely to make a mistake just scanning down the list thinking "yeah, yeah, yeah, looks ok" than actually having to click something for each message. I also thought of highlighting the classification decisions in some other way, like colouring the rows, but decided against that for the same reason. I think this whole issue will go away in version two - see below. [Tim] > The Outlook client doesn't train by magic on anything correctly classified > (yet). Ham folders display a "Delete as Spam" button, Spam folders a > "Recover from Spam" button, and the Unsure folder has both. Msgs are > trained by magic only when selected and you click one of those buttons, or > when you drag a msg from kind of folder to another. I've been thinking that the next version of the web interface would work the same way - rather than a single page of untrained messages, you'd get three pages for ham-judged, spam-judged and unsure. There might need to be an initial training period where this didn't happen. Or maybe not - I ran this at work today "from scratch" with an empty database, training as I went, and in about 30 messages I had lots of unsures, no fps and only about three fns. Being only one working day it had a large ham bias - we'll see what happens tomorrow after I train it on the night's spam. I reckon it'll do well. By presenting the messages as three pre-judged lists, am I contradicting my own statement that the messages shouldn't show up as prejudged in the current 'unclassified' list? 8-) I don't think so, because spotting a ham in a bunch of spams, or vice versa, is much easier than spotting whether any of a whole mixture of messages is misclassified. -- Richie Hindle richie@entrian.com From lists@morpheus.demon.co.uk Mon Nov 18 23:15:44 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Mon, 18 Nov 2002 23:15:44 +0000 Subject: [Spambayes] Just for fun References: <16E1010E4581B049ABC51D4975CEDB8861993D@UKDCX001.uk.int.atosorigin.com> Message-ID: Tim Peters writes: >> - I see most internet headers as good spam clues, which is mildly >> worrying, although hasn't caused any real issues yet. > > If your spam comes from the internet, it's appropriate . A good chunk of ham comes from the Internet, too, but that chunk isn't available in my training set. It could be (to an extent) but see below. >> The obvious implication is that getting a really good training corpus >> is *hard*. Probably beyond the means of the average user. > > The best possible training corpus is the email they actually get, correctly > classified. If they know their own judgment about ham vs spam, all the rest > should happen by magic. It's still hard for clients to do that, though. Agreed (on both points - it's the best and it's hard). In practice, I'm not completely comfortable with the approach of starting from nothing and training only on new mail [1]. But collecting a truly representative corpus isn't easy. The overhead of religiously collecting and manually classifying all mail for a reasonable period is prohibitive, and any attempt to just grab existing filed mails will always introduce bias [2]. I'm really just trying to get to grips with what can be done to ease the "entry cost" of the system. Paul. [1] It works (pretty much any training method works remarkably well) but as has been reported here before, unsures are surprising. And worse than that, in my experience, is the fact that training on an error or unsure and then rescoring it can show it still as unsure. This is *very* offputting - you just told the system it is spam, how come the system ignored you? (I know the answer, but it's almost impossible to make it feel like reasonable behaviour). [2] The main forms of bias I see with my mail are on the one hand, massive imbalance in numbers, because I keep all sorts of ancient junk whereas I (used to) delete spam instantly. On the other hand, taking just my inbox excludes almost all ham which originated from the internet (as a simple example). Tomorrow, I'm hoping to try your new option to compensate for imbalance. Let's hit it with a truly massive ratio and see how it goes! -- This signature intentionally left blank From tim.one@comcast.net Mon Nov 18 23:29:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 18:29:07 -0500 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right In-Reply-To: Message-ID: [Mark Hammond, humbled by amazing ham-sniffing powers] I suppose this would be a good time to confess that I seed each database with "craniofacial reconstruction" as killer-strong ham clues? > ... > The text version of the mail is below. It was HTML, gray background, > blue writing - big brain spam-clues Fortunately or not, because the tokenizer strips HTML decorations, the classifier is blind to info about colors, font styles, and font sizes. I usually don't mind because there are so many other spammy things about spam, but it's still an abstract nag. > And-I'm-yet-to-see-a-bayes-FP ly, You will. I confess that I zip thru my Spam folder faster each day, though -- there's never ham in it anymore. BTW, I gave up on my mistake-driven classifier experiment. I kept getting several porn spam as Unsure every day, and got tired of digging thru it. Now I'm training on each spam that doesn't score 100, and each ham that doesn't score 0. Amazingly, that's added a hell of a lot more spam than ham to the training data -- now up to 99 ham and 149 spam. Porn spam no longer rates as Unsure, and I'm happier. Perhaps that's just due to the drop in forced stimulation, though . From tim@fourstonesExpressions.com Tue Nov 19 00:10:47 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon, 18 Nov 2002 18:10:47 -0600 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: <63eituclg186qi547gha5vruvpmq5fq1rt@4ax.com> Message-ID: I think we've got some real potential for a great little api here. I do have some questions about the data storage. We've agreed that an explicit store is the way we want to go, which I think is correct. However, dbm really doesn't support this. I fooled with a couple ideas (hacks) to make DBDict behave in a load/store fashion, and the best thing I can come up with is to actually make a working copy of the dbm file, which is then used for the session. When store() is called, the original is replaced with the working copy. There are some difficulties with this approach. If store is never called, then there is no guaranteed way to clean up the working copy. Replacing the original with the working copy may be a bit difficult, because dbm doesn't support a close method... SOOOOO... Tim Stone's question is: "Should I go ahead and do that?" - TimS 11/18/2002 1:06:18 PM, Richie Hindle wrote: >Hi Neale, > >> I think it may finally be time to give hammie a big makeover--it should >> just provide the Hammie class, and not be executable. > >You might be right. Especially given that Hammie can be used remotely via >XML-RPC, I wonder whether Tim Stone's Bayes class and Hammie should be >rolled into one, and any client (including pop3proxy) that currently uses >classifier.Bayes or Bayes.XXXBayes should used the new class(es) - that >would unify the API across all the clients, and make that API available >remotely for (almost) free. We could even document it... 8-) > >-- >Richie Hindle >richie@entrian.com > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From neale@woozle.org Tue Nov 19 00:35:55 2002 From: neale@woozle.org (Neale Pickett) Date: 18 Nov 2002 16:35:55 -0800 Subject: [Spambayes] proposed changes to hammie & co. Message-ID: okay, here's the big diff I was talking about. This would take all hammie functionality out of hammie. So there would need to be yet another hammie*.py file, a front-end to this new hammie class which acts like the all-singing, all-dancing program that hammie is currently. This moves everything but the Hammie class out of hammie.py. DBDict goes into its own module, which you could take out and use elsewhere if you wanted. PersistentBayes goes away, replaced by a the DBDictBayes class in Bayes.py. I haven't had time to implement the rest of the stuff yet, but that would be what'd go into the new front-end. So the happy hammie family would then stand at: hammie.py |-- hammiefilter.py |-- pop3proxy.py |-- hammiesrv.py \-- hammie-new-front-end.py This change appears to work fine with hammiefilter and pop3proxy. But it's a pretty big change, so I'd like to hear what at least Richie and Tim Stone think before I commit anything. Neale ? Outlook2000 ? diff ? email ? hammiebatch.py Index: Bayes.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v retrieving revision 1.5 diff -u -r1.5 Bayes.py --- Bayes.py 18 Nov 2002 13:04:20 -0000 1.5 +++ Bayes.py 19 Nov 2002 00:24:57 -0000 @@ -56,11 +56,10 @@ all the spambayes contributors." import Corpus -from classifier import Bayes +import classifier from Options import options -from hammie import DBDict # hammie only for DBDict, which should - # probably really be somewhere else import cPickle as pickle +import dbdict import errno import copy import anydbm @@ -69,7 +68,7 @@ NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training UPDATEPROBS = True # Probabilities will be autoupdated with training -class PersistentBayes(Bayes): +class PersistentBayes(classifier.Bayes): '''Persistent Bayes database object''' def __init__(self, db_name): @@ -169,12 +168,49 @@ self.wordinfo, self.nspam, self.nham = t[1:] +class WIDict(dbdict.DBDict): + """DBDict optimized for holding lots of WordInfo objects. + + Normally, the pickler can figure out that you're pickling the same + type thing over and over, and will just tag the type with a new + byte, thus reducing Administrative Pickle Bloat(R). Since the + DBDict continually creates new picklers, however, nothing ever gets + the chance to do this optimization. + + The WIDict class forces this optimization by stealing the + (currently) unused 'W' pickle type for WordInfo objects. This + results in about a 50% reduction in database size. + + """ + + def __getitem__(self, key): + v = self.hash[key] + if v[0] == 'W': + val = pickle.loads(v[1:]) + # We could be sneaky, like pickle.Unpickler.load_inst, + # but I think that's overly confusing. + obj = classifier.WordInfo(0) + obj.__setstate__(val) + return obj + else: + return pickle.loads(v) + + def __setitem__(self, key, val): + if isinstance(val, classifier.WordInfo): + val = val.__getstate__() + v = 'W' + pickle.dumps(val, 1) + else: + v = pickle.dumps(val, 1) + self.hash[key] = v + + class DBDictBayes(PersistentBayes): '''Bayes object persisted in a hammie.DB_Dict''' - def __init__(self, db_name): + def __init__(self, db_name, mode='c'): '''Constructor(database name)''' + self.mode = mode self.db_name = db_name self.statekey = "saved state" @@ -186,7 +222,8 @@ if Corpus.Verbose: print 'Loading state from',self.db_name,'DB_Dict' - self.wordinfo = DBDict(self.db_name, 'c') + self.wordinfo = WIDict(self.db_name, self.mode, + iterskip=[self.statekey]) if self.wordinfo.has_key(self.statekey): @@ -216,7 +253,7 @@ def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS): '''Constructor(Bayes, \ - Corpus.SPAM|Corpus.HAM), updprobs(True|False)''' + Corpus.SPAM|Corpus.HAM), updprobs(True|False)''' self.bayes = bayes self.trainertype = trainertype @@ -286,4 +323,4 @@ if __name__ == '__main__': - print >>sys.stderr, __doc__ \ No newline at end of file + print >>sys.stderr, __doc__ Index: dbdict.py =================================================================== RCS file: dbdict.py diff -N dbdict.py --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ dbdict.py 19 Nov 2002 00:24:57 -0000 @@ -0,0 +1,92 @@ +#! /usr/bin/env python + +from __future__ import generators +import dbhash +try: + import cPickle as pickle +except ImportError: + import pickle + +class DBDict: + """Database Dictionary. + + This wraps a dbhash database to make it look even more like a + dictionary, much like the built-in shelf class. The difference is + that a DBDict supports all dict methods. + + Call it with the database. Optionally, you can specify a list of + keys to skip when iterating. This only affects iterators; things + like .keys() still list everything. For instance: + + >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo')) + >>> d['skipme'] = 'booga' + >>> d['countme'] = 'wakka' + >>> print d.keys() + ['skipme', 'countme'] + >>> for k in d.iterkeys(): + ... print k + countme + + """ + + def __init__(self, dbname, mode, iterskip=()): + self.hash = dbhash.open(dbname, mode) + self.iterskip = iterskip + + def __getitem__(self, key): + return pickle.loads(self.hash[key]) + + def __setitem__(self, key, val): + self.hash[key] = pickle.dumps(val, 1) + + def __delitem__(self, key, val): + del(self.hash[key]) + + def __iter__(self, fn=None): + k = self.hash.first() + while k != None: + key = k[0] + val = self.__getitem__(key) + if key not in self.iterskip: + if fn: + yield fn((key, val)) + else: + yield (key, val) + try: + k = self.hash.next() + except KeyError: + break + + def __contains__(self, name): + return self.has_key(name) + + def __getattr__(self, name): + # Pass the buck + return getattr(self.hash, name) + + def get(self, key, dfl=None): + if self.has_key(key): + return self[key] + else: + return dfl + + def iteritems(self): + return self.__iter__() + + def iterkeys(self): + return self.__iter__(lambda k: k[0]) + + def itervalues(self): + return self.__iter__(lambda k: k[1]) + +open = DBDict + +def _test(): + import doctest + import dbdict + + doctest.testmod(dbdict) + +if __name__ == '__main__': + _test() + Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.40 diff -u -r1.40 hammie.py --- hammie.py 18 Nov 2002 18:13:54 -0000 1.40 +++ hammie.py 19 Nov 2002 00:24:57 -0000 @@ -1,57 +1,11 @@ #! /usr/bin/env python -# A driver for the classifier module and Tim's tokenizer that you can -# call from procmail. - -"""Usage: %(program)s [options] - -Where: - -h - show usage and exit - -g PATH - mbox or directory of known good messages (non-spam) to train on. - Can be specified more than once, or use - for stdin. - -s PATH - mbox or directory of known spam messages to train on. - Can be specified more than once, or use - for stdin. - -u PATH - mbox of unknown messages. A ham/spam decision is reported for each. - Can be specified more than once. - -r - reverse the meaning of the check (report ham instead of spam). - Only meaningful with the -u option. - -p FILE - use file as the persistent store. loads data from this file if it - exists, and saves data to this file at the end. - Default: %(DEFAULTDB)s - -d - use the DBM store instead of cPickle. The file is larger and - creating it is slower, but checking against it is much faster, - especially for large word databases. Default: %(USEDB)s - -D - the reverse of -d: use the cPickle instead of DBM - -f - run as a filter: read a single message from stdin, add an - %(DISPHEADER)s header, and write it to stdout. If you want to - run from procmail, this is your option. -""" - -from __future__ import generators - -import sys -import os -import types -import getopt -import mailbox -import glob -import email -import errno -import anydbm -import cPickle as pickle +import dbdict import mboxutils -import classifier +import Bayes from Options import options +from tokenizer import tokenize try: True, False @@ -60,166 +14,14 @@ True, False = 1, 0 -program = sys.argv[0] # For usage(); referenced by docstring above - -# Name of the header to add in filter mode -DISPHEADER = options.hammie_header_name -DEBUGHEADER = options.hammie_debug_header_name -DODEBUG = options.hammie_debug_header - -# Default database name -DEFAULTDB = options.persistent_storage_file - -# Probability at which a message is considered spam -SPAM_THRESHOLD = options.spam_cutoff -HAM_THRESHOLD = options.ham_cutoff - -# Probability limit for a clue to be added to the DISPHEADER -SHOWCLUE = options.clue_mailheader_cutoff - -# Use a database? If False, use a pickle -USEDB = options.persistent_use_database - -# Tim's tokenizer kicks far more booty than anything I would have -# written. Score one for analysis ;) -from tokenizer import tokenize - -class DBDict: - - """Database Dictionary. - - This wraps an anydbm to make it look even more like a dictionary. - - Call it with the name of your database file. Optionally, you can - specify a list of keys to skip when iterating. This only affects - iterators; things like .keys() still list everything. For instance: - - >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo')) - >>> d['skipme'] = 'booga' - >>> d['countme'] = 'wakka' - >>> print d.keys() - ['skipme', 'countme'] - >>> for k in d.iterkeys(): - ... print k - countme - - """ - - def __init__(self, dbname, mode, iterskip=()): - self.hash = anydbm.open(dbname, mode) - self.iterskip = iterskip - - def __getitem__(self, key): - v = self.hash[key] - if v[0] == 'W': - val = pickle.loads(v[1:]) - # We could be sneaky, like pickle.Unpickler.load_inst, - # but I think that's overly confusing. - obj = classifier.WordInfo(0) - obj.__setstate__(val) - return obj - else: - return pickle.loads(v) - - def __setitem__(self, key, val): - if isinstance(val, classifier.WordInfo): - val = val.__getstate__() - v = 'W' + pickle.dumps(val, 1) - else: - v = pickle.dumps(val, 1) - self.hash[key] = v - - def __delitem__(self, key, val): - del(self.hash[key]) - - def __iter__(self, fn=None): - k = self.hash.first() - while k != None: - key = k[0] - val = self.__getitem__(key) - if key not in self.iterskip: - if fn: - yield fn((key, val)) - else: - yield (key, val) - try: - k = self.hash.next() - except KeyError: - break - - def __contains__(self, name): - return self.has_key(name) - - def __getattr__(self, name): - # Pass the buck - return getattr(self.hash, name) - - def get(self, key, dfl=None): - if self.has_key(key): - return self[key] - else: - return dfl - - def iteritems(self): - return self.__iter__() - - def iterkeys(self): - return self.__iter__(lambda k: k[0]) - - def itervalues(self): - return self.__iter__(lambda k: k[1]) - - -class PersistentBayes(classifier.Bayes): - - """A persistent Bayes classifier. - - This is just like classifier.Bayes, except that the dictionary is a - database. You take less disk this way and you can pretend it's - persistent. The tradeoffs vs. a pickle are: 1. it's slower - training, but faster checking, and 2. it needs less memory to run, - but takes more space on the hard drive. +class Hammie: + """A spambayes mail filter. - On destruction, an instantiation of this class will write its state - to a special key. When you instantiate a new one, it will attempt - to read these values out of that key again, so you can pick up where - you left off. + This implements the basic functionality needed to score, filter, or + train. """ - # XXX: Would it be even faster to remember (in a list) which keys - # had been modified, and only recalculate those keys? No sense in - # going over the entire word database if only 100 words are - # affected. - - # XXX: Another idea: cache stuff in memory. But by then maybe we - # should just use ZODB. - - def __init__(self, dbname, mode): - classifier.Bayes.__init__(self) - self.statekey = "saved state" - self.wordinfo = DBDict(dbname, mode, (self.statekey,)) - self.dbmode = mode - - self.restore_state() - - def __del__(self): - #super.__del__(self) - self.save_state() - - def save_state(self): - if self.dbmode != 'r': - self.wordinfo[self.statekey] = (self.nham, self.nspam) - - def restore_state(self): - if self.wordinfo.has_key(self.statekey): - self.nham, self.nspam = self.wordinfo[self.statekey] - - -class Hammie: - - """A spambayes mail filter""" - def __init__(self, bayes): self.bayes = bayes @@ -262,9 +64,9 @@ import traceback traceback.print_exc() - def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD, - ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER, - debug=DODEBUG): + def filter(self, msg, header=None, spam_cutoff=None, + ham_cutoff=None, debugheader=None, + debug=None): """Score (judge) a message and add a disposition header. msg can be a string, a file object, or a Message object. @@ -282,6 +84,17 @@ """ + if header == None: + header = options.hammie_header_name + if spam_cutoff == None: + spam_cutoff = options.spam_cutoff + if ham_cutoff == None: + ham_cutoff = options.ham_cutoff + if debugheader == None: + debugheader = options.hammie_debug_header_name + if debug == None: + debug = options.hammie_debug_header + msg = mboxutils.get_message(msg) try: del msg[header] @@ -348,163 +161,47 @@ self.train(msg, True) - def update_probabilities(self): + def update_probabilities(self, store=True): """Update probability values. You would want to call this after a training session. It's pretty slow, so if you have a lot of messages to train, wait until you're all done before calling this. + Unless store is false, the peristent store will be written after + updating probabilities. + """ self.bayes.update_probabilities() + if store: + self.store() + def store(self): + """Write out the persistent store. + + This makes sure the persistent store reflects what is currently + in memory. You would want to do this after a write and before + exiting. + + """ + + self.bayes.store() + + +def open(filename, usedb=True, mode='r'): + """Open a file, returning a Hammie instance. + + If usedb is False, open as a pickle instead of a DBDict. mode is + + used as the flag to open DBDict objects. 'c' for read-write (create + if needed), 'r' for read-only, 'w' for read-write. + + """ -def train(hammie, msgs, is_spam): - """Train bayes with all messages from a mailbox.""" - mbox = mboxutils.getmbox(msgs) - i = 0 - for msg in mbox: - i += 1 - # XXX: Is the \r a Unixism? I seem to recall it working in DOS - # back in the day. Maybe it's a line-printer-ism ;) - sys.stdout.write("\r%6d" % i) - sys.stdout.flush() - hammie.train(msg, is_spam) - print - -def score(hammie, msgs, reverse=0): - """Score (judge) all messages from a mailbox.""" - # XXX The reporting needs work! - mbox = mboxutils.getmbox(msgs) - i = 0 - spams = hams = 0 - for msg in mbox: - i += 1 - prob, clues = hammie.score(msg, True) - if hasattr(msg, '_mh_msgno'): - msgno = msg._mh_msgno - else: - msgno = i - isspam = (prob >= SPAM_THRESHOLD) - if isspam: - spams += 1 - if not reverse: - print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), - print hammie.formatclues(clues) - else: - hams += 1 - if reverse: - print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), - print hammie.formatclues(clues) - return (spams, hams) - -def createbayes(pck=DEFAULTDB, usedb=False, mode='r'): - """Create a Bayes instance for the given pickle (which - doesn't have to exist). Create a PersistentBayes if - usedb is True.""" if usedb: - bayes = PersistentBayes(pck, mode) + b = Bayes.DBDictBayes(filename, mode) else: - bayes = None - try: - fp = open(pck, 'rb') - except IOError, e: - if e.errno <> errno.ENOENT: raise - else: - bayes = pickle.load(fp) - fp.close() - if bayes is None: - bayes = classifier.Bayes() - return bayes - -def usage(code, msg=''): - """Print usage message and sys.exit(code).""" - if msg: - print >> sys.stderr, msg - print >> sys.stderr - print >> sys.stderr, __doc__ % globals() - sys.exit(code) - -def main(): - """Main program; parse options and go.""" - try: - opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r') - except getopt.error, msg: - usage(2, msg) - - if not opts: - usage(2, "No options given") - - pck = DEFAULTDB - good = [] - spam = [] - unknown = [] - reverse = 0 - do_filter = False - usedb = USEDB - mode = 'r' - for opt, arg in opts: - if opt == '-h': - usage(0) - elif opt == '-g': - good.append(arg) - mode = 'c' - elif opt == '-s': - spam.append(arg) - mode = 'c' - elif opt == '-p': - pck = arg - elif opt == "-d": - usedb = True - elif opt == "-D": - usedb = False - elif opt == "-f": - do_filter = True - elif opt == '-u': - unknown.append(arg) - elif opt == '-r': - reverse = 1 - if args: - usage(2, "Positional arguments not allowed") - - save = False - - bayes = createbayes(pck, usedb, mode) - h = Hammie(bayes) - - for g in good: - print "Training ham (%s):" % g - train(h, g, False) - save = True - - for s in spam: - print "Training spam (%s):" % s - train(h, s, True) - save = True - - if save: - h.update_probabilities() - if not usedb and pck: - fp = open(pck, 'wb') - pickle.dump(bayes, fp, 1) - fp.close() - - if do_filter: - msg = sys.stdin.read() - filtered = h.filter(msg) - sys.stdout.write(filtered) - - if unknown: - (spams, hams) = (0, 0) - for u in unknown: - if len(unknown) > 1: - print "Scoring", u - s, g = score(h, u, reverse) - spams += s - hams += g - print "Total %d spam, %d ham" % (spams, hams) - + b = Bayes.PickledBayes(filename) + return Hammie(b) -if __name__ == "__main__": - main() Index: hammiefilter.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v retrieving revision 1.2 diff -u -r1.2 hammiefilter.py --- hammiefilter.py 18 Nov 2002 18:14:04 -0000 1.2 +++ hammiefilter.py 19 Nov 2002 00:24:57 -0000 @@ -51,43 +51,37 @@ print >> sys.stderr, __doc__ % globals() sys.exit(code) -def jar_pickle(h): - if not options.persistent_use_database: - import pickle - fp = open(options.persistent_storage_file, 'wb') - pickle.dump(h.bayes, fp, 1) - fp.close() - - -def hammie_open(mode): - b = hammie.createbayes(options.persistent_storage_file, - options.persistent_use_database, - mode) - return hammie.Hammie(b) - def newdb(): - h = hammie_open('n') - jar_pickle(h) + h = hammie.open(options.persistent_storage_file, + options.persistent_use_database, + 'n') + h.store() print "Created new database in", options.persistent_storage_file def filter(): - h = hammie_open('r') + h = hammie.open(options.persistent_storage_file, + options.persistent_use_database, + 'r') msg = sys.stdin.read() print h.filter(msg) def train_ham(): - h = hammie_open('w') + h = hammie.open(options.persistent_storage_file, + options.persistent_use_database, + 'w') msg = sys.stdin.read() h.train_ham(msg) h.update_probabilities() - jar_pickle(h) + h.store() def train_spam(): - h = hammie_open('w') + h = hammie.open(options.persistent_storage_file, + options.persistent_use_database, + 'w') msg = sys.stdin.read() h.train_spam(msg) h.update_probabilities() - jar_pickle(h) + h.store() def main(): action = filter From tim.one@comcast.net Tue Nov 19 02:54:09 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 21:54:09 -0500 Subject: [Spambayes] Better optimization loop In-Reply-To: <3DD7532A.8020906@hooft.net> Message-ID: [Rob Hooft, simplifying simplex] > ... > I decided that we have a perfect way to optimize the ham and spam > cutoff values in timcv already, so that I can remove these from the > simplex optimization. Good observation! That should help. simplex isn't fast in the best of cases, and in this case ... > To that goal I added a "delayed" flexcost to the CostCounter module > that can use the optimal cutoffs calculated at the end of timcv.py. Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99 and spam_cutoff of 0.995 to get rid of "impossible" FP. > And there are only three variables left to optimize using simplex > > I then ran one optimization on my complete (16000+5800) corpus. The > result is that it is fighting very hard to remove fp's while > introducing lots of unsure messages: > > At the start: > > -> all runs false positives: 15 > -> all runs false negatives: 7 > -> all runs unsure: 189 > Standard Cost: $194.80 > Flex Cost: $607.41 > Delayed-Standard Cost: $98.80 > Delayed-Flex Cost: $310.05 > x=0.4990 p=0.1002 s=0.4537 310.05 > > And near the end: > > -> all runs false positives: 5 > -> all runs false negatives: 6 > -> all runs unsure: 342 > -> all runs false positive %: 0.03125 > -> all runs false negative %: 0.103448275862 > -> all runs unsure %: 1.56880733945 > -> all runs cost: $124.40 > Standard Cost: $124.40 > Flex Cost: $589.16 > Delayed-Standard Cost: $98.60 > Delayed-Flex Cost: $212.28 > x=0.3515 p=0.2861 s=0.2467 212.28 > > At this stage it actually managed to get the delayed standard cost > lower by $0.20 (it has been higher than the starting value during much > of the optimization). The Delayed-Flex cost is lowered by about 30%. > But look at the hugely different parameters it had to use! Can someone > else run with these parameters and confirm that this is an extreme > that is only warranted by my particular corpses? I can try . Here's a 10-fold CV with 6K random ham and 6K random spam from my c.l.py test data; baseline on the left, while the right has [Classifier] unknown_word_prob: 0.3515 minimum_prob_strength: 0.2861 unknown_word_strength: 0.2467 filename: base simp ham:spam: 6000:6000 6000:6000 fp total: 2 1 fp %: 0.03 0.02 fn total: 0 0 fn %: 0.00 0.00 unsure t: 46 101 unsure %: 0.38 0.84 real cost: $29.20 $30.20 best cost: $12.80 $11.80 h mean: 0.42 0.71 h sdev: 3.65 4.81 s mean: 99.96 99.89 s sdev: 1.21 1.94 mean diff: 99.54 99.18 k: 20.48 14.69 It did a little better here too. The best-cost analyses show that it's also nuking FP at the expense of unsures: base: -> best cost for all runs: $12.80 -> achieved at 2 cutoff pairs -> smallest ham & spam cutoffs 0.52 & 0.95 -> fp 1; fn 1; unsure ham 2; unsure spam 7 -> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075% -> largest ham & spam cutoffs 0.525 & 0.95 -> fp 1; fn 1; unsure ham 2; unsure spam 7 -> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075% simp: -> best cost for all runs: $12.80 -> best cost for all runs: $11.80 -> achieved at ham & spam cutoffs 0.495 & 0.995 -> fp 0; fn 0; unsure ham 10; unsure spam 49 -> fp rate 0%; fn rate 0%; unsure rate 0.492% > Please note that to get a delayed flex cost that is this much lower > actually means that in the unsure area there is "50% more order" than > before the optimization! > > At some point Tim (was it you?) has reported that in other optimization > techniques it has proven to be very bad to "focus" on the persistent > and hopeless fp/fn messages. I fear this might bother me here. Ya, I reported that from a paper wrestling with boosting, but it's a common observation. Even in simple settings! Say you're doing a least-squares linear regression on this data: x f(x) - ---- 1 1.9 2 4.1 3 5.9 4 -10.0 5 10.1 6 12.1 7 13.8 If you throw out (4, -10), you get an excellent fit to everything that remains. If you leave it in, you still get "an answer", but it's not a good fit to anything. A 6th-degree polynomial fits all the data perfectly, but the resulting snaky curve is almost certainly a terrible fit to the population from which this sample was taken. A few spam and ham are just unlike their brethren, but from what I've seen of those, no mechanical gimmick is going to classify them correctly. Give up and be happy . > I just started another optimization run, but lowered the cost of a fp > from $10 to $2, and introduced another cost function that I called > flex**2 cost because it changes the cost function for an unsure message > from a linear function to a square function. Oops, two changes at the > same time; but it takes such a long time to run.... When I try a new thing, I usually start with several runs but on *much* less data per run. If at least 3 of 5 show the effect I was hoping for, I may push on; but if 3 of 5 don't, I either give up on it, or change the rules to 4 of 7 (if I'm really in love with the idea ). it's-almost-impossible-not-to-cheat-sometimes-ly y'rs - tim From msurface@myvine.com Tue Nov 19 02:50:00 2002 From: msurface@myvine.com (Mitchell Surface) Date: Mon, 18 Nov 2002 21:50:00 -0500 Subject: [Spambayes] Training questions Message-ID: <20021119025000.GA17060@brewer.fwn.fortwayne.com> I've been lurking here for a while and I finally decided to give this a try. I've read the docs and how to do the initial training seems pretty clear as does setting up a procmail recipe to handle the filtered messages. I do have a couple of questions that I don't remember seeing. Once past the initial training, how do you train on additional ham and spam? Does hammie.py just append new data to what's already there? If so, how can you untrain a misclassified message? Thanks for all the work everybody's put in to this. -- Mitchell Surface N9OSL Fort Wayne, IN USA The Bible is not my book, and Christianity is not my religion. I could never give assent to the long, complicated statements of Christian dogma. -- Abraham Lincoln -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20021118/3fa9dfba/attachment.bin From tim.one@comcast.net Tue Nov 19 04:21:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 18 Nov 2002 23:21:36 -0500 Subject: [Spambayes] RE: chi-combining In-Reply-To: Message-ID: In an offline thread with Greg Louis (who's working on bogofilter), I tried an experiment using just the S, then just the H, components of our spamprob calculation. We currently return (1+S-H)/2. The "justs" result here just returns S, the "justh" just returns 1-H. justs is a comparative disaster, but the more I stare at it, the more I think justh did surprisingly well: filename: base justs justh ham:spam: 6000:6000 6000:6000 6000:6000 fp total: 2 8 2 fp %: 0.03 0.13 0.03 fn total: 0 0 4 fn %: 0.00 0.00 0.07 unsure t: 40 59 6 unsure %: 0.33 0.49 0.05 real cost: $28.00 $91.80 $25.20 best cost: $4.00 $22.40 $6.60 h mean: 0.38 0.69 0.08 h sdev: 3.53 5.81 2.18 s mean: 99.96 99.99 99.92 s sdev: 1.41 0.45 2.58 mean diff: 99.58 99.30 99.84 k: 20.16 15.86 20.97 Similar results were obtained from another trial on different 6K samples from my c.l.py test data. If you hate FP a lot, and would rather suffer a few FN in return for skipping lots of unsures, justh looks like it may be a viable strategy. Despite that H is less sensitive to high-spamprob words than to low-spamprob words (and S the reverse), at least on this data spam still scores very high under H. If you want to try this, in chi2_spamprob replace prob = (S-H + 1.0) / 2.0 with prob = 1.0 - H From rob@hooft.net Tue Nov 19 10:27:22 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Tue, 19 Nov 2002 11:27:22 +0100 Subject: [Spambayes] RE: chi-combining References: Message-ID: <3DDA120A.8070606@hooft.net> Tim Peters wrote: > In an offline thread with Greg Louis (who's working on bogofilter), I tried > an experiment using just the S, then just the H, components of our spamprob > calculation. We currently return (1+S-H)/2. The "justs" result here just > returns S, the "justh" just returns 1-H. justs is a comparative disaster, > but the more I stare at it, the more I think justh did surprisingly well: Try your "invisible ham" spam with this. I'm sure it will score rock-solid ham. By using "justh" you're basically telling spammers that you're not sensitive to spam words, as long as there is enough of the message that looks like ham! The two cases where this makes a difference are H=1 S=1 : this is the case I just described: A message that looks like both ham and spam would be unsure before, but will now result in a Ham score. H=0 S=0 : A message that doesn't look like anything seen before used to result in an unsure, but will now result in a "Spam" disposition. I suspect that 1-H is easier to counter for the ephemeral "smart spammer" than (1+S-H)/2. It is another form of cancellation disease. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tdickenson@devmail.geminidataloggers.co.uk Tue Nov 19 11:13:58 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Tue, 19 Nov 2002 11:13:58 +0000 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right In-Reply-To: References: Message-ID: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk> On Monday 18 November 2002 11:29 pm, Tim Peters wrote: > BTW, I gave up on my mistake-driven classifier experiment. I kept gett= ing > several porn spam as Unsure every day, and got tired of digging thru it= =2E > Now I'm training on each spam that doesn't score 100, and each ham that > doesn't score 0. Amazingly, that's added a hell of a lot more spam tha= n > ham to the training data -- now up to 99 ham and 149 spam. Porn spam n= o > longer rates as Unsure, and I'm happier. Perhaps that's just due to th= e > drop in forced stimulation, though . Why exclude spams that score 100 from training? Even these really spammy= =20 spams might contain clues that would help to classify other more marginal= =20 spam. From francois.granger@free.fr Tue Nov 19 11:51:56 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Tue, 19 Nov 2002 12:51:56 +0100 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> Message-ID: on 18/11/02 23:52, Richie Hindle at richie@entrian.com wrote: > By presenting the messages as three pre-judged lists, am I contradicting my > own statement that the messages shouldn't show up as prejudged in the > current 'unclassified' list? 8-) I don't think so, because spotting a ham > in a bunch of spams, or vice versa, is much easier than spotting whether > any of a whole mixture of messages is misclassified. What if you show the raw spambrob number close to the buttons ? It would give a clue on what the system found the message to be ? -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From skip@pobox.com Tue Nov 19 13:30:42 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 19 Nov 2002 07:30:42 -0600 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> Message-ID: <15834.15618.465106.671756@montanaro.dyndns.org> >> * Given that the time to classify a message is pretty cheap, it would >> also be nice if your interface preset the radio buttons based on an >> initial classification of each message. This suggests you need an >> 'unsure' radio button as well. Richie> I did think of that, then I thought that I was far more likely Richie> to make a mistake just scanning down the list thinking "yeah, Richie> yeah, yeah, looks ok" than actually having to click something Richie> for each message. That suggests to me that you need to group messages together based upon their initial classification. That way, instead of a haphazard arrangement of button settings: they are clumped: D H S U * * * * * * * * perhaps with a
between sections. By clumping things together like this I think it makes it easier to detect an outlier within the group. (The background color should probably alternate between light grey and white to help direct the eyes from the subject to the proper radio button when changes are needed. I say this without ever having seen or used pop3proxy, and can't recall from your mockup if you already do this or not.) Richie> I've been thinking that the next version of the web interface Richie> would work the same way - rather than a single page of untrained Richie> messages, you'd get three pages for ham-judged, spam-judged and Richie> unsure. Sure, same idea. A spam slipped through into my python mailbox yesterday. Stood out like a sore thumb. -- Skip Montanaro - skip@pobox.com http://www.mojam.com/ http://www.musi-cal.com/ From jm@jmason.org Tue Nov 19 14:14:11 2002 From: jm@jmason.org (Justin Mason) Date: Tue, 19 Nov 2002 14:14:11 +0000 Subject: [Spambayes] Another software in the field In-Reply-To: Message from "T. Alexander Popiel" <20021115182039.BD3F3F54C@cashew.wolfskeep.com> Message-ID: <20021119141417.6FC4B16F16@jmason.org> (a bit late in replying! I suffered from inbox overload ;) T. Alexander Popiel said: > If the received parser were a little smarter about parsing iPlanet > received lines, it would have "pcp736393pcs.reston01.va.comcast.net" > instead of "cj569191b" as the first element in the sequence, and > the match list would have been 2 -> 1 -> 2 -> 0 -> 0, yielding: > > message-id-generation:skipped 0 > > I suspect that high skipped numbers would be a strong spam indicator, > howing where message ids were omitted in the sent mail and/or received > headers naively forged to prevent backtracking. It would be interesting to test this; we do something similar in SpamAssassin to find possibly-forged hostnames in the Received headers, and we do try to figure out where in the Received chain the Message-id was added. Two problems we've seen: - some totally-legit senders, especially auto-generated mails, have a bad habit of leaving out the Message-Id until it gets to *your* MX. Annoying, but allowed by the RFCs. This test would have to figure this out in some way; maybe by adding the sender's hostname or domain to the token, so the legit folks gain ham hits, but spammers remain as 1-spam 0-ham hapaxes? - some senders use e.g. hostname "mylittlecompany.com" on their desktop machine or home LAN, then connect via a commodity-DSL connection, resulting in a reverse-lookup of "dsl43-234.bigisp.net". In other words, the rDNS does not match what the sender wishes it did ;) Not a problem in this case, but worth noting when talking about Received-header parsing. --j. From Paul.Moore@atosorigin.com Tue Nov 19 15:58:03 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Tue, 19 Nov 2002 15:58:03 -0000 Subject: [Spambayes] Training Message-ID: <16E1010E4581B049ABC51D4975CEDB88619943@UKDCX001.uk.int.atosorigin.com> One thing I've just noticed using the Outlook client (although I think it's a feature of the algorithm, rather than specific to the client). A couple of hams ended up in my "Unsure" folder. No problem, I trained on them as ham. But they hit my inbox with a spam score *still* in the 25-35 region. So if I refilter, they pop up in my unsure folder again. Nothing I can do will make the messages score as ham. Refiling based on spam scores is a rare operation (I only noticed this because I forgot to tick the "only unscored messages" checkbox in the filter dialog), but the behaviour is annoying, as well as being unnerving. I don't think there's anything which can be done at the algorithm level (the algorithm is effectively saying "OK, I know you're saying it's ham, but it still looks pretty odd to me...") but at the client/user interface level, maybe there should be an extra property "Trained", which says that this message has been specifically confirmed as ham or spam, so that it won't get filtered. I'm not sure how, or if, this would translate to other types of client. Paul. From neale@woozle.org Tue Nov 19 17:13:51 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 09:13:51 -0800 Subject: [Spambayes] Hammiefilter doesn't write out the pickle In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > I think we've got some real potential for a great little api here. I > do have some questions about the data storage. We've agreed that an > explicit store is the way we want to go, which I think is correct. > However, dbm really doesn't support this. I fooled with a couple > ideas (hacks) to make DBDict behave in a load/store fashion, and the > best thing I can come up with is to actually make a working copy of > the dbm file, which is then used for the session. When store() is > called, the original is replaced with the working copy. There are > some difficulties with this approach. If store is never called, then > there is no guaranteed way to clean up the working copy. Replacing > the original with the working copy may be a bit difficult, because dbm > doesn't support a close method... Yeah. I ran into the same problem yesterday. As I thought about it, I realized this must have been why I implemented the __del__ method of DBDict. The problem, really, with DBDict is that there is this meta-information it has to store (nham, nspam). If individual db entries are updated but the meta-info isn't, your database is corrupt, game over. That problem manifests itself in two ways: 1. You need to be very careful about when you hit ^C when running hammie 2. The pop3proxy's "store" method doesn't really do anything But couldn't this be adequately explained by merely stating that the DBDict method stores things instantaneously? If we're careful to always update nham and nspam *before* writing any new wordinfo, then the worst you can do would be start training, then hit ^C right away--equivalent to training on an empty message. And people running the pop3proxy would have to be aware that the way the proxy is working is always in sync with what's on the disk. I don't see either of these as a huge problem. So we need to write out nham and nspam before writing out the new WordInfo counts. I don't think it'd be much of a penalty to do this before every message in a batch training run, and of course for the pickle method it's no difference at all whether you add one before or after training on a message. Neale From neale@woozle.org Tue Nov 19 17:26:02 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 09:26:02 -0800 Subject: [Spambayes] Training questions In-Reply-To: <20021119025000.GA17060@brewer.fwn.fortwayne.com> References: <20021119025000.GA17060@brewer.fwn.fortwayne.com> Message-ID: So then, Mitchell Surface is all like: > I've been lurking here for a while and I finally decided to give this > a try. I've read the docs and how to do the initial training seems > pretty clear as does setting up a procmail recipe to handle the > filtered messages. Thanks for the report! It's always good to hear that someone can actually *use* the blasted thing :) > I do have a couple of questions that I don't remember seeing. Once > past the initial training, how do you train on additional ham and > spam? Does hammie.py just append new data to what's already there? If > so, how can you untrain a misclassified message? You'll want to run "hammie.py -g" on ham, and "hammie.py -s" on spam. That will tokenize the new messages you give it, and increment the frequency counts for those tokens in your database. In a sense, it is appending the new data to what's already there. hammie.py currently doesn't have a way to untrain messages. But I'll add that in the next generation hammie! Thanks for pointing that out! Neale From neale@woozle.org Tue Nov 19 17:46:27 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 09:46:27 -0800 Subject: [Spambayes] hammie, pop3proxy, and persistent_use_database Message-ID: It seems like we're getting a fair amount of people using hammie who just want it to filter their mail. These folks, I am guessing, are just accepting the default values for things, assuming those must be a good place to start. Unfortunately, if you're running hammie out of procmail, the pickle method is going to start to get really slow as your training set gets larger. As fast as the pickler is, it's still having to slurp in the entire file every time you run it. I'm talking several orders of magnitude here. On the other hand, pop3proxy probably works best when using a pickle, since it starts up once and can score many emails. hammiesrv works similarly, but I don't think anyone is using that :) So, what would you say to moving the persistent_use_database option into per-service configuration? Specifically: Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.72 diff -u -r1.72 Options.py --- Options.py 18 Nov 2002 19:14:48 -0000 1.72 +++ Options.py 19 Nov 2002 17:44:16 -0000 @@ -348,10 +348,11 @@ # The default database path used by hammie persistent_storage_file: hammie.db -# hammie can use either a database (quick to score one message) or a pickle -# (quick to train on huge amounts of messages). Set this to True to use a -# database by default. -persistent_use_database: False +[hammiefilter] +# hammiefilter can use either a database (quick to score one message) or +# a pickle (quick to train on huge amounts of messages). Set this to +# True to use a database by default. +hammiefilter_persistent_use_database: False [pop3proxy] # pop3proxy settings - pop3proxy also respects the options in the # Hammie @@ -366,6 +367,7 @@ pop3proxy_spam_cache: pop3proxy-spam-cache pop3proxy_ham_cache: pop3proxy-ham-cache pop3proxy_unknown_cache: pop3proxy-unknown-cache +pop3proxy_persistent_use_database: False [html_ui] html_ui_port: 8880 @@ -440,6 +442,8 @@ 'hammie_debug_header': boolean_cracker, 'hammie_debug_header_name': string_cracker, }, + 'hammiefilter' : {'hammiefilter_persistent_use_database': boolean_cracker, + }, 'pop3proxy': {'pop3proxy_server_name': string_cracker, 'pop3proxy_server_port': int_cracker, 'pop3proxy_port': int_cracker, @@ -448,6 +452,7 @@ 'pop3proxy_spam_cache': string_cracker, 'pop3proxy_ham_cache': string_cracker, 'pop3proxy_unknown_cache': string_cracker, + 'pop3proxy_persistent_use_database': string_cracker, }, 'html_ui': {'html_ui_port': int_cracker, 'html_ui_launch_browser': boolean_cracker, From rob@hooft.net Tue Nov 19 17:47:02 2002 From: rob@hooft.net (Rob Hooft) Date: Tue, 19 Nov 2002 18:47:02 +0100 Subject: [Spambayes] Better optimization loop References: Message-ID: <3DDA7916.5010102@hooft.net> Tim Peters wrote: > [Rob Hooft, simplifying simplex] > >>... >>I decided that we have a perfect way to optimize the ham and spam >>cutoff values in timcv already, so that I can remove these from the >>simplex optimization. > > > Good observation! That should help. simplex isn't fast in the best of > cases, and in this case ... Anyone that has a faster optimization algorithm lying around is welcome to replace my Simplex code. >>To that goal I added a "delayed" flexcost to the CostCounter module >>that can use the optimal cutoffs calculated at the end of timcv.py. > > Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99 > and spam_cutoff of 0.995 to get rid of "impossible" FP. They are in any case better than any other alternative I could think of. But if you disagree, you can change the order in which the CostCounter.default() builds up the cost counters; the optimization always uses the last one. > It did a little better here too. The best-cost analyses show that it's also > nuking FP at the expense of unsures: > > base: > > -> best cost for all runs: $12.80 > -> achieved at 2 cutoff pairs > -> smallest ham & spam cutoffs 0.52 & 0.95 > -> fp 1; fn 1; unsure ham 2; unsure spam 7 > -> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075% > -> largest ham & spam cutoffs 0.525 & 0.95 > -> fp 1; fn 1; unsure ham 2; unsure spam 7 > -> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075% > > simp: > > -> best cost for all runs: $12.80 > -> best cost for all runs: $11.80 > -> achieved at ham & spam cutoffs 0.495 & 0.995 > -> fp 0; fn 0; unsure ham 10; unsure spam 49 > -> fp rate 0%; fn rate 0%; unsure rate 0.492% Very similar to my case. I'm seriously thinking about removing the "hopeless" and "almost hopeless" messages from my corpses. I agree with the bayesian statistics that they can't be correctly classified. > Ya, I reported that from a paper wrestling with boosting, but it's a common > observation. Even in simple settings! Say you're doing a least-squares > linear regression on this data: > > x f(x) > - ---- > 1 1.9 > 2 4.1 > 3 5.9 > 4 -10.0 > 5 10.1 > 6 12.1 > 7 13.8 > > If you throw out (4, -10), you get an excellent fit to everything that > remains. If you leave it in, you still get "an answer", but it's not a good > fit to anything. Press et al. report about a "robust fit", which is not a least squares but a least absolute deviates fit. It is insensitive to outliers. Is there an analog idea for us? > When I try a new thing, I usually start with several runs but on *much* less > data per run. If at least 3 of 5 show the effect I was hoping for, I may > push on; but if 3 of 5 don't, I either give up on it, or change the rules to > 4 of 7 (if I'm really in love with the idea ). These optimizations are very sensitive to step-functions, so I need lots of data to run them. With a small data set it will stop wherever you start it. Further results I obtained: My idea of running with an fp cost of $2 and a square cost function didn't work. It doesn't optimize to a consistent position. Increasing the cost of an fp back to $10 and running with the same square function did do a reasonable job, it optimized to: [Classifier] unknown_word_prob = 0.520415 minimum_prob_strength = 0.315104 unknown_word_strength = 0.215393 So the unknown_word_prob is now back to 0.5 again! I just committed my changes to the optimization code, any hints on improvements are welcome. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tdickenson@devmail.geminidataloggers.co.uk Tue Nov 19 18:07:54 2002 From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Tue, 19 Nov 2002 18:07:54 +0000 Subject: [Spambayes] hammie, pop3proxy, and persistent_use_database In-Reply-To: References: Message-ID: <200211191807.54767.tdickenson@devmail.geminidataloggers.co.uk> On Tuesday 19 November 2002 5:46 pm, Neale Pickett wrote: > hammiesrv works > similarly, but I don't think anyone is using that :) Im using hammiesrv out of procmail, but I would be eager to change if I w= as=20 the only one. From neale@woozle.org Tue Nov 19 18:48:31 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 10:48:31 -0800 Subject: [Spambayes] hammie, pop3proxy, and persistent_use_database In-Reply-To: <200211191807.54767.tdickenson@devmail.geminidataloggers.co.uk> References: <200211191807.54767.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: So then, Toby Dickenson is all like: > On Tuesday 19 November 2002 5:46 pm, Neale Pickett wrote: > > > hammiesrv works > > similarly, but I don't think anyone is using that :) > > Im using hammiesrv out of procmail, but I would be eager to change if I was > the only one. Oh, neat! Well there's no need to remove it, although I think when the number of file names starting with "hammie" exceeds ten, I may get kicked off the project ;) I'm assuming it's working for you; is that a well-placed assumption? Neale From lists@morpheus.demon.co.uk Tue Nov 19 19:19:01 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Tue, 19 Nov 2002 19:19:01 +0000 Subject: [Spambayes] Offtopic - getting bounce messages for spam Message-ID: Sorry, this is offtopic, but I'm hoping that the concentration of spam experts on this group may be able to help me. I've just started receiving undeliverable message reports for spam, sent to people I've never heard of. It looks to me like someone is managing to impersonate me when they send spam out. I'm fairly sure I'm not running an open relay (is there a way of checking for certain?), so I guess someone is spoofing headers or something. I've heard of this sort of thing before, but never experienced this myself. Two questions, really: a) Is this something I should worry about (am I likely to end up on blacklists or the like)? b) What can I do about it in any case? Once again, apologies for this being offtopic, but I don't want to just glibly ignore it, and I wasn't sure where else to ask... Paul. -- This signature intentionally left blank From rob@hooft.net Tue Nov 19 19:35:58 2002 From: rob@hooft.net (Rob Hooft) Date: Tue, 19 Nov 2002 20:35:58 +0100 Subject: [Spambayes] Offtopic - getting bounce messages for spam References: Message-ID: <3DDA929E.1060003@hooft.net> Paul Moore wrote: > Sorry, this is offtopic, but I'm hoping that the concentration of spam > experts on this group may be able to help me. > > I've just started receiving undeliverable message reports for spam, > sent to people I've never heard of. Same here. Happened to me twice this week. I'm also worried about being flooded by this kind of messages.... Do I have to train my filter to consider this as spam? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From skip@pobox.com Tue Nov 19 19:42:25 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 19 Nov 2002 13:42:25 -0600 Subject: [Spambayes] Offtopic - getting bounce messages for spam In-Reply-To: References: Message-ID: <15834.37921.335242.13553@montanaro.dyndns.org> Paul> I've just started receiving undeliverable message reports for Paul> spam, sent to people I've never heard of. I awoke this morning to 350+ such messages in my unsure mailbox. All but a couple were bounces, like you said, for email addresses I didn't know. I scanned a few to see what they were, trained on a few, then modified an Emacs macro I use for bulk deletion of this sort of stuff. Having it sniff for "wipe out your credit card debt" and "yahoo.co.jp.webhosting_hotpicks" seemed to catch all of them. Paul> a) Is this something I should worry about (am I likely to end up Paul> on blacklists or the like)? I don't know. As far as I know that hasn't happened to mail.mojam.com. Paul> b) What can I do about it in any case? Besides train on enough of them so they are reliably caught as spam, I suspect there's little you can do. Skip From skip@pobox.com Tue Nov 19 19:58:55 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 19 Nov 2002 13:58:55 -0600 Subject: [Spambayes] CipherTrust? Message-ID: <15834.38911.435802.71321@montanaro.dyndns.org> This ad came at the head of my eWEEK Security mailing: CHOKING ON SPAM? Stop spam! -- Learn the TOP 10 Techniques To Control Spam. Spam used to be annoying, now it is a critical business problem. Reclaim your mail server. PROTECT YOUR EMAIL SYSTEM against spam and other threats before they reach your mail server(s). FREE White Paper shows you how! http://eletters1.ziffdavis.com/cgi-bin10/flo?y=eSyU0EWaTF0E4J0sQU0Ax Any idea what they do and how they do it? The link was to an information signup page. I didn't feel like asking for even more mail so I didn't submit. -- Skip Montanaro - skip@pobox.com http://www.mojam.com/ http://www.musi-cal.com/ From tim.one@comcast.net Tue Nov 19 19:59:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 19 Nov 2002 14:59:41 -0500 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right In-Reply-To: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: [Tim] > BTW, I gave up on my mistake-driven classifier experiment. I > kept getting several porn spam as Unsure every day, and got tired > of digging thru it. Now I'm training on each spam that doesn't > score 100, and each ham that doesn't score 0. ... [Toby Dickenson] > Why exclude spams that score 100 from training? Even these really spammy > spams might contain clues that would help to classify other more marginal > spam. Absolutely, but that's a different experiment. I've already done "proper" training and know it works great for me. These are experiments in doing silly training. A vast majority of spam scores 100 (on the Outlook client's 0..100 integer scale), and a vast majority of ham scores 0. Training on everything that doesn't score at an extreme is a less-extreme variant of mistake-based training, which, left to their own devices, is what real people are almost ceratinly going to do. I'm trying to get a feel for what the system does then. Purely mistake-based training with reasonable cutoff values turned out to work very well wrt the FN and FP rates, but not so well wrt the Unsure rate, and the Unsures remained surprising the entire time I tried it. While it wasn't prone to outright mistakes after the first day, the Unsures remained irritatingly obvious (to human eyes) after two weeks. Training on the 83 and 96 etc spam too appears to be fixing that rapidly. Curiously, I'm finding much less non-0 ham than non-100 spam (my training ratio on new msgs has gone from about 1:1 spam:ham (purely mistake-based) to about 11:1 spam to ham (training on all non-extremes)). From skip@pobox.com Tue Nov 19 20:09:32 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 19 Nov 2002 14:09:32 -0600 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right In-Reply-To: References: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: <15834.39548.393074.819295@montanaro.dyndns.org> Tim> [Toby Dickenson] >> Why exclude spams that score 100 from training? Even these really >> spammy spams might contain clues that would help to classify other >> more marginal spam. Tim> Absolutely, but that's a different experiment. I've already done Tim> "proper" training and know it works great for me. These are Tim> experiments in doing silly training. If you're taking notes on this in various files in CVS I wouldn't call it "silly training". How about "realistic training"? Skip From tim.one@comcast.net Tue Nov 19 20:29:22 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 19 Nov 2002 15:29:22 -0500 Subject: [Spambayes] Offtopic - getting bounce messages for spam In-Reply-To: Message-ID: [Paul Moore] > Sorry, this is offtopic, but I'm hoping that the concentration of spam > experts on this group may be able to help me. > > I've just started receiving undeliverable message reports for spam, > sent to people I've never heard of. It's even more fun when real people write to you demanding to be taken off your porn (whatever) list. > It looks to me like someone is managing to impersonate me when they > send spam out. Stare at the headers: it's usually a very shallow impersonation. For example, the Received headers are likely to point back to machines you've never heard of -- or even countries. > I'm fairly sure I'm not running an open relay (is there a way of > checking for certain?), Turn your machine off for a week and see if it stops . > so I guess someone is spoofing headers or something. I've > heard of this sort of thing before, but never experienced > this myself. > > Two questions, really: > > a) Is this something I should worry about (am I likely to end up on > blacklists or the like)? Not on a well-run blacklist. This kind of spoofing is common. > b) What can I do about it in any case? Nothing that I know of, unless you want to pee away hours digging thru the headers for clues. By the time you find the perpetrators (if ever), they will have moved on. your-email-address-is-just-a-string-of-characters-ly y'rs - tim From lists@morpheus.demon.co.uk Tue Nov 19 21:30:33 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Tue, 19 Nov 2002 21:30:33 +0000 Subject: [Spambayes] A kinder, gentler hammie References: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com> <079itu84n9lil7sqae4j9gge1sgppps34h@4ax.com> Message-ID: Richie Hindle writes: > Hi Paul, > >> The standard [Windows] environment variables which *can* be used for >> this sort of thing are >> >> 1. HOMEDRIVE and HOMEPATH - %HOMEDRIVE%%HOMEPATH% is basically the >> equivalent of Unix's $HOME. But for nearly all cases, these end >> up being C:\, which to my mind is a bad default. >> 2. USERPROFILE - %USERPROFILE% is a user-specific directory >> suitable for config information. But by default it's a directory >> with spaces in the name, which can be awkward for some >> purposes. It's also hard to navigate to in Windows explorer, >> which makes files stored there a little "hidden". > > Not true on 98: *sigh*. I forgot about Win98. > Having said that, I agree with this: > >> I think "try a number of pathnames" is a sensible approach. > > ...but is there a fallback that *always* works? I'm not sure > whether there is - is argv[0] guaranteed to work, even in frozen / > py2exe'd / Installer'd / cx_Frozen / Squeezed / etc. applications? I think there probably isn't. After all, you can't even guarantee that argv[0] is on a writable medium. :-( Paul. -- This signature intentionally left blank From mhammond@skippinet.com.au Tue Nov 19 21:59:46 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 20 Nov 2002 08:59:46 +1100 Subject: [Spambayes] Training In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619943@UKDCX001.uk.int.atosorigin.com> Message-ID: {Paul] > Refiling based on spam scores is a rare operation (I only noticed > this because I forgot to tick the "only unscored messages" checkbox > in the filter dialog), but the behaviour is annoying, as well as being > unnerving. I don't think there's anything which can be done at the > algorithm level (the algorithm is effectively saying "OK, I know > you're saying it's ham, but it still looks pretty odd to me...") but > at the client/user interface level, maybe there should be an extra > property "Trained", which says that this message has been specifically > confirmed as ham or spam, so that it won't get filtered. I'm not sure > how, or if, this would translate to other types of client. To be honest, I am less worried about "re-filtering" as that should be very rare. My concern is almost identical though - the *next* email that looks the same. Let's say I subscribe to a weekly newsletter. This weeks comes in, gets marked as unsure, so I train. Next weeks comes in - again, it trains as unsure. Repeat ad nauseum. I saw this a real lot when I had a high ham:spam inbalance - training had no obvious effect. I am still hoping to try Tim's new adjustment, but I wonder if somehow similar maths could be exploited. For example, manually training a message could be seen as "intense training", wereas a normal train is - well - normal. The point of manual training is that the system got it wrong, and the user want to see the error stop. "normal" training is just giving the system fairly "general" instructions. The only reason I mention this is because last time I mentioned something that demonstrated my ignorance, Tim promptly replied confirming it, then subsequently made the change anyway . Mark. From tim@fourstonesExpressions.com Tue Nov 19 21:56:52 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 19 Nov 2002 15:56:52 -0600 Subject: [Spambayes] proposed changes to hammie & co. Message-ID: Neale, I'm ok with these changes. I have more to make, but go ahead and make these alterations. Particularly, I've got a dbdict class that supports load/store, so we don't have to worry about training that blows up before we save nham and nspam. I think we should think about where WordInfo class goes... I'm not sure I like having mode on the dbdict constructor, although I understand why you have it. No harm done, as it defaults anyway. I think we should take Bayes out of classifier and put it in Bayes.py I like widict as a class, but it could be abstracted another notch by simply specifying the class to instantiate when you find a 'w' in the pickle, as an operand on the constructor. I'll wait for your checkin, and do some more work on the dbdict module to add my load/store stuff... Travelling, and taking a break from meeting... more when I get to the hotel. - TimS 11/18/2002 6:35:55 PM, Neale Pickett wrote: >okay, here's the big diff I was talking about. This would take all >hammie functionality out of hammie. So there would need to be yet >another hammie*.py file, a front-end to this new hammie class which acts >like the all-singing, all-dancing program that hammie is currently. > >This moves everything but the Hammie class out of hammie.py. DBDict >goes into its own module, which you could take out and use elsewhere if >you wanted. PersistentBayes goes away, replaced by a the DBDictBayes >class in Bayes.py. I haven't had time to implement the rest of the >stuff yet, but that would be what'd go into the new front-end. > >So the happy hammie family would then stand at: > > hammie.py > |-- hammiefilter.py > |-- pop3proxy.py > |-- hammiesrv.py > \-- hammie-new-front-end.py > >This change appears to work fine with hammiefilter and pop3proxy. But >it's a pretty big change, so I'd like to hear what at least Richie and >Tim Stone think before I commit anything. > >Neale > >? Outlook2000 >? diff >? email >? hammiebatch.py >Index: Bayes.py >=================================================================== >RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v >retrieving revision 1.5 >diff -u -r1.5 Bayes.py >--- Bayes.py 18 Nov 2002 13:04:20 -0000 1.5 >+++ Bayes.py 19 Nov 2002 00:24:57 -0000 >@@ -56,11 +56,10 @@ > all the spambayes contributors." > > import Corpus >-from classifier import Bayes >+import classifier > from Options import options >-from hammie import DBDict # hammie only for DBDict, which should >- # probably really be somewhere else > import cPickle as pickle >+import dbdict > import errno > import copy > import anydbm >@@ -69,7 +68,7 @@ > NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training > UPDATEPROBS = True # Probabilities will be autoupdated with training > >-class PersistentBayes(Bayes): >+class PersistentBayes(classifier.Bayes): > '''Persistent Bayes database object''' > > def __init__(self, db_name): >@@ -169,12 +168,49 @@ > self.wordinfo, self.nspam, self.nham = t[1:] > > >+class WIDict(dbdict.DBDict): >+ """DBDict optimized for holding lots of WordInfo objects. >+ >+ Normally, the pickler can figure out that you're pickling the same >+ type thing over and over, and will just tag the type with a new >+ byte, thus reducing Administrative Pickle Bloat(R). Since the >+ DBDict continually creates new picklers, however, nothing ever gets >+ the chance to do this optimization. >+ >+ The WIDict class forces this optimization by stealing the >+ (currently) unused 'W' pickle type for WordInfo objects. This >+ results in about a 50% reduction in database size. >+ >+ """ >+ >+ def __getitem__(self, key): >+ v = self.hash[key] >+ if v[0] == 'W': >+ val = pickle.loads(v[1:]) >+ # We could be sneaky, like pickle.Unpickler.load_inst, >+ # but I think that's overly confusing. >+ obj = classifier.WordInfo(0) >+ obj.__setstate__(val) >+ return obj >+ else: >+ return pickle.loads(v) >+ >+ def __setitem__(self, key, val): >+ if isinstance(val, classifier.WordInfo): >+ val = val.__getstate__() >+ v = 'W' + pickle.dumps(val, 1) >+ else: >+ v = pickle.dumps(val, 1) >+ self.hash[key] = v >+ >+ > class DBDictBayes(PersistentBayes): > '''Bayes object persisted in a hammie.DB_Dict''' > >- def __init__(self, db_name): >+ def __init__(self, db_name, mode='c'): > '''Constructor(database name)''' > >+ self.mode = mode > self.db_name = db_name > self.statekey = "saved state" > >@@ -186,7 +222,8 @@ > if Corpus.Verbose: > print 'Loading state from',self.db_name,'DB_Dict' > >- self.wordinfo = DBDict(self.db_name, 'c') >+ self.wordinfo = WIDict(self.db_name, self.mode, >+ iterskip=[self.statekey]) > > if self.wordinfo.has_key(self.statekey): > >@@ -216,7 +253,7 @@ > > def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS): > '''Constructor(Bayes, \ >- Corpus.SPAM|Corpus.HAM), updprobs(True|False)''' >+ Corpus.SPAM|Corpus.HAM), updprobs(True|False)''' > > self.bayes = bayes > self.trainertype = trainertype >@@ -286,4 +323,4 @@ > > > if __name__ == '__main__': >- print >>sys.stderr, __doc__ >\ No newline at end of file >+ print >>sys.stderr, __doc__ >Index: dbdict.py >=================================================================== >RCS file: dbdict.py >diff -N dbdict.py >--- /dev/null 1 Jan 1970 00:00:00 -0000 >+++ dbdict.py 19 Nov 2002 00:24:57 -0000 >@@ -0,0 +1,92 @@ >+#! /usr/bin/env python >+ >+from __future__ import generators >+import dbhash >+try: >+ import cPickle as pickle >+except ImportError: >+ import pickle >+ >+class DBDict: >+ """Database Dictionary. >+ >+ This wraps a dbhash database to make it look even more like a >+ dictionary, much like the built-in shelf class. The difference is >+ that a DBDict supports all dict methods. >+ >+ Call it with the database. Optionally, you can specify a list of >+ keys to skip when iterating. This only affects iterators; things >+ like .keys() still list everything. For instance: >+ >+ >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo')) >+ >>> d['skipme'] = 'booga' >+ >>> d['countme'] = 'wakka' >+ >>> print d.keys() >+ ['skipme', 'countme'] >+ >>> for k in d.iterkeys(): >+ ... print k >+ countme >+ >+ """ >+ >+ def __init__(self, dbname, mode, iterskip=()): >+ self.hash = dbhash.open(dbname, mode) >+ self.iterskip = iterskip >+ >+ def __getitem__(self, key): >+ return pickle.loads(self.hash[key]) >+ >+ def __setitem__(self, key, val): >+ self.hash[key] = pickle.dumps(val, 1) >+ >+ def __delitem__(self, key, val): >+ del(self.hash[key]) >+ >+ def __iter__(self, fn=None): >+ k = self.hash.first() >+ while k != None: >+ key = k[0] >+ val = self.__getitem__(key) >+ if key not in self.iterskip: >+ if fn: >+ yield fn((key, val)) >+ else: >+ yield (key, val) >+ try: >+ k = self.hash.next() >+ except KeyError: >+ break >+ >+ def __contains__(self, name): >+ return self.has_key(name) >+ >+ def __getattr__(self, name): >+ # Pass the buck >+ return getattr(self.hash, name) >+ >+ def get(self, key, dfl=None): >+ if self.has_key(key): >+ return self[key] >+ else: >+ return dfl >+ >+ def iteritems(self): >+ return self.__iter__() >+ >+ def iterkeys(self): >+ return self.__iter__(lambda k: k[0]) >+ >+ def itervalues(self): >+ return self.__iter__(lambda k: k[1]) >+ >+open = DBDict >+ >+def _test(): >+ import doctest >+ import dbdict >+ >+ doctest.testmod(dbdict) >+ >+if __name__ == '__main__': >+ _test() >+ >Index: hammie.py >=================================================================== >RCS file: /cvsroot/spambayes/spambayes/hammie.py,v >retrieving revision 1.40 >diff -u -r1.40 hammie.py >--- hammie.py 18 Nov 2002 18:13:54 -0000 1.40 >+++ hammie.py 19 Nov 2002 00:24:57 -0000 >@@ -1,57 +1,11 @@ > #! /usr/bin/env python > >-# A driver for the classifier module and Tim's tokenizer that you can >-# call from procmail. >- >-"""Usage: %(program)s [options] >- >-Where: >- -h >- show usage and exit >- -g PATH >- mbox or directory of known good messages (non-spam) to train on. >- Can be specified more than once, or use - for stdin. >- -s PATH >- mbox or directory of known spam messages to train on. >- Can be specified more than once, or use - for stdin. >- -u PATH >- mbox of unknown messages. A ham/spam decision is reported for each. >- Can be specified more than once. >- -r >- reverse the meaning of the check (report ham instead of spam). >- Only meaningful with the -u option. >- -p FILE >- use file as the persistent store. loads data from this file if it >- exists, and saves data to this file at the end. >- Default: %(DEFAULTDB)s >- -d >- use the DBM store instead of cPickle. The file is larger and >- creating it is slower, but checking against it is much faster, >- especially for large word databases. Default: %(USEDB)s >- -D >- the reverse of -d: use the cPickle instead of DBM >- -f >- run as a filter: read a single message from stdin, add an >- %(DISPHEADER)s header, and write it to stdout. If you want to >- run from procmail, this is your option. >-""" >- >-from __future__ import generators >- >-import sys >-import os >-import types >-import getopt >-import mailbox >-import glob >-import email >-import errno >-import anydbm >-import cPickle as pickle > >+import dbdict > import mboxutils >-import classifier >+import Bayes > from Options import options >+from tokenizer import tokenize > > try: > True, False >@@ -60,166 +14,14 @@ > True, False = 1, 0 > > >-program = sys.argv[0] # For usage(); referenced by docstring above >- >-# Name of the header to add in filter mode >-DISPHEADER = options.hammie_header_name >-DEBUGHEADER = options.hammie_debug_header_name >-DODEBUG = options.hammie_debug_header >- >-# Default database name >-DEFAULTDB = options.persistent_storage_file >- >-# Probability at which a message is considered spam >-SPAM_THRESHOLD = options.spam_cutoff >-HAM_THRESHOLD = options.ham_cutoff >- >-# Probability limit for a clue to be added to the DISPHEADER >-SHOWCLUE = options.clue_mailheader_cutoff >- >-# Use a database? If False, use a pickle >-USEDB = options.persistent_use_database >- >-# Tim's tokenizer kicks far more booty than anything I would have >-# written. Score one for analysis ;) >-from tokenizer import tokenize >- >-class DBDict: >- >- """Database Dictionary. >- >- This wraps an anydbm to make it look even more like a dictionary. >- >- Call it with the name of your database file. Optionally, you can >- specify a list of keys to skip when iterating. This only affects >- iterators; things like .keys() still list everything. For instance: >- >- >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo')) >- >>> d['skipme'] = 'booga' >- >>> d['countme'] = 'wakka' >- >>> print d.keys() >- ['skipme', 'countme'] >- >>> for k in d.iterkeys(): >- ... print k >- countme >- >- """ >- >- def __init__(self, dbname, mode, iterskip=()): >- self.hash = anydbm.open(dbname, mode) >- self.iterskip = iterskip >- >- def __getitem__(self, key): >- v = self.hash[key] >- if v[0] == 'W': >- val = pickle.loads(v[1:]) >- # We could be sneaky, like pickle.Unpickler.load_inst, >- # but I think that's overly confusing. >- obj = classifier.WordInfo(0) >- obj.__setstate__(val) >- return obj >- else: >- return pickle.loads(v) >- >- def __setitem__(self, key, val): >- if isinstance(val, classifier.WordInfo): >- val = val.__getstate__() >- v = 'W' + pickle.dumps(val, 1) >- else: >- v = pickle.dumps(val, 1) >- self.hash[key] = v >- >- def __delitem__(self, key, val): >- del(self.hash[key]) >- >- def __iter__(self, fn=None): >- k = self.hash.first() >- while k != None: >- key = k[0] >- val = self.__getitem__(key) >- if key not in self.iterskip: >- if fn: >- yield fn((key, val)) >- else: >- yield (key, val) >- try: >- k = self.hash.next() >- except KeyError: >- break >- >- def __contains__(self, name): >- return self.has_key(name) >- >- def __getattr__(self, name): >- # Pass the buck >- return getattr(self.hash, name) >- >- def get(self, key, dfl=None): >- if self.has_key(key): >- return self[key] >- else: >- return dfl >- >- def iteritems(self): >- return self.__iter__() >- >- def iterkeys(self): >- return self.__iter__(lambda k: k[0]) >- >- def itervalues(self): >- return self.__iter__(lambda k: k[1]) >- >- >-class PersistentBayes(classifier.Bayes): >- >- """A persistent Bayes classifier. >- >- This is just like classifier.Bayes, except that the dictionary is a >- database. You take less disk this way and you can pretend it's >- persistent. The tradeoffs vs. a pickle are: 1. it's slower >- training, but faster checking, and 2. it needs less memory to run, >- but takes more space on the hard drive. >+class Hammie: >+ """A spambayes mail filter. > >- On destruction, an instantiation of this class will write its state >- to a special key. When you instantiate a new one, it will attempt >- to read these values out of that key again, so you can pick up where >- you left off. >+ This implements the basic functionality needed to score, filter, or >+ train. > > """ > >- # XXX: Would it be even faster to remember (in a list) which keys >- # had been modified, and only recalculate those keys? No sense in >- # going over the entire word database if only 100 words are >- # affected. >- >- # XXX: Another idea: cache stuff in memory. But by then maybe we >- # should just use ZODB. >- >- def __init__(self, dbname, mode): >- classifier.Bayes.__init__(self) >- self.statekey = "saved state" >- self.wordinfo = DBDict(dbname, mode, (self.statekey,)) >- self.dbmode = mode >- >- self.restore_state() >- >- def __del__(self): >- #super.__del__(self) >- self.save_state() >- >- def save_state(self): >- if self.dbmode != 'r': >- self.wordinfo[self.statekey] = (self.nham, self.nspam) >- >- def restore_state(self): >- if self.wordinfo.has_key(self.statekey): >- self.nham, self.nspam = self.wordinfo[self.statekey] >- >- >-class Hammie: >- >- """A spambayes mail filter""" >- > def __init__(self, bayes): > self.bayes = bayes > >@@ -262,9 +64,9 @@ > import traceback > traceback.print_exc() > >- def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD, >- ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER, >- debug=DODEBUG): >+ def filter(self, msg, header=None, spam_cutoff=None, >+ ham_cutoff=None, debugheader=None, >+ debug=None): > """Score (judge) a message and add a disposition header. > > msg can be a string, a file object, or a Message object. >@@ -282,6 +84,17 @@ > > """ > >+ if header == None: >+ header = options.hammie_header_name >+ if spam_cutoff == None: >+ spam_cutoff = options.spam_cutoff >+ if ham_cutoff == None: >+ ham_cutoff = options.ham_cutoff >+ if debugheader == None: >+ debugheader = options.hammie_debug_header_name >+ if debug == None: >+ debug = options.hammie_debug_header >+ > msg = mboxutils.get_message(msg) > try: > del msg[header] >@@ -348,163 +161,47 @@ > > self.train(msg, True) > >- def update_probabilities(self): >+ def update_probabilities(self, store=True): > """Update probability values. > > You would want to call this after a training session. It's > pretty slow, so if you have a lot of messages to train, wait > until you're all done before calling this. > >+ Unless store is false, the peristent store will be written after >+ updating probabilities. >+ > """ > > self.bayes.update_probabilities() >+ if store: >+ self.store() > >+ def store(self): >+ """Write out the persistent store. >+ >+ This makes sure the persistent store reflects what is currently >+ in memory. You would want to do this after a write and before >+ exiting. >+ >+ """ >+ >+ self.bayes.store() >+ >+ >+def open(filename, usedb=True, mode='r'): >+ """Open a file, returning a Hammie instance. >+ >+ If usedb is False, open as a pickle instead of a DBDict. mode is >+ >+ used as the flag to open DBDict objects. 'c' for read-write (create >+ if needed), 'r' for read-only, 'w' for read-write. >+ >+ """ > >-def train(hammie, msgs, is_spam): >- """Train bayes with all messages from a mailbox.""" >- mbox = mboxutils.getmbox(msgs) >- i = 0 >- for msg in mbox: >- i += 1 >- # XXX: Is the \r a Unixism? I seem to recall it working in DOS >- # back in the day. Maybe it's a line-printer-ism ;) >- sys.stdout.write("\r%6d" % i) >- sys.stdout.flush() >- hammie.train(msg, is_spam) >- print >- >-def score(hammie, msgs, reverse=0): >- """Score (judge) all messages from a mailbox.""" >- # XXX The reporting needs work! >- mbox = mboxutils.getmbox(msgs) >- i = 0 >- spams = hams = 0 >- for msg in mbox: >- i += 1 >- prob, clues = hammie.score(msg, True) >- if hasattr(msg, '_mh_msgno'): >- msgno = msg._mh_msgno >- else: >- msgno = i >- isspam = (prob >= SPAM_THRESHOLD) >- if isspam: >- spams += 1 >- if not reverse: >- print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), >- print hammie.formatclues(clues) >- else: >- hams += 1 >- if reverse: >- print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), >- print hammie.formatclues(clues) >- return (spams, hams) >- >-def createbayes(pck=DEFAULTDB, usedb=False, mode='r'): >- """Create a Bayes instance for the given pickle (which >- doesn't have to exist). Create a PersistentBayes if >- usedb is True.""" > if usedb: >- bayes = PersistentBayes(pck, mode) >+ b = Bayes.DBDictBayes(filename, mode) > else: >- bayes = None >- try: >- fp = open(pck, 'rb') >- except IOError, e: >- if e.errno <> errno.ENOENT: raise >- else: >- bayes = pickle.load(fp) >- fp.close() >- if bayes is None: >- bayes = classifier.Bayes() >- return bayes >- >-def usage(code, msg=''): >- """Print usage message and sys.exit(code).""" >- if msg: >- print >> sys.stderr, msg >- print >> sys.stderr >- print >> sys.stderr, __doc__ % globals() >- sys.exit(code) >- >-def main(): >- """Main program; parse options and go.""" >- try: >- opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r') >- except getopt.error, msg: >- usage(2, msg) >- >- if not opts: >- usage(2, "No options given") >- >- pck = DEFAULTDB >- good = [] >- spam = [] >- unknown = [] >- reverse = 0 >- do_filter = False >- usedb = USEDB >- mode = 'r' >- for opt, arg in opts: >- if opt == '-h': >- usage(0) >- elif opt == '-g': >- good.append(arg) >- mode = 'c' >- elif opt == '-s': >- spam.append(arg) >- mode = 'c' >- elif opt == '-p': >- pck = arg >- elif opt == "-d": >- usedb = True >- elif opt == "-D": >- usedb = False >- elif opt == "-f": >- do_filter = True >- elif opt == '-u': >- unknown.append(arg) >- elif opt == '-r': >- reverse = 1 >- if args: >- usage(2, "Positional arguments not allowed") >- >- save = False >- >- bayes = createbayes(pck, usedb, mode) >- h = Hammie(bayes) >- >- for g in good: >- print "Training ham (%s):" % g >- train(h, g, False) >- save = True >- >- for s in spam: >- print "Training spam (%s):" % s >- train(h, s, True) >- save = True >- >- if save: >- h.update_probabilities() >- if not usedb and pck: >- fp = open(pck, 'wb') >- pickle.dump(bayes, fp, 1) >- fp.close() >- >- if do_filter: >- msg = sys.stdin.read() >- filtered = h.filter(msg) >- sys.stdout.write(filtered) >- >- if unknown: >- (spams, hams) = (0, 0) >- for u in unknown: >- if len(unknown) > 1: >- print "Scoring", u >- s, g = score(h, u, reverse) >- spams += s >- hams += g >- print "Total %d spam, %d ham" % (spams, hams) >- >+ b = Bayes.PickledBayes(filename) >+ return Hammie(b) > >-if __name__ == "__main__": >- main() >Index: hammiefilter.py >=================================================================== >RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v >retrieving revision 1.2 >diff -u -r1.2 hammiefilter.py >--- hammiefilter.py 18 Nov 2002 18:14:04 -0000 1.2 >+++ hammiefilter.py 19 Nov 2002 00:24:57 -0000 >@@ -51,43 +51,37 @@ > print >> sys.stderr, __doc__ % globals() > sys.exit(code) > >-def jar_pickle(h): >- if not options.persistent_use_database: >- import pickle >- fp = open(options.persistent_storage_file, 'wb') >- pickle.dump(h.bayes, fp, 1) >- fp.close() >- >- >-def hammie_open(mode): >- b = hammie.createbayes(options.persistent_storage_file, >- options.persistent_use_database, >- mode) >- return hammie.Hammie(b) >- > def newdb(): >- h = hammie_open('n') >- jar_pickle(h) >+ h = hammie.open(options.persistent_storage_file, >+ options.persistent_use_database, >+ 'n') >+ h.store() > print "Created new database in", options.persistent_storage_file > > def filter(): >- h = hammie_open('r') >+ h = hammie.open(options.persistent_storage_file, >+ options.persistent_use_database, >+ 'r') > msg = sys.stdin.read() > print h.filter(msg) > > def train_ham(): >- h = hammie_open('w') >+ h = hammie.open(options.persistent_storage_file, >+ options.persistent_use_database, >+ 'w') > msg = sys.stdin.read() > h.train_ham(msg) > h.update_probabilities() >- jar_pickle(h) >+ h.store() > > def train_spam(): >- h = hammie_open('w') >+ h = hammie.open(options.persistent_storage_file, >+ options.persistent_use_database, >+ 'w') > msg = sys.stdin.read() > h.train_spam(msg) > h.update_probabilities() >- jar_pickle(h) >+ h.store() > > def main(): > action = filter > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From bkc@murkworks.com Tue Nov 19 22:17:34 2002 From: bkc@murkworks.com (Brad Clements) Date: Tue, 19 Nov 2002 17:17:34 -0500 Subject: [Spambayes] re: ciphertrust Message-ID: <3DDA711E.25949.343B3AFE@localhost> http://www.ciphertrust.com/ironmail/anti-spam.htm Is this it? I can't really understand it through all the marketing speak. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From francois.granger@free.fr Tue Nov 19 22:25:32 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Tue, 19 Nov 2002 23:25:32 +0100 Subject: [Spambayes] Training In-Reply-To: References: Message-ID: At 8:59 +1100 20/11/02, in message RE: [Spambayes] Training, Mark Hammond wrote: >The only reason I mention this is because last time I mentioned something >that demonstrated my ignorance, Tim promptly replied confirming it, then >subsequently made the change anyway . So, I am not the only one ! ;-) -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From skip@pobox.com Tue Nov 19 22:33:53 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 19 Nov 2002 16:33:53 -0600 Subject: [Spambayes] re: ciphertrust In-Reply-To: <3DDA711E.25949.343B3AFE@localhost> References: <3DDA711E.25949.343B3AFE@localhost> Message-ID: <15834.48209.376350.996825@montanaro.dyndns.org> Brad> http://www.ciphertrust.com/ironmail/anti-spam.htm Brad> Is this it? I can't really understand it through all the marketing Brad> speak. I suspect if there's any content it will be in the white paper which you need to register to get. Must be an expensive solution if they won't tell you anything about it without getting your vital (sales) statistics. Skip From tim.one@comcast.net Tue Nov 19 22:39:37 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 19 Nov 2002 17:39:37 -0500 Subject: [Spambayes] Training In-Reply-To: Message-ID: [Mark Hammond] > ... > My concern is almost identical though - the *next* email that looks the > same. Let's say I subscribe to a weekly newsletter. This weeks comes in, > gets marked as unsure, so I train. Next weeks comes in - again, it trains > as unsure. Repeat ad nauseum. > > I saw this a real lot when I had a high ham:spam inbalance - > training had no obvious effect. Conflating this, though, there were glitches in the Outlook client back than that prevented retraining and/or rescoring from working as intended. > I am still hoping to try Tim's new adjustment, Note that it's already enabled in the Outlook client (but not in the general codebase yet) -- the first time you do anything that recomputes the probabilities, it will kick in with full force. That's actually going to make the described problem worse: when you have a lot more ham than spam, the effect of the adjustment is to make everything "less hammy" than it was. This should help a lot when training on spam, but makes training on ham *less* effective than it was. In effect, it's saying that training on new ham is much less valuable than training on new spam, because you already have way more of the former. > but I wonder if somehow similar maths could be exploited. For example, > manually training a message could be seen as "intense training", wereas a > normal train is - well - normal. The point of manual training is that the > system got it wrong, and the user want to see the error stop. "normal" > training is just giving the system fairly "general" instructions. You could feed a msg into training more than once as ham (or spam). The classifier doesn't know the difference between training on a single msg N times, and training on N different msgs. We could even feed the msg in, in a loop, until the score went out of Unsure territory. That would be novel -- picture the effects on the system if I were to do this with my Nigerian-scam quote. Brrr! But no matter how we cut this, so long as there's more of one kind of data than the other, the class with the lesser amount of data is the one that limits potential accuracy. > The only reason I mention this is because last time I mentioned something > that demonstrated my ignorance, Tim promptly replied confirming it, then > subsequently made the change anyway . Familiar patterns are such a comfort to us all . From rob@hooft.net Tue Nov 19 22:41:14 2002 From: rob@hooft.net (Rob Hooft) Date: Tue, 19 Nov 2002 23:41:14 +0100 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right References: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk> <15834.39548.393074.819295@montanaro.dyndns.org> Message-ID: <3DDABE0A.9090409@hooft.net> Skip Montanaro wrote: > Tim> [Toby Dickenson] > >> Why exclude spams that score 100 from training? Even these really > >> spammy spams might contain clues that would help to classify other > >> more marginal spam. > > Tim> Absolutely, but that's a different experiment. I've already done > Tim> "proper" training and know it works great for me. These are > Tim> experiments in doing silly training. > > If you're taking notes on this in various files in CVS I wouldn't call it > "silly training". How about "realistic training"? Why realistic? Minimalistic? I've seen my favorite being discussed, but I'd like to see more statistics on it: only train on all ham/spam messages automatically without any user interaction after an initial training phase of minimally 10-30 messages. This should automatically adapt to gradual changes. If this would really work, it would be my realistic variant... Integration into the MUA could only make it better. Hm. I just adapted weaktest to be a bit more flexible, such that all these strategies can be tested. There are four new flags to the weaktest.py program: -d : selects the "decisionmaker"; i.e. the strategy used to decide whether a message is trained on. There is a choice between: all : train on all messages allbut0and100 : train on all spam < 0.995 and ham >0.005 unsureandfalses : train on Unsure and fp/fn only unsureonly : train on Unsure only. -u : selects the "update strategy". always : updates counts after every trained message sometimes : trains every 10th -m int : uses the first "int" messages for training only (default 10) -v : increases verbosity. I'm open to ideas (and results). Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From bkc@murkworks.com Tue Nov 19 22:45:43 2002 From: bkc@murkworks.com (Brad Clements) Date: Tue, 19 Nov 2002 17:45:43 -0500 Subject: [Spambayes] re: ciphertrust In-Reply-To: <15834.48209.376350.996825@montanaro.dyndns.org> References: <3DDA711E.25949.343B3AFE@localhost> Message-ID: <3DDA77B8.19706.345501F9@localhost> On 19 Nov 2002 at 16:33, Skip Montanaro wrote: > I suspect if there's any content it will be in the white paper which you > need to register to get. Must be an expensive solution if they won't tell > you anything about it without getting your vital (sales) statistics. I've registered for it. Probably can't pass it on but I'll summarize. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From noreply@sourceforge.net Tue Nov 19 22:43:30 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Tue, 19 Nov 2002 14:43:30 -0800 Subject: [Spambayes] [ spambayes-Patches-639312 ] fix for outlook CompareEntryIDs bug Message-ID: Patches item #639312, was opened at 2002-11-16 23:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Piers Haken (piersh) >Assigned to: Mark Hammond (mhammond) Summary: fix for outlook CompareEntryIDs bug Initial Comment: This patch reenables the CompareEntryIDs for comparing folder IDs. It passes both the MAPI Session and the Oulook Session into the dialog, one for retrieving the exchange-compatible IDs and the other for comparing them. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702 From noreply@sourceforge.net Tue Nov 19 22:44:07 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Tue, 19 Nov 2002 14:44:07 -0800 Subject: [Spambayes] [ spambayes-Patches-639310 ] fix for outlook 'spam' field Message-ID: Patches item #639310, was opened at 2002-11-16 23:32 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Piers Haken (piersh) >Assigned to: Mark Hammond (mhammond) Summary: fix for outlook 'spam' field Initial Comment: 1) firstly it changes the Class of the 'Spam' field to olPercent, which I believe is much more appropriate than olCombination. The problem with olCombination is that you have to manually change the field type in outlook in order to get anything to show up. With olPercent, the column shows up with a nice '%' sign which makes it more obvious what the number actually means. 2) secondly it adds a checkbox 'Update spam scores' to the training dialog. Checking this box causes the trainer to update the spam field for ALL messages in your training folders (in a second pass, if necessary). This means that ALL messages in your inbox have an entry in that field, not just those that arrived since you installed the plugin. This was a huge win for me since it allowed me to sort by the spam field and throw away about 20 spams from my inbox that I had missed during my initial manual pruning. The only issue here is that in order for this to work right, you'll have to manually delete your existing spam fields, restart outlook and then 'rescore'. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702 From lists@morpheus.demon.co.uk Tue Nov 19 23:18:30 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Tue, 19 Nov 2002 23:18:30 +0000 Subject: [Spambayes] Training References: <16E1010E4581B049ABC51D4975CEDB88619943@UKDCX001.uk.int.atosorigin.com> Message-ID: "Mark Hammond" writes: > My concern is almost identical though - the *next* email that looks the > same. Let's say I subscribe to a weekly newsletter. This weeks comes in, > gets marked as unsure, so I train. Next weeks comes in - again, it trains > as unsure. Repeat ad nauseum. Good point. That would be *really* annoying after a while. > I saw this a real lot when I had a high ham:spam inbalance - training had no > obvious effect. This happened to me today, with Tim's new adjustment switched on, with a 10:1 ham:spam imbalance. IIRC, Tim's change means that with this sort of imbalance, ham clues will only have 10% of their normal effect, so saying "This is ham" will be pretty much ignored :-( > I am still hoping to try Tim's new adjustment, but I wonder if > somehow similar maths could be exploited. For example, manually > training a message could be seen as "intense training", wereas a > normal train is - well - normal. The point of manual training is > that the system got it wrong, and the user want to see the error > stop. "normal" training is just giving the system fairly "general" > instructions. I'm not sure. All training is basically saying "these specific messages *are* ham/spam". Whether this is done in bulk, or on an individual basis, shouldn't matter. A naive view says that therefore trained messages will score 0/100 "by definition". But the maths doesn't work like that, and nothing is going to make it. But I think it's a reasonable assumption that any messages which have been explicitly trained will no longer hit the "unsure" range. I just can't see a way of making even that assumption be true. Paul. -- This signature intentionally left blank From neale@woozle.org Tue Nov 19 23:39:45 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 15:39:45 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > I'll wait for your checkin, and do some more work on the dbdict module > to add my load/store stuff... Okay Tim, I'll tell you what. I'm going to create a branch and check in everything I've got. I'm branching because what I have right now breaks some existing functionality. In the branch, we can play around with moving things out of the classifier, moving options, etc. When we get something that we think is stable, and everyone else okays it, we can merge it all back in to HEAD. I've called the branch "hammie-playground". To get to it, just $ cvs update -r hammie-playground The branch need not be around for a long time, just long enough to work out all these changes. Neale From skip@pobox.com Tue Nov 19 23:40:08 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 19 Nov 2002 17:40:08 -0600 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right In-Reply-To: <3DDABE0A.9090409@hooft.net> References: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk> <15834.39548.393074.819295@montanaro.dyndns.org> <3DDABE0A.9090409@hooft.net> Message-ID: <15834.52184.806714.753216@montanaro.dyndns.org> Tim> Absolutely, but that's a different experiment. I've already done Tim> "proper" training and know it works great for me. These are Tim> experiments in doing silly training. Skip> If you're taking notes on this in various files in CVS I wouldn't Skip> call it "silly training". How about "realistic training"? Rob> Why realistic? Minimalistic? Realistic in the sense that the sort of training Tim is trying now probably mimics what you can expect from average users over time. You can't expect people to always train on everything. Even with a slick user interface that won't be much better than just hitting "delete" for each spam. You have to assume people are going to be gung-ho at the beginning, then taper off when either performance gets good enough or the novelty wears off. One stop on the way to not training at all is to only train on FPs, FNs and unsures. Maybe "real world" is a better term than "realistic". Skip From richie@entrian.com Tue Nov 19 23:59:26 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 19 Nov 2002 23:59:26 +0000 Subject: [Spambayes] Training from scratch Message-ID: I started a new database from scratch yesterday morning at work, and trained it via the web interface as the messages arrived. Courtesy of the shiny new pop3graph.py (as yet uncommitted), this is how it behaved over the first 36 hours: . - Number of messages over time * - Number of correctly classified messages over time | . 99 | . | . | . | . | . | . | . | . | . | . | . | . * 74 | . * | . * | . ** | . * | . * | . * | . * | . * | . * | . ** | . ** | . * | . * | . ** | . * | . * | . * | . * | . ** | . ** | . * | . * | . * | . ** | . * | . * | . ** | . * | . * | . * | . * | . ** | . ** | *** | * | * ___________________________________________________ (that should really plot the derivative of the second line as well, but you can see that it very quickly got close to parallel with the total number). This is utterly unscientific I know, but very encouraging. Not one of the misclassifications was an FP! Though that's probably down to the fact that most of the early messages I trained it on were hams. This could be worth bearing in mind when thinking about training strategies (if I'm right) - since FPs are more damaging than FNs, maybe people should be encouraged (forced?) to train on a bunch of hams before any spams. -- Richie Hindle richie@entrian.com From richie@entrian.com Tue Nov 19 23:59:53 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 19 Nov 2002 23:59:53 +0000 Subject: [Spambayes] Offtopic - getting bounce messages for spam In-Reply-To: References: Message-ID: <5liltuk2j7o5uc0l85s97cdq1890jvjgsi@4ax.com> Hi Paul, > I've just started receiving undeliverable message reports for spam, > sent to people I've never heard of. [...] > What can I do about it in any case? This happened to me last year. I received 28,000 bounce emails in a period of about two weeks. I wrote a web-based POP3 gateway that lets you batch-delete messages using regular expressions: http://entrian.com/cgi-bin/pop3.py This is very dangerous - you can wipe all your emails by abusing it, or simply by misunderstanding it. It also passes your POP3 password in plain text across the internet, if that worries you. And it may be buggy. And if anyone can think of any other reasons why people shouldn't use it, please post them. But when your email account is rendered completely useless by all this, it will be a godsend. 8-) -- Richie Hindle richie@entrian.com From neale@woozle.org Wed Nov 20 00:02:58 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 16:02:58 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: References: Message-ID: I just realized that I failed to respond to your points :) So then, Tim Stone - Four Stones Expressions is all like: > Neale, I'm ok with these changes. I have more to make, but go ahead > and make these alterations. Particularly, I've got a dbdict class > that supports load/store, so we don't have to worry about training > that blows up before we save nham and nspam. I'm curious about how you're doing this. I briefly had a DBDict which cached anything you tried to write to it, but it didn't seem like an improvement so I dropped it, figuring ZODB was probably a better solution. > I think we should think about where WordInfo class goes... That's rather unorthodox. Why? > I'm not sure I like having mode on the dbdict constructor, although I > understand why you have it. No harm done, as it defaults anyway. I'm not sure I like it either, but I didn't know where else to put it. If you think of a better solution, feel free to change it. > I think we should take Bayes out of classifier and put it in Bayes.py Now that's downright heretical! ;) It makes sense, I think, Bayes.py being where all the Bayes stuff hangs out. But if you take WordInfo out of classifier, and you take Bayes out of classifier, all you'll have left is two constants. Maybe you just want to rename classifier.py. I wonder what the other Tim thinks about this idea... > I like widict as a class, but it could be abstracted another notch by > simply specifying the class to instantiate when you find a 'w' in the > pickle, as an operand on the constructor. I'm leaning heavily toward dictching WIDict and subclassing Pickler/Unpickler; I think that's the Right Thing. It will be slower running, but maybe not significantly so. I'll run some trials when I get home. Neale From tim@fourstonesExpressions.com Wed Nov 20 03:39:01 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 19 Nov 2002 21:39:01 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: So then, Neale Pickett is all like: 11/19/2002 5:39:45 PM, Neale Pickett wrote: >So then, Tim Stone - Four Stones Expressions is all like: > >> I'll wait for your checkin, and do some more work on the dbdict module >> to add my load/store stuff... > >Okay Tim, I'll tell you what. I'm going to create a branch and check in >everything I've got. I'm branching because what I have right now breaks >some existing functionality. Ok, I've got the branch right now... I'll make my little tweaks. You'll see how I do the load/store stuff with the dbm in LSDBDict(DBDict). Basically, keeps a working file... I have a few other tweaks to the Corpus stuff that are really unrelated to this work, more to do with Richie's needs. I'll put them in the playground as well, just for consistency's sake. > >In the branch, we can play around with moving things out of the >classifier, moving options, etc. When we get something that we think is >stable, and everyone else okays it, we can merge it all back in to HEAD. > >I've called the branch "hammie-playground". To get to it, just > > $ cvs update -r hammie-playground > >The branch need not be around for a long time, just long enough to work >out all these changes. > >> I think we should think about where WordInfo class goes... >> I think we should take Bayes out of classifier and put it in Bayes.py > >That's rather unorthodox. Why? >Now that's downright heretical! ;) It makes sense, I think, Bayes.py >being where all the Bayes stuff hangs out. But if you take WordInfo out >of classifier, and you take Bayes out of classifier, all you'll have >left is two constants. Maybe you just want to rename classifier.py. I >wonder what the other Tim thinks about this idea... > Yeah, the more I think about it, the more I realize my issue is that classifier kinda doesn't tell me what's in there. WordInfo and Bayes superclass... Doesn't really matter to me, but would make more sense to me from a packaging point of view to simply have one file to distribute rather than two... >I'm leaning heavily toward dictching WIDict and subclassing >Pickler/Unpickler; I think that's the Right Thing. It will be slower >running, but maybe not significantly so. I'll run some trials when I >get home. I don't see WIDict in the playground, so I assume you've ditched it already? But I don't see a pickle subclass either... am I missing something. Haven't tried running anything yet, so maybe it will become obvious to me when I do... >Neale > > - Tim www.fourstonesExpressions.com From tim_one@email.msn.com Wed Nov 20 03:43:00 2002 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 19 Nov 2002 22:43:00 -0500 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: [Tim Stone] >> I think we should take Bayes out of classifier and put it in Bayes.py [Neale Pickett] > Now that's downright heretical! ;) It makes sense, I think, Bayes.py > being where all the Bayes stuff hangs out. But if you take WordInfo out > of classifier, and you take Bayes out of classifier, all you'll have > left is two constants. Maybe you just want to rename classifier.py. I > wonder what the other Tim thinks about this idea... Heresy is fine, but I don't understand what the goal of this is. Jeremy previously added WordInfoClass = WordInfo to the Bayes class so that subclasses (of Bayes) could specify the kind of WordInfo structure they want to use. The methods in Bayes never call WordInfo directly, they always invoke self.WordInfoClass(). Bayes doesn't care, so long as whatever the WordInfoClass() factory returns supports the attributes the classifier accesses. Subclassing is a clean & correct way to provide variants. It's unfortunate that Bayes also became an old-style class for a different reason, as subclassing is much more efficient with new-style classes. Then again, classifier methods aren't called that often, so it's hard to get excited about that. Taking the classifier class out of classifer.py doesn't make sense to me on the face of it, but maybe it would if I understood the goal. From tim_one@email.msn.com Wed Nov 20 03:52:26 2002 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 19 Nov 2002 22:52:26 -0500 Subject: [Spambayes] Training from scratch In-Reply-To: Message-ID: [Richie Hindle] > ... > This could be worth bearing in mind when thinking about training > strategies (if I'm right) - since FPs are more damaging than FNs, maybe > people should be encouraged (forced?) to train on a bunch of hams before > any spams. If they don't, just about everything will come out as spam (every word trained on will have a by-counting spamprob of 1.0, and the Baysian adjustment will move that closer to 0.5 but not to less than 0.5). The Outlook client doesn't allow you to train before specifying at least one ham and one spam folder. That doesn't stop a deterimined idiot from specifying empty folders, though. From tim_one@email.msn.com Wed Nov 20 04:01:40 2002 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 19 Nov 2002 23:01:40 -0500 Subject: [Spambayes] More back-patting - my brain's first FP where bayes got it right In-Reply-To: <15834.52184.806714.753216@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > You have to assume people are going to be gung-ho at the beginning, > then taper off when either performance gets good enough or the novelty > wears off. One stop on the way to not training at all is to only train > on FPs, FNs and unsures. > > Maybe "real world" is a better term than "realistic". Failing to account for human behavior would be a failing of the client, then. For this to work superbly, the client is going to have to train on msgs without the *user*'s guidance. I ran a quick experiment on that earlier (the classifier training on its own decisions, simply assuming they were correct), even to the extent of training on false positives as spam assuming the user doesn't look at their spam folder at all after a while. The results were indeed superb, but it so happened there were no false positives during the test run (and I haven't had time to continue with more of those tests, alas). I doubt by-hand training will work except for geeks; they'll end up doing mistake-based training, after an initial flurry of training on 5-year-old ham <0.9 wink>. From tim_one@email.msn.com Wed Nov 20 04:14:05 2002 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 19 Nov 2002 23:14:05 -0500 Subject: [Spambayes] Training In-Reply-To: Message-ID: [Paul Moore] > ... > This happened to me today, with Tim's new adjustment switched on, with > a 10:1 ham:spam imbalance. IIRC, Tim's change means that with this > sort of imbalance, ham clues will only have 10% of their normal > effect, so saying "This is ham" will be pretty much ignored :-( It affects only the Bayesian adjustment to the by-counting spamprob estimates, and the adjustment isn't a linear function, so 10:1 -> 10% isn't what happens. For what really happens, study update_probabilities . The effect of the Bayesian adjustment is *always* to move a by-counting estimate closer to 0.5 (unknown_word_prob). It can never increase the distance of a by-counting estimate from 0.5. So even if the Bayesian adjustment weren't done at all, a hamprob can only get as low as the data says it should get, and that's purely a matter of how often the word has been seen in trained ham and trained spam. Doing better than that would require major psychic powers. > I'm not sure. All training is basically saying "these specific > messages *are* ham/spam". Whether this is done in bulk, or on an > individual basis, shouldn't matter. A naive view says that therefore > trained messages will score 0/100 "by definition". But the maths > doesn't work like that, and nothing is going to make it. You could train on a message over and over and over ... again, until the score became arbitrarily close to 0 or 100. It would probably ruin the classifier for most other msgs, though. > But I think it's a reasonable assumption that any messages which have > been explicitly trained will no longer hit the "unsure" range. I just > can't see a way of making even that assumption be true. I have an FP that's an entire Nigerian scam msg, prefaced by a one-line comment saying something like "Jeez, here's another Nigerian wire scam -- this has been around for 20 years". Think about it . From tim_one@email.msn.com Wed Nov 20 04:18:12 2002 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 19 Nov 2002 23:18:12 -0500 Subject: [Spambayes] CipherTrust? In-Reply-To: <15834.38911.435802.71321@montanaro.dyndns.org> Message-ID: > http://eletters1.ziffdavis.com/cgi-bin10/flo?y=eSyU0EWaTF0E4J0sQU0Ax Rummage around on the site it points to: http://www.ciphertrust.com/ironmail/anti-spam.htm I like the "surgical precision in spam-blocking to eliminate false positives" bit. Screw these percentage error rates! From now on we're surgically precise (not to mention anatomically correct). From tim@fourstonesExpressions.com Wed Nov 20 04:39:19 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 19 Nov 2002 22:39:19 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: 11/19/2002 9:43:00 PM, "Tim Peters" wrote: >[Tim Stone] >>> I think we should take Bayes out of classifier and put it in Bayes.py > >[Neale Pickett] >> Now that's downright heretical! ;) It makes sense, I think, Bayes.py >> being where all the Bayes stuff hangs out. But if you take WordInfo out >> of classifier, and you take Bayes out of classifier, all you'll have >> left is two constants. Maybe you just want to rename classifier.py. I >> wonder what the other Tim thinks about this idea... > >Heresy is fine, but I don't understand what the goal of this is. See below > >Jeremy previously added > > WordInfoClass = WordInfo > >to the Bayes class so that subclasses (of Bayes) could specify the kind of >WordInfo structure they want to use. The methods in Bayes never call >WordInfo directly, they always invoke self.WordInfoClass(). Bayes doesn't >care, so long as whatever the WordInfoClass() factory returns supports the >attributes the classifier accesses. Looks like we've got an impedance mismatch here. The WIDict class that Neale made always assumes WordInfo. We'll have to fix that. If Neale subclasses Pickle to do this, it'll still need to know what class to instantiate. Interesting. > >Subclassing is a clean & correct way to provide variants. It's unfortunate >that Bayes also became an old-style class for a different reason, as >subclassing is much more efficient with new-style classes. Then again, >classifier methods aren't called that often, so it's hard to get excited >about that. > >Taking the classifier class out of classifer.py doesn't make sense to me on >the face of it, but maybe it would if I understood the goal. I don't have strong feelings about it... we could just as easily put all the stuff that's in Bayes.py into classifier.py. One file is better than two, at least in this instance. > > > - Tim www.fourstonesExpressions.com From neale@woozle.org Wed Nov 20 04:42:36 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 20:42:36 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: References: Message-ID: So then, "Tim Peters" is all like: > Heresy is fine, but I don't understand what the goal of this is. > > [snip] > > Taking the classifier class out of classifer.py doesn't make sense to > me on the face of it, but maybe it would if I understood the goal. Right now we have only one classifier, a Bayesian classifier, so when Tim Stone consolidated all the PersistentBayes classes into a Bayes class, it seemed (to me) like all the things called "Bayes" should be found there. Having gotten home and ingested some cabbage pie, I think the classifier module is fine, and we should instead rename the Bayes module to something like "Persistent". In the future, concievably, there may be another non-Bayesian classifier that we will still want to wrap with our cool persistence classes. So the misnomer is the Bayes module, not the classifier module. I think I'll have humble pie for dessert ;) Neale From tim_one@email.msn.com Wed Nov 20 05:01:51 2002 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 20 Nov 2002 00:01:51 -0500 Subject: [Spambayes] Better optimization loop In-Reply-To: <3DDA7916.5010102@hooft.net> Message-ID: [Tim] >> Good observation! That should help. simplex isn't fast in the best of >> cases, and in this case ... [Rob Hooft] > Anyone that has a faster optimization algorithm lying around is welcome > to replace my Simplex code. Twasn't a criticism, just an observation about downhill Simplex, in anyone's implementation. Multidimensional optimization is a darned hard problem, and this approach is at least pretty robust. >>> To that goal I added a "delayed" flexcost to the CostCounter module >>> that can use the optimal cutoffs calculated at the end of timcv.py. >> Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of >> 0.99 and spam_cutoff of 0.995 to get rid of "impossible" FP. > They are in any case better than any other alternative I could think of. > But if you disagree, you can change the order in which the > CostCounter.default() builds up the cost counters; the optimization > always uses the last one. I don't disagree. The point was that the "optimal cutoffs" are *also* working like mad to accommodate outliers at the expense of everything else. So long as FP are viewed as an approximation to the end of the world, all attempts to optimize settings are going to focus on them. > ... > Very similar to my case. I'm seriously thinking about removing the > "hopeless" and "almost hopeless" messages from my corpses. I agree with > the bayesian statistics that they can't be correctly classified. Whether it's a good idea to remove them depends on the goal . I keep mine in my test data so that the error rates reflect real life. But there are about 10 ham in my c.l.py data I simply don't care about, and it doesn't bother me a bit if they pop back into my FP set (indeed, the last few rounds of changes boosted my c.l.py total from 1 FP to 3 FP -- BFD! FP Happen, and the last few round of changes had helpful effects on almost everything else). In that sense, it's wholly unrealistic (but perhaps pragmatically necessary) to say that each FP (and FN, and Unsure) has exactly the same cost as every other. Some FP simply don't matter, while others matter a lot. Likewise, I find some kinds of spam much more irritating than others, and although my c.l.py data has no FN remaining, there are about 50 spam there I really enjoy so I'd like to penalize the system for not letting me see them . > ... > Press et al. report about a "robust fit", which is not a least squares > but a least absolute deviates fit. It is insensitive to outliers. > Is there an analog idea for us? I don't know, but am not sanguine: there's a specific cost function we're trying to minimize, and despite that it's unrealistic it's better than nothing. Introducing this cost measure was a real help! Trying to squeeze the last penny out of it probably isn't, though -- it's not that good a model of reality. It does *generally* help us by saying FP are worse than FN are worse than Unsure, and attaching a concrete figure of merit to that aggregate judgment, but I don't take that number as more than an indicator where "a lot smaller is better". Small changes in it don't bother or cheer me. > ... > Further results I obtained: My idea of running with an fp cost of $2 and > a square cost function didn't work. It doesn't optimize to a consistent > position. Increasing the cost of an fp back to $10 and running with the > same square function did do a reasonable job, it optimized to: > > [Classifier] > unknown_word_prob = 0.520415 > minimum_prob_strength = 0.315104 > unknown_word_strength = 0.215393 > > So the unknown_word_prob is now back to 0.5 again! More, I bet 0.52 is closer to the true unknown-word probability in your data (take all the words that have appeared at least, say, 5 times, and average their spamprobs; that's about the best guess we can make for the spamprob of a word we see for the first time; in the three corpora I measured this on, 0.52 was the smallest empirical value I saw). The other two act to look only at very extreme words, and to keep words extreme longer in the face of contrary evidence (a hapax is strong enough to survive minimum_prob_strength of 0.3 even with s at the default 0.45; they're even more extreme at s 0.22). Guessing "the true spamprob" may have room for improvement. OTOH, if you have more ham than spam, then x=0.52 is acting to make things "less hammy", and a benefit may come from that. In that case, enabling the new ham/spam imbalance adjustment option may help even more. From tim@fourstonesExpressions.com Wed Nov 20 05:06:56 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 19 Nov 2002 23:06:56 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: 11/19/2002 10:42:36 PM, Neale Pickett wrote: >So then, "Tim Peters" is all like: > >> Heresy is fine, but I don't understand what the goal of this is. >> >> [snip] >> >> Taking the classifier class out of classifer.py doesn't make sense to >> me on the face of it, but maybe it would if I understood the goal. > >Right now we have only one classifier, a Bayesian classifier, so when >Tim Stone consolidated all the PersistentBayes classes into a Bayes >class, it seemed (to me) like all the things called "Bayes" should be >found there. > >Having gotten home and ingested some cabbage pie, I think the classifier >module is fine, and we should instead rename the Bayes module to >something like "Persistent". In the future, concievably, there may be >another non-Bayesian classifier that we will still want to wrap with our >cool persistence classes. So the misnomer is the Bayes module, not the >classifier module. How about PersistentClassifier? > >I think I'll have humble pie for dessert ;) > >Neale > > - Tim www.fourstonesExpressions.com From neale@woozle.org Wed Nov 20 05:18:45 2002 From: neale@woozle.org (Neale Pickett) Date: 19 Nov 2002 21:18:45 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > How about PersistentClassifier? Yech. Since the things are kinda doing what the standard shelve module does, and we keep calling them "stores", how about "store"? From tim@fourstonesExpressions.com Wed Nov 20 05:28:55 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 19 Nov 2002 23:28:55 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: 11/19/2002 11:09:35 PM, Neale Pickett wrote: >So then, Tim Stone - Four Stones Expressions is all like: > >> Neale, I just checked in dbdict and Bayes. Lemme know what you think. > >Okay, so you're just copying the file and then renaming it later. It >looks like you're trying to wrap the dbm file with a transactional >model. Copying isn't an atomic operation though, so locking will be a >problem. See below... I don't think Richie is so much after transactionality as 'forget it' mode. I agree that locking is a problem. I don't like the implementation too much... I experimented with keeping an in-memory cache, but that gets hard to manage memory consumption. These bayes databases might get kinda large... So I figured I'd let the dbm implementation manage memory. If it's too stupid to do a good job, then we (someone) should fix that. Perhaps in the long run, ZODB is the final answer. But pickles in particular are so portable... dbm files are so fast... different strokes for different folks, I guess. > >I still don't understand why a DBDict needs load/store. It'd be so much >easier just have store() call self.db.sync() and make load() a noop. Is >there something out there which depends on the disk version being >different from the memory version? As nearly as I can tell, the dbm implementations vary on when they write stuff to persistent storage. Sync only offers the guarantee that the memory and persistent versions match. Richie has presented the requirement that the dictionary be able to forget what has happened... > >> Also, I tried pop3proxy with the playground branch, and it doesn't >> work. It looks like we got a back level of Options.py. I'm not sure >> how to get it up to snuff... > >There was a thinko in pop3proxy, but now I'm getting a weird >AssertionError. Is this something with ther Persistence classes, maybe? >It looks like nspam isn't getting udpated: > >Traceback (most recent call last): > File "/usr/lib/python2.3/threading.py", line 410, in __bootstrap > self.run() > File "/usr/lib/python2.3/threading.py", line 398, in run > apply(self.__target, self.__args, self.__kwargs) > File "./pop3proxy.py", line 1306, in runProxy > state.bayes.learn(tokenizer.tokenize(spam1), True) > File "classifier.py", line 298, in learn > self.update_probabilities() > File "classifier.py", line 345, in update_probabilities > assert spamcount <= nspam >AssertionError > >Workin' on it. > > - Tim www.fourstonesExpressions.com From tim@fourstonesExpressions.com Wed Nov 20 05:32:19 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue, 19 Nov 2002 23:32:19 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: persistent might be better. This gives us: class PersistentBayes(classifier.Bayes): class DBDictBayes(PersistentBayes) bayes = persistent.DBDictBayes('mydb') 11/19/2002 11:18:45 PM, Neale Pickett wrote: >So then, Tim Stone - Four Stones Expressions is all like: > >> How about PersistentClassifier? > >Yech. Since the things are kinda doing what the standard shelve module >does, and we keep calling them "stores", how about "store"? > > > - Tim www.fourstonesExpressions.com From Paul.Moore@atosorigin.com Wed Nov 20 09:04:18 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 20 Nov 2002 09:04:18 -0000 Subject: [Spambayes] Outlook weirdness Message-ID: <16E1010E4581B049ABC51D4975CEDB88619944@UKDCX001.uk.int.atosorigin.com> This morning I started Outlook. I hadn't upgraded spambayes - it's the same version as yesterday. But the training data was completely gone! The manager dialog said that there was no training data, and filters were disabled. But the pickle was there, and yesterday everything was working fine. And even stranger, while I was watching a message came in and got filed in "Unsure". The only thing I can think of which may be relevant is that after I had shutdown Outlook cleanly (or so it looked) last night, when I shut my machine down, I got a message saying Outlook was not responding and was being closed down. Looks like some form of rogue instance of Outlook... Whether that had an effect, I don't know. I'm sure all of these strange effects I get are related to my using Exchange with Active Directory as my server, but I've no idea how to diagnose them :-( Paul. From mhammond@skippinet.com.au Wed Nov 20 09:43:32 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 20 Nov 2002 20:43:32 +1100 Subject: [Spambayes] Training from scratch In-Reply-To: Message-ID: > The Outlook client doesn't allow you to train before specifying > at least one > ham and one spam folder. That doesn't stop a deterimined idiot from > specifying empty folders, though. It doesn't let you enable filtering until there are at least 5 ham and 5 spam in the database though. Not sure why I bothered with that - you should never underestimate the determination of an idiot Mark. From msergeant@startechgroup.co.uk Wed Nov 20 10:22:35 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Wed, 20 Nov 2002 10:22:35 +0000 Subject: [Spambayes] Offtopic - getting bounce messages for spam References: Message-ID: <3DDB626B.90703@startechgroup.co.uk> Paul Moore said the following on 19/11/02 19:19: > Sorry, this is offtopic, but I'm hoping that the concentration of spam > experts on this group may be able to help me. > > I've just started receiving undeliverable message reports for spam, > sent to people I've never heard of. It looks to me like someone is > managing to impersonate me when they send spam out. I'm fairly sure > I'm not running an open relay (is there a way of checking for > certain?), so I guess someone is spoofing headers or something. I've > heard of this sort of thing before, but never experienced this myself. It's called a Joe-Job. I get these *all* the time. See the Spam-L FAQ (google will find it for you) for details. > Two questions, really: > > a) Is this something I should worry about (am I likely to end up on > blacklists or the like)? > b) What can I do about it in any case? There's little you can do except try and detect it and dump the mails, unless you want to spend the effort finding which relay it came through/from and trying to get the IP added to various DNSBL's. Though the problem is you have to know what DNSBL's the mail server that the bounce is coming from uses. Generally I ignore them - they tend to last a day or so before moving on to annoy someone else. Matt. From richie@entrian.com Wed Nov 20 10:28:52 2002 From: richie@entrian.com (richie@entrian.com) Date: Wed, 20 Nov 2002 10:28:52 +0000 Subject: [Spambayes] Re: proposed changes to hammie & co. Message-ID: [Tim Stone] > Richie has presented the requirement that the > dictionary be able to forget what has happened... This isn't a huge requirement - it's nice that the pop3proxy's test code doesn't write anything to the disk, but that's now been achieved by losing __del__. I mentioned that people might want to do speculative training and not save the results, but that can always be achieved by specifying a temporary DB name on the command line. The ability to forget is a nice-to- have, not a requirement. Quoting myself: "I'd much rather have an explicit store() method and document the fact that storage may be pre-empted by certain implementations." -- Richie Hindle richie@entrian.com From msergeant@startechgroup.co.uk Wed Nov 20 10:25:54 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Wed, 20 Nov 2002 10:25:54 +0000 Subject: [Spambayes] re: ciphertrust References: <3DDA711E.25949.343B3AFE@localhost> Message-ID: <3DDB6332.3040406@startechgroup.co.uk> Brad Clements said the following on 19/11/02 22:17: > http://www.ciphertrust.com/ironmail/anti-spam.htm > > > Is this it? I can't really understand it through all the marketing speak. I met with these guys a few weeks ago. Basically it's a custom rule set. A bit like SpamAssassin. They use customer feedback to expand their ruleset. They also use Razor and I think some DNSBL's. Let me know offline if you want any more info. Matt. From piersh@friskit.com Wed Nov 20 10:57:26 2002 From: piersh@friskit.com (Piers Haken) Date: Wed, 20 Nov 2002 02:57:26 -0800 Subject: [Spambayes] Outlook weirdness Message-ID: <9891913C5BFE87429D71E37F08210CB9297516@zeus.sfhq.friskit.com> I have seen this a couple of times, too. I have noticed (by watching PythonWin) that Outlook can take some time to actually shutdown, while saving the db, after the UI has been closed. I have 3500:2050 ham:spam. Piers. > -----Original Message----- > From: Moore, Paul [mailto:Paul.Moore@atosorigin.com]=20 > Sent: Wednesday, November 20, 2002 1:04 AM > To: Spambayes (E-mail) > Subject: [Spambayes] Outlook weirdness >=20 >=20 > This morning I started Outlook. I hadn't upgraded spambayes -=20 > it's the same version as yesterday. But the training data was=20 > completely gone! The manager dialog said that there was no=20 > training data, and filters were disabled. >=20 > But the pickle was there, and yesterday everything was=20 > working fine. And even stranger, while I was watching a=20 > message came in and got filed in "Unsure". >=20 > The only thing I can think of which may be relevant is that=20 > after I had shutdown Outlook cleanly (or so it looked) last=20 > night, when I shut my machine down, I got a message saying=20 > Outlook was not responding and was being closed down. Looks=20 > like some form of rogue instance of Outlook... Whether that=20 > had an effect, I don't know. >=20 > I'm sure all of these strange effects I get are related to my=20 > using Exchange with Active Directory as my server, but I've=20 > no idea how to diagnose them :-( >=20 > Paul. >=20 > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes >=20 From rob@hooft.net Wed Nov 20 12:17:01 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Wed, 20 Nov 2002 13:17:01 +0100 Subject: [Spambayes] Better optimization loop References: Message-ID: <3DDB7D3D.1020306@hooft.net> Tim Peters wrote: > [Tim] >>>Good observation! That should help. simplex isn't fast in the best of >>>cases, and in this case ... > [Rob Hooft] >>Anyone that has a faster optimization algorithm lying around is welcome >>to replace my Simplex code. [Tim] > Twasn't a criticism, just an observation about downhill Simplex, in anyone's > implementation. Multidimensional optimization is a darned hard problem, and > this approach is at least pretty robust. It wasn't anger, it was a genuine invitation.... ;-) I'm running these tests, and they are taking daaaayysss, so I really welcome anyone that has alternatives. One alternative I thought of is to keep the wordcounts lying around, and only calling update function once before starting scoring. But I'm not sure I would be the best person to try that (read: I'm sure someone else can do that 10x faster than I can). Another speedup I could use is a version of Bayes that calculates the spamprob from the numbers on demand instead of calculating them for all words everytime. This pays of for all cases where the training batch is very small (~1 message). Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie@entrian.com Wed Nov 20 12:46:20 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 12:46:20 +0000 Subject: [Spambayes] pop3proxy now supports multiple POP3 accounts In-Reply-To: References: Message-ID: [Tim Stone] > One thing to think about... I'm going to be running more than one of these, > because I have several pop3 accounts. Would it be possible to make pop3proxy > configurable to proxy more than one pop3 account? I'd like to share the same > bayes database between them all... This is now done. It creates a listening port for each account [I don't like the popular idea of munging the POP3 username to include the hostname, because it complicates the proxy - simple is good]. See Options.py for the new ini-file settings - you give it a list of POP3 servers and a corresponding list of local ports to listen on. The old settings still work but are deprecated and give a warning. It passes this test: [pop3proxy] # Evil self-proxying test. pop3proxy_servers: localhost:8110, localhost:111 pop3proxy_ports: 111, 110 where localhost:8110 is a local test POP3 server a la "pop3proxy.py -t". This makes it run two proxies, one pointing at the other. The messages come back classified twice (because it doesn't yet strip existing X-Hammie-Disposition headers - must fix that 8-) And this from a single-threaded program. Sometimes asyncore winds me up, but sometimes I really have to take my hat off to it. -- Richie Hindle richie@entrian.com From richie@entrian.com Wed Nov 20 12:46:16 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 12:46:16 +0000 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <15834.15618.465106.671756@montanaro.dyndns.org> References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> Message-ID: Hi Skip, > That suggests to me that you need to group messages together based upon > their initial classification. That way [...] they are clumped: > > D H S U > * > * > * > * > * > * Excellent plan. I've just committed this, with a heading between each clump. For those not using the proxy, there's a mockup at http://www.entrian.com/review3.html > The background color should probably alternate between light grey and white Implemented in the very first version using the time machine - thanks for the suggestion! 8-) -- Richie Hindle richie@entrian.com From seant@iname.com Wed Nov 20 12:46:23 2002 From: seant@iname.com (Sean True) Date: Wed, 20 Nov 2002 07:46:23 -0500 Subject: [Spambayes] Outlook weirdness In-Reply-To: <9891913C5BFE87429D71E37F08210CB9297516@zeus.sfhq.friskit.com> Message-ID: > > I have seen this a couple of times, too. I have noticed (by watching > PythonWin) that Outlook can take some time to actually shutdown, while > saving the db, after the UI has been closed. I have 3500:2050 ham:spam. I have a pretty persistent Outlook shutdown problem. I have a 6K Spam, 7K Ham training set, and an Outlook that commonly uses 40-50MB of memory. Often when I close Outlook, it will stay in memory. (Leaving an icon in the task bar, too). When I "restart" it, meaning, I think, restart the UI, I get no Spam manager icons, even though the addin is still running cheerfully and filtering. For a while I thought this happened only after the addin had thrown an exception, but that does not appear to be the case. I am suspicious of interactions with my virus scanner (Mcaffee), but have had problems even with Mcaffee disabled. I haven't been aggressive enough to try uninstalling Mcaffee -- I'd rather give up the addin, much as I like it! > The only thing I can think of which may be relevant is that > after I had shutdown Outlook cleanly (or so it looked) last > night, when I shut my machine down, I got a message saying > Outlook was not responding and was being closed down. Looks > like some form of rogue instance of Outlook... Whether that > had an effect, I don't know. > If that happens again, take a look at the task manager, and see if it is really gone. If it's not, I find that terminating the task with prejudice from task manager usually kills it. At the expense of a very long start up as Outlook reconstructs the mailbox database (I think). > I'm sure all of these strange effects I get are related to my > using Exchange with Active Directory as my server, but I've > no idea how to diagnose them :-( I run in a pure internet environment (pop3 servers only). Outlook is a very difficult and unforgiving platform to work with (thanks, Mark!) even without Exchange in the picture. -- Sean From jeremy@alum.mit.edu Wed Nov 20 12:52:22 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Wed, 20 Nov 2002 07:52:22 -0500 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: References: Message-ID: <15835.34182.266661.912333@slothrop.zope.com> >>>>> "TS" == Tim Stone <- Four Stones Expressions > writes: TS> Perhaps in the long run, ZODB is the final answer. But pickles TS> in particular are so portable... dbm files are so TS> fast... different strokes for different folks, I guess. ZODB uses pickles, so is about as portable. Don't know how it's speed compares to dbm files, but would expect it's roughly comparable. What are you concerns other than portability and as-fast-as-dbm-files? The advantage of ZODB is that it takes about two dozen lines of code to make the classifier persistent. One likely concern is that your users have to install ZODB or you have to package it for them. >> I still don't understand why a DBDict needs load/store. It'd be >> so much easier just have store() call self.db.sync() and make >> load() a noop. Is there something out there which depends on the >> disk version being different from the memory version? TS> As nearly as I can tell, the dbm implementations vary on when TS> they write stuff to persistent storage. Sync only offers the TS> guarantee that the memory and persistent versions match. Richie TS> has presented the requirement that the dictionary be able to TS> forget what has happened... Another advantage of ZODB is that it's transactional, which makes it possible to forget what has happened. It also makes it possible for multiple processes to share the database in a sane way. Jeremy From fgranger@teleprosoft.com Wed Nov 20 11:58:41 2002 From: fgranger@teleprosoft.com (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Wed, 20 Nov 2002 12:58:41 +0100 Subject: [Spambayes] Another soft for the collection Message-ID: I did not saw it on the web page: http://spambayes.sourceforge.net/related.html I got it from: http://db.tidbits.com/getbits.acgi?tbart=06994 The site of the product http://www.c-command.com/spamsieve/ Salutations, Francois Granger -- fgranger@teleprosoft.com - tel: +33 1 41 88 48 00 - Fax: + 33 1 41 88 48 48 From jeremy@alum.mit.edu Wed Nov 20 13:13:26 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Wed, 20 Nov 2002 08:13:26 -0500 Subject: [Spambayes] pop3proxy now supports multiple POP3 accounts In-Reply-To: References: Message-ID: <15835.35446.435871.609607@slothrop.zope.com> >>>>> "RH" == Richie Hindle writes: RH> This is now done. It creates a listening port for each account RH> [I don't like the popular idea of munging the POP3 username to RH> include the hostname, because it complicates the proxy - simple RH> is good]. Oh, I see you're aware of the approach. It's a trivial amount of code in the proxy. You're already using asyncore so you can't really be worried about complexity . It's much easier for a user to understand what's going on when the pop client's configuration has some mention of the name of the real pop server. I started off uses two different servers on two different ports, but my client never gave me any indication of where the mail came from. With the new change, the status line tells me what server it's getting mail from. The implementation is this short. It's mostly parsing the user name in name, host, port. def read_user(self): # XXX This could be cleaned up a bit. line = self.rfile.readline() if line == "": return False parts = line.split() if parts[0] != "USER": self.wfile.write("-ERR Invalid command; must specify USER first") return False user = parts[1] i = user.rfind("@") username = user[:i] server = user[i+1:] i = server.find(":") if i == -1: server = server, 110 else: port = int(server[i+1:]) server = server[:i], port zLOG.LOG("POP3", zLOG.INFO, "Got connect for %s" % repr(server)) self.connect_pop(server) self.pop_wfile.write("USER %s\r\n" % username) resp = self.pop_rfile.readline() # As long the server responds OK, just swallow this reponse. if resp.startswith("+OK"): return True else: return False Jeremy From Paul.Moore@atosorigin.com Wed Nov 20 13:15:50 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 20 Nov 2002 13:15:50 -0000 Subject: [Spambayes] Outlook weirdness Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DEC@UKDCX001.uk.int.atosorigin.com> From: Sean True [mailto:seant@iname.com] > I have a pretty persistent Outlook shutdown problem. I have a > 6K Spam, 7K Ham training set, and an Outlook that commonly uses > 40-50MB of memory. Often when I close Outlook, it will stay in > memory. (Leaving an icon in the task bar, too). When I "restart" it, > meaning, I think, restart the UI, I get no Spam manager icons, even > though the addin is still running cheerfully and filtering. Outlook has two exit options "Exit" and "Exit and log off". I've never quite understood the difference, but I wonder if what you're seeing is related - Outlook finishing, closing down the UI, but then remaining for ages in the background while the addin saves the pickles and tidies up. This would also tie in with the high memory footprint, as the pickle method keeps the database in memory. Would it be worth trying the DBM format for the database? I think this would give faster startup/shutdown times, and lower memory consumption, at the expense of on-disk database size and slower filtering (although I doubt that this difference would be an issue). Unfortunately, the addin doesn't hook into the main code's persistence structure (as far as I can see) so switching formats isn't as simple as just changing the INI file. I'll look into it and give it a try at some point. Paul. From richie@entrian.com Wed Nov 20 13:50:29 2002 From: richie@entrian.com (richie@entrian.com) Date: Wed, 20 Nov 2002 13:50:29 +0000 Subject: [Spambayes] pop3proxy now supports multiple POP3 accounts In-Reply-To: <15835.35446.435871.609607@slothrop.zope.com> Message-ID: Hi Jeremy, > Oh, I see you're aware of the approach. It's a trivial amount of > code in the proxy. It's not so much that - I've since realised that it simply won't work with some email clients. As your code says: if parts[0] != "USER": self.wfile.write("-ERR Invalid command; must specify USER first") but that's not always obeyed. For instance, RFC 2449 adds extensions to POP3, including the CAPA command: > The POP3 CAPA command returns a list of capabilities supported > by the POP3 server. It is available in both the AUTHORIZATION > and TRANSACTION states. meaning that the first command given by the client might be CAPA. This gets you into a chicken-and-egg situation whereby you need to proxy the CAPA command but you don't know which server to connect to because you haven't seen the USER command yet. I've seen this in the real world - François Granger's client sends CAPA. This is also why pop3proxy will proxy unknown commands. > You're already using asyncore so you can't really be worried > about complexity . (-8 .helps which, demand on backwards work to brain my rewired I've -- Richie Hindle richie@entrian.com From papaDoc@videotron.ca Wed Nov 20 14:09:34 2002 From: papaDoc@videotron.ca (papaDoc) Date: Wed, 20 Nov 2002 09:09:34 -0500 Subject: [Spambayes] New pop3proxy options Message-ID: <3DDB979E.8050604@videotron.ca> Hi, This is my first contribution ;-) This is a patch for the Options.py 363,364c363,364 < pop3proxy_servers: "" < pop3proxy_ports: "" --- > pop3proxy_servers: > pop3proxy_ports: By the way I'm trying to use pop3proxy with Mozilla 1.1. I'm creating a new account which point to localhost:110. I can retreive the messages and they are scored but I can't display the body I see only the Subject line. (When you click on the subject line the body should be display in the subwindow below) in the mail tools of mozilla. But when I look in the Inbox there are their with their body ????? From bkc@murkworks.com Wed Nov 20 14:47:35 2002 From: bkc@murkworks.com (Brad Clements) Date: Wed, 20 Nov 2002 09:47:35 -0500 Subject: [Spambayes] re: ciphertrust In-Reply-To: <3DDB6332.3040406@startechgroup.co.uk> Message-ID: <3DDB5925.32059.37C5A44E@localhost> On 20 Nov 2002 at 10:25, Matt Sergeant wrote: > I met with these guys a few weeks ago. Basically it's a custom rule set. A > bit like SpamAssassin. They use customer feedback to expand their ruleset. > They also use Razor and I think some DNSBL's. > > Let me know offline if you want any more info. So much for heuristics.. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Wed Nov 20 14:53:18 2002 From: bkc@murkworks.com (Brad Clements) Date: Wed, 20 Nov 2002 09:53:18 -0500 Subject: [Spambayes] Another soft for the collection In-Reply-To: Message-ID: <3DDB5A7C.2661.37CADF75@localhost> On 20 Nov 2002 at 12:58, Fran=E7ois Granger wrote: > The site of the product > > http://www.c-command.com/spamsieve/ In their screen shot, under Corpus. It shows 3778 unused words. Huh? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From seant@iname.com Wed Nov 20 14:53:13 2002 From: seant@iname.com (Sean True) Date: Wed, 20 Nov 2002 09:53:13 -0500 Subject: [Spambayes] Outlook weirdness In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DEC@UKDCX001.uk.int.atosorigin.com> Message-ID: > [PAUL] Outlook has two exit options "Exit" and "Exit and log off". I've never > quite understood the difference, but I wonder if what you're seeing is > related - Outlook finishing, closing down the UI, but then remaining > for ages in the background while the addin saves the pickles and > tidies up. This would also tie in with the high memory footprint, as > the pickle method keeps the database in memory. I believe that the "Exit and log off" option is Exchange specific. I don't get that. If no messages are trained, the database won't be dirty, and doesn't get written. So, that's not the likely culprit. > > Would it be worth trying the DBM format for the database? I think > this would give faster startup/shutdown times, and lower memory > consumption, at the expense of on-disk database size and slower > filtering (although I doubt that this difference would be an issue). > Slower *training* would be an issue, however. > Unfortunately, the addin doesn't hook into the main code's persistence > structure (as far as I can see) so switching formats isn't as simple > as just changing the INI file. I'll look into it and give it a try at > some point. Brave guy. Me, I'm a coward. There are old coders, and there are bold coders -- but there are no old, bold coders. -- Sean From B-Morgan@concentric.net Wed Nov 20 15:08:46 2002 From: B-Morgan@concentric.net (Brad Morgan) Date: Wed, 20 Nov 2002 08:08:46 -0700 Subject: [Spambayes] Outlook weirdness In-Reply-To: Message-ID: Outlook (Internet-only) has been occasionally "hanging around" on me long before I tried any sort of spam filtering. Sometimes it hangs around collecting messages, sometimes it just prevents another version from starting up. I haven't found any pattern to the unsuccessful shutdowns and Microsoft certianly hasn't either for all the patches they've put out. IMO, its their bug plain and simple. Regards, Brad -----Original Message----- From: spambayes-bounces@python.org [mailto:spambayes-bounces@python.org]On Behalf Of Sean True Sent: Wednesday, November 20, 2002 7:53 AM To: Moore, Paul; Sean True; Piers Haken; Spambayes (E-mail) Subject: RE: [Spambayes] Outlook weirdness > [PAUL] Outlook has two exit options "Exit" and "Exit and log off". I've never > quite understood the difference, but I wonder if what you're seeing is > related - Outlook finishing, closing down the UI, but then remaining > for ages in the background while the addin saves the pickles and > tidies up. This would also tie in with the high memory footprint, as > the pickle method keeps the database in memory. I believe that the "Exit and log off" option is Exchange specific. I don't get that. If no messages are trained, the database won't be dirty, and doesn't get written. So, that's not the likely culprit. > > Would it be worth trying the DBM format for the database? I think > this would give faster startup/shutdown times, and lower memory > consumption, at the expense of on-disk database size and slower > filtering (although I doubt that this difference would be an issue). > Slower *training* would be an issue, however. > Unfortunately, the addin doesn't hook into the main code's persistence > structure (as far as I can see) so switching formats isn't as simple > as just changing the INI file. I'll look into it and give it a try at > some point. Brave guy. Me, I'm a coward. There are old coders, and there are bold coders -- but there are no old, bold coders. -- Sean _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From msergeant@startechgroup.co.uk Wed Nov 20 15:28:21 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Wed, 20 Nov 2002 15:28:21 +0000 Subject: [Spambayes] re: ciphertrust References: <3DDB5925.32059.37C5A44E@localhost> Message-ID: <3DDBAA15.6000108@startechgroup.co.uk> Brad Clements said the following on 20/11/02 14:47: > On 20 Nov 2002 at 10:25, Matt Sergeant wrote: > >>I met with these guys a few weeks ago. Basically it's a custom rule set. A >>bit like SpamAssassin. They use customer feedback to expand their ruleset. >>They also use Razor and I think some DNSBL's. >> >>Let me know offline if you want any more info. > > So much for heuristics.. Rules == heuristics. Or are you using a different dictionary to me? :-) From Paul.Moore@atosorigin.com Wed Nov 20 16:05:54 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Wed, 20 Nov 2002 16:05:54 -0000 Subject: [Spambayes] Outlook weirdness Message-ID: <16E1010E4581B049ABC51D4975CEDB88619948@UKDCX001.uk.int.atosorigin.com> From: Sean True [mailto:seant@iname.com] >> Would it be worth trying the DBM format for the database? I think >> this would give faster startup/shutdown times, and lower memory >> consumption, at the expense of on-disk database size and slower >> filtering (although I doubt that this difference would be an issue). >> > Slower *training* would be an issue, however. I can't imagine the training getting much slower than it is at the moment for me :-( The pickle isn't being dumped to disk when I hit "Delete as spam", but the operation is taking over a second. No idea why... This is with 700 spam and 7000 ham (or so) in the DB, giving a 7M pickle. Outlook's using 36M of RAM. Maybe it's just Outlook being slow... Paul. From seant@iname.com Wed Nov 20 16:15:11 2002 From: seant@iname.com (Sean True) Date: Wed, 20 Nov 2002 11:15:11 -0500 Subject: [Spambayes] Outlook weirdness In-Reply-To: Message-ID: > Outlook (Internet-only) has been occasionally "hanging around" on me long > before I tried any sort of spam filtering. Sometimes it hangs around > collecting messages, sometimes it just prevents another version from > starting up. > > I haven't found any pattern to the unsuccessful shutdowns and Microsoft > certianly hasn't either for all the patches they've put out. > IMO, its their > bug plain and simple. > Most bugs are their fault. In this case, however, having the addin UI go away is annoying, and most other products appear to avoid it. On the long list of things before this is not alpha code. -- Sean From skip@pobox.com Wed Nov 20 17:15:45 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 20 Nov 2002 11:15:45 -0600 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> Message-ID: <15835.49985.259869.725900@montanaro.dyndns.org> Richie> I've just committed this, with a heading between each clump. Richie> For those not using the proxy, there's a mockup at Richie> http://www.entrian.com/review3.html Looks very nice. Heck, I may have to actually figure out how to use this. Of course, I need figure out how to throw in an ssh tunnel and point fetchmail at the proxy... Skip From richie@entrian.com Wed Nov 20 17:23:02 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 17:23:02 +0000 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <15835.49985.259869.725900@montanaro.dyndns.org> References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> <15835.49985.259869.725900@montanaro.dyndns.org> Message-ID: Hi Skip, > Richie> http://www.entrian.com/review3.html > > Looks very nice. Heck, I may have to actually figure out how to use this. > Of course, I need figure out how to throw in an ssh tunnel and point > fetchmail at the proxy... Should you feel the urge to write a HOW-TO while you do that, I won't stop you. 8-) -- Richie Hindle richie@entrian.com From skip@pobox.com Wed Nov 20 17:27:22 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 20 Nov 2002 11:27:22 -0600 Subject: [Spambayes] Another soft for the collection In-Reply-To: <3DDB5A7C.2661.37CADF75@localhost> References: <3DDB5A7C.2661.37CADF75@localhost> Message-ID: <15835.50682.656616.985380@montanaro.dyndns.org> >> http://www.c-command.com/spamsieve/ Brad> In their screen shot, under Corpus. Brad> It shows 3778 unused words. Huh? Maybe it ignores hapaxes? S From db3l@fitlinxx.com Wed Nov 20 17:37:01 2002 From: db3l@fitlinxx.com (David Bolen) Date: 20 Nov 2002 12:37:01 -0500 Subject: [Spambayes] Re: Outlook users should update References: Message-ID: "Mark Hammond" writes: > And I just checked in a few changes too. Of most note is that the plugin > should correctly filter all "unread, unscored" messages in your watch > folders at startup. Works for me - let me know if it does for you too > For what it's worth, unfortunately it doesn't seem to work in my environment with an Exchange server (at least not in my case), nor did the prior version I was running from CVS. I just get "Processing 0 missed spam in folder 'Inbox' took 7.46883ms" at startup. I realize that I'm in the minority of Outlook users around here as an Exchange corporate user. :-) Oh, and while I'm commenting about things without contributing a fix, I may as well mention that it seems like some recent training changes (somewhere around a pull from CVS I did on the 14th, when it added stuff to identify messages with no body and what not), some messages fail to train with a traceback (in the trace window) like (XXX is a really long hex id): Error training message '' Traceback (most recent call last): File ".\spambayes\Outlook2000\train.py", line 67, in train_folder if train_message(message, isspam, mgr): File ".\spambayes\Outlook2000\train.py", line 42, in train_message stream = msg.GetEmailPackageObject() File ".\spambayes\Outlook2000\msgstore.py", line 535, in GetEmailPackageObject text = self._GetMessageText() File ".\spambayes\Outlook2000\msgstore.py", line 457, in _GetMessageText 0) # any # of results is fine com_error: (-2147221246, 'Invalid window handle', None, None) It's just the occasional message - 3 out of 3000 in my last training - and it's reproduceable on a given message, but there doesn't seem to be obvious commonality - at least at the MUA level - to the messages. I hesitated to comment before now in part since I wasn't sure of the expected state of the earlier (~11/14) fetch I had done from CVS, and since "real work" suddenly hit hard last week and I felt guilty about not debugging further (I just manually filter at startup right now and ignore the occasional training failure). But if there is any specific data you might want me to acquire I'd be happy to see what I could do. I do intend to poke more deeply when I get a chance. -- David From db3l@fitlinxx.com Wed Nov 20 17:53:29 2002 From: db3l@fitlinxx.com (David Bolen) Date: 20 Nov 2002 12:53:29 -0500 Subject: [Spambayes] Re: Outlook weirdness References: <9891913C5BFE87429D71E37F08210CB9297516@zeus.sfhq.friskit.com> Message-ID: "Sean True" writes: > I have a pretty persistent Outlook shutdown problem. I have a 6K Spam, 7K > Ham training set, > and an Outlook that commonly uses 40-50MB of memory. Often when I close > Outlook, it will stay > in memory. (Leaving an icon in the task bar, too). When I "restart" it, > meaning, I think, > restart the UI, I get no Spam manager icons, even though the addin is still > running cheerfully > and filtering. Usually this is due to the addin generating an exception previously or for some reason Outlook thinking it failed and doesn't want to reload. Although the fact that you say it's still filtering is definitely at odds with that idea - unless it never really unloaded previously. I've found it very useful to just keep a trace window (using the win32traceutil module) running all the time. Technically you can pick up existing trace messages from before you started tracing as long as the addin is still running, but I normally just leave the trace task running at all times. It let's me validate that the addin is shutting down when I think it is and starting up when I think it is or check for issues when something seems amiss. Note that due to COM interaction, if anything with respect to the Outlook process remains in memory, the addin will too. But you can tell that in the trace window since there won't be any disconnect indication. -- David From DavidA@ActiveState.com Wed Nov 20 18:08:37 2002 From: DavidA@ActiveState.com (David Ascher) Date: Wed, 20 Nov 2002 10:08:37 -0800 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <15833.20589.376685.686723@montanaro.dyndns.org> References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> Message-ID: <3DDBCFA5.8070607@ActiveState.com> Richie Hindle wrote: > Excellent plan. I've just committed this, with a heading between each > clump. For those not using the proxy, there's a mockup at > http://www.entrian.com/review3.html Looks good! Suggestions: - (if necessary) learn to love JavaScript and provide keyboard navigation, so that users can do "down,down,down,h,down,down,down,s,..." If you want an example of how this feels before you bother, you can get VPM from us (ActiveState), which ships with Komodo Professional (you can get a trial for free). The JS is probably easy to find as well. - Make 'hovertips' that display the first few lines of the body (stripped of html and whitespace), to aid in classification for when the headers aren't enough. If that's too hard, make a link on each message that shows a popup with the contents of the email. What happens after you click on Train? Does it go to the next day, or just refresh the current page? --david From richie@entrian.com Wed Nov 20 20:01:45 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 20:01:45 +0000 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <3DDBCFA5.8070607@ActiveState.com> References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> <3DDBCFA5.8070607@ActiveState.com> Message-ID: Hi David, > Looks good! Thanks! > (if necessary) learn to love JavaScript and provide keyboard navigation Good plan. My only concern is that it might break on some browsers, but I can always limit it. I'll add it to the to-do list. > If you want an > example of how this feels before you bother, you can get VPM from us > (ActiveState), which ships with Komodo Professional (you can get a trial for > free). The JS is probably easy to find as well. Thanks - I'll have a look. > Make 'hovertips' that display the first few lines of the body (stripped of > html and whitespace), to aid in classification for when the headers aren't > enough. If that's too hard, make a link on each message that shows a popup with > the contents of the email. Linking to the message is already on the to-do list, but I like the hovertip idea as well. > What happens after you click on Train? Does it go to the next day, or just > refresh the current page? It refreshes the current page if you deferred any messages, otherwise it goes to the next or previous page. -- Richie Hindle richie@entrian.com From francois.granger@free.fr Wed Nov 20 20:11:06 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Wed, 20 Nov 2002 21:11:06 +0100 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <15835.34182.266661.912333@slothrop.zope.com> References: <15835.34182.266661.912333@slothrop.zope.com> Message-ID: At 7:52 -0500 20/11/02, in message Re: [Spambayes] proposed changes to hammie & co., Jeremy Hylton wrote: > >One likely concern is that your users have to install ZODB or you >have to package it for them. This is a real issue for "normal" user...... -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html From rob@hooft.net Wed Nov 20 20:05:11 2002 From: rob@hooft.net (Rob Hooft) Date: Wed, 20 Nov 2002 21:05:11 +0100 Subject: [Spambayes] Outlook weirdness References: <16E1010E4581B049ABC51D4975CEDB88619948@UKDCX001.uk.int.atosorigin.com> Message-ID: <3DDBEAF7.7020709@hooft.net> Moore, Paul wrote: > From: Sean True [mailto:seant@iname.com] > >>>Would it be worth trying the DBM format for the database? I think >>>this would give faster startup/shutdown times, and lower memory >>>consumption, at the expense of on-disk database size and slower >>>filtering (although I doubt that this difference would be an issue). >>> >> >>Slower *training* would be an issue, however. > > > I can't imagine the training getting much slower than it is at the > moment for me :-( The pickle isn't being dumped to disk when I hit > "Delete as spam", but the operation is taking over a second. No > idea why... Isn't that the update_spamprob? It is updating ~300k spam probabilities, where you are going to use only a few every time. The current Bayes is optimized for training on hundreds of messages at a time, and then scoring hundreds. For "training one, scoring one" it would be more efficient to delay the calculation of the spam probs until they are needed. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie@entrian.com Wed Nov 20 20:47:43 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 20:47:43 +0000 Subject: [Spambayes] New pop3proxy options In-Reply-To: <3DDB979E.8050604@videotron.ca> References: <3DDB979E.8050604@videotron.ca> Message-ID: <5ksntu0ruug9br0m26g77q8pfqluaaaora@4ax.com> Hi papaDoc, > This is a patch for the Options.py Thanks - not sure why those double-quotes were there, but it all seems to work without them. I'll check in the patch when I next do a check-in. > By the way I'm trying to use pop3proxy with Mozilla 1.1. [...] > I can't display the body I see only the Subject line. I'm not sure I understand this. Are you talking about the web training interface, http://localhost:8880/review ? You can't (yet) view the message bodies in there - clicking on the Subject line *shouldn't* do anything yet. > But when I look in the Inbox there are their with their body ????? So in your email client, as opposed to your web browser, you do see the whole message? It sounds like it's working as intended - have I misunderstood something? -- Richie Hindle richie@entrian.com From papaDoc@videotron.ca Wed Nov 20 21:00:34 2002 From: papaDoc@videotron.ca (papaDoc) Date: Wed, 20 Nov 2002 16:00:34 -0500 Subject: [Spambayes] New pop3proxy options References: <3DDB979E.8050604@videotron.ca> <5ksntu0ruug9br0m26g77q8pfqluaaaora@4ax.com> Message-ID: <3DDBF7F2.10502@videotron.ca> Hi Richie, >>By the way I'm trying to use pop3proxy with Mozilla 1.1. [...] >>I can't display the body I see only the Subject line. >> > >I'm not sure I understand this. Are you talking about the web training >interface, http://localhost:8880/review ? You can't (yet) view the message >bodies in there - clicking on the Subject line *shouldn't* do anything yet. > > > >>But when I look in the Inbox there are their with their body ????? >> >> > >So in your email client, as opposed to your web browser, you do see the >whole message? It sounds like it's working as intended - have I >misunderstood something? > > In http://localhost:8880/review as expected I only see the Subject (until someone makes linking to the real mail working ;-) ) What I was trying to say is: Using the mail tools provided with Mozilla 1.1 I see only the Subject, Sender, Date when one mail is selected. Nothing in the "body" window. When I go into the directory where the mail is saved by Mozilla c:/something/ and look into the file Inbox everything look OK. Mozilla retreive the mail from localhost:110 and pop3proxy from pop.my_prodived.com:110. When Mozilla retreive the mail directly from pop.my_prodived.com:110 I have no problem. papaDoc From neale@woozle.org Wed Nov 20 21:16:25 2002 From: neale@woozle.org (Neale Pickett) Date: 20 Nov 2002 13:16:25 -0800 Subject: [Spambayes] Better optimization loop In-Reply-To: <3DDB7D3D.1020306@hooft.net> References: <3DDB7D3D.1020306@hooft.net> Message-ID: So then, "Rob W.W. Hooft" is all like: > Another speedup I could use is a version of Bayes that calculates the > spamprob from the numbers on demand instead of calculating them for > all words everytime. This pays of for all cases where the training > batch is very small (~1 message). Funny you should bring that up, Rob, because I happen to be working on exactly that. The only way I could think to do it was to pass in a new option to Bayes.learn() and Bayes.unlearn(). I've therefore removed the update_probabilities option and replaced it with update_word_probabilities. My thinking here is that asking things to run Bayes.update_probabilities() when they need it isn't too much of a burden (most of them call it explicitly anyway), but learn() and unlearn() are the *only* places that individual word rescoring can happen. The changed methods become: def learn(self, wordstream, is_spam, update_word_probabilities=True): self._add_msg(wordstream, is_spam, update_word_probabilities) def unlearn(self, wordstream, is_spam, update_word_probabilities=True): self._remove_msg(wordstream, is_spam, update_word_probabilities) def _add_msg(self, wordstream, is_spam, update_word_probabilities): ... def _remove_msg(self, wordstream, is_spam, update_word_probabilities): ... And inside the for loop in _add_msg() and _remove_msg() is this: if update_word_probabilities: self.update_word_probability(word, record) else: # Needed to tell a persistent DB that the content # changed. wordinfo[word] = record I'll check all this in to the hammie-playground branch as soon as I can be sure it doesn't break anything. If we all think it's kosher, I'll merge it into HEAD. Neale From richie@entrian.com Wed Nov 20 21:25:28 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 21:25:28 +0000 Subject: [Spambayes] New pop3proxy options In-Reply-To: <3DDBF7F2.10502@videotron.ca> References: <3DDB979E.8050604@videotron.ca> <5ksntu0ruug9br0m26g77q8pfqluaaaora@4ax.com> <3DDBF7F2.10502@videotron.ca> Message-ID: > Using the mail tools provided with Mozilla 1.1 I see only the > Subject, Sender, Date when one mail is selected. Nothing in the > "body" window. > > When I go into the directory where the mail is saved by Mozilla > c:/something/ and look into the file Inbox everything look OK. Ah, OK, I see! Thanks for that - the proxy must be changing the messages in some way that the Mozilla tools don't like. I'll look into it. -- Richie Hindle richie@entrian.com From rob@hooft.net Wed Nov 20 21:28:51 2002 From: rob@hooft.net (Rob Hooft) Date: Wed, 20 Nov 2002 22:28:51 +0100 Subject: [Spambayes] Better optimization loop References: <3DDB7D3D.1020306@hooft.net> Message-ID: <3DDBFE93.4060600@hooft.net> Neale Pickett wrote: > So then, "Rob W.W. Hooft" is all like: > > >>Another speedup I could use is a version of Bayes that calculates the >>spamprob from the numbers on demand instead of calculating them for >>all words everytime. This pays of for all cases where the training >>batch is very small (~1 message). > And inside the for loop in _add_msg() and _remove_msg() is this: > > if update_word_probabilities: > self.update_word_probability(word, record) > else: > # Needed to tell a persistent DB that the content > # changed. > wordinfo[word] = record I was thinking along different lines: when the train size and the score size are both approximately 1 message, we can forget about the word probabilities altogether. Just don't store them anywhere anymore, and calculate the individual word probabilities from the raw counts while scoring. This will not only save time because lots of words that enter the database will "never" be used again (hapaxes...), but it should also shrink the database size. If it is too slow then we can make a cache out of a dictionary mapping raw count tuples to probabilities to speed it up. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie@entrian.com Wed Nov 20 21:30:38 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 21:30:38 +0000 Subject: [Spambayes] Documentation... Message-ID: This may be premature, but as part of helping John Draper set up the spambayes software I've made a start on some user documentation. It could go on the website, or maybe in with the source code - I'm not sure we're ready to give the impression that this stuff is ready for "normal people" to use yet. This stuff refers to the current, unpackaged sources - if we ever package it up, the documentation will be very different. But I'm guessing that's a long way off, and in the meantime we'll all be asked by friends and project newcomers to explain how it all fits together and how to get it up and running - this is an attempt to let us say "Here, read this!" when that happens. It tries to target both technical and non-technical users (though for some fairly high values of "non-technical") and may well fall between two stools as a result. I'll check it in either with the sources or the website depending on feedback. If anyone spots glaring omissions, factual inaccuracies or downright rudeness, either let me know or edit it after I check it in - I'm not claiming any editorial rights! It's also somewhat biased towards the POP3 proxy and the web interface (for obvious reasons 8-) and lacks any detail on the Outlook plugin because I'm not one of those lucky Outlook users... this is a not-at-all-veiled plea for contributors or users who know about the lacking areas to step forward and write some words! -------------------------------------------------------------------------- > First some concepts: o 'Ham' is the opposite of 'Spam'. 8-) o At no point does any part of Spambayes delete emails. All it does is classify them by adding a header that tells you whether they look like spam or not. It's then up to you to use your email software to do something in response to that header (the Outlook plug-in does some of the work for you). o The header that the software adds is called X-Hammie-Disposition (mostly for historical reasons, and you can customise it) and has a value of Yes, No or Unsure. > There are six main components to the Spambayes system: o A database. Loosely speaking, this is a collection of words and associated spam and ham probabilities. The database says "If a message contains the word 'Viagra' then there's a 98% chance that it's spam, and a 2% chance that it's ham." This database is created by training - you give it messages, tell it whether those messages are ham or spam, and it adjusts its probabilities accordingly. How to train it is covered below. By default it lives in a file called "hammie.db". o The tokeniser/classifier. This is the core engine of the system. The tokenizer splits emails into tokens (words, roughly speaking), and the classifier looks at those tokens to determine whether the message looks like spam or not. You don't use the tokeniser/classifier directly - it powers the other parts of the system. o The POP3 proxy. This sits between your email client (Eudora, Outlook Express, etc) and your email server, and adds the classification header to emails as you download them. A typical user's email setup looks like this: +-----------------+ +-------------+ | Outlook Express | Internet or intranet | | | (or similar) | <--------------------------> | POP3 server | | | | | +-----------------+ +-------------+ The POP3 server runs either at your ISP for internet mail, or somewhere on your internal network for corporate mail. The POP3 proxy sits in the middle and adds the classification header as you retrieve your email: +-----------------+ +------------+ +-------------+ | Outlook Express | | Spambayes | | | | (or similar) | <----> | POP3 proxy | <----> | POP3 server | | | | | | | +-----------------+ +------------+ +-------------+ So where you currently have your email client configured to talk to say, "pop3.my-isp.com", you instead configure the *proxy* to talk to "pop3.my-isp.com" and configure your email client to talk to the proxy. The POP3 proxy can live on your PC, or on the same machine as the POP3 server, or on a different machine entirely, it really doesn't matter. Say it's living on your PC, you'd configure your email client to talk to "localhost". o The web interface. This is a server that runs alongside the POP3 proxy and lets you control it through the web. You can upload emails to it for training or classification, query the probabilities database ("How many of my emails really *do* contain the word Viagra"?) and most importantly, train it on the emails you've received. When you start using the system, unless you train it using the Hammie script it will classify most things as Unsure, and often make mistakes. But it keeps copies of all the email's its seen, and through the web interface you can train it by going through a list of all the emails you've received and checking a Ham/Spam box next to each one. After training on a few messages (say 20 spams and 20 hams), you'll find that it's getting it right most of the time. The web training interface automatically checks the Ham/Spam boxes according to what it thinks, so all you need to do it correct the odd mistake - it's very quick and easy. o The Outlook plug-in. For Outlook 2000 users (not Outlook Express) this lets you manage the whole thing from within Outlook. You set up a Ham folder and a Spam folder, and train it simply by dragging messages into those folders. Alternatively there are buttons to do the same thing. And it integrates into Outlook's filtering system to make it easy to file all the suspected spam into its own folder, for instance. o The Hammie script. This does three jobs: command-line training, procmail filtering, and XML-RPC. To train on a whole collection of messages, stored either as mbox files or as collections of message files in a directory, you run "hammie.py -g ham -s spam", where 'ham' is the mbox file or directory containing ham, and 'spam' is the mbox file or directory containing spam. Procmail filtering is a unix-based email filtering system - to use Hammie as a procmail filter, run it as "hammie.py -f" from a procmail rule. It will read a message from its input, add the header, and write it to its output. Hammie can also run as an XML-RPC server, so that a programmer can write code that uses a remote server to classify emails programmatically - see hammiesrv.py. > Where things live: The Hammie script is called hammie.py. The POP3 proxy and the web interface live in pop3proxy.py. The Outlook plug-in lives in the Outlook2000 subdirectory - see the README.txt in that directory for more information on that. As well as these components, there's also a whole pile of utility scripts, test harnesses and so on - see README.txt and TESTING.txt in the spambayes distribution for more information. > Configuration: The system is configured through a file called "bayescustomize.ini". In here you can configure the name and type of your database, the POP3 server(s) you want to proxy to, the ports you want the proxy and the web interface to run on, and so on. You can also control details like how sure you want the system to be that message really is spam before it marks it as such. The default values for all the options, and the documentation for them, all lives in Options.py. To change an option, create a bayescustomize.ini and add the option to that - don't edit Options.py. > Requirements: To run the software, you need Python 2.2 or above. You also need version 2.4.3 or above of the Python "email" package. If you're running the CVS version of Python (known as 2.3a0) then you already have this. If not, you can download it from http://mimelib.sf.net and install it - unpack the archive, cd to the email-2.4.3 directory and type "python setup.py install". This will install it into your Python site-packages directory. You'll also need to move aside the standard "email" library - go to your Python "Lib" directory and rename "email" to "email_old". > Setup on unix (Windows/Mac users can ignore this bit): On a unix machine, unless you're running as root (which we strongly advise you don't!) you can't run the proxy on port 110. Besides, you quite possibly already have a POP3 server running on that port. You need to run it on an unprivileged port, say 1110. You do this by adding the line pop3proxy_ports: 1110 to bayescustomize.ini - all will become clear in the next section. Where we talk about port 110, you use port 1110. > Minimal setup for using the POP3 proxy and web interface: The minimum you need too do to get started is create a bayescustomize.ini containing the following: [pop3proxy] pop3proxy_servers: pop3.my-isp.com where "pop3.my-isp.com" is wherever you currently have your email client configured to collect mail from. You can now run the proxy by running "python pop3proxy.py". This will print some status messages, which should include: BayesProxyListener listening on port 110. UserInterfaceListener listening on port 8880. What that means is that the POP3 proxy is ready for your email client to connect to it (110 is the standard port number for POP3 - you can use a different one by adding a line to bayescustomize.ini - see Options.py) and that the web interface is ready for your browser to connect to it. The address of the web interface is http://localhost:8880/ (or if you're running it on a different machine, replace 'localhost' with the name of the machine). You can have a look at the web interface now, but it won't be very interesting because the system is untrained and has seen no messages yet. > Reading emails and training the classifier: You now need to configure your email client to talk to the proxy instead of the real email server. Change your equivalent of "pop3.my-isp.com" to "localhost" (or to the name of the machine you're running the proxy on) in your email client's setup. Hit "Get new email" and look at the headers of the emails (send yourself an email if you don't have any!) - there should be an X-Hammie-Disposition header there. It probably says "Unsure", because you haven't done any training yet. You should be able to create a mail folder called "Suspected spam" and set up a filtering rule that puts emails with an "X-Hammie-Disposition: Yes" heading into that folder. (Eventually we should publish instructions on how to do this in all the popular email clients). You can now train the system through the web interface - follow the "Review messages" link and you'll see a list of the emails that the system has seen so far. Check the appropriate boxes and hit Train. The messages disappear (eventually you'll be able to get back to them, for instance to correct any training mistakes) and if you go back to the home page you'll see that the "Total emails trained" has increased. Once you've done this on a few spams and a few hams, you'll find that the X-Hammie-Disposition header is getting it right most of the time. The more you train it the more accurate it gets. There's no need to train it on every message you receive, but you should train on a few spams and a few hams on a regular basis. You should also try to train it on about the same number of spams as hams. You can train it on lots of message in one go using the Hammie script, as explained above. -------------------------------------------------------------------------- -- Richie Hindle richie@entrian.com From mhammond@skippinet.com.au Wed Nov 20 21:52:12 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 21 Nov 2002 08:52:12 +1100 Subject: [Spambayes] Outlook weirdness In-Reply-To: Message-ID: > Outlook (Internet-only) has been occasionally "hanging around" on me long > before I tried any sort of spam filtering. Sometimes it hangs around > collecting messages, sometimes it just prevents another version from > starting up. > > I haven't found any pattern to the unsuccessful shutdowns and Microsoft > certianly hasn't either for all the patches they've put out. > IMO, its their > bug plain and simple. This is my experience too. File->Exit *does* have more luck sometimes, whereas just closing the window may not work as expected. Add stuff like CE Inbox synchronization tools and the various addins people install, and it could be anything. For example, if our addin keeps a reference to a certain COM object, then we end up with a COM circular reference, and Outlook never shuts down. Hard to call COM circular references a bug any more than they are in pre-GC Python. I have seen this problem enough times in the past though that I am fairly confident that we never cause it. Mark. From guido@python.org Wed Nov 20 21:49:20 2002 From: guido@python.org (Guido van Rossum) Date: Wed, 20 Nov 2002 16:49:20 -0500 Subject: [Spambayes] LJ article Message-ID: <200211202149.gAKLnLw28459@pcp02138704pcs.reston01.va.comcast.net> A while ago I promised Linux Journal an article about Spambayes. I got as far as setting up an outline when I found out I have no time to write it. Neither does Tim, who co-volunteered with me. Gary Robinson is still volunteering to write the sections about the math, but most of the article is intended to be not very mathematical, and he can't write that. So... Maybe there's someone here who is interested in writing this article? Fame and fortune for you and for SpamBayes! (Plus, I think LJ pays its authors.) If you're interested, write to Don Marti for details. DON'T WRITE ME! --Guido van Rossum (home page: http://www.python.org/~guido/) From richie@entrian.com Wed Nov 20 22:48:55 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 22:48:55 +0000 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: <3DDBCFA5.8070607@ActiveState.com> References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> <3DDBCFA5.8070607@ActiveState.com> Message-ID: [David Ascher] > learn to love JavaScript and provide keyboard navigation I looked at this, and the keyboard navigation is pretty good already: Tab Tab Tab Left Tab Tab Right Right, etc. One keystroke to move from message to message, possibly multiple keystrokes on an arrow key rather than a single 'h' or 's', but you don't need to move your hands to do it. I've left it on the to-do list, but I've persuaded myself that it's unnecessary. Other people's browser's mileage may vary. > Make 'hovertips' that display the first few lines of the body This is done. The code to strip HTML content uses a regular expression from tokenizer.py which is commented "Cheap-ass gimmick", so I'm interested to see how well people find it works! (Apologies to Tim - it seems to work extremely well.) Rest assures it's safe from HTML content leaking into the web interface - the worst that will happen is that you'll see HTML source in the hovertip. -- Richie Hindle richie@entrian.com From lists@morpheus.demon.co.uk Wed Nov 20 22:38:55 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Wed, 20 Nov 2002 22:38:55 +0000 Subject: [Spambayes] New web training interface for pop3proxy References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> <3DDBCFA5.8070607@ActiveState.com> Message-ID: Richie Hindle writes: >> What happens after you click on Train? Does it go to the next day, or just >> refresh the current page? > > It refreshes the current page if you deferred any messages, otherwise it > goes to the next or previous page. It's locking up for me. There are no messages in the command prompt window - is there any way to get it to produce trace messages (looking at the code, the answer seems to be "no"...)? Do I need to rebuild the database after upgrading? I didn't, and the user interface said "Total emails trained: Spam: 0 Ham: 0". This doesn't tally with reality - I'd trained on a batch of messages (using hammie.py) before starting (BTW, adding a bulk training interface might be nice, although using hammie.py seems to work OK in practice). Paul. -- This signature intentionally left blank From richie@entrian.com Wed Nov 20 23:26:49 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 20 Nov 2002 23:26:49 +0000 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> <3DDBCFA5.8070607@ActiveState.com> Message-ID: Hi Paul, > It's locking up for me. Urk. That's bad (and new - no-one's reported that before). And this happened when you hit the Train button, yes? > There are no messages in the command prompt > window - is there any way to get it to produce trace messages (looking > at the code, the answer seems to be "no"...)? It's "no". 8-) Like I say, no-one's reported it locking before, and I've never seen it. You usually get a traceback when something goes wrong. So your console says something like: Loading database... Done. BayesProxyListener listening on port 110. UserInterfaceListener listening on port 8880. and nothing else, and the process is still running, but you can't get a page served to your browser? What error message do you get from the browser? If it's one of those pointless IE error pages, could you try telnetting to port 8880 and saying "GET / HTTP/1.0"? Can you even connect with telnet? How about port 110? > Do I need to rebuild the database after upgrading? I didn't, and the > user interface said "Total emails trained: Spam: 0 Ham: 0". This > doesn't tally with reality - I'd trained on a batch of messages (using > hammie.py) before starting You shouldn't need to do anything, and even if you did I'd expect it to give an error rather than quietly failing. Are you using a pickle or a DBM? In fact, could you send me your bayescustomize.ini and/or details of the command line you're using (off-list if you prefer)? > (BTW, adding a bulk training interface > might be nice, although using hammie.py seems to work OK in practice). I agree - training by uploading an mbox file is on the list. -- Richie Hindle richie@entrian.com From popiel@wolfskeep.com Wed Nov 20 23:31:28 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 20 Nov 2002 15:31:28 -0800 Subject: [Spambayes] Split DB and no update_probabilities Message-ID: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com> I've got a patch to split the word database into two pieces: one with the ham and spam counts, and one with the spamprobs. At the same time, I got rid of the timestamps and killcounts, both of which are rarely used (and can be added back in by subclasses if they're really wanted). With this patch, the spamprobs can be cheaply discarded en mass, leading to a friendlier interface for incremental training. To go along with this, update_probabilities is eliminated, and the probabilities are generated on demand and merely cached in the spamprob database. Unfortunately, this patch breaks all preexisting databases. It wouldn't be too hard to write a bit of code to take an old pickle and create a count database from it, but I'm too lazy to do that at the moment. (The spamprob database would, of course, take care of itself.) This patch also likely breaks most client code, since update_probabilities no longer exists, and the learn and unlearn methods don't (optionally) take a boolean to control whether update_probabilities is called. I could have left in a noop update_probabilities and ignored the optional arguments to learn and unlearn... but this is alpha code, and a clean break from the old ways is probably better in the long run. Since this patch does break stuff rather severely, I'm just including it in this email for people to look at and play with, instead of checking it in and thereby forcing it down people's throats. Enjoy. - Alex Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.30 diff -c -r1.30 TestDriver.py *** TestDriver.py 19 Nov 2002 17:43:27 -0000 1.30 --- TestDriver.py 20 Nov 2002 23:18:17 -0000 *************** *** 304,325 **** prob, clues = c.spamprob(e, True) printmsg(e, prob, clues) - if options.show_best_discriminators > 0: - print - print " best discriminators:" - stats = [(-1, None)] * options.show_best_discriminators - smallest_killcount = -1 - for w, r in c.wordinfo.iteritems(): - if r.killcount > smallest_killcount: - heapreplace(stats, (r.killcount, w)) - smallest_killcount = stats[0][0] - stats.sort() - for count, w in stats: - if count < 0: - continue - r = c.wordinfo[w] - print " %r %d %g" % (w, r.killcount, r.spamprob) - if options.show_histograms: printhist("this pair:", local_ham_hist, local_spam_hist) self.trained_ham_hist += local_ham_hist --- 304,309 ---- Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.8 diff -c -r1.8 Tester.py *** Tester.py 7 Nov 2002 22:30:04 -0000 1.8 --- Tester.py 20 Nov 2002 23:18:18 -0000 *************** *** 59,69 **** learn = self.classifier.learn if hamstream is not None: for example in hamstream: ! learn(example, False, False) if spamstream is not None: for example in spamstream: ! learn(example, True, False) ! self.classifier.update_probabilities() # Untrain the classifier on streams of ham and spam. Updates # probabilities before returning, and resets test results. --- 59,68 ---- learn = self.classifier.learn if hamstream is not None: for example in hamstream: ! learn(example, False) if spamstream is not None: for example in spamstream: ! learn(example, True) # Untrain the classifier on streams of ham and spam. Updates # probabilities before returning, and resets test results. *************** *** 72,82 **** unlearn = self.classifier.unlearn if hamstream is not None: for example in hamstream: ! unlearn(example, False, False) if spamstream is not None: for example in spamstream: ! unlearn(example, True, False) ! self.classifier.update_probabilities() # Run prediction on each sample in stream. You're swearing that stream # is entirely composed of spam (is_spam True), or of ham (is_spam False). --- 71,80 ---- unlearn = self.classifier.unlearn if hamstream is not None: for example in hamstream: ! unlearn(example, False) if spamstream is not None: for example in spamstream: ! unlearn(example, True) # Run prediction on each sample in stream. You're swearing that stream # is entirely composed of spam (is_spam True), or of ham (is_spam False). Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.53 diff -c -r1.53 classifier.py *** classifier.py 18 Nov 2002 18:23:09 -0000 1.53 --- classifier.py 20 Nov 2002 23:18:18 -0000 *************** *** 46,91 **** LN2 = math.log(2) # used frequently by chi-combining ! PICKLE_VERSION = 1 ! class WordInfo(object): ! __slots__ = ('atime', # when this record was last used by scoring(*) ! 'spamcount', # # of spams in which this word appears ! 'hamcount', # # of hams in which this word appears ! 'killcount', # # of times this made it to spamprob()'s nbest ! 'spamprob', # prob(spam | msg contains this word) ! ) ! ! # Invariant: For use in a classifier database, at least one of ! # spamcount and hamcount must be non-zero. ! # ! # (*)atime is the last access time, a UTC time.time() value. It's the ! # most recent time this word was used by scoring (i.e., by spamprob(), ! # not by training via learn()); or, if the word has never been used by ! # scoring, the time the word record was created (i.e., by learn()). ! # One good criterion for identifying junk (word records that have no ! # value) is to delete words that haven't been used for a long time. ! # Perhaps they were typos, or unique identifiers, or relevant to a ! # once-hot topic or scam that's fallen out of favor. Whatever, if ! # a word is no longer being used, it's just wasting space. ! ! def __init__(self, atime, spamprob=options.unknown_word_prob): ! self.atime = atime ! self.spamcount = self.hamcount = self.killcount = 0 self.spamprob = spamprob def __repr__(self): ! return "WordInfo%r" % repr((self.atime, self.spamcount, ! self.hamcount, self.killcount, ! self.spamprob)) def __getstate__(self): ! return (self.atime, self.spamcount, self.hamcount, self.killcount, ! self.spamprob) def __setstate__(self, t): ! (self.atime, self.spamcount, self.hamcount, self.killcount, ! self.spamprob) = t class Bayes: # Defining __slots__ here made Jeremy's life needlessly difficult when --- 46,82 ---- LN2 = math.log(2) # used frequently by chi-combining ! PICKLE_VERSION = 2 ! class CountInfo(object): ! __slots__ = ('spamcount', 'hamcount') ! ! def __init__(self): ! self.spamcount = self.hamcount = 0 ! ! def __repr__(self): ! return "CountInfo%r" % repr((self.spamcount, self.hamcount)) ! ! def __getstate__(self): ! return (self.spamcount, self.hamcount) ! ! def __setstate__(self, t): ! (self.spamcount, self.hamcount) = t ! ! class ProbInfo(object): ! __slots__ = ('spamprob') ! ! def __init__(self, spamprob=options.unknown_word_prob): self.spamprob = spamprob def __repr__(self): ! return "ProbInfo%r" % repr((self.spamprob,)) def __getstate__(self): ! return (self.spamprob,) def __setstate__(self, t): ! (self.spamprob,) = t class Bayes: # Defining __slots__ here made Jeremy's life needlessly difficult when *************** *** 100,118 **** # ) # allow a subclass to use a different class for WordInfo ! WordInfoClass = WordInfo def __init__(self): ! self.wordinfo = {} self.nspam = self.nham = 0 def __getstate__(self): ! return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham def __setstate__(self, t): if t[0] != PICKLE_VERSION: raise ValueError("Can't unpickle -- version %s unknown" % t[0]) ! self.wordinfo, self.nspam, self.nham = t[1:] # spamprob() implementations. One of the following is aliased to # spamprob, depending on option settings. --- 91,112 ---- # ) # allow a subclass to use a different class for WordInfo ! CountInfoClass = CountInfo ! ProbInfoClass = ProbInfo def __init__(self): ! self.countinfo = {} ! self.probinfo = {} self.nspam = self.nham = 0 def __getstate__(self): ! return (PICKLE_VERSION, self.countinfo, self.probinfo, ! self.nspam, self.nham) def __setstate__(self, t): if t[0] != PICKLE_VERSION: raise ValueError("Can't unpickle -- version %s unknown" % t[0]) ! self.countinfo, self.probinfo, self.nspam, self.nham = t[1:] # spamprob() implementations. One of the following is aliased to # spamprob, depending on option settings. *************** *** 143,151 **** P = Q = 1.0 Pexp = Qexp = 0 clues = self._getclues(wordstream) ! for prob, word, record in clues: ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 P *= 1.0 - prob Q *= prob if P < 1e-200: # move back into range --- 137,143 ---- P = Q = 1.0 Pexp = Qexp = 0 clues = self._getclues(wordstream) ! for prob, word in clues: P *= 1.0 - prob Q *= prob if P < 1e-200: # move back into range *************** *** 232,240 **** Hexp = Sexp = 0 clues = self._getclues(wordstream) ! for prob, word, record in clues: ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 S *= 1.0 - prob H *= prob if S < 1e-200: # prevent underflow --- 224,230 ---- Hexp = Sexp = 0 clues = self._getclues(wordstream) ! for prob, word in clues: S *= 1.0 - prob H *= prob if S < 1e-200: # prevent underflow *************** *** 277,283 **** if options.use_chi_squared_combining: spamprob = chi2_spamprob ! def learn(self, wordstream, is_spam, update_probabilities=True): """Teach the classifier by example. wordstream is a word stream representing a message. If is_spam is --- 267,273 ---- if options.use_chi_squared_combining: spamprob = chi2_spamprob ! def learn(self, wordstream, is_spam): """Teach the classifier by example. wordstream is a word stream representing a message. If is_spam is *************** *** 294,323 **** """ self._add_msg(wordstream, is_spam) - if update_probabilities: - self.update_probabilities() ! def unlearn(self, wordstream, is_spam, update_probabilities=True): """In case of pilot error, call unlearn ASAP after screwing up. Pass the same arguments you passed to learn(). """ self._remove_msg(wordstream, is_spam) - if update_probabilities: - self.update_probabilities() - - def update_probabilities(self): - """Update the word probabilities in the spam database. - - This computes a new probability for every word in the database, - so can be expensive. learn() and unlearn() update the probabilities - each time by default. Thay have an optional argument that allows - to skip this step when feeding in many messages, and in that case - you should call update_probabilities() after feeding the last - message and before calling spamprob(). - """ nham = float(self.nham or 1) nspam = float(self.nspam or 1) --- 284,300 ---- """ self._add_msg(wordstream, is_spam) ! def unlearn(self, wordstream, is_spam): """In case of pilot error, call unlearn ASAP after screwing up. Pass the same arguments you passed to learn(). """ self._remove_msg(wordstream, is_spam) + # Compute the probability reflected by a set of counts. + def _compute_probability(self, record): nham = float(self.nham or 1) nspam = float(self.nspam or 1) *************** *** 330,406 **** S = options.unknown_word_strength StimesX = S * options.unknown_word_prob ! for word, record in self.wordinfo.iteritems(): ! # Compute p(word) = prob(msg is spam | msg contains word). ! # This is the Graham calculation, but stripped of biases, and ! # stripped of clamping into 0.01 thru 0.99. The Bayesian ! # adjustment following keeps them in a sane range, and one ! # that naturally grows the more evidence there is to back up ! # a probability. ! hamcount = record.hamcount ! assert hamcount <= nham ! hamratio = hamcount / nham ! ! spamcount = record.spamcount ! assert spamcount <= nspam ! spamratio = spamcount / nspam ! ! prob = spamratio / (hamratio + spamratio) ! # Now do Robinson's Bayesian adjustment. ! # ! # s*x + n*p(w) ! # f(w) = -------------- ! # s + n ! # ! # I find this easier to reason about like so (equivalent when ! # s != 0): ! # ! # x - p ! # p + ------- ! # 1 + n/s ! # ! # IOW, it moves p a fraction of the distance from p to x, and ! # less so the larger n is, or the smaller s is. ! # Experimental: ! # Picking a good value for n is interesting: how much empirical ! # evidence do we really have? If nham == nspam, ! # hamcount + spamcount makes a lot of sense, and the code here ! # does that by default. ! # But if, e.g., nham is much larger than nspam, p(w) can get a ! # lot closer to 0.0 than it can get to 1.0. That in turn makes ! # strong ham words (high hamcount) much stronger than strong ! # spam words (high spamcount), and that makes the accidental ! # appearance of a strong ham word in spam much more damaging than ! # the accidental appearance of a strong spam word in ham. ! # So we don't give hamcount full credit when nham > nspam (or ! # spamcount when nspam > nham): instead we knock hamcount down ! # to what it would have been had nham been equal to nspam. IOW, ! # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW, ! # we don't "believe" any count to an extent more than ! # min(nspam, nham) justifies. ! ! n = hamcount * spam2ham + spamcount * ham2spam ! prob = (StimesX + n * prob) / (S + n) ! ! if record.spamprob != prob: ! record.spamprob = prob ! # The next seemingly pointless line appears to be a hack ! # to allow a persistent db to realize the record has changed. ! self.wordinfo[word] = record ! ! def clearjunk(self, oldesttime): ! """Forget useless wordinfo records. This can shrink the database size. ! ! A record for a word will be retained only if the word was accessed ! at or after oldesttime. ! """ ! wordinfo = self.wordinfo ! tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime] ! for w in tonuke: ! del wordinfo[w] # NOTE: Graham's scheme had a strange asymmetry: when a word appeared # n>1 times in a single message, training added n to the word's hamcount --- 307,373 ---- S = options.unknown_word_strength StimesX = S * options.unknown_word_prob ! # Compute p(word) = prob(msg is spam | msg contains word). ! # This is the Graham calculation, but stripped of biases, and ! # stripped of clamping into 0.01 thru 0.99. The Bayesian ! # adjustment following keeps them in a sane range, and one ! # that naturally grows the more evidence there is to back up ! # a probability. ! hamcount = record.hamcount ! assert hamcount <= nham ! hamratio = hamcount / nham ! ! spamcount = record.spamcount ! assert spamcount <= nspam ! spamratio = spamcount / nspam ! prob = spamratio / (hamratio + spamratio) ! # Now do Robinson's Bayesian adjustment. ! # ! # s*x + n*p(w) ! # f(w) = -------------- ! # s + n ! # ! # I find this easier to reason about like so (equivalent when ! # s != 0): ! # ! # x - p ! # p + ------- ! # 1 + n/s ! # ! # IOW, it moves p a fraction of the distance from p to x, and ! # less so the larger n is, or the smaller s is. ! # Experimental: ! # Picking a good value for n is interesting: how much empirical ! # evidence do we really have? If nham == nspam, ! # hamcount + spamcount makes a lot of sense, and the code here ! # does that by default. ! # But if, e.g., nham is much larger than nspam, p(w) can get a ! # lot closer to 0.0 than it can get to 1.0. That in turn makes ! # strong ham words (high hamcount) much stronger than strong ! # spam words (high spamcount), and that makes the accidental ! # appearance of a strong ham word in spam much more damaging than ! # the accidental appearance of a strong spam word in ham. ! # So we don't give hamcount full credit when nham > nspam (or ! # spamcount when nspam > nham): instead we knock hamcount down ! # to what it would have been had nham been equal to nspam. IOW, ! # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW, ! # we don't "believe" any count to an extent more than ! # min(nspam, nham) justifies. ! ! n = hamcount * spam2ham + spamcount * ham2spam ! prob = (StimesX + n * prob) / (S + n) ! ! return prob ! ! # Forget all the cached probability information. ! # This is usually done in the process of learning or unlearning, ! # since changing nham and nspam changes the probability of nearly ! # every word in the database. ! def _wipe_probinfo(self): ! self.probinfo = {} # NOTE: Graham's scheme had a strange asymmetry: when a word appeared # n>1 times in a single message, training added n to the word's hamcount *************** *** 427,447 **** self.nspam += 1 else: self.nham += 1 ! wordinfo = self.wordinfo ! wordinfoget = wordinfo.get ! now = time.time() for word in Set(wordstream): ! record = wordinfoget(word) if record is None: ! record = self.WordInfoClass(now) if is_spam: record.spamcount += 1 else: record.hamcount += 1 # Needed to tell a persistent DB that the content changed. ! wordinfo[word] = record def _remove_msg(self, wordstream, is_spam): if is_spam: --- 394,414 ---- self.nspam += 1 else: self.nham += 1 + self._wipe_probinfo() ! countinfo = self.countinfo ! countinfoget = countinfo.get for word in Set(wordstream): ! record = countinfoget(word) if record is None: ! record = self.CountInfoClass() if is_spam: record.spamcount += 1 else: record.hamcount += 1 # Needed to tell a persistent DB that the content changed. ! countinfo[word] = record def _remove_msg(self, wordstream, is_spam): if is_spam: *************** *** 452,462 **** if self.nham <= 0: raise ValueError("non-spam count would go negative!") self.nham -= 1 ! wordinfo = self.wordinfo ! wordinfoget = wordinfo.get for word in Set(wordstream): ! record = wordinfoget(word) if record is not None: if is_spam: if record.spamcount > 0: --- 419,430 ---- if self.nham <= 0: raise ValueError("non-spam count would go negative!") self.nham -= 1 + self._wipe_probinfo() ! countinfo = self.countinfo ! countinfoget = countinfo.get for word in Set(wordstream): ! record = countinfoget(word) if record is not None: if is_spam: if record.spamcount > 0: *************** *** 465,474 **** if record.hamcount > 0: record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: ! del wordinfo[word] else: # Needed to tell a persistent DB that the content changed. ! wordinfo[word] = record def _getclues(self, wordstream): mindist = options.minimum_prob_strength --- 433,454 ---- if record.hamcount > 0: record.hamcount -= 1 if record.hamcount == 0 == record.spamcount: ! del countinfo[word] else: # Needed to tell a persistent DB that the content changed. ! countinfo[word] = record ! ! # Handle the generation and caching of the spamprob values. ! def _getprobability(self, word): ! record = self.probinfo.get(word) ! if record is None: ! counts = self.countinfo.get(word) ! if counts is None: ! return options.unknown_word_prob ! record = self.ProbInfoClass() ! record.spamprob = self._compute_probability(counts) ! self.probinfo[word] = record ! return record.spamprob def _getclues(self, wordstream): mindist = options.minimum_prob_strength *************** *** 477,497 **** clues = [] # (distance, prob, word, record) tuples pushclue = clues.append ! wordinfoget = self.wordinfo.get ! now = time.time() for word in Set(wordstream): ! record = wordinfoget(word) ! if record is None: ! prob = unknown ! else: ! record.atime = now ! prob = record.spamprob distance = abs(prob - 0.5) if distance >= mindist: ! pushclue((distance, prob, word, record)) clues.sort() if len(clues) > options.max_discriminators: del clues[0 : -options.max_discriminators] ! # Return (prob, word, record). return [t[1:] for t in clues] --- 457,471 ---- clues = [] # (distance, prob, word, record) tuples pushclue = clues.append ! probget = self._getprobability for word in Set(wordstream): ! prob = probget(word) distance = abs(prob - 0.5) if distance >= mindist: ! pushclue((distance, prob, word)) clues.sort() if len(clues) > options.max_discriminators: del clues[0 : -options.max_discriminators] ! # Return (prob, word). return [t[1:] for t in clues] From popiel@wolfskeep.com Wed Nov 20 23:33:45 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 20 Nov 2002 15:33:45 -0800 Subject: [Spambayes] Better optimization loop In-Reply-To: Message from Neale Pickett of "20 Nov 2002 13:16:25 PST." References: <3DDB7D3D.1020306@hooft.net> Message-ID: <20021120233345.75B1DF5A7@cashew.wolfskeep.com> In message: Neale Pickett writes: >So then, "Rob W.W. Hooft" is all like: > >> Another speedup I could use is a version of Bayes that calculates the >> spamprob from the numbers on demand instead of calculating them for >> all words everytime. This pays of for all cases where the training >> batch is very small (~1 message). > >Funny you should bring that up, Rob, because I happen to be working on >exactly that. The only way I could think to do it was to pass in a new >option to Bayes.learn() and Bayes.unlearn(). Argh. I was working on it, too... hence the patch I just sent out. Oh, well... no big deal. It looks like our implementations are significantly different, though. Might be worth looking at both and seeing which is better. - Alex From neale@woozle.org Thu Nov 21 00:13:44 2002 From: neale@woozle.org (Neale Pickett) Date: 20 Nov 2002 16:13:44 -0800 Subject: [Spambayes] Better optimization loop In-Reply-To: <20021120233345.75B1DF5A7@cashew.wolfskeep.com> References: <3DDB7D3D.1020306@hooft.net> <20021120233345.75B1DF5A7@cashew.wolfskeep.com> Message-ID: So then, "T. Alexander Popiel" is all like: > Argh. I was working on it, too... hence the patch I just sent out. > Oh, well... no big deal. It looks like our implementations are > significantly different, though. Might be worth looking at both > and seeing which is better. I think what you did is a little closer to what Rob suggested to me in response. It sounds like a pretty good idea to me. What I've been doing in my idle time for the past few hours is playing around with having the WordInfo class compute its own probability. I did this by defining two new methods: def probability(self): if not self.spamprob: self.update_probability() return self.spamprob def update_probability(self, nham, nspam): [basically the same code as Bayes.update_probabilites] My idea was that you'd have to score the probability for each word whenever you use it first, but after that the probability is cached. Long-running things like the pop proxy will get the benefit of the cached probabilities, and short-lived things like hammiefilter get much faster training, and only slightly slower scoring. At least, that's what I expect. I haven't tested this yet. From popiel@wolfskeep.com Thu Nov 21 01:31:50 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 20 Nov 2002 17:31:50 -0800 Subject: [Spambayes] Better optimization loop In-Reply-To: Message from Neale Pickett of "20 Nov 2002 16:13:44 PST." References: <3DDB7D3D.1020306@hooft.net> <20021120233345.75B1DF5A7@cashew.wolfskeep.com> Message-ID: <20021121013150.925D0F5A7@cashew.wolfskeep.com> In message: Neale Pickett writes: > >What I've been doing in my idle time for the past few hours is playing >around with having the WordInfo class compute its own probability. [snip] >My idea was that you'd have to score the probability for each word >whenever you use it first, but after that the probability is cached. >Long-running things like the pop proxy will get the benefit of the >cached probabilities, and short-lived things like hammiefilter get much >faster training, and only slightly slower scoring. At least, that's >what I expect. I haven't tested this yet. What this seems to lack is a good (cheap) way to invalidate the cache. Since changing the amount of training data affects the bayesian adjustment to the probability for just about every word in the database, being able to invalidate the cache is important. (Yes, I know I keep harping on this, but a lot of the ideas circulating on this topic seem to ignore it.) FWIW, I did a small time test on the patch I posted... and it seems to run marginally faster than the original code in the classic timcv setting. I think that getting rid of tracking the timestamps (and making the change non-optional, unlike the first buggy version I mentioned about a week ago) offset the added work of checking mutiple places on a cache miss. Of course, it'll be much faster than dealing with update_probabilities in the fine-grained train-a-few, classify-a-few, train-a-few-again setting... but I haven't actually tested that. I need to do that. - Alex From gustav@morpheus.demon.co.uk Wed Nov 20 22:57:56 2002 From: gustav@morpheus.demon.co.uk (Paul Moore) Date: Wed, 20 Nov 2002 22:57:56 +0000 Subject: [Spambayes] New web training interface for pop3proxy References: <15833.20589.376685.686723@montanaro.dyndns.org> <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com> <15834.15618.465106.671756@montanaro.dyndns.org> <3DDBCFA5.8070607@ActiveState.com> <65ur27eo.fsf@morpheus.demon.co.uk> Message-ID: Paul Moore writes: > Do I need to rebuild the database after upgrading? I didn't, and the > user interface said "Total emails trained: Spam: 0 Ham: 0". This > doesn't tally with reality - I'd trained on a batch of messages (using > hammie.py) before starting (BTW, adding a bulk training interface > might be nice, although using hammie.py seems to work OK in practice). Rebuilding seems to have fixed things, although I don't know quite why. One thing, when I went to the review screen and trained on the messages there, I got an exception: >pop3proxy.py -l 8110 -d -b localhost Loading database... Done. BayesProxyListener listening on port 8110. UserInterfaceListener listening on port 8880. error: uncaptured python exception, closing channel <__main__.UserInterface conn ected at 0x8e5a20> (exceptions.AttributeError:'tuple' object has no attribute 'h amcount' [C:\Python22\lib\asyncore.py|poll|99] [C:\Python22\lib\asyncore.py|hand le_read_event|394] [C:\Python22\lib\asynchat.py|handle_read|112] [C:\Application s\Spambayes\pop3proxy.py|found_terminator|720] [C:\Applications\Spambayes\pop3pr oxy.py|onRequest|745] [C:\Applications\Spambayes\pop3proxy.py|onReview|980] [C:\ Applications\Spambayes\classifier.py|update_probabilities|340]) Sorry, no time to diagnose further just now. Paul. -- This signature intentionally left blank From tim.one@comcast.net Thu Nov 21 02:12:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 20 Nov 2002 21:12:01 -0500 Subject: [Spambayes] Outlook weirdness In-Reply-To: Message-ID: [Brad Morgan] > Outlook (Internet-only) has been occasionally "hanging around" on me > long before I tried any sort of spam filtering. Sometimes it hangs > around collecting messages, sometimes it just prevents another > version from starting up. Same here, of course. About a year ago, it did this every time for a solid week, and lost all my toolbar customizations each time. Sometimes it also lost my rules, ditto my view customizations. Then it away. Then it came back. Etc. I haven't noticed any increase in frequency or severity since I started using the addin. > I haven't found any pattern to the unsuccessful shutdowns and > Microsoft certianly hasn't either for all the patches they've put out. I've noticed that it happens only after I bring up Outlook . > IMO, its their bug plain and simple. It seems that way. We "should be" more paranoid, though. For examples, spin off a thread to rewrite the pickle to disk whenever it gets dirty, and rename the last N pickle files instead of overwriting them. MS doesn't write 5 copies of the registry for fun either <0.9 wink>. A persistent DB may eventually make more sense too. I know! Let's store the words and spamprobs in the registry . From tim@fourstonesExpressions.com Thu Nov 21 02:37:55 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed, 20 Nov 2002 20:37:55 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: Neale, where are we on the playground stuff? We're getting out of sync with pop3proxy... I suppose cvs will be able to merge to a point. I'm ok with making load() a noop and store() = sync() in the dbdict class. We could then do away with lsdbdict. This is much cleaner. Sans objections, I'll do that by tomorrow night. Once that's done, what's left? - Tims 11/19/2002 11:18:45 PM, Neale Pickett wrote: >So then, Tim Stone - Four Stones Expressions is all like: > >> How about PersistentClassifier? > >Yech. Since the things are kinda doing what the standard shelve module >does, and we keep calling them "stores", how about "store"? > > > - Tim www.fourstonesExpressions.com From tim.one@comcast.net Thu Nov 21 03:15:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 20 Nov 2002 22:15:50 -0500 Subject: [Spambayes] Just visiting Message-ID: Making major progress on Zope3 has become a top priority for my (and Guido's, and Jeremy's, and Barry's) employer, to an extent that precludes Zope Corp work on projects that don't directly contribute to that. This project doesn't have beans to do with Zope, so my involvement here has become a strictly "spare time" thing. Since good news always comes in threes, the first alpha of Python 2.3 is due out at the end of the year, and large parts of what I wanted to accomplish there have also become "spare time". In most of my spare time since learning this, I've been waiting for the third piece of good news . I won't vanish, but I have to cut waaaaaay back on this list. I'm proud of what we've all accomplished here, and would hate to see it die -- especially while it still sucks too much for my sisters to use . In case you're wondering, I approve of switching the default WordInfo thingie to compute probs on demand, and will take the remaining minute of today's spare time to remind that 2.2's properties would still allow for using .spamprob notation. Like >>> class Whatever(object): ... def _implementation_of_x_read_attr(self): ... return 2.0 * 4 ... x = property(_implementation_of_x_read_attr) ... >>> w = Whatever() >>> w.x 8.0 >>> s/x/spamprob/ and set your imagination free. .spamprob can do anything, and even different things when getting, setting, or deleting: >>> print property.__doc__ property(fget=None, fset=None, fdel=None, doc=None) -> property attribute fget is a function to be used for getting an attribute value, and likewise fset is a function for setting, and fdel a function for del'ing, an attribute. Typical use is to define a managed attribute x: class C(object): def getx(self): return self.__x def setx(self, value): self.__x = value def delx(self): del self.__x x = property(getx, setx, delx, "I'm the 'x' property.") >>> From rjdsnet@yahoo.com Wed Nov 20 15:41:16 2002 From: rjdsnet@yahoo.com (Ranieri J D Severiano) Date: Wed, 20 Nov 2002 13:41:16 -0200 Subject: [Spambayes] hammiefilter and hammiecli improvements Message-ID: <20021120154116.GA1052@uyrapuru> Hi, What you think about add the "-d" option to hammiefilter.py ? ----------------- def main(): action = filter opts, args = getopt.getopt(sys.argv[1:], 'hngsd') ### HERE for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-g': action = train_ham elif opt == '-s': action = train_spam elif opt == "-n": action = newdb elif opt == '-d': ###\ global USEDB ### - AND HERE USEDB = True ###/ action() ----------------- My other suggestion is to fix the print statement of hammiecli.py : ----------------- def main(): msg = sys.stdin.read() try: x = xmlrpclib.ServerProxy(RPCBASE) m = xmlrpclib.Binary(msg) out = x.filter(m) print out.data ### HERE except: if __debug__: import traceback traceback.print_exc() print msg ----------------- Now, you can get the message and pass it to procmail or another filter. Thanks, Ranieri J D Severiano From neale@woozle.org Thu Nov 21 04:04:26 2002 From: neale@woozle.org (Neale Pickett) Date: 20 Nov 2002 20:04:26 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: References: Message-ID: So then, Tim Stone - Four Stones Expressions is all like: > Neale, where are we on the playground stuff? We're getting out of sync with > pop3proxy... I suppose cvs will be able to merge to a point. I'm currently entwined with mucking the heck out of WordInfo. I've got a neato scheme based on Alex's patch and comments where the WordInfo classes still compute their own probabilities, but also keep a revision number which is compared against a MetaInfo class. The neato thing here, at least from the perspective of DBDict, is that all the meta information is now bundled up in a handy object. > I'm ok with making load() a noop and store() = sync() in the dbdict > class. We could then do away with lsdbdict. This is much cleaner. > Sans objections, I'll do that by tomorrow night. Yeah, go for it. I knew I'd wear you down ;) > Once that's done, what's left? Nothing, really, we just have to present a summary of all the changes we've made, get sign-off, and then I'll merge back into head. I'd also like to take the hedge trimmers to the options class, option names like "hammiefilter_persistent_storage_file" are a little long, methinks. But that can wait. Neale From neale@woozle.org Thu Nov 21 04:38:43 2002 From: neale@woozle.org (Neale Pickett) Date: 20 Nov 2002 20:38:43 -0800 Subject: [Spambayes] hammiefilter and hammiecli improvements In-Reply-To: <20021120154116.GA1052@uyrapuru> References: <20021120154116.GA1052@uyrapuru> Message-ID: So then, Ranieri J D Severiano is all like: > Hi, > What you think about add the "-d" option to hammiefilter.py ? Hi Ranieri. Good eye! I think, at least with hammiefilter, you shouldn't even have an option--USEDB=True should be the only thing you can do. Pickling/unpickling is just too slow for something which is supposed to run quickly and then exit. I'm going to do this soon; I just haven't figured out yet how to do this without messing up other things like pop3proxy. > My other suggestion is to fix the print statement of hammiecli.py : > print out.data ### HERE Yes, that is an important detail, isn't it? I've checked in a fix. Thanks! Neale From popiel@wolfskeep.com Thu Nov 21 05:36:28 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 20 Nov 2002 21:36:28 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message from Neale Pickett of "20 Nov 2002 20:04:26 PST." References: Message-ID: <20021121053628.DD924F5A7@cashew.wolfskeep.com> In message: Neale Pickett writes: > >I'm currently entwined with mucking the heck out of WordInfo. I've got >a neato scheme based on Alex's patch and comments where the WordInfo >classes still compute their own probabilities, but also keep a revision >number which is compared against a MetaInfo class. Eww, do we gotta? I thought I was trying to make the DB smaller. ;-) But yes, revision-stamping everything will work just as well as my patch... though I suspect it'll be slower, just 'cause it's digging through more data. Not that speed is a big issue for anything other than bulk testing. >The neato thing here, at least from the perspective of DBDict, is that >all the meta information is now bundled up in a handy object. This is unalloyed good. - Alex (opinionated) From neale@woozle.org Thu Nov 21 06:08:00 2002 From: neale@woozle.org (Neale Pickett) Date: 20 Nov 2002 22:08:00 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <20021121053628.DD924F5A7@cashew.wolfskeep.com> References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> Message-ID: So then, "T. Alexander Popiel" is all like: > In message: > Neale Pickett writes: > > > >I'm currently entwined with mucking the heck out of WordInfo. I've got > >a neato scheme based on Alex's patch and comments where the WordInfo > >classes still compute their own probabilities, but also keep a revision > >number which is compared against a MetaInfo class. > > Eww, do we gotta? I thought I was trying to make the DB smaller. ;-) Ah, but the only thing *stored* is (spamcount, hamcount). The probability is calculated the first time you ask for it. If you don't update nspam or nham, the next time you ask for it it gives the cached value. So the database is small, but you still get the in-memory probability caching if you're using a pickle or ZODB. But now that words are computing their own probabilities, the Bayes class no longer does anything Bayesian. I guess it's time to rename that class to Classifier. > >The neato thing here, at least from the perspective of DBDict, is > >that all the meta information is now bundled up in a handy object. > > This is unalloyed good. "unalloyed" is a superb word, Alex. It reminds me that I should be studying for the GRE instead of hacking spam classifier code :) Neale From skip@pobox.com Thu Nov 21 06:32:01 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 21 Nov 2002 00:32:01 -0600 Subject: [Spambayes] Is it safe to get back in the water? Message-ID: <15836.32225.108330.152806@montanaro.dyndns.org> I've been too timid to attempt an update from cvs followed by an install for the last several days. The changes have been flying too fast for me to keep up with. Assuming my hammie.db file (the anydbm version) hasn't been modified in a week, can I assume enough breakage has occurred that I will have to retrain from scratch? What about hammie.py? I currently execute hammie.py -f -d -p $HOME/hammie.db from my procmailrc file. Does a simple hammiefilter.py do something similar? How do I tell it what .db file to use? thx, Skip From skip@pobox.com Thu Nov 21 07:09:04 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 21 Nov 2002 01:09:04 -0600 Subject: [Spambayes] Joe-job update Message-ID: <15836.34448.30442.967802@montanaro.dyndns.org> This really has nothing to do with spambayes. I'm just bringing personal closure to the joe-job topic. I kept getting flurries of bounces and they seemed to all have one thing in common - the spammer was coming from some host in the client.attbi.com domain: Received: from mx.boston.juno.com (12-254-180-157.client.attbi.com [12.254.180.157]) by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id gAL3osVN024610 for ; Wed, 20 Nov 2002 21:55:35 -0600 Funny thing, I'm an AT&T broadband customer, and in order to allow my laptop at home to push mail through mail.mojam.com I had once upon a time added a RELAY line to /etc/mail/access: client.attbi.com RELAY Damn! It looks like my mail exchanger was being badly abused. I also had montanaro.dyndns.org RELAY (the hostname of my laptop, which maps to whatever IP I happen to be at for the moment), but it turns out this did no good. Once I removed the client.attbi.com line I discovered I couldn't route mail through mail.mojam.com: MAIL From: 250 2.1.0 ... Sender ok RCPT To: 550 5.7.1 ... Relaying denied After a moment's thought, the solution turned out to be simple. Root's cron now runs a little fix-access script that adjusts /etc/mail/access to allow relaying from precisely the IP address I happen to be at for the moment: ip=`host montanaro.dyndns.org | sed -e 's/.* //'` cd /etc/mail echo "$ip RELAY" > access.tmp cat /etc/mail/access.base >> access.tmp mv access.tmp access make This should help stem the tide of my own joe-job flurries. I don't know if it will help anyone else, but I pass this solution along just in case. Skip From neale@woozle.org Thu Nov 21 07:12:01 2002 From: neale@woozle.org (Neale Pickett) Date: 20 Nov 2002 23:12:01 -0800 Subject: [Spambayes] Is it safe to get back in the water? In-Reply-To: <15836.32225.108330.152806@montanaro.dyndns.org> References: <15836.32225.108330.152806@montanaro.dyndns.org> Message-ID: So then, Skip Montanaro is all like: > Assuming my hammie.db file (the anydbm version) hasn't been modified > in a week, can I assume enough breakage has occurred that I will have > to retrain from scratch? It *should* still work in the HEAD branch. Back it up first though :) The only change that will affect what you're doing is an optimization to how hammie stores WordInfo objects. So you should see the size of your database drop by about 50% the next time you train it. > What about hammie.py? I currently execute > > hammie.py -f -d -p $HOME/hammie.db > > from my procmailrc file. Does a simple > > hammiefilter.py > > do something similar? How do I tell it what .db file to use? If hammie.py is working for you, don't use hammiefilter.py yet. Eventually, hammiefilter.py will be what you want, but I'll make a big loud announcements before I check in anything that'll require you to alter anything. The heaps of checkins you've seen me make recently have all been in a branch. The idea is that we (well, just Tim Stone and I so far) will get everything straightened out over there before we merge back in. At that point we should have a pretty good idea about what's going to be messed up ;) Neale From bkc@murkworks.com Thu Nov 21 00:00:42 2002 From: bkc@murkworks.com (Brad Clements) Date: Wed, 20 Nov 2002 19:00:42 -0500 Subject: [Spambayes] LJ article In-Reply-To: <200211202149.gAKLnLw28459@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3DDBDBDA.29999.A120C12@localhost> On 20 Nov 2002 at 16:49, Guido van Rossum wrote: > If you're interested, write to Don Marti for > details. DON'T WRITE ME! Perhaps you should also post on this list if you write to Don .. ? -- This message was sent with an unlicensed evaluation version of Novell NetMail. Please see http://www.netmail.com/ for details. From vanhorn@whidbey.com Thu Nov 21 09:44:44 2002 From: vanhorn@whidbey.com (G. Armour Van Horn) Date: Thu, 21 Nov 2002 01:44:44 -0800 Subject: [Spambayes] LJ article References: <3DDBDBDA.29999.A120C12@localhost> Message-ID: <3DDCAB0C.34D3C928@whidbey.com> Brad Clements wrote: > On 20 Nov 2002 at 16:49, Guido van Rossum wrote: > > > If you're interested, write to Don Marti for > > details. DON'T WRITE ME! > > Perhaps you should also post on this list if you write to Don .. ? > > -- Okay, here's what I sent to Don: Don, I haven't done a lot of published writing lately, but I used to write a lot for Byte and a flock of lesser-known magazines (most of which are dead and gone now) and I think I still know how it's done. I have been reading the bulk of the Spambayes mailing list for several weeks anyway (there were just under a hundred susbscribers when I joined, if that gives you a time frame). Please let me know what the time frame, word count, deadline, and compensation are and I can be more specific about what I think I could do and how I would approach it. Van -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- From francois.granger@free.fr Thu Nov 21 09:55:07 2002 From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger) Date: Thu, 21 Nov 2002 10:55:07 +0100 Subject: [Spambayes] Hourra for pop3proxy ! Message-ID: I am pleased to announce that the current version works like a charm. It is plug & play. First try with two pop server, everything was working as advertised. The training interface is really good. It just need some cosmetic improvements. I"ll try to come with some patches for this. -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : -- From sjoerd@acm.org Thu Nov 21 10:08:25 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Thu, 21 Nov 2002 11:08:25 +0100 Subject: [Spambayes] Split DB and no update_probabilities In-Reply-To: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com> References: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com> Message-ID: <20021121100825.AFBD974B08@indus.ins.cwi.nl> On Wed, Nov 20 2002 "T. Alexander Popiel" wrote: > I've got a patch to split the word database into two pieces: > one with the ham and spam counts, and one with the spamprobs. > At the same time, I got rid of the timestamps and killcounts, > both of which are rarely used (and can be added back in by > subclasses if they're really wanted). I have one suggestion: in _wipe_probinfo use self.probinfo.clear() instead of self.probinfo = {} That way it is easier to replace the implementation of self.probinfo by something other than a dictionary. -- Sjoerd Mullender From Paul.Moore@atosorigin.com Thu Nov 21 10:11:48 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Thu, 21 Nov 2002 10:11:48 -0000 Subject: [Spambayes] New web training interface for pop3proxy Message-ID: <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com> (This is from memory, as it happened on my home setup and I'm at work now, so I apologise if it's a bit vague). From: Richie Hindle [mailto:richie@entrian.com] > > It's locking up for me. > > Urk. That's bad (and new - no-one's reported that before). And this > happened when you hit the Train button, yes? Yes. > > > There are no messages in the command prompt window - is there any > > way to get it to produce trace messages (looking at the code, the > > answer seems to be "no"...)? > > It's "no". 8-) Like I say, no-one's reported it locking before, and > I've never seen it. You usually get a traceback when something goes > wrong. So your console says something like: > > Loading database... Done BayesProxyListener listening on port 110 . > UserInterfaceListener listening on port 8880 . > > and nothing else, and the process is still running, but you can't > get a page served to your browser? What error message do you get > from the browser? If it's one of those pointless IE error pages, > could you try telnetting to port 8880 and saying "GET / HTTP/1.0"? > Can you even connect with telnet? How about port 110? Exactly as you describe. The browser (IE6) just sits there, doing nothing, with the training interface page still up. The progress bar at the bottom of the screen is (hardly? not?) moving (can't recall for sure if it changed, but it certainly didn't move far in the few minutes I left it...) Basically the sort of behaviour I get when IE is waiting for a response that never comes. I can't do a telnet at the moment to check. Unfortunately, I may not even be able to do one tonight, as I tried rebuilding the database to see if that helped, and it did... I can't recall if I kept the old database, or overwrote it :-( > > Do I need to rebuild the database after upgrading? I didn't, > > and the user interface said "Total emails trained: Spam: 0 Ham: > > 0". This doesn't tally with reality - I'd trained on a batch of > > messages (using hammie.py) before starting > > You shouldn't need to do anything, and even if you did I'd expect it > to give an error rather than quietly failing. Are you using a pickle > or a DBM? In fact, could you send me your bayescustomize.ini and/or > details of the command line you're using (off-list if you prefer)? It was a dbm file. The command line was pop3proxy.py -d -l 8110 localhost (proxying a local POP server on port 110 with the proxy on port 8110 using a DBM file). Working directory was the directory of the program. No bayescustomize.ini file. Sorry if this is a bit vague - it all happened late last night, after a pretty hectic evening, so I wasn't in the best mood for debugging :-) Paul. From mhammond@skippinet.com.au Thu Nov 21 12:24:51 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 21 Nov 2002 23:24:51 +1100 Subject: [Spambayes] New export utility for Outlook users. Message-ID: I just checked in Outlook2000/export.py. This is a tool for exporting the folders currently defined in the addin to a standard SpamBayes directory structure. The idea is that you can export your messages to text files, and run the standard tests over your data, just like all the big-boys do See the instructions in the script for more details. Mark. From tim@fourstonesExpressions.com Thu Nov 21 16:51:22 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 21 Nov 2002 10:51:22 -0600 Subject: [Spambayes] http://www.spamarchive.org/ Message-ID: <6Z546JFPOFAYXD82DJ64C7OLTSKF.3ddd0f0a@riven> New anti-spam site just launched at http://www.spamarchive.org/ Anybody know who this is? - Tim www.fourstonesExpressions.com From popiel@wolfskeep.com Thu Nov 21 16:59:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 21 Nov 2002 08:59:27 -0800 Subject: [Spambayes] Split DB and no update_probabilities In-Reply-To: Message from Sjoerd Mullender <20021121100825.AFBD974B08@indus.ins.cwi.nl> References: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com> <20021121100825.AFBD974B08@indus.ins.cwi.nl> Message-ID: <20021121165927.CFE52F5AC@cashew.wolfskeep.com> In message: <20021121100825.AFBD974B08@indus.ins.cwi.nl> Sjoerd Mullender writes: >On Wed, Nov 20 2002 "T. Alexander Popiel" wrote: > >> I've got a patch to split the word database into two pieces: >> one with the ham and spam counts, and one with the spamprobs. >> At the same time, I got rid of the timestamps and killcounts, >> both of which are rarely used (and can be added back in by >> subclasses if they're really wanted). > >I have one suggestion: in _wipe_probinfo use > self.probinfo.clear() >instead of > self.probinfo = {} >That way it is easier to replace the implementation of self.probinfo >by something other than a dictionary. Thanks. This is just a bit of my python ignorance showing through. The reason I had broken _wipe_probinfo out into its own method was to allow easier replacement of self.probinfo (I normally despise one-statement methods), but your suggestion makes it even cleaner. - Alex From msergeant@startechgroup.co.uk Thu Nov 21 17:00:45 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Thu, 21 Nov 2002 17:00:45 +0000 Subject: [Spambayes] http://www.spamarchive.org/ References: <6Z546JFPOFAYXD82DJ64C7OLTSKF.3ddd0f0a@riven> Message-ID: <3DDD113D.6000603@startechgroup.co.uk> Tim Stone - Four Stones Expressions said the following on 21/11/02 16:51: > New anti-spam site just launched at http://www.spamarchive.org/ > > Anybody know who this is? It's IronMail, the guys someone asked about yesterday (or was it Tuesday?). Not sure why a big funded company would need to do this, and hide the fact that it's IronMail doing it (not actively or anything, but I'm not sure why they didn't announce it that way - perhaps they got caught on the hop). Check the whois records if you're in doubt. From popiel@wolfskeep.com Thu Nov 21 17:18:00 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 21 Nov 2002 09:18:00 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message from Neale Pickett of "20 Nov 2002 22:08:00 PST." References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> Message-ID: <20021121171801.03720F5AC@cashew.wolfskeep.com> In message: Neale Pickett writes: >So then, "T. Alexander Popiel" is all like: > >> In message: >> Neale Pickett writes: >> > >> >I'm currently entwined with mucking the heck out of WordInfo. I've got >> >a neato scheme based on Alex's patch and comments where the WordInfo >> >classes still compute their own probabilities, but also keep a revision >> >number which is compared against a MetaInfo class. >> >> Eww, do we gotta? I thought I was trying to make the DB smaller. ;-) > >Ah, but the only thing *stored* is (spamcount, hamcount). The >probability is calculated the first time you ask for it. If you don't >update nspam or nham, the next time you ask for it it gives the cached >value. So the database is small, but you still get the in-memory >probability caching if you're using a pickle or ZODB. Sounds like there is no caching benefit for one-message-per-invocation situations like running out of procmail, then. Ouch. Unless I'm mistaken, by the time that the probability is being computed in your scheme, the identity of the word has been lost, and thus the probability can't be stored in a secondary database like I had written, either. I suppose that there's enough performance penalties in the procmail scenario (python startup, options loading, other various overhead) that computing all the probabilities from counts is small change. - Alex (overly critical) From neale@woozle.org Thu Nov 21 22:53:17 2002 From: neale@woozle.org (Neale Pickett) Date: 21 Nov 2002 14:53:17 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <20021121171801.03720F5AC@cashew.wolfskeep.com> References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> <20021121171801.03720F5AC@cashew.wolfskeep.com> Message-ID: So then, "T. Alexander Popiel" is all like: > Sounds like there is no caching benefit for one-message-per-invocation > situations like running out of procmail, then. Ouch. Unless I'm > mistaken, by the time that the probability is being computed in your > scheme, the identity of the word has been lost, and thus the > probability can't be stored in a secondary database like I had > written, either. I suppose that there's enough performance penalties > in the procmail scenario (python startup, options loading, other > various overhead) that computing all the probabilities from counts is > small change. Well it's easy enough to test. I modified my WordInfo to store the probability in the store. First off, this makes the database twice as big (floats take more space to pickle than ints), and adds about a second to scoring a batch of 200 messages. So there is some overhead involved. I trained 200 messages with both methods (storing prob and calculating it each time). Here were the times in the trials: storing not 1 9.308 9.298 2 9.573 9.328 3 9.292 9.307 4 9.290 9.288 5 9.306 9.466 I don't think there's a clear winner here. Given that we get similar times, but a database half the size, I'm still inclined to go with not storing probabilities. Neale From popiel@wolfskeep.com Thu Nov 21 23:06:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 21 Nov 2002 15:06:27 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message from Neale Pickett of "21 Nov 2002 14:53:17 PST." References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> <20021121171801.03720F5AC@cashew.wolfskeep.com> Message-ID: <20021121230627.9A841F5AC@cashew.wolfskeep.com> In message: Neale Pickett writes: > >I don't think there's a clear winner here. Given that we get similar >times, but a database half the size, I'm still inclined to go with not >storing probabilities. Sounds good. (Not only would the probability have to be stored, but the revision stamp would have to be stored, for even more DB bloat. Blech.) - Alex From hupp@cs.wisc.edu Fri Nov 22 00:27:13 2002 From: hupp@cs.wisc.edu (Adam Hupp) Date: Thu, 21 Nov 2002 18:27:13 -0600 Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt macros Message-ID: <20021122002713.GC29009@upl.cs.wisc.edu> I've been working on a store for spambayes that uses the Numeric python extension. It's substantially faster than PersistentBayes and the database is about half the size. A comparison, training on 992 messages: PersistentBayes: training: 220s update_prob: 3.2s score 1 msg: .45s score 6156 msgs: 58s NumericBayes: training: 14s update_prob: 0.10s score 1 msg: .59s score 6156 msgs: 49s There are no modifications to classifier.Bayes, it just uses a new WordInfo class with properties. I also modified hammiefilter to do untraining, retraining, and training on filter results. For example: hammiefilter.py --filter --train The incoming message is scored and filtered. If the result is not "Unsure" the classifier will be trained on it. hammiefilter.py --reverse --good --train The incoming message has previously been incorrectly marked as ham. --reverse will untrain the classifier and --train will retrain it on the message as spam. With these tools it's straightforward to setup macros in mutt to manage false negatives/positives and classify "Unsure" messages. The modified files can be found at: http://www.upl.cs.wisc.edu/~hupp/spambayes.tar.gz hammiefilter requires Optik and the NumericBayes store requires Numeric and MaskedArray (and optional part of Numeric). -Adam From tim@fourstonesExpressions.com Fri Nov 22 00:42:57 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu, 21 Nov 2002 18:42:57 -0600 Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt macros In-Reply-To: <20021122002713.GC29009@upl.cs.wisc.edu> Message-ID: <4YA0NKEBJE4165ICXURO2B087374D.3ddd7d91@riven> Sounds really good, Adam. Neale Pickett and I have been working on this kind of stuff in a branch named hammie-playground. There have been some substantial changes made there, that'll be merged into the main thread soon. You might want to check there and see how your changes would fit in... I really like your results. Size and speed have been consistent challenges for us. - TimS 11/21/2002 6:27:13 PM, Adam Hupp wrote: > >I've been working on a store for spambayes that uses the Numeric >python extension. It's substantially faster than PersistentBayes and >the database is about half the size. A comparison, training on 992 messages: > >PersistentBayes: >training: 220s >update_prob: 3.2s >score 1 msg: .45s >score 6156 msgs: 58s > >NumericBayes: >training: 14s >update_prob: 0.10s >score 1 msg: .59s >score 6156 msgs: 49s > >There are no modifications to classifier.Bayes, it just uses a new >WordInfo class with properties. > >I also modified hammiefilter to do untraining, retraining, and >training on filter results. For example: > >hammiefilter.py --filter --train > >The incoming message is scored and filtered. If the result is not >"Unsure" the classifier will be trained on it. > > >hammiefilter.py --reverse --good --train > >The incoming message has previously been incorrectly marked as ham. >--reverse will untrain the classifier and --train will retrain it on >the message as spam. > >With these tools it's straightforward to setup macros in mutt to >manage false negatives/positives and classify "Unsure" messages. > >The modified files can be found at: > >http://www.upl.cs.wisc.edu/~hupp/spambayes.tar.gz > >hammiefilter requires Optik and the NumericBayes store requires >Numeric and MaskedArray (and optional part of Numeric). > >-Adam > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From neale@woozle.org Fri Nov 22 06:10:08 2002 From: neale@woozle.org (Neale Pickett) Date: 21 Nov 2002 22:10:08 -0800 Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt macros In-Reply-To: <20021122002713.GC29009@upl.cs.wisc.edu> References: <20021122002713.GC29009@upl.cs.wisc.edu> Message-ID: So then, Adam Hupp is all like: > PersistentBayes: > training: 220s > update_prob: 3.2s > score 1 msg: .45s > score 6156 msgs: 58s > > NumericBayes: > training: 14s > update_prob: 0.10s > score 1 msg: .59s > score 6156 msgs: 49s Holy cow! That's impressive! I'm no NumPy expert but it looks like you're taking advantage of some sort of "do this on all elements of an array" function -- what the Cray guys used to call vectorization. I imagine NumPy can optimize that sort of loop much better than straight CPython and you'd get speeds close to that of compiled C. This is a totally killer idea, except that we just decided to move probability computation out to individual WordInfo objects! The thinking was--and testing seems to bear this out--that when most transactions are small incremental updates and single message scoring (instead of batches of messages), it's faster to compute individual word probabilities as they're needed, since it saves a ton of I/O and perhaps a lot of needless computation. On the other hand, this could be of tremendous benefit to long-lived processes like the pop3proxy and the Outlook plugin, which want to keep the whole database around in memory. Adam, would it be possible to abstract the bayesian part of the algorithm (the part done in update_probabilities) so that it could be called either with a NumPy vector operation, or in a one-at-a-time fashion by individual WordInfo classes? If you can think of a way to do this, we can throw this in. Even if you can't think of a way to do it, I think it might be worth it to have two implementations of the same algorithm just for this 15x speedup. > I also modified hammiefilter to do untraining, retraining, and > training on filter results. For example: > > hammiefilter.py --filter --train > > The incoming message is scored and filtered. If the result is not > "Unsure" the classifier will be trained on it. > > > hammiefilter.py --reverse --good --train > > The incoming message has previously been incorrectly marked as ham. > --reverse will untrain the classifier and --train will retrain it on > the message as spam. > > With these tools it's straightforward to setup macros in mutt to > manage false negatives/positives and classify "Unsure" messages. That's good stuff. I'll have to check the list archives because I know the issue of auto-training has been discussed and probably beaten into the ground by now. But first I want to get my branch merged in so everybody else can witness my dementia ;) Neale From neale@woozle.org Fri Nov 22 06:33:31 2002 From: neale@woozle.org (Neale Pickett) Date: 21 Nov 2002 22:33:31 -0800 Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt macros In-Reply-To: References: <20021122002713.GC29009@upl.cs.wisc.edu> Message-ID: So then, Neale Pickett is all like: > This is a totally killer idea, except that we just decided to move > probability computation out to individual WordInfo objects! I fired that off a little prematurely. When I say "we" here, I actually mean "Tim Stone, T. Alexander Popiel, and myself". Although Tim Peters has hinted that he thinks this is a good idea, we (Tim S and I) need to get the nod from a few key people before this and the myriad other changes in our (Tim S and I) little branch are checked in. We (royal) hope that we (diminutive) have found this message enlightening. Weale From rob@hooft.net Fri Nov 22 08:58:06 2002 From: rob@hooft.net (Rob Hooft) Date: Fri, 22 Nov 2002 09:58:06 +0100 Subject: [Spambayes] proposed changes to hammie & co. References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> <20021121171801.03720F5AC@cashew.wolfskeep.com> Message-ID: <3DDDF19E.3000305@hooft.net> T. Alexander Popiel wrote: > In message: > Neale Pickett writes: > >>So then, "T. Alexander Popiel" is all like: >> >> >>>In message: >>> Neale Pickett writes: >>> >>>>I'm currently entwined with mucking the heck out of WordInfo. I've got >>>>a neato scheme based on Alex's patch and comments where the WordInfo >>>>classes still compute their own probabilities, but also keep a revision >>>>number which is compared against a MetaInfo class. >>> >>>Eww, do we gotta? I thought I was trying to make the DB smaller. ;-) >> >>Ah, but the only thing *stored* is (spamcount, hamcount). The >>probability is calculated the first time you ask for it. If you don't >>update nspam or nham, the next time you ask for it it gives the cached >>value. So the database is small, but you still get the in-memory >>probability caching if you're using a pickle or ZODB. > > > Sounds like there is no caching benefit for one-message-per-invocation > situations like running out of procmail, then. Is this calculation for the few words in one message really time-determining? There is another way of caching: Make a dictionary that maps count-tuples to spam probabilities. (1,0) -> 0.155 (0,1) -> 0.844 etc. I definitely wouldn't move the calculation into the wordinfo class. It is a different task, so it "should" (design) be a separate class.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From Paul.Moore@atosorigin.com Fri Nov 22 09:33:02 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Fri, 22 Nov 2002 09:33:02 -0000 Subject: [Spambayes] Outlook addin crash Message-ID: <16E1010E4581B049ABC51D4975CEDB8861994F@UKDCX001.uk.int.atosorigin.com> Just got the following in the Outlook addin. No idea what caused it, but the "Exception in thread xxxx" messages are probably the relevant bits (I spent a while trying to get the "Filter Now" button to work before I thought of starting traceutil). The "Invalid window handle" message makes me think of a race condition where Outlook hasn't opened a window by the time the addin needs it... But the addin's UI is there (the extra buttons, and clicking them starts = the dialogs OK). Paul. >\Python22\Lib\site-packages\win32\lib\win32traceutil.py Collecting Python Trace Output... Outlook Spam Addin module loading SpamAddin - Connecting to Outlook Loaded bayes database from = 'C:\Applications\Spambayes\Outlook2000\default_bayes_ database.pck' Loaded message database from = 'C:\Applications\Spambayes\Outlook2000\default_mess age_database.pck' Bayes database initialized with 769 spam and 7166 good messages AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Spam Processing 0 missed spam in folder 'Inbox' took 18.9974ms Exception in thread Thread-1: Traceback (most recent call last): File "C:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "C:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 115, in thread_target self._DoProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 374 , in _DoProcess self.filterer(self.mgr, self.progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in = filterer this_dispositions =3D filter_folder(f, mgr, progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in = filter_fol der disposition =3D filter_message(message, mgr, all_actions) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in = filter_mes sage prob =3D mgr.score(msg) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in = score email =3D msg.GetEmailPackageObject() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in = GetEmai lPackageObject text =3D self._GetMessageText() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in = _GetMes sageText 0) # any # of results is fine com_error: (-2147221246, 'Invalid window handle', None, None) Exception in thread Thread-2: Traceback (most recent call last): File "C:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "C:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 115, in thread_target self._DoProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 374 , in _DoProcess self.filterer(self.mgr, self.progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in = filterer this_dispositions =3D filter_folder(f, mgr, progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in = filter_fol der disposition =3D filter_message(message, mgr, all_actions) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in = filter_mes sage prob =3D mgr.score(msg) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in = score email =3D msg.GetEmailPackageObject() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in = GetEmai lPackageObject text =3D self._GetMessageText() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in = _GetMes sageText 0) # any # of results is fine com_error: (-2147221246, 'Invalid window handle', None, None) Exception in thread Thread-3: Traceback (most recent call last): File "C:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "C:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 115, in thread_target self._DoProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 374 , in _DoProcess self.filterer(self.mgr, self.progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in = filterer this_dispositions =3D filter_folder(f, mgr, progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in = filter_fol der disposition =3D filter_message(message, mgr, all_actions) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in = filter_mes sage prob =3D mgr.score(msg) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in = score email =3D msg.GetEmailPackageObject() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in = GetEmai lPackageObject text =3D self._GetMessageText() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in = _GetMes sageText 0) # any # of results is fine com_error: (-2147221246, 'Invalid window handle', None, None) Exception in thread Thread-4: Traceback (most recent call last): File "C:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "C:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 115, in thread_target self._DoProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 374 , in _DoProcess self.filterer(self.mgr, self.progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in = filterer this_dispositions =3D filter_folder(f, mgr, progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in = filter_fol der disposition =3D filter_message(message, mgr, all_actions) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in = filter_mes sage prob =3D mgr.score(msg) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in = score email =3D msg.GetEmailPackageObject() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in = GetEmai lPackageObject text =3D self._GetMessageText() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in = _GetMes sageText 0) # any # of results is fine com_error: (-2147221246, 'Invalid window handle', None, None) Exception in thread Thread-5: Traceback (most recent call last): File "C:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "C:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 115, in thread_target self._DoProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 374 , in _DoProcess self.filterer(self.mgr, self.progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in = filterer this_dispositions =3D filter_folder(f, mgr, progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in = filter_fol der disposition =3D filter_message(message, mgr, all_actions) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in = filter_mes sage prob =3D mgr.score(msg) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in = score email =3D msg.GetEmailPackageObject() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in = GetEmai lPackageObject text =3D self._GetMessageText() File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in = _GetMes sageText 0) # any # of results is fine com_error: (-2147221246, 'Invalid window handle', None, None) Traceback (most recent call last): File = "C:\Applications\Spambayes\Outlook2000\dialogs\FolderSelector.py", line = 3 59, in _DoUpdateStatus self.SetDlgItemText(IDC_STATUS1, status_string) File "C:\Python22\lib\site-packages\Pythonwin\pywin\mfc\object.py", = line 23, i n __getattr__ raise win32ui.error, "The MFC object has died." win32ui: The MFC object has died. Traceback (most recent call last): File = "C:\Applications\Spambayes\Outlook2000\dialogs\FolderSelector.py", line = 3 59, in _DoUpdateStatus self.SetDlgItemText(IDC_STATUS1, status_string) File "C:\Python22\lib\site-packages\Pythonwin\pywin\mfc\object.py", = line 23, i n __getattr__ raise win32ui.error, "The MFC object has died." win32ui: The MFC object has died. Traceback (most recent call last): File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 98, in OnStart self.StartProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 365 , in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 135, in = EnsureOu tlookFieldsForFolder folders =3D item.Folders File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line = 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line = 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % = (repr(self), att r) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 Traceback (most recent call last): File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", = line 98, in OnStart self.StartProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", = line 365 , in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 135, in = EnsureOu tlookFieldsForFolder folders =3D item.Folders File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line = 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line = 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % = (repr(self), att r) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 From Paul.Moore@atosorigin.com Fri Nov 22 10:04:05 2002 From: Paul.Moore@atosorigin.com (Moore, Paul) Date: Fri, 22 Nov 2002 10:04:05 -0000 Subject: [Spambayes] Outlook addin crash Message-ID: <16E1010E4581B049ABC51D4975CEDB88619950@UKDCX001.uk.int.atosorigin.com> From: Moore, Paul=20 > Just got the following in the Outlook addin. No idea what caused it, > but the "Exception in thread xxxx" messages are probably the relevant > bits (I spent a while trying to get the "Filter Now" button to work > before I thought of starting traceutil). Got it, by a bit of binary search my new messages... I had an appointment confirmation in my inbox, which was causing the = addin to crash. I'd suggest that anything other than a mail item be ignored when filtering. Paul. From tim@fourstonesExpressions.com Fri Nov 22 15:07:38 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 09:07:38 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <3DDDF19E.3000305@hooft.net> Message-ID: 11/22/2002 2:58:06 AM, Rob Hooft wrote: >T. Alexander Popiel wrote: >> In message: >> Neale Pickett writes: >> >>>So then, "T. Alexander Popiel" is all like: >>> >>> >>>>In message: >>>> Neale Pickett writes: >>>> >>>>>I'm currently entwined with mucking the heck out of WordInfo. I've got >>>>>a neato scheme based on Alex's patch and comments where the WordInfo >>>>>classes still compute their own probabilities, but also keep a revision >>>>>number which is compared against a MetaInfo class. >>>> >>>>Eww, do we gotta? I thought I was trying to make the DB smaller. ;-) >>> >>>Ah, but the only thing *stored* is (spamcount, hamcount). The >>>probability is calculated the first time you ask for it. If you don't >>>update nspam or nham, the next time you ask for it it gives the cached >>>value. So the database is small, but you still get the in-memory >>>probability caching if you're using a pickle or ZODB. >> >> >> Sounds like there is no caching benefit for one-message-per-invocation >> situations like running out of procmail, then. > >Is this calculation for the few words in one message really >time-determining? There is another way of caching: Make a dictionary >that maps count-tuples to spam probabilities. > > (1,0) -> 0.155 > (0,1) -> 0.844 >etc. > Yeah, this is an interesting idea. Cacheing is the right way to do this, not pre-calculating, because the tuple count becomes combinatorially large and is open ended. But... once you've calculated for a given tuple, you shouldn't have to do it again. The tuple:prob cache *could* be persistent, but I doubt there's much to be gained by that. - TimS >I definitely wouldn't move the calculation into the wordinfo class. It >is a different task, so it "should" (design) be a separate class.... > >Rob *****module probability***** # assuming probcache is defined somewhere in some initialization class ProbabilityCache: def __init__(self) self.probcache = {} def prob(self, nham, nspam) try: prob = self.probcache[nham][nspam] except KeyError: prob = calcprob(nham, nspam) self.probcache[nham][nspam] = prob return prob def calcprob (nham, nspam) # code moved here from _update_probability in WordInfo class *************************** ....or something of that nature. Maybe Adam Huff's NumPy vectorization stuff might play well into something like this. Incidentally, a dictionary of dictionaries has faster lookup than a dictionary keyed by a constructed tuple. x = {} for i in range(500): x[i] = {} for j in range (500): x[i][j] = 1 t1s = time.time() for k in range(5): for i in range(500): for j in range (500): a = x[i][j] t1e = time.time() x={} for i in range(500): for j in range (500): x[(i,j)] = 1 t2s = time.time() for k in range(5): for i in range(500): for j in range (500): a = x[(i,j)] t2e = time.time() print 'test 1 time =',t1e-t1s print 'test 2 time =',t2e-t2s ***** Four executions: test 1 time = 3.41499996185 test 2 time = 4.41600000858 test 1 time = 3.375 test 2 time = 4.28600001335 test 1 time = 3.41500008106 test 2 time = 4.18599998951 test 1 time = 3.46500003338 test 2 time = 4.23699998856 - TimS > > >-- >Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From tim@fourstonesExpressions.com Fri Nov 22 16:42:03 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 10:42:03 -0600 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message-ID: Well, I've gone and done it... I've touched classifier code. Either my name is now mud, or I really am a part of the community... lol I added result cacheing to the _update_probability method in WordInfo (in hammie-playground branch). I suspect that this will save a lot of time, maybe commensurate with what Adam Huff demonstrated. I don't have a large enough corpus to really benchmark this, though, and you'll definitely want to take a good look to make sure I haven't goofed anything up. I certainly didn't change any calculations... On a related note... There ought to be some safeguard against division by zero in the hamratio and spamratio calculations. The system shouldn't blow up with a /0 exception, but just peacefully assume some default and go about its business. That's because it's possible that this could be run when only spam has been trained on (for example). Some (regular everyday) user may very well make this mistake, which is most likely to occur immediately after installation. A blow up this early will probably just result in them not using it, assuming that it doesn't work. I'd have fixed it, but I have no idea what the peaceful default should be... - TimS www.fourstonesExpressions.com From neale@woozle.org Fri Nov 22 18:43:40 2002 From: neale@woozle.org (Neale Pickett) Date: 22 Nov 2002 10:43:40 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: <3DDDF19E.3000305@hooft.net> References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> <20021121171801.03720F5AC@cashew.wolfskeep.com> <3DDDF19E.3000305@hooft.net> Message-ID: So then, Rob Hooft is all like: > Is this calculation for the few words in one message really > time-determining? There is another way of caching: Make a dictionary > that maps count-tuples to spam probabilities. > > (1,0) -> 0.155 > (0,1) -> 0.844 > etc. Hmm! I did a small test against 200 spam, 200 ham, to see what tuple frequency is like. I got 21833 unique words, but only 869 unique values for (spamcount, hamcount). I also got gnuplot to animate out a cool spinning 3D graph of it just as my boss walked by :) The 20 most frequently-occuring (spamcount, hamcount) tuples were: (15, 0) 57 (18 0) 57 (19 0) 62 (10 5) 65 (0 20) 79 (4 10) 98 (5 10) 99 (9 5) 113 (14 0) 137 (0 15) 153 (13 0) 162 (8 0) 288 (4 5) 303 (10 0) 317 (5 5) 334 (9 0) 611 (0 10) 659 (4 0) 4814 (5 0) 4979 (0 5) 6045 The 20 most infrequently-occurring were: (0, 130) 1 (0, 135) 1 (0, 140) 1 (0, 155) 1 (0, 165) 1 (0, 175) 1 (0, 250) 1 (0, 285) 1 (0, 310) 1 (0, 725) 1 (0, 75) 1 (10, 30) 1 (10, 40) 1 (10, 85) 1 (100, 40) 1 (101, 115) 1 (101, 20) 1 (101, 25) 1 (102, 115) 1 (102, 20) 1 A graph of frequencies looks just a lot like a hyperbola: The more I think about this caching scheme, the more I like it. It deals well with the fact that most of the words only occur a few times, saves memory, and it will speed up pickles *and* databases. It's going in to the playground branch. > I definitely wouldn't move the calculation into the wordinfo class. It > is a different task, so it "should" (design) be a separate class.... Using this scheme, the calculation has to go back into the Bayes (or Classifier) class. WordInfo only stores counters now. Neale From neale@woozle.org Fri Nov 22 18:51:38 2002 From: neale@woozle.org (Neale Pickett) Date: 22 Nov 2002 10:51:38 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7 In-Reply-To: <20021122182258.5CCA9F580@cashew.wolfskeep.com> References: <20021122182258.5CCA9F580@cashew.wolfskeep.com> Message-ID: So then, "T. Alexander Popiel" is all like: > In message: > "Tim Stone" writes: > >Update of /cvsroot/spambayes/spambayes > >In directory sc8-pr-cvs1:/tmp/cvs-serv400 > > > >Modified Files: > > Tag: hammie-playground > > classifier.py > >Log Message: > >Added probability calculation result caching. No benchmark available to see > >how much, if any, performance gain is achieved, but it seems like it could > >be significant, particularly in training large corpora, or with long running > >processes. > > You need to nuke the probcache when meta.revision changes. :-) > > Also, wouldn't the cache implemented by this patch be more > efficient if it indexed by hamcount and spamcount (both > integers) instead of hamratio and spamratio (both floats)? I should think so. What do you think of this idea: probcache is kept as a property of Classifier. Make a classifier.probability(self, word) method which looks up that word's (spamcount, hamcount) tuple in probcache. If it's not there, compute it and add it. Whenever Classifier.learn or Classifier.unlearn are called, probcache is blown away. This will effectively cache probabilities on demand, and make sure they are current. No need for a revision anymore. Sound good? From popiel@wolfskeep.com Fri Nov 22 18:49:58 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 22 Nov 2002 10:49:58 -0800 Subject: [Spambayes] proposed changes to hammie & co. In-Reply-To: Message from Rob Hooft of "Fri, 22 Nov 2002 09:58:06 +0100." <3DDDF19E.3000305@hooft.net> References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> <20021121171801.03720F5AC@cashew.wolfskeep.com> <3DDDF19E.3000305@hooft.net> Message-ID: <20021122184958.7B9C4F598@cashew.wolfskeep.com> In message: <3DDDF19E.3000305@hooft.net> Rob Hooft writes: > >Is this [spamprob] calculation for the few words in one message >really time-determining? No, which I went on to admit in the stuff you snipped. ;-) >There is another way of caching: Make a dictionary >that maps count-tuples to spam probabilities. > > (1,0) -> 0.155 > (0,1) -> 0.844 >etc. I'm not sure this is better; it would definitely have a higher cache hit rate, but the lookups are significantly more expensive (fetch the wordinfo, extract the counts, then fetch the probability). Something to measure... >I definitely wouldn't move the calculation into the wordinfo class. It >is a different task, so it "should" (design) be a separate class.... I moderately agree, but OOP folks tend to have an aversion to pure data classes (as I think WordInfo should be). ;-) - Alex From popiel@wolfskeep.com Fri Nov 22 19:16:02 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 22 Nov 2002 11:16:02 -0800 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7 In-Reply-To: Message from Neale Pickett of "22 Nov 2002 10:51:38 PST." References: <20021122182258.5CCA9F580@cashew.wolfskeep.com> Message-ID: <20021122191603.1FFF4F598@cashew.wolfskeep.com> In message: Neale Pickett writes: > >What do you think of this idea: > >probcache is kept as a property of Classifier. Make a >classifier.probability(self, word) method which looks up that word's >(spamcount, hamcount) tuple in probcache. If it's not there, compute it >and add it. Whenever Classifier.learn or Classifier.unlearn are called, >probcache is blown away. > >This will effectively cache probabilities on demand, and make sure they >are current. No need for a revision anymore. > >Sound good? Sounds good to me. If you split the probability computation itself into a separate method from the cache management stuff, then it makes it easier to subclass to replace just the counts->probability formula. - Alex From tim@fourstonesExpressions.com Fri Nov 22 20:26:54 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 14:26:54 -0600 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7 Message-ID: <31HB07872UNK821UJFYVUOE9JFQL75UO.3dde930e@riven> 11/22/2002 1:16:02 PM, "T. Alexander Popiel" wrote: >In message: > Neale Pickett writes: >> >>What do you think of this idea: >> >>probcache is kept as a property of Classifier. Make a >>classifier.probability(self, word) method which looks up that word's >>(spamcount, hamcount) tuple in probcache. If it's not there, compute it >>and add it. Whenever Classifier.learn or Classifier.unlearn are called, >>probcache is blown away. >> >>This will effectively cache probabilities on demand, and make sure they >>are current. No need for a revision anymore. >> >>Sound good? > >Sounds good to me. If you split the probability computation itself >into a separate method from the cache management stuff, then it makes >it easier to subclass to replace just the counts->probability formula. >From my careful and time consuming examination of the code , it appeared to me that meta revision only changed when nham or nspam changed. Therefore, caching on the ratios rather than nham and nspam allowed the cache to be pertinent all the time. Nuking a cache is expensive... As for indexing on an integer vs a float. Both are immutable types, so you're really indexing on an object reference, not the value. I think python is smart enough to realize this, and not waste the time hashing on the value in this instance... correct me if I'm wrong. - TimS > >- Alex > > - Tim www.fourstonesExpressions.com From popiel@wolfskeep.com Fri Nov 22 20:50:55 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 22 Nov 2002 12:50:55 -0800 Subject: [Spambayes] Re: caching stuff In-Reply-To: Message from "T. Alexander Popiel" <20021122203837.4D2CDF580@cashew.wolfskeep.com> References: <20021122203837.4D2CDF580@cashew.wolfskeep.com> Message-ID: <20021122205055.A66B2F580@cashew.wolfskeep.com> In message: writes: > >From my careful and time consuming examination of the code , it >appeared to me that meta revision only changed when nham or nspam changed. >Therefore, caching on the ratios rather than nham and nspam allowed the >cache to be pertinent all the time. Nuking a cache is expensive... Unfortunately, preserving the cache when nham or nspam changes is bad, because the bayesian adjustment changes, even if the ham and spam ratios don't. :-( Nuking a cache in toto is a lot less expensive than individually invalidating or updating records (which was update_probabilities downfall). Either is a lot less expensive than giving the wrong answer. >As for indexing on an integer vs a float. Both are immutable types, so >you're really indexing on an object reference, not the value. Eh, I don't think so... but I don't know enough python internals to be sure. (Sure, they are immutable types, but I strongly doubt that they're hashed as objects; that would imply that all references to a float value 3.0 were references to the same object... which means some sort of search for the 3.0 object when you added 2.5 and 0.5... which would be a severe performance lose. It seems far more likely that they're hashed by value instead (even if that value is currently boxed in an object).) Does anyone with more python mojo have a definitive answer? Guido? - Alex From tim@fourstonesExpressions.com Fri Nov 22 21:19:11 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 15:19:11 -0600 Subject: [Spambayes] Re: caching stuff In-Reply-To: <20021122205055.A66B2F580@cashew.wolfskeep.com> Message-ID: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> 11/22/2002 2:50:55 PM, "T. Alexander Popiel" wrote: >In message: > writes: >> >>From my careful and time consuming examination of the code , it >>appeared to me that meta revision only changed when nham or nspam changed. >>Therefore, caching on the ratios rather than nham and nspam allowed the >>cache to be pertinent all the time. Nuking a cache is expensive... > >Unfortunately, preserving the cache when nham or nspam changes is bad, >because the bayesian adjustment changes, even if the ham and spam >ratios don't. :-( > >Nuking a cache in toto is a lot less expensive than individually >invalidating or updating records (which was update_probabilities >downfall). Either is a lot less expensive than giving the wrong >answer. Well, if the baseian prob changes even if the ham and spam ratios don't, then of course the caching scheme is bad. But I certainly don't see that in the code that I changed. Maybe I'm looking in the wrong place... - TimS > >>As for indexing on an integer vs a float. Both are immutable types, so >>you're really indexing on an object reference, not the value. > >Eh, I don't think so... but I don't know enough python internals to >be sure. (Sure, they are immutable types, but I strongly doubt that >they're hashed as objects; that would imply that all references to >a float value 3.0 were references to the same object... which means >some sort of search for the 3.0 object when you added 2.5 and 0.5... >which would be a severe performance lose. It seems far more likely >that they're hashed by value instead (even if that value is currently >boxed in an object).) > >Does anyone with more python mojo have a definitive answer? Guido? > >- Alex > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From lists@morpheus.demon.co.uk Fri Nov 22 21:19:07 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Fri, 22 Nov 2002 21:19:07 +0000 Subject: [Spambayes] Outlook addin crash References: <16E1010E4581B049ABC51D4975CEDB88619950@UKDCX001.uk.int.atosorigin.com> Message-ID: "Moore, Paul" writes: > From: Moore, Paul >> Just got the following in the Outlook addin. No idea what caused it, >> but the "Exception in thread xxxx" messages are probably the relevant >> bits (I spent a while trying to get the "Filter Now" button to work >> before I thought of starting traceutil). > > Got it, by a bit of binary search my new messages... > > I had an appointment confirmation in my inbox, which was causing the addin > to crash. I'd suggest that anything other than a mail item be ignored > when filtering. Hmm, the fix isn't obvious. In manager.py, BayesManager.score() is where it goes wrong - msg.GetEmailPackageObject() fails. I suspect that the correct solution is to default the score to 0 (ham) for non-mail objects (on the basis that spammers don't send appointments...). However, we're too late at this point - the place where we're interacting with the MAPI object is in msgstore.py, _GetMessageText. We could check PR_MESSAGE_CLASS there for IPM.Note, but what do we do with non-notes? The best I can think of is to define an exception class, NonMessageException, and raise it in _GetMessageText. We can then catch it in score() and handle it appropriately there. But I'm not 100% convinced that this isn't an abuse of exceptions... But I can't think of a better answer short of a fairly major restructuring. Paul. -- This signature intentionally left blank From rob@hooft.net Fri Nov 22 21:40:19 2002 From: rob@hooft.net (Rob Hooft) Date: Fri, 22 Nov 2002 22:40:19 +0100 Subject: [Spambayes] proposed changes to hammie & co. References: <20021121053628.DD924F5A7@cashew.wolfskeep.com> <20021121171801.03720F5AC@cashew.wolfskeep.com> <3DDDF19E.3000305@hooft.net> <20021122184958.7B9C4F598@cashew.wolfskeep.com> Message-ID: <3DDEA443.80300@hooft.net> T. Alexander Popiel wrote: > In message: <3DDDF19E.3000305@hooft.net> > Rob Hooft writes: > >>I definitely wouldn't move the calculation into the wordinfo class. It >>is a different task, so it "should" (design) be a separate class.... > > I moderately agree, but OOP folks tend to have an aversion > to pure data classes (as I think WordInfo should be). ;-) It doesn't have to be a pure data class. If you want to do pure OOP, indeed the WordInfo class should hide its implementation detail: class WordInfo: def __init__(self,probcalc,...): self.probcalc=probcalc def spamprob(self): return self.probcalc(self.hamcount,self.spamcount) class CachingProbCalc(ProbCalc): # The caching calculator def __call__(self,hamcount,spamcount): .... class Bayes: .... pc=ProbCalc() .... wi=WordInfo(pc) I like object composition (as you can see as well from CostCounter.py). Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From popiel@wolfskeep.com Fri Nov 22 21:51:16 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 22 Nov 2002 13:51:16 -0800 Subject: [Spambayes] Re: caching stuff In-Reply-To: Message from Tim Stone - Four Stones Expressions <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> References: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> Message-ID: <20021122215116.BA3E0F580@cashew.wolfskeep.com> In message: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> writes: > >Well, if the baseian prob changes even if the ham and spam ratios don't, then >of course the caching scheme is bad. But I certainly don't see that in the >code that I changed. Maybe I'm looking in the wrong place... In the probability computation (which I'm reading from update_probabilities in an old image): prob = spamratio / (hamratio + spamratio) n = hamcount + spamcount prob = (StimesX + n * prob) / (S + n) Here we see that prob is based on both the ratios and the raw counts; thus, they're also based on nham & nspam (because to get the same non-zero ratio, you'd have to have a different raw count). There's normally a hulking huge comment in the middle of the code snippet above - that may be making it harder to spot. - Alex From tim@fourstonesExpressions.com Fri Nov 22 22:13:01 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 16:13:01 -0600 Subject: [Spambayes] If this doesn't motivate us... Message-ID: http://www.freep.com/money/tech/mwend22_20021122.htm ARGH!!! - TimS www.fourstonesExpressions.com From tim@fourstonesExpressions.com Fri Nov 22 23:02:42 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 17:02:42 -0600 Subject: [Spambayes] Re: caching stuff In-Reply-To: <20021122215116.BA3E0F580@cashew.wolfskeep.com> Message-ID: 11/22/2002 3:51:16 PM, "T. Alexander Popiel" wrote: >In message: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> > writes: >> >>Well, if the baseian prob changes even if the ham and spam ratios don't, then >>of course the caching scheme is bad. But I certainly don't see that in the >>code that I changed. Maybe I'm looking in the wrong place... > >In the probability computation (which I'm reading from >update_probabilities in an old image): > > prob = spamratio / (hamratio + spamratio) > n = hamcount + spamcount > prob = (StimesX + n * prob) / (S + n) > > >Here we see that prob is based on both the ratios and the >raw counts; thus, they're also based on nham & nspam >(because to get the same non-zero ratio, you'd have to >have a different raw count). I get it now... the larger the raw counts, the more weight is given to this word... So my cache mechanism is fatally flawed. - TimS > >There's normally a hulking huge comment in the middle of >the code snippet above - that may be making it harder to >spot. > >- Alex > > > - Tim www.fourstonesExpressions.com From tim@fourstonesExpressions.com Fri Nov 22 23:46:49 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 17:46:49 -0600 Subject: [Spambayes] Re: caching stuff In-Reply-To: <20021122232400.B57C9F580@cashew.wolfskeep.com> Message-ID: 11/22/2002 5:24:00 PM, "T. Alexander Popiel" wrote: >In message: > Tim Stone - Four Stones Expressions > writes: >>11/22/2002 3:51:16 PM, "T. Alexander Popiel" wrote: >> >>>In message: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> >>> writes: >>> >>> prob = spamratio / (hamratio + spamratio) >>> n = hamcount + spamcount >>> prob = (StimesX + n * prob) / (S + n) > >>But I think what you're saying is that it's possible to come up with the >>same ratio with different raw numbers, like 2:5 and 4:10. The ratio is >>the same, but the prob is different? > >Exactly! Let's work it through with hamratios of 2:5 and 4:10, spamcount >always 0: > >The initial version of prob = 0 / (.4 + 0) remains the same for both. >However, the value of n is 2 in the first case and 4 in the second case. >that means that the adjustment prob = (StimesX + n * prob) / (S + n) >is different; StimesX / (S + 2) in one case, and StimesX / (S + 4) in >the second. Given the default S and X, that's .0918 and .0506. A fairly >significant difference. > >Does that help? I humbly thank my teachers for their patience. :) Check the new code I'm checking into hammie-playground. I think this is more what you're looking for. Rob gave us some good food for thought, huh? - TimS > >- Alex > > - Tim www.fourstonesExpressions.com From neale@woozle.org Sat Nov 23 00:04:21 2002 From: neale@woozle.org (Neale Pickett) Date: 22 Nov 2002 16:04:21 -0800 Subject: [Spambayes] anyone going to the spam conference? Message-ID: So, anyone planning on going to Paul Graham's spam conference January in Cambridge? I just got the nod from $FIRM to attend. An all-expenses paid trip to New England in winter! Woo! I'd like to present a talk on behalf of the spambayes project. If anyone else was planning on doing this, please let me know. There are certainly folks who know more than I about how our classifier works, but nobody from our project shows up on the speakers list. If nobody else steps up to the plate, I'm going to have to ask a lot of dumb questions about the thing. That should be a pretty good motivator! <0.9 wink> Neale From lists@webcrunchers.com Sat Nov 23 00:27:27 2002 From: lists@webcrunchers.com (John D.) Date: Fri, 22 Nov 2002 16:27:27 -0800 Subject: [Spambayes] Hourra for pop3proxy ! In-Reply-To: Message-ID: =46rancis writes: >It is plug & play. First try with two pop server, everything was working as >advertised. The training interface is really good. It just need some >cosmetic improvements. I"ll try to come with some patches for this. I already talking with Richie about me adding a 'Spam management' system. = Using an enhanced WEB GUI, but allowing a number of "controls" for the= classifier, tokenizer, and other spambayes tweakage. Other things I= want to include are spam management functions. Making is easy to report= spam, send to use@ftc.gov, etc. I want to start defining the GUI as early as next weekend. If anyone along= this lines wants additional features, please let me know. So far planned: * Simplifying reporting to spamcop * Database for "frequent offenders" * Tracking tools for tracking their origin. * Testing validity of "opt out" addresses and links * Easy to use tracking and data collection on spammers. * References to the top anti-spammer links, and interfacing data to these other anti-spammers. * establish a test connection to the spammers opt in pop server to see if that address is valid. The Database will be using PostGreSQL and PyGreSQL modules. Will work with= Apache server, using the Python ready "cgi" modules. Will interface with the pop3proxy, and would pass the spam from the proxy= GUI to the database. In tracking down spammers, I'm finding it really difficult to keep track of= the spammers. Because I have to spend time identifying previous spams= they did, and looking through a large list. Now, I want to use the= database to search for the spam in the database to see if it occured= previously before, so I can tell if it's a "repeat offender". Now, I want to be able to click on a spam message, then click a query= button to see if I got this spam before. Most of the spam I get that are= repeat offenders are identical, so it should be easy to look them up. = Once I find a particular spam, I'll have notes I took on them, = reminding me of my last correspondance with them. Even though I spend a lot of time tracking them down, they STILL continue= to spam me, even though I would talk to the owner. I can then collect= this data, and use it for prosecution of "repeat violators". I want to= make it easy for EVERYONE to do this. I'll call this module "SMS" (Spam management system) for lack of a better na= me. I guess I would just put in seperate directory in CVS when I'm ready to= release it. John From lists@webcrunchers.com Sat Nov 23 00:45:57 2002 From: lists@webcrunchers.com (John D.) Date: Fri, 22 Nov 2002 16:45:57 -0800 Subject: [Spambayes] If this doesn't motivate us... In-Reply-To: Message-ID: Tim writes: >http://www.freep.com/money/tech/mwend22_20021122.htm > >ARGH!!! It quotes "Police promptly raided the business and confiscated Ralsky's servers. Although they were returned a few days later, Ralsky now tries to cover his tracks better, so opponents won't know what companies and servers he's using". I didn't know it was possible to forge the IP address. I would be most interested in seeing how that's done. Just so you know, my contact in China and I have been very influencial in getting his Chinese servers shut down. I have a reliable contact in China who I call upon that "opens a lot of doors" that no American can do from here. He then translates things in Chinese for me, then makes phone calls to the internet providers, and speaks to them in Chinese terms. It turns out the Chinese are getting sick of this, and are soon planning to crack down on Foreign spammers using their networks within China... John From tim@fourstonesExpressions.com Sat Nov 23 00:56:06 2002 From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri, 22 Nov 2002 18:56:06 -0600 Subject: [Spambayes] If this doesn't motivate us... Message-ID: <424YZVSOVP1YFCB91WKG762ULKCA1T98.3dded226@riven> 11/22/2002 6:45:57 PM, "John D." wrote: >Tim writes: > >>http://www.freep.com/money/tech/mwend22_20021122.htm >> >>ARGH!!! > >It quotes "Police promptly raided the business and confiscated Ralsky's servers. Although they were returned a few days later, Ralsky now tries to cover his tracks better, so opponents won't know what companies and servers he's using". > >I didn't know it was possible to forge the IP address. I would be most interested in seeing how that's done. Forging an ip address is very very simple. When you create a connection, you tell tcp what ip address you want it to say you're using. Our friend Neale Pickett has written a great little treatment of sockets, that's at http://www.woozle.org/~neale/papers/sockets.html. The problem with backtracing ip addresses is well documented. When you spoof an address, you typically pick an address that you'd like people to think you are, like a yahoo address or something like that. Then, if anybody gets mad at anybody, Yahoo gets the blame. The solution is for our router/switch/gear manufacturing friends to make it impossible for people to spoof addresses, by looking at outgoing packets and rejecting those that don't match the ip address that the router knows it's coming from. But for some reason, this isn't done, though DDOS and other spoof attacks are starting to raise an outcry that something be done. > >Just so you know, my contact in China and I have been very influencial in getting his Chinese servers shut down. I have a reliable contact in China who I call upon that "opens a lot of doors" that no American can do from here. You go, dude! > >He then translates things in Chinese for me, then makes phone calls to the internet providers, and speaks to them in Chinese terms. > >It turns out the Chinese are getting sick of this, and are soon planning to crack down on Foreign spammers using their networks within China... That's great news. > >John > > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > - Tim www.fourstonesExpressions.com From dereks@itsite.com Sat Nov 23 03:45:12 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Fri, 22 Nov 2002 22:45:12 -0500 (EST) Subject: [Spambayes] If this doesn't motivate us... In-Reply-To: <424YZVSOVP1YFCB91WKG762ULKCA1T98.3dded226@riven> Message-ID: > address that the router knows it's coming from. But for some reason, this > isn't done, though DDOS and other spoof attacks are starting to raise an > outcry that something be done. Rules from history: 1. Spamming will continue until it's not profitable. 2. Routers won't come with anti-spoof features until corporations are willing to pay extra money for them. > >It turns out the Chinese are getting sick of this, and are soon planning to > crack down on Foreign spammers using their networks within China... [...] > That's great news. Not to people concerned about human rights and freedom of speech. --Derek From lists@morpheus.demon.co.uk Sat Nov 23 15:49:16 2002 From: lists@morpheus.demon.co.uk (Paul Moore) Date: Sat, 23 Nov 2002 15:49:16 +0000 Subject: [Spambayes] New web training interface for pop3proxy References: <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com> Message-ID: "Moore, Paul" writes: > (This is from memory, as it happened on my home setup and I'm at > work now, so I apologise if it's a bit vague). It's just happened again. So I can diagnose a bit better... >> It's "no". 8-) Like I say, no-one's reported it locking before, and >> I've never seen it. You usually get a traceback when something goes >> wrong. So your console says something like: >> >> Loading database... Done BayesProxyListener listening on port 110 . >> UserInterfaceListener listening on port 8880 . Console screenshot: Loading database... Done. BayesProxyListener listening on port 8110. UserInterfaceListener listening on port 8880. >> and nothing else, and the process is still running, but you can't >> get a page served to your browser? What error message do you get >> from the browser? If it's one of those pointless IE error pages, >> could you try telnetting to port 8880 and saying "GET / HTTP/1.0"? >> Can you even connect with telnet? How about port 110? The browser shows "Training..." and nothing more. The status bar shows "Opening page http://localhost:8880/review..." and the progress bar is part way across and stuck. Python is running at 90%+ CPU. Looks like it's in a loop somewhere. > I can't do a telnet at the moment to check. Telnet isn't responding. The thing's almost certainly in a loop. > It was a dbm file. > > The command line was pop3proxy.py -d -l 8110 localhost > > (proxying a local POP server on port 110 with the proxy on port 8110 > using a DBM file). Working directory was the directory of the program. > No bayescustomize.ini file. The UI showed the database as having 0 ham and 0 spam, but it was doing this yesterday, and everything worked fine then. Looks like some sort of database corruption, but a subtle one... OK, I ran it with Corpus.Verbose = True, and it seems to be locking up just after printing "training with" in Bayes.Trainer.train Further checking... it's in self._add_msg(wordstream, is_spam) in classifier.Bayes. Best I can locate, it's locking up trying to store a None in self.wordinfo. Specifically, # Needed to tell a persistent DB that the content changed. wordinfo[word] = record locks up with record = None (and word = electronics, but I doubt that's relevant :-)) I can post my hammie.db, but it's 1.4M (360K zipped) so I won't bother unless someone thinks it's going to help significantly... (BTW, some sort of dumper of a spambayes database file might be helpful in diagnosing problems like this - at least a structure validator. I don't know how possible this is, and as this area is changing rapidly right now, I'll just put it on the TODO list for the moment. Paul. -- This signature intentionally left blank From francois.granger@free.fr Sat Nov 23 18:59:58 2002 From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger) Date: Sat, 23 Nov 2002 19:59:58 +0100 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: References: <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com> Message-ID: At 15:49 +0000 23/11/02, in message Re: [Spambayes] New web training interface for pop3prox, Paul Moore wrote: >I can post my hammie.db, but it's 1.4M (360K zipped) so I won't bother >unless someone thinks it's going to help significantly... (BTW, some >sort of dumper of a spambayes database file might be helpful in >diagnosing problems like this - at least a structure validator. I >don't know how possible this is, and as this area is changing rapidly >right now, I'll just put it on the TODO list for the moment. The enclosed script did it for the recent pickle format. I'll retest it and improve if necessary. It is rather crude at the moment. but by changing the format = csv by format = asctab line 145, you get either a pseudo csf or ascii tabulated values. -- Le courrier électronique est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : http://minilien.com/?IXZneLoID0 - http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html-------------- next part -------------- Skipped content of type multipart/appledoubleFrom noreply@sourceforge.net Sat Nov 23 14:00:57 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 23 Nov 2002 06:00:57 -0800 Subject: [Spambayes] [ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder Message-ID: Bugs item #642740, was opened at 2002-11-24 01:00 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: "Recover from Spam" wrong folder Initial Comment: Outlook addin: Selecting "Recover From Spam" recovers the selected message to the Inbox folder - which is not necessarily where came from. The filterer will need to save the folder it came from before we can do this. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 From tim.one@comcast.net Sat Nov 23 20:27:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 23 Nov 2002 15:27:30 -0500 Subject: [Spambayes] If this doesn't motivate us... In-Reply-To: Message-ID: [John D.] > ... > I didn't know it was possible to forge the IP address. I would > be most interested in seeing how that's done. It's just bits put together by software. Some systems make it easier than others. There's a huge and ongoing flap about Windows XP Home edition, which is the first consumer MS OS said to make it quite easy. Once you get the spoofing working, here's how to do really nasty stuff : http://rr.sans.org/threats/intro_spoofing.php From tim.one@comcast.net Sat Nov 23 21:49:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 23 Nov 2002 16:49:10 -0500 Subject: [Spambayes] Outlook weirdness In-Reply-To: Message-ID: [Sean True, on using a database] > ... > Slower *training* would be an issue, however. For bulk training, but one-at-a-time training would be much faster (no need for update_probabilities() at the end, which computes a new value for every word in the database). Bulk training could be taught to use a new classifier based on an in-memory dict. When that's done, the in-memory dict's ham and spam counts would be added into the persistent DB (rewriting only those WordInfo records corresponding to words that appeared in the bulk training data), and then the in-memory dict could be thrown away. From tim.one@comcast.net Sat Nov 23 21:54:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 23 Nov 2002 16:54:15 -0500 Subject: [Spambayes] Documentation... In-Reply-To: Message-ID: [Richie Hindle] > This may be premature, but as part of helping John Draper set up the > spambayes software I've made a start on some user documentation. It > could go on the website, or maybe in with the source code - I'm not > sure we're ready to give the impression that this stuff is ready for > "normal people" to use yet. First check it into the project, so other people can help update it too, and so it doesn't get lost. These docs are a great beginning! From tim.one@comcast.net Sat Nov 23 22:10:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 23 Nov 2002 17:10:27 -0500 Subject: [Spambayes] New web training interface for pop3proxy In-Reply-To: Message-ID: [David Ascher] > Make 'hovertips' that display the first few lines of the body [Richie Hindle] > This is done. The code to strip HTML content uses a regular expres= sion > from tokenizer.py which is commented "Cheap-ass gimmick", so I'm > interested to see how well people find it works! It works very well except when it doesn't . The chief damned- whether-you-do-or-don't problem: I've seen several msgs with HTML st= yle sheets and/or HTML comments exceeding 2K characters. The 2K limit in= the minimal matches serves two purposes: 1. Prevent the C stack from blowing up in the regexp engine. But Fran=E7ois Granger reported a C stack blowup anyway on Mac OS 9, and I still have no clue how small a limit would prevent that on his box. 2. Prevent it from consuming an arbitrary amount of text in case we matched a "begin long construct" character sequence by accident= . It's *unlikely* that random test contains (Apologies to Tim - it seems to work extremely well.) Yes, when it works at all . Fixing it in all cases requires do= ing real HTML parsing, and that's expensive, so the current "cheap-ass gi= mmick" is accurate. > Rest assures it's safe from HTML content leaking into the web > interface - the worst that will happen is that you'll see HTML sour= ce > in the hovertip. A giant section near the start seems the most likely >glitch here. Are you using this regexp *from* Python, or from Javascript? >I have half a mind to replace the comment and style nuking with an >iterative, stack-friendly scheme (like, e.g., crack_uuencode() and >crack_urls(), which only use regexps to help find the right places to poke >at -- they can't blow the C stack). But if you're doing this from >Javascript, that wouldn't help you. A giant