From ta-meyer at ihug.co.nz Mon Feb 2 00:39:20 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Mon Feb 2 00:39:34 2004 Subject: [spambayes-dev] -d/-D command line options Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> One other thing that I think would be good to sort out before the next release (we're still doing one, right? ) is getting the -d/-D command line options consistent. We currently have: sb_xmlrpcserver: pickle: -p dbm : -p -d sb_server : pickle: -D dbm : -d sb_notesfilter : pickle: -d dbm : -D sb_mboxtrain : pickle: -D dbm : -d sb_imapfilter : pickle: -d dbm : -D sb_filter : pickle: -D dbm : -d sb_dbexpimp : pickle: -d dbm : -D So, ignoring the odd xmlrpcserver (the old hammie.py was like this, IIRC), we have 3 -D for pickle, and 3 That use -d for pIckle, it seeMS. The -D for pickle ones are more widely used, though. We also had a suggestion, though, that -d for dbm and -p for pickle would make more sense: So, what do people vote for? 1. -D for pickle, -d for dbm 2. -d for pickle, -D for dbm 3. -p for pickle, -d for dbm 4. The inconsistent status quo. [This would close: ] =Tony Meyer From skip at pobox.com Mon Feb 2 07:23:00 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Feb 2 07:23:21 2004 Subject: [spambayes-dev] -d/-D command line options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> Message-ID: <16414.16676.71944.947722@montanaro.dyndns.org> Tony> So, what do people vote for? Tony> 1. -D for pickle, -d for dbm Tony> 2. -d for pickle, -D for dbm Tony> 3. -p for pickle, -d for dbm Tony> 4. The inconsistent status quo. +1 for #3 assuming it won't be too hard to reclaim -d and -p for this purpose. As long as we'll be making an incompatible change to the selection of pickles and databases, I'd like to see about getting the default (pickle or database file) done right for each application. Long-running applications which handle both the training and scoring tasks should probably use pickles by default. Other applications should use database files by default. That will probably mean changes to the config file to split Storage:persistent_use_database into several application-specific options. Skip From popiel at wolfskeep.com Mon Feb 2 14:07:31 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Feb 2 14:07:37 2004 Subject: [spambayes-dev] -d/-D command line options In-Reply-To: Message from "Tony Meyer" of "Mon, 02 Feb 2004 18:39:20 +1300." <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> Message-ID: <20040202190731.662932DE8B@cashew.wolfskeep.com> In message: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> "Tony Meyer" writes: >3. -p for pickle, -d for dbm - Alex From tim at fourstonesExpressions.com Mon Feb 2 16:27:45 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Feb 2 16:27:51 2004 Subject: [spambayes-dev] -d/-D command line options In-Reply-To: <20040202190731.662932DE8B@cashew.wolfskeep.com> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> <20040202190731.662932DE8B@cashew.wolfskeep.com> Message-ID: On Mon, 02 Feb 2004 11:07:31 -0800, T. Alexander Popiel wrote: > In message: > <1ED4ECF91CDED24C8D012BCF2B034F13026F2A75@its-xchg4.massey.ac.nz> > "Tony Meyer" writes: > >> 3. -p for pickle, -d for dbm This one is the least confusing... +1 from me. However... -p and -d should be mutually exclusive, I suppose we should implement and edit for that? -- Vous exprimer; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com From mhammond at skippinet.com.au Tue Feb 3 07:23:46 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Feb 3 07:24:02 2004 Subject: [spambayes-dev] New release Message-ID: Just so everyone knows where I am at... I've been slightly distracted by distutils and releasing win32all (and pretending to have a life, and having trouble getting motivated to work at all, but that is a different story ). The details aren't particularly interesting, but instead of doing another "WISE" based release, I'm going to jump straight to distutils, and therefore force myself to face the issues now rather than get 1/2 way through yet another thing. So, my plan was always to try and release a new win32all, *then* release a new spambayes binary. This way, anyone else can re-create the spambayes binary fairly easily, and for a change, public releases will synchronize *and* be easily reproducible. This is still my personal plan, but has no particular benefit. Therefore, should someone forge ahead and prepare a new source-code release, I will jump ahead of where I am, and do a corresponding binary release. Not that I am pushing for this, but incase someone is waiting for me, then I'm pretty ready to go. Unfortunately (ok ok, fortunately) I have no idea what the source-release process is, nor have any intention of learning now Mark. From tim.one at comcast.net Tue Feb 3 11:07:40 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Feb 3 11:07:41 2004 Subject: [spambayes-dev] FW: [Spambayes] Standards for Filtering Message-ID: Over on the spambayes list, Yakov Shafranovich sent this, and later asked: Is there anyone who is willing to participate in a mailing list on filtering standards? Yakov, I'm forwarding this to the spambayes-dev@python.org list, because that's where the SpamBayes developers congregate. If you're going to send info about the mailing list (which is welcome!), please use mailto:spambayes-dev@python.org -----Original Message----- From: Yakov Shafranovich Sent: Tuesday, February 03, 2004 8:54 AM To: spambayes@python.org Subject: [Spambayes] Standards for Filtering Daniel Quinlan of SpamAssassin suggested that I should contact you. I co-chair the Anti-Spam Research Group (ASRG) of the IRTF [asrg.sp.am] together with John Levine. The ASRG does pre-standards work and research for the IETF. We have been thinking about different ways that standards can help the filtering community and what we currently have on the table is standard headers and dynamic filtering updates (like anti-virus programs do). We also have a subgroup for filtering work which is currently being formed (see http://asrg.sp.am/subgroups/filtering.shtml). What I am wondering, is whether the filtering community can benefit from standards and if filtering folks are willing to discuss such standards, and cooperate with each other. Yakov ------- Yakov Shafranovich / asrg shaftek.org SolidMatrix Technologies, Inc. / research solidmatrix.com "I ate your Web page. / Forgive me. It was juicy / And tart on my tongue." (MIT's 404 Message) ------- From nas-spambayes at python.ca Tue Feb 3 11:45:01 2004 From: nas-spambayes at python.ca (Neil Schemenauer) Date: Tue Feb 3 11:40:23 2004 Subject: [spambayes-dev] Re: Standards for Filtering In-Reply-To: References: Message-ID: <20040203164501.GA13619@mems-exchange.org> Yakov Shafranovich: > We have been thinking about different ways that standards can help > the filtering community and what we currently have on the table is > standard headers and dynamic filtering updates (like anti-virus > programs do). It's probably more drastic than what you are intending, but I think STMP should stop being a "store and forward" protocol. Instead, the connection should remain open until the message is delivered to the final destination (at least as near as possible) or a failure occurs. I think the only change to the SMTP protocol necessary is to disallow multiple envelope recipient addresses. Changes to SMTP servers could be made in a backwards compatible way. Separate bounce messages do not work anymore. I think almost everyone can agree on that now. However, the current SMTP protocol/implementation needs to send separate bounces otherwise it is unreliable. SMTP was designed for an age when links between sites were very unreliable. In that age, moving a message closer to its destination hop by hop made sense. Now it is no longer necessary. Neil From research at solidmatrix.com Tue Feb 3 13:53:52 2004 From: research at solidmatrix.com (Yakov Shafranovich) Date: Tue Feb 3 13:54:08 2004 Subject: [spambayes-dev] Re: FW: [Spambayes] Standards for Filtering In-Reply-To: References: Message-ID: <401FEE40.4080104@solidmatrix.com> Tim, Thanks for forwarding the message. The mailing list is up but not public just yet, since we haven't gathered enough interested parties. Info on the subgroup is here: http://asrg.sp.am/subgroups/filtering.shtml If anyone wants to be added to the list, just email me directly - research@solidmatrix.com or asrg@shaftek.org, and I will add you to the mailing list. It should go public within the next week or two. Yakov Tim Peters wrote: > Over on the spambayes list, Yakov Shafranovich sent this, and later asked: > > Is there anyone who is willing to participate in a mailing list on > filtering standards? > > Yakov, I'm forwarding this to the spambayes-dev@python.org list, because > that's where the SpamBayes developers congregate. If you're going to send > info about the mailing list (which is welcome!), please use > > mailto:spambayes-dev@python.org > > > -----Original Message----- > From: Yakov Shafranovich > Sent: Tuesday, February 03, 2004 8:54 AM > To: spambayes@python.org > Subject: [Spambayes] Standards for Filtering > > > Daniel Quinlan of SpamAssassin suggested that I should contact you. I > co-chair the Anti-Spam Research Group (ASRG) of the IRTF [asrg.sp.am] > together with John Levine. The ASRG does pre-standards work and research > for the IETF. We have been thinking about different ways that standards > can help the filtering community and what we currently have on the table > is standard headers and dynamic filtering updates (like anti-virus > programs do). We also have a subgroup for filtering work which is > currently being formed (see http://asrg.sp.am/subgroups/filtering.shtml). > > What I am wondering, is whether the filtering community can benefit from > standards and if filtering folks are willing to discuss such standards, > and cooperate with each other. > > Yakov > ------- Yakov Shafranovich / asrg shaftek.org SolidMatrix Technologies, Inc. / research solidmatrix.com "Why are both drug addicts and computer aficionados both called users?" (Clifford Stoll) ------- From atom at suspicious.org Wed Feb 4 01:19:57 2004 From: atom at suspicious.org (Atom 'Smasher') Date: Wed Feb 4 01:21:32 2004 Subject: [spambayes-dev] sb_mailsort.py problem Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 using: 1.0a7 & sb_mailsort.py i got a spam that caused sb_mailsort.py to fart. i attached a copy, but i'm not sure if it'll go through, so i'll make it available here: http://smasher.suspicious.org/tmp/sb_problem.tgz for a few days/weeks. sb_problem.tgz md5: 3d8ee9c1e9c5baf683b55ebc131a10f3 i'm assuming that the non-ascii stuff caused the problem. i removed all identifying information from the email, which didn't change the way sb_mailsort.py choked on it. i'm not subscribed to this list, so please Cc me on any replies. footnote: this is on a box running FreeBSD 4.7-RELEASE and python2.2.3 i tried upgrading to newer python when i first installed SB but that caused some bizarre problems and i kept the older version. thanks... ...atom _______________________________________________ PGP key - http://smasher.suspicious.org/pgp.txt 3EBE 2810 30AE 601D 54B2 4A90 9C28 0BBF 3D7D 41E3 ------------------------------------------------- "Simply stated, there is no doubt that Saddam Hussein now has weapons of mass destruction." -- Dick Cheney, 26 August 2002 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (FreeBSD) iD8DBQFAII8RnCgLvz19QeMRAuaDAJ4ici981Wb+P4qapavrl1xLEb8mYQCggRt2 RZmETdnoXBssV4F00uWOI+Y= =vSU3 -----END PGP SIGNATURE----- -------------- next part -------------- A non-text attachment was scrubbed... Name: sb_problem.tgz Type: application/octet-stream Size: 8319 bytes Desc: Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040204/a4a61f01/sb_problem.obj From tameyer at ihug.co.nz Wed Feb 4 03:16:50 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 4 03:17:15 2004 Subject: [spambayes-dev] New release In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304E71325@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A7F@its-xchg4.massey.ac.nz> [Mark] > Therefore, should > someone forge ahead and prepare a new source-code release, I > will jump ahead of where I am, and do a corresponding binary release. > > Not that I am pushing for this, but incase someone is waiting > for me, then I'm pretty ready to go. Unfortunately (ok ok, > fortunately) I have no idea what the source-release process > is, nor have any intention of learning now For the most part, it's documented in README-DEVEL.TXT, but I'm happy to put together a source-code release, if no-one else is clammering to do it (I waited all day for someone else to pipe up ). If we put out release candidates tomorrow then people could try them over the weekend and we could release Monday. So, tomorrow (unless the mail when I get in suggests otherwise), I'll: 1. Sort out the -d/-D/-p stuff. It's looking like -p/-d is the winner (4-0). 2. Do another update to WHAT_IS_NEW.TXT and CHANGELOG.txt. 3. Have another look through the sb_server binary documentation. 4. Build a release candidate and post a link here. Mark - any chance that you want to revamp the Version.py stuff before we put things together? If you change Version.py and the plug-in, I can do the sb_server/imapfilter/pop3dnd scripts. What are we calling this, anyway? 1.0a8? 1.0b1? BTW, the binary release is meant to be built with OL2K, yes? (That means that I should remember not to do it :) =Tony Meyer From kennypitt at hotmail.com Wed Feb 4 11:00:44 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 4 11:01:40 2004 Subject: [spambayes-dev] New release In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A7F@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > Mark - any chance that you want to revamp the Version.py stuff before > we put things together? Sorry, I guess I signed up for this but I've been swamped lately with upcoming releases for stuff they actually pay me for. I have some thoughts on this, but it involves sharing some of the version information across apps in a way that doesn't quite fit with how the version dictionaries are accessed right now. I don't think those changes will make it in for this release. In the meantime, I just checked in some basic cleanup to Version.py to remove extra version text from the Description fields and drop the unused "SMTP Proxy" section since it is now part of sb_server. I also updated the "Hammie" section to "sb_filter", and added some code to the sb_filter.py script to print the version info when displaying usage. Those responsible for the various apps will need to update the version numbers and release dates as you see fit. -- Kenny Pitt From nas-spambayes at python.ca Wed Feb 4 15:07:43 2004 From: nas-spambayes at python.ca (Neil Schemenauer) Date: Wed Feb 4 14:58:29 2004 Subject: [spambayes-dev] sb_mailsort.py problem In-Reply-To: References: Message-ID: <20040204200743.GA19333@mems-exchange.org> On Wed, Feb 04, 2004 at 01:19:57AM -0500, Atom 'Smasher' wrote: > i got a spam that caused sb_mailsort.py to fart. i attached a copy, but > i'm not sure if it'll go through, so i'll make it available here: > http://smasher.suspicious.org/tmp/sb_problem.tgz > for a few days/weeks. I've filed a bug report and attached the files from sb_problem.tgz to it. http://sourceforge.net/tracker/index.php?func=detail&aid=890691&group_id=61702&atid=498103 Cheers, Neil From mhammond at skippinet.com.au Wed Feb 4 19:03:00 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Feb 4 19:03:18 2004 Subject: [spambayes-dev] New release In-Reply-To: Message-ID: <005801c3eb7b$6d3991a0$0200a8c0@eden> > Tony Meyer wrote: > > Mark - any chance that you want to revamp the Version.py > stuff before > > we put things together? > > Sorry, I guess I signed up for this but I've been swamped lately with > upcoming releases for stuff they actually pay me for. > > I have some thoughts on this, but it involves sharing some of the > version information across apps in a way that doesn't quite > fit with how > the version dictionaries are accessed right now. I don't think those > changes will make it in for this release. > > In the meantime, I just checked in some basic cleanup to Version.py to Sounds good to me. Let's leave things for this release, and have a quick discussion about your ideas all ready for the next one. There isn't really anything in the current scheme I am attached to, so I doubt it will be controversial :) How about we call this 0.9, the next 1.0b1 (and b2 etc as needed), then 1.0? Mark. From tameyer at ihug.co.nz Thu Feb 5 03:27:39 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 03:27:57 2004 Subject: [spambayes-dev] -d/-D command line options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304E70F0F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8B@its-xchg4.massey.ac.nz> > As long as we'll be making an incompatible change to the > selection of pickles and databases, I'd like to see about > getting the default (pickle or database file) done right for > each application. Long-running applications which handle > both the training and scoring tasks should probably use > pickles by default. Other applications should use database > files by default. That will probably mean changes to the > config file to split Storage:persistent_use_database into > several application-specific options. I'm not opposed to having application-specific options, but do we definitely want to set pickle as the default for some? Tim's comments indicate that pickle isn't appropriate for Outlook, which probably means that it's not for sb_server. Imapfilter is a tricky one, since it can be run as a separate process at various intervals, or it can be run once and process at various intervals. With the former, a pickle would be bad, with the latter, probably good. I wish we knew if the changes in this release are enough to remove the majority of db errors. We should definitely get a few. =Tony Meyer From tameyer at ihug.co.nz Thu Feb 5 03:40:03 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 03:40:18 2004 Subject: [spambayes-dev] Preventing FAQ 3.13 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304BA3EAC@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8C@its-xchg4.massey.ac.nz> [Eli] > Kenny Pitt wrote, as have many before: > > > This is covered by FAQ 3.13: > > > Would it be possible to detect the error state resulting from > deleting the spam folder from within the plugin? Having a dialog > that pops up describing the problem and possible solutions (and perhaps > a few do-it-for-me buttons?) might cut down a fair bit of the support > questions that come through. At the risk of incurring the wrath of Mark , I've done the smallest possible thing to help out these people. Now, if the Spam/Unsure folder is deleted, rather than quietly failing to move the mail, it reports the error to the user, suggesting that this is what they may have done, and what to do about it. This doesn't prevent them doing it, or give them do-it-for-me buttons, but it's something. It also doesn't help those that have the Spam/Unsure folder in the Deleted Items folder, but haven't emptied it. =Tony Meyer From tameyer at ihug.co.nz Thu Feb 5 05:00:16 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 05:00:32 2004 Subject: [spambayes-dev] New release [Release candidate attached] In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304E71771@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8D@its-xchg4.massey.ac.nz> > How about we call this 0.9, the next 1.0b1 (and b2 etc as > needed), then 1.0? Sounds good to me. A release candidate for the source can be found here: (zip) (tgz) If people could give them a spin, that'd be great. (Especially the linux folk, since I've stuffed up the line endings before <0.5 wink>). =Tony Meyer From skip at pobox.com Thu Feb 5 10:08:45 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Feb 5 10:08:54 2004 Subject: [spambayes-dev] Re: spambayes/scripts ... In-Reply-To: References: Message-ID: <16418.23677.367758.16851@montanaro.dyndns.org> Tony> Let's call this 1.0a9, because 7+1 is a bad number, and that makes Tony> us match the Outlook plug-in. I must admit, I don't understand what makes 7+1 a bad number. Inside joke perhaps? Skip From kennypitt at hotmail.com Thu Feb 5 10:10:30 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 10:11:36 2004 Subject: [spambayes-dev] New release [Release candidate attached] In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8D@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> How about we call this 0.9, the next 1.0b1 (and b2 etc as >> needed), then 1.0? > > Sounds good to me. A release candidate for the source can be found > here: > > (zip) > (tgz) I noticed that these archives contain some version number changes that aren't checked into CVS. Is that intentional? -- Kenny Pitt From kennypitt at hotmail.com Thu Feb 5 10:28:26 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 10:29:30 2004 Subject: [spambayes-dev] New release In-Reply-To: <005801c3eb7b$6d3991a0$0200a8c0@eden> Message-ID: Mark Hammond wrote: >> In the meantime, I just checked in some basic cleanup to Version.py >> to > > Sounds good to me. Let's leave things for this release, and have a > quick discussion about your ideas all ready for the next one. There > isn't really anything in the current scheme I am attached to, so I > doubt it will be controversial :) POP3 Proxy, IMAP Filter, and IMAP Server have a separate InterfaceVersion. Most users won't differentiate between an app and its user interface, so do we really need a separate version for the UI? For apps that check for the latest version we only compare Version (or BinaryVersion), so we would need to bump that up anyway if we wanted users to be notified of the update. If noone is opposed, this is another quick cleanup we could make before the release that would probably simplify things for the future. -- Kenny Pitt From kennypitt at hotmail.com Thu Feb 5 10:31:31 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 10:32:26 2004 Subject: [spambayes-dev] Re: spambayes/scripts ... In-Reply-To: <16418.23677.367758.16851@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Tony> Let's call this 1.0a9, because 7+1 is a bad number, and that makes > Tony> us match the Outlook plug-in. > > I must admit, I don't understand what makes 7+1 a bad number. Inside > joke perhaps? Oh, good, I'm not the only one who missed that one! <0.9 wink> -- Kenny Pitt From david.booth at smhc.org Thu Feb 5 12:05:17 2004 From: david.booth at smhc.org (Booth, David) Date: Thu Feb 5 13:38:05 2004 Subject: [spambayes-dev] Suggestion for Spambayes Message-ID: I have been creating an extensive junk senders list within the Outlook Junk senders options. The junk senders text file located in the users C:\Documents and Settings\User Name\Application Data\Microsoft\Outlook\Junk Senders.txt and Adult Content Senders.txt could additionally be used for training the system. At least for those of us who are trying in vain through the Outlook junk senders option. David R. Booth, MCSE, CNA Director, Information Technology Southwest Mental Health Center 8535 Tom Slick Drive San Antonio, Texas 78229 Voice: (210) 582-6426 Fax: (210) 616-0417 www.smhc.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040205/044ba7d6/attachment.html From kennypitt at hotmail.com Thu Feb 5 14:11:49 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 14:12:55 2004 Subject: [spambayes-dev] Suggestion for Spambayes In-Reply-To: Message-ID: SpamBayes already extracts equivalent sender information for every message that you train. Adding the information from the Adult Content Senders and Junk Senders lists would be difficult to do statistically. The basic (aka over-simplified) concept behind the SpamBayes filter is the percentage of ham and spam messages that contain a particular token, and the Outlook sender lists have no information about the number of messages that contained those addresses. The reason that blocked sender lists don't usually have much effect is that spammers always forge the sender address, and it is usually randomized. The chances of receiving 2 spam messages with the exact same sending address are very small. For the same reason, the sender address is rarely a significant spam clue for SpamBayes. On the other hand, it is often an excellent clue for detecting good messages from people you correspond with often. -- Kenny Pitt _____ From: spambayes-dev-bounces@python.org [mailto:spambayes-dev-bounces@python.org] On Behalf Of Booth, David Sent: Thursday, February 05, 2004 12:05 PM To: spambayes-dev@python.org Subject: [spambayes-dev] Suggestion for Spambayes I have been creating an extensive junk senders list within the Outlook Junk senders options. The junk senders text file located in the users C:\Documents and Settings\User Name\Application Data\Microsoft\Outlook\Junk Senders.txt and Adult Content Senders.txt could additionally be used for training the system. At least for those of us who are trying in vain through the Outlook junk senders option. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040205/624fe8f4/attachment.html From tim at fourstonesExpressions.com Thu Feb 5 15:32:39 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Feb 5 15:32:51 2004 Subject: [spambayes-dev] Fwd: [Spambayes] Spambayes bug In-Reply-To: References: Message-ID: A problem that I don't know the answer to on the spambayes public list... anyone got a guess at what's going on here? ------- Forwarded message ------- From: Dee Dee To: spambayes@python.org Subject: [Spambayes] Spambayes bug Date: Thu, 5 Feb 2004 11:32:35 -0800 > Well shucks, no one responded to my problem and I've since seen similar > posts and it seems there isn't an answer to this bug. (Spam folder > suddenly > deletes and when trying to recreate it, configuration wizard doesn't > work). > I have deleted spaybayes and reinstalled, did the toolbar thing, > nothing. I > don't know what else to try so I guess it's good bye to Spambayes. I will > miss it, it worked GREAT. I would be more than happy to pay for it if > there > was a way to make it work again. Thanks for putting it out there. > > > > By the way, and not complaining here, I know this is all free and very > generous of you to even offer it, but on your website, and put in my > info to > go to the subscriber list, I get an error page. > -- Vous exprimer; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment9827.dat Type: application/octet-stream Size: 180 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040205/ac1e8460/attachment9827.obj From tim.one at comcast.net Thu Feb 5 15:55:59 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Feb 5 15:55:56 2004 Subject: [spambayes-dev] Fwd: [Spambayes] Spambayes bug In-Reply-To: Message-ID: [Tim Stone] > A problem that I don't know the answer to on the spambayes public > list... anyone got a guess at what's going on here? Going on where? Dee Dee talks about more than one problem. Which one are you asking about? If this is about the last paragraph, get more info from her. Is she clicking on the Visit Subscriber List button? If so, did she enter the email address and password for her spambayes list subscription first? What did the error page say? If it said """ Bug in Mailman version 2.1.4 We're sorry, we hit a bug! If you would like to help us identify the problem, please email a copy of this page to the webmaster for this site with a description of what happened. Thanks! Traceback: Traceback (most recent call last): File "/usr/local/mailman-2.1/scripts/driver", line 87, in run_main main() File "/usr/local/mailman-2.1/Mailman/Cgi/roster.py", line 85, in main password, addr) File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 219, in WebAuthenticate for ac in authcontexts: File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 299, in CheckCookie for user in [Utils.UnobscureEmail(u) for u in usernames]: File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 309, in __checkone # combination. File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 104, in AuthContextInfo raise TypeError, 'No user supplied for AuthUser context' File "/usr/local/mailman-2.1/Mailman/OldStyleMemberships.py", line 102, in getMemberPassword raise Errors.NotAMemberError, member NotAMemberError: as@lik.bak """ then she (a) didn't enter her info correctly; and (b) she hit a Mailman bug as a result. OTOH, if it said """ Error Spambayes roster authentication failed. """ then she (a) didn't enter her info correctly; and (b) she didn't hit a Mailman bug as a result . I'll note that the address deedeemurry@greatbigisland.com isn't subscribed to the spambayes list, so if that's the address she entered, she can't succeed. > ------- Forwarded message ------- > From: Dee Dee > To: spambayes@python.org > Subject: [Spambayes] Spambayes bug > Date: Thu, 5 Feb 2004 11:32:35 -0800 > >> Well shucks, no one responded to my problem and I've since seen >> similar posts and it seems there isn't an answer to this bug. (Spam >> folder suddenly deletes and when trying to recreate it, >> configuration wizard doesn't work). I have deleted spaybayes and >> reinstalled, did the toolbar thing, nothing. I don't know what else >> to try so I guess it's good bye to Spambayes. I will miss it, it >> worked GREAT. I would be more than happy to pay for it if there was >> a way to make it work again. Thanks for putting it out there. >> >> >> >> By the way, and not complaining here, I know this is all free and >> very generous of you to even offer it, but on your website, and put >> in my info to go to the subscriber list, I get an error page. From kennypitt at hotmail.com Thu Feb 5 16:01:26 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 16:02:26 2004 Subject: [spambayes-dev] Fwd: [Spambayes] Spambayes bug In-Reply-To: Message-ID: Tim Stone wrote: > A problem that I don't know the answer to on the spambayes public > list... anyone got a guess at what's going on here? > > ------- Forwarded message ------- > From: Dee Dee > To: spambayes@python.org > Subject: [Spambayes] Spambayes bug > Date: Thu, 5 Feb 2004 11:32:35 -0800 I had already responded to this before your message made it to the list. This appears to be a simple case of user accidentally deleting the spam folder as addressed almost daily by FAQ 3.13. The problem with the Config Wizard not showing up was a known problem that Mark fixed a few months ago. -- Kenny Pitt From tim at fourstonesExpressions.com Thu Feb 5 16:03:36 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Feb 5 16:03:43 2004 Subject: [spambayes-dev] Fwd: [Spambayes] Spambayes bug In-Reply-To: References: Message-ID: On Thu, 5 Feb 2004 15:55:59 -0500, Tim Peters wrote: > [Tim Stone] >> A problem that I don't know the answer to on the spambayes public >> list... anyone got a guess at what's going on here? > > Going on where? Dee Dee talks about more than one problem. Which one > are > you asking about? I think the one she really wants to solve, and the one I don't know anything about, is about her spam folder being deleted and when she tries to recreate it, the config wizard "doesn't work." Seems a like not much evidence, but I thought perhaps someone in dev might recognize the symptom... >>> Well shucks, no one responded to my problem and I've since seen >>> similar posts and it seems there isn't an answer to this bug. (Spam >>> folder suddenly deletes and when trying to recreate it, >>> configuration wizard doesn't work). I have deleted spaybayes and >>> reinstalled, did the toolbar thing, nothing. I don't know what else >>> to try so I guess it's good bye to Spambayes. I will miss it, it >>> worked GREAT. I would be more than happy to pay for it if there was >>> a way to make it work again. Thanks for putting it out there. >>> >>> >>> >>> By the way, and not complaining here, I know this is all free and >>> very generous of you to even offer it, but on your website, and put >>> in my info to go to the subscriber list, I get an error page. > -- Vous exprimer; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com From kennypitt at hotmail.com Thu Feb 5 16:32:31 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 16:33:24 2004 Subject: [spambayes-dev] Capitalization of SpamBayes Message-ID: What is the preferred capitalization of our name: "SpamBayes" or "Spambayes"? It's used very inconsistently throughout the docs and even the web UI's, and I'd like to get that cleaned up. -- Kenny Pitt From barry at python.org Thu Feb 5 16:41:12 2004 From: barry at python.org (Barry Warsaw) Date: Thu Feb 5 16:41:29 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: References: Message-ID: <1076017271.5643.263.camel@anthem> On Thu, 2004-02-05 at 16:32, Kenny Pitt wrote: > What is the preferred capitalization of our name: "SpamBayes" or > "Spambayes"? It's used very inconsistently throughout the docs and even > the web UI's, and I'd like to get that cleaned up. I personally don't like capword spellings like this (hence "Mailman" not as commonly mispelled "MailMan"), so I'd prefer "Spambayes". Just one guy's opinion, though. -Barry From barry at wooz.org Thu Feb 5 16:12:51 2004 From: barry at wooz.org (Barry Warsaw) Date: Thu Feb 5 16:50:22 2004 Subject: [spambayes-dev] Fwd: [Spambayes] Spambayes bug In-Reply-To: References: Message-ID: <1076015571.5643.256.camel@anthem> On Thu, 2004-02-05 at 15:55, Tim Peters wrote: > Bug in Mailman version 2.1.4 > > We're sorry, we hit a bug! > If you would like to help us identify the problem, please email a copy of > this page to the webmaster for this site with a description of what > happened. Thanks! > > Traceback: > > Traceback (most recent call last): > File "/usr/local/mailman-2.1/scripts/driver", line 87, in run_main > main() > File "/usr/local/mailman-2.1/Mailman/Cgi/roster.py", line 85, in main > password, addr) > File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 219, in > WebAuthenticate > for ac in authcontexts: > File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 299, in > CheckCookie > for user in [Utils.UnobscureEmail(u) for u in usernames]: > File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 309, in > __checkone > # combination. > File "/usr/local/mailman-2.1/Mailman/SecurityManager.py", line 104, in > AuthContextInfo > raise TypeError, 'No user supplied for AuthUser context' > File "/usr/local/mailman-2.1/Mailman/OldStyleMemberships.py", line 102, in > getMemberPassword > raise Errors.NotAMemberError, member > NotAMemberError: as@lik.bak > """ > > then she (a) didn't enter her info correctly; and (b) she hit a Mailman bug > as a result. Yep, known, and fixed in cvs but not yet pushed out to the live site. Tell as@lik.bak to either kill their mail.python.org cookies or restart their browser (in which case their session cookie will expire). -Barry From tim at fourstonesExpressions.com Thu Feb 5 17:00:04 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Feb 5 17:00:30 2004 Subject: [spambayes-dev] Fwd: [Spambayes] Spambayes bug In-Reply-To: References: Message-ID: On Thu, 5 Feb 2004 16:01:26 -0500, Kenny Pitt wrote: > I had already responded to this before your message made it to the list. Good deal. Thanks! -- Exprimez vous; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com From skip at pobox.com Thu Feb 5 17:06:10 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Feb 5 17:06:20 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: <1076017271.5643.263.camel@anthem> References: <1076017271.5643.263.camel@anthem> Message-ID: <16418.48722.828537.613497@montanaro.dyndns.org> Barry> On Thu, 2004-02-05 at 16:32, Kenny Pitt wrote: >> What is the preferred capitalization of our name: "SpamBayes" or >> "Spambayes"? Barry> ... I'd prefer "Spambayes". +1. Skip From tim.one at comcast.net Thu Feb 5 17:07:49 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Feb 5 17:07:49 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: <1076017271.5643.263.camel@anthem> Message-ID: [Kenny Pitt] >> What is the preferred capitalization of our name: "SpamBayes" or >> "Spambayes"? It's used very inconsistently throughout the docs and >> even the web UI's, and I'd like to get that cleaned up. [Barry Warsaw] > I personally don't like capword spellings like this (hence "Mailman" > not as commonly mispelled "MailMan"), so I'd prefer "Spambayes". > Just one guy's opinion, though. I always spelled it "spambayes", because that's the fastest to type. Most people seem to spell it SpamBayes, so I've tried to play along for consistency's sake. If we have to pick just one, I'd *like* to be with Barry: Spambayes looks less atrocious than SpamBayes. But Bayes is a proper noun, unlike the "man" Barry inherited. SpamFisher would be a more accurate name at this point (we've got more to do with a theorem of Fisher's than of Bayes's!). Let me know if I can clear anything else up . From popiel at wolfskeep.com Thu Feb 5 17:26:30 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Feb 5 17:26:36 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: Message from "Kenny Pitt" of "Thu, 05 Feb 2004 16:32:31 EST." References: Message-ID: <20040205222630.179FE2DF17@cashew.wolfskeep.com> In message: "Kenny Pitt" writes: >What is the preferred capitalization of our name: "SpamBayes" or >"Spambayes"? It's used very inconsistently throughout the docs and even >the web UI's, and I'd like to get that cleaned up. Why, it's "spambayes", all lower, even at the start of a sentence, of course! ;-) Any other monkeys you need wrenched? - Alex From tim at fourstonesExpressions.com Thu Feb 5 17:33:48 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Feb 5 17:33:56 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: <20040205222630.179FE2DF17@cashew.wolfskeep.com> References: <20040205222630.179FE2DF17@cashew.wolfskeep.com> Message-ID: On Thu, 05 Feb 2004 14:26:30 -0800, T. Alexander Popiel wrote: > In message: > "Kenny Pitt" writes: > >> What is the preferred capitalization of our name: "SpamBayes" or >> "Spambayes"? It's used very inconsistently throughout the docs and even I'm thinking Sp4mb4y3s... But we could always change its name, after all... SpamDestructifier SpamAnimosityAttenuator HamHighlighter Oh-so-creatively-your's... -- Exprimez vous!; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com From kennypitt at hotmail.com Thu Feb 5 17:37:01 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 17:37:58 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: <16418.48722.828537.613497@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Barry> On Thu, 2004-02-05 at 16:32, Kenny Pitt wrote: > >> What is the preferred capitalization of our name: "SpamBayes" or > >> "Spambayes"? > > Barry> ... I'd prefer "Spambayes". > > +1. I guess us Windows guys are just used to mixed case. I was leaning towards "SpamBayes" myself. -- Kenny Pitt From kennypitt at hotmail.com Thu Feb 5 17:55:51 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 5 17:56:49 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: Message-ID: Tim Stone wrote: > On Thu, 05 Feb 2004 14:26:30 -0800, T. Alexander Popiel > wrote: > >> In message: >> "Kenny Pitt" writes: >> >>> What is the preferred capitalization of our name: "SpamBayes" or >>> "Spambayes"? It's used very inconsistently throughout the docs and >>> even > > I'm thinking Sp4mb4y3s... > > But we could always change its name, after all... > > SpamDestructifier > SpamAnimosityAttenuator > HamHighlighter > > Oh-so-creatively-your's... OK, now I think I'm beginning to understand why it's so inconsistent! -- Kenny Pitt From rmalayter at bai.org Thu Feb 5 18:01:04 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Thu Feb 5 18:01:09 2004 Subject: [spambayes-dev] Capitalization of SpamBayes Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01A75DB7@cliff.bai.org> [Kenny Pitt] > I guess us Windows guys are just used to mixed case. I was leaning > towards "SpamBayes" myself. +1 from me. I've always liked the so-called "CaliforniaCase" (mixed case without spaces) for computer stuff, be it C++/VB/Java variable names, SQL table names, whatever. It's easy to separate the different parts of the name viusally, and you don't have to restort to using underscores or other separator characters that might not work in some other instance. (Irritatingly, every language, OS, and protocal seems to have different rules in this regard. Some allow underscores, others only dashes, etc. But every platform I've ever seen accepts CaliforniaCase as valid identifier syntax, even if it's not a case-sensitive language.) But since I've contributed little besides bug reports, I probably shouldn't get a vote... Regards, Ryan From barry at python.org Thu Feb 5 17:46:05 2004 From: barry at python.org (Barry Warsaw) Date: Thu Feb 5 18:04:58 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: References: <20040205222630.179FE2DF17@cashew.wolfskeep.com> Message-ID: <1076021164.5643.276.camel@anthem> On Thu, 2004-02-05 at 17:33, Tim Stone wrote: > SpamDestructifier > SpamAnimosityAttenuator > HamHighlighter It's: Timco's New and Improved Despamificationilator 3000!!!!! -Barry From tameyer at ihug.co.nz Thu Feb 5 20:11:46 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 20:12:04 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC3557@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8E@its-xchg4.massey.ac.nz> > What is the preferred capitalization of our name: "SpamBayes" > or "Spambayes"? It's used very inconsistently throughout the > docs and even the web UI's, and I'd like to get that cleaned up. I used to use Spambayes, but then IIRC saw a check-in messages consistifying :) to SpamBayes so have tried to use that since. The logo that the Outlook plug-in (and the readme for the binary sb_server) uses has SpamBayes, so if it ends up as something different, so should that. My personal preference is for spambayes. SpamBayes is worse to type than Spambayes, but, as Tim said, it is the dude's name, after all. I don't really care which one of those it is. Maybe spam_bayes? Or SpamFisher? Or just FisherBayes, in case Hormel does get really picky? <0.9 wink>. =Tony Meyer From tim at fourstonesExpressions.com Thu Feb 5 20:35:20 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Feb 5 20:35:26 2004 Subject: [spambayes-dev] Capitalization of SpamBayes In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8E@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8E@its-xchg4.massey.ac.nz> Message-ID: On Fri, 6 Feb 2004 14:11:46 +1300, Tony Meyer wrote: > My personal preference is for spambayes. SpamBayes is worse to type than > Spambayes, but, as Tim said, it is the dude's name, after all. I think Tommy Bayes, upon close examination of our code, would ask us to change the name so as to not give the impression that he had anything to do with it... -- Exprimez vous!; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com From tameyer at ihug.co.nz Thu Feb 5 20:51:57 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 20:52:14 2004 Subject: [spambayes-dev] Re: spambayes/scripts ... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC346D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A91@its-xchg4.massey.ac.nz> > > I must admit, I don't understand what makes 7+1 a bad > > number. Inside joke perhaps? > > Oh, good, I'm not the only one who missed that one! <0.9 wink> Sorry, maybe I'm just too young or geeky <0.5 wink>. It's a Discworld reference: (it's the 14th entry). It was much more a thing in the early Discworld books than in the recent stuff. Ignoring the bad joke, Mark's suggestion to use 1.09 did seem to make sense, to try and synchronise the binary/source stuff, but if anyone cares enough to check in a change back to 1.08, I'm not going to care :) =Tony From tameyer at ihug.co.nz Thu Feb 5 20:56:10 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 20:56:27 2004 Subject: [spambayes-dev] New release In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC3468@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046778F3@its-xchg4.massey.ac.nz> > POP3 Proxy, IMAP Filter, and IMAP Server have a separate > InterfaceVersion. Most users won't differentiate between an > app and its user interface, so do we really need a separate > version for the UI? [...] > If noone is opposed, this is another quick cleanup we could > make before the release that would probably simplify things > for the future. Yes, this is definitely something that should be done, especially since the vast majority of the web interface code is shared among them anyway. I had made this change locally, but didn't ever commit it since I wanted to sort out the rest, but couldn't figure out how :) IMAP Server's interface version can be dumped; it doesn't really use the web interface at all anymore (just for configuration, and even that will go). I had originally made a new 'application', which was the interface, and both sb_server and sb_imapfilter referred to that. It does seem that the interface may change significantly without sb_server or sb_imapfilter doing so (in fact, sb_server hardly changes at all anymore, but the interface changes lots). But you can just kill it if you think that's better. =Tony Meyer From tameyer at ihug.co.nz Thu Feb 5 20:58:43 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 5 20:58:59 2004 Subject: [spambayes-dev] New release [Release candidate attached] In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC3458@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046778F4@its-xchg4.massey.ac.nz> > I noticed that these archives contain some version number > changes that aren't checked into CVS. Is that intentional? Well spotted :) The answer is kinda. I bumped the __init__.py version, as per the instructions, and figured that the Version.py ones would need to be as well, before an actual release. I didn't check anything in, though, because I figured it'd be better to wait until the stuff is sorted :) The ones I bumped are the ones that I personally think have changed since 1.0a7 (although I'm 50/50 about the engine itself, but I think the experimental/deprecated options make it so). =Tony Meyer From tim.one at comcast.net Fri Feb 6 00:28:35 2004 From: tim.one at comcast.net (Tim Peters) Date: Fri Feb 6 00:28:33 2004 Subject: [spambayes-dev] -d/-D command line options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8B@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > I'm not opposed to having application-specific options, but do we > definitely want to set pickle as the default for some? So long as this is alpha software, I want to inflict bsddb3 on as many users as possible -- it seems that's the only hope we have left of getting enough clues about the corruption problems to solve them. OTOH, if nobody can make time to rework the code to use bsddb3 in a transactional way, and provide for recovery, this will never get beyond alpha software so long as bsddb3 is the default for any of our apps. ... > I wish we knew if the changes in this release are enough to remove the > majority of db errors. We should definitely get a few. We'll learn the most about that if we don't monkey with the defaults now. Alpha users are supposed to help debug software, even if no Outlook users realize it <0.5 wink>. From kennypitt at hotmail.com Fri Feb 6 10:36:16 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 6 10:37:10 2004 Subject: [spambayes-dev] New release In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046778F3@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > It does seem that the interface may change significantly without > sb_server or sb_imapfilter doing so (in fact, sb_server hardly > changes at all anymore, but the interface changes lots). But you can > just kill it if you think that's better. My thought here was that even if it's only the UI that has changed, to the average user the "application" has still changed. The pop3proxy_tray wrapper for sb_server supports the "check for latest version" function, but it only checks the sb_server version and not the interface version. If we only increment the interface version, the latest version check won't detect that the app has been updated. End result: whenever we increment interface version we have to increment the sb_server version as well, so we lose any value that might have been gained by knowing that the interface has changed but sb_server hasn't. There have also been some bug fixes and additional options added to the tokenizer/classifier which would affect the operation of sb_server because it depends on those pieces. As developers we like to know that only the engine has changed, but to an end user it's still a new version of sb_server. Since these are the sorts of issues that are driving my thinking on the versioning, I guess now is as good a time as any to try to present what I've come up with so far. This will be an off-the-cuff description of the scheme, so I'm sure it will ramble a lot (especially since I tend to do that anyway <0.5 wink>). Let the discussion begin, if anyone is determined enough to read through all this. Whenever we've made enough changes to be worth a new release, something in there probably affects every app in one way or another. It's also a little difficult to keep everything straight as users report issues when different apps have different versioning schemes. My proposal is that all apps share the source release version as their primary version number. The shared release version would consist of the following parts: * A major/minor version number (1.0) * A release number that would increment for each release. If alpha9 is our ninth release then the release number would be 9, but it would increment to 10 for beta1. The release number could be reset to 1 when we move on to 1.1 development (although it wouldn't have to be). The purpose of this is to give us a three-part version number major.minor.release that is always increasing. * A string representation of the version ("1.0a9" or "1.0b1"). The binary major.minor.release version would be used for version check comparisons, but this string representation is what would be visible to the user. * A release date ("Feb 2004") In addition to the shared version info, the engine and each application would have a separate "revision number" that we would increment during development to track changes specific to that app. At a minimum, the revision number would be incremented before each release if the app has been updated. We could also choose to increment it each time we make a significant change to the app between releases, which might make it easier to track the state for users running from CVS source. Does anyone think we should keep a revision date as well to show when the revision number was last incremented? For the complete version number of each app, the revision number could be added to the string representation of the release version to produce something like "1.0a9-004" for sb_server (actual format is, of course, up for discussion). The revision number could also be combined with the binary release version to produce a standard 4-part version number for use on Windows binaries (major.minor.release.revision). As developers, we know that we can always look at the revision number to see that sb_server itself did not change between "1.0a9-004" and "1.0b1-004". We should be able to use the same version numbering for binary versions instead of keeping a separate version number. If we need to release an updated binary between source releases then we can increment the app's revision number to indicate that, but I'm not sure what version number to put on the Windows installer release if it contains more than one app. I'd also like to have Version.py do the check for binary or source version when building the version description string rather than each caller doing the check and selecting a different description format string. OK, I think that's more than enough to fuel the fire for now, so have at it! -- Kenny Pitt From skip at pobox.com Fri Feb 6 11:53:41 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 6 11:53:52 2004 Subject: [spambayes-dev] train to exhaustion? Message-ID: <16419.50837.685474.162845@montanaro.dyndns.org> Did anyone see Gary Robinson's blog (and related pages) about train-to-exhaustion? Justin Mason posted a reference on the spambayes list. Does one of the incremental training regimens implement it under a different name? Skip From tim.one at comcast.net Fri Feb 6 12:39:12 2004 From: tim.one at comcast.net (Tim Peters) Date: Fri Feb 6 12:39:09 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: <16419.50837.685474.162845@montanaro.dyndns.org> Message-ID: [Skip] > Did anyone see Gary Robinson's blog (and related pages) about > train-to-exhaustion? Justin Mason posted a reference on the > spambayes list. Yup, and I read it then. > Does one of the incremental training regimens implement it under a > different name? Don't think so, although the fpfnunsure regime seems to correspond closely to one *pass* of TTE. TTE is like running fpfnunsure repeatedly, starting each pass with the trained database from the end of the last pass (and starting with an empty training database), until results stop improving. The TTE description had a wrinkle, alternating between ham and spam, so it appears to assume you have an equal number of each. On each pass you look at *all* messages (even the ones already trained on); whether you allow it to train again on a message that's already been trained is a choice. From barry at python.org Fri Feb 6 17:20:09 2004 From: barry at python.org (Barry Warsaw) Date: Fri Feb 6 17:30:18 2004 Subject: [spambayes-dev] Re: -d/-D command line options In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8B@its-xchg4.massey.ac.nz> Message-ID: Tim Peters wrote: > > So long as this is alpha software, I want to inflict bsddb3 on as many users > as possible -- it seems that's the only hope we have left of getting enough > clues about the corruption problems to solve them. OTOH, if nobody can make > time to rework the code to use bsddb3 in a transactional way, and provide > for recovery, this will never get beyond alpha software so long as bsddb3 is > the default for any of our apps. Let's say I went insane and decided to see if I could bang a transactional bsddb3 implementation together. Where should I hook it in? What interface should I try to support? -Barry From kennypitt at hotmail.com Fri Feb 6 17:48:18 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 6 17:49:13 2004 Subject: [spambayes-dev] RE: [Spambayes] Spam block In-Reply-To: <16420.3967.74757.764544@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Kenny> The open-source "YahooPOPs!" program will allow you to read your > Kenny> Yahoo mail via a POP3 interface, and you can then point the > Kenny> SpamBayes sb_server POP3 proxy filter at the YahooPOPs POP3 > Kenny> interface to filter your mail. > > Kenny> http://yahoopops.sourceforge.net > > Cool... Perhaps this belongs in the FAQ? Done. Can someone rebuild and update the website? -- Kenny Pitt From skip at pobox.com Fri Feb 6 17:58:24 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 6 17:58:32 2004 Subject: [spambayes-dev] Re: -d/-D command line options In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8B@its-xchg4.massey.ac.nz> Message-ID: <16420.7184.237886.249166@montanaro.dyndns.org> Barry> Let's say I went insane and decided to see if I could bang a Barry> transactional bsddb3 implementation together. Where should I Barry> hook it in? What interface should I try to support? Start with the DBDictClassifier class in spambayes.storage. Hooking it in is easy. Edit storage._storage_types and replace DBDictClassifier with your class or add another entry to it keyed by something like "tdbm". Note that there is a ZODBClassifier on the todo list in that module should you be extra motivated. Skip From skip at pobox.com Fri Feb 6 18:04:07 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 6 18:04:15 2004 Subject: [spambayes-dev] Re: -d/-D command line options In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A8B@its-xchg4.massey.ac.nz> Message-ID: <16420.7527.335514.382958@montanaro.dyndns.org> > Start with the DBDictClassifier class in spambayes.storage. Hooking > it in is easy. Edit storage._storage_types and replace > DBDictClassifier with your class or add another entry to it keyed by > something like "tdbm". I forgot - if you add another entry in _storage_types you'll have to twiddle storage.database_type() as well. Skip From skip at pobox.com Fri Feb 6 18:04:35 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 6 18:04:42 2004 Subject: [spambayes-dev] contrib/tte.py Message-ID: <16420.7555.681983.164247@montanaro.dyndns.org> I just checked in contrib/tte.py. Here's the doc string: """ Train to exhaustion: train repeatedly on a pile of ham and spam until everything scores properly. usage %(prog)s [ -h ] -g file -s file [ -d file | -p file ] [ -m N ] -h - print this documentation and exit. -g file - take ham from file -s file - take spam from file -d file - use a database-based classifier named file -p file - use a pickle-based classifier named file -m N - train on at most N messages (nham == N/2 and nspam == N/2) See Gary Robinson's blog: http://www.garyrobinson.net/2004/02/spam_filtering_.html """ I have an unsure pile at the moment with 580 or so messages in it. I'm going to see how it does with that, varying the maximum number of messages I tte on. Skip From kennypitt at hotmail.com Fri Feb 6 18:27:04 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 6 18:28:01 2004 Subject: [spambayes-dev] Re: -d/-D command line options In-Reply-To: <16420.7184.237886.249166@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Barry> Let's say I went insane and decided to see if I could bang a > Barry> transactional bsddb3 implementation together. Where should I > Barry> hook it in? What interface should I try to support? > > Start with the DBDictClassifier class in spambayes.storage. Hooking > it in is easy. Edit storage._storage_types and replace > DBDictClassifier with your class or add another entry to it keyed by > something like "tdbm". Without taking time to look at the code, it seems true transactional support might need a little more than that. Doesn't DBDictClassifier just provide methods to update individual token counts? I would think the correct transactional approach would be: 1. Start a transaction before training from a single message 2. Attempt to train all tokens in the message 3. If any token update fails, rollback all count updates from that message 4. If all tokens succeed, update the trained message count 5. If everything was successful, commit the transaction -- Kenny Pitt From skip at pobox.com Fri Feb 6 18:42:46 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 6 18:42:55 2004 Subject: [spambayes-dev] Re: -d/-D command line options In-Reply-To: References: <16420.7184.237886.249166@montanaro.dyndns.org> Message-ID: <16420.9846.485247.821461@montanaro.dyndns.org> Kenny> Without taking time to look at the code, it seems true Kenny> transactional support might need a little more than that. Kenny> Doesn't DBDictClassifier just provide methods to update Kenny> individual token counts? Yeah, but it should still be the best starting point for investigation. Skip From tameyer at ihug.co.nz Sat Feb 7 21:34:55 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Feb 7 21:35:14 2004 Subject: [spambayes-dev] RE: [Spambayes] Spam block In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC3859@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677903@its-xchg4.massey.ac.nz> [Skip] > Cool... Perhaps this belongs in the FAQ? [Kenny] > Done. Can someone rebuild and update the website? Done. =Tony Meyer From scott at pcsincnet.com Sun Feb 8 12:48:06 2004 From: scott at pcsincnet.com (scott@pcsincnet.com) Date: Sun Feb 8 12:48:32 2004 Subject: [spambayes-dev] Time window Message-ID: <000801c3ee6b$b5a051c0$0a03a8c0@hp> We recieve most of our spam after hours. Could a time window help the SpamBayes scoring system ? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040208/8ccff5c0/attachment.html From skip at pobox.com Sun Feb 8 16:29:52 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Feb 8 16:30:13 2004 Subject: [spambayes-dev] Time window In-Reply-To: <000801c3ee6b$b5a051c0$0a03a8c0@hp> References: <000801c3ee6b$b5a051c0$0a03a8c0@hp> Message-ID: <16422.43600.359685.249095@montanaro.dyndns.org> scott> We recieve most of our spam after hours. Could a time window help scott> the SpamBayes scoring system ? Yes, in theory. I ran some tests a fair while ago which were pretty inconclusive though. Try running some tests with the x-generate_time_buckets option set. There's also x-extract_dow for examining the day of the week. Skip From seant at webreply.com Mon Feb 9 19:16:22 2004 From: seant at webreply.com (Sean True) Date: Mon Feb 9 19:16:14 2004 Subject: [spambayes-dev] Small patch to Outlook2000/dialogs/FolderSelector.py Message-ID: <200402100016.i1A0G6T7021079@mailhub.wr2.com> Integrating this with some other code appears to reveal a bug in the code that keeps the node labels in memory. The object is returned from Pack...Item() in order to keep it the scope of the caller, but it appears that that is not enough: symptom was that the labels got splattered by other data unless they are kept in scope longer. I've got developer status, but my CVS access is extremely inconsistent, so I will toss this out and see if anyone else likes it well enough to check it in. -- Sean --- ..\..\..\spambayes-latest-cvs\Outlook2000\dialogs\FolderSelector.py Tue Dec 30 11:26:32 2003 +++ FolderSelector.py Mon Feb 9 19:06:59 2004 @@ -276,6 +276,8 @@ ): FolderSelector_Parent.__init__(self, parent, manager.dialog_parser, "IDD_FOLDER_SELECTOR") assert not single_select or selected_ids is None or len(selected_ids)<=1 + # List of things to keep in scope for a while + self.extras = [] self.single_select = single_select self.next_item_id = 1 self.item_map = {} @@ -343,6 +345,8 @@ bitmapSel, cItems, item_id)) + # Keep the buffered string info in scope + self.extras = self.extras + extras if verbose: print "Inserting item", repr(insert_buf), "-", hitem = win32gui.SendMessage(self.list, commctrl.TVM_INSERTITEM, ============== Sean True From mhammond at skippinet.com.au Mon Feb 9 23:23:51 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Feb 9 23:24:08 2004 Subject: [spambayes-dev] Release 1.0a9 (0.9) Message-ID: <3fe901c3ef8d$b07ec0f0$0200a8c0@eden> Hi all, Tony (thanks!) and I have put together version "1.0a9 (0.9)" We went for this "dual" version number so it is clearly an upgrade for both Outlook users and source-code users. The release includes the source archives, and the 'combined binary installer'. Please see the 'Files' page at https://sourceforge.net/project/showfiles.php?group_id=61702 - I would appreciate any comments on the release notes, change-log, and a few success reports of the tarballs. I only just made the release active, but have not yet sent a release notice, nor announced it anywhere else. Assuming no showstoppers, I intend doing this tomorrow(ish). Of course, anyone else is free to make the announcements. I will also set the 'Outlook' package to hidden, to help remove confusion. Let me know what you think! Mark. From tameyer at ihug.co.nz Tue Feb 10 16:40:06 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Feb 10 16:40:27 2004 Subject: [spambayes-dev] Release 1.0a9 (0.9) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC403F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677929@its-xchg4.massey.ac.nz> > Please see the 'Files' page at > https://sourceforge.net/project/showfiles.php?> group_id=61702 > - I would appreciate any comments on the > release notes, change-log, and a few success reports of the tarballs. I think it's worth me putting new versions of the source there before the public release that have the typo in dbexpimp's docstring fixed, otherwise this will confuse people lots. Any other requests before I put it up? =Tony Meyer From kennypitt at hotmail.com Wed Feb 11 09:13:38 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 11 09:14:34 2004 Subject: [spambayes-dev] Outlook Addin files on SourceForge Message-ID: I have a suspicion that many Outlook users who look at the SourceForge Files page will see that "Version 0.8" is still the latest file in the "Outlook Addin" section and think that the addin has not been updated. Would it make sense to list the "spambayes-1.0a9.exe" installer under both the "spambayes" and "Outlook Addin" sections? -- Kenny Pitt From sethg at GoodmanAssociates.com Wed Feb 11 11:17:28 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Wed Feb 11 11:17:33 2004 Subject: [spambayes-dev] Outlook Addin files on SourceForge In-Reply-To: Message-ID: > [Kenny Pitt] > Would it make sense to list the "spambayes-1.0a9.exe" installer under > both the "spambayes" and "Outlook Addin" sections? Yes, or possibly deprecate the Outlook add-in section with an explanatory if it is now redundant. -- Seth Goodman From kennypitt at hotmail.com Wed Feb 11 14:58:19 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 11 14:59:14 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? Message-ID: I pointed this user at the new 1.0a9 (0.9) installer because he was getting the "unable to register" error with 0.81. He sent back the following report. Perrin Jean-Marc wrote: > I just download and install the spambayes-1.0a9 > > Installation is all right, but now, when I launch Outlook, nothing's > appears, Outlook has no change at all ? > > It seems to me that the oultlook addin is not installed and enabled I unregistered my source version and ran the installer myself, and I got the same result. The SpamBayes toolbar was still present from my source version, but all the buttons were inactive. SpamBayes appeared in the COM Add-ins list, but was unchecked and reported something about "an error occurred during load". I did a regsvr32 on the installed outlook_addin.dll and SpamBayes worked again after restarting Outlook. It looks like we may have a problem in the installer with registering the addin properly. My initial guess is that the Outlook Add-in registry entries are getting set, but the COM object itself isn't getting registered. I'm heading off to take a look at this now, but wanted to give everyone a heads up in case someone wants to pull (or hide, or whatever) the installer from SourceForge until we can sort this out. -- Kenny Pitt From kennypitt at hotmail.com Wed Feb 11 15:16:35 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 11 15:17:31 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: Message-ID: Kenny Pitt wrote: > ... I did a regsvr32 on > the installed outlook_addin.dll and SpamBayes worked again after > restarting Outlook. > > It looks like we may have a problem in the installer with registering > the addin properly. My initial guess is that the Outlook Add-in > registry entries are getting set, but the COM object itself isn't > getting registered. OK, I've taken a look and it appears that registering the addin using the outlook_addin_register executable during install does two things differently than registering the addin DLL directly with regsvr32: * outlook_addin_register sets the DLL path in InprocServer32 to "pythoncom23.dll", while regsvr32 sets it to "\bin\outlook_addin.dll" * outlook_addin_register does not create the PythonCOMPath key -- Kenny Pitt From tameyer at ihug.co.nz Wed Feb 11 18:52:45 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 11 18:53:16 2004 Subject: [spambayes-dev] Outlook Addin files on SourceForge In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC43B3@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467793F@its-xchg4.massey.ac.nz> [Kenny Pitt] > Would it make sense to list the "spambayes-1.0a9.exe" installer under > both the "spambayes" and "Outlook Addin" sections? [Seth Goodman] > Yes, or possibly deprecate the Outlook add-in section with an > explanatory if it is now redundant. The plan is to set the Outlook Addin package to "Hidden", which means that it won't be visible at all, so they won't be able to select it. One hopes that they are then clever enough to look at the most recent release and get the installer from there. If we end up getting "where has the installer gone?" questions, we could always add a note to the page explaining the situation. I suspect that the majority of people will download it directly from the link on the windows.html page, or in the announcement email, anyway. =Tony Meyer From tameyer at ihug.co.nz Wed Feb 11 19:23:38 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 11 19:24:00 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13050DCEE7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677940@its-xchg4.massey.ac.nz> > I unregistered my source version and ran the installer myself, and I > got the same result. What version of Outlook/Windows are you using? I tried it and it worked fine, using Outlook XP SP2, with Windows XP Pro SP1. The OP was Outlook 2k with WinNT (French, but I doubt that makes a difference, or that you are using French). > I'm heading off to take a look at this now, but wanted to give > everyone a heads up in case someone wants to pull (or hide, or > whatever) the installer from SourceForge until we can sort this out. Only those reading the -dev list know that the 1.0a9 (0.9) release is there (and those that have stumbled across it while looking at the sourceforge site, I suppose), so it should be ok to leave it active rather than hidden. This way at least those here can give it a go without going through convoluted loops to get hold of it. > OK, I've taken a look and it appears that registering > the addin using the outlook_addin_register executable during install > does two things differently than registering the addin DLL directly > with regsvr32: > * outlook_addin_register sets the DLL path in InprocServer32 to > "pythoncom23.dll", while regsvr32 sets it to > "\bin\outlook_addin.dll" I get the installer setting it to "\bin\outlook_addin.dll". > * outlook_addin_register does not create the PythonCOMPath key The installer does this for me, too. =Tony Meyer From tameyer at ihug.co.nz Thu Feb 12 02:27:38 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 12 02:28:00 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304FC37AD@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467794A@its-xchg4.massey.ac.nz> [Skip] > Did anyone see Gary Robinson's blog (and related pages) about > train-to-exhaustion? Justin Mason posted a reference on > the spambayes list. Like Tim, I read it then, and then heard someone (Bill Yerazunis?) mention it while I was watching the 2004 MIT Spam Conference webcasts. [Skip] > Does one of the incremental training regimens implement it under a > different name? > > Don't think so, although the fpfnunsure regime seems to > correspond closely to one *pass* of TTE. TTE is like running > fpfnunsure repeatedly, starting each pass with the trained > database from the end of the last pass (and starting with an > empty training database), until results stop improving. By "results stop improving", do you think that the intention is that the same number of messages are misclassified, or that the scores stop getting better? (ie. if one message was still a false-positive, but moved from 0.8 to 0.7, is that improving?). I've written up a regime to do this with the incremental.py setup, or at least I hope so :) It's damn slow, though. I can't get it to run at any speed that's any good unless I only use a very recent portion (like 2 days) of mail for the retesting. With my data, and this setup (allowing mail to be trained more than once, and using the latest 2 days of mail), I found it gave better results than fpfnunsure, but still not as good as nonedge (apart from very early on, when all sorts of weird things happen with all the regimes, and I think is an artefact of that mail). has graphs, a bit more in the way of write-up, and also some output from Skip's tte.py script. =Tony Meyer From kennypitt at hotmail.com Thu Feb 12 08:54:07 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 12 08:55:02 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677940@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> I unregistered my source version and ran the installer myself, and I >> got the same result. > > What version of Outlook/Windows are you using? I tried it and it > worked fine, using Outlook XP SP2, with Windows XP Pro SP1. The OP > was Outlook 2k with WinNT (French, but I doubt that makes a > difference, or that you are using French). I'm on Win2K Server SP4 (US English) with Outlook 2003. Outlook version probably doesn't matter because I'm just sitting in the dist\bin directory registering and unregistering and then looking at the registry with regedit. > I get the installer setting it to > "\bin\outlook_addin.dll". > >> * outlook_addin_register does not create the PythonCOMPath key > > The installer does this for me, too. I must admit that I used my locally-built copy of the installer which *should* be the same but isn't necessarily identical. I'll download the actual installer from SourceForge and try again. -- Kenny Pitt From kennypitt at hotmail.com Thu Feb 12 09:15:04 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 12 09:16:00 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130467794A@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > By "results stop improving", do you think that the intention is that > the same number of messages are misclassified, or that the scores > stop getting better? (ie. if one message was still a false-positive, > but moved from 0.8 to 0.7, is that improving?). Gary's original blog entry defines train-to-exhaustion as the following: "Training to exhaustion" is repeating training on error, with the same message corpus, until no errors remain. The "until no errors remain" says to me that you *want* to keep iterating until that false-positive is correctly classified. I would think, then, that you would keep going as long as the score indicates that you are getting closer to correct classification. Where I'm a bit unclear is what to do if repeated training on that last remaining false positive starts causing other messages to be misclassified. I wonder what would happen if you took an "incorrectness score" that was the average of the distance from perfect classification over all messages, and stop if that average ever increases? In any case, this is a very computationally intensive process. It seems like it would be a good approach for initial training over a starting corpus, but maybe not well suited to ongoing incremental training. -- Kenny Pitt From kennypitt at hotmail.com Thu Feb 12 10:10:33 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 12 10:11:28 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: Message-ID: Kenny Pitt wrote: > Tony Meyer wrote: >> I get the installer setting it to >> "\bin\outlook_addin.dll". >> >>> * outlook_addin_register does not create the PythonCOMPath key >> >> The installer does this for me, too. > > I must admit that I used my locally-built copy of the installer which > *should* be the same but isn't necessarily identical. I'll download > the actual installer from SourceForge and try again. Interesting. I used the SF installer and it did register as outlook_addin.dll. I didn't get the PythonCOMPath key, but the addin worked fine in Outlook all the same. I looked at the version number on python23.dll and noticed that the SF installer was built using the Python 2.3.2 release. I've updated to 2.3.3 so my local version was built with that. I'd be interested to know what results others get when rebuilding with 2.3.3. As it is, I can't begin to guess whether this is due to differences in 2.3.3, or to some other configuration difference unique to my system. -- Kenny Pitt From tameyer at ihug.co.nz Thu Feb 12 16:12:53 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 12 16:13:18 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13050DD110@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467794B@its-xchg4.massey.ac.nz> > Interesting. I used the SF installer and it did register as > outlook_addin.dll. I didn't get the PythonCOMPath key, but > the addin worked fine in Outlook all the same. > > I looked at the version number on python23.dll and noticed > that the SF installer was built using the Python 2.3.2 > release. I've updated to 2.3.3 so my local version was built > with that. I'd be interested to know what results others get > when rebuilding with 2.3.3. I'm using Python 2.3.3, so I build a local copy of the installer and tried it and it worked here, too, so I guess that's not it. > As it is, I can't begin to guess > whether this is due to differences in 2.3.3, or to some other > configuration difference unique to my system. It must be some other difference, I suppose, although this still doesn't explain the problem that the OP had, since he was presumably using the one from sf. Hmm... If Mark doesn't find time to make the release announcement today, I will - if people find that it fails, then we'll just have to act really quickly and get a 1.0a9.1 (0.91) release out. Hopefully, people should find that this one is much better. =Tony Meyer From skip at pobox.com Thu Feb 12 17:51:37 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Feb 12 17:51:50 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F130467794A@its-xchg4.massey.ac.nz> Message-ID: <16428.889.749153.651608@montanaro.dyndns.org> Kenny> Tony Meyer wrote: >> By "results stop improving", do you think that the intention is that >> the same number of messages are misclassified, or that the scores >> stop getting better? (ie. if one message was still a false-positive, >> but moved from 0.8 to 0.7, is that improving?). Kenny> Gary's original blog entry defines train-to-exhaustion as the Kenny> following: Kenny> "Training to exhaustion" is repeating training on error, with the Kenny> same message corpus, until no errors remain. Kenny> The "until no errors remain" says to me that you *want* to keep Kenny> iterating until that false-positive is correctly classified. That's how I interpreted it as well when I wrote tte.py. With my current training database (roughly 700 total messages, evenly split between hams and spams) it takes five passes through the database (two to three minutes) to correctly classify all messages. Each pass is fastet than its predecessor because it trains on fewer messages. Kenny> I would think, then, that you would keep going as long as the Kenny> score indicates that you are getting closer to correct Kenny> classification. And stop once all ham score at or below the ham_cutoff and all spam score at or above the spam_cutoff. Kenny> Where I'm a bit unclear is what to do if repeated training on Kenny> that last remaining false positive starts causing other messages Kenny> to be misclassified. I think you keep at it. The tte.py script scores each message on each pass, ignoring the results for that message on previous passes. If it scores out of the zone on this pass it is trained. It doesn't matter if it was in the zone on an earlier pass. I look at that sort of thing this way. I have some hams and some spams with significant enough numbers of tokens in common. By repeatedly training on those messages we discount the value of those shared tokens and increase the value of each message's unique tokens. Kenny> I wonder what would happen if you took an "incorrectness score" Kenny> that was the average of the distance from perfect classification Kenny> over all messages, and stop if that average ever increases? I don't understand what you're suggesting. What is "perfect classification over all messages"? Skip From popiel at wolfskeep.com Thu Feb 12 18:32:57 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Feb 12 18:33:02 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: Message from Skip Montanaro of "Thu, 12 Feb 2004 16:51:37 CST." <16428.889.749153.651608@montanaro.dyndns.org> References: <1ED4ECF91CDED24C8D012BCF2B034F130467794A@its-xchg4.massey.ac.nz> <16428.889.749153.651608@montanaro.dyndns.org> Message-ID: <20040212233257.B36982DE4D@cashew.wolfskeep.com> In message: <16428.889.749153.651608@montanaro.dyndns.org> Skip Montanaro writes: > > Kenny> I would think, then, that you would keep going as long as the > Kenny> score indicates that you are getting closer to correct > Kenny> classification. > >And stop once all ham score at or below the ham_cutoff and all spam score at >or above the spam_cutoff. Of course, this process is not guaranteed to ever complete; consider the case where you have two messages with identical token lists (perhaps in different orders?) and one is marked as ham and the other is marked as spam. At best, you could get them both classified as unsure. - Alex From mhammond at skippinet.com.au Thu Feb 12 22:37:52 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Feb 12 22:38:20 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: Message-ID: <16fb01c3f1e2$c33e0370$0200a8c0@eden> > Kenny Pitt wrote: > > Tony Meyer wrote: > >> I get the installer setting it to > >> "\bin\outlook_addin.dll". > >> > >>> * outlook_addin_register does not create the PythonCOMPath key > >> > >> The installer does this for me, too. > > > > I must admit that I used my locally-built copy of the > installer which > > *should* be the same but isn't necessarily identical. I'll download > > the actual installer from SourceForge and try again. > > Interesting. I used the SF installer and it did register as > outlook_addin.dll. I didn't get the PythonCOMPath key, but the addin > worked fine in Outlook all the same. The PythonCOMPath key isn't needed for frozen programs, and indeed isn't written for frozen programs. This is in win32com\server\register.py. The code that knows how to find the classes without PythonCOMPath is in py2exe's boot_com_servers.py > I looked at the version number on python23.dll and noticed that the SF > installer was built using the Python 2.3.2 release. I've updated to > 2.3.3 so my local version was built with that. I'd be interested to > know what results others get when rebuilding with 2.3.3. As it is, I > can't begin to guess whether this is due to differences in > 2.3.3, or to > some other configuration difference unique to my system. Oops - I meant to use 2.3 - I'm actually building from a Python CVS tree tagged with 2.3, but clearly haven't updated for a while :) I'm updating now, so I'll see if there are any issues. However, it sounds more like your issues were caused by either py2exe or win32all being out of date. So does this mean the binary is all ready to go, and we can announce it? Mark. From kennypitt at hotmail.com Fri Feb 13 08:57:16 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 13 08:58:27 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <16fb01c3f1e2$c33e0370$0200a8c0@eden> Message-ID: Mark Hammond wrote: >> Kenny Pitt wrote: >> As it is, I can't begin to guess whether this is due to differences >> in 2.3.3, or to some other configuration difference unique to my >> system. > > Oops - I meant to use 2.3 - I'm actually building from a Python CVS > tree tagged with 2.3, but clearly haven't updated for a while :) I'm > updating now, so I'll see if there are any issues. However, it > sounds more like your issues were caused by either py2exe or win32all > being out of date. win32all is build 163, but I suspect py2exe which I built locally from latest CVS. I just downloaded the official 0.5.0 release of py2exe and I'll try again using that. > So does this mean the binary is all ready to go, and we can announce > it? I agree with Tony's assessment that it appears to work better for more people and that we should put it out there and see what happens. -- Kenny Pitt From kennypitt at hotmail.com Fri Feb 13 09:13:29 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 13 09:14:24 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: <16428.889.749153.651608@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > Kenny> I wonder what would happen if you took an "incorrectness score" > Kenny> that was the average of the distance from perfect classification > Kenny> over all messages, and stop if that average ever increases? > > I don't understand what you're suggesting. What is "perfect > classification over all messages"? I was afraid that statement wouldn't be quite clear. <0.5 wink> I was thinking of perfect classification of a spam being an exact 1.0 score, and perfect classification of a ham being an exact 0.0 score. If a ham scores as 0.01, then its distance from perfect is 0.01. If a spam scores as 0.99, then its distance from perfect is also 0.01. The "incorrectness score" I was considering would take the total of these distances for all messages as you score them in a single round, and divide by the total number of messages to get the average distance. What I was wondering was whether or not going through a round where this average distance was greater than or equal to the previous round would be a good indicator that more iterations would not improve the overall accuracy any further. The intent is to have some kind of guard condition to prevent the concern that Alex mentioned of getting caught in an infinite iteration loop. -- Kenny Pitt From skip at pobox.com Fri Feb 13 10:15:01 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 13 10:15:13 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: References: <16428.889.749153.651608@montanaro.dyndns.org> Message-ID: <16428.59893.363506.342830@montanaro.dyndns.org> Kenny> What I was wondering was whether or not going through a round Kenny> where this average distance was greater than or equal to the Kenny> previous round would be a good indicator that more iterations Kenny> would not improve the overall accuracy any further. The intent Kenny> is to have some kind of guard condition to prevent the concern Kenny> that Alex mentioned of getting caught in an infinite iteration Kenny> loop. I think you could probably approximate that closely enough by requiring that the number of misses drops from round to round. (A "miss" in this case is a message that doesn't score within its proper zone.) Skip From kennypitt at hotmail.com Fri Feb 13 11:51:24 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 13 11:52:30 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: <16428.59893.363506.342830@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > I think you could probably approximate that closely enough by > requiring that the number of misses drops from round to round. (A > "miss" in this case is a message that doesn't score within its proper > zone.) In most cases that's probably true. However, here's an example from one of the test runs of tte.py that Tony posted to his web site: round: 3, msgs: 1312, ham misses: 2, spam misses: 0 round: 4, msgs: 1312, ham misses: 0, spam misses: 2 round: 5, msgs: 1312, ham misses: 0, spam misses: 1 round: 6, msgs: 1312, ham misses: 0, spam misses: 0 The total number of misses did not decrease between rounds 3 and 4, but further rounds did reduce the misses to zero. I guess you could correct for that by stopping if the total misses increases or if both ham misses and spam misses stay the same, but that doesn't feel quite right either. If nothing else, it fails to account for Tony's original question: "if one message was still a false-positive, but moved from 0.8 to 0.7, is that improving?" -- Kenny Pitt From skip at pobox.com Fri Feb 13 12:12:36 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 13 12:12:45 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: References: <16428.59893.363506.342830@montanaro.dyndns.org> Message-ID: <16429.1412.756636.311864@montanaro.dyndns.org> Kenny> In most cases that's probably true. However, here's an example Kenny> from one of the test runs of tte.py that Tony posted to his web Kenny> site: Kenny> round: 3, msgs: 1312, ham misses: 2, spam misses: 0 Kenny> round: 4, msgs: 1312, ham misses: 0, spam misses: 2 Kenny> round: 5, msgs: 1312, ham misses: 0, spam misses: 1 Kenny> round: 6, msgs: 1312, ham misses: 0, spam misses: 0 Kenny> The total number of misses did not decrease between rounds 3 and Kenny> 4, but further rounds did reduce the misses to zero. Understood. I'm sure there are ways around that, like save the total misses from the last N rounds and exit if they increase or don't decrease within M rounds (M < N). Kenny> If nothing else, it fails to account for Tony's original Kenny> question: "if one message was still a false-positive, but moved Kenny> from 0.8 to 0.7, is that improving?" That's not how I interpreted the description on Gary's blog. Either it moves into the desired zone or it doesn't. I've been using my tte.py script for a few days now and haven't noticed this as a practical problem. I suspect we're worrying about a problem that won't arise. Maybe add a maxrounds flag? Skip From rmalayter at bai.org Fri Feb 13 12:25:23 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Fri Feb 13 12:25:26 2004 Subject: [spambayes-dev] problems with 0.9 plug-in & installation Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01E19C2E@cliff.bai.org> 1) It doesn't install over 0.8 cleanly. After closing all outlook processes, and runing the 1.0a9 installer, and reloading outlook, version 0.8 was still the active plug-in. I had to remove 0.8 and then install 0.9. Incidentally, removing 0.8 doesn't remove the spambayes toolbar correctly. 2) The "check for newer version" menu selection still reports 0.8 is installed, and that there is no newer version available. The spamBayes manager screen, however, reports v0.9 is active. Everything else in the upgrade seemed okay. Ryan Malayter Sr. Network & Database Administrator Bank Administration Institute Chicago, Illinois, USA PGP Key: http://www.malayter.com/pgp-public.txt ::::::::::::::::::::::::::::::: Only the mediocre are at their best all the time. From sethg at GoodmanAssociates.com Fri Feb 13 13:07:42 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Fri Feb 13 13:07:45 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: Message-ID: > [Kenny Pitt] > The total number of misses did not decrease between rounds 3 and 4, but > further rounds did reduce the misses to zero. This is undoubtedly more than you want to read, but here goes. Kenny's observation is expected. What you've implemented, in signal processing terms, is an adaptive estimator for the order of and value of a parameter set. You know the results with no noise (all correct classifications are known a priori), and you are trying to come up with both the number of parameters and their values that describes the data "best" according to some scalar cost function. While adaptive estimators vary wildly in how they decide what the next iteration is, all of the ones that I've seen have something in common: they don't approach convergence monotonically. For the same reason that they have this behavior, they also are not guaranteed converge to the global minimum cost. Think of the cost function as a surface (it is multi-dimensional, but I can't visualize anything beyond three) that has numerous dips but one dip is deeper than the rest and you will probably never find that one. This means: 1) there will be bumps in the generally decreasing cost function as you iterate 2) there is no proven way to distinguish a bump in the cost function from a local or global minimum 3) there is no way to tell if a minimum in the cost function is local or global (unless you can prove that there is a lower bound and show that you've achieved it - good luck) 4) there is no formal proof for most algorithms that they converge at all; people just test to get a sense for robustness; algorithms that take smaller steps per iteration tend to converge better, though slower, and tend to get stuck in local minima more easily; pick your poison That being said, everyone who uses these methods comes up with criteria, sometimes one that gives the appearance of mathematical validity and sometimes a wild-ass heuristic, to tell them when to stop iterating. I wouldn't trust the result of a single iteration to tell me that I've found one of the many minima in the cost function. Since you are iteratively estimating both the number of parameters you need as well as their values, you are hopping around a surface in steps that may be wider than the minima, so you should expect significant bumps in the convergence curve. In the case that you have a threshold cost value that you say is good enough, that's as easy as it gets. If you can't achieve that required cost threshold, you may want to go a number of iterations beyond the point where you think you've found a minimum to make sure that you're not looking at a "speed bump" in the middle of a long descent. How hard you try to find out if you're stuck in a local minimum could be based on how good your estimator currently is, but that's up to you. > [Kenny Pitt] > I guess you could correct for that by stopping if the total misses > increases or if both ham misses and spam misses stay the same, but that > doesn't feel quite right either. If nothing else, it fails to account > for Tony's original question: "if one message was still a > false-positive, but moved from 0.8 to 0.7, is that improving?" There's no definitive answer. The choice of the cost function is completely up to you, and has a lot to do with the quality of the resulting estimator. Though continuous cost functions, like total distance (or mean-squared distance) from perfect classification are intellectually satisfying (I personally prefer these), it is possible that a discrete cost function, like the number of mis-classifications, will perform better. The only answer is in testing. -- Seth Goodman From kennypitt at hotmail.com Fri Feb 13 13:14:18 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 13 13:15:27 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: <16429.1412.756636.311864@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > I've been using my tte.py script for a few days now and haven't > noticed this as a practical problem. I suspect we're worrying about > a problem that won't arise. Maybe add a maxrounds flag? I agree. I ran a couple of tests with tte.py on my timcv test sets and it always completed in just a few rounds. A maxrounds flag sounds like a +1, probably as a configurable option. -- Kenny Pitt From sethg at GoodmanAssociates.com Fri Feb 13 13:36:43 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Fri Feb 13 13:36:45 2004 Subject: [spambayes-dev] bad web link Message-ID: On the Windows page for the new version, http://spambayes.sourceforge.net/windows.html, the link to release notes near the top of the page takes you to the release notes for version 0.8 rather than the new version. -- Seth Goodman From kennypitt at hotmail.com Fri Feb 13 14:19:10 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 13 14:20:14 2004 Subject: [spambayes-dev] bad web link In-Reply-To: Message-ID: Seth Goodman wrote: > On the Windows page for the new version, > http://spambayes.sourceforge.net/windows.html, the link to release > notes near the top of the page takes you to the release notes for > version 0.8 rather than the new version. So it is. I'll fix it in CVS and see if I can get someone to push the update to the website. Thanks for the report. -- Kenny Pitt From skip at pobox.com Fri Feb 13 14:48:39 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 13 14:48:47 2004 Subject: [spambayes-dev] train to exhaustion? In-Reply-To: References: <16429.1412.756636.311864@montanaro.dyndns.org> Message-ID: <16429.10775.237081.706200@montanaro.dyndns.org> Skip> Maybe add a maxrounds flag? Kenny> A maxrounds flag sounds like a +1, probably as a configurable Kenny> option. Added. I haven't messed with any new config parser options. If tte.py leaps out of the contrib directory that would be something to consider though. Skip From skip at pobox.com Fri Feb 13 14:50:48 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Feb 13 14:51:59 2004 Subject: [spambayes-dev] bad web link In-Reply-To: References: Message-ID: <16429.10904.707941.983486@montanaro.dyndns.org> >> ... the link to release notes near the top of the page takes you to >> the release notes for version 0.8 rather than the new version. Kenny> So it is. I'll fix it in CVS and see if I can get someone to Kenny> push the update to the website. Thanks for the report. Done. Skip From sethg at GoodmanAssociates.com Fri Feb 13 16:23:34 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Fri Feb 13 16:23:38 2004 Subject: [spambayes-dev] observations on 0.9 plug-in Message-ID: I just un-installed the 0.81 binary and installed the 0.9 binary. Here are a couple of observations: 1) As someone else reported, asking Spambayes to check for the latest version reports that I am currently running 0.81. The Spambayes Manager correctly reports that the installed version is 0.9. 2) The log file message that the failure to create toolbar message is normal is a big improvement. Would be better if the FAILURE message wasn't there at all, but if that was easy, it would already be fixed. 3) Log file has message that it is watching the spam folder for incremental training, though I have that option unchecked. It does not train when I move a message into that folder, so the option works correctly. There is just an incorrect log file message. 4) Viewing message tokens is still a bit annoying due to two problems, probably both hard to do anything about: a) when closing the message, it asks if you want to save changes even though none were made b) after closing, the message changes from unread to read, so I have to manually restore the state to unread (the FAQ does say no one know how to fix this) -- Seth Goodman From sethg at GoodmanAssociates.com Fri Feb 13 16:37:33 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Fri Feb 13 16:40:04 2004 Subject: [spambayes-dev] observations on 0.9 plug-in In-Reply-To: Message-ID: > 1) As someone else reported, asking Spambayes to check for the latest > version reports that I am currently running 0.81. The Spambayes Manager > correctly reports that the installed version is 0.9. Got that one totally wrong. Checking for new version says the latest version available is 0.81, Sept. 2003, no new updates available. It does not report what version I am running. -- Seth Goodman From mhammond at skippinet.com.au Fri Feb 13 19:49:52 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Feb 13 19:50:15 2004 Subject: [spambayes-dev] observations on 0.9 plug-in In-Reply-To: Message-ID: <156301c3f294$769195c0$0200a8c0@eden> > Got that one totally wrong. Checking for new version says the latest > version available is 0.81, Sept. 2003, no new updates > available. It does > not report what version I am running. Yeah, that is the intent, including not telling the user what version they are running. It seemed too confusing to have all these version numbers - the point is to check if a new version is available, not to tell you what you are running :) I have updated the website with the new version info - it should now report the latest is 0.9. Mark. From tameyer at ihug.co.nz Sat Feb 14 20:42:11 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Feb 14 23:24:32 2004 Subject: [spambayes-dev] problems with 0.9 plug-in & installation In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13050DD39C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AAA@its-xchg4.massey.ac.nz> > 1) It doesn't install over 0.8 cleanly. After closing all > outlook processes, and runing the 1.0a9 installer, and > reloading outlook, version 0.8 was still the active plug-in. > I had to remove 0.8 and then install 0.9. I think (Mark will probably correct this if I'm wrong), that this is a consequence of changing the installation method (installer->py2exe, or maybe the installer program instead of Inno's regserver). Anyway, the instructions in the notes do tell you to do this: """ * If you have an existing version of the Outlook addin installed, please uninstall it via Control Panel->Add/Remove Programs. This will not remove your training or configuration information. """ > Incidentally, > removing 0.8 doesn't remove the spambayes toolbar correctly. This is a (very old!) known bug. =Tony Meyer From Bolt at telus.net Tue Feb 17 19:24:40 2004 From: Bolt at telus.net (Bolt) Date: Tue Feb 17 19:24:44 2004 Subject: [spambayes-dev] Suggestion Message-ID: I love how SpamBays works, but I have a suggestion.............. When I find SPAM that was not caught by Spam Bays, it would be great if there was an option that I could set so that when I click on the "Delete as Spam" button, it actually moved the item to the Deleted Items folder instead of the Spam folder (where I have to go and manual delete the items anyway - I suspect that this is a MS Outlook issue). Anyway, keep up the great work. Bolt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040217/9947c2d0/attachment.html From tameyer at ihug.co.nz Tue Feb 17 21:28:11 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Feb 17 21:29:19 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13050DD263@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ABE@its-xchg4.massey.ac.nz> [Kenny] > OK, I've taken a look and it appears that registering > the addin using the outlook_addin_register executable > during install does two things differently than > registering the addin DLL directly with regsvr32: > * outlook_addin_register sets the DLL path in InprocServer32 > to "pythoncom23.dll", while regsvr32 sets it to > "\bin\outlook_addin.dll" Did you ever figure out what was causing this? I have an Excel plug-in that uses the same registration trickery as the Outlook plug-in's addin.py (guess where the code came from ), and I ran into this problem today; i.e. pythoncom23.dll was the DLL path, rather than my excel_addin.dll. Weirdly, it was working before this, and I can't think of what code I changed that could effect this. In any case (and to bring it back to spambayes ), I put a "pythoncom.frozen = True" line in after the "sys.frozendllhandle = ..." line (at the end of addin.py), and that fixed it. I presume that would fix it for you, too. I'm not sure why pythoncom.frozen wasn't already True, though, since it was running in a binary (py2exe 0.5.0, win32 200). Maybe Mark knows :) =Tony Meyer From danieleloff at hotmail.com Tue Feb 17 22:35:29 2004 From: danieleloff at hotmail.com (Daniel Eloff) Date: Tue Feb 17 22:35:33 2004 Subject: [spambayes-dev] Understanding Classifier Code Message-ID: I've been looking at ways of increasing the speed at which the classifier runs. I think that increasing the speed of the classifier would be one of the things required to make spam-bayes computationally worthwhile to run on a gateway for several thousand users. (others include keeping all word records in memory, and writing the really tight parts into C or assembler) No arguments yet please, I just need some help to understand what the classifier is doing and why. I'll paste the code and interject when i don't understand why something is happening that way. (I don't like to modify code without understanding it fully, you're asking for trouble if you do that.) I really appreciate any help you can give me on this, the more the merrier. And if my team manages to get spambayes working well on our server we'll be sure to share the modifications with you. H = S = 1.0 Hexp = Sexp = 0 clues = self._getclues(wordstream) for prob, word, record in clues: S *= 1.0 - prob H *= prob if S < 1e-200: # prevent underflow S, e = frexp(S) Sexp += e if H < 1e-200: # prevent underflow H, e = frexp(H) Hexp += e Tell me, why a seperate spam/ham score at this point? # Compute the natural log of the product = sum of the logs: # ln(x * 2**i) = ln(x) + i * ln(2). S = ln(S) + Sexp * LN2 H = ln(H) + Hexp * LN2 Okay i can see from the comment that this is equiv to ln(x * 2**i) But why take the logarithim of the final spam/ham prob? n = len(clues) if n: S = 1.0 - chi2Q(-2.0 * S, 2*n) H = 1.0 - chi2Q(-2.0 * H, 2*n) Why multiply the score by -2? why double n? def chi2Q(x2, v, exp=_math.exp, min=min): """Return prob(chisq >= x2, with v degrees of freedom). v must be even. """ assert v & 1 == 0 # XXX If x2 is very large, exp(-m) will underflow to 0. m = x2 / 2.0 sum = term = exp(-m) for i in range(1, v//2): term *= m / i sum += term What's going on here? Why do we take the exp() of -x2/2? Why did we multiply x2 by 2 if we divide it by 2 again anyway? What is the loop doing and why? # With small x2 and large v, accumulated roundoff error, plus error in # the platform exp(), can cause this to spill a few ULP above 1.0. For # example, chi2Q(100, 300) on my box has sum == 1.0 + 2.0**-52 at this # point. Returning a value even a teensy bit over 1.0 is no good. return min(sum, 1.0) Thanks again! _________________________________________________________________ Help STOP SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=dept/bcomm&pgmarket=en-ca&RU=http%3a%2f%2fjoin.msn.com%2f%3fpage%3dmisc%2fspecialoffers%26pgmarket%3den-ca From kennypitt at hotmail.com Wed Feb 18 09:21:05 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 18 09:22:07 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ABE@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > [Kenny] >> OK, I've taken a look and it appears that registering >> the addin using the outlook_addin_register executable >> during install does two things differently than >> registering the addin DLL directly with regsvr32: >> * outlook_addin_register sets the DLL path in InprocServer32 >> to "pythoncom23.dll", while regsvr32 sets it to >> "\bin\outlook_addin.dll" > > Did you ever figure out what was causing this? I have an Excel > plug-in that uses the same registration trickery as the Outlook > plug-in's addin.py (guess where the code came from ), and I ran > into this problem today; No, I rebuilt my installer using the released version of py2exe but then I got tied up and didn't get around to testing it. I'll go give it a try right now, and if I still have the original problem then I'll try your fix and see if it makes a difference for me. -- Kenny Pitt From kennypitt at hotmail.com Wed Feb 18 10:06:44 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 18 10:07:49 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ABE@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > In any case (and to bring it back to spambayes ), I put a > "pythoncom.frozen = True" line in after the "sys.frozendllhandle = > ..." line (at the end of addin.py), and that fixed it. I presume > that would fix it for you, too. I'm not sure why pythoncom.frozen > wasn't already True, though, since it was running in a binary (py2exe > 0.5.0, win32 200). Maybe Mark knows :) Yep, this was the fix that I needed to make it work for me. I also solved the mystery of why pythoncom.frozen isn't set, although I'm still clueless why it worked in some cases and not in others. Here's the reason why pythoncom.frozen isn't set: In the py2exe sources, there is only place that pythoncom.frozen is set and that is with the following lines in boot_com_servers.py: import pythoncom if not hasattr(sys, "frozen"): # standard exes have none. sys.frozen = pythoncom.frozen = 1 else: # com DLLs already have sys.frozen set to 'dll' pythoncom.frozen = sys.frozen But py2exe uses boot_com_servers.py only for COM DLL's and EXE's specified in the com_server= list. outlook_addin_register.exe is just a standard Windows executable, so py2exe uses boot_common.py instead which makes no reference to pythoncom at all. -- Kenny Pitt From danieleloff at hotmail.com Wed Feb 18 11:55:04 2004 From: danieleloff at hotmail.com (Daniel Eloff) Date: Wed Feb 18 11:55:08 2004 Subject: [spambayes-dev] Understanding Classifier Code Message-ID: I've been looking at ways of increasing the speed at which the classifier runs. I think that increasing the speed of the classifier would be one of the things required to make spam-bayes computationally worthwhile to run on a gateway for several thousand users. (others include keeping all word records in memory, and writing the really tight parts into C or assembler) No arguments yet please, I just need some help to understand what the classifier is doing and why. I'll paste the code and interject when i don't understand why something is happening that way. (I don't like to modify code without understanding it fully, you're asking for trouble if you do that.) I really appreciate any help you can give me on this, the more the merrier. And if my team manages to get spambayes working well on our server we'll be sure to share the modifications with you. H = S = 1.0 Hexp = Sexp = 0 clues = self._getclues(wordstream) for prob, word, record in clues: S *= 1.0 - prob H *= prob if S < 1e-200: # prevent underflow S, e = frexp(S) Sexp += e if H < 1e-200: # prevent underflow H, e = frexp(H) Hexp += e Tell me, why a seperate spam/ham score at this point? # Compute the natural log of the product = sum of the logs: # ln(x * 2**i) = ln(x) + i * ln(2). S = ln(S) + Sexp * LN2 H = ln(H) + Hexp * LN2 Okay i can see from the comment that this is equiv to ln(x * 2**i) But why take the logarithim of the final spam/ham prob? n = len(clues) if n: S = 1.0 - chi2Q(-2.0 * S, 2*n) H = 1.0 - chi2Q(-2.0 * H, 2*n) Why multiply the score by -2? why double n? def chi2Q(x2, v, exp=_math.exp, min=min): """Return prob(chisq >= x2, with v degrees of freedom). v must be even. """ assert v & 1 == 0 # XXX If x2 is very large, exp(-m) will underflow to 0. m = x2 / 2.0 sum = term = exp(-m) for i in range(1, v//2): term *= m / i sum += term What's going on here? Why do we take the exp() of -x2/2? Why did we multiply x2 by 2 if we divide it by 2 again anyway? What is the loop doing and why? # With small x2 and large v, accumulated roundoff error, plus error in # the platform exp(), can cause this to spill a few ULP above 1.0. For # example, chi2Q(100, 300) on my box has sum == 1.0 + 2.0**-52 at this # point. Returning a value even a teensy bit over 1.0 is no good. return min(sum, 1.0) Thanks again! _________________________________________________________________ Dream of owning a home? Find out how in the First-time Home Buying Guide. http://special.msn.com/home/firsthome.armx From antoine.trux at nokia.com Wed Feb 18 12:08:31 2004 From: antoine.trux at nokia.com (antoine.trux@nokia.com) Date: Wed Feb 18 12:08:58 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon Message-ID: <4EAA30E8E17684458E24408ADC13A4B50108522B@esebe006.ntc.nokia.com> Hi, I just installed the SpamBayes plug-in for Outlook 2000. SpamBayes seems superior in all respects to the previous spam filter I was using (Spammunition, see www.upserve.com ), except for one important detail: I was able to configure Spammunition so that it would switch off the "new mail" (envelope shaped) icon after detecting a spam, but I could not manage doing it with SpamBayes. According to http://spambayes.sourceforge.net/faq.html#how-can-i-get-rid-of-the-envelope-tray-icon-for-spam , this functionality would be very hard to implement. This FAQ item says: "This means that even if you have set SpamBayes to mark spam as read, the envelope tray icon will not vanish." However, the manual describes option spam_mark_as_read as follows (C:\Program Files\SpamBayes\docs\outlook\docs\configuration.html): "Determines if spam messages are marked as 'Read' as they are filtered. This can be set to 'True' if the new-mail icon bothers you when the only new items are spam." This does not seem to work for me. Consider this scenario: 1) My "Inbox" folder is empty (yes, absolutely empty). No "new mail" icon. 2) A spam arrives and is detected by SpamBayes. The "new mail" icon is on. As I wrote above, it is possible to configure Spammunition so that the "new mail" icon be switched off in this same scenario. Now, this same FAQ item (http://spambayes.sourceforge.net/faq.html#how-can-i-get-rid-of-the-envelope-tray-icon-for-spam) says: "Although there is code available that provides a method to delete this icon, it doesn't let us determine whether there is other unread mail as well, which means that we do not know whether we should delete the icon or not." So you prefer not to switch off the icon if you are not sure. It would be most valuable, however, if SpamBayes could switch off the "new mail" icon each and every time a spam message is detected, or at least provide an option to work that way. This is because I much prefer: 1) not to be constantly distracted by the "new mail" icon than: 2) having to check my Inbox every time the "new mail" icon is switched on (which happens every 5th minute or so for me, because I get about 250 spams a day). With usage scenario 1, I can work this way: - When I hear the new message's sound, I have a look at the "new mail" icon: - Most of the time, the icon is off (because the vast majority of the messages I get are spams and are detected by SpamBayes). - Sometimes, the icon is on. I then know that a ham has arrived (or a spam that SpamBayes does not detect), so I can interrupt my work. - After being absent and returning to my desk, I immediately check whether I have new hams even if the "new mail" icon is off, because I know it could have been switched off by SpamBayes after detecting a spam. Can you confirm that it is not currently possible to configure SpamBayes for usage scenario 1? Antoine From nas-spambayes at python.ca Wed Feb 18 12:52:51 2004 From: nas-spambayes at python.ca (Neil Schemenauer) Date: Wed Feb 18 12:52:58 2004 Subject: [spambayes-dev] Understanding Classifier Code In-Reply-To: References: Message-ID: <20040218175251.GA9732@mems-exchange.org> On Tue, Feb 17, 2004 at 07:35:29PM -0800, Daniel Eloff wrote: > I've been looking at ways of increasing the speed at which the > classifier runs. Don't forget about Amdahl's law. Have you profiled Spambayes and found where most of the time is being spent? My guess is that tokenization is expensive. However, I wouldn't start any optimation effort myself until profiling since guesses are often wrong. Neil From theller at python.net Wed Feb 18 13:04:10 2004 From: theller at python.net (Thomas Heller) Date: Wed Feb 18 13:04:06 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? References: <1ED4ECF91CDED24C8D012BCF2B034F13050DD263@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2ABE@its-xchg4.massey.ac.nz> Message-ID: <3c98fen9.fsf@python.net> "Tony Meyer" writes: > [Kenny] >> OK, I've taken a look and it appears that registering >> the addin using the outlook_addin_register executable >> during install does two things differently than >> registering the addin DLL directly with regsvr32: >> * outlook_addin_register sets the DLL path in InprocServer32 >> to "pythoncom23.dll", while regsvr32 sets it to >> "\bin\outlook_addin.dll" > > Did you ever figure out what was causing this? I have an Excel plug-in that > uses the same registration trickery as the Outlook plug-in's addin.py (guess > where the code came from ), and I ran into this problem today; i.e. > pythoncom23.dll was the DLL path, rather than my excel_addin.dll. Weirdly, > it was working before this, and I can't think of what code I changed that > could effect this. > > In any case (and to bring it back to spambayes ), I put a > "pythoncom.frozen = True" line in after the "sys.frozendllhandle = ..." line > (at the end of addin.py), and that fixed it. I presume that would fix it > for you, too. I'm not sure why pythoncom.frozen wasn't already True, > though, since it was running in a binary (py2exe 0.5.0, win32 200). Maybe > Mark knows :) I wasn't aware of the hackery Mark does at the end of the script - use a frozen exe, pretend it is a frozen dll, to trick win32com.server.register into registering it as dll server. >From what I have found in the code, setting pythoncom.frozen = True seems to be safe. It would be interesting to know the values of sys.frozen and pythoncom.frozen (if any) before. I wanted to make a bugfix release of py2exe - although this fixes different bugs (I assume). If there's a problem with COM registration I would like to wait until this is resolved. Thomas From tim.one at comcast.net Wed Feb 18 15:36:09 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Feb 18 15:36:28 2004 Subject: [spambayes-dev] Understanding Classifier Code In-Reply-To: Message-ID: [Daniel Eloff] > I've been looking at ways of increasing the speed at which the > classifier runs. ... You should profile first, of course -- you're probably looking at stuff now that doesn't much matter (the arithmetic done per feature is typically minor compared to database overheads, parsing, and I/O costs). The classifier implements equation 3 from Gary Robinson's article: http://www.linuxjournal.com/article.php?sid=6467 Or, in English (from a comment in classifier.py): # Across vectors of length n, containing random uniformly-distributed # probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution # with 2*n degrees of freedom. It's more complicated than *just* that because the implementation lives within the constraints of current floating-point hardware. > Tell me, why a seperate spam/ham score at this point? This is explained in Gary's article. Fisher's test is more sensitive to probabilities near 0 than to those near 1; H reflects the probabilities around 0.5, giving another measure more sensitive to ham features. Having two measures in the end allows the classifier to know when it's confused. > But why take the logarithim of the final spam/ham prob? Fisher's theorem is about the distribution of the sum of logs. For efficiency, the code transforms that into a product, followed by just one application of log: ln(x) + ln(y) + ln(z) + ... = ln(x*y*z*...) > Why multiply the score by -2? why double n? Both follow directly from the equation. > def chi2Q(x2, v, exp=_math.exp, min=min): > """Return prob(chisq >= x2, with v degrees of freedom). > ... > > What's going on here? Why do we take the exp() of -x2/2? > > Why did we multiply x2 by 2 if we divide it by 2 again anyway? > > What is the loop doing and why? It's an implementation of what the comment says: the probability that a random variable following the chi-squared distribution with v degrees of freedom is at least as large as the given value x2. Any number of statistics texts can lead you to this kind of program for computing it; I happened to use a particularly simple series expansion taken from Abramowitz & Stegun, applicable only when the # of degrees of freedom is even. We know that this is the case in this application, because Fisher's theorem always feeds in twice the number of features (so is always even). The arithmetic in classifier.py is trivial. But it could be worth optimizing chi2Q, via (e.g.) a constant-time polynomial approximation; the chi2Q here is much more accurate than the classifier needs. OTOH, chi2Q isn't that expensive either (just a few simple arithmetic operations per feature). From kennypitt at hotmail.com Wed Feb 18 16:58:38 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 18 16:59:40 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: <4EAA30E8E17684458E24408ADC13A4B50108522B@esebe006.ntc.nokia.com> Message-ID: antoine.trux@nokia.com wrote: > So you prefer not to switch off the icon if you are not sure. It > would be most valuable, however, if SpamBayes could switch off the > "new mail" icon each and every time a spam message is detected, or at > least provide an option to work that way. > > With usage scenario 1, I can work this way: > - When I hear the new message's sound, I have a look at the "new > mail" icon: > - Most of the time, the icon is off (because the vast majority of > the messages I get are spams and are detected by SpamBayes). > - Sometimes, the icon is on. I then know that a ham has arrived (or > a spam that SpamBayes does not detect), so I can interrupt my work. > - After being absent and returning to my desk, I immediately check > whether I have new hams even if the "new mail" icon is off, because I > know it could have been switched off by SpamBayes after detecting a > spam. There are many other complications to this that might throw a monkey wrench into your scenario, such as receiving both a ham and a spam in the same batch of messages or using the background filtering option that would delay SpamBayes' processing of a spam message. > Can you confirm that it is not currently possible to configure > SpamBayes for usage scenario 1? I can confirm that this not currently possible. -- Kenny Pitt From kennypitt at hotmail.com Wed Feb 18 17:11:45 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 18 17:12:46 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? In-Reply-To: <3c98fen9.fsf@python.net> Message-ID: Thomas Heller wrote: > I wasn't aware of the hackery Mark does at the end of the script - > use a frozen exe, pretend it is a frozen dll, to trick > win32com.server.register into registering it as dll server. > >> From what I have found in the code, setting pythoncom.frozen = True > seems to be safe. It would be interesting to know the values of > sys.frozen and pythoncom.frozen (if any) before. I printed pythoncom.frozen before I made the change, and it was 0/False. I didn't print sys.frozen, though. I'll check that if I get a chance. The hackery is a result of an apparent problem with the Inno installer during uninstall. When Inno tried to unregister the COM DLL using the usual LoadLibrary/DllUnregisterServer method, it apparently didn't release the DLL properly and then failed to delete some of the files. The outlook_addin_register stuff was needed so that the DLL could be registered and unregistered without actually loading the DLL. -- Kenny Pitt From miguel at vargas.com Wed Feb 18 17:21:39 2004 From: miguel at vargas.com (Miguel) Date: Wed Feb 18 17:22:37 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" Message-ID: <4033E573.9070509@vargas.com> Sorry for the semi-offtopic post, but any help will be very apreciated. Mozilla's mail apps currently use Paul Graham's original algorithm with some basic tokenizing techniques. This situation could use some improvement, so now there is an effort to copy some ideas from Spambayes into Mozilla. I wrote a Mozilla patch that tries to port the chi2-combining techniques from classifier.py into Mozilla's C++. My testing is showing huge improvements in the fn rates, but a big deterioration in the fp rates. For example, in a test with a 3,741 email corpus we got: original - fn:206 fp:0 chi2 patch - fn:63 fp:11 My question is, did you guys notice a similar increase in fp rates when you originally switched from Graham to chi2? If not, then I'll assume that I made a mistake in porting classifier.py. Many thanks, Miguel PS. If anyone is interested in what Mozilla is doing, you can look here: http://bugzilla.mozilla.org/show_bug.cgi?id=181534 http://bugzilla.mozilla.org/show_bug.cgi?id=230093 http://bugzilla.mozilla.org/show_bug.cgi?id=231873 Here is the core of my C++ port if anyone wants to take a look. You'll notice that I included the "experimental_ham_spam_imbalance_adjustment", could this be my problem? double spam2ham = dmin(nbad/ngood, 1.0); double ham2spam = dmin(ngood/nbad, 1.0); /** This section comes from probability(self, record) and _getclues(self, wordstream)**/ for (i = 0; i < count; ++i) { Token& token = tokens[i]; // tokens is an array of Token, elements of a Token // include both token.mProbability and token.mDistance const char* word = token.mWord; Token* t = mGoodTokens.get(word); double hamcount = ((t != NULL) ? t->mCount : 0); t = mBadTokens.get(word); double spamcount = ((t != NULL) ? t->mCount : 0); prob = (spamcount / nbad) / ( hamcount / ngood + spamcount / nbad); double n = hamcount * spam2ham + spamcount * ham2spam; prob = (0.225 + n * prob) / (.45 + n); double distance = abs(prob - 0.5); if (distance >= .1) { goodclues++; token.mDistance = distance; token.mProbability = prob; } else { token.mDistance = -1; //ignore clue } } // sort the array by the token distances PRUint32 first, last = count; if (count > 150) { first = count - 150; // This function sorts the array by token.mDistance NS_QuickSort(tokens, count, sizeof(Token), compareTokens, NULL); } else { first = 0; } /** This section comes from chi2_spamprob(self, wordstream, evidence=False) **/ double H = 1.0, S = 1.0, Hexp = 0, Sexp = 0; goodclues=0; int e; for (i = first; i < last; ++i) { if (tokens[i].mDistance != -1) { goodclues++; double value = tokens[i].mProbability; S *= (1.0 - value); H *= value; if ( S < 1e-200 ) { S = frexp(S, &e); Sexp += M_E; } if ( H < 1e-200 ) { H = frexp(H, &e); Hexp +=M_E; } } } S = log(S) + Sexp * M_LN2; H = log(H) + Hexp * M_LN2; if (goodclues>0) { S = 1.0 - chi2Q(-2.0 * S, 2 * goodclues); H = 1.0 - chi2Q(-2.0 * H, 2 * goodclues); prob = (S-H +1.0) / 2.0; } else { prob = 0.5; } PRBool isJunk = (prob >= 0.90); //hardcoded at .9 ------------------------------------ Here's the chi2Q funcition: static double chi2Q (double x2, double v) { PRUint32 i; double m = x2 / 2.0; double sum = exp(-m); double term = exp(-m); for (i=1;i<=floor(v/2);i++) { term *= m / i; sum += term; } return dmin(sum,1.0); } From nas-spambayes at python.ca Wed Feb 18 17:40:56 2004 From: nas-spambayes at python.ca (Neil Schemenauer) Date: Wed Feb 18 17:41:35 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4033E573.9070509@vargas.com> References: <4033E573.9070509@vargas.com> Message-ID: <20040218224055.GA10780@mems-exchange.org> On Wed, Feb 18, 2004 at 05:21:39PM -0500, Miguel wrote: > Sorry for the semi-offtopic post, but any help will be very > apreciated. I think it's on topic. > My testing is showing huge improvements in the fn rates, but a big > deterioration in the fp rates. For example, in a test with a 3,741 > email corpus we got: > original - fn:206 fp:0 > chi2 patch - fn:63 fp:11 It would be helpful to have score distribution data. That would tell you if the rates could be improved by using different cutoffs. Also, it might give clues as to where the problem is. You can see some example plots here: http://spambayes.sourceforge.net/background.html Cheers, Neil From adam.walker at rbwconsulting.com Wed Feb 18 19:07:42 2004 From: adam.walker at rbwconsulting.com (Adam Walker) Date: Wed Feb 18 19:07:57 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4033E573.9070509@vargas.com> References: <4033E573.9070509@vargas.com> Message-ID: <4033FE4E.7080704@rbwconsulting.com> Usage suggests that "experimental_ham_spam_imbalance_adjustment" produced bad results for most people. It is no longer used by most people. Miguel wrote: > > Here is the core of my C++ port if anyone wants to take a look. > You'll notice that I included the > "experimental_ham_spam_imbalance_adjustment", could this be my problem? > From tameyer at ihug.co.nz Wed Feb 18 19:30:20 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 18 19:31:01 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305255EAC@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467799A@its-xchg4.massey.ac.nz> [Miguel] > Here is the core of my C++ port if anyone wants to take a look. > You'll notice that I included the > "experimental_ham_spam_imbalance_adjustment", could this be > my problem? [Adam] > Usage suggests that "experimental_ham_spam_imbalance_adjustment" > produced bad results for most people. It is no longer used by > most people. And in fact the code doesn't even exist in current CVS or in 1.0a9 (0.9), even though the option is there (but deprecated). So only people using out-of-date spambayes might be using it. Note that this could only effect your results if you did in fact have an imbalance - you didn't say how the corpus was split. =Tony Meyer From tameyer at ihug.co.nz Wed Feb 18 20:26:43 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 18 20:27:52 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305255E4E@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AC3@its-xchg4.massey.ac.nz> > From what I have found in the code, setting pythoncom.frozen = True > seems to be safe. It would be interesting to know the values > of sys.frozen and pythoncom.frozen (if any) before. sys.frozen is 1, pythoncom.frozen is 0 (here, and for Kenny, because his print wouldn't have executed if bool(sys.frozen) wasn't True). > I wanted to make a bugfix release of py2exe - although this > fixes different bugs (I assume). If there's a problem with > COM registration I would like to wait until this is resolved. I think this is pretty specific to SpamBayes (and those like me that copy it ). Surely there aren't many people around who are using this "pretend the exe is a dll" hack. Without that, all seems to be ok. OTOH, there have been three recent reports of this problem: """ Traceback (most recent call last): File "addin.pyc", line 1191, in OnConnection File "manager.pyc", line 908, in GetManager File "manager.pyc", line 344, in __init__ File "manager.pyc", line 492, in LocateDataDirectory File "win32com\shell\shell.pyc", line 9, in ? File "win32com\shell\shell.pyc", line 7, in __load ImportError: DLL load failed: A device attached to the system is not functioning. """ (Install goes fine, opening Outlook results in this). This looks like it is some sort of COM registration problem, but I don't know if it's with SpamBayes, win32com, or py2exe (I don't suppose your other bugfixes are for this? ). One of the reporters has the same Windows version & Outlook version as me, and it works fine here, so it doesn't appear to be a result of that. I do have lots of python dlls scattered about the place, of course. =Tony Meyer From tameyer at ihug.co.nz Wed Feb 18 20:30:24 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 18 20:31:02 2004 Subject: [spambayes-dev] Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305255C6A@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AC4@its-xchg4.massey.ac.nz> > Yep, this was the fix that I needed to make it work for me. > I also solved the mystery of why pythoncom.frozen isn't set, > although I'm still clueless why it worked in some cases and > not in others. Yeah, I wish I knew that, too >:( > Here's the reason why pythoncom.frozen isn't set: [...] I think this is correct, though. Since we're the one pretending to be a dll, we should be the one to set pythoncom.frozen, too. No doubt Mark will soon weigh in with his wisdom, and then we'll all understand . =Tony Meyer From miguel at vargas.com Wed Feb 18 21:37:35 2004 From: miguel at vargas.com (Miguel Vargas) Date: Wed Feb 18 21:37:09 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130467799A@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130467799A@its-xchg4.massey.ac.nz> Message-ID: <4034216F.7010801@vargas.com> I'll take that out and re-test, it looked harmless when I put it in, but I should've known you guys hadn't turned it on for a reason. The corpus I used has 2,793 hams and 948 hams. We're using SpamAssasin's public corpus for testing. It's kind of tough to test since we don't have the nice cross-validation tools that you have, and things aren't very flexible so we have to get the emails into a POP3 server and let the app download the messages. Basically what I've been doing is splitting SpamAssasin's corpus in half, training on one set and getting results on the other. It takes a while, but it seems to work, the results have consistently shown an increase in the fp rate. Anyways, I'll let you know the results. thanks From tim.one at comcast.net Wed Feb 18 22:53:22 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Feb 18 22:53:29 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4033E573.9070509@vargas.com> Message-ID: [Miguel] Note that spambayes has an "unsure" concept. I don't know how you decide what's ham and what's spam, but things scoring near 0.5 in this system are neither. > ... > /** This section comes from chi2_spamprob(self, wordstream, > evidence=False) **/ > double H = 1.0, S = 1.0, Hexp = 0, Sexp = 0; > goodclues=0; > int e; > for (i = first; i < last; ++i) { > if (tokens[i].mDistance != -1) { > goodclues++; > double value = tokens[i].mProbability; > S *= (1.0 - value); > H *= value; > if ( S < 1e-200 ) { > S = frexp(S, &e); > Sexp += M_E; I don't know what M_E expands to, but unless it expands to e then this part is way off. Hexp and Sexp should be ints. Read the original spambayes comments to see *why* e must be added to Sexp here (and I don't mean 2.71828... by "e", I mean the exponent stuffed into your int variable named "e" by the frexp() call). > } > if ( H < 1e-200 ) { > H = frexp(H, &e); > Hexp +=M_E; As above. > Here's the chi2Q funcition: > static double chi2Q (double x2, double v) { v should be int (OK), or unsigned int (better). > PRUint32 i; > double m = x2 / 2.0; > double sum = exp(-m); > double term = exp(-m); exp() is expensive, so don't call exp(-m) twice; e.g., double sum = exp(-m); double term = sum; instead. > for (i=1;i<=floor(v/2);i++) { If v is int or unsigned int, you'll also get to skip the relatively expensive floor() call on each loop trip. You should put in the original code's assert that v is even (this algorithm is dead wrong if v is odd). More seriously, this loop goes around once too often; it should be for (i = 1; i < v/2; ++i) { (if i<=j, the Python range(i, j) contains j-i elements, starting at i and ending with j-1). From tameyer at ihug.co.nz Wed Feb 18 23:26:55 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 18 23:27:28 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305255F08@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AC5@its-xchg4.massey.ac.nz> > The corpus I used has 2,793 hams and 948 hams. I presume one of those is spam, or you've got amazing results . If your testing setup allows it, have a go with 948 of each, and see what that does. > We're using SpamAssasin's public corpus for testing. In case it helps, last time I ran timcv with my copy of the SA corpus, split into 5 groups, I got: -> tested 655 hams & 2389 spams against 2624 hams & 9557 spams filename: sa_base ham:spam: 3279:11946 fp total: 7 fp %: 0.21 fn total: 85 fn %: 0.71 unsure t: 359 unsure %: 2.36 real cost: $226.80 best cost: $187.00 h mean: 0.81 h sdev: 6.74 s mean: 98.20 s sdev: 10.32 mean diff: 97.39 k: 5.71 (This is with all defaults). If you've got the corpus lying around in one-text-file-per-email format, then the easiest way to test would be to install Python and SpamBayes, and run timcv.py over the corpus and see if the results you get look something like the Mozilla ones (I suppose you could do -n2 to simulate splitting the corpus in half, rather than more sets as is common here). > It's kind of tough to test since we don't have the nice > cross-validation tools that you have You could write some, of course . =Tony Meyer From barry at python.org Wed Feb 18 23:43:39 2004 From: barry at python.org (Barry Warsaw) Date: Wed Feb 18 23:43:48 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4033E573.9070509@vargas.com> References: <4033E573.9070509@vargas.com> Message-ID: <1077165819.4430.27.camel@anthem> On Wed, 2004-02-18 at 17:21, Miguel wrote: > Sorry for the semi-offtopic post, but any help will be very > apreciated. > > Mozilla's mail apps currently use Paul Graham's original algorithm > with some basic tokenizing techniques. This > situation could use some improvement, so now there is an effort to > copy some ideas from Spambayes into Mozilla. I'd really love it if Moz/Thunderbird's spam filtering were pluggable. Evolution's is, and although the framework is pretty inefficient (all hams get filtered twice) it's very effective and usable. -Barry From tameyer at ihug.co.nz Thu Feb 19 00:59:20 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 19 00:59:47 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305255F4F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046779A5@its-xchg4.massey.ac.nz> > I'd really love it if Moz/Thunderbird's spam filtering were > pluggable. +1 (ok, so this is the wrong list for this...) =Tony Meyer From antoine.trux at nokia.com Thu Feb 19 02:08:27 2004 From: antoine.trux at nokia.com (antoine.trux@nokia.com) Date: Thu Feb 19 02:08:44 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon Message-ID: <4EAA30E8E17684458E24408ADC13A4B50287D224@esebe006.ntc.nokia.com> > > Can you confirm that it is not currently possible to configure > > SpamBayes for usage scenario 1? > > I can confirm that this not currently possible. Well, then I am left with the following options: - Switch back to Spammunition (which does switch off the new mail icon). - Use a commercial anti-spam filter (if I can find one that switches off the new mail icon; suggestions welcome). - Change my email address. I am thrilled by none of these possibilities. I would have liked to use SpamBayes, but given the amount of spam I get, this is just not possible. Antoine From kennypitt at hotmail.com Thu Feb 19 10:26:23 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 19 10:27:31 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: <4EAA30E8E17684458E24408ADC13A4B50287D224@esebe006.ntc.nokia.com> Message-ID: antoine.trux@nokia.com wrote: >>> Can you confirm that it is not currently possible to configure >>> SpamBayes for usage scenario 1? >> >> I can confirm that this not currently possible. > > Well, then I am left with the following options: > - Switch back to Spammunition (which does switch off the new mail > icon). > - Use a commercial anti-spam filter (if I can find one that switches > off the new mail icon; suggestions welcome). > - Change my email address. > > I am thrilled by none of these possibilities. I would have liked to > use SpamBayes, but given the amount of spam I get, this is just not > possible. Well, there are several other possibilities. SpamBayes is open source, so you can modify it as much as you want to suit your needs. I wrote the code to implement finding and removing the mail icon from within Python, and I'll gladly send you a copy if you'd like to try to incorporate it. There is also a patch (#858925) on SourceForge to add a notification sound to SpamBayes that could be used to replace Outlook's new mail icon and notification sound. The patch allows you to define 3 different sounds to differentiate spam, ham, or unsure. After processing a batch of mail, it will play the sound representing the "hammiest" message it saw (ham -> unsure -> spam). Here's a link to the patch: http://sourceforge.net/tracker/index.php?func=detail&aid=858925&group_id =61702&atid=498105 -- Kenny Pitt From tim.one at comcast.net Thu Feb 19 10:31:11 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Feb 19 10:31:13 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: <4EAA30E8E17684458E24408ADC13A4B50287D224@esebe006.ntc.nokia.com> Message-ID: [Antoine] >>> Can you confirm that it is not currently possible to configure >>> SpamBayes for usage scenario 1? [Kenny Pitt] >> I can confirm that this not currently possible. [Antoine] > Well, then I am left with the following options: > - Switch back to Spammunition (which does switch off the new mail > icon). > - Use a commercial anti-spam filter (if I can find one that > switches off the new mail icon; suggestions welcome). > - Change my email address. I can think of a few others: - Ignore the new-mail icon. That's what I do. In fact, I started ignoring it long before SpamBayes existed -- its behavior never made useful sense to me. - Devote a good part of your short-term future to figuring out how to write code to make this goofy icon do what you want. - Pay someone else to devote part of their life to doing that. > I am thrilled by none of these possibilities. I would have liked to > use SpamBayes, but given the amount of spam I get, this is just not > possible. I bet I get more spam than you get . From antoine.trux at nokia.com Thu Feb 19 10:44:55 2004 From: antoine.trux at nokia.com (antoine.trux@nokia.com) Date: Thu Feb 19 10:45:29 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon Message-ID: <4EAA30E8E17684458E24408ADC13A4B50287D22F@esebe006.ntc.nokia.com> > - Ignore the new-mail icon. That's what I do. In fact, I started > ignoring it long before SpamBayes existed -- its behavior never > made useful sense to me. "Tell us what you need, we will explain why you don't need it." From engelhardt at kleinmichel.com Thu Feb 19 10:46:22 2004 From: engelhardt at kleinmichel.com (Joachim Engelhardt) Date: Thu Feb 19 10:46:29 2004 Subject: [spambayes-dev] A Question about SpamBayes Message-ID: <000001c3f6ff$86fc2230$6501a8c0@JOACHIM> Hi there, first of all, SpamBayes is GREAT!!! I have it running now since two or three weeks and I love it. Fantastic work. However, I am missing one tiny little feature, which you are addressing under 4.3 and 6.5 in your FAQ section: Return/bounce/forward spam back to the sender. In FAQ 4.3 it sounds like that this is somehow possible. But, I have no clue what are you talking about in there... By the way, I am running the Outlook (2000) plugin under Windows 200 Pro - no Exchange server. So is it possible now or not? In FAQ 6.5 it is stated that I can't bounce spam back to the sender since most sender addresses are fake anyway. I am in full agreement with you on that and go along with this statement totally. However, there are always messages that get filtered out by SpamBayes that are not spam and are legitimate. Therefore, I am always browsing over the Junk mail folder before deleting all the spam - making sure I am not deleting an important message. Now, if I could autoreply to all messages in the Junk E-Mail folder and attach a short message to it, at least the non-spam senders would be notified automatically that their email was considered spam and has not been read but deleted. This way they could try to resend or rephrase. The return address from non-spammers should be a good one - and the autoreplies to spammers with fake return addresses end up in limbo. Do you think this makes any sense and are you considering implementing something like that into SpamBayes? Thanks, Joe From miguel at vargas.com Thu Feb 19 11:24:57 2004 From: miguel at vargas.com (Miguel) Date: Thu Feb 19 11:30:00 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: References: Message-ID: <4034E359.50401@vargas.com> Tim Peters wrote: > [Miguel] > > Note that spambayes has an "unsure" concept. I don't know how you decide > what's ham and what's spam, but things scoring near 0.5 in this system are > neither. Adding the "unsure" category to Mozilla is going to be expensive resource-wise, so we've decided to put it off for now. We're going to lump the "unsures" in with the hams. I'm thinking that we'll set the default cutoff at 0.9. > I don't know what M_E expands to, but unless it expands to > > e > > then this part is way off. Hexp and Sexp should be ints. Read the original > spambayes comments to see *why* e must be added to Sexp here (and I don't > mean 2.71828... by "e", I mean the exponent stuffed into your int variable > named "e" by the frexp() call). Oops, I misunderstood "e" to be the constant from the math module. >>Here's the chi2Q funcition: >>static double chi2Q (double x2, double v) { > > > v should be int (OK), or unsigned int (better). > > >> PRUint32 i; >> double m = x2 / 2.0; >> double sum = exp(-m); >> double term = exp(-m); > > > exp() is expensive, so don't call exp(-m) twice; e.g., > > double sum = exp(-m); > double term = sum; > > instead. Good suggestions >> for (i=1;i<=floor(v/2);i++) { > > > If v is int or unsigned int, you'll also get to skip the relatively > expensive floor() call on each loop trip. You should put in the original > code's assert that v is even (this algorithm is dead wrong if v is odd). I don't understand how making v an int will make it skip the floor function. Also, I don't understand what the assert does, what does the function return if v is odd? > More seriously, this loop goes around once too often; it should be > > for (i = 1; i < v/2; ++i) { > > (if i<=j, the Python range(i, j) contains j-i elements, starting at i and > ending with j-1). Good catch! I really apreciate you auditing my code, thanks! From miguel at vargas.com Thu Feb 19 11:49:24 2004 From: miguel at vargas.com (Miguel) Date: Thu Feb 19 11:49:37 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AC5@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AC5@its-xchg4.massey.ac.nz> Message-ID: <4034E914.6060307@vargas.com> >>The corpus I used has 2,793 hams and 948 hams. > > > I presume one of those is spam, or you've got amazing results . If > your testing setup allows it, have a go with 948 of each, and see what that > does. That was 948 spams. I'll do some tests with equal number of spams and hams. > If you've got the corpus lying around in one-text-file-per-email format, > then the easiest way to test would be to install Python and SpamBayes, and > run timcv.py over the corpus and see if the results you get look something > like the Mozilla ones (I suppose you could do -n2 to simulate splitting the > corpus in half, rather than more sets as is common here). The problem is that the tokenizers are different, so it's not possible to compare the results since the classifiers are fed different tokens. >>It's kind of tough to test since we don't have the nice >>cross-validation tools that you have > > > You could write some, of course . I suppose I could, or I could let you guys do the testing and then copy your results into Mozilla ;-) From gbrown at alumni.caltech.edu Thu Feb 19 12:23:18 2004 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Thu Feb 19 12:31:36 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: Message-ID: <01f101c3f70d$11449a50$0601000a@Glenn> Wouldn't this problem be much simpler if spambayes added its own icon instead of trying to tweak Outlook's broken one? A simple implementation would be Ham detected: add icon icon clicked: activate Outlook and remove icon . --Glenn P.S.: the "Recover from Spam" Smily would look good in the tool tray. :) From tim.one at comcast.net Thu Feb 19 12:32:46 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Feb 19 12:32:47 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4034E359.50401@vargas.com> Message-ID: [Miguel] > Adding the "unsure" category to Mozilla is going to be expensive > resource-wise, so we've decided to put it off for now. > We're going to lump the "unsures" in with the hams. I'm thinking > that we'll set the default cutoff at 0.9. That's OK. "Unsure" turned out to be a valuable concept for most people, so don't put it off forever . >>> Here's the chi2Q funcition: >>> static double chi2Q (double x2, double v) { >> v should be int (OK), or unsigned int (better). >>> for (i=1;i<=floor(v/2);i++) { >> If v is int or unsigned int, you'll also get to skip the relatively >> expensive floor() call on each loop trip. You should put in the >> original code's assert that v is even (this algorithm is dead wrong >> if v is odd). > I don't understand how making v an int will make it skip the floor > function. v must be an even integer >= 0, therefore there's no need to compute floors; plain v/2 is always exactly correct when v is even. > Also, I don't understand what the assert does, Well, assert() is a standard C function. If you do assert((v & 1) == 0); then, provided you haven't compiled with NDEBUG #define'd, your program should die if you ever pass an odd integer for v. > what does the function return if v is odd? It shouldn't return anything then: the program should die! It's as senseless to use this function for odd v as it is, e.g., to try to dereference a NULL pointer. If the rest of your code is correct, it will never try to call this function with odd v. An assert() is a way to catch this error if the rest of the code isn't correct. As the comments in the original code said: v must be even. That's a precondition for using chi2Q; an assert would catch violations of the precondition; the assert() should never fail; if it does, the code *calling* chi2Q is in error. From kennypitt at hotmail.com Thu Feb 19 12:52:43 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 19 12:53:46 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4034E359.50401@vargas.com> Message-ID: Miguel wrote: > Tim Peters wrote: >>> for (i=1;i<=floor(v/2);i++) { >> >> >> If v is int or unsigned int, you'll also get to skip the relatively >> expensive floor() call on each loop trip. You should put in the >> original code's assert that v is even (this algorithm is dead wrong >> if v is odd). > I don't understand how making v an int will make it skip the floor > function. Also, I don't understand what the assert does, what does > the function return if v is odd? floor(x) takes a floating point value x and finds the largest integer that is <= x. If x is >= 0, this is equivalent to truncating the fractional part. If v above is an int then C++ will use integer arithmetic when computing v/2, which will always truncate the fractional part. The floor() function is therefore unnecessary and can be removed from the code. -- Kenny Pitt From kennypitt at hotmail.com Thu Feb 19 13:02:31 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 19 13:03:31 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: <01f101c3f70d$11449a50$0601000a@Glenn> Message-ID: Glenn Brown wrote: > Wouldn't this problem be much simpler if spambayes added its own icon > instead of trying to tweak Outlook's broken one? A simple > implementation would be > Ham detected: add icon > icon clicked: activate Outlook and remove icon Seems simple, doesn't it? But what about the following? - Open Outlook without clicking the SpamBayes tray icon and read all the mail. How do we detect that we should remove the tray icon? - What about messages that are processed by Outlook rules with background filtering enabled, and are never seen by SpamBayes? - How about messages received in folders that SpamBayes isn't configured to filter? That said, this approach could be reasonably effective depending on how you use Outlook. InBoxer (which is based on SpamBayes) does something like this. I guess that's what configuration options are for. The problem is, I don't think any of the volunteers here are interested enough in having this feature to sign up for the effort to implement it. If someone wants to write up a patch for it, I'm sure it would be given serious consideration for including in a future release. -- Kenny Pitt From tim.one at comcast.net Thu Feb 19 13:23:13 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Feb 19 13:23:14 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: <4EAA30E8E17684458E24408ADC13A4B50287D22F@esebe006.ntc.nokia.com> Message-ID: [Tim] >> - Ignore the new-mail icon. That's what I do. In fact, I started >> ignoring it long before SpamBayes existed -- its behavior never >> made useful sense to me. [Antoine] > "Tell us what you need, we will explain why you don't need it." I'm not telling you what you need, I'm telling you what works for me. Take it or leave it. What you get from this project is what other people contribute to it. Since I've got no use for this icon (or anything like it), I'm not going to contribute my time to trying to improve it. If you absolutely have to have it, then you need to do it yourself, motivate someone else to do it for you, or, indeed, use something else. From sethg at GoodmanAssociates.com Thu Feb 19 14:25:43 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Feb 19 14:25:47 2004 Subject: [spambayes-dev] A Question about SpamBayes In-Reply-To: <000001c3f6ff$86fc2230$6501a8c0@JOACHIM> Message-ID: > [Joachim Engelhardt] > The return > address from non-spammers should be a good one - and the autoreplies to > spammers with fake return addresses end up in limbo. Unfortunately, that is not the case. Many of the return addresses on spam are legal addresses belonging to someone else that the spammer has forged. This is known as a "joe-job". It is meant to make it look like some innocent domain sent out the spam. If you've ever been the victim of a joe-job, you'd know that you can receive hundreds or even thousands of bounces and complaints each day which will overwhelm you mail system, or at least your ability to deal with the volume of mail. Since the bounces and complaints are "coming from everywhere", there is really no way to block them and filtering would be very difficult. Because of this, sending messages to the supposed originators of the spam will often wind up punishing some innocent third party. BTW, spammers often forge addresses that have submitted abuse reports, so you may be punishing the very people who are trying to fight spammers! -- Seth Goodman From sethg at GoodmanAssociates.com Thu Feb 19 14:32:04 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Feb 19 14:32:05 2004 Subject: [spambayes-dev] A Question about SpamBayes In-Reply-To: Message-ID: > [Joachim Engelhardt] > The return > address from non-spammers should be a good one - and the autoreplies to > spammers with fake return addresses end up in limbo. The other problem that I forgot to mention is that for the cases where the return address is a non-deliverable address, you will get a DSN (bounce message) right back. I don't think you want that, either. -- Seth Goodman From tameyer at ihug.co.nz Thu Feb 19 17:03:18 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 19 17:10:10 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305255F9F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046779A7@its-xchg4.massey.ac.nz> > - Use a commercial anti-spam filter (if I can find one that > switches off the new mail icon; suggestions welcome). [...] > I am thrilled by none of these possibilities. I would have > liked to use SpamBayes, but given the amount of spam I get, > this is just not possible. You could look into InBoxer (there's a link on our 'related' page). It's the SpamBayes code (or close enough), so you'll get the same results. I believe they have done *something* to address this, although I'm not sure exactly what. In any case, since you'd be giving them money, you'd be able to ask them for features, if they don't do what you want already. =Tony Meyer From rmalayter at bai.org Thu Feb 19 19:00:50 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Thu Feb 19 19:00:57 2004 Subject: [spambayes-dev] a useful pre-filter for auto-training bayesian systems? Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01E19DCE@cliff.bai.org> I found this interesting. Using the social network as a first step, a bunch of "definite ham" and "definite spam" messages are listed. These can be used to train a Bayesian filter which then filters the rest of the unsures automatically. Very little user intervention would therefore be required for training, and it cuts in half the number of messages that must be filtered by the much-more-expensive statistical filter. http://www.arxiv.org/abs/cond-mat/0402143 One could even imagine users securely posting their email addressee's "white lists" by posting SHA-1 hashes instead of actually email addresses to some public forum. (This would have to be salted, of course). This could create a meta-social-network. They don't seem to address the issue of a spam that has a forged address from your own social network, though, which might trip up this whole social network process. Ryan Malayter Sr. Network & Database Administrator Bank Administration Institute Chicago, Illinois, USA PGP Key: http://www.malayter.com/pgp-public.txt ::::::::::::::::::::::::::::::: I am prepared to meet my Maker. Whether my Maker is prepared for the great ordeal of meeting me is another matter. -Sir Winston S. Churchill From jm at jmason.org Thu Feb 19 20:21:44 2004 From: jm at jmason.org (Justin Mason) Date: Thu Feb 19 20:21:54 2004 Subject: [spambayes-dev] a useful pre-filter for auto-training bayesian systems? In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A01E19DCE@cliff.bai.org> Message-ID: <20040220012146.848D117003@jmason.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ryan Malayter writes: > I found this interesting. Using the social network as a first step, a > bunch of "definite ham" and "definite spam" messages are listed. These > can be used to train a Bayesian filter which then filters the rest of > the unsures automatically. Very little user intervention would therefore > be required for training, and it cuts in half the number of messages > that must be filtered by the much-more-expensive statistical filter. > > http://www.arxiv.org/abs/cond-mat/0402143 > > One could even imagine users securely posting their email addressee's > "white lists" by posting SHA-1 hashes instead of actually email > addresses to some public forum. (This would have to be salted, of > course). This could create a meta-social-network. > > They don't seem to address the issue of a spam that has a forged address > from your own social network, though, which might trip up this whole > social network process. Yeah -- spam with forged From of your address. That has historically been how spammers get around address-book-based whitelisting, because everyone usually has 1 or more of their own addrs in the address book. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) Comment: Exmh CVS iD8DBQFANWEoQTcbUG5Y7woRAr2uAKDC4sevuJ87uYk6zPlb6aWOik7xXgCfdn7n /TFi3tpsMnGxI38K4cpTmUA= =v88V -----END PGP SIGNATURE----- From ta-meyer at ihug.co.nz Thu Feb 19 22:53:06 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 19 22:53:32 2004 Subject: [spambayes-dev] Automatically generated bug reports from sb_server Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ACB@its-xchg4.massey.ac.nz> Surprisingly (to me, at least), it appears that people are actually going to use the auto-generate-a-bug-report feature in sb_server. I've fixed the formatting problem, but I'm also wondering if other changes should be made to make them more effective. Should the user have to enter a subject themselves? I worry that there will be too many "Problem with POP3 Proxy" messages, and it'll be confusing to keep track of threads. (Obviously they can change the subject before posting, but there's no compulsion, or even suggestion, to do so). Would it be better to just open up a link to the sourceforge bug tracking system (they're in a web browser, after all)? Or offer both, but suggest one more strongly? Any other comments, based on the ones that have arrived so far? =Tony Meyer From rmalayter at bai.org Thu Feb 19 19:00:50 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Fri Feb 20 03:43:28 2004 Subject: [spambayes-dev] [Spambayes] a useful pre-filter for auto-training bayesian systems? Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01E19DCE@cliff.bai.org> I found this interesting. Using the social network as a first step, a bunch of "definite ham" and "definite spam" messages are listed. These can be used to train a Bayesian filter which then filters the rest of the unsures automatically. Very little user intervention would therefore be required for training, and it cuts in half the number of messages that must be filtered by the much-more-expensive statistical filter. http://www.arxiv.org/abs/cond-mat/0402143 One could even imagine users securely posting their email addressee's "white lists" by posting SHA-1 hashes instead of actually email addresses to some public forum. (This would have to be salted, of course). This could create a meta-social-network. They don't seem to address the issue of a spam that has a forged address from your own social network, though, which might trip up this whole social network process. Ryan Malayter Sr. Network & Database Administrator Bank Administration Institute Chicago, Illinois, USA PGP Key: http://www.malayter.com/pgp-public.txt ::::::::::::::::::::::::::::::: I am prepared to meet my Maker. Whether my Maker is prepared for the great ordeal of meeting me is another matter. -Sir Winston S. Churchill _______________________________________________ Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html From rmalayter at bai.org Thu Feb 19 19:00:50 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Fri Feb 20 03:45:28 2004 Subject: [spambayes-dev] [Spambayes] a useful pre-filter for auto-training bayesian systems? Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01E19DCE@cliff.bai.org> I found this interesting. Using the social network as a first step, a bunch of "definite ham" and "definite spam" messages are listed. These can be used to train a Bayesian filter which then filters the rest of the unsures automatically. Very little user intervention would therefore be required for training, and it cuts in half the number of messages that must be filtered by the much-more-expensive statistical filter. http://www.arxiv.org/abs/cond-mat/0402143 One could even imagine users securely posting their email addressee's "white lists" by posting SHA-1 hashes instead of actually email addresses to some public forum. (This would have to be salted, of course). This could create a meta-social-network. They don't seem to address the issue of a spam that has a forged address from your own social network, though, which might trip up this whole social network process. Ryan Malayter Sr. Network & Database Administrator Bank Administration Institute Chicago, Illinois, USA PGP Key: http://www.malayter.com/pgp-public.txt ::::::::::::::::::::::::::::::: I am prepared to meet my Maker. Whether my Maker is prepared for the great ordeal of meeting me is another matter. -Sir Winston S. Churchill _______________________________________________ Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html From jm at jmason.org Thu Feb 19 20:21:44 2004 From: jm at jmason.org (Justin Mason) Date: Fri Feb 20 03:46:18 2004 Subject: [Spambayes] Re: [spambayes-dev] a useful pre-filter for auto-training bayesian systems? In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A01E19DCE@cliff.bai.org> Message-ID: <20040220012146.848D117003@jmason.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ryan Malayter writes: > I found this interesting. Using the social network as a first step, a > bunch of "definite ham" and "definite spam" messages are listed. These > can be used to train a Bayesian filter which then filters the rest of > the unsures automatically. Very little user intervention would therefore > be required for training, and it cuts in half the number of messages > that must be filtered by the much-more-expensive statistical filter. > > http://www.arxiv.org/abs/cond-mat/0402143 > > One could even imagine users securely posting their email addressee's > "white lists" by posting SHA-1 hashes instead of actually email > addresses to some public forum. (This would have to be salted, of > course). This could create a meta-social-network. > > They don't seem to address the issue of a spam that has a forged address > from your own social network, though, which might trip up this whole > social network process. Yeah -- spam with forged From of your address. That has historically been how spammers get around address-book-based whitelisting, because everyone usually has 1 or more of their own addrs in the address book. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) Comment: Exmh CVS iD8DBQFANWEoQTcbUG5Y7woRAr2uAKDC4sevuJ87uYk6zPlb6aWOik7xXgCfdn7n /TFi3tpsMnGxI38K4cpTmUA= =v88V -----END PGP SIGNATURE----- _______________________________________________ Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html From kennypitt at hotmail.com Fri Feb 20 09:31:46 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 20 09:32:46 2004 Subject: [spambayes-dev] Automatically generated bug reports from sb_server In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ACB@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > Should the user have to enter a subject themselves? I worry that > there will be too many "Problem with POP3 Proxy" messages, and it'll > be confusing to keep track of threads. We could set the subject to something like "POP3 Proxy Problem: " That *might* encourage people to change the subject, and we'll be able to see very quickly if people aren't paying any attention to it. > Would it be better to just open up a link to the sourceforge bug > tracking system (they're in a web browser, after all)? Or offer > both, but suggest one more strongly? One of the problems we have consistently is people not providing enough information about their problem. IIRC, one of the purposes for doing the bug report feature was to give users a template that would encourage them to provide the necessary info. I don't think pointing them directly at SourceForge would do that unless there is some way to automatically add some template headings when opening the page. Is there any possibility of providing them a form to fill in through the sb_server UI and then submitting that info directly as a SourceForge bug instead of emailing it to the list? I think most of us subscribe to the spambayes-bugs list anyway, so we would still get e-mail notification. > Any other comments, based on the ones that have arrived so far? Sorry, I haven't looked closely enough at the feature or the reports received so far to provide any more input. I do bring up the proxy from time to time, and I'll try to remember to look at this more. -- Kenny Pitt From miguel at vargas.com Fri Feb 20 18:13:50 2004 From: miguel at vargas.com (Miguel) Date: Fri Feb 20 18:14:04 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: References: Message-ID: <403694AE.6020303@vargas.com> OK, I made all the suggested changes and re-tested. The fn rate dropped by half, which is amazing considering that it was already about half of the original. Unfourtunately, the fp rate did not improve and might have even gone up a bit. To try to pinpoint my problem I've been trying to debug into classifier.py and feed it some numbers. Unfourtunately I don't know my way around the python debugger very well so I haven't been able to pull this off. Is there a kind Python soul in here that could help me with this? Feed these numbers into classifier.py and see if you get the same results ngood = 861, nbad = 759 spam score = 0.809734 token 1: hamcount = 13 spamcount = 103, prob=0.898333 token 2: hamcount = 44 spamcount = 99, prob=0.717812 token 3: hamcount = 22 spamcount = 96, prob=0.830673 token 4: hamcount = 0 spamcount = 0, discarded token 5: hamcount = 5802 spamcount = 4680, discarded token 6: hamcount = 0 spamcount = 0, discarded token 7: hamcount = 1 spamcount = 3, prob=0.745295 token 8: hamcount = 513 spamcount = 1353, prob=0.749430 token 9: hamcount = 0 spamcount = 1, prob=0.844828 token 10: hamcount = 2440 spamcount = 908, prob=0.296862 token 11: hamcount = 1079 spamcount = 901, discarded token 12: hamcount = 320 spamcount = 305, discarded token 13: hamcount = 1 spamcount = 0, prob=0.155172 token 14: hamcount = 1 spamcount = 1, discarded token 15: hamcount = 0 spamcount = 0, discarded token 16: hamcount = 2986 spamcount = 6224, prob=0.702770 token 17: hamcount = 2272 spamcount = 852, prob=0.298469 token 18: hamcount = 3774 spamcount = 2822, discarded token 19: hamcount = 1878 spamcount = 1929, discarded token 20: hamcount = 23 spamcount = 17, discarded token 21: hamcount = 25 spamcount = 15, discarded token 22: hamcount = 4374 spamcount = 4524, discarded token 23: hamcount = 231 spamcount = 215, discarded token 24: hamcount = 0 spamcount = 0, discarded token 25: hamcount = 32 spamcount = 120, prob=0.808753 token 26: hamcount = 1075 spamcount = 1231, discarded token 27: hamcount = 995 spamcount = 628, discarded token 28: hamcount = 2 spamcount = 0, prob=0.091837 token 29: hamcount = 0 spamcount = 0, discarded token 30: hamcount = 915 spamcount = 514, prob=0.389251 token 31: hamcount = 0 spamcount = 0, discarded token 32: hamcount = 6051 spamcount = 5895, discarded token 33: hamcount = 6 spamcount = 30, prob=0.845796 token 34: hamcount = 2409 spamcount = 1251, prob=0.370725 token 35: hamcount = 324 spamcount = 620, prob=0.684528 token 36: hamcount = 791 spamcount = 735, discarded token 37: hamcount = 355 spamcount = 1660, prob=0.841306 token 38: hamcount = 7425 spamcount = 4191, prob=0.390359 token 39: hamcount = 311 spamcount = 734, prob=0.727963 token 40: hamcount = 0 spamcount = 0, discarded token 41: hamcount = 1029 spamcount = 934, discarded token 42: hamcount = 0 spamcount = 0, discarded token 43: hamcount = 0 spamcount = 0, discarded token 44: hamcount = 548 spamcount = 735, prob=0.603372 token 45: hamcount = 3 spamcount = 0, prob=0.065217 token 46: hamcount = 217 spamcount = 132, discarded token 47: hamcount = 106 spamcount = 58, prob=0.383304 token 48: hamcount = 0 spamcount = 0, discarded token 49: hamcount = 0 spamcount = 3, prob=0.934783 token 50: hamcount = 5 spamcount = 1, prob=0.206905 token 51: hamcount = 28 spamcount = 234, prob=0.903889 token 52: hamcount = 6939 spamcount = 5645, discarded token 53: hamcount = 135 spamcount = 502, prob=0.808147 token 54: hamcount = 0 spamcount = 0, discarded token 55: hamcount = 323 spamcount = 501, prob=0.637544 token 56: hamcount = 10034 spamcount = 9013, discarded token 57: hamcount = 7 spamcount = 2, prob=0.256930 token 58: hamcount = 0 spamcount = 0, discarded token 59: hamcount = 10 spamcount = 6, discarded token 60: hamcount = 21 spamcount = 28, prob=0.601065 token 61: hamcount = 736 spamcount = 1119, prob=0.632955 token 62: hamcount = 784 spamcount = 3206, prob=0.822622 token 63: hamcount = 2428 spamcount = 16895, prob=0.887550 token 64: hamcount = 0 spamcount = 0, discarded token 65: hamcount = 689 spamcount = 272, prob=0.309399 token 66: hamcount = 80 spamcount = 341, prob=0.828279 token 67: hamcount = 3190 spamcount = 3140, discarded token 68: hamcount = 0 spamcount = 0, discarded token 69: hamcount = 116 spamcount = 631, prob=0.860326 token 70: hamcount = 477 spamcount = 442, discarded token 71: hamcount = 2538 spamcount = 2167, discarded token 72: hamcount = 30 spamcount = 43, prob=0.618456 token 73: hamcount = 26 spamcount = 7, prob=0.237537 token 74: hamcount = 0 spamcount = 0, discarded token 75: hamcount = 22 spamcount = 50, prob=0.719156 token 76: hamcount = 210 spamcount = 15, prob=0.075803 token 77: hamcount = 15 spamcount = 172, prob=0.927581 token 78: hamcount = 1557 spamcount = 2412, prob=0.637313 token 79: hamcount = 23 spamcount = 2, prob=0.097039 token 80: hamcount = 1343 spamcount = 326, prob=0.215985 token 81: hamcount = 750 spamcount = 881, discarded token 82: hamcount = 846 spamcount = 1160, prob=0.608651 token 83: hamcount = 48 spamcount = 36, discarded token 84: hamcount = 55 spamcount = 47, discarded token 85: hamcount = 27 spamcount = 28, discarded token 86: hamcount = 1585 spamcount = 641, prob=0.314526 token 87: hamcount = 114 spamcount = 36, prob=0.264453 token 88: hamcount = 896 spamcount = 724, discarded token 89: hamcount = 0 spamcount = 0, discarded token 90: hamcount = 0 spamcount = 0, discarded token 91: hamcount = 92 spamcount = 76, discarded token 92: hamcount = 11 spamcount = 178, prob=0.947273 token 93: hamcount = 62 spamcount = 79, discarded token 94: hamcount = 1937 spamcount = 1618, discarded token 95: hamcount = 16 spamcount = 44, prob=0.755341 token 96: hamcount = 606 spamcount = 620, discarded token 97: hamcount = 552 spamcount = 265, prob=0.352659 token 98: hamcount = 473 spamcount = 302, discarded token 99: hamcount = 3002 spamcount = 5390, prob=0.670692 token 100: hamcount = 0 spamcount = 0, discarded token 101: hamcount = 8 spamcount = 3, prob=0.306362 token 102: hamcount = 13 spamcount = 2, prob=0.158824 token 103: hamcount = 732 spamcount = 826, discarded token 104: hamcount = 474 spamcount = 3408, prob=0.890738 token 105: hamcount = 0 spamcount = 0, discarded token 106: hamcount = 0 spamcount = 0, discarded token 107: hamcount = 1191 spamcount = 5958, prob=0.850161 token 108: hamcount = 337 spamcount = 3033, prob=0.910735 token 109: hamcount = 258 spamcount = 222, discarded token 110: hamcount = 1 spamcount = 1, discarded token 111: hamcount = 300 spamcount = 865, prob=0.765750 token 112: hamcount = 526 spamcount = 1327, prob=0.740998 token 113: hamcount = 7 spamcount = 25, prob=0.797846 token 114: hamcount = 956 spamcount = 1957, prob=0.698961 token 115: hamcount = 2 spamcount = 0, prob=0.091837 token 116: hamcount = 0 spamcount = 0, discarded token 117: hamcount = 863 spamcount = 1208, prob=0.613559 token 118: hamcount = 469 spamcount = 282, discarded token 119: hamcount = 230 spamcount = 57, prob=0.219879 token 120: hamcount = 0 spamcount = 0, discarded token 121: hamcount = 32 spamcount = 33, discarded token 122: hamcount = 736 spamcount = 840, discarded token 123: hamcount = 1 spamcount = 0, prob=0.155172 token 124: hamcount = 41 spamcount = 14, prob=0.280994 token 125: hamcount = 3 spamcount = 2, discarded token 126: hamcount = 10 spamcount = 46, prob=0.836477 token 127: hamcount = 12 spamcount = 115, prob=0.914295 token 128: hamcount = 48 spamcount = 28, prob=0.398815 token 129: hamcount = 388 spamcount = 308, discarded token 130: hamcount = 2466 spamcount = 1232, prob=0.361746 token 131: hamcount = 0 spamcount = 0, discarded token 132: hamcount = 61 spamcount = 473, prob=0.897584 token 133: hamcount = 3 spamcount = 11, prob=0.796645 token 134: hamcount = 0 spamcount = 0, discarded token 135: hamcount = 471 spamcount = 561, discarded token 136: hamcount = 7218 spamcount = 8674, discarded token 137: hamcount = 43 spamcount = 34, discarded token 138: hamcount = 21 spamcount = 29, prob=0.609385 token 139: hamcount = 8 spamcount = 39, prob=0.843574 token 140: hamcount = 0 spamcount = 0, discarded token 141: hamcount = 27 spamcount = 32, discarded token 142: hamcount = 23 spamcount = 3, prob=0.135206 token 143: hamcount = 5895 spamcount = 4656, discarded token 144: hamcount = 11 spamcount = 2, prob=0.181994 token 145: hamcount = 1718 spamcount = 2754, prob=0.645181 token 146: hamcount = 32 spamcount = 4, prob=0.128828 token 147: hamcount = 721 spamcount = 388, prob=0.379109 token 148: hamcount = 45 spamcount = 125, prob=0.758415 token 149: hamcount = 767 spamcount = 1010, discarded token 150: hamcount = 319 spamcount = 338, discarded token 151: hamcount = 1071 spamcount = 1628, prob=0.632918 token 152: hamcount = 36 spamcount = 109, prob=0.773655 token 153: hamcount = 188 spamcount = 160, discarded token 154: hamcount = 0 spamcount = 0, discarded token 155: hamcount = 87 spamcount = 276, prob=0.782200 token 156: hamcount = 16 spamcount = 3, prob=0.182902 token 157: hamcount = 0 spamcount = 0, discarded token 158: hamcount = 1106 spamcount = 510, prob=0.343484 token 159: hamcount = 861 spamcount = 759, discarded token 160: hamcount = 354 spamcount = 326, discarded token 161: hamcount = 154 spamcount = 791, prob=0.853346 token 162: hamcount = 5 spamcount = 6, discarded token 163: hamcount = 8 spamcount = 20, prob=0.735524 token 164: hamcount = 543 spamcount = 1595, prob=0.769110 token 165: hamcount = 180 spamcount = 1293, prob=0.890575 token 166: hamcount = 0 spamcount = 0, discarded token 167: hamcount = 1730 spamcount = 5246, prob=0.774751 token 168: hamcount = 87 spamcount = 19, prob=0.199825 token 169: hamcount = 22 spamcount = 1, prob=0.057689 token 170: hamcount = 35 spamcount = 225, prob=0.878753 token 171: hamcount = 0 spamcount = 0, discarded token 172: hamcount = 475 spamcount = 495, discarded token 173: hamcount = 192 spamcount = 86, prob=0.337182 token 174: hamcount = 1723 spamcount = 1518, discarded token 175: hamcount = 2990 spamcount = 1730, prob=0.396273 token 176: hamcount = 539 spamcount = 4562, prob=0.905636 token 177: hamcount = 0 spamcount = 0, discarded token 178: hamcount = 1156 spamcount = 1529, prob=0.600049 token 179: hamcount = 0 spamcount = 0, discarded token 180: hamcount = 3 spamcount = 0, prob=0.065217 token 181: hamcount = 0 spamcount = 0, discarded token 182: hamcount = 4 spamcount = 2, prob=0.371550 token 183: hamcount = 670 spamcount = 882, discarded token 184: hamcount = 873 spamcount = 687, discarded token 185: hamcount = 1 spamcount = 3, prob=0.745295 token 186: hamcount = 5678 spamcount = 8736, prob=0.635741 token 187: hamcount = 4 spamcount = 12, prob=0.765425 token 188: hamcount = 1 spamcount = 0, prob=0.155172 token 189: hamcount = 81 spamcount = 76, discarded token 190: hamcount = 355 spamcount = 427, discarded token 191: hamcount = 1149 spamcount = 1266, discarded token 192: hamcount = 2034 spamcount = 611, prob=0.254198 token 193: hamcount = 110 spamcount = 16, prob=0.142908 token 194: hamcount = 0 spamcount = 0, discarded token 195: hamcount = 1 spamcount = 0, prob=0.155172 token 196: hamcount = 806 spamcount = 710, discarded token 197: hamcount = 0 spamcount = 0, discarded token 198: hamcount = 24 spamcount = 10, prob=0.323296 token 199: hamcount = 55 spamcount = 85, prob=0.636341 token 200: hamcount = 585 spamcount = 342, prob=0.398791 token 201: hamcount = 3 spamcount = 0, prob=0.065217 token 202: hamcount = 28 spamcount = 37, discarded token 203: hamcount = 1 spamcount = 4, prob=0.793041 token 204: hamcount = 2375 spamcount = 1331, prob=0.388667 token 205: hamcount = 233 spamcount = 43, prob=0.173642 token 206: hamcount = 0 spamcount = 0, discarded token 207: hamcount = 24 spamcount = 57, prob=0.728036 token 208: hamcount = 28 spamcount = 23, discarded token 209: hamcount = 4 spamcount = 2, prob=0.371550 token 210: hamcount = 87 spamcount = 98, discarded token 211: hamcount = 30 spamcount = 222, prob=0.892853 token 212: hamcount = 1167 spamcount = 1133, discarded token 213: hamcount = 303 spamcount = 168, prob=0.386223 token 214: hamcount = 384 spamcount = 557, prob=0.621935 token 215: hamcount = 10 spamcount = 0, prob=0.021531 token 216: hamcount = 1722 spamcount = 1738, discarded token 217: hamcount = 3000 spamcount = 2582, discarded token 218: hamcount = 134 spamcount = 449, prob=0.791487 token 219: hamcount = 138 spamcount = 982, prob=0.889617 token 220: hamcount = 4561 spamcount = 5959, discarded token 221: hamcount = 966 spamcount = 4736, prob=0.847570 token 222: hamcount = 641 spamcount = 419, discarded token 223: hamcount = 204 spamcount = 27, prob=0.131259 token 224: hamcount = 13681 spamcount = 10641, discarded token 225: hamcount = 60 spamcount = 436, prob=0.891457 token 226: hamcount = 862 spamcount = 759, discarded token 227: hamcount = 56 spamcount = 141, prob=0.740131 token 228: hamcount = 8 spamcount = 5, discarded token 229: hamcount = 872 spamcount = 758, discarded token 230: hamcount = 4986 spamcount = 3565, discarded token 231: hamcount = 8932 spamcount = 8166, discarded token 232: hamcount = 1090 spamcount = 823, discarded token 233: hamcount = 29 spamcount = 16, prob=0.386083 token 234: hamcount = 1457 spamcount = 1461, discarded token 235: hamcount = 472 spamcount = 1564, prob=0.789802 token 236: hamcount = 4052 spamcount = 2179, prob=0.378901 token 237: hamcount = 2325 spamcount = 7625, prob=0.788136 token 238: hamcount = 10 spamcount = 42, prob=0.823721 token 239: hamcount = 11 spamcount = 0, prob=0.019651 token 240: hamcount = 90 spamcount = 21, prob=0.210466 token 241: hamcount = 0 spamcount = 0, discarded token 242: hamcount = 516 spamcount = 828, prob=0.645379 From miguel at vargas.com Fri Feb 20 23:29:29 2004 From: miguel at vargas.com (Miguel Vargas) Date: Fri Feb 20 23:28:47 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: References: Message-ID: <4036DEA9.1070204@vargas.com> After playing with the Pythonwin debugger some more I don't think it's possible to do what I was trying to do. I was trying to step into classifier.py and then modify the variables at run time. Is there a debugger that will let me do this? If not, how could I test these numbers? I was thinking maybe I could load the values directly into the training database, is there an easy way of doing that? From miguel at vargas.com Fri Feb 20 23:49:10 2004 From: miguel at vargas.com (Miguel Vargas) Date: Fri Feb 20 23:48:27 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4036DEA9.1070204@vargas.com> References: <4036DEA9.1070204@vargas.com> Message-ID: <4036E346.2000108@vargas.com> By the way, here's the latest incarnation of my code. I'm still somewhat confused about the floor() function in chi2Q. I understand why it's not needed, so why is it in the SpamBayes code? /** This section comes from probability(self, record) and _getclues(self, wordstream)**/ for (i = 0; i < count; ++i) { Token& token = tokens[i]; // tokens is an array of Token, elements of a Token // include both token.mProbability and token.mDistance const char* word = token.mWord; Token* t = mGoodTokens.get(word); double hamcount = ((t != NULL) ? t->mCount : 0); t = mBadTokens.get(word); double spamcount = ((t != NULL) ? t->mCount : 0); prob = (spamcount / nbad) / ( hamcount / ngood + spamcount / nbad); double n = hamcount + spamcount; prob = (0.225 + n * prob) / (.45 + n); double distance = abs(prob - 0.5); if (distance >= .1) { goodclues++; token.mDistance = distance; token.mProbability = prob; } else { token.mDistance = -1; //ignore clue } } // sort the array by the token distances PRUint32 first, last = count; if (count > 150) { first = count - 150; // This function sorts the array by token.mDistance NS_QuickSort(tokens, count, sizeof(Token), compareTokens, NULL); } else { first = 0; } /** This section comes from chi2_spamprob(self, wordstream, evidence=False) **/ double H = 1.0, S = 1.0; PRUint32 Hexp = 0, Sexp = 0; goodclues=0; int e; for (i = first; i < last; ++i) { if (tokens[i].mDistance != -1) { goodclues++; double value = tokens[i].mProbability; S *= (1.0 - value); H *= value; if ( S < 1e-200 ) { S = frexp(S, &e); Sexp += e; } if ( H < 1e-200 ) { H = frexp(H, &e); Hexp += e; } } } S = log(S) + Sexp * M_LN2; H = log(H) + Hexp * M_LN2; if (goodclues>0) { S = 1.0 - chi2Q(-2.0 * S, 2 * goodclues); H = 1.0 - chi2Q(-2.0 * H, 2 * goodclues); prob = (S-H +1.0) / 2.0; } else { prob = 0.5; } PRBool isJunk = (prob >= 0.90); ------------------------------------ Here's the chi2Q funcition: double chi2Q (double x2, PRUint32 v) { PRUint32 i; double m = x2 / 2.0; double sum = exp(-m); double term = sum; NS_ASSERTION(!(v & 1), "chi2Q called with odd value"); for (i=1 ; i<=v/2 ; ++i) { term *= m / i; sum += term; } return dmin(sum,1.0); } From tim.one at comcast.net Sat Feb 21 00:05:53 2004 From: tim.one at comcast.net (Tim Peters) Date: Sat Feb 21 00:05:54 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4036E346.2000108@vargas.com> Message-ID: [Miguel Vargas] > By the way, here's the latest incarnation of my code. I'm still > somewhat confused about the floor() function in chi2Q. I understand > why it's not needed, so why is it in the SpamBayes code? There's no call to floor() in the entire SpamBayes codebase (let alone in chi2Q). What makes you think there is? ... > ------------------------------------ > Here's the chi2Q funcition: > double chi2Q (double x2, PRUint32 v) { > PRUint32 i; > double m = x2 / 2.0; > double sum = exp(-m); > double term = sum; > > NS_ASSERTION(!(v & 1), "chi2Q called with odd value"); > > for (i=1 ; i<=v/2 ; ++i) { As covered before, this loop is going around once too often. Sorry, no time for more now. From miguel at vargas.com Sat Feb 21 00:31:21 2004 From: miguel at vargas.com (Miguel Vargas) Date: Sat Feb 21 00:30:39 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: References: Message-ID: <4036ED29.3070306@vargas.com> Tim Peters wrote: > [Miguel Vargas] > >>By the way, here's the latest incarnation of my code. I'm still >>somewhat confused about the floor() function in chi2Q. I understand >>why it's not needed, so why is it in the SpamBayes code? > > > There's no call to floor() in the entire SpamBayes codebase (let alone in > chi2Q). What makes you think there is? In chi2Q you have for i in range(1, v//2): I thought // was floor division, that's why I had written floor(v/2). > > ... > > >>------------------------------------ >>Here's the chi2Q funcition: >>double chi2Q (double x2, PRUint32 v) { >> PRUint32 i; >> double m = x2 / 2.0; >> double sum = exp(-m); >> double term = sum; >> >> NS_ASSERTION(!(v & 1), "chi2Q called with odd value"); >> >> for (i=1 ; i<=v/2 ; ++i) { > > > As covered before, this loop is going around once too often. Sorry, no time > for more now. > I don't know how I missed that one, I'm off to test again... From tim.one at comcast.net Sat Feb 21 12:19:10 2004 From: tim.one at comcast.net (Tim Peters) Date: Sat Feb 21 12:19:09 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <4036ED29.3070306@vargas.com> Message-ID: [Miguel Vargas] > In chi2Q you have > > for i in range(1, v//2): > > I thought // was floor division, that's why I had written floor(v/2). Ah. // in Python is primarily integer division (as opposed to /, which is float division). It so happens that integer division in Python returns the floor of the mathematical quotient, and as an integer (unlike the floor() function, which maps float to float) , but that's secondary. In any case, this secondary distinction makes no difference when the mathematical quotient is exactly representable as an integer, and v divided by 2 is exactly representable as an integer when v is an even integer. IOW, an experienced Pythoneer sees "v//2" and first reads it as "OK, v is an integer and we want an integer result too". Since Python doesn't have type declarations, Pythoneers can't guess the intended semantics by looking for v's declaration, so Python uses different operator symbols for integer and float division. That way the intent is immediately clear (as it will be for you too the *next* time you see // in Python code). From ta-meyer at ihug.co.nz Sat Feb 21 21:20:45 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Sat Feb 21 21:21:12 2004 Subject: [spambayes-dev] Automatically generated bug reports from sb_server In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130525646C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046779BD@its-xchg4.massey.ac.nz> > We could set the subject to something like > "POP3 Proxy Problem: " > That *might* encourage people to change the subject, and we'll be able > to see very quickly if people aren't paying any attention to it. After reading mail today, I'm no longer hesitant - the identical subject messages have got to go. The code makes a (rudimentary) check to make sure that the user has entered in what's necessary (so we don't get lots of "...My problem is [DESCRIBE YOUR PROBLEM HERE]" messages), so I'll get it to confirm that they've done something with the subject as well. > One of the problems we have consistently is people not > providing enough information about their problem. > IIRC, one of the purposes for doing > the bug report feature was to give users a template that > would encourage them to provide the necessary info. I don't > think pointing them directly at SourceForge would do that > unless there is some way to > automatically add some template headings when opening the page. Ah, good point - I hadn't thought about that. > Is there any possibility of providing them a form to fill in > through the sb_server UI and then submitting that info directly as a > SourceForge bug instead of emailing it to the list? I think most of us > subscribe to the spambayes-bugs list anyway, so we would still get > e-mail notification. Probably not, because you have to be logged in to sourceforge to submit a bug (well, we can turn that off in the admin settings I think, but it's presumably there for a reason). I'll forget about that idea :) > Sorry, I haven't looked closely enough at the feature or the reports > received so far to provide any more input. No worries :) > I do bring up the proxy from > time to time, and I'll try to remember to look at this more. Thanks :) (Note that if you ever want to test the reporting, you can change the address it sends to in ui.html (if you have resourcepackage to regenerate ui_html.py), so it (for example) sends to you, rather than the list.) =Tony Meyer From randy.kondor at matrikon.com Sun Feb 22 02:06:08 2004 From: randy.kondor at matrikon.com (Randy Kondor) Date: Sun Feb 22 02:06:27 2004 Subject: [spambayes-dev] RE: [PSF-Board] Spambayes Message-ID: Folks, can you please contact me about making some enhancements to Spambayes? I can provide funding for this work. Your product is awesome!!! Regards, Randy Matrikon - We make connections! Watch the multimedia OPC Tutorial at http://www.matrikon.com/tutorial O-----------------------------------------------------------------O | Randy Kondor Phone: 780-945-4035 | | OPC Product Manager Fax: 780-448-9191 | | Matrikon Email: randy.kondor@matrikon.com | | 10405 Jasper Ave, Suite 1800 WEB: http://www.matrikon.com | | Edmonton, Alberta, Canada, T5J 3N4 | O-----------------------------------------------------------------O Attend Matrikon's Annual User Group May 2-5, 2004 Matrikon Valued Partners Event - MVP 2004 Driving Performance, Partners for Results. -----Original Message----- From: Tim Peters [mailto:tim.one@comcast.net] Sent: Friday, February 20, 2004 11:56 AM To: Randy Kondor Cc: psf@python.org Subject: RE: [PSF-Board] Spambayes [Randy Kondor] > I am interested in speaking with you about funds for Spambayes. > Please call me at 780-945-4035. Randy, the SpamBayes developers chose to give copyright to the PSF, and to release the code under the PSF license, but the PSF has no other connection with that project. For example, the PSF doesn't fund it, or manage it. You can reach the SpamBayes developers directly via: mailto:spambayes-dev@python.org All SpamBayes developers are currently unpaid volunteers, but there's no prohibition against them doingwork for pay. There's also no prohibition against using the SpamBayes code in a commercial project (and, for example, at least http://www.inboxer.com/ has done so). **************************************************************************** * READER BEWARE: Unencrypted, unsigned Internet e-mail is inherently insecure. Internet messages may be corrupted, incomplete, misdirected or may incorrectly identify the sender. Therefore, nothing in this message or attachments may be considered legally binding. THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INDIVIDUAL OR ENTITY TO WHICH IT IS ADDRESSED AND MAY BE PRIVILEGED. If you are not the intended recipient or their authorized agent, you may not forward or copy this information and must delete or destroy all copies of this message and attachments received. If you have received this communication in error, please notify Matrikon Inc. by telephone at (780) 448-1010. **************************************************************************** * From randy.kondor at matrikon.com Sun Feb 22 02:11:08 2004 From: randy.kondor at matrikon.com (Randy Kondor) Date: Sun Feb 22 02:11:26 2004 Subject: [spambayes-dev] RE: [PSF-Board] Spambayes Message-ID: Folks, can you please contact me about making some enhancements to Spambayes? I can provide funding for this work. Your product is awesome!!! Regards, Randy Matrikon - We make connections! Watch the multimedia OPC Tutorial at http://www.matrikon.com/tutorial O-----------------------------------------------------------------O | Randy Kondor Phone: 780-945-4035 | | OPC Product Manager Fax: 780-448-9191 | | Matrikon Email: randy.kondor@matrikon.com | | 10405 Jasper Ave, Suite 1800 WEB: http://www.matrikon.com | | Edmonton, Alberta, Canada, T5J 3N4 | O-----------------------------------------------------------------O Attend Matrikon's Annual User Group May 2-5, 2004 Matrikon Valued Partners Event - MVP 2004 Driving Performance, Partners for Results. -----Original Message----- From: Tim Peters [mailto:tim.one@comcast.net] Sent: Friday, February 20, 2004 11:56 AM To: Randy Kondor Cc: psf@python.org Subject: RE: [PSF-Board] Spambayes [Randy Kondor] > I am interested in speaking with you about funds for Spambayes. > Please call me at 780-945-4035. Randy, the SpamBayes developers chose to give copyright to the PSF, and to release the code under the PSF license, but the PSF has no other connection with that project. For example, the PSF doesn't fund it, or manage it. You can reach the SpamBayes developers directly via: mailto:spambayes-dev@python.org All SpamBayes developers are currently unpaid volunteers, but there's no prohibition against them doingwork for pay. There's also no prohibition against using the SpamBayes code in a commercial project (and, for example, at least http://www.inboxer.com/ has done so). **************************************************************************** * READER BEWARE: Unencrypted, unsigned Internet e-mail is inherently insecure. Internet messages may be corrupted, incomplete, misdirected or may incorrectly identify the sender. Therefore, nothing in this message or attachments may be considered legally binding. THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INDIVIDUAL OR ENTITY TO WHICH IT IS ADDRESSED AND MAY BE PRIVILEGED. If you are not the intended recipient or their authorized agent, you may not forward or copy this information and must delete or destroy all copies of this message and attachments received. If you have received this communication in error, please notify Matrikon Inc. by telephone at (780) 448-1010. **************************************************************************** * From Jean-Marc.Valin at USherbrooke.ca Sun Feb 22 03:18:34 2004 From: Jean-Marc.Valin at USherbrooke.ca (Jean-Marc Valin) Date: Sun Feb 22 03:18:37 2004 Subject: [spambayes-dev] Some samples that fool spambayes Message-ID: <1077437914.4096.48.camel@idefix.homelinux.org> Hi, In the last few days, I've been hit with spam that got past through spambayes. This spam, which seems to come from the same spammer, always gets through because the spammer includes lots of "hammie" words at the end. I thought it would be interesting to add these messages to your test setup because they're good examples of spam actively trying to defeat a filter. I've posted the mbox file here: http://www.speex.org/~jm/mbox Jean-Marc P.S. I'm not on the list, so please CC to me on your reply. -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Universit? de Sherbrooke, Qu?bec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Ceci est une partie de message =?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e=2E?= Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040222/28727510/attachment.bin From tim.one at comcast.net Sun Feb 22 12:21:58 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Feb 22 12:22:06 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <403694AE.6020303@vargas.com> Message-ID: [Miguel] > OK, I made all the suggested changes and re-tested. The fn rate > dropped by half, which is amazing considering that it > was already about half of the original. Unfourtunately, the fp > rate did not improve and might have even gone up a bit. > > To try to pinpoint my problem I've been trying to debug into > classifier.py and feed it some numbers. Unfourtunately I > don't know my way around the python debugger very well so I haven't > been able to pull this off. > > Is there a kind Python soul in here that could help me with this? > Feed these numbers into classifier.py and see if you get the same > results > > ngood = 861, nbad = 759 > > spam score = 0.809734 > > token 1: hamcount = 13 spamcount = 103, prob=0.898333 ... > token 242: hamcount = 516 spamcount = 828, prob=0.645379 chi2.py has a showscore() function, which displays details about the chi combining calculation; e.g., >>> showscore([.1, .1, .1, .1]) P(chisq >= 0.842884 | v= 8) = 0.999059 P(chisq >= 18.4207 | v= 8) = 0.0182845 spam prob 0.000940523891325 ham prob 0.981715504484 (S-H+1)/2 0.00961250970379 >>> showscore([.9, .9, .9, .9]) P(chisq >= 18.4207 | v= 8) = 0.0182845 P(chisq >= 0.842884 | v= 8) = 0.999059 spam prob 0.981715504484 ham prob 0.000940523891325 (S-H+1)/2 0.990387490296 >>> showscore([.1, .1, .9, .9]) P(chisq >= 9.63178 | v= 8) = 0.291827 P(chisq >= 9.63178 | v= 8) = 0.291827 spam prob 0.708173451976 ham prob 0.708173451976 (S-H+1)/2 0.5 >>> Sticking your email msg into a string called 'data', then running this Python snippet: """ import re parse = re.compile(r'prob=([\d.]+)') probs = [float(prob) for prob in parse.findall(data)] print "found", len(probs), "probs" print "first", probs[0], "last", probs[-1] import sys sys.path.insert(0, '/code/spambayes') # season to taste from spambayes.chi2 import showscore showscore(probs) """ printed this: found 131 probs first 0.898333 last 0.645379 P(chisq >= 271.809 | v=262) = 0.325528 P(chisq >= 220.223 | v=262) = 0.971459 spam prob 0.674472123467 ham prob 0.0285410060406 (S-H+1)/2 0.822965558713 So the code in this project would have given a higher spamprob (0.822...) than your code got (0.809...). This could very well be due to the off-by-one error in your chi2Q function. Indeed, if I change chi2.py's chi2Q's loop to for i in range(1, v//2 + 1): then the output changes to found 131 probs first 0.898333 last 0.645379 P(chisq >= 271.809 | v=262) = 0.357377 P(chisq >= 220.223 | v=262) = 0.976845 spam prob 0.642623035989 ham prob 0.0231547641273 (S-H+1)/2 0.809734135931 which is an excellent match to what you reported. You can verify that the chi-squared values we actually compute are correct by, e.g., using one of the interactive chi-squared calculators on the web. For example, http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html From sam at s-j-t.co.uk Sun Feb 22 13:17:55 2004 From: sam at s-j-t.co.uk (Sam Thorne) Date: Sun Feb 22 13:18:10 2004 Subject: [spambayes-dev] Mac OS X package Message-ID: <6F78EA16-6563-11D8-9994-003065DA26BA@s-j-t.co.uk> Hello all, New to the list, and pretty new to spambayes. I've put together a package for easy installation of spambayes on Mac OS X. Posted it on the wiki; http://www.entrian.com/sbwiki/MacOSXPackage Generally wondering a few things though; in the install I've created, I put everything into a spambayes folder in the Library; this is the 'nice' place to install things on Mac OS X so they are self-contained and easy to find/uninstall. The pop proxy works fine here, with the folder structure being: /Library/SpamBayes/(contents of Scripts folder from source + data files) /Library/SpamBayes/spambayes/ (contents of the spambayes folder from source) However, the utilities (such as which_db etc) still look for the libs in /System/Library/Python.framework/... and don't find them (obviously). In the end I left them out, as the point of the package really was just to create a very easy point 'n' click install, so I doubt many will use the utilities, but I'm still wondering why everything else checks the current directory first and then elsewhere for it's modules etc. when the testtools and utilities don't...? And also, am I breaching the license by leaving them out? Anyway, hopefully that's all ok but I'd appreciate any feedback on the pkg. -- Sam So long, and thanks for all the fish. From miguel at vargas.com Sun Feb 22 22:46:10 2004 From: miguel at vargas.com (Miguel Vargas) Date: Sun Feb 22 22:45:53 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: References: Message-ID: <40397782.8080607@vargas.com> Tim Peters wrote: > which is an excellent match to what you reported. You can verify that the Great. I just confirmed that when I fixed my off-by-one error I got the correct value (0.822...). This points to a problem in the section where I calculate the probability per token. So then I noticed the 2 assertions from the probability function that I left out from my code assert hamcount <= nham assert spamcount <= nspam That is when I realized that we are counting the tokens differently. It looks like SpamBayes only counts a token once per message no matter how many times it appears. Mozilla counts every instance of a token, so hamcount can easily be greater than nham, that is eveident in the email I sent before >> ngood = 861, nbad = 759 ... >> token 5: hamcount = 5802 spamcount = 4680 I'm off to patch Mozilla... From tim.one at comcast.net Sun Feb 22 23:13:21 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Feb 22 23:13:26 2004 Subject: [spambayes-dev] Mozilla SpamBayes "porting" In-Reply-To: <40397782.8080607@vargas.com> Message-ID: [Miguel Vargas] > Great. I just confirmed that when I fixed my off-by-one error I got > the correct value (0.822...). Cool! > This points to a problem in the section where I calculate the > probability per token. So then I noticed the 2 assertions from the > probability function that I left out from my code > > assert hamcount <= nham > assert spamcount <= nspam > > That is when I realized that we are counting the tokens differently. > It looks like SpamBayes only counts a token once per message no > matter how many times it appears. There's a comment block about this in classifier.py, before the _add_msg() method. Graham's scheme was schizophrenic, counting duplicates more than once during training, but only once during scoring. See the comment for more on that. > Mozilla counts every instance of a token, If it's still following Graham's scheme in this respect, I expect Mozilla's scheme also differs between training and scoring. > so hamcount can easily be greater than nham, that is eveident in the > email I sent before >>> ngood = 861, nbad = 759 >>> ... >>> token 5: hamcount = 5802 spamcount = 4680 Then it's clear that a token could be counted more than once during training (as in Graham's scheme), but is not enough to say whether scoring does or doesn't weed out duplicates. The current spambayes algorithms weed out duplicates during training and scoring. From tim.one at comcast.net Sun Feb 22 23:19:02 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Feb 22 23:19:08 2004 Subject: [spambayes-dev] FW: [Spambayes] Delete As Spam - Doesn't always work Message-ID: Does this log file ring bells with anyone? I haven't seen errors coming out of GetHTMLFromRTFProperty before. Geoff, you may have intended to attach the email message you couldn't delete as spam, but no such attachment arrived -- just the log file. [fwd'ed with permission] [Geoff Campbell] > Hi Tim - > > Attached is the latest log file (only one I could find), and also the > email message that I couldn't "Delete as Spam". In trying to solve > the problem, I upgraded to the latest SpamBayes version - that may > have wiped out the log of when the message originally came in. > > Anyway, thanks for you help. > > Geoff Campbell > > > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net] > Sent: Thursday, February 19, 2004 1:59 PM > To: geoff@controlg.com > Cc: spambayes@python.org > Subject: RE: [Spambayes] Delete As Spam - Doesn't always work > > > [Geoff Campbell] >> Sometimes (very rarely) when I select an obvious spam message and >> "click" on "Delete As Spam", nothing happens. It seems that the >> message is "SpamBayes proof"! This happens once every several days, >> but I can't see a pattern. Using Outlook 2002 (SP-1). Would be >> happy > >> to send you a representative email. > That would be good. There's probably a helpful (to us ) message > in your SpamBayes log file whenever this happens. A real problem is > that Outlook destroys the exact structure of incoming email, so you > may not actually be able to give anyone else an email that reproduces > the problem. When "nothing happens" in an otherwise-working > SpamBayes, the usual cause is that the email is so badly formed > (violates so many standards about how email *should* be constructed) > that the SpamBayes email parser gives up trying to make any sense of > it. If that's what's happening, you will find helpful (to us) > information in your SpamBayes log file. -------------- next part -------------- A non-text attachment was scrubbed... Name: spambayes1.log Type: application/octet-stream Size: 22906 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040222/518937e4/spambayes1-0001.obj From geoff at controlg.com Mon Feb 23 11:27:39 2004 From: geoff at controlg.com (Geoff Campbell) Date: Mon Feb 23 11:27:59 2004 Subject: [spambayes-dev] RE: [Spambayes] Delete As Spam - Doesn't always work In-Reply-To: Message-ID: <021a01c3fa29$f7c853d0$2e00a8c0@Dell2400> Tim - Another try to forward - I'm forwarding the "sent" item (which shows the two attachments). Maybe it's more of an "outlook" problem! Thanks - Geoff -----Original Message----- From: Tim Peters [mailto:tim.one@comcast.net] Sent: Sunday, February 22, 2004 9:19 PM To: spambayes-dev@python.org Cc: geoff@controlg.com Subject: FW: [Spambayes] Delete As Spam - Doesn't always work Does this log file ring bells with anyone? I haven't seen errors coming out of GetHTMLFromRTFProperty before. Geoff, you may have intended to attach the email message you couldn't delete as spam, but no such attachment arrived -- just the log file. [fwd'ed with permission] [Geoff Campbell] > Hi Tim - > > Attached is the latest log file (only one I could find), and also the > email message that I couldn't "Delete as Spam". In trying to solve > the problem, I upgraded to the latest SpamBayes version - that may > have wiped out the log of when the message originally came in. > > Anyway, thanks for you help. > > Geoff Campbell > > > -----Original Message----- > From: Tim Peters [mailto:tim.one@comcast.net] > Sent: Thursday, February 19, 2004 1:59 PM > To: geoff@controlg.com > Cc: spambayes@python.org > Subject: RE: [Spambayes] Delete As Spam - Doesn't always work > > > [Geoff Campbell] >> Sometimes (very rarely) when I select an obvious spam message and >> "click" on "Delete As Spam", nothing happens. It seems that the >> message is "SpamBayes proof"! This happens once every several days, >> but I can't see a pattern. Using Outlook 2002 (SP-1). Would be >> happy > >> to send you a representative email. > That would be good. There's probably a helpful (to us ) message > in your SpamBayes log file whenever this happens. A real problem is > that Outlook destroys the exact structure of incoming email, so you > may not actually be able to give anyone else an email that reproduces > the problem. When "nothing happens" in an otherwise-working SpamBayes, > the usual cause is that the email is so badly formed (violates so many > standards about how email *should* be constructed) that the SpamBayes > email parser gives up trying to make any sense of it. If that's > what's happening, you will find helpful (to us) information in your > SpamBayes log file. -------------- next part -------------- An embedded message was scrubbed... From: "Geoff Campbell" Subject: RE: [Spambayes] Delete As Spam - Doesn't always work Date: Thu, 19 Feb 2004 15:07:54 -0700 Size: 28584 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20040223/55292dfd/attachment-0001.mht From skip at pobox.com Mon Feb 23 12:27:27 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Feb 23 12:27:37 2004 Subject: [spambayes-dev] inboxer ad Message-ID: <16442.14335.97668.115205@montanaro.dyndns.org> Hey folks, I just noticed an Inboxer ad on SF. Maybe this is old hat (I was on vacation last week). Thought I'd pass it along though. Skip From kennypitt at hotmail.com Mon Feb 23 15:43:30 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Feb 23 15:47:40 2004 Subject: [spambayes-dev] RE: [Spambayes] Delete As Spam - Doesn't always work In-Reply-To: <021a01c3fa29$f7c853d0$2e00a8c0@Dell2400> Message-ID: [Geoff Campbell] > Another try to forward - I'm forwarding the "sent" item (which shows > the two attachments). > > Maybe it's more of an "outlook" problem! > > [Tim Peters] >> Does this log file ring bells with anyone? I haven't seen errors >> coming out of GetHTMLFromRTFProperty before. >> >> Geoff, you may have intended to attach the email message you couldn't >> delete as spam, but no such attachment arrived -- just the log file. I still don't see the original message, but I have a suspicion that you are correct in guessing that this is some sort of Outlook problem. The error number shown in the spambayes1.log maps to MAPI_E_CORRUPT_DATA, and it appears that it can only appear as a Python exception if it occurs in the low-level MAPI function call to WrapCompressedRTFStream (inside win32all). From what I can gather, the only way that this could happen is if something is corrupted in the message data of that particular message. I'm afraid I can't offer any ideas as to what might cause that, though. >From the log, it appears that SpamBayes tries repeatedly to process the message but fails each time. We may need to add some additional error handling to prevent that. It might also be good to log a more descriptive message instead of just a traceback. -- Kenny Pitt From steve at chamber.org.hk Mon Feb 23 21:10:20 2004 From: steve at chamber.org.hk (Stephen Luk) Date: Mon Feb 23 21:04:08 2004 Subject: [spambayes-dev] Urgent, please help Message-ID: I have to say Spambayes is a great product, and I've been using it happily. There's one problem that I have found out, the Spambayes filters my inbox emails before my Outlook(2003) client rules are applied, is there a way to check new emails with my own outlook rules before the SpamBayes filter applies? Thanks for your help steve -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040224/f36998bf/attachment.html From antoine.trux at nokia.com Tue Feb 24 03:42:48 2004 From: antoine.trux at nokia.com (antoine.trux@nokia.com) Date: Tue Feb 24 03:43:17 2004 Subject: [spambayes-dev] getting rid of the "new mail" icon Message-ID: <4EAA30E8E17684458E24408ADC13A4B501085235@esebe006.ntc.nokia.com> Tony, > You could look into InBoxer (there's a link on our 'related' > page). It's > the SpamBayes code (or close enough), so you'll get the same > results. I > believe they have done *something* to address this, although > I'm not sure > exactly what. In any case, since you'd be giving them money, > you'd be able > to ask them for features, if they don't do what you want already. Thank you very much for pointing me at InBoxer. I actually placed my order yesterday. Yes, they did something to address this: they have their own icon (this is actually the solution that Glenn Brown suggested in a message that he posted last Thursday). Antoine From tameyer at ihug.co.nz Wed Feb 25 01:33:02 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 25 01:33:39 2004 Subject: [spambayes-dev] Mac OS X package In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1305256A64@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AD9@its-xchg4.massey.ac.nz> > New to the list, and pretty new to spambayes. I've put together a > package for easy installation of spambayes on Mac OS X. > Posted it on the wiki; http://www.entrian.com/sbwiki/MacOSXPackage Cool, thanks! > However, the utilities (such as which_db etc) still look for the libs > in /System/Library/Python.framework/... and don't find them > (obviously). They should look for them wherever the config file says they are. A question is where the config file should default, to, though. On Windows, a bayescustomize.ini file is created in the "Application Data" directory, so everything defaults to being relative to that. There has been some discussion (although it's not the case yet) of making Linux default to creating a .spambayesrc file in ~, or maybe ~/.spambayes. What's the "correct" place with OS X? (Note that this includes the bayescustomize.ini file (prefs), as well as the two databases, and the cache directories (user data), so it could be quite big). Does anyone know if there's a way to get the path to this, like the win32all function that provides the "Application Data" directory? > the utilities, but I'm still wondering why everything else checks the > current directory first and then elsewhere for it's modules etc. when > the testtools and utilities don't...? The way it works is that all the scripts look for a bayescustomize.ini file - they look in an environment variable (BAYESCUSTOMIZE), in ~ (for .spambayesrc) and in the current working directory. If that's found, then all default file locations are relative to that. If it's not found then (unless win32 as above) the current working directory is used as a default location, and everything is relative to that. I suspect that in your testing, you had a bayescustomize.ini file in your scripts directory, and not elsewhere, which would explain the differing behaviour. > And also, am I breaching the license by leaving them out? The license pretty much lets you do whatever you want with the code, although IANAL. =Tony Meyer --- Please always include the list (spambayes@python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. This way, you get everyone's help, and avoid a lack of replies when I'm busy. From kennypitt at hotmail.com Wed Feb 25 09:30:22 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Feb 25 09:31:24 2004 Subject: [spambayes-dev] Mac OS X package In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AD9@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> However, the utilities (such as which_db etc) still look for the libs >> in /System/Library/Python.framework/... and don't find them >> (obviously). > > They should look for them wherever the config file says they are. A > question is where the config file should default, to, though. On > Windows, a bayescustomize.ini file is created in the "Application > Data" directory, so everything defaults to being relative to that. > There has been some discussion (although it's not the case yet) of > making Linux default to creating a .spambayesrc file in ~, or maybe > ~/.spambayes. > > What's the "correct" place with OS X? OS X is a Unix-derivative, so I think "~" should refer to the user's home directory just like on Linux. Skip would know for sure. >> the utilities, but I'm still wondering why everything else checks the >> current directory first and then elsewhere for it's modules etc. when >> the testtools and utilities don't...? I read this issue a little differently than Tony, I think. It sounded to me like you were having trouble with Python finding the imported modules, so I'll respond to that since Tony has already done a great job describing the configuration and data file stuff. Every Python installation has a default location where it searches for all the standard library modules that come with the Python distribution. Sounds like for you it is "/System/Library/Python.framework/". One way to make sure that all SpamBayes apps can find the SB library modules is to install the "spambayes" folder and all of its contents under this default library path. This is what the setup.py script in the root of the SpamBayes distribution does if you run "python setup.py install". There is also an environment variable, PYTHONPATH, that you can set to a list of additional directories to search for library modules. This is most often used during development so that apps can be tested from the source tree before they are installed into the default path. Another way that is used by some of the SpamBayes scripts is to directly manipulate the "sys.path" variable at the start of the Python script. Python doesn't default to looking for modules in the current directory, but some of the scripts force this by adding the current working directory to sys.path when they run. This method is somewhat fragile and relies on two things: you have to maintain the exact directory structure from the SpamBayes distribution, and you have to run the script with your working directory set to the directory containing the script file. You will also find that not all of the scripts handle this the same way, especially those in the utilities and testtools directories that are used mostly for development and testing. -- Kenny Pitt From skip at pobox.com Wed Feb 25 10:56:44 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Feb 25 10:56:56 2004 Subject: [spambayes-dev] Mac OS X package In-Reply-To: References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AD9@its-xchg4.massey.ac.nz> Message-ID: <16444.50620.102628.113027@montanaro.dyndns.org> >> There has been some discussion (although it's not the case yet) of >> making Linux default to creating a .spambayesrc file in ~, or maybe >> ~/.spambayes. >> >> What's the "correct" place with OS X? Kenny> OS X is a Unix-derivative, so I think "~" should refer to the Kenny> user's home directory just like on Linux. Skip would know for Kenny> sure. Yes, as long as you call os.path.expanduser() on the string. Skip From juntunen at well.com Wed Feb 25 20:44:23 2004 From: juntunen at well.com (Thomas Juntunen) Date: Wed Feb 25 20:44:08 2004 Subject: [spambayes-dev] Re: Mac OS X package In-Reply-To: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/22/04, Sam Thorne imposed order on a stream of electrons to say: >New to the list, and pretty new to spambayes. I've put together a >package for easy installation of spambayes on Mac OS X. Just for completeness, I thought I'd mention the MacPython PackageManager contains a spambayes package (version 1.0a7, IIRC) that will install automatically and try to satisfy any dependencies. The maintainers may be interested in what you're doing. MacPython: http://homepages.cwi.nl/~jack/macpython/ To get the spambayes package, you need to use Bob Ippolito's extended package database: http://undefined.org/python/pimp/ HTH, Thomas Juntunen -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0 iQA/AwUBQD1BZtFoei/9T3YdEQJSdQCePfQZgcwQ89KKHx+Y0XJffEiBrtoAnR1Y dQHK2mEmJNDPdNt7gpIRfJfB =pEhI -----END PGP SIGNATURE----- From ta-meyer at ihug.co.nz Wed Feb 25 23:10:05 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Feb 25 23:10:44 2004 Subject: [spambayes-dev] "Bayesian Dobly" Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046779FC@its-xchg4.massey.ac.nz> Has anyone else read the "Bayesian Dobly: Noise Reduction for Statistical Analysis" paper by Jonathan Zdziarski of nuclearelephant.com's DSPAM? (it was /.'d recently). The idea, basically, is to remove some tokens from the stream before classifying if they look like "noise" -i.e. junk words, or word salad. This is done by comparing the strength (distance from 0.5, for us) of words compared to their neighbours. Apparently it's quite successful for them. It doesn't appear that it would be all that difficult to implement a version of this for SpamBayes, so I thought I'd give it a go, unless someone here is going to tell me that it's not a good idea (or that they've already done it). (Obviously, if I do, I'll post the patch & results from testing here). =Tony Meyer From tim.one at comcast.net Wed Feb 25 23:53:58 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Feb 25 23:53:56 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046779FC@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > Has anyone else read the "Bayesian Dobly: Noise Reduction for > Statistical Analysis" paper by Jonathan Zdziarski of > nuclearelephant.com's DSPAM? (it was /.'d recently). Yup, but I thought the exposition was too confused to be worth the effort of figuring out what it was really saying. > The idea, basically, is to remove some tokens from the stream before > classifying if they look like "noise" -i.e. junk words, or word > salad. This is done by comparing the strength (distance from 0.5, > for us) of words compared to their neighbours. Apparently it's quite > successful for them. > > It doesn't appear that it would be all that difficult to implement a > version of this for SpamBayes, so I thought I'd give it a go, unless > someone here is going to tell me that it's not a good idea (or that > they've already done it). Go for it -- we certainly need a dozen new arbitrary parameters to fiddle with . I haven't found that "word salad" attacks have any luck against my personal classifier, so I'm not sure it *could* do me any good. What's killing me now is virus bounces: since I decided to save and classify every email I get as "ham" or "spam", I've found that I just can't decide about lots of those, and call them ham one day but spam the next. As a result, new ones tend to score near 0.5. I used to leave them Unsure, and delete them unclassified. I was happier then. But if I'm still getting word salad spam, it's scoring so near 1.0 I never look into it ... yup! The most recent spam I got has deprivation quorum stratum ruth brett thunderstorm hungarian sidelong lamentation puritanic convolution dolphin actuarial ferromagnet lump dockside angora filmmake mat gallstone snyder spacesuit tale bujumbura operand effectual mckinney chrysler airtight coulter compulsive vaudois platypus dinghy diminutive scala at the end. Most of those were ignored because they'd never been seen before, while acturial, thunderstorm, bujumbura and platypus had only been seen in spam before. So to the extent that word salad was considered at all here, it helped nail the thing as spam. (Not that it *needed* that either, though -- the sales pitch was plenty spammy on its own.) From tim at fourstonesExpressions.com Thu Feb 26 08:58:40 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Feb 26 08:58:46 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: References: Message-ID: On Wed, 25 Feb 2004 23:53:58 -0500, Tim Peters wrote: >> It doesn't appear that it would be all that difficult to implement a >> version of this for SpamBayes, so I thought I'd give it a go, unless >> someone here is going to tell me that it's not a good idea (or that >> they've already done it). > > Go for it -- we certainly need a dozen new arbitrary parameters to fiddle > with . I haven't found that "word salad" attacks have any luck > against my personal classifier, so I'm not sure it *could* do me any > good. I have about 180:180 ham/spam trained in my classifier right now, and word salad type spams are pegged at almost 1 in all cases. I think the strategy of not training salad is what I've done, though not by explicit design. I can't see that a "dolby" (I presume that's what you meant?) would affect my results in the least. It's an interesting idea, though, and I can see how it might affect other applications of the technology, such as webpage bayesian filtering, where the amount of noise is likely to be much higher, and hapaxes would tend to drive things toward unsure much more strongly. It seems like an interesting experiment to perform... +1 from me for experimenting (it's your time after all... ) What I've been thinking about lately is making a simple pop3 client (basically ripping the proxy-ness) out of the pop3proxy, to look at a pop3 account and perform any number of classifications on it. The problem I'm trying to solve is to find a way to respond to the myriad requests we get on spambayes list for answers to the same question. I could create classifiers to recognize characteristic questions (e.g. I upgraded outlook and now spambayes doesn't work) and send an automatic response that includes something like "outlook is stupid and disables some plugins during upgrade. reenable it (url here) and see if that helps. If not, then go to our problem reporting page (url here) and submit a problem report." In this problem space, the noise is likely to be much higher, and the dolby thing might improve the quality of the classifications. But at any rate, if I can do this, it might prove quite liberating, as it seems that most of the questions on the public list fall into a few well defined categories. -- Exprimez vous!; Expr?sese; Esprimi te stesso; Express yourself! Tim Stone See my photography at www.fourstonesExpressions.com From kennypitt at hotmail.com Thu Feb 26 09:21:49 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Feb 26 09:22:52 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: Message-ID: Tim Peters wrote: > Most of those were ignored because they'd never been seen > before, while acturial, thunderstorm, bujumbura and platypus had only > been seen in spam before. So to the extent that word salad was > considered at all here, it helped nail the thing as spam. (Not that > it *needed* that either, though -- the sales pitch was plenty spammy > on its own.) I've had very similar results. My suspicion is that if (and it's a big *if*) word salad is going to have any negative effect at all, it would be a possible long-term reduction in classifier accuracy. That's something that's going to be very difficult to test. I doubt that anyone who rebuilds their training database from scratch on a semi-regular basis will ever see any effect from it, though, unless the spammers do a much better job of selecting the words they use. -- Kenny Pitt From sam at s-j-t.co.uk Thu Feb 26 13:21:38 2004 From: sam at s-j-t.co.uk (Sam Thorne) Date: Thu Feb 26 13:21:43 2004 Subject: [spambayes-dev] Mac OS X package In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AD9@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AD9@its-xchg4.massey.ac.nz> Message-ID: <9E001868-6888-11D8-9DC8-003065DA26BA@s-j-t.co.uk> Ok, after a bit talking to the maintainer of the package list Thomas Juntunen pointed me to, and some exploration of my own I've discovered a few things. One thing to clarify, when I say package, I mean an installer pkg for Mac OS X Installer; not a python package/module (I hadn't realised there were python packages and we might be talking at cross-purposes). > Just for completeness, I thought I'd mention the MacPython > PackageManager contains a spambayes package (version 1.0a7, IIRC) that > will install automatically and try to satisfy any dependencies. The > maintainers may be interested in what you're doing. > > MacPython: http://homepages.cwi.nl/~jack/macpython/ > > To get the spambayes package, you need to use Bob Ippolito's extended > package database: http://undefined.org/python/pimp/ > > > HTH, > Thomas Juntunen Briefly, Mac OS X has a layout where /System/Library is for system level installs, frameworks etc, the /Library which is for global user level installs, and finally each user has ~/Library for personal installs. First off, the layout of Python on Mac OS X (10.3) seems insane to me; the framework for python is installed in /System/Library, which is fine for system level things, however instead of modifying Python's sys.path to include a user level install point in /Library (which would be the standard place) or just installing in /Library in the first place, they've symlinked site-packages in the lib directory in the /System/Library install point to /Library/Python/2.3/, so sys.path only contains references to /System/Library. I don't understand why this has been done, when a .pth could have just used to map to a user level install dir in /Library for the search path, it seems to just make a confusing file structure. > I read this issue a little differently than Tony, I think. It sounded > to me like you were having trouble with Python finding the imported > modules, so I'll respond to that since Tony has already done a great > job > describing the configuration and data file stuff. > > Every Python installation has a default location where it searches for > all the standard library modules that come with the Python > distribution. > Sounds like for you it is "/System/Library/Python.framework/". One way > to make sure that all SpamBayes apps can find the SB library modules is > to install the "spambayes" folder and all of its contents under this > default library path. This is what the setup.py script in the root of > the SpamBayes distribution does if you run "python setup.py install". Anyway, I think what I'll do is keep the spambayes install where it is (/Library/SpamBayes) and put a .pth file into the system python install so the modules are accessible by everything else (and then the utilities should work). > They should look for them wherever the config file says they are. A > question is where the config file should default, to, though. On > Windows, a > bayescustomize.ini file is created in the "Application Data" > directory, so > everything defaults to being relative to that. There has been some > discussion (although it's not the case yet) of making Linux default to > creating a .spambayesrc file in ~, or maybe ~/.spambayes. > > What's the "correct" place with OS X? (Note that this includes the > bayescustomize.ini file (prefs), as well as the two databases, and the > cache > directories (user data), so it could be quite big). Does anyone know > if > there's a way to get the path to this, like the win32all function that > provides the "Application Data" directory? As for the config file, at the moment it defaults to /Library/SpamBayes, which is the install point for all user config files as well. This was a work-around for the moment as I'm not sure how to get the data files for each user to be recognised by spambayes depending on who is logged in. spambayes is running as daemon as root, so how would the different user config files be loaded? As for the actual install location, you could use ~/.spambayes or ~/.spambayesrc. But it would probably be considered 'nicer' to put them somewhere easily accessible by non-commandline users though, so ~/Library/SpamBayes would probably be a good place. (p.s. Mac OS X is _not_ case sensitive like other unices, so don't mind my odd naming too much :0) -- Sam So long, and thanks for all the fish. From sethg at GoodmanAssociates.com Thu Feb 26 16:22:13 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Feb 26 16:24:28 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: Message-ID: > [Kenny Pitt] > > I've had very similar results. My suspicion is that if (and > it's a big > *if*) word salad is going to have any negative effect at all, it would > be a possible long-term reduction in classifier accuracy. That's > something that's going to be very difficult to test. I doubt that > anyone who rebuilds their training database from scratch on a > semi-regular basis will ever see any effect from it, though, > unless the > spammers do a much better job of selecting the words they use. As another data point, my database has no problem identifying the salad messages as spam. Anecdotally, I *think* they don't score as high (no data whatsoever, just a vague impression), but that could easily be wrong. Spammers could do a better job of selecting salad words, perhaps, as most of them turn out to be hapaxes for my database. This is a tough nut for them to crack because everyone's hammy vocabulary is different. I think that will protect us in the end. However, if anyone could distill a subset of hammy words that were hammy to at least a large number of people, then we'd have some trouble. Doing that is not a small project and might not even be possible, but as their delivery rates decline, they may try it. As far as the "Dolby" approach goes (I think Dolby Labs would cringe at this, but Hormel got nabbed so why not Dolby?), it's interesting as a contrast to bigrams. With bigrams, we look for the "strongest" word pairs, considering different tilings of the word stream. It doesn't care about the individual strengths of adjacent words, except indirectly when deciding on the best tiling. The "Dolby" approach looks for word pairs that have the most opposite classifications and doesn't care about the strength of the word pair as a token. This puts a high value on word order, which is information not considered in bigrams. My intuition says it wouldn't help, but like they say, that and a nickel will get you on the subway (obviously a long time ago in a universe far, far away). -- Seth Goodman From jepler at unpythonic.net Thu Feb 26 17:34:53 2004 From: jepler at unpythonic.net (Jeff Epler) Date: Thu Feb 26 17:35:15 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: References: Message-ID: <20040226223453.GB25096@unpythonic.net> On Thu, Feb 26, 2004 at 03:22:13PM -0600, Seth Goodman wrote: > I think Dolby Labs would cringe at > this, but Hormel got nabbed so why not Dolby? The trick is called "Dobly", not "Dolby". I don't think "Dobly" is a trademarked word. Jeff From tameyer at ihug.co.nz Thu Feb 26 18:38:23 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Feb 26 18:39:39 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130536291F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ADD@its-xchg4.massey.ac.nz> > On Thu, Feb 26, 2004 at 03:22:13PM -0600, Seth Goodman wrote: > > I think Dolby Labs would cringe at > > this, but Hormel got nabbed so why not Dolby? > > The trick is called "Dobly", not "Dolby". I don't think > "Dobly" is a trademarked word. Is that why it's named that? Like most other people here (it seems) I thought that it was misspelt, but it was consistent through the paper, so I used that (although when reading it in my head I 'said' "Dolby"). Maybe it's a new Harry Potter house elf . It doesn't seem to make much sense to use "Dobly" if "Dolby" is what they mean. I don't see (but IANAL) anything wrong with writing a paper that explains how "Dolby"-style techniques were used in email classification, assuming that it had (TM) and "Dolby is a registered trademark of ..." in all the right places. As long as they weren't going to use it to try and sell stuff. =Tony Meyer From sethg at GoodmanAssociates.com Thu Feb 26 19:22:22 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Feb 26 19:22:24 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2ADD@its-xchg4.massey.ac.nz> Message-ID: Yup, I assumed it was a typo, too. I wonder if Dolby(TM) has become ubiquitous enough in common parlance to lose their trademark status, like Kleenex and Xerox? Monty Python's famous skit probably did a lot to jeopardize Hormel's trademark on Spam(TM), but they seem to have held on to it. But "Dobly"? Ick. -- Seth Goodman From jm at jmason.org Thu Feb 26 19:39:31 2004 From: jm at jmason.org (Justin Mason) Date: Thu Feb 26 19:39:40 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: Message-ID: <20040227003932.DCEFE590026@radish.jmason.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Seth Goodman writes: > Yup, I assumed it was a typo, too. I wonder if Dolby(TM) has become > ubiquitous enough in common parlance to lose their trademark status, > like Kleenex and Xerox? Monty Python's famous skit probably did a lot > to jeopardize Hormel's trademark on Spam(TM), but they seem to have held > on to it. But "Dobly"? Ick. Guys -- you both need to re-watch _Spinal Tap_ ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAPpHDQTcbUG5Y7woRAnXnAJ9+o32T+BRWPsqS+7l2Zka+YZNIVwCdFWgH b5K5Ppl1eHMibbWBu50ZVpA= =IEHs -----END PGP SIGNATURE----- From moraes at sbcglobal.net Thu Feb 26 19:43:52 2004 From: moraes at sbcglobal.net (Mark Moraes) Date: Thu Feb 26 19:42:36 2004 Subject: [spambayes-dev] patch to improve statistics from spambayes Message-ID: <016e01c3fcca$cc1a87f0$0d01a8c0@ashoka> Hi. While I'm generally very happy with Spambayes, I was a bit confused by the statistics, which didn't seem to add up. I'm using Spambayes 1.0a9, the web page says SpamBayes POP3 Proxy Version 0.4 (February 2004) on Windows 2000 SP4. I'm using POP3 interface (tried a couple of different mail agents, including a command-line POP3 fetch, OE and Mozilla mail -- I see similar results, these are with the command line fetch). I have 'Lookup message in cache' set to yes, Notate to: unsure, Classify subject spam. I suppress caching of bulk ham. After a POP3 fetch, the Statistics page says: SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%) spam and 63 (5%) unsure. 324 messages were manually classified as good (0 were false positives). 379 messages were manually classified as spam (33 were false negatives). 6 unsure messages were manually identified as good, and 52 as spam. ** 1. 6 unsure good + 52 unsure spam adds up to 58. But the processed line says 63? It's not clear how many messages were manually reviewed/trained. ** 2. It's not clear that manually classified as good helps figure out what was accurately classified as good, because that includes ham, spam and unsures that were so classified. Ditto for spam. It's not clear how the 324 manually classified as good relate to the 754 good, and the 379 manually classified as spam relate to the 333 spam? And as a result, it's hard to estimate accuracy. ** 3. After using the Review web page to train and mark all 4 unsure as spam, 2 ham as spam and leaving all spam as-is (yay!), I see: SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%) spam and 63 (5%) unsure. 333 messages were manually classified as good (0 were false positives). 414 messages were manually classified as spam (35 were false negatives). 6 unsure messages were manually identified as good, and 56 as spam. The false positive count is clearly a bug, since I just classified 2 ham as spam, and I know I've done that often. But I've never had to classify spam as ham. Looks like fp & fn are inverted. The enclosed patch fixes that inversion, adds a few counters to tell which ham was manually identifed as spam and vice versa, as well as total ham/spam/manually reviewed, so one can calculate percentages. (The calculation is conservative; false positives/manually-reviewed ham, or false negatives/manually-reviewed spam, so that unreviewed messages don't skew the percentages) Also trimmed the statements somewhat to avoid over-long lines. (removed some verbs:-) Before the enclosed patch, Stats.py produces: SpamBayes has processed 1223 messages - 827 (68%) good, 333 (27%) spam and 63 (5%) unsure. 346 messages were manually classified as good (0 were false positives). 414 messages were manually classified as spam (35 were false negatives). 6 unsure messages were manually identified as good, and 56 as spam. With the patch, Stats.py produces: Classified 1223 messages - 827 (68%) ham, 333 (27%) spam and 63 (5%) unsure. Manually trained 760 messages: 340 of 375 ham messages manually confirmed (35 false positives 4.2%). 323 of 323 spam messages manually confirmed (0 false negatives 0.0%). Of 62 unsure messages, 6 (9.7%) manually identified as ham, 56 (90.3%) as spam. I find this much more useful -- hope you agree. Regards, Mark. -------------- next part -------------- A non-text attachment was scrubbed... Name: DIFF-spambayes10a9-stats Type: application/octet-stream Size: 6661 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040226/e059599b/DIFF-spambayes10a9-stats-0001.obj From sethg at GoodmanAssociates.com Thu Feb 26 20:30:32 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Feb 26 20:31:06 2004 Subject: [spambayes-dev] "Bayesian Dobly" In-Reply-To: <20040227003932.DCEFE590026@radish.jmason.org> Message-ID: > Guys -- > > you both need to re-watch _Spinal Tap_ ;) > > - --j. I better admit right up front that I've never seen it. Huge hole in my cultural experience that needs to be filled, and soon! What can I say? -- Seth Goodman From mhammond at keypoint.com.au Thu Feb 26 21:36:31 2004 From: mhammond at keypoint.com.au (Mark Hammond) Date: Thu Feb 26 21:36:51 2004 Subject: [spambayes-dev] word salad, bad bounces In-Reply-To: Message-ID: <002101c3fcda$82e28820$0200a8c0@eden> [Tim] > Go for it -- we certainly need a dozen new arbitrary > parameters to fiddle > with . I haven't found that "word salad" attacks have any luck > against my personal classifier, so I'm not sure it *could* do > me any good. Ditto for me. I think it has been mentioned that "word salad" attacks are only noticed when they do happen to have hammy words for a particular person, which makes the person believe they are 'effective', just because they never saw the 500 spams that also tried it. Assuming bayes would see the message as spammy without word salad (as Tim mentioned was normally the case), doesn't this mean that word salad can only help the message get scored in a non-spam category, even if only for a fraction of a percent of the training data out there? This means it is still a win for the spammer. > What's killing me now is virus bounces: since I decided to save and > classify every email I get as "ham" or "spam", I've found > that I just can't > decide about lots of those, and call them ham one day but > spam the next. As > a result, new ones tend to score near 0.5. I used to leave > them Unsure, and > delete them unclassified. I was happier then. Ahhh - but this is really what I had to reply to. This remains my email problem. Mine is probably a little bit worse, as anything@skippinet.com.au comes to me. Training on all of these seemed to actually hurt my false-positive rate - the words used in the various bounce messages are generally slightly hammy, meaning repeated training was necessary to 'bump up' the spam score of the common words, at the expense of others. At least that was my impression when looking at the clues. So unfortunately, my general solution now is a hacked up Python script that tries to detect these bounces, and nuke them. Unfortunately, it is fairly braindead, and is much more aggressive when the 'to' address is not my email address - something I expect you don't need to deal with. This was one reason I would love a generic 'mail filter' mechanism written in Python. If we tracked addresses and subject lines that the user sent, we could probably do a reasonable job of detecting a "good-bounce" from a "bad-bounce", but this is starting to get off-topic :) Mark [Just back from a refreshing 6 day tour 3000km up to Newcastle and back on the bike] From mhammond at keypoint.com.au Thu Feb 26 22:13:38 2004 From: mhammond at keypoint.com.au (Mark Hammond) Date: Thu Feb 26 22:13:58 2004 Subject: [spambayes-dev] RE: [Spambayes] Delete As Spam - Doesn't alwayswork In-Reply-To: Message-ID: <002201c3fcdf$b2561b80$0200a8c0@eden> > I still don't see the original message, but I have a > suspicion that you > are correct in guessing that this is some sort of Outlook problem. > > The error number shown in the spambayes1.log maps to > MAPI_E_CORRUPT_DATA, and it appears that it can only appear > as a Python > exception if it occurs in the low-level MAPI function call to > WrapCompressedRTFStream (inside win32all). From what I can > gather, the > only way that this could happen is if something is corrupted in the > message data of that particular message. I'm afraid I can't offer any > ideas as to what might cause that, though. I think all of that is 100% correct. >From the log, it appears that SpamBayes tries repeatedly to > process the > message but fails each time. I'm guessing that these entries are the result of a few different occurrences of the same failure - ie: 1) Mail comes in. SpamBayes tries to score it, but fails. Ends up remaining in inbox, with log entry written (this log appears to have the timer enabled, as that appears in the first traceback 2) As mail is left in the inbox, user tries 'delete as spam'. This fails, in the same way (1) failed, so this doesn't work either, and writes a traceback. 3) Mail is *still* in the inbox - god damn stupid computers - user goes back to (2) a few times, gives up in disgust :) > We may need to add some additional error > handling to prevent that. It might also be good to log a more > descriptive message instead of just a traceback. Yes, but we have to be careful how we do it. The error at (1) can never display a modal error dialog (what if the mail came in overnight?). The error at (2) should not display a dialog box for *every* error it sees - the user may have 100 messages selected. A single error dialog may be appropriate. At the end of the day though, this error *is* special, in that it is simply failure to get the HTML. We should handle this better. However, it is a good reminder that random MAPI failures could occur at any place, so the better error handling above is still needed, especially as we get more users. Below is what I think the fix for this specific error is. Should I also nuke the win32all warning, and just let the AttributeError perculate up? Binary users will never see it. Mark. Index: msgstore.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v retrieving revision 1.84 diff -u -r1.84 msgstore.py --- msgstore.py 27 Feb 2004 02:57:40 -0000 1.84 +++ msgstore.py 27 Feb 2004 03:03:11 -0000 @@ -447,22 +447,22 @@ try: rtf_stream = mapi_object.OpenProperty(prop_tag, pythoncom.IID_IStream, 0, 0) + try: + html_stream = mapi.WrapCompressedRTFStream(rtf_stream, 0) + except AttributeError: + if not _have_complained_about_missing_rtf: + print "*" * 50 + print "Sorry, but you need to update to a new win32all (158 or " + print "later), so we correctly get the HTML from messages." + print "See http://starship.python.net/crew/mhammond/win32" + print "*" * 50 + _have_complained_about_missing_rtf = True + return "" + html = mapi.RTFStreamToHTML(html_stream) except pythoncom.com_error, details: if not IsNotFoundCOMException(details): print "ERROR getting RTF body", details return "" - try: - html_stream = mapi.WrapCompressedRTFStream(rtf_stream, 0) - except AttributeError: - if not _have_complained_about_missing_rtf: - print "*" * 50 - print "Sorry, but you need to update to a new win32all (158 or " - print "later), so we correctly get the HTML from messages." - print "See http://starship.python.net/crew/mhammond/win32" - print "*" * 50 - _have_complained_about_missing_rtf = True - return "" - html = mapi.RTFStreamToHTML(html_stream) # html may be None if not RTF originally from HTML, but here we # always want a string return html or '' From mhammond at keypoint.com.au Fri Feb 27 00:20:11 2004 From: mhammond at keypoint.com.au (Mark Hammond) Date: Fri Feb 27 00:20:36 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? In-Reply-To: Message-ID: <002401c3fcf1$62129a10$0200a8c0@eden> Nice analysis Kenny. Just to clarify: > The hackery is a result of an apparent problem with the Inno installer > during uninstall. When Inno tried to unregister the COM DLL using the > usual LoadLibrary/DllUnregisterServer method, it apparently didn't > release the DLL properly and then failed to delete some of the files. To be fair to Inno, it is more Python's fault. The problem is more that doing a LoadLibrary(), executing Python code, then doing a FreeLibrary() doesn't release every DLL. The inno guy mailed me back suggesting later versions may spawn a separate process for this unregistration to get around this and similar issues. Given that hackery, I think it fair to say the error is in the hackery, rather than in py2exe, and that py2exe is fine the way it is. Mark. From mhammond at keypoint.com.au Fri Feb 27 00:21:39 2004 From: mhammond at keypoint.com.au (Mark Hammond) Date: Fri Feb 27 00:21:59 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2AC3@its-xchg4.massey.ac.nz> Message-ID: <002501c3fcf1$9491b200$0200a8c0@eden> [Tony] > OTOH, there have been three recent reports of this problem: > """ > Traceback (most recent call last): > File "addin.pyc", line 1191, in OnConnection > File "manager.pyc", line 908, in GetManager > File "manager.pyc", line 344, in __init__ > File "manager.pyc", line 492, in LocateDataDirectory > File "win32com\shell\shell.pyc", line 9, in ? > File "win32com\shell\shell.pyc", line 7, in __load > ImportError: DLL load failed: A device attached to the system is not > functioning. > """ That generally means their shell32.dll windows file is out of date - I would like to know the specific symbol in question though - maybe you can ask a reporter to zip up their shell32.dll? Mark. From kennypitt at hotmail.com Fri Feb 27 10:47:06 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 27 10:48:12 2004 Subject: [spambayes-dev] patch to improve statistics from spambayes In-Reply-To: <016e01c3fcca$cc1a87f0$0d01a8c0@ashoka> Message-ID: Mark Moraes wrote: > While I'm generally very happy with Spambayes, I was a bit > confused by the statistics, which didn't seem to add up. I think there are some good ideas here, but looks like some misunderstandings as well. I'll see if I can clear those up a little. I've often wondered if we couldn't produce some more useful statistics, so maybe this a good start to a discussion. > 6 unsure good + 52 unsure spam adds up to 58. But the processed > line says 63? It's not clear how many messages were manually > reviewed/trained. This should indicate that there were 5 unsures that were not trained. I considered adding "and 5 were untrained" to the stats line. > After using the Review web page to train and mark all 4 unsure as > spam, 2 ham as spam and leaving all spam as-is (yay!), I see: > > SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%) > spam and 63 (5%) unsure. 333 messages were manually classified as > good (0 were false positives). 414 messages were manually classified > as spam (35 were false negatives). 6 unsure messages were manually > identified as good, and 56 as spam. > > The false positive count is clearly a bug, since I just classified > 2 ham as spam, and I know I've done that often. But I've never > had to classify spam as ham. Looks like fp & fn are inverted. A "positive" means that the message was classified as spam, and a "negative" means that it was classified as ham. A "false positive", then, is a message that was classified as spam when it should have been ham and a "false negative" is a message that was classified as ham when it should have been spam. Unsures are not counted. If you've never had to reclassify something from spam to ham then you've never had a false positive, and the 2 messages that you had to reclassify as spam were false negatives because they weren't detected. It looks to me like the original statistics are correct here. > The enclosed patch fixes that inversion, adds a few counters > to tell which ham was manually identifed as spam and vice > versa, as well as total ham/spam/manually reviewed, so > one can calculate percentages. Not sure why more counters are necessary. We already count the number of false negatives (fn) which are hams that were trained as spam, the number of unsures that were trained as spam (trn_unsure_spam), and the total number trained as spam (trn_spam). The number of messages that were correctly classified as spam and were also trained on is then (trn_spam - trn_unsure_spam - fn). The same can be done to calculate the ham side. > ... (The calculation is conservative; > false positives/manually-reviewed ham, or false > negatives/manually-reviewed spam, > so that unreviewed messages don't skew the percentages) Taking percentages only out of trained messages tells you something about your training regimen, but nothing about the accuracy of the filter. Filter accuracy is the percent of messages that were correctly classified the first time compared to all messages received. The correct calculation for accuracy should be: total_correct = (cls_spam - fp) + (cls_ham - fn) acc = 100.0 * (total_correct / total) Knowing the percent incorrectly classified is useful as well. Unsures play into accuracy in an unusual way because some people consider them "mistakes" and some don't. Showing the % correct, the % incorrect, and the % unsure accounts for that. > With the patch, Stats.py produces: > Classified 1223 messages - 827 (68%) ham, 333 (27%) spam and 63 (5%) > unsure. > Manually trained 760 messages: > 340 of 375 ham messages manually confirmed (35 false positives 4.2%). > 323 of 323 spam messages manually confirmed (0 false negatives 0.0%). > Of 62 unsure messages, 6 (9.7%) manually identified as ham, 56 > (90.3%) as spam. > > I find this much more useful -- hope you agree. I think it's a good start (with the exception of reversing the definitions of false positives and false negatives ). Here's what I've come up with for comparison (I've been playing with something similar in the Outlook stats): """ SpamBayes has classified a total of 1223 messages: 827 ham (67.6% of total) 333 spam (27.2% of total) 63 unsure (5.2% of total) 1125 messages were classified correctly (92.0% of total) 35 messages were classified incorrectly (2.9% of total) 0 false positives (0.0% of total) 35 false negatives (2.9% of total) 6 unsures trained as ham (9.5% of unsures) 56 unsures trained as spam (88.9% of unsures) 1 unsure was not trained (1.6% of unsures) A total of 760 messages have been trained: 346 ham (98.3% ham, 1.7% unsure, 0.0% false positives) 414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives) """ -- Kenny Pitt From kennypitt at hotmail.com Fri Feb 27 10:57:06 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 27 10:58:06 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? In-Reply-To: <002401c3fcf1$62129a10$0200a8c0@eden> Message-ID: Mark Hammond wrote: >> The hackery is a result of an apparent problem with the Inno >> installer during uninstall. When Inno tried to unregister the COM >> DLL using the usual LoadLibrary/DllUnregisterServer method, it >> apparently didn't release the DLL properly and then failed to delete >> some of the files. > > To be fair to Inno, it is more Python's fault. The problem is more > that doing a LoadLibrary(), executing Python code, then doing a > FreeLibrary() doesn't release every DLL. Yeah, I discovered that later when I tried this out using a simple NSIS install script (for those who don't know, it's another open-source alternative to Inno) and got the same result. -- Kenny Pitt From kennypitt at hotmail.com Fri Feb 27 11:02:25 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 27 11:03:26 2004 Subject: [spambayes-dev] RE: [Spambayes] Delete As Spam - Doesn't alwayswork In-Reply-To: <002201c3fcdf$b2561b80$0200a8c0@eden> Message-ID: Mark Hammond wrote: > Below is what I think the fix for this specific error is. Should I > also nuke the win32all warning, and just let the AttributeError > perculate up? Binary users will never see it. > > [snip diff] Looks good, +1 here. Since binary users won't see it, I don't know that the win32all warning is hurting anything. On the other hand, I think the updated builds have been out long enough that if you feel like removing it then go for it. -- Kenny Pitt From theller at python.net Fri Feb 27 11:50:14 2004 From: theller at python.net (Thomas Heller) Date: Fri Feb 27 12:01:31 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? References: <002401c3fcf1$62129a10$0200a8c0@eden> Message-ID: <3c8wlb5l.fsf@python.net> "Kenny Pitt" writes: > Mark Hammond wrote: >>> The hackery is a result of an apparent problem with the Inno >>> installer during uninstall. When Inno tried to unregister the COM >>> DLL using the usual LoadLibrary/DllUnregisterServer method, it >>> apparently didn't release the DLL properly and then failed to delete >>> some of the files. >> >> To be fair to Inno, it is more Python's fault. The problem is more >> that doing a LoadLibrary(), executing Python code, then doing a >> FreeLibrary() doesn't release every DLL. > > Yeah, I discovered that later when I tried this out using a simple NSIS > install script (for those who don't know, it's another open-source > alternative to Inno) and got the same result. So, neither innosetup nor nsis can remove in-use files (after a reboot)? Isn't that a problem? Thomas From kennypitt at hotmail.com Fri Feb 27 13:15:47 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Feb 27 13:16:48 2004 Subject: [spambayes-dev] Re: Problem with 1.0a9 Windows installer? In-Reply-To: <3c8wlb5l.fsf@python.net> Message-ID: Thomas Heller wrote: >>>> The hackery is a result of an apparent problem with the Inno >>>> installer during uninstall. When Inno tried to unregister the COM >>>> DLL using the usual LoadLibrary/DllUnregisterServer method, it >>>> apparently didn't release the DLL properly and then failed to >>>> delete some of the files. >>> >>> To be fair to Inno, it is more Python's fault. The problem is more >>> that doing a LoadLibrary(), executing Python code, then doing a >>> FreeLibrary() doesn't release every DLL. >> >> Yeah, I discovered that later when I tried this out using a simple >> NSIS install script (for those who don't know, it's another >> open-source alternative to Inno) and got the same result. > > So, neither innosetup nor nsis can remove in-use files (after a > reboot)? Isn't that a problem? I don't think that's the issue. I'm not sure if Inno supports it or not, but in NSIS I can set my deletes up to require reboot for in-use files if I explicitly tell it that's what I want. The problem is that the DLL isn't actually "in use" because no other apps have it loaded, so there wouldn't normally be a reason to need this. A C++ COM DLL in the same circumstance would not require a reboot. We're the ones that caused the DLL to be in use by trying to unregister it, so forcing the user to do a reboot just for that seems a bit harsh. The "register-with-an-EXE" hack gets the job done with less hassle to the user, we just needed to tweak the hack to make it work right. Of course, I wouldn't be opposed if you have some neat idea up your sleeve to change py2exe so that everything just magically gets released after the FreeLibrary call . No-pressure-we-love-py2exe-anyway-ly yours, -- Kenny Pitt From sethg at GoodmanAssociates.com Fri Feb 27 14:32:52 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Fri Feb 27 14:33:00 2004 Subject: [spambayes-dev] patch to improve statistics from spambayes In-Reply-To: Message-ID: > [Kenny Pitt] > """ > SpamBayes has classified a total of 1223 messages: > 827 ham (67.6% of total) > 333 spam (27.2% of total) > 63 unsure (5.2% of total) > > 1125 messages were classified correctly (92.0% of total) > 35 messages were classified incorrectly (2.9% of total) > 0 false positives (0.0% of total) > 35 false negatives (2.9% of total) > > 6 unsures trained as ham (9.5% of unsures) > 56 unsures trained as spam (88.9% of unsures) > 1 unsure was not trained (1.6% of unsures) > > A total of 760 messages have been trained: > 346 ham (98.3% ham, 1.7% unsure, 0.0% false positives) > 414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives) > """ That looks very useful, concise and complete. -- Seth Goodman From moraes at sbcglobal.net Sun Feb 29 02:14:07 2004 From: moraes at sbcglobal.net (Mark Moraes) Date: Sun Feb 29 02:12:55 2004 Subject: [spambayes-dev] patch to improve statistics from spambayes [rev2] References: Message-ID: <005001c3fe93$a6735610$0a01a8c0@ashoka> Based on Kenny Pitt's suggestions, I revised my statistics patch (enclosed the revised patch relative to 1.0a9, now that I understand the definition of false positive :-) An assumption of this form of calculation that's worth noting is that unreviewed/untrained messages must have been classified correctly (presumably otherwise the user would have trained those messages). Seems reasonable enough to me, but worth keeping in mind. (also, anyone who cares about looking at the statistics presumably cares enough to review/train often!) Regards, Mark. Kenny Pitt wrote: > Mark Moraes wrote: > > ... (The calculation is conservative; > > false positives/manually-reviewed ham, or false > > negatives/manually-reviewed spam, > > so that unreviewed messages don't skew the percentages) > > Taking percentages only out of trained messages tells you something > about your training regimen, but nothing about the accuracy of the > filter. Filter accuracy is the percent of messages that were correctly > classified the first time compared to all messages received. The > correct calculation for accuracy should be: > > total_correct = (cls_spam - fp) + (cls_ham - fn) > acc = 100.0 * (total_correct / total) > SpamBayes has classified a total of 1223 messages: > 827 ham (67.6% of total) > 333 spam (27.2% of total) > 63 unsure (5.2% of total) > > 1125 messages were classified correctly (92.0% of total) > 35 messages were classified incorrectly (2.9% of total) > 0 false positives (0.0% of total) > 35 false negatives (2.9% of total) --- Sample of current output: SpamBayes has classified a total of 1671 messages: 1139 ham (68.2% of total) 452 spam (27.0% of total) 80 unsure (4.8% of total) 1555 classified correctly (93.1% of total) 36 classified incorrectly (2.2% of total) 0 incorrectly identified as spam (false positive 0.0% of the total) 36 incorrectly identified as ham (false negative 2.2% of the total) 6 unsures trained as ham (7.5% of unsures) 73 unsures trained as spam (91.3% of unsures) 1 unsure was not trained (1.3% of unsures) A total of 943 messages have been trained: 393 ham (98.5% ham, 1.5% unsure, 0.0% false positives) 550 spam (80.2% spam, 0.0% unsure, 6.5% false negatives) -------------- next part -------------- A non-text attachment was scrubbed... Name: DIFF-spambayes10a9-stats-rev2 Type: application/octet-stream Size: 7989 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040228/4afd20ab/DIFF-spambayes10a9-stats-rev2.obj