From ta-meyer at ihug.co.nz Sun Jan 2 06:13:58 2005 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 2 06:14:37 2005 Subject: [spambayes-dev] SpamBayes i18n Message-ID: An update on the SpamBayes i18n progress: 1. I've checked in the changes to the main spambayes code to work with gettext. I haven't extensively checked this, so there may be bits that need work. I haven't made any change to the loading of the translation manager, so it'll still be currently looking for an outlook_addin.mo file. (As an aside: I think maybe one large messages.po file may be best - as time progresses the overlap between the web interface and the Outlook plug-in grows, and so most messages need to be translated for both, even though that's a lot of work). 2. I've written up a "how to translate" section in README-DEVEL.txt. I think everything in there is correct, but I haven't really actually done any translation work, so there may be errors. It would be great if people interested in doing translations could work through the steps outlined and let me know if there are problems/mistakes. (The link will work once anonymous CVS catches up). Thanks again to everyone that is willing to help with this effort! =Tony.Meyer From theller at python.net Mon Jan 3 09:57:45 2005 From: theller at python.net (Thomas Heller) Date: Mon Jan 3 09:56:27 2005 Subject: [spambayes-dev] sb_imapfilter fix Message-ID: Just guesswork: Index: sb_imapfilter.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/scripts/sb_imapfilter.py,v retrieving revision 1.50 diff -c -r1.50 sb_imapfilter.py *** sb_imapfilter.py 23 Dec 2004 18:14:32 -0000 1.50 --- sb_imapfilter.py 3 Jan 2005 08:55:26 -0000 *************** *** 1087,1093 **** imap = IMAPSession(server, port, imapDebug, doExpunge) # Load stats manager. ! stats = Stats(options, message_db) httpServer = UserInterfaceServer(options["html_ui", "port"]) httpServer.register(IMAPUserInterface(classifier, imap, pwd, --- 1087,1093 ---- imap = IMAPSession(server, port, imapDebug, doExpunge) # Load stats manager. ! stats = Stats.Stats(options, message_db) httpServer = UserInterfaceServer(options["html_ui", "port"]) httpServer.register(IMAPUserInterface(classifier, imap, pwd, From nviry at kerberos.fr Mon Jan 3 23:31:32 2005 From: nviry at kerberos.fr (nviry@kerberos.fr) Date: Mon Jan 3 23:31:35 2005 Subject: [spambayes-dev] Translation Message-ID: <3248.81.56.10.70.1104791492.squirrel@www.kerberos.fr> Hi, sorry to spam you with this message. It contains the .rc (outlook plugin) translated into french. I'm new to the CVS tools so I think I need a password to write directly the file on the server, this is why I send you the fire directly. I used vi to translate the file. HTML file will be next. As I'm not an expert in .rc files, please double check the follwing lines : > LANGUAGE LANG_FRENCH, SUBLANG_FRENCH_FR (2 times) lines like this one where not translated, do the need translation ? > CONTROL "Folder names...\nLine 2 Thanks, next post in a few days Nicolas -------------- next part -------------- A non-text attachment was scrubbed... Name: dialogs.rc Type: application/octet-stream Size: 33539 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20050103/5f0e2bc5/dialogs-0001.obj From tameyer at ihug.co.nz Tue Jan 4 01:16:29 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 4 01:17:41 2005 Subject: [spambayes-dev] sb_imapfilter fix In-Reply-To: Message-ID: > Just guesswork: > > Index: sb_imapfilter.py [...] > ! stats = Stats(options, message_db) [...] > ! stats = Stats.Stats(options, message_db) Thanks; fixed - and I've added a test to test_sb_imapfilter.py to check that the web interface works. =Tony.Meyer From market at cc.wwu.edu Tue Jan 4 02:01:42 2005 From: market at cc.wwu.edu (TJ Olney) Date: Tue Jan 4 02:01:45 2005 Subject: [spambayes-dev] sb_mboxtrain.py trashes some pine mailboxes interpreting them as only one message Message-ID: <41D9EAF6.6080907@cc.wwu.edu> Then leaving only the first message behind. Is this a known problem? Since it worked fine with the first few -g mailboxes I tried, I was pretty confident, but when I tried it on a couple of huge mailing list subscription files, it choked and left behind only the first message. This is unix and pine 4.61 uw-imapd. It appears that this happens when an older form of email is in the folder that might not have the appropriate headers. like this: > ^?^? (Fwd) Re: (Fwd) Re: Relationships Gone Sour: Divorce? > > From @uga.cc.uga.edu:owner-crm-l@EMUVM1.CC.EMORY.EDU T It then deletes all the rest of the mailbox. My fault for using folders with such old messages in them, but I thought you should know. I'm really looking forward to having this working! TJ Olney From hatukanezumi at users.sourceforge.net Tue Jan 4 02:41:25 2005 From: hatukanezumi at users.sourceforge.net (Hatuka*nezumi) Date: Tue Jan 4 02:41:34 2005 Subject: [spambayes-dev] Some problems about i18n In-Reply-To: References: Message-ID: <20050104104125.7ee787f3.hatukanezumi@users.sourceforge.net> On Wed, 22 Dec 2004 15:01:54 +1300 "Tony Meyer" wrote: > I had hoped that Hatuka Nezumi would have responded to the earlier message, > but I haven't heard anything from him for a while (busy, perhaps). He is > leading the i18n process for SpamBayes (I'm helping and doing the checking > in). Sorry for no response. I'm in new-year ('shogatsu' in japanese) vacation till 6 January. I'll go back next week. Problems for Japanese/CJK: 1. Recommended charset of Japanese e-mail message is ISO-2022-JP (cf. RFC1468). This charset isn't suitable for XML/XHTML parser and isn't compatible with Windows ANSI codepage (CP932 for Japanese). 2. ISO-2022-* aren't suitable for spambayes tokenizer also. 3. More than one charsets may be used for messages of one language (e.g. ISO-8859-*, UTF-8 and UTF-7 for West-Latin. ISO-2022-JP, Shift_JIS, EUC-JP, UTF-8 and UTF-7 for Japanese). 4. In some East-asian languages (Japanese or Chinese), words are not space-separated then they won't be effectively tokenized. Patch #824651 try to solve these problems. For current i18n works, problem 1. should be solved at least. I am planning to provide sub-patches related to each problems (except problem 4.), converting message headers/bodies to suitable charset for tokenizer (Unicode), web interface (e.g. UTF-8) and Outlook plug-in (mbcs). This solution also will provide really i18n'ized message handling. Note that this solution can require bind_textdomain_codeset function for overlapping gettext catalog of web interface and Outlook plug-in. But I'm not familiar with this function... --- nezumi From tameyer at ihug.co.nz Tue Jan 4 02:59:16 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 4 02:59:52 2005 Subject: [spambayes-dev] Translation In-Reply-To: Message-ID: > sorry to spam you with this message. This is definitely not spam! > It contains the .rc (outlook plugin) translated into french. Great; thanks! > I'm new to the CVS tools so I think I need a password to > write directly the file on the server, this is why I send > you the file directly. Yes, only the developers can make changes to CVS. Sending patches/files here is fine, or alternatively, you could submit a patch via the sourceforge system: > I used vi to translate the file. > HTML file will be next. Great :) It all looks fine, although I had to make a few size changes where the French was larger than the English. I've checked this in, so CVS users ought to be able to set their desired language to fr_FR and have the dialogs (mostly) in French! > As I'm not an expert in .rc files, please double check the > follwing lines : > > LANGUAGE LANG_FRENCH, SUBLANG_FRENCH_FR (2 times) It didn't matter for SpamBayes, but VC++ chokes on those, for some reason. I've left them as the English versions, since we don't use it. > lines like this one where not translated, do the need translation ? > > CONTROL "Folder names...\nLine 2 Nope - you're right in leaving those (well, it wouldn't matter if they were). They're just placeholds for dynamically generated text. Thanks heaps for the contribution! =Tony.Meyer From kenny.pitt at gmail.com Wed Jan 5 16:11:22 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Wed Jan 5 16:11:27 2005 Subject: [spambayes-dev] Training problem in latest CVS Message-ID: <41dc039b.23bd3560.3e0a.0ee2@smtp.gmail.com> I trained an Unsure message this morning and was surprised that the score didn't seem to change after it was moved back to my Inbox. I looked in the log file and found the following interesting lines: """ Bayes database initialized with 366 spam and 218 good messages ... Moving and spam training message 'Who's Winning? ' - Training on message 'Who's Winning? ' in 'Personal Folders/Possible Spam - already was trained as spam Saving bayes database with 366 spam and 218 good messages ... Recovering to folder 'Inbox' and ham training message 'Delta cuts U.S. air fares up to 50% ' - Training on message 'Delta cuts U.S. air fares up to 50% ' in 'Personal Folders/Possible Spam - already was trained as good Saving bayes database with 366 spam and 218 good messages """ Both of these were just-received messages that were classified as Unsure, but notice that SpamBayes thinks they had already been trained. Looks like a bug may have snuck in with how we detect the training status of a message. I'll look into it when I get a chance, but I'm hoping Tony will know what to do since he has done a lot of work with the message info database lately. -- Kenny Pitt From tameyer at ihug.co.nz Fri Jan 7 04:35:07 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jan 7 04:35:37 2005 Subject: [spambayes-dev] Training problem in latest CVS In-Reply-To: <41dc039b.23bd3560.3e0a.0ee2@smtp.gmail.com> Message-ID: > I trained an Unsure message this morning and was surprised that the score > didn't seem to change after it was moved back to my Inbox. I noticed this a couple of days ago, too, but didn't have the time to look into it just then. [...] > I'll look into it when I get a chance, but I'm hoping Tony will know what to > do since he has done a lot of work with the message info database lately. Feel free to leave it if you want - I have a reasonable idea of what is going wrong (or "how I broke things") and what needs to be fixed. I'd do it now, but the Outlook installation on my main machine is broken and usable, and I'm still trying to figure out how to fix it :(. Once that's done (I have hopes that it will be today, but if not, then Monday is the next time I'll get a chance) I'll look into this. =Tony.Meyer From nviry at kerberos.fr Sun Jan 9 18:57:08 2005 From: nviry at kerberos.fr (nviry@kerberos.fr) Date: Sun Jan 9 18:57:12 2005 Subject: [spambayes-dev] Translation - HTML file Message-ID: <3129.81.56.10.70.1105293428.squirrel@www.kerberos.fr> Hi, here is the html file attached. I can translate the web site now. I've wgeted the web pages but they seem to have been generated. Are the pages in the cvs directory ? If not, where can I find them ? Nicolas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20050109/b77e439c/ui-0001.html From tameyer at ihug.co.nz Mon Jan 10 03:34:07 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Jan 10 03:34:54 2005 Subject: [spambayes-dev] Translation - HTML file In-Reply-To: Message-ID: > here is the html file attached. Great - thanks! I've checked this in. > I can translate the web site now. I've wgeted the web pages > but they seem to have been generated. Are the pages in the > cvs directory ? If not, where can I find them ? Yes they are in CVS. There is a "website" module that you can checkout which has the source in it. (They are .ht files, apart from the FAQ, which is .txt). There hasn't been a translation of the SpamBayes website before. I guess the translation can just be a mirror in a (eg) fr directory? So something like: And then have links to the translations on the main (English) page at ? If that sounds like the right way to go, then I can make the necessary adjustments to the scripts that generate the html and copy it to the live site. (Whatever happens, if you just provide the translations, then the rest of us can figure out where to put them). Thanks again for the contribution! =Tony.Meyer From lenneis at wu-wien.ac.at Mon Jan 10 19:20:30 2005 From: lenneis at wu-wien.ac.at (Joerg Lenneis) Date: Mon Jan 10 19:31:34 2005 Subject: [spambayes-dev] sb_mailsort.py status Message-ID: Dear all, I have only last week started to use Spambayes and I am very impressed so far. This is my first attempt at spam filtering. I finally gave up, my mail address has been around and used for ages, so without filtering I get an insane amount of spam. I feared a not insignificant number of false positives, but so far things have worked very well, with no message classified as a false positive. I use sb_mailsort.py for training and filtering (use a CDB database for probabilities, sort into one of two Maildirs depending on wether a message is above the spam threshhold or not) because it gives me a failproof way of updating the probabilities and delivery of mails, even over NFS. I am not concerened about the additional overhead that the database is reconstructed from scratch on every training session. I have noticed from CVS that sb_mailsort.py is somewhat dated now, with the last update about 5 months ago. There are a couple of things that might be useful, like being able to set the spam threshhold via the command line. The Maildir algorithm could also be adapted somewhat to conform more closely to the specification. Are patches to do this welcome here, or alternatively, is the original author still interested to continue work on sb_mailsort.py? best regards, -- Joerg Lenneis email: lenneis@wu-wien.ac.at From nas at arctrix.com Mon Jan 10 21:00:58 2005 From: nas at arctrix.com (Neil Schemenauer) Date: Mon Jan 10 21:01:02 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: References: Message-ID: <20050110200057.GB585@mems-exchange.org> On Mon, Jan 10, 2005 at 07:20:30PM +0100, Joerg Lenneis wrote: > I have noticed from CVS that sb_mailsort.py is somewhat dated now, > with the last update about 5 months ago. That's because it's such high quality code. ;-) > There are a couple of things that might be useful, like being able > to set the spam threshhold via the command line. The Maildir > algorithm could also be adapted somewhat to conform more closely > to the specification. > > Are patches to do this welcome here, or alternatively, is the original > author still interested to continue work on sb_mailsort.py? Patches are definitely welcome. I'm extremely busy at the moment but I will be glad to review changes. I actually don't use sb_mailsort.py anymore. The problem was that I was receiving so much crap that I no longer could review the spam folder. I now have a SMTP reverse proxy that uses Spambayes and a CDB database. Bouncing high scoring spam is necessary for me because I can't review it all. Neil From sethg at GoodmanAssociates.com Mon Jan 10 21:38:54 2005 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Mon Jan 10 21:38:55 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: <20050110200057.GB585@mems-exchange.org> Message-ID: > From: Neil Schemenauer > Sent: Monday, January 10, 2005 2:01 PM <...> > I now have a SMTP reverse proxy that uses Spambayes and a > CDB database. Bouncing high scoring spam is necessary for me > because I can't review it all. That's great! I hope when you say "bounce" you actually mean reject at the end of data. Rejecting unwanted messages is the way to go, because even in the rare event of a false positive, the sender gets a DSN and there's no backscatter. If that's what you are doing, please ignore the following, which is for those who still send DSN's for spam. Bouncing spam after acceptance is a real problem, even though false positives would still get a DSN. The problem is that in the majority of spam, both the MAIL FROM: and the From: addresses are forged. Sending a bounce just abuses innocent third parties, in addition to giving the spammer a second chance to get their payload delivered. -- Seth Goodman From tameyer at ihug.co.nz Mon Jan 10 23:20:12 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Jan 10 23:20:50 2005 Subject: [spambayes-dev] Training problem in latest CVS In-Reply-To: Message-ID: [Kenny Pitt] > I trained an Unsure message this morning and was surprised > that the score didn't seem to change after it was moved back > to my Inbox. Ok, this ought to be fixed now. Apologies for the delay - fixing my Outlook install (dead profile, corrupted mapi32.dll) took longer than expected. =Tony.Meyer From popiel at wolfskeep.com Mon Jan 10 23:35:39 2005 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Jan 10 23:35:42 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: Message from "Seth Goodman" of "Mon, 10 Jan 2005 14:38:54 CST." References: Message-ID: <20050110223539.49FD12DDB6@cashew.wolfskeep.com> In message: "Seth Goodman" writes: > >Bouncing spam after acceptance is a real problem, even though false >positives would still get a DSN. The problem is that in the majority of >spam, both the MAIL FROM: and the From: addresses are forged. Sending a >bounce just abuses innocent third parties, in addition to giving the spammer >a second chance to get their payload delivered. Unfortunately, bouncing spam after acceptance is increasingly unavoidable for anyone who has a backup MX host as insurance against their primary host being down. Many spammers are targetting the secondary MX instead of the primary MX... and a secondary MX sufficiently isolated from the primary to actually be useful as a failover is likelyly to just accept, queue, and relay. When the primary then rejects the message from the secondary, the secondary is stuck trying to deliver a DSN. I'm currently working on a hack to postfix such that locally-generated DSN messages cannot get deferred (the RFC says that you must generate a DSN, but doesn't say that you have to try to deliver it more than once). This will at least prevent my secondary MX from crumbling under the load of bouncing spam sent to nonexistant addresses on my primary. (These spam DSNs frequently end up deferred because the purported source either doesn't exist or issues a 400-series response to trying to deliver the DSN... and the retries of these deferrals for 4 days is what pushes my secondary over the edge.) - Alex, peeved at having to hack his mail server because of the spammers PS. No, I'm not willing to not have a secondary MX. My primary does crash occasionally, though (thankfully) not as much as it used to before I replaced the motherboard. From kenny.pitt at gmail.com Mon Jan 10 23:43:14 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Mon Jan 10 23:43:17 2005 Subject: [spambayes-dev] Training problem in latest CVS In-Reply-To: Message-ID: <41e30503.12405188.0671.130f@smtp.gmail.com> Tony Meyer wrote: > [Kenny Pitt] >> I trained an Unsure message this morning and was surprised >> that the score didn't seem to change after it was moved back >> to my Inbox. > > Ok, this ought to be fixed now. Apologies for the delay - fixing my > Outlook install (dead profile, corrupted mapi32.dll) took longer than > expected. Works for me. Thanks, Tony. If anyone else verifies this fix, just make sure you don't try it on a message that you had problems with before the fix. The damage is already done. -- Kenny Pitt From nas at arctrix.com Mon Jan 10 23:44:42 2005 From: nas at arctrix.com (Neil Schemenauer) Date: Mon Jan 10 23:44:45 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: <20050110223539.49FD12DDB6@cashew.wolfskeep.com> References: <20050110223539.49FD12DDB6@cashew.wolfskeep.com> Message-ID: <20050110224442.GB1491@mems-exchange.org> On Mon, Jan 10, 2005 at 02:35:39PM -0800, T. Alexander Popiel wrote: > When the primary then rejects the message from the secondary, the > secondary is stuck trying to deliver a DSN. You really should not generate DSNs, IMHO. They will very likely be sent to forged From addresses. In that case, they are as bad as spam. > PS. No, I'm not willing to not have a secondary MX. My primary does > crash occasionally, though (thankfully) not as much as it used to > before I replaced the motherboard. I don't see how that makes a secondary MX necessary. The sending servers have outgoing queues and they will retry. Neil From sethg at GoodmanAssociates.com Tue Jan 11 01:43:21 2005 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Tue Jan 11 01:43:22 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: <20050110223539.49FD12DDB6@cashew.wolfskeep.com> Message-ID: > From: T. Alexander Popiel [mailto:popiel@wolfskeep.com] > Sent: Monday, January 10, 2005 4:36 PM <...> > Unfortunately, bouncing spam after acceptance is increasingly unavoidable > for anyone who has a backup MX host as insurance against their primary > host being down. Anyone who runs a secondary MX with less security than the primary, i.e. no list of real mailboxes, no FCrDNS, etc., might as well drop their anti-spam measures on the primary, because: > Many spammers are targetting the secondary MX instead of the primary > MX... as would anyone who wants to deliver a message that you don't want to accept and therefore seeks out the MX with the weakest security. > and a secondary MX sufficiently isolated from the primary to actually > be useful as a failover is likelyly to just accept, queue, and relay. It is perfectly reasonable to establish a secondary in another facility that still has the same list of real mailboxes and the same incoming policy as the primary. This might rule out some providers of backup MX services, but not all. If you're a large operation, you should have complete control over your secondary. If you're a small operation, you might want to rethink if it is really necessary to have a secondary MX if it is not possible to clone the setup of your primary. Hardware and network connections are more reliable than they used to be and senders will queue your mail upon temporary failures. > When the primary then rejects the message from the > secondary, the secondary is stuck trying to deliver a DSN. This is exactly the situation you never want to be in. A spammer should get the same rejection at your secondary that they get at your primary. <...> > (These spam DSNs frequently end up deferred because the purported > source either doesn't exist or issues a 400-series response to > trying to deliver the DSN... and the retries of these deferrals for > 4 days is what pushes my secondary over the edge.) Another possibility is that the systems to whom you are sending bogus DSN's are teergrubing you (forcing you to keep a socket and process alive for a long time) as punishment for abuse. Check your logs to see if these 4xx transactions are taking a very long time. Operating any MX in store and forward mode and sending out DSN's to return addresses on spam that you haven't confirmed are good during the SMTP session can easily turn you into a spam reflector. Even if the envelope return addresses on spam are valid, they are likely to be joe-job forgeries, so you still don't want to send DSN's in response to spam. -- Seth Goodman From sethg at GoodmanAssociates.com Tue Jan 11 02:41:17 2005 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Tue Jan 11 02:41:16 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: <20050110223539.49FD12DDB6@cashew.wolfskeep.com> Message-ID: > From: T. Alexander Popiel [mailto:popiel@wolfskeep.com] > Sent: Monday, January 10, 2005 4:36 PM <...> > PS. No, I'm not willing to not have a secondary MX. My primary does > crash occasionally, though (thankfully) not as much as it used to > before I replaced the motherboard. If you can't clone your primary setup onto your secondary and you can't live without a secondary, here's another possibility. Only accept mail at the secondary when the primary is down. This should greatly limit the damage, since your primary will rarely be down. If you refuse a connection at your secondary MX and they don't retry at your primary, you can be pretty sure it wasn't real mail. -- Seth Goodman From popiel at wolfskeep.com Tue Jan 11 04:29:27 2005 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 11 04:29:31 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: Message from Neil Schemenauer of "Mon, 10 Jan 2005 17:44:42 EST." <20050110224442.GB1491@mems-exchange.org> References: <20050110223539.49FD12DDB6@cashew.wolfskeep.com> <20050110224442.GB1491@mems-exchange.org> Message-ID: <20050111032927.9079B2DDF3@cashew.wolfskeep.com> In message: <20050110224442.GB1491@mems-exchange.org> Neil Schemenauer writes: >On Mon, Jan 10, 2005 at 02:35:39PM -0800, T. Alexander Popiel wrote: >> When the primary then rejects the message from the secondary, the >> secondary is stuck trying to deliver a DSN. > >You really should not generate DSNs, IMHO. They will very likely be >sent to forged From addresses. In that case, they are as bad as >spam. RFC 2821 requires DSNs if a site has accepted a message that is subsequently discovered to be undeliverable: # If an SMTP server has accepted the task of relaying the mail and # later finds that the destination is incorrect or that the mail cannot # be delivered for some other reason, then it MUST construct an # "undeliverable mail" notification message and send it to the # originator of the undeliverable mail (as indicated by the reverse- # path). Formats specified for non-delivery reports by other standards # (see, for example, [24, 25]) SHOULD be used if possible. Personally, I'm not willing to allow other people's anti-social behavior to induce me to violate clearly specified standards. >> PS. No, I'm not willing to not have a secondary MX. My primary does >> crash occasionally, though (thankfully) not as much as it used to >> before I replaced the motherboard. > >I don't see how that makes a secondary MX necessary. The sending >servers have outgoing queues and they will retry. Not all senders have outgoing queues (in particular, some mail clients insist on trying to send mail direct to the destination, and have no facility to queue until the destination is available). Moreover, there are times when my machine has been down for several days with hardware failures, and while I can control the queue expiry on my secondary MX (setting it to 30 days or so, when I know I'm going to be down a while), I cannot control the expire times on any sender's queues. - Alex From popiel at wolfskeep.com Tue Jan 11 04:48:34 2005 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 11 04:48:36 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: Message from "Seth Goodman" of "Mon, 10 Jan 2005 18:43:21 CST." References: Message-ID: <20050111034834.DF6CA2DDB6@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >> From: T. Alexander Popiel [mailto:popiel@wolfskeep.com] >> Sent: Monday, January 10, 2005 4:36 PM > ><...> > >> Unfortunately, bouncing spam after acceptance is increasingly unavoidable >> for anyone who has a backup MX host as insurance against their primary >> host being down. [...] >It is perfectly reasonable to establish a secondary in another facility that >still has the same list of real mailboxes and the same incoming policy as >the primary. This might rule out some providers of backup MX services, but >not all. If you're a large operation, you should have complete control over >your secondary. If you're a small operation, you might want to rethink if >it is really necessary to have a secondary MX if it is not possible to clone >the setup of your primary. Hardware and network connections are more >reliable than they used to be and senders will queue your mail upon >temporary failures. I'm an extremely small operation... this is my home box, maintained in my spare time, and I trade secondary MX services with friends on multiple continents. Yes, I could export my list of valid addresses to said friends, but it would still be a hack to the mail server to obey that list for relay (unless I'm missing something in the postfix docs, which is entirely possible). >> When the primary then rejects the message from the >> secondary, the secondary is stuck trying to deliver a DSN. > >This is exactly the situation you never want to be in. A spammer should get >the same rejection at your secondary that they get at your primary. In a world of robust communication between primary and secondary, yes... but that requires much more investment in the infrastructure than I've had opportunity to make. >> (These spam DSNs frequently end up deferred because the purported >> source either doesn't exist or issues a 400-series response to >> trying to deliver the DSN... and the retries of these deferrals for >> 4 days is what pushes my secondary over the edge.) > >Another possibility is that the systems to whom you are sending bogus DSN's >are teergrubing you (forcing you to keep a socket and process alive for a >long time) as punishment for abuse. 1. The DSNs are not bogus; it's the messages that they're in response to that were bogus. 2. Since it's disk space that's the problem and not CPU time, teergrubing is not an issue. (My home DSL link (or that of my secondary) isn't fat enough for it to be worth teergrubing me, anyway.) 3. It's unforunate when obeying the RFCs is considered abuse. - Alex From popiel at wolfskeep.com Tue Jan 11 04:51:33 2005 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 11 04:51:35 2005 Subject: [spambayes-dev] sb_mailsort.py status In-Reply-To: Message from "Seth Goodman" of "Mon, 10 Jan 2005 19:41:17 CST." References: Message-ID: <20050111035133.2F5C32DDB6@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >> From: T. Alexander Popiel [mailto:popiel@wolfskeep.com] >> Sent: Monday, January 10, 2005 4:36 PM > ><...> > >> PS. No, I'm not willing to not have a secondary MX. My primary does >> crash occasionally, though (thankfully) not as much as it used to >> before I replaced the motherboard. > >If you can't clone your primary setup onto your secondary and you can't live >without a secondary, here's another possibility. Only accept mail at the >secondary when the primary is down. This should greatly limit the damage, >since your primary will rarely be down. If you refuse a connection at your >secondary MX and they don't retry at your primary, you can be pretty sure it >wasn't real mail. ... Or that there was another routing foulup in Sprint's Seattle hub. Having some parts of the net able to reach me but not others happens about once a quarter for an hour or two. - Alex From tameyer at ihug.co.nz Tue Jan 11 23:20:09 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 11 23:48:50 2005 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/Outlook2000/dialogs/resourcesdialogs.h, 1.25, 1.26 dialogs.rc, 1.50, 1.51 In-Reply-To: Message-ID: > Make all resources use the same sub-language. Due to editing > by developers from different countries, we had developed a mixture > of English (United States) and English (Australia) resources. > With apologies to Tony and Mark, I'm in the US so I standardized > on English (United States). No need to apologise to me - I don't think any NZer was ever in favour of something Australian . =Tony.Meyer From luciagomes8475z at hotmail.com Fri Jan 14 23:33:44 2005 From: luciagomes8475z at hotmail.com (Lucia Gomes) Date: Fri Jan 14 23:33:46 2005 Subject: [spambayes-dev] listagem de e-mails Message-ID: <20050114223343.C95191E4010@bag.python.org> Mais Emails, venda online de listas de email, fazemos mala direta e propaganda de sua empresa ou neg?cio para milh?es de emails. Temos listas de email Mala Direta, Mala-Direta, Cadastro de Emails, Lista de Emails, Mailing List, Milh?es de Emails, Programas de Envio de Email, Email Bombers, Extratores de Email, Listas Segmentadas de Email, Emails Segmentados, Emails em Massa, E-mails http://www.estacion.de/maladireta Temos listas de email Mala Direta, Mala-Direta, Cadastro de Emails, Lista de Emails, Mailing List, Milh?es de Emails, Programas de Envio de Email, Email Bombers, Extratores de Email, Listas Segmentadas de Email, Emails Segmentados, Emails em Massa, E-mails http://www.estacion.de/maladireta From theller at python.net Tue Jan 18 17:00:58 2005 From: theller at python.net (Thomas Heller) Date: Tue Jan 18 16:59:42 2005 Subject: [spambayes-dev] Re: Cannot find saved message References: <8y91sszb.fsf@python.net> Message-ID: Thomas Heller writes: > "Tony Meyer" writes: > >>>From time to time, I'm getting this traceback, in the sb_imapfilter: >> [...] >>> File "sb_imapfilter.py", line 559, in Save >>> raise BadIMAPResponseError("Cannot find saved message", "") >>> BadIMAPResponseError: The command 'Cannot find saved message' >>> failed to give an OK response. >> [...] >>> Does anyone have a solution to this, before I examine this further? >> >> Not a solution, but there is the material in here: >> >> [ 1023797 ] Imapfilter fails: 'Cannot find saved message' >> >> >> I haven't managed to figure this one out yet, sorry. (If you have the >> time to, that would be great!). I believe the problem comes from the >> way imapfilter now waits for an EXISTS message from the IMAP server >> before trying to find the new message (this is to try and overcome a >> problem the old version had with servers that wouldn't immediately >> find new messages). >> >> However, if you're getting as far as 559, then an EXISTS response has >> been received, but the newly created message isn't found anyway. >> (Maybe a different message arrived, but the one we created isn't >> available? That would be wierd). >> >> Running with -i4 ought to give enough detail of the IMAP4 conversation >> that you can see why its failing. If you don't have time to look at >> it, if you could attach your -i4 output to the tracker (removing your >> username/password details) and remind me to get to this quickly, I'll >> try and do that. > > Maybe related, maybe not - running with -i4 seems (?) to cure the > problem. At least is has not yet happended again. The bad news is - it didn't. The problem remains. But the good news is: running sb_imapfilter (from CVS) with Python2.4 instead of 2.3 really fixed the problem. Thomas From tameyer at ihug.co.nz Tue Jan 18 23:20:23 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 18 23:21:46 2005 Subject: [spambayes-dev] Re: Cannot find saved message In-Reply-To: Message-ID: > >>>From time to time, I'm getting this traceback, in the > sb_imapfilter: > >> [...] > >>> File "sb_imapfilter.py", line 559, in Save > >>> raise BadIMAPResponseError("Cannot find saved message", "") > >>> BadIMAPResponseError: The command 'Cannot find saved > message' failed > >>> to give an OK response. > >> [...] > >>> Does anyone have a solution to this, before I examine > this further? [...] > The bad news is - it didn't. The problem remains. > > But the good news is: running sb_imapfilter (from CVS) with > Python2.4 instead of 2.3 really fixed the problem. Were you using sb_imapfilter not from CVS before? (i.e. was the fix using Python 2.4, or using Python 2.4 *and* CVS imapfilter?). The main thing that I can think of that changes with 2.4 is using email 3.0, which means that there's no need for the unparseable message handling. If the fix was just changing to 2.4, then maybe the problem is just occurring with malformed messages? That would certainly be a good place for me to start looking, at least :) =Tony.Meyer From ta-meyer at ihug.co.nz Wed Jan 19 00:20:13 2005 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 19 00:20:55 2005 Subject: [spambayes-dev] More stupid beats smart timcv.py results Message-ID: Results for a couple of timcv.py tests that I've done recently are here: The former was in response to a request to tokenize the Received-SPF headers. I don't have a great deal of mail with those headers (and looking at the specs, it's not clear whether they are still meant to be used). Hardly anything changed, anyway, so it doesn't seem worth doing anything with them at the moment. The latter was prompted by a comment in JGC's latest newsletter (though I'm sure I've seen this somewhere before, too). To avoid deliberate misspellings and the so-called 'cambridge effect' you replace each (or generate a new) token that is made up of the letters in the original token sorted into a constant order (e.g. alphabetical). So "god" becomes "dgo", but so does "dog". I tried both replacing the original token and adding a new one, and tried making the change in the headers, in the body, and both. In the good cases FPs weren't really effected, but FNs always increased, as did unsures, so that with the effect of making the database harder to read, makes this a bad idea it seems. Anyway, just FYI :) =Tony.Meyer From skip at pobox.com Wed Jan 19 02:23:13 2005 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 19 03:00:14 2005 Subject: [spambayes-dev] More stupid beats smart timcv.py results In-Reply-To: References: Message-ID: <16877.46721.450856.19583@montanaro.dyndns.org> Tony> The latter was prompted by a comment in JGC's latest newsletter Tony> (though I'm sure I've seen this somewhere before, too). Who's JGC? Has anyone tried de-l33t-ing words that contain numbers? http://www.bbc.co.uk/dna/h2g2/A787917 Skip From ta-meyer at ihug.co.nz Wed Jan 19 03:06:23 2005 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 19 03:07:14 2005 Subject: [spambayes-dev] More stupid beats smart timcv.py results In-Reply-To: Message-ID: > Who's JGC? Sorry. John Graham-Cumming of POPfile (). > Has anyone tried de-l33t-ing words that contain numbers? > > http://www.bbc.co.uk/dna/h2g2/A787917 Not me. =Tony.Meyer From tim.peters at gmail.com Wed Jan 19 03:13:57 2005 From: tim.peters at gmail.com (Tim Peters) Date: Wed Jan 19 03:13:59 2005 Subject: [spambayes-dev] More stupid beats smart timcv.py results In-Reply-To: References: Message-ID: <1f7befae05011818136de1b412@mail.gmail.com> [Tony Meyer] > Results for a couple of timcv.py tests that I've done recently are > here: It's sure nice to see someone is still testing ideas! It would be even nicer if we could find a good one . > > > > The former was in response to a request to tokenize the > Received-SPF headers. I don't have a great deal of mail with > those headers (and looking at the specs, it's not clear whether > they are still meant to be used). Hardly anything changed, > anyway, so it doesn't seem worth doing anything with them > at the moment. Indeed, I had to stare hard to find any difference at all. > The latter was prompted by a comment in JGC's latest > newsletter (though I'm sure I've seen this somewhere before, > too). To avoid deliberate misspellings and the so-called > 'cambridge effect' you replace each (or generate a new) token > that is made up of the letters in the original token sorted into a > constant order (e.g. alphabetical). So "god" becomes "dgo", > but so does "dog". > > I tried both replacing the original token and adding a new one, > and tried making the change in the headers, in the body, and > both. In the good cases FPs weren't really effected, but FNs > always increased, as did unsures, so that with the effect of > making the database harder to read, makes this a bad > idea it seems. Yup. I see very little Camridbge Unvierstiy obfuscation, so I wouldn't expect this to help. In effect, replacing tokens with a canonicalized form is a limited kind of hashing (mapping multiple tokens to one), and the only kind of deliberate token-confusion that ever won in tests was the "skip:" gimmick for very long tokens. In the cases where you added the canonicalized form (in addition to retaining the original form), it may have a bad interaction with the bigram option (which I believe you use), destroying the natural bigrams. It would be clearer to turn bigrams off in that case. But I wouldn't expect it to help anyway. From tameyer at ihug.co.nz Wed Jan 19 03:31:00 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 19 03:31:24 2005 Subject: [spambayes-dev] More stupid beats smart timcv.py results In-Reply-To: Message-ID: [Tim Peters] > It's sure nice to see someone is still testing ideas! I was mother-in-law-sitting for a few days, so leaving the machine running tests was an easy option :) > It would be even nicer if we could find a good one . It would be easier if the with-defaults results weren't so good, of course :) If I find time to really try something out I suppose I ought to start by staring at the mistakes that running with defaults is generating. > Indeed, I had to stare hard to find any difference at all. Me too :) I wondered if maybe I didn't have any of the things, so grepped through the mail, but there were some (mostly from the same domains, which probably means that there were clues being harvested already). [...] > In the cases where you added the canonicalized form (in addition to > retaining the original form), it may have a bad interaction with the > bigram option (which I believe you use), destroying the natural > bigrams. It would be clearer to turn bigrams off in that case. I ran them all with bigrams off, although I do have it on with the classifier I actually use. > But I wouldn't expect it to help anyway. You'd be right :) =Tony.Meyer From adrian at apsistemas.info Wed Jan 19 22:52:31 2005 From: adrian at apsistemas.info (Adrian Perello Marin) Date: Wed Jan 19 22:54:13 2005 Subject: [spambayes-dev] Dialogs.rc TRANSLATION In-Reply-To: Message-ID: <006401c4fe71$32ca6120$4501a8c0@samsungx10> Please can you tell me if the original dialogs.rc must to be translated only the part of English (U.S.) resources or the English (Australia) resources must to be translated too ?? Thanks. From kenny.pitt at gmail.com Wed Jan 19 23:49:12 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Wed Jan 19 23:49:22 2005 Subject: [spambayes-dev] Dialogs.rc TRANSLATION In-Reply-To: <006401c4fe71$32ca6120$4501a8c0@samsungx10> Message-ID: <41eee3ef.5cfa5707.1694.00a7@smtp.gmail.com> Adrian Perello Marin wrote: > Please can you tell me if the original dialogs.rc must to be > translated only the part of > > English (U.S.) resources or the English (Australia) resources must to > be translated too ?? The latest CVS version of dialogs.rc should contain only English (U.S.) resources. I checked in an update a week or two ago to change all the Australia resources back to U.S. -- Kenny Pitt From tameyer at ihug.co.nz Thu Jan 20 02:44:21 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 20 02:44:56 2005 Subject: [spambayes-dev] RE: [Spambayes] Trained two times as much spam as ham In-Reply-To: Message-ID: [Skip Montanaro] > It might be useful to codify some of these ideas into a tool > the user can run to reduce training dataset sizes without > necessarily committing to the train-to-exhaustion concept. I think it would be a good idea if we had a spambayes.training module that contained various training code like this, code to do tte/nonedge/etc and so forth. contrib/tte.py (and maybe other new contrib/ or utilties/ scripts) could just be the getopt stuff and then a few lines of code calling the appropriate functions in spambayes.training, and the other scripts could make use of the same code (I'd also like to have Outlook, sb_server and sb_imapfilter to have a slightly higher abstraction for training to allow for flexibility in what's done). =Tony.Meyer From sethg at GoodmanAssociates.com Thu Jan 20 03:44:43 2005 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Jan 20 03:44:42 2005 Subject: [spambayes-dev] Dialogs.rc TRANSLATION In-Reply-To: <41eee3ef.5cfa5707.1694.00a7@smtp.gmail.com> Message-ID: > From: Kenny Pitt > Sent: Wednesday, January 19, 2005 4:49 PM <...> > The latest CVS version of dialogs.rc should contain only English (U.S.) > resources. I checked in an update a week or two ago to change all the > Australia resources back to U.S. Thanks for doing the translation. -- Seth Goodman From skip at pobox.com Fri Jan 21 03:25:38 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 21 04:00:16 2005 Subject: [spambayes-dev] minor csv module problem Message-ID: <16880.26658.474204.913311@montanaro.dyndns.org> In my message training I train into a pickle (faster at that point), then use sb_dbexpimp to dump it to a csv file. For use by sb_bnfilter I then convert that to a Berkeley db file. (The csv file also serves as a convenient debug/interchange format.) The Python csv module is used both to write and read the csv file. Unfortunately, it seems to have a bug. It generates this line: "subject: \r",0,1\r (\r subbing for the real CR), which it later refuses to read because it thinks there is a newline inside the string. This is a long-standing bug as far as I can tell. I can reproduce it with Python 2.3 and 2.4, though is fixed in the latest CVS, probably as a side-effect of the recent changes to the csv module. I imagine we'll get the csv problem fixed (hopefully by the 2.3.5 release), but that doesn't help SpamBayes in the short term, so I think a workaround is in order. The problem is a token generated that ends with a \r character. One spam's subject is: '=?iso-2022-jp?B?k36LeILdgs2DRYNug0WDbiAgICAgICAgICAN?=' After decoding by email.Header.decode_header we have '\x93~\x8bx\x82\xdd\x82\xcd\x83E\x83n\x83E\x83n \r' The tokenizer generates this token as part of its output: 'subject: \r' Perhaps we could replace '\r' with ' ' in the subject before tokenizing without losing much/any accuracy. I don't believe we can get whitespace in body tokens. Skip From tameyer at ihug.co.nz Fri Jan 21 04:08:43 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jan 21 04:09:19 2005 Subject: [spambayes-dev] minor csv module problem In-Reply-To: Message-ID: > Perhaps we could replace '\r' with ' ' in the subject before > tokenizing without losing much/any accuracy. I don't believe > we can get whitespace in body tokens. +1. (I presume that this is a nicer solution than having our own csv subclass that has the problem fixed?) =Tony.Meyer From skip at pobox.com Fri Jan 21 05:43:01 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 21 06:00:12 2005 Subject: [spambayes-dev] minor csv module problem In-Reply-To: References: Message-ID: <16880.34901.565759.516302@montanaro.dyndns.org> >> Perhaps we could replace '\r' with ' ' in the subject before >> tokenizing without losing much/any accuracy. I don't believe we can >> get whitespace in body tokens. Tony> +1. Tony> (I presume that this is a nicer solution than having our own csv Tony> subclass that has the problem fixed?) Well, given that the bug is in the underlying _csv extension module, I suspect so. ;-) Checked in as tokenizer.py 1.34. Skip From skip at pobox.com Fri Jan 21 21:12:13 2005 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 21 21:43:41 2005 Subject: [spambayes-dev] "approximately" the same size Message-ID: <16881.25117.750274.132042@montanaro.dyndns.org> When we tell people not to let their ham/spam imbalance get too bad, we are referring to the number of messages trained. There is another way to look at this imbalance though: number of tokens generated from each stream. For me, ham messages are much larger on average than spam messages. Consequently, for roughly the same number of tokens to come from each stream, I need more spams than hams. Is there some way to tell how this might affect scoring? Is it relevant to the scoring? ATM, I have nearly three times as many spams as hams in my training set: % egrep '^From ' newham.old | wc -l 93 % egrep '^From ' newspam.old | wc -l 267 but the hams contribute approximately the same number of unique tokens as the spams: >>> from spambayes import mboxutils, tokenizer >>> hs = set() >>> ss = set() >>> for msg in mboxutils.getmbox("newham.old"): ... hs |= set(tokenizer.tokenize(msg)) ... >>> for msg in mboxutils.getmbox("newspam.old"): ... ss |= set(tokenizer.tokenize(msg)) ... >>> len(hs) 20360 >>> len(ss) 24734 Most tokens are unique to one set or the other: >>> len(ss & hs) 5205 >>> len(ss - hs) 19529 >>> len(hs - ss) 15155 Skip From kenny.pitt at gmail.com Mon Jan 24 15:58:58 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Mon Jan 24 15:59:02 2005 Subject: [spambayes-dev] "approximately" the same size In-Reply-To: <16881.25117.750274.132042@montanaro.dyndns.org> Message-ID: <41f50d33.17f5d111.78b9.0051@smtp.gmail.com> Skip Montanaro wrote: > When we tell people not to let their ham/spam imbalance get too bad, > we are referring to the number of messages trained. There is another > way to look at this imbalance though: number of tokens generated from > each stream. For me, ham messages are much larger on average than > spam messages. Consequently, for roughly the same number of tokens to > come from each stream, I need more spams than hams. Is there some > way to tell how this might affect scoring? Is it relevant to the > scoring? Mathematically, the total number of tokens should have no effect on the probabilities. We only count a token once per message, and we divide the number of messages that have contained the token by the total number of messages. The total number of tokens never figures into the calculation at all. It would be interesting to know, though, if this type of imbalance might skew the selection of the significant tokens that figure into the calculation of the final score. If there are significantly more ham tokens in the training, is it more likely that the 150 significant tokens chosen will also have a higher percentage of ham tokens? -- Kenny Pitt From skip at pobox.com Mon Jan 24 16:29:02 2005 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 24 17:00:55 2005 Subject: [spambayes-dev] "approximately" the same size In-Reply-To: <41f50d33.17f5d111.78b9.0051@smtp.gmail.com> References: <16881.25117.750274.132042@montanaro.dyndns.org> <41f50d33.17f5d111.78b9.0051@smtp.gmail.com> Message-ID: <16885.5182.803670.71266@montanaro.dyndns.org> Kenny> Mathematically, the total number of tokens should have no effect Kenny> on the probabilities. We only count a token once per message, Kenny> and we divide the number of messages that have contained the Kenny> token by the total number of messages. The total number of Kenny> tokens never figures into the calculation at all. Still, it seems to me the number of unique tokens seen (and the overlap between those seen in ham and those in spam) must have some effect on the effectiveness of the algorithm. The more disjoint the set of tokens appearing in hams and spams are the easier it should be to distinguish ham from spam. If there are 1000 tokens that appear in ham and 100 tokens that appear in spam, is it more likely that the intersection of the two approximates the set of spam tokens? Kenny> It would be interesting to know, though, if this type of Kenny> imbalance might skew the selection of the significant tokens that Kenny> figure into the calculation of the final score. If there are Kenny> significantly more ham tokens in the training, is it more likely Kenny> that the 150 significant tokens chosen will also have a higher Kenny> percentage of ham tokens? That's sort of what I was thinking (though my thought was not as well-formed). So, getting back to the original problem. Assume I have tried hard to maintain a nearly 1:1 ham:spam ratio. Given that most hams are much larger than most spams, there will be many more tokens found in hams than tokens found in spams. Most tokens seen in spams will have been seen in some hams, thus lessening their effectiveness A corollary thought: Given H and S, the sets of ham and spam tokens, respectively, what would be effect of simply deleting their intersection from the database? Skip From t-meyer at ihug.co.nz Tue Jan 25 23:05:08 2005 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 25 23:05:31 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 Message-ID: If no-one objects, I'd like to put 1.0.2 out tomorrow (all going well). There are a couple of web interface bugs that are particularly annoying (changing config problem and display when the config path contains on of '<>&'), plus various minor fixes. It should also work with Python 2.4, and (this is the main change for Outlook users) the binary will be build with 2.4. At this point, I'm not particularly interested in continuing work for a 1.0.3 release (although if 1.1 takes a long time, then maybe that will change), so this would the last in the 1.0.x line. If you have any changes you want in 1.0.2, please let me know so I can hold off the build. Otherwise I'll put it together later today and put a release up on sourceforge and announce it here so that maybe one or two people can give it a go (there are so few changes that it ought to be reasonably safe) before a proper announcement tomorrow. After that, I'd like to try and get a 1.1a1 out the door so that people can try it out (there are heaps of changes - checkins that date from May last year!). I had hoped to get this out by the end of the month, but that's rapidly approaching...my rough plan is this: 1.1a1: End of January (31st maybe, since that's a holiday here) 1.1a2: End of February 1.1b1: Mid March (assuming both alphas go well) 1.1rc1: Start of April 1.1: Early April Are there any things not done yet that people would like to see in 1.1? If I recall the process correctly, when 1.1b1 goes out we consider the trunk frozen (apart from bugfixes) and once 1.1 is out we cut a branch for it and unfreeze the trunk. Things I'm planning on doing before 1.1a1, if I can find the time: * I'm not 100% sure that the ZODBClassifier and (particularly) ZEOClassifier storages classes are working exactly as they should. Things I'm planning on doing before 1.1a2, if I can find the time: * Finishing up getting at least basic unit test scripts done for all the spambayes package. * For the binary, updating the installer script to include sb_imapfilter and sb_pop3dnd and a few minor changes that have been suggested by people. * It would be great to have at least one translation completely done. We currently have most of French and some of Spanish. All I can do is check the stuff in, of course. Any others? Or any suggestions for changes to the proposed schedule? =Tony.Meyer From tameyer at ihug.co.nz Wed Jan 26 03:28:40 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 26 03:28:45 2005 Subject: [spambayes-dev] More stupid beats smart timcv.py results In-Reply-To: Message-ID: [Tony Meyer, last week] > The latter was prompted by a comment in JGC's latest > newsletter (though I'm sure I've seen this somewhere before, > too). To avoid deliberate misspellings and the so-called > 'cambridge effect' you replace each (or generate a new) token > that is made up of the letters in the original token sorted > into a constant order (e.g. alphabetical). So "god" becomes > "dgo", but so does "dog". At the MIT Spam Conference John mentioned (offhand, regarding something else) that POPFile does this just for words that are longer than 6 characters. Since I already had the stuff at hand, I gave this a go, in case the poor results were just from those short words. Compared to all-defaults, fp and fn were unchanged and unsure rose 0.03%. So the verdict is unchanged. (I can post cmp.py or table.py results if anyone is interested, but there's nothing really interesting here). =Tony.Meyer From tameyer at ihug.co.nz Wed Jan 26 05:22:54 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 26 05:23:00 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: [Tony Meyer] > I'll put it together > later today and put a release up on sourceforge and announce > it here so that maybe one or two people can give it a go > (there are so few changes that it ought to be reasonably > safe) before a proper announcement tomorrow. I've done this. It seems unlikely that anyone is really after anything extra to go into 1.0.2, but if there is, speak up and I'll redo the builds tomorrow (NZ time). Otherwise, please feel free to download the source or binary and give it a spin. I've got some more testing of it myself to do tomorrow, and barring any problems cropping up, I'll put out an announcement probably tomorrow afternoon. =Tony.Meyer From kenny.pitt at gmail.com Wed Jan 26 16:05:12 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Wed Jan 26 16:05:29 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: <41f7b1aa.5556a87b.7801.0383@smtp.gmail.com> Tony Meyer wrote: > After that, I'd like to try and get a 1.1a1 out the door so that > people can try it out (there are heaps of changes - checkins that > date from May last year!). I had hoped to get this out by the end of > the month, but that's rapidly approaching...my rough plan is this: > > 1.1a1: End of January (31st maybe, since that's a holiday here) > 1.1a2: End of February > 1.1b1: Mid March (assuming both alphas go well) > 1.1rc1: Start of April > 1.1: Early April > > Are there any things not done yet that people would like to see in > 1.1? Just a couple of minor ones. Unfortunately, I'm on a tight deadline at work until at least the end of January. > Things I'm planning on doing before 1.1a1, if I can find the time: I'd like to get a tab added to the Manager for configuring the notification sounds. It may be just edit boxes for the filenames at first (no browse), but I think we need something if general users are going to be able to take advantage of it. > Things I'm planning on doing before 1.1a2, if I can find the time: Given that we haven't been able to solve the bsddb corruption/run-recovery problem, I wanted to try something with automatic backups to at least provide a recovery mechanism. My idea was to detect if the database is still good when shutting down, and if so make a backup copy of the .db file. When starting up, if the training data is corrupt then try to restore this "known good" backup and then attempt the open again. > * It would be great to have at least one translation completely > done. We currently have most of French and some of Spanish. All I > can do is check the stuff in, of course. Speaking of translations, have you been able to build a binary since the translation support was added? Whenever I try to run setup_all.py, I get the following error: """ running py2exe running build_py error: package directory 'spambayes\resources' does not exist """ -- Kenny Pitt From t-meyer at ihug.co.nz Thu Jan 27 03:55:15 2005 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 27 03:56:20 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: As most will probably have seen, I've done the 1.0.2 release. I smoke-tested the source (I've run it lots of times recently, and there are few changes) and did a bit of testing with the binary (turns out using Python 2.4 wasn't as simple as expected). [Tony Meyer] >> Are there any things not done yet that people would like to see in >> 1.1? [Kenny Pitt] > Just a couple of minor ones. Unfortunately, I'm on a tight > deadline at work until at least the end of January. Even with my rough plan there's a month between a1 and a2, so plenty of time to add more :) I think it'd be good to get something out though, since the 'deeper' changes can be tested already and people can give us feedback about various changes. > I'd like to get a tab added to the Manager for configuring > the notification sounds. It may be just edit boxes for the > filenames at first (no browse), but I think we need something > if general users are going to be able to take advantage of it. +1. Browse shouldn't be too hard - you should be able to just call CreateFileDialog, right? I agree that this would be nice before a1, so that people get a feel for the interface. I can hold off until it's done (I'm fairly busy with stuff at the moment, too). > Given that we haven't been able to solve the bsddb > corruption/run-recovery problem, I've completely given up and am happy in my ZODB.FileStorage world <0.1 wink>. I'd like to include the necessary ZODB stuff in a a1 build so that people can give ZODB a go if they would like, assuming that it doesn't bloat the installer too much. However, there still isn't a ZODB release for Python 2.4, so I'd have to build it myself, which I'm not too sure about. Maybe one is coming soon. (Tim would know, I expect). > I wanted to try something > with automatic backups to at least provide a recovery > mechanism. My idea was to detect if the database is still > good when shutting down, How? Back when I was playing with things (and I think Richie found this too, but it's a long time back) I found that you could corrupt the database and the RUN_RECOVERY error wouldn't be triggered until several accesses later. > and if so make a backup copy of the > .db file. When starting up, if the training data is corrupt > then try to restore this "known good" backup and then attempt > the open again. Making a backup could be worth a go, though. Maybe this ought to be an (Outlook experimental) option? It will almost double the amount of disk space that the plug-in requires. > Speaking of translations, have you been able to build a > binary since the translation support was added? Whenever I > try to run setup_all.py, I get the following error: > > """ > running py2exe > running build_py > error: package directory 'spambayes\resources' does not exist """ That's unrelated to the translation stuff. It's a distutils change, IIRC - I checked in a fix to the 1.0.x branch, but never got around to checking it into HEAD. I'll do that now. =Tony.Meyer From kenny.pitt at gmail.com Thu Jan 27 17:40:39 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Thu Jan 27 17:40:53 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: <41f9198c.367236f1.373e.0927@smtp.gmail.com> Tony Meyer wrote: >> I'd like to get a tab added to the Manager for configuring >> the notification sounds. It may be just edit boxes for the >> filenames at first (no browse), but I think we need something >> if general users are going to be able to take advantage of it. > > +1. Browse shouldn't be too hard - you should be able to just call > CreateFileDialog, right? Unfortunately, that's kind of the sticking point. CreateFileDialog comes from win32ui, not win32gui, and win32ui requires the MFC dlls. After all the work that was going on around the time I joined the project to eliminate the need for win32ui, I hate to add it all back just for one options page. >> Given that we haven't been able to solve the bsddb >> corruption/run-recovery problem, > > I've completely given up and am happy in my ZODB.FileStorage world > <0.1 wink>. I'd like to include the necessary ZODB stuff in a a1 > build so that people can give ZODB a go if they would like, assuming > that it doesn't bloat the installer too much. I played around a little with a SQLite classifier also, which is very lightweight to install. It was pretty slow when working with the 2.8 version because of the way we do our commits, but I'd like to give it another go with the 3.0 version of SQLite to see if the situation has improved. Maybe I can get this in for 1.1a2 so that it could get some testing. Right now, I don't know if it would really be any more stable than bsd or not. >> I wanted to try something >> with automatic backups to at least provide a recovery >> mechanism. My idea was to detect if the database is still >> good when shutting down, > > How? Back when I was playing with things (and I think Richie found > this too, but it's a long time back) I found that you could corrupt > the database and the RUN_RECOVERY error wouldn't be triggered until > several accesses later. That could definitely be a problem. It's still just an idea at this point, so figuring out how to make it work is still a ways off. My idea was to check only when shutting down, so unless the corruption was caused by the shutdown itself we would probably be OK. It should, at the very least, significantly reduce the window for total loss of training data. >> and if so make a backup copy of the >> .db file. When starting up, if the training data is corrupt >> then try to restore this "known good" backup and then attempt >> the open again. > > Making a backup could be worth a go, though. Maybe this ought to be > an (Outlook experimental) option? It will almost double the amount > of disk space that the plug-in requires. Yeah, I definitely want to make it optional. It not only increases the disk space, but also increases the time it takes to shut down. -- Kenny Pitt From tameyer at ihug.co.nz Fri Jan 28 03:37:44 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jan 28 03:37:48 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: [Tony Meyer] >> Browse shouldn't be too hard - you should be able to just call >> CreateFileDialog, right? [Kenny Pitt] > Unfortunately, that's kind of the sticking point. > CreateFileDialog comes from win32ui, not win32gui, and > win32ui requires the MFC dlls. Ah - I didn't realise that CreateFileDialog was in win32ui and not win32gui. Does this mean all common dialogs need win32ui, or can we just call them (CFileDialog) explicitly ourselves? (With something like win32api.LoadLibrary). > I played around a little with a SQLite classifier also [...] > Right now, I don't know if it would really be any > more stable than bsd or not. That's the trouble - I never get an bsddb corruption problems anymore, so I have no idea if I would trigger problems with other storage methods either. It would be good if we could have a range available at least in the 1.1 alphas anyway. Maybe in 1.2 the storage method could even be exposed via the GUI (in Advanced options). =Tony.Meyer From TBrigham at venocoinc.com Fri Jan 28 00:11:31 2005 From: TBrigham at venocoinc.com (TBrigham@venocoinc.com) Date: Fri Jan 28 03:44:01 2005 Subject: [spambayes-dev] Multiple Windows User Profiles Message-ID: <6AC011B1CB7FD411BC78001083FC582602183627@venocosrv.venocoinc.com> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: javaacro.gif Type: image/gif Size: 44668 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20050127/28d2620c/javaacro-0001.gif From t-meyer at ihug.co.nz Fri Jan 28 04:10:46 2005 From: t-meyer at ihug.co.nz (Tony Meyer) Date: Fri Jan 28 04:10:51 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: [Kenny Pitt] >> I'd like to get a tab added to the Manager for configuring the >> notification sounds. [...] [Tony Meyer] > I can hold off until it's done (I'm fairly busy with stuff at the > moment, too). BTW, since I used up the time I would have spent doing the remaining things that I personally want done for 1.1a1 putting together 1.0.3, it's fairly likely that I won't get to 1.1a1 until the end of next week. =Tony.Meyer From kenny.pitt at gmail.com Fri Jan 28 17:39:03 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Fri Jan 28 17:39:10 2005 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/windows spambayes.iss, 1.18, 1.19 In-Reply-To: Message-ID: <41fa6aaa.7f1eb6f4.388b.0d46@smtp.gmail.com> Tony Meyer wrote: > Modified Files: > spambayes.iss > > + UsagePage := CreateInputOptionPage(UserPage.ID, > + 'Personal Information', 'How will you use My Program?', > + 'Please specify how you would like to use My Program, then > click Next.', + True, False); > + UsagePage.Add('Light mode (no ads, limited functionality)'); > + UsagePage.Add('Sponsored mode (with ads, full functionality)'); > + UsagePage.Add('Paid mode (no ads, full functionality)'); Dare I even ask what this stuff is all about? BTW, py2exe doesn't put MSVCR71.dll in the dist/bin by default so the InnoSetup script won't compile initially. Is copying MSVCR71.DLL to py2exe/dist/bin just a manual step that needs to be done before running Inno? If so, we may want to include that in README-DEVEL.txt. -- Kenny Pitt From kenny.pitt at gmail.com Fri Jan 28 17:39:03 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Fri Jan 28 17:39:11 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: <41fa6aac.2be75911.388b.0d47@smtp.gmail.com> Tony Meyer wrote: > [Tony Meyer] >>> Browse shouldn't be too hard - you should be able to just call >>> CreateFileDialog, right? > > [Kenny Pitt] >> Unfortunately, that's kind of the sticking point. >> CreateFileDialog comes from win32ui, not win32gui, and >> win32ui requires the MFC dlls. > > Ah - I didn't realise that CreateFileDialog was in win32ui and not > win32gui. Does this mean all common dialogs need win32ui, or can we > just call them (CFileDialog) explicitly ourselves? (With something > like win32api.LoadLibrary). There's a win32gui.GetOpenFileName() method that provides a lower-level interface to the open dialog. It requires you to manually construct the Win32 API OPENFILENAME structure and pass it in as a string, and I haven't had time to figure out how to do that yet. If Mark is still listening to this list, maybe he will chime in with a pointer to some more info on this. -- Kenny Pitt From tvarnedoe at earthlink.net Fri Jan 28 18:55:43 2005 From: tvarnedoe at earthlink.net (T Varnedoe) Date: Fri Jan 28 18:55:47 2005 Subject: [spambayes-dev] Spambayes single installation for multiple mail clients Message-ID: <20050128175546.3A18F1E4004@bag.python.org> Question: Is there a way to install a single instance of Spambayes and have it work with 2 mail clients installed on the same computer? I.e. Outlook Express v2003 (Corporate email) client and Outlook 2003 (Personal email) clients. Thanks in advance for your time and efforts on my behalf. Best Regards Tom V -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20050128/106a73fe/attachment.html From kenny.pitt at gmail.com Fri Jan 28 19:09:56 2005 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Fri Jan 28 19:10:00 2005 Subject: [spambayes-dev] Spambayes single installation for multiple mailclients In-Reply-To: <20050128175546.3A18F1E4004@bag.python.org> Message-ID: <41fa7ff5.3b11a490.7801.192e@smtp.gmail.com> Outlook Express and Outlook process mail in entirely different ways. Although the same core SpamBayes classification engine is used, different SpamBayes applications are needed to get access to the incoming mail and provide training and configuration. It is possible to configure both of the SpamBayes applications to use the same training database, but this is not necessarily a good idea. The Outlook Addin and the POP3 or IMAP applications for Outlook Express have slightly different views of the incoming mail which will have an effect on the spam clues that are generated by each. This could potentially have a negative impact on your accuracy. You would also need to make sure that you never run both SpamBayes applications at the same time because it can cause your training data to become corrupted if two applications try to update the same database file at the same time. -- Kenny Pitt _____ From: spambayes-dev-bounces@python.org [mailto:spambayes-dev-bounces@python.org] On Behalf Of T Varnedoe Sent: Friday, January 28, 2005 12:56 PM To: spambayes-dev@python.org Subject: [spambayes-dev] Spambayes single installation for multiple mailclients Question: Is there a way to install a single instance of Spambayes and have it work with 2 mail clients installed on the same computer? I.e. Outlook Express v2003 (Corporate email) client and Outlook 2003 (Personal email) clients. Thanks in advance for your time and efforts on my behalf. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20050128/180d5cff/attachment.htm From tameyer at ihug.co.nz Sun Jan 30 23:15:52 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 30 23:15:56 2005 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/windowsspambayes.iss, 1.18, 1.19 In-Reply-To: Message-ID: [Tony Meyer] >> Modified Files: >> spambayes.iss >> >> + UsagePage := CreateInputOptionPage(UserPage.ID, >> + 'Personal Information', 'How will you use My Program?', >> + 'Please specify how you would like to use My Program, then >> click Next.', + True, False); >> + UsagePage.Add('Light mode (no ads, limited functionality)'); >> + UsagePage.Add('Sponsored mode (with ads, full functionality)'); >> + UsagePage.Add('Paid mode (no ads, full functionality)'); [Kenny Pitt] > Dare I even ask what this stuff is all about? Opps. As you've guessed, I had decided to try and update spambayes.iss for Inno 5.x, and accidentally checked that stuff in. I'll revert it soonish. > BTW, py2exe doesn't put MSVCR71.dll in the dist/bin by > default so the InnoSetup script won't compile initially. Is > copying MSVCR71.DLL to py2exe/dist/bin just a manual step > that needs to be done before running Inno? If so, we may > want to include that in README-DEVEL.txt. I'm still trying to get my head around what to do with msvcr71.dll. Thomas Heller on the py2exe-users list said that he thinks that (in his IANAL opinion) you need a license to redistribute mscvr71.dll. If that's the case, then we can't include it with SpamBayes (I'll do a 1.0.4 with Python 2.3, and 1.1a1 can be built with 2.3 as well), AFAICT. If it is legit to include it, then we need to figure where to source it. Either it's not distributed with Python or the Python install puts it in windows\system32 (I haven't had a chance to check). (Thomas's (again, IANAL) opinion was that it was legit for Python to redistribute the dll. There was some discussion of this a while back on python-dev, I believe). If Python does install it, then we can just source it from wherever it gets put (the setup_all.py script can do this). If Python doesn't install it, but we are going to, then a manual copy is probably the only option, and we'll just have to update README-DEVEL.txt. Using Python 2.4 has turned out to be a right PITA, really, and I wish I had just stuck with 2.3. It is tempting to give up using 2.4, and just include email 3.0 instead (IIRC that's a reasonably simple option), since that's the primary reason for using 2.4. =Tony.Meyer From tameyer at ihug.co.nz Sun Jan 30 23:34:17 2005 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 30 23:34:22 2005 Subject: [spambayes-dev] 1.0.2 and 1.1a1 In-Reply-To: Message-ID: > There's a win32gui.GetOpenFileName() method that provides a > lower-level interface to the open dialog. It requires you to > manually construct the Win32 API OPENFILENAME structure and > pass it in as a string, and I haven't had time to figure out > how to do that yet. There's an example here: It's for Windows CE, I think, but is probably more-or-less the same :) =Tony.Meyer