From jm at jmason.org Sat Mar 1 13:45:52 2003 From: jm at jmason.org (Justin Mason) Date: Sat Mar 1 08:46:05 2003 Subject: [Spambayes] Graph results In-Reply-To: Message from "T. Alexander Popiel" <20030301051552.4DB592DE8C@cashew.wolfskeep.com> Message-ID: <20030301134557.15F1216F16@jmason.org> Alexander -- nice work! Thanks for investigating this... > 2. Spambayes continues to improve for a couple months, > but I'm starting to see an increase in errors after > about 4-5 months. I don't know why this is; it might > be because spam is mutating, or it might be because > my definition of spam has been mutating. Spam has definitely been mutating heavily in the last 4 months. > Anyway, the next thing for me to really look at is the effect > of aging... As in expiration of tokens? I thought SB didn't use that? Or do you mean validity of trained results from >3 months ago... --j. From mhammond at skippinet.com.au Sun Mar 2 02:29:26 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sat Mar 1 10:30:28 2003 Subject: [Spambayes] "delete as spam" gives error in Outlook XP In-Reply-To: <20030301065159.47330.qmail@web41305.mail.yahoo.com> Message-ID: This is a known bug - please define a folder for filtering your spam mail to - even if filtering is not enabled. Mark. > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of Chris Lopes > Sent: Saturday, 1 March 2003 5:52 PM > To: spambayes@python.org > Subject: [Spambayes] "delete as spam" gives error in Outlook XP > > > Hello, > > I am running Outlook 2002 SP-2 on Windows XP Pro SP1. > I have spambayes 1.0a2 installed, along with python.org's python > 2.2.2 with win32all-150 > installed. > In order to install the add-in for outlook, I just ran addin.py > from spambayes' outlook2000 > directory. The plugin installed fine, and I was able to train > spambayes on a set of both spam and > non-spam emails just fine. > > However, "Delete As Spam" does not work. It gives the following > error visible from > PythonWin's Trace Collector Debugging Tool when I click "Delete As Spam": > pythoncom error: Python error invoking COM method. > Traceback (most recent call last): > File "D:\Python22\lib\site-packages\win32com\server\policy.py", > line 275, in _Invoke_ > return self._invoke_(dispid, lcid, wFlags, args) > File "D:\Python22\lib\site-packages\win32com\server\policy.py", > line 280, in _invoke_ > return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, > None, None) > File "D:\Python22\lib\site-packages\win32com\server\policy.py", > line 510, in _invokeex_ > return apply(func, args) > File "D:\spambayes-1.0a2\Outlook2000\addin.py", line 305, in OnClick > spam_folder = msgstore.GetFolder(spam_folder_id) > File "D:\spambayes-1.0a2\Outlook2000\msgstore.py", line 223, in > GetFolder > folder_id = self.NormalizeID(folder_id) > File "D:\spambayes-1.0a2\Outlook2000\msgstore.py", line 185, in > NormalizeID > assert type(item_id) in [type(''), type(u'')], "What kind of > ID is '%r'?" % (item_id,) > exceptions.AssertionError: What kind of ID is 'None'? > > Please help > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes From popiel at wolfskeep.com Sat Mar 1 07:47:19 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Mar 1 10:47:25 2003 Subject: [Spambayes] Graph results In-Reply-To: Message from jm@jmason.org (Justin Mason) of "Sat, 01 Mar 2003 13:45:52 GMT." <20030301134557.15F1216F16@jmason.org> References: <20030301134557.15F1216F16@jmason.org> Message-ID: <20030301154719.EDE3A2DEB4@cashew.wolfskeep.com> In message: <20030301134557.15F1216F16@jmason.org> jm@jmason.org (Justin Mason) writes: > >Alexander -- nice work! Thanks for investigating this... Heh. It's just a way to use up even more CPU-hours, in the same spirit as was prevalent last October... ;-) >> 2. Spambayes continues to improve for a couple months, >> but I'm starting to see an increase in errors after >> about 4-5 months. I don't know why this is; it might >> be because spam is mutating, or it might be because >> my definition of spam has been mutating. > >Spam has definitely been mutating heavily in the last 4 months. Oh, definitely. However, since the test runs were training throughout the data period, one would hope that they'd have picked up on the mutations without a loss of accuracy. (Of course, some of the mutations have been to include features that SB doesn't recognize at all (s p a c e d o u t w o r d s), which could well be the source of the trouble.) I'm just worried that having too much information about past forms of spam may be interfering with recognition of current spam (through the auspices of spam probability deflation due to the probabilities being based on fraction of known spams containing any feature... so as more spams are known with differing features, the probability for any given feature decreases). Hence my interest in aging. >> Anyway, the next thing for me to really look at is the effect >> of aging... > >As in expiration of tokens? I thought SB didn't use that? >Or do you mean validity of trained results from >3 months ago... Standard SB doesn't, you're right. On the other hand, my personal installation (not what I ran tests with!) expires messages after 120 days. I'm curious to see if this is actually the boon I suspect it is. - Alex From popiel at wolfskeep.com Sat Mar 1 09:12:46 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Mar 1 12:12:49 2003 Subject: [Spambayes] Graphs on my website Message-ID: <20030301171246.64E992DEB4@cashew.wolfskeep.com> Those who want to see my pretty graphs without waiting for the moderator approval of my .png-laden posting can go to http://www.wolfskeep.com/~popiel/spambayes/incremental to see all the pretty pictures (along with a bunch of the raw and semi-cooked data files). - Alex From skip at pobox.com Sat Mar 1 12:05:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 1 13:05:13 2003 Subject: [Spambayes] Graphs on my website In-Reply-To: <20030301171246.64E992DEB4@cashew.wolfskeep.com> References: <20030301171246.64E992DEB4@cashew.wolfskeep.com> Message-ID: <15968.63061.877833.567556@montanaro.dyndns.org> Alex, After reading your note and looking at the graphs on your website I have a couple questions: 1. For the dense among us can you define "perfect" and "corrected" training? 2. Did you adjust your spam/ham cutoffs from the default? 3. Do you have any measure of how the unsure stuff broke down between ham and spam? Thx, Skip From popiel at wolfskeep.com Sat Mar 1 10:23:43 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Mar 1 13:23:47 2003 Subject: [Spambayes] Graphs on my website In-Reply-To: Message from Skip Montanaro <15968.63061.877833.567556@montanaro.dyndns.org> References: <20030301171246.64E992DEB4@cashew.wolfskeep.com> <15968.63061.877833.567556@montanaro.dyndns.org> Message-ID: <20030301182343.C01F22DEB4@cashew.wolfskeep.com> In message: <15968.63061.877833.567556@montanaro.dyndns.org> Skip Montanaro writes: > >After reading your note and looking at the graphs on your website I have a >couple questions: > > 1. For the dense among us can you define "perfect" and "corrected" > training? Perfect trains immediately after scoring with the _actual_ classification. Corrected trains immediately after scoring with the _guessed_ classification, then fixes everything to _actual_ at the end of the day. (This was interesting to me because it somewhat closely models my actual usage, given my nightly retrains.) > 2. Did you adjust your spam/ham cutoffs from the default? No. > 3. Do you have any measure of how the unsure stuff broke down between ham > and spam? In the raw output, yes, though I didn't graph it. Some rough cumulative averages: perfect: 42 ham unsure and 290 spam unsure corrected: 55 ham unsure and 330 spam unsure fpfnunsure: 80 ham unsure and 1200 spam unsure - Alex From bill at parducci.net Sat Mar 1 11:51:15 2003 From: bill at parducci.net (bill parducci) Date: Sat Mar 1 14:51:19 2003 Subject: [Spambayes] train on demand Message-ID: <3E610F33.8090507@parducci.net> not wanting to leave mail laying around for a day whilst i wait for the daily mboxtrain.py cron job to fire off i came up with the following scheme for being able to initiate retraining via e-mail: 1. modification for .procmailrc, inserting this above the recipe that initiates hammiefilter.py: :0 * ^Subject:.*mboxtrain.[MyKeyCode] { :0 * ^From.*[MyEmailAddress] |${HOME}/retrain.sh } 2. spiff up the shell script (retrain.sh) that calls mboxtrain.py to to send back a note telling me that the retraining is done and to output the information to a log file that can be read later ( would have included it in the note, but the way that mboxtrain.py outputs the message counts it makes for a very unwieldy message). #!/bin/sh mailhome="${HOME}/mail" user=`basename ${HOME}` inbox="/var/spool/mail/$user" xhost=`hostname` xdomain=`dnsdomainname` /opt/spambayes/mboxtrain.py -d /home/$user/.hammiedb -s $mailhome/spam -g $inbox -g $mailhome/foo -g $mailhome/bar -g $mailhome/blah -g $mailhome/oink >${HOME}/retrain.out /usr/sbin/sendmail -f devnull@$xhost.$xdomain $user < Bugs item #695632, was opened at 2003-03-01 16:48 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Richard Scott (rich1) Assigned to: Nobody/Anonymous (nobody) Summary: MySQL Digest Causes Spambayes to Crash Initial Comment: The main mysql e-mail list (digest version) and the mysql bugs e-mail list (digest version) always cause Spambayes to crash. It appears that the error occurs in Generator.py. Here is the output: Training ham (/home/richard/Mail/inbox): Reading as MH mailbox /home/richard/Mail/inbox/2 /home/richard/Mail/inbox/5 /home/richard/Mail/inbox/6 /home/richard/Mail/inbox/724 /home/richard/Mail/inbox/29 /home/richard/Mail/inbox/751 Traceback (most recent call last): File "/home/richard/spambayes/mboxtrain.py", line 278, in ? main() File "/home/richard/spambayes/mboxtrain.py", line 265, in main train(h, g, False, force) File "/home/richard/spambayes/mboxtrain.py", line 207, in train mhdir_train(h, path, is_spam, force) File "/home/richard/spambayes/mboxtrain.py", line 190, in mhdir_train f.write(msg.as_string()) File "/usr/lib/python2.2/site-packages/email/Message.py", line 107, in as_string g.flatten(self, unixfrom=unixfrom) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 100, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 128, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 154, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 243, in _handle_multipart g.flatten(part, unixfrom=False) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 100, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 128, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 154, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 212, in _handle_text raise TypeError, 'string payload expected: %s' % type(payload) TypeError: string payload expected: ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702 From tim_one at email.msn.com Sat Mar 1 15:23:42 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sat Mar 1 15:24:36 2003 Subject: [Spambayes] Graphs on my website In-Reply-To: <20030301171246.64E992DEB4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel > Those who want to see my pretty graphs without waiting > for the moderator approval of my .png-laden posting I approved it around midnight, so anyone who hasn't gotten it yet probably isn't going to. It was held for approval merely due to sheer size. After approving it, it bounced back from a number of mailing-list recipients because braindead "virus detection" gimmicks thought it was a virus. A typical bounce report complained that you were trying to hide the real nature of the attachments by giving them two extensions ("whatever.mtv.png"). Software . From klassa at nc.rr.com Sun Mar 2 09:42:45 2003 From: klassa at nc.rr.com (klassa@nc.rr.com) Date: Sun Mar 2 09:42:35 2003 Subject: [Spambayes] Outlook plugin doesn't want to filter while on dialup Message-ID: <9504.1046616165@qwop.com> I'm visiting my folks, and am stuck with dialup. Oddly, the Outlook plugin doesn't seem to want to filter. Before I left, while on broadband, life was good. Here, every piece of spam gets through untouched. Did the Outlook plugin notice the crappy connection speed :-) and punt? Confused, John From tim.one at comcast.net Sun Mar 2 12:40:10 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 2 12:40:40 2003 Subject: [Spambayes] Outlook plugin doesn't want to filter while on dialup In-Reply-To: <9504.1046616165@qwop.com> Message-ID: [klassa@nc.rr.com] > I'm visiting my folks, and am stuck with dialup. Oddly, the Outlook > plugin doesn't seem to want to filter. Before I left, while on broadband, > life was good. Here, every piece of spam gets through untouched. > > Did the Outlook plugin notice the crappy connection speed :-) and punt? It shouldn't matter, and I routinely run Outlook + spambayes via cable modem and via dialup on the same machine without trouble. IOW, I bet that when you get back to a broadband connection, it still won't work, that somehow it's turned itself off, or can't get started. Open PythonWin and then Tools -> Trace Collector Debugging Tool before starting Outlook and see if any interesting msgs appear in PythonWin's Python Trace Collector window. No msgs at all count as "interesting" too . From klassa at nc.rr.com Sun Mar 2 20:26:13 2003 From: klassa at nc.rr.com (klassa@nc.rr.com) Date: Sun Mar 2 20:26:00 2003 Subject: [Spambayes] Outlook plugin doesn't want to filter while on dialup In-Reply-To: Your message of "Sun, 02 Mar 2003 12:40:10 EST." Message-ID: <10310.1046654773@qwop.com> >>>>> On Sun, 2 Mar 2003, "Tim" == Tim Peters wrote: Tim> It shouldn't matter, and I routinely run Outlook + spambayes via Tim> cable modem and via dialup on the same machine without trouble. Tim> IOW, I bet that when you get back to a broadband connection, it Tim> still won't work, that somehow it's turned itself off, or can't get Tim> started. Open PythonWin and then Tim> Tools -> Trace Collector Debugging Tool Tim> before starting Outlook and see if any interesting msgs appear in Tim> PythonWin's Python Trace Collector window. No msgs at all count as Tim> "interesting" too . Output enclosed, below. This was with no mail to process, of course, but everything looks fine. What I'm noticing (now that I'm back at home, in the land of broadband... as God intended it to be :-)) is that I'm suddenly getting more false negatives. That is, SB *does* appear to be filtering, but more spam is getting through than got through just a couple of days ago. Significantly more. I can't imagine that spam changed that much just overnight. :-) I'll keep an eye on this... Weird. Thanks for the reply! John Outlook Spam Addin module loading SpamAddin - Connecting to Outlook Loaded bayes database from 'd:\Program Files\SpamBayes\Outlook2000\default_bayes_database.pck' Loaded message database from 'd:\Program Files\SpamBayes\Outlook2000\default_message_database.pck' Bayes database initialized with 259 spam and 598 good messages AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Spam Processing 0 missed spam in folder 'Inbox' took 0.600076ms From Paul.Moore at atosorigin.com Mon Mar 3 09:01:07 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Mon Mar 3 04:02:31 2003 Subject: [Spambayes] This message crashed Spambayes... Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D946@UKDCX001.uk.int.atosorigin.com> The attached message caused an error when being processed by the Outlook plugin, which stopped it processing the rest of my inbox. Unfortunately, I've no idea if attaching an email from Outlook will result in something readable from any other mailer (at least, in terms of diagnosing an issue like this!) If it doesn't, let me know what to do to diagnose the problem... Paul. PS I didn't raise a SF bug report for now, as when I saved the message as text from Outlook, it *definitely* lost any useful header info :-( Traceback info: Exception in thread Thread-5: Traceback (most recent call last): File "C:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "C:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", line 115, in thread_target self._DoProcess() File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", line 375, in _DoProcess self.filterer(self.mgr, self.progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 88, in filterer this_dispositions = filter_folder(f, mgr, progress) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 68, in filter_folder disposition = filter_message(message, mgr, all_actions) File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in filter_message prob = mgr.score(msg) File "C:\Applications\Spambayes\Outlook2000\manager.py", line 388, in score result = self.bayes.spamprob(bayes_tokenize(email), evidence) File "C:\Applications\Spambayes\spambayes\classifier.py", line 217, in chi2_spamprob clues = self._getclues(wordstream) File "C:\Applications\Spambayes\spambayes\classifier.py", line 436, in _getclues for word in Set(wordstream): File "C:\Applications\Spambayes\spambayes\compatsets.py", line 374, in __init__ self._update(iterable) File "C:\Applications\Spambayes\spambayes\compatsets.py", line 333, in _update for element in it: File "C:\Applications\Spambayes\spambayes\tokenizer.py", line 1052, in tokenize for tok in self.tokenize_headers(msg): File "C:\Applications\Spambayes\spambayes\tokenizer.py", line 1106, in tokenize_headers for x, subjcharset in email.Header.decode_header(x): File "C:\Python22\Lib\email\Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "C:\Python22\Lib\email\base64MIME.py", line 179, in decode dec = a2b_base64(s) Error: Incorrect padding << (??) ???? ??1? ? ??? ??? ???>> -------------- next part -------------- An embedded message was scrubbed... From: Subject: (??) ???? ??1? ? ??? ??? ??? Date: Sun, 2 Mar 2003 13:54:19 -0000 Size: 1689 Url: http://mail.python.org/pipermail/spambayes/attachments/20030303/52d154bd/attachment.eml From mhammond at skippinet.com.au Mon Mar 3 21:13:43 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 3 05:14:20 2003 Subject: [Spambayes] editing the project HTML Message-ID: The docs at http://spambayes.sourceforge.net/applications.html need a minor edit. I am an administrator of the group at sourceforge, but I can't work out how to edit this page. All clues gratefully accepted :) Mark. From sjoerd at acm.org Mon Mar 3 11:27:45 2003 From: sjoerd at acm.org (Sjoerd Mullender) Date: Mon Mar 3 05:27:49 2003 Subject: [Spambayes] This message crashed Spambayes... In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D946@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB880113D946@UKDCX001.uk.int.atosorigin.com> Message-ID: <20030303102745.3F9F474EB0@indus.ins.cwi.nl> On Mon, Mar 3 2003 "Moore, Paul" wrote: > The attached message caused an error when being processed by the Outlook = > plugin, which stopped it processing the rest of my inbox. Unfortunately, = > I've no idea if attaching an email from Outlook will result in something = > readable from any other mailer (at least, in terms of diagnosing an = > issue like this!) If it doesn't, let me know what to do to diagnose the = > problem... > > Paul. > > PS I didn't raise a SF bug report for now, as when I saved the message = > as text from Outlook, it *definitely* lost any useful header info :-( I did file an SF bug report after I got a similar crash for a message that I received and after I investigated where it went wrong. See bug #696458. -- Sjoerd Mullender From Paul.Moore at atosorigin.com Mon Mar 3 10:38:04 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Mon Mar 3 05:39:37 2003 Subject: [Spambayes] This message crashed Spambayes... Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D94A@UKDCX001.uk.int.atosorigin.com> From: Sjoerd Mullender [mailto:sjoerd@acm.org] > I did file an SF bug report after I got a similar crash for > a message that I received and after I investigated where it > went wrong. See bug #696458. Ah. Thanks - this looks like it's the same issue as I saw. Paul. From anthony at interlink.com.au Mon Mar 3 23:05:05 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Mar 3 07:05:13 2003 Subject: [Spambayes] editing the project HTML In-Reply-To: Message-ID: <200303031205.h23C56211301@localhost.localdomain> >>> "Mark Hammond" wrote > The docs at http://spambayes.sourceforge.net/applications.html need a minor > edit. I am an administrator of the group at sourceforge, but I can't work > out how to edit this page. All clues gratefully accepted :) check out the "website" repository. From wsy at merl.com Mon Mar 3 07:55:40 2003 From: wsy at merl.com (Bill Yerazunis) Date: Mon Mar 3 07:55:48 2003 Subject: [Spambayes] editing the project HTML In-Reply-To: References: Message-ID: <200303031255.h23CteX10043@localhost.localdomain> From: "Mark Hammond" The docs at http://spambayes.sourceforge.net/applications.html need a minor edit. I am an administrator of the group at sourceforge, but I can't work out how to edit this page. All clues gratefully accepted :) The way I do it on CRM114 is to log directly into sourceforge via ssh and use an editor on the offending HTML. In your case, log in like: ssh -l hammond spambayes.sourceforge.net and then cd over to the spambayes HTML directory: cd /home/groups/s/sp/spambayes/htdocs and then invoke the editor of your choice. -Bill Yerazunis From noreply at sourceforge.net Mon Mar 3 02:12:50 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 10:22:06 2003 Subject: [Spambayes] [ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in subject Message-ID: Bugs item #696458, was opened at 2003-03-03 11:12 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Sjoerd Mullender (sjoerd) Assigned to: Nobody/Anonymous (nobody) Summary: crash in tokenizer due to bad base64 in subject Initial Comment: I got a crash in the tokenizer in the line where it does x = msg.get('subject', '') for x, subjcharset in email.Header.decode_header(x): The reason is, the subject of this particular message is Subject: *****SPAM***** =?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?= which gives a binascii.Error: Incorrect padding from binascii.a2b_base64. I am running an up-to-date spambayes and python (i.e. both fresh from CVS). Here is a (parial) stack trace: File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1052, in tokenize for tok in self.tokenize_headers(msg): File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1106, in tokenize_headers for x, subjcharset in email.Header.decode_header(x): File "/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) binascii.Error: Incorrect padding ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 From noreply at sourceforge.net Mon Mar 3 02:39:39 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 10:22:07 2003 Subject: [Spambayes] [ spambayes-Bugs-696476 ] Manual filtering in outlook fails Message-ID: Bugs item #696476, was opened at 2003-03-03 11:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in outlook fails Initial Comment: When I try to run "filter now" from the outlook plugin - I get the following trace: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes-1.0a2 \Outlook2000\dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes-1.0a2 \Outlook2000\dialogs\FilterDialog.py", line 365, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes-1.0a2 \Outlook2000\manager.py", line 156, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 OS: windows XP home Spambayes version: 1.0a2 outlook version: 2000 sp3 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702 From noreply at sourceforge.net Mon Mar 3 08:30:26 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 11:23:56 2003 Subject: [Spambayes] [ spambayes-Bugs-696671 ] server error attempting to review Message-ID: Bugs item #696671, was opened at 2003-03-03 16:30 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jeremy Hylton (jhylton) Assigned to: Nobody/Anonymous (nobody) Summary: server error attempting to review Initial Comment: 500 Server error Traceback (most recent call last): File "/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "/usr/local/bin/pop3proxy.py", line 930, in onReview messageInfo = self._makeMessageInfo(message) File "/usr/local/bin/pop3proxy.py", line 825, in _makeMessageInfo messageInfo.bodySummary = self._trimHeader(text, 200) File "/usr/local/bin/pop3proxy.py", line 623, in _trimHeader sections = email.Header.decode_header(field) File "/usr/local/lib/python2.3/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/usr/local/lib/python2.3/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) Error: Incorrect padding ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702 From noreply at sourceforge.net Mon Mar 3 08:41:04 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 11:50:17 2003 Subject: [Spambayes] [ spambayes-Bugs-696671 ] server error attempting to review Message-ID: Bugs item #696671, was opened at 2003-03-03 17:30 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702 Category: pop3proxy Group: None >Status: Closed >Resolution: Duplicate Priority: 5 Submitted By: Jeremy Hylton (jhylton) Assigned to: Nobody/Anonymous (nobody) Summary: server error attempting to review Initial Comment: 500 Server error Traceback (most recent call last): File "/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "/usr/local/bin/pop3proxy.py", line 930, in onReview messageInfo = self._makeMessageInfo(message) File "/usr/local/bin/pop3proxy.py", line 825, in _makeMessageInfo messageInfo.bodySummary = self._trimHeader(text, 200) File "/usr/local/bin/pop3proxy.py", line 623, in _trimHeader sections = email.Header.decode_header(field) File "/usr/local/lib/python2.3/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/usr/local/lib/python2.3/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) Error: Incorrect padding ---------------------------------------------------------------------- >Comment By: Sjoerd Mullender (sjoerd) Date: 2003-03-03 17:41 Message: Logged In: YES user_id=43607 Closing as duplicate of bug #696458. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702 From noreply at sourceforge.net Mon Mar 3 08:44:15 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 11:50:18 2003 Subject: [Spambayes] [ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in subject Message-ID: Bugs item #696458, was opened at 2003-03-03 11:12 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Sjoerd Mullender (sjoerd) Assigned to: Nobody/Anonymous (nobody) Summary: crash in tokenizer due to bad base64 in subject Initial Comment: I got a crash in the tokenizer in the line where it does x = msg.get('subject', '') for x, subjcharset in email.Header.decode_header(x): The reason is, the subject of this particular message is Subject: *****SPAM***** =?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?= which gives a binascii.Error: Incorrect padding from binascii.a2b_base64. I am running an up-to-date spambayes and python (i.e. both fresh from CVS). Here is a (parial) stack trace: File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1052, in tokenize for tok in self.tokenize_headers(msg): File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1106, in tokenize_headers for x, subjcharset in email.Header.decode_header(x): File "/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) binascii.Error: Incorrect padding ---------------------------------------------------------------------- >Comment By: Sjoerd Mullender (sjoerd) Date: 2003-03-03 17:44 Message: Logged In: YES user_id=43607 It seems to me that all calls to email.Header.decode_header should be protected with try/except, or decode_header itself should protect itself with a try/except. A third possibility is to add an extra indirection through a function that does basically: def decode_header(x): try: return email.Header.decode_header(x) except: return x ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 From noreply at sourceforge.net Mon Mar 3 09:30:00 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 12:22:57 2003 Subject: [Spambayes] [ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in subject Message-ID: Bugs item #696458, was opened at 2003-03-03 04:12 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Sjoerd Mullender (sjoerd) Assigned to: Nobody/Anonymous (nobody) Summary: crash in tokenizer due to bad base64 in subject Initial Comment: I got a crash in the tokenizer in the line where it does x = msg.get('subject', '') for x, subjcharset in email.Header.decode_header(x): The reason is, the subject of this particular message is Subject: *****SPAM***** =?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?= which gives a binascii.Error: Incorrect padding from binascii.a2b_base64. I am running an up-to-date spambayes and python (i.e. both fresh from CVS). Here is a (parial) stack trace: File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1052, in tokenize for tok in self.tokenize_headers(msg): File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1106, in tokenize_headers for x, subjcharset in email.Header.decode_header(x): File "/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) binascii.Error: Incorrect padding ---------------------------------------------------------------------- >Comment By: Skip Montanaro (montanaro) Date: 2003-03-03 11:30 Message: Logged In: YES user_id=44345 Casual observation for anyone reporting spambayes bugs which involve the email package - You should also check/report such errors on the http://mimelib.sourceforge.net/ project, which is where the email gurus hang out. ---------------------------------------------------------------------- Comment By: Sjoerd Mullender (sjoerd) Date: 2003-03-03 10:44 Message: Logged In: YES user_id=43607 It seems to me that all calls to email.Header.decode_header should be protected with try/except, or decode_header itself should protect itself with a try/except. A third possibility is to add an extra indirection through a function that does basically: def decode_header(x): try: return email.Header.decode_header(x) except: return x ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 From piersh at friskit.com Mon Mar 3 09:41:41 2003 From: piersh at friskit.com (Piers Haken) Date: Mon Mar 3 12:40:38 2003 Subject: [Spambayes] Error during outlook plugin startup Message-ID: <9891913C5BFE87429D71E37F08210CB92C7504@zeus.sfhq.friskit.com> I just updated from CVS and I'm now getting the following error on startup. Can anyone tell me what's up? Piers. Outlook Spam Addin module loading SpamAddin - Connecting to Outlook Traceback (most recent call last): File "C:\Python22\lib\site-packages\win32com\universal.py", line 150, in dispatch retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD, args, None, None) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 322, in _InvokeEx_ return self._invokeex_(dispid, lcid, wFlags, args, kwargs, serviceProvider) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 562, in _invokeex_ return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags, args, kwArgs, serviceProvider) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 510, in _invokeex_ return apply(func, args) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 615, in OnConnection self.manager = manager.GetManager(application) File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 472, in GetManager _mgr = BayesManager(outlook=outlook, verbose=verbose) File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 142, in __init__ self.MigrateDataDirectory() File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 200, in MigrateDataDirectory self._MigrateFile("default_bayes_database.pck") File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 211, in _MigrateFile shutil.move(src, dest) exceptions.AttributeError: 'module' object has no attribute 'move' From mike at plokta.com Mon Mar 3 20:38:52 2003 From: mike at plokta.com (Mike Scott) Date: Mon Mar 3 15:38:55 2003 Subject: [Spambayes] Server error when training in POP3proxy Message-ID: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com> After using it successfully for a couple of weeks, POP3proxy is throwing the following error when I try to review emails for training in the web browser interface. The rest of the web browser interface, and POP3proxy, seems to be working OK. Does anyone who knows more than me about POP3proxy have any ideas for how to diagnose or fix it? I've just pulled the most recent update from CVS, which hasn't helped. I'm on Mac OS X 10.2.4 running Python 2.2.2, in case it's relevant. 500 Server error Traceback (most recent call last): File "spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "pop3proxy.py", line 1003, in onReview messageInfo = self._makeMessageInfo(message) File "pop3proxy.py", line 856, in _makeMessageInfo messageInfo.bodySummary = self._trimHeader(text, 200) File "pop3proxy.py", line 648, in _trimHeader sections = email.Header.decode_header(field) File "/sw/src/root-python22-2.2.2-2/sw/lib/python2.2/email/Header.py", line 92, in decode_header File "/sw/src/root-python22-2.2.2-2/sw/lib/python2.2/email/base64MIME.py", line 179, in decode Error: Incorrect padding -- Mike Scott mike@plokta.com From piersh at friskit.com Mon Mar 3 13:01:55 2003 From: piersh at friskit.com (Piers Haken) Date: Mon Mar 3 16:00:50 2003 Subject: [Spambayes] Error during outlook plugin startup Message-ID: <9891913C5BFE87429D71E37F08210CB92C7505@zeus.sfhq.friskit.com> Okay, I worked around this problem by deleting my pickles and starting from scratch (it didn't need to do the migration) but I believe this is still a problem. I'm using python2.2.2 and win32all-152. Piers. > -----Original Message----- > From: Piers Haken > Sent: Monday, March 03, 2003 9:42 AM > To: Spambayes > Subject: [Spambayes] Error during outlook plugin startup > > > I just updated from CVS and I'm now getting the following > error on startup. Can anyone tell me what's up? > > Piers. > > Outlook Spam Addin module loading > SpamAddin - Connecting to Outlook > Traceback (most recent call last): > File "C:\Python22\lib\site-packages\win32com\universal.py", > line 150, in dispatch > retVal = ob._InvokeEx_(meth.dispid, 0, > pythoncom.DISPATCH_METHOD, args, None, None) > File > "C:\Python22\lib\site-packages\win32com\server\policy.py", > line 322, in _InvokeEx_ > return self._invokeex_(dispid, lcid, wFlags, args, kwargs, > serviceProvider) > File > "C:\Python22\lib\site-packages\win32com\server\policy.py", > line 562, in _invokeex_ > return DesignatedWrapPolicy._invokeex_( self, dispid, > lcid, wFlags, args, kwArgs, serviceProvider) > File > "C:\Python22\lib\site-packages\win32com\server\policy.py", > line 510, in _invokeex_ > return apply(func, args) > File "C:\Python22\spam\spambayes\Outlook2000\addin.py", > line 615, in OnConnection > self.manager = manager.GetManager(application) > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", > line 472, in GetManager > _mgr = BayesManager(outlook=outlook, verbose=verbose) > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", > line 142, in __init__ > self.MigrateDataDirectory() > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", > line 200, in MigrateDataDirectory > self._MigrateFile("default_bayes_database.pck") > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", > line 211, in _MigrateFile > shutil.move(src, dest) > exceptions.AttributeError: 'module' object has no attribute 'move' > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes > From mhammond at skippinet.com.au Tue Mar 4 09:46:39 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 3 17:47:19 2003 Subject: [Spambayes] Error during outlook plugin startup In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C7504@zeus.sfhq.friskit.com> Message-ID: > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 211, in > _MigrateFile > shutil.move(src, dest) > exceptions.AttributeError: 'module' object has no attribute 'move' Damn - it seems Python 2.2 doesn't have shutil.move. I will replace it with win32api.MoveFileEx(). Mark. From mhammond at skippinet.com.au Tue Mar 4 10:22:21 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 3 18:23:24 2003 Subject: [Spambayes] Error during outlook plugin startup In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C7504@zeus.sfhq.friskit.com> Message-ID: I have checked in a fix for this. Mark. > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of Piers Haken > Sent: Tuesday, 4 March 2003 4:42 AM > To: Spambayes > Subject: [Spambayes] Error during outlook plugin startup > > > I just updated from CVS and I'm now getting the following error on > startup. Can anyone tell me what's up? > > Piers. > > Outlook Spam Addin module loading > SpamAddin - Connecting to Outlook > Traceback (most recent call last): > File "C:\Python22\lib\site-packages\win32com\universal.py", line 150, > in dispatch > retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD, > args, None, None) > File "C:\Python22\lib\site-packages\win32com\server\policy.py", line > 322, in _InvokeEx_ > return self._invokeex_(dispid, lcid, wFlags, args, kwargs, > serviceProvider) > File "C:\Python22\lib\site-packages\win32com\server\policy.py", line > 562, in _invokeex_ > return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags, > args, kwArgs, serviceProvider) > File "C:\Python22\lib\site-packages\win32com\server\policy.py", line > 510, in _invokeex_ > return apply(func, args) > File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 615, in > OnConnection > self.manager = manager.GetManager(application) > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 472, in > GetManager > _mgr = BayesManager(outlook=outlook, verbose=verbose) > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 142, in > __init__ > self.MigrateDataDirectory() > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 200, in > MigrateDataDirectory > self._MigrateFile("default_bayes_database.pck") > File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 211, in > _MigrateFile > shutil.move(src, dest) > exceptions.AttributeError: 'module' object has no attribute 'move' > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes From mhammond at skippinet.com.au Tue Mar 4 10:25:26 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 3 18:33:43 2003 Subject: [Spambayes] Missing HTML payload Message-ID: The following mail got past SpamBayes. Looking at the clues, it appears that spambayes was missing the HTML body of the message (which *does* render almost correctly in Outlook). I instrumented the "show clues" feature to show *all* message tokens found in the body. As you can see at the very end, the entire body was stripped. I am guessing that we barf on: there") ('hi there', []) >>> tokenizer.crack_html_comment("hi there IE and Mozilla both render "hi there". SpamBayes will miss the "there". Thus, spambayes can miss most of the message payload even though the user sees it all. Attaching a patch which creates a new option, ignore_unterminated_html_comments: True, which correctly handles this case. If set to False, you get the old behaviour. If no one can see a reason to keep the existing behaviour, then this can be dropped as an option. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702 From noreply at sourceforge.net Mon Mar 3 18:09:17 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 3 21:00:10 2003 Subject: [Spambayes] [ spambayes-Bugs-696995 ] Invalid HTML comments are not ignored Message-ID: Bugs item #696995, was opened at 2003-03-03 20:55 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Nobody/Anonymous (nobody) Summary: Invalid HTML comments are not ignored Initial Comment: Incorrectly terminated HTML comments are ignored by SpamBayes, but most clients handle this gracefully. For both of the following: hi there IE and Mozilla both render "hi there". SpamBayes will miss the "there". Thus, spambayes can miss most of the message payload even though the user sees it all. Attaching a patch which creates a new option, ignore_unterminated_html_comments: True, which correctly handles this case. If set to False, you get the old behaviour. If no one can see a reason to keep the existing behaviour, then this can be dropped as an option. ---------------------------------------------------------------------- >Comment By: Tim Peters (tim_one) Date: 2003-03-03 21:09 Message: Logged In: YES user_id=31435 I suggest the one-line change to analyze() I posted to the mailing list instead -- there's no real value I can see in the current behavior of throwing away everything after an unmatched open-block construct, and it wasn't intentional behavior. If an open-block construct isn't matched by a close-block construct, all in all it's more reasonable to act as if the open-block construct hadn't been recognized as one at all. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702 From mhammond at skippinet.com.au Tue Mar 4 13:02:16 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 3 21:02:48 2003 Subject: [Spambayes] Missing HTML payload In-Reply-To: Message-ID: Thanks for the replies! > > Outlook actually shows this entire tag (ie, literally " there IE and Mozilla both render "hi there". SpamBayes will miss the "there". Thus, spambayes can miss most of the message payload even though the user sees it all. Attaching a patch which creates a new option, ignore_unterminated_html_comments: True, which correctly handles this case. If set to False, you get the old behaviour. If no one can see a reason to keep the existing behaviour, then this can be dropped as an option. ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-04 13:17 Message: Logged In: YES user_id=14198 Tim's fix (plus a couple of comments) checked in. ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-03-04 13:09 Message: Logged In: YES user_id=31435 I suggest the one-line change to analyze() I posted to the mailing list instead -- there's no real value I can see in the current behavior of throwing away everything after an unmatched open-block construct, and it wasn't intentional behavior. If an open-block construct isn't matched by a close-block construct, all in all it's more reasonable to act as if the open-block construct hadn't been recognized as one at all. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702 From tim.one at comcast.net Mon Mar 3 21:21:17 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 3 21:21:51 2003 Subject: [Spambayes] Missing HTML payload In-Reply-To: Message-ID: [Mark Hammond] > Thanks for the replies! I guess you didn't get my bill . > Interestingly, Outlook shows the text, but IE and Mozilla do not. All 3 > show the text *after* the unmatched comment, but only Outlook shows the > comment itself. I don't want to think about the implications of that > . > > I made an alternative patch in that bug I pointed to, which completely > strips the invalid comment. From purely an Outlook POV, your patch is > probably better (as your patch better reflects what we see), but from the > "correctness" POV, maybe mine is (as it better reflects what most HTML > clients see) My belief is that non-spam HTML mail moves in the direction of using HTML correctly, so that damaged HTML is itself a spam indicator. Unlike Paul Graham , I have sisters, and they love sending HTML mail. It's fun for them and they do some beautiful stuff with it. So, all along, I've been much less willing to penalize HTML than other projects of this ilk (only computer geeks have bugs up their butts about using HTML in email). The flip side is that if damaged HTML is a symptom of spam, damaged HTML should be penalized, and *not* stripping the damaged stuff will create a mountain of characteristic clues. Senders of ham can avoid those penalties by sending well-formed HTML. > It does seem that no option is required whatever way we go. I'd agree even if we didn't have too many options. From skip at pobox.com Mon Mar 3 22:33:39 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 3 23:33:44 2003 Subject: [Spambayes] binascii.Error Message-ID: <15972.11427.981401.997736@montanaro.dyndns.org> A couple people recently reported binascii.Error being raised by pop3proxy, etc. Sjoerd Mullender filed a bug report on SF as well. I just checked in a change to spambayes/tokenizer.py which seems to fix the problem. Please give the latest CVS version a try and let me know if you still experience the problem. As an added bonus, a new token, "charset:invalid" gets generated when binascii barfs. More clues for the guys in the white hats. Skip From noreply at sourceforge.net Mon Mar 3 20:41:03 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 00:04:48 2003 Subject: [Spambayes] [ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in subject Message-ID: Bugs item #696458, was opened at 2003-03-03 04:12 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 Category: None Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Sjoerd Mullender (sjoerd) >Assigned to: Skip Montanaro (montanaro) Summary: crash in tokenizer due to bad base64 in subject Initial Comment: I got a crash in the tokenizer in the line where it does x = msg.get('subject', '') for x, subjcharset in email.Header.decode_header(x): The reason is, the subject of this particular message is Subject: *****SPAM***** =?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?= which gives a binascii.Error: Incorrect padding from binascii.a2b_base64. I am running an up-to-date spambayes and python (i.e. both fresh from CVS). Here is a (parial) stack trace: File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1052, in tokenize for tok in self.tokenize_headers(msg): File "/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py", line 1106, in tokenize_headers for x, subjcharset in email.Header.decode_header(x): File "/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) binascii.Error: Incorrect padding ---------------------------------------------------------------------- >Comment By: Skip Montanaro (montanaro) Date: 2003-03-03 22:41 Message: Logged In: YES user_id=44345 Still not clear what the best course of action is at the email package level. I solved it here by catching the binascii exception and tossing in a 'charset:invalid' token. It solved the problem here. Sjoerd, let me know if it's still a problem for you, but I think this should worm around it. S ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-03-03 11:30 Message: Logged In: YES user_id=44345 Casual observation for anyone reporting spambayes bugs which involve the email package - You should also check/report such errors on the http://mimelib.sourceforge.net/ project, which is where the email gurus hang out. ---------------------------------------------------------------------- Comment By: Sjoerd Mullender (sjoerd) Date: 2003-03-03 10:44 Message: Logged In: YES user_id=43607 It seems to me that all calls to email.Header.decode_header should be protected with try/except, or decode_header itself should protect itself with a try/except. A third possibility is to add an extra indirection through a function that does basically: def decode_header(x): try: return email.Header.decode_header(x) except: return x ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702 From spambayes at rodland.no Tue Mar 4 09:51:21 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Tue Mar 4 03:51:26 2003 Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder Message-ID: I am wondering if I'm doing something wrong here. I just checked out the last copy of CVS. you have fixed bugs: [ 642740 ] "Recover from Spam" wrong folder and [ 696476 ] Manual filtering in outlook fails however both of these (still) fails. I've completely deleted my old installations. I've unregistered, and then re-registered addin.py. I've checked that I've got the last versions of both addin.py & manager.py: [Fredrik@FMR_WIN Outlook2000]$ spcvs status addin.py =================================================================== File: addin.py Status: Up-to-date Working revision: 1.50 Repository revision: 1.50 /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v Sticky Tag: (none) Sticky Date: (none) Sticky Options: (none) [Fredrik@FMR_WIN Outlook2000]$ spcvs status manager.py =================================================================== File: manager.py Status: Up-to-date Working revision: 1.51 Repository revision: 1.51 /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v Sticky Tag: (none) Sticky Date: (none) Sticky Options: (none) Am I missing something here? I've allready posted a bug similar to 696476 (that is bug #697120). I'll be happy to re-post a bug similar to 642740. F -- Fredrik R?dland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From spambayes at rodland.no Tue Mar 4 09:56:59 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Tue Mar 4 03:57:05 2003 Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder In-Reply-To: Message-ID: > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of Fredrik Rodland > Sent: 4. mars 2003 09:51 > To: Spambayes > Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder > > I've allready posted a bug similar to 696476 (that is bug > #697120). I'll be > happy to re-post a bug similar to 642740. what's the prefered of: A. re-opeing a bug B. posting a new bug (with a link/comment to the old) when something (still) does not work when the bug is closed? Fredrik -- Fredrik R?dland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From mhammond at skippinet.com.au Tue Mar 4 21:26:38 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Mar 4 05:27:41 2003 Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder In-Reply-To: Message-ID: > what's the prefered of: > > A. re-opeing a bug > B. posting a new bug (with a link/comment to the old) My preference is A, assuming that the bug is still "warm", or not actually fixed as was the case here. If an identical bug appears in the future as a regression due to some other change, then it should be a new bug. Still-wishing-we-had-bugzilla ly, Mark. From frodland at aston.no Tue Mar 4 09:50:52 2003 From: frodland at aston.no (Fredrik Rodland) Date: Tue Mar 4 09:55:21 2003 Subject: [Spambayes] reg [ 642740 ] "Recover from Spam" wrong folder Message-ID: I am wondering if I'm doing something wrong here. I just checked out the last copy of CVS. you have fixed bugs: [ 642740 ] "Recover from Spam" wrong folder and [ 696476 ] Manual filtering in outlook fails however both of these (still) fails. I've completely deleted my old installations. I've unregistered, and then re-registered addin.py. I've checked that I've got the last versions of both addin.py & manager.py: [Fredrik@FMR_WIN Outlook2000]$ spcvs status addin.py =================================================================== File: addin.py Status: Up-to-date Working revision: 1.50 Repository revision: 1.50 /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v Sticky Tag: (none) Sticky Date: (none) Sticky Options: (none) [Fredrik@FMR_WIN Outlook2000]$ spcvs status manager.py =================================================================== File: manager.py Status: Up-to-date Working revision: 1.51 Repository revision: 1.51 /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v Sticky Tag: (none) Sticky Date: (none) Sticky Options: (none) Am I missing something here? I've allready posted a bug similar to 696476 (that is bug #697120). I'll be happy to re-post a bug similar to 642740. F -- Fredrik R?dland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From noreply at sourceforge.net Mon Mar 3 22:15:28 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:34 2003 Subject: [Spambayes] [ spambayes-Bugs-696476 ] Manual filtering in outlook fails Message-ID: Bugs item #696476, was opened at 2003-03-03 21:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702 Category: Outlook Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in outlook fails Initial Comment: When I try to run "filter now" from the outlook plugin - I get the following trace: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes-1.0a2 \Outlook2000\dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes-1.0a2 \Outlook2000\dialogs\FilterDialog.py", line 365, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes-1.0a2 \Outlook2000\manager.py", line 156, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 OS: windows XP home Spambayes version: 1.0a2 outlook version: 2000 sp3 ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-04 17:15 Message: Logged In: YES user_id=14198 /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v <-- manager.py new revision: 1.51; previous revision: 1.50 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702 From noreply at sourceforge.net Tue Mar 4 00:24:30 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:35 2003 Subject: [Spambayes] [ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails Message-ID: Bugs item #697120, was opened at 2003-03-04 09:24 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in Outlook (still) fails Initial Comment: also see bug #696476 which is very similar to this one (but has status: closed). When trying to filter manually in outlook, I get this error. I've tried to filter multiple folders, both with and wiothout the "include subfolder-checkbox" set, and also ensured that there was a message in the folder I trie3d to filter. Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 290, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 From noreply at sourceforge.net Tue Mar 4 02:11:31 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:37 2003 Subject: [Spambayes] [ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails Message-ID: Bugs item #697120, was opened at 2003-03-04 09:24 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in Outlook (still) fails Initial Comment: also see bug #696476 which is very similar to this one (but has status: closed). When trying to filter manually in outlook, I get this error. I've tried to filter multiple folders, both with and wiothout the "include subfolder-checkbox" set, and also ensured that there was a message in the folder I trie3d to filter. Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 290, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- >Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 11:11 Message: Logged In: YES user_id=724871 I've tested this some more. It seems like I was wrong in my initial bug-report. everything seems to be working fine if "include subfolder" is UNCHECKED. The filtering then both handles empty and non-empty folders. However if the "include subfolder" is CHECKED, the filtering fails - also if all folders filtered contain mails. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 From noreply at sourceforge.net Tue Mar 4 02:33:06 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:39 2003 Subject: [Spambayes] [ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails Message-ID: Bugs item #697120, was opened at 2003-03-04 19:24 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 Category: Outlook Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in Outlook (still) fails Initial Comment: also see bug #696476 which is very similar to this one (but has status: closed). When trying to filter manually in outlook, I get this error. I've tried to filter multiple folders, both with and wiothout the "include subfolder-checkbox" set, and also ensured that there was a message in the folder I trie3d to filter. Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 290, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-04 21:33 Message: Logged In: YES user_id=14198 OK, finally fixed: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v <-- manager.py new revision: 1.52; previous revision: 1.51 I was tricked by the original traceback, which had an appointment item. My previous checkin made sure *that* couldn't happen again Note that if you comment in the bug that it still fails, I will simply re-open the old bug, rather than creating a new one. Do that if this fix doesn't work :( ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 21:11 Message: Logged In: YES user_id=724871 I've tested this some more. It seems like I was wrong in my initial bug-report. everything seems to be working fine if "include subfolder" is UNCHECKED. The filtering then both handles empty and non-empty folders. However if the "include subfolder" is CHECKED, the filtering fails - also if all folders filtered contain mails. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 From noreply at sourceforge.net Tue Mar 4 02:43:29 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:42 2003 Subject: [Spambayes] [ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder Message-ID: Bugs item #642740, was opened at 2002-11-24 01:00 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 Category: None Group: None >Status: Open >Resolution: Works For Me Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) >Summary: "Recover from Spam" wrong folder Initial Comment: Outlook addin: Selecting "Recover From Spam" recovers the selected message to the Inbox folder - which is not necessarily where came from. The filterer will need to save the folder it came from before we can do this. ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-04 21:43 Message: Logged In: YES user_id=14198 Can you post an example of something that fails? Note that a remaining potential problem is out of our control: occasionally the "Inbox" will see a message before the builtin rules. In this case, we filter it from the Inbox, not from where the Outlook rule would have moved it. Thus, when we recover, we see the inbox as the source. Note that I also fixed another bug related to this - previously, simply scoring a message would store that folder name as the "source" of the message. Thus, if you had previously viewed the clues for a message once in the wrong folder, the correct source folder would have been lost. So please ensure you are testing with mail received since I said I fixed this. ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-02-04 17:23 Message: Logged In: YES user_id=14198 /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v <-- addin.py new revision: 1.48; previous revision: 1.47 /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v <-- filter.py new revision: 1.16; previous revision: 1.15 /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v <-- msgstore.py new revision: 1.39; previous revision: 1.38 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 From noreply at sourceforge.net Tue Mar 4 02:45:47 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:50 2003 Subject: [Spambayes] [ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails Message-ID: Bugs item #697120, was opened at 2003-03-04 09:24 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 Category: Outlook Group: None >Status: Open Resolution: Fixed Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in Outlook (still) fails Initial Comment: also see bug #696476 which is very similar to this one (but has status: closed). When trying to filter manually in outlook, I get this error. I've tried to filter multiple folders, both with and wiothout the "include subfolder-checkbox" set, and also ensured that there was a message in the folder I trie3d to filter. Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 290, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- >Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 11:45 Message: Logged In: YES user_id=724871 Well - i still get an error - bug reopened: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 293, in EnsureOutlookFieldsForFolder self.EnsureOutlookFieldsForFolder(folder.EntryID, True) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 245, in EnsureOutlookFieldsForFolder msgstore_folder = self.message_store.GetFolder(folder_id) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \msgstore.py", line 232, in GetFolder folder_id = self.NormalizeID(folder_id) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \msgstore.py", line 186, in NormalizeID assert False, "We expect fully qualified IDs" AssertionError: We expect fully qualified IDs win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-04 11:33 Message: Logged In: YES user_id=14198 OK, finally fixed: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v <-- manager.py new revision: 1.52; previous revision: 1.51 I was tricked by the original traceback, which had an appointment item. My previous checkin made sure *that* couldn't happen again Note that if you comment in the bug that it still fails, I will simply re-open the old bug, rather than creating a new one. Do that if this fix doesn't work :( ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 11:11 Message: Logged In: YES user_id=724871 I've tested this some more. It seems like I was wrong in my initial bug-report. everything seems to be working fine if "include subfolder" is UNCHECKED. The filtering then both handles empty and non-empty folders. However if the "include subfolder" is CHECKED, the filtering fails - also if all folders filtered contain mails. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 From noreply at sourceforge.net Tue Mar 4 02:52:49 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:57 2003 Subject: [Spambayes] [ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails Message-ID: Bugs item #697120, was opened at 2003-03-04 19:24 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 Category: Outlook Group: None >Status: Closed Resolution: Fixed Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering in Outlook (still) fails Initial Comment: also see bug #696476 which is very similar to this one (but has status: closed). When trying to filter manually in outlook, I get this error. I've tried to filter multiple folders, both with and wiothout the "include subfolder-checkbox" set, and also ensured that there was a message in the folder I trie3d to filter. Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 290, in EnsureOutlookFieldsForFolder folders = item.Folders File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 402, in __getattr__ if d is not None: return getattr(d, attr) File "C:\PROGRA~1\_DEV\Python22\lib\site- packages\win32com\client\__init__.py", line 368, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (repr(self), attr) AttributeError: '' object has no attribute 'Folders' win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-04 21:52 Message: Logged In: YES user_id=14198 OK - dare ya to re-open it again /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v <-- manager.py new revision: 1.53; previous revision: 1.52 ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 21:45 Message: Logged In: YES user_id=724871 Well - i still get an error - bug reopened: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\AsyncDialog.py", line 98, in OnStart self.StartProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \dialogs\FilterDialog.py", line 366, in StartProcess self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 293, in EnsureOutlookFieldsForFolder self.EnsureOutlookFieldsForFolder(folder.EntryID, True) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \manager.py", line 245, in EnsureOutlookFieldsForFolder msgstore_folder = self.message_store.GetFolder(folder_id) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \msgstore.py", line 232, in GetFolder folder_id = self.NormalizeID(folder_id) File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000 \msgstore.py", line 186, in NormalizeID assert False, "We expect fully qualified IDs" AssertionError: We expect fully qualified IDs win32ui: Error in Command Message handler for command ID 1100, Code 0 ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-04 21:33 Message: Logged In: YES user_id=14198 OK, finally fixed: /cvsroot/spambayes/spambayes/Outlook2000/manager.py,v <-- manager.py new revision: 1.52; previous revision: 1.51 I was tricked by the original traceback, which had an appointment item. My previous checkin made sure *that* couldn't happen again Note that if you comment in the bug that it still fails, I will simply re-open the old bug, rather than creating a new one. Do that if this fix doesn't work :( ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 21:11 Message: Logged In: YES user_id=724871 I've tested this some more. It seems like I was wrong in my initial bug-report. everything seems to be working fine if "include subfolder" is UNCHECKED. The filtering then both handles empty and non-empty folders. However if the "include subfolder" is CHECKED, the filtering fails - also if all folders filtered contain mails. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702 From noreply at sourceforge.net Tue Mar 4 03:03:34 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 09:55:59 2003 Subject: [Spambayes] [ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder Message-ID: Bugs item #642740, was opened at 2002-11-23 15:00 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 Category: None Group: None Status: Open Resolution: Works For Me Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: "Recover from Spam" wrong folder Initial Comment: Outlook addin: Selecting "Recover From Spam" recovers the selected message to the Inbox folder - which is not necessarily where came from. The filterer will need to save the folder it came from before we can do this. ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 12:03 Message: Logged In: YES user_id=724871 OK - i've tested some more. this seems to work sometimes, and sometimes not. It may be related to the other bug you're refering to, but I'll try to walk thorugh an example. - I've got a message in a folder (inbox/maillister/locker). The message was filtered by outlooks rules to this folder this morning - i.e. I've never viewed neither the message or the clues from any other folder. - I run a manual filter on this folder (which returns with 1 good msg as expected) - WILL THIS FORGET THE FOLDER OF THIS MSG? - I press the "delete as spam" button, and the message appears in my SPAM-folder. - I enter my spam-folder and press the "recover from spam"- button. - the message appears in my INBOX The message was ORIGINALLY (this morning local time) filtered using the 1.0.a2 version of spambayes, while I now use the latest CVS-version. the following appears in the trace-collector: Deleting and spam training message '[Lockergnome Penguin Shell] Network Shutdown' - trained as spam Recovering to folder 'Inbox' and ham training message '[Lockergnome Penguin Shell] Network Shutdown' - trained as ham If you add some more debug, I'll be happy to run some tests on this msg. Is there anyway to check whether this message actually ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-04 11:43 Message: Logged In: YES user_id=14198 Can you post an example of something that fails? Note that a remaining potential problem is out of our control: occasionally the "Inbox" will see a message before the builtin rules. In this case, we filter it from the Inbox, not from where the Outlook rule would have moved it. Thus, when we recover, we see the inbox as the source. Note that I also fixed another bug related to this - previously, simply scoring a message would store that folder name as the "source" of the message. Thus, if you had previously viewed the clues for a message once in the wrong folder, the correct source folder would have been lost. So please ensure you are testing with mail received since I said I fixed this. ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-02-04 07:23 Message: Logged In: YES user_id=14198 /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v <-- addin.py new revision: 1.48; previous revision: 1.47 /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v <-- filter.py new revision: 1.16; previous revision: 1.15 /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v <-- msgstore.py new revision: 1.39; previous revision: 1.38 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 From jeremy at zope.com Tue Mar 4 10:29:57 2003 From: jeremy at zope.com (Jeremy Hylton) Date: Tue Mar 4 10:30:45 2003 Subject: [Spambayes] Server error when training in POP3proxy In-Reply-To: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com> References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com> Message-ID: <1046791797.1953.6.camel@slothrop.zope.com> On Mon, 2003-03-03 at 15:38, Mike Scott wrote: > After using it successfully for a couple of weeks, POP3proxy is > throwing the following error when I try to review emails for training > in the web browser interface. The rest of the web browser interface, > and POP3proxy, seems to be working OK. Does anyone who knows more than > me about POP3proxy have any ideas for how to diagnose or fix it? I've > just pulled the most recent update from CVS, which hasn't helped. I'm > on Mac OS X 10.2.4 running Python 2.2.2, in case it's relevant. I saw the same problem and filed a spambayes bug report. The funny thing is, I wrote a script to scan the unknown cache and found three messages that caused the problem. The messages were all generated from a sourceforge bug report for python. The bug report was that some MIME text caused the email package to barf -- and the bug report included an example of the input the caused the problem. The bug report was carefully crafted to cause any tool that used the email package to fail. I think the right solution is not just to fix the email package, but to make pop3proxy more robust. It should expect that the email package may fail unexpectedly. In those cases, it should not fail catastrophically. Jeremy From skip at pobox.com Tue Mar 4 09:36:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Mar 4 10:47:29 2003 Subject: [Spambayes] Server error when training in POP3proxy In-Reply-To: <1046791797.1953.6.camel@slothrop.zope.com> References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com> <1046791797.1953.6.camel@slothrop.zope.com> Message-ID: <15972.51219.834328.821065@montanaro.dyndns.org> >> After using it successfully for a couple of weeks, POP3proxy is >> throwing the following error when I try to review emails for training >> in the web browser interface.... Jeremy> I think the right solution is not just to fix the email package, Jeremy> but to make pop3proxy more robust.... I checked in a change to tokenizer.py yesterday evening which should robustify things a bit. Please "cvs up" and give it a whirl. Skip From jeremy at zope.com Tue Mar 4 13:23:00 2003 From: jeremy at zope.com (Jeremy Hylton) Date: Tue Mar 4 13:23:40 2003 Subject: [Spambayes] Server error when training in POP3proxy In-Reply-To: <15972.51219.834328.821065@montanaro.dyndns.org> References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com> <1046791797.1953.6.camel@slothrop.zope.com> <15972.51219.834328.821065@montanaro.dyndns.org> Message-ID: <1046802180.2030.23.camel@slothrop.zope.com> On Tue, 2003-03-04 at 10:36, Skip Montanaro wrote: > >> After using it successfully for a couple of weeks, POP3proxy is > >> throwing the following error when I try to review emails for training > >> in the web browser interface.... > > Jeremy> I think the right solution is not just to fix the email package, > Jeremy> but to make pop3proxy more robust.... > > I checked in a change to tokenizer.py yesterday evening which should > robustify things a bit. Please "cvs up" and give it a whirl. I'm looking at the checkin comment for tokenizer, and I think it won't work. If you look at the traceback we provided, it shows that the tokenizer isn't involved. The proxy is calling email.Header.decode_header() directly. On the failure in question, it isn't even calling it on a header :-). Jeremy From piersh at friskit.com Tue Mar 4 10:54:21 2003 From: piersh at friskit.com (Piers Haken) Date: Tue Mar 4 13:53:14 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <9891913C5BFE87429D71E37F08210CB9297588@zeus.sfhq.friskit.com> I'm seeing some weird behavior sometimes when the outlook plugin filters spam. Sometimes the spam that ends up in my spam folder has a spam field value of '0%' even though the 'show clues' feature shows the correct value. Looking through the trace output I'm seeing a bunch of assertion failures like this: pythoncom error: Python error invoking COM method. Traceback (most recent call last): File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 275, in _Invoke_ return self._invoke_(dispid, lcid, wFlags, args) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 280, in _invoke_ return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 601, in _invokeex_ return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags, args, kwArgs, serviceProvider) File "C:\Python22\lib\site-packages\win32com\server\policy.py", line 541, in _invokeex_ return apply(func, args) File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 184, in OnItemAdd msgstore_message = self.manager.message_store.GetMessage(item) File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 258, in GetMessage message_id = self.NormalizeID(message_id) File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 185, in NormalizeID assert type(item_id) in [type(''), type(u'')], "What kind of ID is '%r'?" % (item_id,) exceptions.AssertionError: What kind of ID is ''? I'm not sure what's going on here, has anyone else seen this before? Piers. From skip at pobox.com Tue Mar 4 14:09:02 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Mar 4 15:09:25 2003 Subject: [Spambayes] Server error when training in POP3proxy In-Reply-To: <1046802180.2030.23.camel@slothrop.zope.com> References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com> <1046791797.1953.6.camel@slothrop.zope.com> <15972.51219.834328.821065@montanaro.dyndns.org> <1046802180.2030.23.camel@slothrop.zope.com> Message-ID: <15973.2014.368579.363028@montanaro.dyndns.org> >> I checked in a change to tokenizer.py yesterday evening which should >> robustify things a bit. Please "cvs up" and give it a whirl. Jeremy> I'm looking at the checkin comment for tokenizer, and I think it Jeremy> won't work. If you look at the traceback we provided, it shows Jeremy> that the tokenizer isn't involved. The proxy is calling Jeremy> email.Header.decode_header() directly. On the failure in Jeremy> question, it isn't even calling it on a header :-). I was working off the traceback I got which wasn't from pop3proxy. In my checkin comment I wrote: These two may not be the only places requiring a change. Anywhere email.Header.decode_header() is called - particularly when passed a subject or email address - should probably be guarded. I don't regularly run pop3proxy, so couldn't easily check any changes I'd make to that code. Still, the try/except structure should be similar. Skip From noreply at sourceforge.net Tue Mar 4 16:39:36 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 4 20:33:58 2003 Subject: [Spambayes] [ spambayes-Bugs-693423 ] email message generates error in pop3proxy.py Message-ID: Bugs item #693423, was opened at 2003-02-25 23:02 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702 Category: pop3proxy Group: None >Status: Open Resolution: None Priority: 5 Submitted By: David Shaw (dshaw) Assigned to: Tim Stone (timstone4) Summary: email message generates error in pop3proxy.py Initial Comment: Hi all, A friend of mine had a cache file in his "unknown" folder that caused the "review" web page in pop3proxy.py to generate the following traceback: Traceback (most recent call last): File "spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "pop3proxy.py", line 929, in onReview judgement = judgement.split(';')[0].strip() File "pop3proxy.py", line 815, in _makeMessageInfo print type(text) AttributeError: 'list' object has no attribute 'replace' He sent me the offending message, and I replicated the problem: msg = open("/Users/dshaw/Desktop/crash_spam.txt", "r") message = mbox.get_message(msg) part = typed_subpart_iterator(message, 'text', 'plain').next() text = part.get_payload() >>> text [] So, instead of text, the payload is a list containing a single email message instance. Here are the objects' respective payloads: >>> message._payload [, , , , , , , , , , , , , ] ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-04 18:39 Message: Logged In: YES user_id=645698 I just checked in a fix for this problem. I have no ability to actually test it, though. Please try your test case again and let me know the outcome. ---------------------------------------------------------------------- Comment By: David Shaw (dshaw) Date: 2003-02-28 10:34 Message: Logged In: YES user_id=244639 Seems to be fixed! Thanks. ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-02-27 22:29 Message: Logged In: YES user_id=645698 I just checked in a fix for this problem. I have no ability to actually test it, though. Please try your test case again and let me know the outcome. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702 From niek at haunter.student.utwente.nl Wed Mar 5 10:04:03 2003 From: niek at haunter.student.utwente.nl (Niek Bergboer) Date: Wed Mar 5 04:04:08 2003 Subject: [Spambayes] Graphs on my website Message-ID: <20030305090403.GB30529@haunter.student.utwente.nl> On Sat, Mar 01, 2003 at 09:12:46AM -0800, T. Alexander Popiel wrote: > Those who want to see my pretty graphs without waiting > for the moderator approval of my .png-laden posting > can go to http://www.wolfskeep.com/~popiel/spambayes/incremental > to see all the pretty pictures (along with a bunch of the > raw and semi-cooked data files). Looks very nice indeed, and the results seem to be good (fn and fp ~10^-2). For the other examples on your site, for which you use a parameter to check its effect on the performance (e.g. the ham:spam ratio, of the training set size), it would be nice to generate a ROC-curve: In a ROC-curve (Receiver Operating Characteristic curve), you plot the correct positive rate (y-axis) against the false positive rate (x-axis). The points on the curve are given by using e.g. different spam:ham ratio's. A ROC-curve doesn't necessarily provide more information, but it is a rather standard way to present results in (more or less) binary classification. The term ROC originates from RADAR detection results, AFAIK. A problem that needs to be addressed in making ROC-curves for spambayes is how to handle unsures: disregarding them completely in the ROC curve seems reasonable, but then one probably also needs a correct.pos.rate vs. unsures rate curve. > - Alex Just my 2 Eurocents... Niek -- Max Brod: "Gibt es denn gar keine Hoffnung?" Franz Kafka: "Aber ja! Es gibt unendlich viel Hoffnung. Nur nicht fuer uns." PGP public key at http://www.bergboer.net From Paul.Moore at atosorigin.com Wed Mar 5 09:45:54 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Wed Mar 5 04:47:20 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D955@UKDCX001.uk.int.atosorigin.com> From: Piers Haken [mailto:piersh@friskit.com] > I'm seeing some weird behavior sometimes when the outlook plugin filters > spam. Sometimes the spam that ends up in my spam folder has a spam field > value of '0%' even though the 'show clues' feature shows the correct > value. [...] > I'm not sure what's going on here, has anyone else seen this before? Yes, I see it fairly often, and it has been reported before (to the list, but possibly not on SF). IIRC, Mark thought it was a timing issue between when the message arrived and when the plugin fired. But that's about as much as I know... Paul. From mhammond at skippinet.com.au Wed Mar 5 21:19:44 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 5 05:20:18 2003 Subject: [Spambayes] Outlook plugin error In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D955@UKDCX001.uk.int.atosorigin.com> Message-ID: > From: Piers Haken [mailto:piersh@friskit.com] > > I'm seeing some weird behavior sometimes when the outlook plugin filters > > spam. Sometimes the spam that ends up in my spam folder has a spam field > > value of '0%' even though the 'show clues' feature shows the correct > > value. > > [...] > > > I'm not sure what's going on here, has anyone else seen this before? > > Yes, I see it fairly often, and it has been reported before (to the list, > but possibly not on SF). IIRC, Mark thought it was a timing issue between > when the message arrived and when the plugin fired. But that's about as > much as I know... I never see this. The timing issue I was thinking of would account for a *blank* spam score, but not a zero score. A zero implies that the scoring worked correctly, but did indeed return zero. If you disable filtering, you should see all new mail arrive with a blank score, rather than zero. Please tell me if this is not true. If it *is* true, then I guess we can add some additional trace statements to see what is going on. Mark. From Paul.Moore at atosorigin.com Wed Mar 5 10:31:04 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Wed Mar 5 05:32:26 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D958@UKDCX001.uk.int.atosorigin.com> From: Mark Hammond [mailto:mhammond@skippinet.com.au] > If you disable filtering, you should see all new mail arrive with a > blank score, rather than zero. Yes, that's right. Paul From noreply at sourceforge.net Wed Mar 5 05:09:38 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 08:20:31 2003 Subject: [Spambayes] [ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort for uiPort Message-ID: Patches item #697970, was opened at 2003-03-05 14:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy didn't use addressAndPort for uiPort Initial Comment: pop3proxy doesn't accept the hostname:portno notation for the -l (i.e. uiPort) flag. I did'nt like everybody on our LAN being able to read my mail using a webbrowser, so I wrote the attached path, this allows -l localhost:8880 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 From noreply at sourceforge.net Wed Mar 5 05:10:48 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 08:20:32 2003 Subject: [Spambayes] [ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort for uiPort Message-ID: Patches item #697970, was opened at 2003-03-05 14:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy didn't use addressAndPort for uiPort Initial Comment: pop3proxy doesn't accept the hostname:portno notation for the -l (i.e. uiPort) flag. I did'nt like everybody on our LAN being able to read my mail using a webbrowser, so I wrote the attached path, this allows -l localhost:8880 ---------------------------------------------------------------------- >Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 14:10 Message: Logged In: YES user_id=311771 Forgot the checkmark, as usual. arrggg. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 From roz at one.net Wed Mar 5 01:14:24 2003 From: roz at one.net (J. Solomon Kostelnik) Date: Wed Mar 5 08:21:03 2003 Subject: [Spambayes] Great Job! Message-ID: <1046844863.4850.5.camel@jsk.one.net> Just wanted to say "great job" on the software so far. After training only about 10-15 emails, it successfully caught ALL spam, and only accidentally got a few "hams." With each successive train, it gets better. I really am impressed. One suggestion: document (if it exists), or add a run-time flag to run a certain .ini file on startup of the pop3proxy script. I'd like to add pop3proxy.py to my rc.local file, but I need to be able to tell it where to look for the .ini file. If this exists, please just point me to the docs where it says. Thanks again and keep up the great work! -- Solomon aka JSK333 http://w3.one.net/~roz/ ?Come to me, all you who labor and are heavily burdened, and I will give you rest. Take my yoke upon you, and learn from me, for I am gentle and lowly in heart; and you will find rest for your souls? --Jesus Christ, Son of God; Matthew 11:28-29 PGP Public Key Available: http://w3.one.net/~roz/jsk333.asc From noreply at sourceforge.net Wed Mar 5 05:36:04 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 08:38:20 2003 Subject: [Spambayes] [ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort for uiPort Message-ID: Patches item #697970, was opened at 2003-03-05 14:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy didn't use addressAndPort for uiPort Initial Comment: pop3proxy doesn't accept the hostname:portno notation for the -l (i.e. uiPort) flag. I did'nt like everybody on our LAN being able to read my mail using a webbrowser, so I wrote the attached path, this allows -l localhost:8880 ---------------------------------------------------------------------- >Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 14:36 Message: Logged In: YES user_id=311771 ahem. Make that ".. this allows -u localhost:8880", of course ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 14:10 Message: Logged In: YES user_id=311771 Forgot the checkmark, as usual. arrggg. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 From wsy at merl.com Wed Mar 5 08:52:19 2003 From: wsy at merl.com (Bill Yerazunis) Date: Wed Mar 5 08:52:23 2003 Subject: [Spambayes] Graphs on my website In-Reply-To: <20030305090403.GB30529@haunter.student.utwente.nl> (niek@haunter.student.utwente.nl) References: <20030305090403.GB30529@haunter.student.utwente.nl> Message-ID: <200303051352.h25DqJh20230@localhost.localdomain> From: niek@haunter.student.utwente.nl (Niek Bergboer) In a ROC-curve (Receiver Operating Characteristic curve), you plot the correct positive rate (y-axis) against the false positive rate (x-axis). The points on the curve are given by using e.g. different spam:ham ratio's. A ROC-curve doesn't necessarily provide more information, but it is a rather standard way to present results in (more or less) binary classification. The term ROC originates from RADAR detection results, AFAIK. A problem that needs to be addressed in making ROC-curves for spambayes is how to handle unsures: disregarding them completely in the ROC curve seems reasonable, but then one probably also needs a correct.pos.rate vs. unsures rate curve. The ROC curves I've seen are all plots of correct% v incorrect% with the parameterization variable being some controllable threshold that's an input to the system; the closer the "knee" in the curve comes to the origin, the better the discrimination, and the parameter value(s) at the point of closest approach are the optimal operating parameters . In the case of SpamBayes, where there's a distinct "third class", I'd suggest _three_ curves: Ham v. Unsure Unsure v. Spam Ham v. Spam This would plot the confusion on all three axes, and make it clear that you can drive the third one (ham v. spam) really close to the origin (which is good) by expanding the size of the Unsure class. -Bill Yerazunis ( CRM114 spy :-) ) From skip at pobox.com Wed Mar 5 08:35:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 5 09:35:45 2003 Subject: [Spambayes] Great Job! In-Reply-To: <1046844863.4850.5.camel@jsk.one.net> References: <1046844863.4850.5.camel@jsk.one.net> Message-ID: <15974.2872.617304.198800@montanaro.dyndns.org> Solomon> I really am impressed. As are we all. Solomon> One suggestion: document (if it exists), or add a run-time flag Solomon> to run a certain .ini file on startup of the pop3proxy script. You can set your BAYESCUSTOMIZE environment variable to (on Unix) a colon separated list of ini files which will be loaded, in order. Skip From noreply at sourceforge.net Wed Mar 5 06:38:38 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 11:32:15 2003 Subject: [Spambayes] [ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort for uiPort Message-ID: Patches item #697970, was opened at 2003-03-05 13:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy didn't use addressAndPort for uiPort Initial Comment: pop3proxy doesn't accept the hostname:portno notation for the -l (i.e. uiPort) flag. I did'nt like everybody on our LAN being able to read my mail using a webbrowser, so I wrote the attached path, this allows -l localhost:8880 ---------------------------------------------------------------------- >Comment By: Richie Hindle (richiehindle) Date: 2003-03-05 14:38 Message: Logged In: YES user_id=85414 Unless I'm misunderstanding something, this is exactly what the html_ui_allow_remote_connections setting is for...? Thanks for the patch anyway - there's nothing wrong with being able to specify the address that way whether html_ui_allow_remote_connections solves your problem or not. ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 13:36 Message: Logged In: YES user_id=311771 ahem. Make that ".. this allows -u localhost:8880", of course ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 13:10 Message: Logged In: YES user_id=311771 Forgot the checkmark, as usual. arrggg. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 From noreply at sourceforge.net Wed Mar 5 07:25:45 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 11:32:16 2003 Subject: [Spambayes] [ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort for uiPort Message-ID: Patches item #697970, was opened at 2003-03-05 07:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy didn't use addressAndPort for uiPort Initial Comment: pop3proxy doesn't accept the hostname:portno notation for the -l (i.e. uiPort) flag. I did'nt like everybody on our LAN being able to read my mail using a webbrowser, so I wrote the attached path, this allows -l localhost:8880 ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-05 09:25 Message: Logged In: YES user_id=645698 You bring up a very good point, Wolfgang. Your patch plugs one hole, but someone can still access your mail via http://:8880 (or whatever port you happen to be listening on). This is a problem, and I think the solution is to implement http auth... We can't just reject connections that don't originate from localhost, because someone really might want to use another computer to access the pop3proxy ui. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-03-05 08:38 Message: Logged In: YES user_id=85414 Unless I'm misunderstanding something, this is exactly what the html_ui_allow_remote_connections setting is for...? Thanks for the patch anyway - there's nothing wrong with being able to specify the address that way whether html_ui_allow_remote_connections solves your problem or not. ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 07:36 Message: Logged In: YES user_id=311771 ahem. Make that ".. this allows -u localhost:8880", of course ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 07:10 Message: Logged In: YES user_id=311771 Forgot the checkmark, as usual. arrggg. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 From noreply at sourceforge.net Wed Mar 5 07:41:15 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 11:32:18 2003 Subject: [Spambayes] [ spambayes-Feature Requests-698036 ] pop3proxy security Message-ID: Feature Requests item #698036, was opened at 2003-03-05 09:41 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 Category: pop3proxy Group: None Status: Open Priority: 5 Submitted By: Tim Stone (timstone4) Assigned to: Tim Stone (timstone4) Summary: pop3proxy security Initial Comment: Currently, there is no security on the pop3proxy, so anyone can access the user interface from any computer, given a web browser and knowledge of the ip address and port. Even if you didn't know the port, figuring it out wouldn't necessarily be difficult. This allows several operations that could be security problems, including reading at least the first couple hundred characters of each mail body. It would seem that the correct solution is to implement a challenge/authentication on the pop3proxy http server. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 From noreply at sourceforge.net Wed Mar 5 08:48:14 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 11:45:09 2003 Subject: [Spambayes] [ spambayes-Feature Requests-698036 ] pop3proxy security Message-ID: Feature Requests item #698036, was opened at 2003-03-05 09:41 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 Category: pop3proxy Group: None Status: Open Priority: 5 Submitted By: Tim Stone (timstone4) Assigned to: Tim Stone (timstone4) Summary: pop3proxy security Initial Comment: Currently, there is no security on the pop3proxy, so anyone can access the user interface from any computer, given a web browser and knowledge of the ip address and port. Even if you didn't know the port, figuring it out wouldn't necessarily be difficult. This allows several operations that could be security problems, including reading at least the first couple hundred characters of each mail body. It would seem that the correct solution is to implement a challenge/authentication on the pop3proxy http server. ---------------------------------------------------------------------- >Comment By: Skip Montanaro (montanaro) Date: 2003-03-05 10:48 Message: Logged In: YES user_id=44345 I don't think this is a problem. Just tell the webserver to listen on "localhost" or "127.0.0.1", or maybe even "". Connections from remote hosts won't be accepted. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 From neale at woozle.org Wed Mar 5 09:36:04 2003 From: neale at woozle.org (Neale Pickett) Date: Wed Mar 5 12:36:01 2003 Subject: [Spambayes] Adding a message database In-Reply-To: ("Mark Hammond"'s message of "Thu, 27 Feb 2003 09:11:55 +1100") References: Message-ID: Hi everybody. I just got my Internet service restored. Boy howdy is the phone company ever responsive <2.0 wink>. "Mark Hammond" writes: > I simply want a memory of how a specific message was trained, for the > following reasons: > > * Accidental attempt to train the same message, in the same way, multiple > times. > * Accidental attempt to train the same message as ham and spam. So, this is a rockin' idea and I'd be glad to rewrite mboxtrain/hammiefilter to use it once it's implemented. Neale From piersh at friskit.com Wed Mar 5 09:42:13 2003 From: piersh at friskit.com (Piers Haken) Date: Wed Mar 5 12:41:06 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <9891913C5BFE87429D71E37F08210CB92C7515@zeus.sfhq.friskit.com> Paul, are you using any of: 1) oulook XP 2) hotmail plugin for (1) 3) exchange server ? I'm wondering if the problem has anything to do with the fact that the spam field is set before the message is moved. Piers. -----Original Message----- From: Moore, Paul [mailto:Paul.Moore@atosorigin.com] Sent: Wednesday, March 05, 2003 1:46 AM To: Piers Haken; Spambayes Subject: RE: [Spambayes] Outlook plugin error From: Piers Haken [mailto:piersh@friskit.com] > I'm seeing some weird behavior sometimes when the outlook plugin > filters spam. Sometimes the spam that ends up in my spam folder has a > spam field value of '0%' even though the 'show clues' feature shows > the correct value. [...] > I'm not sure what's going on here, has anyone else seen this before? Yes, I see it fairly often, and it has been reported before (to the list, but possibly not on SF). IIRC, Mark thought it was a timing issue between when the message arrived and when the plugin fired. But that's about as much as I know... Paul. From noreply at sourceforge.net Wed Mar 5 09:35:33 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 13:13:40 2003 Subject: [Spambayes] [ spambayes-Feature Requests-698036 ] pop3proxy security Message-ID: Feature Requests item #698036, was opened at 2003-03-05 15:41 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 Category: pop3proxy Group: None Status: Open Priority: 5 Submitted By: Tim Stone (timstone4) Assigned to: Tim Stone (timstone4) Summary: pop3proxy security Initial Comment: Currently, there is no security on the pop3proxy, so anyone can access the user interface from any computer, given a web browser and knowledge of the ip address and port. Even if you didn't know the port, figuring it out wouldn't necessarily be difficult. This allows several operations that could be security problems, including reading at least the first couple hundred characters of each mail body. It would seem that the correct solution is to implement a challenge/authentication on the pop3proxy http server. ---------------------------------------------------------------------- >Comment By: Richie Hindle (richiehindle) Date: 2003-03-05 17:35 Message: Logged In: YES user_id=85414 [Tim Stone] > Currently, there is no security on the pop3proxy Not true - you can use the html_ui_allow_remote_connections setting to reject connections from anywhere other than the local machine. This is a bit draconian - as you say, we should have a better solution - but it's not as bad as you make out. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-03-05 16:48 Message: Logged In: YES user_id=44345 I don't think this is a problem. Just tell the webserver to listen on "localhost" or "127.0.0.1", or maybe even "". Connections from remote hosts won't be accepted. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 From noreply at sourceforge.net Wed Mar 5 09:40:02 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 13:13:41 2003 Subject: [Spambayes] [ spambayes-Feature Requests-698036 ] pop3proxy security Message-ID: Feature Requests item #698036, was opened at 2003-03-05 09:41 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 Category: pop3proxy Group: None Status: Open Priority: 5 Submitted By: Tim Stone (timstone4) Assigned to: Tim Stone (timstone4) Summary: pop3proxy security Initial Comment: Currently, there is no security on the pop3proxy, so anyone can access the user interface from any computer, given a web browser and knowledge of the ip address and port. Even if you didn't know the port, figuring it out wouldn't necessarily be difficult. This allows several operations that could be security problems, including reading at least the first couple hundred characters of each mail body. It would seem that the correct solution is to implement a challenge/authentication on the pop3proxy http server. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-05 11:40 Message: Logged In: YES user_id=645698 Ya, the problem here is that I might want to allow remote connections, but I certainly don't want just anybody to be able to connect. Skip's suggestion doesn't help here. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-03-05 11:35 Message: Logged In: YES user_id=85414 [Tim Stone] > Currently, there is no security on the pop3proxy Not true - you can use the html_ui_allow_remote_connections setting to reject connections from anywhere other than the local machine. This is a bit draconian - as you say, we should have a better solution - but it's not as bad as you make out. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2003-03-05 10:48 Message: Logged In: YES user_id=44345 I don't think this is a problem. Just tell the webserver to listen on "localhost" or "127.0.0.1", or maybe even "". Connections from remote hosts won't be accepted. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702 From noreply at sourceforge.net Wed Mar 5 09:40:19 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 5 13:13:43 2003 Subject: [Spambayes] [ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort for uiPort Message-ID: Patches item #697970, was opened at 2003-03-05 14:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Wolfgang Strobl (strobl) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy didn't use addressAndPort for uiPort Initial Comment: pop3proxy doesn't accept the hostname:portno notation for the -l (i.e. uiPort) flag. I did'nt like everybody on our LAN being able to read my mail using a webbrowser, so I wrote the attached path, this allows -l localhost:8880 ---------------------------------------------------------------------- >Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 18:40 Message: Logged In: YES user_id=311771 Richie, thanks for the hint, I didn't know about the new html_ui_allow_remote_connections option, because I didn't read through docs and sources again after doing a new checkout. Using the Option parsing helper functions was simply done by looking for symmetry. Tim: assuming that localhost is resolved locally to 127.0.0.1, AFIK only local processes using the loopback interface can bind to the port, when somesthing listens on localhost:. That's exactly what I need, when everything (mail client, browser, pop3proxy) runs on the very same machine. ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-03-05 16:25 Message: Logged In: YES user_id=645698 You bring up a very good point, Wolfgang. Your patch plugs one hole, but someone can still access your mail via http://:8880 (or whatever port you happen to be listening on). This is a problem, and I think the solution is to implement http auth... We can't just reject connections that don't originate from localhost, because someone really might want to use another computer to access the pop3proxy ui. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-03-05 15:38 Message: Logged In: YES user_id=85414 Unless I'm misunderstanding something, this is exactly what the html_ui_allow_remote_connections setting is for...? Thanks for the patch anyway - there's nothing wrong with being able to specify the address that way whether html_ui_allow_remote_connections solves your problem or not. ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 14:36 Message: Logged In: YES user_id=311771 ahem. Make that ".. this allows -u localhost:8880", of course ---------------------------------------------------------------------- Comment By: Wolfgang Strobl (strobl) Date: 2003-03-05 14:10 Message: Logged In: YES user_id=311771 Forgot the checkmark, as usual. arrggg. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702 From N7DR at arrisi.com Wed Mar 5 14:11:02 2003 From: N7DR at arrisi.com (D. R. Evans) Date: Wed Mar 5 16:11:17 2003 Subject: [Spambayes] pop3proxy crashes Message-ID: <3E660576.15567.1F786E44@localhost> I made the mistake of rebooting my Linux box.... Following the reboot, pop3proxy.py now dumps the following to the screen whenever I try to run it: Loading database... Traceback (most recent call last): File "./pop3proxy.py", line 1577, in ? run() File "./pop3proxy.py", line 1551, in run state.createWorkers() File "./pop3proxy.py", line 1161, in createWorkers self.bayes = storage.DBDictClassifier(filename) File "./spambayes/storage.py", line 140, in __init__ self.load() File "./spambayes/storage.py", line 152, in load t = self.db[self.statekey] File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__ return Unpickler(f).load() EOFError It worked fine (for about three weeks) until the reboot. I'm probably forgetting to do something obvious (I hope). Doc -------------------------------------------------------------- Phone: +1 303 494 0394 Mobile: +1 720 839 8462 Fax: +1 781 240 0527 -------------------------------------------------------------- From dave at nullcube.com Thu Mar 6 08:06:12 2003 From: dave at nullcube.com (Dave Harrison) Date: Wed Mar 5 16:11:39 2003 Subject: [Spambayes] encountered error while processing spam folder Message-ID: <20030305210612.GA5950@dave@alana.ucc.usyd.edu.au> Hey, Ive been using spambayes for a few days and at first it worked fine. But recently I have been getting the following error when I try to train it on my spam folder. Im assuming it might have to do with an email with a mangled header. But Im having trouble tracking down which exact email it is. Is there a way I can track down the offending email to forward onto the devel team to help assess this error ? Cheers Dave Training spam (/home/dave/.mail/spam): Reading as Unix mbox Traceback (most recent call last): File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 278, in ? main() File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 270, in main train(h, s, True, force) File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 203, in train mbox_train(h, path, is_spam, force) File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 139, in mbox_train if msg_train(h, msg, is_spam, force): File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 71, in msg_train h.train(msg, is_spam) File "/home/dave/spambayes-1.0a2/hammie.py", line 150, in train spambayes.hammiebulk.main() File "./spambayes/classifier.py", line 270, in learn File "./spambayes/classifier.py", line 391, in _add_msg File "./spambayes/compatsets.py", line 374, in __init__ File "./spambayes/compatsets.py", line 333, in _update File "./spambayes/tokenizer.py", line 1052, in tokenize File "./spambayes/tokenizer.py", line 1106, in tokenize_headers File "/usr/local/lib/python2.2/email/Header.py", line 92, in decode_header dec = email.base64MIME.decode(encoded) File "/usr/local/lib/python2.2/email/base64MIME.py", line 179, in decode dec = a2b_base64(s) binascii.Error: Incorrect padding From skip at pobox.com Wed Mar 5 15:21:31 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 5 16:21:40 2003 Subject: [Spambayes] encountered error while processing spam folder In-Reply-To: <20030305210612.GA5950@dave@alana.ucc.usyd.edu.au> References: <20030305210612.GA5950@dave@alana.ucc.usyd.edu.au> Message-ID: <15974.27227.241247.403310@montanaro.dyndns.org> Dave> Hey, Ive been using spambayes for a few days and at first it Dave> worked fine. But recently I have been getting the following error Dave> when I try to train it on my spam folder.... Fixed in CVS. ;-) Skip From mhammond at skippinet.com.au Thu Mar 6 08:35:32 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 5 16:36:34 2003 Subject: [Spambayes] locale and ConfigParser Message-ID: I recently received a mail regarding SpamBayes refusing to work: > Possible reasons: > > Outlook 2002, Dutch version. ... > File "C:\Python22\lib\ConfigParser.py", line 306, in getfloat > return self.__get(section, float, option) > File "C:\Python22\lib\ConfigParser.py", line 300, in __get > return conv(self.get(section, option)) > exceptions.ValueError: invalid literal for float(): 0.20 Addiing the following anywhere before the file is parsed: > import locale > locale.setlocale(locale.LC_NUMERIC, "en") Corrects the problem. However, it is unclear to me what the ramifications of this would be. Anyone have a clue what we should do about this? Those-bloody-dutch ly, Mark. From mhammond at skippinet.com.au Thu Mar 6 09:33:08 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 5 17:33:51 2003 Subject: [Spambayes] Outlook plugin error In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C7515@zeus.sfhq.friskit.com> Message-ID: > Paul, are you using any of: > 1) oulook XP > 2) hotmail plugin for (1) > 3) exchange server > > ? > > I'm wondering if the problem has anything to do with the fact that the > spam field is set before the message is moved. Further, when you see this behaviour, can you immediately check the Pythonwin debug window for a message? Each message processed should have a message that indicates its spam disposition - the first thing I need to know is if such mails fire this debug trace. Mark. From roz at one.net Wed Mar 5 17:39:29 2003 From: roz at one.net (J. Solomon Kostelnik) Date: Wed Mar 5 17:40:54 2003 Subject: [Spambayes] POP3proxy.py error Message-ID: <1046903969.3119.1.camel@jsk.one.net> This script had been working fine for the last several days. I have changed nothing in the setup. Today when I attempt to load pop3proxy.py from my spambayes directory, I get the following output: Loading database... Traceback (most recent call last): File "/usr/bin/pop3proxy.py", line 1651, in ? run() File "/usr/bin/pop3proxy.py", line 1619, in run state.createWorkers() File "/usr/bin/pop3proxy.py", line 1307, in createWorkers self.bayes = storage.DBDictClassifier(self.databaseFilename) File "/usr/lib/python2.2/site-packages/spambayes/storage.py", line 140, in __init__ self.load() File "/usr/lib/python2.2/site-packages/spambayes/storage.py", line 148, in load self.dbm = dbmstorage.open(self.db_name, self.mode) File "/usr/lib/python2.2/site-packages/spambayes/dbmstorage.py", line 54, in open return f(*args) File "/usr/lib/python2.2/site-packages/spambayes/dbmstorage.py", line 36, in open_best return f(*args) File "/usr/lib/python2.2/site-packages/spambayes/dbmstorage.py", line 22, in open_gdbm return gdbm.open(*args) gdbm.error: (11, 'Resource temporarily unavailable') ----- What is happening here? Solomon From bill at parducci.net Wed Mar 5 15:47:49 2003 From: bill at parducci.net (bill parducci) Date: Wed Mar 5 18:47:53 2003 Subject: [Spambayes] statistical comparison of enviroment? Message-ID: <3E668CA5.3050203@parducci.net> first off, FWIW i am really amazed at the level of work that has gone into just the consideration of tokenization strategies. having struggled against the spam onslaught for the last 2 years armed solely with procmail i can really appreciate the work that has been done here! (after 200+ recipes i asked myself if there wasn't a better way... and found you guys... now i *know* there is. kudos to the group, this is some great work! obeisance complete, off to the topic at hand :o) i have been reading through the code/documentation looking at not just the token process, but considering the data that is subject to statistical analysis as well. i might have missed this, but has anyone considered including environmental factors into the spam vs. ham analysis? a couple of things come to mind right off the bat, but i am sure more could be found: 1. time of day (would require some real granularity tweaking) 2. size of header / size message / header:message ratio 3. attachment count (MIME count) / MIME count:message size ratio 4. [space|tab|\n]:[visible char] ratio etc... i think that if it hasn't already been done, it would be interesting to see if statistically comparing the *phyiscal* attributes of the messages would have an effect on the accuracy of the decision. currently--and i freely admit to being a lamer in undergrad stats--i think that this information is only considered implicitly. b From mhammond at skippinet.com.au Thu Mar 6 12:35:55 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 5 20:36:56 2003 Subject: [Spambayes] Adding a message database In-Reply-To: Message-ID: > > I simply want a memory of how a specific message was trained, for the > > following reasons: > > > > * Accidental attempt to train the same message, in the same > way, multiple > > times. > > * Accidental attempt to train the same message as ham and spam. > > So, this is a rockin' idea and I'd be glad to rewrite > mboxtrain/hammiefilter to use it once it's implemented. OK - while I am here... ;) It seems to me that sub-classing classifier to change storage semantics is wrong. IMO, this should use delegation. sub-classing of classifier should be used should the classification sheme want overriding, not the storage requirements. This wouldn't be too hard to do - _setwordinfo() etc just delegate to a self.storage - and would make some sense to do as part of a "message database". If there a compelling reason for it being the way it is? Mark. From popiel at wolfskeep.com Wed Mar 5 17:59:16 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Mar 5 20:59:21 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: Message from bill parducci of "Wed, 05 Mar 2003 15:47:49 PST." <3E668CA5.3050203@parducci.net> References: <3E668CA5.3050203@parducci.net> Message-ID: <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> In message: <3E668CA5.3050203@parducci.net> bill parducci writes: > >i might have missed this, but has anyone considered including >environmental factors into the spam vs. ham analysis? a couple >of things come to mind right off the bat, but i am sure more >could be found: > >1. time of day (would require some real granularity tweaking) This was tried, with 10 minute intervals; testing on two separate corpora (that of the guy who came up with the patch and my own) showed that the effect was inconsequential. The largest result was the observation that both ham and spam tend to slacken a bit in the middle of the night. >2. size of header / size message / header:message ratio > >3. attachment count (MIME count) / MIME count:message size ratio > >4. [space|tab|\n]:[visible char] ratio All of these have been mentioned in the past, but no one to my knowledge has actually tested them. Please feel free to code up something to turn these ideas into tokens... then they can be tested, and if they're useful then they'll likely be incorporated. Testing of new tokens like this has dropped off since about last October... spambayes is already good enough for just about everyone to be happy. My recent tests on training methods seem to show that accuracy has been dropping off for the last twho months, though, so it may be time to revisit this problem... - Alex From bill at parducci.net Wed Mar 5 18:26:30 2003 From: bill at parducci.net (bill parducci) Date: Wed Mar 5 21:26:33 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> Message-ID: <3E66B1D6.90308@parducci.net> > Please feel free to code up something to turn these ideas into > tokens... then they can be tested, and if they're useful then > they'll likely be incorporated. ok. in the interest of time saving (i've not programmed in python before), how about i [tabular] list what i find and let the statistas in the group decide if there is significance? i have a pile of spam and ham that i can wade through (unless there is a standardized sample that is preferable). b From skip at pobox.com Wed Mar 5 20:39:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 5 21:39:28 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> Message-ID: <15974.46301.300535.819582@montanaro.dyndns.org> >> 1. time of day (would require some real granularity tweaking) >> 2. size of header / size message / header:message ratio >> 3. attachment count (MIME count) / MIME count:message size ratio >> 4. [space|tab|\n]:[visible char] ratio (Just thinking out loud.) One of the problems we have generating new improvements is the system is so good now that improvements of any kind tend to be microscopic, and thus extremely hard to measure. Still, the more ways you can get the tool to tell you "this smells like spam", the harder it will be for spammers to defeat it. Accordingly, when considering potential improvements (improved tokenizing tricks, for example), perhaps what we should be doing is disabling much of the current capability and then testing a new change against such a "crippled" system. Making it more concrete, suppose we split tokenizing into two groups, "natural" tokens and "synthetic" tokens. Natural tokens would be what you get with basic whitespace splitting, nothing more. Synthetic tokens would be stuff like tokenizing this subject and generating subject:[Spambayes] subject:statistical subject:comparison subject:of subject:environment By reducing the effectiveness of the system for testing, I think we'd have a better idea how effective a new idea might be. What I don't know is how to measure the independence of two different "improvements". (The more independent two improvements are, the harder it seems it would be for a spammer to hit two birds with one stone when trying to defeat spambayes.) Suppose for the sake of argument that this base system I talk about is 80% effective at properly distinguishing ham from spam. Suppose improvement A takes that to 83% and applied independently to the base system, improvement B takes that to 85%. How do you tell how independent A and B are from one another? Skip From bill at parducci.net Wed Mar 5 19:23:27 2003 From: bill at parducci.net (bill parducci) Date: Wed Mar 5 22:23:31 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: <15974.46301.300535.819582@montanaro.dyndns.org> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <15974.46301.300535.819582@montanaro.dyndns.org> Message-ID: <3E66BF2F.6080200@parducci.net> Skip Montanaro wrote: > By reducing the effectiveness of the system for testing, I think we'd have a > better idea how effective a new idea might be. What I don't know is how to > measure the independence of two different "improvements". (The more > independent two improvements are, the harder it seems it would be for a > spammer to hit two birds with one stone when trying to defeat spambayes.) > Suppose for the sake of argument that this base system I talk about is 80% > effective at properly distinguishing ham from spam. Suppose improvement A > takes that to 83% and applied independently to the base system, improvement > B takes that to 85%. How do you tell how independent A and B are from one > another? how about you measure each of the methodologies individually (at least those that have relevance; it seems that time is not one such approach), then look for those that are most complimentary? for example, suppose you had a simple matrix with message_id along the vertical axis and methodology across the horizontal access (plus one entry for 'true nature' of message) and then checked to see which combination of methodologies was the most accurate? of course, there may be some level of combinatorial explosion in doing it this way, but it would speak to the independence issue wouldn't it? b From popiel at wolfskeep.com Wed Mar 5 20:03:36 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Mar 5 23:03:40 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: Message from bill parducci of "Wed, 05 Mar 2003 18:26:30 PST." <3E66B1D6.90308@parducci.net> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <3E66B1D6.90308@parducci.net> Message-ID: <20030306040336.77E4E2DEA4@cashew.wolfskeep.com> In message: <3E66B1D6.90308@parducci.net> bill parducci writes: >> Please feel free to code up something to turn these ideas into >> tokens... then they can be tested, and if they're useful then >> they'll likely be incorporated. > >ok. in the interest of time saving (i've not programmed in python >before), how about i [tabular] list what i find and let the statistas >in the group decide if there is significance? i have a pile of spam >and ham that i can wade through (unless there is a standardized sample >that is preferable). We've actually got a pretty good testing infrastructure set up; for tokenization tests, I personally use timcv.py with each of the tokenization options and then feed the output of the runs into table.py. This produces some nice tabularizations that you may notice in the mailing list archives. Using your own ham and spam is standard procedure here; most people are touchy about giving their ham away due to privacy concerns. If some new option looks good, then multiple people try it out on their different corpora, and if it still looks good after that, then it gets included. Don't worry about not having coded in python before. I hadn't done much in python before this project either, and people haven't been screaming about how ugly my code is, yet... - Alex From popiel at wolfskeep.com Wed Mar 5 20:22:00 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Mar 5 23:22:02 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: Message from Skip Montanaro <15974.46301.300535.819582@montanaro.dyndns.org> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <15974.46301.300535.819582@montanaro.dyndns.org> Message-ID: <20030306042200.08E482DEA4@cashew.wolfskeep.com> In message: <15974.46301.300535.819582@montanaro.dyndns.org> Skip Montanaro writes: > >Accordingly, when considering potential improvements (improved tokenizing >tricks, for example), perhaps what we should be doing is disabling much of >the current capability and then testing a new change against such a >"crippled" system. This seems like a reasonable strategy. There's already options to control some of the header parsing; I suspect more options could be put in to disable various other aspects of the tokenizer. I'm not sure how much the folks who are just trying to use the system will like all the extra options, though... >What I don't know is how to measure the independence of two different >"improvements". The simple solution for that seems to me to be doing four runs, with each combination of the two options on and off. If the two are independent, then the run with both on should be better than the run with either on, and the run with neither on should be worse than both. If it's really independent, then there should be a nice mathematical relation between the improvements from none to either and from either to both... but I'm forgetting what that math is at the moment, and I doubt than anything is perfectly independent anyway. >Suppose for the sake of argument that this base system I talk about is 80% >effective at properly distinguishing ham from spam. Suppose improvement A >takes that to 83% and applied independently to the base system, improvement >B takes that to 85%. How do you tell how independent A and B are from one >another? By doing a run with both A and B, and seeing if it was at about 87%. >(The more independent two improvements are, the harder it seems it would >be for a spammer to hit two birds with one stone when trying to defeat >spambayes.) Aye. The problem, of course, is that we could start making spambayes so tricked-out that it'd be as slow as SpamAssassin. ;-) - Alex From tim at fourstonesExpressions.com Wed Mar 5 23:08:22 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 6 00:11:06 2003 Subject: [Spambayes] Adding a message database In-Reply-To: Message-ID: 3/5/2003 7:35:55 PM, "Mark Hammond" wrote: >It seems to me that sub-classing classifier to change storage semantics is >wrong. IMO, this should use delegation. sub-classing of classifier should >be used should the classification sheme want overriding, not the storage >requirements. Yes, I agree with this. I think the same kind of argument applies to the message id database thing. Assuming that there is a classifier subclass to manage message ids seems wrong. And while I am here... ;) assuming that classifier will be subclassed as some kind of persistent classifier seems wrong to me, too. > >This wouldn't be too hard to do - _setwordinfo() etc just delegate to a >self.storage - and would make some sense to do as part of a "message >database". I wonder if delegate is the right pattern here. Perhaps observer? > >If there a compelling reason for it being the way it is? Nope. So... let's consider a strawman like this: class Classifier: def __init__(self, wi): self.wordinfo = wi() class WordInfo: """ In memory wordinfo class """ class PersistentWordInfo(WordInfo): """ Implements persistence as dbdict, let's forget pickles.""" class Message: """ Message abstraction """ def __init__(self, id) """ All messages have an id """ if id is None: self.id = time() # make up some arbitrary id else: self.id = id def setPayload(self, payload) """ payload is delivered to an email.Message object """ self.msg = email.Message() self.msg.add_payload(payload) """ have appropriate delegators to the Message object """ class FileMessage(Message): """ Message stored in a file system """ class MboxMessage(Message): """ Message stored in an mbox """ """ Perhaps other Message classes for various mechanisms, like Outlook, Lotus, etc.""" class MessageSet: """ Iterable set of Message objects """ class FileMessageSet: """ Set of Messages in the file system """ class MboxMessageSet(MessageSet): """ Set of Messages in an mbox """ """ Perhaps other MessageSet classes for various mechanisms, like Outlook, Lotus, etc. """ class Trainer: def __init__(self, wordinfo, idDb): """ Trains. Some methods in this class will come from current classifier class. """ self.wordinfo = wordinfo self.idDb = idDb def learn(self, msg, isSpam): """ unlearns if need be, then learns a message. """ try: mstat = idDb.isSpam(msg) except NeverTrainedError: pass else: if isSpam != mstat self.unlearn(msg, not mstat) wordinfo.learn(msg, isSpam) # you get the idea def unlearn(self, msg, isSpam): """ unlearn previous training """ wordinfo.unlearn(msg, isSpam) class MessageIdDb: """ Maintains a persistent set of message ids and how they've been trained""" def __init__(self, dbname): """ Assumes a particular persistence mechanism (pickle, bsddb, whatever)""" self.dbname = dbname # do something to load def rememberSpam(id): def rememberHam(id): def isSpam(id): """ Iteratable? """ Rip away, dudes... :) c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim_one at email.msn.com Thu Mar 6 01:01:12 2003 From: tim_one at email.msn.com (Tim Peters) Date: Thu Mar 6 01:01:50 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: <15974.46301.300535.819582@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > Suppose improvement A takes that to 83% and applied independently to the > base system, improvement B takes that to 85%. How do you tell how > independent A and B are from one another? It's a well studied area, and any std work on experimental design will cover it. Picture an analogy: spam == disease, and various kinds of clues are various drugs claimed to cure the disease (or test procedure claimed to identify the disease). A proper experimental design can quantify which drugs work and how well, which combinations are better than the sum of their parts, and which worse. This is a messy combinatorial problem, though, and real-life experiments rarely try to tackle more than a few drugs at a time. Then again, despite the howling of the perturbed, few people actually die from a spam that leaks thru . If I had time, I'd rather investigate Adaboost (mentioned several times here long ago) as a means to combine various kinds of clues as if they were each classifiers on their own. Adaboost is a general approach to combining multiple classifiers so that the combined classifier is better than any of its parts, provided only (roughly speaking) that each classifier going into it does better than chance. For example, we've seen here that a header-only classifier can do very well, and so can a classifier than looks only at msg bodies. The *best* way to combine those two may very well not be simply lumping them together as equals. I ran experiments on a classifier that looked only at Subject lines, and reported here that it had error rates down around 5% all by itself. Etc: there are lots of little classifiers you *could* build out of our code base. Chi-combining gives each kind of clue (token) equal weight, and there's no reason to believe that's optimal. Gary Robinson once suggested a variant on the geometric-mean approaches that weighted tokens differently by giving each an exponent derived from its spamprob (instead of giving each one exponent 1/n, where n is the # of tokens). I couldn't make time to pursue that then. In a sense, Adaboost is a way of weighting a collection of classifiers where the data *tells* you good weights to use, instead of dreaming up an a priori weighting scheme. Lots of "learning" algorithms do a similar thing, but Adaboost enjoys a long list of provably good performance and convergence properties. OTOH, if you come up with a better scheme, my original 35K collection of test msgs can't demonstrate it (spambayes already does a perfect-as-it-can-be job on it). OTOH, lots of marginal decisions were based on that specific collection, and I'm sure some of them would have been decided differently if anyohe else had spent 20 hours a day for two months dreaming up tests on their test corpus . From jean-marc.valin at hermes.usherb.ca Thu Mar 6 00:44:31 2003 From: jean-marc.valin at hermes.usherb.ca (Jean-Marc Valin) Date: Thu Mar 6 01:04:15 2003 Subject: [Spambayes] mboxtrain.py crashes Message-ID: <1046929470.1829.20.camel@idefix.homelinux.org> Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 241 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030306/8248f39a/attachment-0001.bin From skip at pobox.com Thu Mar 6 00:42:15 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 6 01:42:19 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: <20030306042200.08E482DEA4@cashew.wolfskeep.com> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <15974.46301.300535.819582@montanaro.dyndns.org> <20030306042200.08E482DEA4@cashew.wolfskeep.com> Message-ID: <15974.60871.524835.326773@montanaro.dyndns.org> Alex> I suspect more options could be put in to disable various other Alex> aspects of the tokenizer. I'm not sure how much the folks who are Alex> just trying to use the system will like all the extra options, Alex> though... I was thinking along the lines of one extra option which could collectively disable all but the most basic features. It would default to False so normal users would have to explicitly enable it (and might even get a warning displayed if it was enabled). >> (The more independent two improvements are, the harder it seems it >> would be for a spammer to hit two birds with one stone when trying to >> defeat spambayes.) Alex> Aye. The problem, of course, is that we could start making Alex> spambayes so tricked-out that it'd be as slow as SpamAssassin. ;-) Not necessarily. If A and B prove to not be independent, we dump one and keep the other. In some situations, spambayes may actually perform fewer tricks, thus speeding it up. Skip From Paul.Moore at atosorigin.com Thu Mar 6 09:05:47 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Thu Mar 6 04:07:14 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D959@UKDCX001.uk.int.atosorigin.com> From: Piers Haken [mailto:piersh@friskit.com] > Paul, are you using any of: > 1) oulook XP > 2) hotmail plugin for (1) > 3) exchange server Yes, Exchange Server > I'm wondering if the problem has anything to do with the fact that the > spam field is set before the message is moved. I'm not sure I see how, but I've no reason to think you're wrong, either. I always assumed that it was somehow related to the fact that mails arrive asynchronously, and could therefore arrive when the plugin "wasn't ready" somehow. That implies (a) that some form of locking or queueing mechanism is needed, and (b) that it's going to be bloody hard to diagnose or test :-) But this is pure speculation on my part... Paul. From rob at hooft.net Thu Mar 6 11:13:04 2003 From: rob at hooft.net (Rob W. W. Hooft) Date: Thu Mar 6 05:13:09 2003 Subject: [Spambayes] statistical comparison of enviroment? References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <15974.46301.300535.819582@montanaro.dyndns.org> Message-ID: <3E671F30.9080101@hooft.net> Skip Montanaro wrote: > Suppose improvement A > takes that to 83% and applied independently to the base system, improvement > B takes that to 85%. How do you tell how independent A and B are from one > another? Separate from all the good suggestions already made to help this, I would say that a little information entropy would do wonders. Say we have one token that occurs in 25 out of 100 messages, regardless of whether they are ham or spam. And another one that does also hit 25 out of the same 100 messages. present absent token1 25 75 token2 25 75 In this case, both tokens have an information entropy (S) of: S = 0.25*log_e(1/0.25)+0.75*log_e(1/0.75) = 0.56 bit Combining the two tokens can give different possibilities, among which: token1 token2 present absent present 25 0 S = 0.56 bit absent 0 75 token1 token2 present absent present 9 16 S = 1.11 bit absent 16 59 token1 token2 present absent present 0 25 S = 1.03 bit absent 25 50 This way it is possible to see how many "bits" of information are obtained from one token individually, or by combining tokens. In general, combining tokens will give less than the sum of their individual contributions. How much less is a quantitave measure of the correlation of the tokens. Of course this does not make any prediction as to the suitability of each token to characterize a message as spam. Someone with better background in information theory can probably combine the information entropy with the suitability in a proper way. In any case, if the two tokens under study are correlated as in the first combination (25/0/0/75), they are equally suited for spam classification. Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim at fourstonesExpressions.com Thu Mar 6 06:29:18 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 6 07:29:23 2003 Subject: [Spambayes] mboxtrain.py crashes In-Reply-To: <1046929470.1829.20.camel@idefix.homelinux.org> Message-ID: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst> Jean-Marc, please report this as a bug so we can track it. You can do that at http://sourceforge.net/projects/spambayes/ Otherwise, your report will get lost in the mailing list noise. Thanks. 3/5/2003 11:44:31 PM, Jean-Marc Valin wrote: >Hi, > >I'm trying to train a spam database and I'm experiencing crashes with >mboxtrain.py. I'm attaching three mbox's (simplified to their offending >e-mail) that produce the crash. This happens with both CVS and the last >nightly build (tried both python 2.2 and 2.3a2). The message printed is: > >Traceback (most recent call last): > File "mboxtrain.py", line 284, in ? > main() > File "mboxtrain.py", line 271, in main > train(h, g, False, force) > File "mboxtrain.py", line 209, in train > mbox_train(h, path, is_spam, force) > File "mboxtrain.py", line 140, in mbox_train > for msg in mbox: > File "/opt//lib/python2.3/mailbox.py", line 35, in next > return self.factory(_Subfile(self.fp, start, stop)) > File "/software/spambayes/spambayes/mboxutils.py", line 116, in >get_message > msg = email.message_from_string(obj) > File "/opt//lib/python2.3/email/__init__.py", line 52, in >message_from_string > return Parser(_class, strict=strict).parsestr(s) > File "/opt//lib/python2.3/email/Parser.py", line 75, in parsestr > return self.parse(StringIO(text), headersonly=headersonly) > File "/opt//lib/python2.3/email/Parser.py", line 64, in parse > self._parsebody(root, fp, firstbodyline) > File "/opt//lib/python2.3/email/Parser.py", line 239, in _parsebody > msgobj = self.parsestr(part) > File "/opt//lib/python2.3/email/Parser.py", line 75, in parsestr > return self.parse(StringIO(text), headersonly=headersonly) > File "/opt//lib/python2.3/email/Parser.py", line 64, in parse > self._parsebody(root, fp, firstbodyline) > File "/opt//lib/python2.3/email/Parser.py", line 146, in _parsebody > boundary = container.get_boundary() > File "/opt//lib/python2.3/email/Message.py", line 701, in get_boundary > boundary = self.get_param('boundary', missing) > File "/opt//lib/python2.3/email/Message.py", line 566, in get_param > for k, v in self._get_params_preserve(failobj, header): > File "/opt//lib/python2.3/email/Message.py", line 516, in >_get_params_preserve params = Utils.decode_params(params) > File "/opt//lib/python2.3/email/Utils.py", line 337, in decode_params > charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) > File "/opt//lib/python2.3/email/Utils.py", line 283, in decode_rfc2231 > charset, language, s = s.split("'", 2) >ValueError: unpack list of wrong size > > Jean-Marc > >-- >Jean-Marc Valin, M.Sc.A. >LABORIUS (http://www.gel.usherb.ca/laborius) >Universit? de Sherbrooke, Qu?bec, Canada > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From jm at jmason.org Thu Mar 6 12:38:13 2003 From: jm at jmason.org (Justin Mason) Date: Thu Mar 6 08:32:30 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: Message from "T. Alexander Popiel" <20030306042200.08E482DEA4@cashew.wolfskeep.com> Message-ID: <20030306123818.73B6016F1B@jmason.org> T. Alexander Popiel said: > Aye. The problem, of course, is that we could start making spambayes > so tricked-out that it'd be as slow as SpamAssassin. ;-) Hey! ;) --j. From noreply at sourceforge.net Thu Mar 6 06:24:36 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 09:56:44 2003 Subject: [Spambayes] [ spambayes-Feature Requests-690928 ] turn off saving messages in popproxy Message-ID: Feature Requests item #690928, was opened at 2003-02-21 16:00 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=690928&group_id=61702 Category: pop3proxy Group: None >Status: Closed Priority: 5 Submitted By: Carl Nygard (cnygard) Assigned to: Tim Stone (timstone4) Summary: turn off saving messages in popproxy Initial Comment: It would be nice to be able to turn off saving message for training, and just let the settings chug. I'm guessing that the messages will just pile up if I don't go in and at least discard the messages every day. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-06 08:24 Message: Logged In: YES user_id=645698 Option has been added. ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-02-21 21:28 Message: Logged In: YES user_id=645698 Messages are auto-deleted after 7 days, by default. This is not well documented, however. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=690928&group_id=61702 From N7DR at arrisi.com Thu Mar 6 07:47:26 2003 From: N7DR at arrisi.com (D. R. Evans) Date: Thu Mar 6 10:02:20 2003 Subject: [Spambayes] mboxtrain.py crashes In-Reply-To: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst> References: <1046929470.1829.20.camel@idefix.homelinux.org> Message-ID: <3E66FD0E.5572.233F95DE@localhost> On 6 Mar 2003 at 6:29, Tim Stone - Four Stones Expre wrote: > Jean-Marc, please report this as a bug so we can track it. You can do > that at http://sourceforge.net/projects/spambayes/ Otherwise, your > report will get lost in the mailing list noise. Thanks. > So I assume that I should do the same with my notice yesterday about pop3proxy.py crashes. I'll file a bug report later today. I already miss my spambayes :-) Doc -------------------------------------------------------------- Phone: +1 303 494 0394 Mobile: +1 720 839 8462 Fax: +1 781 240 0527 -------------------------------------------------------------- From MMARTINEZ at intranet.reeusda.gov Thu Mar 6 10:33:35 2003 From: MMARTINEZ at intranet.reeusda.gov (Martinez, Michael - CSREES/ISTM) Date: Thu Mar 6 10:32:51 2003 Subject: [Spambayes] Integration with qmail? Message-ID: I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers would be appreciated. Thanks, Michael Martinez CSREES/ISTM/USDA From tim at fourstonesExpressions.com Thu Mar 6 09:50:07 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 6 10:50:13 2003 Subject: [Spambayes] pop3proxy crashes In-Reply-To: <3E660576.15567.1F786E44@localhost> Message-ID: Nearly as I can tell, your training database has been corrupted. I'm not quite sure how this happened, but from what I see in the code, there is likely no recovery at this point. When you submit a bug report, go ahead and attach your training database. 3/5/2003 3:11:02 PM, "D. R. Evans" wrote: >I made the mistake of rebooting my Linux box.... > >Following the reboot, pop3proxy.py now dumps the following to the >screen whenever I try to run it: > >Loading database... >Traceback (most recent call last): > File "./pop3proxy.py", line 1577, in ? > run() > File "./pop3proxy.py", line 1551, in run > state.createWorkers() > File "./pop3proxy.py", line 1161, in createWorkers > self.bayes = storage.DBDictClassifier(filename) > File "./spambayes/storage.py", line 140, in __init__ > self.load() > File "./spambayes/storage.py", line 152, in load > t = self.db[self.statekey] > File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__ > return Unpickler(f).load() >EOFError > >It worked fine (for about three weeks) until the reboot. I'm probably >forgetting to do something obvious (I hope). > > Doc >-------------------------------------------------------------- >Phone: +1 303 494 0394 >Mobile: +1 720 839 8462 >Fax: +1 781 240 0527 >-------------------------------------------------------------- > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From noreply at sourceforge.net Thu Mar 6 08:09:03 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 11:10:02 2003 Subject: [Spambayes] [ spambayes-Bugs-698796 ] mboxtrain.py crashes on some mbox data Message-ID: Bugs item #698796, was opened at 2003-03-06 11:09 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698796&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jean-Marc Valin (jmvalin) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain.py crashes on some mbox data Initial Comment: I'm trying to train a spam database and I'm experiencing crashes with mboxtrain.py. I'm attaching three mbox's (simplified to their offending e-mail) that produce the crash. This happens with both CVS and the last nightly build (tried both python 2.2 and 2.3a2). The message printed is: Traceback (most recent call last): File "mboxtrain.py", line 284, in ? main() File "mboxtrain.py", line 271, in main train(h, g, False, force) File "mboxtrain.py", line 209, in train mbox_train(h, path, is_spam, force) File "mboxtrain.py", line 140, in mbox_train for msg in mbox: File "/opt//lib/python2.3/mailbox.py", line 35, in next return self.factory(_Subfile(self.fp, start, stop)) File "/software/spambayes/spambayes/mboxutils.py", line 116, in get_message msg = email.message_from_string(obj) File "/opt//lib/python2.3/email/__init__.py", line 52, in message_from_string return Parser(_class, strict=strict).parsestr(s) File "/opt//lib/python2.3/email/Parser.py", line 75, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/opt//lib/python2.3/email/Parser.py", line 64, in parse self._parsebody(root, fp, firstbodyline) File "/opt//lib/python2.3/email/Parser.py", line 239, in _parsebody msgobj = self.parsestr(part) File "/opt//lib/python2.3/email/Parser.py", line 75, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/opt//lib/python2.3/email/Parser.py", line 64, in parse self._parsebody(root, fp, firstbodyline) File "/opt//lib/python2.3/email/Parser.py", line 146, in _parsebody boundary = container.get_boundary() File "/opt//lib/python2.3/email/Message.py", line 701, in get_boundary boundary = self.get_param('boundary', missing) File "/opt//lib/python2.3/email/Message.py", line 566, in get_param for k, v in self._get_params_preserve(failobj, header): File "/opt//lib/python2.3/email/Message.py", line 516, in _get_params_preserve params = Utils.decode_params(params) File "/opt//lib/python2.3/email/Utils.py", line 337, in decode_params charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) File "/opt//lib/python2.3/email/Utils.py", line 283, in decode_rfc2231 charset, language, s = s.split("'", 2) ValueError: unpack list of wrong size ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698796&group_id=61702 From bill at parducci.net Thu Mar 6 08:02:10 2003 From: bill at parducci.net (bill parducci) Date: Thu Mar 6 11:12:22 2003 Subject: [Spambayes] Integration with qmail? In-Reply-To: References: Message-ID: <3E677102.7000607@parducci.net> once you have procmail setup to work with qmail HAMMIE.txt (in the tarball) will walk you through the install process. if you don't have procmail setup here are a couple of places you may want to start: http://www.flounder.net/qmail/qmail-howto.html (#10) http://www.ornl.gov/cts/archives/mailing-lists/qmail/1998/07/msg00350.html b Martinez, Michael - CSREES/ISTM wrote: > I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers > would be appreciated. > > Thanks, > > Michael Martinez > CSREES/ISTM/USDA From skip at pobox.com Thu Mar 6 10:18:33 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 6 11:18:47 2003 Subject: [Spambayes] mboxtrain.py crashes In-Reply-To: <3E66FD0E.5572.233F95DE@localhost> References: <1046929470.1829.20.camel@idefix.homelinux.org> <3E66FD0E.5572.233F95DE@localhost> Message-ID: <15975.29913.77912.36528@montanaro.dyndns.org> Doc> So I assume that I should do the same with my notice yesterday Doc> about pop3proxy.py crashes. Yes. If it's the header parsing problem which Jeremy recently fixed, I'll close it right out, but if not, it helps to have a chit in the system so it doesn't get lost. Skip From piersh at friskit.com Thu Mar 6 09:02:58 2003 From: piersh at friskit.com (Piers Haken) Date: Thu Mar 6 12:01:45 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <9891913C5BFE87429D71E37F08210CB92C7519@zeus.sfhq.friskit.com> I can't find any correlation between the assert and the incorrect field setting. They may well be unrelated. Do you know what is a 'win32com.gen_py.None.MailItem'? Piers. > -----Original Message----- > From: Mark Hammond [mailto:mhammond@skippinet.com.au] > Sent: Wednesday, March 05, 2003 2:33 PM > To: Piers Haken; Moore, Paul; Spambayes > Subject: RE: [Spambayes] Outlook plugin error > > > > Paul, are you using any of: > > 1) oulook XP > > 2) hotmail plugin for (1) > > 3) exchange server > > > > ? > > > > I'm wondering if the problem has anything to do with the > fact that the > > spam field is set before the message is moved. > > Further, when you see this behaviour, can you immediately > check the Pythonwin debug window for a message? Each message > processed should have a message that indicates its spam > disposition - the first thing I need to know is if such mails > fire this debug trace. > > Mark. > > From piersh at friskit.com Thu Mar 6 09:19:12 2003 From: piersh at friskit.com (Piers Haken) Date: Thu Mar 6 12:38:44 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <9891913C5BFE87429D71E37F08210CB92C751A@zeus.sfhq.friskit.com> Okay, I'm wondering: under what circumstances would a message NOT have an "EntryID"? Piers. > -----Original Message----- > From: Piers Haken > Sent: Thursday, March 06, 2003 9:03 AM > To: Mark Hammond; Moore, Paul; Spambayes > Subject: RE: [Spambayes] Outlook plugin error > > > I can't find any correlation between the assert and the > incorrect field setting. They may well be unrelated. > > Do you know what is a 'win32com.gen_py.None.MailItem'? > > Piers. > > > -----Original Message----- > > From: Mark Hammond [mailto:mhammond@skippinet.com.au] > > Sent: Wednesday, March 05, 2003 2:33 PM > > To: Piers Haken; Moore, Paul; Spambayes > > Subject: RE: [Spambayes] Outlook plugin error > > > > > > > Paul, are you using any of: > > > 1) oulook XP > > > 2) hotmail plugin for (1) > > > 3) exchange server > > > > > > ? > > > > > > I'm wondering if the problem has anything to do with the > > fact that the > > > spam field is set before the message is moved. > > > > Further, when you see this behaviour, can you immediately > > check the Pythonwin debug window for a message? Each message > > processed should have a message that indicates its spam > > disposition - the first thing I need to know is if such mails > > fire this debug trace. > > > > Mark. > > > > > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes > From neale at woozle.org Thu Mar 6 09:48:40 2003 From: neale at woozle.org (Neale Pickett) Date: Thu Mar 6 12:48:48 2003 Subject: [Spambayes] Integration with qmail? In-Reply-To: ("Martinez, Michael - CSREES/ISTM"'s message of "Thu, 6 Mar 2003 10:33:35 -0500") References: Message-ID: "Martinez, Michael - CSREES/ISTM" writes: > I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers > would be appreciated. Wow, d?j? vu! I would refer you to http://mail.python.org/pipermail/spambayes/2003-February/003322.html and the messages following it, for starters. I'm still looking at ways to do this, but not at a staggering pace. Any ideas are still appreciated :) Neale From neale at woozle.org Thu Mar 6 09:52:33 2003 From: neale at woozle.org (Neale Pickett) Date: Thu Mar 6 12:52:37 2003 Subject: [Spambayes] mboxtrain.py crashes In-Reply-To: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst> (Tim Stone - Four Stones Expressions's message of "Thu, 06 Mar 2003 06:29:18 -0600") References: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst> Message-ID: Tim Stone - Four Stones Expressions writes: > Jean-Marc Valin wrote: > >> File "/opt//lib/python2.3/email/Utils.py", line 283, in decode_rfc2231 >> charset, language, s = s.split("'", 2) >> ValueError: unpack list of wrong size > > Jean-Marc, please report this as a bug so we can track it. You can do > that at http://sourceforge.net/projects/spambayes/ Otherwise, your > report will get lost in the mailing list noise. Thanks. Right. But just for the record, it looks an awful lot like another instance of the email package not handling really fouled-up messages gracefully. So the fix may be a long time coming. In the meantime, since that message is probably spam, you can most likely just delete it and mboxtrain will continue to work. Actually, I guess mboxtrain could be a little more error-resistant. I'll add that to the todo list. Neale From noreply at sourceforge.net Thu Mar 6 09:26:28 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 13:12:03 2003 Subject: [Spambayes] [ spambayes-Bugs-698852 ] can't classify messages Message-ID: Bugs item #698852, was opened at 2003-03-06 17:26 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jeremy Hylton (jhylton) Assigned to: Nobody/Anonymous (nobody) Summary: can't classify messages Initial Comment: Traceback (most recent call last): File "/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "/usr/local/bin/pop3proxy.py", line 1064, in onClassify for word, wordProb in clues: NameError: global name 'clues' is not defined I don't know when the code broke, but it's been like this for a long time. There is no binding for clues anywhere. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702 From nas at python.ca Thu Mar 6 10:27:43 2003 From: nas at python.ca (Neil Schemenauer) Date: Thu Mar 6 13:18:13 2003 Subject: [Spambayes] Integration with qmail? In-Reply-To: References: Message-ID: <20030306182743.GA10575@glacier.arctrix.com> Martinez, Michael - CSREES/ISTM wrote: > I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers > would be appreciated. I've got some code to do this. I just need to make it available. Perhaps this weekend (if MGS2 doesn't get the best of me :-). Neil From noreply at sourceforge.net Thu Mar 6 10:34:32 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 13:41:22 2003 Subject: [Spambayes] [ spambayes-Bugs-698852 ] can't classify messages Message-ID: Bugs item #698852, was opened at 2003-03-06 11:26 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jeremy Hylton (jhylton) >Assigned to: Tim Stone (timstone4) Summary: can't classify messages Initial Comment: Traceback (most recent call last): File "/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "/usr/local/bin/pop3proxy.py", line 1064, in onClassify for word, wordProb in clues: NameError: global name 'clues' is not defined I don't know when the code broke, but it's been like this for a long time. There is no binding for clues anywhere. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-06 12:34 Message: Logged In: YES user_id=645698 Wow. You're right about the long time thing. Apparently this isn't something that anybody does on a regular basis... There's no classification code anywhere in the function! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702 From noreply at sourceforge.net Thu Mar 6 10:54:02 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 13:44:44 2003 Subject: [Spambayes] [ spambayes-Bugs-698852 ] can't classify messages Message-ID: Bugs item #698852, was opened at 2003-03-06 11:26 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702 Category: pop3proxy Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: Jeremy Hylton (jhylton) Assigned to: Tim Stone (timstone4) Summary: can't classify messages Initial Comment: Traceback (most recent call last): File "/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "/usr/local/bin/pop3proxy.py", line 1064, in onClassify for word, wordProb in clues: NameError: global name 'clues' is not defined I don't know when the code broke, but it's been like this for a long time. There is no binding for clues anywhere. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-06 12:54 Message: Logged In: YES user_id=645698 Fixed ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-03-06 12:34 Message: Logged In: YES user_id=645698 Wow. You're right about the long time thing. Apparently this isn't something that anybody does on a regular basis... There's no classification code anywhere in the function! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702 From grobinson at transpose.com Thu Mar 6 14:21:33 2003 From: grobinson at transpose.com (Gary Robinson) Date: Thu Mar 6 14:21:29 2003 Subject: [Spambayes] Best tweak values Message-ID: Hi, On the wiki that is pointed to in my LJ article (http://spamland.org/jsp/Wiki?GarySpamArticle), I would like to mention the paramaters that have worked best in spambayes. s and x? f(w) values associated with the middle excluded words? optimal spam/ham cutoff? Thanks to anyone who can help-- --Gary -- [http://ThisURLEnablesEmailToGetThroughOverzealousSpamFilters.org] Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.transpose.com http://radio.weblogs.com/0101454 From N7DR at arrisi.com Thu Mar 6 14:12:43 2003 From: N7DR at arrisi.com (D. R. Evans) Date: Thu Mar 6 16:12:57 2003 Subject: [Spambayes] pop3proxy crashes In-Reply-To: References: <3E660576.15567.1F786E44@localhost> Message-ID: <3E67575B.3086.24A05058@localhost> On 6 Mar 2003 at 9:50, Tim Stone - Four Stones Expre wrote: > Nearly as I can tell, your training database has been corrupted. I'm > not quite sure how this happened, but from what I see in the code, there > is likely no recovery at this point. When you submit a bug report, go > ahead and attach your training database. > Which file is that? (he asks, hoping that its not the 45MB hammie.db.dat file...) Doc -------------------------------------------------------------- Phone: +1 303 494 0394 Mobile: +1 720 839 8462 Fax: +1 781 240 0527 -------------------------------------------------------------- From skip at pobox.com Thu Mar 6 15:21:38 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 6 16:21:50 2003 Subject: [Spambayes] pop3proxy crashes In-Reply-To: <3E67575B.3086.24A05058@localhost> References: <3E660576.15567.1F786E44@localhost> <3E67575B.3086.24A05058@localhost> Message-ID: <15975.48098.666495.579958@montanaro.dyndns.org> Doc> Which file is that? (he asks, hoping that its not the 45MB Doc> hammie.db.dat file...) yeah, hammie.db.*. Just zip them up (there should be .dir and maybe .bak files as well) and attach them. They'll probably compress pretty well. Skip From T.A.Meyer at massey.ac.nz Fri Mar 7 11:02:36 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 6 17:03:15 2003 Subject: [Spambayes] statistical comparison of enviroment? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318D561@its-xchg4.massey.ac.nz> > Alex> Aye. The problem, of course, is that we could start making > Alex> spambayes so tricked-out that it'd be as slow as SpamAssassin. ;-) > > Not necessarily. If A and B prove to not be independent, we > dump one and > keep the other. In some situations, spambayes may actually > perform fewer tricks, thus speeding it up. I must say that this is one of the things that I think spambayes has really got right. TimP's insistance on only including the best option (via deathmatches :), and on not including anything unless testing proved that it helped, has, IMO, kept spambayes nice and neat. (Which is not to say that more options shouldn't be examined - at least if they're in the archives, then if they are ever needed, the work is already done). =Tony Meyer From T.A.Meyer at massey.ac.nz Fri Mar 7 11:06:53 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 6 17:07:30 2003 Subject: [Spambayes] statistical comparison of enviroment? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318D562@its-xchg4.massey.ac.nz> > Testing of new tokens like this has dropped off since about > last October... spambayes is already good enough for just > about everyone to be happy. My recent tests on training > methods seem to show that accuracy has been dropping off for > the last twho months, though, so it may be time to revisit > this problem... I'm (slowly) wading through the archives (interesting reading, but *long*), and have reached about this point. It does seem that the majority of the testing was done on certain collections of spam (along with lots of different ham). I wonder whether things got tuned a little too closely to that, and now that the spam is a little different, some options might need to be relooked at (rather than just retraining). Once I'm done with the archives (and then the options stuff), I'll try and set up a testing system so that I can work on that. I'm personally most interested in the effects of aging, the ham:spam ratio (with the current code), and how long spambayes takes to become effective, so I'll concentrate on those. =Tony Meyer From T.A.Meyer at massey.ac.nz Fri Mar 7 11:08:44 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 6 17:09:18 2003 Subject: [Spambayes] statistical comparison of enviroment? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318D563@its-xchg4.massey.ac.nz> > ok. in the interest of time saving (i've not programmed in > python before), how about i [tabular] list what i find and > let the statistas in the group decide if there is > significance? If you want anything in particular coded, feel free to post a feature request on SF and if no-one else gets to it, I'll give it a go (the implementation; I'd probably leave most of the testing to you/others). > (unless there is a standardized sample that is preferable). Personally, I think the more standardised samples are avoided, the better. Otherwise, we're just building a spam filter that recognises a particular collection of spam. =Tony Meyer From noreply at sourceforge.net Thu Mar 6 14:21:05 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 17:55:22 2003 Subject: [Spambayes] [ spambayes-Bugs-693423 ] email message generates error in pop3proxy.py Message-ID: Bugs item #693423, was opened at 2003-02-25 23:02 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702 Category: pop3proxy Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: David Shaw (dshaw) Assigned to: Tim Stone (timstone4) Summary: email message generates error in pop3proxy.py Initial Comment: Hi all, A friend of mine had a cache file in his "unknown" folder that caused the "review" web page in pop3proxy.py to generate the following traceback: Traceback (most recent call last): File "spambayes/Dibbler.py", line 398, in found_terminator getattr(plugin, name)(**params) File "pop3proxy.py", line 929, in onReview judgement = judgement.split(';')[0].strip() File "pop3proxy.py", line 815, in _makeMessageInfo print type(text) AttributeError: 'list' object has no attribute 'replace' He sent me the offending message, and I replicated the problem: msg = open("/Users/dshaw/Desktop/crash_spam.txt", "r") message = mbox.get_message(msg) part = typed_subpart_iterator(message, 'text', 'plain').next() text = part.get_payload() >>> text [] So, instead of text, the payload is a list containing a single email message instance. Here are the objects' respective payloads: >>> message._payload [, , , , , , , , , , , , , ] ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-03-04 18:39 Message: Logged In: YES user_id=645698 I just checked in a fix for this problem. I have no ability to actually test it, though. Please try your test case again and let me know the outcome. ---------------------------------------------------------------------- Comment By: David Shaw (dshaw) Date: 2003-02-28 10:34 Message: Logged In: YES user_id=244639 Seems to be fixed! Thanks. ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-02-27 22:29 Message: Logged In: YES user_id=645698 I just checked in a fix for this problem. I have no ability to actually test it, though. Please try your test case again and let me know the outcome. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702 From mhammond at skippinet.com.au Fri Mar 7 10:01:21 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Mar 6 18:02:02 2003 Subject: [Spambayes] Outlook plugin error In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D959@UKDCX001.uk.int.atosorigin.com> Message-ID: > I always assumed that it was somehow related to the fact that mails arrive > asynchronously, and could therefore arrive when the plugin "wasn't ready" > somehow. I have no idea how outlook send its events - but our plugin is called with one event for each message that arrives. We process this event synchronousy - ie, the event handler does not return until the message has been processed by us. Thus, from our POV, we are always ready. I have reason to suspect that Outlook delivers these events synchronously on the main Outlook GUI thread, but have no proof or documentary evidence. Occasionaly, I have reason to believe they do come on different threads. Occasionally, I have reason to believe I should check But I see no evidence that there is conflict. If a message is moved underneath us, we get a MAPI_E_NOT_FOUND error (as the entryid changes). If something else changes the object underneath us, we get a MAPI_E_OBJECT_CHANGED error which we can handle and retry. We currently *don't* have retry code in place, but we have never seen MAPI_E_OBJECT_CHANGED (that would currently dump an exception to the debug window, and leave the message unscored rather than zero) The most-important-by-far thing I need to know is if a trace message, such as: > Message 'RE: It was nice to see at Amazon today...' had a Spam classification of 'No' appears for these messages with a spam score of zero which "show clues" shows as non-zero. Just don't forget that "show clues" reporting 5.38458e-015 is really reporting zero Mark. From N7DR at arrisi.com Thu Mar 6 16:06:07 2003 From: N7DR at arrisi.com (D. R. Evans) Date: Thu Mar 6 18:06:14 2003 Subject: [Spambayes] pop3proxy crashes In-Reply-To: <15975.48098.666495.579958@montanaro.dyndns.org> References: <3E67575B.3086.24A05058@localhost> Message-ID: <3E6771EF.7385.3183D8@localhost> On 6 Mar 2003 at 15:21, Skip Montanaro wrote: > Doc> Which file is that? (he asks, hoping that its not the 45MB Doc> > hammie.db.dat file...) > > yeah, hammie.db.*. Just zip them up (there should be .dir and maybe > .bak files as well) and attach them. They'll probably compress pretty > well. > I get the message from sourceforge: Could Not Attach File to Item: ArtifactFile: File must be > 20 bytes and < 256000 bytes in length Item Successfully Created which sort-of-suggests that it made an entry in the bug database but would not include the ZIPped database file (which ended up being about 2MB after a maximum-compression ZIP). Doc -------------------------------------------------------------- Phone: +1 303 494 0394 Mobile: +1 720 839 8462 Fax: +1 781 240 0527 -------------------------------------------------------------- From noreply at sourceforge.net Thu Mar 6 15:11:03 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 18:19:58 2003 Subject: [Spambayes] [ spambayes-Bugs-699063 ] pop3proxy.py crashes Message-ID: Bugs item #699063, was opened at 2003-03-06 16:11 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: D. R. Evans (n7dr) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy.py crashes Initial Comment: pop3proxy.py worked fine for a couple of weeks. I then rebooted my Linux box (Mandrake 8.1), and since then pop3proxy.py produces the following output on the console: Loading database... Traceback (most recent call last): File "./pop3proxy.py", line 1577, in ? run() File "./pop3proxy.py", line 1551, in run state.createWorkers() File "./pop3proxy.py", line 1161, in createWorkers self.bayes = storage.DBDictClassifier(filename) File "./spambayes/storage.py", line 140, in __init__ self.load() File "./spambayes/storage.py", line 152, in load t = self.db[self.statekey] File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__ return Unpickler(f).load() EOFError The database files are attached. Doc ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702 From tim at fourstonesExpressions.com Thu Mar 6 17:22:18 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 6 18:22:25 2003 Subject: [Spambayes] pop3proxy crashes In-Reply-To: <3E6771EF.7385.3183D8@localhost> Message-ID: <76PJRL87221SRPMKITR054ZVP98C9MK.3e67d82a@myst> Go ahead and reply to this mail with the file attached, then... 3/6/2003 5:06:07 PM, "D. R. Evans" wrote: >On 6 Mar 2003 at 15:21, Skip Montanaro wrote: > >> Doc> Which file is that? (he asks, hoping that its not the 45MB Doc> >> hammie.db.dat file...) >> >> yeah, hammie.db.*. Just zip them up (there should be .dir and maybe >> .bak files as well) and attach them. They'll probably compress pretty >> well. >> > >I get the message from sourceforge: > Could Not Attach File to Item: ArtifactFile: File must be > 20 bytes >and < 256000 bytes in length Item Successfully Created > >which sort-of-suggests that it made an entry in the bug database but >would not include the ZIPped database file (which ended up being about >2MB after a maximum-compression ZIP). > > Doc >-------------------------------------------------------------- >Phone: +1 303 494 0394 >Mobile: +1 720 839 8462 >Fax: +1 781 240 0527 >-------------------------------------------------------------- > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From noreply at sourceforge.net Thu Mar 6 15:51:13 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 18:43:51 2003 Subject: [Spambayes] [ spambayes-Bugs-695142 ] Email does not render subject in the "Review" Page Message-ID: Bugs item #695142, was opened at 2003-02-28 10:40 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: David Shaw (dshaw) Assigned to: Tim Stone (timstone4) >Summary: Email does not render subject in the "Review" Page Initial Comment: I received the attached email. When I go to the "review" web page of pop3proxy.py, all it shows is: Messages classified as Unsure: From: (none) (none) It acts as though the message has no "from" or "subject", even though they exist. The user is not given any way to classify this message other than to click on the first "(none)" and read the raw message to determine its contents. I will attach the message below. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-06 17:51 Message: Logged In: YES user_id=645698 This is another email package parsing 'error' caused by a malformed header in the attached email. The content-type header has an embedded /r/n, which causes the email package to barf and discard all the headers. IMO, the email package is being used in Spambayes in ways that it was never intended for. Malformed mail is gonna be the death of us, and the email package just doesn't seem to handle it very well. I'm gonna leave this bug open, but there's virtually nothing that can be done to make things better, at least not AFAIK. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702 From noreply at sourceforge.net Thu Mar 6 15:51:54 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 18:43:53 2003 Subject: [Spambayes] [ spambayes-Bugs-673388 ] pop3proxy storage Message-ID: Bugs item #673388, was opened at 2003-01-23 16:02 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=673388&group_id=61702 Category: pop3proxy Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: François Granger (fgranger) Assigned to: Nobody/Anonymous (nobody) Summary: pop3proxy storage Initial Comment: I had a look in the pop3proxy folders, and I found thes strange files. They miss header and maybe part of the message. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-06 17:51 Message: Logged In: YES user_id=645698 Cannot recreate. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=673388&group_id=61702 From tim at fourstonesExpressions.com Thu Mar 6 21:07:26 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 6 22:07:32 2003 Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes In-Reply-To: <3E67575B.3086.24A05058@localhost> Message-ID: 3/6/2003 3:12:43 PM, "D. R. Evans" wrote: >On 6 Mar 2003 at 9:50, Tim Stone - Four Stones Expre wrote: > >> Nearly as I can tell, your training database has been corrupted. I'm >> not quite sure how this happened, but from what I see in the code, there >> is likely no recovery at this point. When you submit a bug report, go >> ahead and attach your training database. The database is definitely corrupted. This is the first time I've seen this. The 'saved state' key in the database (where spamcount and hamcount are maintained) has a corrupt value, that kills the unpickler. There are >88,000 words in this database, and apparently the machine was rebooted without a proper shutdown. This is bad. D.R. I need you to do a couple things: If you have the spam and ham saved in an mbox or something, then you can simply delete the database files and retrain from scratch. This would be the best alternative. If this isn't the case, if you can remember, or figure out some way, how many spams and hams were trained into this database, I can recover it for you. Even a rough estimate will likely do. And... can you tell me, if you know, what dbm module is in use? Maybe someone can give us a few lines of python you can run that will tell us that info. It's too late for me to bring it to mind... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From noreply at sourceforge.net Thu Mar 6 19:56:14 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 6 22:52:15 2003 Subject: [Spambayes] [ spambayes-Bugs-699174 ] mboxtrain only trains on cur in maildir Message-ID: Bugs item #699174, was opened at 2003-03-06 21:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Matthew Cowles (mdcowles) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain only trains on cur in maildir Initial Comment: When training on a maildir, mboxtrain trains only on the messages in the subirectory cur. It ignores messages in the subdirectory new. Since new is for messages that haven't been seen, I think it's worth looking there since at least some spam will have been filed unseen. I'll upload a patch that makes it train on both. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702 From mhammond at skippinet.com.au Fri Mar 7 21:58:56 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Mar 7 06:00:00 2003 Subject: [Spambayes] FW: Mhammond, Intelligent antispam IER software Message-ID: <002101c2e498$8cd2eab0$530f8490@eden> I had to share this irony :) I received this spam, selling anti-spam software! I was a little dissapointed that spambayes scored it as only a "maybe". So I checked the clues - the top 6 ham clues were: word spamprob #ham #spam '*H*' 0.0438937 - - '*S*' 0.78226 - - 'manually' 0.0184302 35 1 'mapi' 0.0266272 8 0 'keyword' 0.0302013 7 0 'source,' 0.0302013 7 0 'inbox' 0.0401784 26 2 'algorithm' 0.0652174 3 0 So sadly, a cruel irony is that spambayes let me down here - by knowing that I work on anti-spam software, it scoreed this anti-spam spam as ham. Even-funnier-is-that-I-am-slammed ly, Mark. -----Original Message----- From: vgarner6570@winning.com [mailto:vgarner6570@winning.com]On Behalf Of eagleclaw3449@lawyer.com Sent: None To: Mhammond Subject: Mhammond, Intelligent antispam IER software TheVeryBest - Software Downloads Top-Rank Software Download Site on the Internet Internet->Email->Spam Remedy v1.5 PRO Spam Remedy (3.17MB) Description: The powerful, effective and intelligent anti-spam tool. It automatically cleans spam messages out of your mailbox before you receive or read them. Features: Automatically Blocking Spam Spam Remedy automatically checks your mail boxes and filters unwanted, dangerous, or offensive mail messages to save your time from manually detecting and organizing mail messages. Effectively Spam Detecting A complex Aritificial Intelligence algorithm has been used in Spam Remedy product to detecting legitimate mail messages and spam messages,the technique has more precision than other filter-based and keyword-based anti-spam technologies. Be Sure You Get Your Right Mail Messages Spam Remedy doesn't confirm a spam message by a single keyword in mail content. It examines the entire message - source, headers and mail content to confirm whether it is a spam message. Supports Multiple Email Types and Almost All Email Clients Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and MAPI email accounts,Directly works with almost all email clients(Outlook Express, Becky Mail,Foxmail,Outlook, The bat!, Eudora etc.), espacially includes support for web-based Hotmail/MSN email clients. Nothing you need to change to your email clients! Easy to use - You don't need to set any complex filter rules, just add your email accounts to Spam Remedy and then it works. Friends List and Rejecting List With Friends List and Rejecting List,you have the chance to decide who are never blocked or directly treat their mail messages as spam. Keep your inbox clean Spam Remedy places all intercepted spam messages to its interval mail database so that your inbox remains uncluttered and free of spam.If for some reason a legitimate email is flagged as spam, you can easily recover in multiple ways. Editor's Rating: Copyright ?2002-2003 DarkSoft Group All Rights Reserved. -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1276 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/53ad725e/winmail.bin From tim at fourstonesExpressions.com Fri Mar 7 07:05:33 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 08:05:38 2003 Subject: [Spambayes] FW: Mhammond, Intelligent antispam IER software In-Reply-To: <002101c2e498$8cd2eab0$530f8490@eden> Message-ID: Gary Robinson and I just yesterday had a conversation about sending spam to advertise wecanstopspam.org... LOL!!! We decided that it was just too low a stoop. 3/7/2003 4:58:56 AM, "Mark Hammond" wrote: >I had to share this irony :) > >I received this spam, selling anti-spam software! I was a little >dissapointed that spambayes scored it as only a "maybe". So I checked the >clues - the top 6 ham clues were: > >word spamprob #ham #spam >'*H*' 0.0438937 - - >'*S*' 0.78226 - - >'manually' 0.0184302 35 1 >'mapi' 0.0266272 8 0 >'keyword' 0.0302013 7 0 >'source,' 0.0302013 7 0 >'inbox' 0.0401784 26 2 >'algorithm' 0.0652174 3 0 > >So sadly, a cruel irony is that spambayes let me down here - by knowing that >I work on anti-spam software, it scoreed this anti-spam spam as ham. > >Even-funnier-is-that-I-am-slammed ly, > >Mark. > >-----Original Message----- >From: vgarner6570@winning.com [mailto:vgarner6570@winning.com]On Behalf Of >eagleclaw3449@lawyer.com >Sent: None >To: Mhammond >Subject: Mhammond, Intelligent antispam IER software > > >TheVeryBest - Software Downloads > Top-Rank Software Download Site on the Internet >Internet->Email->Spam Remedy v1.5 PRO > >Spam Remedy (3.17MB) > > > >Description: > >The powerful, effective and intelligent anti-spam tool. >It automatically cleans spam messages out of your mailbox before you receive >or read them. > >Features: > >Automatically Blocking Spam >Spam Remedy automatically checks your mail boxes and filters unwanted, >dangerous, or offensive mail messages to save your time from manually >detecting and organizing mail messages. >Effectively Spam Detecting >A complex Aritificial Intelligence algorithm has been used in Spam Remedy >product to detecting legitimate mail messages and spam messages,the >technique has more precision than other filter-based and keyword-based >anti-spam technologies. >Be Sure You Get Your Right Mail Messages >Spam Remedy doesn't confirm a spam message by a single keyword in mail >content. It examines the entire message - source, headers and mail content >to confirm whether it is a spam message. >Supports Multiple Email Types and Almost All Email Clients >Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and MAPI email >accounts,Directly works with almost all email clients(Outlook Express, Becky >Mail,Foxmail,Outlook, The bat!, Eudora etc.), espacially includes support >for web-based Hotmail/MSN email clients. Nothing you need to change to your >email clients! >Easy to use - You don't need to set any complex filter rules, just add your >email accounts to Spam Remedy and then it works. >Friends List and Rejecting List >With Friends List and Rejecting List,you have the chance to decide who are >never blocked or directly treat their mail messages as spam. >Keep your inbox clean >Spam Remedy places all intercepted spam messages to its interval mail >database so that your inbox remains uncluttered and free of spam.If for some >reason a legitimate email is flagged as spam, you can easily recover in >multiple ways. > >Editor's Rating: > > >Copyright ?2002-2003 DarkSoft Group All Rights Reserved. > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Fri Mar 7 10:59:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 11:59:13 2003 Subject: [Spambayes] full o' spaces Message-ID: <15976.53209.395058.683195@montanaro.dyndns.org> I just received a message (attached) in which every word in the body was space-separated. There were thus no clues at all in the body and the clues in the header weren't enough to pull it out of the unsure classification. I'm working on a tokenizer patch. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: diploma.msg Type: application/octet-stream Size: 2365 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/685a6654/diploma.obj From tim at fourstonesExpressions.com Fri Mar 7 11:01:28 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 12:01:34 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15976.53209.395058.683195@montanaro.dyndns.org> Message-ID: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> Ya, I noticed that same thing yesterday. Maybe an "excessive whitespace" clue, or "many single character words" clue, or something like that? 3/7/2003 10:59:05 AM, Skip Montanaro wrote: >I just received a message (attached) in which every word in the body was >space-separated. There were thus no clues at all in the body and the clues >in the header weren't enough to pull it out of the unsure classification. >I'm working on a tokenizer patch. > >Skip > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From python-spambayes at discworld.dyndns.org Fri Mar 7 11:19:02 2003 From: python-spambayes at discworld.dyndns.org (Charles Cazabon) Date: Fri Mar 7 12:16:37 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>; from tim@fourstonesExpressions.com on Fri, Mar 07, 2003 at 11:01:28AM -0600 References: <15976.53209.395058.683195@montanaro.dyndns.org> <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> Message-ID: <20030307111902.A12956@discworld.dyndns.org> Tim Stone - Four Stones Expressions wrote: > Ya, I noticed that same thing yesterday. Maybe an "excessive whitespace" > clue, or "many single character words" clue, or something like that? Ratio of number of spaces to number of non-spaces in the body, perhaps? Add a metatoken if this exceeds 0.25 or something like that. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From piersh at friskit.com Fri Mar 7 09:51:49 2003 From: piersh at friskit.com (Piers Haken) Date: Fri Mar 7 12:50:32 2003 Subject: [Spambayes] Improved comparison of classifier changes? Message-ID: <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com> (This came to me in a dream. No, really...) When comparing two different classifier/tokenizer strategies, instead of just comparing the numbers of false negatives and positives, how about comparing some function (product, sum, average, some-more-appropriate-statistical-function?) of the spam probability of all messages in each classification (spam, ham, false-positive, false-negative)? This might give a slightly better indication of not just the numbers of messages that were classified correctly/incorrectly, but of how sure the classifier was when it made those decisions. .. or was I just dreaming...? Piers. From tim at fourstonesExpressions.com Fri Mar 7 11:59:19 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 12:59:23 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307111902.A12956@discworld.dyndns.org> Message-ID: 3/7/2003 11:19:02 AM, Charles Cazabon wrote: >Tim Stone - Four Stones Expressions wrote: >> Ya, I noticed that same thing yesterday. Maybe an "excessive whitespace" >> clue, or "many single character words" clue, or something like that? > >Ratio of number of spaces to number of non-spaces in the body, perhaps? Add a >metatoken if this exceeds 0.25 or something like that. Any threshold we use for anything like this has to be configurable. Otherwise the spammers will simply make sure they don't exceed the threshold... In normal (english) language usage, there is probably a relatively well understood distribution of unigrams, bigrams, trigrams, and longer words. Any 'severe' departure from this distribution could be a very good spam clue. For example, I could use the following to defeat a whitespace and unigram counting scheme: Bu y m ore st u ff t h an yo u EVE R tho ug ht you c ou l d h and le. It's a bit harder to read than regular text, but the human brain is amazingly adaptive to stuff like this. This kind of trickery is likely to be one avenue that spammers try to heavily use to defeat us. (the other being malformation of mail, imo). Oh, and btw, don't believe for a second that spammers don't subscribe to this list :) c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From nas at python.ca Fri Mar 7 10:14:02 2003 From: nas at python.ca (Neil Schemenauer) Date: Fri Mar 7 13:04:28 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15976.53209.395058.683195@montanaro.dyndns.org> References: <15976.53209.395058.683195@montanaro.dyndns.org> Message-ID: <20030307181402.GA13499@glacier.arctrix.com> Skip Montanaro wrote: > I just received a message (attached) in which every word in the body was > space-separated. I wouldn't worry about it too much. It doesn't look like an effective spam to me. I gave up reading it after the first line. I don't think the bozos who respond to spam would make any more of an effort to read it. > I'm working on a tokenizer patch. Perhaps we should be careful about adding stuff unless we can show a statistically significant improvement in error rates given real test data. That said, it seems logical that it would be better if short words were not completely discarded by the tokenizer. Perhaps it would be enough to remember the ratio of dropped words to generated tokens. Something like: 'shortratio:2**%d' % log2(nshort / ntokens) As you can tell, I love logarithms (as any true engineer should). :-) Alternatively, perhaps we could just drop the lower limit on token length. Neil From tim at fourstonesExpressions.com Fri Mar 7 12:13:22 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 13:13:27 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307181402.GA13499@glacier.arctrix.com> Message-ID: 3/7/2003 12:14:02 PM, Neil Schemenauer wrote: >Skip Montanaro wrote: >> I just received a message (attached) in which every word in the body was >> space-separated. > >I wouldn't worry about it too much. It doesn't look like an effective >spam to me. I gave up reading it after the first line. I don't think >the bozos who respond to spam would make any more of an effort to read >it. The fallacy here is that you're assuming that spammers will simply give up. They won't. And a set of eyeballs looking at a mail, even if they stop reading after the first line, is better than no eyeballs. So they'll keep trying things to defeat the algorithms, especially if their response rates are dropping. > >> I'm working on a tokenizer patch. > >Perhaps we should be careful about adding stuff unless we can show a >statistically significant improvement in error rates given real test >data. This strategy, which has been employed by the spambayes team up to this point, is very useful for research, but is quite reactive. We're exiting the research phase of this project, and entering a product phase. Reactive strategy is not appropriate for products (e.g. Microsoft security). We must be proactive, and kill ideas before they become widespread in the spammer community. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From popiel at wolfskeep.com Fri Mar 7 10:29:04 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Mar 7 13:29:09 2003 Subject: [Spambayes] Improved comparison of classifier changes? In-Reply-To: Message from "Piers Haken" <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com> References: <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com> Message-ID: <20030307182904.B81A82DDC7@cashew.wolfskeep.com> In message: <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com> "Piers Haken" writes: >(This came to me in a dream. No, really...) > >When comparing two different classifier/tokenizer strategies, instead of >just comparing the numbers of false negatives and positives, how about >comparing some function (product, sum, average, >some-more-appropriate-statistical-function?) of the spam probability of >all messages in each classification (spam, ham, false-positive, >false-negative)? This might give a slightly better indication of not >just the numbers of messages that were classified correctly/incorrectly, >but of how sure the classifier was when it made those decisions. > >.. or was I just dreaming...? Here's sample output from table.py: filename: rcb rcB rCb rCB Rcb RcB RCb RCB ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 fp total: 3 3 3 3 3 3 3 3 fp %: 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 fn total: 12 14 16 14 12 12 12 12 fn %: 0.60 0.70 0.80 0.70 0.60 0.60 0.60 0.60 unsure t: 53 37 50 39 40 31 37 32 unsure %: 1.32 0.93 1.25 0.97 1.00 0.78 0.93 0.80 real cost: $52.60 $51.40 $56.00 $51.80 $50.00 $48.20 $49.40 $48.40 best cost: $48.20 $45.20 $49.20 $45.60 $37.20 $38.80 $40.60 $38.60 h mean: 0.40 0.32 0.35 0.32 0.31 0.30 0.29 0.29 h sdev: 5.39 4.71 5.12 4.68 4.55 4.47 4.47 4.43 s mean: 98.45 98.68 98.35 98.68 98.75 98.85 98.72 98.85 s sdev: 9.76 9.57 10.46 9.58 9.08 9.06 9.37 9.11 mean diff: 98.05 98.36 98.00 98.36 98.44 98.55 98.43 98.56 k: 6.47 6.89 6.29 6.90 7.22 7.28 7.11 7.28 So yes, when using the test harness and associated tools, we do compare more than just the fp and fn counts. We also look at percentages, a weighted cost function, the best possible cost achievable just by moving the ham and spam cutoffs, and the mean scores, their separation, and their standard deviations. We just haven't done much tokenizer testing lately, so these reports aren't obvious in the recent archives. - Alex From bill at parducci.net Fri Mar 7 11:21:06 2003 From: bill at parducci.net (bill parducci) Date: Fri Mar 7 14:21:11 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15976.53209.395058.683195@montanaro.dyndns.org> References: <15976.53209.395058.683195@montanaro.dyndns.org> Message-ID: <3E68F122.5040502@parducci.net> welcome to MEME mail! :o) You i have been working on some ideas on how to attack this off an on for the last few months, but it is very difficult because [the]{mind}(is)quite|g00d`at~separating+the\message'fr0m_the^TEXT. it is this work that prompted my initial query into what is being done with tokenization on this list. if it would help, i can send/post a few sample messages that i have been using to test my work. i have also come with a crude mechainsm for trying to work around it. hasn't been tested and needs a lot of work (it is written in vb). anyway, if anyone is interested i can show what i have come up with so far. b Skip Montanaro wrote: > I just received a message (attached) in which every word in the body was > space-separated. There were thus no clues at all in the body and the clues > in the header weren't enough to pull it out of the unsure classification. > I'm working on a tokenizer patch. > > Skip 1 From tim.one at comcast.net Fri Mar 7 14:22:51 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Mar 7 14:23:28 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307181402.GA13499@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > ... > That said, it seems logical that it would be better if short words were > not completely discarded by the tokenizer. Perhaps it would be enough > to remember the ratio of dropped words to generated tokens. Something > like: > > 'shortratio:2**%d' % log2(nshort / ntokens) > > As you can tell, I love logarithms (as any true engineer should). :-) I've mentioned before that the metatoken (number of bytes)/(number of words) was a very strong indicator in early tests. An unusually high ratio of bytes to words was a very strong spam indicator; spam with the interspersed whitespace gimmick would have an unusually low ratio. I didn't check in the code, though, because it made no difference in error rates at the time. But a single token doesn't carry much weight, and any gimmick that reduces response rate (including those that make text harder to read) probably won't last long. > Alternatively, perhaps we could just drop the lower limit on token > length. Experiments were run on that, and they hurt. See "How big should 'a word' be?" in tokenizer.py. Note that we have a configurable limit for the upper end of how big a word can be. The evidence in favor of adding it was (at best) weak. From bill at parducci.net Fri Mar 7 11:24:00 2003 From: bill at parducci.net (bill parducci) Date: Fri Mar 7 14:24:04 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> References: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> Message-ID: <3E68F1D0.7060806@parducci.net> or... ratio of non 'a-z|A-Z|0-9' vs. 'a-z|A-Z|0-9'? he says (with physical attribute analysis on the brain :o) b Tim Stone - Four Stones Expressions wrote: > Ya, I noticed that same thing yesterday. Maybe an "excessive whitespace" > clue, or "many single character words" clue, or something like that? > > 3/7/2003 10:59:05 AM, Skip Montanaro wrote: From nas at python.ca Fri Mar 7 11:42:34 2003 From: nas at python.ca (Neil Schemenauer) Date: Fri Mar 7 14:33:02 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: <20030307181402.GA13499@glacier.arctrix.com> Message-ID: <20030307194234.GA13770@glacier.arctrix.com> Tim Stone - Four Stones Expressions wrote: > The fallacy here is that you're assuming that spammers will simply give up. > They won't. And a set of eyeballs looking at a mail, even if they stop > reading after the first line, is better than no eyeballs. I have to respectfully disagree. Spammers _need_ people to respond to their spam. If a filter avoidance trick kills the response rate they will stop using it. There is no point in bloating spambayes with every failed trick they try. That's why I suggested testing with a real corpus. If a trick is common enough that detecting it signficantly affects the error rate then fine, add code for it. Otherwise, forget about and keep spambayes lean and mean. > So they'll keep trying things to defeat the algorithms, especially if > their response rates are dropping. Sure. However, they will only continue using a trick if it defeats filters _and_ gets an acceptable response rate. > This strategy, which has been employed by the spambayes team up to this point, > is very useful for research, but is quite reactive. We're exiting the > research phase of this project, and entering a product phase. Reactive > strategy is not appropriate for products (e.g. Microsoft security). I disagree. We should not abandon the rigorous, testing based strategy that got SB to its current state. Adding more code every time a spammer comes up with a new trick is completely reactionary and will eventually destroy the code base. I'm mystified as to how you can call such an approach proactive. > We must be proactive, and kill ideas before they become widespread in > the spammer community. We don't need to worry about spammers' ideas that will be killed by other forces. Perhaps it comes down to a question of objectives. If your objective is to keep spam out of your mailbox then trying to detect all spam, effective or not, makes sense. My objective is to destroy the spam business. One way to do that is to have a widely deployable filter that blocks spam that would make spammers money. Honestly, for me to hit delete for a few spam messages in my inbox is not a big deal. It is the fact that these people are wasting millions of people's time. Neil From tim at fourstonesExpressions.com Fri Mar 7 13:49:13 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 14:49:19 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307194234.GA13770@glacier.arctrix.com> Message-ID: This is a great discussion. We really should hash this out to everyone's satisfaction. 3/7/2003 1:42:34 PM, Neil Schemenauer wrote: >Tim Stone - Four Stones Expressions wrote: >> The fallacy here is that you're assuming that spammers will simply give up. >> They won't. And a set of eyeballs looking at a mail, even if they stop >> reading after the first line, is better than no eyeballs. > >I have to respectfully disagree. Spammers _need_ people to respond to >their spam. If a filter avoidance trick kills the response rate they >will stop using it. There is no point in bloating spambayes with every >failed trick they try. This really wasn't what I was suggesting. Rather, when we find a significant hole through which effective spam can squirt, we should plug it, rather than wait to see if any spammers find that same hole. > That's why I suggested testing with a real >corpus. If a trick is common enough that detecting it signficantly >affects the error rate then fine, add code for it. Otherwise, forget >about and keep spambayes lean and mean. > >> So they'll keep trying things to defeat the algorithms, especially if >> their response rates are dropping. > >Sure. However, they will only continue using a trick if it defeats >filters _and_ gets an acceptable response rate. If it defeats the filters then the response rate, however dismal, will be better than for spam that doesn't defeat the filters. > >> This strategy, which has been employed by the spambayes team up to this point, >> is very useful for research, but is quite reactive. We're exiting the >> research phase of this project, and entering a product phase. Reactive >> strategy is not appropriate for products (e.g. Microsoft security). > >I disagree. We should not abandon the rigorous, testing based strategy >that got SB to its current state. Absolutely. Rigorous testing is not the issue at all, in my mind. > Adding more code every time a spammer >comes up with a new trick is completely reactionary and will eventually >destroy the code base. I'm mystified as to how you can call such an >approach proactive. Again, I was suggesting that we find the holes before they do. I think that we should begin to think like spammers, not like people trying to defeat spammers. If we were on the other side, what would we do? Gosh, I can think of things, simple things. And if I can find something that actually crashes the tokenizer, all the better. I'll look at the code, more closely than most on this team ever will. I'll find the holes, and blast away. My goal? Not to get spam into mailboxes, but to destroy the anti-spam community. Make people give up hope that this problem really is/can be solved. That's the way to make you and me go away. Simply make it so people don't believe in us. > >> We must be proactive, and kill ideas before they become widespread in >> the spammer community. > >We don't need to worry about spammers' ideas that will be killed by >other forces. Perhaps it comes down to a question of objectives. If >your objective is to keep spam out of your mailbox then trying to detect >all spam, effective or not, makes sense. My objective is to destroy the >spam business. The two objectives are identical. > One way to do that is to have a widely deployable filter >that blocks spam that would make spammers money. Honestly, for me to >hit delete for a few spam messages in my inbox is not a big deal. It is >the fact that these people are wasting millions of people's time. > > Neil > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From python-spambayes at discworld.dyndns.org Fri Mar 7 14:02:47 2003 From: python-spambayes at discworld.dyndns.org (Charles Cazabon) Date: Fri Mar 7 15:00:23 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307194234.GA13770@glacier.arctrix.com>; from nas@python.ca on Fri, Mar 07, 2003 at 11:42:34AM -0800 References: <20030307181402.GA13499@glacier.arctrix.com> <20030307194234.GA13770@glacier.arctrix.com> Message-ID: <20030307140247.A16563@discworld.dyndns.org> Neil Schemenauer wrote: > > This strategy, which has been employed by the spambayes team up to this > > point, is very useful for research, but is quite reactive. We're exiting > > the research phase of this project, and entering a product phase. > > Reactive strategy is not appropriate for products (e.g. Microsoft > > security). > > I disagree. We should not abandon the rigorous, testing based strategy that > got SB to its current state. Adding more code every time a spammer comes up > with a new trick is completely reactionary and will eventually destroy the > code base. I'm mystified as to how you can call such an approach proactive. Hear, hear. Don't turn SpamBayes into a convoluted, hocus-pocus collection of ad-hoc rules a la SpamAssasin. Keep testing; if a technique doesn't measurably improve the result, toss it. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From skip at pobox.com Fri Mar 7 14:06:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 15:06:27 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> References: <15976.53209.395058.683195@montanaro.dyndns.org> <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> Message-ID: <15976.64431.573736.976718@montanaro.dyndns.org> Tim> Ya, I noticed that same thing yesterday. Maybe an "excessive Tim> whitespace" clue, or "many single character words" clue, or Tim> something like that? I tried the ratio of spaces to the total number of characters in the message body, but that is inconclusive: >>> db = shelve.open("../hammie.db", "r") >>> for k in db.keys(): ... if k.startswith("space ratio"): ... print k, db[k] ... space ratio: 0.0 (1240, 399) space ratio: 0.1 (3950, 6603) space ratio: 0.2 (1405, 4562) space ratio: 0.3 (289, 231) space ratio: 0.4 (85, 51) space ratio: 0.5 (15, 16) space ratio: 0.6 (2, 2) space ratio: 0.8 (3, 0) (Maybe I should be ignoring whitespace at the beginning of lines?) The diploma message has a space ration of right around 0.5. I haven't looked at other messages yet to see what the other messages with similar ratios looked like. Maybe the ratio of single-character words to the total number of words would be better. Skip From tim at fourstonesExpressions.com Fri Mar 7 14:17:22 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 15:17:28 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15976.64431.573736.976718@montanaro.dyndns.org> Message-ID: <07IG71RP63WR1YTNYSZW828NB8WKH.3e68fe52@myst> 3/7/2003 2:06:07 PM, Skip Montanaro wrote: >The diploma message has a space ration I see you suffer from the same spelling disorder as I... I always write ration instead of ratio... lol > of right around 0.5. I haven't >looked at other messages yet to see what the other messages with similar >ratios looked like. Maybe the ratio of single-character words to the total >number of words would be better. Can you look at percentage of unigrams, bigrams, trigrams, and ngrams? > >Skip > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Fri Mar 7 14:25:06 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 15:25:20 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307194234.GA13770@glacier.arctrix.com> References: <20030307181402.GA13499@glacier.arctrix.com> <20030307194234.GA13770@glacier.arctrix.com> Message-ID: <15977.34.990208.430806@montanaro.dyndns.org> Neil> Tim Stone - Four Stones Expressions wrote: >> The fallacy here is that you're assuming that spammers will simply >> give up. They won't. And a set of eyeballs looking at a mail, even >> if they stop reading after the first line, is better than no >> eyeballs. Neil> I have to respectfully disagree. Spammers _need_ people to Neil> respond to their spam. If a filter avoidance trick kills the Neil> response rate they will stop using it. There is no point in Neil> bloating spambayes with every failed trick they try. That's why I Neil> suggested testing with a real corpus. If a trick is common enough Yes, my corpus is currently 11,000+ hams and 7,000+ spams. My first try failed, but I think I know why. In addition several people have suggested some other things to try. >> We must be proactive, and kill ideas before they become widespread in >> the spammer community. Neil> We don't need to worry about spammers' ideas that will be killed Neil> by other forces. Precisely. This particular message landed right in the middle of the unsure. Training on it didn't affect its later classification much. That suggests that to swing that message into the spam region, one or more new techniques need to be developed which highlight an attribute of that message. Neil> My objective is to destroy the spam business. One way to do that Neil> is to have a widely deployable filter that blocks spam that would Neil> make spammers money. Honestly, for me to hit delete for a few Neil> spam messages in my inbox is not a big deal. It is the fact that Neil> these people are wasting millions of people's time. Correct, but as we all know, the spammers learn and we have no way of directly measuring our effectiveness at destroying their business. All we can measure directly is how effective we are at segregating their messages into spam folders. It appears that this simple technique is sufficient to move most spams into the unsure category (and thus viewed). Skip From tim at fourstonesExpressions.com Fri Mar 7 14:29:33 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 15:29:38 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <07IG71RP63WR1YTNYSZW828NB8WKH.3e68fe52@myst> Message-ID: >Can you look at percentage of unigrams, bigrams, trigrams, and ngrams? I'm thinking that, for English anyway, nu < nb < nt < nn is the rule. If this rule is violated, then that's a spam indicator. I sure don't know if that's the case with other languages, though... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From nas at python.ca Fri Mar 7 12:40:02 2003 From: nas at python.ca (Neil Schemenauer) Date: Fri Mar 7 15:30:28 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: <20030307194234.GA13770@glacier.arctrix.com> Message-ID: <20030307204002.GD13770@glacier.arctrix.com> Tim Stone - Four Stones Expressions wrote: > Rather, when we find a significant hole through which effective spam > can squirt, we should plug it, rather than wait to see if any spammers > find that same hole. I agree (with emphases on the word "effective"). If spammers don't care about effectiveness than it will be extremely difficult to block their messages. > If it defeats the filters then the response rate, however dismal, will be > better than for spam that doesn't defeat the filters. Nope. If it costs them more money to send than what they make back it will not be better. Sending spam, however cheap, costs money. Therefore, at some non-zero response rate it becomes unprofitable to send it. > Again, I was suggesting that we find the holes before they do. Why? > And if I can find something that actually crashes the tokenizer, all > the better. That's a different kettle of fish, I think. Whatever the filter does, it should not crash or lose email, no matter what the spammer does. I'm all for that kind of improvement. > Not to get spam into mailboxes, but to destroy the anti-spam community. Yikes, don't hurt me. I think you meant the the anti-anti-spam community. :-) Personally, I'm content with letting the anti-anti-spam community do what they will. If they come up with something the spam community adopts then I think we can deal with it. For example, the "HTML comments inside words" trick must be effective since I'm seeing it fairly often now. It's really a no brainer, since if the MTA understands HTML there is no visable difference in the message. Luckily SB already deals with this trick in a more general way. > Make people give up hope that this problem really is/can be solved. > That's the way to make you and me go away. Simply make it so people > don't believe in us. I'm having a little trouble parsing that. I think you are saying that if the filter doesn't achieve the objective of keeping spam out of people's mailboxes then people will not adopt it. That's true, but I think the average person is fairly tolerant of FNs, as long as the FP rate is very low. I think FNs annoy spam filter hackers more than regular people. > >We don't need to worry about spammers' ideas that will be killed by > >other forces. Perhaps it comes down to a question of objectives. If > >your objective is to keep spam out of your mailbox then trying to detect > >all spam, effective or not, makes sense. My objective is to destroy the > >spam business. > > The two objectives are identical. Nope. Blocking all spam achieves both objectives while blocking only effective spam achieves only the second. Since effective spam is a subset of all spam it could be easier to block. Neil From nas at python.ca Fri Mar 7 12:44:24 2003 From: nas at python.ca (Neil Schemenauer) Date: Fri Mar 7 15:34:48 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15976.64431.573736.976718@montanaro.dyndns.org> References: <15976.53209.395058.683195@montanaro.dyndns.org> <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst> <15976.64431.573736.976718@montanaro.dyndns.org> Message-ID: <20030307204424.GE13770@glacier.arctrix.com> Skip Montanaro wrote: > Maybe the ratio of single-character words to the total number of words > would be better. I like Tim's suggestion of bytes/tokens. Could you give that a try? Neil From python-spambayes at discworld.dyndns.org Fri Mar 7 14:39:02 2003 From: python-spambayes at discworld.dyndns.org (Charles Cazabon) Date: Fri Mar 7 15:36:37 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15977.34.990208.430806@montanaro.dyndns.org>; from skip@pobox.com on Fri, Mar 07, 2003 at 02:25:06PM -0600 References: <20030307181402.GA13499@glacier.arctrix.com> <20030307194234.GA13770@glacier.arctrix.com> <15977.34.990208.430806@montanaro.dyndns.org> Message-ID: <20030307143902.A16967@discworld.dyndns.org> Skip Montanaro wrote: > > > We don't need to worry about spammers' ideas that will be killed by other > > forces. > > Precisely. This particular message landed right in the middle of the > unsure. Training on it didn't affect its later classification much. That > suggests that to swing that message into the spam region, one or more new > techniques need to be developed which highlight an attribute of that > message. As more spammers use the technique, it automatically becomes a better indicator of spamminess. You don't really need to manually twiddle knobs. At most, adding a metatoken might help, but as Tim has pushed for all along, if that doesn't make a measurable difference don't do it. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From nas at python.ca Fri Mar 7 12:55:40 2003 From: nas at python.ca (Neil Schemenauer) Date: Fri Mar 7 15:46:05 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15977.34.990208.430806@montanaro.dyndns.org> References: <20030307181402.GA13499@glacier.arctrix.com> <20030307194234.GA13770@glacier.arctrix.com> <15977.34.990208.430806@montanaro.dyndns.org> Message-ID: <20030307205540.GF13770@glacier.arctrix.com> Skip Montanaro wrote: > Correct, but as we all know, the spammers learn and we have no way of > directly measuring our effectiveness at destroying their business. All we > can measure directly is how effective we are at segregating their messages > into spam folders. I think we can indirectly determine that by what techniques become popular. I suppose a quickly changing set of techniques could be interpreted as a sign of effective filters. Based on that, I would say we are starting to get somewhere but the war it not over by a long shot. Neil From pje at telecommunity.com Fri Mar 7 16:01:18 2003 From: pje at telecommunity.com (Phillip J. Eby) Date: Fri Mar 7 16:01:06 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: <07IG71RP63WR1YTNYSZW828NB8WKH.3e68fe52@myst> Message-ID: <5.1.1.6.0.20030307155256.00ab5020@telecommunity.com> At 02:29 PM 3/7/03 -0600, Tim Stone - Four Stones Expressions wrote: > >Can you look at percentage of unigrams, bigrams, trigrams, and ngrams? > >I'm thinking that, for English anyway, nu < nb < nt < nn is the rule. If >this >rule is violated, then that's a spam indicator. I sure don't know if that's >the case with other languages, though... There may be a simple way to deal with the entire range of possible "character noise" techniques, be it whitespace, letter->number substitution, etc. What if we simply create a meta-token which is driven by the ratio of recognized to unrecognized (non-meta) tokens? In this way, the more noise a spammer adds to their message, the greater the probability that the message will be considered "noisy spam". Repeats of the same message after training would result in the message being "recognized spam", repeats before training would be spotted by their being "noisy". The natural spammer countermove to this is that they'll have to add lots of boilerplate "hammy" english text to bump themselves back into the "unsure" range, and/or begin adding noise only to highly spammy words. I already get tons of spam about "seks" and "r4pe" and similar things. I'm not sure what to do about these countermoves, but at least it puts us back on level ground with the spammers again. I'm afraid that adding "bulk noise" like whitespace and punctuation to messages would be a too-easily automated anti-bayes move for spammers to adopt in general. From skip at pobox.com Fri Mar 7 15:11:56 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 16:12:04 2003 Subject: [Spambayes] ok, i'm confused Message-ID: <15977.2844.223581.728734@montanaro.dyndns.org> Here are the original X-Spambayes headers for the full-o'-spaces message: X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; 'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90 X-Spambayes-Classification: unsure; 0.46 After my latest tweak to the tokenizer (ratio of spaces to total number of characters, after deleting leading and trailing whitespace on each line) and complete retraining (11k+ ham 7k+ spam), I get: X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; 'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90 X-Spambayes-Classification: spam; 0.95 I've done nothing to adjust the values displayed in the X-Spambayes-Debug header, so all generated tokens should be displayed, and as you can see, all displayed tokens are the same, before and after. My space ratio token isn't displayed (if I insert a print before the relevant yield statement I see it has a value of 'space ratio: 0.9'). Why is the message now classified as spam when before is was solidly in the middle of unsure? Skip From tim at fourstonesExpressions.com Fri Mar 7 15:46:40 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 16:46:46 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307204002.GD13770@glacier.arctrix.com> Message-ID: 3/7/2003 2:40:02 PM, Neil Schemenauer wrote: >Tim Stone - Four Stones Expressions wrote: >> Rather, when we find a significant hole through which effective spam >> can squirt, we should plug it, rather than wait to see if any spammers >> find that same hole. > >I agree (with emphases on the word "effective"). If spammers don't care >about effectiveness than it will be extremely difficult to block their >messages. > >> If it defeats the filters then the response rate, however dismal, will be >> better than for spam that doesn't defeat the filters. > >Nope. If it costs them more money to send than what they make back it >will not be better. Sending spam, however cheap, costs money. >Therefore, at some non-zero response rate it becomes unprofitable to >send it. The above statement has nothing to do with the statement above it. > >> Again, I was suggesting that we find the holes before they do. > >Why? I suppose you're satisfied with Microsoft's approach to security. Let's just wait until some flood of spam makes it through our user's filters. We'll them make a patch and post it. Very few will install it. In the meantime, users will conclude that our stuff doesn't work very well, and we've lost. >> Not to get spam into mailboxes, but to destroy the anti-spam community. > >Yikes, don't hurt me. I think you meant the the anti-anti-spam >community. :-) Ya... heh Reminds me of a political cartoon during the days the ABM treaty was being negotiated. There were Ballistic Missiles, Anti-Ballistic Missiles, AABMs AAABMs, etc.etc... > Personally, I'm content with letting the anti-anti-spam >community do what they will. If they come up with something the spam >community adopts then I think we can deal with it. > >For example, the "HTML comments inside words" trick must be effective >since I'm seeing it fairly often now. It's really a no brainer, since >if the MTA understands HTML there is no visable difference in the >message. Luckily SB already deals with this trick in a more general >way. My point exactly. Thank you for your tacit, though obviously accidental, agreement! > >> Make people give up hope that this problem really is/can be solved. >> That's the way to make you and me go away. Simply make it so people >> don't believe in us. > >I'm having a little trouble parsing that. I think you are saying that >if the filter doesn't achieve the objective of keeping spam out of >people's mailboxes then people will not adopt it. That's true, but I >think the average person is fairly tolerant of FNs, as long as the FP >rate is very low. I think FNs annoy spam filter hackers more than >regular people. You parsed it correctly. Tolerant of FN is one thing, tolerant of a LOT of FN is quite another. Essentially, what we have with no filtering is all FN. I'm surprised at how annoyed I get when it misses only one. Especially if that one is particularly offensive and I think it *should* have caught it. But I'm working on this stuff, and so my tolerance is much higher than most. It takes very little to convince the teeming masses that something is not worth the trouble it takes to install it and keep it going, and that trouble is considerable for spambayes. Thus, at some (surprisingly low) threshold of FN, users will conclude that this stuff isn't worth the bother. Maybe filtering technology really can't evolve as quickly as spam can. I hope that's not the case. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Fri Mar 7 16:07:51 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 17:07:57 2003 Subject: [Spambayes] ok, i'm confused In-Reply-To: <15977.2844.223581.728734@montanaro.dyndns.org> Message-ID: <6Z083XDA83431VFB5YA0RN1V5ZKJ6.3e691837@myst> Doesn't spambayes use the top 20 clues or so? Debug doesn't print all the clues, and the combiner doesn't use them all, either, IIRC. Maybe debug just isn't printing out everything that's being used? Strange. On the other hand, maybe this explains some of the FP and FN rate increases that have been being reported as of late... 3/7/2003 3:11:56 PM, Skip Montanaro wrote: >Here are the original X-Spambayes headers for the full-o'-spaces message: > > X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; > 'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62; > 'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90 > X-Spambayes-Classification: unsure; 0.46 > >After my latest tweak to the tokenizer (ratio of spaces to total number of >characters, after deleting leading and trailing whitespace on each line) and >complete retraining (11k+ ham 7k+ spam), I get: > > X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; > 'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; > 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; > 'header:Received:3': 0.90 > X-Spambayes-Classification: spam; 0.95 > >I've done nothing to adjust the values displayed in the X-Spambayes-Debug >header, so all generated tokens should be displayed, and as you can see, all >displayed tokens are the same, before and after. My space ratio token isn't >displayed (if I insert a print before the relevant yield statement I see it >has a value of 'space ratio: 0.9'). Why is the message now classified as >spam when before is was solidly in the middle of unsure? > >Skip > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Fri Mar 7 17:09:15 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Mar 7 17:09:52 2003 Subject: [Spambayes] ok, i'm confused In-Reply-To: <15977.2844.223581.728734@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Here are the original X-Spambayes headers for the full-o'-spaces message: > > X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; > ... > X-Spambayes-Classification: unsure; 0.46 > > After my latest tweak to the tokenizer (ratio of spaces to total number of > characters, after deleting leading and trailing whitespace on > each line) and complete retraining (11k+ ham 7k+ spam), I get: > > X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; > ... > X-Spambayes-Classification: spam; 0.95 > > I've done nothing to adjust the values displayed in the X-Spambayes-Debug > header, so all generated tokens should be displayed, and as you > can see, all displayed tokens are the same, before and after. I removed that part, in order to make an internal inconsistency clearer: the overall score is prob = (S-H + 1.0) / 2.0 and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~= 0.47. > ... > Why is the message now classified as spam when before is was solidly in > the middle of unsure? A sharper question is how (0.47-0.56 + 1.0) / 2.0 came out to be 0.95. Answer that, and you'll know everything . From tim at fourstonesExpressions.com Fri Mar 7 17:00:47 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 18:00:54 2003 Subject: [Spambayes] full o' spaces Message-ID: 3/7/2003 4:54:23 PM, Francois Granger wrote: >Word length seams to be a parameter with some "bracketed" values for >western european languages. Some food for thought here (four pages >pdf document): > >http://arxiv.org/pdf/cs.CL/0102026 Very interesting. Perhaps we should employ a variation of this algorithm... perhaps a simple average of word length, with high/low thresholds beyond which spam is indicated... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Fri Mar 7 17:19:25 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 18:19:33 2003 Subject: [Spambayes] ok, i'm confused In-Reply-To: References: <15977.2844.223581.728734@montanaro.dyndns.org> Message-ID: <15977.10493.531680.816324@montanaro.dyndns.org> Tim> I removed that part, in order to make an internal inconsistency Tim> clearer: the overall score is Tim> prob = (S-H + 1.0) / 2.0 Tim> and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~= Tim> 0.47. Problem solved. The message had already been run through spambayes once, so it already had X-Spambayes-Classification and X-Spambayes-Debug headers. The second time I ran it through hammiefilter manually I forgot to set BAYESCUSTOMIZE, so it didn't add a new debug header. It did, however, replace the original classification header with the new one. (Maybe all X-Spambayes headers should be deleted by default?) Here's what the Spambayes headers for that message look like now: X-Spambayes-Classification: spam; 1.00 X-Spambayes-Debug: '*H*': 0.00; '*S*': 1.00; 'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.34; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.66; 'to:addr:bugs': 0.73; 'skip:1 10': 0.76; 'bytes/words: 2': 0.84; 'cc:addr:bugsmoke': 0.84; 'cc:addr:bugsmom16': 0.84; 'cc:addr:bugsmom_1982': 0.84; 'from:addr:diplomas.org': 0.84; 'from:addr:learning': 0.84; 'from:name:marie': 0.84; 'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84; 'to:addr:moi.com': 0.84; 'pfxlen:2': 0.87; 'cc:no real name:2**2': 0.87; 'cc:addr:mojam.com': 0.89; 'cc:addr:yahoo.com': 0.89; 'header:Received:3': 0.90; 'cc:addr:msn.com': 0.96; 'cc:addr:gateway.net': 0.97; 'cc:addr:bugs': 0.99 Note there are many more clues than before as well: X-Spambayes-Classification: unsure; 0.46 X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05; 'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90 The original time it was run was against the spambayes sw and database I have on the Mojam web server (something I didn't notice originally either). I think either the database or the software there is getting a bit out-of-date. Note the lack of cc:addr headers which put this squarely in the spam domain. At this point, I'm going to hold off on the bytes/words ratio stuff. If anyone wants to play around with it, I'll be happy to send you a context diff for tokenize.py. Skip From skip at pobox.com Fri Mar 7 17:37:43 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 18:37:50 2003 Subject: [Spambayes] Eliminating duplicates from mbox file Message-ID: <15977.11591.575821.556483@montanaro.dyndns.org> While retraining today I flubbed at one point and wound up with a bunch of duplicates in my training sets. I wrote the attached script to eliminate the duplicates. I have a few questions: 1. Is this worth checking into the contrib directory? 2. Why did I have to subclass mailbox.PortableUnixMailbox? It looks on the surface like mailbox.PortableUnixMailbox ought to work as-is (it has both __iter__() and next()), but if I use it directly without subclassing I get this: Traceback (most recent call last): File "singular.py", line 32, in ? main() File "singular.py", line 18, in main for msg in mbox: TypeError: iteration over non-sequence (BTW, I get the same error if I iterate over the mbox file using mboxutils.getmbox.) 3. Is there a better way to emit the unique messages that doesn't require me to manually escape leading "From " sequences? Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/octet-stream Size: 722 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/b3068bf5/attachment.obj From popiel at wolfskeep.com Fri Mar 7 16:52:49 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Mar 7 19:52:53 2003 Subject: [Spambayes] statistical comparison of enviroment? In-Reply-To: Message from bill parducci of "Fri, 07 Mar 2003 14:50:53 PST." <3E69224D.8010103@parducci.net> References: <3E668CA5.3050203@parducci.net> <20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <3E66B1D6.90308@parducci.net> <20030306040336.77E4E2DEA4@cashew.wolfskeep.com> <3E69224D.8010103@parducci.net> Message-ID: <20030308005249.A465D2DDC7@cashew.wolfskeep.com> In message: <3E69224D.8010103@parducci.net> bill parducci writes: > >T. Alexander Popiel wrote: >> We've actually got a pretty good testing infrastructure set up; >> for tokenization tests, I personally use timcv.py with each of the >> tokenization options and then feed the output of the runs into >> table.py. This produces some nice tabularizations that you may >> notice in the mailing list archives. > >by any chance do you have an example of how this is initiated? (fyi: it >seems that there is an issue with the command line 'help' option.) Argh. You're running into the same problem I did originally, due to the testing stuff being in a subdir and the spambayes stuff not being on your python path. This is perhaps one of the most annoying bits about the system. I just checked in a fix to timcv.py which appropriately mangles the python path before trying to import the spambayes stuff. I don't think this will break anybody... if it does, please tell me the proper way to mangle the python path for an unprivileged user. Remember, I'm a relative python newbie, too. As to more general instructions: 1. Set up your corpora in subdirectories named Data/Ham/reservoir and Data/Spam/reservoir, with one message per file. The splitndirs.py under utilities may of help here if you're starting from mboxes, or es2hs.py under testtools if you're starting from an MH setup like mine. 2. If you're going to do any incremental testing, sort and group the corpora with sort+group.py. 3. Decide how many sets you want for your cross-validation. Personally, I use 5. Then use either rebal.py (from the utilities) or mksets.py (from testtools) to populate the sets, depending on whether or not you chose to sort+group... mksets.py doesn't like filenames not in the special format for incremental testing. 4. Set up an .ini file with whatever options you want to use as baseline. Set the BAYESCUSTOMIZE environment variable to that .ini file, then run timcv.py and capture the output. 5. Set up another .ini file with whatever options you want to test. Set the BAYESCUSTOMIZE environment variable to that .ini file, then run timcv.py again and capture the output to a different file. 6. Run table.py on the two output files from timcv.py. Mail the results to the list. :-) Enjoy. - Alex From bill at parducci.net Fri Mar 7 17:19:01 2003 From: bill at parducci.net (bill parducci) Date: Fri Mar 7 20:19:05 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: Message-ID: <3E694505.40208@parducci.net> i know that fixed length delimiting has been tried, but i wonder how well it would work for something like this if all the non 'a-zA-Z0-9' chars were removed first (basically creating 1 'superword' per region). it would seem to speak to a number of issues like: s p a c e s i n p l a c e s l.o.w..p.r.o.f.i.l.e,,c,h,a,r,s and_low_profile_chars CamelCaseTyping (bracketing){and}[bracketing] (a)(n)(d) (b)(r)(a)(c)(k)(e)(t)(i)(n)(g) fence|posting|!fence!posting this is the direction of thinking that i started down when i was first confronted with this because the power of wetware to absorb a MEME; it led me to many hours of fruitless delimiter selection examination. this is not at all to say that this will be the case here but as new ideas are bandied about, i posit that it is a good idea to make sure that previously discarded methodologies be reexamined periodically. b From tim.one at comcast.net Fri Mar 7 20:20:27 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Mar 7 20:21:03 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: <15977.11591.575821.556483@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > While retraining today I flubbed at one point and wound up with a bunch of > duplicates in my training sets. I wrote the attached script to eliminate > the duplicates. I have a few questions: > > 1. Is this worth checking into the contrib directory? Not for Outlook users . > 2. Why did I have to subclass mailbox.PortableUnixMailbox? You shouldn't have to, and you shouldn't have to check for "msg is None" either. Note that some of the earliest scripts in the codebase don't do either. For example, from split.py: mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message) for msg in mbox: if random.random() < percent: outfp = bin1out else: outfp = bin2out astext = str(msg) assert astext.endswith('\n') outfp.write(astext) > ... > 3. Is there a better way to emit the unique messages that doesn't > require me to manually escape leading "From " sequences? Looks to me like the email pkg (at least the one in Python CVS) already does the ">From" bit within msg bodies. The *leading* "From " isn't supposed to be escaped -- "From " at the start of a line within a body is supposed to be escaped precisely so that an unescaped "From " at the start of a line is recognized as the start of a new msg. From popiel at wolfskeep.com Fri Mar 7 18:05:24 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Mar 7 21:23:37 2003 Subject: [Spambayes] Bytes/words ratio Message-ID: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com> Skip's bytes/words metatoken seems to be a bust. -> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams -> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams -> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams -> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams -> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams -> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams -> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams -> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams -> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams -> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams filename: baseline.out skiptok.out ham:spam: 10258:19188 10258:19188 fp total: 16 16 fp %: 0.16 0.16 fn total: 52 52 fn %: 0.27 0.27 unsure t: 296 303 unsure %: 1.01 1.03 real cost: $271.20 $272.60 best cost: $252.40 $254.80 h mean: 0.40 0.39 h sdev: 5.35 5.31 s mean: 99.21 99.19 s sdev: 6.79 6.85 mean diff: 98.81 98.80 k: 8.14 8.12 Not much to say; all it did was make a few more things unsure by spreading out the spam a bit more. Blah. - Alex From spambayes_discussion at cklowe.com Sat Mar 8 02:43:11 2003 From: spambayes_discussion at cklowe.com (Chris Lowe) Date: Fri Mar 7 21:43:14 2003 Subject: [Spambayes] Outlook Express integration Message-ID: <00cf01c2e51c$75810570$8f526451@blueeyes> Hello I'm a newbie in all sorts of ways, so please forgive me for being crass I've managet to get Spambayes working with Outlook Express, but it isn't pretty. Details are here: http://www.apt202.net/cgi-bin/wiki.pl?SpamBayesOutlookExpress Basically I've changed the hammie_header_name to 'To', so OE can filter on it. A few minor mods to pop3proxy.py were required because there's usually another 'To' header present. I personally think the HTML interface is OK for training, but I can see the obvious attraction of an intgrated solution as offered by the Outlook plug-in. The technique also seems to work OK with Netscape, but then again netscape can cope OK with 'X-Spambayes-Classification' as a custom header. Would you be so kind as to offer some suggestions on how I could improve this? Cheers, Chris Lowe From tim.one at comcast.net Fri Mar 7 22:57:38 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri Mar 7 22:58:10 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030307140247.A16563@discworld.dyndns.org> Message-ID: [Neil Schemenauer] >> I disagree. We should not abandon the rigorous, testing based >> strategy that got SB to its current state. Adding more code every >> time a spammer comes up with a new trick is completely reactionary >> and will eventually destroy the code base. [Charles Cazabon] > Hear, hear. Don't turn SpamBayes into a convoluted, hocus-pocus > collection of ad-hoc rules a la SpamAssasin. Indeed, I'd rather keep it a convoluted, hocus-pocus collection of tokenization gimmicks <0.9 wink>. Really, I doubt SpamAssassin has anything more bizarre than our "skip:" tokens, and I kept the latter because taking them out hurt results. I've never been sure why -- and I was never able to find a way of summarizing thrown-out "too-long tokens" that did as well, either. There's magic enough to go around. Also ego deflaters! I'm still convinced that preserving case should help, and also looking at (at least) bigrams -- unfortunately, the data didn't agree. It may in the future, though, if spam gets more sophisticated. > Keep testing; if a technique doesn't measurably improve the result, toss it. At the time I got yanked from this project, I was looking to remove code rather than add more. There are too many tokenization options already, and it isn't clear that some of them do anyone any good anymore. The gary_combining classifier scheme should also go away. From tim at fourstonesExpressions.com Fri Mar 7 22:01:42 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 7 23:01:50 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: <6ZA6HGVUIC6ZBA85APKQNJETQVQQP83.3e696b26@myst> 3/7/2003 9:57:38 PM, Tim Peters wrote: > There's magic enough to go around. Also ego deflaters! I'm still >convinced that preserving case should help, and also looking at (at least) >bigrams -- unfortunately, the data didn't agree. It may in the future, >though, if spam gets more sophisticated. The war will indeed be very interesting ;) c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Fri Mar 7 22:28:22 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 23:28:25 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: References: <15977.11591.575821.556483@montanaro.dyndns.org> Message-ID: <15977.29030.890609.602417@montanaro.dyndns.org> >> 2. Why did I have to subclass mailbox.PortableUnixMailbox? Tim> You shouldn't have to, and you shouldn't have to check for "msg is Tim> None" either. Note that some of the earliest scripts in the Tim> codebase don't do either. For example, from split.py: mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message) for msg in mbox: if random.random() < percent: outfp = bin1out ... Yeah, I know. That's how I originally wrote it. Without the test against None it just went into an infloop. >> 3. Is there a better way to emit the unique messages that doesn't >> require me to manually escape leading "From " sequences? Tim> Looks to me like the email pkg (at least the one in Python CVS) Tim> already does the ">From" bit within msg bodies. I figured it must have. Must be something other than the .as_string() method though. It clearly doesn't escape "\nFrom " as "\n>From ". Tim> The *leading* "From " isn't supposed to be escaped -- Correct. Tim> "From " at the start of a line within a body is supposed to be Tim> escaped precisely so that an unescaped "From " at the start of a Tim> line is recognized as the start of a new msg. I guess I was really asking if there's something better than .as_string() to call when I want to emit a message. I don't see anything obvious in the online docs though. Skip From skip at pobox.com Fri Mar 7 22:32:35 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 23:32:38 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com> References: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com> Message-ID: <15977.29283.908599.21234@montanaro.dyndns.org> Alex> Skip's bytes/words metatoken seems to be a bust. I take (mild) exception to that. It was TimP's idea. Perhaps I implemented it wrong. ;-) Also, note that Tim indicated it helped in his early testing. Skip From skip at pobox.com Fri Mar 7 22:36:38 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 7 23:36:40 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: <15977.29030.890609.602417@montanaro.dyndns.org> References: <15977.11591.575821.556483@montanaro.dyndns.org> <15977.29030.890609.602417@montanaro.dyndns.org> Message-ID: <15977.29526.301050.465780@montanaro.dyndns.org> >>> 2. Why did I have to subclass mailbox.PortableUnixMailbox? Tim> You shouldn't have to, and you shouldn't have to check for "msg is Tim> None" either. Note that some of the earliest scripts in the Tim> codebase don't do either. For example, from split.py: Skip> mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message) Skip> for msg in mbox: Skip> if random.random() < percent: Skip> outfp = bin1out Skip> ... Skip> Yeah, I know. That's how I originally wrote it. Without the test Skip> against None it just went into an infloop. Yuck, badly worded. I should have said something like Yeah, I know. That's how I originally wrote it. After subclassing PortableUnixMailbox to get the "for msg in mbox:" to succeed, without the test against None in the loop it just went into an infloop. Skip From tim.one at comcast.net Sat Mar 8 00:01:05 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 00:01:37 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: <15977.29526.301050.465780@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Yuck, badly worded. I should have said something like > > Yeah, I know. That's how I originally wrote it. After subclassing > PortableUnixMailbox to get the "for msg in mbox:" to succeed, without > the test against None in the loop it just went into an infloop. Except that you shouldn't have needed to subclass, just as the sample code I showed didn't need to subclass. That's where the problem lies. After you subclassed it, the None problem was probably due to the subclassing (indeed, it clearly was due to the subclassing: you had your subclass __iter__ return self, and self.next() can return None then; the mailbox.PortableUnixMailbox.__iter__ returns iter(self.next, None), which cannot return None). To get anywhere else with this and without benefit of telepathy, you should create a self-contained small test case and make sure you're using a self-consistent set of factory-standard software. The problem is why you needed to subclass to begin with: as you orginally noted, mailbox.PortableUnixMailbox already supplied __iter__, so it makes no sense that you had to supply your own. Something else is wrong. From tim.one at comcast.net Sat Mar 8 00:05:01 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 00:05:39 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: <15977.29283.908599.21234@montanaro.dyndns.org> Message-ID: [Alex] > Skip's bytes/words metatoken seems to be a bust. [Skip] > I take (mild) exception to that. It was TimP's idea. Perhaps I > implemented it wrong. ;-) Also, note that Tim indicated it helped in his > early testing. Nope, I said it was a strong spam indicator, but that it made no difference to error rates. That's the same outcome Alex just reported (I didn't see a asignificant difference in his before-and-after results; no change in FP or FN, and (just) a few msgs tipped into Unsure). Another example may help to clarify: in just about anyone's test data, "
" would be a very strong spam indicator, if the tokenizer produced it. I expect that adding it into the mix would boost the FP rate, though -- at least for those of us with sisters . From tim.one at comcast.net Sat Mar 8 00:08:14 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 00:08:53 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <6ZA6HGVUIC6ZBA85APKQNJETQVQQP83.3e696b26@myst> Message-ID: [Tim Stone] > The war will indeed be very interesting ;) Starting when <0.9 wink>? From tim.one at comcast.net Sat Mar 8 00:20:51 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 00:21:30 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: <15977.29030.890609.602417@montanaro.dyndns.org> Message-ID: > Tim> Looks to me like the email pkg (at least the one in Python CVS) > Tim> already does the ">From" bit within msg bodies. [Skip Montanaro] > I figured it must have. Must be something other than the .as_string() > method though. It clearly doesn't escape "\nFrom " as "\n>From ". Stick some prints in the code. In the _handle_text() method, see whether this block is getting executed (it should be): if self._mangle_from_: payload = fcre.sub('>From ', payload) If it isn't, trace it back from there. > ... > I guess I was really asking if there's something better than > .as_string() to call when I want to emit a message. I don't see anything > obvious in the online docs though. I think Barry usully uses str(msg), which is equivalent to msg.as_string(unixfrom=True) Either way, it leads pretty directly to the _mangle_from code quoted above. From popiel at wolfskeep.com Fri Mar 7 21:32:54 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Mar 8 00:32:59 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: Message from Skip Montanaro <15977.29283.908599.21234@montanaro.dyndns.org> References: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com> <15977.29283.908599.21234@montanaro.dyndns.org> Message-ID: <20030308053254.CCEC52DDC7@cashew.wolfskeep.com> In message: <15977.29283.908599.21234@montanaro.dyndns.org> Skip Montanaro writes: > > Alex> Skip's bytes/words metatoken seems to be a bust. > >I take (mild) exception to that. It was TimP's idea. Perhaps I implemented >it wrong. ;-) Also, note that Tim indicated it helped in his early testing. Aye, you're right. I should have said that it seems to be a bust for my corpus. My apologies. Does anybody else have a decently sized corpus (I believe we were using a minimum of 2000 each of spam and ham for the last shootout) who's willing to test this goodie? - Alex From mike at plokta.com Sat Mar 8 08:29:42 2003 From: mike at plokta.com (Mike Scott) Date: Sat Mar 8 03:29:43 2003 Subject: [Spambayes] Headers and pop3proxy Message-ID: <1C0EA690-5140-11D7-BE5F-000393DB4B0C@plokta.com> Is there an easy way (perhaps a parameter in bayescustomize.ini) to get pop3proxy to add a header giving the spam probability score, as well as the one classifying the message as ham/unsure/spam? This would make it easier to fine-tune the min and max scores to get email classified correctly -- I get no false negatives at all, and not much in the unsure category, but I get a few false positives. So I need to increase the spam cutoff (currently at 0.95), but I don't know how much. -- Mike Scott mike@plokta.com From anthony at interlink.com.au Sat Mar 8 19:45:00 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Sat Mar 8 03:45:28 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: <200303080845.h288j0C16756@localhost.localdomain> >>> Tim Stone replying to Neil Schemenauer > > Adding more code every time a spammer > >comes up with a new trick is completely reactionary and will eventually > >destroy the code base. I'm mystified as to how you can call such an > >approach proactive. > > Again, I was suggesting that we find the holes before they do. I think > that we should begin to think like spammers, not like people trying to > defeat spammers. If we were on the other side, what would we do? Gosh, > I can think of things, simple things. And if I can find something > that actually crashes the tokenizer, all the better. I'll look at the > code, more closely than most on this team ever will. I'll find the > holes, and blast away. My goal? Not to get spam into mailboxes, but to > destroy the anti-spam community. Make people give up hope that this > problem really is/can be solved. That's the way to make you and me go > away. Simply make it so people don't believe in us. We're not talking about something that crashes the tokenizer. We're talking about a new spam technique that's been seen in a very small number of live spams. I've not yet seen one of these, and I get an absolute shiteload of spam every day. Note also that a lot of people run spamassassin, and it's absolute death on this technique (called "gappy text", from memory). The chances of this technique surviving very long is very small. We can sit here for days, weeks and months and think of ways to defeat the existing classifier. We have done that, in the past. But a change that is not tested and shown to improve existing results, does _not_ belong in the code base. It goes against _everything_ that has made this project successful. Sure - if you find a way to actually crash the tokeniser, then the fix should go in. But "what if"ing serves no use, and may make things worse. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony at interlink.com.au Sat Mar 8 19:51:22 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Sat Mar 8 03:51:50 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: <200303080851.h288pMv16800@localhost.localdomain> >>> Tim Peters wrote > At the time I got yanked from this project, I was looking to remove code > rather than add more. There are too many tokenization options already, and > it isn't clear that some of them do anyone any good anymore. The > gary_combining classifier scheme should also go away. I was wondering about that last time I was trying to get some new graphs for the SB website. Does anyone have any real objections to this going away? If not, I'll kill it all on monday (I'll put a Last_Gary tag on the version before the code removal). Anthony -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Sat Mar 8 07:32:52 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 8 08:32:54 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: References: <15977.29030.890609.602417@montanaro.dyndns.org> Message-ID: <15977.61700.812466.667050@montanaro.dyndns.org> Tim> Stick some prints in the code. In the _handle_text() method, see Tim> whether this block is getting executed (it should be): Tim> if self._mangle_from_: Tim> payload = fcre.sub('>From ', payload) Okay, I'll give that a try. The reason I stuck in the replace() call was that what it told me the number of messages was (len(d), where d is the dict using md5 checksums as keys) differed from what "egrep '^From ' out" told me after it had generated the output file (there were four more "^From " lines than the number of messages in the dict). Once I added the replace() call, they agreed. Given that, I think there's a bug without inserting prints. (I had planned to submit a bug report today.) Skip From skip at pobox.com Sat Mar 8 08:36:48 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 8 09:36:50 2003 Subject: [Spambayes] Eliminating duplicates from mbox file Message-ID: <15978.0.395098.109027@montanaro.dyndns.org> >> 2. Why did I have to subclass mailbox.PortableUnixMailbox? Tim> You shouldn't have to... *sigh* I come before the bar asking humbly for forgiveness... I was doing all this from my ~/tmp directory, which, lo and behold, had a version of mailbox.py dating from September 2001. The _Mailbox class had next() but not __iter__. Who knows what other semantic differences existed. Sorry for the wasted bandwidth. Skip From tim at fourstonesExpressions.com Sat Mar 8 09:10:33 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 8 10:10:40 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <200303080851.h288pMv16800@localhost.localdomain> Message-ID: <5LLGA6NM65TPWZV214285M1U622.3e6a07e9@myst> 3/8/2003 2:51:22 AM, Anthony Baxter wrote: > >>>> Tim Peters wrote >> At the time I got yanked from this project, I was looking to remove code >> rather than add more. There are too many tokenization options already, and >> it isn't clear that some of them do anyone any good anymore. The >> gary_combining classifier scheme should also go away. > >I was wondering about that last time I was trying to get some new graphs >for the SB website. Does anyone have any real objections to this going away? >If not, I'll kill it all on monday (I'll put a Last_Gary tag on the version >before the code removal). I think we should get rid of any related options, too: use_gary_combining and use_chi_squared_combining. Perhaps this would be a good time to make experimental_ham_spam_imbalance_adjustment permanent? > >Anthony >-- >Anthony Baxter >It's never too late to have a happy childhood. > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Sat Mar 8 09:25:21 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 8 10:25:27 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <200303080845.h288j0C16756@localhost.localdomain> Message-ID: <97746185C05Z4WKJO17232KGFCZWOM.3e6a0b61@myst> 3/8/2003 2:45:00 AM, Anthony Baxter wrote: >We can sit here for days, weeks and months and think of ways to defeat >the existing classifier. We have done that, in the past. But a change that >is not tested and shown to improve existing results, does _not_ belong >in the code base. It goes against _everything_ that has made this project >successful. Ok, so let me summarize what I think our discussion has boiled down to. 1. We will not make changes that regress our results on existing spam. 2. We will engage in ongoing analysis of spam, keeping our testing corpora up to date as best we can. When significant (we have yet to define significant) amounts of FN start happening, we will adjust the tokenizer appropriately. Point 1 is a given. There seems to be considerable inertia in the project toward using point 2 as an ongoing strategy. I can live with it, because there's tremendous value in what we're doing, and it really does work. I just have to say, though, that from a marketing viewpoint (believe it or not, I was a marketer in a former life), this strategy can potentially shoot us in the foot, because we aren't the ones finding problems, spammers are, and I think this could cause our users to lose faith in our product. "I trained this stuff as spam, and this thing STILL doesn't catch it." If that happens to a user more than a few times, the conclusion will be that it doesn't work. I'm telling you, it doesn't take but one bad article in a ZD publication, and it's all over with for us. Ok, I'm off my soapbox. This has been a great discussion. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From N7DR at arrisi.com Sat Mar 8 08:31:03 2003 From: N7DR at arrisi.com (D. R. Evans) Date: Sat Mar 8 10:31:09 2003 Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes In-Reply-To: References: <3E67575B.3086.24A05058@localhost> Message-ID: <3E69AA47.15581.8DD9511@localhost> On 6 Mar 2003 at 21:07, Tim Stone - Four Stones Expre wrote: > 3/6/2003 3:12:43 PM, "D. R. Evans" wrote: > > >On 6 Mar 2003 at 9:50, Tim Stone - Four Stones Expre wrote: > > > >> Nearly as I can tell, your training database has been corrupted. I'm > >> not quite sure how this happened, but from what I see in the code, > >> there is likely no recovery at this point. When you submit a bug > >> report, go ahead and attach your training database. > > The database is definitely corrupted. This is the first time I've seen > this. The 'saved state' key in the database (where spamcount and > hamcount are maintained) has a corrupt value, that kills the unpickler. > > There are >88,000 words in this database, and apparently the machine was > rebooted without a proper shutdown. This is bad. > I plugged my handspring into a USB port to do a sync (as usual) and the machine completely froze (not as usual). Dead. Could no longer even reach it from other machines on the network. So I had to power down. However, I do note that I was NOT doing any spambayes-related operations at the time (unless pop3proxy goes off and does things in the background, which I don't think it does). > D.R. I need you to do a couple things: > > If you have the spam and ham saved in an mbox or something, then you can > simply delete the database files and retrain from scratch. This would > be the best alternative. If this isn't the case, if you can remember, > or figure out some way, how many spams and hams were trained into this > database, I can recover it for you. Even a rough estimate will likely > do. > I'll just reinstall and start all over again. Not a problem. Almost certainly much easier than having you try to reconstruct the database. > And... can you tell me, if you know, what dbm module is in use? Maybe > someone can give us a few lines of python you can run that will tell us > that info. It's too late for me to bring it to mind... > If someone can post how to find that out, I'll gladly run it. Doc > c'est moi - TimS > http://www.fourstonesExpressions.com > http://wecanstopspam.org > > -------------------------------------------------------------- Phone: +1 303 494 0394 Mobile: +1 720 839 8462 Fax: +1 781 240 0527 -------------------------------------------------------------- From nas at python.ca Sat Mar 8 08:43:48 2003 From: nas at python.ca (Neil Schemenauer) Date: Sat Mar 8 11:34:11 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com> References: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com> Message-ID: <20030308164347.GA16439@glacier.arctrix.com> T. Alexander Popiel wrote: > Skip's bytes/words metatoken seems to be a bust. I'll take the blame. I think neither Skip nor Tim explicitly said it was a good idea. Thanks for testing. Neil From skip at pobox.com Sat Mar 8 11:29:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 8 12:30:01 2003 Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes In-Reply-To: <3E69AA47.15581.8DD9511@localhost> References: <3E67575B.3086.24A05058@localhost> <3E69AA47.15581.8DD9511@localhost> Message-ID: <15978.10381.283357.950737@montanaro.dyndns.org> >> There are >88,000 words in this database, and apparently the machine >> was rebooted without a proper shutdown. This is bad. Doc> I plugged my handspring into a USB port to do a sync (as usual) and Doc> the machine completely froze (not as usual). Dead. Could no longer Doc> even reach it from other machines on the network. So I had to power Doc> down. Doc> However, I do note that I was NOT doing any spambayes-related Doc> operations at the time (unless pop3proxy goes off and does things Doc> in the background, which I don't think it does). If pop3proxy was running, even if it wasn't analyzing any messages at that instant, it probably had the database open. For performance reasons, the BerkeleyDB library does a fair amount of caching. It is quite possible the database was in an invalid state at the time your machine froze. All may not be lost however. Did your BerkeleyDB package come with a db_recover command? If so, it may be able to repair the damage. For those who haven't investigated all the mysteries of the BerkeleyDB package, it comes with a number of command-line programs which manipulate the database in various ways: db_archive db_deadlock db_load db_recover db_upgrade db_checkpoint db_dump db_printlog db_stat db_verify You can read all about them at http://www.sleepycat.com/docs/utility/index.html Does anyone know if the Windows distribution of Python comes with these utilities? If not, it probably should. db_dump, db_load, db_upgrade db_verify and db_recover are particularly useful. Skip From francois.granger at free.fr Sat Mar 8 19:17:09 2003 From: francois.granger at free.fr (Francois Granger) Date: Sat Mar 8 13:17:16 2003 Subject: [Spambayes] Another issue with the email package Message-ID: Today I got a mail with a "return-space" in the subject field. It was not tagged at all. And I can't find it in the cache directories. I have a copy of it in my Eudora mailbox. But this is not of much help. Here a copy and past of headers around this: [...] User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01 X-Accept-Language: en-us, en MIME-Version: 1.0 To: francis.bebey@free.fr Subject: demande de renseignements (sans photo attach? ) Content-Type: text/plain; format=flowed Content-Transfer-Encoding: 8bit Bonjour, Je voudrais des informations sur la disponibilite de la chanson "je vous aime zaime zaime", je ne le trouve ici en Belgique pas dans les [...] -- http://fgranger.net1.nerim.net:8000/cgi-bin/pyblosxom.cgi From stephena at hiwaay.net Sat Mar 8 11:34:58 2003 From: stephena at hiwaay.net (Stephen Anderson) Date: Sat Mar 8 14:35:38 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: References: <15977.29283.908599.21234@montanaro.dyndns.org> Message-ID: <3E69D562.5861.18EF176D@localhost> On 8 Mar 2003 at 0:05, Tim Peters wrote: > Another example may help to clarify: in just about anyone's test data, > "
" would be a very strong spam indicator, if the tokenizer produced > it. I expect that adding it into the mix would boost the FP rate, > though -- at least for those of us with sisters . Okay Tim, I just can't take it anymore. My curiosity has gotten the best of me. Would you please ask your sisters to email me a sample of one of their very pretty HTML emails you keep referring to. I have a sister too, but her HTML emails are almost indistinguishable in presentation from that of a plain-text one. So, can you help me with my burning question: Just what does pretty email look like? Geeky regards, Steve From tim at fourstonesExpressions.com Sat Mar 8 13:53:27 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 8 14:53:37 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: <3E69D562.5861.18EF176D@localhost> Message-ID: 3/8/2003 1:34:58 PM, "Stephen Anderson" wrote: >Okay Tim, I just can't take it anymore. My curiosity has gotten the best of me. Would you >please ask your sisters to email me a sample of one of their very pretty HTML emails you >keep referring to. Woah... This ain't no matchmaking mailing list... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From lists at morpheus.demon.co.uk Sat Mar 8 21:18:55 2003 From: lists at morpheus.demon.co.uk (Paul Moore) Date: Sat Mar 8 16:21:26 2003 Subject: [Spambayes] full o' spaces References: <200303080845.h288j0C16756@localhost.localdomain> <97746185C05Z4WKJO17232KGFCZWOM.3e6a0b61@myst> Message-ID: Tim Stone - Four Stones Expressions writes: > 3/8/2003 2:45:00 AM, Anthony Baxter wrote: > >>We can sit here for days, weeks and months and think of ways to defeat >>the existing classifier. We have done that, in the past. But a change that >>is not tested and shown to improve existing results, does _not_ belong >>in the code base. It goes against _everything_ that has made this project >>successful. > > Ok, so let me summarize what I think our discussion has boiled down to. > > 1. We will not make changes that regress our results on existing spam. > > 2. We will engage in ongoing analysis of spam, keeping our testing corpora up > to date as best we can. When significant (we have yet to define significant) > amounts of FN start happening, we will adjust the tokenizer appropriately. > > Point 1 is a given. There seems to be considerable inertia in the project > toward using point 2 as an ongoing strategy. I can live with it, because > there's tremendous value in what we're doing, and it really does work. I just > have to say, though, that from a marketing viewpoint (believe it or not, I was > a marketer in a former life), this strategy can potentially shoot us in the > foot, because we aren't the ones finding problems, spammers are, and I think > this could cause our users to lose faith in our product. "I trained this > stuff as spam, and this thing STILL doesn't catch it." If that happens to a > user more than a few times, the conclusion will be that it doesn't work. I'm > telling you, it doesn't take but one bad article in a ZD publication, and it's > all over with for us. > > Ok, I'm off my soapbox. This has been a great discussion. Can I borrow that box for a moment? Thanks... :-) The key point, for me, is that spambayes is the only anti-spam tool I have ever used that made a real dent in my spam problem. And the dent it made was pretty much total. While I still get unsures, and even the occasional FN, in reality I don't have a spam problem any more. I don't know why spambayes is so good, but the single most distinctive aspect of the project is the rigorous analysis of results, and ruthless refusal to include techniques which don't pull their weight. When I mention spambayes to friends, my "marketing" approach is, basically: 1. It works. Really well. 2. It learns what you consider spam, and acts on that. 3. It's been tested on thousands of spam, with error rates so low as to be negligible. 4. You do need to maintain it - a little ongoing training helps (but it's not a major task, and if you don't bother, you're still going to get very impressive results) 5. Er. But it's a bit rough around the edges still. I'll help you install it, if you like. Notice (5). That's what is killing us right now with real people (me, I'm a figment of your imagination: be very afraid ). Anything else is minor. Your point (2) means that we can claim that we know it works - we've tested it (my point (3)). Pre-emptive attempts to address possible new spam tricks loses that - you can't *prove* the effectiveness of a new technique if you don't have corpora with evidence of that technique to test against. I view the benefit of being able to show proof that the program works as greater than the risk of being branded reactive. Oh, and by the way - you use Microsoft's security strategy to demonstrate that a reactive approach is bad. But that's FUD. Another business that is (as far as the general public is aware) totally reactive is the anti-virus business. If you liken the spambayes approach to an anti-virus strategy, it suddenly looks much better :-) OK, who wants the box next? Paul. -- This signature intentionally left blank From tim at fourstonesExpressions.com Sat Mar 8 16:38:08 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 8 17:38:18 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: 3/8/2003 3:18:55 PM, Paul Moore wrote: >> >> Ok, I'm off my soapbox. This has been a great discussion. > >Can I borrow that box for a moment? Thanks... :-) I yield the floor. >1. It works. Really well. >2. It learns what you consider spam, and acts on that. >3. It's been tested on thousands of spam, with error rates so low as > to be negligible. >4. You do need to maintain it - a little ongoing training helps (but > it's not a major task, and if you don't bother, you're still going > to get very impressive results) >5. Er. But it's a bit rough around the edges still. I'll help you > install it, if you like. > >Notice (5). That's what is killing us right now with real people (me, >I'm a figment of your imagination: be very afraid ). Anything >else is minor. Absolutely. >If you liken the spambayes >approach to an anti-virus strategy, it suddenly looks much better :-) Hmmm... interesting analog, but it only goes so far. Viruses would be a vastly smaller threat had microsoft engaged in the strategy that I'm arguing for. Trojans, worms, etc... the face of the online world would be considerably different had they invested in building fundamentally secure systems... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From noreply at sourceforge.net Sat Mar 8 16:53:30 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sat Mar 8 19:50:16 2003 Subject: [Spambayes] [ spambayes-Bugs-700165 ] MoveFileEx doesn't exist on Win98 Message-ID: Bugs item #700165, was opened at 2003-03-08 19:53 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Tim Peters (tim_one) Assigned to: Mark Hammond (mhammond) Summary: MoveFileEx doesn't exist on Win98 Initial Comment: After a CVS up, Outlook craps out on Win98SE now in BayesManager._MigrateFile. File "C:\Code\spambayes\Outlook2000\manager.py", line 213, in _MigrateFile win32con.MOVEFILE_COPY_ALLOWED) pywintypes.error: (120, 'MoveFileEx', 'This function is only valid in Win32 mode.') which really seems to mean that MoveFileEx isn't supported at or before Win98. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702 From noreply at sourceforge.net Sat Mar 8 17:06:44 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sat Mar 8 19:58:48 2003 Subject: [Spambayes] [ spambayes-Bugs-700165 ] MoveFileEx doesn't exist on Win98 Message-ID: Bugs item #700165, was opened at 2003-03-08 19:53 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Tim Peters (tim_one) Assigned to: Mark Hammond (mhammond) Summary: MoveFileEx doesn't exist on Win98 Initial Comment: After a CVS up, Outlook craps out on Win98SE now in BayesManager._MigrateFile. File "C:\Code\spambayes\Outlook2000\manager.py", line 213, in _MigrateFile win32con.MOVEFILE_COPY_ALLOWED) pywintypes.error: (120, 'MoveFileEx', 'This function is only valid in Win32 mode.') which really seems to mean that MoveFileEx isn't supported at or before Win98. ---------------------------------------------------------------------- >Comment By: Tim Peters (tim_one) Date: 2003-03-08 20:06 Message: Logged In: YES user_id=31435 I checked in a patch to Outlook2000/manager.py, rev1.54, which worked for me on Win98. If you're happy with this, just close the bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702 From popiel at wolfskeep.com Sat Mar 8 17:23:14 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Mar 8 20:23:18 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message from Tim Stone - Four Stones Expressions References: Message-ID: <20030309012314.73C832DE92@cashew.wolfskeep.com> In message: writes: >3/8/2003 3:18:55 PM, Paul Moore wrote: > >>> >>> Ok, I'm off my soapbox. This has been a great discussion. >> >>Can I borrow that box for a moment? Thanks... :-) > >I yield the floor. Okay, I'll grab the box for a moment... >>If you liken the spambayes >>approach to an anti-virus strategy, it suddenly looks much better :-) > >Hmmm... interesting analog, but it only goes so far. Viruses would be a >vastly smaller threat had microsoft engaged in the strategy that I'm arguing >for. Trojans, worms, etc... the face of the online world would be >considerably different had they invested in building fundamentally secure >systems... To build a fundamentally secure system, though, we'd be replacing SMTP with something that actively prevented impersonation and forgery, as well as possibly providing a provable audit trail back to original sender, along with their identity. We're not coming even close to that... so I think that the anti-virus analogy is quite appropriate. We're layering a band-aid on top of a fundamentally insecure system, and patching any leaks as we hear about them. Microsoft is not to blame for all the worms and trojans. Microsoft is merely the juiciest target at the moment. Do recall that the first worm to make headline news (the Morris worm back in 1988) targetted VAX and Sun 3 systems through sendmail vulnerabilities. I could rant for a while that it is human nature to build weak systems and again human nature to abuse such systems... but that's not a particularly useful thread for the spambayes list. - Alex From tim at fourstonesExpressions.com Sat Mar 8 19:35:35 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 8 20:35:43 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030309012314.73C832DE92@cashew.wolfskeep.com> Message-ID: 3/8/2003 7:23:14 PM, "T. Alexander Popiel" wrote: >In message: > writes: >>3/8/2003 3:18:55 PM, Paul Moore wrote: >> >>>> >>>> Ok, I'm off my soapbox. This has been a great discussion. >>> >>>Can I borrow that box for a moment? Thanks... :-) >> >>I yield the floor. > >Okay, I'll grab the box for a moment... > >>>If you liken the spambayes >>>approach to an anti-virus strategy, it suddenly looks much better :-) >> >>Hmmm... interesting analog, but it only goes so far. Viruses would be a >>vastly smaller threat had microsoft engaged in the strategy that I'm arguing >>for. Trojans, worms, etc... the face of the online world would be >>considerably different had they invested in building fundamentally secure >>systems... > >To build a fundamentally secure system, though, we'd be replacing >SMTP with something that actively prevented impersonation and >forgery, as well as possibly providing a provable audit trail back >to original sender, along with their identity. We're not coming >even close to that... so I think that the anti-virus analogy is >quite appropriate. We're layering a band-aid on top of a >fundamentally insecure system, and patching any leaks as we hear >about them. All good, interesting points, but we're not talking about building a secure system here. We're just thinking about a couple of alternative going forward strategies for our project. One alternative is to actively try to find ways that spammers can get through our filter and plug those holes before the spammers find them. The other is to wait until a significant amount of spam is pouring through the hole, then plug the hole in a much more testable, provable manner. The first has the strength of potentially keeping users happier, but the weakness of not having a strong corpus of evolved spam to test against, so the effectiveness of changes to the tokenizer is not necessarily provable. The second has the strength of provability, and the weakness of our software potentially appearing to be deficient. This strategy, which we seem to be converging on , bears resemblance (imo) to microsoft's "wait till a hacker trashes the webserver, figure out how they did it, and post a patch" strategy. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Sat Mar 8 21:56:07 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 21:56:39 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <97746185C05Z4WKJO17232KGFCZWOM.3e6a0b61@myst> Message-ID: [Tim Stone] > Ok, so let me summarize what I think our discussion has boiled down to. > > 1. We will not make changes that regress our results on existing spam. There are two error rates, and an unsure rate, and they're all important. I'm afraid that when someone sees a spam and suggests a gimmick to nail it, they forget that it's also going to penalize some ham, and affect the unsure rate too. It's just human nature to fixate on potential benefits and discount potential costs. The point of statistical testing is to look at all the effects. A change that's a pure win on all counts has become exceedingly hard to come up with. > 2. We will engage in ongoing analysis of spam, keeping our > testing corpora up to date as best we can. When significant (we have yet to > define significant) amounts of FN start happening, we will adjust the > tokenizer appropriately. Or bad trends in FP or Unsure, and provided someone can dream up a gimmick that addresses the problem du jour without damaging the things they're *not* thinking about more than helping the thing they are thinking about. > Point 1 is a given. There seems to be considerable inertia in > the project toward using point 2 as an ongoing strategy. I watch my spam, ham and unsures closely, and check in a change whenever there's an identifiable screwup. For example, that's how the treatment of embedded nonsense HTML tags got repaired a while ago, and very recently is how unclosed HTML start-comment tags stopped being a problem. I'm not seeing any loss of effectiveness in my own email, though, and it's true I don't spend any time dreaming up ways to defeat the system. So long as spam uses the language and artifacts of advertising, and the tokenizer sees those, it will be damned hard to get spam thru reliably -- and it will be hard to get solicited commercial email thru too (it's still the case that the first time or two I get a desired email from a given online business, it rates Unsure or even as Spam -- it depends on how obnoxious it is). Exceptions raised by the email pkg now appear to be the easiest approach to hiding msg content from this particular system, and if I were a spammer that's what I'd concentrate on. Python allows very easy ways to catch exceptions, though, so it's not something I'm frightened of -- we've added alternative processing paths for email exceptions before, and we can add more. There's a systematic spambayes codebase problem, though, in that people call the email pkg parsing functions directly, and that prevents centralizing workarounds for pkg weaknesses that get discovered. > I can live with it, because there's tremendous value in what we're doing, > and it really does work. I just have to say, though, that from a marketing > viewpoint (believe it or not, I was a marketer in a former life), this > strategy can potentially shoot us in the foot, because we aren't the ones > finding problems, spammers are, I've seen no evidence that they're finding anything to exploit here, and doubt this particular project is popular enough for them to target. Most spam damaged enough to make the email pkg complain appears to me to be due to spammer incompetence, or to bugs in the software they're using to generate the spam. If you want to see something break, give it to a 2-year old <0.9 wink>. At the moment, I have a grand total of one spam from my personal email that still breaks the system (causes an email BoundaryError exception that the Outlook client doesn't protect itself against), and that's it, out of tens of thousands. I got that email last December, and haven't gotten another like; I conclude it's evidence of a spammer who didn't know what they were doing. I confess I haven't fixed this bug, since it turned out to be a one-shot thing and there are so many other things demanding my time. Fixing a bug I don't expect to see again just doesn't rate high enough to get done. > and I think this could cause our users to lose faith in our product. "I > trained this stuff as spam, and this thing STILL doesn't catch it." That irritation can occur even when the system is working perfectly, alas. The flip side is that the lack of special cases to *force* classification as one thing or another also makes it impossible to attack such a subsystem: "preponderance of evidence" is the only way to get a score out of the system. > If that happens to a user more than a few times, the conclusion will be > that it doesn't work. I'm telling you, it doesn't take but one bad article > in a ZD publication, and it's all over with for us. OTOH, one good article in a ZD publication would kill us with newbie support requests too <0.5 wink>. From tim.one at comcast.net Sat Mar 8 21:56:08 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 21:56:43 2003 Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes In-Reply-To: <15978.10381.283357.950737@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > For those who haven't investigated all the mysteries of the BerkeleyDB > package, it comes with a number of command-line programs which manipulate > the database in various ways: > > db_archive db_deadlock db_load db_recover db_upgrade > db_checkpoint db_dump db_printlog db_stat db_verify > > You can read all about them at > > http://www.sleepycat.com/docs/utility/index.html > > Does anyone know if the Windows distribution of Python comes with these > utilities? It doesn't. > If not, it probably should. db_dump, db_load, db_upgrade > db_verify and db_recover are particularly useful. Enhancing the Windows installer is a "spare time" thing for me now, and I don't have any. IOW, fine by me, but I won't be doing the work. From tim.one at comcast.net Sat Mar 8 21:57:52 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 21:58:23 2003 Subject: [Spambayes] Eliminating duplicates from mbox file In-Reply-To: <15978.0.395098.109027@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I come before the bar asking humbly for forgiveness... I was > doing all this from my ~/tmp directory, which, lo and behold, had a > version of mailbox.py dating from September 2001. The _Mailbox class > had next() but not __iter__. Who knows what other semantic differences > existed. Not me. Maybe this relates to your problems with From lines too? > Sorry for the wasted bandwidth. I haven't trained on msgs from you as spam, so don't sweat it . From tim.one at comcast.net Sat Mar 8 22:05:55 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 22:06:25 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <200303080851.h288pMv16800@localhost.localdomain> Message-ID: [Tim Peters] >> At the time I got yanked from this project, I was looking to remove code >> rather than add more. There are too many tokenization options >> already, and it isn't clear that some of them do anyone any good >> anymore. The gary_combining classifier scheme should also go away. [Anthony Baxter] > I was wondering about that last time I was trying to get some new graphs > for the SB website. Does anyone have any real objections to this > going away? Last person I knew was using it was Sean True, but that was last year. I flipped between gary_ and chi_ combining a lot myself last year too, until gaining more confidence in the latter. > If not, I'll kill it all on monday (I'll put a Last_Gary tag on > the version before the code removal). Bless you! As TimS said, we should also nuke the options specific to it. From tim.one at comcast.net Sat Mar 8 22:15:10 2003 From: tim.one at comcast.net (Tim Peters) Date: Sat Mar 8 22:15:40 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <5LLGA6NM65TPWZV214285M1U622.3e6a07e9@myst> Message-ID: [Tim Stone] > ... > I think we should get rid of any related options, too: > use_gary_combining and use_chi_squared_combining. Agreed. > Perhaps this would be a good time to make > experimental_ham_spam_imbalance_adjustment permanent? There haven't been enough test reports on that one to decide. It's True by default in the Outlook client, but still appears to be False by default everywhere else. There are bad visible effects either way (if it's off and you get a large ratio imbalance, it's too easy for a msg to score incorrectly as belonging to the more popular category; if it's on and you get a large ratio imbalance, training on another example from the more popular category has little effect, exacerbating (for example) the "but I trained on it and it's *still* called ham!" irritation). From nas at python.ca Sat Mar 8 20:14:47 2003 From: nas at python.ca (Neil Schemenauer) Date: Sat Mar 8 23:05:11 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: <5LLGA6NM65TPWZV214285M1U622.3e6a07e9@myst> Message-ID: <20030309041447.GA17672@glacier.arctrix.com> Tim Peters wrote: > > Perhaps this would be a good time to make > > experimental_ham_spam_imbalance_adjustment permanent? > > There haven't been enough test reports on that one to decide. How do I test it? Neil From tim at fourstonesExpressions.com Sat Mar 8 22:47:05 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 8 23:47:17 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: 3/8/2003 9:15:10 PM, Tim Peters wrote: >There haven't been enough test reports on that one to decide. It's True by >default in the Outlook client, but still appears to be False by default >everywhere else. There are bad visible effects either way (if it's off and >you get a large ratio imbalance, it's too easy for a msg to score >incorrectly as belonging to the more popular category; if it's on and you >get a large ratio imbalance, training on another example from the more >popular category has little effect, exacerbating (for example) the "but I >trained on it and it's *still* called ham!" irritation). Rats. I thought it was True by default. All this time I've been using it thinking it was on... ok, so if I turn it on now, what would I expect? I have a huge ham/spam imbalance in my notes sb database, and have been a bit disappointed by the classifier... > > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From bill at parducci.net Sat Mar 8 22:13:35 2003 From: bill at parducci.net (bill parducci) Date: Sun Mar 9 01:13:39 2003 Subject: [Spambayes] spaced out spam Message-ID: <3E6ADB8F.7070709@parducci.net> well it looks like spacemania is catching on... b -------- Original Message -------- Subject: none Date: Fri, 7 Mar 2003 02:40:14 GMT From: mcgough U N I V E R S I T Y D I P L O M A S O b t a i n a p r o s p e r o u s f u t u r e , m o n e y e a r n i n g p o w e r , a n d t h e a d m i r a t i o n o f a l l . D i p l o m a s f r o m p r e s t i g i o u s , n o n - a c c r e d i t e d u n i v e r s i t i e s b a s e d o n y o u r p r e s e n t k n o w l e d g e a n d l i f e e x p e r i e n c e . N o r e q u i r e d t e s t s, c l a s s e s , b o o k s , o r i n t e r v i e w s . B a c h e l o r s , m a s t e r s , M B A , a n d d o c t o r a t e ( P h D ) d i p l o m a s a v a i l a b l e i n t h e f i e l d o f y o u r c h o i c e . N o o n e i s t u r n e d d o w n . C o n f i d e n t i a l i t y a s s u r e d . C A L L N O W t o r e c e i v e y o u r d i p l o m a w i t h i n d a y s ! ! ! 1-817-740-5673 C a l l 2 4 h o u r s a d a y , 7 d a y s a w e e k , i n c l u d i n g S u n d a y s a n d h o l i d a y s . From tim_one at email.msn.com Sun Mar 9 01:54:10 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sun Mar 9 01:54:51 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030309041447.GA17672@glacier.arctrix.com> Message-ID: [Neil Schemenauer, asking about experimental_ham_spam_imbalance_adjustment] > How do I test it? One run with True, and another with False. If you have the same # of ham and spam in your training data, it shouldn't make any difference. If you have an imbalance, it will, and then the question is which setting gives better results. I'm not keen on people who don't already have an imbalance artificially creating one, though -- for example, I think mistake-based manual training is likely to create imbalance, and that's likely to have different characteristics than imbalance forced via picking random subsets. From tim_one at email.msn.com Sun Mar 9 02:14:22 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sun Mar 9 02:15:00 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030309012314.73C832DE92@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > Do recall that the first worm to make headline news (the Morris worm > back in 1988) targetted VAX and Sun 3 systems through sendmail > vulnerabilities. It's curious that current sendmail holes were the hottest security topic this week, 15 years later, and that the holes were created by "security code". Makes me glad I sleep with a loaded gun . From tim_one at email.msn.com Sun Mar 9 02:13:56 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sun Mar 9 02:15:13 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: [Tim Peters] > There haven't been enough test reports on that one to decide. > It's True by default in the Outlook client, but still appears to be > False by default everywhere else. There are bad visible effects > either way (if it's off and you get a large ratio imbalance, it's too > easy for a msg to score incorrectly as belonging to the more popular > category; if it's on and you get a large ratio imbalance, training on > another example from the more popular category has little effect, > exacerbating (for example) the "but I trained on it and it's *still* > called ham!" irritation). [Tim Stone] > Rats. I thought it was True by default. It is if you're using the Outlook client. > All this time I've been using it thinking it was on... ok, so if I > turn it on now, what would I expect? Did you read the paragraph you quoted? I've written several small essays on the topic here, and think the parenthetical comments above are a decent summary. > I have a huge ham/spam imbalance in my notes sb database, Striving for balance is likely a better idea. > and have been a bit disappointed by the classifier... I'm short on telepathy tonight. Perhaps the *way* in which you're disappointed is related to the comments above? For example, if you have much more ham than spam and have a too-high FN rate, or you have much more spam than ham and have a too-high FP rate, then the comments are directly applicable. From tim_one at email.msn.com Sun Mar 9 02:22:32 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sun Mar 9 02:23:10 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: <20030308164347.GA16439@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > I'll take the blame. I think neither Skip nor Tim explicitly said it > was a good idea. Thanks for testing. Testing is always a good thing, but I don't get the umbrage and blame thing here: *most* ideas turn out to add no value -- and always have, and likely always will. Bytes/word didn't help last time I tried 'em either, and that idea was better than *most* because it didn't hurt either <0.1 wink>. From tim_one at email.msn.com Sun Mar 9 02:45:44 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sun Mar 9 02:46:27 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: [Tim Stone] > ... > One alternative is to actively try to find ways that spammers can get > through our filter and plug those holes before the spammers find them. Instead of arguing about this more, how about we try it once? I'll note that we have no defense against the "white on white" HTML hiding trick, but also that that trick hasn't been effective against my personal classifier (the one spam of that kind I've seen rate solidly Unsure for me lucked into hiding a news story about the DC-area snipers, after I had trained on many msgs from friends and relatives also corresponding about that topic at the time). Hiding *all* the text in a .gif or .jpg on the Web merely linked to within the email seemed like a very good trick at the start, but seems ineffective now too -- there's nothing in the body then to offset spammish clues in the headers. Jeremy and Guido were both recipients of cunning spam this system couldn't stop: the spam took the form of replies to msgs they posted to public mailing lists, reproducing their original subject line and a quotes from the bodies of their msgs. This guaranteed they contained lots of words that were hammy to them, and also fooled the content-based whitelist boosts python.org added to its SpamAssassin installation. That's the cleverest attack I've seen, but it happened last year and I haven't heard of it happening again. From tim_one at email.msn.com Sun Mar 9 03:14:21 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sun Mar 9 03:15:00 2003 Subject: [Spambayes] spaced out spam Message-ID: [bill parducci] > well it looks like spacemania is catching on... > ... This is actually the same spam, word for word & space for space, that started the "full o' spaces" thread, here: http://mail.python.org/pipermail/spambayes/2003-March/003806.html Skip later reported that running an up-to-date classifier nailed it as spam despite the absence of body clues: http://mail.python.org/pipermail/spambayes/2003-March/003834.html I think that last report was also a bit suspicious, though, as the clue listing appeared to contain hapaxes unique to the msg being scored (suggesting that the msg had already been trained on as spam); e.g., 'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84; > Subject: none > Date: Fri, 7 Mar 2003 02:40:14 GMT > From: mcgough > > > U N I V E R S I T Y D I P L O M A S > > O b t a i n a p r o s p e r o u s f u t u r e , m o n e y e a > r n i n g p o w e r , a n d > t h e a d m i r a t i o n o f a l l . From tim at fourstonesExpressions.com Sun Mar 9 07:37:32 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun Mar 9 08:37:38 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: 3/9/2003 1:13:56 AM, "Tim Peters" wrote: >I'm short on telepathy tonight. Perhaps the *way* in which you're >disappointed is related to the comments above? For example, if you have >much more ham than spam and have a too-high FN rate, or you have much more >spam than ham and have a too-high FP rate, then the comments are directly >applicable. Can I plead nocturnal insanity? Maybe it was all the housecleaning fluid fumes... Ok, I train on virtually every piece of mail that comes into my notes inbox. the ratio is about 10:1 spam:ham. I currently have about 600 spam trained into the database. I still get maybe 10%-15% unsure, invariably on spam. I virtually never have a FP. Maybe I just need to adjust the spam cutoff... Mainly thinking out loud, and bemoaning the fact that I've annoyed my namesake. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From wsy at merl.com Sun Mar 9 08:48:53 2003 From: wsy at merl.com (Bill Yerazunis) Date: Sun Mar 9 08:49:29 2003 Subject: [Spambayes] spaced out spam In-Reply-To: <3E6ADB8F.7070709@parducci.net> (message from bill parducci on Sat, 08 Mar 2003 22:13:35 -0800) References: <3E6ADB8F.7070709@parducci.net> Message-ID: <200303091348.h29DmrC15652@localhost.localdomain> From: bill parducci well it looks like spacemania is catching on... Subject: none Date: Fri, 7 Mar 2003 02:40:14 GMT From: mcgough U N I V E R S I T Y D I P L O M A S Nothing we haven't seen before, with hypertextus interruptus. SBPH feature generation has no trouble with this, as the features of the wildcarded phrase: BUY YOUR GENUINE VIAGRA ONLINE NOW are just as significant and unique as: V I A G R A (in fact, probably the latter is moreso, as nobody I know would likely use those particular letters spaced that way. If they were going to say "viagra" they'd just say it. -Bill Yerazunis From skip at pobox.com Sun Mar 9 08:25:45 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 09:25:50 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: Message-ID: <15979.20201.258754.620902@montanaro.dyndns.org> Tim> Jeremy and Guido were both recipients of cunning spam this system Tim> couldn't stop: the spam took the form of replies to msgs they Tim> posted to public mailing lists, reproducing their original subject Tim> line and a quotes from the bodies of their msgs.... That's the Tim> cleverest attack I've seen, but it happened last year and I haven't Tim> heard of it happening again. Perhaps the cost to create such spam outweighs the potential benefit. You have to maintain a fair amount of information about the people you want to spam. In addition, it's not at all obvious that the people who post to public mailing lists and newsgroups: * cover the list of candidate spam recipients very well, or * that they are the sorts of people who would be scammed by bigger manhood or MLM come-ons. Maybe it was just a test by a spammer which returned a negative result and was thus abandoned. Skip From skip at pobox.com Sun Mar 9 08:28:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 09:30:23 2003 Subject: [Spambayes] spaced out spam In-Reply-To: References: Message-ID: <15979.20391.307592.911910@montanaro.dyndns.org> Tim> Skip later reported that running an up-to-date classifier nailed it Tim> as spam despite the absence of body clues: Tim> http://mail.python.org/pipermail/spambayes/2003-March/003834.html Tim> I think that last report was also a bit suspicious, though, as the Tim> clue listing appeared to contain hapaxes unique to the msg being Tim> scored (suggesting that the msg had already been trained on as Tim> spam); e.g., Tim> 'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84; Well, yes. It was reported as unsure. As is my normal practice, I saved it to my spam collection and trained on it later. That doesn't negate all the other clues which were originally missing. I believe I explained that I was mixing apples and oranges, comparing the debug header info generated on one (out-of-date) machine with the classification header generated on my (much more up-to-date) laptop. Skip From skip at pobox.com Sun Mar 9 08:38:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 09:38:24 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: Message-ID: <15979.20959.993297.864816@montanaro.dyndns.org> Tim> Ok, I train on virtually every piece of mail that comes into my Tim> notes inbox. the ratio is about 10:1 spam:ham. I currently have Tim> about 600 spam trained into the database. I still get maybe Tim> 10%-15% unsure, invariably on spam. I virtually never have a FP. Tim> Maybe I just need to adjust the spam cutoff... Mainly thinking out Tim> loud, and bemoaning the fact that I've annoyed my namesake. Tim, I know your Notes environment may not allow this, but I do a couple things to minimize the number of duplicate postings that ever get considered. At the very start of my .procmailrc file I remove messages with a message-id I've seen recently: # make sure we don't get two copies of the same message :0 Wh: msgid.lock | $FORMAIL -D 16384 $HOME/tmp/msgid.cache Later, after a message has been determined to be spam, I run my loose checksum script and dump the message if it looks the same as a previous spam: :0 * ^X-Spambayes-Classification: spam { ### this recipe gobbles items with matching body checksums (taken ### loosely to try and avoid obvious tricks) :0 W: cksum.lock | $PYCKSUM -v $HOME/tmp/cksum.cache :0: $SPAM } If I didn't take these steps I'm sure I'd get more spam (and probably see more mistakes). Since building my initial large training set, I have generally only trained on mistakes and unsures. Accordingly, I have about 12,000 saved hams and 7,000 saved spams. If the code changes I retrain completely, but generally only retrain on new messages. I think either of these techniques (message-id caching and loose checksums) could be incorporated into pop3proxy without much effort. Maybe you could use something like the script I posted the other day to remove duplicates from your collection and bring your spam:ham ratio into something closer to 1:1. Skip From bill at parducci.net Sun Mar 9 06:58:03 2003 From: bill at parducci.net (bill parducci) Date: Sun Mar 9 09:58:07 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15979.20959.993297.864816@montanaro.dyndns.org> References: <15979.20959.993297.864816@montanaro.dyndns.org> Message-ID: <3E6B567B.7080503@parducci.net> Skip Montanaro wrote: [...] > Maybe you could use something like the script I posted the other day to > remove duplicates from your collection and bring your spam:ham ratio into > something closer to 1:1. is there a query that can be run to see what the current ratio of trained messages is? thanks b From skip at pobox.com Sun Mar 9 09:51:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 10:52:13 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <3E6B567B.7080503@parducci.net> References: <15979.20959.993297.864816@montanaro.dyndns.org> <3E6B567B.7080503@parducci.net> Message-ID: <15979.25357.466002.698085@montanaro.dyndns.org> >> Maybe you could use something like the script I posted the other day >> to remove duplicates from your collection and bring your spam:ham >> ratio into something closer to 1:1. bill> is there a query that can be run to see what the current ratio of bill> trained messages is? I use mbox-formatted files, so it's fairly easy on Unix-like systems: % egrep '^From ' newham.clean.save | wc -l 11870 % egrep '^From ' newspam.clean.save | wc -l 6994 Skip From nas at python.ca Sun Mar 9 09:58:16 2003 From: nas at python.ca (Neil Schemenauer) Date: Sun Mar 9 12:48:37 2003 Subject: [Spambayes] full o' spaces In-Reply-To: References: <20030309041447.GA17672@glacier.arctrix.com> Message-ID: <20030309175816.GA19182@glacier.arctrix.com> Tim Peters wrote: > One run with True, and another with False. If you have the same # of ham > and spam in your training data, it shouldn't make any difference. Okay, I tested with a natural inbalance. Looks like it doesn't hurt or help me. out/unbalanced-bases.txt -> out/unbalanced-adjusts.txt -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams -> tested 547 hams & 389 spams against 2188 hams & 1556 spams false positive percentages 0.731 0.731 tied 0.366 0.366 tied 0.183 0.548 lost +199.45% 0.183 0.183 tied 0.183 0.183 tied won 0 times tied 4 times lost 1 times total unique fp went from 9 to 11 lost +22.22% mean fp % went from 0.329067641682 to 0.402193784278 lost +22.22% false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.257 0.257 tied won 0 times tied 5 times lost 0 times total unique fn went from 1 to 1 tied mean fn % went from 0.051413881748 to 0.051413881748 tied ham mean ham sdev 2.61 2.94 +12.64% 11.66 12.40 +6.35% 2.66 2.94 +10.53% 11.20 11.87 +5.98% 2.42 2.71 +11.98% 11.25 12.20 +8.44% 1.78 2.00 +12.36% 9.05 9.81 +8.40% 1.92 2.15 +11.98% 9.00 9.68 +7.56% ham mean and sdev for all runs 2.28 2.55 +11.84% 10.50 11.26 +7.24% spam mean spam sdev 99.56 99.63 +0.07% 3.29 2.50 -24.01% 99.22 99.30 +0.08% 5.03 4.68 -6.96% 99.63 99.68 +0.05% 2.82 2.55 -9.57% 99.46 99.55 +0.09% 3.96 3.20 -19.19% 99.17 99.22 +0.05% 6.41 6.12 -4.52% spam mean and sdev for all runs 99.41 99.48 +0.07% 4.50 4.06 -9.78% ham/spam mean difference: 97.13 96.93 -0.20 547 389 From nas at python.ca Sun Mar 9 12:08:09 2003 From: nas at python.ca (Neil Schemenauer) Date: Sun Mar 9 14:58:30 2003 Subject: [Spambayes] better Received header tokens Message-ID: <20030309200808.GA19398@glacier.arctrix.com> I wasted some time today trying to improve the mine_received_headers option. The goal was to generate fewer more useful tokens. Also, I wanted to be resistent to received header forgery. For the sake of posterity, here's what I came up with: ippat = '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' received_re = re.compile(r"from .*\b(%s)[)\]].*\b" r"by (\S+)\s+([^;]*)" % ippat, re.M|re.S) hops = 0 network = None for hdr in msg.get_all("received", []): m = received_re.search(hdr) if m: ip = m.group(1) n = '.'.join(ip.split('.')[:2]) if n != network: hops += 1 network = n yield 'received:%d:%s' (hops, network) yield 'received:%d' % hops I expected this to do better than the current code. Testing shows otherwise. Perhaps using a more specific or more general network (instead of /16) would help. Neil From dan at tobias.name Sun Mar 9 14:39:53 2003 From: dan at tobias.name (Daniel R. Tobias) Date: Sun Mar 9 15:03:09 2003 Subject: [Spambayes] SpamBayes Message-ID: <3E6B9889.4030207@tobias.name> I'm just trying out the SpamBayes proxy software now; seems like a very good idea. However, I've had some problems with the program tending to have a total nervous breakdown any time its data structure gets in any way different from what it expects, like the database index getting corrupted due to a system crash, or some of the inbound messages getting deleted by a virus scanning program during processing. It seems you haven't programmed any sort of graceful recovery when any data file becomes missing or corrupted, but just crash the script altogether. Once, the proxy wouldn't even start at all due to some data error and I had to wipe its data files out entirely and start over. Other times (like when the virus scanner kills a message in between it being downloaded and being reviewed for ham/spam training purposes), a message will show in the list of messages to review, but when you try to do anything with it, the program crashes. This process needs improving to reach a level of robustness needed to use in a production environment rather than just for testing and experimentation purposes. -- == Dan == Dan's Web Tips: http://webtips.dan.info/ Dan's Domain Site: http://domains.dan.info/ From noreply at sourceforge.net Sun Mar 9 00:33:31 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sun Mar 9 15:03:15 2003 Subject: [Spambayes] [ spambayes-Bugs-700165 ] MoveFileEx doesn't exist on Win98 Message-ID: Bugs item #700165, was opened at 2003-03-09 11:53 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702 Category: Outlook Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Tim Peters (tim_one) Assigned to: Mark Hammond (mhammond) Summary: MoveFileEx doesn't exist on Win98 Initial Comment: After a CVS up, Outlook craps out on Win98SE now in BayesManager._MigrateFile. File "C:\Code\spambayes\Outlook2000\manager.py", line 213, in _MigrateFile win32con.MOVEFILE_COPY_ALLOWED) pywintypes.error: (120, 'MoveFileEx', 'This function is only valid in Win32 mode.') which really seems to mean that MoveFileEx isn't supported at or before Win98. ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-09 19:33 Message: Logged In: YES user_id=14198 Thanks! ---------------------------------------------------------------------- Comment By: Tim Peters (tim_one) Date: 2003-03-09 12:06 Message: Logged In: YES user_id=31435 I checked in a patch to Outlook2000/manager.py, rev1.54, which worked for me on Win98. If you're happy with this, just close the bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702 From tim.one at comcast.net Sun Mar 9 15:44:21 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 15:45:12 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: [Tim Stone] > Can I plead nocturnal insanity? Maybe it was all the housecleaning fluid > fumes... Only if you mainlined them . > Ok, I train on virtually every piece of mail that comes into my > notes inbox. the ratio is about 10:1 spam:ham. I currently have about > 600 spam trained into the database. Implying that you've trained on a total of about 60 ham? If so, that's very light training (for this system). > I still get maybe 10%-15% unsure, invariably on spam. I virtually > never have a FP. Peculiar! Try turning on the experimental imbalance adjustment just to see what happens. I don't expect it will help, but I wouldn't have expected the outcome you're getting either. > Maybe I just need to adjust the spam cutoff... Can't guess from here. > Mainly thinking out loud, and bemoaning the fact that I've annoyed my > namesake. Na, acting irritated is just fun for Tims . From tim.one at comcast.net Sun Mar 9 15:49:32 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 15:50:04 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <15979.20201.258754.620902@montanaro.dyndns.org> Message-ID: [Tim] > Jeremy and Guido were both recipients of cunning spam this system > couldn't stop [as replies to their postings] [Skip Montanaro] > ... > Maybe it was just a test by a spammer which returned a negative result > and was thus abandoned. That's what I figured -- any form of targeting spam adds expense. Targeting posters to tech mailing lists has got to be close to a zero-response approach. From nas at python.ca Sun Mar 9 13:00:11 2003 From: nas at python.ca (Neil Schemenauer) Date: Sun Mar 9 15:50:30 2003 Subject: [Spambayes] Integration with qmail? In-Reply-To: References: Message-ID: <20030309210011.GA19599@glacier.arctrix.com> Martinez, Michael - CSREES/ISTM wrote: > I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers > would be appreciated. See http://arctrix.com/nas/qmail/spambayes/ . The code is still a bit rough and the instructions were hastily written. The cool part about the system is that it should be suitable for deployment at the mail server level. Users don't need to do anything and should not have too worry about legitimate email being rejected. Obviously it doesn't not perform quite as well as a personal filter but it is much better than no filter at all. Neil From tim.one at comcast.net Sun Mar 9 16:14:55 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 16:15:28 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <20030309175816.GA19182@glacier.arctrix.com> Message-ID: [Neil Schemenauer, tests the experimental imbalance adjustment] > Okay, I tested with a natural inbalance. Looks like it doesn't hurt or > help me. > > out/unbalanced-bases.txt -> out/unbalanced-adjusts.txt > -> tested 547 hams & 389 spams against 2188 hams & 1556 spams > ... This is a very mild imbalance, so I don't expect much change. The option was introduced when people reported imbalance ratios close to 20; yours is under 1.5. Since you have more ham than spam, without the adjustmet the spamprob of a ham word can get closer to 0 than the spamprob of a spam word can get to 1, effectively giving ham words more strength than spam words. The effect of the adjustment is to make ham words "less hammy", which should tend to reduce FN and increase FP. The larget the imbalance ratio, the more pronounded these effects should be. > false positive percentages > 0.731 0.731 tied > 0.366 0.366 tied > 0.183 0.548 lost +199.45% > 0.183 0.183 tied > 0.183 0.183 tied > > won 0 times > tied 4 times > lost 1 times > > total unique fp went from 9 to 11 lost +22.22% > mean fp % went from 0.329067641682 to 0.402193784278 lost +22.22% > > false negative percentages > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.257 0.257 tied > > won 0 times > tied 5 times > lost 0 times > > total unique fn went from 1 to 1 tied > mean fn % went from 0.051413881748 to 0.051413881748 tied > > ham mean ham sdev > 2.61 2.94 +12.64% 11.66 12.40 +6.35% > 2.66 2.94 +10.53% 11.20 11.87 +5.98% > 2.42 2.71 +11.98% 11.25 12.20 +8.44% > 1.78 2.00 +12.36% 9.05 9.81 +8.40% > 1.92 2.15 +11.98% 9.00 9.68 +7.56% > > ham mean and sdev for all runs > 2.28 2.55 +11.84% 10.50 11.26 +7.24% > > spam mean spam sdev > 99.56 99.63 +0.07% 3.29 2.50 -24.01% > 99.22 99.30 +0.08% 5.03 4.68 -6.96% > 99.63 99.68 +0.05% 2.82 2.55 -9.57% > 99.46 99.55 +0.09% 3.96 3.20 -19.19% > 99.17 99.22 +0.05% 6.41 6.12 -4.52% > > spam mean and sdev for all runs > 99.41 99.48 +0.07% 4.50 4.06 -9.78% Since words look "less hammy" after the adjusment, an increase in both means is expected, and the appearance of ham words in spam doesn't yank down the spam scores as much so a decrease in spam sdev is also expected. OTOH, the ham words in ham are also less hammy after adjustment, so ham scores are expected to spread more (-> increase in ham sdev). So the changes were all qualitatively expected, and overall didn't make a real bottom-line difference. Imbalance this mild isn't what the gimmick was aiming at, though -- it was aimed at stopping disastrous embarrassments for people with extreme training ratios. Thank you for trying it! From tim at fourstonesExpressions.com Sun Mar 9 15:23:02 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun Mar 9 16:23:10 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: 3/9/2003 2:44:21 PM, Tim Peters wrote: >Implying that you've trained on a total of about 60 ham? If so, that's very >light training (for this system). Yeah... but that's the ratio I get. My notes inbox is not heavily used for legitimate mail, but the mail that IS there is extremely important. > >> I still get maybe 10%-15% unsure, invariably on spam. I virtually >> never have a FP. > >Peculiar! Try turning on the experimental imbalance adjustment just to see >what happens. I don't expect it will help, but I wouldn't have expected the >outcome you're getting either. I'm going to play with this one, and with the spamcutoff as well. I'm also going to do some clues investigation, which will be a bit of a trick because there's no place in notes that I can store a 'header'... More at 11... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Sun Mar 9 16:30:42 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 16:31:14 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: <3E69D562.5861.18EF176D@localhost> Message-ID: [Stephen Anderson] > Okay Tim, I just can't take it anymore. My curiosity has gotten > the best of me. Would you please ask your sisters to email me a > sample of one of their very pretty HTML emails you keep referring > to. Nope -- if they sent email to people they didn't grow up with, they'd get a spam problem. They have no presence in CyberSpace -- not even google can find them . > I have a sister too, but her HTML emails are almost indistinguishable > in presentation from that of a plain-text one. So, can you help me > with my burning question: Just what does pretty email look like? It helps if your sister is an artist, can use image and sound manipulation programs, and doesn't pay much attention to copyright notices on web sites. One of my sisters even taught herself (just) enough about Java to reuse Java applets, specifying different parameters. A pretty email is a coordinated combination of sound and color or images, and sometimes animation. At the guts-of-the-HTML level, it has a lot in common with fancy spam. At the human level, though, it's pleasing (or even poignant) instead of obnoxious. I don't know how to automate telling the difference. When MSN first started, there used to be a lot of that on their proprietary newsgroups too. Dialup speed pretty much killed it (along with MSN's attempts to sell proprietary "rich" content). From skip at pobox.com Sun Mar 9 15:35:07 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 16:35:08 2003 Subject: [Spambayes] SpamBayes In-Reply-To: <3E6B9889.4030207@tobias.name> References: <3E6B9889.4030207@tobias.name> Message-ID: <15979.45963.260615.349902@montanaro.dyndns.org> Dan> It seems you haven't programmed any sort of graceful recovery when Dan> any data file becomes missing or corrupted, but just crash the Dan> script altogether. In the face of a corrupt database, all I think we could do is toss out the database and work with no clues. Every message would wind up "unsure" with a score of 0.50. Is that acceptable to you? Dan> Once, the proxy wouldn't even start at all due to some data error Dan> and I had to wipe its data files out entirely and start over. In think that's about all it could do automatically. Dan> Other times (like when the virus scanner kills a message in between Dan> it being downloaded and being reviewed for ham/spam training Dan> purposes), a message will show in the list of messages to review, Dan> but when you try to do anything with it, the program crashes. I think this could be more easily recovered from. Dan> This process needs improving to reach a level of robustness needed Dan> to use in a production environment rather than just for testing and Dan> experimentation purposes. Granted. Skip From lists at morpheus.demon.co.uk Sun Mar 9 21:41:43 2003 From: lists at morpheus.demon.co.uk (Paul Moore) Date: Sun Mar 9 16:50:05 2003 Subject: [Spambayes] full o' spaces References: <20030309175816.GA19182@glacier.arctrix.com> Message-ID: Tim Peters writes: > This is a very mild imbalance, so I don't expect much change. The option > was introduced when people reported imbalance ratios close to 20; yours is > under 1.5. I have a spam:ham imbalance of 10:1 in my current database. However, my available corpus is pretty small - under 200 ham and 3500 spam (I've been retaining spams for a while now, but my saved ham is basically just mails I found interesting enough to archive). And that corpus is what I trained on, in any case. So I don't have any test data. And I don't really understand how to run tests with what I do have :-( I'd happily run some tests on this database, if you could give me some details on how to go about it. Paul. -- This signature intentionally left blank From tim at fourstonesExpressions.com Sun Mar 9 16:04:07 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun Mar 9 17:04:13 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: Message-ID: 3/9/2003 3:30:42 PM, Tim Peters wrote: >[Stephen Anderson] >> Okay Tim, I just can't take it anymore. My curiosity has gotten >> the best of me. Would you please ask your sisters to email me a >> sample of one of their very pretty HTML emails you keep referring >> to. > >Nope -- if they sent email to people they didn't grow up with, they'd get a >spam problem. They have no presence in CyberSpace -- not even google can >find them . Saaaaaaaayyyy.... so all this stuff about needing to be easy enough for your sisters was just so much smoke? c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From T.A.Meyer at massey.ac.nz Mon Mar 10 12:16:24 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Sun Mar 9 18:17:06 2003 Subject: [Spambayes] Outlook Express integration Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD7A@its-xchg4.massey.ac.nz> > I've managet to get Spambayes working with Outlook Express, > but it isn't pretty. Well, nor is OE :) > Basically I've changed the hammie_header_name to 'To', so OE > can filter on > it. A few minor mods to pop3proxy.py were required because > there's usually another 'To' header present. I note from the wiki that you're using a2. Spambayes can now add classification information in the subject line (add "pop3proxy_notate_subject: True" to your config file). This should avoid having to use the 'To' header, with all the inherant problems. > I personally think the HTML interface is OK for training, but > I can see the obvious attraction of an intgrated solution as > offered by the Outlook plug-in. There are two (main) problems. One is that integration into clients is a *lot* of work, and there are a *lot* of clients around. The other is that OE is such a limited client that just about any integration is either impossible, or even more work. > Would you be so kind as to offer some suggestions on how I > could improve this? Sure: 1. Get the latest CVS. (I'm thinking that it's time for a3, anyway, especially once the gary_ stuff is gone). 2. Try using pop3proxy_notate_subject - you'll have to rewrite your rules, but it should work better. 3. Try using the smtpproxy (see below). 4. Send your comments & ideas back to the list :) SMTPProxy (maybe something should be added to the docs?) -------------------------------------------------------- This is an alternative method for training, which really needs evaluation. Setup is just like pop3proxy - go to http://localhost:8880 and in the options put your normal smtp server(s) and port(s). In OE, set the outgoing SMTP server to localhost. You can now train by forwarding/bouncing mail to special addresses - these default to spambayes_ham@localhost and spambayes_spam@localhost, but you can set them to whatever you like (smtpproxy_ham_address and smtpproxy_spam_address in your config file). No need to regularly go to the config web page at all. Note that the smtpproxy is, by default, not active - you'll need to configure it. Note also, that since OE doesn't treat headers nicely, you'll need to set pop3proxy_add_mailid_to to "body". Finally, to start the smtpproxy, add "-s" when you start pop3proxy (i.e. "pop3proxy -s", if you don't use any other options). Hope this helps :) =Tony Meyer From T.A.Meyer at massey.ac.nz Mon Mar 10 12:19:06 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Sun Mar 9 18:19:42 2003 Subject: [Spambayes] Headers and pop3proxy Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C892@its-xchg4.massey.ac.nz> > Is there an easy way (perhaps a parameter in > bayescustomize.ini) to get pop3proxy to add a header giving the > spam probability score, as well as > the one classifying the message as ham/unsure/spam? As far as I can tell, no. This would be very simple to add, though. Quick poll from the list: do I provide this as a patch, or check it in? =Tony Meyer From T.A.Meyer at massey.ac.nz Mon Mar 10 12:26:32 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Sun Mar 9 18:27:06 2003 Subject: [Spambayes] full o' spaces Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C893@its-xchg4.massey.ac.nz> > this is not at all to say > that this will be the case here but as new ideas are bandied > about, i posit that it is a good idea to make sure that > previously discarded methodologies be reexamined periodically. I would absolutely agree with this. To grab the box for a minute and add my 2c to the discussion about being reactive or proactive: I think that we should be as proactive as possible in trying to find new ways to tag mail that distinguish spam & ham - like the bytes/word count, and so on. But I don't think these should be checked in, unless they do demonstrate that they make a difference. The important thing is to code them, and test them, and note those tests & code so that later on, (when, for example, white space spam is really common), we can be as quickly reactive as possible, just grabbing code from the archive, re-testing it and deploying it. Along with this, it would be great if every now and then, some of these rejected ideas were retested against with the current code and current ham/spam. Plus, of course, testing the odd idea that is in the code that might not still need to be there. Just my thoughts... =Tony Meyer From mhammond at skippinet.com.au Mon Mar 10 10:30:20 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun Mar 9 18:31:27 2003 Subject: [Spambayes] Headers and pop3proxy In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C892@its-xchg4.massey.ac.nz> Message-ID: > Quick poll from the list: do I provide this as a patch, or check it in? If you consider it "safe", the impact would be restricted to the single application, and you are currently actively maintaining that application, then go for it! Mark. From mhammond at skippinet.com.au Mon Mar 10 10:42:43 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun Mar 9 18:43:48 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result Message-ID: Here are my current results on the imbalance option. Interestingly, my initial "-n2" results looked better than my "-n 5" results below. FWIW, Outlook users should remember there is an "export.py" script in the addin directory. This will export your ham and spam into the "spambayes\testdata\Data" directory, which is the default place the test scripts the test tools use. Just run this from the command line. And for everyone else, once you have a "Data" directory, running the tests means: * Create testtools\bayescustomize.ini with the options you want to test * run "timtest.py -n 2 > result1.txt" * run "rates result1.txt" - this creates "result1s.txt" * Repeat the above, changing the options, and redirecting to "result2.txt" and getting "result2s.txt" as final output. * Run "cmp.py result1s.txt result2s.txt" Well - if it *doesn't* mean that, then you can ignore my results too . My results below are for "-n 5". Mark. \temp\imbalance_falses.txt -> \temp\imbalance_trues.txt -> tested 412 hams & 1004 spams against 429 hams & 1019 spams -> tested 440 hams & 1076 spams against 429 hams & 1019 spams -> tested 397 hams & 1054 spams against 429 hams & 1019 spams -> tested 477 hams & 1056 spams against 429 hams & 1019 spams -> tested 429 hams & 1019 spams against 412 hams & 1004 spams -> tested 440 hams & 1076 spams against 412 hams & 1004 spams -> tested 397 hams & 1054 spams against 412 hams & 1004 spams -> tested 477 hams & 1056 spams against 412 hams & 1004 spams -> tested 429 hams & 1019 spams against 440 hams & 1076 spams -> tested 412 hams & 1004 spams against 440 hams & 1076 spams -> tested 397 hams & 1054 spams against 440 hams & 1076 spams -> tested 477 hams & 1056 spams against 440 hams & 1076 spams -> tested 429 hams & 1019 spams against 397 hams & 1054 spams -> tested 412 hams & 1004 spams against 397 hams & 1054 spams -> tested 440 hams & 1076 spams against 397 hams & 1054 spams -> tested 477 hams & 1056 spams against 397 hams & 1054 spams -> tested 429 hams & 1019 spams against 477 hams & 1056 spams -> tested 412 hams & 1004 spams against 477 hams & 1056 spams -> tested 440 hams & 1076 spams against 477 hams & 1056 spams -> tested 397 hams & 1054 spams against 477 hams & 1056 spams -> tested 412 hams & 1004 spams against 429 hams & 1019 spams -> tested 440 hams & 1076 spams against 429 hams & 1019 spams -> tested 397 hams & 1054 spams against 429 hams & 1019 spams -> tested 477 hams & 1056 spams against 429 hams & 1019 spams -> tested 429 hams & 1019 spams against 412 hams & 1004 spams -> tested 440 hams & 1076 spams against 412 hams & 1004 spams -> tested 397 hams & 1054 spams against 412 hams & 1004 spams -> tested 477 hams & 1056 spams against 412 hams & 1004 spams -> tested 429 hams & 1019 spams against 440 hams & 1076 spams -> tested 412 hams & 1004 spams against 440 hams & 1076 spams -> tested 397 hams & 1054 spams against 440 hams & 1076 spams -> tested 477 hams & 1056 spams against 440 hams & 1076 spams -> tested 429 hams & 1019 spams against 397 hams & 1054 spams -> tested 412 hams & 1004 spams against 397 hams & 1054 spams -> tested 440 hams & 1076 spams against 397 hams & 1054 spams -> tested 477 hams & 1056 spams against 397 hams & 1054 spams -> tested 429 hams & 1019 spams against 477 hams & 1056 spams -> tested 412 hams & 1004 spams against 477 hams & 1056 spams -> tested 440 hams & 1076 spams against 477 hams & 1056 spams -> tested 397 hams & 1054 spams against 477 hams & 1056 spams false positive percentages 1.699 1.214 won -28.55% 0.909 0.682 won -24.97% 1.008 0.756 won -25.00% 0.210 0.210 tied 0.932 0.699 won -25.00% 0.682 0.227 won -66.72% 1.008 0.504 won -50.00% 0.000 0.000 tied 0.466 0.233 won -50.00% 0.243 0.243 tied 1.259 0.504 won -59.97% 0.210 0.000 won -100.00% 0.699 0.466 won -33.33% 1.456 0.728 won -50.00% 1.818 1.591 won -12.49% 0.839 0.210 won -74.97% 0.466 0.233 won -50.00% 0.728 0.485 won -33.38% 0.455 0.227 won -50.11% 1.259 0.756 won -39.95% won 17 times tied 3 times lost 0 times total unique fp went from 40 to 26 won -35.00% mean fp % went from 0.817290959648 to 0.49835280855 won -39.02% false negative percentages 0.398 0.498 lost +25.13% 0.093 0.186 lost +100.00% 0.380 0.474 lost +24.74% 0.189 0.189 tied 0.294 0.294 tied 0.000 0.372 lost +(was 0) 0.190 0.285 lost +50.00% 0.379 0.568 lost +49.87% 0.491 0.883 lost +79.84% 0.896 1.195 lost +33.37% 0.664 1.139 lost +71.54% 0.189 0.379 lost +100.53% 0.294 0.393 lost +33.67% 0.498 0.697 lost +39.96% 0.093 0.093 tied 0.189 0.379 lost +100.53% 0.687 1.374 lost +100.00% 1.195 1.295 lost +8.37% 0.651 0.929 lost +42.70% 0.474 0.664 lost +40.08% won 0 times tied 3 times lost 17 times total unique fn went from 44 to 66 lost +50.00% mean fn % went from 0.412283315133 to 0.614303438288 lost +49.00% ham mean ham sdev 3.82 2.97 -22.25% 14.77 12.72 -13.88% 3.31 2.42 -26.89% 13.21 10.90 -17.49% 3.57 2.69 -24.65% 13.66 11.26 -17.57% 4.26 3.20 -24.88% 15.98 13.14 -17.77% 3.37 2.65 -21.36% 14.07 12.03 -14.50% ham mean and sdev for all runs 3.67 2.79 -23.98% 14.38 12.05 -16.20% spam mean spam sdev 98.10 96.94 -1.18% 8.44 10.32 +22.27% 97.83 96.47 -1.39% 9.10 11.49 +26.26% 97.63 96.24 -1.42% 10.29 12.59 +22.35% 98.13 96.83 -1.32% 8.23 10.58 +28.55% 96.93 95.49 -1.49% 11.94 14.13 +18.34% spam mean and sdev for all runs 97.72 96.40 -1.35% 9.71 11.91 +22.66% ham/spam mean difference: 94.05 93.61 -0.44 From tim at fourstonesExpressions.com Sun Mar 9 17:47:48 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sun Mar 9 18:47:53 2003 Subject: [Spambayes] Headers and pop3proxy In-Reply-To: Message-ID: 3/9/2003 5:30:20 PM, "Mark Hammond" wrote: >> Quick poll from the list: do I provide this as a patch, or check it in? > Check it in, dude! c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From T.A.Meyer at massey.ac.nz Mon Mar 10 12:55:44 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Sun Mar 9 18:57:33 2003 Subject: [Spambayes] Headers and pop3proxy Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD7B@its-xchg4.massey.ac.nz> > > Is there an easy way (perhaps a parameter in > > bayescustomize.ini) to get pop3proxy to add a header giving the > > spam probability score, as well as > > the one classifying the message as ham/unsure/spam? Ok, new answer: yes (with the latest CVS). Set pop3proxy_include_prob to "True" in your config file. Note: This is such a simple patch that I can't see how it would break anything, *but*, I have not tested anything apart from that it is off by default (so no change for everyone), and that it works if turned on with my rather vanilla system. It currently changes the header from "X-Spambayes-Classification: Spam" (or whatever) to "X-Spambayes-Classification: Spam, .953246327" (or whatever). If people would like it in a seperate header, or formatted (to 2 decimal places for example), let me know. =Tony Meyer From francois.granger at free.fr Mon Mar 10 01:15:41 2003 From: francois.granger at free.fr (Francois Granger) Date: Sun Mar 9 19:15:47 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C893@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C893@its-xchg4.massey.ac.nz> Message-ID: At 12:26 +1300 10/03/2003, in message RE: [Spambayes] full o' spaces, Meyer, Tony wrote: > > this is not at all to say >> that this will be the case here but as new ideas are bandied >> about, i posit that it is a good idea to make sure that >> previously discarded methodologies be reexamined periodically. > >I would absolutely agree with this. To grab the box for a minute >and add my 2c to the discussion about being reactive or proactive: > > But I don't think these should be checked in, unless they do >demonstrate that they make a difference. The important thing is to >code them, and test them, and note those tests & code so that later >on, (when, for example, white space spam is really common), we can >be as quickly reactive as possible, just grabbing code from the >archive, re-testing it and deploying it. This bring the idea of creating kind of a "plugin" concept for adding or removing rules ? -- Hofstadter's Law : It always takes longer than you expect, even when you take into account Hofstadter's Law. From T.A.Meyer at massey.ac.nz Mon Mar 10 13:18:34 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Sun Mar 9 19:19:15 2003 Subject: [Spambayes] full o' spaces Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C897@its-xchg4.massey.ac.nz> [Tony] > > But I don't think these should be checked in, unless they do > >demonstrate that they make a difference. The important thing is to > >code them, and test them, and note those tests & code so that later > >on, (when, for example, white space spam is really common), we can > >be as quickly reactive as possible, just grabbing code from the > >archive, re-testing it and deploying it. [Francois] > This bring the idea of creating kind of a "plugin" concept for adding > or removing rules ? Oooh. I hadn't thought of that, but I do like it. Not as a release type tool, but definately as a debug type one. I wonder how this could be done in a simple, non-bloat kind of way. =Tony Meyer From popiel at wolfskeep.com Sun Mar 9 18:29:08 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sun Mar 9 21:29:14 2003 Subject: [Spambayes] Re: [Spambayes-checkins] spambayes/testtools timtest.py,1.2,1.3 In-Reply-To: Message from "Tony Meyer" References: Message-ID: <20030310022908.0883D2DE80@cashew.wolfskeep.com> In message: "Tony Meyer" writes: > >Modified Files: > timtest.py >Log Message: >Mangle path for those without spambayes in pythonpath, like Alex's >mod of testcv. Heh. Glad people thought it was a good idea. ;-) - Alex From tim.one at comcast.net Sun Mar 9 21:31:43 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 21:32:20 2003 Subject: [Spambayes] Bytes/words ratio In-Reply-To: Message-ID: [Tim Stone] > Saaaaaaaayyyy.... so all this stuff about needing to be easy > enough for your sisters was just so much smoke? Heh -- you're confusing me with the "ease of use" people. The only effect this project will have on my siblings is in whether their email gets unjustly blocked by someone *else* as spam. Toward avoiding that outcome, I don't want to penalize HTML mail just for using HTML. I expect it's not possible to make this (or any other visible) system easy enough for them to *use*-- themselves --with Outlook Express. If they had spam problems (which they don't), I'd urge them to switch to Outlook and use Mark's spiffy addin. I'm pretty sure they could use that one (one sister on her own, the other with some long-distance phone coaching). From popiel at wolfskeep.com Sun Mar 9 18:37:09 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sun Mar 9 21:37:13 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: Message from Neil Schemenauer <20030309200808.GA19398@glacier.arctrix.com> References: <20030309200808.GA19398@glacier.arctrix.com> Message-ID: <20030310023710.F11BA2DE80@cashew.wolfskeep.com> In message: <20030309200808.GA19398@glacier.arctrix.com> Neil Schemenauer writes: >I wasted some time today trying to improve the mine_received_headers >option. The goal was to generate fewer more useful tokens. Also, >I wanted to be resistent to received header forgery. [...] >I expected this to do better than the current code. Testing shows >otherwise. Perhaps using a more specific or more general network >(instead of /16) would help. Something that has occured to me recently: how many tokens does it take to significantly change the scores? Most of the recent tokenizing experiments have been adding between one and a handful of tokens, or even reducing token count. Perhaps our problem is not that the identification methods we're coming up with are bad (heck, Tim did indicate that the bytes/word token _was_ a strong indicator... I didn't look at the values for the token itself), but rather that these new methods of identification are getting drowned out in the noise. Perhaps we should figure out some way to give metatokens extra weight in the combining calculations? I'm afraid that I don't have a strong enough math background to know how to do this. Alternately, we could drop the limit on the number of tokens looked at from 150 back down to around 20... - Alex From tim.one at comcast.net Sun Mar 9 22:13:40 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 22:14:16 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: <20030310023710.F11BA2DE80@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Something that has occured to me recently: how many tokens does it > take to significantly change the scores? Most of the recent tokenizing > experiments have been adding between one and a handful of tokens, or > even reducing token count. Perhaps our problem is not that the > identification methods we're coming up with are bad (heck, Tim did > indicate that the bytes/word token _was_ a strong indicator... I > didn't look at the values for the token itself), but rather that > these new methods of identification are getting drowned out in the > noise. Oddly, I doubt it matters. The median ham score is near 0, and the median spam score near 100, so most messages are very solidly at one end. When a new token is added, it's not going to have any substantial effect on those, it's going to affect Unsures, and msgs near the Unsure cutoffs. One token is enough to swing a msg near a boundary to the other side. Note that strong indicators aren't necessarily *good* indicators, either: if they're strongly correlated with other strong indicators, a bad decision is easy to get. That's why we strip HTML decorations, for example. For another, about the only spam I see rate unsure anymore is stuff that leaks thru SpamAssassin via python.org. spambayes *usually* wouldn't have any trouble with such spam on its own, but there are a dozen header clues all effectively saying "this came from python.org" then, and those are all strong ham clues (thanks to SpamAssassin's usual effectiveness). However, they're really all the same clue, and the system has no way to realize that; treating them as a dozen distinct clues gives them way more credence than they deserve. From T.A.Meyer at massey.ac.nz Mon Mar 10 16:21:09 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Sun Mar 9 22:21:57 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C898@its-xchg4.massey.ac.nz> Hmm...I created the data set, and then did these: > * Create testtools\bayescustomize.ini with the options you > want to test > * run "timtest.py -n 2 > result1.txt" > * run "rates result1.txt" - this creates "result1s.txt" > * Repeat the above, changing the options, and redirecting to > "result2.txt" and getting "result2s.txt" as final output. But when I did this: > * Run "cmp.py result1s.txt result2s.txt" cmp.py gave me lots of errors, because the lines were not what was expected. My results docs started with a copy of the options, so I dumped those, but then it had trouble with everything else as well. The docs do have nice histograms, but cmp.py doesn't give me what it gave Mark :) Advice, please? Thanks, Tony Meyer From tim.one at comcast.net Sun Mar 9 22:39:54 2003 From: tim.one at comcast.net (Tim Peters) Date: Sun Mar 9 22:40:32 2003 Subject: [Spambayes] FW: Mhammond, Intelligent antispam IER software In-Reply-To: <002101c2e498$8cd2eab0$530f8490@eden> Message-ID: [Mark Hammond] > I had to share this irony :) > > I received this spam, selling anti-spam software! I was a little > dissapointed that spambayes scored it as only a "maybe". Whereas when I got the same spam, I was disappointed to see it scored as spam! I like checking out the competition . > So I checked the clues - the top 6 ham clues were: > > word spamprob #ham #spam > '*H*' 0.0438937 - - > '*S*' 0.78226 - - Those two lines imply the overall score was in the high 80s -- do you have your spam cutoff set to 90? (Mine is at 80, BTW -- but then I still look at every new spam every day, and have no fear of FP) -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 1028 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030309/e7fb97d7/winmail.bin From skip at pobox.com Sun Mar 9 21:51:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 22:51:40 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: <20030310023710.F11BA2DE80@cashew.wolfskeep.com> References: <20030309200808.GA19398@glacier.arctrix.com> <20030310023710.F11BA2DE80@cashew.wolfskeep.com> Message-ID: <15980.3010.940085.926858@montanaro.dyndns.org> Alex> Alternately, we could drop the limit on the number of tokens Alex> looked at from 150 back down to around 20... I look at all those tokens as many different ways for a message to exonerate or incriminate itself. If the various meta-tokens provide five (just to pick a number out of thin air) more-or-less independent ways to say, "this looks like spam", it's less likely that a spammer will successfully figure out how to circumvent all five schemes. The only positive effect I can imagine is improved performance of the classifier, which would generally be drowned out by either Python startup costs or networking overhead. Skip From skip at pobox.com Sun Mar 9 21:55:41 2003 From: skip at pobox.com (Skip Montanaro) Date: Sun Mar 9 22:55:45 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: References: <20030310023710.F11BA2DE80@cashew.wolfskeep.com> Message-ID: <15980.3261.634515.784710@montanaro.dyndns.org> Tim> For another, about the only spam I see rate unsure anymore is stuff Tim> that leaks thru SpamAssassin via python.org. spambayes *usually* Tim> wouldn't have any trouble with such spam on its own, but there are Tim> a dozen header clues all effectively saying "this came from Tim> python.org" .... That's correct when considering the rather narrow Python email universe, but I suspect most people live in a somewhat more diverse electronic world than that, so the python.org effect won't be quite as strong in the normal case. Skip From anthony at interlink.com.au Mon Mar 10 16:12:12 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Mar 10 00:13:06 2003 Subject: [Spambayes] full o' spaces In-Reply-To: Message-ID: <200303100512.h2A5CCn08173@localhost.localdomain> >>> Paul Moore wrote > 5. Er. But it's a bit rough around the edges still. I'll help you > install it, if you like. > > Notice (5). That's what is killing us right now with real people (me, > I'm a figment of your imagination: be very afraid ). Anything > else is minor. Speaking of which, what happened to that alpha-2 release? I've got Wednesday off, and can work on it then... -- Anthony Baxter It's never too late to have a happy childhood. From anthony at interlink.com.au Mon Mar 10 16:19:37 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Mar 10 00:20:27 2003 Subject: [Spambayes] full o' spaces In-Reply-To: <200303100512.h2A5CCn08173@localhost.localdomain> Message-ID: <200303100519.h2A5JbR08230@localhost.localdomain> >>> Anthony Baxter wrote > > >>> Paul Moore wrote > > 5. Er. But it's a bit rough around the edges still. I'll help you > > install it, if you like. > > > > Notice (5). That's what is killing us right now with real people (me, > > I'm a figment of your imagination: be very afraid ). Anything > > else is minor. > > Speaking of which, what happened to that alpha-2 release? > > I've got Wednesday off, and can work on it then... Hm. I didn't look closely enough - it's there, but the website's not been updated. Are we at a point where another release is useful, or should I update the website to point to the current -a2 release? -- Anthony Baxter It's never too late to have a happy childhood. From nas at python.ca Sun Mar 9 21:39:06 2003 From: nas at python.ca (Neil Schemenauer) Date: Mon Mar 10 00:29:27 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C898@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C898@its-xchg4.massey.ac.nz> Message-ID: <20030310053906.GA20786@glacier.arctrix.com> Meyer, Tony wrote: > cmp.py gave me lots of errors, because the lines were not what was > expected. I'm guessing you ran "rates.py test > tests". rates.py creates its own output file and writes something a little different to stdout. cmp.py can't understand the stdout data. Neil From Paul.Moore at atosorigin.com Mon Mar 10 09:08:05 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Mon Mar 10 04:09:30 2003 Subject: [Spambayes] Outlook plugin error Message-ID: <16E1010E4581B049ABC51D4975CEDB88619A0A@UKDCX001.uk.int.atosorigin.com> [Spam marked with a score of 0%, but with clear spam status in the clues] From: Mark Hammond [mailto:mhammond@skippinet.com.au] >> I'm wondering if the problem has anything to do with the fact that the >> spam field is set before the message is moved. > Further, when you see this behaviour, can you immediately check the > Pythonwin debug window for a message? Each message processed should have a > message that indicates its spam disposition - the first thing I need to know > is if such mails fire this debug trace. Actually, the only time I've seen it happen since that message is when the plugin is doing "catchup" when I first start Outlook in the morning, and it goes and processes a load of messages (400-odd this morning) from the server. I checked the traceutil output, and the message with a 0% score is not in there. The message has a spam property of 0%, and the clues show 100% spam: Spam Score: 1 word spamprob #ham #spam '*H*' 0 - - '*S*' 1 - - 'damages' 0.019311 566 8 'austin,' 0.184822 13 2 'related' 0.205474 263 50 'such' 0.221215 879 184 'these' 0.321268 1464 511 'skip:c 10' 0.339304 1975 748 'for' 0.347324 6260 2457 'makes' 0.347573 410 161 'skip:f 10' 0.359639 681 282 'box' 0.37294 504 221 'list' 0.374491 2498 1103 [rest omitted...] Paul. From joe at rockymountains.net Sun Mar 9 20:59:08 2003 From: joe at rockymountains.net (Joseph Conrad) Date: Mon Mar 10 07:45:01 2003 Subject: [Spambayes] Confused Message-ID: <3E6C0D8C.4030004@rockymountains.net> Spambayes, I have a system running the pop3proxy, it's amazingly accurate with very little training. I would like to integrate into our postfix SMTP server as and incomming filter. It looks simple enough, I already run virus scanning. The thing I am not getting is hammiesrv, when I try to run it I get: AttributeError: 'module' object has no attribute 'DEFAULTDB' I have looked at the documentation, there really not much more than an a quick mention of hammiesrv. I'm not at all familiar with python, but I suspect that if someone could drop me a hint I could take it from there. Thanks, Joseph Conrad From tim at fourstonesExpressions.com Mon Mar 10 08:11:40 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 09:11:46 2003 Subject: [Spambayes] SpamBayes In-Reply-To: <15979.45963.260615.349902@montanaro.dyndns.org> Message-ID: 3/9/2003 3:35:07 PM, Skip Montanaro wrote: > > Dan> This process needs improving to reach a level of robustness needed > Dan> to use in a production environment rather than just for testing and > Dan> experimentation purposes. > D.R.Evans had a database corruption similar to this a while back. This is going to be an ongoing problem. I believe we should append records to a log file each time a message is trained. The spamcount and hamcount (at least) should be logged, so it can be recovered. Perhaps even the tokens being trained should be logged, but that might make the log quite large... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Mon Mar 10 10:27:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 10 11:28:00 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <15980.48391.655560.225683@montanaro.dyndns.org> I've got a few people here at Northwestern set up with Spambayes now. Classification is being done by me on the server, not by the users on their desktops. I just just chatting with a couple of the admins here who commented that SpamAssassin's X-Spam-Level header is nice because you can tell users to just add or delete a star from their Eudora filter to fine-tune the break between spam and ham. That might be a bit weird with Spambayes since it's a three-state system, but I think it might be useful to add an X-Spambayes-Level header where the number of stars is equal to int(score*10). I control the ham and spam cutoffs, and thus the inclusion of the words "ham", "unsure" and "spam", but this would make it easy for people to filter on a score basis in their mail client. Sort of a fine-tuning knob. or-a-fake-thermostat-ly, y'rs, Skip From tim at fourstonesExpressions.com Mon Mar 10 11:04:58 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 12:05:04 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <15980.48391.655560.225683@montanaro.dyndns.org> Message-ID: 3/10/2003 10:27:51 AM, Skip Montanaro wrote: >I've got a few people here at Northwestern set up with Spambayes now. >Classification is being done by me on the server, not by the users on their >desktops. I just just chatting with a couple of the admins here who >commented that SpamAssassin's X-Spam-Level header is nice because you can >tell users to just add or delete a star from their Eudora filter to >fine-tune the break between spam and ham. Funny, I was just thinking about the same thing today. There was a request for the pop3proxy to do this a couple months back. Never made it as a feature request, but I remember it. Seems like a reasonable thing to do. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Mon Mar 10 12:15:11 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 10 12:15:53 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: <15980.3261.634515.784710@montanaro.dyndns.org> Message-ID: [Tim] > For another, about the only spam I see rate unsure anymore is stuff > that leaks thru SpamAssassin via python.org. spambayes *usually* > wouldn't have any trouble with such spam on its own, but there are > a dozen header clues all effectively saying "this came from > python.org" .... [Skip Montanaro] > That's correct when considering the rather narrow Python email > universe, but I suspect most people live in a somewhat more diverse > electronic world than that, so the python.org effect won't be quite as > strong in the normal case. It was an example of harmful correlation, by way of illustrating why a strong indicator isn't necessarily a desirable indicator. This particular example applies pretty directly to any source from which a user rarely (but not never) gets spam, and leaves clues about itself. From wsy at merl.com Mon Mar 10 12:32:21 2003 From: wsy at merl.com (Bill Yerazunis) Date: Mon Mar 10 12:32:55 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <15980.48391.655560.225683@montanaro.dyndns.org> (message from Skip Montanaro on Mon, 10 Mar 2003 10:27:51 -0600) References: <15980.48391.655560.225683@montanaro.dyndns.org> Message-ID: <200303101732.h2AHWLL19489@localhost.localdomain> From: Skip Montanaro Classification is being done by me on the server, not by the users on their desktops. I just just chatting with a couple of the admins here who commented that SpamAssassin's X-Spam-Level header is nice because you can tell users to just add or delete a star from their Eudora filter to fine-tune the break between spam and ham. That might be a bit weird with Spambayes since it's a three-state system, but I think it might be useful to add an X-Spambayes-Level header where the number of stars is equal to int(score*10). I control the ham and spam cutoffs, and thus the inclusion of the words "ham", "unsure" and "spam", but this would make it easy for people to filter on a score basis in their mail client. Sort of a fine-tuning knob. or-a-fake-thermostat-ly, y'rs, I've also had multiple requests for a continuous output match parameter in CRM114, so I settled on this: pR = - (log (Pspam) - log (Pnonspam) This goes from roughly +350 to -350, and (nicely) the uncertains and errors all seem to group around +/- 100 . 90%+ of the messages come out either > 200 or < -200, so it's an effective human-understood representation. I know the CAMRAM people wanted it pretty badly; expect them to start using it soon. (it's called pR for the same reason pH is called pH - it's the negative log of the ratios of the match probabilities, just like pH is the negative log of the ion ratios.) -Bill Yerazunis From skip at pobox.com Mon Mar 10 11:47:20 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 10 12:47:29 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: References: <15980.3261.634515.784710@montanaro.dyndns.org> Message-ID: <15980.53160.87499.28570@montanaro.dyndns.org> Tim> It was an example of harmful correlation, by way of illustrating Tim> why a strong indicator isn't necessarily a desirable indicator. Tim> This particular example applies pretty directly to any source from Tim> which a user rarely (but not never) gets spam, and leaves clues Tim> about itself. True enough. I'm sure there are lots of such correlations. But if a person's incoming mail isn't dominated by one source, such harmful correlations will have less impact on the final score of any given message, right? As an example, I just grep'd my ham collection for the Sender field, squashed case, sorted and uniq'd, then sorted again. The tail end looked like 150 sender: folkmusic-admin@grassyhill.org 221 sender: zope-admin@zope.org 255 sender: folk music presenters 450 sender: spambayes-bounces@python.org 550 sender: python-checkins-admin@python.org 555 sender: owner-6pack@autox.team.net 688 sender: python-dev-admin@python.org 821 sender: spamassassin-talk-admin@lists.sourceforge.net 1387 sender: cedu-admin@manatee.mojam.com 3091 sender: python-list-admin@python.org This is out of 9609 Sender headers (just under 12,000 hams). If I remember comments you've made on this topic in the past, I expect your Sender: headers to be more strongly dominated by Python-related messages than this. Just the presence of a Sender header irregardless of where it came from seems to be a pretty strong ham clue (something spammers could/do exploit?). My roughly 7,000 spams only have 759 Sender headers. I haven't experimented with adding it to Options.options.address_headers, but your comment in tokenizer.py suggests this probably wouldn't be too wise. Skip From phil.west at gtri.gatech.edu Mon Mar 10 13:41:00 2003 From: phil.west at gtri.gatech.edu (Phil West) Date: Mon Mar 10 13:55:04 2003 Subject: [Spambayes] Problem installing SpamBayes-Outlook on outlook xp [Unable to register spambayes_addin.dll] Message-ID: <462E202877E3D54AADAF076E175B60A91ADF69@mail.elsys-exchange.elsys> Hi: I'm running Outlook 2002 on a win2k pro machine, when I start my python 2.2 IDE it sez: PythonWin 2.2.1 (#34, Apr 9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32. When I run the SpamBayes-Outlook-Setup.exe program, it encounters the error: As one would expect, neither retry nor Ignore yield a working installation. Any pointers on how to resolve this would be appreciated. Thanks, Phil -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/bmp Size: 346006 bytes Desc: Outlook.bmp Url : http://mail.python.org/pipermail/spambayes/attachments/20030310/b31b0bac/attachment-0001.bin From tim at fourstonesExpressions.com Mon Mar 10 13:03:55 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 14:04:04 2003 Subject: [Spambayes] Problem installing SpamBayes-Outlook on outlook xp [Unable to register spambayes_addin.dll] In-Reply-To: <462E202877E3D54AADAF076E175B60A91ADF69@mail.elsys-exchange.elsys> Message-ID: <9771A5YWUSNI65POCAHFPN51YVFEC0IF.3e6ce19b@myst> 3/10/2003 12:41:00 PM, "Phil West" wrote: >Hi: >I'm running Outlook 2002 on a win2k pro machine, when I start my python >2.2 IDE it sez: PythonWin 2.2.1 (#34, Apr 9 2002, 19:34:33) [MSC 32 bit >(Intel)] on win32. > > When I run the SpamBayes-Outlook-Setup.exe program, it encounters the >error: Are we missing something here? I don't see an error. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From bill at parducci.net Mon Mar 10 11:11:42 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 10 14:12:00 2003 Subject: [Spambayes] single message test question Message-ID: <3E6CE36E.5040903@parducci.net> would someone be so kind as to instruct me in what the most straightforward way to test my current filter against a single message would be? i have a note that scored very high in spamminess and i would like to know why. (i have the note isolated into a single mbox file at the moment.) thanks b From tim.one at comcast.net Mon Mar 10 14:22:15 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 10 14:23:02 2003 Subject: [Spambayes] Problem installing SpamBayes-Outlook on outlook xp[Unable to register spambayes_addin.dll] In-Reply-To: <9771A5YWUSNI65POCAHFPN51YVFEC0IF.3e6ce19b@myst> Message-ID: [Phil West] > I'm running Outlook 2002 on a win2k pro machine, when I start my python > 2.2 IDE it sez: PythonWin 2.2.1 (#34, Apr 9 2002, 19:34:33) [MSC 32 bit > (Intel)] on win32. > > When I run the SpamBayes-Outlook-Setup.exe program, it encounters the > error: [Tim Stone] > Are we missing something here? I don't see an error. There was a giant .bmp file attached, and God only knows what will happen to that. It was a Windows error box; the only interesting part said Unable to register the DLL/OCX: DllRegisterServer failed; code 0x00000000. The error code is frightening . From tim at fourstonesExpressions.com Mon Mar 10 13:26:03 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 14:26:08 2003 Subject: [Spambayes] single message test question In-Reply-To: <3E6CE36E.5040903@parducci.net> Message-ID: 3/10/2003 1:11:42 PM, bill parducci wrote: >would someone be so kind as to instruct me in what the most straightforward way to test my current filter against a single message would be? i have a note that scored very high in spamminess and i would like to know why. (i have the note isolated into a single mbox file at the moment.) For me, the easiest way is to bring up the pop3proxy, with the -u option, and use the cut-and-paste entry field to classify the message. > >thanks > >b > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From popiel at wolfskeep.com Mon Mar 10 11:53:31 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Mar 10 14:53:36 2003 Subject: [Spambayes] single message test question In-Reply-To: Message from bill parducci of "Mon, 10 Mar 2003 11:11:42 PST." <3E6CE36E.5040903@parducci.net> References: <3E6CE36E.5040903@parducci.net> Message-ID: <20030310195331.9D22D2DDD7@cashew.wolfskeep.com> In message: <3E6CE36E.5040903@parducci.net> bill parducci writes: >would someone be so kind as to instruct me in what the most >straightforward way to test my current filter against a single >message would be? i have a note that scored very high in spamminess >and i would like to know why. (i have the note isolated into a >single mbox file at the moment.) My approach would be to create a config file with hammie_debug_header set to true, then set the BAYESCUSTOMIZE environment variable to that config file, then run the message through hammiefilter. Actually, I have hammie_debug_header turned on in my default config file, so all I have to do is look at all the headers for the message (I normally don't display the debug header as I'm reading mail). - Alex From mhammond at skippinet.com.au Tue Mar 11 08:43:12 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 10 16:44:14 2003 Subject: [Spambayes] New bsddb3 for Python 2.2 Message-ID: There have been some reports of problems running the spambayes Outlook plugin on Python 2.2 with bsddb3. It turns out that the bsddb3 release itself was bad. The bsddb3 maintainers have released a new version of the binary (from the usual place - http://sourceforge.net/project/showfiles.php?group_id=13900). If you install this version of bsddb3, spambayes should work fine (and fast!). This bsddb module is also built using the same database version as the Python 2.3 bsddb module, so our database can be freely used between stock Python 2.3, and Python 2.2+bsddb3. Remember that there is no pickle->db migration code in Outlook - you are probably going to need to do a full re-train if you install bsddb3. If the startup/shutdown times are annoying you, it is well worth it though. Mark. From tim at fourstonesExpressions.com Mon Mar 10 17:08:58 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 18:09:05 2003 Subject: [Spambayes] New bsddb3 for Python 2.2 In-Reply-To: Message-ID: <1WPNVTEDIFYXPOK53WRTPZVJFQODBDA.3e6d1b0a@myst> 3/10/2003 3:43:12 PM, "Mark Hammond" wrote: >Remember that there is no pickle->db migration code in Outlook - you are >probably going to need to do a full re-train if you install bsddb3. If the >startup/shutdown times are annoying you, it is well worth it though. dbExpImp.py can be used to migrate from pickle to db. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From T.A.Meyer at massey.ac.nz Tue Mar 11 12:45:34 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon Mar 10 18:46:41 2003 Subject: [Spambayes] full o' spaces Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89A@its-xchg4.massey.ac.nz> > Are we at a point where another release is useful, or should I update > the website to point to the current -a2 release? I think we are definately at the point where another release is useful. Browsing through the check-ins list, there have been quite a few significant improvements* since a2. =Tony Meyer * Not in the way of improving rates, really, but in fixing bugs and adding features. From mhammond at skippinet.com.au Tue Mar 11 10:45:43 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 10 18:46:49 2003 Subject: [Spambayes] New bsddb3 for Python 2.2 In-Reply-To: <1WPNVTEDIFYXPOK53WRTPZVJFQODBDA.3e6d1b0a@myst> Message-ID: > dbExpImp.py can be used to migrate from pickle to db. I'm sure it can, but I'm also fairly certain that simply running it won't do the right thing for outlook :) If someone wants to work out the exact command to use, that would be great. Mark. From tim at fourstonesExpressions.com Mon Mar 10 18:15:31 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 19:15:40 2003 Subject: [Spambayes] New bsddb3 for Python 2.2 In-Reply-To: Message-ID: 3/10/2003 5:45:43 PM, "Mark Hammond" wrote: > >> dbExpImp.py can be used to migrate from pickle to db. > >I'm sure it can, but I'm also fairly certain that simply running it won't do >the right thing for outlook :) If someone wants to work out the exact >command to use, that would be great. Well, it certainly doesn't understand any of the other databases... :( forgot about those. But it can change the main wordinfo database. If you can send me an example of the other databases, I'd be happy to fix it to manage those too... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Mon Mar 10 18:29:38 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 10 19:29:48 2003 Subject: [Spambayes] New bsddb3 for Python 2.2 In-Reply-To: Message-ID: <53UQA6GIGBIDTR42EDWVJE74DC7321.3e6d2df2@myst> 3/10/2003 5:45:43 PM, "Mark Hammond" wrote: > >> dbExpImp.py can be used to migrate from pickle to db. > >I'm sure it can, but I'm also fairly certain that simply running it won't do >the right thing for outlook :) If someone wants to work out the exact >command to use, that would be great. > Ok, so here I am, replying to the same message twice... braindeath is a terrible thing. The commands would be: dbExpImp.py -e -d mypickledwordinfo -f mypickledwordinfo.export dbExpImp.py -i -D mybsddbwordinfo -f mypickledwordinfo.export or... dbExpImp.py -h for that and several more scenarios c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Mon Mar 10 21:28:28 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 10 21:29:08 2003 Subject: [Spambayes] better Received header tokens In-Reply-To: <15980.53160.87499.28570@montanaro.dyndns.org> Message-ID: [Tim] > It was an example of harmful correlation, by way of illustrating > why a strong indicator isn't necessarily a desirable indicator. > This particular example applies pretty directly to any > source from which a user rarely (but not never) gets spam, and > leaves clues about itself. [Skip Montanaro] > True enough. I'm sure there are lots of such correlations. But if a > person's incoming mail isn't dominated by one source, such harmful > correlations will have less impact on the final score of any > given message, right? Strictly less, yes, but it's a second-order distinction and would have trouble being *significantly* less. Say you have H total ham and S total spam, and that a particular token appears in h ham and s spam. The unadjusted spamprob for that token is then s/S --------- s/S + h/H which can be rearranged as H ---------- H + (h/s)S The magnitudes of h and s don't matter to the result, nor even the magnitudes of h and s relative to H and S -- all that matters is the ratio of h to s. So it makes no difference at this level whether the token appears in 99% of your training data, or in 0.0001% of it: if it appears in (say) 20 times more ham msgs than spam msgs, the first-order spamprob guess is the same whether that's a total of 20 msgs or 20 million. Or, IOW, if 1% of my python.org mail is spam, and 1% of my guysnamedtim.com mail is spam, and 1% of my friendsofskip.org mail is spam, a clue unique to any of those sources gets the same first-order spamprob, and regardless of what percentages of my total email derive from these sources. The Bayesian adjustment goes on to fiddle the guess, taking *some* measure of the magnitude of h+s into account, but as h+s increases it has a smaller and smaller effect. If I only have one msg total from guysnamedtim.com, the adjustment is large, but unknown_word_strength is under 0.5 by default and we approach the by-counting spamprob guess quickly as h+s increases. > As an example, I just grep'd my ham collection for the > Sender field, squashed case, sorted and uniq'd, then sorted again. The > tail end looked like > > 150 sender: folkmusic-admin@grassyhill.org > 221 sender: zope-admin@zope.org > 255 sender: folk music presenters > 450 sender: spambayes-bounces@python.org > 550 sender: python-checkins-admin@python.org > 555 sender: owner-6pack@autox.team.net > 688 sender: python-dev-admin@python.org > 821 sender: spamassassin-talk-admin@lists.sourceforge.net > 1387 sender: cedu-admin@manatee.mojam.com > 3091 sender: pthon-list-admin@python.org > > This is out of 9609 Sender headers (just under 12,000 hams). If > I remember comments you've made on this topic in the past, I expect > your Sender: headers to be more strongly dominated by Python-related > messages than this. They are, but, as above, that has a minor effect on spamprobs. What's worse about python.org mail is that there are so *many* tokens unique to it, and they're (equally) strong ham clues. Of course there are two sides to the story: while that makes it easy for spam from python.org to rate unsure, it also virtually guarantees that ham from python.org never rates unsure. > Just the presence of a Sender header irregardless of where it came from > seems to be a pretty strong ham clue (something spammers could/do > exploit?). > My roughly 7,000 spams only have 759 Sender headers. Then they're not very consistent in exploiting it . > I haven't experimented with adding it to Options.options.address_headers, > but your comment in tokenizer.py suggests this probably wouldn't be too > wise. It's on by default in the Outlook client. It's deadly for research on mixed-source corpora, but for live email I expect it to help. This wasn't formally tested, though, and should be. I can testify from experience that's it not deadly in real-life Outlook use . From T.A.Meyer at massey.ac.nz Tue Mar 11 15:42:00 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon Mar 10 21:42:49 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz> Ok, there's now the following headers available in pop3proxy: X-Spambayes-Classification: {ham | spam | unsure} X-Spambayes-Spam-Probability: (message score) X-Spambayes-Level: (thermostat, one * = 10%) X-Spambayes-Evidence: (list of clues, like hammie's debug) X-Spambayes-MailId: (unique id for the message) Apart from Classification, all of these are off by default. The rest can be turned on via the configuration page in the ui, or via the following options in a config file: pop3proxy_include_prob: {True | False} pop3proxy_include_thermostat: {True | False} pop3proxy_include_evidence: {True | False} pop3proxy_add_mailid_to: {"" | "header" | "body" | "header body" | "body header"} You can, of course, change any of the header names - look in Options.py for the details. As when I committed the prob header, I've done limited testing here. Nothing changes as far as I can tell until you change the default settings, so those that don't want these options should find nothing different. Each header seems to add what it should. I didn't really know what other tests to do! Let me/the list know if something isn't right. Enjoy :) =Tony Meyer From T.A.Meyer at massey.ac.nz Tue Mar 11 15:50:39 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon Mar 10 21:51:37 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89E@its-xchg4.massey.ac.nz> > Meyer, Tony wrote: > > cmp.py gave me lots of errors, because the lines were not what was > > expected. [Neil] > I'm guessing you ran "rates.py test > tests". rates.py > creates its own > output file and writes something a little different to stdout. cmp.py > can't understand the stdout data. Ah, this is exactly what I did. I should have read Mark's instructions somewhat more closely. Thanks for the help, and apologies for the stupidity ;) Cheers, Tony From T.A.Meyer at massey.ac.nz Tue Mar 11 15:54:44 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon Mar 10 21:55:41 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89F@its-xchg4.massey.ac.nz> > Here are my current results on the imbalance option. And here are mine. imbalance_false4s.txt -> imbalance_true4s.txt -> tested 372 hams & 48 spams against 983 hams & 155 spams -> tested 333 hams & 56 spams against 1022 hams & 147 spams -> tested 329 hams & 48 spams against 1026 hams & 155 spams -> tested 321 hams & 51 spams against 1034 hams & 152 spams -> tested 372 hams & 48 spams against 983 hams & 155 spams -> tested 333 hams & 56 spams against 1022 hams & 147 spams -> tested 329 hams & 48 spams against 1026 hams & 155 spams -> tested 321 hams & 51 spams against 1034 hams & 152 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 4 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 6.250 6.250 tied 0.000 0.000 tied 6.250 6.250 tied 3.922 3.922 tied won 0 times tied 4 times lost 0 times total unique fn went from 8 to 8 tied mean fn % went from 4.10539215686 to 4.10539215686 tied ham mean ham sdev 0.39 0.39 +0.00% 3.46 3.46 +0.00% 0.09 0.09 +0.00% 0.91 0.91 +0.00% 0.65 0.65 +0.00% 4.57 4.57 +0.00% 1.40 1.40 +0.00% 7.93 7.93 +0.00% ham mean and sdev for all runs 0.62 0.62 +0.00% 4.87 4.87 +0.00% spam mean spam sdev 87.62 87.62 +0.00% 28.34 28.34 +0.00% 90.83 90.83 +0.00% 18.01 18.01 +0.00% 91.17 91.17 +0.00% 25.61 25.61 +0.00% 85.65 85.65 +0.00% 25.97 25.97 +0.00% spam mean and sdev for all runs 88.85 88.85 +0.00% 24.68 24.68 +0.00% ham/spam mean difference: 88.23 88.23 +0.00 My ham:spam ratio is about 7:1 (Mark's was about 1:2.5). Forgive the newbie question, but does this mean that: (a) for my corpus, the options makes no difference at all? (b) I haven't tested with a big enough corpus? (c) I did something wrong ;) Thanks, Tony Meyer From tim.one at comcast.net Mon Mar 10 22:10:09 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 10 22:11:04 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89F@its-xchg4.massey.ac.nz> Message-ID: [Meyer, Tony] > imbalance_false4s.txt -> imbalance_true4s.txt > -> tested 372 hams & 48 spams against 983 hams & 155 spams > -> tested 333 hams & 56 spams against 1022 hams & 147 spams > -> tested 329 hams & 48 spams against 1026 hams & 155 spams > -> tested 321 hams & 51 spams against 1034 hams & 152 spams > -> tested 372 hams & 48 spams against 983 hams & 155 spams > -> tested 333 hams & 56 spams against 1022 hams & 147 spams > -> tested 329 hams & 48 spams against 1026 hams & 155 spams > -> tested 321 hams & 51 spams against 1034 hams & 152 spams > > false positive percentages > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > > won 0 times > tied 4 times > lost 0 times > > total unique fp went from 0 to 0 tied > mean fp % went from 0.0 to 0.0 tied > > false negative percentages > 6.250 6.250 tied > 0.000 0.000 tied > 6.250 6.250 tied > 3.922 3.922 tied > > won 0 times > tied 4 times > lost 0 times > > total unique fn went from 8 to 8 tied > mean fn % went from 4.10539215686 to 4.10539215686 tied > > ham mean ham sdev > 0.39 0.39 +0.00% 3.46 3.46 +0.00% > 0.09 0.09 +0.00% 0.91 0.91 +0.00% > 0.65 0.65 +0.00% 4.57 4.57 +0.00% > 1.40 1.40 +0.00% 7.93 7.93 +0.00% > > ham mean and sdev for all runs > 0.62 0.62 +0.00% 4.87 4.87 +0.00% > > spam mean spam sdev > 87.62 87.62 +0.00% 28.34 28.34 +0.00% > 90.83 90.83 +0.00% 18.01 18.01 +0.00% > 91.17 91.17 +0.00% 25.61 25.61 +0.00% > 85.65 85.65 +0.00% 25.97 25.97 +0.00% > > spam mean and sdev for all runs > 88.85 88.85 +0.00% 24.68 24.68 +0.00% > > ham/spam mean difference: 88.23 88.23 +0.00 > > My ham:spam ratio is about 7:1 (Mark's was about 1:2.5). Forgive > the newbie question, but does this mean that: > (a) for my corpus, the options makes no difference at all? > (b) I haven't tested with a big enough corpus? > (c) I did something wrong ;) (d) Something went wrong somewhere. The listings of means and sdevs are supremely sensitive to even the tiniest changes: I've never seen them all zero unless the classifiers and tokenizers going into them were actually identical. Given that you have more ham than spam, the expected effect of enabling the option is to decrease your FN rate (which, at 4%, is high), and possibly increase your FP rate (which is 0). From tony-bayes at lownds.com Mon Mar 10 21:25:47 2003 From: tony-bayes at lownds.com (Tony Lownds) Date: Tue Mar 11 00:26:18 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz> Message-ID: Hi, How about putting these in a seperate namespace? I have been writing a GUI for spambayes using PyObjC and it could benefit from more regular option names here. header_spam_probability: {True|False} header_level: {True|False} header_evidence: {True|False} header_mailid: {True|False} pop3proxy_mailid_notate_body: {True|False} pop3proxy_classification_notate_to: {True|False} Ok, the last two options aren't very regularly named, but then again, they do irregular things. There may be some bugs lurking, I'm now getting "X-Spambayes-Classification: ham" in the body of my emails. Also, this bit of code around line 163 in pop3proxy.py doesn't account for the extra possible headers. # HEADER_EXAMPLE is the longest possible header - the length of this one # is added to the size of each message. HEADER_FORMAT = '%s: %%s\r\n' % options.hammie_header_name HEADER_EXAMPLE = '%s: xxxxxxxxxxxxxxxxxxxx\r\n' % options.hammie_header_name BTW, I'm pretty excited about the mailid stuff you have done. Being able to correct a single message without seeing all of my mail again will be great. -Tony Lownds At 3:42 PM +1300 3/11/03, Meyer, Tony wrote: >Ok, there's now the following headers available in pop3proxy: > >X-Spambayes-Classification: {ham | spam | unsure} >X-Spambayes-Spam-Probability: (message score) >X-Spambayes-Level: (thermostat, one * = 10%) >X-Spambayes-Evidence: (list of clues, like hammie's debug) >X-Spambayes-MailId: (unique id for the message) > >Apart from Classification, all of these are off by default. The >rest can be turned on via the configuration page in the ui, or via >the following options in a config file: >pop3proxy_include_prob: {True | False} >pop3proxy_include_thermostat: {True | False} >pop3proxy_include_evidence: {True | False} >pop3proxy_add_mailid_to: {"" | "header" | "body" | "header body" | >"body header"} > >You can, of course, change any of the header names - look in >Options.py for the details. > >As when I committed the prob header, I've done limited testing here. >Nothing changes as far as I can tell until you change the default >settings, so those that don't want these options should find nothing >different. Each header seems to add what it should. I didn't >really know what other tests to do! Let me/the list know if >something isn't right. > >Enjoy :) > >=Tony Meyer > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes From noreply at sourceforge.net Tue Mar 11 00:48:17 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 11 07:48:33 2003 Subject: [Spambayes] [ spambayes-Bugs-701413 ] dbExpImp.py fails (python 2.2, win XP) Message-ID: Bugs item #701413, was opened at 2003-03-11 09:48 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Nobody/Anonymous (nobody) Summary: dbExpImp.py fails (python 2.2, win XP) Initial Comment: dbExpImp.py fails to run, and exists with the following error: C:\Programfiler\_UTIL\spambayes_cvs\spambayes>C:\P ROGRA~1\_DEV\Python22\python.exe dbExpImp.py File "dbExpImp.py", line 98 from __future__ import generators SyntaxError: from __future__ imports must occur at the beginning of the file I tried to move the import-statements on top, as indicated by the error-msg, and this seemed to work. I.E. MOVE: #################################### from __future__ import generators import spambayes.storage from spambayes.Options import options import sys, os, getopt, errno, re import urllib #################################### OVER: #################################### try: True, False except NameError: # Maintain compatibility with Python 2.2 True, False = 1, 0 #################################### I've never written anything in Python, so I have no clue as to what this really means. os: win XP HOME (norwegian) python 2.2 bsddb3: 4.1.4 spambayes: latest CVS as of 2003-03-11 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702 From noreply at sourceforge.net Tue Mar 11 05:52:01 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 11 08:50:37 2003 Subject: [Spambayes] [ spambayes-Bugs-701413 ] dbExpImp.py fails (python 2.2, win XP) Message-ID: Bugs item #701413, was opened at 2003-03-11 02:48 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702 Category: None Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) >Assigned to: Tim Stone (timstone4) Summary: dbExpImp.py fails (python 2.2, win XP) Initial Comment: dbExpImp.py fails to run, and exists with the following error: C:\Programfiler\_UTIL\spambayes_cvs\spambayes>C:\P ROGRA~1\_DEV\Python22\python.exe dbExpImp.py File "dbExpImp.py", line 98 from __future__ import generators SyntaxError: from __future__ imports must occur at the beginning of the file I tried to move the import-statements on top, as indicated by the error-msg, and this seemed to work. I.E. MOVE: #################################### from __future__ import generators import spambayes.storage from spambayes.Options import options import sys, os, getopt, errno, re import urllib #################################### OVER: #################################### try: True, False except NameError: # Maintain compatibility with Python 2.2 True, False = 1, 0 #################################### I've never written anything in Python, so I have no clue as to what this really means. os: win XP HOME (norwegian) python 2.2 bsddb3: 4.1.4 spambayes: latest CVS as of 2003-03-11 ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-11 07:52 Message: Logged In: YES user_id=645698 Fixed. Wish they were all this easy. Now what dummy put the generators import there anyway? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702 From skip at pobox.com Tue Mar 11 08:00:27 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Mar 11 09:00:50 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz> Message-ID: <15981.60411.170824.504685@montanaro.dyndns.org> Tony> Ok, there's now the following headers available in pop3proxy: Tony> X-Spambayes-Classification: {ham | spam | unsure} Tony> X-Spambayes-Spam-Probability: (message score) Tony> X-Spambayes-Level: (thermostat, one * = 10%) Tony> X-Spambayes-Evidence: (list of clues, like hammie's debug) Tony> X-Spambayes-MailId: (unique id for the message) Perhaps adding/deleting headers should be controlled by their own section in the options file and a headers module should be written, so all apps which tweak headers can say something like: from spambayes import headers ... headers.add_spambayes_headers(msg, ...) ... and not have to worry further about specific headers. On a related note, it seems to me that if a spambayes tool is going to delete one of the headers (in case the message has been classified previously or spammers try to exploit them), then all of them should be deleted: from spambayes import headers ... headers.delete_spambayes_headers(msg) ... Skip From T.A.Meyer at massey.ac.nz Tue Mar 11 17:45:42 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 09:08:07 2003 Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD7F@its-xchg4.massey.ac.nz> > (d) Something went wrong somewhere. The listings of means > and sdevs are > supremely sensitive to even the tiniest changes: I've never > seen them all zero unless the classifiers and tokenizers > going into them were actually identical. Which was the case here. . The mistake was that timtest wasn't finding the new config, so it was running the same test twice and comparing it. Not surprisingly, option=false did the same as option=false :) Thanks for the help :) > Given that you have more ham than spam, the expected effect > of enabling the option is to decrease your FN rate (which, > at 4%, is high), and possibly increase your FP rate (which is 0). Which is what happened. From 4% to 1% and from 0% to 0.2%. The 3 fp's were (1) a "you're almost ready to start using" email from habeas.com (this does better in my personal set since I check for the habeas headers), (2) an announcement from mtnsms.com about their new smspop service, and (3) a "thank you for installing" message from Real. I think this says that for me, it's a loss. All three of these (particularly the first two) were important at the time, and I would not have wanted to wade through the spam folder for them. I would much rather put up with the fn's. Here are (hopefully) correct results: imbal_falses.txt -> imbal_trues.txt -> tested 372 hams & 48 spams against 983 hams & 155 spams -> tested 333 hams & 56 spams against 1022 hams & 147 spams -> tested 329 hams & 48 spams against 1026 hams & 155 spams -> tested 321 hams & 51 spams against 1034 hams & 152 spams -> tested 372 hams & 48 spams against 983 hams & 155 spams -> tested 333 hams & 56 spams against 1022 hams & 147 spams -> tested 329 hams & 48 spams against 1026 hams & 155 spams -> tested 321 hams & 51 spams against 1034 hams & 152 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.304 lost +(was 0) 0.000 0.623 lost +(was 0) won 0 times tied 2 times lost 2 times total unique fp went from 0 to 3 lost +(was 0) mean fp % went from 0.0 to 0.231751081821 lost +(was 0) false negative percentages 6.250 2.083 won -66.67% 0.000 0.000 tied 6.250 2.083 won -66.67% 3.922 0.000 won -100.00% won 3 times tied 1 times lost 0 times total unique fn went from 8 to 2 won -75.00% mean fn % went from 4.10539215686 to 1.04166666667 won -74.63% ham mean ham sdev 0.39 1.45 +271.79% 3.46 6.76 +95.38% 0.09 1.30 +1344.44% 0.91 6.05 +564.84% 0.65 2.56 +293.85% 4.57 9.96 +117.94% 1.40 3.37 +140.71% 7.93 14.06 +77.30% ham mean and sdev for all runs 0.62 2.14 +245.16% 4.87 9.65 +98.15% spam mean spam sdev 87.62 94.09 +7.38% 28.34 16.45 -41.95% 90.83 99.06 +9.06% 18.01 3.61 -79.96% 91.17 94.81 +3.99% 25.61 17.83 -30.38% 85.65 94.52 +10.36% 25.97 14.35 -44.74% spam mean and sdev for all runs 88.85 95.74 +7.75% 24.68 14.10 -42.87% ham/spam mean difference: 88.23 93.60 +5.37 =Tony Meyer From T.A.Meyer at massey.ac.nz Tue Mar 11 19:07:27 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 09:15:08 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A5@its-xchg4.massey.ac.nz> [Bill Yerazunis] > I've also had multiple requests for a continuous output match > parameter in > CRM114, so I settled on this: > > pR = - (log (Pspam) - log (Pnonspam) > > This goes from roughly +350 to -350, and (nicely) the uncertains > and errors all seem to group around +/- 100 . Curious, and (sort of) able to now run tests (thanks Tim & Mark), I changed the "prob = (S-H + 1.0) / 2.0" equation in classifier.py to use this method. I had to also fiddle with 0's since log(0) isn't nice (how does CRM114 do this?), plus I moved it from -350to+350 to 0-1. Surprisingly I got good (well, perfect, actually) results. Is this just my tiny-weeny sets? A fluke? *Another* mistake on my part? The change I made was to replace line 245 ("prob = (S-H + 1.0) / 2.0") of classifier.py with: """ from math import log if H == 0: H = 0.00000001 if S == 0: S = 0.00000001 prob = ((-(log(S) - log(H)))/350) + 0.5 """ pr_falses.txt -> pr_trues.txt -> tested 333 hams & 56 spams against 372 hams & 48 spams -> tested 329 hams & 48 spams against 372 hams & 48 spams -> tested 321 hams & 51 spams against 372 hams & 48 spams -> tested 372 hams & 48 spams against 333 hams & 56 spams -> tested 329 hams & 48 spams against 333 hams & 56 spams -> tested 321 hams & 51 spams against 333 hams & 56 spams -> tested 372 hams & 48 spams against 329 hams & 48 spams -> tested 333 hams & 56 spams against 329 hams & 48 spams -> tested 321 hams & 51 spams against 329 hams & 48 spams -> tested 372 hams & 48 spams against 321 hams & 51 spams -> tested 333 hams & 56 spams against 321 hams & 51 spams -> tested 329 hams & 48 spams against 321 hams & 51 spams -> tested 333 hams & 56 spams against 372 hams & 48 spams -> tested 329 hams & 48 spams against 372 hams & 48 spams -> tested 321 hams & 51 spams against 372 hams & 48 spams -> tested 372 hams & 48 spams against 333 hams & 56 spams -> tested 329 hams & 48 spams against 333 hams & 56 spams -> tested 321 hams & 51 spams against 333 hams & 56 spams -> tested 372 hams & 48 spams against 329 hams & 48 spams -> tested 333 hams & 56 spams against 329 hams & 48 spams -> tested 321 hams & 51 spams against 329 hams & 48 spams -> tested 372 hams & 48 spams against 321 hams & 51 spams -> tested 333 hams & 56 spams against 321 hams & 51 spams -> tested 329 hams & 48 spams against 321 hams & 51 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.312 0.000 won -100.00% 0.000 0.000 tied 0.304 0.000 won -100.00% 0.935 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied 0.623 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 4 times tied 8 times lost 0 times total unique fp went from 4 to 0 won -100.00% mean fp % went from 0.181092520524 to 0.0 won -100.00% false negative percentages 0.000 0.000 tied 2.083 0.000 won -100.00% 0.000 0.000 tied 2.083 0.000 won -100.00% 2.083 0.000 won -100.00% 0.000 0.000 tied 2.083 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied 6.250 0.000 won -100.00% 0.000 0.000 tied 4.167 0.000 won -100.00% won 6 times tied 6 times lost 0 times total unique fn went from 5 to 0 won -100.00% mean fn % went from 1.5625 to 0.0 won -100.00% ham mean ham sdev 3.64 55.82 +1433.52% 11.61 3.14 -72.95% 3.68 55.64 +1411.96% 12.69 3.18 -74.94% 2.84 55.75 +1863.03% 10.59 3.09 -70.82% 2.08 56.10 +2597.12% 7.78 3.12 -59.90% ham mean and sdev for all runs 3.05 55.83 +1730.49% 10.83 3.14 -71.01% spam mean spam sdev 92.59 45.50 -50.86% 17.72 3.41 -80.76% 94.02 44.72 -52.44% 16.04 3.48 -78.30% 93.46 45.01 -51.84% 16.94 3.44 -79.69% 87.89 45.01 -48.79% 22.86 3.88 -83.03% spam mean and sdev for all runs 91.98 45.07 -51.00% 18.75 3.57 -80.96% ham/spam mean difference: 88.93 -10.76 -99.69 Comments? =Tony Meyer From anthony at interlink.com.au Wed Mar 12 01:22:47 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Mar 11 09:23:11 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A5@its-xchg4.massey.ac.nz> Message-ID: <200303111422.h2BEMlX27103@localhost.localdomain> >>> "Meyer, Tony" wrote > Curious, and (sort of) able to now run tests (thanks Tim & Mark), I > changed the "prob = (S-H + 1.0) / 2.0" equation in classifier.py to > use this method. I had to also fiddle with 0's since log(0) isn't nice > (how does CRM114 do this?), plus I moved it from -350to+350 to 0-1. > Surprisingly I got good (well, perfect, actually) results. Is this > just my tiny-weeny sets? A fluke? *Another* mistake on my part? Um, I'd say "mistake". Look at the numbers. Your ham mean has gone from around 3 to around 55, while the spam mean's gone from around 92 to around 45. So you've moved everything solidly into the "unsure" bucket. This, of course, will remove your FN/FP numbers. But then, dumping your email directly into the unsure folder without running spambayes will do that, too Worse yet, your spam is scoring, on average, less than your ham! Oops. Anthony > ham mean and sdev for all runs > 3.05 55.83 +1730.49% 10.83 3.14 -71.01% > > spam mean and sdev for all runs > 91.98 45.07 -51.00% 18.75 3.57 -80.96% From tim at fourstonesExpressions.com Tue Mar 11 11:14:21 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 11 12:14:29 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <15981.60411.170824.504685@montanaro.dyndns.org> Message-ID: >Perhaps adding/deleting headers should be controlled by their own section in >the options file and a headers module should be written This is a great idea. I'll take this one on. I'll fix pop3proxy and notesfilter. I suppose hammiefilter will need to be adjusted. I'm not sure how interesting this will be to the outlook code. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Tue Mar 11 11:25:50 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Mar 11 12:26:02 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: References: <15981.60411.170824.504685@montanaro.dyndns.org> Message-ID: <15982.7198.16444.97558@montanaro.dyndns.org> >> Perhaps adding/deleting headers should be controlled by their own >> section in the options file and a headers module should be written Tim> This is a great idea. I'll take this one on. I'll fix pop3proxy Tim> and notesfilter. I suppose hammiefilter will need to be adjusted. I'll twiddle hammiefilter. Tim> I'm not sure how interesting this will be to the outlook code. If it's checking various header options, they will need changing if the names are changed. Skip From kjellqvist at nordkalak.se Tue Mar 11 21:04:55 2003 From: kjellqvist at nordkalak.se (=?iso-8859-1?q?G=F6ran=20K=E4llqvist?=) Date: Tue Mar 11 15:28:17 2003 Subject: [Spambayes] Crash after upgrading KDE Message-ID: <200303112104.55945.kjellqvist@nordkalak.se> Hi! I've just upgraded to KDE 3.1. Have used spambayes with KDE 3.04 for several weeks without problem, and it still starts OK: >gorank@triathlon:~/spambayes> /usr/bin/pop3proxy.py >Loading database... Done. >Listener on port 1110 is proxying m1.970.telia.com:110 >User interface url is http://localhost:8880/ But when I try to fetch my mail (with Kmail 1.5) I get the following error: >error: uncaptured python exception, closing channel ><__main__.ServerLineReader connected at 0x883bed4> (exceptions.EOFError: >[/usr/lib/python2.2/asyncore.py|poll|94] >[/usr/lib/python2.2/asyncore.py|handle_read_event|391] >[/usr/lib/python2.2/asynchat.py|handle_read|130] >[/usr/bin/pop3proxy.py|found_terminator|200] >[/usr/bin/pop3proxy.py|onServerLine|268] >[/usr/bin/pop3proxy.py|onResponse|342] >[/usr/bin/pop3proxy.py|onTransaction|438] [/usr/bin/pop3proxy.py|onRetr|485] >[/usr/lib/python2.2/site-packages/spambayes/classifier.py|chi2_spamprob|217] >[/usr/lib/python2.2/site-packages/spambayes/classifier.py|_getclues|437] >[/usr/lib/python2.2/site-packages/spambayes/storage.py|_wordinfoget|192] >[/usr/lib/python2.2/shelve.py|get|66] >[/usr/lib/python2.2/shelve.py|__getitem__|71]) The webinterface is still working. I'm running spambayes on linux 2.4.18. Anyone seen a similar problem? And the solution? Greetings G?ran K?llqvist From T.A.Meyer at massey.ac.nz Wed Mar 12 10:07:10 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 16:11:41 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A6@its-xchg4.massey.ac.nz> > Perhaps adding/deleting headers should be controlled by their > own section in > the options file and a headers module should be written Definately +1 here. Anything that simplifies the options, or modularises them is good, IMO. I look forward to seeing it when Tim's done :) =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 12 10:11:56 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 16:12:44 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz> > How about putting these in a seperate namespace? I have been writing > a GUI for spambayes using PyObjC and it could benefit from more > regular option names here. Does Skip's proposal sound ok? > header_spam_probability: {True|False} > header_level: {True|False} > header_evidence: {True|False} > header_mailid: {True|False} > pop3proxy_mailid_notate_body: {True|False} > pop3proxy_classification_notate_to: {True|False} Personally, I prefer the current method of mailid. Originally it was like this (for perhaps 24 hours), but there are already way too many options. So I dropped the T/F add_to_body option and changed the add to a string. Shouldn't matter to developers, and end-users should have a nice GUI hiding it all anyway. > There may be some bugs lurking, I'm now getting > "X-Spambayes-Classification: ham" in the body of my emails. I will check this ASAP. > Also, > this bit of code around line 163 in pop3proxy.py doesn't account for > the extra possible headers. Drat. Good spotting. Will fix this too. > BTW, I'm pretty excited about the mailid stuff you have done. Being > able to correct a single message without seeing all of my mail again > will be great. We aim to please ;) =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 12 10:40:39 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 16:41:35 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD80@its-xchg4.massey.ac.nz> > Um, I'd say "mistake". Look at the numbers. Your ham mean has gone > from around 3 to around 55, while the spam mean's gone from around > 92 to around 45. So you've moved everything solidly into the "unsure" > bucket. . I realised that I'd stuffed it up just after I went home. Too much rushing at the end of the day. Looking at: > pR = - (log (Pspam) - log (Pnonspam) > This goes from roughly +350 to -350, and (nicely) the uncertains > and errors all seem to group around +/- 100 . I should have been more careful, since obviously a Pspam and Pnonspam ranging from 0->1 will not end up with many scores near 350, unless there are some *very* accurate floating point numbers. Apologies for the foolishness. =Tony Meyer From tim.one at comcast.net Tue Mar 11 16:41:36 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Mar 11 16:42:13 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <15982.7198.16444.97558@montanaro.dyndns.org> Message-ID: [TimS] > I'm not sure how interesting this will be to the outlook code. Not to worry -- it shouldn't affect the Outlook client one way or the other. That stores the spam score as a kind of metadata ("custom field") on the message object; it doesn't alter the headers. From tony-bayes at lownds.com Tue Mar 11 16:59:22 2003 From: tony-bayes at lownds.com (Tony Lownds) Date: Tue Mar 11 19:59:24 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz> Message-ID: At 10:11 AM +1300 3/12/03, Meyer, Tony wrote: >Does Skip's proposal sound ok? Yes, sounds like a good idea. >Personally, I prefer the current method of mailid. Originally it >was like this (for perhaps 24 hours), but there are already way too >many options. So I dropped the T/F add_to_body option and changed >the add to a string. Shouldn't matter to developers, and end-users >should have a nice GUI hiding it all anyway. I see what you mean. Maybe someone will want mailid to appear at the front of the e-mail body, or maybe in the subject, or.... > > BTW, I'm pretty excited about the mailid stuff you have done. Being >> able to correct a single message without seeing all of my mail again > > will be great. > >We aim to please ;) > Great stuff! -Tony From tony-bayes at lownds.com Tue Mar 11 17:18:50 2003 From: tony-bayes at lownds.com (Tony Lownds) Date: Tue Mar 11 20:18:53 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz> Message-ID: > > There may be some bugs lurking, I'm now getting >> "X-Spambayes-Classification: ham" in the body of my emails. > >I will check this ASAP. > This fixes it. --- pop3proxy.py 11 Mar 2003 02:48:29 -0000 1.65 +++ pop3proxy.py 12 Mar 2003 01:10:36 -0000 @@ -503,7 +503,7 @@ headers, body = re.split(r'\n\r?\n', messageText, 1) messageName = state.getNewMessageName() - headers += '\r\n%s: %s\r\n' % (options.hammie_header_name, + headers += '\n%s: %s\r\n' % (options.hammie_header_name, disposition) if command == 'RETR' and not state.isTest: if options.pop3proxy_add_mailid_to.find("header") != -1: > > Also, >> this bit of code around line 163 in pop3proxy.py doesn't account for >> the extra possible headers. > >Drat. Good spotting. Will fix this too. I think this bug is deeper :) -Tony From bill at parducci.net Tue Mar 11 17:29:26 2003 From: bill at parducci.net (bill parducci) Date: Tue Mar 11 20:29:33 2003 Subject: [Spambayes] weighting question Message-ID: <3E6E8D76.4060206@parducci.net> is there currently a way to weight the smtp envelope information (in particular, 'mail from') independently from the payload of the message? the reason i ask is that if someone that i work with forwards an obvious spam note to me with little preamble (e.g. 'note the use of underscores') the remaining content of the message forces it right into the spam bucket. i have two other cases where retraining doesn't seem to be improving false positives as well. thanks b p.s. yes, i know that 'mail from' isn't a reliable authentication assertion. :o) From tim.one at comcast.net Tue Mar 11 20:33:25 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Mar 11 20:34:04 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A5@its-xchg4.massey.ac.nz> Message-ID: [Meyer, Tony] > ... > The change I made was to replace line 245 ("prob = (S-H + 1.0) / > 2.0") of classifier.py with: > """ > from math import log > if H == 0: > H = 0.00000001 > if S == 0: > S = 0.00000001 > prob = ((-(log(S) - log(H)))/350) + 0.5 > """ Apart from the technical glitches you bumped into, there's a reason we don't want to combine H and S via any expression of this form. Because the difference of logs is the log of the quotient, and the negation of a log is the log of the reciprocal, the heart of this expression is log(H/S), and it's the H/S part that's undesirable. If, say, H is 0.99, and S is 0.0099, H/S is 100 and there's no problem with concluding that we're sure the msg is ham. But suppose H is .0001 and S is .000001. Then H/S is also 100, but it's plain nuts to be exactly as sure that the msg is ham: H on its own says the system thinks there's virtually no chance the msg looks like what it's been taught about ham, and the low S says the same about what it's been taught about spam: it doesn't look like either, so Unsure is the "proper" response. If the system *had* to guess one or the other, then ham is the best guess it can make, but H on its own says the system doesn't believe that guess. (Note that in pH calculations, small magnitudes don't "say" anything significant -- a factor of 100 is equally signficant in that domain no matter how small the input magnitudes.) Rob Hooft crafted the simple combining formula we use to give a high combined score in the first example and a solid Unsure in the second example. We used a different expression involving a ratio before that, and examples of the second kind are exactly where it screwed up. Don't want to do that again . BTW, and IIRC, cmp.py never got updated to deal sensibly with unsures. If that's right, it shouldn't be used except when spam_cutoff == ham_cutoff. Then you've got a two-outcome classifier (no unsures), and cmp.py won't "forget" any msgs. From skip at pobox.com Tue Mar 11 19:41:54 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Mar 11 20:42:21 2003 Subject: [Spambayes] weighting question In-Reply-To: <3E6E8D76.4060206@parducci.net> References: <3E6E8D76.4060206@parducci.net> Message-ID: <15982.36962.234270.381858@montanaro.dyndns.org> bill> is there currently a way to weight the smtp envelope information bill> (in particular, 'mail from') independently from the payload of the bill> message? You can set this in your Options.py file: [Tokenizer] address_headers: from to cc You can add other headers (Sender is used by the Outlook plugin) which contain addresses too. Skip From bill at parducci.net Tue Mar 11 18:00:55 2003 From: bill at parducci.net (bill parducci) Date: Tue Mar 11 21:00:59 2003 Subject: [Spambayes] weighting question In-Reply-To: <15982.36962.234270.381858@montanaro.dyndns.org> References: <3E6E8D76.4060206@parducci.net> <15982.36962.234270.381858@montanaro.dyndns.org> Message-ID: <3E6E94D7.7040106@parducci.net> Skip Montanaro wrote: > You can set this in your Options.py file: > > [Tokenizer] > address_headers: from to cc [Tokenizer] address_headers: from is the default value on my system, which makes me think that spambayes is already considering the 'mail from' information. (unless another flag needs to be set to enable this: basic_header_tokenize?) if that is the case then i would think retraining with the original in the spam mbox and the forwarded version in an ham mbox should score the sender (forwarder) very strongly HAM, right? (of course, to do this i would have to hand hack the mbox file, not having the original spam.) if so, then the question becomes is that enough to qualify subsequent messages from sender as ham? and if it isn't, then i am back full circle to wanting to be able to weight it separately from the message payload! :o) thanks b From T.A.Meyer at massey.ac.nz Wed Mar 12 15:52:38 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 21:54:04 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8AD@its-xchg4.massey.ac.nz> > Apart from the technical glitches you bumped into, there's a > reason we don't > want to combine H and S via any expression of this form. [technical explanation cut] Thanks for that Tim. It's been a few years since I've done maths...I was playing around with number (a non-broken version), and came to much the same conclusion myself, but without the nice theory. > BTW, and IIRC, cmp.py never got updated to deal sensibly with > unsures. If > that's right, it shouldn't be used except when spam_cutoff == > ham_cutoff. > Then you've got a two-outcome classifier (no unsures), and > cmp.py won't "forget" any msgs. I think this is still the case. If there is going to be a minor increase in testing again, which is the better option, to have ham_cutoff==spam_cutoff, or to update to reveal unsure info? (I suspect the latter). Thanks again. [Must think more before posting. Must think more before posting. Must think...] =Tony Meyer From joel at prettyhipprogramming.com Tue Mar 11 21:57:15 2003 From: joel at prettyhipprogramming.com (Joel Ricker) Date: Tue Mar 11 22:00:04 2003 Subject: [Spambayes] [OT] Converting Outlook MSGs to mbox Message-ID: <000201c2e843$17244120$c9e03942@nc.rr.com> Hi all, I wanted to know if anyone had any experience with converting Outlook's e-mail message format into mbox or other format. My problem is that I've tried using the Outlook Spambayes add-in but it didn't quite work. It's more than likely my installation of Outlook rather than the add-in. My Outlook install flakes out from time to time. The plugin installed ok and I was able to define my corpus and it started to working. However, the next time I brought up Outlook, the plug-in was gone and it won't let me reinstall for some reason. So for stability's sake and plus as a solution in case I ever decide to use a different e-mail program, I decided to try the pop3proxy script. I've got two large folders with my (corpii?) but Outlook appears to use some proprietary storage of some sort for the e-mails. I can save the messages as text, as long as I do them one at a time and it doesn't store all of the message headers, only From, To, Subject, and Date. Has anybody seen a converter for this MSG format? Thanks Joel From T.A.Meyer at massey.ac.nz Wed Mar 12 16:09:39 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 22:10:20 2003 Subject: [Spambayes] Perhaps a level header would be useful? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8AE@its-xchg4.massey.ac.nz> > > > There may be some bugs lurking, I'm now getting > >> "X-Spambayes-Classification: ham" in the body of my emails. > > > >I will check this ASAP. > This fixes it. [patch] Thanks. My cvs access is a bit spotty today (actually I think it's my network access in general), but it should hopefully go through soon. This was me not reading the regex closely enough when I updated things. =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 12 16:44:15 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 22:44:53 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD81@its-xchg4.massey.ac.nz> Is Richie around at the moment? I get the feeling he would be most help here. TimS maybe? An issue that Tony Lownds brought up is that pop3proxy currently has HEADER_EXAMPLE, which is used in response to a pop3 STAT or LIST command to calculate the new size of the message, in case the mailer needs to know, and asks. With the new headers, this is a problem. Level and MailId are easy enough, but evidence (i.e. hammie_debug) could be just about any size. What's the collective answer? I do recall from previous messages that Richie was originally much more careful about making things the right size ("No " and "Yes", for example), and then IIRC decided to give this up and fix it if anyone broke. Do any mailers use STAT or LIST for something important like allocating a certain amount of memory? Advice appreciated :) Along similar lines, HEADER_FORMAT used to define the header format, which is now hard coded. Should the decision be to wipe HEADER_FORMAT out, or to have a HEADER_FORMAT for each header? (This could go into the header module that TimS is building). It stoped being used at r1.36, without any comments in the checkin about why (it was a big checkin, though). =Tony Meyer From tim at fourstonesExpressions.com Tue Mar 11 21:52:13 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 11 22:53:26 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD81@its-xchg4.massey.ac.nz> Message-ID: 3/11/2003 9:44:15 PM, "Meyer, Tony" wrote: >An issue that Tony Lownds brought up is that pop3proxy currently has HEADER_EXAMPLE, which is used in response to a pop3 STAT or LIST command to calculate the new size of the message, in case the mailer needs to know, and asks. > >With the new headers, this is a problem. Level and MailId are easy enough, but evidence (i.e. hammie_debug) could be just about any size. > >What's the collective answer? I do recall from previous messages that Richie was originally much more careful about making things the right size ("No " and "Yes", for example), and then IIRC decided to give this up and fix it if anyone broke. I was noticing this very thing today as I started preparing to do that header module thing. This is a problem, because AFAIK mailers expect the pop3proxy to give them a buffer size when they do a list, or stat. One idea here is to add the headers willy-nilly, then determine the length of the resulting header text. Another would be to place an upper boundary on how much text we will add to the headers, report a header that size, and make sure we never exceed that size. If we do, we could drop headers in some sort of priority order until we're under the limit. I like the first idea better, but I'm not sure it works with STAT. You *could* do a test with all those mailers you have installed and see if any of them *use* stat... If you set options.verbose = True, pop3proxy produces a log of all the interactions it proxys... >Along similar lines, HEADER_FORMAT used to define the header format, which is now hard coded. Should the decision be to wipe HEADER_FORMAT out, or to have a HEADER_FORMAT for each header? (This could go into the header module that TimS is building). It stoped being used at r1.36, without any comments in the checkin about why (it was a big checkin, though). I don't see that there's any value to this field... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From T.A.Meyer at massey.ac.nz Wed Mar 12 17:00:41 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 23:01:21 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD82@its-xchg4.massey.ac.nz> It seems to me that saying that a message is bigger than it actually is is *not* a problem, but the reverse would be (if, for example, memory was set aside for it). So X-Spambayes-MailID is easy, X-Spambayes-Level is easy, X-Spambayes-Prob is easy, and X-Spambayes-Classification is easy. X-Spambayes-Evidence is the tricky one. > One idea here is to add the headers willy-nilly, then > determine the length of the resulting header text. I'm not sure I get what you are suggesting here. > Another would be to place an upper boundary on how > much text we will add to the headers, report a header that > size, and make sure we never exceed that size. If we do, > we could drop headers in some sort of priority order > until we're under the limit. I guess we could limit the number of words in the evidence, and if more are present, just not include them (or include a "...", or "too many words! go to the web ui!" message). > You *could* do a test with all those mailers you have > installed and see if any of them *use* stat... If you set > options.verbose = True, pop3proxy produces a log of all > the interactions it proxys... I supposed I could, at that. [HEADER_FORMAT] > I don't see that there's any value to this field... Nor do I. +1 to deleting it, then. It's certainly been a long time since it was used in pop3proxy. =Tony Meyer From tim at fourstonesExpressions.com Tue Mar 11 22:12:55 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 11 23:13:01 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD82@its-xchg4.massey.ac.nz> Message-ID: 3/11/2003 10:00:41 PM, "Meyer, Tony" wrote: >It seems to me that saying that a message is bigger than it actually is is *not* a problem, but the reverse would be (if, for example, memory was set aside for it). So X-Spambayes-MailID is easy, X-Spambayes-Level is easy, X- Spambayes-Prob is easy, and X-Spambayes-Classification is easy. X-Spambayes- Evidence is the tricky one. > >> One idea here is to add the headers willy-nilly, then >> determine the length of the resulting header text. > >I'm not sure I get what you are suggesting here. Yeah, I'm suggesting simply adding the headers to the message, then reporting how big the resulting message is. It's a bit of a hack, but it'll be accurate. Upon further rumination, though, it won't work with STAT, cause you don't have the message to add headers to. So STAT is gonna have to make an estimate. But upon further further rumination, I seriously doubt that mailers actually use STAT to allocate buffer space, for example. That doesn't make much sense to me. Probably more to simply put a mail size in the mailer ui, or to determine if it's larger than some threshold value set in the mailer configuration as the maximum size of a mail to download, etc. etc. etc. >I guess we could limit the number of words in the evidence, and if more are present, just not include them (or include a "...", or "too many words! go to the web ui!" message). That works, too. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Tue Mar 11 22:21:46 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 11 23:21:53 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B2@its-xchg4.massey.ac.nz> Message-ID: 3/11/2003 10:16:01 PM, "Meyer, Tony" wrote: > >I'll go through and see if the mailers I have installed use STAT or LIST. But I won't get time until tomorrow (NZ time) to get to this. I'll update the list when I've got to it. Well, they certainly use LIST, and I'm relatively certain they use STAT. My Opera 6.05 mailer uses 'em both. Here's the start of a recent pop3proxy log: OK Cubic Circle's v1.31 1998/05/13 POP3 ready <4715000052b56e3e@mail.powweb.com> USER timstone +OK timstone selected PASS f04g0t +OK Congratulations! STAT +OK 1 196987 UIDL +OK But remember to DELETE messages REGULARLY 1 2e3009d532010300 . LIST 1 +OK 51 196937 RETR 1 +OK 196937 octets c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim.one at comcast.net Tue Mar 11 23:24:49 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Mar 11 23:25:27 2003 Subject: [Spambayes] Perhaps a level header would be useful? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8AD@its-xchg4.massey.ac.nz> Message-ID: [Tim] > IIRC, cmp.py never got updated to deal sensibly with > unsures. If that's right, it shouldn't be used except when spam_cutoff > == ham_cutoff. Then you've got a two-outcome classifier (no unsures), > and cmp.py won't "forget" any msgs. [Meyer, Tony] > I think this is still the case. If there is going to be a minor > increase in testing again, which is the better option, to have > ham_cutoff==spam_cutoff, or to update to reveal unsure info? (I > suspect the latter). It depends on what you're trying to accomplish, of course . Updating cmp.py is a project, because it never intended to deal with unsures, and they don't fit well with its very detailed analysis of FP and FN. Note that the less-exhaustive table.py *does* deal with unsures already, and with automating cutoff analysis (based on your histogram option settings). After Alex invented table.py, I rarely used cmp.py again except to zero in on changes with very small effects. Using table.py, you can skip the rates.py step(s) too (table.py works directly with the output files produced by timtest.py (if you must) or timcv.py (preferred)). > Thanks again. [Must think more before posting. Must think more before > posting. Must think...] You're doing fine! Thinking is overrated , and if I can't remember why we did something one way instead of another, we should probably throw it out and start that part over again. From T.A.Meyer at massey.ac.nz Wed Mar 12 17:24:08 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 23:26:44 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> > Well, they certainly use LIST, and I'm relatively certain > they use STAT. My > Opera 6.05 mailer uses 'em both. Here's the start of a > recent pop3proxy log: Good (now I don't need to test :). We've got no way of knowing what the mailers do with this information, really (apart from nice open source ones ;). So is it: (a) put limits on the size of our headers (b) no limits, and if someone reports a bug, then we reconsider things :) =Tony Meyer From skybow at hotkey.net.au Wed Mar 12 15:29:58 2003 From: skybow at hotkey.net.au (Geoff Moyle) Date: Tue Mar 11 23:27:20 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: Cannot get spambayes to install using drive H: Laptop win 2000 installs to outlook 2000 ok (c:) main machine windows 2000 does not appear on outlook. installation appears to go ok Geoff Moyle Knowledge Engineer --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.461 / Virus Database: 260 - Release Date: 10/03/2003 From tim at fourstonesExpressions.com Tue Mar 11 22:29:40 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 11 23:29:55 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> Message-ID: 3/11/2003 10:24:08 PM, "Meyer, Tony" wrote: > >Good (now I don't need to test :). Why does simply figuring things out not occur to me earlier? Sometimes I'm just a stoopidhead. > >So is it: >(a) put limits on the size of our headers >(b) no limits, and if someone reports a bug, then we reconsider things :) I vote (b) :) Long live user testing! c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From T.A.Meyer at massey.ac.nz Wed Mar 12 17:31:47 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 23:32:51 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B4@its-xchg4.massey.ac.nz> > Cannot get spambayes to install using drive H: > Laptop win 2000 installs to outlook 2000 ok (c:) > main machine windows 2000 does not appear on outlook. > installation appears to go ok I can't think why the drive would matter. I've installed the outlook addin from C: and D: (both partitions on a single drive), E: (another drive), and H: (a network drive) - with Outlook on D:. What exactly goes wrong? What do you mean by "windows 2000 does not appear on outlook"? Nothing appears when you open Outlook? What version of spambayes are you using? The latest CVS? Alpha1? Alpha2? The Outlook plugin installer? =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 12 17:33:45 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 11 23:34:26 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B5@its-xchg4.massey.ac.nz> > >So is it: > >(a) put limits on the size of our headers > >(b) no limits, and if someone reports a bug, then we > reconsider things :) > > I vote (b) :) Long live user testing! +1 for me too. (Why else is this alpha software? ;) Unless anyone complains, I'll only make one little change - I'll fix it so that an approximate size of the level, prob and mailid headers are added (if those options are checked), but ignore any effect of enabling the evidence header. It's all off by default, anyway. =Tony Meyer From popiel at wolfskeep.com Tue Mar 11 21:59:28 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Mar 12 00:59:34 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT In-Reply-To: Message from "Meyer, Tony" <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> Message-ID: <20030312055928.983EE2DEA0@cashew.wolfskeep.com> In message: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> "Meyer, Tony" writes: > >So is it: >(a) put limits on the size of our headers >(b) no limits, and if someone reports a bug, then we reconsider things >:) Or (c) when we get a STAT or LIST or something which requires reporting the size of the message, we could fetch the message and analyze it and report the proper size after headers have been added... Of course, I'm not volunteering to code it, and I have no idea whether that would make the proxy too slow/expensive for people on dialups... - Alex From tim at fourstonesExpressions.com Wed Mar 12 07:07:03 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 12 08:07:09 2003 Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT In-Reply-To: <20030312055928.983EE2DEA0@cashew.wolfskeep.com> Message-ID: 3/11/2003 11:59:28 PM, "T. Alexander Popiel" wrote: >In message: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> > "Meyer, Tony" writes: >> >>So is it: >>(a) put limits on the size of our headers >>(b) no limits, and if someone reports a bug, then we reconsider things >:) > >Or (c) when we get a STAT or LIST or something which requires >reporting the size of the message, we could fetch the message >and analyze it and report the proper size after headers have >been added... > >Of course, I'm not volunteering to code it, and I have no >idea whether that would make the proxy too slow/expensive >for people on dialups... One of the points stat is to not have to fetch mails that are excessively large unless the user wishes it. Fetching the mail on a stat really violates the protocol. > >- Alex > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From mhammond at skippinet.com.au Thu Mar 13 00:26:32 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 12 08:27:35 2003 Subject: [Spambayes] Windows service version of pop3proxy available Message-ID: I couldn't resist :) See the new "windows" directory, and read the comments in pop3proxy_service.py. Windows 2000/XP only - no Win9x support. My intention is to create a single Windows installer for both pop3proxy and the Outlook plugin. IMO, a "background" version of pop3proxy for Win9x would be good (so we can call it a "service" on all Windows versions). Let me know if you are interested in helping. Mark. From T.A.Meyer at massey.ac.nz Thu Mar 13 11:13:03 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 17:17:20 2003 Subject: [Spambayes] Windows service version of pop3proxy available Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B7@its-xchg4.massey.ac.nz> > I couldn't resist :) See the new "windows" directory, and > read the comments > in pop3proxy_service.py. Windows 2000/XP only - no Win9x support. If there's now a windows directory for all windows specific stuff, does this mean that the Outlook directory will move into that? Given that the plugin (apparently) works for Outlook 2k2, the directory could be renamed, anyway. Outlook isn't available on any other platforms, right? (I believe MS have said that the mac Outlook is being dropped and exchange support being built into entourage). Just a thought... =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 13 11:19:41 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 17:20:19 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B8@its-xchg4.massey.ac.nz> > using the windows installer from mark's web site. not sure of > version .. > downloaded it yesterday. is installed in program files > directory but does not appear on outlook. Hopefully Mark will chip in, because he'll have more of an idea about this that I will. (Mark: can any trace information be obtained with the installer version? do people have to have Python installed to be able to get this?) Suggestions: * This is definately *Outlook*, and not *Outlook Express*, right? (I just have to check...) * Are you displaying the "Standard" toolbar? This is where the plugin buttons will appear. * Uninstall the plugin. Reset the toolbars in Outlook. Reinstall the toolbars. * Choose "Customize current view" in the inbox. Look at the "fields defined in this folder" section. If there is a "Spam" column there, add it to the view. If this is there & new mail has scores appear, then it's working, but not showing you the GUI. * Do you have any other Outlook plugins installed? If so, what? Hope this helps. =Tony Meyer From mhammond at skippinet.com.au Thu Mar 13 09:44:20 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 12 17:55:11 2003 Subject: [Spambayes] Windows service version of pop3proxy available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B7@its-xchg4.massey.ac.nz> Message-ID: > If there's now a windows directory for all windows specific > stuff, does this mean that the Outlook directory will move into > that? Given that the plugin (apparently) works for Outlook 2k2, > the directory could be renamed, anyway. Outlook isn't available > on any other platforms, right? (I believe MS have said that the > mac Outlook is being dropped and exchange support being built > into entourage). I thought of the top-level "windows" directory being a kind of "helper", but not containing complete applications. Eg, pop3proxy_service just hooks on the back of pop3proxy, but I don't think it makes sense to have in the top-level directory. Things like the installer script etc also make sense here. The outlook plugin is a large, stand-alone application, and IMO should be in its own directory. I don't really mind if this was moved to *under* the windows directory, but I see no real need. I'd still much rather see the top-level directory cleaned up even more, with pop3proxy and hammie getting their own, application specific directories. In this case, pop3proxy_service in the pop3proxy directory makes more sense, and this "windows" directory could be replaced with one simply for the installer. Mark. From T.A.Meyer at massey.ac.nz Thu Mar 13 12:14:56 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 18:16:35 2003 Subject: [Spambayes] Windows service version of pop3proxy available Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B9@its-xchg4.massey.ac.nz> > I thought of the top-level "windows" directory being a kind > of "helper", but not containing complete applications. Ah, I see. > The outlook plugin is a large, stand-alone application, and > IMO should be in its own directory. I don't really mind > if this was moved to *under* the windows directory, but > I see no real need. Fair enough, I get what you mean now. Any thoughts about renaming it to Outlook, rather than Outlook2000? Not worth the bother? > I'd still much rather see the top-level directory cleaned up > even more, with pop3proxy and hammie getting their own, > application specific directories. So would I, but I've read the debates about this in the past, and I'm staying clear ;) =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 13 13:22:21 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 19:23:48 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C2@its-xchg4.massey.ac.nz> > Its definitely outlook > yep std toolbar > yes reinstalled > yes I do have another plugin installed which is the avg virus > checker. this > is the only diff between machines so I am assuming that this > is probably the problem. Well, I installed version 6.0 (build 645) of the free AVG virus checker, then the plugin, and it seems ok for me. If you have a different version of the checker, then it still might be that. Mark: FYI the AVG virus checker does integrate with Outlook - a button appears on the standard toolbar. > Any advice Two things: * Wait for Mark to come up with the answer ;) Seriously, while I know a little about the Outlook plugin, I know almost nothing about the installer, and he knows everything ;) about both. * See if the CVS version works for you. This is much more complicated, though - you'll need Python installed, and to do an anonymous checkout of the source. Sorry I haven't been more use (yet!). =Tony Meyer --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.459 / Virus Database: 258 - Release Date: 25/02/2003 From mhammond at skippinet.com.au Thu Mar 13 11:34:45 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 12 19:35:31 2003 Subject: [Spambayes] Spambayes installation problem In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C2@its-xchg4.massey.ac.nz> Message-ID: Sorry I haven't replied yet. I have no clue :( The existing plugin still redirects all messages to Pythonwin, so to see any debug output you will need Python+win32all. I think I will hack together a simple "redirect log to file" when run from the installer. I'm currently upgrading the HTML so that users who stumble on it and aren't real geeks can get it going easily. Mark. > -----Original Message----- > From: Meyer, Tony [mailto:T.A.Meyer@massey.ac.nz] > Sent: Thursday, 13 March 2003 11:22 AM > To: Geoff Moyle > Cc: Mark Hammond; spambayes@python.org > Subject: RE: [Spambayes] Spambayes installation problem > > > > Its definitely outlook > > yep std toolbar > > yes reinstalled > > yes I do have another plugin installed which is the avg virus > > checker. this > > is the only diff between machines so I am assuming that this > > is probably the problem. > > Well, I installed version 6.0 (build 645) of the free AVG virus > checker, then the plugin, and it seems ok for me. If you have a > different version of the checker, then it still might be that. > > Mark: FYI the AVG virus checker does integrate with Outlook - a > button appears on the standard toolbar. > > > Any advice > > Two things: > * Wait for Mark to come up with the answer ;) Seriously, while I > know a little about the Outlook plugin, I know almost nothing > about the installer, and he knows everything ;) about both. > > * See if the CVS version works for you. This is much more > complicated, though - you'll need Python installed, and to do an > anonymous checkout of the source. > > Sorry I haven't been more use (yet!). > > =Tony Meyer > > --- > Outgoing mail is certified Virus Free. > Checked by AVG anti-virus system (http://www.grisoft.com). > Version: 6.0.459 / Virus Database: 258 - Release Date: 25/02/2003 > From T.A.Meyer at massey.ac.nz Thu Mar 13 16:23:40 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 22:26:53 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C6@its-xchg4.massey.ac.nz> > I have no clue :( The existing > plugin still redirects all messages to Pythonwin, so to see > any debug output you will > need Python+win32all. Geoff: if you're willing, then installing Python and win32all would mean that we could look at whatever error the plugin is throwing up (assuming that it is!). The longer I run with the thing installed, the more I suspect that it is the virus program. > I think I will hack together a simple "redirect log to file" > when run from the installer. Sounds like a good plan. =Tony Meyer --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.459 / Virus Database: 258 - Release Date: 25/02/2003 From T.A.Meyer at massey.ac.nz Thu Mar 13 17:29:28 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 23:30:07 2003 Subject: [Spambayes] UpdatableConfigParser Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD87@its-xchg4.massey.ac.nz> Those that were paying attention will recall a discussion a couple of weeks back about config files, paticularly updating them. I've just committed a new module - UpdatableConfigParser. This extends ConfigParser so that config files can be updated (retaining whitespace and comments), rather than simpyl rewritten. It should work fine with multiple config files, like ConfigParser, although there are issues to consider when doing so. Only those using OptionConfig to change their options should notice any difference at all. For everyone else the functions are either almost identical to ConfigParser, or are the ConfigParser functions. Those that do use OptionConfig will now be able to retain comments and whitespace in their ini files. The Outlook plugin *might* also someday use this module. I've tried to test this as thoroughly as possible, but no doubt as soon as I commit it, there will be an error. I'll try to get this fixed ASAP. Anyone wanting to do more testing with multiple files (there are so many possibilities!) is very welcome to do so (OptionConfig only works with one file, so this will not actually effect anyone currently using it). The __doc__ has a lot more information. I would hope this would be useful to any UI that allows modification of config files (are there any apart from OptionConfig at the moment?). =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 13 17:45:12 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 12 23:45:53 2003 Subject: [Spambayes] Storing Options Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> Ignoring the fact that it's scattered throughout the code base, does anyone like the current method of getting options? What I personally do not like (in order of dislike): * That sections are ignored, leading to names like pop3proxy_servers. * Updating the options object does not update the underlying ConfigParser (now UpdatableConfigParser ;) object, so a write() (or update()) will not write the updated values. * Having all the defaults in Options.py, rather than a much simpler default config file (IIRC the reason for folding the file in was so that it didn't matter which directory you were running from, but the envar should take care of that, yes?) I know I'm not completely alone here, but I'd like to know if there are lots of people (or even a few of the right people ;) that like it as it is. If people (a) don't care, or (b) also don't like it, then I'll try and come up with a better scheme (and present it before making any changes!). =Tony Meyer From popiel at wolfskeep.com Wed Mar 12 21:34:31 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Mar 13 00:34:34 2003 Subject: [Spambayes] Storing Options In-Reply-To: Message from "Meyer, Tony" <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> Message-ID: <20030313053431.4B4962DE8A@cashew.wolfskeep.com> In message: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> "Meyer, Tony" writes: [talking about handling options] >If people (a) don't care, or (b) also don't like it, then I'll try >and come up with a better scheme (and present it before making any >changes!). +1 As a person who juggles many different options sets for doing testing, I would ask that one of the design constraints be to make such juggling reasonably easy. - Alex From anthony at interlink.com.au Thu Mar 13 21:39:53 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Thu Mar 13 05:40:45 2003 Subject: [Spambayes] wanted: malformed email messages. Message-ID: <200303131040.h2DAdrq18384@localhost.localdomain> If you've got spam that breaks python's email parser in some way, don't just gripe - send it to me. I'm going to make a fairly serious go at seeing what I can do to make the email parser more robust, and also make sure it notes what it had to do in order to get the message to parse (these notes will almost certainly be very good clues). Please don't just forward them, unless you're sure your mailer can do _correct_ message/rfc822 encapsulation. If you're not sure, mail a tarball or zipfile containing the message(s). Thanks, Anthony From mhammond at skippinet.com.au Thu Mar 13 21:51:00 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Mar 13 05:51:42 2003 Subject: [Spambayes] Storing Options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> Message-ID: [Tony] ... > I know I'm not completely alone here, but I'd like to know if > there are lots of people (or even a few of the right people ;) > that like it as it is. If people (a) don't care, or (b) also > don't like it, then I'll try and come up with a better scheme > (and present it before making any changes!). I'm certainly +1 on the concept. I think you should go for it! We are still alpha, so we can get away with lots. Now is better, too - the longer we go, the harder it gets. I've already discovered that splitting the database from an inheritance model is much harder than it looks - largely because there is already a kind of decay in the code - some __getstate__, some explicit, some pickling, some zodb, etc. It seems the math is done - if Tim can't measure improvement, I'm sure as hell that I can't <0.0 wink>. So the more architecture stuff we can do now, the better, and the more future users of this technology can benefit. Mark. From spambayes at rodland.no Thu Mar 13 12:16:20 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Thu Mar 13 06:16:33 2003 Subject: [Spambayes] mails that fail when filtered in outlook In-Reply-To: Message-ID: I've got a couple of emails (allready fetched by outlook) which make the filterering fail. I'm unsure how to report this as a bug. If I forward the mails, the errors seems to go away - probably outlook "fixes" the email so that the errors disappear. There seems to be a different error for these two mails - both listed at the end of this mail. I've no idea as to what makes the first one fail. The other one seems to have a '\r\n' included in the subject. I guess this is not good, but it shouldn't make the plugin fail, should it? also - If manual filtering is started, and one e-mail fails, the rest of the filetering seems to be skipped. couldn't the filtering continue, skipping the message which failed? appriciate any comments on these. I'll be happy to post some or all of these as bugs, but - as I said - I'm unsure how to include a message for reproducing the errors. ERROR 1: Error getting property from stream (-2147221233, 'OLE error 0x8004010f', None, None) pythoncom error: Python error invoking COM method. Traceback (most recent call last): File "C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py", line 275, in _Invoke_ return self._invoke_(dispid, lcid, wFlags, args) File "C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py", line 280, in _invoke_ return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None) File "C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py", line 541, in _invokeex_ return apply(func, args) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py", line 160, in OnClick self.handler(*self.args) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py", line 225, in ShowClues score, clues = mgr.score(msgstore_message, evidence=True) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\manager.py", line 439, in score email = msg.GetEmailPackageObject() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\msgstore.py", line 639, in GetEmailPackageObject text = self._GetMessageText() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\msgstore.py", line 582, in _GetMessageText assert msg.is_multipart() exceptions.AssertionError: ERROR 2: FAILED to create email.message from: 'X-Exchange-Message: true\nSubject: RE: Les p\xe5 www.aftenposten.no: \r\nSinte brennkopper herjer\r\n\nTo: reidar.rodland@hydro.com\nCC: Mona R\xf8dland\n\n\nDet var nok "bare" - som legen sa - myggstikk. Det som er litt skitt er at martin, siste natta p\xe5 Korsika ble igjen angrepet noe s\xe5 til de grader. Vi kom ut av tellinga etter \xe5 ha passert 60 stikk p\xe5 hode og hender. Det ser ikke godt ut, og er det sikkert heller ikke. Han klorer seg til blods i \xf8ret, og skirker en del... Mona er ogs\xe5 lettere angrepet, mens jeg stort sett har sluppet ganske billig fra det....\r\n\r\nVi driver med myggreduserende tiltak om dagen:\r\n-Eurax\r\n-sitronkonsentrat\r\n-ting i stikkkontakter som skal hodle dem borte\r\n-lemmer for vinduer\r\n-myggnett\r\n-etc....\r\n\r\ndet er ikke greit det er med myggen; men ellers er vi ved meget godt mot etter at vi er tilbake i Grasse. Det var nok ganske smart \xe5 v\xe6re her noen dager p\xe5 forh\xe5nd, for p\xe5 mange m\xe5ter f\xf8ltes som \xe5 komme hjem i stedet for bare til nok et "nytt sted med nye rutiner \xe5 etablere". Vi slapper av. I morgen skal vi p\xe5 marked i Pleymenade!\r\n\r\n\r\n\r\nF\r\n\r\n\r\n--\r\nFredrik R\xf8dland ASTON Technology Phone: +47 23 28 40 17\r\nTechnical Architect Stocknet Fax : +47 910 73 621\r\nFredrik.Rodland@aston.no http://www.aston.no Mob : +47 992 19 817\r\n \r\n\r\n> -----Original Message-----\r\n> From: reidar.rodland@hydro.com [mailto:reidar.rodland@hydro.com]\r\n> Sent: 12. september 2002 08:43\r\n> To: frodland@aston.no\r\n> Subject: Les p\xe5 www.aftenposten.no: Sinte brennkopper herjer \r\n> \r\n> \r\n> Dette er et tips som reidar.rodland@hydro.com har sendt fra \r\n> Aftenposten Nettutgaven.\r\n> \r\n> \r\n> Sinte brennkopper herjer\r\n> \r\n> \r\n> En brennkoppeepidemi er i ferd med \xe5 n\xe5 alle deler av landet. \r\n> \xc5rets brennkopper er langt mer aggressive enn vanlig og mye \r\n> vanskeligere \xe5 behandle.\r\n> \r\n> Les mer her: \r\n> http://www.aftenposten.no/forbruker/helse/article.jhtml?articleID=397867\r\n > \r\n> -------------------------------------------------\r\n> Beskjed fra reidar.rodland@hydro.com:\r\n> Ref les varicelles. Til info!. Brennkopper starter gjerne med \r\n> vannkopper og s\xe5 g\xe5r det betennelse i s\xe5rene.\r\n> http://www.aftenposten.no \r\n> P\r\n> -------------------------------------------------\r\n> ' pythoncom error: Python error invoking COM method. Traceback (most recent call last): File "C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py", line 275, in _Invoke_ return self._invoke_(dispid, lcid, wFlags, args) File "C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py", line 280, in _invoke_ return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None) File "C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py", line 541, in _invokeex_ return apply(func, args) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py", line 160, in OnClick self.handler(*self.args) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py", line 225, in ShowClues score, clues = mgr.score(msgstore_message, evidence=True) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\manager.py", line 439, in score email = msg.GetEmailPackageObject() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\msgstore.py", line 641, in GetEmailPackageObject msg = email.message_from_string(text) File "C:\PROGRA~1\_DEV\Python22\lib\email\__init__.py", line 52, in message_from_string return Parser(_class, strict=strict).parsestr(s) File "C:\PROGRA~1\_DEV\Python22\lib\email\Parser.py", line 75, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "C:\PROGRA~1\_DEV\Python22\lib\email\Parser.py", line 62, in parse self._parseheaders(root, fp) File "C:\PROGRA~1\_DEV\Python22\lib\email\Parser.py", line 128, in _parseheaders raise Errors.HeaderParseError( email.Errors.HeaderParseError: Not a header, not a continuation: ``Sinte brennkopper herjer'' F -- Fredrik R?dland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From spambayes at djl.freeuk.com Thu Mar 13 11:22:57 2003 From: spambayes at djl.freeuk.com (David Leftley) Date: Thu Mar 13 06:23:04 2003 Subject: [Spambayes] wanted: malformed email messages. In-Reply-To: <200303131040.h2DAdrq18384@localhost.localdomain> References: <200303131040.h2DAdrq18384@localhost.localdomain> Message-ID: On Thu, 13 Mar 2003 21:39:53 +1100, Anthony Baxter wrote: > >If you've got spam that breaks python's email parser in some way, don't >just gripe - send it to me. I'm going to make a fairly serious go at >seeing what I can do to make the email parser more robust, and also make >sure it notes what it had to do in order to get the message to parse >(these notes will almost certainly be very good clues). I was just about to send some messages that the Outlook plugin was choking on. These messages have malformed headers, either with unexpected lines of Base64 or header lines broken across several lines. But having just upgraded to version 2.5b1 of the email package, all the dodgy messages I have received to date are now processed without errors. The only further improvement I would like to see regarding these messages is to try and decode the Base64 in the headers rather than just discarding it - currently I have a few spam messages with very low scores, presumably because spambayes has thrown away all the clues in the body of the message. David. From spambayes at djl.freeuk.com Thu Mar 13 11:29:59 2003 From: spambayes at djl.freeuk.com (David Leftley) Date: Thu Mar 13 06:30:04 2003 Subject: [Spambayes] mails that fail when filtered in outlook In-Reply-To: References: Message-ID: On Thu, 13 Mar 2003 12:16:20 +0100, "Fredrik Rodland" wrote: >I've got a couple of emails (allready fetched by outlook) which make the >filterering fail. >I've no idea as to what makes the first one fail. The other one seems to >have a '\r\n' included in the subject. I guess this is not good, but it >shouldn't make the plugin fail, should it? There seem to be some big improvements in the handling of malformed headers in the latest python email package. I was getting an error similar to the second of yours until I upgraded to version 2.5b1, from http://sourceforge.net/project/showfiles.php?group_id=25568 > >also - If manual filtering is started, and one e-mail fails, the rest of the >filetering seems to be skipped. couldn't the filtering continue, skipping >the message which failed? Yes, this is something I would like to see as well. It can sometimes be tricky to work out which of the 2000 messages in the spam corpus is causing filtering to fail! David. From spambayes at rodland.no Thu Mar 13 13:05:38 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Thu Mar 13 07:05:45 2003 Subject: [Spambayes] wanted: malformed email messages. In-Reply-To: Message-ID: On Thu, 13 Mar 2003 21:39:53 +1100, Anthony Baxter wrote: > >If you've got spam that breaks python's email parser in some way, don't >just gripe - send it to me. I'm going to make a fairly serious go at >seeing what I can do to make the email parser more robust, and also make >sure it notes what it had to do in order to get the message to parse >(these notes will almost certainly be very good clues). I'd love to - but as I wrote in my other post - outlook (which is the MUA I use at the moment ) fixes these messages, so that they don't fail anymore. does anybody have any tips on how to save/send a message with all of it's origianl content from outlook (2000)? F -- Fredrik Rodland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From spambayes at rodland.no Thu Mar 13 13:09:14 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Thu Mar 13 07:09:21 2003 Subject: FW: [Spambayes] mails that fail when filtered in outlook Message-ID: > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of David Leftley > Sent: 13. mars 2003 12:30 > To: spambayes@python.org > Subject: Re: [Spambayes] mails that fail when filtered in outlook > > > Yes, this is something I would like to see as well. It can sometimes > be tricky to work out which of the 2000 messages in the spam corpus is > causing filtering to fail! exactly. I had to split the mails I wanted to filter (a total of 2700) into smaller portions to narrow down to the email that actually failed. Actaully it was 2 of them - both described in my original post. F -- Fredrik Rodland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From spambayes at rodland.no Thu Mar 13 13:18:22 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Thu Mar 13 07:18:29 2003 Subject: [Spambayes] mails that fail when filtered in outlook In-Reply-To: Message-ID: > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of David Leftley > Sent: 13. mars 2003 12:30 > To: spambayes@python.org > Subject: Re: [Spambayes] mails that fail when filtered in outlook > > > On Thu, 13 Mar 2003 12:16:20 +0100, "Fredrik Rodland" > wrote: > > >I've no idea as to what makes the first one fail. The other one seems to > >have a '\r\n' included in the subject. I guess this is not good, but it > >shouldn't make the plugin fail, should it? > > There seem to be some big improvements in the handling of malformed > headers in the latest python email package. I was getting an error > similar to the second of yours until I upgraded to version 2.5b1, from > http://sourceforge.net/project/showfiles.php?group_id=25568 Thanx - I followed your advice & installed the mail-lib to 2.5b1 - this helped on the message which had \r\n in the subject. however - the other (with the assertion-error) one still fails with the same error. F -- Fredrik Rodland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From noreply at sourceforge.net Wed Mar 12 21:32:26 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 13 07:46:22 2003 Subject: [Spambayes] [ spambayes-Bugs-702758 ] When manually filtering the results are not right. Message-ID: Bugs item #702758, was opened at 2003-03-13 18:32 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Tony Meyer (anadelonbrin) Assigned to: Mark Hammond (mhammond) Summary: When manually filtering the results are not right. Initial Comment: When doing a manual filter (via the filter dialog), the results displayed (found x ham, x spam, x unsure) are for the last folder filtered only, not the total over all folders, as one would expect. This is because in filter.py the update() function of the dictionary is used, and the docs have this as a[x] = b[x], not a[x] += b[x], which is what would be wanted here. Unless this is changed in a later version of Python, then this should really be fixed. I might get to it :) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702 From noreply at sourceforge.net Thu Mar 13 04:36:16 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 13 07:46:28 2003 Subject: [Spambayes] [ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder Message-ID: Bugs item #642740, was opened at 2002-11-23 15:00 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 Category: None Group: None Status: Open Resolution: Works For Me Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: "Recover from Spam" wrong folder Initial Comment: Outlook addin: Selecting "Recover From Spam" recovers the selected message to the Inbox folder - which is not necessarily where came from. The filterer will need to save the folder it came from before we can do this. ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-13 13:36 Message: Logged In: YES user_id=724871 I haven't seen this after I entered my previous comment. I gues I was working on an old message, as I mentioned... I guess you could close this bug... ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 12:03 Message: Logged In: YES user_id=724871 OK - i've tested some more. this seems to work sometimes, and sometimes not. It may be related to the other bug you're refering to, but I'll try to walk thorugh an example. - I've got a message in a folder (inbox/maillister/locker). The message was filtered by outlooks rules to this folder this morning - i.e. I've never viewed neither the message or the clues from any other folder. - I run a manual filter on this folder (which returns with 1 good msg as expected) - WILL THIS FORGET THE FOLDER OF THIS MSG? - I press the "delete as spam" button, and the message appears in my SPAM-folder. - I enter my spam-folder and press the "recover from spam"- button. - the message appears in my INBOX The message was ORIGINALLY (this morning local time) filtered using the 1.0.a2 version of spambayes, while I now use the latest CVS-version. the following appears in the trace-collector: Deleting and spam training message '[Lockergnome Penguin Shell] Network Shutdown' - trained as spam Recovering to folder 'Inbox' and ham training message '[Lockergnome Penguin Shell] Network Shutdown' - trained as ham If you add some more debug, I'll be happy to run some tests on this msg. Is there anyway to check whether this message actually ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-04 11:43 Message: Logged In: YES user_id=14198 Can you post an example of something that fails? Note that a remaining potential problem is out of our control: occasionally the "Inbox" will see a message before the builtin rules. In this case, we filter it from the Inbox, not from where the Outlook rule would have moved it. Thus, when we recover, we see the inbox as the source. Note that I also fixed another bug related to this - previously, simply scoring a message would store that folder name as the "source" of the message. Thus, if you had previously viewed the clues for a message once in the wrong folder, the correct source folder would have been lost. So please ensure you are testing with mail received since I said I fixed this. ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-02-04 07:23 Message: Logged In: YES user_id=14198 /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v <-- addin.py new revision: 1.48; previous revision: 1.47 /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v <-- filter.py new revision: 1.16; previous revision: 1.15 /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v <-- msgstore.py new revision: 1.39; previous revision: 1.38 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 From noreply at sourceforge.net Thu Mar 13 04:38:40 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 13 07:46:36 2003 Subject: [Spambayes] [ spambayes-Bugs-702920 ] Manual filtering (Outlook) fails if one message fails Message-ID: Bugs item #702920, was opened at 2003-03-13 13:38 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering (Outlook) fails if one message fails Initial Comment: I've posted tyhis question on the maillist, and with (at least) one positive feedback, I enter it here: If manual filtering is started, and one e-mail fails, the rest of the filetering seems to be skipped. couldn't the filtering of the remaining messages continue, skipping the message which failed? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 From tim at fourstonesExpressions.com Thu Mar 13 07:07:20 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 08:07:26 2003 Subject: [Spambayes] wanted: malformed email messages. In-Reply-To: <200303131040.h2DAdrq18384@localhost.localdomain> Message-ID: Anthony, I've been working on the Parser myself for a couple days. I've attached my version of it. I have to tell you that I think the parser is fairly poorly written. I haven't done any of the formal regression tests on it as of yet. There is a mail attached to spambayes bug #695142 that has a malformed continuation header (text starts in column 1). c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org -------------- next part -------------- A non-text attachment was scrubbed... Name: Parser.py Type: application/octet-stream Size: 12630 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030313/c758eeab/Parser.obj From skip at pobox.com Thu Mar 13 07:14:46 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 13 08:15:12 2003 Subject: [Spambayes] UpdatableConfigParser In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD87@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1318CD87@its-xchg4.massey.ac.nz> Message-ID: <15984.33862.819218.177867@montanaro.dyndns.org> Tony> Those that were paying attention will recall a discussion a couple Tony> of weeks back about config files, paticularly updating them. Hmmm... What applications modify config files? That usually seems to me to be the province of special config file editors or humans armed with text editors. You're not proposing that applications like pop3proxy should modify them are you? Skip From skip at pobox.com Thu Mar 13 07:15:50 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 13 08:16:04 2003 Subject: [Spambayes] Storing Options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> Message-ID: <15984.33926.350804.451009@montanaro.dyndns.org> Tony> I know I'm not completely alone here, but I'd like to know if Tony> there are lots of people (or even a few of the right people ;) Tony> that like it as it is. I like it the way it is. I'd prefer to fiddle my options with a text editor. Skip From tim at fourstonesExpressions.com Thu Mar 13 07:30:30 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 08:30:36 2003 Subject: [Spambayes] UpdatableConfigParser In-Reply-To: <15984.33862.819218.177867@montanaro.dyndns.org> Message-ID: I'm reasonably sure that there is code in several places that modifies specific options temporarily, counting on the fact that those modifications are not permanent. options.verbose modification is one of those things that gets twiddled every now and then. I suppose persisting option changes should be explicit. 3/13/2003 7:14:46 AM, Skip Montanaro wrote: > > Tony> Those that were paying attention will recall a discussion a couple > Tony> of weeks back about config files, paticularly updating them. > >Hmmm... What applications modify config files? That usually seems to me to >be the province of special config file editors We have one of those. > or humans armed with text editors. We have one of those. > You're not proposing that applications like pop3proxy should >modify them are you? It already does. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Thu Mar 13 07:31:25 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 08:31:33 2003 Subject: [Spambayes] Storing Options In-Reply-To: <15984.33926.350804.451009@montanaro.dyndns.org> Message-ID: 3/13/2003 7:15:50 AM, Skip Montanaro wrote: > Tony> I know I'm not completely alone here, but I'd like to know if > Tony> there are lots of people (or even a few of the right people ;) > Tony> that like it as it is. > >I like it the way it is. I'd prefer to fiddle my options with a text >editor. Nothing that Tony is proposing precludes this as a possibility. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Thu Mar 13 07:33:09 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 08:33:13 2003 Subject: [Spambayes] Storing Options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> Message-ID: <41RLMINJ98C72197WT54SQPK3VUTMGEC.3e708895@myst> 3/12/2003 10:45:12 PM, "Meyer, Tony" wrote: >Ignoring the fact that it's scattered throughout the code base, does anyone like the current method of getting options? > >What I personally do not like (in order of dislike): >* That sections are ignored, leading to names like pop3proxy_servers. > >* Updating the options object does not update the underlying ConfigParser (now UpdatableConfigParser ;) object, so a write() (or update()) will not write the updated values. > >* Having all the defaults in Options.py, rather than a much simpler default config file (IIRC the reason for folding the file in was so that it didn't matter which directory you were running from, but the envar should take care of that, yes?) The only thing I really like about Options.py is the cracker, which returns an object of the correct type given the stringness, numberness, or booleanness of the option. > >I know I'm not completely alone here, but I'd like to know if there are lots of people (or even a few of the right people ;) that like it as it is. If people (a) don't care, or (b) also don't like it, then I'll try and come up with a better scheme (and present it before making any changes!). +1 for me. > >=Tony Meyer > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Thu Mar 13 08:37:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 13 09:37:20 2003 Subject: [Spambayes] UpdatableConfigParser In-Reply-To: References: <15984.33862.819218.177867@montanaro.dyndns.org> Message-ID: <15984.38805.111595.613581@montanaro.dyndns.org> >> You're not proposing that applications like pop3proxy should modify >> them are you? Tim> It already does. I meant modify them and save those modifications to the underlying config file. I realize that the options get suitably modified at runtime. I'm concerned that if I set the verbose flag on the command line that my config file will get modified so that verbose is then the default. I definitely don't want that. Skip From tim at fourstonesExpressions.com Thu Mar 13 08:44:51 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 09:44:57 2003 Subject: [Spambayes] UpdatableConfigParser In-Reply-To: <15984.38805.111595.613581@montanaro.dyndns.org> Message-ID: <43C8Q1ZRPC7VQGC3DA1YDB73WT975Z.3e709963@myst> 3/13/2003 8:37:09 AM, Skip Montanaro wrote: > > >> You're not proposing that applications like pop3proxy should modify > >> them are you? > > Tim> It already does. > >I meant modify them and save those modifications to the underlying config >file. I realize that the options get suitably modified at runtime. I'm >concerned that if I set the verbose flag on the command line that my config >file will get modified so that verbose is then the default. I definitely >don't want that. For sure on that one. The pop3proxy has an Option Configuration page, where options that pertain to the proxy can be manipulated by the user. Those manipulations actually do modify the ini file. > >Skip > > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From bill at parducci.net Thu Mar 13 08:24:15 2003 From: bill at parducci.net (bill parducci) Date: Thu Mar 13 11:24:18 2003 Subject: [Spambayes] training issues Message-ID: <3E70B0AF.5080303@parducci.net> i receive a couple of newsletters and [travel] updates that i cannot get trained properly for the life of me. every time one of them comes in it is classified as spam (high 90s not uncommon) and dumped into my spam folder. i moved the message into my inbox and fired off mboxtrain each time this happens. in looking at the note afterwards i see X-Spambayes-Trained: ham in the header. however, the next time that a similar message arrives, it is dumped into spam. my guess is that the weighting of the content (e.g. state department travel warnings bear a tremendous degree of similarity with the scams from nigeria if you just look at the occurrences of 'low freq' words) overcomes the effect of the training (which i am guessing acts by raising the header information to high ham probabilities as a result of much of the other information being previously trained as spam). the bottom line is that i am not sure how to correct for this. suggestions? thanks b From trebor at animeigo.com Thu Mar 13 10:20:16 2003 From: trebor at animeigo.com (Robert Woodhead) Date: Thu Mar 13 11:25:11 2003 Subject: [Spambayes] Email Certificates of Approval In-Reply-To: References: Message-ID: Guys, Been toying with a new, complementary idea for spam reduction. Wanted to pass it by you before unleashing it on the unsuspecting masses. http://www.madoverlord.com/Projects/SPAMIDEA.t Comments much appreciated, of course. Best R Crossposted; spambayes & spam-l -- Woodhead's Law: "The further you are from your server, the more likely it is to crash." From db3l at fitlinxx.com Thu Mar 13 14:25:09 2003 From: db3l at fitlinxx.com (David Bolen) Date: Thu Mar 13 14:25:14 2003 Subject: [Spambayes] Re: wanted: malformed email messages. References: <200303131040.h2DAdrq18384@localhost.localdomain> Message-ID: David Leftley writes: (...) > But having just upgraded to version 2.5b1 of the email package, all > the dodgy messages I have received to date are now processed without > errors. (...) I had similar behavior - I started getting a large rash of messages that would fail to parse due to bad continuation lines (often containing HTML comments or some such noise in the headers). In my case I actually switched to Python 2.3a2 for the add-in (which looks like it has 2.5a1 of the e-mail package) and all the parsing problems went away. So at the very least, I think we would want to stress the need to be using a very current email package, since for me in the span of a few days I went from having an occasional such message to having a good percentage each day (must have been a new format some spam-bot is using or something). In the context of the Outlook plugin, it also made me think that it might be nice if the plugin didn't abort on an individual message failure, but kept working on any remaining messages so as to at least process as many as possible. -- David From tim at fourstonesExpressions.com Thu Mar 13 15:12:39 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 16:12:45 2003 Subject: [Spambayes] Are we ready for alpha 3? Message-ID: Give me some votes and I'll release alpha 3 tonight, if the votes are aye c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From mhammond at skippinet.com.au Fri Mar 14 08:11:54 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Mar 13 16:12:52 2003 Subject: [Spambayes] Storing Options In-Reply-To: <15984.33926.350804.451009@montanaro.dyndns.org> Message-ID: [Skip] > Tony> I know I'm not completely alone here, but I'd like to know if > Tony> there are lots of people (or even a few of the right people ;) > Tony> that like it as it is. > > I like it the way it is. I'd prefer to fiddle my options with a text > editor. My understanding is that an updatable options class would allow the pop3proxy configuration page to save its options back to a file. I don't think there is any suggestion that we try and get clever by "remembering" options implicitly. I like this idea for Outlook - I would prefer to have the options maintained by the Outlook GUI be stored back in a text based options file. Mark. From noreply at sourceforge.net Thu Mar 13 13:28:14 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 13 16:22:28 2003 Subject: [Spambayes] [ spambayes-Bugs-699063 ] pop3proxy.py crashes Message-ID: Bugs item #699063, was opened at 2003-03-06 17:11 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702 Category: pop3proxy Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: D. R. Evans (n7dr) >Assigned to: Tim Stone (timstone4) Summary: pop3proxy.py crashes Initial Comment: pop3proxy.py worked fine for a couple of weeks. I then rebooted my Linux box (Mandrake 8.1), and since then pop3proxy.py produces the following output on the console: Loading database... Traceback (most recent call last): File "./pop3proxy.py", line 1577, in ? run() File "./pop3proxy.py", line 1551, in run state.createWorkers() File "./pop3proxy.py", line 1161, in createWorkers self.bayes = storage.DBDictClassifier(filename) File "./spambayes/storage.py", line 140, in __init__ self.load() File "./spambayes/storage.py", line 152, in load t = self.db[self.statekey] File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__ return Unpickler(f).load() EOFError The database files are attached. Doc ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-13 15:28 Message: Logged In: YES user_id=645698 We currently have no way of recovering from this kind of error should it occur. We believe, however, that the defect is actually a bsddb defect that has been corrected in a subsequent release of bsddb. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702 From noreply at sourceforge.net Thu Mar 13 13:29:02 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 13 16:22:35 2003 Subject: [Spambayes] [ spambayes-Bugs-699174 ] mboxtrain only trains on cur in maildir Message-ID: Bugs item #699174, was opened at 2003-03-06 21:56 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702 Category: None Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: Matthew Cowles (mdcowles) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain only trains on cur in maildir Initial Comment: When training on a maildir, mboxtrain trains only on the messages in the subirectory cur. It ignores messages in the subdirectory new. Since new is for messages that haven't been seen, I think it's worth looking there since at least some spam will have been filed unseen. I'll upload a patch that makes it train on both. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-13 15:29 Message: Logged In: YES user_id=645698 This is a feature request. If this remains as a requirement, please resubmit as such. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702 From tim at fourstonesExpressions.com Thu Mar 13 15:29:06 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 16:29:12 2003 Subject: [Spambayes] Federal Trade Commission Workshop on Spam Message-ID: <1VTRKG87YU87FCD8UPB6RY1T2XHP.3e70f822@myst> Well, maybe the feds are starting to wake up... April 30 to May 2, the FTC will be having a workshop on the spam problem. Anybody in that general vicinity? http://www.ftc.gov/bcp/workshops/spam/index.html c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From skip at pobox.com Thu Mar 13 15:48:52 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 13 16:49:02 2003 Subject: [Spambayes] Federal Trade Commission Workshop on Spam In-Reply-To: <1VTRKG87YU87FCD8UPB6RY1T2XHP.3e70f822@myst> References: <1VTRKG87YU87FCD8UPB6RY1T2XHP.3e70f822@myst> Message-ID: <15984.64708.540798.130761@montanaro.dyndns.org> Tim> Well, maybe the feds are starting to wake up... April 30 to May 2, Tim> the FTC will be having a workshop on the spam problem. Anybody in Tim> that general vicinity? Well, PythonLabs is in that general vicinity. I suspect none of them could spare all three days though. Skip From trebor at animeigo.com Thu Mar 13 20:02:36 2003 From: trebor at animeigo.com (Robert Woodhead) Date: Thu Mar 13 20:03:08 2003 Subject: [Spambayes] Email Certificates of Approval Message-ID: Forgot to post this to the list At 2:48 PM -0500 3/13/03, Eric S. Johansson wrote: >Robert Woodhead wrote: >>Guys, >> >>Been toying with a new, complementary idea for spam reduction. >>Wanted to pass it by you before unleashing it on the unsuspecting >>masses. >> >>http://www.madoverlord.com/Projects/SPAMIDEA.t >> >>Comments much appreciated, of course. > >several major problems with this proposal. It fails if: > >a registrar fails to list a spammer as a spammer Well, SSL fails if a registrar doesn't do his job and issues bogus certs. At some point, you have to trust someone. >CRL reporting latency is too great Not really that much of an issue. Remember, this is just another data point. If someone starts broadly spamming using a cert, enough users will note it to get the word out before most of the recipients grab it from their mailserver. >virus lifts certificates from various machines >the implementation follows all of the usual security human factors >failures (i.e. passphrases etc.) Yes, but you're going to have these problems with any system. Heck, a virus could grab your mailserver password and the evil spammers could reconfigure you as a relay. At a certain point, you have to just say "this is good enough", and you can get there. > >it also fails because it doesn't allow truly anonymous speech and >opens the door for elected and non elected governance controlling >your ability to e-mail. First of all, one could create anonymous certificates. But the flip side is, users may decide to give less weight to an email from an anonymous source. That's their choice. 99.999999999% of all email is not anonymous. And note that anonymous remailers could get certs and certify that the source of the email (one of their users) is not a spammer. Finally, all such a scheme is really saying is "if you are willing to certify who you are, your recipients will be more willing to trust that what you are sending is not spam". Bluntly, the anonymity of email -- or more precisely, the ease of obscuring the origin of email -- is one of the major flaws in the current email system design that makes spam so easy to inflict on us all. > >simple proof of work stamps get around all of these problems and >still put a big burden on the spammer. Proof of work is an interesting idea. But if it is worth enough, custom silicon can easily give 100x or even 1000x the throughput of a general-purpose processor. And also keep in mind that there are legitimate emailers who need to send out a lot of email. At 12:05 PM -0800 3/13/03, T. Alexander Popiel wrote: >1. A single fee of $50 per registration would not be sufficient to > support the registrar; there are ongoing costs which only an > up-front fee cannot address, unless certificates expire... which > is horrible for the reputation aspects of the system. The > registrar would have to be a subscription service, much like the > DNS registrars... but likely with higher costs because they > wouldn't be able to securely delegate authority (or perhaps > they could securely delegate... but people wouldn't believe it > was secure, so wouldn't trust it). I doubt people will want to > pay $50 every couple months just to have a reputation. It's unclear at present what the costs might be, but this is a valid concern. I'm not thinking of something that has to be frequently renewed (unless it gets revoked). Also, most cert owners would be mailserver operators anyway. > >2. The registrar could be infiltrated, bribed, or otherwise compromised. > Not helpful. There would be no provable protections against such, > so the registrar would have to be a trusted party (in the negative > connotation of "you trust them because there's no way to verify > their veracity"). I think that keeping my own database would be > preferable. So can DNS registrars. So can SSL registrars. But note that such a compromise will be immediately obvious, so there is a great incentive for the registrar to play fair. > >3. Many people pay good money to be jerks. That's pretty much the > definition of email marketing... the spamhouses charge a pretty > penny for running one of the blast-o-grams. An additional $50 > per blast-o-gram for a new reputation token is minor compared > to the $1k-$10k+ per mailing... Point well taken. But note that I was talking about people buying certs to use to trash the reps of others. As for the blastogram operators, after a while they'll find they can't buy certs from the registrar anymore. > >4. Adding a message signature to the Received headers (which is > effectively what you're doing) would be a wonderful thing... but > there's no need to centralize the signature keys. Even if each > mail handler had their own privately kept & guarded keys, it'd > help tracking immensely. True. > >5. If people are foolish enough to not scan their tagged-as-spam > mailboxes for important things like their boss's name as sender, > then they deserve to have the company go belly-up. ;-) More > generally, completely ignoring the tagged-as-spam stuff is > dangerous and dumb, because _NO_ system is going to be perfect. > Sorting the messages by sender makes it fairly easy and quick to > dispose of them. I agree, but even the vigilant screw up occasionally. I scan the sender/title of all of my spam (it gets filtered to the bottom of the inbox), but even with whitelisting, an occasional email from a legit sender (who has never emailed before, and is clueless enough to not put a nice descriptive subject line) gets by. > >6. I also think you're overstating the reaction speed of such a system; > if a spammer has a new certificate for each mailing (or each day), > then most people will not have read the message (and registered it > as spam) at the time when other people's mailers or procmail scripts > need to classify it... classification always happens before reading, > and classification is usually immediately after receipt, while > reading is delayed some arbitrary amount. With enough users, this problem goes away. > >7. If you put in something saying that certificates under a certain age > (or with only a few votes) are suspect, then you unfairly penalize > new (or casual) email users, while merely inconveniencing the > spammers (who then have to pre-buy and age their certificates, and/or > ballot-stuff them). Hadn't thought to do that. > >8. Ballot-stuffing would be a major problem, and if done at a reasonable > rate it'd be nearly impossible to detect. (How do you distinguish > between you and forty clone machines ballot-stuffing at about 20 > votes per day vs. someone who regularly communicates via email to > everyone in his workplace or on a local mailing list?) Ballot stuffing is an issue, but consider that to stuff more than one ballot on a particular email, you'd have to have multiple certificates. Unless someone puts together a gang of like-minded dipwads, all stuffing, the "popiel is a lousy scumbag" votes are going to get overwhelmed by the "popiel is a nice guy" votes. THere's probably some cute things that can be done to detect stuffing. Appreciate the comments, keep them coming. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From noreply at sourceforge.net Thu Mar 13 14:57:53 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 13 20:12:18 2003 Subject: [Spambayes] [ spambayes-Feature Requests-703283 ] mboxtrain only trains on cur in maildir Message-ID: Feature Requests item #703283, was opened at 2003-03-13 16:57 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: Matthew Cowles (mdcowles) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain only trains on cur in maildir Initial Comment: When training on a maildir, mboxtrain trains only on the messages in the subirectory cur. It ignores messages in the subdirectory new. Since new is for messages that haven't been seen, I think it's worth looking there since at least some spam will have been filed unseen. This is the same as bug 699174 which Tim Stone closed saying, "This is a feature request. If this remains as a requirement, please resubmit as such." The patch attached to that bug report fixes the behavior which I still consider a bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702 From tim.one at comcast.net Thu Mar 13 20:26:50 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Mar 13 20:28:20 2003 Subject: [Spambayes] Federal Trade Commission Workshop on Spam In-Reply-To: <15984.64708.540798.130761@montanaro.dyndns.org> Message-ID: [TimS] > Well, maybe the feds are starting to wake up... April 30 > to May 2, the FTC will be having a workshop on the spam problem. > Anybody in Tim> that general vicinity? [SkipM] > Well, PythonLabs is in that general vicinity. I suspect none of > them could spare all three days though. I doubt our employer would agree to one hour -- we're not in the spam business. I suppose that can be read more than one way, some more obviously true than others . From T.A.Meyer at massey.ac.nz Fri Mar 14 15:13:39 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 13 21:14:29 2003 Subject: [Spambayes] UpdatableConfigParser Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD89@its-xchg4.massey.ac.nz> Oh well, it's the thought that counts ;) For those that don't read the check-ins, I've reverted Options.py and OptionConfig.py to ConfigParser, not UpdatableConfigParser. Two main problems: * ConfigParser has changed in more recent Python (I'll take a look at the new version and see how exactly). * Without starting a debate, there are issues about hooking into 'private' attributes. I've left UpdatableConfigParser.py there, although it's not imported by any module. I'll tinker with it and maybe get it so that it's acceptable :) I still stand by the idea ;) I would have got to this faster, but you people that live on the wrong side of the world found the problems when I was asleep :) Those that cvs-up'd since my update (21 hours ago) should do so again. =Tony Meyer From T.A.Meyer at massey.ac.nz Fri Mar 14 15:15:25 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 13 21:16:47 2003 Subject: [Spambayes] Are we ready for alpha 3? Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8D0@its-xchg4.massey.ac.nz> > Give me some votes and I'll release alpha 3 tonight, if the > votes are aye +1 as long as it takes Options.py and OptionConfig.py after I dropped UpdatableConfigParser. Don't forget to update the website to note that a3 is there. =Tony Meyer From T.A.Meyer at massey.ac.nz Fri Mar 14 15:51:50 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 13 21:52:59 2003 Subject: [Spambayes] Storing Options Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD8B@its-xchg4.massey.ac.nz> I don't think I was very clear. Let me try again: [Skip] > Hmmm... What applications modify config files? pop3proxy (via OptionConfig) and the Outlook plugin. These are the only two applications with a ui at the moment, aren't they? So they'd be the only ones that do it. > That usually seems to me to be the province of special > config file editors or humans armed with text > editors. You're not proposing that applications > like pop3proxy should modify them are you? Those applications aimed more at end user type people will have some sort of capability to change options, and will need to be able to store these somehow, so the applications will edit them. This doesn't remove the ability to manually edit them - and in some applications (those that use hammie, for example), this (hand-edit) would probabably always be the only option. [TimS] > I'm reasonably sure that there is code in several places that > modifies specific options temporarily [...] > I suppose persisting option changes should be explicit. I'm not proposing changes on that magnitude! In operation, I would think nothing much would change. The config file(s) would be (and are) changed only when the user clicks the save button in the web ui config page, or clicks OK in the Outlook manager dialog, or ... [Mark] > My understanding is that an updatable options class would allow the > pop3proxy configuration page to save its options back to a > file. Which is exactly what it does now (although not as nicely as it could). This idea is to improve how this is done, behind the scenes (before we get to beta and it's too late!). =Tony Meyer From anthony at interlink.com.au Fri Mar 14 13:55:17 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Thu Mar 13 21:55:58 2003 Subject: [Spambayes] Are we ready for alpha 3? In-Reply-To: Message-ID: <200303140255.h2E2tHU12723@localhost.localdomain> >>> Tim Stone - Four Stones Expressions wrote > Give me some votes and I'll release alpha 3 tonight, if the votes are aye > There needs to be documentation for people upgrading from earlier versions. The website should be updated when the release is made, as should PyPI. Anthony From tim.one at comcast.net Thu Mar 13 22:28:29 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Mar 13 22:29:04 2003 Subject: [Spambayes] wanted: malformed email messages. In-Reply-To: Message-ID: [Fredrik Rodland] > I'd love to - but as I wrote in my other post - outlook (which is > the MUA I use at the moment ) fixes these messages, so that they don't > fail anymore. > > does anybody have any tips on how to save/send a message with all of it's > origianl content from outlook (2000)? Outlook doesn't store the original content, so it's not possible. Just look at the code in the Outlook2000 directory of this project to see all the pain it takes to partially reconstruct the original! Outlook simply wasn't designed with current Internet email standards in mind, and scatters the message it gets into a large number of fields and properties that seem originally designed for a proprietary MS email format. Some things can't be recovered at all (e.g., the original MIME armor is *almost* always lost). From T.A.Meyer at massey.ac.nz Fri Mar 14 16:29:57 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 13 22:31:17 2003 Subject: [Spambayes] RE: [Spambayes-checkins] spambayes/spambayes Options.py,1.22,1.23 Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8D7@its-xchg4.massey.ac.nz> > [Tony - please do something with your mailer to keep lines under 80 > columns] I'm not sure I can. I use Outlook with Exchange. "Internet email" is set to wrap at 74 chars, but there isn't a setting AFAIK to wrap mail sent through exchange. If anyone else knows differently, please let me know. I'll try and remember to hard wrap lines myself :(. My check-in messages are also not wrapped - these are generated by TortoiseCVS. I've posted a request to wrap them, but you open-source, who knows when/if it will get done ;) I'll try to remember to hard wrap these too. =Tony Meyer From tim at fourstonesExpressions.com Thu Mar 13 21:33:30 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 13 22:33:37 2003 Subject: [Spambayes] Are we ready for alpha 3? In-Reply-To: <200303140255.h2E2tHU12723@localhost.localdomain> Message-ID: 3/13/2003 8:55:17 PM, Anthony Baxter wrote: > >>>> Tim Stone - Four Stones Expressions wrote >> Give me some votes and I'll release alpha 3 tonight, if the votes are aye >> > >There needs to be documentation for people upgrading from earlier versions. Good point. Won't be tonight... ;) c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From trebor at animeigo.com Thu Mar 13 20:03:23 2003 From: trebor at animeigo.com (Robert Woodhead) Date: Thu Mar 13 22:40:46 2003 Subject: [Spambayes] Email Certificates of Approval In-Reply-To: <3E711F51.10105@harvee.org> References: <3E70E094.9060902@harvee.org> <3E711F51.10105@harvee.org> Message-ID: At 7:16 PM -0500 3/13/03, Eric S. Johansson wrote: >I hope you realize that as I play Devils advocate, there is >absolutely no animosity towards you or your idea. This "think like >a spammer" role-playing was an essential part of the process in >making camram (a sender pays antispam system) more robust. Oh of course. I do that all the time myself. If I think you're being a dickhead, you'll be the second to know. ;^) >trusting someone is the fundamental leverage point a con artist counts on. Actually, this is not entirely correct. Con artists depend on the greed of the marks. >Now one thing to consider is the reputation damage the industry >would have given a sufficiently motivated set of spammers. They can >keep setting up certificate authorities faster than you can knock >them down. If they stayed legit long enough, they can burn a whole >bunch of people and get them to cease trusting certificates and >certificate authorities. > >reputation capital such a wonderful thing... Obviously, there would have to be a central registrar who is responsible for certifying subregistrars. We have that already with DNS. And clearly, the amount of vetting required to become a subregistrar would be significant. >how will they get the word out? Are you envisioning some form of >peer-to-peer reporting structure? Not quite clear on this, but it will involve some central servers keeping track of the votes. > How do you deal with false reports? Imagine someone collecting >certificates and then a network of people report them as spammers. No, a certificate holder can only vote that a particular email from another cert holder is spam, he would have to register his vote within a reasonable period of time, and his vote (and the yea votes) would decay over time. So to get tagged as a spammer you would have to get voted against by a significant fraction of the electorate (those receiving emails from you) in a short period of time [there would have to be some provision that if A emails B, B can't forward to C and both B and C vote it to be spam]. This provides a form of traffic analysis. Spammers email to a lot of people over a short period of time. Hammers email to a much smaller number of people -- almost all known to them -- over longer periods. The only reasonable targets for a smear campaign are large legit bulk emailers (say, amazon.com) and mailing lists. > Instantly, you can cause loss of e-mail access to a large number of >people. Also, what about indirect reputation trashing. Someone gets >a certificate with your name and identity. Obviously, it won't >match your certificate but most people won't know that. The cert isn't intended to identify you, though in most cases it can. It's used to tell you "emailer 23734282932823732732929 is regarded as a spammer". Rarely will end-users bother to, or need, to find out that 23734282932823732732929 is dipwad27@hotmail.com. This isn't supposed to be a replacement for other systems of spam detection, just another data point used in deciding what to do. The more orthoganal detection methods we have, the harder it will be to spam. What this does is give you an estimate of the reputation of someone you've never heard of before. > >On the latency issue, a spammer can get out an awful lot of e-mail >in a small number of hours. The distribution of a certificate >revocation notice worldwide will need to be under 10 minutes in >order for it to be only moderately effective. I suggest you do the >math of propagation and figure out how far and how fast spam can go >in only a few minutes. You're missing something. From the standpoint of an enduser, it doesn't matter how fast the spam gets to his mailserver. All that matters is how long it is between the start of the spam run and the time his mailreader downloads the email from the mailserver and checks its reputation. For most email users, this averages several hours, enough time for the earlybirds who check their email every 5 minutes to vote on the reputation. >The problem with certificates and this kind of identity theft is >that it directly affects your reputation. You can be barred from >ever having access to e-mail again based on this form of identity >theft. You could potentially even be barred from accessing the >Internet ever. How you repudiate something that's supposedly >something you can't repudiate? After all, it's your electronic >identity. I don't know about you but I want to deny that I ever >wrote some of my e-mail. ;-) No, not at all. Worst case, you buy another certificate. Or even, if reputations decay, just wait a couple of days and you'll have a decent rep again. Consider the horrible case, a worm that goes around stealing certificates and giving them to spammers. What happens? The reputation system becomes unreliable for a few days until the apps get patched. If we're clever in the implementation, in such a case, the cert holders could get a fresh cert at no charge if they wanted. >I agree with most of which you say. I think that certificate based >or, more correctly stated, identity based e-mail antispam filters >can be made to work if you make them decentralized and based on who >you know. The trouble comes when you try to send e-mail to someone >that you don't know. Which is the problem I'm addressing. > If you assume web of trust, then you have a "six degrees from >Kevin Bacon" type problem as you try to find someone you know who's >willing to introduce you to someone who knows the person you're >trying to get in touch with. Right. It's an issue for new users. > >but you still haven't dealt with the issue of elected and unelected >governance and their influence on your ability to generate e-mail. Nobody is stopping you from emailing to your heart's content. Your readers are merely making a recommendation to new people you might want to email as to what kind of guy you are. Nobody is being forced to stop emailing. Nobody is being forced to not read an email. It's just a suggestion. No more, or less, important than "your bayesian filter thinks this is spammy" or "your dnsbl says this comes from a known spam source". End users can be stupid and trash emails based on the recommendation. But they can also do that based on what their spamfilter says. >Now how does this apply to legitimate bulk e-mail? All bulk e-mail >should be opt-in. Therefore, after you have established a >relationship with a bulk e-mail source, they are now defined as >"friend" and sign their messages to you. Otherwise, they can just >sit there and generate stamps. It's an interesting approach. >actually, if you want identity based systems to control spam, you >have to have a and identity associated with every e-mail account. >At a $50 price point, it ain't going to happen. *I* wouldn't spend >$50 on a certificate when I know they cost pennies to generate. >Given a sufficient high price point, you create an opportunity for >folks to come in and trash the reputation capital of the entire >system. The $ is really for running the reputation database. If that can effectively be distributed, then that problem goes away. >Actually, this raises an important point. Why should I spend money >to clean up somebody else's mess. Certificate based systems such as >you propose further increase the receiver pays nature of e-mail. I >pay when Spam comes in, I pay to keep Spam out. No. Only senders who want a cert pay. You can receive email and use the system without a cert (but you don't get a vote). >>So can DNS registrars. So can SSL registrars. But note that such >>a compromise will be immediately obvious, so there is a great >>incentive for the registrar to play fair. > >how does it become immediately obvious? Because certs that should quickly get tagged as belonging to spammers won't. > Will there be worldwide bulletins on CNN? Will the Attorney >General's of 15 states lead a SWAT team into some small tropical >country to shut down the naughty registrar? And if the registrar >has managed to accrue a few hundred thousand customers who are >legitimate? What happens to them? How do you get people to change >their certificates when the user interface to add them is so >painfully horrible that they won't use them in the first place? That's an implementation issue. But note that while you may have resellers, there needs to be a central registrar who has the database of certs (like the DNS root servers). So that's the point of compromise. >>Point well taken. But note that I was talking about people buying >>certs to use to trash the reps of others. As for the blastogram >>operators, after a while they'll find they can't buy certs from the >>registrar anymore. > >so they will form their own. And have to go through background checks. And put up a bond. Spammers won't do this. >the trick is knowing when the transition happens and even experts >screwed up some of the time. How can you ever hope to get someone >who is uninterested to give that level of attention to detail? You don't. They will tend to freeride on the power-users who will form the voting elite. The whole point, as I've said before, is to have another orthogonal detection system, reducing false positives for the clueless, who are most likely to get bitten by them. >like I said. You need extremely fast propagation of information, >reliable dissemination points and reliable connections to those >dissemination points. I still believe it's 10 minutes propagation >worldwide with full redundancy on all connections. See above. The rest-stop on the mailserver before the POP session gives you the time. > >also, what happens if someone can get to the dissemination points? >Does all the e-mail get held up and what do they get all of their >e-mail regardless? Sure they do. They just don't get the benefit of an opinion. It fails gracefully. > >And who is going to pay for all this infrastructure? Could it be >the receiver of the e-mail? The Spammer isn't paying. The cert purchasers. >you can deal with one form of ballot stuffing from the certificate >identity which prevents multiple votes by the same certificate on >the same topic. Then you can also use source IP address to see if >you get a lot of certificates from the same address. Unfortunately, >this test would fail for organizations with address translation >gateways. Note that if you're sending to someone who has a cert, you could encode that info into the header line, so that only the holder of the cert (the recipient) could vote against you. This would also, btw, have the nice feature of having a robust X-Original-Recipient field in the email, great for detecting bounces from clueless isps who return bizarre bounce messages. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From T.A.Meyer at massey.ac.nz Fri Mar 14 16:53:41 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 13 22:55:19 2003 Subject: [Spambayes] RE: [Spambayes-checkins] spambayes/spambayes Options.py,1.22,1.23 Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD8D@its-xchg4.massey.ac.nz> > See above - that the ConfigParser didn't expose this interface is > probably just an indication that no-one had needed to do this before. > There's been a lot of changes in it for Python2.3, so it seems like > you're not the first person to run into this. Indeed, almost as an aside (so it seems) this was done. If only I'd checked the Python CVS first... > If the 2.3 ConfigParser class is better, there's nothing > saying we can't include it in the package (we already do this with > the sets and heapq module). Well, my UpdatableConfigParser still adds functionality - most particularly, it lets OptionConfig.py (and 'one day soon' Outlook) update config files without stripping comments. It will work as is with Python 2.2.2, but it has the deplorable ;) hooks into the private attributes. It should work exactly the same with the CVS Python (without the hooks, changing __sections and __read to _sections and _read). So, do we: (a) include the latest ConfigParser, so that the code can be all the same? (b) have a version check that does the ugly hooking if we're pre 2.3, and otherwise is nice? (c) get Tony to give up on this ;) =Tony Meyer From T.A.Meyer at massey.ac.nz Fri Mar 14 17:26:47 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 13 23:27:34 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD90@its-xchg4.massey.ac.nz> [I'm throwing this back to the list in the hope that someone will have a good idea] To summarise: Geoff is unable to get the Outlook plugin running. He's using the installer that Mark put together. It goes through and creates the C:\Program Files\Outlook Plugin\ directory (whatever it is called), and it also seems to register the COM plugin since it appears in Outlook's list of such plugins. However, when opening Outlook, the GUI doesn't appear. The trace window shows nothing (not even the "loading" lines). Geoff does have a couple of other plugins that might be causing the problem. One is a virus checker called AVG, which adds a button to the toolbar and adds text to messages. I installed the free version of this (6.0), but the spambayes plugin still worked for me. Geoff might be using a different version of AVG, however. He also has a synronisation plugin. IIRC Mark did say that the installer would fail if there was already a COM plugin that was written in Python. Is there any chance that either of these might be? [Geoff] > It is in the COM add-ins but its checkbox is not ticked. Ticking and > reloading makes no difference. It should definately be ticked. Just to check, after you tick it, and close & reopen Outlook, are you looking at a mail folder (like the Inbox) and not something else (like Outlook Today)? The items won't appear until you do. [Geoff] > However there is a synchronisation add-in which I believe is > from a palm pde Ticked, or unticked? Unless someone on the list has ideas, the only one I have left is that Geoff progresses past the nice package Mark put together and gets the CVS version. =Tony Meyer From popiel at wolfskeep.com Thu Mar 13 21:23:56 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Mar 14 00:24:00 2003 Subject: [Spambayes] Email Certificates of Approval In-Reply-To: Message from Robert Woodhead References: <3E70E094.9060902@harvee.org> <3E711F51.10105@harvee.org> Message-ID: <20030314052356.1A7112DE88@cashew.wolfskeep.com> In message: Robert Woodhead writes: >> >>On the latency issue [...] > >You're missing something. From the standpoint of an enduser, it >doesn't matter how fast the spam gets to his mailserver. All that >matters is how long it is between the start of the spam run and the >time his mailreader downloads the email from the mailserver and >checks its reputation. For most email users, this averages several >hours, enough time for the earlybirds who check their email every 5 >minutes to vote on the reputation. So... the people who form the basis for the judgements of the system (those that check their email every 5 minutes) are exactly those people who get no benefit from it (because there hasn't yet been enough input to form a good judgement). Sounds like there's no incentive to participate and actually make the system work. It also doesn't do a bloody thing for those of us who get their mail delivered realtime to the *nix mailserver with procmail segregating it into MH mailboxes (or similar). Yeah, I know it's horribly anachronistic to actually have a login account on the mailserver and not use POP or IMAP... but it's far easier to grep through 30000 message mailboxes that way. I suppose there's not many of us classic users left, though. - Alex From spambayes at rodland.no Fri Mar 14 08:58:19 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Fri Mar 14 03:00:37 2003 Subject: [Spambayes] Re: wanted: malformed email messages. In-Reply-To: Message-ID: > David Leftley writes: > > In the context of the Outlook plugin, it also made me think that it > might be nice if the plugin didn't abort on an individual message > failure, but kept working on any remaining messages so as to at least > process as many as possible. I've posted bug #702920 which addresses this problem. It could be argued that this should be a feature request, though.... Fredrik -- Fredrik Rodland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From trebor at animeigo.com Fri Mar 14 07:01:47 2003 From: trebor at animeigo.com (Robert Woodhead) Date: Fri Mar 14 07:22:50 2003 Subject: [Spambayes] Email Certificates of Approval In-Reply-To: <20030314052356.1A7112DE88@cashew.wolfskeep.com> References: <3E70E094.9060902@harvee.org> <3E711F51.10105@harvee.org> <20030314052356.1A7112DE88@cashew.wolfskeep.com> Message-ID: At 9:23 PM -0800 3/13/03, T. Alexander Popiel wrote: >In message: > Robert Woodhead writes: >>> >>>On the latency issue [...] >> >>You're missing something. From the standpoint of an enduser, it >>doesn't matter how fast the spam gets to his mailserver. All that >>matters is how long it is between the start of the spam run and the >>time his mailreader downloads the email from the mailserver and >>checks its reputation. For most email users, this averages several >>hours, enough time for the earlybirds who check their email every 5 >>minutes to vote on the reputation. > >So... the people who form the basis for the judgements of the >system (those that check their email every 5 minutes) are exactly >those people who get no benefit from it (because there hasn't >yet been enough input to form a good judgement). Sounds like >there's no incentive to participate and actually make the system >work. Not quite, it's a probabilistic thing. Someone who checks their email every 5 minutes is more likely to look at it before an opinion has been formed, but it is not a sure thing. It all depends on whether they were early or late in the spam run, for example. Again, keep in mind this is not intended to be a be-all-end-all method. It is intended to be part of a suite of methods used to make life hard for the spammer. I'll repeat the mantra: orthogonality. > >It also doesn't do a bloody thing for those of us who get their >mail delivered realtime to the *nix mailserver with procmail >segregating it into MH mailboxes (or similar). Yeah, I know it's >horribly anachronistic to actually have a login account on the >mailserver and not use POP or IMAP... but it's far easier to >grep through 30000 message mailboxes that way. I suppose there's >not many of us classic users left, though. True. You neanderthals will simply have to suffer. ;^) -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters. From noreply at sourceforge.net Fri Mar 14 10:02:41 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri Mar 14 12:54:40 2003 Subject: [Spambayes] [ spambayes-Bugs-695142 ] Email does not render subject in the "Review" Page Message-ID: Bugs item #695142, was opened at 2003-02-28 10:40 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: David Shaw (dshaw) Assigned to: Tim Stone (timstone4) >Summary: Email does not render subject in the "Review" Page Initial Comment: I received the attached email. When I go to the "review" web page of pop3proxy.py, all it shows is: Messages classified as Unsure: From: (none) (none) It acts as though the message has no "from" or "subject", even though they exist. The user is not given any way to classify this message other than to click on the first "(none)" and read the raw message to determine its contents. I will attach the message below. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-14 12:02 Message: Logged In: YES user_id=645698 We are now actively engaged in improving the email package parser, which should resolve these malformation related errors. ---------------------------------------------------------------------- Comment By: Tim Stone (timstone4) Date: 2003-03-06 17:51 Message: Logged In: YES user_id=645698 This is another email package parsing 'error' caused by a malformed header in the attached email. The content-type header has an embedded /r/n, which causes the email package to barf and discard all the headers. IMO, the email package is being used in Spambayes in ways that it was never intended for. Malformed mail is gonna be the death of us, and the email package just doesn't seem to handle it very well. I'm gonna leave this bug open, but there's virtually nothing that can be done to make things better, at least not AFAIK. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702 From noreply at sourceforge.net Fri Mar 14 15:38:22 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri Mar 14 18:28:26 2003 Subject: [Spambayes] [ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder Message-ID: Bugs item #642740, was opened at 2002-11-24 01:00 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 Category: None Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) >Summary: "Recover from Spam" wrong folder Initial Comment: Outlook addin: Selecting "Recover From Spam" recovers the selected message to the Inbox folder - which is not necessarily where came from. The filterer will need to save the folder it came from before we can do this. ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-13 23:36 Message: Logged In: YES user_id=724871 I haven't seen this after I entered my previous comment. I gues I was working on an old message, as I mentioned... I guess you could close this bug... ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-04 22:03 Message: Logged In: YES user_id=724871 OK - i've tested some more. this seems to work sometimes, and sometimes not. It may be related to the other bug you're refering to, but I'll try to walk thorugh an example. - I've got a message in a folder (inbox/maillister/locker). The message was filtered by outlooks rules to this folder this morning - i.e. I've never viewed neither the message or the clues from any other folder. - I run a manual filter on this folder (which returns with 1 good msg as expected) - WILL THIS FORGET THE FOLDER OF THIS MSG? - I press the "delete as spam" button, and the message appears in my SPAM-folder. - I enter my spam-folder and press the "recover from spam"- button. - the message appears in my INBOX The message was ORIGINALLY (this morning local time) filtered using the 1.0.a2 version of spambayes, while I now use the latest CVS-version. the following appears in the trace-collector: Deleting and spam training message '[Lockergnome Penguin Shell] Network Shutdown' - trained as spam Recovering to folder 'Inbox' and ham training message '[Lockergnome Penguin Shell] Network Shutdown' - trained as ham If you add some more debug, I'll be happy to run some tests on this msg. Is there anyway to check whether this message actually ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-04 21:43 Message: Logged In: YES user_id=14198 Can you post an example of something that fails? Note that a remaining potential problem is out of our control: occasionally the "Inbox" will see a message before the builtin rules. In this case, we filter it from the Inbox, not from where the Outlook rule would have moved it. Thus, when we recover, we see the inbox as the source. Note that I also fixed another bug related to this - previously, simply scoring a message would store that folder name as the "source" of the message. Thus, if you had previously viewed the clues for a message once in the wrong folder, the correct source folder would have been lost. So please ensure you are testing with mail received since I said I fixed this. ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-02-04 17:23 Message: Logged In: YES user_id=14198 /cvsroot/spambayes/spambayes/Outlook2000/addin.py,v <-- addin.py new revision: 1.48; previous revision: 1.47 /cvsroot/spambayes/spambayes/Outlook2000/filter.py,v <-- filter.py new revision: 1.16; previous revision: 1.15 /cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v <-- msgstore.py new revision: 1.39; previous revision: 1.38 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702 From noreply at sourceforge.net Fri Mar 14 15:39:44 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri Mar 14 18:28:31 2003 Subject: [Spambayes] [ spambayes-Bugs-702920 ] Manual filtering (Outlook) fails if one message fails Message-ID: Bugs item #702920, was opened at 2003-03-13 23:38 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering (Outlook) fails if one message fails Initial Comment: I've posted tyhis question on the maillist, and with (at least) one positive feedback, I enter it here: If manual filtering is started, and one e-mail fails, the rest of the filetering seems to be skipped. couldn't the filtering of the remaining messages continue, skipping the message which failed? ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-15 10:39 Message: Logged In: YES user_id=14198 Can you please post a traceback? (and sorry if I missed it on the list) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 From noreply at sourceforge.net Sat Mar 15 09:03:07 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sat Mar 15 18:27:10 2003 Subject: [Spambayes] [ spambayes-Patches-704188 ] non-interactive hammie Message-ID: Patches item #704188, was opened at 2003-03-15 17:03 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=704188&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Toby Dickenson (htrd) Assigned to: Nobody/Anonymous (nobody) Summary: non-interactive hammie Initial Comment: When hammie is training, it displays a message counter to stdout when processing every message in the mailbox. I have recently updated the training phase of my procmail integration to run under cron, and this verbose output is unwelcome. This attached patch causes hammie to only update the counter for every message if stdout is a tty. If it is not (such as when run under cron) it only displays the final total at the end of processing a mailbox. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=704188&group_id=61702 From dan at tobias.name Sat Mar 15 19:45:21 2003 From: dan at tobias.name (Daniel R. Tobias) Date: Sat Mar 15 22:35:14 2003 Subject: [Spambayes] It crapped out yet again... Message-ID: <3E73C921.6040204@tobias.name> Once again, my ham/spam database seems to have gone belly-up, and I can't get the proxy to start up. This is the error I get: Traceback (most recent call last): File "C:\Program Files\spambayes-1.0a2\pop3proxy.py", line 1577, in ? run() File "C:\Program Files\spambayes-1.0a2\pop3proxy.py", line 1551, in run state.createWorkers() File "C:\Program Files\spambayes-1.0a2\pop3proxy.py", line 1161, in createWorkers self.bayes = storage.DBDictClassifier(filename) File "C:\Program Files\spambayes-1.0a2\spambayes\storage.py", line 140, in __init__ self.load() File "C:\Program Files\spambayes-1.0a2\spambayes\storage.py", line 152, in load t = self.db[self.statekey] File "C:\Python22\lib\shelve.py", line 71, in __getitem__ return Unpickler(f).load() EOFError -- == Dan == Dan's Web Tips: http://webtips.dan.info/ Dan's Domain Site: http://domains.dan.info/ From lists at webcrunchers.com Sun Mar 16 18:00:50 2003 From: lists at webcrunchers.com (John D.) Date: Sun Mar 16 22:20:37 2003 Subject: [Spambayes] Other pop3proxy options In-Reply-To: <039101c2d847$3b1eff40$a100a8c0@zlichstein> Message-ID: Hi, been away from the list for a while, but want to comments on some of the earlier postings from a long time ago.... So excuse the lateness of this posting. >I would like to extend the options for how disposition is identified by the pop3proxy implementation. In particular, I would like the option of > >A. X-Spambayes-Classification: as now >B. To: XXXXX as is in CVS now >C. Subject line munging to append > >Is there any reason that was not included? (beside the obvious potential for a spammer to slip in a workaround) I use Outlook Express, and obviously can't use the arbitrary header technique - and am most interested in adding a [***SPAM***] header so that I can correctly bucketize those messages - but leave [***UNSURE***] in my primary box, and not molest ham messages at all. > >Is there any reason not to do this? Would you accept it if I did? Is there any reason why you aren't using the email module Parser API to crack the headers? I have found a certain number of messages are not parsed correctly by the re that you are using. They show up as From: (none) Subj: (none) in the UI - but I haven't determined why just yet (though I can see that some part of the message is getting stuck with the header by your re.split(r'\n\r?\n', messageText, 1) expression. So - the reason why we are changing this, is to accommodate Outlook users who can't filter on the "X-Spambayes-Classification"? John From lists at webcrunchers.com Sun Mar 16 18:43:47 2003 From: lists at webcrunchers.com (John D.) Date: Sun Mar 16 22:20:44 2003 Subject: [Spambayes] Use of email package In-Reply-To: <15955.48207.421755.891103@gargle.gargle.HOWL> References: Message-ID: Barry writes.... >>>>>> "TS" == Tim Stone writes: > > TS> We've got to either seriously harden our code so it knows what > TS> to do when the email package raises an exception, or consider > TS> not using the email package. I think I'll be reworking > TS> pop3proxy so that it no longer uses the email package for > TS> anything. The Corpus stuff currently has most (all?) the > TS> function that is needed by pop3proxy anyway. > >Let me take this opportunity to elaborate on the architecture of the >email package. There was a deliberate separation between the >representation of email messages and the parsing of flat text to that >object model (and in generating flat text from the object model, but >that may not be relevant). > >Thus, it was designed with an eye toward the use of application >specific parsers, and it may well be that the default parsers (both >the strict and the lax parsers) may not be appropriate for an >application that tends to see intentionally ill-formed messages. My >suggestion would be to write a parser that can handle the really bad >messages, then use the default lax parser for most things, and fall >back to the "adaptive parser" for the really horrendous messages. > >Then donate that parser back to Python. I've already spent a lot of time developing my system using the "email" package and the classic "Message" classes. I'm also aware of the bugs in the email.Parser, especially when it comes to parsing MINE type messages, in particular the KlezH virii I keep getting, which in most cases GAGS my mail processing system. Right now, I skip processing these messages, and leave them on the POP server, and manually deal with them. I'm hoping we can still use these packages, because we already spent a lot of time using them, but lets just try and fix the Parser to work right. I'm still using a very much earlier version of the SpamBayes project, and I know I need to catch up, but was planning to hold off in doing that until I can get another OpenBSD box on our Co_location rack, which we plan to earmark for Specific SpamBayes development. On top of that, I'm also working on our SMS (Spam Management System) under Open Source, where we plan to "Collect" spam into a SQL database, with the idea of developing a spam processing system. This involves building the Database, then as spam comes in, to PROCESS it so we can keep track of REPEAT spam, and be able to do really cool things to allow sending the spam to SpamCop, FTC, etc. It's also going to test the opt out mechanisms of the spam and further classify it in order to identify the really bad ones. Each database entry allows one to take specific notes on the spam, to allow for easy tracing of the spammer and locating them through "whois" lookups on the sites they hock in the spam. I've already got some pretty solid code to extract URL's and opt out addresses, and other routines to test the validity of the opt out URL'S. So the idea is to be able to instantly look up specific spams I report to the Authorities to verify if gatways are still open, or bring up notes on pending investigations against spammers, and also bring up the Whois contact info on the domains... I'm doing this manually right now, but eventually want to get this automation working soon. I already got about 10 spammer's domains shut down because their Whois is bogus, so it would automatically link to the Domain name issuers complaint forms pages, keeping track of the "ticket numbers" allowing me to easily follow up my complaints to unsure they revoke the spammers domain name, or put in accurate Whois info so that their contact info is accurate. I have all of this almost working on my LOCAL box on my LAN, and hopefully within a few weeks, want to being up "spamcruncher.com" server box with a web site, PostGres, Python, and the SMS libraries and CGI's that drive the web based GUI, and setup a few Alpha and Beta testers. On it, would be a pop3proxy, SMTP Proxy, Database, Spambayes, etc. Would then be looking for anyone wanting to participate in our SMS development. Any comments? Forward them to "crunch@shopip.com" as I use this mail address specifically for my Mailing lists, and I download all my list mail every week. John From anthony at interlink.com.au Mon Mar 17 14:30:27 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Sun Mar 16 22:31:44 2003 Subject: [Spambayes] Use of email package In-Reply-To: Message-ID: <200303170330.h2H3USq15462@localhost.localdomain> [John, if possible please keep your email messages to less than 80 columns per line - thanks!] >>> "John D." wrote > I've already spent a lot of time developing my system using the > "email" package and the classic "Message" classes. > > I'm also aware of the bugs in the email.Parser, especially when it > comes to parsing MINE type messages, in particular the KlezH virii I > keep getting, which in most cases GAGS my mail processing system. If this is still broken under email 2.5b1, can you send me a complete sample of the broken message? Thanks, Anthony -- Anthony Baxter It's never too late to have a happy childhood. From lists at webcrunchers.com Sun Mar 16 19:34:39 2003 From: lists at webcrunchers.com (John D.) Date: Sun Mar 16 22:34:42 2003 Subject: [Spambayes] On Merging System wide corpuses with specific User's Corpuses. Message-ID: Had any thoughts or discussions been made about the idea of "Merging" system wide Spam Corpuses with "Local" ones? For instance, as what was discussed earlier, people are not willing to be submitting their personal mail to the "ham" corpus (At least not all of it), but in instances where a domain has multiple users, I think it would be nice in the training phase to mark an item to put into a "system wide" pool of spam or ham, or put it into a "local" or a corpus specific to a parcicular user. But when classifying it, treat the corpus as a "single" file. John From lists at webcrunchers.com Sun Mar 16 20:21:11 2003 From: lists at webcrunchers.com (John D.) Date: Sun Mar 16 23:21:46 2003 Subject: [Spambayes] wanted: malformed email messages. In-Reply-To: References: <200303131040.h2DAdrq18384@localhost.localdomain> <200303131040.h2DAdrq18384@localhost.localdomain> Message-ID: I have some malformed messages that Parse failes to resolve. Most are KlezH Virus attachments that fail to put an additional space between the sections as per the RFC states. Do you want these as well? John From noreply at sourceforge.net Sun Mar 16 17:50:39 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 09:48:53 2003 Subject: [Spambayes] [ spambayes-Bugs-702758 ] When manually filtering the results are not right. Message-ID: Bugs item #702758, was opened at 2003-03-13 18:32 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702 Category: Outlook Group: None >Status: Closed Resolution: None Priority: 5 Submitted By: Tony Meyer (anadelonbrin) Assigned to: Mark Hammond (mhammond) Summary: When manually filtering the results are not right. Initial Comment: When doing a manual filter (via the filter dialog), the results displayed (found x ham, x spam, x unsure) are for the last folder filtered only, not the total over all folders, as one would expect. This is because in filter.py the update() function of the dictionary is used, and the docs have this as a[x] = b[x], not a[x] += b[x], which is what would be wanted here. Unless this is changed in a later version of Python, then this should really be fixed. I might get to it :) ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2003-03-17 13:50 Message: Logged In: YES user_id=552329 v1.9 of filter.py fixes this (well, works for me). Thanks Mark. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702 From noreply at sourceforge.net Mon Mar 17 03:06:19 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 09:49:00 2003 Subject: [Spambayes] [ spambayes-Bugs-702920 ] Manual filtering (Outlook) stops if one message fails Message-ID: Bugs item #702920, was opened at 2003-03-13 13:38 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) >Summary: Manual filtering (Outlook) stops if one message fails Initial Comment: I've posted tyhis question on the maillist, and with (at least) one positive feedback, I enter it here: If manual filtering is started, and one e-mail fails, the rest of the filetering seems to be skipped. couldn't the filtering of the remaining messages continue, skipping the message which failed? ---------------------------------------------------------------------- >Comment By: Fredrik Rodland (fmmr) Date: 2003-03-17 12:06 Message: Logged In: YES user_id=724871 I (sligthly) chqanged the summary. I've included one traceback. However I've run into several different ones in the past when filtering manual, and all seems to stop the actual filter-process. What I want/wish is that the filtering process continues with the remaining messages even if one message fails. There have also been several other comments on this subject on the list. the actual traceback as requested: Error getting property from stream (-2147221233, 'OLE error 0x8004010f', None, None) Exception in thread Thread-2: Traceback (most recent call last): File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 408, in __bootstrap self.run() File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\dialogs\AsyncDialog.py", line 115, in thread_target self._DoProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\dialogs\FilterDialog.py", line 375, in _DoProcess self.filterer(self.mgr, self.progress) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\filter.py", line 100, in filterer this_dispositions = filter_folder(f, mgr, progress) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\filter.py", line 80, in filter_folder disposition = filter_message(message, mgr, all_actions) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\filter.py", line 15, in filter_message prob = mgr.score(msg) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\manager.py", line 439, in score email = msg.GetEmailPackageObject() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\msgstore.py", line 639, in GetEmailPackageObject text = self._GetMessageText() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\msgstore.py", line 582, in _GetMessageText assert msg.is_multipart() AssertionError ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-15 00:39 Message: Logged In: YES user_id=14198 Can you please post a traceback? (and sorry if I missed it on the list) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 From noreply at sourceforge.net Mon Mar 17 03:14:44 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 09:49:06 2003 Subject: [Spambayes] [ spambayes-Bugs-704921 ] "Train now" (outlook) fails Message-ID: Bugs item #704921, was opened at 2003-03-17 12:14 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: "Train now" (outlook) fails Initial Comment: I updated to the last CVS-version - which has the option of re-scoring messages after training. however when clicking on "train now" in the main plugin dialog, the following traceback is caught. the training-dialog seems "deqad" and does not react to the "train now"-button. traceback: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\dialogs\TrainingDialog.py", line 70, in OnInitDialog self.UpdateStatus() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\dialogs\TrainingDialog.py", line 103, in UpdateStatus if self.config.rescore: AttributeError: _ConfigurationContainer instance has no attribute 'rescore' win32ui: OnInitDialog() virtual handler (>) raised an exception ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702 From acunningham at rsasecurity.com Mon Mar 17 15:11:05 2003 From: acunningham at rsasecurity.com (Cunningham, Andy) Date: Mon Mar 17 10:05:57 2003 Subject: [Spambayes] Outlook 2002 Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F16F3@exuk01> Hi There. I just installed spambayes with python 2.2, latest email package, and win32app build 152. I'm using Outlook 2002 on Windows 2000 Professional. Looking through the archives of this mailing list, it seems like this should work, but I can't get any of the Folder lists to display - I just get an empty dialog box. The debug trace shows the following error when I try to bring up a folder list: Traceback (most recent call last): File "C:\andyc\Install\spambayes\spambayes-1.0a2\Outlook2000\dialogs\FolderSelect or.py", line 310, in OnInitDialog tree = BuildFolderTreeMAPI(self.manager.message_store.session) File "C:\andyc\Install\spambayes\spambayes-1.0a2\Outlook2000\dialogs\FolderSelect or.py", line 93, in BuildFolderTreeMAPI msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, None) win32ui: OnInitDialog() virtual handler (>) raised an exception Can anyone help? -- Andy Cunningham Senior IS Consultant RSA Security UK Ltd From David.Vaughan at trizetto.com Mon Mar 17 13:49:15 2003 From: David.Vaughan at trizetto.com (Vaughan, David) Date: Mon Mar 17 16:15:25 2003 Subject: [Spambayes] setup Message-ID: I did not have python so I set it up on Win2k for the first time. I also have spambayes-1.0a2 but know how to neither setup.py build nor setup.py install . Kindly point me in the right direction. I am hoping to use spambayes with my netscape email account VaughanDA@Netscape.net . Any information you happen to have on how to connect netscape mail client to the netscape mail server would be appreciated. Today, I just use http to use netscape mail. I'd prefer to use the netscape mail client but have never set it up and don't quite know how. Thank you for your help. I look forward to your response. DVaughan19@SprintPCS.com M:(678) 478-5983 David.Vaughan@TriZetto.com W:(770) 225-3057 (877) 751-6025 From mhammond at skippinet.com.au Tue Mar 18 08:47:51 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 17 17:06:44 2003 Subject: [Spambayes] Outlook 2002 In-Reply-To: <418A63CAEBF2D4118A1A00508BB1A0B8029F16F3@exuk01> Message-ID: > msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | > pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, None) The error code for this is MAPI_E_CORRUPT_STORE, which doesn't sound good! I have checked in a change so that any errors when walking the folder tree are ignored. However, this same error is going to happen, so that part of your folder tree will *not* appear in the dialog. Hopefully only a small part of your tree is corrupt, so the folders you want will still be there - you will have to try it and see. Mark. From tim at fourstonesExpressions.com Mon Mar 17 16:26:50 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 17 17:26:58 2003 Subject: [Spambayes] setup In-Reply-To: Message-ID: <21MKZX53IFPRM1T8598VPSO3YOIEVR.3e764baa@myst> Ok, David, the first step for you is going to be to setup the netscape mailer. This is completely independent of spambayes (at this point). I googled on 'netscape pop3 setup and turned up a number of pages where there are instructions on how to do this. The first was at http://documentation.ascinet.com/www/print.asp?CourseNumber=1040. Once you get that set up and working, then drop us a line, and we'll get you going on getting spambayes setup. In the meantime, be sure you look closely at http://spambayes.sourceforge.net/, our homepage. There is considerable information there on how to setup spambayes, including setting up and running the pop3proxy, which is what you'll need. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From T.A.Meyer at massey.ac.nz Tue Mar 18 11:13:55 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon Mar 17 18:26:54 2003 Subject: [Spambayes] Other pop3proxy options Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8F1@its-xchg4.massey.ac.nz> > >I would like to extend the options for how disposition is > identified by the pop3proxy implementation. In particular, I > would like the option of > >A. X-Spambayes-Classification: as now > >B. To: XXXXX as is in CVS now > >C. Subject line munging to append [...] > So - the reason why we are changing this, is to accommodate > Outlook users who can't filter on the "X-Spambayes-Classification"? This wasn't so much a change as an addition. The default behaviour is still to just add the classification header and nothing else. If you want to, however, you can munge the To: or Subject: lines as well. This was added to accomodate Outlook *Express* (Outlook has better spambayes integration than just about anything) users, in particular, but also any other 'thin' clients. =Tony Meyer From noreply at sourceforge.net Mon Mar 17 16:17:33 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 19:36:59 2003 Subject: [Spambayes] [ spambayes-Bugs-704921 ] "Train now" (outlook) fails Message-ID: Bugs item #704921, was opened at 2003-03-17 23:14 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) >Summary: "Train now" (outlook) fails Initial Comment: I updated to the last CVS-version - which has the option of re-scoring messages after training. however when clicking on "train now" in the main plugin dialog, the following traceback is caught. the training-dialog seems "deqad" and does not react to the "train now"-button. traceback: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\dialogs\TrainingDialog.py", line 70, in OnInitDialog self.UpdateStatus() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\dialogs\TrainingDialog.py", line 103, in UpdateStatus if self.config.rescore: AttributeError: _ConfigurationContainer instance has no attribute 'rescore' win32ui: OnInitDialog() virtual handler (>) raised an exception ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2003-03-18 12:17 Message: Logged In: YES user_id=552329 r1.7 of config.py should fix this bug. Please test if this works for you. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702 From noreply at sourceforge.net Mon Mar 17 19:37:45 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 22:37:21 2003 Subject: [Spambayes] [ spambayes-Bugs-704921 ] "Train now" (outlook) fails Message-ID: Bugs item #704921, was opened at 2003-03-17 22:14 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702 Category: Outlook Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) >Summary: "Train now" (outlook) fails Initial Comment: I updated to the last CVS-version - which has the option of re-scoring messages after training. however when clicking on "train now" in the main plugin dialog, the following traceback is caught. the training-dialog seems "deqad" and does not react to the "train now"-button. traceback: Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\dialogs\TrainingDialog.py", line 70, in OnInitDialog self.UpdateStatus() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\dialogs\TrainingDialog.py", line 103, in UpdateStatus if self.config.rescore: AttributeError: _ConfigurationContainer instance has no attribute 'rescore' win32ui: OnInitDialog() virtual handler (>) raised an exception ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-18 14:37 Message: Logged In: YES user_id=14198 Fixed in filter.py, rev 1.20. The "dead dialog" problem seems a little deeper then this, and affects all dialogs - I will open a new bug. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-03-18 11:17 Message: Logged In: YES user_id=552329 r1.7 of config.py should fix this bug. Please test if this works for you. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702 From noreply at sourceforge.net Mon Mar 17 19:39:47 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 22:37:28 2003 Subject: [Spambayes] [ spambayes-Bugs-705378 ] Cancelled "full train" leaves bad database. Message-ID: Bugs item #705378, was opened at 2003-03-18 14:39 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705378&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Cancelled "full train" leaves bad database. Initial Comment: If you go to the training dialog, select "rebuild entire database", start the train, then cancel it, the database is left in a useless state. We should probably train to a new database, then move it over once complete. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705378&group_id=61702 From noreply at sourceforge.net Mon Mar 17 19:43:10 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 17 22:37:36 2003 Subject: [Spambayes] [ spambayes-Bugs-705379 ] Outlook dialogs sometimes become unresponsive Message-ID: Bugs item #705379, was opened at 2003-03-18 14:43 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705379&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Outlook dialogs sometimes become unresponsive Initial Comment: The training and filtering dialogs sometimes become unresponsive during filtering/training. They shouldn't, as hoops are jumped through to keep the UI and worker in separate threads. Further, it only seems to happen on "large" folders - eg, I can provoke it on my Inbox, but not on smaller folders. I'm guessing some bullshit COM/Outlook thread rule I am breaking. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705379&group_id=61702 From mhammond at skippinet.com.au Tue Mar 18 19:03:08 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Mar 18 03:04:12 2003 Subject: [Spambayes] New Outlook binary available Message-ID: I have made a new Outlook installer binary on my starship page - http://starship.python.net/crew/mhammond/spambayes/ (Should I be putting these on the main spambayes page, even though they aren't official releases? I'm happy to!) This version fixes alot of problems in the first version - both problems that exist in the source-code version, and installer-specific problems. We have better docs aimed more at the first time user, output is redirected to a log file, the apply() warnings have gone, etc. If you are running the old version, please uninstall and try the new one - the uninstall will not delete your databases. Thanks, Mark. From spambayes at djl.freeuk.com Tue Mar 18 15:06:17 2003 From: spambayes at djl.freeuk.com (David Leftley) Date: Tue Mar 18 10:06:24 2003 Subject: [Spambayes] Why was this e-mail's body ignored? Message-ID: <3vce7vclqjg0f9q0b93e3058b9n9eum4mr@4ax.com> I was surprised to see the message below appear towards the lower end of my "possible spam" range - but looking at the breakdown of how the message was classified, it turns out that spambayes is ignoring the entire message body. What is it about this message that makes spambayes think it has no relevant content? Is it simply that we don't try to handle multipart/alternative messages? David. >Return-path: >Delivery-date: Tue, 18 Mar 2003 14:51:57 +0000 >Received: from hypnos.uk.clara.net ([213.253.16.103]) > by chaos.uk.clara.net with esmtp (Exim 4.12) > id 18vIRZ-0002zI-00; Tue, 18 Mar 2003 14:51:57 +0000 >Received: from [200.86.159.217] (helo=213.253.16.103) > by hypnos.uk.clara.net with smtp (Exim 3.33 #2) > id 18vINe-000O5C-00; Tue, 18 Mar 2003 14:47:55 +0000 >Received: from 1v5.tbom9.net [45.100.189.59] by 213.253.16.103; Tue, 18 Mar 2003 18:39:48 +0400 >Message-ID: >From: "Leigh Skaggs" >To: >Subject: Who said money won't get you laid? >Date: Tue, 18 Mar 03 18:39:48 GMT >X-Priority: 3 >X-MSMail-Priority: Normal >X-Mailer: The Bat! (v1.52f) Business >MIME-Version: 1.0 >Content-Type: multipart/alternative; > boundary="1._1C_DFD.9.C4AFD" >X-RBL-Warning: (bl.spamcop.net) Blocked - see http://spamcop.net/bl.shtml?200.86.159.217 >X-UIDL: 1047999119.11657.chaos.uk.clara.net >X-RCPT: djl >Status: U >X-Spambayes-Classification: unsure >X-Spambayes-Spam-Probability: 0.235233369482 > >This is a multi-part message in MIME format. > >--1._1C_DFD.9.C4AFD >Content-Type: text/plain >Content-Transfer-Encoding: quoted-printable > >Money isn't everything...or so you were told right? >Well we bet you that it is! Take a look at these girls >who would do absolutely ANYTHING to win over a self made billionaire! > >http://www.hotxxxpass.net/pass2/ > >These hardworking girls think it's their lucky day because Max >the billionaire suitor has fallen upon them and their wonderful >talents! Watch them show themselves off to impress Max! > >You won't believe what these girls will do to get a piece of Max's pie! > >http://www.hotxxxpass.net/pass2/ >jpbbvp hn > mf > rcjoz >--1._1C_DFD.9.C4AFD-- > > Spam clues for this message: *H* 0.0310695580167 *S* 0.968984419484 subject:? 0.155172413793 content-type:multipart/alternative 0.706000895656 subject:' 0.738095238095 subject:get 0.844827586207 subject:money 0.844827586207 subject:you 0.934782608696 to:2**2 0.969798657718 From MMARTINEZ at intranet.reeusda.gov Tue Mar 18 10:34:21 2003 From: MMARTINEZ at intranet.reeusda.gov (Martinez, Michael - CSREES/ISTM) Date: Tue Mar 18 10:48:07 2003 Subject: [Spambayes] Suggestion: interface to qmail/qmail-scanner smtp gateway Message-ID: It'd be great if you could write a small, lightweight interface to qmail-scanner/qmail. Something like what "spamd/spamc" is for "SpamAssassin." I would be running spambayes on my smtp gateway right now, except that no one has written the interface. Martinez, Michael CSREES/ISTM/USDA (202) 720-6223 From acunningham at rsasecurity.com Tue Mar 18 17:16:18 2003 From: acunningham at rsasecurity.com (Cunningham, Andy) Date: Tue Mar 18 12:11:26 2003 Subject: [Spambayes] Outlook 2002 Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F1709@exuk01> Mark Thanks for the fast response. I actually managed to beat one of the nightly builds (2003-03-17) into working after I'd sent this, but I will update to your new code tomorrow and see if that works as well. AndyC -----Original Message----- From: Mark Hammond [mailto:mhammond@skippinet.com.au] Sent: 17 March 2003 21:48 To: Cunningham, Andy; spambayes@python.org Subject: RE: [Spambayes] Outlook 2002 > msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | > pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, > None) The error code for this is MAPI_E_CORRUPT_STORE, which doesn't sound good! I have checked in a change so that any errors when walking the folder tree are ignored. However, this same error is going to happen, so that part of your folder tree will *not* appear in the dialog. Hopefully only a small part of your tree is corrupt, so the folders you want will still be there - you will have to try it and see. Mark. From noreply at sourceforge.net Tue Mar 18 08:24:36 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 18 12:27:07 2003 Subject: [Spambayes] [ spambayes-Bugs-695632 ] MySQL Digest Causes Spambayes to Crash Message-ID: Bugs item #695632, was opened at 2003-03-01 10:48 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Richard Scott (rich1) Assigned to: Nobody/Anonymous (nobody) Summary: MySQL Digest Causes Spambayes to Crash Initial Comment: The main mysql e-mail list (digest version) and the mysql bugs e-mail list (digest version) always cause Spambayes to crash. It appears that the error occurs in Generator.py. Here is the output: Training ham (/home/richard/Mail/inbox): Reading as MH mailbox /home/richard/Mail/inbox/2 /home/richard/Mail/inbox/5 /home/richard/Mail/inbox/6 /home/richard/Mail/inbox/724 /home/richard/Mail/inbox/29 /home/richard/Mail/inbox/751 Traceback (most recent call last): File "/home/richard/spambayes/mboxtrain.py", line 278, in ? main() File "/home/richard/spambayes/mboxtrain.py", line 265, in main train(h, g, False, force) File "/home/richard/spambayes/mboxtrain.py", line 207, in train mhdir_train(h, path, is_spam, force) File "/home/richard/spambayes/mboxtrain.py", line 190, in mhdir_train f.write(msg.as_string()) File "/usr/lib/python2.2/site-packages/email/Message.py", line 107, in as_string g.flatten(self, unixfrom=unixfrom) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 100, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 128, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 154, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 243, in _handle_multipart g.flatten(part, unixfrom=False) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 100, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 128, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 154, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 212, in _handle_text raise TypeError, 'string payload expected: %s' % type(payload) TypeError: string payload expected: ---------------------------------------------------------------------- Comment By: Chuck Bearden (cfbearden) Date: 2003-03-18 10:24 Message: Logged In: YES user_id=499555 I am experiencing the same problem with the axkit digest and also with the monthly log files for a LISTSERV list that I run. Perhaps it's the presence of so many From: To: Subject: Received: etc. lines within one email message. I can fix this problem for myself by inserting a procmail recipe for the digests before the spambayes recipes, but I'm not sure how well that approach will scale to the 100+ folks I'd like to deploy this for. Also, it could cause problems for POP proxy users, since I don't see how they can prevent their digest traffic from being considered by by the spambayes filters on the proxy. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702 From noreply at sourceforge.net Tue Mar 18 10:48:56 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Tue Mar 18 14:00:32 2003 Subject: [Spambayes] [ spambayes-Bugs-695632 ] MySQL Digest Causes Spambayes to Crash Message-ID: Bugs item #695632, was opened at 2003-03-01 10:48 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Richard Scott (rich1) Assigned to: Nobody/Anonymous (nobody) Summary: MySQL Digest Causes Spambayes to Crash Initial Comment: The main mysql e-mail list (digest version) and the mysql bugs e-mail list (digest version) always cause Spambayes to crash. It appears that the error occurs in Generator.py. Here is the output: Training ham (/home/richard/Mail/inbox): Reading as MH mailbox /home/richard/Mail/inbox/2 /home/richard/Mail/inbox/5 /home/richard/Mail/inbox/6 /home/richard/Mail/inbox/724 /home/richard/Mail/inbox/29 /home/richard/Mail/inbox/751 Traceback (most recent call last): File "/home/richard/spambayes/mboxtrain.py", line 278, in ? main() File "/home/richard/spambayes/mboxtrain.py", line 265, in main train(h, g, False, force) File "/home/richard/spambayes/mboxtrain.py", line 207, in train mhdir_train(h, path, is_spam, force) File "/home/richard/spambayes/mboxtrain.py", line 190, in mhdir_train f.write(msg.as_string()) File "/usr/lib/python2.2/site-packages/email/Message.py", line 107, in as_string g.flatten(self, unixfrom=unixfrom) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 100, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 128, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 154, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 243, in _handle_multipart g.flatten(part, unixfrom=False) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 100, in flatten self._write(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 128, in _write self._dispatch(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 154, in _dispatch meth(msg) File "/usr/lib/python2.2/site-packages/email/Generator.py", line 212, in _handle_text raise TypeError, 'string payload expected: %s' % type(payload) TypeError: string payload expected: ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-18 12:48 Message: Logged In: YES user_id=645698 I believe this problem has been resolved for pop3proxy. ---------------------------------------------------------------------- Comment By: Chuck Bearden (cfbearden) Date: 2003-03-18 10:24 Message: Logged In: YES user_id=499555 I am experiencing the same problem with the axkit digest and also with the monthly log files for a LISTSERV list that I run. Perhaps it's the presence of so many From: To: Subject: Received: etc. lines within one email message. I can fix this problem for myself by inserting a procmail recipe for the digests before the spambayes recipes, but I'm not sure how well that approach will scale to the 100+ folks I'd like to deploy this for. Also, it could cause problems for POP proxy users, since I don't see how they can prevent their digest traffic from being considered by by the spambayes filters on the proxy. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702 From tim at fourstonesExpressions.com Tue Mar 18 13:36:34 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 18 14:36:39 2003 Subject: [Spambayes] setup In-Reply-To: Message-ID: <2VWNUQ09SP959809QLPK72DCD0XJD.3e777542@myst> 3/18/2003 1:29:15 PM, "Vaughan, David" wrote: > > It's not supposed to be this hard :-) > > I'll keep trying but presently am unable to set up POP3. I get the >message "Connection to server imap.mail.netcenter.com timed out." but can >not find in the Netscape 7.02 preferences where to set the server name. pop3proxy does not support imap servers at this time. For that matter, there isn't any imap support in spambayes at this point in time... :( c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From ducky at webfoot.com Tue Mar 18 15:43:52 2003 From: ducky at webfoot.com (Kaitlin Duck Sherwood) Date: Tue Mar 18 18:38:34 2003 Subject: [Spambayes] OT: email developer posting at OSAF Message-ID: (Hi, sorry for being a bit off-topic, but this seems like an outstanding place to look for people who know about open source projects, email, and python.) We at the Open Source Applications Foundation is looking for an experienced and self-motivated person to join our development team in the San Francisco area. The ideal person would have a knowledge of e-mail protocols and standards as well as experience producing end-user software. User interface experience is very valuable. We have just posted this job to http://www.osafoundation.org/employment.htm Interested parties can send information to jobs@osafoundation.org. (Note that this is my "home" email account, not my OSAF account, so don't reply to this account.) From T.A.Meyer at massey.ac.nz Wed Mar 19 12:14:34 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 18 19:15:16 2003 Subject: [Spambayes] New Outlook binary available Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8FD@its-xchg4.massey.ac.nz> > I have made a new Outlook installer binary on my starship page - > http://starship.python.net/crew/mhammond/spambayes/ (Should > I be putting these on the main spambayes page, even though they > aren't official releases? I'm happy to!) IMO, yes (I don't see how they are any less official than the alpha1 and alpha2 that have gone out). =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 19 12:38:44 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 18 19:39:19 2003 Subject: [Spambayes] setup Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C900@its-xchg4.massey.ac.nz> > pop3proxy does not support imap servers at this time. For > that matter, there isn't any imap support in spambayes > at this point in time... :( How hard would this be to address? I've read all the past messages about how complex IMAP is, but to just hook into spambayes like pop3proxy does. (Obviously the ui - which probably will be a separate module someday anyway - could be reused). Does anyone have an IMAP account that would be willing to work on this? Apart from the impossible case of webmail, it does seem like IMAP is the last big group of users that can't use spambayes. =Tony Meyer From david at theresistance.net Tue Mar 18 21:00:49 2003 From: david at theresistance.net (David Shaw) Date: Tue Mar 18 21:02:21 2003 Subject: [Spambayes] Outlook 2002 In-Reply-To: <418A63CAEBF2D4118A1A00508BB1A0B8029F1709@exuk01> Message-ID: <9B8C1314-59AE-11D7-8825-000393582EF6@theresistance.net> I got the spam below today. Spambayes said it was ham. I trained it as spam. I reclassified. Spambayes still said it was ham. I had to classify it as spam 6 times before it would recognize it as such. I think this list makes spam about antispam software get by 100% of the time (this list comprises over half of my daily ham). ---- From stiffypop17291@mindless.com Tue Mar 18 14:48:05 2003 Return-Path: Received: from ZZ (67.97.202.131) by theresistance.net with SMTP (Eudora Internet Mail Server 3.0.3) for ; Tue, 18 Mar 2003 14:33:31 -0500 To: David From: stiffypop17291@mindless.com Reply-To: tallstranger00897@engineer.com Sender: jeffrey_dunlap3581@paris.com X-Mailer: OutLook Express IMO, 59 Subject: David, Intelligent antispam IER software MIME-Version: 1.0 Content-type: text/html Content-Transfer-Encoding: 8bit Date: Tue, 18 Mar 2003 14:33:31 -0500 Message-ID: <1164106485-1165210066@theresistance.net> Spam Remedy
TheVeryBest - Software Downloads
 Top-Rank Software Download Site on the Internet 
Internet->Email->Spam Remedy v1.5 PRO

Spam Remedy        (3.17MB)

Description:

The powerful, effective and intelligent anti-spam tool.
It automatically cleans spam messages out of your mailbox before you receive or read them.


Features:
  • Automatically Blocking Spam
    Spam Remedy automatically checks your mail boxes and filters unwanted, dangerous, or offensive mail messages to save your time from manually detecting and organizing mail messages.
  • Effectively Spam Detecting
    A complex Aritificial Intelligence algorithm has been used in Spam Remedy product to detecting legitimate mail messages and spam messages,the technique has more precision than other filter-based and keyword-based anti-spam technologies.
  • Be Sure You Get Your Right Mail Messages
    Spam Remedy doesn't confirm a spam message by a single keyword in mail content. It examines the entire message - source, headers and mail content to confirm whether it is a spam message.
  • Supports Multiple Email Types and Almost All Email Clients
    Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and MAPI email accounts,Directly works with almost all email clients(Outlook Express, Becky Mail,Foxmail,Outlook, The bat!, Eudora etc.), espacially includes support for web-based Hotmail/MSN email clients. Nothing you need to change to your email clients!
  • Easy to use  - You don't need to set any complex filter rules, just add your email accounts to Spam Remedy and then it works.
  • Friends List and Rejecting List
    With Friends List and Rejecting List,you have the chance to decide who are never blocked or directly treat their mail messages as spam.
  • Keep your inbox clean
    Spam Remedy places all intercepted spam messages to its interval mail database so that your inbox remains uncluttered and free of spam.If for some reason a legitimate email is flagged as spam, you can easily recover in multiple ways.

    Editor's Rating:
Copyright ?2002-2003 DarkSoft Group  All Rights Reserved.
From mhammond at skippinet.com.au Wed Mar 19 13:09:53 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Mar 18 21:10:59 2003 Subject: [Spambayes] Outlook 2002 In-Reply-To: <9B8C1314-59AE-11D7-8825-000393582EF6@theresistance.net> Message-ID: Can you mail me the "spam clues" for one such message? Although the behaviour you describe is possibly correct, I would like to make sure we are seeing all the payload etc. Mark. > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of David Shaw > Sent: Wednesday, 19 March 2003 1:01 PM > To: spambayes@python.org > Subject: Re: [Spambayes] Outlook 2002 > > > I got the spam below today. Spambayes said it was ham. I trained it > as spam. I reclassified. Spambayes still said it was ham. I had to > classify it as spam 6 times before it would recognize it as such. I > think this list makes spam about antispam software get by 100% of the > time (this list comprises over half of my daily ham). > > ---- > From stiffypop17291@mindless.com Tue Mar 18 14:48:05 2003 > Return-Path: > Received: from ZZ (67.97.202.131) by theresistance.net with SMTP (Eudora > Internet Mail Server 3.0.3) for ; > Tue, 18 Mar 2003 14:33:31 -0500 > To: David > From: stiffypop17291@mindless.com > Reply-To: tallstranger00897@engineer.com > Sender: jeffrey_dunlap3581@paris.com > X-Mailer: OutLook Express IMO, 59 > Subject: David, Intelligent antispam IER software > MIME-Version: 1.0 > Content-type: text/html > Content-Transfer-Encoding: 8bit > Date: Tue, 18 Mar 2003 14:33:31 -0500 > Message-ID: <1164106485-1165210066@theresistance.net> > > Spam Remedy > > >
> > > > > > > > > >
style="FILTER: dropshadow(color=#336699, offx=3, offy=4, > positive=1); WIDTH: 550px; COLOR: #ffffff; FONT-FAMILY: Arial Black; > POSITION: relative">TheVeryBest > - Software Downloads
style="WIDTH: 60px">  color=#ffffff>Top-Rank Software Download Site on the > Internet  face=verdana color=#ffffff size=1> > >
> > > > > > > >
> href="http://www.Siliconeparadise.com/remedy/ > index.html?Utw2EJz3u7">Internet-> > href="http://www.Siliconeparadise.com/remedy/ > index.html?hWT14FrUkz">Email-> > href="http://www.Siliconeparadise.com/remedy/index.html?atGiTvEygc">Spam > Remedy v1.5 PRO

src="http://siliconeparadise.com/ads/logo.gif" width=32 > border=0> face=arial color=#00aa66 size=4>Spam Remedy face=arial > size=2>     > href="http://www.Siliconeparadise.com/remedy/index.html?20voRZ0tTk"> height=19 src="http://siliconeparadise.com/ads/buy.gif" > width=55 > border=0>     > href="http://www.Siliconeparadise.com/remedy/index.html?g07NR2iaLv"> height=19 > src="http://siliconeparadise.com/ads/download.gif" width=55 > border=0>(3.17MB)
>
>
> href="http://www.Siliconeparadise.com/remedy/index.html?zLD6WmqqLx"> height=236 src="http://siliconeparadise.com/ads/screen.gif" > width=270 align=right > border=0>Description: >

The powerful, effective and intelligent anti-spam > tool.
It automatically cleans spam messages out of your > mailbox > before you receive or read them.


Features:
>
face=verdana size=1>Copyright ?2002-2003 > href="http://www.Siliconeparadise.com/remedy/ > index.html?29KRcCFDrp">DarkSoft > Group  All Rights Reserved. >
> > > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes > From david at theresistance.net Tue Mar 18 22:19:41 2003 From: david at theresistance.net (David Shaw) Date: Tue Mar 18 22:21:12 2003 Subject: [Spambayes] Outlook 2002 In-Reply-To: Message-ID: <9FE05FE8-59B9-11D7-8825-000393582EF6@theresistance.net> I think this is what you want. I restored my backup of the hammie.db to before I trained on the message in question, and here's what I got: *H* 1.0 *S* 3.33066907388e-16 headers 0.00257289879931 spam. 0.00310559006211 content. 0.00517836593786 inbox 0.00585175552666 source, 0.00634696755994 remains 0.00850661625709 recover 0.00884086444008 technique 0.00884086444008 mailbox 0.00959488272921 boxes 0.0104895104895 keyword 0.0115681233933 anti-spam 0.0115681233933 eudora 0.0115681233933 description: 0.0136778115502 clients. 0.0155709342561 web-based 0.0167286245353 algorithm 0.0167286245353 express, 0.0167286245353 features: 0.0196506550218 rules, 0.0215311004785 spam 0.0223177887819 detecting 0.0238095238095 precision 0.0266272189349 etc.), 0.0266272189349 reason 0.0289726436155 offensive 0.0348837209302 examines 0.0348837209302 subject:software 0.0348837209302 complex 0.038637702312 spam, 0.0392551056175 dangerous, 0.0412844036697 pop3, 0.0412844036697 messages 0.0430328965312 filter 0.0465043869016 subject:antispam 0.0505617977528 subject:Intelligent 0.0505617977528 subject:IER 0.0505617977528 remedy 0.0505617977528 doesn't 0.0585953340706 message. 0.0640451247568 editor's 0.0652173913043 intercepted 0.0652173913043 hotmail/msn 0.0652173913043 top-rank 0.0652173913043 espacially 0.0652173913043 ?2002-2003 0.0652173913043 v1.5 0.0652173913043 url:buy 0.0652173913043 unwanted, 0.0652173913043 uncluttered 0.0652173913043 spam.if 0.0652173913043 rejecting 0.0652173913043 rating: 0.0652173913043 messages,the 0.0652173913043 mapi 0.0652173913043 imap4 0.0652173913043 hotmail/msn, 0.0652173913043 filter-based 0.0652173913043 (3.17mb) 0.0652173913043 bat!, 0.0652173913043 becky 0.0652173913043 clients! 0.0652173913043 checks 0.0710347118419 filters 0.0803222637998 clients 0.0855282287063 supports 0.0871172200062 aritificial 0.0918367346939 theverybest 0.0918367346939 darksoft 0.0918367346939 sure 0.0992330392591 almost 0.102559626457 works. 0.111519301848 clean 0.120768412967 them. 0.121151958915 single 0.122015711876 works 0.125061861969 add 0.126036514129 tool. 0.132129948073 skip:m 20 0.138677557916 list 0.141137401539 url:download 0.152090210781 content 0.152386830957 multiple 0.152995044376 manually 0.153866187624 set 0.155020414893 url:screen 0.155172413793 url:remedy 0.155172413793 url:4horse 0.155172413793 -> 0.155172413793 database 0.161594300822 mail 0.166700111476 blocked 0.18061971205 used 0.184085816436 entire 0.185547009874 keep 0.192563678333 whether 0.194129600717 support 0.197733550434 directly 0.201527083185 powerful, 0.211649404564 then 0.220440381137 its 0.226466533203 change 0.229694796147 intelligent 0.23151590252 ways. 0.236278160711 some 0.242962413214 types 0.243161455013 read 0.247014415533 nothing 0.257912566324 confirm 0.273987887728 places 0.273987887728 use 0.275639337316 blocking 0.278210951417 need 0.283254780348 accounts 0.287308467645 group 0.288249835829 don't 0.295123702457 software 0.299203612621 treat 0.299715643894 has 0.299716644058 pro 0.305490905645 before 0.309520509917 message 0.315550066847 skip:k 10 0.320136815288 flagged 0.321924580422 download 0.323406630897 right 0.324779887317 that 0.333199816514 easy 0.333347052849 than 0.335342627679 easily 0.339718352497 been 0.345234674073 url:siliconeparadise 0.655538429202 url:www 0.65776955901 free 0.712682333876 from:no real name:2**0 0.729567209811 url:html 0.774652055141 header:Received:1 0.795755497715 copyright 0.796002970816 x-mailer:outlook express imo, 59 0.832612041334 receive 0.834103482178 rights 0.856909611825 reserved. 0.887970350732 url:logo 0.888597344878 url:index 0.89243705346 subject:David 0.916407352467 virus: >> >> >> >> >> >> >> >> >> >> >>
> >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?Utw2EJz3u7">Internet->> >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?hWT14FrUkz">Email->> >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?atGiTvEygc">Spam >> Remedy v1.5 PRO

> src="http://siliconeparadise.com/ads/logo.gif" width=32 >> border=0>> face=arial color=#00aa66 size=4>Spam Remedy> face=arial >> size=2>    > >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?20voRZ0tTk">> height=19 src="http://siliconeparadise.com/ads/buy.gif" >> width=55 >> border=0>    > >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?g07NR2iaLv">> height=19 >> src="http://siliconeparadise.com/ads/download.gif" width=55 >> border=0>(3.17MB)
>>
>>
> >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?zLD6WmqqLx">> height=236 >> src="http://siliconeparadise.com/ads/screen.gif" >> width=270 align=right >> border=0>Description: >>

The powerful, effective and intelligent >> anti-spam >> tool.
It automatically cleans spam messages out of >> your >> mailbox >> before you receive or read them. >>


Features:
>>
>> >> > height=20>> face=verdana size=1>Copyright ?2002-2003 > >> href="http://www.Siliconeparadise.com/remedy/ >> index.html?29KRcCFDrp">DarkSoft >> Group  All Rights Reserved. >> >> >> >> >> _______________________________________________ >> Spambayes mailing list >> Spambayes@python.org >> http://mail.python.org/mailman/listinfo/spambayes >> > From tim at fourstonesExpressions.com Tue Mar 18 21:32:49 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 18 22:32:59 2003 Subject: [Spambayes] Outlook 2002 In-Reply-To: <9B8C1314-59AE-11D7-8825-000393582EF6@theresistance.net> Message-ID: 3/18/2003 8:00:49 PM, David Shaw wrote: > I think this list makes spam about antispam software get by 100% of the >time (this list comprises over half of my daily ham). Hmmmm... interesting. Perhaps we should put whitelist rules in the system c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't From T.A.Meyer at massey.ac.nz Wed Mar 19 15:56:09 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 18 22:56:59 2003 Subject: [Spambayes] Outlook 2002 Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90A@its-xchg4.massey.ac.nz> > (this list comprises over half of my daily ham). > spam. 0.00310559006211 > anti-spam 0.0115681233933 > spam 0.0223177887819 > spam, 0.0392551056175 > subject:antispam 0.0505617977528 > spam.if 0.0652173913043 Perhaps you should train less on this list and more on the remaining ham you have? More ham isn't necessarily better, if the ham contains a lot of spam clues (as I understand it). Presumably these six clues (even the odd 'spam.if' clue) resulted from training on this list. > espacially 0.0652173913043 > aritificial 0.0918367346939 It seems quite strange to me that these two misspelled words score so low. Do you get a lot of ham that has poorly spelled words? > ?2002-2003 0.0652173913043 This also seems strange; do you really have a lot of ham with this sort of copyright info? > v1.5 0.0652173913043 Or this version number? > url:buy 0.0652173913043 Or email with 'buy' in an embedded URL? > theverybest 0.0918367346939 > darksoft 0.0918367346939 These seem even stranger. I didn't read the email, but Darksoft is the manufacturer, right? Any idea what ham contributed such low scores to these words? It almost looks to me like you have a similar email in your ham somewhere that was mistrained - I don't seem how a lot of these clues could result from this list. =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 19 16:00:59 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 18 23:01:36 2003 Subject: [Spambayes] Spambayes installation problem Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90C@its-xchg4.massey.ac.nz> [Geoff's problems installing the Outlook plugin] > The new file installs correctly ... we have lift off ... > thanks for your help No worries, I'm glad it works now - and it's not that I actually did anything in the end! I've crossed this to the list & Mark so that we know that it works now, and so if anyone has anything similar they know to try the new installer. =Tony Meyer From T.A.Meyer at massey.ac.nz Wed Mar 19 16:03:59 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Tue Mar 18 23:04:34 2003 Subject: [Spambayes] Beta status checklist Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90D@its-xchg4.massey.ac.nz> With a new version of the Outlook plugin released, and TimS close to finishing up alpha3, I was wondering how close things were to beta. I was wondering if we could come up with a list of 'to do's that the consensus agreed needed to be implemented/fixed before we would consider that spambayes was ready for a first beta release. So, to start off: * Much better documentation for the SMTP proxy training option. Anyone care to add to the list? =Tony Meyer From tim at fourstonesExpressions.com Tue Mar 18 22:09:32 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 18 23:09:39 2003 Subject: [Spambayes] Beta status checklist In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90D@its-xchg4.massey.ac.nz> Message-ID: >So, to start off: >* Much better documentation for the SMTP proxy training option. * Incorporation of integration.txt (and probably other text files) into the website, and maybe a review of the mailing list for faq type information * Installation with some level of migration from previous release. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't From tim.one at comcast.net Tue Mar 18 23:10:18 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue Mar 18 23:11:44 2003 Subject: [Spambayes] Outlook 2002 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90A@its-xchg4.massey.ac.nz> Message-ID: [Meyer, Tony] > ... > It almost looks to me like you have a similar email in your ham > somewhere that was mistrained - Or many. The oddball clues have spamprobs too low to be due to hapaxes. Other oddities: messages,the 0.0652173913043 spam.if 0.0652173913043 theverybest 0.0918367346939 From tim_one at email.msn.com Tue Mar 18 23:53:19 2003 From: tim_one at email.msn.com (Tim Peters) Date: Tue Mar 18 23:55:22 2003 Subject: [Spambayes] New Outlook binary available In-Reply-To: Message-ID: [Mark Hammond] > I have made a new Outlook installer binary on my starship page - > http://starship.python.net/crew/mhammond/spambayes/ (Should I be putting > these on the main spambayes page, even though they aren't official > releases? I'm happy to!) +1, if it increases visibility and/or distribution, and I expect it does both to make the installer available from both. From acunningham at rsasecurity.com Wed Mar 19 10:22:14 2003 From: acunningham at rsasecurity.com (Cunningham, Andy) Date: Wed Mar 19 05:17:02 2003 Subject: [Spambayes] Beta status checklist (or this turning into new f eature requests?) Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F1722@exuk01> I'd add the Outlook Binary installer as a part of the release. I think this is going to make a huge difference to takeup within the Windows world - it will be a pre-requisite for any kind of corporate use.... What are people's thoughts on some kind of predefined training database (like ham/spam terms that appear in more than x% of submitted training databases)? The other feature that I personally like to see is the ability to send an NDR when the message is identified as definitely being spam - what are peoples thoughts on this? AndyC -----Original Message----- From: Tim Stone - Four Stones Expressions [mailto:tim@fourstonesExpressions.com] Sent: 19 March 2003 04:10 To: Spambayes; Meyer, Tony Subject: Re: [Spambayes] Beta status checklist >So, to start off: >* Much better documentation for the SMTP proxy training option. * Incorporation of integration.txt (and probably other text files) into the website, and maybe a review of the mailing list for faq type information * Installation with some level of migration from previous release. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From tim at fourstonesExpressions.com Wed Mar 19 06:53:32 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 19 07:53:42 2003 Subject: [Spambayes] Beta status checklist In-Reply-To: Message-ID: 3/18/2003 10:09:32 PM, Tim Stone - Four Stones Expressions wrote: >>So, to start off: >>* Much better documentation for the SMTP proxy training option. >* Incorporation of integration.txt (and probably other text files) into the >website, and maybe a review of the mailing list for faq type information >* Installation with some level of migration from previous release. * Prerequisite checking for email and dbm modules (at least) > >c'est moi - TimS >http://www.fourstonesExpressions.com >http://wecanstopspam.org > >There are 10 kinds of people in the world: > those who understand binary, > and those who don't > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't From noreply at sourceforge.net Wed Mar 19 02:03:48 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 19 08:14:29 2003 Subject: [Spambayes] [ spambayes-Bugs-706170 ] Execute test suite fails in Outlook Message-ID: Bugs item #706170, was opened at 2003-03-19 11:03 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Execute test suite fails in Outlook Initial Comment: The test suite fails in outlook. I've retrained messages from a spam and a ham folder. I think this may be related to moving the database-files from the spambayes to the default docs-folders in windows a couple of weeks ago. the following traceback is shown in PythonWin: Executing automated tests... Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\addin.py", line 308, in Tester tester.test(manager) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 306, in test TestSpamFilter(driver) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 173, in TestSpamFilter msg, words = driver.CreateTestMessageInFolder (SPAM, driver.folder_watch) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 132, in CreateTestMessageInFolder msg, words = self.CreateTestMessage(spam_status) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 145, in CreateTestMessage words.update(FindTopWords(self.manager.bayes, 50, True)) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 64, in FindTopWords for word, info in extractor(bayes): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 46, in DBExtractor key = bayes.dbm.next()[0] File "C:\PROGRA~1\_DEV\Python22\Lib\site- packages\bsddb3\__init__.py", line 122, in next rv = self.dbc.next() DBNotFoundError: (-30991, 'DB_NOTFOUND: No matching key/data pair found') Tests FAILED. Sorry about that. If I were you, I would do a full re-train ASAP Please delete any test messages from your Spam, Unsure or Inbox folders first. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702 From ilumb at platform.com Wed Mar 19 11:09:29 2003 From: ilumb at platform.com (Ian Lumb) Date: Wed Mar 19 11:10:46 2003 Subject: [Spambayes] Beta status checklist (or this turning into new feature requests?) Message-ID: <4AB0624F069DAD4E90F18B13A818EEFE076690@catoexm04.noam.corp.platform.com> Add the Outlook binary installer as part of the release? Absolutely! Pre-defined training db? Not sure. Why? Many mail servers already use spam-filtering technologies. Currently, client-side spambayes compliments what exists on the server-side. BTW, are there plans to develop server-side spambayes? (Apologies if this is a FAQ.) I know that it can eclipse what we are currently using on our Exchange server :-) -Ian -----Original Message----- From: Cunningham, Andy [mailto:acunningham@rsasecurity.com] Sent: Wednesday, March 19, 2003 5:22 AM To: 'tim@fourstonesExpressions.com'; Spambayes; Meyer, Tony Subject: RE: [Spambayes] Beta status checklist (or this turning into new feature requests?) I'd add the Outlook Binary installer as a part of the release. I think this is going to make a huge difference to takeup within the Windows world - it will be a pre-requisite for any kind of corporate use.... What are people's thoughts on some kind of predefined training database (like ham/spam terms that appear in more than x% of submitted training databases)? The other feature that I personally like to see is the ability to send an NDR when the message is identified as definitely being spam - what are peoples thoughts on this? AndyC -----Original Message----- From: Tim Stone - Four Stones Expressions [mailto:tim@fourstonesExpressions.com] Sent: 19 March 2003 04:10 To: Spambayes; Meyer, Tony Subject: Re: [Spambayes] Beta status checklist >So, to start off: >* Much better documentation for the SMTP proxy training option. * Incorporation of integration.txt (and probably other text files) into the website, and maybe a review of the mailing list for faq type information * Installation with some level of migration from previous release. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From db3l at fitlinxx.com Wed Mar 19 11:32:37 2003 From: db3l at fitlinxx.com (David Bolen) Date: Wed Mar 19 12:20:24 2003 Subject: [Spambayes] Outlook addin delay in updating and resetting unread Message-ID: I know this has come up in the past on this list, but I just installed the most recent installer-based version of the Outlook addin-in on a co-worker's machine, and am seeing the problem pretty consistently, whereas before it was more hit or miss (or so I thought). Originally I thought more so than on my own machine, but after paying more attention, I now seem to be able to reproduce this consistently on my machine too - not sure if this is easier with the installer-version (which I just switched to trying, having been using the source release up to now). The behavior is that when a new message arrives, it shows as new in Outlook (this is Outlook with a corporate Exchange server), but the spam column is not filled in (the behavior occurs with or without display of the column, but it's easy to see with the column present). The log file shows that the message has been classified so it certainly seems to be an Outlook issue. We've waited for over a minute with no change. If you interact with Outlook in various ways (switch to a different folder and back, often just opening the message or creating a reply) the field will update, but at the same time the message gets remarked as unread even if you had just opened it and marked it read. This of course is annoying to the user because they're currently reading the message but it will still show as unread when they are done. Interestingly enough, whatever latency or update problem exists is always behind by one message - if a new message arrives, the prior message will have its Spam field updated. If the newly arrived message is filtered by the addin, it does move or take whatever operation, so again the filter appears to be running. It seems clear that this is some latency issue with Outlook updating the status of a message - it's not clear if resetting the read bit is because the delayed status includes an explicit unread bit, or if outlook is just refreshing the status of the message as of the delayed update. I'm going to switch back to the source version to see if its just as reproduceable there (or if I just got used to it without realizing it) to play around a little, but was wondering if anyone had any other ideas or knew of any workarounds for my co-worker in the meantime? Thanks. -- David From Paul.Moore at atosorigin.com Wed Mar 19 17:33:35 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Wed Mar 19 12:34:59 2003 Subject: [Spambayes] Outlook addin delay in updating and resetting unread Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D992@UKDCX001.uk.int.atosorigin.com> From: David Bolen [mailto:db3l@fitlinxx.com] > It seems clear that this is some latency issue with Outlook > updating the status of a message - it's not clear if resetting > the read bit is because the delayed status includes an explicit > unread bit, or if outlook is just refreshing the status of the > message as of the delayed update. I've seen this problem as well, also on an Exchange server. I'm quite a way behind on releases (haven't updated from CVS for a few weeks), so it's not a new thing. For me, it's always been pretty random (as far as I've been able to tell) so I've never been able to offer much to go on. But yes, it's irritating. I've not heard anyone hitting this except on Exchange, so maybe it's an issue with how Outlook interacts with Exchange rather than a pure Outlook issue...? Paul. From db3l at fitlinxx.com Wed Mar 19 13:05:41 2003 From: db3l at fitlinxx.com (David Bolen) Date: Wed Mar 19 13:05:47 2003 Subject: [Spambayes] Re: Outlook addin delay in updating and resetting unread References: <16E1010E4581B049ABC51D4975CEDB880113D992@UKDCX001.uk.int.atosorigin.com> Message-ID: "Moore, Paul" writes: > I've not heard anyone hitting this except on Exchange, so maybe > it's an issue with how Outlook interacts with Exchange rather than > a pure Outlook issue...? Could be - I only work with an Exchange server. I've determined that a workaround that seems solid at this point is to execute another SaveChanges() call when changing the Spam property. At this point I'm testing it via another msg.Save() call up in the filter module. I'm not sure why, but since that results in two saves during filtering (one after the spam property is updated and another after the mail folder information as part of the actions) the display is always updated immediately. It's probably a bit more overhead, but I'm not sure how much and if it fixes the issue, I'm willing to spend the extra API call (which presumably may result in another round trip to the server). I've only been able to test on my machine so far since I'm having trouble exactly replicating the binary installer package, but it's working for me. -- David From noreply at sourceforge.net Wed Mar 19 12:46:25 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 19 15:39:13 2003 Subject: [Spambayes] [ spambayes-Bugs-706520 ] assert fails in classifier Message-ID: Bugs item #706520, was opened at 2003-03-19 12:46 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Adam Glass (adamglass) Assigned to: Nobody/Anonymous (nobody) Summary: assert fails in classifier Initial Comment: This morning, I noticed that my emails no longer had a X-Spambayes-Classification header, so I looked through my procmail logs, and sure enough, hammiefilter.py is giving a traceback when an assertion fails. This happens on all messages now; it is not specific to a single message, or intermittent. Therefore, I suspect my .hammiedb is corrupted... I can supply it to anyone who would like to investigate it for debugging purposes. I am using Spambayes 1.0a2, installed on a system with Python 2.2.1, with the new version of the email library (as per the install docs.) Please contact me if you require any further details. Example of how to generate the error follows, along with traceback: adam$ /usr/local/bin/hammiefilter.py -f -d $HOME/.hammiedb < example Traceback (most recent call last): File "/usr/local/bin/hammiefilter.py", line 179, in ? main() File "/usr/local/bin/hammiefilter.py", line 175, in main action(msg) File "/usr/local/bin/hammiefilter.py", line 113, in filter return h.filter(msg) File "/usr/local/lib/python2.2/site-packages/spambayes/hammie.py", line 108, in filter prob, clues = self._scoremsg(msg, True) File "/usr/local/lib/python2.2/site-packages/spambayes/hammie.py", line 38, in _scoremsg return self.bayes.spamprob(tokenize(msg), evidence) File "/usr/local/lib/python2.2/site-packages/spambayes/classifier.py", line 217, in chi2_spamprob clues = self._getclues(wordstream) File "/usr/local/lib/python2.2/site-packages/spambayes/classifier.py", line 441, in _getclues prob = self.probability(record) File "/usr/local/lib/python2.2/site-packages/spambayes/classifier.py", line 304, in probability assert spamcount <= nspam AssertionError ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702 From tim at fourstonesExpressions.com Wed Mar 19 14:48:35 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 19 15:48:41 2003 Subject: [Spambayes] Beta status checklist In-Reply-To: Message-ID: <53HESNHEAOJKIKFVT3294ZWC8SN1YHE.3e78d7a3@myst> 3/19/2003 6:53:32 AM, Tim Stone - Four Stones Expressions wrote: >3/18/2003 10:09:32 PM, Tim Stone - Four Stones Expressions > wrote: > >>>So, to start off: >>>* Much better documentation for the SMTP proxy training option. >>* Incorporation of integration.txt (and probably other text files) into the >>website, and maybe a review of the mailing list for faq type information >>* Installation with some level of migration from previous release. >* Prerequisite checking for email and dbm modules (at least) * Some kind of recovery from wordinfo database corruption (nham and nspam are lost on an increasingly frequent basis) Bug 706520 *MUST* be fixed. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't From Phil.Cox at SystemExperts.com Wed Mar 19 14:01:53 2003 From: Phil.Cox at SystemExperts.com (Phil Cox) Date: Wed Mar 19 17:20:52 2003 Subject: [Spambayes] Not getting the icon in the tool bar Message-ID: <000201c2ee63$26bf9d20$0500000a@jiloa.com> The application seems to be working, but I don't get the icon in the toolbar to configure it. Any thoughts? Here is my log file: SpamAddin - Connecting to Outlook Loaded bayes database from 'C:\Documents and Settings\pcc\Application Data\SpamBayes\default_bayes_database.db' Loaded message database from 'C:\Documents and Settings\pcc\Application Data\SpamBayes\default_message_database.db' Bayes database initialized with 0 spam and 0 good messages Loaded databases in 2.70174ms Phil From bplist at www.wormy.org Wed Mar 19 19:00:21 2003 From: bplist at www.wormy.org (BP List) Date: Wed Mar 19 17:42:53 2003 Subject: [Spambayes] mboxtrain.py error Message-ID: I created the database file with "hammiefilter.py -n". It seems that every mailbox file I run mboxtrain.py on results in an error similar to this: www:/home/bryan# ./mboxtrain.py -d /home/bryan/.hammiedb -g /home/bryan/mail/Mailbox -s /home/bryan/mail/SPAM Training ham (/home/bryan/mail/Mailbox): Reading as Unix mbox Traceback (most recent call last): File "./mboxtrain.py", line 284, in ? main() File "./mboxtrain.py", line 271, in main train(h, g, False, force) File "./mboxtrain.py", line 209, in train mbox_train(h, path, is_spam, force) File "./mboxtrain.py", line 166, in mbox_train fcntl.lockf(f, fcntl.LOCK_UN) IOError: [Errno 16] Device or resource busy I have tried this as root and as the user. I assume that there is really nothing wrong with mboxtrain.py, but I don't have the faintest idea where to start looking. I've tried several mailbox files all with the same result. I am sure that noone had the mailbox open as well. I've just installed all the latest supporting applications that are listed in the spambayes documentation. Please let me know if you need any specific details. Thanks in advance! -- Bryan From T.A.Meyer at massey.ac.nz Thu Mar 20 10:49:27 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 19 17:50:46 2003 Subject: [Spambayes] Not getting the icon in the tool bar Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9B@its-xchg4.massey.ac.nz> > The application seems to be working, but I don't get the icon in the > toolbar to configure it. Any thoughts? Well, there's nothing in the log file, so you're right, it does seem to be working. I would suggest that you try: * making sure that you're in the inbox, and not in something like "Outlook Today" * ensuring that the standard toolbar is visible * resetting the standard toolbar (right-click on it, choose customize, and the reset) and restarting Outlook. Which version of the plugin are you using? (a) The most recent (002) installer (binary) version from Mark's website? (b) The older (001) installer (binary) version from Mark's website? (this is known to have this sort of bug, so if so, please get the newer version) (c) The most recent CVS source (d) Old CVS source. =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 20 10:52:53 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 19 17:57:50 2003 Subject: [Spambayes] Beta status checklist (not new feature requests!) Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C915@its-xchg4.massey.ac.nz> > Add the Outlook binary installer as part of the release? Absolutely! I've discussed this with TimS off-list, but what's the general consensus here? (History-wise, I believe that alpha1 didn't have the plugin, but alpha2 does, and alpha3 will). I *don't* think that the Outlook plugin should be part of a beta release. The installer that Mark's created does a much better job, I think, and only installs those bits that the plugin needs, not pop3proxy and all the rest. IMO, a (potential) user should download *either* the Outlook installer, *or* a beta release of everything else. Thoughts? (Mark?) =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 20 11:03:51 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 19 18:04:25 2003 Subject: [Spambayes] Beta status checklist (or this turning into new feature requests?) Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9C@its-xchg4.massey.ac.nz> [Andy Cunningham] > What are people's thoughts on some kind of predefined > training database > (like ham/spam terms that appear in more than x% of submitted training > databases)? This has had plenty of discussion previously; I'd suggest people flick through the archives if they haven't read them already. To add my 2c, since I haven't previously, I would say that this is a *bad idea*, unless a system is developed to expunge this pre-defined set at some point after the user has collected their own data. You only have to train a single message and the system will do better than a coin-toss, so there's no need to have pre-defined stuff - it catches onto the sort of messages that would be in a pre-defined db so quickly I don't think there's any point. People that don't want to train would be better off with SpamAssassin, or something like that. =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 20 11:06:33 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Wed Mar 19 18:07:08 2003 Subject: [Spambayes] Beta status checklist (or this turning into newfeature requests?) Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C917@its-xchg4.massey.ac.nz> > BTW, are there plans to develop server-side spambayes? > (Apologies if this is a FAQ.) I know that it can eclipse what > we are currently using on our Exchange server :-) This is a FAQ, and does lend weight to TimS's suggestion that we need a list of answers for FAQs :) I have no experience with using spambayes in a server type situation, but from reading the messages on the list, I believe that this can be done now, to a certain extent. The real question is how you want training to be done - does the admin do it? Does everyone contribute? Do you want users to have a shared definition of spam, or individual? =Tony Meyer From B-Morgan at concentric.net Wed Mar 19 16:10:20 2003 From: B-Morgan at concentric.net (Brad Morgan) Date: Wed Mar 19 18:10:55 2003 Subject: [Spambayes] Beta status checklist (not new feature requests!) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C915@its-xchg4.massey.ac.nz> Message-ID: > IMO, a (potential) user should download *either* the Outlook installer, *or* > a beta release of everything else. > Thoughts? (Mark?) > =Tony Meyer This sounds reasonable to me. Regards, Brad Morgan From mhammond at skippinet.com.au Thu Mar 20 10:22:28 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 19 18:23:04 2003 Subject: [Spambayes] Beta status checklist (not new feature requests!) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C915@its-xchg4.massey.ac.nz> Message-ID: > > Add the Outlook binary installer as part of the release? Absolutely! > > I've discussed this with TimS off-list, but what's the general > consensus here? > > (History-wise, I believe that alpha1 didn't have the plugin, but > alpha2 does, and alpha3 will). > > I *don't* think that the Outlook plugin should be part of a beta > release. The installer that Mark's created does a much better > job, I think, and only installs those bits that the plugin needs, > not pop3proxy and all the rest. > > IMO, a (potential) user should download *either* the Outlook > installer, *or* a beta release of everything else. Sounds fine to me - except I would raise the bar a little - why not make a pop3propxy *binary* release for Windows too - then the problem becomes moot- on Windows you get a binary. I realize time is an issue, so this strategy sounds OK for beta2, but maybe we could aim for a beta3 with binaries before v1. Mark. From skip at pobox.com Wed Mar 19 17:43:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 19 18:44:06 2003 Subject: [Spambayes] Beta status checklist (or this turning into new feature requests?) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9C@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9C@its-xchg4.massey.ac.nz> Message-ID: <15993.181.61574.726@montanaro.dyndns.org> Tony> [Andy Cunningham] >> What are people's thoughts on some kind of predefined training >> database (like ham/spam terms that appear in more than x% of >> submitted training databases)? Tony> To add my 2c, since I haven't previously, I would say that this is Tony> a *bad idea*, unless a system is developed to expunge this Tony> pre-defined set at some point after the user has collected their Tony> own data. My 2c... I am currently manually training for a couple other people here at Northwestern. I do want to get more victims^H^H^H^H^H^H^H early adopters, but it generally seems to be working pretty well. I had to encourage them a little to send me ham which was correctly classified (spam is no problem, I have fountains full of the stuff). I think they seemed to expect the system to work properly from the get-go and were only sending me stuff that was misclassified or which wound up marked unsure. Accordingly, they were a bit confused at a few of the mistakes it made. At the moment I have just 152 hams and 135 spams in the training database. Things seem to be working okay though I haven't been tracking it in any formal sense, just in the sense that they aren't complaining. ;-) Skip From N7DR at arrisi.com Wed Mar 19 16:59:21 2003 From: N7DR at arrisi.com (D. R. Evans) Date: Wed Mar 19 18:59:26 2003 Subject: [Spambayes] database corruption Message-ID: <3E78A1E9.29639.DCE8D7@localhost> Just to let folk know that the database corruption that I reported and filed a while back has happened again (#699063). I line in Colorado and some of you may know that we just had a massive storm. The power here was intermittent for a couple of hours, and as a result the Linux box running pop3proxy.py went down a couple of times due to loss of power. When everything came back up and seemed stable, I restarted pop3proxy.py and was unable to restart pop3proxy.py because of database corruption. As before, there was no mail activity going on at the time of the crash. Tim suspected in the resolution of the bug report that switching to a newer version of bsddb would fix the problem. I'm not in a position to do that at the moment (maybe I'll try again after Mandrake 9.1 is released), so I will have to try to switch to some other bayesian filtering system instead, at least for a while :-( Doc -------------------------------------------------------------- Phone: +1 303 494 0394 Mobile: +1 720 839 8462 Fax: +1 781 240 0527 -------------------------------------------------------------- From mhammond at skippinet.com.au Thu Mar 20 11:11:15 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 19 19:12:21 2003 Subject: [Spambayes] database corruption In-Reply-To: <3E78A1E9.29639.DCE8D7@localhost> Message-ID: > Just to let folk know that the database corruption that I reported and > filed a while back has happened again (#699063). How about we do a db sync after we perform a train? This shouldn't be too painful, won't affect scoring performance, and should always leave the DB consistent. Only drawback I see is that after a huge retrain, my fast machine takes a number of seconds to save the DB - OTOH, paying this penalty *during* the retrain operation is moer appealing than paying it at shutdown anyway. Mark. From tim at fourstonesExpressions.com Wed Mar 19 18:18:29 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 19 19:18:35 2003 Subject: [Spambayes] database corruption In-Reply-To: Message-ID: <96H5JIA8HFKJB73V3VQL73PL4XHEOJ.3e7908d5@myst> 3/19/2003 6:11:15 PM, "Mark Hammond" wrote: >> Just to let folk know that the database corruption that I reported and >> filed a while back has happened again (#699063). > >How about we do a db sync after we perform a train? This shouldn't be too >painful, won't affect scoring performance, and should always leave the DB >consistent. Only drawback I see is that after a huge retrain, my fast >machine takes a number of seconds to save the DB - OTOH, paying this penalty >*during* the retrain operation is moer appealing than paying it at shutdown >anyway. I thought that the proxy does this already, but a cursory inspection of the code doesn't look like that's there. I'll check in a fix. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim at fourstonesExpressions.com Wed Mar 19 18:22:26 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 19 19:22:33 2003 Subject: [Spambayes] database corruption In-Reply-To: <96H5JIA8HFKJB73V3VQL73PL4XHEOJ.3e7908d5@myst> Message-ID: <04A5F1YHGFBZU04NL952VFDDAYWLJXV.3e7909c2@myst> 3/19/2003 6:18:29 PM, Tim Stone - Four Stones Expressions wrote: >3/19/2003 6:11:15 PM, "Mark Hammond" wrote: > >>> Just to let folk know that the database corruption that I reported and >>> filed a while back has happened again (#699063). >> >>How about we do a db sync after we perform a train? This shouldn't be too >>painful, won't affect scoring performance, and should always leave the DB >>consistent. Only drawback I see is that after a huge retrain, my fast >>machine takes a number of seconds to save the DB - OTOH, paying this penalty >>*during* the retrain operation is moer appealing than paying it at shutdown >>anyway. > >I thought that the proxy does this already, but a cursory inspection of the >code doesn't look like that's there. I'll check in a fix. Well, after a closer look, it really does. The DBDictClassifier implementation does a db.sync() as well... > >c'est moi - TimS >http://www.fourstonesExpressions.com >http://wecanstopspam.org > >There are 10 kinds of people in the world: > those who understand binary, > and those who don't. > > > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From mhammond at skippinet.com.au Thu Mar 20 12:35:32 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 19 20:36:27 2003 Subject: [Spambayes] database corruption In-Reply-To: <04A5F1YHGFBZU04NL952VFDDAYWLJXV.3e7909c2@myst> Message-ID: > >I thought that the proxy does this already, but a cursory > inspection of the > >code doesn't look like that's there. I'll check in a fix. > > Well, after a closer look, it really does. The DBDictClassifier > implementation does a db.sync() as well... It does a db.sync() during a store, but that is all I can see. It does not sync after an individual train, which is what I was suggesting. Mark. From tim at fourstonesExpressions.com Wed Mar 19 20:06:57 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 19 21:07:07 2003 Subject: [Spambayes] database corruption In-Reply-To: Message-ID: <52NIUSHCA9SOSQUP5ZZVOMA9IH52PLZW.3e792241@myst> 3/19/2003 7:35:32 PM, "Mark Hammond" wrote: >It does a db.sync() during a store, but that is all I can see. It does not >sync after an individual train, which is what I was suggesting. The pop3proxy initiates the store after a train, at line 945. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From mhammond at skippinet.com.au Thu Mar 20 13:21:29 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 19 21:22:40 2003 Subject: [Spambayes] database corruption In-Reply-To: <52NIUSHCA9SOSQUP5ZZVOMA9IH52PLZW.3e792241@myst> Message-ID: > >It does a db.sync() during a store, but that is all I can see. > It does not > >sync after an individual train, which is what I was suggesting. > > The pop3proxy initiates the store after a train, at line 945. Interesting. Then I wonder how this problem could occur. Presumably the original poster was not performing a train operation as the machine went down (certainly not *every* time this has happened). So assuming that a synch() was done at least a few seconds ago, what could cause the database to get into a corrupt state? How would the file ever change after the last train had completed? MArk. From tim at fourstonesExpressions.com Wed Mar 19 20:27:00 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 19 21:27:07 2003 Subject: [Spambayes] database corruption In-Reply-To: Message-ID: 3/19/2003 8:21:29 PM, "Mark Hammond" wrote: >Interesting. Then I wonder how this problem could occur. Presumably the >original poster was not performing a train operation as the machine went >down (certainly not *every* time this has happened). So assuming that a >synch() was done at least a few seconds ago, what could cause the database >to get into a corrupt state? How would the file ever change after the last >train had completed? The only answer I can come up with is that there is a bug in whatever dbm implementation that D.R.Evans (and others) are currently using. Is there a way to determine what dbm implementation gets used by these guys? c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From noreply at sourceforge.net Wed Mar 19 16:31:47 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 19 22:30:39 2003 Subject: [Spambayes] [ spambayes-Bugs-706170 ] Execute test suite fails in Outlook Message-ID: Bugs item #706170, was opened at 2003-03-19 21:03 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Execute test suite fails in Outlook Initial Comment: The test suite fails in outlook. I've retrained messages from a spam and a ham folder. I think this may be related to moving the database-files from the spambayes to the default docs-folders in windows a couple of weeks ago. the following traceback is shown in PythonWin: Executing automated tests... Traceback (most recent call last): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\addin.py", line 308, in Tester tester.test(manager) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 306, in test TestSpamFilter(driver) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 173, in TestSpamFilter msg, words = driver.CreateTestMessageInFolder (SPAM, driver.folder_watch) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 132, in CreateTestMessageInFolder msg, words = self.CreateTestMessage(spam_status) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 145, in CreateTestMessage words.update(FindTopWords(self.manager.bayes, 50, True)) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 64, in FindTopWords for word, info in extractor(bayes): File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\ Outlook2000\tester.py", line 46, in DBExtractor key = bayes.dbm.next()[0] File "C:\PROGRA~1\_DEV\Python22\Lib\site- packages\bsddb3\__init__.py", line 122, in next rv = self.dbc.next() DBNotFoundError: (-30991, 'DB_NOTFOUND: No matching key/data pair found') Tests FAILED. Sorry about that. If I were you, I would do a full re-train ASAP Please delete any test messages from your Spam, Unsure or Inbox folders first. ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-20 11:31 Message: Logged In: YES user_id=14198 This seems a bsddb3 problem. The code in question: try: key = bayes.dbm.next()[0] except bsddb.error: already attempts to catch this error. Further, the docs for DBNotFoundError state that it derives from bsddb.error, meaning my except statement should work. I will try and get to using my Python 2.2 version for the plugin to fix this. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702 From noreply at sourceforge.net Wed Mar 19 16:32:34 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 19 22:30:46 2003 Subject: [Spambayes] [ spambayes-Bugs-702920 ] Manual filtering (Outlook) stops if one message fails Message-ID: Bugs item #702920, was opened at 2003-03-13 23:38 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 Category: Outlook Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Fredrik Rodland (fmmr) Assigned to: Mark Hammond (mhammond) Summary: Manual filtering (Outlook) stops if one message fails Initial Comment: I've posted tyhis question on the maillist, and with (at least) one positive feedback, I enter it here: If manual filtering is started, and one e-mail fails, the rest of the filetering seems to be skipped. couldn't the filtering of the remaining messages continue, skipping the message which failed? ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-20 11:32 Message: Logged In: YES user_id=14198 Checked the fix in for this a couple of days ago. ---------------------------------------------------------------------- Comment By: Fredrik Rodland (fmmr) Date: 2003-03-17 22:06 Message: Logged In: YES user_id=724871 I (sligthly) chqanged the summary. I've included one traceback. However I've run into several different ones in the past when filtering manual, and all seems to stop the actual filter-process. What I want/wish is that the filtering process continues with the remaining messages even if one message fails. There have also been several other comments on this subject on the list. the actual traceback as requested: Error getting property from stream (-2147221233, 'OLE error 0x8004010f', None, None) Exception in thread Thread-2: Traceback (most recent call last): File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 408, in __bootstrap self.run() File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\dialogs\AsyncDialog.py", line 115, in thread_target self._DoProcess() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\dialogs\FilterDialog.py", line 375, in _DoProcess self.filterer(self.mgr, self.progress) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\filter.py", line 100, in filterer this_dispositions = filter_folder(f, mgr, progress) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\filter.py", line 80, in filter_folder disposition = filter_message(message, mgr, all_actions) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\filter.py", line 15, in filter_message prob = mgr.score(msg) File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\manager.py", line 439, in score email = msg.GetEmailPackageObject() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\msgstore.py", line 639, in GetEmailPackageObject text = self._GetMessageText() File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo ok2000\msgstore.py", line 582, in _GetMessageText assert msg.is_multipart() AssertionError ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-15 10:39 Message: Logged In: YES user_id=14198 Can you please post a traceback? (and sorry if I missed it on the list) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702 From noreply at sourceforge.net Wed Mar 19 16:33:37 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed Mar 19 22:30:53 2003 Subject: [Spambayes] [ spambayes-Bugs-677842 ] COM error on access denied Message-ID: Bugs item #677842, was opened at 2003-01-31 10:21 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=677842&group_id=61702 Category: Outlook Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Tony Meyer (anadelonbrin) Assigned to: Mark Hammond (mhammond) Summary: COM error on access denied Initial Comment: Some folders (public ones in particular) may not allow the user access to create the spam field. This also seems to cause an 'access denied' com error later on. An example traceback is below. Warning: failed to create the Outlook user-property in folder 'MCN Newsletter' (-2147352567, 'Exception occurred.', (4096, 'Microsoft Outlook', "You don't have appropriate permission to perform this operation.", None, 0, -2147024891), None) This is probably because the code has recently been changed, but it will have no effect on the filtering or scoring. AntiSpam: Watching for new messages in folder MCN Newsletter AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Spam Error processing missed messages! Traceback (most recent call last): File "D:\CVS Modules\spambayes\Outlook2000 \addin.py", line 610, in OnConnection self.ProcessMissedMessages() File "D:\CVS Modules\spambayes\Outlook2000 \addin.py", line 884, in ProcessMissedMessages File "D:\CVS Modules\spambayes\Outlook2000 \addin.py", line 129, in ProcessMessage if msgstore_message.GetField (manager.config.field_score_name) is not None: File "D:\CVS Modules\spambayes\Outlook2000 \msgstore.py", line 651, in GetField prop = self.mapi_object.GetIDsFromNames(props, 0) [0] com_error: (-2147024891, 'Access is denied.', None, None) ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-20 11:33 Message: Logged In: YES user_id=14198 This has been fixed a while ago too - it was the same problem that caused Hotmail messages to fail. Please reopen if you have problems. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-02-05 06:03 Message: Logged In: YES user_id=552329 Of course I don't need to wait until mail arrives, I can 'filter now'...sigh (it is early yet, I'm not really awake). I made the change and tried to filter a folder without write- access. This is what I got: Warning: failed to create the Outlook user-property in folder 'MCN Newsletter' (-2147352567, 'Exception occurred.', (4096, 'Microsoft Outlook', "You don't have appropriate permission to perform this operation.", None, 0, -2147024891), None) This is probably because the code has recently been changed, but it will have no effect on the filtering or scoring. Exception in thread Thread-1: Traceback (most recent call last): File "D:\Python22\Lib\threading.py", line 408, in __bootstrap self.run() File "D:\Python22\Lib\threading.py", line 396, in run apply(self.__target, self.__args, self.__kwargs) File "D:\CVS Modules\spambayes\Outlook2000 \dialogs\AsyncDialog.py", line 115, in thread_target self._DoProcess() File "D:\CVS Modules\spambayes\Outlook2000 \dialogs\FilterDialog.py", line 375, in _DoProcess self.filterer(self.mgr, self.progress) File "D:\CVS Modules\spambayes\Outlook2000\filter.py", line 85, in filterer this_dispositions = filter_folder(f, mgr, progress) File "D:\CVS Modules\spambayes\Outlook2000\filter.py", line 65, in filter_folder disposition = filter_message(message, mgr, all_actions) File "D:\CVS Modules\spambayes\Outlook2000\filter.py", line 15, in filter_message prob = mgr.score(msg) File "D:\CVS Modules\spambayes\Outlook2000 \manager.py", line 384, in score email = msg.GetEmailPackageObject() File "D:\CVS Modules\spambayes\Outlook2000 \msgstore.py", line 595, in GetEmailPackageObject text = self._GetMessageText() File "D:\CVS Modules\spambayes\Outlook2000 \msgstore.py", line 472, in _GetMessageText hr, data = self.mapi_object.GetProps(prop_ids,0) com_error: (-2147024891, 'Access is denied.', None, None) The more I think about it, the more I am of the opinion that filtering (and scoring) should not be allowed unless the user has write access to the folder. This would be simple enough to implement I presume (somewhere in folderselector.py, a check to see that access is available when the user selects a folder). This would also leave someone else to do public folder testing, since I don't have write access to any :) Apologies again for the multiple messages - like I said, it's early :) ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-02-05 05:56 Message: Logged In: YES user_id=552329 ack. my stupid browser (because of my stupid actions) resent my comment many times. my apologies. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-02-05 05:50 Message: Logged In: YES user_id=552329 Hi Mark I don't really want to do anything with public folders! But there was a message (from Neale from memory) about a user having trouble so I tried playing round with them and got this problem. I would want to filter a public folder that I didn't have write access to so that I could see/rank the spam scores I guess. Although the worthwhileness (is that a word? ;) of this does seem a bit dubious. Maybe the 'solution' is to disallow all filtering on folders without write access? I'll have a go repoducing the exception with the change in code and let you know how it goes. I'll have to wait until about midday (NZ) for any mail to arrive in the public folder. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-02-05 05:49 Message: Logged In: YES user_id=552329 Hi Mark I don't really want to do anything with public folders! But there was a message (from Neale from memory) about a user having trouble so I tried playing round with them and got this problem. I would want to filter a public folder that I didn't have write access to so that I could see/rank the spam scores I guess. Although the worthwhileness (is that a word? ;) of this does seem a bit dubious. Maybe the 'solution' is to disallow all filtering on folders without write access? I'll have a go repoducing the exception with the change in code and let you know how it goes. I'll have to wait until about midday (NZ) for any mail to arrive in the public folder. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-02-05 05:48 Message: Logged In: YES user_id=552329 Hi Mark I don't really want to do anything with public folders! But there was a message (from Neale from memory) about a user having trouble so I tried playing round with them and got this problem. I would want to filter a public folder that I didn't have write access to so that I could see/rank the spam scores I guess. Although the worthwhileness (is that a word? ;) of this does seem a bit dubious. Maybe the 'solution' is to disallow all filtering on folders without write access? I'll have a go repoducing the exception with the change in code and let you know how it goes. I'll have to wait until about midday (NZ) for any mail to arrive in the public folder. ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-02-04 23:01 Message: Logged In: YES user_id=14198 Hi Tony, I didn't realize you were an antipode ;) I'm wondering why you want to filter public folders that you don't have write access to? Or is the point that you can *move* the message, just can't save fields? Interestingly, your exception points at: if msgstore_message.GetField(manager.config.field_score_name) is not None: which implies that this error is actually on the *following* message, not the one that is actually failing. This does make sense, as we pass mapi.MAPI_DEFERRED_ERRORS to all mapi functions. I'm wondering if you can easily repro this exception? If so, I would be interested to see what changing msgstore.py, line 666 (eeek!!!) in current CVS from: self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE | USE_DEFERRED_ERRORS) to: self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE) has on this exception, and if indeed the exception is now raised from the "save" operation rather than a following one. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=677842&group_id=61702 From skip at pobox.com Wed Mar 19 22:29:26 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 19 23:29:31 2003 Subject: [Spambayes] database corruption In-Reply-To: <3E78A1E9.29639.DCE8D7@localhost> References: <3E78A1E9.29639.DCE8D7@localhost> Message-ID: <15993.17318.29523.61558@montanaro.dyndns.org> Doc> When everything came back up and seemed stable, I restarted Doc> pop3proxy.py and was unable to restart pop3proxy.py because of Doc> database corruption. As before, there was no mail activity going on Doc> at the time of the crash. What version of Berkeley DB are you using? Try this command: rpm -qa | egrep '^(lib)?db' It might report something like db1-devel-1.85-6mdk libdb3.2-devel-3.2.9-2mdk db1-1.85-6mdk libdbtcl3.2-3.2.9-2mdk db2-2.4.14-3mdk libdb3.2-3.2.9-2mdk Note that I don't have anything like "libdb3.2-utils-3.2.9-2mdk". If you don't but have your Mandrake CD around, install that RPM. That will give you a bunch of commands which begin with "db_". Try running db_recover on your corrupt database file and see if it fixes the problem. What does ldd say about the version of libdb linked into your bsddb.so file? Try something like this: % ldd /usr/local/lib/python2.2/lib-dynload/bsddb.so libdb-3.2.so => /lib/libdb-3.2.so (0x4001c000) libc.so.6 => /lib/libc.so.6 (0x400a3000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x80000000) You want the -utils RPM which corresponds to the libdb version bsddb.so was linked against. On Mandrake systems you can install multiple versions of libdb simultaneously. Skip From mhammond at skippinet.com.au Thu Mar 20 15:53:52 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed Mar 19 23:54:49 2003 Subject: [Spambayes] database corruption In-Reply-To: <15993.17318.29523.61558@montanaro.dyndns.org> Message-ID: bsddb.db.version() tells us the version too. Mark. From T.A.Meyer at massey.ac.nz Thu Mar 20 17:19:10 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 20 00:19:49 2003 Subject: [Spambayes] Beta status checklist (not new feature requests!) Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C920@its-xchg4.massey.ac.nz> > > IMO, a (potential) user should download *either* the Outlook > > installer, *or* a beta release of everything else. > > Sounds fine to me - except I would raise the bar a little - > why not make a pop3propxy *binary* release for Windows too - > then the problem becomes moot- on Windows you get a binary. I notice that this is also listed in the "short term plans" in the readme in the windows directory. Does this mean that you are working on it, or that someone else should? > I realize time is an issue, so this strategy sounds OK for > beta2, but maybe we could aim for a beta3 with binaries before v1. Well, TimS is only doing _alpha_ 3 at the moment, unless the list of prereq's for beta 1 ends up really short, so there should be time. But otherwise, yes. =Tony Meyer From T.A.Meyer at massey.ac.nz Thu Mar 20 19:09:40 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 20 02:10:45 2003 Subject: [Spambayes] Storing Options Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> Ok, here's a draft proposal for changes to create a new way of storing options. I'm not going to implement any of this unless there is a consensus that it's a good thing, so don't panic. There are four main changes, each outlined below. They can be implemented separately, but together makes most sense. I would not change: * The search path for options (i.e. defaults, then envar, then current/home directory). * Storing the defaults inside a .py file rather than having a 'bayes.ini' file (reading the archives, the reasons behind this make sense So, here's what I do propose. Note that these are significant changes and would require changes (improvements ;) all over the code. The user would notice nothing, however. When you get a chance, please read through these and comment. 1. Change from using getattr to get This means using 'options["pop3proxy_servers"]' rather than 'options.pop3proxy_servers'. This avoid the possible problems with conflicts with existing OptionsClass attribute names, and allows #2 and (more easily) #3. 2. Use the section data. This means using 'options[("pop3proxy", "servers")]'. For backwards compatability 'options["pop3proxy_servers"]' would return the value of any option named "pop3proxy_servers", whichever section it was in. This is tidier, and allows neat things later on (like maybe only loading option sections that are relevant). For the most part it is already set up this way, it's just that Options currently throws away all the section information. 3. Setting values propagates through to ConfigParser This means that 'options.pop3proxy_add_evidence_header = True' (with #1 & #2, 'options[("pop3proxy", "add_evidence header")] = True') would not just change the Options object, but also the ConfigParser object that it inherits from. This *does not* mean that that any files would be changed, but *does* mean that they could be updated on demand, via the write() function - or via the update() function in UpdatableConfigParser). 4. Detailed options. Each option has the following attributes: * a name * a nice name * a default value * explanation text * either a tuple or a regex of allowed values * the current value * whether it should be restored on a 'return to defaults' command Two simple examples: "pop3proxy_servers", "Servers", "", "These are the servers that will be proxied blah blah...", r"\w", "pop.example.com", False "add_evidence_header", "Clues Header", True, "This option adds a header with the spam clues blah blah blah", (True, False), False, True These would be accessed as follows: nice name: options.display_name(sect, opt) default: options.default(sect, opt) - these would also be the values of all options prior to loading any config file explanation text: options.doc(sect, opt) allowed values: options.valid_input(sect, opt) current value: either via options[(sect, opt)], or via options.get(sect, opt) restore on revert: options.no_restore(sect, opt) Also provided would be options.is_valid(sect, opt, value) which would return True iff the value was valid for that option. OR options[(sect, opt)] / options.get(sect, opt) returns an Option object that has these things. This is nicer, but is more work to just get the current value, which is what is wanted most of the time. From T.A.Meyer at massey.ac.nz Thu Mar 20 19:25:58 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 20 02:26:33 2003 Subject: [Spambayes] Beta status checklist Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C92A@its-xchg4.massey.ac.nz> > * Some kind of recovery from wordinfo database corruption If the database itself is corrupt, is there really anything we can do, other than point them towards the db recovery tools? (Unless we expect people to hold onto ham & spam to retrain on). I would suggest that if the db is dead, all we can do is rename it (for recovery purposes) and create a new, empty db. > (nham and nspam are lost on an increasingly frequent basis) It seems (via a grep for 'nham' or 'nspam') like the only things that use nham and nspam are: * testing code (the user wouldn't be using this) * experimental_ham_spam_imbalance (off by default) If this is correct, does it really matter if nham and/or nspam are incorrect? (Not that the bugs shouldn't be traced down, however). =Tony Meyer From acunningham at rsasecurity.com Thu Mar 20 09:11:07 2003 From: acunningham at rsasecurity.com (Cunningham, Andy) Date: Thu Mar 20 04:06:03 2003 Subject: [Spambayes] Beta status checklist (or this turning into newfe ature requests?) Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F172C@exuk01> I wonder if something like the following would work: 1) Each user has a private spam database stored on the server. 2) A scheduled task will compute some kind of "average" of the databases. This might include some kind of threshold (e.g., if more than 20% of users say it's spam), or just straight averaging. 3) Systems admins can train directly on the system database to provide feedback as to whether the system is, in fact, spam or not. I would also have some kind of whitelist/blacklist built in 4) incoming mail is checked against both the user and the system database, and scored against each, to get two scores a user score (u-score) and a system score (s-score). Then you can apply both scores: u-score > 90% or s-score > 90% ==> spam u-score < 15% or s-score < 15% ==> ham I guess it would take some kind of analysis to determine the best averaging process. This means that you end up with hundreds of people training the database. AndyC -----Original Message----- From: Meyer, Tony [mailto:T.A.Meyer@massey.ac.nz] Sent: 19 March 2003 23:07 To: Spambayes Subject: RE: [Spambayes] Beta status checklist (or this turning into newfeature requests?) > BTW, are there plans to develop server-side spambayes? > (Apologies if this is a FAQ.) I know that it can eclipse what > we are currently using on our Exchange server :-) This is a FAQ, and does lend weight to TimS's suggestion that we need a list of answers for FAQs :) I have no experience with using spambayes in a server type situation, but from reading the messages on the list, I believe that this can be done now, to a certain extent. The real question is how you want training to be done - does the admin do it? Does everyone contribute? Do you want users to have a shared definition of spam, or individual? =Tony Meyer _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes From acunningham at rsasecurity.com Thu Mar 20 09:36:12 2003 From: acunningham at rsasecurity.com (Cunningham, Andy) Date: Thu Mar 20 04:31:02 2003 Subject: [Spambayes] Outlook 2002 Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F172D@exuk01> Mark I tried out your change - in fact, I tried out several variants of newer code, and in all of them I now seem to be getting a different error. This is based on the latest CVS build checked out at around 9AM GMT this morning, though the same thing happens in the a2 release as well, now that I have removed the source of the error (I traced the problem below to a moved .pst file which hadn't been modified in outlook - so continuing the folder tree walk on that error is probably a Good Thing.) Traceback (most recent call last): File "C:\andyc\Install\spambayes\spambayes\Outlook2000\dialogs\ManagerDialog.py", line 97, in OnInitDialog self.UpdateControlStatus() File "C:\andyc\Install\spambayes\spambayes\Outlook2000\dialogs\ManagerDialog.py", line 143, in UpdateControlStatus watch_names = self.mgr.FormatFolderNames( File "C:\andyc\Install\spambayes\spambayes\Outlook2000\manager.py", line 222, in FormatFolderNames folder = self.message_store.GetFolder(eid) File "C:\andyc\Install\spambayes\spambayes\Outlook2000\msgstore.py", line 242, in GetFolder folder_id = self.NormalizeID(folder_id) File "C:\andyc\Install\spambayes\spambayes\Outlook2000\msgstore.py", line 195, in NormalizeID assert False, "We expect fully qualified IDs - second branch" AssertionError: We expect fully qualified IDs - second branch win32ui: OnInitDialog() virtual handler (>) raised an exception SpamAddin - Disconnecting from Outlook Bayes database is not dirty - not writing Addin terminating: 1 COM client and 2 COM servers exist. The " - second branch" comment is where I modified the two identical assert statements in NormaliseID so that I could tell which one was getting triggered. This is the second instance. Commenting out the assertion seems to allow everything to work properly, though I don't understand the code well enough to ensure that I'm not storing problems up for later..... AndyC -----Original Message----- From: Mark Hammond [mailto:mhammond@skippinet.com.au] Sent: 17 March 2003 21:48 To: Cunningham, Andy; spambayes@python.org Subject: RE: [Spambayes] Outlook 2002 > msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL | > pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, > None) The error code for this is MAPI_E_CORRUPT_STORE, which doesn't sound good! I have checked in a change so that any errors when walking the folder tree are ignored. However, this same error is going to happen, so that part of your folder tree will *not* appear in the dialog. Hopefully only a small part of your tree is corrupt, so the folders you want will still be there - you will have to try it and see. Mark. From popiel at wolfskeep.com Thu Mar 20 07:52:05 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Mar 20 10:52:08 2003 Subject: [Spambayes] Storing Options In-Reply-To: Message from "Meyer, Tony" <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> Message-ID: <20030320155205.428B62DE9E@cashew.wolfskeep.com> In message: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> "Meyer, Tony" writes: >Ok, here's a draft proposal for changes to create a new way of storing >options. I'm not going to implement any of this unless there is a >consensus that it's a good thing, so don't panic. Looked good to me. - Alex From tim at fourstonesExpressions.com Thu Mar 20 10:30:44 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Thu Mar 20 11:30:49 2003 Subject: [Spambayes] Storing Options In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> Message-ID: <2X05SOHG95URMGMGB794SRWTBAB9S96.3e79ecb4@myst> +1 from me. 3/20/2003 1:09:40 AM, "Meyer, Tony" wrote: >Ok, here's a draft proposal for changes to create a new way of storing >options. I'm not going to implement any of this unless there is a >consensus that it's a good thing, so don't panic. > >There are four main changes, each outlined below. They can be >implemented separately, but together makes most sense. > >I would not change: >* The search path for options (i.e. defaults, then envar, then >current/home directory). >* Storing the defaults inside a .py file rather than having a >'bayes.ini' file (reading the archives, the reasons behind this make >sense > >So, here's what I do propose. Note that these are significant changes >and would require changes (improvements ;) all over the code. The user >would notice nothing, however. When you get a chance, please read >through these and comment. > >1. Change from using getattr to get > >This means using 'options["pop3proxy_servers"]' rather than >'options.pop3proxy_servers'. This avoid the possible problems with >conflicts with existing OptionsClass attribute names, and allows #2 and >(more easily) #3. > >2. Use the section data. > >This means using 'options[("pop3proxy", "servers")]'. For backwards >compatability 'options["pop3proxy_servers"]' would return the value of >any option named "pop3proxy_servers", whichever section it was in. > >This is tidier, and allows neat things later on (like maybe only >loading option sections that are relevant). For the most part it is >already set up this way, it's just that Options currently throws away >all the section information. > >3. Setting values propagates through to ConfigParser > >This means that 'options.pop3proxy_add_evidence_header = True' (with >#1 & #2, 'options[("pop3proxy", "add_evidence header")] = True') >would not just change the Options object, but also the ConfigParser >object that it inherits from. > >This *does not* mean that that any files would be changed, but *does* >mean that they could be updated on demand, via the write() function - >or via the update() function in UpdatableConfigParser). > >4. Detailed options. > >Each option has the following attributes: >* a name >* a nice name >* a default value >* explanation text >* either a tuple or a regex of allowed values >* the current value >* whether it should be restored on a 'return to defaults' command > >Two simple examples: > >"pop3proxy_servers", "Servers", "", "These are the servers that will be >proxied blah blah...", r"\w", "pop.example.com", False > >"add_evidence_header", "Clues Header", True, "This option adds a header >with the spam clues blah blah blah", (True, False), False, True > >These would be accessed as follows: >nice name: options.display_name(sect, opt) >default: options.default(sect, opt) - these would also be the values of >all options prior to loading any config file >explanation text: options.doc(sect, opt) >allowed values: options.valid_input(sect, opt) >current value: either via options[(sect, opt)], or via >options.get(sect, opt) >restore on revert: options.no_restore(sect, opt) > >Also provided would be options.is_valid(sect, opt, value) which would >return True iff the value was valid for that option. > >OR options[(sect, opt)] / options.get(sect, opt) returns an Option >object that has these things. This is nicer, but is more work to just >get the current value, which is what is wanted most of the time. > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim.one at comcast.net Thu Mar 20 16:54:38 2003 From: tim.one at comcast.net (Tim Peters) Date: Thu Mar 20 16:59:31 2003 Subject: [Spambayes] Beta status checklist In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C92A@its-xchg4.massey.ac.nz> Message-ID: > ... > It seems (via a grep for 'nham' or 'nspam') like the only things that > use nham and nspam are: > > * testing code (the user wouldn't be using this) > * experimental_ham_spam_imbalance (off by default) > > If this is correct, Nope, they enter into every probability calculation, via Classifier.probability(). More, they have to. I expect a real bug got hacked over instead of solved at the time these int() calls got added to classifier.add_msg(): if is_spam: self.nspam = int(self.nspam) + 1 # account for string nspam else: self.nham = int(self.nham) + 1 # account for string nham That is, the database was hosed if these things were ever strings, or someone hacked around a bad database integration in the wrong place. Note that it's easy to show that nham and nspam must be ints, provided that only methods of Classifier muck with a Classifier's instance variables. Under the same assumption, no word's hamcount can exceed nham, or its spamcount nspam. From noreply at sourceforge.net Thu Mar 20 17:34:14 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 20 20:40:43 2003 Subject: [Spambayes] [ spambayes-Feature Requests-703283 ] mboxtrain only trains on cur in maildir Message-ID: Feature Requests item #703283, was opened at 2003-03-13 16:57 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: Matthew Cowles (mdcowles) >Assigned to: Tim Stone (timstone4) Summary: mboxtrain only trains on cur in maildir Initial Comment: When training on a maildir, mboxtrain trains only on the messages in the subirectory cur. It ignores messages in the subdirectory new. Since new is for messages that haven't been seen, I think it's worth looking there since at least some spam will have been filed unseen. This is the same as bug 699174 which Tim Stone closed saying, "This is a feature request. If this remains as a requirement, please resubmit as such." The patch attached to that bug report fixes the behavior which I still consider a bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702 From noreply at sourceforge.net Thu Mar 20 17:47:53 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 20 20:40:50 2003 Subject: [Spambayes] [ spambayes-Feature Requests-703283 ] mboxtrain only trains on cur in maildir Message-ID: Feature Requests item #703283, was opened at 2003-03-13 16:57 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702 Category: None Group: None >Status: Closed Priority: 5 Submitted By: Matthew Cowles (mdcowles) Assigned to: Tim Stone (timstone4) Summary: mboxtrain only trains on cur in maildir Initial Comment: When training on a maildir, mboxtrain trains only on the messages in the subirectory cur. It ignores messages in the subdirectory new. Since new is for messages that haven't been seen, I think it's worth looking there since at least some spam will have been filed unseen. This is the same as bug 699174 which Tim Stone closed saying, "This is a feature request. If this remains as a requirement, please resubmit as such." The patch attached to that bug report fixes the behavior which I still consider a bug. ---------------------------------------------------------------------- >Comment By: Tim Stone (timstone4) Date: 2003-03-20 19:47 Message: Logged In: YES user_id=645698 Added -n option to train mail in "new". This leaves the current behavior of training only "cur" unaltered. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702 From noreply at sourceforge.net Thu Mar 20 17:49:02 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu Mar 20 20:40:57 2003 Subject: [Spambayes] [ spambayes-Feature Requests-695059 ] wildcard support for mboxtrain Message-ID: Feature Requests item #695059, was opened at 2003-02-28 07:54 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=695059&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: bill parducci (humantypo) >Assigned to: Tim Stone (timstone4) Summary: wildcard support for mboxtrain Initial Comment: i have about 40 folders that i use to keep track of numerous e-mail lists, projects, scraps of digital dimentia, etc. it would be very helpful if mboxtrain would accept wildcards for mail folder identification. yes, i could have 40 command line params, but that adds a YAM (Yet Another Maintenance) task to make sure that the folders match the command line parameters. what would really be useful is if mboxtrain would keep track of folders that it has read in that session already. that way one could use the following syntax: mboxtrain -d [db] -s [dir]/spam -g [dir]/* and not have the ham process read the spam folder (since it is likely that there will be only 1 spam folder and multiple ham folders). i suppose you could just hard code the ham flag parser to ignore folders named 'spam' but that would kinda be horky... anway, i think would help in the move towards more 'set & forget' operation. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=695059&group_id=61702 From T.A.Meyer at massey.ac.nz Fri Mar 21 15:30:34 2003 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Thu Mar 20 22:31:16 2003 Subject: [Spambayes] Beta status checklist Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C933@its-xchg4.massey.ac.nz> > Nope, they enter into every probability calculation, via > Classifier.probability(). More, they have to. I don't know how I missed that. I was even looking at that section of the code; I remember reading those lines. Go figure. > I expect a real bug got hacked over instead of solved at the > time these int() calls got added to classifier.add_msg(): [...] > That is, the database was hosed if these things were ever strings, or > someone hacked around a bad database integration in the wrong place. Really we need to solve the problem that's causing the incorrect counts, rather than try and restore 'corrupt' db's. What we need, of course, is someone who regularly seems this problem so that we can track it down. Anyone out there? =Tony Meyer From tdickenson at devmail.geminidataloggers.co.uk Fri Mar 21 09:33:12 2003 From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Fri Mar 21 04:33:16 2003 Subject: [Spambayes] [ spambayes-Feature Requests-695059 ] wildcard support for mboxtrain In-Reply-To: References: Message-ID: <200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk> On Friday 21 March 2003 1:49 am, SourceForge.net wrote: It sounds like you are aiming the same direction as me. > it would be very helpful if mboxtrain would accept > wildcards for mail folder identification. yes, i could > have 40 command line params, but that adds a YAM (Yet > Another Maintenance) task to make sure that the folders > match the command line parameters. I am currently using a script that extracts all my mail folder names from a kmail configuration file, then builds up a long hammie command line and executes it. (Im happy to contribute this if anyone is interested) This is working well for me. Every day I perform a full train unsing hammie, not mboxtrains incremental approach. This means I can use the mail reader to expire old messages, and have them removed from the spambayes database. > and not have the ham process read the spam folder >(since it is likely that there will be only 1 spam > folder and multiple ham folders). I started with one folder, but am now using two. Filters put new spam in a spam folder, and at the end of the week I review it for hams, and move all the spams into a spam/archive folder. > i suppose you could > just hard code the ham flag parser to ignore folders > named 'spam' but that would kinda be horky... I assume that any folder named spam and its subfolders contain spam. From Paul.Moore at atosorigin.com Fri Mar 21 13:13:04 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Fri Mar 21 08:13:35 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com> I'm trying to get a friend set up with Spambayes using Outlook Express. To get some initial training sorted, it would be nice to get a mbox file of some of his existing messages which he could train on. But I can't find a way of getting OE to save a mbox file. Is there a way? Any OE victims around here...? Thanks, Paul. From tim at fourstonesExpressions.com Fri Mar 21 07:32:49 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 21 08:32:56 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com> Message-ID: >But I can't find a way of getting OE to save a mbox >file. Is there a way? You're sore out of luck on that one, dude. Outlook-Express-Victim-ly yours - TimS c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim at fourstonesExpressions.com Fri Mar 21 08:45:49 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 21 09:45:55 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express In-Reply-To: Message-ID: <43F0MG1VVSTR95LJMJQ072IHCB98WT.3e7b259d@myst> 3/21/2003 7:32:49 AM, Tim Stone - Four Stones Expressions wrote: >>But I can't find a way of getting OE to save a mbox >>file. Is there a way? > >You're sore out of luck on that one, dude. Well, it appears as if I've spoken a bit too soon on this one. I did some digging, and found a program called MailNavigator (http://www.mailnavigator.com/mailnavigator.html), that can read OE mailboxes and export them as an mbox. I've downloaded it, tried it, and it works. When you start it up, do File->Load External Mailbox... Point the browser window at the OE inbox.dbx file, normally in Documents and Settings \currentuser\Local Settings\Application Data\Identities\{bunchaglorp} \Microsoft\Outlook Express. You should see your inbox (or whatever folder you loaded) contents in MailNavigator. Then do Message->Select All, then Message->Save As... pick a file name and location, and select file type RFC822-text file... et voila, you have an mbox! c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From noreply at sourceforge.net Fri Mar 21 05:35:52 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri Mar 21 09:48:15 2003 Subject: [Spambayes] [ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows doesn't work... Message-ID: Bugs item #707491, was opened at 2003-03-21 13:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paul Moore (pmoore) Assigned to: Nobody/Anonymous (nobody) Summary: Pop3 proxy service code for Windows doesn't work... Initial Comment: The pop3proxy_service.py program doesn't seem to work with Python 2.2.2. The problem is that a main program doesn't have a __file__ variable defined. (This works in Python 2.3, which I guess is why this got missed...) I've attached a "quick fix" patch, which uses a helper module "findme.py". ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 From noreply at sourceforge.net Fri Mar 21 05:36:47 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri Mar 21 09:48:22 2003 Subject: [Spambayes] [ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows doesn't work... Message-ID: Bugs item #707491, was opened at 2003-03-21 13:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paul Moore (pmoore) >Assigned to: Mark Hammond (mhammond) Summary: Pop3 proxy service code for Windows doesn't work... Initial Comment: The pop3proxy_service.py program doesn't seem to work with Python 2.2.2. The problem is that a main program doesn't have a __file__ variable defined. (This works in Python 2.3, which I guess is why this got missed...) I've attached a "quick fix" patch, which uses a helper module "findme.py". ---------------------------------------------------------------------- >Comment By: Paul Moore (pmoore) Date: 2003-03-21 13:36 Message: Logged In: YES user_id=113328 File attachment didn't work :-( ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 From tim at fourstonesExpressions.com Fri Mar 21 08:53:17 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 21 09:53:23 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express In-Reply-To: <43F0MG1VVSTR95LJMJQ072IHCB98WT.3e7b259d@myst> Message-ID: 3/21/2003 8:45:49 AM, Tim Stone - Four Stones Expressions wrote: >3/21/2003 7:32:49 AM, Tim Stone - Four Stones Expressions > wrote: > >>>But I can't find a way of getting OE to save a mbox >>>file. Is there a way? >> >>You're sore out of luck on that one, dude. > >Well, it appears as if I've spoken a bit too soon on this one. I did some >digging More digging. There's a sourceforge project called mbx2mbox, at http://mbx2mbox.sourceforge.net/ I haven't tried this, but it looks like it will do what you want, as well. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From Paul.Moore at atosorigin.com Fri Mar 21 15:07:37 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Fri Mar 21 10:08:09 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D99F@UKDCX001.uk.int.atosorigin.com> From: Tim Stone - Four Stones Expressions >> Well, it appears as if I've spoken a bit too soon on this one. >> I did some digging [...] > More digging. Thanks for these! I'll pass the info on... Paul From popiel at wolfskeep.com Fri Mar 21 08:24:30 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Mar 21 11:24:34 2003 Subject: [Spambayes] [ spambayes-Feature Requests-695059 ] wildcard support for mboxtrain In-Reply-To: Message from Toby Dickenson <200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk> References: <200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: <20030321162430.C6D952DE2F@cashew.wolfskeep.com> In message: <200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk> Toby Dickenson writes : >On Friday 21 March 2003 1:49 am, SourceForge.net wrote: > >It sounds like you are aiming the same direction as me. > >> it would be very helpful if mboxtrain would accept >> wildcards for mail folder identification. [...] >I am currently using a script that extracts all my mail folder names from a >kmail configuration file, then builds up a long hammie command line and >executes it. (Im happy to contribute this if anyone is interested) > >This is working well for me. My approach to this problem is that I make two copies of every mail; one copy goes into an 'everything' folder, and the other copy gets delivered into 'inbox' or 'newspam' as appropriate. As I review spam (or find it as false negatives), I move it into a 'spam' folder. For training, ham = everything - spam - newspam. Naming three folders doesn't seem to be a big deal, whereas naming all the innumerable folders that my inbox gets sorted into would be. The code to do this is checked in under contrib as bulktrain.sh and bulkgraph.py, and described in BULK.txt. >Every day I perform a full train unsing hammie, not mboxtrains incremental >approach. This means I can use the mail reader to expire old messages, and >have them removed from the spambayes database. I just ignore everything more than 120 days old, personally... and that's just to keep the database around 20 meg. Tests show that it hurts accuracy by less than 1%. Of course, ignoring everything over a week old hurts less than 5%... - Alex From skip at pobox.com Fri Mar 21 15:58:35 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 21 16:58:44 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks Message-ID: <15995.35595.805586.553059@montanaro.dyndns.org> Has anybody thought about how any of the Spambayes tools would perform in the face of disk quotas or full disk partitions? Here at Northwestern they are going to start supporting IMAP (against their better wishes, but the customer is always right). Because they have roughly 30,000 active email accounts and IMAP allows (requires?) mail to be stored on the server, they are going to institute disk quotas on the mail servers for the first time. Procmail+SpamAssassin seems to be breaking in some situations and SA is (incorrectly, I believe) getting egg on its face as a result. I'd like to make sure Spambayes has these various problems addressed. Skip From popiel at wolfskeep.com Fri Mar 21 17:34:19 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Mar 21 20:34:24 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: Message from Skip Montanaro <15995.35595.805586.553059@montanaro.dyndns.org> References: <15995.35595.805586.553059@montanaro.dyndns.org> Message-ID: <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> In message: <15995.35595.805586.553059@montanaro.dyndns.org> Skip Montanaro writes: > >Has anybody thought about how any of the Spambayes tools would perform in >the face of disk quotas or full disk partitions? Very poorly. I think that'd send it straight into DB corruption. In general, spambayes is likely to require a bit more disk space than any fixed-pattern classifier like SpamAssassin... my database is about 20 megs, for instance. I don't hink that SpamAssassin requires more than a few K of personal storage, unless you turn on its bayesian stuff... - Alex From tshumway at jdiworks.net Fri Mar 21 18:19:30 2003 From: tshumway at jdiworks.net (Terrel Shumway) Date: Fri Mar 21 21:14:38 2003 Subject: [Spambayes] Binaries for MSwin In-Reply-To: References: Message-ID: <200303211819.30500.tshumway@jdiworks.net> On Wednesday 19 March 2003 15:22, Mark Hammond wrote: > > > Add the Outlook binary installer as part of the release? Absolutely! > > Sounds fine to me - except I would raise the bar a little - why not make a > pop3propxy *binary* release for Windows too - then the problem becomes > moot- on Windows you get a binary. one more reason to publish binaries for mswin: ZoneAlarm. popfile, written in perl, forces the average[1] user to allow all perl programs to access the internet -- a gaping hole in your firewall. (I consider this a defect in ZoneAlarm's design, but I don't think it is going away anytime soon.) --- [1] a sophisticated user could create a private copy of perl.exe and call it popfile.exe From tim at fourstonesExpressions.com Fri Mar 21 22:23:43 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Fri Mar 21 23:23:51 2003 Subject: [Spambayes] Binaries for MSwin In-Reply-To: <200303211819.30500.tshumway@jdiworks.net> Message-ID: 3/21/2003 8:19:30 PM, Terrel Shumway wrote: >one more reason to publish binaries for mswin: ZoneAlarm. >popfile, written in perl, forces the average[1] user to allow all perl >programs to access the internet -- a gaping hole in your firewall. Excellent point. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Sat Mar 22 00:22:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 22 01:22:20 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> References: <15995.35595.805586.553059@montanaro.dyndns.org> <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> Message-ID: <15996.275.455989.47435@montanaro.dyndns.org> >> Has anybody thought about how any of the Spambayes tools would >> perform in the face of disk quotas or full disk partitions? Alex> Very poorly. I think that'd send it straight into DB corruption. I'm less concerned with database corruption than loss of email. For stuff like hammiefilter, the database is opened read-only anyway. Skip From popiel at wolfskeep.com Fri Mar 21 22:28:44 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Mar 22 01:28:48 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: Message from Skip Montanaro <15996.275.455989.47435@montanaro.dyndns.org> References: <15995.35595.805586.553059@montanaro.dyndns.org> <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> <15996.275.455989.47435@montanaro.dyndns.org> Message-ID: <20030322062844.D04932DE2F@cashew.wolfskeep.com> In message: <15996.275.455989.47435@montanaro.dyndns.org> Skip Montanaro writes: > > >> Has anybody thought about how any of the Spambayes tools would > >> perform in the face of disk quotas or full disk partitions? > > Alex> Very poorly. I think that'd send it straight into DB corruption. > >I'm less concerned with database corruption than loss of email. For stuff >like hammiefilter, the database is opened read-only anyway. Eh, in that case, it's not spambayes's problem. Mail delivery is outside the scope of a classifier. At most, pop3proxy's private cache would be affected... but I don't think that's how you'd be using the system in a server-based environment. - Alex From skip at pobox.com Sat Mar 22 00:42:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 22 01:42:30 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: <20030322062844.D04932DE2F@cashew.wolfskeep.com> References: <15995.35595.805586.553059@montanaro.dyndns.org> <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> <15996.275.455989.47435@montanaro.dyndns.org> <20030322062844.D04932DE2F@cashew.wolfskeep.com> Message-ID: <15996.1488.414014.793768@montanaro.dyndns.org> >> I'm less concerned with database corruption than loss of email. For >> stuff like hammiefilter, the database is opened read-only anyway. Alex> Eh, in that case, it's not spambayes's problem. Mail delivery is Alex> outside the scope of a classifier. At most, pop3proxy's private Alex> cache would be affected... but I don't think that's how you'd be Alex> using the system in a server-based environment. I agree, but hammiefilter (for example), has to respond appropriately (no tracebacks, proper exit code so callers like procmail can do the right thing) if it encounters an IOError. Similarly, pop3proxy has to no lose messages if it finds it can't write the message to the disk. Skip From spambayes at djl.freeuk.com Sat Mar 22 11:43:44 2003 From: spambayes at djl.freeuk.com (David Leftley) Date: Sat Mar 22 06:43:48 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com> Message-ID: On Fri, 21 Mar 2003 13:13:04 -0000, "Moore, Paul" wrote: >I'm trying to get a friend set up with Spambayes using Outlook >Express. To get some initial training sorted, it would be nice to >get a mbox file of some of his existing messages which he could >train on. But I can't find a way of getting OE to save a mbox >file. Is there a way? Any OE victims around here...? > Possibly the simplest way to approach this is to install a copy of Eudora, and tell it to import the messages from OE. I believe Eudora uses standard mbox files for its storage. David. From francois.granger at free.fr Sat Mar 22 14:20:54 2003 From: francois.granger at free.fr (Francois Granger) Date: Sat Mar 22 08:21:02 2003 Subject: [Spambayes] Getting a mbox file from Outlook Express In-Reply-To: References: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com> Message-ID: At 11:43 +0000 22/03/2003, in message Re: [Spambayes] Getting a mbox file from Outlook Expres, David Leftley wrote: >On Fri, 21 Mar 2003 13:13:04 -0000, "Moore, Paul" > wrote: > >>I'm trying to get a friend set up with Spambayes using Outlook >>Express. To get some initial training sorted, it would be nice to >>get a mbox file of some of his existing messages which he could >>train on. But I can't find a way of getting OE to save a mbox >>file. Is there a way? Any OE victims around here...? >> >Possibly the simplest way to approach this is to install a copy of >Eudora, and tell it to import the messages from OE. I believe Eudora >uses standard mbox files for its storage. Not exactly standard because it extract the enclosures. -- Hofstadter's Law : It always takes longer than you expect, even when you take into account Hofstadter's Law. From bill at parducci.net Sat Mar 22 06:44:35 2003 From: bill at parducci.net (bill parducci) Date: Sat Mar 22 09:44:40 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: <15996.1488.414014.793768@montanaro.dyndns.org> References: <15995.35595.805586.553059@montanaro.dyndns.org> <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> <15996.275.455989.47435@montanaro.dyndns.org> <20030322062844.D04932DE2F@cashew.wolfskeep.com> <15996.1488.414014.793768@montanaro.dyndns.org> Message-ID: <3E7C76D3.9000206@parducci.net> this issue, in combination with some of the manual processes posted to the list to maintain db size and relevancy has made me wonder if spambayes shouldn't incorporate the ability to FIFO token/training info. it seems that the most straightforward way to do this would be to time stamp each entry into the db and then have a configurable param indicating how long the db should keep information before pruning it (ostensibly during the training process). this would fundamentally increase the size of the db in order to store this info, but should make it much more predictable in terms of size. given the results of some of the notes that i have seen on the list, it seems that mail more than a couple of months old doesn't add to the accuracy of the system (and in some cases can decrease it) so i don't see this as a detriment to the system's behavior (as long as the data life span is reasonable). just thinking out loud, but this seems like a move forward in creating a 'set & forget' system. b From skip at pobox.com Sat Mar 22 09:52:33 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 22 10:53:30 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: <3E7C76D3.9000206@parducci.net> References: <15995.35595.805586.553059@montanaro.dyndns.org> <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> <15996.275.455989.47435@montanaro.dyndns.org> <20030322062844.D04932DE2F@cashew.wolfskeep.com> <15996.1488.414014.793768@montanaro.dyndns.org> <3E7C76D3.9000206@parducci.net> Message-ID: <15996.34497.75006.235906@montanaro.dyndns.org> bill> this issue, in combination with some of the manual processes bill> posted to the list to maintain db size and relevancy has made me bill> wonder if spambayes shouldn't incorporate the ability to FIFO bill> token/training info. This is also not what I'm worried about. While we need to provide means to manage the size of the database, that is essentially an offline activity. I'm worried simply about the situation where a mail message arrives and there's no disk space left to process it properly. You really can't control the way the database file size grows. Since it's implementing a hash, once the key density gets too high, it expands the database dramatically and shuffles things all around. In between these striking leaps in size, the database grows little, if at all, for each new key added. Let me restate the problem: I just don't want Spambayes to be accused, rightly or wrongly, of losing mail because a disk quota was exceeded or a disk partition filled up. Everything else is merely an inconvenience. Lost mail can't be recovered. What motivated this was an (incorrect, in my opinion) assumption by a sys admin where I work that because there was a failure in a mail setup using procmail and SpamAssassin when the disk quota was exceeded that it was obviously a SpamAssassin problem. Skip From wsy at merl.com Sat Mar 22 06:20:47 2003 From: wsy at merl.com (Bill Yerazunis) Date: Sat Mar 22 12:21:37 2003 Subject: [Spambayes] Binaries for MSwin Message-ID: <200303221120.h2MBKlQ01327@localhost.localdomain> From: Terrel Shumway > Sounds fine to me - except I would raise the bar a little - why not make a > pop3propxy *binary* release for Windows too - then the problem becomes > moot- on Windows you get a binary. one more reason to publish binaries for mswin: ZoneAlarm. popfile, written in perl, forces the average[1] user to allow all perl programs to access the internet -- a gaping hole in your firewall. (I consider this a defect in ZoneAlarm's design, but I don't think it is going away anytime soon.) --- [1] a sophisticated user could create a private copy of perl.exe and call it popfile.exe Or a sophisticated _installer_ program could make that copy (or symlink) of perl.exe itself, name it popfile.exe, and all would be well. -Bill Y. From bill at parducci.net Sat Mar 22 10:27:45 2003 From: bill at parducci.net (bill parducci) Date: Sat Mar 22 13:27:49 2003 Subject: [Spambayes] filtering in the face of disk quotas or full disks In-Reply-To: <15996.34497.75006.235906@montanaro.dyndns.org> References: <15995.35595.805586.553059@montanaro.dyndns.org> <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com> <15996.275.455989.47435@montanaro.dyndns.org> <20030322062844.D04932DE2F@cashew.wolfskeep.com> <15996.1488.414014.793768@montanaro.dyndns.org> <3E7C76D3.9000206@parducci.net> <15996.34497.75006.235906@montanaro.dyndns.org> Message-ID: <3E7CAB21.3000604@parducci.net> Skip Montanaro wrote: > This is also not what I'm worried about. While we need to provide means to > manage the size of the database, that is essentially an offline activity. > I'm worried simply about the situation where a mail message arrives and > there's no disk space left to process it properly. ok, but to date, this is a *manual* 'offline activity' involving any number of homegrown solutions to resolve. while this is operationally acceptable to advanced users such as those that mind this list, i believe that it is impractical for the vast majority of those who could benefit from this solution (but are unable/unwilling to keeps multiple copies of mail in numerous files, etc.) > You really can't control the way the database file size grows. Since it's > implementing a hash, once the key density gets too high, it expands the > database dramatically and shuffles things all around. In between these > striking leaps in size, the database grows little, if at all, for each new > key added. perhaps using the current h architecture, but if you have the ability to maintain the size of the input pool (possibly via a secondary data store that handles raw tokens), then it seems illogical that the size of the db cannot be managed within reason. > Let me restate the problem: I just don't want Spambayes to be accused, > rightly or wrongly, of losing mail because a disk quota was exceeded or a > disk partition filled up. Everything else is merely an inconvenience. Lost > mail can't be recovered. What motivated this was an (incorrect, in my > opinion) assumption by a sys admin where I work that because there was a > failure in a mail setup using procmail and SpamAssassin when the disk quota > was exceeded that it was obviously a SpamAssassin problem. good luck preventing misplaced accusations! :o) b From noreply at sourceforge.net Sat Mar 22 23:35:31 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sun Mar 23 03:31:20 2003 Subject: [Spambayes] [ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows doesn't work... Message-ID: Bugs item #707491, was opened at 2003-03-22 00:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 Category: pop3proxy Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paul Moore (pmoore) Assigned to: Mark Hammond (mhammond) Summary: Pop3 proxy service code for Windows doesn't work... Initial Comment: The pop3proxy_service.py program doesn't seem to work with Python 2.2.2. The problem is that a main program doesn't have a __file__ variable defined. (This works in Python 2.3, which I guess is why this got missed...) I've attached a "quick fix" patch, which uses a helper module "findme.py". ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-23 18:35 Message: Logged In: YES user_id=14198 Fixed in r1.3 - thanks. ---------------------------------------------------------------------- Comment By: Paul Moore (pmoore) Date: 2003-03-22 00:36 Message: Logged In: YES user_id=113328 File attachment didn't work :-( ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 From noreply at sourceforge.net Sat Mar 22 23:35:48 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sun Mar 23 03:31:28 2003 Subject: [Spambayes] [ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows doesn't work... Message-ID: Bugs item #707491, was opened at 2003-03-22 00:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 Category: pop3proxy Group: None >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Paul Moore (pmoore) Assigned to: Mark Hammond (mhammond) Summary: Pop3 proxy service code for Windows doesn't work... Initial Comment: The pop3proxy_service.py program doesn't seem to work with Python 2.2.2. The problem is that a main program doesn't have a __file__ variable defined. (This works in Python 2.3, which I guess is why this got missed...) I've attached a "quick fix" patch, which uses a helper module "findme.py". ---------------------------------------------------------------------- Comment By: Mark Hammond (mhammond) Date: 2003-03-23 18:35 Message: Logged In: YES user_id=14198 Fixed in r1.3 - thanks. ---------------------------------------------------------------------- Comment By: Paul Moore (pmoore) Date: 2003-03-22 00:36 Message: Logged In: YES user_id=113328 File attachment didn't work :-( ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702 From Eugeny.Sattler at RU.NESTLE.com Mon Mar 24 17:25:23 2003 From: Eugeny.Sattler at RU.NESTLE.com (Eugeny.Sattler@RU.NESTLE.com) Date: Mon Mar 24 09:53:05 2003 Subject: [Spambayes] SpamBayes and Outlook 2000 Message-ID: <5D7D85C4DFC1D411BD8700B0D07810E00174A272@KUFMXS04> Hi, I would like to try your Outlook 2000 add-in. Pls tell me, is it for POP3 connection only or suitable also for MS Exchange Server 5.5 environment ? Thanks. -- Eugeny From mhammond at skippinet.com.au Tue Mar 25 08:04:13 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Mar 24 16:04:57 2003 Subject: [Spambayes] SpamBayes and Outlook 2000 In-Reply-To: <5D7D85C4DFC1D411BD8700B0D07810E00174A272@KUFMXS04> Message-ID: > Hi, > I would like to try your Outlook 2000 add-in. > Pls tell me, is it for POP3 connection only or suitable also for > MS Exchange > Server 5.5 environment ? > Thanks. It is suitable for both. Regards, Mark. From skip at pobox.com Mon Mar 24 15:16:20 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 24 16:16:30 2003 Subject: [Spambayes] __del__ in DBDictClassifier? Message-ID: <15999.30116.549922.124871@montanaro.dyndns.org> Is there some reason the storage.DBDictClassifier class doesn't implement a __del__ method which calls store()? If not, I'm going to add one. Skip From noreply at sourceforge.net Mon Mar 24 14:19:35 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 24 17:19:39 2003 Subject: [Spambayes] [ spambayes-Bugs-709051 ] Error loading configuration should not be fatal Message-ID: Bugs item #709051, was opened at 2003-03-25 09:19 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Error loading configuration should not be fatal Initial Comment: There was a report of this error using the second binary release: SpamAddin - Connecting to Outlook pythoncom error: Failed to call the universal dispatcher Traceback (most recent call last): File "E:\src\pythonex\com\win32com\universal.py", line 170, in dispatch File "E:\src\pythonex\com\win32com\server\policy.py", line 322, in _InvokeEx_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 655, in OnConnection File "E:\src\spambayes\Outlook2000\manager.py", line 475, in GetManager File "E:\src\spambayes\Outlook2000\manager.py", line 152, in __init__ File "E:\src\spambayes\Outlook2000\manager.py", line 355, in LoadConfig exceptions.EOFError: While there is another problem that caused this error, we should not die completely loading the config pickle should it get screwed up. However, as this means spambayes will be unconfigured, we do need a scheme to let the user know this (as we do in the few other places where we disable spambayes due to config errors) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702 From noreply at sourceforge.net Mon Mar 24 14:56:42 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 24 17:46:04 2003 Subject: [Spambayes] [ spambayes-Bugs-709051 ] Error loading configuration should not be fatal Message-ID: Bugs item #709051, was opened at 2003-03-25 09:19 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Error loading configuration should not be fatal Initial Comment: There was a report of this error using the second binary release: SpamAddin - Connecting to Outlook pythoncom error: Failed to call the universal dispatcher Traceback (most recent call last): File "E:\src\pythonex\com\win32com\universal.py", line 170, in dispatch File "E:\src\pythonex\com\win32com\server\policy.py", line 322, in _InvokeEx_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 655, in OnConnection File "E:\src\spambayes\Outlook2000\manager.py", line 475, in GetManager File "E:\src\spambayes\Outlook2000\manager.py", line 152, in __init__ File "E:\src\spambayes\Outlook2000\manager.py", line 355, in LoadConfig exceptions.EOFError: While there is another problem that caused this error, we should not die completely loading the config pickle should it get screwed up. However, as this means spambayes will be unconfigured, we do need a scheme to let the user know this (as we do in the few other places where we disable spambayes due to config errors) ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-03-25 09:56 Message: Logged In: YES user_id=14198 The reporter just let me know that the problem was caused by about 20 power failures over short period. So I don't think we can cure the cause here, just the symptoms. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702 From tim at fourstonesExpressions.com Mon Mar 24 19:34:18 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 24 20:34:25 2003 Subject: [Spambayes] __del__ in DBDictClassifier? In-Reply-To: <15999.30116.549922.124871@montanaro.dyndns.org> Message-ID: 3/24/2003 3:16:20 PM, Skip Montanaro wrote: > >Is there some reason the storage.DBDictClassifier class doesn't implement a >__del__ method which calls store()? If not, I'm going to add one. Yup. There is no guarantee that the __del__ method is called, so we (Richie and I) felt like rather than give the impression that store would always be called, it would be better to make it explicit. You know.. the old "dumb beats smart" thing. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Mon Mar 24 22:22:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 24 23:22:34 2003 Subject: [Spambayes] __del__ in DBDictClassifier? In-Reply-To: References: <15999.30116.549922.124871@montanaro.dyndns.org> Message-ID: <15999.55680.895881.768181@montanaro.dyndns.org> >> Is there some reason the storage.DBDictClassifier class doesn't >> implement a __del__ method which calls store()? Tim> Yup. There is no guarantee that the __del__ method is called, You're suggesting that there's a good chance a DBDictClassifier instance will be involved in a cycle? Looking at the code briefly I didn't see an instance attributes which looked like they would refer to other objects which would (possibly indirectly) refer back to the instance. It's a common Python idiom to call an object's close() method in its __del__ method. Skip From tim at fourstonesExpressions.com Tue Mar 25 07:46:45 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Tue Mar 25 08:46:57 2003 Subject: [Spambayes] __del__ in DBDictClassifier? In-Reply-To: <15999.55680.895881.768181@montanaro.dyndns.org> Message-ID: 3/24/2003 10:22:24 PM, Skip Montanaro wrote: >You're suggesting that there's a good chance a DBDictClassifier instance >will be involved in a cycle? Looking at the code briefly I didn't see an >instance attributes which looked like they would refer to other objects >which would (possibly indirectly) refer back to the instance. It's a common >Python idiom to call an object's close() method in its __del__ method. Quoting your mail of 11/14/2002: From: Skip Montanaro Date: Thu, 14 Nov 2002 10:49:28 -0600 To: spambayes@python.org Subject: [Spambayes] read-only DBDict in hammie? I'd like to share the anydbm file between several accounts on my machine. Before I fiddle hammie.py so it opens the file in read-only mode, is there any reason when classifying (not training) it actually needs to update the file? There's a __del__ method in PersistentBayes which does this: def __del__(self): #super.__del__(self) self.save_state() def save_state(self): self.wordinfo[self.statekey] = (self.nham, self.nspam) When classifying there's no reason that nham or nspam would change, right? Skip Quoting an exchange between Neale and Richie dated 11/18/2002: From: Richie Hindle To: Neale Pickett Subject: Re: [Spambayes] Hammiefilter doesn't write out the pickle Date: Mon, 18 Nov 2002 18:02:07 +0000 Cc: spambayes@python.org Hi Neale, > Neale thinks this is the right way to do it. If the Bayes.* classes > write out their state on destruction, we can treat them all the same. > That's easy enough, just have them call self.store() in the __del__ > method. Richie thinks this is a bad move. Here's a minor rant I sent to Tim Stone when he did exactly this in his Bayes module: -------------------------------------------------------------------------- PersistentBayes.__del__() calls store() - this seems like a bad thing for three reasons. One is that I might not want to save my changes to the database - pop3proxy has an explicit "Save & Shutdown" and "Shutdown" buttons to give the user control over whether the database is saved or not (to let you do speculative training and discard the results, for instance). [This is the least important of the three reasons. Four, four reasons!] Also, the pop3proxy self-test uses an in-memory bayes instance that it never wants to write to disk. Secondly, it's unpredictable when __del__ will be called, or even *whether* it will be called - this: class A: def __del__(self): print "A.__del__" class B: def __del__(self): print "B.__del__" a = A() b = B() a.b = b b.a = a print "Exiting..." won't call either __del__ method in the current CPython implementation. Thirdly, if users of PersistentBayes explicitly call store() - which seems like the right thing to do - the database will be written out twice. [And that can take *a long time*.] [snip] I've found another reason why PersistentBayes.__del__() is a bad thing - self.db_name isn't set in the case where a PickledBayes is created using a filename that doesn't exist (which is done by the pop3proxy self-test) - that was leading to exceptions being throw from __del__, which is a notoriously hard problem to track down. -------------------------------------------------------------------------- I'd much rather have an explicit store() method and document the fact that storage may be pre-empted by certain implementations. Relying on __del__ is nasty. -- Richie Hindle richie@entrian.com As you can tell, I had coded the __del__ originally, and it was removed for the objections that you and Richie raised. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Tue Mar 25 08:12:22 2003 From: skip at pobox.com (Skip Montanaro) Date: Tue Mar 25 09:12:33 2003 Subject: [Spambayes] __del__ in DBDictClassifier? In-Reply-To: References: <15999.55680.895881.768181@montanaro.dyndns.org> Message-ID: <16000.25542.670078.393940@montanaro.dyndns.org> Tim> Richie thinks this is a bad move. Here's a minor rant I sent to Tim> Tim Stone when he did exactly this in his Bayes module: Tim> ... - pop3proxy has an explicit "Save & Shutdown" and "Shutdown" Tim> buttons to give the user control over whether the database is saved Tim> or not ... Good enough for me. Skip From skip at pobox.com Wed Mar 26 09:19:54 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 26 10:20:02 2003 Subject: [Spambayes] Win 2k/XP + Eudora? Message-ID: <16001.50458.808956.759671@montanaro.dyndns.org> I've been asked to take a look at installing Spambayes for a user in one of the departments. She's running Win2k/XP and uses Eudora as her email client. Sounds like I will need to install Python+pop3proxy for her. I seem to recall something odd about Eudora and different POP servers. Is that only when using multiple POP servers? Thanks, Skip From tim at fourstonesExpressions.com Wed Mar 26 09:56:51 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 26 10:56:57 2003 Subject: [Spambayes] Win 2k/XP + Eudora? In-Reply-To: <16001.50458.808956.759671@montanaro.dyndns.org> Message-ID: 3/26/2003 9:19:54 AM, Skip Montanaro wrote: > Is that only when using multiple POP servers? Yup. Apparently Eudora can only access one pop server. Papadoc checked in an html document about configuring various pop3 clients, but I can't seem to find it at the moment. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim at fourstonesExpressions.com Wed Mar 26 10:07:36 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 26 11:07:40 2003 Subject: [Spambayes] Win 2k/XP + Eudora? In-Reply-To: <3E81CFCE.30605@videotron.ca> Message-ID: <1Z97LKLE0B6A6OC8KG62GFQMGAJGUP.3e81d048@myst> 3/26/2003 10:05:34 AM, papaDoc wrote: >Hi Tim, > >The document was not checked in since I don't have check in access >but the document is attached to one of the old mail of this list. > Ah... no wonder I can't find it! I have the old mail. Thanks for the tip. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Wed Mar 26 10:45:03 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 26 11:45:12 2003 Subject: [Spambayes] Win 2k/XP + Eudora? In-Reply-To: <1Z97LKLE0B6A6OC8KG62GFQMGAJGUP.3e81d048@myst> References: <3E81CFCE.30605@videotron.ca> <1Z97LKLE0B6A6OC8KG62GFQMGAJGUP.3e81d048@myst> Message-ID: <16001.55567.948358.769105@montanaro.dyndns.org> >> The document was not checked in since I don't have check in access >> but the document is attached to one of the old mail of this list. Tim> Ah... no wonder I can't find it! I have the old mail. Thanks for Tim> the tip. Tim, If you can check it in, please do. Otherwise, forward it to me and I'll see that it gets stitched into the spambayes website. Thx, Skip From tony-bayes at lownds.com Wed Mar 26 09:15:46 2003 From: tony-bayes at lownds.com (Tony Lownds) Date: Wed Mar 26 12:37:08 2003 Subject: [Spambayes] Win 2k/XP + Eudora? In-Reply-To: <16001.50458.808956.759671@montanaro.dyndns.org> References: <16001.50458.808956.759671@montanaro.dyndns.org> Message-ID: At 9:19 AM -0600 3/26/03, Skip Montanaro wrote: >I've been asked to take a look at installing Spambayes for a user in one of >the departments. She's running Win2k/XP and uses Eudora as her email >client. Sounds like I will need to install Python+pop3proxy for her. I >seem to recall something odd about Eudora and different POP servers. Is >that only when using multiple POP servers? > Eudora can't use a different port for different accounts, they all have to use port 110. With a plugin, a port other than 110 can be used - but it is still used across accounts. -Tony From francois.granger at free.fr Wed Mar 26 19:14:01 2003 From: francois.granger at free.fr (Francois Granger) Date: Wed Mar 26 13:14:07 2003 Subject: [Spambayes] Win 2k/XP + Eudora? In-Reply-To: References: Message-ID: At 09:56 -0600 on 26/03/2003, in message Re: [Spambayes] Win 2k/XP + Eudora?, Tim Stone - Four Stones Expressions wrote: >3/26/2003 9:19:54 AM, Skip Montanaro wrote: > >> Is that only when using multiple POP servers? > >Yup. Apparently Eudora can only access one pop server. Eudora can access mutiple pop server. But all must have the same port number. Somebody (I don't rember who and can't find the msg in achive) gave a trick for MacOS X, available for Unixes, wich create multiple localhost adresses. ======= in a shell script sudo ifconfig lo0 inet 127.0.0.2 add sudo ifconfig lo0 inet 127.0.0.3 add sudo ifconfig lo0 inet 127.0.0.4 add ======= in bayescustomize.ini [pop3proxy] pop3proxy_servers = pop.nerim.net:110, pop.free.fr:110, altern.org:110, pop.laposte.net:110 pop3proxy_ports = 127.0.0.1:110, 127.0.0.2:110,127.0.0.3:110, 127.0.0.4:110 ======= This may be portable to W2000 ? Ref: http://mail.python.org/pipermail/spambayes/2003-January/002659.html -- Hofstadter's Law : It always takes longer than you expect, even when you take into account Hofstadter's Law. From tim at fourstonesExpressions.com Wed Mar 26 13:09:25 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Wed Mar 26 14:09:33 2003 Subject: [Spambayes] Win 2k/XP + Eudora? In-Reply-To: Message-ID: <76IESRTO9676YJIQNTOJ3YTR1TKVR.3e81fae5@myst> 3/26/2003 12:14:01 PM, Francois Granger wrote: > >======= in a shell script >sudo ifconfig lo0 inet 127.0.0.2 add >sudo ifconfig lo0 inet 127.0.0.3 add >sudo ifconfig lo0 inet 127.0.0.4 add > >======= in bayescustomize.ini >[pop3proxy] >pop3proxy_servers = pop.nerim.net:110, pop.free.fr:110, >altern.org:110, pop.laposte.net:110 >pop3proxy_ports = 127.0.0.1:110, 127.0.0.2:110,127.0.0.3:110, 127.0.0.4:110 > >======= > >This may be portable to W2000 ? > The alternate ipaddresses that you created in your shell script can be added in the c:\winnt\system32\drivers\etc\hosts file. Simply add lines like: 127.0.0.1 localhost (this line should already be there...) 127.0.0.2 localhost2 127.0.0.3 localhost3 127.0.0.4 localhost4 and then use the same trick in bayescustomize.ini. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Wed Mar 26 14:19:10 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 26 15:19:17 2003 Subject: [Spambayes] Any ideas about this one? Message-ID: <16002.2878.68928.814803@montanaro.dyndns.org> The message at http://manatee.mojam.com/~skip/junk.msg scored squarely in the ham zone for me, mostly because the scoring was swamped by all those normally good address clues (aahz, aleax cosc.canterbury.ac.nz, etc). I could obviously remove "to" from my address_headers option. I tried doing that, which moved it up near 0.5, however I noticed no skip: tokens were generated: X-Spambayes-Classification: unsure; 0.46 X-Spambayes-Debug: '*H*': 0.89; '*S*': 0.80; 'x-mailer:microsoft outlook imo, build 9.0.2416 (9.0.2911.0)': 0.01; 'subject:pack': 0.09; 'subject:: ': 0.19; 'header:Message-ID:1': 0.35; 'subject:Watch': 0.75; 'content-type:application/x-msdownload': 0.97; 'filename:fname piece:exe': 0.97 Is that related to the structure of the message (causing the attachment to be skipped altogether)? Skip P.S. I couldn't send the message itself to the list because the virus detector rejected it, hence the URL above. Should we allow stuff like that to squeeze through to this list? S From tim.one at comcast.net Wed Mar 26 17:04:08 2003 From: tim.one at comcast.net (Tim Peters) Date: Wed Mar 26 17:06:23 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: <16002.2878.68928.814803@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > The message at > > http://manatee.mojam.com/~skip/junk.msg > > scored squarely in the ham zone for me, mostly because the scoring was > swamped by all those normally good address clues (aahz, aleax > cosc.canterbury.ac.nz, etc). I could obviously remove "to" from my > address_headers option. I tried doing that, which moved it up near 0.5, > however I noticed no skip: tokens were generated: > > ... > > Is that related to the structure of the message (causing the > attachment to be skipped altogether)? I think so -- the MIME type was application/x-msdownload, and the tokenizer doesn't even bother to decode non- text/* portions. > ... > P.S. I couldn't send the message itself to the list because the virus > detector rejected it, hence the URL above. Should we allow stuff > like that to squeeze through to this list? It would have been held for moderator approval regardless, due to sheer size, and I would have rejected it (people on this list should be able to find quarter-meg examples of viruses on their own ). The salient points in this message were the headers, + a comment of the form "and the body is a quarter megabyte of base64". From francois.granger at free.fr Wed Mar 26 23:35:57 2003 From: francois.granger at free.fr (Francois Granger) Date: Wed Mar 26 17:36:04 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: <16002.2878.68928.814803@montanaro.dyndns.org> References: <16002.2878.68928.814803@montanaro.dyndns.org> Message-ID: At 14:19 -0600 on 26/03/2003, in message [Spambayes] Any ideas about this one?, Skip Montanaro wrote: >The message at > > http://manatee.mojam.com/~skip/junk.msg > >scored squarely in the ham zone for me, mostly because the scoring was >swamped by all those normally good address clues (aahz, aleax >cosc.canterbury.ac.nz, etc). I could obviously remove "to" from my >address_headers option. I tried doing that, which moved it up near 0.5, >however I noticed no skip: tokens were generated: > > X-Spambayes-Classification: unsure; 0.46 > X-Spambayes-Debug: '*H*': 0.89; '*S*': 0.80; > 'x-mailer:microsoft outlook imo, build 9.0.2416 > (9.0.2911.0)': 0.01; 'subject:pack': 0.09; 'subject:: ': 0.19; > 'header:Message-ID:1': 0.35; 'subject:Watch': 0.75; > 'content-type:application/x-msdownload': 0.97; > 'filename:fname piece:exe': 0.97 > >Is that related to the structure of the message (causing the attachment to >be skipped altogether)? Not easy to classify... My database "thinks": Spam probability: 0.810594681692 Clues: *H* 0.313039016579 *S* 0.934228379963 header:Received:5 0.0854354380187 subject:: 0.110737860364 subject:. 0.744834167131 header:Importance:1 0.781318555354 to:2**6 0.844827586207 subject:this 0.898823641021 subject:Watch 0.983271375465 well, funny ! -- Hofstadter's Law : It always takes longer than you expect, even when you take into account Hofstadter's Law. From skip at pobox.com Wed Mar 26 18:26:34 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed Mar 26 19:26:38 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: References: <16002.2878.68928.814803@montanaro.dyndns.org> Message-ID: <16002.17722.470645.635722@montanaro.dyndns.org> Francois> to:2**6 0.844827586207 Odd, I don't see that at all in my clues. As long as someone's database is snagging that message, I won't worry about it, though I am kind of curious about the missing to:2**6 clue in the debug results. Skip From popiel at wolfskeep.com Wed Mar 26 19:29:31 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Mar 26 22:29:36 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: Message from Skip Montanaro <16002.17722.470645.635722@montanaro.dyndns.org> References: <16002.2878.68928.814803@montanaro.dyndns.org> <16002.17722.470645.635722@montanaro.dyndns.org> Message-ID: <20030327032931.9EEE92DDC7@cashew.wolfskeep.com> In message: <16002.17722.470645.635722@montanaro.dyndns.org> Skip Montanaro writes: > Francois> to:2**6 0.844827586207 > >Odd, I don't see that at all in my clues. > >As long as someone's database is snagging that message, I won't worry about >it, though I am kind of curious about the missing to:2**6 clue in the debug >results. It probably was in the midrange zone to be ignored (.4 to .6 by default). - Alex From skip at pobox.com Thu Mar 27 08:03:44 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 27 09:03:50 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: <20030327032931.9EEE92DDC7@cashew.wolfskeep.com> References: <16002.2878.68928.814803@montanaro.dyndns.org> <16002.17722.470645.635722@montanaro.dyndns.org> <20030327032931.9EEE92DDC7@cashew.wolfskeep.com> Message-ID: <16003.1216.922317.235294@montanaro.dyndns.org> >> As long as someone's database is snagging that message, I won't worry >> about it, though I am kind of curious about the missing to:2**6 clue >> in the debug results. Alex> It probably was in the midrange zone to be ignored (.4 to .6 by Alex> default). The default is 0.5 (meaning show everything): # The range of clues that are added to the "debug" header in the E-mail # All clues that have their probability smaller than this number, or # larger than one minus this number are added to the header such that # you can see why spambayes thinks this is ham/spam or why it is # unsure. The default is to show all clues, but you can reduce that by # setting showclue to a lower value, such as 0.1 clue_mailheader_cutoff: 0.5 and I didn't change that, so everything should be shown. Just for completeness, here's my options file, in case I'm missing something: [Hammie] hammie_debug_header: True [Tokenizer] summarize_email_prefixes: True summarize_email_suffixes: True address_headers: from [Categorization] ham_cutoff: 0.20 spam_cutoff: 0.88 [hammiefilter] hammiefilter_persistent_storage_file: ~/hammie.db [globals] dbm_type: dbhash Skip From francois.granger at free.fr Thu Mar 27 15:12:44 2003 From: francois.granger at free.fr (Francois Granger) Date: Thu Mar 27 09:12:50 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: <16003.1216.922317.235294@montanaro.dyndns.org> References: <16002.2878.68928.814803@montanaro.dyndns.org> <16002.17722.470645.635722@montanaro.dyndns.org> <20030327032931.9EEE92DDC7@cashew.wolfskeep.com> <16003.1216.922317.235294@montanaro.dyndns.org> Message-ID: At 08:03 -0600 on 27/03/2003, in message Re: [Spambayes] Any ideas about this one?, Skip Montanaro wrote: >completeness, here's my options file, in case I'm missing something: > > [Hammie] > hammie_debug_header: True > > [Tokenizer] > summarize_email_prefixes: True > summarize_email_suffixes: True > address_headers: from > > [Categorization] > ham_cutoff: 0.20 > spam_cutoff: 0.88 > > [hammiefilter] > hammiefilter_persistent_storage_file: ~/hammie.db > > [globals] > dbm_type: dbhash To be able to compare, here is mine: [Categorization] ham_cutoff = 0.10 spam_cutoff = 0.95 [pop3proxy] pop3proxy_persistent_storage_file = hammie.db pop3proxy_servers = pop.nerim.net:110, pop.free.fr:110, altern.org:110, pop.laposte.net pop3proxy_ports = 127.0.0.1:110, 127.0.0.2:110,127.0.0.3:110, 127.0.0.4:110 -- Hofstadter's Law : It always takes longer than you expect, even when you take into account Hofstadter's Law. From skip at pobox.com Thu Mar 27 21:10:55 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu Mar 27 22:11:06 2003 Subject: [Spambayes] Non-email use of the spambayes project Message-ID: <16003.48447.482874.642781@montanaro.dyndns.org> I've successfully applied the Spambayes code (http://spambayes.sf.net/) to a non-email application today and thought I'd pass the concept along to others. Many of you on c.l.py probably are aware of the Spambayes project which relies on user segregation of a set of email messages into spam and ham, then combines the resulting clues they contain to predict the hamminess or spamminess of email messages it hasn't seen before. It works extremely well for this, but the basic concept is applicable to other classification problems. I've operated the Mojam and Musi-Cal websites for several years. Over that time we've accumulated a sizable venue database. Unfortunately, many entries in the database have become stale and don't contribute anything to the system other than to slow down queries. Venue names get misspelled, venues go out of business, non-venue stuff slips into the database, or other errors occur. As a result, I had a venue database containing roughly 35,000 entries, only about half of which were referenced by concert items in the database. The database as it sat couldn't be licensed to potential customers because of all the errors it contained. I could simply delete all of those entries, but that would delete a lot of useful content from the database. Many of those currently unreferenced venue entries *are* correct and will eventually be associated with other concerts, or will be useful as corollary information for people using our websites or as an extra database we can license to content consumers. I wrote a trivial little application today which allowed me to rummage through the unreferenced records in the database. I could delete entries which I felt were incorrect, but it was a one-at-a-time process. With 15,000+ entries to scan, one-by-one wasn't going to cut it. Then I got the idea to use the Spambayes classifier to watch what I was doing and train on my actions. I was viewing the records in chunks of 20 items at a time, sorted alphabetically. I could choose to delete one or more items or move onto the next chunk of 20 entries. A deletion caused the classifier to be trained on the entry as "spam". Moving onto the next chunk caused the classifier to be trained on the remaining undeleted entries as "ham". Over a short period of time, it got reasonably good at identifying "spam". I then started sorting each chunk of 20 items by its spambayes score and could specify a threshold score below which to eliminate all entries in that chunk. The next improvement was to sort the entire mess of records by the spambayes classification. I was then seeing entire chunks of records whose scores fell below the threshold and was able to delete them 20 at a time. The entire Spambayes code is a single tokenizer generator function and a small Classifier class: import spambayes.storage class Classifier: def __init__(self): self.cls = spambayes.storage.DBDictClassifier("fven.db") def classify(self, d): return self.cls.spamprob(tokenize(d), True) def train(self, d, saved): self.cls.learn(tokenize(d), saved) def __del__(self): self.cls.store() def tokenize(d): # d is a dictionary as returned by a MySQL query - tokenize the # various fields, noting interesting facts yield "venue length:%d" % len(d["venue"]) for word in d["venue"].split(): # looks like a festival - not a venue at all if word.lower().endswith("fest"): yield "venue:" yield "venue:"+word # most correct venue names don't contain punctuation if (string.translate(d["venue"], null_xlate, string.punctuation) != d["venue"]): yield "venue:" # no address information for this venue - less valuable if not d["addr1"]: yield "addr1:" elif d["addr1"][0] not in string.digits: # most valid addresses in the US/Canada begin with a street number yield "addr1:" for word in d["addr1"].split(): yield "addr1:"+word for word in d["addr2"].split(): yield "addr2:"+word yield "phone:"+d["phone"] yield "city:"+d["city"].strip() yield "region:"+(d["state"].strip() or d["country"].strip()) yield "zip:"+d["zip"] # sometimes the city gets replicated in the address, making the # data "dirtier" and thus less valuable vwords = d["venue"].lower().split() for word in d["city"].lower().split(): if word in vwords: yield "city:" break # the record's id reflects its age - older records, and thus # smaller ids, are more likely to be outdated try: yield "id:2**%.0f" % math.log(int(d["id"]) // 100) except OverflowError: yield "id:2**0" return ... classifier = Classifier() The input to the tokenizer, instead of being an email message, is a dictionary representing the return value from an SQL query. When an item is to be deleted, it gets classified like so: classifier.train(d, False) When moving the the next chunk, the remaining records are classified like so: for item in chunk: classifier.train(item, True) I haven't gotten too crazy with the tokenizer (compare it with the Spambayes tokenizer!). I will probably collect some other clues in the tokenizer, such as what other tables reference the venue record. For the time being, it's working okay. I just need it to do a reasonably good job segregating records so I can quickly scan a group and make a deletion decision. So far, it's doing a very good job. Not bad for 15-30 minutes of work... Skip From tim_one at email.msn.com Thu Mar 27 23:27:34 2003 From: tim_one at email.msn.com (Tim Peters) Date: Thu Mar 27 23:28:15 2003 Subject: [Spambayes] Any ideas about this one? In-Reply-To: <16003.1216.922317.235294@montanaro.dyndns.org> Message-ID: [T. Alexander Popiel] > It probably was in the midrange zone to be ignored (.4 to .6 by > default). [Skip Montanaro] > The default is 0.5 (meaning show everything): > ... > clue_mailheader_cutoff: 0.5 I expect Alex had this mind: # When scoring a message, ignore all words with # abs(word.spamprob - 0.5) < minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many # tests. 0.1 appeared to work well across all corpora. minimum_prob_strength: 0.1 abs(p-0.5) < 0.1 is-same-as 0.4 < p < 0.6; Classifier._getclues() doesn't return any word with a spamprob in that range. From Paul.Moore at atosorigin.com Fri Mar 28 09:22:38 2003 From: Paul.Moore at atosorigin.com (Moore, Paul) Date: Fri Mar 28 04:24:02 2003 Subject: [Spambayes] Non-email use of the spambayes project Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D9CA@UKDCX001.uk.int.atosorigin.com> From: Skip Montanaro [mailto:skip@pobox.com] > I've successfully applied the Spambayes code (http://spambayes.sf.net/) > to a non-email application today and thought I'd pass the concept along > to others. This is a lovely idea! Based on this description, I'm sure I can think of a number of "data cleaning" exercises I'd like to do which might benefit from this sort of approach. Makes me wonder if there's a case for taking the algorithmic guts out of spambayes, and making a standalone library module from it... Thanks for posting this. Paul. From mhammond at skippinet.com.au Fri Mar 28 23:04:48 2003 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Mar 28 07:05:32 2003 Subject: [Spambayes] RE: error when trying spambayes addin with outlook2002 In-Reply-To: <6D9E338D57C2D411AC5800105AF41E123A1B67@mailnt.eurosys.nl> Message-ID: > exceptions.ValueError: invalid literal for float(): 0.20 This will almost certainly be due to not succumbing to world domination, and not having your locale set to an English one . Adding: import locale locale.setlocale(locale.LC_NUMERIC, "en") Somewhere near the top of addin.py should fix this. I think I will check this in, rather than waiting for a non-Windows user to strike this problem ;) Coalition-of-the-commas ly, Mark. From jon at doobla.com Fri Mar 28 04:02:44 2003 From: jon at doobla.com (Jonathon Jones) Date: Fri Mar 28 07:18:08 2003 Subject: [Spambayes] Using your script with sendmail on my server? Message-ID: <003601c2f508$ccecce10$a98f59cf@doobla> Hi, I am somewhat new to Linux but I am learning fast. I have a Linux server with Ensim installed where I host my own sites and a few for others. I want to use your filter to filter out spam on the box and I was wondering how I can do it? Ideally there would have to be a database for each domain or user and I would want it to run between the mail server and their client software, but on my box. I don't want them to have to install any software or anything. I was thinking about setting up training email addresses so that anything sent to spam@domain.com would be flagged as spam and anything sent to ham@domain.com would be flagged as ham. Any suggestions? Am I on the right track, or is there a better way? I'd really appreciate any help you'd be willing to give. God Bless, Jon From skip at pobox.com Fri Mar 28 06:19:39 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri Mar 28 07:19:45 2003 Subject: [Spambayes] Non-email use of the spambayes project In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D9CA@UKDCX001.uk.int.atosorigin.com> References: <16E1010E4581B049ABC51D4975CEDB880113D9CA@UKDCX001.uk.int.atosorigin.com> Message-ID: <16004.15835.738554.638681@montanaro.dyndns.org> Paul> Makes me wonder if there's a case for taking the algorithmic guts Paul> out of spambayes, and making a standalone library module from Paul> it... Given how easy it is to use as-is, I don't see a strong need. More important I think is to document how to use it as I did. So much of what is there now is so strongly tied to classifying email messages that it's easy to lose sight of how well it can be applied to other classifcation problems. Skip From spambayes at rodland.no Fri Mar 28 13:55:00 2003 From: spambayes at rodland.no (Fredrik Rodland) Date: Fri Mar 28 07:56:31 2003 Subject: [Spambayes] Non-email use of the spambayes project In-Reply-To: <16004.15835.738554.638681@montanaro.dyndns.org> Message-ID: > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of Skip Montanaro > Sent: 28. mars 2003 13:20 > To: Moore, Paul > Cc: python-list@python.org; spambayes@python.org > Subject: RE: [Spambayes] Non-email use of the spambayes project > > important I think is to document how to use it as I did. So much > of what is > there now is so strongly tied to classifying email messages that it's easy > to lose sight of how well it can be applied to other > classifcation problems. Totally agree! also, for us who're not completely into Python, it would be great with some sort of cookbook/skeletons/APIs available and documented. I tried to read your original code, but gave up after a while. I have a similar situation, having a database with 100.000 people in it, with quite a few rows not being real persons. It'd be gresat to try to use the spambayes code on this. The concept should be fairly common so that one could write a script/program in any language. At least what I'm picturing is write a script wich loops over the dataset, construct some kind of concatinated string, and passing this as argument to one of three procedures/methods/scripts: A. classify as spam B. classify as ham C. get_score Fredrik -- Fredrik Rodland Technical Architect, Stocknet, Oslo, Norway Stocknet: http://www.stocknet.com phone: +47 23 28 40 17 Private: http://rodland.no phone: +47 99 21 98 17 From tchur at optushome.com.au Sat Mar 29 07:22:56 2003 From: tchur at optushome.com.au (Tim Churches) Date: Fri Mar 28 15:34:05 2003 Subject: Orange (was: [Spambayes] Non-email use of the spambayes project) In-Reply-To: References: Message-ID: <1048882982.1263.23.camel@emilio> On Fri, 2003-03-28 at 23:55, Fredrik Rodland wrote: > > important I think is to document how to use it as I did. So much > > of what is > > there now is so strongly tied to classifying email messages that it's easy > > to lose sight of how well it can be applied to other > > classifcation problems. > Totally agree! > also, for us who're not completely into Python, it would be great with some > sort of cookbook/skeletons/APIs available and documented. I tried to read > your original code, but gave up after a while. I have a similar situation, > having a database with 100.000 people in it, with quite a few rows not being > real persons. It'd be gresat to try to use the spambayes code on this. The Orange project, developed at the University of Ljubljana, is well worth a look. It is a Python framework and collection of modules (many of them C extension modules) for learning about data mining and machine learning techniques. It includes facilities for a number of supervised and non-supervised classification methods apart from the naive Bayes classifier, such as (quoting the Orange Web site) "classification trees, k-NN, majority classifier, support vector machines, logistic regression. Ensemble methods like boosting and bagging are also included ." It is quite well documented and now even has a GUI interface. Code is GPLed. See http://magix.fri.uni-lj.si/orange/ -- Tim C PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere or at http://members.optushome.com.au/tchur/pubkey.asc Key fingerprint = 8C22 BF76 33BA B3B5 1D5B EB37 7891 46A9 EAF9 93D0 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/spambayes/attachments/20030329/6d31e4b5/attachment.bin From noreply at sourceforge.net Sat Mar 29 08:45:38 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Sat Mar 29 16:10:14 2003 Subject: [Spambayes] [ spambayes-Patches-711845 ] mboxtrain.py in mh mode: trivial fix Message-ID: Patches item #711845, was opened at 2003-03-29 11:45 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=711845&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jay Berkenbilt (jay_berkenbilt) Assigned to: Nobody/Anonymous (nobody) Summary: mboxtrain.py in mh mode: trivial fix Initial Comment: This patch relative to mboxtrain.py in the 2003-01-17 snapshot fixes two trivial problems in mhdir_train: files are overwritten needlessly, and the count of trained messages is not properly updated. I just took the logic from the maildir_train function and duplicated it. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=711845&group_id=61702 From francois.granger at free.fr Sun Mar 30 00:15:59 2003 From: francois.granger at free.fr (Francois Granger) Date: Sat Mar 29 18:16:06 2003 Subject: [Spambayes] Back to language issue (long) Message-ID: I got this mail (see at end) as ham. I did tested it first withe a copy and past into pop3proxy to check why. I then did a copy and pas of all but the french part. My current database is 1.1MB Total emails trained: Spam: 487 Ham: 433 Database available on request. See result below: full message: ============= Spam probability: 0.000751759880204 Clues: *H* 1.0 *S* 0.00150351976007 quand 0.00850661625709 quelque 0.0136778115502 chose 0.0145631067961 sous 0.0167286245353 demande 0.0180722891566 pourrait 0.0196506550218 mon 0.0208789727319 comme 0.0221227588516 fait 0.0230376824338 fa?on 0.0238095238095 maintenant 0.0238095238095 donner 0.0266272189349 m'a 0.0266272189349 sait 0.0266272189349 mais 0.0313902714872 aide 0.033396550918 aller 0.0348837209302 famille 0.0348837209302 raison 0.0348837209302 pas 0.0408933100639 j'aurais 0.0412844036697 voudrais 0.0412844036697 leur 0.0498272166206 d'abord 0.0505617977528 temps. 0.0505617977528 valeurs 0.0505617977528 suis 0.0584879191463 j'ai 0.0600818444939 actuellement 0.0652173913043 mort 0.0652173913043 no. 0.0652173913043 pourraient 0.0652173913043 quant 0.0652173913043 que 0.0720802127541 savoir 0.0775927513664 courrier 0.0834026744242 dit 0.0834026744242 peux 0.0834026744242 qui 0.0880139586679 leurs 0.0900374919101 advised 0.0918367346939 d'accord 0.0918367346939 fils 0.0918367346939 garder 0.0918367346939 pays 0.0918367346939 p?re 0.0918367346939 qu'elles 0.0918367346939 raisons 0.0918367346939 r?pondre 0.0918367346939 seraient 0.0918367346939 tells 0.0918367346939 parce 0.0980935237284 faire 0.109870293122 est 0.112227878216 peut 0.121878882484 pour 0.122393105049 beaucoup 0.123296503689 depuis 0.130659320434 ?t? 0.131386324669 dans 0.132119929488 mot 0.133327070159 les 0.144455392115 une 0.145286593037 passe 0.151233850681 puis 0.15146680803 argent 0.155172413793 changement 0.155172413793 contacter 0.155172413793 d'eux. 0.155172413793 envoyer 0.155172413793 fonds 0.155172413793 gouvernement 0.155172413793 jonas 0.155172413793 lui. 0.155172413793 lumi?re 0.155172413793 manque 0.155172413793 mexico 0.155172413793 monsieur, 0.155172413793 pourrais 0.155172413793 p?res 0.155172413793 saisir 0.155172413793 samuel 0.155172413793 tracer 0.155172413793 tu? 0.155172413793 venir 0.155172413793 sur 0.167633574543 des 0.171795256129 f?vrier 0.175326781486 cette 0.185894857561 r?ponse 0.195556328108 par 0.203011609628 merci 0.213163172086 avec 0.216683274074 votre 0.223048790244 possible 0.223362908057 d'un 0.233611469218 content-type:text/plain 0.237053248399 que, 0.256059940913 soci?t? 0.256059940913 trace 0.256059940913 voila 0.256059940913 son 0.261468013498 details 0.753633575315 agent 0.75736859731 surprise 0.75736859731 please 0.758980156238 looking 0.760041610618 contact 0.77287714023 out 0.779791147442 dear 0.786906048266 lettre 0.794293030271 journal 0.796004632524 skip:n 10 0.799435070973 fact 0.805426676513 internet. 0.817845487146 government 0.831484524404 reasons 0.842224555182 chamber 0.844827586207 email addr:voila.fr 0.844827586207 from:addr:voila.fr 0.844827586207 l'argent 0.844827586207 s?curit? 0.844827586207 veuillez 0.844827586207 voeu 0.844827586207 paid 0.865736495188 family 0.883222448153 letter 0.890903967158 company 0.902210464433 8bit%:9 0.908163265306 ahead. 0.908163265306 assistance 0.908163265306 commerce 0.908163265306 compte 0.908163265306 d?tails 0.908163265306 forces 0.908163265306 l'aide 0.908163265306 transf?rer 0.908163265306 business 0.91145727403 country 0.913240293115 watch 0.915916543989 money 0.919669618949 expenses 0.934782608696 father 0.934782608696 african 0.949438202247 anticipated 0.949438202247 ownership 0.949438202247 funds 0.95871559633 transfer 0.95871559633 transfer. 0.95871559633 percentage 0.96511627907 all but french: =============== Spam probability: 0.999530237124 Clues: *H* 1.05021213948e-09 *S* 0.999060475299 advised 0.0918367346939 tells 0.0918367346939 jonas 0.155172413793 mexico 0.155172413793 samuel 0.155172413793 possible 0.223362908057 content-type:text/plain 0.237053248399 trace 0.256059940913 son 0.261468013498 worked 0.276699759479 anything 0.29049020174 keeping 0.29049020174 light 0.321159980827 killed 0.332823460338 soon 0.345187733082 running 0.354319106501 trying 0.366311400136 them. 0.395181777302 know 0.398909302873 knows 0.6044824946 the 0.604592362611 subject:- 0.605156130452 want 0.606231021087 this 0.606271581031 not 0.60636980267 make 0.606842817477 start 0.607293682795 come 0.607655753716 and 0.608871670018 going 0.611497626481 can 0.614004582662 who 0.619148006288 netherlands 0.621790545969 under 0.622731249912 agree 0.630287560804 sir, 0.630287560804 for 0.631336804391 it. 0.632419432237 they 0.635641043314 security 0.63597973579 you 0.642573051538 through 0.644897220533 because 0.648506005667 send 0.650492892877 from 0.654952588668 one 0.656989004993 phone 0.657287865339 2002 0.66388718647 ways 0.670909653021 has 0.671329724104 their 0.671472909829 name 0.673196353036 address 0.675205726154 may 0.679409037042 your 0.681455529625 all 0.690605015038 more 0.694798240263 mr. 0.704335845591 leader 0.715217636184 late 0.721105048724 request 0.732178002628 subject:. 0.735614438659 details 0.753633575315 agent 0.75736859731 surprise 0.75736859731 please 0.758980156238 looking 0.760041610618 contact 0.77287714023 out 0.779791147442 dear 0.786906048266 journal 0.796004632524 skip:n 10 0.799435070973 fact 0.805426676513 internet. 0.817845487146 government 0.831484524404 reasons 0.842224555182 chamber 0.844827586207 email addr:voila.fr 0.844827586207 from:addr:voila.fr 0.844827586207 paid 0.865736495188 family 0.883222448153 letter 0.890903967158 company 0.902210464433 ahead. 0.908163265306 assistance 0.908163265306 commerce 0.908163265306 forces 0.908163265306 business 0.91145727403 country 0.913240293115 watch 0.915916543989 money 0.919669618949 expenses 0.934782608696 father 0.934782608696 african 0.949438202247 anticipated 0.949438202247 ownership 0.949438202247 funds 0.95871559633 transfer 0.95871559633 transfer. 0.95871559633 percentage 0.96511627907 full message: ============= Return-Path: Delivered-To: online.fr-francois.granger@free.fr Received: (qmail 25473 invoked from network); 29 Mar 2003 19:22:26 -0000 Received: from smtp-out.voila.wanadooportails.com (HELO mailsmtp5.ftmms) (193.252.117.74) by mrelay4-2.free.fr with SMTP; 29 Mar 2003 19:22:26 -0000 Received: from voila.fr (10.3.7.82) by mailsmtp5.ftmms (6.7.015) id 3E6540600058A8BC; Sat, 29 Mar 2003 20:08:38 +0100 Date: Sat, 29 Mar 2003 20:08:38 +0100 Message-Id: Subject: anticipated co-operation. MIME-Version: 1.0 X-Sensitivity: 3 Content-Type: text/plain From: "ask_savimbi1" To: "ask_savimbi1" X-XaM3-API-Version: 3.2 R27 (B52-pl1) X-type: 0 X-SenderIP: 81.23.193.84 X-Spambayes-Classification: ham Dear Sir, This letter may come to you as a surprise due to fact that we have not met. I got your address from the south African chamber of commerce business journal from one of their laison officers who knows about what I am going through so he advised me to contact you for assistance. My name is Samuel savimbi son of the late unit a rebel leader Jonas Savimbi from Angola who was killed on the 22nd of February 2002 by the government forces in Mexico province. Since the death of my late father the government has being looking for ways they could seize my fathers money and so it is this light that I need your assistance in trying to keep the funds from them. The money is in the Netherlands with a security company who could transfer the money at my command, so i would need you to go to Holland when the need arises but you have to first reply this mail so i could give you more details as to how to go about it. Please know that all your expenses would be paid for and there is a percentage my family has worked out should you agree to help in this transfer. Sir I am presently under security watch in my country Angola and i would want you to help me in keeping the money in your account because the government can?t trace the money to your individual account in America. Please because of security reasons I would prefer not to discuss much on the Internet. I request that you send your reply to chief_teas@voila.fr so we can start with the change of ownership and also I would have to send you the password for the account so we can make the transfer as soon as possible because I am running out of time. Please contact my agent in Holland on phone no.31-61-2722388 and anything he tells you, brief me so I can give them the go ahead. Thanks for your anticipated co-operation. Regards, Mr. Samuel savimbi Cher Monsieur, Cette lettre peut venir ? vous comme surprise due au fait que nous n'avons pas rencontr?. J'ai obtenu votre adresse de la chambre de commerce sud-africaine le journal d'affaires d'un de leurs officiers de laison qui sait ce que j'interviens ainsi il m'a conseill? de vous contacter pour l'aide. Mon nom est fils de savimbi de Samuel de l'unit? en retard par Chef rebelle Jonas Savimbi d'Angola qui a ?t? tu? sur le 22?me f?vrier 2002 par le gouvernement force dans la province du Mexique. Depuis la mort de mon d?funt p?re le gouvernement a rechercher des mani?res qu'elles pourraient saisir mon argent de p?res et ainsi il est cette lumi?re que j'ai besoin de votre aide dans l'essai de garder les fonds d'eux. L'argent est en Hollandes avec une soci?t? de valeurs mobili?res qui pourrait transf?rer l'argent ? ma commande, ainsi j'aurais besoin de vous pour aller en Hollande quand le besoin se fait sentir mais vous devez d'abord r?pondre ce courrier ainsi je pourrais vous donner plus de d?tails quant ? la fa?on aborder lui. Veuillez savoir que toutes vos d?penses seraient pay?es pour et il y a un pourcentage que ma famille a ?tabli si vous ?tes d'accord sur l'aide dans ce transfert. Monsieur I suis actuellement sous la montre de s?curit? dans mon pays Angola et je voudrais que vous m'aidiez en maintenant l'argent dans votre compte parce que le gouvernement ne peut pas tracer l'argent ? votre compte individuel en Am?rique. Veuillez en raison des raisons de s?curit? je pr?f?rerais ne pas discuter beaucoup sur l'Internet. Je demande que vous envoyez votre r?ponse au chief_teas@voila.fr ainsi nous peut commencer par le changement de la propri?t? et ?galement je devrais vous envoyer le mot de passe pour le compte ainsi nous pouvons faire le transfert aussit?t que possible parce que je manque de temps. Veuillez entrer en contact avec mon agent en Hollande sur le No. de t?l?phone.31-61-2722388 et quelque chose il vous dit que, donnez- des instructionsmoi ainsi je peux leur donner l'avancement. Merci pour votre coop?ration pr?vue. Respect, M.. Savimbi de Samuel ------------------------------------------ Faites un voeu et puis Voila ! www.voila.fr -- Fran?ois Granger http://francois.granger.free.fr/ From tim at fourstonesExpressions.com Sat Mar 29 18:17:13 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 29 19:17:48 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: Message-ID: How interesting. I wonder if a weakness of spambayes is to include a bunch of gibberish tokens that would almost surely not be in someone's database, which would tend to drive the spamprob strongly towards unknown prob, which is .5 by default... (not that French is gibberish ) - TimS 3/29/2003 5:15:59 PM, Francois Granger wrote: >I got this mail (see at end) as ham. I did tested it first withe a >copy and past into pop3proxy to check why. I then did a copy and pas >of all but the french part. My current database is 1.1MB Total >emails trained: Spam: 487 Ham: 433 > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim at fourstonesExpressions.com Sat Mar 29 18:52:06 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 29 19:52:11 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: Message-ID: 3/29/2003 5:15:59 PM, Francois Granger wrote: >I got this mail (see at end) as ham. I did tested it first withe a >copy and past into pop3proxy to check why. Here's my pop3proxy test with the full mail, including the french part... Firmly spam. Spam probability: 0.92419807897 Clues: *H* 0.0145501365628 *S* 0.862946294503 assistance. 0.0412844036697 leurs 0.0505617977528 sur 0.0505617977528 les 0.0505617977528 des 0.0505617977528 courrier 0.0505617977528 chamber 0.0918367346939 tells 0.0918367346939 raison 0.0918367346939 trying 0.139001551758 chef 0.155172413793 forces 0.155172413793 ownership 0.155172413793 province 0.155172413793 force 0.205136233853 pour 0.205136233853 soon 0.21345434995 agree 0.230877558425 going 0.236121618369 individual 0.242284867767 est 0.284013639555 got 0.290993746827 22nd 0.294934298229 skip:n 10 0.30063314409 anything 0.302156320408 skip:t 20 0.303820933856 could 0.326118902066 mail 0.326308585409 knows 0.331910224838 running 0.337647867133 but 0.339140209196 should 0.340976178118 keeping 0.364093796853 no. 0.3773889689 surprise 0.3773889689 it. 0.390905206875 thanks 0.39803669893 how 0.6002750524 there 0.605416290974 what 0.609061801031 netherlands 0.616411930908 start 0.616433461787 fact 0.621424884871 would 0.624827760755 all 0.654446720984 trace 0.655810415038 help 0.656378980481 phone 0.658303044151 mexico 0.666416791604 discuss 0.666416791604 leader 0.666416791604 send 0.668378643841 under 0.677515506493 more 0.69919514058 their 0.702904517833 much 0.70583428009 plus 0.706968820414 details 0.706968820414 one 0.707042188084 come 0.708327278837 country 0.708704976074 your 0.70918861243 skip:i 10 0.71494003345 please 0.726385151004 dear 0.726868260612 address 0.730123697827 from 0.7306605061 out 0.733744809345 business 0.735181579572 looking 0.739728588612 funds 0.743289737924 contact 0.748813284618 father 0.756245996156 advised 0.756245996156 time. 0.760504822525 because 0.772202794598 south 0.782632496655 transfer 0.782632496655 give 0.791118686042 want 0.796597204303 regards, 0.811918594754 february 0.819174677074 prefer 0.821306509832 watch 0.835226655679 samuel 0.844827586207 unit 0.844827586207 aller 0.844827586207 agent 0.844827586207 company 0.848392489438 through 0.858782955016 paid 0.883475677477 light 0.916342496387 money 0.919305321038 message-id:invalid 0.934782608696 commerce 0.934782608696 chose 0.934782608696 skip:- 40 0.935713160894 government 0.95871559633 expenses 0.96511627907 rebel 0.96511627907 assistance 0.96511627907 c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From matt at mondoinfo.com Sat Mar 29 19:13:25 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sat Mar 29 20:17:46 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: References: Message-ID: <1048985376.4.887@sake.mondoinfo.com> Dear=20Tim, >=20How=20interesting.=20=20I=20wonder=20if=20a=20weakness=20of=20spambayes= =20is=20to=20include >=20a=20bunch=20of=20gibberish=20tokens=20that=20would=20almost=20surely=20= not=20be=20in >=20someone's=20database,=20which=20would=20tend=20to=20drive=20the=20spamp= rob=20strongly >=20towards=20unknown=20prob,=20which=20is=20.5=20by=20default... I=20don't=20think=20it=20is.=20The=20point=20of=20ignoring=20all=20the=20cl= ues=20but=20the=20most extreme=20ones=20is=20that=20bland=20or=20gibberish=20words=20are=20unlikel= y=20to=20be counted. I=20think=20that=20the=20problem=20in=20this=20case=20is=20that=20Francois= =20doesn't=20get much=20spam=20in=20French.=20If=20he=20did,=20the=20bland=20French=20words= =20(which=20is almost=20all=20of=20them=20listed=20in=20the=20clues)=20would=20likely=20be= =20ignored=20and the=20ones=20that=20are=20indicative=20of=20this=20sort=20of=20spam=20("arg= ent",=20"tu=E9", "gouvernement",=20etc)=20would=20be=20scored=20correctly. I=20suspect=20that=20the=20error=20is=20just=20a=20matter=20of=20spambayes= =20not recognizing=20a=20sort=20of=20spam=20that=20it=20hasn't=20been=20trained=20= on. Regards, Matt From tim_one at email.msn.com Sat Mar 29 20:55:36 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sat Mar 29 20:56:15 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: Message-ID: [Tim Stone] > How interesting. I wonder if a weakness of spambayes is to > include a bunch of gibberish tokens that would almost surely not > be in someone's database, which would tend to drive the spamprob > strongly towards unknown prob, which is .5 by > default... (not that French is gibberish ) - TimS That won't work: an unknown word has, as you say, spamprob 0.5 by default, and all words with spamprob in (.4, .6) are simply ignored by default. They don't affect the score at all. In Francois's case, it seems clear that he simply hasn't gotten (trained on) many French renditions of the Nigerian scam, but has gotten (trained on) significant numbers of French ham. So even vanilla French words (like quelque) have strong ham scores for him. So long as it remains true that he gets very few French Nigerian scams, they'll continue to score as ham -- but then, by supposition, they are in fact rare, so nothing to get excited about. If French renditions of this spam become common, the very low ham probs of common French words will approach 0.5 (and so common French words will become ignored), and the spamprobs of telltale French words will get much spammier, and the system will nail French spam. From tim at fourstonesExpressions.com Sat Mar 29 20:03:11 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 29 21:03:21 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: Message-ID: <42CBSR6NHGC844121QL6ZSR2X3XKIPJ.3e86505f@myst> 3/29/2003 7:55:36 PM, "Tim Peters" wrote: >That won't work: an unknown word has, as you say, spamprob 0.5 by default, >and all words with spamprob in (.4, .6) are simply ignored by default. That, I didn't know. Learn something new all the time... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim_one at email.msn.com Sat Mar 29 21:46:34 2003 From: tim_one at email.msn.com (Tim Peters) Date: Sat Mar 29 21:47:49 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: <42CBSR6NHGC844121QL6ZSR2X3XKIPJ.3e86505f@myst> Message-ID: [TimP] > That won't work: an unknown word has, as you say, spamprob 0.5 > by default, and all words with spamprob in (.4, .6) are simply > ignored by default. [TimS] > That, I didn't know. Learn something new all the time... FYI, it's controlled by option minimum_prob_strength. You can arrange to ignore nothing by setting that to 0.0 (the default is 0.1), or to ignore everything by setting it to 0.5. Almost all testing reports said 0.1 worked better than 0.0; one report did a little better at 0.0, but, for the reason you gave, a setting of 0.0 would leave an exploitable hole in the scoring. As is, gibberish words have no effect on scoring, but do have a subtler effect: they bloat the database size. From tim at fourstonesExpressions.com Sat Mar 29 21:59:21 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Sat Mar 29 22:59:28 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: Message-ID: 3/29/2003 8:46:34 PM, "Tim Peters" wrote: >but do have a subtler effect: they bloat the database size. If I recall correctly, single occurance words are called hapaxes, right? We've talked about aging before, but it seems like it would be clearly a good thing to age hapaxes. After a while, ALL they will do is bloat the database, which is arguably a bad thing. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Sat Mar 29 22:31:00 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat Mar 29 23:31:20 2003 Subject: [Spambayes] Back to language issue (long) In-Reply-To: References: Message-ID: <16006.29444.29137.628295@montanaro.dyndns.org> TimP> but do have a subtler effect: they bloat the database size. TimS> If I recall correctly, single occurance words are called hapaxes, TimS> right? We've talked about aging before, but it seems like it TimS> would be clearly a good thing to age hapaxes. After a while, ALL TimS> they will do is bloat the database, which is arguably a bad thing. I retrain on my entire saved email collection periodically. After a full retrain, I delete all hapaxes (well, I copy the database except for the hapaxes it contains). It cuts the database size roughly in half, and if, after adding more messages, those tokens are no longer hapaxes, they will be kept after the next retrain. Seems to work for me. Skip From noreply at sourceforge.net Sun Mar 30 21:47:32 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 31 00:44:59 2003 Subject: [Spambayes] [ spambayes-Bugs-712480 ] Outlook 2002 (XP) installation fails Message-ID: Bugs item #712480, was opened at 2003-03-31 05:47 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paul Marrero (pmarrero) Assigned to: Mark Hammond (mhammond) Summary: Outlook 2002 (XP) installation fails Initial Comment: I use office XP with the Outlook client. It appears that the registration was successfull but I cannnot find any menu buttons. XP clipboard does appear to have the Icons. The command line train works. Not sure where to go from here. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702 From richard at jowsey.com Mon Mar 31 21:14:56 2003 From: richard at jowsey.com (Richard Jowsey) Date: Mon Mar 31 06:15:05 2003 Subject: [Spambayes] Latest spammer trick stymied Message-ID: <3E88AFD0.4984.1072C98C@localhost> Lately (as prophesied), there have been a number of very short spams arriving, containing only a singleton URL. My proxy's classifier was giving these an "unsure" rating -- too few clues. But, these buggers were starting to become quite annoying... So today I added a simple web-crawler, which will venture out on demand and slurp the words off any site. This little hoover is only unleashed when the number of distinct clues/words in an email is less than 150, it's heading for the "unsure" bucket, and we find an http URL in there. The entire source HTML is then whacked through the tokenizer and classified. The extra servlet processing can take a couple seconds, mostly network overhead, and really only noticeable when paying close attention to message download times, but the results are really worth it! It nails them dead. Cheers! Richard From pje at telecommunity.com Mon Mar 31 07:32:36 2003 From: pje at telecommunity.com (Phillip J. Eby) Date: Mon Mar 31 07:33:01 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: <3E88AFD0.4984.1072C98C@localhost> Message-ID: <5.1.0.14.0.20030331073006.01ebac50@mail.telecommunity.com> At 09:14 PM 3/31/03 +1000, Richard Jowsey wrote: >So today I added a simple web-crawler, which will venture out on >demand and slurp the words off any site. This little hoover is only >unleashed when the number of distinct clues/words in an email is less >than 150, it's heading for the "unsure" bucket, and we find an http >URL in there. The entire source HTML is then whacked through the >tokenizer and classified. Won't this just convince spammers that: 1) Their spam is "working", because "people are clicking on the link", and 2) If there's a unique ID in the URL, it will confirm that your address is live and that you're a sucker for whatever it is they mailed you. :) Of course, I also suppose it's possible that if enough people install a spam filter that works this way, the resulting "spambayes effect" might crash a few of their servers. :) From anthony at interlink.com.au Mon Mar 31 22:51:03 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Mar 31 07:52:21 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: <5.1.0.14.0.20030331073006.01ebac50@mail.telecommunity.com> Message-ID: <200303311251.h2VCp4419496@localhost.localdomain> >>> "Phillip J. Eby" wrote > Won't this just convince spammers that: > > 1) Their spam is "working", because "people are clicking on the link", and So? More fool them - hopefully they'll spend more money on this useless technique, and go broke, sooner. > 2) If there's a unique ID in the URL, it will confirm that your address is > live and that you're a sucker for whatever it is they mailed you. :) I figure there's little or no point to trying to hide addresses from spammers. Unless you never ever post to a mailing list, or to anyone off-site, and you've got a non-obvious username, they're going to get your address anyway. > Of course, I also suppose it's possible that if enough people install a > spam filter that works this way, the resulting "spambayes effect" might > crash a few of their servers. :) Well, if nothing else, the useless load on their webserver helps push a little of the cost of spam back towards the spammer. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From tim at fourstonesExpressions.com Mon Mar 31 07:42:48 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 08:43:25 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: <200303311251.h2VCp4419496@localhost.localdomain> Message-ID: 3/31/2003 6:51:03 AM, Anthony Baxter wrote: >Well, if nothing else, the useless load on their webserver helps push a >little of the cost of spam back towards the spammer. We have to be careful with this. It would be relatively simple to stymie, by simply adding two urls, the spam one, and an unrelated innocent site. Or three urls, or whatever... We definitely should NOT crawl the site, just in case it really is an innocent url. The load can crush a site, particularly if it's hosted. BUT, if we don't crawl the site, then the trick is easily stymied by simply having the page be a linked jpg with the appropriate information, or a flash, or whatever... so we're darned if we do, darned if we don't. Spambayes is superb at recognizing spam based solely upon the payload received. If these mails are slipping through, then we need to examine the clues and see why. Can you show us the clues for one of your mails that headed for unsure? At the moment, we clue url:, which is very likely to become a hapax. Perhaps a better solution is to create a token for the presence of a url... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From noreply at sourceforge.net Sun Mar 30 22:05:24 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 31 10:22:04 2003 Subject: [Spambayes] [ spambayes-Bugs-712480 ] Outlook 2002 (XP) installation fails Message-ID: Bugs item #712480, was opened at 2003-03-31 17:47 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paul Marrero (pmarrero) Assigned to: Mark Hammond (mhammond) Summary: Outlook 2002 (XP) installation fails Initial Comment: I use office XP with the Outlook client. It appears that the registration was successfull but I cannnot find any menu buttons. XP clipboard does appear to have the Icons. The command line train works. Not sure where to go from here. ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2003-03-31 18:05 Message: Logged In: YES user_id=552329 Which version of the Outlook plugin are you using? (a) the latest CVS, (b) the 001 stand-alone installer, or (c) the 002 stand-alone installer? I know that the 001 installer has been known to have this problem (although it appeared to be fixed in 002). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702 From David.Vaughan at trizetto.com Mon Mar 31 08:22:31 2003 From: David.Vaughan at trizetto.com (Vaughan, David) Date: Mon Mar 31 10:27:52 2003 Subject: [Spambayes] setup Message-ID: I finally figured out my problem. Netscape Webmail uses imap and my employer does not have that port opened up. It works at home but not here at the office. Go figure. Will Spambayes work with imap or must it be pop3? -----Original Message----- From: Tim Stone - Four Stones Expressions [mailto:tim@fourstonesExpressions.com] Sent: Tuesday, March 18, 2003 2:37 PM To: Vaughan, David; Spambayes Subject: Re: RE: [Spambayes] setup 3/18/2003 1:29:15 PM, "Vaughan, David" wrote: > > It's not supposed to be this hard :-) > > I'll keep trying but presently am unable to set up POP3. I get the >message "Connection to server imap.mail.netcenter.com timed out." but can >not find in the Netscape 7.02 preferences where to set the server name. pop3proxy does not support imap servers at this time. For that matter, there isn't any imap support in spambayes at this point in time... :( c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org From tim at fourstonesExpressions.com Mon Mar 31 09:49:14 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 10:49:20 2003 Subject: [Spambayes] setup In-Reply-To: Message-ID: <41OKWVGBUQNYTA5POJI54PC8A9MLIF.3e88637a@myst> 3/31/2003 9:22:31 AM, "Vaughan, David" wrote: > Will Spambayes work with imap or must it be pop3? There currenly is no imap proxy in spambayes. It is a documented feature request, but nobody has picked it up as of yet. I think the problem (certainly from my point of view) is that imap servers to test against are not nearly as common as pop3 servers. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From dave at boost-consulting.com Mon Mar 31 11:07:52 2003 From: dave at boost-consulting.com (David Abrahams) Date: Mon Mar 31 11:08:12 2003 Subject: [Spambayes] Spambayes/procmail Message-ID: I want to set up spambayes to work with procmail on my mail server. Does anyone have experience with that? If not, will someone please discuss it with me? I'm particularly interested in what the model for getting new spam/ham classifications to procmail might be. My last query of 24 February went completely unanswered, which is a little discouraging. I have quite a learning curve to overcome, having no experience with procmail and little with IMAP. If someone who knows a little about SpamBayes could at least help me figure out which questions I need to answer in order to get started, that would be a big help. Thanks! -- Dave Abrahams Boost Consulting www.boost-consulting.com From tim at fourstonesExpressions.com Mon Mar 31 10:17:31 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 11:17:37 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: Message-ID: 3/31/2003 10:07:52 AM, David Abrahams wrote: > >I want to set up spambayes to work with procmail on my mail server. >Does anyone have experience with that? You should start by reading http://spambayes.sourceforge.net/applications.html. There is a link to a page called "guide to integrating hammie with your mailer" on that page that should give you some good starting points. The subject of integrating with procmail has been discussed relatively extensively in the mailing list. You might check out the archives, searching on procmail. Again, start at http://spambayes.sourceforge.net If after that you're having trouble, please be sure to drop us a line! c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From skip at pobox.com Mon Mar 31 10:27:29 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 31 11:27:41 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: References: Message-ID: <16008.27761.497235.617892@montanaro.dyndns.org> Dave> I want to set up spambayes to work with procmail on my mail Dave> server. Does anyone have experience with that? Dave, I use spambayes with procmail. The major issue is generally not one of getting messages classified, but of getting them trained. Here are the relevant bits out of my procmailrc file: PYCKSUM=$HOME/local/bin/pycksum HAMMIE=$HOME/local/bin/hammiefilter.py BAYESCUSTOMIZE=$HOME/hammie.opt :0 fw:hamlock | $HAMMIE -d $HOME/hammie.db :0 * ^X-Spambayes-Classification: spam { ### this recipe gobbles items with matching body checksums (taken ### loosely to try and avoid obvious tricks) :0 W: cksum.lock | $PYCKSUM -v $HOME/tmp/cksum.cache ### spam scores come in two flavors - equal to 1.00 and less than ### 1.00 scores are much more likely to be real spam, so require ### less sifting - therefore keep them separate :0: * ^X-Spambayes-Classification: spam; 1.00 $SPAM1 :0: $SPAM } :0 * ^X-Spambayes-Classification: unsure unsure ... You can dispense with the PYCKSUM stuff, though I find it does delete a fair number of duplicate spams. I get email for a large number of aliases at the same address however. YMMV. I've attached the version of the script which I use. It's similar to the loosecksum.py script in the Spambayes utilities directory, but incorporates the ideas Justin Mason detailed about the SpamAssassin checksummer. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: pycksum.py Type: application/octet-stream Size: 3099 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes/attachments/20030331/c1cfd712/pycksum-0001.obj From tim at fourstonesExpressions.com Mon Mar 31 10:36:21 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 11:36:26 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: <16008.27761.497235.617892@montanaro.dyndns.org> Message-ID: <71A8XU5ZVSQLC9HEEAKKIZURMNH42X.3e886e85@myst> >You can dispense with the PYCKSUM stuff, though I find it does delete a fair >number of duplicate spams. I get email for a large number of aliases at the >same address however. YMMV. I've attached the version of the script which >I use. It's similar to the loosecksum.py script in the Spambayes utilities >directory, but incorporates the ideas Justin Mason detailed about the >SpamAssassin checksummer. Maybe you can update integration.txt with these pertinent bits? Also, perhaps check in your checksummer? c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From dave at boost-consulting.com Mon Mar 31 11:35:06 2003 From: dave at boost-consulting.com (David Abrahams) Date: Mon Mar 31 11:42:04 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: <16008.27761.497235.617892@montanaro.dyndns.org> (Skip Montanaro's message of "Mon, 31 Mar 2003 10:27:29 -0600") References: <16008.27761.497235.617892@montanaro.dyndns.org> Message-ID: Skip, thanks for replying! Skip Montanaro writes: > I use spambayes with procmail. The major issue is generally not one of > getting messages classified, but of getting them trained. I figured it would be; I think that's what I meant by "classified". I do have a folder full of accumulated spam. What has been your strategy for training? -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Mon Mar 31 10:53:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 31 11:53:14 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: References: <16008.27761.497235.617892@montanaro.dyndns.org> Message-ID: <16008.29297.744394.325654@montanaro.dyndns.org> >> I use spambayes with procmail. The major issue is generally not one >> of getting messages classified, but of getting them trained. Dave> I figured it would be; I think that's what I meant by Dave> "classified". I do have a folder full of accumulated spam. What Dave> has been your strategy for training? Here's what I do. It's sensitive to my particular mail setup, so you can probably only use this as a rough guide. My mail reader is VM inside XEmacs. VM has a "l"abel command prefix. I added two new keys to its keymap, "h" and "s" (which were fortuitously unused) to copy messages to spam and ham folders: (defun copy-to-spam () (interactive) (vm-save-message (expand-file-name "~/tmp/newspam")) (vm-undelete-message 1)) (defun copy-to-nonspam () (interactive) (vm-save-message (expand-file-name "~/tmp/newham")) (vm-undelete-message 1)) (define-key vm-mode-map "ls" 'copy-to-spam) (define-key vm-summary-mode-map "ls" 'copy-to-spam) (define-key vm-mode-map "lh" 'copy-to-nonspam) (define-key vm-summary-mode-map "lh" 'copy-to-nonspam) ~/tmp/new{ham,spam} are then processed using a fairly simple shell script: #!/bin/bash export BAYESCUSTOMIZE=$HOME/hammie.opt cd ~/tmp base=new db=hammie.db # touch the messages up a bit to avoid spurious "clues" if [ -f ${base}ham -a -f ${base}spam ] ; then unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}ham > ${base}ham.clean unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}spam > ${base}spam.clean # do the deed hammie.py -d -p $db -g ${base}ham.clean -s ${base}spam.clean # save the files for later retraining cat ${base}ham.clean >> ${base}ham.clean.save echo "" >> ${base}ham.clean.save rm ${base}ham ${base}ham.clean cat ${base}spam.clean >> ${base}spam.clean.save echo "" >> ${base}spam.clean.save rm ${base}spam ${base}spam.clean else echo Missing ${base}ham and/or ${base}spam files fi I run the train script periodically to train on new ham and spam, then copy the resulting hammie.db file to where it's really used: % train Training ham (newham.clean): 12 Training spam (newspam.clean): 29 % cp -p hammie.db ~ This setup works fine for me, though probably won't be as attractive for people who aren't as addicted to the shell prompt as I am. Skip From jh at web.de Mon Mar 31 20:05:18 2003 From: jh at web.de (Juergen Hermann) Date: Mon Mar 31 13:06:06 2003 Subject: [Spambayes] Added headers and no newline Message-ID: Hi! I did not check whether this was fixed yet, I get a lot of this X-Spambayes-Classification: ham X-Spambayes-MailId: 1049128650-2 X-Spambayes-Spam-Probability: 8.78190881126e-009 X-Spambayes-Evidence: '*H*': 0.00; '*S*': 0.00; 'subject:] ': 0.00; 'url:listinfo': 0.00; 'url:mailman': 0.00; 'skip:_ 40': 0.00; 'url:python': 0.00; 'email addr:python.org': 0.00; 'subject:[': 0.00; 'europython': 0.00; 'email name:europython': 0.00; 'subject:EuroPython': 0.00; 'url:europython': 0.00; 'header:Received:7': 0.00; 'idea.': 0.00; 'url:mail': 0.00; 'url:org': 0.00; 'header:Errors-To:1': 0.00; 'think': 0.00; 'good': 0.00; 'space': 0.00; 'list': 0.00; 'mailing': 0.00; 'some': 0.00; 'big': 0.00; 'will': 0.00; 'url:html': 0.00; 'url:www': 0.00; 'lives': 0.00; 'url:index': 0.00 Open Space was a big success at PyCon. I think having some will be a good idea. The OpenSpace manifesto lives here:' http://www.openspaceworld.org/english/index.html with a few weeks old spambayes. The newline before the body is missing, thus the first part of the message is not shown normally in the email client. I'll update anyway. Ciao, J?rgen From skip at pobox.com Mon Mar 31 12:33:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 31 13:34:00 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: References: <16008.27761.497235.617892@montanaro.dyndns.org> <16008.29297.744394.325654@montanaro.dyndns.org> Message-ID: <16008.35343.710709.566185@montanaro.dyndns.org> Dave> Here's what I've never understood about this system: shouldn't it Dave> be enough to label spam? GNUs gives me a key to label a message Dave> as spam. If I collect all of those, shouldn't I be able to tell Dave> spambayes that everything in my INBOX that's been read and isn't Dave> in my SpamBox is ham? I suspect you can use or adapt Neil Schemenauer's mboxtrain.py script to do what you want. I started doing things this way before that was an option though. >> This setup works fine for me, though probably won't be as attractive >> for people who aren't as addicted to the shell prompt as I am. Dave> Well, I'm not sure I understand it yet, but I think I'll get Dave> there. Yeah, it will probably take awhile. If you fetch your email via POP you might find the pop3proxy a better fit. It provides a web-based training interface. Skip From bill at parducci.net Mon Mar 31 10:34:41 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 31 13:38:15 2003 Subject: [Spambayes] Spambayes/procmail References: <16008.27761.497235.617892@montanaro.dyndns.org> <16008.29297.744394.325654@montanaro.dyndns.org> Message-ID: <3E888A41.4090104@parducci.net> i think this speaks well to the point that training is a individual and manual proceess! :o) b Skip Montanaro wrote: > Here's what I do. It's sensitive to my particular mail setup, so you can > probably only use this as a rough guide. > > My mail reader is VM inside XEmacs. VM has a "l"abel command prefix. I > added two new keys to its keymap, "h" and "s" (which were fortuitously > unused) to copy messages to spam and ham folders: > > (defun copy-to-spam () > (interactive) > (vm-save-message (expand-file-name "~/tmp/newspam")) > (vm-undelete-message 1)) > > (defun copy-to-nonspam () > (interactive) > (vm-save-message (expand-file-name "~/tmp/newham")) > (vm-undelete-message 1)) > > (define-key vm-mode-map "ls" 'copy-to-spam) > (define-key vm-summary-mode-map "ls" 'copy-to-spam) > (define-key vm-mode-map "lh" 'copy-to-nonspam) > (define-key vm-summary-mode-map "lh" 'copy-to-nonspam) > > ~/tmp/new{ham,spam} are then processed using a fairly simple shell script: > > #!/bin/bash > > export BAYESCUSTOMIZE=$HOME/hammie.opt > cd ~/tmp > > base=new > db=hammie.db > > # touch the messages up a bit to avoid spurious "clues" > if [ -f ${base}ham -a -f ${base}spam ] ; then > unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}ham > ${base}ham.clean > unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}spam > ${base}spam.clean > > # do the deed > hammie.py -d -p $db -g ${base}ham.clean -s ${base}spam.clean > > # save the files for later retraining > cat ${base}ham.clean >> ${base}ham.clean.save > echo "" >> ${base}ham.clean.save > rm ${base}ham ${base}ham.clean > > cat ${base}spam.clean >> ${base}spam.clean.save > echo "" >> ${base}spam.clean.save > rm ${base}spam ${base}spam.clean > else > echo Missing ${base}ham and/or ${base}spam files > fi > > I run the train script periodically to train on new ham and spam, then copy > the resulting hammie.db file to where it's really used: > > % train > Training ham (newham.clean): > 12 > Training spam (newspam.clean): > 29 > % cp -p hammie.db ~ > > This setup works fine for me, though probably won't be as attractive for > people who aren't as addicted to the shell prompt as I am. > > Skip > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes From dave at boost-consulting.com Mon Mar 31 13:20:34 2003 From: dave at boost-consulting.com (David Abrahams) Date: Mon Mar 31 13:54:58 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: <16008.29297.744394.325654@montanaro.dyndns.org> (Skip Montanaro's message of "Mon, 31 Mar 2003 10:53:05 -0600") References: <16008.27761.497235.617892@montanaro.dyndns.org> <16008.29297.744394.325654@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > >> I use spambayes with procmail. The major issue is generally not one > >> of getting messages classified, but of getting them trained. > > Dave> I figured it would be; I think that's what I meant by > Dave> "classified". I do have a folder full of accumulated spam. What > Dave> has been your strategy for training? > > Here's what I do. It's sensitive to my particular mail setup, so you can > probably only use this as a rough guide. > > My mail reader is VM inside XEmacs. I'm using GNUs, FWIW. > VM has a "l"abel command prefix. I added two new keys to its > keymap, "h" and "s" (which were fortuitously unused) to copy > messages to spam and ham folders: Here's what I've never understood about this system: shouldn't it be enough to label spam? GNUs gives me a key to label a message as spam. If I collect all of those, shouldn't I be able to tell spambayes that everything in my INBOX that's been read and isn't in my SpamBox is ham? > (defun copy-to-spam () > (interactive) > (vm-save-message (expand-file-name "~/tmp/newspam")) > (vm-undelete-message 1)) > > (defun copy-to-nonspam () > (interactive) > (vm-save-message (expand-file-name "~/tmp/newham")) > (vm-undelete-message 1)) > > (define-key vm-mode-map "ls" 'copy-to-spam) > (define-key vm-summary-mode-map "ls" 'copy-to-spam) > (define-key vm-mode-map "lh" 'copy-to-nonspam) > (define-key vm-summary-mode-map "lh" 'copy-to-nonspam) > > ~/tmp/new{ham,spam} are then processed using a fairly simple shell script: > > #!/bin/bash > > export BAYESCUSTOMIZE=$HOME/hammie.opt > cd ~/tmp > > base=new > db=hammie.db > > # touch the messages up a bit to avoid spurious "clues" > if [ -f ${base}ham -a -f ${base}spam ] ; then > unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}ham > ${base}ham.clean > unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}spam > ${base}spam.clean > > # do the deed > hammie.py -d -p $db -g ${base}ham.clean -s ${base}spam.clean > > # save the files for later retraining > cat ${base}ham.clean >> ${base}ham.clean.save > echo "" >> ${base}ham.clean.save > rm ${base}ham ${base}ham.clean > > cat ${base}spam.clean >> ${base}spam.clean.save > echo "" >> ${base}spam.clean.save > rm ${base}spam ${base}spam.clean > else > echo Missing ${base}ham and/or ${base}spam files > fi > > I run the train script periodically to train on new ham and spam, then copy > the resulting hammie.db file to where it's really used: > > % train > Training ham (newham.clean): > 12 > Training spam (newspam.clean): > 29 > % cp -p hammie.db ~ > > This setup works fine for me, though probably won't be as attractive for > people who aren't as addicted to the shell prompt as I am. Well, I'm not sure I understand it yet, but I think I'll get there. Thanks! -- Dave Abrahams Boost Consulting www.boost-consulting.com From dave at boost-consulting.com Mon Mar 31 13:52:21 2003 From: dave at boost-consulting.com (David Abrahams) Date: Mon Mar 31 13:55:08 2003 Subject: [Spambayes] Spambayes/procmail In-Reply-To: <16008.35343.710709.566185@montanaro.dyndns.org> (Skip Montanaro's message of "Mon, 31 Mar 2003 12:33:51 -0600") References: <16008.27761.497235.617892@montanaro.dyndns.org> <16008.29297.744394.325654@montanaro.dyndns.org> <16008.35343.710709.566185@montanaro.dyndns.org> Message-ID: Skip Montanaro writes: > Dave> Here's what I've never understood about this system: shouldn't it > Dave> be enough to label spam? GNUs gives me a key to label a message > Dave> as spam. If I collect all of those, shouldn't I be able to tell > Dave> spambayes that everything in my INBOX that's been read and isn't > Dave> in my SpamBox is ham? > > I suspect you can use or adapt Neil Schemenauer's mboxtrain.py script to do > what you want. I started doing things this way before that was an option > though. Excellent! Thank you. > >> This setup works fine for me, though probably won't be as attractive > >> for people who aren't as addicted to the shell prompt as I am. > > Dave> Well, I'm not sure I understand it yet, but I think I'll get > Dave> there. > > Yeah, it will probably take awhile. If you fetch your email via POP you > might find the pop3proxy a better fit. It provides a web-based training > interface. Nope; I'm using IMAP. Thanks, -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Mon Mar 31 15:46:36 2003 From: skip at pobox.com (Skip Montanaro) Date: Mon Mar 31 16:46:45 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: <3E893DA7.31420.20D35DB@localhost> References: <200303311251.h2VCp4419496@localhost.localdomain> <3E893DA7.31420.20D35DB@localhost> Message-ID: <16008.46908.795498.412561@montanaro.dyndns.org> >> We definitely should NOT crawl the site, just in case it really is an >> innocent url. The load can crush a site, particularly if it's >> hosted. Richard> Nah. You need to throw thousands of requests at a half-decent Richard> web server before it gives up the ghost. And if they're sending Richard> out 10 million mail pieces, they should expect their http Richard> server to take some load. These are definitely NOT innocent Richard> emails. They come from bogus senders, have minimal headers Richard> (deliberately), and contain *nothing* but a url. Which points, Richard> via redirect naturally, to an incest porn or get-a-huge-penis Richard> site, etc. You can't make that judgement beforehand. If the site you are poking is a valid site and the email received was not spam, none of what you said holds. If I remember correctly, you said this was only to be performed in circumstances where certain criteria were met, none of which included a conclusion the mail was spam. Skip From popiel at wolfskeep.com Mon Mar 31 14:15:29 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Mar 31 17:15:36 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: Message from "Richard Jowsey" of "Tue, 01 Apr 2003 07:20:07 +1000." <3E893DA7.31420.20D35DB@localhost> References: <200303311251.h2VCp4419496@localhost.localdomain> <3E893DA7.31420.20D35DB@localhost> Message-ID: <20030331221529.8A6592DDF2@cashew.wolfskeep.com> In message: <3E893DA7.31420.20D35DB@localhost> "Richard Jowsey" writes: >> We have to be careful with this. It would be relatively simple to >> stymie, by simply adding two urls, the spam one, and an unrelated >> innocent site. Or three urls, or whatever... > >Spammers are simple folk. They won't be putting no innocent url's in >these spams... Spammers might be simple folk, but serious crackers (not the script kiddies) certainly are not. If there comes to be a widely deployed tool with this sort of fetch-what-I-tell-you-to behaviour, then it will get exploited by people wanting to do a denial of service attack or similar. Why bother sending out your own IRC-controlled worm, when there's already remote-controllable spamfilters ready and waiting to pound a site into the ground? After all, writing (and releasing) a worm is already recognized as a crime, but the legality of just sending out a not-as-innocent-as-it-looks email blast is still in contention... - Alex From tshumway at jdiworks.net Mon Mar 31 14:40:42 2003 From: tshumway at jdiworks.net (tshumway@jdiworks.net) Date: Mon Mar 31 17:37:34 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: <16008.46908.795498.412561@montanaro.dyndns.org> References: <200303311251.h2VCp4419496@localhost.localdomain> <3E893DA7.31420.20D35DB@localhost> <16008.46908.795498.412561@montanaro.dyndns.org> Message-ID: <1049150442.3e88c3ea2d4d9@jdiworks.net> Quoting Skip Montanaro : > > >> We definitely should NOT crawl the site, just in case it really is an > >> innocent url. The load can crush a site, particularly if it's > >> hosted. > > Richard> Nah. You need to throw thousands of requests at a half-decent > Richard> web server before it gives up the ghost. And if they're sending > Richard> out 10 million mail pieces, they should expect their http > Richard> server to take some load. These are definitely NOT innocent > Richard> emails. They come from bogus senders, have minimal headers > Richard> (deliberately), and contain *nothing* but a url. Which points, > > You can't make that judgement beforehand. If the site you are poking is a > valid site and the email received was not spam, none of what you said holds. > If I remember correctly, you said this was only to be performed in > circumstances where certain criteria were met, none of which included a > conclusion the mail was spam. Anyone who includes a URL in a mail message will probably be prepared for some load based on the number of people receiving the message. If I send a message to a client asking him to look at a web site on a staging server, I expect a dozen or so hits, followed by a phone call. If I send a message to my family mailing list, I expect a couple hundred hits (followed by a complaint from my brother that his picture looks ugly (What can I do? 8-) ). If an evil spammer sends a URL to 50 million addresses, it might expect (hope for) a decent slashdot spike. Interpreting the results of the http request opens a new can of worms. All of the tricks we use to mangle addresses (javascript, formmail honeypots, user-agent based web-pages, funky encodings, etc.) can now be used by the spammer against us. hmmm. I think it will take a while for that to become a major problem. In a server-side deployment where the same spam is likely to reach many hosted mailboxes, a specialized proxy server might be able to reduce the perceived response rate and the wasted bandwidth. -- Terrel From tim at fourstonesExpressions.com Mon Mar 31 16:05:23 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 18:01:56 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: <16008.46908.795498.412561@montanaro.dyndns.org> Message-ID: 3/31/2003 3:46:36 PM, Skip Montanaro wrote: > > >> We definitely should NOT crawl the site, just in case it really is an > >> innocent url. The load can crush a site, particularly if it's > >> hosted. > > Richard> Nah. You need to throw thousands of requests at a half-decent > Richard> web server before it gives up the ghost. And if they're sending > Richard> out 10 million mail pieces, they should expect their http > Richard> server to take some load. These are definitely NOT innocent > Richard> emails. They come from bogus senders, have minimal headers > Richard> (deliberately), and contain *nothing* but a url. Which points, > Richard> via redirect naturally, to an incest porn or get-a-huge-penis > Richard> site, etc. > >You can't make that judgement beforehand. If the site you are poking is a >valid site and the email received was not spam, none of what you said holds. >If I remember correctly, you said this was only to be performed in >circumstances where certain criteria were met, none of which included a >conclusion the mail was spam. That's right. We really should try to solve this problem with tokenization. > >Skip > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim at fourstonesExpressions.com Mon Mar 31 17:04:33 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 18:04:39 2003 Subject: [Spambayes] Latest spammer trick stymied Message-ID: <1U071TPJPNNL41DC4WXS95KHOMCALKPK.3e88c981@myst> 3/31/2003 4:15:29 PM, "T. Alexander Popiel" wrote: >In message: <3E893DA7.31420.20D35DB@localhost> > "Richard Jowsey" writes: >>> We have to be careful with this. It would be relatively simple to >>> stymie, by simply adding two urls, the spam one, and an unrelated >>> innocent site. Or three urls, or whatever... >> >>Spammers are simple folk. They won't be putting no innocent url's in >>these spams... > >Spammers might be simple folk, but serious crackers (not the script >kiddies) certainly are not. If there comes to be a widely deployed >tool with this sort of fetch-what-I-tell-you-to behaviour, then it >will get exploited by people wanting to do a denial of service >attack or similar. Why bother sending out your own IRC-controlled >worm, when there's already remote-controllable spamfilters ready >and waiting to pound a site into the ground? After all, writing >(and releasing) a worm is already recognized as a crime, but the >legality of just sending out a not-as-innocent-as-it-looks email >blast is still in contention... EXCELLENT point, Alex. Case closed. > >- Alex > >_______________________________________________ >Spambayes mailing list >Spambayes@python.org >http://mail.python.org/mailman/listinfo/spambayes > > c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From bill at parducci.net Mon Mar 31 15:21:24 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 31 18:24:58 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION References: <200303311251.h2VCp4419496@localhost.localdomain> <3E893DA7.31420.20D35DB@localhost> <16008.46908.795498.412561@montanaro.dyndns.org> <1049150442.3e88c3ea2d4d9@jdiworks.net> Message-ID: <3E88CD74.4050405@parducci.net> currently, does spambayes treat a URL as a single token or is it parsed somehow? it would seem that if URLs were parsed you would be able to train spambayes to detect mail for odious content based on components of the link. take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj it would seem that the most accurate way to evaluate this would be to parse using '/' (starting after 'http://'). that would allow spambayes to evaluate the domain (check.mypam.com) while giving it the ability to differentiate between directories (which may map to users on ISP systems: http://user.aol.com/niceguy vs. http://user.aol.com/spammer). b From popiel at wolfskeep.com Mon Mar 31 16:06:06 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Mar 31 19:06:10 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION In-Reply-To: Message from bill parducci of "Mon, 31 Mar 2003 15:21:24 PST." <3E88CD74.4050405@parducci.net> References: <200303311251.h2VCp4419496@localhost.localdomain> <3E893DA7.31420.20D35DB@localhost> <16008.46908.795498.412561@montanaro.dyndns.org> <1049150442.3e88c3ea2d4d9@jdiworks.net> <3E88CD74.4050405@parducci.net> Message-ID: <20030401000606.33A3A2DDF2@cashew.wolfskeep.com> In message: <3E88CD74.4050405@parducci.net> bill parducci writes: >currently, does spambayes treat a URL as a single token or is it parsed >somehow? URLs are parsed with the following code: | urlsep_re = re.compile(r"[;?:@&=+,$.]") | | class URLStripper(Stripper): | def __init__(self): | # The empty regexp matches anything at once. | Stripper.__init__(self, url_re.search, re.compile("").search) | | def tokenize(self, m): | proto, guts = m.groups() | tokens = ["proto:" + proto] | pushclue = tokens.append | | # Lose the trailing punctuation for casual embedding, like: | # The code is at http://mystuff.org/here? Didn't resolve. | # or | # I found it at http://mystuff.org/there/. Thanks! | assert guts | while guts and guts[-1] in '.:?!/': | guts = guts[:-1] | for piece in guts.split('/'): | for chunk in urlsep_re.split(piece): | pushclue("url:" + chunk) | return tokens >take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj That example would yield the tokens: proto:http url:check url:myspam url:com url:ad url:junk url:random url:fsldkjflksj >it would seem that the most accurate way to evaluate this would be to >parse using '/' (starting after 'http://'). that would allow spambayes >to evaluate the domain (check.mypam.com) while giving it the ability to >differentiate between directories (which may map to users on ISP >systems: http://user.aol.com/niceguy vs. http://user.aol.com/spammer). This already happens to some extent, though the I think there could be better handling of the composite hostname and directory path... to wit, I suspect that adding the following tokens would help: url:myspam.com url:check.myspam.com url:check.myspam.com/ad url:check.myspam.com/ad/junk I haven't tested this yet, but I further suspect that I will have Tim Peters' problem: my results are already good enough that I won't be able to say anything conclusive about it. - Alex From bill at parducci.net Mon Mar 31 16:36:48 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 31 19:41:42 2003 Subject: [Spambayes] Latest spammer trick stymied References: Message-ID: <3E88DF20.9080204@parducci.net> Mark Hammond wrote: > Could you not do the same thing today, by sending out a HTML email > referencing some images from the server you want to attack? Given the > number of mail clients out there that will fetch these images (using their > mailers default settings), I would expect this to remain a far more > effective attack than the one you propose. yes, that would DoS the [http] target, but one could DoS the [mail] recipient's system by sending multiple messages linking to a site that is overloaded (or intentionally slow) so that the [blocking] 'slurp' event clogs up the mail processing flow. it's just a matter of whom you wish to annoy. :o) b From tim.one at comcast.net Mon Mar 31 19:37:16 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 31 19:44:19 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION In-Reply-To: <20030401000606.33A3A2DDF2@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > This already happens to some extent, though the I think there could > be better handling of the composite hostname and directory path... > to wit, I suspect that adding the following tokens would help: > > url:myspam.com > url:check.myspam.com > url:check.myspam.com/ad > url:check.myspam.com/ad/junk > > I haven't tested this yet, but I further suspect that I will have > Tim Peters' problem: my results are already good enough that I won't > be able to say anything conclusive about it. Mining embedded URLs was the first tokenization enhancement added to the project, and it instantly cut the false negative rate in half -- that remains the single biggest win we ever got. At first, it was fancier than it is now. The scheme got simpler over time, as testing showed no significant difference in results as more gimmicks got thrown out. Note that we actually generate more tokens than meet the eye for spam like: """ X-Message-Info: JGTYoYF78jEHjJx36Oi8+Q1OJDRSDidP Received: from wildlife.com ([4.40.47.205]) by mc9-f10.bay6.hotmail.com with Microsoft SMTPSVC(5.0.2195.5600); Sun, 30 Mar 2003 23:44:18 -0800 Date: Sun, 30 Mar 2003 01:37:18 -0300 From: "Ella Schotte" To: Message-ID: <20030330013718.9ltGDlkp5jmJ@wildlife.com> Content-Type: text/plain Subject: with Daughter Return-Path: skoocea@wildlife.com X-OriginalArrivalTime: 31 Mar 2003 07:44:18.0807 (UTC) FILETIME=[56139870:01C2F759] http://jeajeeceap.lewdmother.com """ The complete list of tokens generated by the Outlook client by default for that is: 'cc:none' 'content-type:text/plain' 'from:addr:skoocea' 'from:addr:wildlife.com' 'from:name:ella schotte' 'header:Date:1' 'header:From:1' 'header:Message-ID:1' 'header:Received:1' 'header:Return-Path:1' 'header:Subject:1' 'header:To:1' 'message-id:@wildlife.com' 'noheader:abuse-reports-to' 'noheader:errors-to' 'noheader:importance' 'noheader:in-reply-to' 'noheader:mime-version' 'noheader:organization' 'noheader:reply-to' 'noheader:user-agent' 'noheader:x-abuse-info' 'noheader:x-complaints-to' 'noheader:x-face' 'proto:http' 'reply-to:none' 'sender:none' 'subject: ' 'subject:Daughter' 'subject:with' 'to:2**0' 'to:addr:email.msn.com' 'to:addr:tim_one' 'to:no real name:2**0' 'url:com' 'url:jeajeeceap' 'url:lewdmother' 'x-mailer:none' Currently, in my home classifier, only 7 of those have spamprobs outside of (.4, .6), so 31 tokens are ignored. If "minimal headers" becomes a popular spam gimmick, that will boost the spamprobs of the assorted "noheader:xyz" and "xyz:none" tokens, to the point where they're no longer ignored. From tim at fourstonesExpressions.com Mon Mar 31 18:44:53 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 19:45:03 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: Message-ID: 3/31/2003 6:24:36 PM, "Mark Hammond" wrote: >[Tim S again] >> EXCELLENT point, Alex. Case closed. > >I'm not sure who you are speaking for here . But yeah, fetching the >URL does seem the wrong long-term approach. I'm very impressed with the >creativity of the idea though - I see lots of these spams and did wonder WTF >we could do about it. Speaking for myself, of course... We currently do not provide a token for the *presence* of a url. I'm not sure if this would have pushed it toward spamminess or not, but it bears researching. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim.one at comcast.net Mon Mar 31 19:59:56 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 31 20:01:18 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: Message-ID: [Tim Stone] > Speaking for myself, of course... > > We currently do not provide a token for the *presence* of a url. We already generate one of proto:http proto:https proto:ftp depending on what's approrpriate. > I'm not sure if this would have pushed it toward spamminess or not, but it > bears researching. Look in your database for the spamprob on 'proto:http'. My bet is that it's near neutral; it's reasonable to expect that a "found a URL" token would have the same spamprob. From bill at parducci.net Mon Mar 31 17:15:02 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 31 20:18:37 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION References: <200303311251.h2VCp4419496@localhost.localdomain> <3E893DA7.31420.20D35DB@localhost> <16008.46908.795498.412561@montanaro.dyndns.org> <1049150442.3e88c3ea2d4d9@jdiworks.net> <3E88CD74.4050405@parducci.net> <20030401000606.33A3A2DDF2@cashew.wolfskeep.com> Message-ID: <3E88E816.4060003@parducci.net> T. Alexander Popiel wrote: >>take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj > That example would yield the tokens: > > proto:http > url:check > url:myspam > url:com > url:ad > url:junk > url:random > url:fsldkjflksj doesn't the degree of granularity here dilute the information? in other words, 'com' and 'junk' are extremely common, while 'myspam.com' less so and 'check.myspam.com' completely unique. since neutral tokens are ignored, words like these may not be considered, while the following most likely would be considered: > url:myspam.com > url:check.myspam.com > url:check.myspam.com/ad > url:check.myspam.com/ad/junk therefore, in the case of url parsing, it would seem that less [granularity] is more [accuracy]. b From tim.one at comcast.net Mon Mar 31 20:28:43 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 31 20:40:03 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION In-Reply-To: <3E88E816.4060003@parducci.net> Message-ID: [bill parducci] > > doesn't the degree of granularity here dilute the information? in other > words, 'com' and 'junk' are extremely common, while 'myspam.com' less so > and 'check.myspam.com' completely unique. since neutral tokens are > ignored, words like these may not be considered, while the following > most likely would be considered: > >> url:myspam.com That's decent, but likely no better than url:myspam. >> url:check.myspam.com >> url:check.myspam.com/ad >> url:check.myspam.com/ad/junk Those are probably one-shot hapaxes (i.e., worthless, except for catching copies of the same spam). If you own a domain xyz.com, then you can make up all the ABC.xyz.com targets you like, and spammers generally do. ABC doesn't repeat often except in copies of the same spam. > therefore, in the case of url parsing, it would seem that less > [granularity] is more [accuracy]. Test and measure. From tim.one at comcast.net Mon Mar 31 20:36:28 2003 From: tim.one at comcast.net (Tim Peters) Date: Mon Mar 31 20:44:44 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: Message-ID: [Mark Hammond] > ... > But yeah, fetching the URL does seem the wrong long-term approach. Hard to say. > I'm very impressed with the creativity of the idea though - I see lots of these > spams and did wonder WTF we could do about it. I suggest you wait . I saw a lot of these last week, but a lot less this week so far. As advertising, sending a single URL has got to suck: who would click on it, and why, especially after the novelty wears off? For reasons explained earlier, if this is combined with the minimal-header gimmick, positive tokens generated for the absence of assorted header lines will eventually get high spamprobs too. From bill at parducci.net Mon Mar 31 17:55:40 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 31 20:59:15 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION References: Message-ID: <3E88F19C.9040008@parducci.net> Tim Peters wrote: >>> url:check.myspam.com >>> url:check.myspam.com/ad >>> url:check.myspam.com/ad/junk > > Those are probably one-shot hapaxes (i.e., worthless, except for catching > copies of the same spam). If you own a domain xyz.com, then you can make up > all the ABC.xyz.com targets you like, and spammers generally do. ABC > doesn't repeat often except in copies of the same spam. empirically i am not so sure. below are links that have been arriving daily in my trolling account (each listed twice per note, one supposedly in case you are having problems with the other): http://www.nudesletter.com/schoolgirl-FEB/index.html http://www.nudesletter.com/auditions-SC/index.html http://www.nudesletter.com/8thstreet-ND/index.html http://www.nudesletter.com/multi-FEB/index.html http://www.nudesletter.com/russians-WG/index.html while the goal is the same (traffic to www.nudesletter.com), each day the url changes. there are a number of other spam threads that work similarly. >>therefore, in the case of url parsing, it would seem that less >>[granularity] is more [accuracy]. > > Test and measure. you left off 'write code' before 'test and measure'. i am still coming up to speed there so for me this will have to stay in the theoretical for the time being. b From neale at woozle.org Mon Mar 31 17:57:24 2003 From: neale at woozle.org (Neale Pickett) Date: Mon Mar 31 21:03:33 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION In-Reply-To: (Tim Peters's message of "Mon, 31 Mar 2003 19:37:16 -0500") References: Message-ID: Tim Peters writes: > The scheme got simpler over time, as testing showed no significant > difference in results as more gimmicks got thrown out. Hi gang. I'm not supposed to be working on this project anymore but I just can't help following up to this one. I see Tim answering a lot of "I've got a cool tokenizing idea" questions. So many, in fact, that I think there ought to be a FAQ on the web page somehwere, to the tune of: Q: Hey! Why don't you implement cool tokenizer trick X? I think it would really foil those spammers! A: Have you run your tokenizer trick against a set of messages to see if it actually works? Many times what seems like a good idea turns out not to help much, and sometimes even hurts. If you have a good idea, you've run it against a batch of messages and can prove that it helps, paste the code for your technique and the proof to the mailing list. Otherwise, you will likely get a message from Tim Peters about why you need to test your idea :) Just an idea. Neale From bill at parducci.net Mon Mar 31 18:22:31 2003 From: bill at parducci.net (bill parducci) Date: Mon Mar 31 21:26:05 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION References: <1ED4ECF91CDED24C8D012BCF2B034F13010B4430@its-xchg4.massey.ac.nz> Message-ID: <3E88F7E7.3070708@parducci.net> > Many times what seems like a good idea turns > out not to help much, and sometimes even hurts. this very thread started with such an approach {build and show] and was predominantly dismissed. this may not have an affect on the implementer's use of the modification, but i would hate to think that this would be the only 'allowable' method by which ideas can be posted. ...and sometimes someone else has tried it and it didn't help. why would you want to force people to reinvent the wheel before discussing an idea? b From noreply at sourceforge.net Mon Mar 31 18:48:48 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Mar 31 21:34:49 2003 Subject: [Spambayes] [ spambayes-Bugs-712480 ] Outlook 2002 (XP) installation fails Message-ID: Bugs item #712480, was opened at 2003-03-31 17:47 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paul Marrero (pmarrero) Assigned to: Mark Hammond (mhammond) Summary: Outlook 2002 (XP) installation fails Initial Comment: I use office XP with the Outlook client. It appears that the registration was successfull but I cannnot find any menu buttons. XP clipboard does appear to have the Icons. The command line train works. Not sure where to go from here. ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2003-04-01 14:48 Message: Logged In: YES user_id=552329 Actually, I get this too. I've just switched to Outlook XP, so I'm not sure if this is the reason, or just that I'm doing a fresh install. The log includes the following traces: SpamAddin - Connecting to Outlook Failed to load bayes database Traceback (most recent call last): File "E:\src\spambayes\Outlook2000\manager.py", line 310, in LoadBayes File "E:\src\spambayes\Outlook2000\manager.py", line 118, in open_bayes AttributeError: 'module' object has no attribute 'DBDictClassifier' Loaded message database from 'C:\Documents and Settings\tameyer\Application Data\SpamBayes\default_message_database.db' Either bayes database or message database is missing - creating new pythoncom error: Failed to call the universal dispatcher Traceback (most recent call last): File "E:\src\pythonex\com\win32com\universal.py", line 170, in dispatch File "E:\src\pythonex\com\win32com\server\policy.py", line 322, in _InvokeEx_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 655, in OnConnection File "E:\src\spambayes\Outlook2000\manager.py", line 475, in GetManager File "E:\src\spambayes\Outlook2000\manager.py", line 165, in __init__ File "E:\src\spambayes\Outlook2000\manager.py", line 329, in LoadBayes File "E:\src\spambayes\Outlook2000\manager.py", line 378, in InitNewBayes File "E:\src\spambayes\Outlook2000\manager.py", line 94, in new_bayes File "E:\src\spambayes\Outlook2000\manager.py", line 118, in open_bayes exceptions.AttributeError: 'module' object has no attribute 'DBDictClassifier' SpamAddin - Connecting to Outlook Failed to load bayes database Traceback (most recent call last): File "E:\src\spambayes\Outlook2000\manager.py", line 310, in LoadBayes File "E:\src\spambayes\Outlook2000\manager.py", line 118, in open_bayes AttributeError: 'module' object has no attribute 'DBDictClassifier' Loaded message database from 'C:\Documents and Settings\tameyer\Application Data\SpamBayes\default_message_database.db' Either bayes database or message database is missing - creating new pythoncom error: Failed to call the universal dispatcher Traceback (most recent call last): File "E:\src\pythonex\com\win32com\universal.py", line 170, in dispatch File "E:\src\pythonex\com\win32com\server\policy.py", line 322, in _InvokeEx_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 655, in OnConnection File "E:\src\spambayes\Outlook2000\manager.py", line 475, in GetManager File "E:\src\spambayes\Outlook2000\manager.py", line 165, in __init__ File "E:\src\spambayes\Outlook2000\manager.py", line 329, in LoadBayes File "E:\src\spambayes\Outlook2000\manager.py", line 378, in InitNewBayes File "E:\src\spambayes\Outlook2000\manager.py", line 94, in new_bayes File "E:\src\spambayes\Outlook2000\manager.py", line 118, in open_bayes exceptions.AttributeError: 'module' object has no attribute 'DBDictClassifier' ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-03-31 18:05 Message: Logged In: YES user_id=552329 Which version of the Outlook plugin are you using? (a) the latest CVS, (b) the 001 stand-alone installer, or (c) the 002 stand-alone installer? I know that the 001 installer has been known to have this problem (although it appeared to be fixed in 002). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702 From tim at fourstonesExpressions.com Mon Mar 31 19:05:03 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 23:43:53 2003 Subject: [Spambayes] Latest spammer trick stymied In-Reply-To: Message-ID: 3/31/2003 6:59:56 PM, Tim Peters wrote: > >Look in your database for the spamprob on 'proto:http'. My bet is that it's >near neutral; it's reasonable to expect that a "found a URL" token would >have the same spamprob. Ok. I missed that one. Yeah, it's .56 or so. So that idea's a dumb one. ;) So what's your take on the slurping thing, Tim? c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't. From tim at fourstonesExpressions.com Mon Mar 31 22:47:10 2003 From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions) Date: Mon Mar 31 23:47:21 2003 Subject: [Spambayes] Latest spammer trick stymied - QUESTION In-Reply-To: Message-ID: 3/31/2003 7:57:24 PM, Neale Pickett wrote: >Tim Peters writes: > >> The scheme got simpler over time, as testing showed no significant >> difference in results as more gimmicks got thrown out. > >Hi gang. I'm not supposed to be working on this project anymore but I >just can't help following up to this one. I see Tim answering a lot of >"I've got a cool tokenizing idea" questions. So many, in fact, that I >think there ought to be a FAQ on the web page somehwere, to the tune of: > Problem there is, that it seems like the spambayes site is the last place people look for information. ;) c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.