From tameyer at ihug.co.nz Mon Aug 2 06:26:33 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Aug 2 06:26:40 2004 Subject: [spambayes-dev] untested idea for calculating message lengths In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130734DBBD@its-xchg4.massey.ac.nz> Message-ID: Sorry about the delay - last week was very busy for me - but I've managed to give this a go, too. Mixed results, but I'd call it a loss for me: -> tested 4692 hams & 386 spams against 18762 hams & 1537 spams -> tested 4695 hams & 381 spams against 18759 hams & 1542 spams -> tested 4693 hams & 383 spams against 18761 hams & 1540 spams -> tested 4690 hams & 384 spams against 18764 hams & 1539 spams -> tested 4684 hams & 389 spams against 18770 hams & 1534 spams -> tested 4691 hams & 385 spams against 18763 hams & 1538 spams -> tested 4691 hams & 385 spams against 18763 hams & 1538 spams -> tested 4691 hams & 385 spams against 18763 hams & 1538 spams -> tested 4691 hams & 384 spams against 18763 hams & 1539 spams -> tested 4690 hams & 384 spams against 18764 hams & 1539 spams false positive percentages 0.000 0.000 tied 0.021 0.000 won -100.00% 0.000 0.000 tied 0.000 0.064 lost +(was 0) 0.000 0.043 lost +(was 0) won 1 times tied 2 times lost 2 times total unique fp went from 1 to 5 lost +400.00% mean fp % went from 0.00425985090522 to 0.0213192344457 lost +400.47% false negative percentages 1.036 0.779 won -24.81% 1.050 1.299 lost +23.71% 0.783 1.299 lost +65.90% 1.823 1.042 won -42.84% 1.285 0.781 won -39.22% won 3 times tied 0 times lost 2 times total unique fn went from 23 to 20 won -13.04% mean fn % went from 1.19553834481 to 1.03990800866 won -13.02% ham mean ham sdev 0.09 0.05 -44.44% 1.85 1.44 -22.16% 0.12 0.11 -8.33% 2.34 1.92 -17.95% 0.12 0.09 -25.00% 2.06 1.67 -18.93% 0.09 0.18 +100.00% 2.01 3.27 +62.69% 0.04 0.16 +300.00% 0.88 3.00 +240.91% ham mean and sdev for all runs 0.09 0.12 +33.33% 1.89 2.38 +25.93% spam mean spam sdev 95.66 96.82 +1.21% 15.14 13.47 -11.03% 95.73 96.88 +1.20% 15.31 13.94 -8.95% 97.07 96.30 -0.79% 11.43 14.42 +26.16% 95.32 95.68 +0.38% 16.78 15.08 -10.13% 95.55 96.42 +0.91% 15.67 14.02 -10.53% spam mean and sdev for all runs 95.86 96.42 +0.58% 14.99 14.20 -5.27% ham/spam mean difference: 95.77 96.30 +0.53 I'm not set up to be able to use the tte.py script like Skip did to get lengths, but I set timcv.py to save the classifiers, and got these numbers: token length ham spam 3 0 1 4 6 16 5 445 201 6 4910 277 7 7909 441 8 3883 271 9 1205 227 10 246 87 11 131 12 12 24 1 There's 12 times more ham than spam, of course, so I don't know that these mean much. With a smaller, but more balanced corpus, it's a wash (well, it gets one fewer unsure): -> tested 280 hams & 131 spams against 1111 hams & 512 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 277 hams & 128 spams against 1114 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 280 hams & 131 spams against 1111 hams & 512 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 277 hams & 128 spams against 1114 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams -> tested 278 hams & 128 spams against 1113 hams & 515 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 6.870 6.870 tied 3.125 3.125 tied 7.813 7.813 tied 3.906 3.906 tied 5.469 5.469 tied won 0 times tied 5 times lost 0 times total unique fn went from 35 to 35 tied mean fn % went from 5.43654580153 to 5.43654580153 tied ham mean ham sdev 0.18 0.14 -22.22% 1.77 1.40 -20.90% 0.01 0.01 +0.00% 0.17 0.11 -35.29% 0.01 0.01 +0.00% 0.12 0.08 -33.33% 0.03 0.02 -33.33% 0.39 0.29 -25.64% 0.28 0.27 -3.57% 3.37 3.27 -2.97% ham mean and sdev for all runs 0.10 0.09 -10.00% 1.72 1.60 -6.98% spam mean spam sdev 88.65 88.68 +0.03% 25.58 25.49 -0.35% 89.82 89.65 -0.19% 23.25 23.49 +1.03% 87.20 87.22 +0.02% 28.97 28.92 -0.17% 90.75 90.79 +0.04% 23.91 23.85 -0.25% 90.28 90.34 +0.07% 25.98 25.96 -0.08% spam mean and sdev for all runs 89.34 89.33 -0.01% 25.65 25.64 -0.04% ham/spam mean difference: 89.24 89.24 -0.00 Tokens for these ones are: token length ham spam 3 1 0 4 11 0 5 53 2 6 221 266 7 297 88 8 213 94 9 115 50 10 168 3 11 16 8 12 5 0 13 8 0 The ratio here is about 2 ham to 1 spam, so, accounting for that, it looks like (for this corpus, anyway), ham varies in size a lot more than spam. =Tony Meyer From sjoerd at acm.org Mon Aug 2 12:43:10 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Mon Aug 2 12:43:18 2004 Subject: [spambayes-dev] Use of email factory function Message-ID: <410E1ABE.5070700@acm.org> While looking through the code in sb_imapfilter.py I noticed this comment with the associated code: # Annoyingly, we can't just pass over the RFC822 message to an # existing message object (like self) and have it parse it. So # we go through the hoops of creating a new message, and then # copying over all its internals. try: new_msg = email.Parser.Parser().parsestr(data[self.rfc822_key]) [...] I was wondering, instead of having the email parser create a new instance and then copying over the internals, could we not make use of the factory function argument? We could call the parser like this: new_msg = email.Parser.Parser(lambda x=self: x).parsestr(data[self.rfc822_key]) The parser will call the factory function passed as its first argument to get a new instance of a Message. But we could also just use a factory function that returns the already existing instance (lambda x=self: x) and then not copy over the internals. Or is this something that doesn't work in older but still supported versions of the email package? -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040802/fcf6c816/signature.pgp From barry at python.org Mon Aug 2 17:34:45 2004 From: barry at python.org (Barry Warsaw) Date: Mon Aug 2 17:34:41 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: <410E1ABE.5070700@acm.org> References: <410E1ABE.5070700@acm.org> Message-ID: <1091460885.9150.39.camel@localhost> On Mon, 2004-08-02 at 06:43, Sjoerd Mullender wrote: > Or is this something that doesn't work in older but still supported > versions of the email package? IIRC, the factory has worked in every version of this library called 'email' (it may not have worked in some very early version of mimelib). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040802/8d6aeb93/attachment.pgp From sjoerd at acm.org Mon Aug 2 22:32:30 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Mon Aug 2 22:32:49 2004 Subject: [spambayes-dev] bug in imap filter or in email package Message-ID: <410EA4DE.5060200@acm.org> -----BEGIN PGP SIGNED MESSAGE----- I noticed that I had way too many Unsures so I did some investigating. One message I looked at carefully was a pure HTML message (i.e. not a multipart/alternative) which was encoded with base64. Ordinarily Spambayes should decode that and tokenize the decoded message. However, I noticed that this message had a bunch of tokens of the form 'skip:d 60': 0.01; 'skip:l 60': 0.01; 'skip:m 60': 0.03; and no tokens that came from the decoded message. But when I use the web interface of sb_imapfilter.py and tokenize a locally saved copy of the message, I don't get these tokens, but instead I get tokens which come from the decoded message. I went through the steps of what sb_imapfilter.py does by hand and I noticed a few things: Message.asTokens is defined as follows: ~ def asTokens(self): ~ return tokenize(self.as_string()) and tokenize (which is really Tokenizer.tokenize does this: ~ def tokenize(self, obj): ~ msg = self.get_message(obj) [...] and finally, self.get_message (which is really get_message in tokenizer.py) creates a Message instance of the argument string. I have the feeling that this can be made more efficient by having ~ def asTokens(self): ~ return tokenize(self) instead. get_message just returns its argument if it is a Message instance (which self in Message.asTokens is). But this is not the bug. tokenize calls tokenize_body which goes through the text parts (only one here) and calls part.get_payload(decode=True) (where part is a Message instance as returned by msg.walk()). get_payload in email.Message.py gets the content-transfer-encoding header, but this (and here the bug manifests itself) returns the string 'base64\r', i.e. with \r. Since this is not equal to any of the known encodings, get_payload doesn't decode and just returns the base64-encoded data. The question is, is this a bug in the email package in that it should convert \r\n to \n, or is this a bug somewhere else in that the message given to the email package should never have included those \r\n? The message instance is created with email.Parser.Parser().parsestr(...) where the argument to parsestr is the data as returned by the IMAP server (which of course uses \r\n line endings). By the way, Windows is not involved anywhere in the process, so the \r\n aren't OS artifacts. My Python is almost fully up-to-date, the email package is completely up-to-date (my last cvs update was after the last change to the email component). - -- Sjoerd Mullender -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iQCVAwUBQQ6k3j7g04AjvIQpAQG65QQAiEzw2wFqnn3TnF1QrnBhaDKuiyIpXo/x 0GxyFztoX29c3us9Yost8Satf4pw2wKmSmHaj6ENkT0bRHhlf+DrqkkPDR/S4rPL DDh9nRXaVMfsRT2v4QZWOmfjeDadwJsXtV0toiTKlRQ4eT68fZkjwBePmMgw+aDv NpXJO4LQX4U= =BhHX -----END PGP SIGNATURE----- From sjoerd at acm.org Mon Aug 2 22:47:41 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Mon Aug 2 22:47:55 2004 Subject: [spambayes-dev] Use of email factory function Message-ID: <410EA86D.7090004@acm.org> I'm resending this because I didn't see it arrive on the list after 10 hours. While looking through the code in sb_imapfilter.py I noticed this comment with the associated code: # Annoyingly, we can't just pass over the RFC822 message to an # existing message object (like self) and have it parse it. So # we go through the hoops of creating a new message, and then # copying over all its internals. try: new_msg = email.Parser.Parser().parsestr(data[self.rfc822_key]) [...] I was wondering, instead of having the email parser create a new instance and then copying over the internals, could we not make use of the factory function argument? We could call the parser like this: new_msg = email.Parser.Parser(lambda x=self: x).parsestr(data[self.rfc822_key]) The parser will call the factory function passed as its first argument to get a new instance of a Message. But we could also just use a factory function that returns the already existing instance (lambda x=self: x) and then not copy over the internals. Or is this something that doesn't work in older but still supported versions of the email package? -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040802/2673763f/signature.pgp From tameyer at ihug.co.nz Tue Aug 3 02:07:38 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Aug 3 02:07:49 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: Message-ID: > I noticed that I had way too many Unsures so I did some investigating. > One message I looked at carefully was a pure HTML message (i.e. not a > multipart/alternative) which was encoded with base64. Ordinarily > Spambayes should decode that and tokenize the decoded message. [...] > My Python is almost fully up-to-date, the email package is completely > up-to-date (my last cvs update was after the last change to the email > component). This sounds a lot like the bug with the email package that Neil Schemenauer brought up here very recently. He said that he'd brought it up with Barry, but not submitted a bug report. I'm not sure if he has yet, or not (and I haven't had a chance to look at it more), but if not, then it would probably be worth you doing this, so that Barry doesn't forget about it (and maybe it could squeeze into Python 2.4a2, if it's a really simple fix and Barry isn't too busy). > I went through the steps of what sb_imapfilter.py does by hand and I > noticed a few things: > > Message.asTokens is defined as follows: > ~ def asTokens(self): > ~ return tokenize(self.as_string()) > and tokenize (which is really Tokenizer.tokenize does this: > ~ def tokenize(self, obj): > ~ msg = self.get_message(obj) > [...] > and finally, self.get_message (which is really get_message in > tokenizer.py) creates a Message instance of the argument string. > > I have the feeling that this can be made more efficient by having > ~ def asTokens(self): > ~ return tokenize(self) > instead. get_message just returns its argument if it is a Message > instance (which self in Message.asTokens is). +1 to checking this in. =Tony Meyer From tameyer at ihug.co.nz Tue Aug 3 02:11:56 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Aug 3 02:12:02 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: Message-ID: > I was wondering, instead of having the email parser create a new > instance and then copying over the internals, could we not > make use of the factory function argument? > > We could call the parser like this: > > new_msg = email.Parser.Parser(lambda x=self: > x).parsestr(data[self.rfc822_key]) > > The parser will call the factory function passed as its first > argument to get a new instance of a Message. But we could also just > use a factory function that returns the already existing instance (lambda > x=self: x) and then not copy over the internals. I (think I) wrote that code, and I'm certainly hazy on many parts of the email package and the parsers. I didn't know that the above was possible (I suspected something like it, but couldn't figure it out). +1 to checking this in. (I think that setPayload in spambayes.Message could then also do this, so I'll look into that after this has been checked in a works :) =Tony Meyer From nas at arctrix.com Tue Aug 3 03:27:22 2004 From: nas at arctrix.com (Neil Schemenauer) Date: Tue Aug 3 03:27:25 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: References: Message-ID: <20040803012722.GA5502@mems-exchange.org> On Tue, Aug 03, 2004 at 12:07:38PM +1200, Tony Meyer wrote: > This sounds a lot like the bug with the email package that Neil Schemenauer > brought up here very recently. He said that he'd brought it up with Barry, > but not submitted a bug report. Have not filed a bug yet and it's still not fixed, AFAIK. Neil From ta-meyer at ihug.co.nz Tue Aug 3 08:07:08 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Tue Aug 3 08:07:16 2004 Subject: [spambayes-dev] Deprecated options Message-ID: Hi everyone, Given that CVS HEAD is now working towards 1.1 and we have the safe 1.0 release branch, would anyone mind if I ripped out all the options & associated code (three options, two code) that are marked as deprecated? The users have had plenty of warning, and it seems unlikely that anyone is using them anyway. The relevant options are: [Classifier] x-experimental_ham_spam_imbalance_adjustment - the code for this is gone already; it's just the option that's left. [Tokenizer] x-extract_dow [Tokenizer] x-generate_time_buckets I'm interested in up/de-grading some of the experimental options, too. I'd vote for: [Classifier] x-use_bigrams - becomes a regular option (defaulting to False?) [Tokenizer] x-fancy_url_recognition - becomes a regular option (defaulting to True?) [Tokenizer] x-pick_apart_urls - no opinion, here. [Tokenizer] x-reduce_habeas_headers & [Tokenizer] x-search_for_habeas_headers - removed. The habeas headers aren't used much, and are ofttimes spoofed, so these end up not being much use. [URLRetreiver] *. Not sure as of yet. =Tony Meyer From tameyer at ihug.co.nz Tue Aug 3 10:00:19 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Aug 3 10:00:27 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: Message-ID: [Sjoerd Mullender] > I was wondering, instead of having the email parser create a new > instance and then copying over the internals, could we not > make use of the factory function argument? > > We could call the parser like this: > > new_msg = email.Parser.Parser(lambda x=self: > x).parsestr(data[self.rfc822_key]) > > The parser will call the factory function passed as its first > argument to get a new instance of a Message. But we could also just > use a factory function that returns the already existing instance (lambda > x=self: x) and then not copy over the internals. [Tony Meyer] > I (think I) wrote that code, and I'm certainly hazy on many > parts of the email package and the parsers. I didn't know > that the above was possible (I suspected something like it, > but couldn't figure it out). > > +1 to checking this in. Thinking about things more (and going over sb_imapfilter code for other reasons), I'm again not sure how this works (but if it does, then still +1!). What we want is to make _ourselves_ be the message that is returned from email.Parser.Parser. I can see the above being nicer if we were copying *to* new_msg, but we are copying *from* it. I think the proper solution is to move the code that calls this outside of the message instance itself, but that involves a fair bit of work. Hopefully, though, I'm misunderstanding things :) =Tony Meyer From sjoerd at acm.org Tue Aug 3 10:15:04 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Tue Aug 3 10:15:09 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: References: Message-ID: <410F4988.1090803@acm.org> Tony Meyer wrote: > [Sjoerd Mullender] > >>I was wondering, instead of having the email parser create a new >>instance and then copying over the internals, could we not >>make use of the factory function argument? >> >>We could call the parser like this: >> >>new_msg = email.Parser.Parser(lambda x=self: >>x).parsestr(data[self.rfc822_key]) >> >>The parser will call the factory function passed as its first >>argument to get a new instance of a Message. But we could also just >>use a factory function that returns the already existing instance (lambda >>x=self: x) and then not copy over the internals. > > > [Tony Meyer] > >>I (think I) wrote that code, and I'm certainly hazy on many >>parts of the email package and the parsers. I didn't know >>that the above was possible (I suspected something like it, >>but couldn't figure it out). >> >>+1 to checking this in. > > > Thinking about things more (and going over sb_imapfilter code for other > reasons), I'm again not sure how this works (but if it does, then still > +1!). > > What we want is to make _ourselves_ be the message that is returned from > email.Parser.Parser. I can see the above being nicer if we were copying > *to* new_msg, but we are copying *from* it. I think the proper solution is > to move the code that calls this outside of the message instance itself, but > that involves a fair bit of work. Hopefully, though, I'm misunderstanding > things :) I just discovered it doesn't work. It doesn't work for multipart messages since each part is an new instance, and with my patch all those instances would be the same instance. If we can do something that returns self the first time it is called and a (truly) new instance for all subsequent calls, it might work. Something like this, perhaps: class IMAPMessage(message.SBHeaderMessage): def __init__(self): self.first_factory_call = True [...] def factory(self): # first time return self, all subsequent times, return a new # Message instance if self.first_factory_call: self.first_factory_call = False return self return email.Message.Message() [...] try: new_msg = email.Parser.Parser(self.factory).parsestr(data[self.rfc822_key]) [...] -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040803/e4ea4f12/signature.pgp From sjoerd at acm.org Tue Aug 3 13:55:27 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Tue Aug 3 13:55:44 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: References: Message-ID: <410F7D2F.5070307@acm.org> Tony Meyer wrote: >>I noticed that I had way too many Unsures so I did some investigating. >>One message I looked at carefully was a pure HTML message (i.e. not a >>multipart/alternative) which was encoded with base64. Ordinarily >>Spambayes should decode that and tokenize the decoded message. > > [...] > >>My Python is almost fully up-to-date, the email package is completely >>up-to-date (my last cvs update was after the last change to the email >>component). > > > This sounds a lot like the bug with the email package that Neil Schemenauer > brought up here very recently. He said that he'd brought it up with Barry, > but not submitted a bug report. I'm not sure if he has yet, or not (and I > haven't had a chance to look at it more), but if not, then it would probably > be worth you doing this, so that Barry doesn't forget about it (and maybe it > could squeeze into Python 2.4a2, if it's a really simple fix and Barry isn't > too busy). > > >>I went through the steps of what sb_imapfilter.py does by hand and I >>noticed a few things: >> >>Message.asTokens is defined as follows: >>~ def asTokens(self): >>~ return tokenize(self.as_string()) >>and tokenize (which is really Tokenizer.tokenize does this: >>~ def tokenize(self, obj): >>~ msg = self.get_message(obj) >> [...] >>and finally, self.get_message (which is really get_message in >>tokenizer.py) creates a Message instance of the argument string. >> >>I have the feeling that this can be made more efficient by having >>~ def asTokens(self): >>~ return tokenize(self) >>instead. get_message just returns its argument if it is a Message >>instance (which self in Message.asTokens is). > > > +1 to checking this in. > > =Tony Meyer > Done in revision 1.52 of spambayes/message.py. -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040803/2c2b91b4/signature.pgp From kennypitt at hotmail.com Tue Aug 3 16:13:58 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Aug 3 16:14:05 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: Message-ID: Tony Meyer wrote: > Given that CVS HEAD is now working towards 1.1 and we have the safe > 1.0 release branch, would anyone mind if I ripped out all the options > & associated code (three options, two code) that are marked as > deprecated? The users have had plenty of warning, and it seems > unlikely that anyone is using them anyway. > > The relevant options are: > > [Classifier] x-experimental_ham_spam_imbalance_adjustment - the code > for this is gone already; it's just the option that's left. > [Tokenizer] x-extract_dow > [Tokenizer] x-generate_time_buckets +1 here. > I'm interested in up/de-grading some of the experimental options, > too. I'd vote for: > > [Classifier] x-use_bigrams - becomes a regular option (defaulting to > False?) > [Tokenizer] x-fancy_url_recognition - becomes a regular option > (defaulting to True?) > [Tokenizer] x-pick_apart_urls - no opinion, here. I've been running with x-pick_apart_urls on for quite awhile and it seems to have been fairly effective, but I haven't run cross-validation to compare with and without the option. > [Tokenizer] x-reduce_habeas_headers & [Tokenizer] > x-search_for_habeas_headers - removed. The habeas headers aren't > used much, and are ofttimes spoofed, so these end up not being much > use. > [URLRetreiver] *. Not sure as of yet. +1 on the experimentals as well. I'm all for simplifying the options as much as possible. In the same vane, I wonder if there are some little-used options on the configuration pages that might be better left for manual editing of the config file to clean up the UI a little? -- Kenny Pitt From skip at pobox.com Tue Aug 3 16:29:59 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Aug 3 16:30:13 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: References: Message-ID: <16655.41319.650116.195290@montanaro.dyndns.org> Tony> [Classifier] x-experimental_ham_spam_imbalance_adjustment - the code for Tony> this is gone already; it's just the option that's left. Tony> [Tokenizer] x-extract_dow Tony> [Tokenizer] x-generate_time_buckets Definitely zap the above. Tony> I'm interested in up/de-grading some of the experimental options, Tony> too. I'd vote for: Tony> [Classifier] x-use_bigrams - becomes a regular option (defaulting to False?) False would be best. We already have people complaining about the size of their databases. Tony> [Tokenizer] x-reduce_habeas_headers & [Tokenizer] Tony> x-search_for_habeas_headers - removed. The habeas headers aren't used much, Tony> and are ofttimes spoofed, so these end up not being much use. Are the habeas headers a dead-end in the wider world that most Spambayes users simply don't use? If they are spoofed they should be a fairly good spam clue. I'm not sure I'd delete them yet. Skip From neel at mediapulse.com Tue Aug 3 17:12:59 2004 From: neel at mediapulse.com (Michael C. Neel) Date: Tue Aug 3 17:12:41 2004 Subject: [spambayes-dev] Thought on mass hosting spambayes.... Message-ID: <1091545979.12811.38.camel@mike.mediapulse.com> I'm emailing the list to get ideas on how spambayes could be used in a mass email hosting setup, as an ISP would have. There wound/t be much trouble in setting up seperate sb databases per email account using existing methods already found in sb code, such as the mysql/sql option. Filtering can occur on the smtp recieve side, placing a header to be filtered by the client (or subject line). This is the easy part. The hard part is the training. I'll assume it's generally accepted that for spambayes to be effective, training must be done on a per account basis. One man's spam is another man's ham after all. So we need a training interface to handle a mass hosting setup that is per account. It gets worse though. IMAP folders could help here, but an ISP does not want users saving mail on the server. Also, support needs to be low, so I'm not sure expecting users to view source and cut and paste into a web app will be the answer either. Most email client support some form of 'forward as redirect', in which the message is sent again with a new envlope. The brainstorming here has gone down the road of some type of email account that does training, i.e. a ham@isp.com and a spam@isp.com. the smtp server for these accounts would require authencation to prevent real spammers from using them, and also to tie the sender to an account and database for training. Using the forward as redirect, the user trains his database. I'm also thinking it would also be possible to detect if the user did not forward as a redirect, but instead did a normal forward, because there would be no recieved headers, and there would also be no spambayes headers, so we could reject the message for training and prevent the user from incorrectly training his account. We would want a basic online application for the user to tune his spambayes preferences, but again this isn't much to do and is really just working with exiting code and interfaces. I think if spambayes had a solution for the mail server in a mass hosting, it would get alot of use by ISPs who constantly hear spam complaints from users. Current commercial options are too costly and don't provide the results spambayes can. Current open source server solutons also don't compare with spambayes in results. I'm interested in hearing back thoughts, and open to other interface ideas. I'm ready to start coding away on the solution, but I want to be sure in the solution before I start =). Mike __________________________________ michael.neel@mediapulse.com vice president of information systems 865.675.4455 x30 800.380.4514 www.mediapulse.com __________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040803/fccba265/attachment.htm From matt at mondoinfo.com Tue Aug 3 21:04:58 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Tue Aug 3 21:05:06 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: References: Message-ID: <1091559456.68.1196@mint-julep.mondoinfo.com> [Tony Meyer] >> [Tokenizer] x-pick_apart_urls - no opinion, here. [Kenny Pitt] > I've been running with x-pick_apart_urls on for quite awhile and it > seems to have been fairly effective, but I haven't run > cross-validation to compare with and without the option. Cross-validation showed that x-pick_apart_urls wasn't particularly effective on my mail. But creating synthetic tokens for the IP address of a URL's host part is effective for me and that's a small hack on top the code that implements x-pick_apart_urls. I expect that some other useful things could be done based on the code as well. Regards, Matt From kennypitt at hotmail.com Tue Aug 3 22:58:14 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Aug 3 22:58:24 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: <1091559456.68.1196@mint-julep.mondoinfo.com> Message-ID: Matthew Dixon Cowles wrote: > Cross-validation showed that x-pick_apart_urls wasn't particularly > effective on my mail. But creating synthetic tokens for the IP > address of a URL's host part is effective for me and that's a small > hack on top the code that implements x-pick_apart_urls. I assume you're doing a DNS lookup on the hostname in the url, then? The potential problem I see with that is that the IP address can change if you lookup the same hostname again. The system could be switched to a different IP, or they might be using a round-robin DNS for load balancing. SpamBayes training relies on the tokenizer generating the exact same set of tokens every time a message is parsed. If you make a mistake in training and later correct that mistake, SpamBayes needs to remove the trained tokens from the incorrect corpus and it does that by tokenizing the message again. If you introduce dynamic information into the token stream, you might end up trying to remove a different set of tokens than what was originally added, which can potentially corrupt your training database. -- Kenny Pitt From skip at pobox.com Tue Aug 3 23:49:06 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Aug 3 23:49:30 2004 Subject: [spambayes-dev] Generating SB tokens based upon information on the net In-Reply-To: References: <20040730163221.GA24953@mems-exchange.org> Message-ID: <16656.2130.446565.493957@montanaro.dyndns.org> (adding the spambayes-dev mailing list to the cc Brad> If nothing else, then *PLEASE* configure SpamBayes to do these Brad> same sorts of things internally, so that we can then score on Brad> these issues, as opposed to rejecting the messages outright. One of the downfalls of many systems that operate deep into the email toolchain is that they try to do lookups on the net of some sort. With Spambayes we have tried to not go down that path and only use information in the message itself. Many's the time I had to stop SpamAssassin because its razor lookups hung. You never know when the network is going to flake out on you. If your training indicates that "no reverse DNS" is a strongly spammy clue, I think you should make darn sure you can check that when you are scoring messages. Brad> I don't think you realize the potential scale of problem that this Brad> could cause for us. Certainly no worse than having our process table fill up smtpd_proxy processes awaiting a DNS response that ain't gonna happen. Skip From matt at mondoinfo.com Wed Aug 4 00:17:16 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Wed Aug 4 00:17:22 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: References: <1091559456.68.1196@mint-julep.mondoinfo.com> Message-ID: <1091567908.21.1196@mint-julep.mondoinfo.com> [me] >> But creating synthetic tokens for the IP address of a URL's host >> part is effective for me and that's a small >> hack on top the code that implements x-pick_apart_urls. [Kenny Pitt] > I assume you're doing a DNS lookup on the hostname in the url, then? Yes, exactly so. > The potential problem I see with that is that the IP address can > change if you lookup the same hostname again. The system could be > switched to a different IP, or they might be using a round-robin > DNS for load balancing. The code creates a token (actually multiple tokens) for each IP, so the round-robin aspect doesn't apply, but everything else you say is quite correct. > SpamBayes training relies on the tokenizer generating the exact > same set of tokens every time a message is parsed. If you make a > mistake in training and later correct that mistake, SpamBayes needs > to remove the trained tokens from the incorrect corpus and it does > that by tokenizing the message again. If you introduce dynamic > information into the token stream, you might end up trying to > remove a different set of tokens than what was originally added, > which can potentially corrupt your training database. I agree. Nevertheless, I'm glad to use it because it works. Between mine_received_headers and using tokens from the URLs' IPs, even pure word-salad rarely gets through. Regards, Matt From tameyer at ihug.co.nz Wed Aug 4 10:39:25 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Aug 4 10:39:31 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: Message-ID: > If we can do something that returns self the first time it is > called and a (truly) new instance for all subsequent calls, it > might work. Something like this, perhaps: [...] I've just checked in lots of changes (and some tests!) to sb_imapfilter. You're running from CVS, yes? If so, if you could see if they break it for you (don't for me), that would be great. There are lots of changes, so running on somewhere safe would be good... I have made a change to fix this problem, along the lines (but not quite the same) as what you suggested. I've also changed from always using our own ID to using the Message-ID header when possible (but it should still pick up the old ones and not retrain/filter them). Interested to know what you think... :) =Tony Meyer From tameyer at ihug.co.nz Wed Aug 4 10:46:45 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Aug 4 10:46:51 2004 Subject: [spambayes-dev] Thought on mass hosting spambayes.... In-Reply-To: Message-ID: > There wound/t be much trouble in setting up separate > sb databases per email account using existing methods > already found in sb code, such as the mysql/sql option. Note that while easy, this will take a chunk of space. > Most email client support some form of 'forward as redirect', Some don't do this well, but yes most do. I did some testing of this a while back - the results are in smtpproxy.py in the source archive (in comments at the top of the file). I found that the best system was to get the forwarded message to include an id (originally in the headers is best) and then use that id to look up the message in a (reasonably short-lived) cache. That way whatever mucking about with the message the mailer does, it still gets correctly trained. YMMV. > in which the message is sent again with a new envlope. > The brainstorming here has gone down the road of some > type of email account that does training, i.e. a > ham@isp.com and a spam@isp.com. This is the way that the SpamBayes SMTP proxy works, except that it runs locally (by default only accepting local connections) so there isn't any authentication worry. You could probably use it as a basis for this solution (i.e. put it between the user and the SMTP server on the ISP's machine). You'd need to catch the AUTH command to do the authentication, but that shouldn't be tricky. =Tony Meyer From neel at mediapulse.com Wed Aug 4 15:50:27 2004 From: neel at mediapulse.com (Michael C. Neel) Date: Wed Aug 4 15:49:56 2004 Subject: [spambayes-dev] Thought on mass hosting spambayes.... In-Reply-To: References: Message-ID: <1091627427.27910.30.camel@mike.mediapulse.com> On Wed, 2004-08-04 at 04:46, Tony Meyer wrote: > > There wound/t be much trouble in setting up separate > > sb databases per email account using existing methods > > already found in sb code, such as the mysql/sql option. > > Note that while easy, this will take a chunk of space. > Though, compared to the spam build up it's pretty small. This is really a pricing issue though, the marketing guys will need to make sure the price point allows us to add in servers for horsepower and space as we get more accounts setup. In other words, not my problem =p. > > Most email client support some form of 'forward as redirect', > > Some don't do this well, but yes most do. I did some testing of this a > while back - the results are in smtpproxy.py in the source archive (in > comments at the top of the file). > Yes, looking at Outlook express last night, I didn't even find the option. Really, if I can find a way to get OE and O2K/O3K to send the message with headers, I feel pretty good that other clients will be no trouble. > This is the way that the SpamBayes SMTP proxy works, except that it runs > locally (by default only accepting local connections) so there isn't any > authentication worry. You could probably use it as a basis for this > solution (i.e. put it between the user and the SMTP server on the ISP's > machine). You'd need to catch the AUTH command to do the authentication, > but that shouldn't be tricky. > Thanks, more code to look at will help as always =) Mike From brad.knowles at skynet.be Wed Aug 4 17:21:58 2004 From: brad.knowles at skynet.be (Brad Knowles) Date: Wed Aug 4 17:32:33 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: <16656.2130.446565.493957@montanaro.dyndns.org> References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> Message-ID: At 4:49 PM -0500 2004-08-03, Skip Montanaro wrote: > One of the downfalls of many systems that operate deep into the email > toolchain is that they try to do lookups on the net of some sort. With > Spambayes we have tried to not go down that path and only use information in > the message itself. Many's the time I had to stop SpamAssassin because its > razor lookups hung. You never know when the network is going to flake out > on you. If your training indicates that "no reverse DNS" is a strongly > spammy clue, I think you should make darn sure you can check that when you > are scoring messages. In the case of reverse DNS, all that work will already have been done by the system before you ever get the message. All MTAs I know of automatically do reverse DNS lookups the moment a client connects, regardless of whether or not they actually attempt to use that information to control access. If nothing else, they need this information to put into the "Received:" headers that they're going to add to the message as it passes through. I don't know how easy it would be to configure postfix to pull this out and hand it to you on the command-line or otherwise outside of the context of the message itself, but that would probably be possible. Or, you could just parse the content of the appropriate headers that we just added. > Certainly no worse than having our process table fill up smtpd_proxy > processes awaiting a DNS response that ain't gonna happen. We've got that no matter what. If DNS goes down, we're toast, period. The kinds of things I had configured is no additional exposure with respect to that issue. Indeed, all MTAs I know of are toast if DNS ever goes down, at least in their default configurations. If you know what you're doing, you can configure them to disable all attempts to use the DNS, but that's normally only useful in dial-up UUCP-style connections. Otherwise, this greatly reduces the scope of what you can do with the information you have available to you, and really ties the hands of the mail server administrator. If we're not doing DNS blacklist lookups within SpamBayes, then I think we need to seriously look at adding that capability in some other fashion. My experience has been that these are some of the most important information sources you can have available to you when attempting to score a message for spam probability. -- Brad Knowles, "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See for more info. From popiel at wolfskeep.com Wed Aug 4 18:30:34 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Aug 4 18:30:43 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: Message from Brad Knowles of "Wed, 04 Aug 2004 19:21:58 +0400." References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> Message-ID: <20040804163034.0DDFE2E070@cashew.wolfskeep.com> In message: Brad Knowles writes: > > In the case of reverse DNS, all that work will already have been >done by the system before you ever get the message. All MTAs I know >of automatically do reverse DNS lookups the moment a client connects, >regardless of whether or not they actually attempt to use that >information to control access. If nothing else, they need this >information to put into the "Received:" headers that they're going to >add to the message as it passes through. Actually, the Received header info can come from the HELO or EHLO command that opened the conversation, not DNS. I haven't looked to see if any MTAs actually do it that way, but it's the way I would do it if I were writing one... (And sure, that means a rogue could lie about identification in the HELO... but that's why both the name and the IP appear in the Received line.) >Or, you could just parse the content of the appropriate headers that >we just added. I believe that's the point of the mine_received_headers option. > We've got that no matter what. If DNS goes down, we're toast, >period. The kinds of things I had configured is no additional >exposure with respect to that issue. > > Indeed, all MTAs I know of are toast if DNS ever goes down, at >least in their default configurations. Outbound, certainly... but not for inbound. - Alex From kennypitt at hotmail.com Wed Aug 4 18:45:07 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Aug 4 18:45:26 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information onthe net In-Reply-To: Message-ID: Brad Knowles wrote: > If we're not doing DNS blacklist lookups within SpamBayes, then I > think we need to seriously look at adding that capability in some > other fashion. My experience has been that these are some of the > most important information sources you can have available to you when > attempting to score a message for spam probability. I wrote a patch a while back (never submitted to SourceForge) that would query a list of DNS blacklists and insert the results as tokens. In cross-validation testing, I found that the results had virtually no effect on the accuracy of the classifier, probably because one or two DNSBL tokens weren't enough to override the effects of all the other tokens from the message itself. It also resulted in a *huge* increase in the time required for SpamBayes to classify a message. As I mentioned in another recent post, any dynamic tokens like this can also cause problems in the SpamBayes training. Most DNSBL's have an aging feature so that mailhosts will be removed from the blacklist if no spam has been received or reported from them in a certain time period. If I query a DNSBL for a particular host tomorrow, I might get a different result than I got today. This is especially problematic for anyone using a train-on-everything strategy. If SpamBayes identifies a message incorrectly today and automatically trains on it, but I don't get around to reviewing and correcting the training until tomorrow, I could end up trying to remove the wrong set of tokens from the incorrect training corpus and thus corrupting my training database. -- Kenny Pitt From skip at pobox.com Wed Aug 4 19:20:35 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Aug 4 19:21:04 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> Message-ID: <16657.6883.902560.801458@montanaro.dyndns.org> Brad> In the case of reverse DNS, all that work will already have been Brad> done by the system before you ever get the message. Apologies, my bad. Brad> Or, you could just parse the content of the appropriate headers Brad> that we just added. Spambayes will root around in the Received: headers if you ask it to. It generates all sorts of tokens based on fragments of IP addresses and hostnames it finds. Perhaps it's already doing what you wanted and I failed to make the connection in my original note. Brad> If we're not doing DNS blacklist lookups within SpamBayes, then I Brad> think we need to seriously look at adding that capability in some Brad> other fashion. My experience has been that these are some of the Brad> most important information sources you can have available to you Brad> when attempting to score a message for spam probability. There's no need to do blacklisting as far as I'm concerned. Spambayes already mines content from the sender fields, so (for example), mail purporting to come from "billr@smart.net" generates tokens which are highly spammy. That effectively serves the same purpose but doesn't have the bad property of blacklists or whitelists - that they ignore everything else the software is trying to tell you about the content of the message. Skip From brad.knowles at skynet.be Wed Aug 4 20:39:00 2004 From: brad.knowles at skynet.be (Brad Knowles) Date: Wed Aug 4 21:29:00 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: <20040804163034.0DDFE2E070@cashew.wolfskeep.com> References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> <20040804163034.0DDFE2E070@cashew.wolfskeep.com> Message-ID: At 9:30 AM -0700 2004-08-04, T. Alexander Popiel wrote: > Actually, the Received header info can come from the HELO or EHLO command > that opened the conversation, not DNS. Depends on the MTA configuration. I believe that all current versions of major MTAs will currently put both claimed EHLO/HELO information into the "Received:" headers, as well as what comes from reverse DNS -- by default. >> Indeed, all MTAs I know of are toast if DNS ever goes down, at >>least in their default configurations. > > Outbound, certainly... but not for inbound. For both inbound and outbound, by default. -- Brad Knowles, "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See for more info. From brad.knowles at skynet.be Wed Aug 4 20:54:22 2004 From: brad.knowles at skynet.be (Brad Knowles) Date: Wed Aug 4 21:29:11 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: <16657.6883.902560.801458@montanaro.dyndns.org> References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> <16657.6883.902560.801458@montanaro.dyndns.org> Message-ID: At 12:20 PM -0500 2004-08-04, Skip Montanaro wrote: > Spambayes will root around in the Received: headers if you ask it to. It > generates all sorts of tokens based on fragments of IP addresses and > hostnames it finds. Perhaps it's already doing what you wanted and I failed > to make the connection in my original note. I don't know what it's doing. I know nothing about the SpamBayes configuration on bag. I am stating that this is a capability that I believe we need. > There's no need to do blacklisting as far as I'm concerned. Spambayes > already mines content from the sender fields, so (for example), mail > purporting to come from "billr@smart.net" generates tokens which are highly > spammy. That effectively serves the same purpose but doesn't have the bad > property of blacklists or whitelists - that they ignore everything else the > software is trying to tell you about the content of the message. The content of the message is important, yes. But you're throwing away all the envelope information which can also be very instructive. At the very least, all the envelope information should be incorporated into whatever system you're using to score the message. -- Brad Knowles, "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See for more info. From brad.knowles at skynet.be Wed Aug 4 20:51:37 2004 From: brad.knowles at skynet.be (Brad Knowles) Date: Wed Aug 4 21:30:28 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information onthe net In-Reply-To: References: Message-ID: At 12:45 PM -0400 2004-08-04, Kenny Pitt wrote: > In > cross-validation testing, I found that the results had virtually no effect > on the accuracy of the classifier, probably because one or two DNSBL tokens > weren't enough to override the effects of all the other tokens from the > message itself. When I have used DNSBLs in the past, I have used more than just one or two. I typically use a dozen or two. That should generate enough additional information to have a significant impact. > It also resulted in a *huge* increase in the time required > for SpamBayes to classify a message. The DNS lookup time can be significant. That's true. That's part of why you want to mirror all the DNSBLs that you use so that you can query them locally, as opposed to having to go across the Internet to get that information. It takes coordination to set this up, but all the major DNSBL providers make these sorts of arrangements as a matter of course, and I'm sure that we wouldn't have any problems. I've done this plenty of times before. As for how much additional work is required to process the additional information, I do not know. > Most DNSBL's have an aging > feature so that mailhosts will be removed from the blacklist if no spam has > been received or reported from them in a certain time period. Yup. > If I query a > DNSBL for a particular host tomorrow, I might get a different result than I > got today. Indeed. > This is especially problematic for anyone using a > train-on-everything strategy. If SpamBayes identifies a message incorrectly > today and automatically trains on it, but I don't get around to reviewing > and correcting the training until tomorrow, I could end up trying to remove > the wrong set of tokens from the incorrect training corpus and thus > corrupting my training database. Ahh. Well, the aging issue is not likely to be a problem unless you have waited a very significant amount of time between gathering the data and trying to process it -- most servers that are used for spam continue to be used for spam for quite some time. You might have problems the other way, however -- servers that were clean at the time, and which you have saved in your "misclassified" folder, may now be on a black list by the time you try to process the information. -- Brad Knowles, "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See for more info. From brad.knowles at skynet.be Wed Aug 4 21:46:17 2004 From: brad.knowles at skynet.be (Brad Knowles) Date: Wed Aug 4 21:48:11 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information onthe net In-Reply-To: References: Message-ID: At 10:51 PM +0400 2004-08-04, Brad Knowles wrote: Damn. Thought I'd removed roto-rooters from the previous message, per Barry's request. I'm really sorry about that. ;( >> This is especially problematic for anyone using a >> train-on-everything strategy. If SpamBayes identifies a message >>incorrectly >> today and automatically trains on it, but I don't get around to reviewing >> and correcting the training until tomorrow, I could end up trying to remove >> the wrong set of tokens from the incorrect training corpus and thus >> corrupting my training database. > > Ahh. Well, the aging issue is not likely to be a problem unless you > have waited a very significant amount of time between gathering the data > and trying to process it -- most servers that are used for spam continue > to be used for spam for quite some time. BTW, even if there is an aging issue to be concerned about at training time, this doesn't mean that we should just throw away all the envelope information at message injection time. This is important stuff, and looking this up on various black lists can give us a good indication as to whether or not the message content is spam. This is not perfect, but then no method is. Throwing away the envelope information (and information based on the envelope information, as obtained from external sources), is cutting off your nose to spite your face. If there is an aging issue on the training side, then I'd encourage you to find ways to encode that additional envelope and envelope-derived information into a more stable form (perhaps added to certain headers of the message), so that you can train on them at a time of your choosing, and without any additional dependance on the DNS. But this is a SpamBayes-specific issue that needs to be resolved internally. -- Brad Knowles, "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See for more info. From skip at pobox.com Wed Aug 4 22:00:43 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Aug 4 22:01:09 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> <16657.6883.902560.801458@montanaro.dyndns.org> Message-ID: <16657.16491.569373.875427@montanaro.dyndns.org> Brad> I don't know what it's doing. I know nothing about the SpamBayes Brad> configuration on bag. Aside from the way it's plugged into bag's Postfix setup, the Spambayes proxy running on bag.python.org is no different than any other Spambayes installation. The clasification and scoring pieces of the system are no different than what you'd find in any other Spambayes application (sb_server.py, sb_filter.py, etc). In particular, if the mine_received_headers option is True, the tokenizer will spew out all sorts of interesting tokens based on IP addresses and hostnames like received:83.69 received:83.69.163 received:83.69.163.110 and received:grp.scd.yahoo.com received:mail.sc5.yahoo.com received:mail.yahoo.com received:n16.grp.scd.yahoo.com received:n36.grp.scd.yahoo.com received:n39.grp.scd.yahoo.com received:n53.grp.scd.yahoo.com received:n54.grp.scd.yahoo.com received:sc5.yahoo.com received:scd.yahoo.com received:smtp805.mail.sc5.yahoo.com received:web21506.mail.yahoo.com received:web50510.mail.yahoo.com received:web60101.mail.yahoo.com received:web60909.mail.yahoo.com received:web61208.mail.yahoo.com received:yahoo.com It will grub around in many other mail headers as well. Like many other pieces of software the code is the best documentation you're going to find. You needn't read and understand it all, however. Thanks in large degree to Tim Peters' skill, the tokenizer is clearly written and very well-commented (comments make up probably half the file): http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/tokenizer.py?rev=1.31&view=markup The classifier is also well-structured and well-commented thanks to Tim and contains links to both Paul Graham's original "Plan for Spam" as well as some of Gary Robinson's writings: http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/classifier.py?rev=1.25&view=markup Brad> I am stating that this is a capability that I believe we need. I think we already have it. >> There's no need to do blacklisting as far as I'm concerned. Brad> The content of the message is important, yes. But you're throwing Brad> away all the envelope information which can also be very Brad> instructive. What envelope information? Does it turn up in a message header? If Postfix adds it to a Received: header, Spambayes probably already takes it into account. Skip From jafo at tummy.com Wed Aug 4 22:28:48 2004 From: jafo at tummy.com (Sean Reifschneider) Date: Wed Aug 4 22:28:54 2004 Subject: [Roto Rooters] RE: [spambayes-dev] Re: Generating SB tokens based upon information onthe net In-Reply-To: References: Message-ID: <20040804202848.GF5364@tummy.com> On Wed, Aug 04, 2004 at 10:51:37PM +0400, Brad Knowles wrote: > The DNS lookup time can be significant. That's true. That's >part of why you want to mirror all the DNSBLs that you use so that >you can query them locally, as opposed to having to go across the What SpamAssassin does it to send out all the DNSBL queries at once, and then it determins how long it will wait for responses. The spamassassin man page has information on the specifics, but it's something like this: If after a second we have 90% of the responses, don't listen for any more. If after 2 seconds we have 80%... If after 5 seconds we have 60%... If after 10 seconds we have 40%... Terminate after 30 seconds. Sean -- If we don't survive, we don't do anything else. -- John Sinclair Sean Reifschneider, Member of Technical Staff tummy.com, ltd. - Linux Consulting since 1995. Qmail, Python, SysAdmin From popiel at wolfskeep.com Thu Aug 5 01:12:40 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Aug 5 01:12:43 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: Message from Brad Knowles of "Wed, 04 Aug 2004 22:54:22 +0400." References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> <16657.6883.902560.801458@montanaro.dyndns.org> Message-ID: <20040804231240.4AAB22DFC6@cashew.wolfskeep.com> In message: Brad Knowles writes: > The content of the message is important, yes. But you're >throwing away all the envelope information which can also be very >instructive. > > At the very least, all the envelope information should be >incorporated into whatever system you're using to score the message. We are not throwing away the envelope; we consider large parts of the envelope to be content, and incorporate it just like any other token source. Most particularly, we get information from the From, To, CC, Message-Id, and Received headers (plus all the MIME-related headers). We have tested including stuff from other headers too, with various results (Date tends to be useless, for instance). - Alex From tameyer at ihug.co.nz Thu Aug 5 03:04:38 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Aug 5 03:04:43 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: Message-ID: [Tony] > This sounds a lot like the bug with the email package that Neil > Schemenauer brought up here very recently. He said that > he'd brought it up with Barry, but not submitted a bug report. [Neil] > Have not filed a bug yet and it's still not fixed, AFAIK. Just so that everyone knows, Sjoerd did file a report (sf #1002475), so this should get fixed before 2.4 final, hopefully (but definitely not 2.4a2). Maybe even someone at the bug day this weekend will get to it. =Tony Meyer From tameyer at ihug.co.nz Thu Aug 5 03:56:49 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Aug 5 03:56:55 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: Message-ID: There has been a little bit of discussion on spambayes-dev about a bug with the 2.4a1 email package, where header lines that end with \r\n are not treated correctly (the value ends up with a \r at the end). A SF bug was opened for this: [ 1002475 ] email message parser doesn't handle \r\n correctly I've created a patch to fix this, and a couple of tests to add to test_email.py: [ 1003693 ] Fix for 1002475 (Feedparser not handling \r\n correctly) If someone would like to review this and check it in after 2.4a2 is all done that would be great. Maybe someone at the bug day? (I might come along to that, but it's the middle of the night, so probably not). Thanks! =Tony Meyer From ta-meyer at ihug.co.nz Thu Aug 5 04:07:39 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Thu Aug 5 04:08:50 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: Message-ID: [Tony] > [Classifier] > x-experimental_ham_spam_imbalance_adjustment - the code for > this is gone already; it's just the option that's left. > [Tokenizer] x-extract_dow > [Tokenizer] x-generate_time_buckets [Skip] > Definitely zap the above. Done. [Tony] > [Classifier] x-use_bigrams - becomes a regular > option (defaulting to False?) [Skip] > False would be best. We already have people complaining > about the size of their databases. Unless anyone speaks up in the next couple of days, I'll remove the "x-" from the option, the "EXPERIMENTAL" from the description, and leave it set to False by default. [Skip] > Are the habeas headers a dead-end in the wider world that > most Spambayes users simply don't use? If they are spoofed > they should be a fairly good spam clue. I'm not sure I'd > delete them yet. I'm not certain - I very rarely see mail with them (I have an Outlook thingy that puts a little *H* next to mail with them, so I do notice when mail does) - with the exception of one source (TidBITS/TidBITS-Talk). For a while I saw spam with them, too, but even that seems to have stopped. I wonder whether perhaps the experiment failed, and they simply don't get used any more. I'm happy to leave them for the moment - it would certainly be interesting to see results from anyone that does get habeas-marked mail (good or bad). It's a while since I did any testing with it, so I reran it with my current testing corpora and got a loss and an indifferent: (first line is all defaults, second is searching for habeas headers, third is reducing habeas headers to a single token) -> tested 280 hams & 131 spams against 1111 hams & 512 spams [...] filename: exchanges exchange_habeass exchange_habeas_reduces ham:spam: 1391:643 1391:643 1391:643 fp total: 0 0 0 fp %: 0.00 0.00 0.00 fn total: 35 35 35 fn %: 5.44 5.44 5.44 unsure t: 83 82 82 unsure %: 4.08 4.03 4.03 real cost: $51.60 $51.40 $51.40 best cost: $33.80 $33.20 $33.20 h mean: 0.10 0.09 0.09 h sdev: 1.72 1.60 1.60 s mean: 89.34 89.33 89.33 s sdev: 25.65 25.64 25.64 mean diff: 89.24 89.24 89.24 k: 3.26 3.28 3.28 -> tested 4690 hams & 384 spams against 18764 hams & 1539 spams [...] filename: ihugs ihug_habeass ihug_habeas_reduces ham:spam: 23454:1923 23454:1923 23454:1923 fp total: 1 5 5 fp %: 0.00 0.02 0.02 fn total: 23 20 20 fn %: 1.20 1.04 1.04 unsure t: 169 151 154 unsure %: 0.67 0.60 0.61 real cost: $66.80 $100.20 $100.80 best cost: $57.00 $84.20 $83.00 h mean: 0.09 0.12 0.12 h sdev: 1.89 2.36 2.38 s mean: 95.86 96.42 96.43 s sdev: 14.99 14.20 14.17 mean diff: 95.77 96.30 96.31 k: 5.67 5.82 5.82 =Tony Meyer From tim.peters at gmail.com Thu Aug 5 04:31:52 2004 From: tim.peters at gmail.com (Tim Peters) Date: Thu Aug 5 04:31:55 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: References: Message-ID: <1f7befae04080419317101bfa@mail.gmail.com> [Tony] >>> [Classifier] x-use_bigrams - becomes a regular >>> option (defaulting to False?) [Skip] >> False would be best. We already have people complaining >> about the size of their databases. [Tony] > Unless anyone speaks up in the next couple of days, I'll remove the "x-" > from the option, the "EXPERIMENTAL" from the description, and leave it set > to False by default. You're incapable of making a bad decision here, so I've stayed silent . Bigrams remain an interesting option, so I don't expect the code to go away. The database size can be pretty amazing, though! Using bigrams and a giant pickled dict, my Outlook routinely consumes over 120MB of RAM now. Fine by me -- I've got plenty of RAM. But it sure makes False the right default. From tim.one at comcast.net Thu Aug 5 06:00:35 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Aug 5 06:00:45 2004 Subject: [spambayes-dev] RE: [Pydotorg] Re: Generating SB tokens based upon information on thenet In-Reply-To: <16657.16491.569373.875427@montanaro.dyndns.org> Message-ID: <20040805040044.51D211E4009@bag.python.org> Lest anyone forget , SpamBayes was originally developed using a python.org mail corpus as "ham", consisting of tens of thousands of "blessed" tech mailing list msgs, hundreds of which turned out to be false negatives, cleaned from the corpus over a period of months as SpamBayes got better at discovering them (the large number of bogus "ham" really hurt at the start -- garbage in, garbage out). The classifier achieved the fabled "four nines" accuracy on that traffic in controlled tests, and showed no possible improvement remaining to be made (there were no false negatives remaining, and the 3 to 9 false positives remaining were technically ham but likely impossible for any useful system to identify as ham -- like the one-time poster to comp.lang.python who quoted an entire Nigerian scam spam with a one-line "this is a scam" comment at the start). SpamBayes doesn't need more info to do a stellar job on tech mailing list traffic (more might make for a tiny improvement, measurable only in a very-large-scale controlled test), but what it does need is ongoing training. I don't know whether the latter is feasible. From tameyer at ihug.co.nz Thu Aug 5 07:43:02 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Aug 5 07:43:13 2004 Subject: [spambayes-dev] RE: [Pydotorg] Re: Generating SB tokens based uponinformation on thenet In-Reply-To: Message-ID: [Tim Peters] > SpamBayes doesn't need more info to do a stellar job on tech > mailing list traffic (more might make for a tiny improvement, > measurable only in a very-large-scale controlled test), but what > it does need is ongoing training. I don't know whether the > latter is feasible. I don't know pretty much anything about the way that mail to python.org is managed, but this conversation ended up on spambayes-dev, so I'm going to comment anyway . Might this be a case for some nice Mailman-SpamBayes integration? If training a list-specific database could be done as part of something that admin people already do with Mailman, that would seem like a good thing. Clicking discard on the review page adds the message to a spam corpus, for example, and have some sort of semi-regular in=-the-quiet-moments tte script that runs through that corpus and the archive (for ham) to update training. But no, I don't think I'm offering to do the coding work... =Tony Meyer From sjoerd at acm.org Thu Aug 5 14:52:57 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Thu Aug 5 14:53:15 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: References: Message-ID: <41122DA9.4000508@acm.org> Tony Meyer wrote: >>If we can do something that returns self the first time it is >>called and a (truly) new instance for all subsequent calls, it >>might work. Something like this, perhaps: > > [...] > > I've just checked in lots of changes (and some tests!) to sb_imapfilter. > You're running from CVS, yes? If so, if you could see if they break it for > you (don't for me), that would be great. There are lots of changes, so > running on somewhere safe would be good... It doesn't work and I wonder how it can work in your environment. The problem is that you use both BadIMAPResponse and BadIMAPResponseError to raise exceptions, but only the latter is actually defined. Also, since in sb_imapfilter.py imap is no longer a global variable, the call to imap.close() in IMAPSession.SelectFolder() causes a NameError exception. Should that be self.close() instead? > I have made a change to fix this problem, along the lines (but not quite the > same) as what you suggested. I've also changed from always using our own ID > to using the Message-ID header when possible (but it should still pick up > the old ones and not retrain/filter them). > > Interested to know what you think... :) > > =Tony Meyer > -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040805/833fea67/signature.pgp From sjoerd at acm.org Thu Aug 5 15:59:53 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Thu Aug 5 16:00:18 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: <41122DA9.4000508@acm.org> References: <41122DA9.4000508@acm.org> Message-ID: <41123D59.8080804@acm.org> Sjoerd Mullender wrote: > Tony Meyer wrote: > >>> If we can do something that returns self the first time it is called >>> and a (truly) new instance for all subsequent calls, it >>> might work. Something like this, perhaps: >> >> >> [...] >> >> I've just checked in lots of changes (and some tests!) to sb_imapfilter. >> You're running from CVS, yes? If so, if you could see if they break >> it for >> you (don't for me), that would be great. There are lots of changes, so >> running on somewhere safe would be good... > > > It doesn't work and I wonder how it can work in your environment. The > problem is that you use both BadIMAPResponse and BadIMAPResponseError to > raise exceptions, but only the latter is actually defined. > > Also, since in sb_imapfilter.py imap is no longer a global variable, the > call to imap.close() in IMAPSession.SelectFolder() causes a NameError > exception. Should that be self.close() instead? More problems: When a message can't be parsed in IMAPMessage.get_full_message, message.insert_exception_header is called. The argument is data["RFC822"], but that key doesn't exist. It should be self.rfc822_key instead. And then there is a reference to "self" in insert_exception_header, so that causes NameError exception. I don't know whether it is still necessary to insert a mailid header in the message after your Message-Id changes, but if so, maybe the id should be passed as a parameter? >> I have made a change to fix this problem, along the lines (but not >> quite the >> same) as what you suggested. I've also changed from always using our >> own ID >> to using the Message-ID header when possible (but it should still pick up >> the old ones and not retrain/filter them). >> >> Interested to know what you think... :) >> >> =Tony Meyer >> > > > > ------------------------------------------------------------------------ > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040805/2260f115/signature.pgp From barry at python.org Thu Aug 5 17:16:39 2004 From: barry at python.org (Barry Warsaw) Date: Thu Aug 5 17:16:32 2004 Subject: [spambayes-dev] RE: [Pydotorg] Re: Generating SB tokens based uponinformation on thenet In-Reply-To: References: Message-ID: <1091718999.8541.58.camel@localhost> On Thu, 2004-08-05 at 01:43, Tony Meyer wrote: > Might this be a case for some nice Mailman-SpamBayes integration? If > training a list-specific database could be done as part of something that > admin people already do with Mailman, that would seem like a good thing. > Clicking discard on the review page adds the message to a spam corpus, for > example, and have some sort of semi-regular in=-the-quiet-moments tte script > that runs through that corpus and the archive (for ham) to update training. > > But no, I don't think I'm offering to do the coding work... You wouldn't have to start from scratch though. :) There a probably 2.5 year old patch on SF that I did for Mailman 2.1, and I think Simone Piunno (on mailman-developers) did some more recent work bringing that patch up to date. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040805/fba6dba0/attachment.pgp From barry at python.org Thu Aug 5 17:25:40 2004 From: barry at python.org (Barry Warsaw) Date: Thu Aug 5 17:25:31 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: References: Message-ID: <1091719086.8545.60.camel@localhost> On Wed, 2004-08-04 at 21:04, Tony Meyer wrote: > Just so that everyone knows, Sjoerd did file a report (sf #1002475), so this > should get fixed before 2.4 final, hopefully (but definitely not 2.4a2). > Maybe even someone at the bug day this weekend will get to it. I'm going to try to participate in bug day this weekend, at least for a little while, so I'd happily work on this with someone. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040805/fc275300/attachment.pgp From brad.knowles at skynet.be Thu Aug 5 18:20:56 2004 From: brad.knowles at skynet.be (Brad Knowles) Date: Thu Aug 5 18:40:00 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information on the net In-Reply-To: <16657.16491.569373.875427@montanaro.dyndns.org> References: <20040730163221.GA24953@mems-exchange.org> <16656.2130.446565.493957@montanaro.dyndns.org> <16657.6883.902560.801458@montanaro.dyndns.org> <16657.16491.569373.875427@montanaro.dyndns.org> Message-ID: At 3:00 PM -0500 2004-08-04, Skip Montanaro wrote: > Like many other pieces of software the code is the best documentation you're > going to find. You needn't read and understand it all, however. I'm not a programmer. I haven't done any proper programming in fifteen years, since I graduated college. Even if I was a programmer, I know absolutely nothing about Python. Therefore, the source code is pretty much useless to me. > What envelope information? Does it turn up in a message header? If Postfix > adds it to a Received: header, Spambayes probably already takes it into > account. To the best of my knowledge, none of the black list stuff ever gets added to any headers, certainly not by postfix. If you have postfix look this stuff up in a black list, then it will make binary decisions based on that, which we don't want. We want something that can look this information up on various black lists, record all those various bits of information, and then score the message based on the result. Given the limited capability to do this sort of thing within postfix, I believe that SpamBayes would be the best place to do this. If you don't want to do this within SpamBayes, then you need to come up with some other external component that fits as a shim between postfix and SpamBayes. -- Brad Knowles, "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See for more info. From rmalayter at bai.org Thu Aug 5 18:56:24 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Thu Aug 5 18:56:22 2004 Subject: [spambayes-dev] Deprecated options Message-ID: <792DE28E91F6EA42B4663AE761C41C2A02A5BC47@cliff.bai.org> [Tim Peters] > You're incapable of making a bad decision here, so I've stayed silent > . Bigrams remain an interesting option, so I don't expect the > code to go away. The database size can be pretty amazing, though! > Using bigrams and a giant pickled dict, my Outlook routinely consumes > over 120MB of RAM now. Fine by me -- I've got plenty of RAM. But it > sure makes False the right default. CRM-114 uses 5-grams or even more, but ultimately uses a short hash to represent the n-gram strings. This (intentionally?) short hash (effectively 20 bits, from what I've read) results in a lot of collisions, which keeps the classifier DB size small. Performance doesn't seem to suffer much at all because of these collisions. Should this approach be looked at for n-grams and SpamBayes? I would love to try the "engine" of CRM-114's SBPH classifier in SpamBayes' comparatively pretty and easy-to-use skin. I think the maximum DB size using this approach (5 bytes for hex-encoded hash "token", 4 bytes each for ham and spam count) would be something like 13.5 MB. Perhaps there's more overhead (termination strings, whatever) in the DB format than I realize, but the DB could still be kept fairly small. Heck, the size of the hash could be configurable as well, giving people the option to use whatever length (and resulting DB size) they're comfortable with. I know Bill Y. (CRM-144's creator) used to participate here, perhaps he could offer some ideas. To me, using SBPH to generate tokens for SpamBayes seems like it would be fairly straightforward. The rest of SpamBayes would stay mostly the same. If I ever find the time to try this out on my own, I will give it a shot. Regards, Ryan From kennypitt at hotmail.com Thu Aug 5 19:25:13 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Aug 5 19:25:21 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon information onthe net In-Reply-To: Message-ID: Brad Knowles wrote: > We want something that can look this information up on various black > lists, record all those various bits of information, and then score > the message based on the result. Given the limited capability to do > this sort of thing within postfix, I believe that SpamBayes would be > the best place to do this. > > If you don't want to do this within SpamBayes, then you need to come > up with some other external component that fits as a shim between > postfix and SpamBayes. IMHO, it's not likely that this will make it into the core SpamBayes source, but if you want to put it in yourself then the patch is attached. Applying the patch only involves inserting several lines into 2 different files. This version only inserts a token if a host is found on a DNSBL, but could easily be tweaked to insert a "spam" token if found or an "ok" token if not found. After applying the patch, you configure it by adding the x-dnsbl_databases option to the Tokenizer section of your config file. The format is a space-separated list of DNSBL hostnames, for example: x-dnsbl_databases:sbl.spamhaus.org xbl.spamhaus.org cbl.abuseat.org -- Kenny Pitt -------------- next part -------------- A non-text attachment was scrubbed... Name: dnsbl.patch Type: application/octet-stream Size: 2524 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040805/5e5e7b5c/dnsbl-0001.obj From tim.peters at gmail.com Thu Aug 5 20:42:45 2004 From: tim.peters at gmail.com (Tim Peters) Date: Thu Aug 5 20:42:48 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A02A5BC47@cliff.bai.org> References: <792DE28E91F6EA42B4663AE761C41C2A02A5BC47@cliff.bai.org> Message-ID: <1f7befae0408051142608484fa@mail.gmail.com> [Ryan Malayter] > CRM-114 uses 5-grams or even more, but ultimately uses a short hash to > represent the n-gram strings. This (intentionally?) short hash > (effectively 20 bits, from what I've read) In this thread, which you started , Bill said "effectively 64 bits": http://mail.python.org/pipermail/spambayes/2003-September/007602.html Details are important at this level. > results in a lot of collisions, which keeps the classifier DB size small. > Performance doesn't seem to suffer much at all because of these collisions. Experiments were run on that in SB before, and you can even find patches (probably out of date now!) in the archives that implement it. Results were discouraging. It did learn "faster". After a moderate amount of training data, though, results were worse. Collisions did hurt, and the rare bad classifications as a result of hash aliasing were spectacular: incomprehensibly bad to the human eye. Also needs a different database implementation to be practical (string-keyed mappings are too wasteful when the keys come from a contiguous range of integers, and using a Python dict to represent the mapping then is enormously too wasteful). > ... > I know Bill Y. (CRM-144's creator) used to participate here, perhaps he > could offer some ideas. To me, using SBPH to generate tokens for > SpamBayes seems like it would be fairly straightforward. The rest of > SpamBayes would stay mostly the same. It's easy to experiment with, but for practical application it needs a different database approach, to exploit the nature of the keys. From sethg at GoodmanAssociates.com Fri Aug 6 04:12:51 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Fri Aug 6 04:12:50 2004 Subject: [spambayes-dev] Re: Generating SB tokens based upon informationon the net In-Reply-To: <20040804163034.0DDFE2E070@cashew.wolfskeep.com> Message-ID: > From: T. Alexander Popiel > Sent: Wednesday, August 04, 2004 11:31 AM <...> > Actually, the Received header info can come from the HELO or > EHLO command that opened the conversation, not DNS. I haven't > looked to see if any MTAs actually do it that way, but it's > the way I would do it if I were writing one... (And sure, > that means a rogue could lie about identification in the > HELO... but that's why both the name and the IP appear in the > Received line.) You'd be exactly right. Most MTA's put the SMTP-client (the sender) IP in [ ], put the EHLO string before that and put the rDNS result before that. What you usually see for the sending machine in a received header is: rDNS result ( EHLO name [ IP address ]) Some mailers take the pro-active step of noting "May be forged" if they notice that the rDNS lookup and EHLO string were too different. In any case, with these three pieces of information, a user can interpret the received headers going from top to bottom. The only piece of information that can be forged is really the EHLO string. Spoofing an IP address for an SMTP session is very hard and is best done through a proxy with the address you want to spoof, and attacking the rDNS tree is pretty tough. The only implication for Spambayes is the when mining headers, the EHLO string in spam is often non-existent. When it does appear, it is sometimes the target domain (to try to fool the mailer into thinking it is a local message), sometimes a joe-job victim and sometimes just a non-sense string. That one piece of information may be of dubious value for the classifier, but the IP and rDNS result are certainly useful. The other item of dubious value that seems to generate tokens is the recipient machine in the top received header. That is _always_ your own MX, and listing it is not of any value. Often, the sending machine in the first received line is another internal machine at your own provider, but not always. The first external sender either occurs in the first or second received header. For example, the top two received line in Alex's post that I am responding to were (for me): Received: from inbound-mx3.atl.registeredsite.com ([64.224.219.91]) by imta04a2.registeredsite.com with ESMTP id <20040804163157.IEAL28804.imta04a2.registeredsite.com@inbound-mx3.atl.re gisteredsite.com> for ; Wed, 4 Aug 2004 12:31:57 -0400 Received: from smtp-vbr2.xs4all.nl (smtp-vbr2.xs4all.nl [194.109.24.22]) by inbound-mx3.atl.registeredsite.com (8.12.11/8.12.8) with ESMTP id i74GV4aI019873 for ; Wed, 4 Aug 2004 16:31:05 GMT The top received line is a local handoff between the gateway MX and an internal MTA at my provider. The gateway MX did not even bother to provide an EHLO string, since it is a trusted internal handoff. None of the address information in this line is suitable for generating tokens. This would take some effort to suppress, since some providers don't have the incoming MX and MDA (mail delivery agent) functions separated. One possible way to determine this is an internal transfer if the sending and receiving machines in the top received header have the same domain, it is an internal handoff. The next received line contains the actual external SMTP-client (the sender). In this case, the EHLO string matches the rDNS exactly, showing the sender has their DNS properly configured. The machine in the 'from' part of this line is worth generating tokens for. The machine in the 'by' part is the same as the machine in the 'from' part of the first header line, and since it is not suitable generating tokens in the first line, it is not suitable here. In general, only machines listed in the 'from' part of each received line should be candidates for token generation, and only if they have a different domain from the machine in the 'by' part of the top received line. When generating a token for a machine, it would probably be wise to ignore the EHLO string, since they are simply part of the MTA configuration and can be forged to be the same as a machine that you would normally trust. In fact, some spamming MTA's change their EHLO string with every message. Another dead-giveaway is when the sending machine has an rDNS result that has a pattern resembling a dynamic IP connection. Typically, this is something like 220-15-7-52-adslpool.bigISP.com. Unfortunately, this gets into regex's, which is really the bailiwick of SpamAssassin and other rule-based systems. The best way to determine if the line is a dynamic IP is to consult a dynamic IP DNSBL, but for all the reasons that have been mentioned, this is really out of the question for Spambayes. OTOH, running a proxy ahead of Spambayes that checks half a dozen DNSBL's on all the machines in the 'from' parts of the header lines might be a useful adjunct. While YMMV, I have had very few errors when I used to do this (I no longer bother). For large spam loads where a few percent unsures amounts to a lot of mail to manually classify, this can be helpful. -- Seth Goodman From tameyer at ihug.co.nz Fri Aug 6 07:25:17 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Aug 6 07:25:23 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: Message-ID: > It doesn't work and I wonder how it can work in your > environment. I did say I did limited testing. The tests didn't raise any of the (incorrect) BadIMAPResponse exceptions, so they didn't cause problems. I've fixed those, thanks. > Also, since in sb_imapfilter.py imap is no longer a global > variable, the call to imap.close() in IMAPSession.SelectFolder() > causes a NameError exception. Should that be self.close() instead? This one I missed because I don't expunge. It should be self.imap_server.close(), and I've fixed this, too, thanks. > When a message can't be parsed in IMAPMessage.get_full_message, > message.insert_exception_header is called. The argument is > data["RFC822"], but that key doesn't exist. It should be > self.rfc822_key instead. This is a bug present in the old version that was simply carried through. I've fixed this too (and will backport), thanks. > And then there is a reference to "self" in insert_exception_header, so > that causes NameError exception. I don't know whether it is still > necessary to insert a mailid header in the message after your > Message-Id changes, but if so, maybe the id should be passed as a > parameter? Missed this one because I didn't raise an exception. Changed to pass the id as a parameter, thanks. =Tony Meyer From sjoerd at acm.org Fri Aug 6 08:20:47 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Fri Aug 6 08:21:02 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: References: Message-ID: <4113233F.5090104@acm.org> Tony Meyer wrote: >>It doesn't work and I wonder how it can work in your >>environment. > > > I did say I did limited testing. The tests didn't raise any of the > (incorrect) BadIMAPResponse exceptions, so they didn't cause problems. I've > fixed those, thanks. There is still one except BadIMAPResponse: in folder_list. >>Also, since in sb_imapfilter.py imap is no longer a global >>variable, the call to imap.close() in IMAPSession.SelectFolder() >>causes a NameError exception. Should that be self.close() instead? > > > This one I missed because I don't expunge. It should be > self.imap_server.close(), and I've fixed this, too, thanks. Is that so? The call is in IMAPSession class, not in IMAPMessage where all other references to self.imap_server are. I don't see where self.imap_server is initialized in the IMAPSession class. -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040806/a9aeae24/signature.pgp From sjoerd at acm.org Fri Aug 6 11:53:09 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Fri Aug 6 11:53:23 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: <4113233F.5090104@acm.org> References: <4113233F.5090104@acm.org> Message-ID: <41135505.8090105@acm.org> Another buglet still: in IMAPMessage.Save there is still a reference to the variable imap. This should probably be self.imap_server instead. Sjoerd Mullender wrote: > Tony Meyer wrote: > >>> It doesn't work and I wonder how it can work in your environment. >> >> >> >> I did say I did limited testing. The tests didn't raise any of the >> (incorrect) BadIMAPResponse exceptions, so they didn't cause >> problems. I've >> fixed those, thanks. > > > There is still one except BadIMAPResponse: in folder_list. > >>> Also, since in sb_imapfilter.py imap is no longer a global variable, >>> the call to imap.close() in IMAPSession.SelectFolder() >>> causes a NameError exception. Should that be self.close() instead? >> >> >> >> This one I missed because I don't expunge. It should be >> self.imap_server.close(), and I've fixed this, too, thanks. > > > Is that so? The call is in IMAPSession class, not in IMAPMessage where > all other references to self.imap_server are. I don't see where > self.imap_server is initialized in the IMAPSession class. > > > ------------------------------------------------------------------------ > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040806/18edca67/signature.pgp From gtoal at gtoal.com Fri Aug 6 14:31:20 2004 From: gtoal at gtoal.com (Graham Toal) Date: Fri Aug 6 14:29:50 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: <20040806100053.DF6F11E4032@bag.python.org> References: <20040806100053.DF6F11E4032@bag.python.org> Message-ID: <41137A18.mailHZ114KBQ@gtoal.com> Tim Peters wrote: > > I know Bill Y. (CRM-144's creator) used to participate here, perhaps he > > could offer some ideas. To me, using SBPH to generate tokens for > > SpamBayes seems like it would be fairly straightforward. The rest of > > SpamBayes would stay mostly the same. > > It's easy to experiment with, but for practical application it needs a > different database approach, to exploit the nature of the keys. I had a hack at a different DB approach and although I admit I did not take it as far as a working spam filter, the proof of concept implementation was at least enough to convince me that it was an avenue worth exploring. I wrote it up here: http://www.gtoal.com/mt/archives/2004_02.html and there is some sample code here: http://www.gtoal.com/spam/devel-temp/tokra3.c.html Without any knowlege of the structure of text at all, it was able to intuit sequences such as 'e' as being symptomatic of spam. Two conclusions: 1) You can afford to classify much longer sequences than simple n-grams, because if you use variable-length sequences, they're self-limiting. 2) The natural fit data structure for this is a 256-trie. (specifically a DAWG but implemented as a trie rather than a DAG to allow easy additions) regards Graham From rmalayter at bai.org Fri Aug 6 19:11:59 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Fri Aug 6 19:12:02 2004 Subject: [spambayes-dev] Deprecated options Message-ID: <792DE28E91F6EA42B4663AE761C41C2A02B04010@cliff.bai.org> [Tim Peters] >In this thread, which you started , Bill said >"effectively 64 bits": >http://mail.python.org/pipermail/spambayes/2003-September/007602.html Wow, I totally forgot about that thread. Now I realize why I was so intruiged by this new bigrams/DB size thread ;-). >Details are important at this level. I agree. From the description on the CRM114 site of how hashes are mapped to 1-MB .CSS files, I figured the effective hash length was 20 bits. I missed a step, apparently, in that CRM114 uses the address mapping only for the starting address of a hash bucket, not as a direct clipping of the hash value. I was just plain wrong. >Experiments were run on that in SB before, and you can even find >patches (probably out of date now!) in the archives that implement it. I remember looking then, and I am still unable to find those patches (in CVS) or the statistical results. Only anecdotal references to "hashing performing poorly" seem to appear throghout a bunch of threads. My google search was "CRM-114 site:python.org", there were 93 results that I looked at, nothing pointing to the original tests of these ideas. I guess the failure of the whole hashing issue was never really settled in my mind, since it seems to work so well for CRM-114. But SB has been working "good enough" for me for over a year now, so I never pursued thigns further. >Results were discouraging. It did learn "faster". After a moderate >amount of training data, though, results were worse. Collisions did >hurt, and the rare bad classifications as a result of hash aliasing >were spectacular: incomprehensibly bad to the human eye. Did the test just store the hash value as hex/base64/whatever in the regular SpamBayes DB format? What hash was used? The same "fast hash" used in CRM114? Thanks, Ryan From tameyer at ihug.co.nz Sat Aug 7 08:16:53 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Aug 7 08:16:58 2004 Subject: [spambayes-dev] Use of email factory function In-Reply-To: Message-ID: > There is still one except BadIMAPResponse: in folder_list. So much for my searching. Thanks; fixed. > Is that so? The call is in IMAPSession class, not in > IMAPMessage where all other references to self.imap_server are. > I don't see where self.imap_server is initialized in the > IMAPSession class. Damn. Right again. Fixed. FWIW, I'm just about through with more improvements to the test suite, which should catch errors like these. I'll check it all in shortly. > Another buglet still: in IMAPMessage.Save there is still a > reference to the variable imap. This should probably be > self.imap_server instead. Thanks; fixed. =Tony Meyer From barry at python.org Sat Aug 7 17:54:07 2004 From: barry at python.org (Barry Warsaw) Date: Sat Aug 7 17:54:16 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: References: Message-ID: <1091894046.1064.29.camel@anthem.wooz.org> On Wed, 2004-08-04 at 21:56, Tony Meyer wrote: > There has been a little bit of discussion on spambayes-dev about a bug with > the 2.4a1 email package, where header lines that end with \r\n are not > treated correctly (the value ends up with a \r at the end). > > A SF bug was opened for this: > > [ 1002475 ] email message parser doesn't handle \r\n correctly > > > I've created a patch to fix this, and a couple of tests to add to > test_email.py: > [ 1003693 ] Fix for 1002475 (Feedparser not handling \r\n correctly) > 305470> > > If someone would like to review this and check it in after 2.4a2 is all done > that would be great. Maybe someone at the bug day? (I might come along to > that, but it's the middle of the night, so probably not). Patch looks good to me. I'm checking them in (after a little stylistic hydrant-pissing :). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040807/be5d5348/attachment.pgp From barry at python.org Sat Aug 7 18:00:40 2004 From: barry at python.org (Barry Warsaw) Date: Sat Aug 7 18:00:43 2004 Subject: [spambayes-dev] bug in imap filter or in email package In-Reply-To: <410EA4DE.5060200@acm.org> References: <410EA4DE.5060200@acm.org> Message-ID: <1091894439.1064.32.camel@anthem.wooz.org> On Mon, 2004-08-02 at 16:32, Sjoerd Mullender wrote: > The question is, is this a bug in the email package in that it should > convert \r\n to \n, or is this a bug somewhere else in that the message > given to the email package should never have included those \r\n? > > The message instance is created with email.Parser.Parser().parsestr(...) > where the argument to parsestr is the data as returned by the IMAP > server (which of course uses \r\n line endings). Just for completeness, the email parser is supposed to be line-ending agnostic. IOW, it should gladly accept any of the 3 standard line endings. Thanks to Tony's patch, it now does. :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040807/2b59b304/attachment.pgp From brown at dui-dwi.com Sun Aug 8 17:50:06 2004 From: brown at dui-dwi.com (DUI-DWI) Date: Sun Aug 8 17:51:52 2004 Subject: [spambayes-dev] Can SpamBayes be improved with Markovian Weighting or Chained Karnaugh Mapping? Message-ID: <6.0.0.22.2.20040808114653.0656f338@127.0.0.1> Guys, I realized that I just posted this in the general list, and this is definitely for the dev's: More of a technical question here. I've been doing ton's of research on spam in general, and am curious about the following. I recently watched the 2004 Spam Conference Webcast (http://spamconference.org/webcast.html) and a couple of speaker's brought up a couple of improvements that can be made on a Bayesian Filter such as Markovian Weighting and Chained Karnaugh Mapping that was found in their latest research and tests. With my limited knowledge, I do know that SpamBayes uses its own technique of tiling unigrams and bigrams and the chi-squared combining, but I do not know if these are comparable to Markovian Weighting or Chained Karnaugh Mapping. I found the option in the current SpamBayes release in the Experimental Configuration called "Use mixed uni/bi-grams scheme". I'm wondering if this is along the lines of my curiosity? Any kind soul who can break all this down to me would be greatly appreciated. I had no idea that spam filtering would take up this amount of my time........LOL! Cheers! Erik Brown From tim.peters at gmail.com Mon Aug 9 03:29:31 2004 From: tim.peters at gmail.com (Tim Peters) Date: Mon Aug 9 03:29:36 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A02B04010@cliff.bai.org> References: <792DE28E91F6EA42B4663AE761C41C2A02B04010@cliff.bai.org> Message-ID: <1f7befae040808182958643dd3@mail.gmail.com> [Ryan Malayter] ... > I remember looking then, and I am still unable to find those patches (in > CVS) or the statistical results. Only anecdotal references to "hashing > performing poorly" seem to appear throghout a bunch of threads. My > google search was "CRM-114 site:python.org", there were 93 results that > I looked at, nothing pointing to the original tests of these ideas. There were many threads that tried hashing for one reason or another. Sorry, I can't make time to search for them. One experiment clearly related to CRM-114, with patch, is here: http://mail.python.org/pipermail/spambayes/2002-November/001504.html For whatever reason, pipermail gave the attachment an .exe extension. Rename it to .txt (or whatever works for you for a patch file). > I guess the failure of the whole hashing issue was never really settled > in my mind, since it seems to work so well for CRM-114. But SB has been > working "good enough" for me for over a year now, so I never pursued > thigns further. The CRM experiment had much more to do with generating huge piles of highly correlated features than with hashing. CRM-114 does everything differently, from tokenization through combining rule. The experiment only changed one thing in SB, and that experiment was such a disaster there was no incentive to try to figure out if changing N other things too may have helped. > ... > Did the test just store the hash value as hex/base64/whatever in the > regular SpamBayes DB format? Yes. > What hash was used? The same "fast hash" used in CRM114? Answered in the msg linked to above. From ta-meyer at ihug.co.nz Tue Aug 10 02:34:22 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Tue Aug 10 02:36:33 2004 Subject: [spambayes-dev] Deprecated options In-Reply-To: Message-ID: [Tony] > Unless anyone speaks up in the next couple of days, I'll > remove the "x-" from the option, the "EXPERIMENTAL" from the > description, and leave it set to False by default. In case anyone doesn't watch the check-ins list, FYI I've done this. =Tony Meyer From tameyer at ihug.co.nz Tue Aug 10 08:21:10 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Aug 10 08:21:19 2004 Subject: [spambayes-dev] Windows implementation using Commandline(forSpamBayes) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13071A35CD@its-xchg4.massey.ac.nz> Message-ID: [This is rather old now - sorry; I missed that it hadn't been responded to yet] > (2) I discovered that the scripts would not run unless I copied the > particular script from the script folder to the base folder. Was that > correct? If so, should I have deleted the original file for > that script which I left in the scripts folder? (When I attempted > to run the script from the scripts folder, I got an error message > saying that stuff was not found.) The typical thing to do here is to run "python setup.py install" in the root spambayes directory. This copies everything in spambayes/spambayes into /Lib/site-packages/spambayes in your Python directory, and everything in spambayes/scripts into /Scripts in your Python directory. You can then run the scripts from there (or anywhere else - the things that need to be found are in the site-packages directory). Alternatively, you can just set the environment variable PYTHONPATH to include the path to the root spambayes directory. > (6) At first sb_mboxtrain.py didn't work. It turns out, > various assumptions existed which didn't apply in my case. > The closest function, as far as I could tell, was the > "maildir_train" function for "maildir" -type mail boxes. > But even this wasn't compatible. My mail was simply a > collection of text files in a folder. I don't really know anything about sb_mboxtrain, so will have to skip this. I would have thought that the text files could been treated as very small mbox files, though. > (7) This is where I am stuck. When I actually attempt to run > a file through the filter, I've used the following: > > "sb_filter.py -f > C:\SpamBayesServerEdition\inbox\20040624105403D1E2-00000002.tmp" > > It returns with NO feedback and NO errors. sb_filter.py should print out (to the console) the resulting message. For example, run sb_filter.py with no arguments, type "Subject: Test", then return then control-z and you should get a message printed out that includes the classification header. If it doesn't find any messages, then it does nothing, so this is probably the case here. It wants the filenames to be certain types of files, and a raw RFC822 message isn't one of them. The easiest thing, if you are always going to have a collection of text files (one per mesasge) would be to change mbox = mboxutils.getmbox(fname) To mbox = [email.message_from_file(file(fname))] In sb_filter.py, and put an "import email" up the top. > for fname in args: > print "fname: " % fname > > (notice my print line I eventually added. When this print > line is place here, an error message occurs: The problem here is that you don't have anywhere for the fname to go. It should be 'print "fname: %s" % fname' or just 'print "fname", fname'. =Tony Meyer From tspeirs at paradise.net.nz Wed Aug 11 00:48:58 2004 From: tspeirs at paradise.net.nz (Terry Speirs) Date: Wed Aug 11 00:49:07 2004 Subject: [spambayes-dev] FAQ Typo Message-ID: <20040810224901.E5D0EAE4C5@smtp-3.paradise.net.nz> Hi SpamBayes, A minor correction to your FAQ web page: http://spambayes.sourceforge.net/faq.html#what-do-i-need-to-do-to-update-the -faq Section 1.5, item 5, ends with "so all you need to do it correct the odd mistake - it's very quick and easy.". It should say "so all you need to do is correct.". Thanks for your efforts and this excellent product. TRS -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040811/0ec8fd87/attachment.html From tameyer at ihug.co.nz Wed Aug 11 06:51:58 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Aug 11 06:52:05 2004 Subject: [spambayes-dev] FAQ Typo In-Reply-To: Message-ID: > A minor correction to your FAQ web page: [...] > Section 1.5, item 5, ends with "so all you need > to do it correct the odd mistake - it's very quick > and easy.". It should say "so all you need to do is > correct.". Thanks; fixed. > Thanks for your efforts and this excellent product. You're welcome (from us all). =Tony Meyer From mjgomez at softonic.com Wed Aug 11 12:11:07 2004 From: mjgomez at softonic.com (Maria Julieta Gomez) Date: Wed Aug 11 12:10:12 2004 Subject: [spambayes-dev] Softonic.com: Request for Permission Message-ID: <20040811101007.1B5711E4006@bag.python.org> Hello, My name is Mar?a Julieta G?mez and I am replacing Luis Garrido Softonic.com?s Customer Service Executive during his summer vacations, our company is the leading site in software downloads and sales all over Europe and spanish worldwide. Our website, http://www.softonic.com, only available in Spanish at this moment, receives over 9 million visits per month, serving up to 75 million pages per month (In the Global Top 10 websites in Spain). As for the depth of our database, we offer over 25.000 thousand programs in our catalogue, and have over 5 million downloads per month and produce different cd-rom/dvd collections widely distributed in Spain through PC Magazines and Newspapers. As, currently now, Windows platform is obviously the biggest in terms of downloads and applications, we would like to ask for your permission in order to include your program SpamBayes 1.0b1 in our cd-rom collections and make them available to all Softonic users through all our distribution methods & marketing campaigns. Also, if you are interested in selling the product in Spain, just let us know, we can offer you different selling and marketing methods for reaching a wide range of spanish users. Yours sincerely, ---------------------------------------------------------- Mar?a Julieta Gomez Departamento de Atenci?n al Cliente -Softonic.Com -Grupo Intercom- URL: http://www.softonic.com L?nea cliente Softonic.com - 902.25.25.45 Horario de atenci?n al cliente de 9:00 a 18:00h ---------------------------------------------------------- Este mensaje y los documentos que, en su caso, lleve anexos, pueden contener informaci?n confidencial. Por ello, se informa a quien lo reciba por error que la informaci?n contenida en el mismo es reservada y su uso no autorizado est? prohibido legalmente, por lo que en tal caso le rogamos que nos lo comunique por la misma v?a o por tel?fono (93 592 01 15) , se abstenga de realizar copias del mensaje o remitirlo o entregarlo a terceras personas y proceda a borrarlo de inmediato. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040811/88f9c881/attachment.htm From kennypitt at hotmail.com Wed Aug 11 17:06:27 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Aug 11 17:27:55 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training In-Reply-To: <42393C9DA7930245AB540667607F4F5022C307@SPIKE.city> Message-ID: This discussion seems more appropriate for the dev list, so moving it there. Coe, Bob wrote: > To me, the solution to the problem seems obvious and almost absurdly > easy to implement: When the imbalance reaches a certain level > (determined by the Spambayes gurus), have the program start training > on every nth message it classifies as ham. Do this until the desired > balance is restored. A while back, I devised a scheme that I wanted to try out where SpamBayes would automatically train on random messages of the classification that is too low, and the probability of training would be based on the extent of the imbalance. Probability would be 0 until imbalance reaches a starting threshold such as 2:1, and would increase exponentially to 1 as the imbalance moves toward a maximum threshold, say 5:1. Unfortunately, automatic training is not as easy as it would seem in the Outlook add-in. The add-in uses a different trainer so that it can keep track of the Outlook id's of trained messages, rescore trained messages, etc. This makes it difficult to call the trainer from inside the classifier because the classifier is shared with the other SpamBayes apps such as sb_server. I'm sure there are some simple modifications that could be made to the code structure so that this could be implemented, but I just haven't found the time yet to work out the details. -- Kenny Pitt From tameyer at ihug.co.nz Thu Aug 12 09:47:12 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Aug 12 09:47:18 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training In-Reply-To: Message-ID: > My sense is that when users have an imbalance problem, > overwhelmingly the situation is that of this user, i.e. more > spam than ham. I'm about to say a couple of things that > depend on that assumption, so I just want to state it. I would agree with that assumption, in general. >> Firstly, if you are not already, then doing "train on mistakes" >> is a good idea. This should reduce the imbalance, and make it >> grow less quickly. > > I don't see why. True, train-on-mistakes might not reduce the imbalance compared to train-on-everything. This would only be true if the percentage of mistakes that are spam is lower than the percentage of incoming mail that is spam. I should really have used "might" instead of "should" there. In some cases, it will, however. The imbalance almost certainly will grow less quickly, though, because the database size will grow much, much slower. > The expectation should be that users will > tune their cutoff values so that most of what goes into the > unsure folder is spam. If a user then processes every unsure > message into the database, this will increase, not decrease, > the imbalance. I'm not sure that there is an expectation that users will so tune the cutoffs, but I could be wrong. I like my cutoffs so that most of what goes in the unsure folder is unsure (I don't mean that facetiously - mail that *I* am also unsure about). I believe it would be fair to say that unsure messages tend toward spam, and I think I've seen work that shows that ham tends to be more homogenous than spam (which makes much logical sense, although logical sense has little to do with any of this ). I wouldn't expect though for a user to raise the ham cutoff to reduce the amount of ham in the unsure folder, though (to me an unsure ham is much better than a false negative). I think one of the main ways that people can help the imbalance, if they are already doing train-on-mistakes, is reducing the spam threshold a bit. Both Outlook and non- ship with a cutoff of 0.9, and I think quite a few people can get much less spam in their unsure box with a rate more like 0.8. > Depending (possibly) on your settings, moving messages to the > spam folder, even manually, will process them into the > database. Right? Yes, my mistake. I long-ago turned off both the incremental training options, and so ofttimes forget about them. For me, manually moving a message to the spam folder does no training, but by default it will. > To me, the solution to the problem seems obvious and almost > absurdly easy to implement: When the imbalance reaches a > certain level (determined by the Spambayes gurus), have the > program start training on every nth message it classifies as > ham. Do this until the desired balance is restored. A while back now, I tried doing testing with various forms of auto-balancing training. The results were terrible. I never managed to find time to figure out why and how to resolve that, although I'd still like to. In fact, there is a feature request tracker still open (even assigned to me, I think) that requests some sort of auto-balancing. More recently, the reported success (by Skip with SpamBayes, and by others with other things) of training-to-exhaustion, which implicitly keeps the database balanced, makes me want to try that out, both with more testing and in some sort of integrated fashion with sb_server/Outlook. I did not try the exact scheme outlined above when I was doing my testing. It would be easy enough to do so, if only the time was available. If anyone would like to run the incremental testing setup, I'm happy to write the above into an appropriate regime. =Tony Meyer From kennypitt at hotmail.com Thu Aug 12 16:29:25 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Aug 12 16:34:07 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training In-Reply-To: Message-ID: Switching over to the dev list... Tony Meyer wrote: >> If we are, in fact, talking about the Outlook add-in then it is very >> difficult to do anything besides "train on mistakes". > > Except for initial training, when the obvious (from what is > presented, I > think) choice is train-on-everything. (i.e the Wizard asks for mail > you already have stored, and does toe on that. Since I never run the config wizard myself, I had forgotten about that. I think most casual users fall into one of two camps: either they have every good message they've ever received still sitting in their Inbox, or they get far more spam than ham. Either way, initial training is likely to result in a significant imbalance. I doubt that most users pay any attention to how many of each type of message are in their initial training set. Does the Wizard give any kind of warning during initial training if there is a significant imbalance in the selected messages? The config wizard seems to me to encourage initial training on existing messages. Since training on mistakes and unsures starting from an empty database has proven so effective for most of us, I wonder if it would not be better to recommend that method instead? > It would be > interesting - and I will try this when I get time - to get it to do > tte instead). I think TTE would be an excellent choice for initial training, far better than train on everything given the likely disparity in the number of available messages of each type. -- Kenny Pitt From sethg at GoodmanAssociates.com Thu Aug 12 16:59:45 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Thu Aug 12 16:59:49 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training In-Reply-To: Message-ID: > From: Kenny Pitt > Sent: Wednesday, August 11, 2004 10:06 AM <...> > Unfortunately, automatic training is not as easy as it would > seem in the Outlook add-in. The add-in uses a different > trainer so that it can keep track of the Outlook id's of > trained messages, rescore trained messages, etc. This makes > it difficult to call the trainer from inside the classifier > because the classifier is shared with the other SpamBayes apps such as > sb_server. I'm sure there are some simple modifications that > could be made to the code structure so that this could be > implemented, but I just haven't found the time yet to work > out the details. How about something dumb and ugly? When a user trains a message and Spambayes sees that there is an imbalance in the training set sizes, put up a text info box recommending that the user train on N ham (or spam) to keep Spambayes performing well. The user selects the messages, moves them to the unsure folder and trains as appropriate. I did say it was dumb and ugly. This would be easier if you could train on a correctly classified message without moving it to the unsure folder. At present, there is no "train as good" button in the ham folder and no "train as spam" in the spam folder. That might be a nice addition anyway. As you say, automating this is not easy. There are no folders of confirmed ham or spam in the Outlook implementation to choose among. Using the dumb and ugly (tm) method, the additional ham or spam the user selects to train are manually selected and are actually ham or spam. The text box could further suggest that they train on messages that scored furthest from perfect classification. -- Seth Goodman From rmalayter at bai.org Thu Aug 12 17:38:18 2004 From: rmalayter at bai.org (Ryan Malayter) Date: Thu Aug 12 17:38:19 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training Message-ID: <792DE28E91F6EA42B4663AE761C41C2A02B047DC@cliff.bai.org> [Seth Goodman] >As you say, automating this is not easy. There are no folders of >confirmed ham or spam in the Outlook implementation to choose among. >Using the dumb and ugly (tm) method, the additional ham or >spam the user >selects to train are manually selected and are actually ham or spam. >The text box could further suggest that they train on messages that >scored furthest from perfect classification. I think there *are* 99% confirmed classification folders; read (or older than x days) messages in the "watch" folders, and read (or older than x days) messages in the spam folder. When I get a ham/spam imbalance, and need more hams trained, I do the same thing. I sort my outlook inbox by the spam column, and find the untrained "edge cases" to train on. That is, I find the hams that scored just under my threshold (20%) and train on them. It would seem to me that this process could be automated. We have a list of folders Spambayes is watching, which presumably contain ham. Spambayes knows where it stores spam, and we know which messages we've already trained on. The outlook plug-in could just check the training balance ratio each time it runs, and if it exceeds 1.5 or something, it could go out and find more stuff to train on to even the load. Of course, this will not work for people who don't keep around at least a few hams/spams that were classified correctly. I don't know how to solve that issue, other than to have the "autobalance" abort with an error to the user, or simply do nothing. I'd like to try my hand at Python and contribute this enhancement myself, to see how it works. But I'm not very familiar with the spambayes code base. Any idea which module/classes I should be looking at for starters? Regards, Ryan From erica55 at bol.com.br Fri Aug 13 16:34:56 2004 From: erica55 at bol.com.br (Erica Silveira) Date: Fri Aug 13 16:34:50 2004 Subject: [spambayes-dev] Mala direta por e-mail - As melhores listas de email Message-ID: <20040813143448.4C1561E4426@bag.python.org> Mala direta por e-mail. Cadastros selecionados. As melhores listas de e-mails selecionados por estados, atividades e profiss?es. Listas atualizadas para mala direta via e-mail marketing. Visite http://www.promonet.mx.gs Cadastros altamente selecionados para divulga??o de produtos por email marketing. Listas de e-mails e programas gr?tis para divulga??o via correio eletr?nico. Mala direta por e-mail. Visite agora: http://www.promonet.mx.gs From popiel at wolfskeep.com Fri Aug 13 07:17:47 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Aug 14 00:32:54 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training In-Reply-To: Message from "Tony Meyer" of "Thu, 12 Aug 2004 19:47:12 +1200." References: Message-ID: <20040813051747.23AED2DE7F@cashew.wolfskeep.com> In message: "Tony Meyer" writes: >True, train-on-mistakes might not reduce the imbalance compared to >train-on-everything. This would only be true if the percentage of >mistakes that are spam is lower than the percentage of incoming mail >that is spam. I should really have used "might" instead of "should" >there. In some cases, it will, however. > >The imbalance almost certainly will grow less quickly, though, because >the database size will grow much, much slower. I don't believe this assertion. Sure, the raw counts of the trained ham and spam will grow more slowly, but the relative imbalance will (based on observation of my mail data) grow even more rapidly in reduced-training regimes. I use TOAE (train-on-almost-everything, or all spam < .995 and all ham > .005), not TOM, but I can say that the imbalance in my training set is significantly higher than the imbalance in my incoming mail. Specifically, in the retrain from last night (using the last 4 months of my incoming mail): Total: 3367 ham, 33851 spam (90.95% spam) Trained: 225 ham, 17421 spam (98.72% spam) Also, I have in that time period: Unsure: 83 ham, 8843 spam (99.07% spam) Errors: 2 fp, 872 fn (99.77% spam) This shows that if I was training on errors, my imbalance would be even worse than it currently is. (It also shows that I really need to tune my cutoffs - they're currently at the defaults.) I have in the past suggested that the ideal imbalance is related to the number of distinct 'topics' in each category. For instance, there's only about 4 topics in my ham: spambayes discussions, pennmush discussions, administrative mails, and idle chatter from my friends. On the other hand, there's many more topics in my spam: delivery errors caused by virus joe-jobs, sexual enhancement, mortgage loans, weight reduction, nigerian-style scams, must-have lawn ornaments, stock pick of the scammer, chain letters, this-is-not-a-marketing-pyramid, etc. Unfortunately, I don't have enough AI knowledge to rigorously categorize these and quantify the relationship between topics and training imbalance. - Alex From tim.peters at gmail.com Sat Aug 14 05:56:42 2004 From: tim.peters at gmail.com (Tim Peters) Date: Sat Aug 14 05:56:51 2004 Subject: [spambayes-dev] RE: [Spambayes] question regarding training In-Reply-To: <20040813051747.23AED2DE7F@cashew.wolfskeep.com> References: <20040813051747.23AED2DE7F@cashew.wolfskeep.com> Message-ID: <1f7befae04081320566873d6dd@mail.gmail.com> [T. Alexander Popiel] > ... > I have in the past suggested that the ideal imbalance is related to > the number of distinct 'topics' in each category. For instance, > there's only about 4 topics in my ham: spambayes discussions, pennmush > discussions, administrative mails, and idle chatter from my friends. > On the other hand, there's many more topics in my spam: delivery > errors caused by virus joe-jobs, sexual enhancement, mortgage loans, > weight reduction, nigerian-style scams, must-have lawn ornaments, > stock pick of the scammer, chain letters, this-is-not-a-marketing-pyramid, > etc. ... Just FYI, I've heard several anecdotal reports that the N-way classical Bayesian classifier POPFile (which is a good one) does a better job at catching spam if you indeed create several distinct spam categories (porn spam, mortgage spam, etc), instead of having one catch-all spam category. From erica55 at bol.com.br Mon Aug 16 06:20:51 2004 From: erica55 at bol.com.br (Erica Silveira) Date: Mon Aug 16 06:20:40 2004 Subject: [spambayes-dev] listagem de e-mails Message-ID: <20040816042034.29A631E4002@bag.python.org> Mais Emails, venda online de listas de email, fazemos mala direta e propaganda de sua empresa ou neg?cio para milh?es de emails. Temos listas de email Mala Direta, Mala-Direta, Cadastro de Emails, Lista de Emails, Mailing List, Milh?es de Emails, Programas de Envio de Email, Email Bombers, Extratores de Email, Listas Segmentadas de Email, Emails Segmentados, Emails em Massa, E-mails http://www.promonet.mx.gs Temos listas de email Mala Direta, Mala-Direta, Cadastro de Emails, Lista de Emails, Mailing List, Milh?es de Emails, Programas de Envio de Email, Email Bombers, Extratores de Email, Listas Segmentadas de Email, Emails Segmentados, Emails em Massa, E-mails http://www.promonet.mx.gs From ta-meyer at ihug.co.nz Wed Aug 18 07:09:49 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Aug 18 07:09:54 2004 Subject: [spambayes-dev] Absence Message-ID: Hi everyone :) Just in case anyone wonders why I've suddenly gone silent and stopped answering people on spambayes@python.org, I'm leaving for a business trip/vacation for 3.5 weeks from tomorrow. No email (well, not enough to handle all this mail), so no answering SpamBayes questions. The positive side is that I'm also not here to check anything in and break the scripts, so there should be no *new* problems . =Tony Meyer p.s. If anyone wants to, there isn't much to do to get 1.0 out the door. The sf release is all done, the source archives are uploaded, the website is uploaded (but not pushed to the live site). All that needs to be done is build the binary (needs to be done with Outlook 2k), upload it, make the release visible, and send out the announcement emails (remembering Anthony's suggestion to make a big deal of the exhaustive alpha/beta cycle). You might be able to get Mark to build the binary, but I haven't managed to get hold of him recently. p.p.s. 1.0 is no different to 1.0rc2, so there shouldn't be any new bugs then, either. From groups.02 at kavalec.com Thu Aug 19 15:33:18 2004 From: groups.02 at kavalec.com (G. Waleed Kavalec) Date: Thu Aug 19 15:32:12 2004 Subject: [spambayes-dev] The next generation? In-Reply-To: Message-ID: <20040819133210.BBD9F1E4003@bag.python.org> DNA technique protects against 'evil' emails http://www.newscientist.com/news/print.jsp?id=ns99996292 11:34 19 August 04 Exclusive from New Scientist Print Edition. Subscribe and get 4 free issues. A technique originally designed to analyse DNA sequences is the latest weapon in the war against spam. An algorithm named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits) can catch nearly 97 per cent of spam. Chung-Kwei is based on the Teiresias algorithm, developed by the bioinformatics research group at IBM's Thomas J Watson Research Center in New York, US. Teiresias was designed to search different DNA and amino acid sequences for recurring patterns, which often indicate genetic structures that have an important role. Instead of chains of characters representing DNA sequences, the research group fed the algorithm 65,000 examples of known spam. Each email was treated as a long, DNA-like chain of characters. Teiresias identified six million recurring patterns in this collection, such as "Viagra". Each pattern represented a common sequence of letters and numbers that had appeared in more than one unsolicited message. The researchers then ran a collection of known non-spam (dubbed "ham") through the same process, and removed the patterns that occurred in both groups. Genuine email Incoming email was given a score based on how many spam patterns it had. A long email that only had a few spammy sentences would get a relatively low score; but one with many patterns spread across the length of the message would score much higher. The Chung-Kwei correctly identified 64,665 of 66,697 test messages as being spam or 96.56 per cent. More importantly, its rate of misidentifying genuine email as spam was just 1 in 6000 messages. Losing a single email in a torrent of spam is a greater failing in a filter than letting the occasional spam email through. Chung-Kwei deals with common spammer strategies to dodge pattern-recognition schemes, such as replacing the s with a $, as in "increa$e your $ex power" using its built-in tolerance for different, but functionally equivalent, DNA sequences. Just as in genetic analysis, Teiresias could be taught that CCC and CCU codons both produce the same amino acid, proline, the anti-spam system can be trained to accept $ and s as identical. IBM intends to include Chung-Kwei in its commercial product, SpamGuru. Justin Mason, who developed SpamAssassin, one of the most popular open-source anti-spam filters, says that Chung-Kwei looks promising. "I think there is still a lot of work to be done. But what is exciting is not the particular algorithm, but the fact that IBM has shown there is the entire field of bioinformatics techniques to explore in the fight against spam." Danny O'Brien, San Jose From jccarv at hotmail.com Sun Aug 22 15:18:03 2004 From: jccarv at hotmail.com (John Stovas) Date: Sun Aug 22 15:18:04 2004 Subject: [spambayes-dev] Olimpics Message-ID: <20040822131802.C6E741E4009@bag.python.org> An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040822/13dbeb09/attachment.html From paulagrassi203 at hotmail.com Mon Aug 23 18:16:14 2004 From: paulagrassi203 at hotmail.com (Paula O. Grassi) Date: Mon Aug 23 18:16:16 2004 Subject: [spambayes-dev] =?iso-8859-1?q?email=2C_mala_direta=2C_propagand?= =?iso-8859-1?q?a_e-mail=2C_marketing_por_e-mails_listas_de_divulga?= =?iso-8859-1?q?=E7=E3o?= Message-ID: <20040823161615.07C971E4012@bag.python.org> Visite agora: http://www.divulgamail.mx.gs mala direta e-mail, email regi?es, e-mails regi?o, mala direta por email, marketing e-mail, regi?es, cadastro e-mails, publicidade por email, emails regi?o, divulgar, enviar emails, campanha emails, propaganda emails, email cidade, envio an?nimo emails, email estados, divulgar e-mail, programas emails, e-mails por estados, e-mails cidade, cadastro e-mail, mala direta por e-mail, listas emails, e-mail regi?es, propaganda email, enviar email an?nimo, envio mala direta, estados, campanha, cidade, envio, publicidade e-mails, Visite agora: http://www.divulgamail.mx.gs campanhas e-mail, lista e-mail, programas e-mails, e-mails estado, publicidade emails, marketing digital, cidade, divulgar, lista email, emails estados, propaganda digital e-mails, e-mail por regi?es, e-mails por cidades, email cidades, campanha e-mail, e-mail estado, listas email, lista emails, propaganda por e-mails, mala direta email, publicidade, cidades, marketing emails, cidade, email por regi?es, envio propaganda, listas e-mails, e-mails regi?es, divulgar e-mails, envio mala-direta, e-mail cidades, email estado, e-mails por Visite agora: http://www.divulgamail.mx.gs regi?o, marketing por emails, propaganda, software email em massa, propaganda digital e-mail, programas email, email, mala direta, propaganda e-mail, marketing e-mails, e-mail, mala-direta email, propaganda digital, emails por regi?o, email segmentado, estado, campanhas e-mails, e-mails cidades, e-mails segmentados, email por estado, marketing por email, emails segmentado, divulga??o, e-mails estados, cidade, campanha e-mails, software, email segmentados, regi?o, enviar e-mails an?nimo, enviar emails an?nimo, mala direta emails, marketing email, emails segmentados, programas e-mail, e-mails por cidade, lista e-mails, propaganda, mala direta por e-mails, campanha email, software spam internet, Visite agora: http://www.divulgamail.mx.gs emails estado, publicidade e-mail, e-mail por cidades, enviar e-mail an?nimo, software propaganda internet, emails cidade, emails, campanhas emails, mala-direta e-mail, publicidade email, mala direta e-mails, e-mail regi?o, listas, listas segmentadas, marketing, marketing digital por emails, email regi?o, divulga??o e-mail, emails por cidade, mala-direta por email, marketing digital por e-mails, listas email, lista segmentada, cidades, cadastro email, divulgue seu produto, mala-direta por e-mails, e-mail por estado, segmentos, email por cidades, propaganda por e-mail, emails cidades, publicidade por emails, envio e-mail, e-mails por estado, mala direta, mala-direta, mala-direta por emails, e-mail segmentado, marketing digital emails, cidades, divulga??o e-mails, marketing, e-mail estados, cidades, marketing por e-mail, envio emails, marketing digital email, propaganda Visite agora: http://www.divulgamail.mx.gs por email, envio an?nimo email, divulgue sua propaganda, propaganda digital emails, cidade, emails por cidades, e-mails segmentado, propaganda por emails, divulgar email, e-mail cidade, enviar e-mails, e-mails, cadastro emails, e-mail por cidade, envio email, cadastro, lista, envio e-mails, propaganda digital email, publicidade por e-mails, marketing digital, e-mail por regi?o, email por estados, divulga??o, emails por estados, segmentados, mala-direta emails, envio publicidade, campanhas, mala direta por emails, e-mail por estados, marketing por e-mails, emails por estado, mala-direta e-mails, marketing digital e-mail, divulgar emails, emails regi?es, publicidade, email por regi?o, e-mails por regi?es, listas e-mail, divulga??o emails, mala-direta por e-mail, enviar e-mail, enviar email, Visite agora: http://www.divulgamail.mx.gs divulga??o email, cidades, publicidade por e-mail, enviar, emails por regi?es, marketing digital por e-mail, email por cidade, campanhas email, marketing digital por email, marketing digital e-mails, propaganda e-mails, e-mail segmentados, envio an?nimo e-mail, software publicidade internet, segmentados, envio an?nimo e-mails, lista mala direta, programa email an?nimo, mala direta internet, publicidade email, mala direta segmentada, emails segmentados, marketing digital, mala direta email, publicidade, spam, mala direta e-mail, email regi?es, e-mails regi?o, mala direta por email, marketing e-mail, regi?es, cadastro e-mails, publicidade por email, emails regi?o, divulgar, enviar emails, campanha emails, propaganda emails, email cidade, envio an?nimo emails, email estados, divulgar e-mail, programas emails, e-mails por estados, e-mails cidade, cadastro e-mail, mala direta por e-mail, listas emails, e-mail regi?es, propaganda email, enviar email an?nimo, envio Visite agora: http://www.divulgamail.mx.gs mala direta, estados, campanha, cidade, envio, publicidade e-mails, campanhas e-mail, lista e-mail, programas e-mails, e-mails estado, publicidade emails, marketing digital, cidade, divulgar, lista email, emails estados, propaganda digital e-mails, e-mail por regi?es, e-mails por cidades, email cidades, campanha e-mail, e-mail estado, listas email, lista emails, propaganda por e-mails, mala direta email, publicidade, cidades, marketing emails, cidade, email por regi?es, envio propaganda, listas e-mails, e-mails regi?es, divulgar e-mails, envio mala-direta, e-mail cidades, email estado, e-mails por regi?o, marketing por emails, propaganda, software email em massa, propaganda digital e-mail, programas email, email, mala direta, propaganda e-mail, marketing e-mails, e-mail, mala-direta email, propaganda Visite agora: http://www.divulgamail.mx.gs digital, emails por regi?o, email segmentado, estado, campanhas e-mails, e-mails cidades, e-mails segmentados, email por estado, marketing por email, emails segmentado, divulga??o, e-mails estados, cidade, campanha e-mails, software, email segmentados, regi?o, enviar e-mails an?nimo, enviar emails an?nimo, mala direta emails, marketing email, emails segmentados, programas e-mail, e-mails por cidade, lista e-mails, propaganda, mala direta por e-mails, campanha email, software spam internet, emails Visite agora: http://www.divulgamail.mx.gs estado, publicidade e-mail, e-mail por cidades, enviar e-mail an?nimo, software propaganda internet, emails cidade, emails, campanhas emails, mala-direta e-mail, publicidade email, mala direta e-mails, e-mail regi?o, listas, listas segmentadas, marketing, marketing digital por emails, email regi?o, divulga??o e-mail, emails por cidade, mala-direta por email, marketing digital por e-mails, listas email, lista segmentada, cidades, cadastro email, divulgue seu produto, mala-direta por e-mails, e-mail por estado, segmentos, email por cidades, propaganda por e-mail, emails cidades, publicidade por emails, envio e-mail, e- Visite agora: http://www.divulgamail.mx.gs mails por estado, mala direta, mala-direta, mala-direta por emails, e-mail segmentado, marketing digital emails, cidades, divulga??o e-mails, marketing, e-mail estados, cidades, marketing por e-mail, envio emails, marketing digital email, propaganda por email, envio an?nimo email, divulgue sua propaganda, propaganda digital emails, cidade, emails por cidades, e-mails segmentado, propaganda por emails, divulgar email, e-mail cidade, enviar e-mails, e-mails, cadastro emails, e-mail por cidade, envio email, cadastro, lista, envio e-mails, propaganda digital email, publicidade por e-mails, marketing digital, e-mail por regi?o, email por estados, divulga??o, emails por estados, segmentados, mala-direta emails, envio publicidade, campanhas, mala direta por emails, e-mail por estados, marketing por e- Visite agora: http://www.divulgamail.mx.gs mails, emails por estado, mala-direta e-mails, marketing digital e-mail, divulgar emails, emails regi?es, publicidade, email por regi?o, e-mails por regi?es, listas e-mail, divulga??o emails, mala-direta por e-mail, enviar e-mail, enviar email, divulga??o email, cidades, publicidade por e-mail, enviar, emails por regi?es, marketing digital por e-mail, email por cidade, campanhas email, marketing digital por email, marketing digital e-mails, propaganda e-mails, e-mail segmentados, envio an?nimo e-mail, software publicidade internet, segmentados, envio an?nimo e-mails, lista mala direta, programa email an?nimo, mala direta internet, publicidade email, mala direta segmentada, emails segmentados, marketing digital, mala direta email, publicidade, spam From Firmicus at ankabut.net Wed Aug 25 18:22:12 2004 From: Firmicus at ankabut.net (Firmicus@ankabut.net) Date: Wed Aug 25 18:22:16 2004 Subject: [spambayes-dev] Chung-Kwei algorithm (from BBC news) Message-ID: <20040825162212.DA55F7EE9@mail.udag.de> Hello spambayes developers, Heard of this yet? Regards, F ======== 'DNA analysis' spots e-mail spam By Jo Twist BBC News Online science and technology staff Few would have thought that when Crick and Watson discovered DNA, it would help in making a tool to fight spam. But computational biologists at IBM's TJ Watson Research Center have devised an anti-spam filter based on the way scientists analyse genetic sequences. Called after Feng Shui character Chung-Kwei, the formula automatically learns patterns of spam vocabulary and has proved to be 96.5% efficient. In tests, the filter only misidentified one message in 6,000 as spam. Pillar of protection Isidore Rigoutsos and Tien Huynh, at IBM's bioinformatics and pattern discovery research group, started to develop the formula - or algorithm - a little over a year ago. They named the formula, Chung-Kwei, after a Feng Shui character who is usually shown carrying a bat, and also holds a sword behind him. He is an important figure for those involved in business and who have expensive goods that need protection. Chung-Kwei grew out of another algorithm called Teiresias which the researchers were using for pattern discovery in computation biology sequencing, specifically, in protein annotation. "To train 88,000 messages takes about 15 minutes on a normal single processor. If, an hour, later we have more spam we can add to the collection so we keep on learning more and more" -- Isidore Rigoutsos, IBM The algorithm helped in automatically determining the properties of a protein, like function and structure, directly from a string. "Obviously algorithms that pertain to pattern discovery are applicable to a vast range of problems," Mr Rigoutsos explained to BBC News Online. Instead of looking at strings of protein, Chung-Kwei uses Teiresias to identify strings of character sequences which appear in spam, but never in non-spam mail. Their work, said Mr Rigoutsos, was helped by the large volume of spam which they received at their own workplace. "We have lots of e-mails that we know are bona fide spam. If we run a pattern analysis on those, it can see letters that appear frequently. "One of the properties of the algorithm is that it will spot two or more occurrences. It doesn't matter where it is in the message. "If you do this, effectively you get small collections of letters so you can think of these as a vocabulary of sorts. If you have lots of data to work with, your vocabulary will be able to describe the data in a different form." Spam training The algorithm can be trained so that it will not be fooled by cunning replacements of "S" with "$", a common ploy used by spammers to bypass conventional e-mail filters. The Chung-Kwei method builds up its database of known true-spam patterns and constantly adds new patterns it spots. It compares its vocabulary to e-mails which it knows do not contain spam. So, an incoming message hit with this pattern analysis will be rejected if it contains a large proportion of the same vocabulary patterns. If a message received had a lot of spam patterns in it, it was scored highly. Chung-Kwei succeeded in spotting almost 97% of junk mails. "We experimented with large collections of e-mail. We have 66,000 training messages that are all spam and 22,000 training messages that are all 'white' [non-spam]. "To train 88,000 messages takes about 15 minutes on a normal single processor. If, an hour, later we have more spam, we can add to the collection so we keep on learning more and more." Various anti-spam software use several techniques to spot and kill junk mail, but IBM believes the Chung-Kwei algorithm to be the only anti-spam tool that uses pattern discovery in this way. Some tools look at the route an e-mail has taken and its origins; others involve identity verification and black and white listing of accepted and not accepted addresses. Others use Bayesian combinations of individual words that statistically make up spam messages. The system has to go through some more pilot studies and testing before it is let loose to protect inboxes. The research was originally reported in the New Scientist magazine. Story from BBC NEWS: http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/3584534.stm Published: 2004/08/25 09:38:12 GMT ? BBC MMIV From sjoerd at acm.org Thu Aug 26 10:48:19 2004 From: sjoerd at acm.org (Sjoerd Mullender) Date: Thu Aug 26 10:48:28 2004 Subject: [spambayes-dev] crash from sb_imapfilter.py Message-ID: <412DA3D3.3020900@acm.org> Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 374 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040826/398588c7/signature.pgp From gtoal at gtoal.com Thu Aug 26 15:46:45 2004 From: gtoal at gtoal.com (Graham Toal) Date: Thu Aug 26 15:39:42 2004 Subject: [spambayes-dev] Re: Chung-Kwei algorithm (from BBC news) Message-ID: <412DE9C5.mailIBV1Z5V89@gtoal.com> Firmicus@ankabut.net quoted a press release: > 'DNA analysis' spots e-mail spam > > By Jo Twist > > BBC News Online science and technology staff > > > Few would have thought that when Crick and Watson discovered DNA, > it would help in making a tool to fight spam. > > But computational biologists at IBM's TJ Watson Research Center > have devised an anti-spam filter based on the way scientists analyse > genetic sequences. > > Called after Feng Shui character Chung-Kwei, the formula automatically > learns patterns of spam vocabulary and has proved to be 96.5% efficient. > > In tests, the filter only misidentified one message in 6,000 as spam. > > Pillar of protection > > Isidore Rigoutsos and Tien Huynh, at IBM's bioinformatics and pattern > discovery research group, started to develop the formula - or algorithm > - a little over a year ago. > > They named the formula, Chung-Kwei, after a Feng Shui character who is > usually shown carrying a bat, and also holds a sword behind him. > > He is an important figure for those involved in business and who have > expensive goods that need protection. > > Chung-Kwei grew out of another algorithm called Teiresias which the > researchers were using for pattern discovery in computation biology > sequencing, specifically, in protein annotation. > > "To train 88,000 messages takes about 15 minutes on a normal single processor. > If, an hour, later we have more spam we can add to the collection so we keep > on learning more and more" > -- Isidore Rigoutsos, IBM > > The algorithm helped in automatically determining the properties of > a protein, like function and structure, directly from a string. > > "Obviously algorithms that pertain to pattern discovery are applicable > to a vast range of problems," Mr Rigoutsos explained to BBC News Online. > > Instead of looking at strings of protein, Chung-Kwei uses Teiresias to > identify strings of character sequences which appear in spam, but never ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ exactly my algorithm > in non-spam mail. > > Their work, said Mr Rigoutsos, was helped by the large volume of spam > which they received at their own workplace. > > "We have lots of e-mails that we know are bona fide spam. If we run a > pattern analysis on those, it can see letters that appear frequently. > > "One of the properties of the algorithm is that it will spot two or ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > more occurrences. It doesn't matter where it is in the message. ^^^^^^^^^^^^^^^^ Exactly what mine does > > "If you do this, effectively you get small collections of letters so > you can think of these as a vocabulary of sorts. If you have lots of > data to work with, your vocabulary will be able to describe the data > in a different form." > > Spam training > > The algorithm can be trained so that it will not be fooled by cunning > replacements of "S" with "$", a common ploy used by spammers to bypass > conventional e-mail filters. > > The Chung-Kwei method builds up its database of known true-spam > patterns and constantly adds new patterns it spots. > > It compares its vocabulary to e-mails which it knows do not contain > spam. So, an incoming message hit with this pattern analysis will be > rejected if it contains a large proportion of the same vocabulary > patterns. > > If a message received had a lot of spam patterns in it, it was scored > highly. Chung-Kwei succeeded in spotting almost 97% of junk mails. > > "We experimented with large collections of e-mail. We have 66,000 > training messages that are all spam and 22,000 training messages that > are all 'white' [non-spam]. > > "To train 88,000 messages takes about 15 minutes on a normal single > processor. If, an hour, later we have more spam, we can add to the > collection so we keep on learning more and more." > > Various anti-spam software use several techniques to spot and kill > junk mail, but IBM believes the Chung-Kwei algorithm to be the only ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > anti-spam tool that uses pattern discovery in this way. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mine does) > > Some tools look at the route an e-mail has taken and its origins; > others involve identity verification and black and white listing > of accepted and not accepted addresses. > > Others use Bayesian combinations of individual words that statistically ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ie this doesn't) > make up spam messages. > > The system has to go through some more pilot studies and testing > before it is let loose to protect inboxes. I published what I am fairly sure is the same algorithm, or at least close enough to invalidate a patent on the grounds of prior art, back in February of this year. I also mailed it to a few prominent researchers like Bill Yerazunis, just to get it on the record. (One of those researchers was a spam researcher at IBM incidentally, though I don't believe he was involved in the project quoted above) Here's the links, decide for yourself. I'm not interested in taking credit for this, if it is the same algorithm, just in making sure it is not patented. ------------------ ] Date: Fri, 18 Jun 2004 12:33:02 -0500 ] From: Graham Toal ] To: mertz@gnosis.cx ] Subject: spam algorithms ] Message-ID: <40D3274E.mailN3311EYMT@gtoal.com> ] User-Agent: nail 10.5 4/27/03 ] MIME-Version: 1.0 ] Content-Type: text/plain; charset=us-ascii ] Content-Transfer-Encoding: 7bit ] ] I was searching the IBM search engine for spam algorithms and ] found your page: ] ] > . Bayesian trigram filters ] > I decided to look into how well a much more starkly limited ] > model space would work for a Bayesian spam filter. ] > Specifically, I decided to use trigrams for my probability ] > model rather than "words". ] ] Have a look at some notes I wrote up earlier this year: ] http://www.gtoal.com/mt/archives/2004_02.html ] ] You may find this approach interesting. It's sort of like your idea but ] taken to the extreme. Subsequent to writing the description in the above ] link, I have actually hacked up some prototype code and the idea does show ] promise. I've not had the resources to make a fully usable spam filter ] out of it, but it's a convincing proof of concept. ] ( http://www.gtoal.com/spam/devel-temp/tokra3.c.html ) ] ] I hope you don't mind me mailing this to you unsolicited, but I want ] a few people working in the field to be aware of it as I think it is ] original and by making it public I can forstall someone reinventing ] it later and claiming a bogus patent on it! ] ] ] Regards ] ] Graham Toal --------------------------- I am BCC'ing this post to Isidore Rigoutsos and Tien Huynh for their opinion as to whether their "pattern discovery" algorithm and my "token recognition" algorithm are the same algorithm (and to ask as to whether IBM is attempting to patent it...) G From jronen at stern.nyu.edu Sun Aug 29 20:42:43 2004 From: jronen at stern.nyu.edu (Joshua Ronen) Date: Sun Aug 29 20:42:51 2004 Subject: [spambayes-dev] problem with "junk suspects" Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: Joshua Ronen (jronen@stern.nyu.edu).vcf Type: text/x-vcard Size: 401 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040829/7dd101f7/JoshuaRonenjronenstern.nyu.edu.vcf From janeaustine50 at hotmail.com Tue Aug 31 14:54:59 2004 From: janeaustine50 at hotmail.com (Austine Jane) Date: Tue Aug 31 14:55:02 2004 Subject: [spambayes-dev] The naive bayes classifier algorithm in spambayes doesn't take in frequency? Message-ID: Hello. I have a question on the naive bayes classifier algorithm used in spambayes. I suppose if word1 appeared in three ham mail, the probability of word1 being in ham mail would be greater than when it appeared in one ham mail: (using spambayes-1.0rc2) >>>c=storage.DBDictClassifier('test.db') >>>def tok(s): return s.split() >>>c.learn(tok('word1'),is_spam=False) >>>c.spamprob(tok('word1')) 0.15517241379310343 >>>c.learn(tok('word1'),False) >>>c.spamprob(tok('word1')) 0.091836734693877542 >>>c.learn(tok('word1'),False) >>>c.spamprob(tok('word1')) 0.065217391304347783 As you see the spam probability declines. So far so good. >>>c.learn(tok('word1'),True) And word1 also appeared in one spam mail, but it appeared in three ham mail before. >>>c.spamprob(tok('word1')) 0.5 Hm... Sounds not very right. >>>c.learn(tok('word1'),False) >>>c.spamprob(tok('word1')) 0.5 Stays still. This doesn't sound intuitive. For example, word1 occurred in 1000 spam email and occured in 1 ham mail. What is the probability of one mail that contains word1 being spam mail? Half and half? Doesn't it take in the number of occurences(it does seem to take in the number of distinct tokens though)? It seems like the concept of the number of occurences and the number of distinct tokens are mixed in spambayes' classifier. Machine Learning by Tom Mitchell(esp. page 183 and http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html see also p.s.) suggests a formula that gives a quite different result from spambayes' classifier. It always takes in the number of occurences, hence more intuitive. Am I missing something big? Thanks in advance, Jane ------------- p.s. The formula in Tom Mitchell's book is: Vocabulary is the set of all distinct words and other tokens occuring in any text document from Examples For each target value v_j in V do * docs_j is the subset of documents from Examples for which the target value is v_j * P(v_j)= | docs_j | / | Examples | * Text_j is a single document created by concatenating all members of docs_j * n is the total number of distinct word positions in Text_j * for each word w_k in Vocabulary * P(w_k|v_j) = ( n_k + 1 )/( n + |vocabulary| ) * n_k is the number of times word w_k occurs in Text_j As you see it uses m-estimate. _________________________________________________________________ Help STOP SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail From heli at helimodels.com Tue Aug 31 19:38:38 2004 From: heli at helimodels.com (John Moriarty) Date: Tue Aug 31 19:42:05 2004 Subject: [spambayes-dev] message subject filtering Message-ID: <001d01c48f81$59fe2ec0$2101a8c0@user> message subject filtering I'm not a programmer. I have asked a similar question before, but recently, mounting spams have made me alter it significantly, and so I hope it worthy of reconsideration. A lot of spam shows: * Ungrammatical and/or irrelevant wording * Random words * Gibberish words * Deliberately weird or obscure punctuations * Since this is true in the header as well as the text body, this potentially reduces the loads on the filter. Random words not seen before seem to allow stuff through more easily. Therefore the presence of these certain features, I don't know for sure if they fall under the definition of tokens, are high probability signals. Is this what's new -error signals computed from the entire (or a substantial subset of) message? * I also note spam outnumbers ham by up to 100 to one, so header filtering seems good at throwing up warnings. And invariably the text body contains the web address of the seller, so a web address of itself is a giveaway. I am fast at identifying spam by the header alone, using the above observations I reckon I spot 90% plus in a blink. However it's still a pain rubbing them out. It seems to me that application of rules based the above would be a more sophisticated way of developing spambayes. I think the analysis of text would then focus in better on the more subtle forms of spam, using the tokens to greater effect. Apologies if any of this is rubbish or goes against the theories. Kind regards, John Moriarty (+353) (0)87 2833 530 www.helimodels.com From kennypitt at hotmail.com Tue Aug 31 20:57:43 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Aug 31 20:57:49 2004 Subject: [spambayes-dev] message subject filtering In-Reply-To: <001d01c48f81$59fe2ec0$2101a8c0@user> Message-ID: John Moriarty wrote: > A lot of spam shows: > > * Ungrammatical and/or irrelevant wording > * Random words > * Gibberish words > * Deliberately weird or obscure punctuations > * Since this is true in the header as well as the text body, this > potentially reduces the loads on the filter. Are you interested in this because you want to analyze just the headers and not download the entire message if it is determined to be spam? If so, there are other issues besides whether or not we can successfully identify the spam just based on the headers. In the case of the Outlook Add-in, Outlook has already downloaded the message by the time we are told about it. In the case of the POP3 proxy (sb_server), discarding a message that you have partially processed is problematic because the e-mail client is already aware that the message exists and will sometimes get confused if we refuse to give it any data. > Random words not seen > before seem to allow stuff through more easily. In the case of SpamBayes, this is not true. SpamBayes assigns a probability of 0.5 to any word that it hasn't been trained on, and then discards any words that have a probability between 0.4 and 0.6 before calculating the spam score. Because SpamBayes ignores these words, they have absolutely no effect, either positive or negative, on the classification of the message. The only time that random words have an effect on the classification is if the spammer happens to hit on some words that you *have* seen before. If those words have only been seen in spam messages then it only *increases* the probability that the message will be properly identified as spam. It is very rare for the spammer to stumble across a significant number of words that you have trained as hammy, and even then there aren't usually enough of them to outweigh the other spammy clues in the message. > * I also note spam outnumbers ham by up to 100 to one Maybe for you, but not necessarily for everyone. While it does seem that most people these days are receiving more spam than good messages, there are still some people (someone who is extremely active on a lot of high-volume mailing lists, possibly) that get far more ham than spam. SpamBayes needs to work equally well regardless of the ratio of ham vs. spam that a particular user receives. > And invariably the text body contains the web address of the seller, > so a web address of itself is a giveaway. SpamBayes has an option that will break up URLs and create clues from the domain name, directory names, etc. If a particular domain is used a lot in spam then that will become a spam clue. The mere presence of a URL in the message is not a good indicator of spam in general. I receive a lot of legitimate mail such as developer newsletters that contain lots of URLs. > I am fast at identifying spam by the header alone, using the above > observations I reckon I spot 90% plus in a blink. The human brain has a capacity for learning and detecting patterns in the text that far exceeds what SpamBayes can ever be capable of. In most cases, however, SpamBayes can probably process the entire message in less time than you can process just the header. The more information SpamBayes has at its disposal, the less likely it is to make a mistake and toss an important message into your spam folder. -- Kenny Pitt From tim.peters at gmail.com Tue Aug 31 21:58:53 2004 From: tim.peters at gmail.com (Tim Peters) Date: Tue Aug 31 21:58:59 2004 Subject: [spambayes-dev] message subject filtering In-Reply-To: <001d01c48f81$59fe2ec0$2101a8c0@user> References: <001d01c48f81$59fe2ec0$2101a8c0@user> Message-ID: <1f7befae04083112581b5ab627@mail.gmail.com> [John Moriarty] > ... > And invariably the text body contains the web address of the seller, > so a web address of itself is a giveaway. ... > www.helimodels.com ... > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev That is, if what you say is true, then every message from you, and every message on this mailing list, is spam. It's truly not that simple -- although maybe I'm not guessing correctly at what "web address of itself" means. Since you posted from heli@helimodels.com, I counted www.helimodels.com as "a web address of itself", and similarly for the footer added to spambayes-dev email. BTW, we ran controlled experiments on subject-only classification in the early days of the project. That showed there is a lot of info in subject lines, but not enough to make a good classifier. For example, lots of spam has subject "Hello". So do lots of inquiries from prospective employers . From kennypitt at hotmail.com Tue Aug 31 23:36:18 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Aug 31 23:36:27 2004 Subject: [spambayes-dev] The naive bayes classifier algorithm in spambayesdoesn't take in frequency? In-Reply-To: Message-ID: Austine Jane wrote: > I have a question on the naive bayes classifier algorithm used in > spambayes. > > I suppose if word1 appeared in three ham mail, the probability of > word1 being in ham mail would be greater than when it appeared in one > ham mail: That depends. The statistics are based on the fraction of ham mail that the word appeared in, not just the absolute number of times it has occurred. A word that appeared in 1 ham message out of 1 would have the same probability as a word that appeared in 3 messages out of 3. A word that appeared in 1 message out of 3 would have a lower probability. >>>> c=storage.DBDictClassifier('test.db') >>>> def tok(s): return s.split() >>>> c.learn(tok('word1'),is_spam=False) >>>> c.spamprob(tok('word1')) > 0.15517241379310343 >>>> c.learn(tok('word1'),False) >>>> c.spamprob(tok('word1')) > 0.091836734693877542 >>>> c.learn(tok('word1'),False) >>>> c.spamprob(tok('word1')) > 0.065217391304347783 > > As you see the spam probability declines. So far so good. > >>>> c.learn(tok('word1'),True) > > And word1 also appeared in one spam mail, but it appeared in three > ham mail before. > >>>> c.spamprob(tok('word1')) > 0.5 > > Hm... Sounds not very right. > >>>> c.learn(tok('word1'),False) >>>> c.spamprob(tok('word1')) > 0.5 > > Stays still. > > This doesn't sound intuitive. Assuming that you started from a clean database file and the training shown in your example is the only training you've done, then this is exactly right. If you go on to train the word as ham 1000 more times, you'll still get 0.5. Here's why: The base probability for a word is based on ratios: p = spamratio / (spamratio + hamratio) where spamratio is the number of spam messages that contained the word divided by the total number of spam messages, and hamratio is the same but using only the ham messages. After training the word 3 times as ham, you had a hamratio of 3 / 3 = 1.0. You had no spam messages, so your spam ratio was 0. This leads to: p = 0 / (0 + 1) = 0 Because a word that has been seen only a few times is not a good predictor, an adjustment is made to the base probability based on the total number of messages that contained the word: n = hamcount + spamcount = 3 + 0 = 3 adj_p = ((S * X) + (n * p)) / (S + n) where S and X are constants. S is the "unknown word strength" with a default value of 0.45, and X is the "unknown word probability" with a default value of 0.5 (these are configurable in SpamBayes). When you apply this adjustment you can see how p = 0 becomes the 0.0652 that you saw, and also why the value was slightly higher when you had 1 and 2 messages instead of 3. You can also see that as n approaches infinity, the constant factors of (S * X) and S become irrelevant, the n terms on top and bottom cancel out, and you are left with p. Now as soon as you trained the first instance of the word as spam, your spamratio became 1 / 1 = 1 also, so your base p becomes: p = 1.0 / (1.0 + 1.0) = 0.5 >From this point forward, as long as you train only this one word both your hamratio and spamratio will always be 1 and p will always be 0.5. If you train some different words and then calculate the spamprob of this word again, then you will see it start to change from 0.5. > For example, word1 occurred in 1000 > spam email and occured in 1 ham mail. What is the probability of one > mail that contains > word1 being spam mail? Half and half? Yes, the probability is 0.5 as long as it appeared in 1000 out of 1000 spam mails and 1 out of 1 ham mail as in your example above. If, on the other hand, the word appeared in 1000 out of 1000 spams and 1 out of 1000 hams then the spam probability would be very different, approximately 0.999. > Doesn't it take in the number > of occurences(it does seem to take in the number of distinct tokens > though)? It seems like the concept of the number of occurences and > the number of distinct tokens are mixed in spambayes' classifier. No, it only counts each word once in a single mail message. The original Paul Graham scheme (http://www.paulgraham.com/spam.html) from which SpamBayes evolved counted the total number of occurrences of the word, but early testing of SpamBayes showed that accuracy was better if we considered only the number of messages that contained the word and not the total number of times that the word appeared. -- Kenny Pitt From tim.peters at gmail.com Tue Aug 31 23:51:54 2004 From: tim.peters at gmail.com (Tim Peters) Date: Tue Aug 31 23:52:00 2004 Subject: [spambayes-dev] The naive bayes classifier algorithm in spambayesdoesn't take in frequency? In-Reply-To: References: Message-ID: <1f7befae0408311451748ff439@mail.gmail.com> [Austine Jane] ... >> Doesn't it take in the number of occurences (it does seem to take in the >> number of distinct tokens though)? It seems like the concept of the number of >> occurences and the number of distinct tokens are mixed in spambayes' >> classifier. [Kenny Pitt] > No, it only counts each word once in a single mail message. The original > Paul Graham scheme (http://www.paulgraham.com/spam.html) from which > SpamBayes evolved counted the total number of occurrences of the word, but > early testing of SpamBayes showed that accuracy was better if we considered > only the number of messages that contained the word and not the total number > of times that the word appeared. Graham's scheme was actually schizophrenic in this respect: it counted duplicates as if distinct during training, but not during scoring. There's more explanation in the comment block preceding our classifier.py's Classifier._add_msg() method: # NOTE: Graham's scheme had a strange asymmetry: when a word appeared # n>1 times in a single message, training added n to the word's hamcount # or spamcount, but predicting scored words only once. Tests showed # that adding only 1 in training, or scoring more than once when # predicting, hurt under the Graham scheme. # This isn't so under Robinson's scheme, though: results improve # if training also counts a word only once. The mean ham score decreases # significantly and consistently, ham score variance decreases likewise, # mean spam score decreases (but less than mean ham score, so the spread # increases), and spam score variance increases. # I (Tim) speculate that adding n times under the Graham scheme helped # because it acted against the various ham biases, giving frequently # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else, # adding only once in training, a word like that was simply ignored until # it appeared in 5 distinct training spams. Without the ham-favoring # biases, though, and never ignoring words, counting n times introduces # a subtle and unhelpful bias. # There does appear to be some useful info in how many times a word # appears in a msg, but distorting spamprob doesn't appear a correct way # to exploit it. BTW, Gary Robinson's _Linux Journal_ article is still the best explanation of the math SB uses: http://www.linuxjournal.com/article.php?sid=6467