From skip@pobox.com Tue Oct 1 00:00:32 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 30 Sep 2002 18:00:32 -0500 Subject: [Spambayes] mining dates? In-Reply-To: <200209302241.g8UMfgJ08118@localhost.localdomain> References: <15768.50203.838893.944644@12-248-11-90.client.attbi.com> <200209302241.g8UMfgJ08118@localhost.localdomain> Message-ID: <15768.55184.227888.264009@12-248-11-90.client.attbi.com> >> It didn't prove my hypothesis, but may have exposed something as >> useful. Spam seems to be sent at a fairly constant rate throughout >> the day, which stands to reason, since it's probably all sent >> automatically. However, ham definitely seems to be sent >> predominantly during waking hours (doh!). I'm going to give a little >> date mining a try. Anthony> Interesting. I'm not sure it actually buys that much, timezones Anthony> being what they are. Unless you have evidence that, say, all Anthony> spam is actually sent by a small team of Belgians, in which Anthony> case we can just knock out stuff sent during business hours in Anthony> belgian standard time. That's why I simply ignored the timezone offset. The points plotted were in local time. As I mentioned in my mail, spam seems to be sent at all hours of the day and night. If anything, a small hamminess would be attributed to messages sent during waking hours. Skip From skip@pobox.com Tue Oct 1 00:16:12 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 30 Sep 2002 18:16:12 -0500 Subject: [Spambayes] Here's why "generate_long_skips: False" worked... Message-ID: <15768.56124.22371.659117@12-248-11-90.client.attbi.com> I figured out why the false positive I saw was interpreted as text. I had been incorrectly forwarding mail from the itineraries@mojam.com command processor alias (for probably five years or more). This wasn't a big deal in the past because I am the only person who receives such messages, but it was incorrect nonethelss. Instead of sending the original message out with Resent-*: headers prepended, I sent a new message with the original message as the body, e.g.: From itin@manatee.mojam.com Tue Sep 24 15:34:42 2002 Return-Path: Received: from manatee.mojam.com (localhost [127.0.0.1]) by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id g8OKYf0F013847 for ; Tue, 24 Sep 2002 15:34:41 -0500 Received: (from itin@localhost) by manatee.mojam.com (8.12.1/8.12.1/Submit) id g8OKYfxH013839; Tue, 24 Sep 2002 15:34:41 -0500 Message-Id: <200209242034.g8OKYfxH013839@manatee.mojam.com> From: itin@manatee.mojam.com To: skip@mojam.com Subject: New Itinerary: "nancy fly artist's tour dates" from mg@nflyagency.com Date: Tue, 24 Sep 2002 15:34:41 -0500 Return-Path: Received: from txsmtp02.texas.rr.com (smtp2.texas.rr.com [24.93.36.230]) by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id g8OKYG0F013791 for ; Tue, 24 Sep 2002 15:34:17 -0500 Received: from [192.168.0.4] (cs24342-228.austin.rr.com [24.243.42.228]) by txsmtp02.texas.rr.com (8.12.5/8.12.2) with ESMTP id g8OKXano027834; Tue, 24 Sep 2002 16:33:36 -0400 (EDT) User-Agent: Microsoft-Outlook-Express-Macintosh-Edition/5.02.2022 Date: Tue, 24 Sep 2002 15:33:56 -0500 Subject: Nancy Fly Artist's Tour Dates From: Martha Guthrie To: Tour Date Recipients Message-ID: Mime-version: 1.0 Content-type: multipart/mixed; boundary="MS_Mac_OE_3115726436_766524_MIME_Part" > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ... I just fixed that piece of code over the weekend. Since I won't be getting any new mail like the above note in the future, I suppose I should purge them from my collection or adjust those messages to have the correct format. So, should I pull the generate_long_skips option back out? Skip From tim.one@comcast.net Tue Oct 1 01:27:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 30 Sep 2002 20:27:41 -0400 Subject: [Spambayes] Here's why "generate_long_skips: False" worked... In-Reply-To: <15768.56124.22371.659117@12-248-11-90.client.attbi.com> Message-ID: [Skip Montanaro] > I figured out why the false positive I saw was interpreted as > text. I had been incorrectly forwarding mail from the > itineraries@mojam.com command processor alias (for probably five > years or more). This wasn't a big deal in the past because I am > the only person who receives such messages, but it was incorrect > nonethelss. Instead of sending the original message out with > Resent-*: headers prepended, I sent a new message with the > original message as the body, e.g.: [and the original headers "look like body text", ditto the MIME decorations] > I just fixed that piece of code over the weekend. Since I won't > be getting any new mail like the above note in the future, I suppose > I should purge them from my collection or adjust those messages to > have the correct format. Out of curiousity, what percentage of your corpus consisted of such msgs? And were they all ham? > So, should I pull the generate_long_skips option back out? I'm neutral, but if you leave it in please change the comment (it's misleading now). I believe that whenever a skip token does some good, it's indicating a weakness in the tokenizer (this is nearly tautological: when skip does some good, it says there's useful info in "very long words"!). Over time, I hope people are inspired to find out just what good it is that we're getting by crudely summarizing via "skip" tokens, and extract it purposefully. An easy example is Asian spam, where the lack of whitespace ends up generating oodles of skip tokens (and '8bit%' tokens), but there must be a more effective way to generate useful tokens for that without bloating the database beyond reason. So I hope that skip-generation will eventually become worthless. From JasonR.Mastaler Tue Oct 1 01:38:38 2002 From: JasonR.Mastaler (JasonR.Mastaler) Date: Mon, 30 Sep 2002 18:38:38 -0600 Subject: [Spambayes] Re: Matt Sergeant: Introduction References: <3D98486A.1050208@startechgroup.co.uk> Message-ID: Matt Sergeant writes: > I've been following this list on gmane.org for a while now (it's a > mail to nntp gateway for those interested in following multiple > technical mailing lists in a read-only fashion) Actually, Gmane is not read-only -- you can both read and post. -- (http://tmda.net/) From nas@python.ca Tue Oct 1 01:42:57 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 30 Sep 2002 17:42:57 -0700 Subject: [Spambayes] Here's why "generate_long_skips: False" worked... In-Reply-To: References: <15768.56124.22371.659117@12-248-11-90.client.attbi.com> Message-ID: <20021001004256.GA27420@glacier.arctrix.com> Tim Peters wrote: > An easy example is Asian spam, where the lack of whitespace ends up > generating oodles of skip tokens (and '8bit%' tokens), but there must > be a more effective way to generate useful tokens for that without > bloating the database beyond reason. I tried generating 2 character-grams when has_highbit_char was true. I seem to recall that it worked okay. The bonus would be that there would be a limit of 2**16 of these tokens in the DB. Neil From skip@pobox.com Tue Oct 1 01:54:20 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 30 Sep 2002 19:54:20 -0500 Subject: [Spambayes] Here's why "generate_long_skips: False" worked... In-Reply-To: References: <15768.56124.22371.659117@12-248-11-90.client.attbi.com> Message-ID: <15768.62012.24430.856757@12-248-11-90.client.attbi.com> >> I just fixed that piece of code over the weekend. Since I won't be >> getting any new mail like the above note in the future, I suppose I >> should purge them from my collection or adjust those messages to have >> the correct format. Tim> Out of curiousity, what percentage of your corpus consisted of such Tim> msgs? And were they all ham? Of the current Data/{Ham,Spam}/Set* collection (2000 per side), three hams and 252 spams. Three types of mail get sent to itineraries@mojam.com: spam, legitimate (but unrecognized submissions), and recognized legitimate submissions. I never see the last category, because the command processor dumps those to the correct files for later processing and forwards the rest to me. The vast majority of the other two classes of messages are spam. >> So, should I pull the generate_long_skips option back out? Tim> I'm neutral, but if you leave it in please change the comment (it's Tim> misleading now). Will do. Does this make sense? # If legitimate mail contains things that look like text to the # tokenizer and turning turning off this option helps (perhaps binary # attachments get 'defanged' by something upstream from this operation # and thus look like text), this may help, and should be an alert that # perhaps the tokenizer is broken. Skip From tim.one@comcast.net Tue Oct 1 02:49:09 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 30 Sep 2002 21:49:09 -0400 Subject: [Spambayes] new option: generate_long_skips In-Reply-To: <15768.54170.102815.684984@12-248-11-90.client.attbi.com> Message-ID: [Skip Montanaro] > ... > I notice it's suggesting an even lower cutoff now (0.375). > > Before: > > -> best cutoff for all runs: 0.4 > -> with weighted total 1*30 fp + 17 fn = 47 > -> fp rate 1.5% fn rate 0.85% > > After: > > -> best cutoff for all runs: 0.375 > -> with weighted total 1*35 fp + 7 fn = 42 > -> fp rate 1.75% fn rate 0.35% It's suggesting that cutoff *if* what you want to do is minimize the total number of misclassified messages, without favoring errors of either kind. Most people here hate false positives more, and in that case you should set option best_cutoff_fp_weight (which defaults to 1) to how much more you hate fp than fn. See the comments for that option in Options.py. You have such extreme overlap that you should also boost nbuckets up from its default 40; the resolution of the automated histogram analysis is limited by the number of buckets. From tim.one@comcast.net Tue Oct 1 03:16:59 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 30 Sep 2002 22:16:59 -0400 Subject: [Spambayes] mining dates? In-Reply-To: <15768.50203.838893.944644@12-248-11-90.client.attbi.com> Message-ID: [Skip Montanaro] > ... > It didn't prove my hypothesis, but may have exposed something as useful. > Spam seems to be sent at a fairly constant rate throughout the day, > which stands to reason, since it's probably all sent automatically. > However, ham definitely seems to be sent predominantly during waking > hours (doh!). I'm going to give a little date mining a try. You have my encouragement, but are you talking about date mining or time mining? Date mining has hurt lots of folks, by giving good results for bogus reasons ("oops! that whole ham archive came from 1998, and none of my spam does"). So I suggest you *almost* stick to just time-of-day for now. Two extensions: 1. Day of week may also be interesting. I keep a hotmail account alive just to watch the spam pour in, and it definitely gets more spam on weekends. I speculate that the last 500 people to buy a CD of email addresses can't make time until the weekend to become an instant internet millionaire . 2. Greg Ward suggested two Date things SpamAssassin looks for: SPAM: * 1.6 -- Invalid Date: header (not RFC 2822) SPAM: * 2.7 -- Date: is 24 to 48 hours before Received: date If, OTOH, we were trying to distinguish email from Guido from the rest of our email, a great clue would be whether it came from Guido, but an even better one is whether his reply was sent before the original . From tim.one@comcast.net Tue Oct 1 03:22:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 30 Sep 2002 22:22:03 -0400 Subject: [Spambayes] Here's why "generate_long_skips: False" worked... In-Reply-To: <20021001004256.GA27420@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > I tried generating 2 character-grams when has_highbit_char was true. In addition to, or in lieu of, generating skip tokens? > I seem to recall that it worked okay. The bonus would be that there > would be a limit of 2**16 of these tokens in the DB. Appreciated. I used to do character 5-grams in this case, and the database burden was significant. Plus results didn't get worse when I stopped doing n-grams altogether. Somebody want to try this on their corpus? 1. Current vs doing character 2-grams when has_highbit_char is true instead of generating skip tokens. 2. Current vs doing character 2-grams when has_highbit_char is true in addition to generating skip tokens. From nas@python.ca Tue Oct 1 04:21:00 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 30 Sep 2002 20:21:00 -0700 Subject: [Spambayes] Here's why "generate_long_skips: False" worked... In-Reply-To: References: <20021001004256.GA27420@glacier.arctrix.com> Message-ID: <20021001032100.GA27892@glacier.arctrix.com> Tim Peters wrote: > [Neil Schemenauer] > > I tried generating 2 character-grams when has_highbit_char was true. > > In addition to, or in lieu of, generating skip tokens? In addition. > 1. Current vs doing character 2-grams when has_highbit_char is true > instead of generating skip tokens. Left is current: false positive percentages 0.000 0.000 tied 1.000 1.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 5 to 5 tied mean fp % went from 0.25 to 0.25 tied false negative percentages 0.000 0.000 tied 1.000 1.000 tied 1.000 1.000 tied 0.500 0.500 tied 1.500 1.500 tied 1.500 1.500 tied 0.500 0.500 tied 0.500 0.500 tied 1.000 1.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 15 to 15 tied mean fn % went from 0.75 to 0.75 tied ham mean ham sdev 27.66 27.62 -0.14% 8.52 8.51 -0.12% 26.51 26.47 -0.15% 8.75 8.79 +0.46% 25.82 25.76 -0.23% 7.92 7.91 -0.13% 27.03 27.00 -0.11% 8.22 8.28 +0.73% 26.95 26.88 -0.26% 8.21 8.26 +0.61% 29.23 29.19 -0.14% 9.28 9.27 -0.11% 27.25 27.20 -0.18% 8.15 8.16 +0.12% 26.89 26.83 -0.22% 7.88 7.89 +0.13% 27.02 26.93 -0.33% 9.02 8.99 -0.33% 26.63 26.57 -0.23% 7.20 7.18 -0.28% ham mean and sdev for all runs 27.10 27.05 -0.18% 8.38 8.39 +0.12% spam mean spam sdev 81.73 82.38 +0.80% 10.24 10.96 +7.03% 80.90 81.56 +0.82% 10.16 10.96 +7.87% 80.03 81.11 +1.35% 9.99 11.02 +10.31% 81.51 82.48 +1.19% 10.28 11.29 +9.82% 81.44 82.31 +1.07% 10.43 11.13 +6.71% 81.11 82.17 +1.31% 9.82 10.87 +10.69% 80.64 81.69 +1.30% 9.52 10.47 +9.98% 80.43 81.48 +1.31% 9.84 10.74 +9.15% 81.18 82.02 +1.03% 10.25 10.91 +6.44% 81.17 82.59 +1.75% 9.90 11.10 +12.12% spam mean and sdev for all runs 81.01 81.98 +1.20% 10.06 10.96 +8.95% ham/spam mean difference: 53.91 54.93 +1.02 > > 2. Current vs doing character 2-grams when has_highbit_char is true > in addition to generating skip tokens. Again, left is current: false positive percentages 0.000 0.000 tied 1.000 1.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 5 to 5 tied mean fp % went from 0.25 to 0.25 tied false negative percentages 0.000 0.000 tied 1.000 1.000 tied 1.000 1.000 tied 0.500 0.500 tied 1.500 1.500 tied 1.500 1.500 tied 0.500 0.500 tied 0.500 0.500 tied 1.000 1.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 15 to 15 tied mean fn % went from 0.75 to 0.75 tied ham mean ham sdev 27.66 27.66 +0.00% 8.52 8.52 +0.00% 26.51 26.52 +0.04% 8.75 8.79 +0.46% 25.82 25.82 +0.00% 7.92 7.92 +0.00% 27.03 27.06 +0.11% 8.22 8.28 +0.73% 26.95 26.96 +0.04% 8.21 8.25 +0.49% 29.23 29.23 +0.00% 9.28 9.28 +0.00% 27.25 27.26 +0.04% 8.15 8.16 +0.12% 26.89 26.89 +0.00% 7.88 7.88 +0.00% 27.02 27.02 +0.00% 9.02 9.02 +0.00% 26.63 26.63 +0.00% 7.20 7.20 +0.00% ham mean and sdev for all runs 27.10 27.10 +0.00% 8.38 8.39 +0.12% spam mean spam sdev 81.73 82.51 +0.95% 10.24 11.00 +7.42% 80.90 81.66 +0.94% 10.16 10.98 +8.07% 80.03 81.24 +1.51% 9.99 11.18 +11.91% 81.51 82.58 +1.31% 10.28 11.35 +10.41% 81.44 82.38 +1.15% 10.43 11.17 +7.09% 81.11 82.29 +1.45% 9.82 10.91 +11.10% 80.64 81.78 +1.41% 9.52 10.48 +10.08% 80.43 81.57 +1.42% 9.84 10.80 +9.76% 81.18 82.13 +1.17% 10.25 10.96 +6.93% 81.17 82.71 +1.90% 9.90 11.22 +13.33% spam mean and sdev for all runs 81.01 82.09 +1.33% 10.06 11.02 +9.54% ham/spam mean difference: 53.91 54.99 +1.08 From nas@python.ca Tue Oct 1 04:23:12 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 30 Sep 2002 20:23:12 -0700 Subject: [Spambayes] mining dates? In-Reply-To: References: <15768.50203.838893.944644@12-248-11-90.client.attbi.com> Message-ID: <20021001032312.GB27892@glacier.arctrix.com> Tim Peters wrote: > 2. Greg Ward suggested two Date things SpamAssassin looks for: > > SPAM: * 1.6 -- Invalid Date: header (not RFC 2822) Tried that. It didn't help my error rate so I mercilessly killed it. Neil From tim.one@comcast.net Tue Oct 1 04:47:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 30 Sep 2002 23:47:38 -0400 Subject: [Spambayes] Central limit In-Reply-To: Message-ID: [Tim] > ... > I made up a combination of "look at ratios" and "different cutoffs > for different n" by iteratively staring at the errors and making > stuff up. It now appears that the "different cutoffs for different n" was just an accident based on the specific errors I stared at. Recall that the "certainty heuristic" was of the form: ratio = max(abs(zhsam / zspam), abs(zspam / zham)) certain = ratio > cutoff and then I went on to choose different cutoffs depending on n (n is the number of "extreme words" found in the msg, with a maximum of 50). Here's an exhaustive account of all the times the log-central-limit code was wrong (meaning that abs(zham) < abs(zspam) but the msg was really spam, or that abs(zspam) < abs(zham) but the msg was really ham). This is segregated by n (the number of extreme words). For each n, a list of all ratios in the "but I was wrong" cases is given. The number in square brackets is the number of predictions made with this specific value of n. The number in curly braces is the percentage of incorrect predictions. So, for example, 35 times we did a prediction on a msg with 7 extreme words (that's a very short msg!). Twice the prediction was wrong (5.71% of 35), and in one of those cases ratio was 1.31, and in the other ratio was 1.72. 3: [36] {0.00%} 4: [21] {0.00%} 5: [14] {0.00%} 6: [22] {0.00%} 7: [35] {5.71%} 1.31 1.72 8: [42] {4.76%} 1.01 1.33 9: [72] {5.56%} 1.00 1.04 1.14 1.28 10: [123] {0.00%} 11: [129] {1.55%} 1.07 1.09 12: [123] {1.63%} 1.05 1.09 13: [131] {0.00%} 14: [169] {0.59%} 1.11 15: [180] {1.11%} 1.18 1.73 16: [232] {1.29%} 1.12 1.12 1.43 17: [315] {1.27%} 1.06 1.06 1.27 1.48 18: [344] {1.16%} 1.28 1.35 1.50 1.60 19: [333] {1.20%} 1.03 1.24 1.75 1.78 20: [375] {0.53%} 1.10 1.12 21: [448] {0.45%} 1.09 2.54 22: [492] {0.00%} 23: [535] {0.56%} 1.38 1.72 2.20 24: [604] {0.50%} 1.03 1.17 1.66 25: [638] {0.63%} 1.04 1.55 1.64 1.85 26: [594] {0.51%} 1.06 1.07 1.13 27: [676] {0.74%} 1.02 1.03 1.06 1.26 1.35 28: [789] {0.00%} 29: [811] {0.49%} 1.03 1.18 1.41 2.24 30: [763] {0.39%} 1.04 1.04 2.08 31: [805] {0.12%} 1.44 32: [787] {0.13%} 1.19 33: [763] {0.26%} 1.10 1.36 34: [764] {0.13%} 1.04 35: [822] {0.12%} 1.03 36: [796] {0.00%} 37: [819] {0.00%} 38: [947] {0.11%} 1.08 39: [907] {0.00%} 40: [873] {0.00%} 41: [877] {0.11%} 1.21 42: [1016] {0.00%} 43: [1005] {0.00%} 44: [1016] {0.00%} 45: [1003] {0.30%} 1.07 1.10 1.27 46: [1068] {0.09%} 1.24 47: [1019] {0.00%} 48: [1026] {0.10%} 1.15 49: [1056] {0.28%} 1.09 1.10 1.24 50: [63585] {0.07%} 1.02 1.02 1.02 1.03 1.03 1.04 1.04 1.04 1.05 1.05 1.05 1.06 1.06 1.08 1.09 1.09 1.09 1.10 1.10 1.11 1.11 1.12 1.13 1.14 1.17 1.17 1.18 1.18 1.18 1.19 1.19 1.19 1.20 1.21 1.25 1.27 1.27 1.29 1.30 1.30 1.40 1.44 1.48 1.52 1.56 1.63 Several things to note: 1. The error rate is generally lower the more words we've got to work with. 2. There are notable exceptions to that, but error rates are so low that a single message makes a large difference in error rate. 3. There doesn't appear to be any correlation between n and the maximum ratio "that works" for that n. 4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8. If we were willing to accept half of 1 percent of 1 percent as an acceptable error rate for "certainty", a fixed cutoff of 1.8 would have caused 5 false negatives (sorry, you can't tell whether they're f-p or f-n from the above) in the region of certainty, and no false positives there: [overall results with a fixed ratio cutoff of 1.8] for all ham 45000 total certain 44830 99.622% (|zham| smaller and ratio > 1.8) wrong 0 0.000% unsure 170 0.378% (|zham| smaller and ratio <= 1.8) wrong 37 21.765% for all spam 45000 total certain 44563 99.029% (|zspam| smaller and ratio > 1.8) wrong 5 0.011% unsure 437 0.971% (|zspam| smaller and ratio <= 1.8) wrong 79 18.078% From tim.one@comcast.net Tue Oct 1 05:03:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 00:03:17 -0400 Subject: [Spambayes] mining dates? In-Reply-To: <20021001032312.GB27892@glacier.arctrix.com> Message-ID: [Tim] >> 2. Greg Ward suggested two Date things SpamAssassin looks for: >> >> SPAM: * 1.6 -- Invalid Date: header (not RFC 2822) [Neil Schemenauer] > Tried that. It didn't help my error rate so I mercilessly killed it. Hmm. You generally chop off the lines revealing how large a test you're running, but from your total error rates in the last report: total unique fp went from 5 to 5 tied mean fp % went from 0.25 to 0.25 tied total unique fn went from 15 to 15 tied mean fn % went from 0.75 to 0.75 tied it seems a safe bet that you're predicting against 200 messages per run. In that case, the smallest non-zero *change* in a one-run error rate you could possibly see is 0.5% (1 of 200 msgs), which essentially *is* your overall error rate. In other words, like me, you've reached the point where your corpus can no longer support measuring improvements reliably -- even if a solid but modest improvement were to be made, it's quite likely you couldn't measure it. That leaves us staring at ham & spam means & sdevs, which are still good indicators of whether a change moves "in a good direction", but isn't as exciting as watching error rates plummet. Moving to a larger corpus would help make your life more interesting again: sign up for more mailing lists . From anthony@interlink.com.au Tue Oct 1 05:12:35 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 01 Oct 2002 14:12:35 +1000 Subject: [Spambayes] Central limit In-Reply-To: Message-ID: <200210010412.g914CZJ10661@localhost.localdomain> >>> Tim Peters wrote > and then I went on to choose different cutoffs depending on n (n is the > number of "extreme words" found in the msg, with a maximum of 50). What happens past 50? extracting just the ones where it was "dead wrong"... > 21: [448] {0.45%} 1.09 2.54 > 23: [535] {0.56%} 1.38 1.72 2.20 > 25: [638] {0.63%} 1.04 1.55 1.64 1.85 > 29: [811] {0.49%} 1.03 1.18 1.41 2.24 > 30: [763] {0.39%} 1.04 1.04 2.08 What's the plot of cutoff -vs- uncertain messages like? How do these relate? > 2. There are notable exceptions to that, but error rates are so low > that a single message makes a large difference in error rate. Is there anything "magic" about those 5 fns? Were they the usual suspects? Does inspecting them by hand give any clues about other tokenisation clues that might have helped them? (e.g. if your corpus was sufficiently single-sourced that you could turn on all the disabled clue-extractors...) > 4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8. And all of those were fn, not fp. Anthony From nas@python.ca Tue Oct 1 05:25:32 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 30 Sep 2002 21:25:32 -0700 Subject: [Spambayes] mining dates? In-Reply-To: References: <20021001032312.GB27892@glacier.arctrix.com> Message-ID: <20021001042532.GA28075@glacier.arctrix.com> Tim Peters wrote: > it seems a safe bet that you're predicting against 200 messages per run Good work detective Peters. > Moving to a larger corpus would help make your life more interesting > again: sign up for more mailing lists . I use a different email address for each email list I sign up on. That makes sorting easy. My ham and spam collection is taken from addresses that don't receive mailing list traffic. So, signing up for more lists wouldn't help. Neil From skip@pobox.com Tue Oct 1 05:36:44 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 30 Sep 2002 23:36:44 -0500 Subject: [Spambayes] mining dates? In-Reply-To: References: <15768.50203.838893.944644@12-248-11-90.client.attbi.com> Message-ID: <15769.9820.558475.996393@12-248-11-90.client.attbi.com> >> It didn't prove my hypothesis, but may have exposed something as >> useful. Spam seems to be sent at a fairly constant rate throughout >> the day, which stands to reason, since it's probably all sent >> automatically. However, ham definitely seems to be sent >> predominantly during waking hours (doh!). I'm going to give a little >> date mining a try. Tim> You have my encouragement, but are you talking about date mining or Tim> time mining? Well, I'm mining the Date: field for time information. The other mining option examines Received: headers for host and IP information. I was just following suit. Tim> Date mining has hurt lots of folks, by giving good results for Tim> bogus reasons ("oops! that whole ham archive came from 1998, and Tim> none of my spam does"). So I suggest you *almost* stick to just Tim> time-of-day for now. Two extensions: Tim> 1. Day of week may also be interesting. I keep a hotmail account Tim> alive just to watch the spam pour in, and it definitely gets Tim> more spam on weekends. I speculate that the last 500 people Tim> to buy a CD of email addresses can't make time until the Tim> weekend to become an instant internet millionaire . Yeah, I thought about dow. I'll give it a look-see. Of course, that requires me to actually call time.strptime() and come up with a couple plausible format strings. Here's a small sample from one of my ham Set directories: Date: Mon, 27 May 2002 00:02:09 EDT Date: 26 Sep 2002 20:21:59 -0700 Date: Wed, 25 Sep 2002 23:02:40 -0400 Date: Thu, 2 May 2002 11:12:41 -0700 (PDT) Mining time info is simpler because it seems more uniformly formatted than the rest of the Date: header (in my limited experience anway), so I can be more stupid when I collect that information and just extract it with a simple regular expression. A quickie shell pipeline suggests that spam generally violates date formats a lot more often than ham. Given this pipeline for ham: find Data/Ham/Set* -type f \ | xargs sed -n -e '/^From /,/^$/p' \ | egrep '^Date: ' \ | egrep '^Date: [A-Z][a-z][a-z],' \ | awk '{print $2}' \ | sort \ | uniq -c I get this nice clean output: 418 Fri, 217 Mon, 135 Sat, 118 Sun, 396 Thu, 247 Tue, 347 Wed, Changing the first element of the pipe to scan my Spam collection gives the much messier output: 228 Fri, 317 Mon, 1 Mon,16 1 Mon,23 2 Mon,27 178 Sat, 1 Sex, 233 Sun, 294 Thu, 1 Thu,26 339 Tue, 2 Tue,17 271 Wed, 2 Wed,16 It is nice to know that every once in great while Sex is a day of the week. Wish I could predict its occurence though. Any experts on the list? Just eyeballing things, the frequency patterns look different between spam and ham. Skip From tim.one@comcast.net Tue Oct 1 06:25:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 01:25:28 -0400 Subject: [Spambayes] Central limit In-Reply-To: <200210010412.g914CZJ10661@localhost.localdomain> Message-ID: [Tim] >> and then I went on to choose different cutoffs depending on n (n is >> the number of "extreme words" found in the msg, with a maximum of 50). [Anthony Baxter] > What happens past 50? I don't know. Gary originally suggested 30, and the only reason I tried 50 this time was due to a braino (I was editing the 150 max_discriminators value we use now, and unthinkingly just deleted the "1"). I have no results for any value other than 50. > extracting just the ones where it was "dead wrong"... By this I guess you mean the error cases where the ratio exceeded 1.8? >> 21: [448] {0.45%} 1.09 2.54 >> 23: [535] {0.56%} 1.38 1.72 2.20 >> 25: [638] {0.63%} 1.04 1.55 1.64 1.85 >> 29: [811] {0.49%} 1.03 1.18 1.41 2.24 >> 30: [763] {0.39%} 1.04 1.04 2.08 > What's the plot of cutoff -vs- uncertain messages like? > How do these relate? Sorry, I don't know what you mean. Here's a histogram showing the # of predictions made at each ratio, where the "99.0" bucket includes all ratios >= 99.0 (there are a lot of those!): 90000 items; mean 96.06; sdev 2174.03 * = 141 items 1.0 794 ****** 2.0 1411 *********** 3.0 2067 *************** 4.0 2373 ***************** 5.0 2640 ******************* 6.0 2708 ******************** 7.0 2883 ********************* 8.0 2747 ******************** 9.0 2598 ******************* 10.0 2478 ****************** 11.0 2307 ***************** 12.0 2194 **************** 13.0 2008 *************** 14.0 1906 ************** 15.0 2403 ****************** 16.0 5814 ****************************************** 17.0 5650 ***************************************** 18.0 3635 ************************** 19.0 2133 **************** 20.0 1762 ************* 21.0 1634 ************ 22.0 1351 ********** 23.0 1154 ********* 24.0 1001 ******** 25.0 937 ******* 26.0 871 ******* 27.0 898 ******* 28.0 861 ******* 29.0 949 ******* 30.0 952 ******* 31.0 937 ******* 32.0 869 ******* 33.0 812 ****** 34.0 715 ****** 35.0 745 ****** 36.0 736 ****** 37.0 576 ***** 38.0 573 ***** 39.0 551 **** 40.0 520 **** 41.0 463 **** 42.0 486 **** 43.0 445 **** 44.0 451 **** 45.0 374 *** 46.0 349 *** 47.0 365 *** 48.0 365 *** 49.0 288 *** 50.0 319 *** 51.0 299 *** 52.0 276 ** 53.0 281 ** 54.0 273 ** 55.0 255 ** 56.0 246 ** 57.0 239 ** 58.0 213 ** 59.0 236 ** 60.0 211 ** 61.0 188 ** 62.0 205 ** 63.0 178 ** 64.0 164 ** 65.0 162 ** 66.0 190 ** 67.0 177 ** 68.0 174 ** 69.0 145 ** 70.0 175 ** 71.0 155 ** 72.0 168 ** 73.0 123 * 74.0 140 * 75.0 132 * 76.0 130 * 77.0 133 * 78.0 121 * 79.0 119 * 80.0 122 * 81.0 125 * 82.0 124 * 83.0 97 * 84.0 96 * 85.0 125 * 86.0 99 * 87.0 93 * 88.0 94 * 89.0 102 * 90.0 99 * 91.0 105 * 92.0 88 * 93.0 82 * 94.0 95 * 95.0 72 * 96.0 72 * 97.0 82 * 98.0 82 * 99.0 8580 ************************************************************* I suppose you can get a crude answer to whatever it is you're asking from staring at that . Here's restricted to ratios < 10.0: 20221 items; mean 61.62; sdev 23.19 * = 6 items 1.00 93 **************** 1.10 74 ************* 1.20 69 ************ 1.30 75 ************* 1.40 71 ************ 1.50 66 *********** 1.60 69 ************ 1.70 90 *************** 1.80 91 **************** 1.90 96 **************** 2.00 94 **************** 2.10 119 ******************** 2.20 126 ********************* 2.30 146 ************************* 2.40 136 *********************** 2.50 144 ************************ 2.60 134 *********************** 2.70 168 **************************** 2.80 167 **************************** 2.90 177 ****************************** 3.00 192 ******************************** 3.10 176 ****************************** 3.20 222 ************************************* 3.30 203 ********************************** 3.40 198 ********************************* 3.50 230 *************************************** 3.60 205 *********************************** 3.70 183 ******************************* 3.80 209 *********************************** 3.90 249 ****************************************** 4.00 207 *********************************** 4.10 253 ******************************************* 4.20 204 ********************************** 4.30 212 ************************************ 4.40 253 ******************************************* 4.50 240 **************************************** 4.60 249 ****************************************** 4.70 246 ***************************************** 4.80 270 ********************************************* 4.90 239 **************************************** 5.00 258 ******************************************* 5.10 240 **************************************** 5.20 242 ***************************************** 5.30 256 ******************************************* 5.40 248 ****************************************** 5.50 279 *********************************************** 5.60 263 ******************************************** 5.70 294 ************************************************* 5.80 286 ************************************************ 5.90 274 ********************************************** 6.00 259 ******************************************** 6.10 261 ******************************************** 6.20 257 ******************************************* 6.30 278 *********************************************** 6.40 278 *********************************************** 6.50 241 ***************************************** 6.60 279 *********************************************** 6.70 287 ************************************************ 6.80 287 ************************************************ 6.90 281 *********************************************** 7.00 299 ************************************************** 7.10 291 ************************************************* 7.20 311 **************************************************** 7.30 285 ************************************************ 7.40 281 *********************************************** 7.50 259 ******************************************** 7.60 292 ************************************************* 7.70 288 ************************************************ 7.80 285 ************************************************ 7.90 292 ************************************************* 8.00 249 ****************************************** 8.10 271 ********************************************** 8.20 261 ******************************************** 8.30 289 ************************************************* 8.40 269 ********************************************* 8.50 275 ********************************************** 8.60 294 ************************************************* 8.70 290 ************************************************* 8.80 281 *********************************************** 8.90 268 ********************************************* 9.00 258 ******************************************* 9.10 263 ******************************************** 9.20 268 ********************************************* 9.30 280 *********************************************** 9.40 279 *********************************************** 9.50 247 ****************************************** 9.60 253 ******************************************* 9.70 244 ***************************************** 9.80 265 ********************************************* 9.90 241 ***************************************** >> 2. There are notable exceptions to that, but error rates are so low >> that a single message makes a large difference in error rate. > Is there anything "magic" about those 5 fns? Were they the usual > suspects? Does inspecting them by hand give any clues about other > tokenisation clues that might have helped them? (e.g. if your corpus > was sufficiently single-sourced that you could turn on all the > disabled clue-extractors...) Sorry, I can't relate the errors to msgs. All I have is a binary pickle containing 90,000 of these: class Node(object): __slots__ = 'is_spam', 'n', 'zham', 'zspam', 'delta', 'score' That was generated when I was testing a different "certainty heuristic" that performed much worse than the one I'm talking about now, and its text output file doesn't contain any error cases with ratios larger than about 1.1 (so it doesn't contain the errors in question now). It never made a mistake, but it considered huge numbers of msgs to be uncertain -- if 25% of msgs are kicked out for manual review, I'd consider the scheme wholly impractical. >> 4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8. > And all of those were fn, not fp. That's right. In this particular test. OTOH, this particular test ran 90 times each training on 500+500 then predicting against 4500+4500, so it was giving itself a hard job. I've got lots of reasons to believe that training on 500 ham and 500 spam isn't enough to get reasonable coverage of the diversity in my corpora. Offline, Guido tried the use_central_limit2 code exactly as-is on a much larger test, training on about 8K ham + 3K spam for each run. I don't recommend doing that because the "scores" produced by the code as-is make no sense -- they basically produce 1 bit of information (which zscore was smaller?) in a highly confusing way, and a way that's not symmetric around 0.5. I believe he also used max_discriminators=150 (the default these days), which may well be "too large" for the log-central-limit code (Gary designed it to make extreme use of the extreme words, and there's no message that has 150 distinct extreme words). Even so, compared to our current default scheme, his bottom lines across 90 runs were: total unique fp went from 904 to 324 won -64.16% mean fp % went from 0.662958214428 to 0.232509170721 won -64.93% total unique fn went from 97 to 275 lost +183.51% mean fn % went from 0.127271524421 to 0.328802849112 lost +158.35% and we've already seen that this scheme is less certain about spam than about ham. Alas, there's no way to know what the "certainty heuristic" would have said in Guido's large run (there's no code checked in for that, and I'm having an increasingly hard time making insane amounts of time for this project). From tim.one@comcast.net Tue Oct 1 06:36:34 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 01:36:34 -0400 Subject: [Spambayes] mining dates? In-Reply-To: <15769.9820.558475.996393@12-248-11-90.client.attbi.com> Message-ID: [Skip Montanaro, on day-of-week] > Yeah, I thought about dow. I'll give it a look-see. Of course, that > requires me to actually call time.strptime() and come up with a couple > plausible format strings. Stupid almost certainly beats smart here. Match against r'(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s' If that succeeds, generate a dow token with the day of the week, else generate a dow token with a "no day" value. All cases are then reduced to 8, and all goofy patterns you see in spam are reduced to one. You could refine that a little (e.g., to distinguish plain-missing from there-but- not-followed-by-space), but I expect more than that would be counterproductive. Testing is the final judge, of course, but trust me on this one : start stupid, and work your way up until results stop improving. From mjm@michaelmeltzer.com Tue Oct 1 07:01:38 2002 From: mjm@michaelmeltzer.com (Michael Meltzer) Date: Tue, 1 Oct 2002 02:01:38 -0400 Subject: [Spambayes] just an idea Message-ID: <010701c26910$017d5760$0b01a8c0@mjm2> This is a multi-part message in MIME format. ---------------------- multipart/alternative attachment For what it is wroth, in the same way time stamp might be useful, the = current crop black holes list might be helpful, their problem has always = be they are a little to touchy, slow to react and a little dangerous for = a admin due to their draconian nature. I have had my commercial DSL line = include just because they where DSL lines. but the filter is a little = more forgiving then a simple binary decision. knowing a ip is a dial up = line, cable modem, a know spammer address or a open relay could be = useful in a close call. in fact a real cute application would be for the = filters to report the spambayes blackhole list automatically, but it = would not be a blackhole list just one element of the filter used in the = evaluation. Works nicely with the properties of the filter, should help = with hammy email that might look spammy, expressly if it a out of the = norm for the user and the network effect with a little address ageing = should be self maintaining. The down side those dns queries can be = expense. Just a thought. http://relays.osirusoft.com/cgi-bin/rbcheck.cgi MJM ---------------------- multipart/alternative attachment-- From skip@pobox.com Tue Oct 1 07:10:00 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 1 Oct 2002 01:10:00 -0500 Subject: [Spambayes] more date field mining Message-ID: <15769.15416.639114.331796@12-248-11-90.client.attbi.com> I have now modified the Tokenizer class thus: class Tokenizer: date_hms_re = re.compile(r' (?P[0-9][0-9]):' r'(?P[0-9][0-9]):' r'(?P[0-9][0-9]) ') date_formats = ("%a, %d %b %Y %H:%M:%S (%Z)", "%a, %d %b %Y %H:%M:%S %Z", "%d %b %Y %H:%M:%S (%Z)", "%d %b %Y %H:%M:%S %Z") ... def tokenize_headers(self, msg): # Special tagging of header lines and MIME metadata. ... if options.mine_date_headers: for header in msg.get_all("date", ()): mat = self.date_hms_re.search(header) # return the time in Date: headers arranged in # six-minute buckets if mat is not None: h = int(mat.group('hour')) bucket = int(mat.group('minute')) // 10 yield 'time:%02d:%d' % (h, bucket) # extract the day of the week for fmt in self.date_formats: try: timetuple = time.strptime(header, fmt) except ValueError: pass else: yield 'dow:%d' % timetuple[6] else: yield 'dow:invalid' Times and days of the week seem like they should be pretty distinct. I should probably analyze them separately using two options. Still, here are my initial results using this coarser grained scheme: cutoffs -> times -> tested 200 hams & 200 spams against 1800 hams & 1800 spams ... false positive percentages 1.000 1.000 tied 1.500 1.500 tied 1.000 1.000 tied 1.000 1.500 lost +50.00% 1.000 1.000 tied 1.500 1.500 tied 3.500 3.500 tied 1.500 1.500 tied 1.500 1.500 tied 1.500 2.000 lost +33.33% won 0 times tied 8 times lost 2 times total unique fp went from 30 to 32 lost +6.67% mean fp % went from 1.5 to 1.6 lost +6.67% false negative percentages 0.500 0.500 tied 1.500 1.500 tied 0.500 0.500 tied 0.500 0.500 tied 2.000 2.000 tied 0.000 0.000 tied 1.000 1.500 lost +50.00% 1.000 1.000 tied 0.000 0.000 tied 1.500 1.500 tied won 0 times tied 9 times lost 1 times total unique fn went from 17 to 18 lost +5.88% mean fn % went from 0.85 to 0.9 lost +5.88% ham mean ham sdev 20.82 21.05 +1.10% 6.43 6.47 +0.62% 21.86 22.00 +0.64% 6.63 6.61 -0.30% 21.38 21.56 +0.84% 6.49 6.57 +1.23% 21.96 22.13 +0.77% 6.26 6.27 +0.16% 21.51 21.73 +1.02% 6.72 6.73 +0.15% 21.66 21.88 +1.02% 6.98 7.01 +0.43% 21.45 21.62 +0.79% 7.66 7.59 -0.91% 21.74 21.93 +0.87% 6.69 6.67 -0.30% 21.71 21.88 +0.78% 7.44 7.43 -0.13% 21.87 22.01 +0.64% 5.93 5.93 +0.00% ham mean and sdev for all runs 21.60 21.78 +0.83% 6.75 6.75 +0.00% spam mean spam sdev 74.10 73.79 -0.42% 12.99 12.71 -2.16% 72.47 72.11 -0.50% 13.92 13.63 -2.08% 74.05 73.75 -0.41% 13.00 12.80 -1.54% 74.00 73.68 -0.43% 12.27 12.03 -1.96% 72.43 72.06 -0.51% 13.73 13.33 -2.91% 72.68 72.35 -0.45% 13.27 13.04 -1.73% 72.57 72.29 -0.39% 13.03 12.84 -1.46% 71.50 71.26 -0.34% 12.12 11.95 -1.40% 73.25 72.92 -0.45% 12.67 12.39 -2.21% 73.02 72.73 -0.40% 12.44 12.24 -1.61% spam mean and sdev for all runs 73.01 72.69 -0.44% 12.98 12.73 -1.93% ham/spam mean difference: 51.41 50.91 -0.50 I'll try it with a more fine-grained set of options tomorrow after a little snooze. Skip From msergeant@startechgroup.co.uk Tue Oct 1 10:18:13 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 10:18:13 +0100 Subject: [Spambayes] Matt Sergeant: Introduction References: Message-ID: <3D996855.9030707@startechgroup.co.uk> Tim Peters wrote: > [Matt Sergeant] > > Thanks for the introduction, Matt! Welcome. > > >>... >>Like you all, I discovered very quickly that it's the tokenisation >>techniques that are the biggest "win" when it comes down to it. > > The first thing I tried after implementing Graham's scheme was special > tokenization and tagging of embedded http/https/ftp thingies. Consider that adopted ;-) And to give back I'll tell you that one of my biggest wins was parsing HTML (with HTML::Parser - a C implementation so it's very fast) and tokenising all attributes, so I get: colspan=2 face=Arial, Helvetica, sans-serif as tokens. Plus using a proper HTML parser I get to parse HTML comments too (which is a win). Using word tuples is also a small win, but increases the database size and number of tokens you have to pull from the database enormously. That's an issue for me because I'm not using an in-memory database (one implementation uses CDB, another uses SQL - the SQL one is really nice because you can so easily do data mining, and the code to extract the token probabilities is just a view). > That > instantly cut the false negative rate in half. It remains the single > biggest win we ever got. Well I very quickly found out that most of the academic research into this has been pretty bogus. For example everyone seems (seemed?) to think that stemming was a big win, but I found it to lose every time. > The rest has been an aggregation of many smaller > wins, and the benefit gotten over time from finding and removing the biases > in Paul's formulation has been highly significant. That eventually hit a > wall,where this set of 3 artificialities was stubborn: > > artificially clamping spamprobs into [0.01, 0.99] > artificially boosting ham counts > looking at only the 16 most-extreme words > > Changing any one, or any two, of those, gave at best mixed results. It took > wholesale adoption of all of Gary Robinson's ideas at once (some of which > aren't really explained (yet?) on his webpage) to nuke them all. The fewer > the number of "mystery knobs", the better results have gotten, but the > original biases sometimes acted to cancel each other out in the areas they > hurt most, so you can't get here from there removing just one at a time. (I've followed this all so far in read-only mode, but thanks for rounding it up into 2 paragraphs ). The one thing that still bothers me still about Gary's method is that the threshold value varies depending on corpus. Though I expect there's some mileage in being able to say that the middle ground is "unknown". >>so I'm hopefully going to get CLT done this week and see how it fares. >>Unfortunately I find python incredibly difficult to read, so it takes >>me a while! > > > Hmm. I could tell you to mentally translate > > a.b > > to > > $a->{b} > > but I doubt your problem is at that level . Post a snippet of Python > you find "incredibly difficult to read", and someone will be happy to walk > you thru it. I really can't guess, as this particular criticism of Python > is one I've never heard before! OK, I'll go over it again this week and next time I get stuck I'll mail out for some help ;-) The hardest part really is getting from how my code is structured (i.e. where I get my data from, how I store it, etc) to your version. Simple examples like where you use a priority queue for the probabilities so you can extract the top N indicators, I just use an array, and use a sort to get the top N. So mostly it's just the details of storage that confuse me. Oh, and not being able to figure out where a block ends :-P Off the top of my head, what does frexp() do? And where is compute_population_stats used? >>... >>such as how the probability stuff works so much better on individuals' >>corpora (or on a particular mailing list's corpus) than it does for >>hundreds of thousands of users. > > That's been my suspicion, but we haven't tested it here yet. So save us the > effort and tell us the bottom line from your tests . On my personal email I was seeing about 5 FP's in 4000, and about 20 FN's in about the same number (can't find the exact figures right now). On a live feed of customer email we're seeing about 4% FN's and 2% FP's. I don't yet have your fancy histograms, mostly because the code works on one email in isolation right now, and knows nothing about what result it should have given - I need to write wrappers to do that stuff yet. From anthony@interlink.com.au Tue Oct 1 10:29:48 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 01 Oct 2002 19:29:48 +1000 Subject: [Spambayes] Matt Sergeant: Introduction In-Reply-To: <3D996855.9030707@startechgroup.co.uk> Message-ID: <200210010929.g919Tnn02347@localhost.localdomain> >>> Matt Sergeant wrote > And to give back I'll tell you that one of my biggest wins was parsing > HTML (with HTML::Parser - a C implementation so it's very fast) and > tokenising all attributes, so I get: > > colspan=2 > face=Arial, Helvetica, sans-serif > > as tokens. Plus using a proper HTML parser I get to parse HTML comments > too (which is a win). With the Graham code, we found that the simple minded parsing of HTML actually hurt more than it gained, but it was a _very_ simple split-on- whitespace. In a case of syncronicity, at the moment I'm running a test over my newer larger monster corpus (35Kh/17Ks) to extract the avpairs from HTML tokens. Anthony From msergeant@startechgroup.co.uk Tue Oct 1 10:37:29 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 10:37:29 +0100 Subject: [Spambayes] Re: Matt Sergeant: Introduction References: <3D98486A.1050208@startechgroup.co.uk> Message-ID: <3D996CD9.2030300@startechgroup.co.uk> Jason R. Mastaler wrote: > Matt Sergeant writes: > > >>I've been following this list on gmane.org for a while now (it's a >>mail to nntp gateway for those interested in following multiple >>technical mailing lists in a read-only fashion) > > > Actually, Gmane is not read-only -- you can both read and post. Does it depend on the list? I tried to post once and my post never showed up. From anthony@interlink.com.au Tue Oct 1 10:50:01 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 01 Oct 2002 19:50:01 +1000 Subject: [Spambayes] memory consumption... In-Reply-To: <20020926192704.GA3931@schaduw.felnet> Message-ID: <200210010950.g919o3e02522@localhost.localdomain> >>> Carel Fellinger wrote > I take it that your new to linux? Otherwise ignore my rambling. > Linux uses all its free memory for caching, but only trully free > memory. So before any swapping starts the cache will shrink to its > bare minimum first. That's what I'd expected. But it looked like this little laptop had got confused, and wouldn't let go of the cached ram. a reboot later and it's happy again, and tossing cached data away, rather than paging everything else out. oh well. file under "one of those freaky things that computers sometimes do". Anthony From msergeant@startechgroup.co.uk Tue Oct 1 10:54:56 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 10:54:56 +0100 Subject: [Spambayes] memory consumption... References: <200210010950.g919o3e02522@localhost.localdomain> Message-ID: <3D9970F0.8090007@startechgroup.co.uk> Anthony Baxter wrote: >>>>Carel Fellinger wrote >>> >>I take it that your new to linux? Otherwise ignore my rambling. >>Linux uses all its free memory for caching, but only trully free >>memory. So before any swapping starts the cache will shrink to its >>bare minimum first. > > > That's what I'd expected. But it looked like this little laptop had > got confused, and wouldn't let go of the cached ram. a reboot later > and it's happy again, and tossing cached data away, rather than > paging everything else out. FWIW, this is very much dependant on Linux kernel version. Red Hat's stock kernels seem to perform much better than anyone elses doing this type of thing. Matt. From mwh@python.net Tue Oct 1 12:04:38 2002 From: mwh@python.net (Michael Hudson) Date: 01 Oct 2002 12:04:38 +0100 Subject: [Spambayes] Re: Matt Sergeant: Introduction References: <3D996855.9030707@startechgroup.co.uk> Message-ID: Matt Sergeant writes: > Off the top of my head, what does frexp() do? >>> print math.frexp.__doc__ frexp(x) Return the mantissa and exponent of x, as pair (m, e). m is a float and e is an int, such that x = m * 2.**e. If x is 0, m and e are both 0. Else 0.5 <= abs(m) < 1.0. Cheers, M. -- The bottom tier is what a certain class of wanker would call "business objects" ... -- Greg Ward, 9 Dec 1999 From mwh@python.net Tue Oct 1 12:03:28 2002 From: mwh@python.net (Michael Hudson) Date: 01 Oct 2002 12:03:28 +0100 Subject: [Spambayes] Re: Matt Sergeant: Introduction References: <3D98486A.1050208@startechgroup.co.uk> <3D996CD9.2030300@startechgroup.co.uk> Message-ID: Matt Sergeant writes: > Jason R. Mastaler wrote: > > Matt Sergeant writes: > > > >>I've been following this list on gmane.org for a while now (it's a > >>mail to nntp gateway for those interested in following multiple > >>technical mailing lists in a read-only fashion) > > Actually, Gmane is not read-only -- you can both read and post. > > Does it depend on the list? I tried to post once and my post never > showed up. You should get a once-per-list confirmation email. Reply to that, and you should be able to post via gmane. If you see this post, you know it works... Cheers, M. -- /* I'd just like to take this moment to point out that C has all the expressive power of two dixie cups and a string. */ -- Jamie Zawinski from the xkeycaps source From msergeant@startechgroup.co.uk Tue Oct 1 13:05:51 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 13:05:51 +0100 Subject: [Spambayes] Re: Matt Sergeant: Introduction References: <3D996855.9030707@startechgroup.co.uk> Message-ID: <3D998F9F.1030301@startechgroup.co.uk> Michael Hudson wrote: > Matt Sergeant writes: > > >>Off the top of my head, what does frexp() do? > > >>>>print math.frexp.__doc__ >>> > frexp(x) > > Return the mantissa and exponent of x, as pair (m, e). > m is a float and e is an int, such that x = m * 2.**e. > If x is 0, m and e are both 0. Else 0.5 <= abs(m) < 1.0. Ah cool. Same as Math::BigFloat's $x->parts(). From richie@entrian.com Tue Oct 1 14:15:21 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 01 Oct 2002 14:15:21 +0100 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: <20020928153427.D68E.JCARLSON@uci.edu> References: <20020928002231.CD68.JCARLSON@uci.edu> <20020928153427.D68E.JCARLSON@uci.edu> Message-ID: Hi Josiah, > I have (in the past) had email software that doesn't allow arbitrary > header matching. By inserting the Subject, I guarantee that ANY email > software can filter it. A case for an option, maybe. How old was this software? (please say "Very old" 8-) Thanks for the explanations of everything else. I hope my comments were useful. -- Richie Hindle richie@entrian.com From richie@entrian.com Tue Oct 1 14:15:25 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 01 Oct 2002 14:15:25 +0100 Subject: [Spambayes] Cunning use of quoted-printable Message-ID: Afternoon all, I've just found this message in my spam corpus: ----------------------------------------------------------------------- [Some headers snipped] Subject: Mail for Richie Hindle Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Mailer: Mail Express Dear=20Richie=20Hindle,=0D=0A=0D=0AInternet-Soft.Com=20is=20pleased=20t= o=20announce=20the=20release=20of=20the=20following=20new=20software=20= programs:=0D=0A=0D=0A1)=20FTP=20Navigator=206.58=0D=0Ahttp://www.intern= et-soft.com/DEMO/ftpnavigator.exe=0D=0A=0D=0A2)=20Web=20Site=20eXtracto= r=208.01=0D=0Ahttp://www.esalesbiz.com/extra/webextrasetup.exe=0D=0A=0D= [more of the same snipped] ----------------------------------------------------------------------- Looks like an attempt to fox system like spambayes. It doesn't make much difference, because the tokenizer decodes the quoted-printable, but it could trigger a clue token. I doubt there are enough spams out there for that to make any difference, and how to quantify whether a message looks like its using this trick is not obvious. I only really mention it as a curiosity. It did some out as a false positive in my testing, but I don't think that was because of the quoting. Less interesting are the results of running Tim's 4000-message tests on my corpora: -> best cutoff for all runs: 0.56 -> with weighted total 10*2 fp + 37 fn = 57 -> fp rate 0.1% fn rate 1.85% total unique false pos 2 total unique false neg 37 average fp % 0.1 average fn % 1.85 This tells me two things: I am Mr. Average, and the results are astonishingly impressive! -- Richie Hindle richie@entrian.com From msergeant@startechgroup.co.uk Tue Oct 1 14:32:22 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 14:32:22 +0100 Subject: [Spambayes] Good evening/morning/afternoon everyone References: <20020928002231.CD68.JCARLSON@uci.edu> <20020928153427.D68E.JCARLSON@uci.edu> Message-ID: <3D99A3E6.4050403@startechgroup.co.uk> Richie Hindle wrote: > Hi Josiah, > > >>I have (in the past) had email software that doesn't allow arbitrary >>header matching. By inserting the Subject, I guarantee that ANY email >>software can filter it. > > > A case for an option, maybe. How old was this software? (please say "Very > old" 8-) Lotus Notes still can't filter on arbitrary headers. Matt. From msergeant@startechgroup.co.uk Tue Oct 1 14:36:35 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 14:36:35 +0100 Subject: [Spambayes] Tokenising clues Message-ID: <3D99A4E3.9000108@startechgroup.co.uk> It seems everyone is slowly stumbling on "tokenising clues" here. A "date" header issue here, a "message-id" issue there, and a particular way to format body text as another possible clue. This seems like a vast waste of your time to me. There's a couple of projects out there that have already spent vast amounts of time and programming effort into figuring out these other clues that spambayes misses out on. Rather than repeating that work, why not just rip all the rules out of SpamAssassin or some other spam checking project wholesale, and stuff those into your database? Sorry, I don't want to demean any of your work, but we need to work together to fight spam, and I'd rather not see so much time wasted on individual clues when SpamAssassin already extracts about 800 of them! Matt. From skip@pobox.com Tue Oct 1 14:52:55 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 1 Oct 2002 08:52:55 -0500 Subject: [Spambayes] results of mining post time - slight loss Message-ID: <15769.43191.229664.140708@12-248-11-90.client.attbi.com> (forgot to press the send key yesterday evening...) Using six-minute time buckets gleaned from Date: headers, here are the results (executive summary: slight loss). Buckets were computed as I suggested in my previous email: (h*60+m)//10 that is, six-minute intervals (maybe I should name this option the lawyer-fee-increment (*)?) Before: [TestDriver] spam_cutoff: 0.4 After: [Tokenizer] mine_date_headers: True [TestDriver] spam_cutoff: 0.4 Results: cutoffs -> times -> tested 200 hams & 200 spams against 1800 hams & 1800 spams ... yadda yadda yadda false positive percentages 1.000 1.000 tied 1.500 1.500 tied 1.000 1.000 tied 1.000 1.500 lost +50.00% 1.000 1.000 tied 1.500 1.500 tied 3.500 3.500 tied 1.500 1.500 tied 1.500 1.500 tied 1.500 2.000 lost +33.33% won 0 times tied 8 times lost 2 times total unique fp went from 30 to 32 lost +6.67% mean fp % went from 1.5 to 1.6 lost +6.67% false negative percentages 0.500 0.500 tied 1.500 1.500 tied 0.500 0.500 tied 0.500 0.500 tied 2.000 2.000 tied 0.000 0.000 tied 1.000 1.000 tied 1.000 1.000 tied 0.000 0.000 tied 1.500 1.500 tied won 0 times tied 10 times lost 0 times total unique fn went from 17 to 17 tied mean fn % went from 0.85 to 0.85 tied ham mean ham sdev 20.82 20.98 +0.77% 6.43 6.47 +0.62% 21.86 21.96 +0.46% 6.63 6.62 -0.15% 21.38 21.52 +0.65% 6.49 6.56 +1.08% 21.96 22.09 +0.59% 6.26 6.29 +0.48% 21.51 21.67 +0.74% 6.72 6.75 +0.45% 21.66 21.78 +0.55% 6.98 7.00 +0.29% 21.45 21.59 +0.65% 7.66 7.62 -0.52% 21.74 21.88 +0.64% 6.69 6.68 -0.15% 21.71 21.84 +0.60% 7.44 7.43 -0.13% 21.87 21.96 +0.41% 5.93 5.93 +0.00% ham mean and sdev for all runs 21.60 21.73 +0.60% 6.75 6.76 +0.15% spam mean spam sdev 74.10 73.87 -0.31% 12.99 12.80 -1.46% 72.47 72.28 -0.26% 13.92 13.79 -0.93% 74.05 73.83 -0.30% 13.00 12.85 -1.15% 74.00 73.83 -0.23% 12.27 12.11 -1.30% 72.43 72.18 -0.35% 13.73 13.45 -2.04% 72.68 72.44 -0.33% 13.27 13.11 -1.21% 72.57 72.44 -0.18% 13.03 12.94 -0.69% 71.50 71.34 -0.22% 12.12 12.01 -0.91% 73.25 73.05 -0.27% 12.67 12.50 -1.34% 73.02 72.81 -0.29% 12.44 12.29 -1.21% spam mean and sdev for all runs 73.01 72.81 -0.27% 12.98 12.82 -1.23% ham/spam mean difference: 51.41 51.08 -0.33 Skip (*) It's a sad commentary on the litigiousness of Americans if someone like me who's basically never been to a lawyer recognizes the stereotypical six-minute increment lawyers are supposed to use to bill their clients. (Or maybe I watched too much "LA Law" at a crucial period of my life...) From anthony@interlink.com.au Tue Oct 1 15:22:16 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Wed, 02 Oct 2002 00:22:16 +1000 Subject: [Spambayes] Tokenising clues In-Reply-To: <3D99A4E3.9000108@startechgroup.co.uk> Message-ID: <200210011422.g91EMHT04893@localhost.localdomain> >>> Matt Sergeant wrote > This seems like a vast waste of your time to me. There's a couple of > projects out there that have already spent vast amounts of time and > programming effort into figuring out these other clues that spambayes > misses out on. Rather than repeating that work, why not just rip all the > rules out of SpamAssassin or some other spam checking project wholesale, > and stuff those into your database? The problems are that - many of the existing tools are of the "if this header says _this_, it indicates spamminess of -this- much". The stuff here is more trying to work out answers that work without having to try and produce magic numbers for what a particular header value means. - a lot of the problems are from the testing corpuses (yes, I know the word is corpora, corpuses looks cooler :) and the mixed nature of them. This rules out a bunch of "obvious" tricks. - spamassassin, in particular, is written in perl. I tried looking through it to grok clues and started having twitches and convulsions. Been through the perl horror, not going back :) I couldn't find a simple doco of "here's what SA looks at" in the docs. > Sorry, I don't want to demean any of your work, but we need to work > together to fight spam, and I'd rather not see so much time wasted on > individual clues when SpamAssassin already extracts about 800 of them! The problem with SA for at least one of the applications I have is that it's way, way too aggressive. My monster corpus is the main contact email for the company I work for. SA kicks out far too many legitimate commercial email messages. But that mailbox gets (in the last week) something like 200 spams a day - probably more. Sifting through the hits looking for the real posts is too much work. If there is a list of existing tokenisation clues we can work from, excellent! I know I won't mind re-using someone else's hard-won experience in this area. :) Anthony -- Anthony Baxter It's never too late to have a happy childhood. From skip@pobox.com Tue Oct 1 15:39:20 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 1 Oct 2002 09:39:20 -0500 Subject: [Spambayes] new virus... Message-ID: <15769.45976.592234.222829@12-248-11-90.client.attbi.com> Not quite on-topic for this group, but I know some people are interested in getting this project to identify viruses. FYI... Virus Could Prove Real Bugbear for Networks A new mass-mailing virus, which hit the Internet on Monday, could cause quite a bit of damage to vulnerable networks. The virus, known as Bugbear, installs a Trojan on infected machines that is capable of logging users' keystrokes, which could include passwords and other sensitive information. http://eletters1.ziffdavis.com/cgi-bin10/flo?y=eSHe0EWaTF0E4J0q1G0Ac Skip From msergeant@startechgroup.co.uk Tue Oct 1 15:41:56 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 15:41:56 +0100 Subject: [Spambayes] Tokenising clues References: <200210011422.g91EMHT04893@localhost.localdomain> Message-ID: <3D99B434.1000006@startechgroup.co.uk> Anthony Baxter wrote: >>>>Matt Sergeant wrote >>> >>This seems like a vast waste of your time to me. There's a couple of >>projects out there that have already spent vast amounts of time and >>programming effort into figuring out these other clues that spambayes >>misses out on. Rather than repeating that work, why not just rip all the >>rules out of SpamAssassin or some other spam checking project wholesale, >>and stuff those into your database? > > > The problems are that > > - many of the existing tools are of the "if this header says _this_, > it indicates spamminess of -this- much". The stuff here is more > trying to work out answers that work without having to try and > produce magic numbers for what a particular header value means. The scoring is independant from the matching. The scoring is merely a by-product of running the matches through the genetic algorithm - in order to feed that genetic algorithm we have to not care what the score is (as that's prior knowledge, thus bad). > - a lot of the problems are from the testing corpuses (yes, I know > the word is corpora, corpuses looks cooler :) and the mixed nature > of them. This rules out a bunch of "obvious" tricks. This is suggested as an extension of what you do, not a replacement though. You've already got accurate code, but it seems that spamassassin was able to get clues from your FN's that word tokenisation has missed. The very nature of what you're doing will mean that if the SA rules aren't as accurate as the tokens you do find in an email then it won't matter. But it's just that little bit more information. > - spamassassin, in particular, is written in perl. I tried looking > through it to grok clues and started having twitches and convulsions. > Been through the perl horror, not going back :) > I couldn't find a simple doco of "here's what SA looks at" in the docs. Check the rules/ directory. You can read regexps I assume. That's all SpamAssassin is - a big regexp engine. There are rules that run code (we call them eval tests), but most of them aren't that complex, for example something that looks at eval:subject_all_caps() will run: sub subject_is_all_caps { my ($self) = @_; my $subject = $self->get('Subject'); $subject =~ s/^\s+//; $subject =~ s/\s+$//; return 0 if $subject !~ /\s/; # don't match one word subjects return 0 if (length $subject < 10); # don't match short subjects $subject =~ s/[^a-zA-Z]//g; # only look at letters return length($subject) && ($subject eq uc($subject)); } if you change all the arrows to dots, and remove all the dollars, semi-colons and curly brackets, you get: sub subject_is_all_caps subject = self.get('Subject') subject =~ s/^\s+// subject =~ s/\s+$// return 0 if subject !~ /\s/ # don't match one word subjects return 0 if (length subject < 10) # don't match short subjects subject =~ s/[^a-zA-Z]//g # only look at letters return length(subject) && (subject eq uc(subject)) It's almost like python! ;-) >>Sorry, I don't want to demean any of your work, but we need to work >>together to fight spam, and I'd rather not see so much time wasted on >>individual clues when SpamAssassin already extracts about 800 of them! > > The problem with SA for at least one of the applications I have is that > it's way, way too aggressive. So up your threshold, or train it yourself. Isn't that what you're doing with spambayes? > My monster corpus is the main contact email > for the company I work for. SA kicks out far too many legitimate > commercial email messages. But that mailbox gets (in the last week) > something like 200 spams a day - probably more. Sifting through the > hits looking for the real posts is too much work. > > If there is a list of existing tokenisation clues we can work from, > excellent! I know I won't mind re-using someone else's hard-won experience > in this area. :) Yep, check the rules/ directory. Particularly the 20_* files, which are the header, body and rawbody rules (don't worry about the distinction between body and rawbody for now - it's really rather bogus ;-) Matt. From tim.one@comcast.net Tue Oct 1 16:10:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 11:10:12 -0400 Subject: [Spambayes] Re: Matt Sergeant: Introduction Message-ID: <39e6338f5b.38f5b39e63@icomcast.net> >>>Off the top of my head, what does frexp() do? >> frexp(x) >> >> Return the mantissa and exponent of x, as pair (m, e). >> m is a float and e is an int, such that x = m * 2.**e. >> If x is 0, m and e are both 0. Else 0.5 <= abs(m) < 1.0. > Ah cool. Same as Math::BigFloat's $x->parts(). Maybe -- I like to think of it as being the same as the frexp() defined by the C standard . From noreply@sourceforge.net Tue Oct 1 10:31:38 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Tue, 01 Oct 2002 02:31:38 -0700 Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail integration Message-ID: Feature Requests item #616944, was opened at 2002-10-01 13:31 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: Sinchi Pacharuraq (sinchi) Assigned to: Nobody/Anonymous (nobody) Summary: Mozilla Mail integration Initial Comment: Integration with Mozilla Mail client ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 From msergeant@startechgroup.co.uk Tue Oct 1 16:22:59 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 16:22:59 +0100 Subject: [Spambayes] Re: Matt Sergeant: Introduction References: <39e6338f5b.38f5b39e63@icomcast.net> Message-ID: <3D99BDD3.5020802@startechgroup.co.uk> Tim Peters wrote: >>>>Off the top of my head, what does frexp() do? >>> > >>>frexp(x) >>> >>>Return the mantissa and exponent of x, as pair (m, e). >>>m is a float and e is an int, such that x = m * 2.**e. >>>If x is 0, m and e are both 0. Else 0.5 <= abs(m) < 1.0. >> > >>Ah cool. Same as Math::BigFloat's $x->parts(). > > > Maybe -- I like to think of it as being the same as the frexp() defined > by the C standard . Duh, yeah. That was just the first search.cpan.org result for mantissa ;-) So it's the same as POSIX::frexp() ;-) From gward@python.net Tue Oct 1 16:41:24 2002 From: gward@python.net (Greg Ward) Date: Tue, 1 Oct 2002 11:41:24 -0400 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: References: Message-ID: <20021001154123.GA1581@cthulhu.gerg.ca> On 01 October 2002, Richie Hindle said: [... message with lots of quoted-printable in it ...] > Looks like an attempt to fox system like spambayes. It doesn't make much > difference, because the tokenizer decodes the quoted-printable, but it > could trigger a clue token. SpamAssassin has a test for this -- MIME_EXCESSIVE_QP: rawbody MIME_EXCESSIVE_QP eval:check_for_mime_excessive_qp() describe MIME_EXCESSIVE_QP Excessive quoted-printable encoding in body score MIME_EXCESSIVE_QP 2.070 The implementation is pretty simple: sub check_for_mime_excessive_qp { my ($self) = @_; # Note: We don't use rawbody because it removes MIME parts. Instead, # we get the raw unfiltered body. We must not change any lines. my $body = join('', @{$self->{msg}->get_body()}); my $length = length($body); my $qp = $body =~ s/\=([0-9A-Fa-f]{2,2})/$1/g; # this seems like a decent cutoff return ($length != 0 && ($qp > ($length / 20))); } (Hey, now that Matt Sergeant is on the list, I can stop being the local SpamAssassin expert! *phew*!) I guess there are a couple of ways to translate this to a stream-of-tokens approach: * do a tokenizing pass over the raw message body, and spit out a whole lot of "=20" tokens * examine the raw body in a non-tokenizing way, and just emit a "lots of quoted-printable" token * ...? Greg -- Greg Ward http://www.gerg.ca/ Did YOU find a DIGITAL WATCH in YOUR box of VELVEETA? From gward@python.net Tue Oct 1 16:50:13 2002 From: gward@python.net (Greg Ward) Date: Tue, 1 Oct 2002 11:50:13 -0400 Subject: [Spambayes] Tokenising clues In-Reply-To: <3D99A4E3.9000108@startechgroup.co.uk> References: <3D99A4E3.9000108@startechgroup.co.uk> Message-ID: <20021001155013.GB1581@cthulhu.gerg.ca> On 01 October 2002, Matt Sergeant said: > This seems like a vast waste of your time to me. There's a couple of > projects out there that have already spent vast amounts of time and > programming effort into figuring out these other clues that spambayes > misses out on. Rather than repeating that work, why not just rip all the > rules out of SpamAssassin or some other spam checking project wholesale, > and stuff those into your database? The tricky part is not stealing relevant code from SpamAssassin -- I just posted SA's "excessive quoted printable" hack, and I'm sure I could translate it into Python in 10 minutes. Not all Python hackers are afraid of Perl. ;-) (Tim could probably do it in 10 seconds, but never mind.) The trick is how to integrate it into spambayes' overall approach, where a message is simply distilled into a stream of tokens for training or prediction. It's a very different model from SpamAssassin -- it's one thing to write the code that says, "this message has a lot of quoted-printable characters in it", and it's another thing entirely to decide how to use that knowledge in an appropriate way. It's like the difference between writing the rule and coming up with a score for it. This, IMHO, is one respect in which SA is much more mature than spambayes: I see a lot of people here groping through a multi-dimensional space made up of various options and algorithm tweaks, trying to optimize something (the FP rate, the FN rate, the distance between the two histograms, whatever). In contrast, SpamAssassin drastically simplifies the space to explore -- it's the space of all SA rules and scores -- and automates the optimization by using a genetic algorithm. There's a middle ground waiting to be found somewhere... Greg -- Greg Ward http://www.gerg.ca/ I used to be a FUNDAMENTALIST, but then I heard about the HIGH RADIATION LEVELS and bought an ENCYCLOPEDIA!! From skip@pobox.com Tue Oct 1 17:00:09 2002 From: skip@pobox.com (Skip Montanaro) Date: Tue, 1 Oct 2002 11:00:09 -0500 Subject: [Spambayes] Tokenising clues In-Reply-To: <3D99B434.1000006@startechgroup.co.uk> References: <200210011422.g91EMHT04893@localhost.localdomain> <3D99B434.1000006@startechgroup.co.uk> Message-ID: <15769.50825.784716.147473@12-248-11-90.client.attbi.com> >>> Sorry, I don't want to demean any of your work, but we need to work >>> together to fight spam, and I'd rather not see so much time wasted >>> on individual clues when SpamAssassin already extracts about 800 of >>> them! >> The problem with SA for at least one of the applications I have is >> that it's way, way too aggressive. Matt> So up your threshold, or train it yourself. Isn't that what you're Matt> doing with spambayes? If I understand things correctly, the SA genetic algorithm trains using a huge body of mail (how many ham & spam test inputs are fed to the GA?). If a huge collection is necessary, that would pretty much rule out individuals doing their own training. Have the SA gang done any tests to see how accurate the GA is with small ham/spam collections? Are the inputs fed to the GA pruned periodically to eliminate old messages? I assume that training using an individual's ham/spam collection would make it more accurate for that person's future mail. On the other hand, spambayes training (ignoring all the experimenting we're doing at the moment) pretty much just consists of separating known ham and spam, training on that periodically, then feeding incoming messages to the classifier. It looks like at this point, the spambayes stuff works pretty well for individuals with relatively small collections (200-400 of each). It remains to be seen if a "default" set of indicators would work for a large population. Even within this small community, we have pretty variable results across individuals, partly because we make mistakes establishing our training sets and partly because our email interests vary. I view the two projects as complementary and don't find any of the potential duplication of effort a problem. Having multiple ways to look at ham and spam makes it much harder for the bad guys to sneak something through and also creates new opportunities for each other. Last night I noticed that one of my strongest ham indicators is skip:_ 40 Turns out that many mailing lists - at least those managed by Mailman - by default add a trailer to the end of each message, like so: _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes I subscribe to and administer a number of Mailman-managed mailing lists, so it's a good ham indicator for me. For others who tend not to subscribe to any such lists it would obviously be less valuable. There's a hammy rule for the SA gang which I doubt is currently in the SA rule set. ("url:mailman" is not quite as good a ham indicator as the forty underscore token.) Skip From nas@python.ca Tue Oct 1 17:19:31 2002 From: nas@python.ca (Neil Schemenauer) Date: Tue, 1 Oct 2002 09:19:31 -0700 Subject: [Spambayes] Tokenising clues In-Reply-To: <20021001155013.GB1581@cthulhu.gerg.ca> References: <3D99A4E3.9000108@startechgroup.co.uk> <20021001155013.GB1581@cthulhu.gerg.ca> Message-ID: <20021001161931.GA29333@glacier.arctrix.com> Greg Ward wrote: > This, IMHO, is one respect in which SA is much more mature than > spambayes: I see a lot of people here groping through a > multi-dimensional space made up of various options and algorithm tweaks, > trying to optimize something (the FP rate, the FN rate, the distance > between the two histograms, whatever). In contrast, SpamAssassin > drastically simplifies the space to explore -- it's the space of all SA > rules and scores -- and automates the optimization by using a genetic > algorithm. There's a middle ground waiting to be found somewhere... SpamAssassin's smaller search space comes a price. People have to continously come up with new rules. I don't like the way the tokenizer is heading right now either. I want to try generating n-grams from the headers. If that can be made if work reasonably well I think it will be much better approach long term. Neil From msergeant@startechgroup.co.uk Tue Oct 1 17:14:57 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Tue, 01 Oct 2002 17:14:57 +0100 Subject: [Spambayes] Tokenising clues References: <200210011422.g91EMHT04893@localhost.localdomain> <3D99B434.1000006@startechgroup.co.uk> <15769.50825.784716.147473@12-248-11-90.client.attbi.com> Message-ID: <3D99CA01.1090100@startechgroup.co.uk> Skip Montanaro wrote: > >>> Sorry, I don't want to demean any of your work, but we need to work > >>> together to fight spam, and I'd rather not see so much time wasted > >>> on individual clues when SpamAssassin already extracts about 800 of > >>> them! > > >> The problem with SA for at least one of the applications I have is > >> that it's way, way too aggressive. > > Matt> So up your threshold, or train it yourself. Isn't that what you're > Matt> doing with spambayes? > > If I understand things correctly, the SA genetic algorithm trains using a > huge body of mail (how many ham & spam test inputs are fed to the GA?). If > a huge collection is necessary, that would pretty much rule out individuals > doing their own training. Have the SA gang done any tests to see how > accurate the GA is with small ham/spam collections? Are the inputs fed to > the GA pruned periodically to eliminate old messages? I assume that > training using an individual's ham/spam collection would make it more > accurate for that person's future mail. We haven't done that much testing on small data sets, but that's because the project aims are very different - I see spambayes as an experiment right now, whereas SpamAssassin has to genericise to large numbers of users out of the box. Feel free to try your own training though and let us know how it goes! > On the other hand, spambayes training (ignoring all the experimenting we're > doing at the moment) pretty much just consists of separating known ham and > spam, training on that periodically, then feeding incoming messages to the > classifier. Same as SpamAssassin. You run mass-check on a bunch of spam and non-spam, then feed that into the GA. It takes a *lot* longer than a statistical classifier, but that's the only difference I can see. > I view the two projects as complementary and don't find any of the potential > duplication of effort a problem. Having multiple ways to look at ham and > spam makes it much harder for the bad guys to sneak something through and > also creates new opportunities for each other. Last night I noticed that > one of my strongest ham indicators is > > skip:_ 40 > > Turns out that many mailing lists - at least those managed by Mailman - by > default add a trailer to the end of each message, like so: > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman-21/listinfo/spambayes > > I subscribe to and administer a number of Mailman-managed mailing lists, so > it's a good ham indicator for me. For others who tend not to subscribe to > any such lists it would obviously be less valuable. > > There's a hammy rule for the SA gang which I doubt is currently in the SA > rule set. ("url:mailman" is not quite as good a ham indicator as the forty > underscore token.) We have a much more robust mailman detector already. And that's my point - a spammer can get around your naive "mailman detector" with a bunch of underscores anywhere in his message, but he has to work a lot harder to get around a more robust detection system (it's not invincible, but it would probably require him modifying his software). So give the dog (spambayes) a bone. Let it eat all the information you can give it. None of it is going to hurt, or if it does you can chuck that out like you have been doing for a few weeks already with other tokenising ideas! Matt. From noreply@sourceforge.net Tue Oct 1 17:04:12 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Tue, 01 Oct 2002 09:04:12 -0700 Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail integration Message-ID: Feature Requests item #616944, was opened at 2002-10-01 04:31 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: Sinchi Pacharuraq (sinchi) Assigned to: Nobody/Anonymous (nobody) Summary: Mozilla Mail integration Initial Comment: Integration with Mozilla Mail client ---------------------------------------------------------------------- >Comment By: Skip Montanaro (montanaro) Date: 2002-10-01 11:04 Message: Logged In: YES user_id=44345 ummm.... a bit short on detail/description. What precisely do you mean by "Mozilla Mail integration"? Can you describe what you would like to see feature-wise? Note that no other mail system integration has been attempted at this point with the exception that I believe the hammie script works with procmail. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 From carel.fellinger@chello.nl Tue Oct 1 18:11:26 2002 From: carel.fellinger@chello.nl (Carel Fellinger) Date: Tue, 1 Oct 2002 19:11:26 +0200 Subject: [Spambayes] memory consumption... In-Reply-To: <200210010950.g919o3e02522@localhost.localdomain> References: <20020926192704.GA3931@schaduw.felnet> <200210010950.g919o3e02522@localhost.localdomain> Message-ID: <20021001171126.GA4184@mail.felnet> On Tue, Oct 01, 2002 at 07:50:01PM +1000, Anthony Baxter wrote: ... > That's what I'd expected. But it looked like this little laptop had > got confused, and wouldn't let go of the cached ram. a reboot later > and it's happy again, and tossing cached data away, rather than > paging everything else out. Any chance you're running an early 2.4 kernel? There where lot of problems that sounded just like what you're saying here. Version 18 should work, probably earlier version too. -- groetjes, carel From neale@woozle.org Tue Oct 1 17:57:19 2002 From: neale@woozle.org (Neale Pickett) Date: 01 Oct 2002 09:57:19 -0700 Subject: [Spambayes] Some ideas I have.... In-Reply-To: References: Message-ID: So then, John Draper is all like: > I want to start up another discussion about what the direction of the > group is heading, as far as addressing the issues of where spam filter > should take place. IE: Client side, Vs Server side. Currently we have two applications of the classifier: hammie and pop3proxy. Both of these can run on either the client or the server. Your "bureaucrat" model sounds a lot like an observer pattern . This is what procmail does with incoming mail, dispatching events to various processing functions (like hammie or spamassassin) who can each take a swing at the message. What might be really useful would be a hook into an existing SMTP server. But before that happens, we need to answer some questions like whether or not it's feasible to run one classifier database against an entire organization or ISP's email. Still trying to port this to my Palm Pilot, Neale From neale@woozle.org Tue Oct 1 18:03:15 2002 From: neale@woozle.org (Neale Pickett) Date: 01 Oct 2002 10:03:15 -0700 Subject: [Spambayes] to From_ or not to From_? In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > Actually, none of mine do, because BruceG's spam didn't. I removed > all the "From " lines from the c.l.py archive to match that (easier > than inventing such lines for Bruce's msgs). I don't know that it > makes any difference for the way I run the tests, but it certainly > could make a difference if "From " lines were getting mined for clues. > I forced all my msgs alike in this respect just to cut off that > possibility. Sorry to enter this discussion a little late--I've been pretty busy with a release at work. I understand some people may not have them, but the "From " lines seem to be very useful, as they report who the sender identified themselves as in the MAIL command of the SMTP envelope. I've had a great deal of success stopping spam at the gate by denying access to people who identify themselves with addresses from certain domains. I would expect that looking at "From " lines would be a clear win for anyone. Here, I'll put my money where my mouth is. My mail program writes the >From header as an X-From: line. I add this to my bayescustomize.ini: [Tokenizer] basic_header_tokenize: True basic_header_skip: received date x-[^f][^r].* And I get this on my tiny corpus (2x5x200 messages): """ false positive percentages 1.500 1.500 tied 1.000 1.000 tied 2.000 1.000 won -50.00% 1.500 1.000 won -33.33% 1.500 1.000 won -33.33% won 3 times tied 2 times lost 0 times total unique fp went from 15 to 11 won -26.67% mean fp % went from 1.5 to 1.1 won -26.67% false negative percentages 1.500 1.000 won -33.33% 0.000 0.500 lost +(was 0) 1.000 1.000 tied 0.500 0.000 won -100.00% 1.000 1.000 tied won 2 times tied 2 times lost 1 times """ In all but one case where something changed, it was just a single message. That's not a huge improvement, but maybe enough of one to convince someone with a larger test set to try it out? Neale From tim.one@comcast.net Tue Oct 1 17:35:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 12:35:46 -0400 Subject: [Spambayes] RE: spambayes: CLT In-Reply-To: <3D9973EC.5020303@startechgroup.co.uk> Message-ID: This is a popular offline question, so I'll take the liberty of answering it once and for for all here: > Is there one place (or email in the archives) that describes clearly > what the Central Limit Theorem does, and how it works? I can't seem to > find one, and the code isn't helping all that much ;-) CLT is a fundamental tool in statistics, and a google search will turn up dozens of good intros. Skipping the weasle words, if you've got a large population P, with mean M and variance V, suppose you want to estimate M. You can take N values from P at random, and compute *their* mean, M'. That's called "the sample mean", for obvious reasons. The CLT says that the sample mean follows a normal distribution, with mean M and variance V/N. The remarkable thing is that no assumption need be made about the distribution of P: it may be normal itself, or uniform, or darned-near crazy, or anything in-between -- it just doesn't matter. M' is normally distributed regardless. In practical terms this says two things: M' is an unbiased estimate of M, and that you can cut the chance that M' differs significantly from M as low as you want by increasing N. Heck, by observing the variance V' of M' over many trials, you can even estimate the variance V of P, via multiplying V' by N. All that said, it's pretty much irrelevant to the "central limit" code: the calculations act as if the CLT applied here, but CLT doesn't actually apply here. The problem is that picking the N "most extreme" words from a single email isn't a *random* sampling from P by a long shot, and that violates the key precondition for applying the CLT. What we've got instead are two distinct populations, and a seat-of-the-pants rule for deciding whether we're certain a given email belongs to one of the populations (although note that no "certainty code" has been checked in yet, and "the scores" returned by the central-limit code that is checked in are pretty much nonsense right now). From python-spambayes@discworld.dyndns.org Tue Oct 1 18:12:04 2002 From: python-spambayes@discworld.dyndns.org (Charles Cazabon) Date: Tue, 1 Oct 2002 11:12:04 -0600 Subject: [Spambayes] to From_ or not to From_? In-Reply-To: ; from neale@woozle.org on Tue, Oct 01, 2002 at 10:03:15AM -0700 References: Message-ID: <20021001111204.A4413@discworld.dyndns.org> Neale Pickett wrote: > > I understand some people may not have them, but the "From " lines seem > to be very useful, as they report who the sender identified themselves > as in the MAIL command of the SMTP envelope. This information is also normally recorded in a Return-Path: header, which is not dependent on the mail storage format, unlike the mbox-only "From " lines. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From neale@woozle.org Tue Oct 1 19:00:06 2002 From: neale@woozle.org (Neale Pickett) Date: 01 Oct 2002 11:00:06 -0700 Subject: [Spambayes] Patch and info on how to run a test In-Reply-To: <3D98B463.3040006@videotron.ca> References: <3D98B463.3040006@videotron.ca> Message-ID: So then, papaDoc is all like: > Hi, > > This is a small patch to help people which don't have python in > their path. Hmm, what you probably ought to do is run runtest.sh like so: PATH=$PATH:/python/dir ./runtest.sh testname But maybe you should figure out how to get python in your path instead :) I have a /home/neale/bin directory in my path, where I can make symbolic links to or write wrappers around stuff that wouldn't otherwise be in my path. Failing that, you can always add something like this at the beginning of the file: python() { /path/to/python "$@" } which makes a "python" function that runs python. > runtest.sh is talking about > # This test requires you have an appropriately-modified > # Tester.py.new and classifier.py.new as detailed in > # > Where can I find those two files ? That test is obsolete now; Tim's already pronounced on Gary's ideas (he liked them). I've taken it out of runtest.sh in CVS. Now it only has a "set1" and "set2" target, which you can use to run two timcv tests. You'll want to run ./runtest.sh -r set1 at least once, and from then on you can diddle with the code and run ./runtest.sh set2 to see what the diddling has done. > I don't have 2000 spams yet only 1546 now but going up every days. I never thought I'd live to see the day when people were hoping for more spam :) Neale From neale@woozle.org Tue Oct 1 19:48:15 2002 From: neale@woozle.org (Neale Pickett) Date: 01 Oct 2002 11:48:15 -0700 Subject: [Spambayes] to From_ or not to From_? In-Reply-To: <20021001111204.A4413@discworld.dyndns.org> References: <20021001111204.A4413@discworld.dyndns.org> Message-ID: So then, Charles Cazabon is all like: > Neale Pickett wrote: > > > > I understand some people may not have them, but the "From " lines seem > > to be very useful, as they report who the sender identified themselves > > as in the MAIL command of the SMTP envelope. > > This information is also normally recorded in a Return-Path: header, which is > not dependent on the mail storage format, unlike the mbox-only "From " lines. Touché--so it is. I've been working with pipermail archives too long! Thanks for the correction, Neale From tim.one@comcast.net Tue Oct 1 20:17:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 15:17:46 -0400 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: Message-ID: [Richie Hindle] > I've just found this message in my spam corpus: > > ----------------------------------------------------------------------- > > [Some headers snipped] > > Subject: Mail for Richie Hindle > Content-Type: text/plain; charset="us-ascii" > Content-Transfer-Encoding: quoted-printable > X-Mailer: Mail Express > > Dear=20Richie=20Hindle,=0D=0A=0D=0AInternet-Soft.Com=20is=20pleased=20t= > o=20announce=20the=20release=20of=20the=20following=20new=20software=20= > programs:=0D=0A=0D=0A1)=20FTP=20Navigator=206.58=0D=0Ahttp://www.intern= > et-soft.com/DEMO/ftpnavigator.exe=0D=0A=0D=0A2)=20Web=20Site=20eXtracto= > r=208.01=0D=0Ahttp://www.esalesbiz.com/extra/webextrasetup.exe=0D=0A=0D= > > [more of the same snipped] > > ----------------------------------------------------------------------- > > Looks like an attempt to fox system like spambayes. It doesn't make much > difference, because the tokenizer decodes the quoted-printable, but it > could trigger a clue token. The other trick of this nature is to encode the whole msg in base64. We decode that too. tokenizer.py contains a comment block with before-and-after tests run with and without generating tokens for Content-Transfer-Encoding. Results were random (some runs got better, others got worse), so I left it out. That didn't aim at catching intentional obfuscation, though. > I doubt there are enough spams out there for that to make any difference, > and how to quantify whether a message looks like its using this trick is > not obvious. I only really mention it as a curiosity. It did some out > as a false positive in my testing, I *think* you meant it was a false negative, since you said it was in your spam collection, and haven't argued that it's actually ham. > but I don't think that was because of the quoting. It's currently as if the quoted-printable business didn't exist. It likely got mild ham boosts for the text/plain and us-ascii parts of the Content-Type line. > Less interesting are the results of running Tim's 4000-message tests on my > corpora: > > -> best cutoff for all runs: 0.56 > -> with weighted total 10*2 fp + 37 fn = 57 > -> fp rate 0.1% fn rate 1.85% > total unique false pos 2 > total unique false neg 37 > average fp % 0.1 > average fn % 1.85 > > This tells me two things: I am Mr. Average, and the results are > astonishingly impressive! If you can without revealing a confidence, it would be good if you could share the fp. Short of that, are these fp that bother you? Would you be upset if you lost them in real life? There are about 10 msgs in my ham I couldn't care less about, but I keep them in the ham just because they're not truly spam. 8 of them are correctly classified at the moment, but if I found a change that slashed the f-n rate at the cost of putting those 8 back in to the f-p class, I wouldn't count the latter against the change much. BTW, non-Python conference announcements appear to be hated by the central-limit versions of the classifier too -- but at least those versions "know" they're confused about what to call them! From tim.one@comcast.net Tue Oct 1 20:54:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 15:54:32 -0400 Subject: [Spambayes] Tokenising clues In-Reply-To: <20021001161931.GA29333@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > ... > I don't like the way the tokenizer is heading right now either. I only care which way the results are heading . > I want to try generating n-grams from the headers. If that can be > made if work reasonably well I think it will be much better > approach long term. Be sure to read the comments in tokenizer.py about previous experiments with character n-grams. A string of length N produces N-n+1 character n-grams, and that's a ton of clues for a single string. For example, Organization: Massachussetts Institute of Technology is going to generate a big pile of ham clues, and if a spammer happens to include that header too, it's going to be hard to overcome them. There are some specific examples in the aforementioned comments. This should be less severe now, though, since max_discriminators is about 10x larger than it used to be. Certainly worth trying! From rob@hooft.net Tue Oct 1 20:57:57 2002 From: rob@hooft.net (Rob Hooft) Date: Tue, 01 Oct 2002 21:57:57 +0200 Subject: [Spambayes] Central limit References: Message-ID: <3D99FE45.3010905@hooft.net> Tim Peters wrote: > [Rob Hooft] > >> - The standard deviations seem "underestimated". Gary already said >> this can be caused by correlations between scores. Alternatively >> this can indicate that the data is not 1D: in more than one >> dimension, a higher percentage of normally distributed data lies >> outside of the "core regions". Anyway, something can be done about >> this: just calculate the RMS Z-score, and scale it to 1.0. > > > Sorry, I don't know what that means or how to compute it; neither does > google . Let's say this is my population: {2, 5, 10, 64}. Then what > are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32, 64, and 1000? You can calculate the Root Mean Square (RMS) of all Z-scores. That is the same as the "standard deviation" of the population. This appears to be around 3-4. If we calculate the value for one test run, it can be used as a parameter on the next run, to make sure the Z-scores really form a distribution of 0+/-1. These parameters might even be relatively corpus-insensitive. >> - The "certainty" rule of Tim should be formalized. > Sure, but how? I made up a combination of "look at ratios" and "different > cutoffs for different n" by iteratively staring at the errors and making > stuff up. Even then all I get is a binary "certain or uncertain?" decision > out of it, and without a clear connection to quantifiable probabilities I > don't have strong reason to believe it's a sensible approach in general. I'd say something like erf(Zspam) is the chance that the message belongs to the spam corpus (assuming the renormalized Z scores) and erf(Zham) is the chance that the message belongs to the ham corpus. If 1.0-erf(Zham)-erf(Zspam) is sizeable, that could express a "chance" that the message belongs to neither; erf(Zham)+erf(Zspam) is then a way to express the "classifyability". Normally not both of Zham and Zspam are small, but the math might need to handle the case that the sum of these two is larger than one for the weird case... Somebody with a proper statistical background can probably improve on this. Unfortunately import math does not come with the error function.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Tue Oct 1 21:07:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 16:07:53 -0400 Subject: [Spambayes] Tokenising clues In-Reply-To: <3D99CA01.1090100@startechgroup.co.uk> Message-ID: [Matt Sergeant] > ... > We have a much more robust mailman detector already. And that's my point > - a spammer can get around your naive "mailman detector" with a bunch of > underscores anywhere in his message, but he has to work a lot harder to > get around a more robust detection system (it's not invincible, but it > would probably require him modifying his software). Matt, we don't have *any* "mailman detector", and that's a key point. We generate "skip" tokens for every string longer than 12 chars, and that it happened to catch a Mailman clue is pure luck. It's not trying to *do* anything specific. We catch so many "Mailman clues", in fact, that I dare not look at most of the header lines in my mixed-source data -- the Mailman clues it picks up purely by luck then are too strong. As to a spammer trying to exploit it, not a problem. No single word can determine the outcome, and if spammers take to putting '-'*40 in their spam, the system will learn to disregard it. I've done this experiment: I ran my fat test, looked at the list of the top 50 discriminators, and purged them all from the database. Then I ran my fat test again. The performance wasn't significantly worse. If one set of clues becomes worthless, it finds another set. So long as spam is trying to sell you something, "it's different". > So give the dog (spambayes) a bone. Let it eat all the information > you can give it. This is fine, provided it doesn't bloat the database size, or increase classification time, without a compensating measurable improvement in results. Part of the tokenizer is as finicky as it is because I'm aiming to keep size and time requirements in bounds too (so, e.g., I deliberately don't tokenize Content-Transfer-Encoding, and note the presence or absence of an Organization line but without tokenizing its value: experiments showed that what I *do* do in these cases helped, but that the parts I left out did not help). > None of it is going to hurt, or if it does you can chuck that out like > you have been doing for a few weeks already with other tokenising ideas! As a general rule, I add things that help, not add in lots of ideas at once and then throw things out that don't help. Our results steadily progress in the right direction, so I'm going to stick with what works. From tim.one@comcast.net Tue Oct 1 22:36:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 17:36:33 -0400 Subject: [Spambayes] Matt Sergeant: Introduction In-Reply-To: <3D996855.9030707@startechgroup.co.uk> Message-ID: [Matt Sergeant] > ... > And to give back I'll tell you that one of my biggest wins was parsing > HTML (with HTML::Parser - a C implementation so it's very fast) and > tokenising all attributes, so I get: > > colspan=2 > face=Arial, Helvetica, sans-serif > > as tokens. Plus using a proper HTML parser I get to parse HTML comments > too (which is a win). Matt, what are you using as test data? The experience here has been that HTML is sooooo strongly correlated with spam that we've added gimmick after gimmick to remove evidence that HTML ever existed; else the rare ham that uses HTML-- or even *discusses* HTML with a few examples! --had an extremely hard time avoiding getting classified as spam. As a result, by default we're stripping all HTML tags unlooked-at (except that we mine http etc thingies before stripping tags). Even so, the mere presence of text/html content-type, and " ", still have very high spamprob, and so still make it hard for content-free <0.1 wink> HTML hams to get thru. A problem seems to be that everyone here subscribes to a few HTML marketing newsletters (whether they think of them that way or not), but that the only other HTML they get in their email is 100x more HTML spam. That gives every indication of HTML spamprobs >= 0.99, and legitimately so. A compounding problem then is that the simplifying assumption of word-probability independence is grossly violated by HTML markup -- the prob that a msg contains colspan=2 and the prob that a msg contains face=Arial aren't independent at all, and pretending that they are independent grossly overestimates the rarity of seeing them both in a single msg. Do you find, for example, that colspan=2 is common in HTML ham but rare in HTML spam, or vice versa? I know of specific cases where we're missing good clues by purging HMTL decorations, but nobody here has yet found a strategy for leaving them in that isn't a disaster for at least one of the error rates. I'm wondering what's sparing you from that fate. > Using word tuples is also a small win, Word bigrams were a loss for us (comments in tokenizer.py). This should be revisted under Gary's scheme, and/or we should stop thinking of unsolicited conference announcements as being ham . > but increases the database size and number of tokens you have to > pull from the database enormously. That was also our experience with word bigrams, but less than "enormously"; about a factor of 2; character 5-grams were snuggling up to enormously. > That's an issue for me because I'm not using an in-memory database (one > implementation uses CDB, another uses SQL - the SQL one is really nice > because you can so easily do data mining, and the code to extract the > token probabilities is just a view). I haven't hooked ours up to a database yet, but others have. It's premature for my present purposes . > ... > Well I very quickly found out that most of the academic research into > this has been pretty bogus. For example everyone seems (seemed?) to > think that stemming was a big win, but I found it to lose every time. We haven't tried that. OTOH, the academic research has been on Bayesian classifiers, and this isn't one (despite that Paul called it one). > ... > The one thing that still bothers me still about Gary's method is that > the threshold value varies depending on corpus. Though I expect there's > some mileage in being able to say that the middle ground is "unknown". It does allow for an easy, gradual, and effective way to favor f-n at the expense of f-p, or vice versa. There was no such possibility under Paul's scheme, as the more training data we fed in, the rarer it was for *any* score not to be extremely close to 0 or extremely close to 1, and regardless of whether the classification was right or wrong. Gary's method hasn't been caught in such extreme embarrassment yet. OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that across to people. Gary has said he knows of ways to make the distinction sharper, but we haven't yet been able to provoke him into revealing them . The central limit variations, and especially the logarithmic one, are much more extreme this way. > ... > OK, I'll go over it again this week and next time I get stuck I'll mail > out for some help ;-) The hardest part really is getting from how my > code is structured (i.e. where I get my data from, how I store it, etc) > to your version. Simple examples like where you use a priority queue for > the probabilities so you can extract the top N indicators, I just use an > array, and use a sort to get the top N. The priority queue was potentially much more efficient when max_discriminators was 15. I expect that it costs more than it's worth now that we've boosted it to 150, so if there's ever a hint that the scoring time is non-trivial, I'll probably use an array too. > So mostly it's just the details of storage that confuse me. I'm not sure what that means, but expect it will get fleshed out in time. > ... > And where is compute_population_stats used? It has a non-trivial implementation only under the central-limit variations, of which there are two. It's intended to be called after update_probabilities is called at the end of training, to do a third training pass of computing population ham & spam means & variances. Most people here aren't aware of that, as it happens "by magic" when a test driver calls TestDriver.Driver.train(): # CAUTION: this just doesn't work for incrememental training when # options.use_central_limit is in effect. def train(self, ham, spam): print "-> Training on", ham, "&", spam, "...", c = self.classifier nham, nspam = c.nham, c.nspam self.tester.train(ham, spam) print c.nham - nham, "hams &", c.nspam- nspam, "spams" c.compute_population_stats(ham, False) c.compute_population_stats(spam, True) >>>... >>> such as how the probability stuff works so much better on individuals' >>> corpora (or on a particular mailing list's corpus) than it does for >>> hundreds of thousands of users. > On my personal email I was seeing about 5 FP's in 4000, and about 20 > FN's in about the same number (can't find the exact figures right now). So to match the units and order of the next sentence, about 0.5% FN rate and 0.13% FP rate. > On a live feed of customer email we're seeing about 4% FN's and 2% FP's. Is that across hundreds of thousands of users? Do you know the correpsonding statistics for SpamAssassin? For python.org use, I've thought that as long as we could keep this scheme fast, it may be a good way to reduce the SpamAssassin load. From tim.one@comcast.net Wed Oct 2 00:35:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 19:35:14 -0400 Subject: [Spambayes] to From_ or not to From_? In-Reply-To: Message-ID: [Neale Pickett, "From " lines] > ... > Here, I'll put my money where my mouth is. My mail program writes the > From header as an X-From: line. I add this to my bayescustomize.ini: > > [Tokenizer] > basic_header_tokenize: True > basic_header_skip: received > date > x-[^f][^r].* Note that this tokenizes a great many headers lines beyond just x-from. Something like basic_header_skip: (?!x-from) would have been sharper (that's a negative lookahead assertion: it matches iff the header name doesn't match x-from, so it skips a header line iff it's not x-from, so it looks only at x-from -- all obvious to the most casual observer ). > And I get this on my tiny corpus (2x5x200 messages): > > """ > false positive percentages > 1.500 1.500 tied > 1.000 1.000 tied > 2.000 1.000 won -50.00% > 1.500 1.000 won -33.33% > 1.500 1.000 won -33.33% > > won 3 times > tied 2 times > lost 0 times > > total unique fp went from 15 to 11 won -26.67% > mean fp % went from 1.5 to 1.1 won -26.67% > > false negative percentages > 1.500 1.000 won -33.33% > 0.000 0.500 lost +(was 0) > 1.000 1.000 tied > 0.500 0.000 won -100.00% > 1.000 1.000 tied > > won 2 times > tied 2 times > lost 1 times > """ > > In all but one case where something changed, it was just a single > message. That's not a huge improvement, *Relative to* your error rates, it was a huge improvement, but it's hard to be confident about it because the absolute # of msgs involved is so small. Still, that it won 3 times on f-p, and never lost, adds to the confidence you should have that it truly helped. > but maybe enough of one to convince someone with a larger test > set to try it out? I can't get away with tokenizing so many header lines; there are too many "good clues for bad reasons" in my mixed-source data. From tim.one@comcast.net Wed Oct 2 00:52:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 19:52:41 -0400 Subject: [Spambayes] Tokenising clues In-Reply-To: <20021001155013.GB1581@cthulhu.gerg.ca> Message-ID: [Greg Ward] > ... > This, IMHO, is one respect in which SA is much more mature than > spambayes: I see a lot of people here groping through a > multi-dimensional space made up of various options and algorithm tweaks, > trying to optimize something (the FP rate, the FN rate, the distance > between the two histograms, whatever). In contrast, SpamAssassin > drastically simplifies the space to explore -- it's the space of all SA > rules and scores -- and automates the optimization by using a genetic > algorithm. There's a middle ground waiting to be found somewhere... There are many ways to automatically improve learning algorithms, and at least boosting has been mentioned here several times. It would likely also be suitable for improving SpamAssassin. Robert Schapire's papers are the ones to read: http://www.research.att.com/~schapire/boost.html I can't make time to pursure it, and, at least on my corpus, I doubt any algorithm is going to do significantly better than what we've got right now (given my current error rates, that's not really open to rational debate ). From tim.one@comcast.net Wed Oct 2 01:03:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 01 Oct 2002 20:03:08 -0400 Subject: [Spambayes] new virus... In-Reply-To: <15769.45976.592234.222829@12-248-11-90.client.attbi.com> Message-ID: [Skip Montanaro] > Not quite on-topic for this group, but I know some people are > interested in getting this project to identify viruses. Greg Ward explained the very simple scheme he uses to catch viruses at python.org, and by all signs it's very effective. Provided users are willing to block executables of all kinds, piece o' cake! I'm working on a much more insidious virus, which exploits the confusion caused by mixing tabs and spaces . From jcarlson@uci.edu Wed Oct 2 02:42:51 2002 From: jcarlson@uci.edu (Josiah Carlson) Date: Tue, 01 Oct 2002 18:42:51 -0700 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: References: <20020928153427.D68E.JCARLSON@uci.edu> Message-ID: <20021001183417.D9A6.JCARLSON@uci.edu> Richie, > > I have (in the past) had email software that doesn't allow arbitrary > > header matching. By inserting the Subject, I guarantee that ANY email > > software can filter it. > > A case for an option, maybe. How old was this software? (please say "Very > old" 8-) > > Thanks for the explanations of everything else. I hope my comments were > useful. It is in fact fairly old...like '98 vintage or so. I recently upgraded to their 2.0 release and thought it was the same. Turns out it automatically parses new headers and allows one to search for those specifically. Pretty neat. (FYI I'm using Becky! internet email and just wrote a console python app to parse and read email from their proprietary files, works pretty smooth). Your comments were very useful. I've fixed the size stuff and went with the X-Hammie-Disposition header. Also fixed some other code and added a few utilities (like one to convert mbox format files to pasp, and vise-versa). No problem about the explanations, thanks for the comments *smile* - Josiah From tim_one@email.msn.com Wed Oct 2 05:30:50 2002 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 2 Oct 2002 00:30:50 -0400 Subject: [Spambayes] Central limit In-Reply-To: <3D99FE45.3010905@hooft.net> Message-ID: [Rob Hooft] >>> - The standard deviations seem "underestimated". Gary already said >>> this can be caused by correlations between scores. Alternatively >>> this can indicate that the data is not 1D: in more than one >>> dimension, a higher percentage of normally distributed data lies >>> outside of the "core regions". Anyway, something can be done about >>> this: just calculate the RMS Z-score, and scale it to 1.0. Just noting that since the central limit theorem doesn't really apply here, the justification for dividing the population variance by n to estimate the sample variance seems approximately non-existent. That division has a big effect on what we're taking to be "the sdev" when n=50. [Tim] >> Sorry, I don't know what that means or how to compute it; neither does >> google . Let's say this is my population: {2, 5, 10, 64}. Then >> what are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32, >> 64, and 1000? [Rob] > You can calculate the Root Mean Square (RMS) of all Z-scores. By RMS I understand you to mean the square root of the mean of the Z-score squares. Is that what you mean? > That is the same as the "standard deviation" of the population. But now I'm lost again. The RMS of a population isn't the same as the sdev of a population (as I understand sdev), unless the mean of the population happens to be 0. The mean Z-score is definitely not 0 in the results I'm seeing; the zscores for ham are highly skewed to one side of 0. > This appears to be around 3-4. For what? The RMS of {1, 2, 32, 64, 1000} is about 450: >>> math.sqrt((1**2 + 2**2 + 32**2 + 64**2 + 1000**2)/5.) 448.35811579584458 >>> The RMS of {2, 5, 10, 64) is about 32: >>> math.sqrt((2**2 + 5**2 + 10**2 + 64**2)/4.) 32.5 >>> OTOH, what I understand to be the sdevs of those are about 32 and 25, respectively. So I'm out of ideas for where 3-4 might come from. I note that most google hits on "RMS Z-score" land on the WHAT IF program you worked on as a postdoc -- so I suspect this is a case where something is so obvious to you it may be impossible for you to explain it <0.9 wink>. > If we calculate the value for one test run, it can be used as a parameter > on the next run, to make sure the Z-scores really form a distribution of > 0+/-1. These parameters might even be relatively > corpus-insensitive. > ... > I'd say something like erf(Zspam) is the chance that the message belongs > to the spam corpus (assuming the renormalized Z scores) and erf(Zham) is > the chance that the message belongs to the ham corpus. > > 1.0-erf(Zham)-erf(Zspam) is sizeable, that could express a "chance" that > the message belongs to neither; erf(Zham)+erf(Zspam) is then a way to > express the "classifyability". Normally not both of Zham and Zspam are > small, but the math might need to handle the case that the sum of these > two is larger than one for the weird case... Somebody with a proper > statistical background can probably improve on this. Unfortunately > import math does not come with the error function.... That won't be a problem if there's something useful here. I'm willing to pursue it, but am getting hints that it will work worse than the seat-of-the-pants ratio gimmick. The "region of certainty" where the ratio gimmick has never been *observed* to make a mistake includes cases where both zscores are so large that, no matter how we fiddle them, the system would believe there's no earthly chance the msg came from either population. But so long as |larger zscore| / |smaller score| "is big enough", it doesn't seem to matter how large the smaller zscore is. The most extreme ham in this "region of (seeming) certainty" had a ham zscore of -24.3, and a spam zscore of -42.9. I only tallied percentages up to ham |zscores| of 10 before, but it's clear from this that the percentage at 24.3 would be insignificant: This % of hams had abs(zham) <= this -------------- --------------------- 18.377% 1.0 36.525% 2.0 53.650% 3.0 67.919% 4.0 78.301% 5.0 85.831% 6.0 90.788% 7.0 93.762% 8.0 95.696% 9.0 97.044% 10.0 The similar table for spam zcores showed lower variance. From tim_one@email.msn.com Wed Oct 2 06:01:18 2002 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 2 Oct 2002 01:01:18 -0400 Subject: [Spambayes] mining dates? In-Reply-To: <20021001042532.GA28075@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > ... > I use a different email address for each email list I sign up on. That > makes sorting easy. My ham and spam collection is taken from addresses > that don't receive mailing list traffic. So, signing up for more lists > wouldn't help. Who said anything about signing up for lists you want to read? We're just trying to get you more ham . From crunch@shopip.com Tue Oct 1 20:51:11 2002 From: crunch@shopip.com (John Draper) Date: Tue, 1 Oct 2002 12:51:11 -0700 Subject: [Spambayes] Some ideas I have.... In-Reply-To: References: Message-ID: Neale writes: >So then, John Draper is all like: > >> I want to start up another discussion about what the direction of the >> group is heading, as far as addressing the issues of where spam filter >> should take place. IE: Client side, Vs Server side. > >Currently we have two applications of the classifier: hammie and >pop3proxy. Both of these can run on either the client or the server. > >Your "bureaucrat" model sounds a lot like an observer pattern >. This is what procmail >does with incoming mail, dispatching events to various processing >functions (like hammie or spamassassin) who can each take a swing at the >message. > >What might be really useful would be a hook into an existing SMTP >server. But before that happens, we need to answer some questions like >whether or not it's feasible to run one classifier database against an >entire organization or ISP's email. We wrote a cheap and dirty MTA (SMTP server) in Python. Definately not RFC821 complient, but I use it as a test or proof of performance. These issues I think are important (System wide filtering, vs local user level filtering). John From crunch@shopip.com Wed Oct 2 09:03:27 2002 From: crunch@shopip.com (John Draper) Date: Wed, 2 Oct 2002 01:03:27 -0700 Subject: [Spambayes] Another proposal from one of us. Message-ID: I propose a preprocessor which would convert message "meta-information" into= tokens which=20 would be appended to the email message prior to digestion by SpamBayes. Call= this SBMIP. Meta-information is content information which is outside that of the text in= the message header or body. Example meta-information rules might be: ---------------------------------------- Was the message body entirely in upper case? Have I sent mail to this address before? Have I received mail from this address before? Is this sender in my address book? Are all the recipients in my address book? Was this sent to more than one person? Was this sent to more than two people? Was this sent between the hours of 8AM and 5PM? Was this received between the hours of midnight and 6AM? Did this message have attachments? Was this message plain or HTML? Was this sent from a Pacific Rim IP address block (see IANA list)? Does the subject include non-ASCII (>127) characters? Does the body include a large number of non-ASCII characters? Is the character set other than Latin, ISO,...? Is this from a time zone other than mine? Is there a greater than three hour difference in time zones? Does the source IP address have a DNS entry? Is the source IP ping-able (requires sending ping and waiting for response)? Was the domain name forged? Is this a valid source email (would require connecting to source POP acct)? There is some overlap between these and what the tokens themselves reveal,= that is not a concern. The more information, the better. The last four= rules are relatively slow, so should be made optional. This rule checker= would probably be implemented as a Python module like a plug-in. There's= too much customization to have simply a rules text file, though a text file= and matching engine would work nicely. This way many people could easily= contribute their own preprocessor rules. Each of these rules would generate a unique tag, one value for true and= another for false.=20 =46or example, "Is this message mostly in upper case" might output= "SB54321-1234T" or "SB54321-1234F". "SB" (SpamBayes) is just an= identifier, "54321" is a user-unique random number so that other people's= posted or forwarded messages don't coincidentally match your own= preprocessor, "1234" is the rule ID number, and the "T" or "F" is whether= the rule was matched or not. The preprocessor output/SpamBayes input might look something like: ------------------------------------------------------------------ Date: xxxx =46rom: xxxx Subject: xxxx To: xxx Bla: xxxx Header junk: xxx This is the text of any old random message. 24973 is unique to that user. If= that user imports someone else's knowledge base, they would change the #'s= to their own. SB24973-0001T SB24973-0002F SB24973-0004T SB24973-0034F =2E.. SB24973-0234T SB24973-0235T SB24973-0236F -BBC, 2002-10-01. From msergeant@startechgroup.co.uk Wed Oct 2 09:26:41 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Wed, 02 Oct 2002 09:26:41 +0100 Subject: [Spambayes] Matt Sergeant: Introduction References: Message-ID: <3D9AADC1.9030207@startechgroup.co.uk> Tim Peters wrote: > [Matt Sergeant] > >>... >>And to give back I'll tell you that one of my biggest wins was parsing >>HTML (with HTML::Parser - a C implementation so it's very fast) and >>tokenising all attributes, so I get: >> >> colspan=2 >> face=Arial, Helvetica, sans-serif >> >>as tokens. Plus using a proper HTML parser I get to parse HTML comments >>too (which is a win). > > > Matt, what are you using as test data? The experience here has been that > HTML is sooooo strongly correlated with spam that we've added gimmick after > gimmick to remove evidence that HTML ever existed; else the rare ham that > uses HTML-- or even *discusses* HTML with a few examples! --had an extremely > hard time avoiding getting classified as spam. We have a live feed from one of our towers. You have to be careful to classify only HTML that is actually going to be rendered as HTML by the client (i.e. content-type: text/html, or the whole thing is HTML which is a heuristic Outlook seems to use, which is infuriating). Due to it being a live feed, we get all sorts of HTML newsletters in there, so only real spammy indicators get noticed, rather than HTML being a generic catch-all. I guess the point being that we see more HTML newsletters than we see HTML spam ;-) > Do you find, for example, that > > colspan=2 > > is common in HTML ham but rare in HTML spam, or vice versa? select * from words where word = 'colspan=2'; word | goodcount | badcount -----------+-----------+---------- colspan=2 | 3950 | 4197 Hmm, I guess colspan=2 wasn't a good example . > I'm wondering what's sparing > you from that fate. I suspect it's just the corpus. >>but increases the database size and number of tokens you have to >>pull from the database enormously. > > > That was also our experience with word bigrams, but less than "enormously"; > about a factor of 2; character 5-grams were snuggling up to enormously. I think for me it was more me hitting the limits of the performance I could expect from postgresql. Expecting 10,000 selects to come back in anything like a reasonable timeframe was a bit much to ask ;-) >>Well I very quickly found out that most of the academic research into >>this has been pretty bogus. For example everyone seems (seemed?) to >>think that stemming was a big win, but I found it to lose every time. > > We haven't tried that. OTOH, the academic research has been on Bayesian > classifiers, and this isn't one (despite that Paul called it one). True, but my original classifier was bayesian (naive). >>The one thing that still bothers me still about Gary's method is that >>the threshold value varies depending on corpus. Though I expect there's >>some mileage in being able to say that the middle ground is "unknown". > > > It does allow for an easy, gradual, and effective way to favor f-n at the > expense of f-p, or vice versa. There was no such possibility under Paul's > scheme, as the more training data we fed in, the rarer it was for *any* > score not to be extremely close to 0 or extremely close to 1, and regardless > of whether the classification was right or wrong. Gary's method hasn't been > caught in such extreme embarrassment yet. > > OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that > across to people. Gary has said he knows of ways to make the distinction > sharper, but we haven't yet been able to provoke him into revealing them > . The central limit variations, and especially the logarithmic one, > are much more extreme this way. Is that central_limit_2 as you call it? >>On my personal email I was seeing about 5 FP's in 4000, and about 20 >>FN's in about the same number (can't find the exact figures right now). > > So to match the units and order of the next sentence, about 0.5% FN rate and > 0.13% FP rate. > >>On a live feed of customer email we're seeing about 4% FN's and 2% FP's. > > Is that across hundreds of thousands of users? It's just on one particular email tower, so around a few thousand I think. > Do you know the > correpsonding statistics for SpamAssassin? For python.org use, I've thought > that as long as we could keep this scheme fast, it may be a good way to > reduce the SpamAssassin load. I don't keep stats for SpamAssassin - we don't use it "pure" so it wouldn't be worth it. FWIW, I'm working on making SpamAssassin 3 significantly faster (like about 50x) by using a decision tree rather than a linear scan of all rules. I think for your purposes (python.org mailing lists) there's probably a lot of mileage in doing spambayes first, then if spambayes is unsure (say between .40 and .60) run the email through spamassassin (but set the threshold to 7). Matt. From rob@hooft.net Wed Oct 2 11:22:16 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Wed, 02 Oct 2002 12:22:16 +0200 Subject: [Spambayes] Central limit References: Message-ID: <3D9AC8D8.5020308@hooft.net> Tim Peters wrote: [Tim] >>>Sorry, I don't know what that means or how to compute it; neither does >>>google . Let's say this is my population: {2, 5, 10, 64}. Then >>>what are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32, >>>64, and 1000? [Rob] >>You can calculate the Root Mean Square (RMS) of all Z-scores. [Tim] > By RMS I understand you to mean the square root of the mean of the Z-score > squares. Is that what you mean? Yep. >>That is the same as the "standard deviation" of the population. > But now I'm lost again. The RMS of a population isn't the same as the sdev > of a population (as I understand sdev), unless the mean of the population > happens to be 0. The mean Z-score is definitely not 0 in the results I'm > seeing; the zscores for ham are highly skewed to one side of 0. OK. I'd like to see a "histogram" to see what causes this. Is 0 still the most frequently observed value, and is the distribution asymmetric, or does it look like a bell-curve that is offset? What we are trying to do is to "describe" the histogram in as little parameters as possible. Thereby it is not very important to get the "bulk-form" right, as long as the tails of the distribution (or at least the relevant one of the two tails) is reasonably described. This is because we are not interested in Z scores lower than 1-2, they indicate by themselves that the tested message is part of the population. Only for abs(Z)>2 we should have a reasonable description to be able to calculate a "chance". It may be necessary to describe the thing in more than average and standard deviation (skew, kurtosis), but chances are that we can do with a "fake average" and a "fake standard deviation" to describe the one interesting tail. > >>This appears to be around 3-4. > > > For what? The RMS of {1, 2, 32, 64, 1000} is about 450: Sorry, I didn't look at your example set, but at the distribution you gave earlier and again a bit lower in this message. > I note that most google hits on "RMS Z-score" land on the WHAT IF program > you worked on as a postdoc -- so I suspect this is a case where something is > so obvious to you it may be impossible for you to explain it <0.9 wink>. :-) and/or I didn't see the enough of the clt results yet to communicate in a reasonable way. In fact, my machine is crunching on a clt run now for the first time, but it is taking several hours for my corpora. >>1.0-erf(Zham)-erf(Zspam) is sizeable, that could express a "chance" that >>the message belongs to neither; erf(Zham)+erf(Zspam) is then a way to >>express the "classifyability". Normally not both of Zham and Zspam are >>small, but the math might need to handle the case that the sum of these >>two is larger than one for the weird case... Somebody with a proper >>statistical background can probably improve on this. Unfortunately >>import math does not come with the error function.... > > > That won't be a problem if there's something useful here. I'm willing to > pursue it, but am getting hints that it will work worse than the > seat-of-the-pants ratio gimmick. The "region of certainty" where the ratio > gimmick has never been *observed* to make a mistake includes cases where > both zscores are so large that, no matter how we fiddle them, the system > would believe there's no earthly chance the msg came from either population. > But so long as > > |larger zscore| / |smaller score| > > "is big enough", it doesn't seem to matter how large the smaller zscore is. > The most extreme ham in this "region of (seeming) certainty" had a ham > zscore of -24.3, and a spam zscore of -42.9. I only tallied percentages up > to ham |zscores| of 10 before, but it's clear from this that the percentage > at 24.3 would be insignificant: Does that -24.3/-42.9 look like anything else you have ever seen? For me this would clearly indicate that "it is unlike anything in the two corpora, but if you really want to express chances, it it VERY much more likely (18 sigma!) to be ham than spam". > This % of hams had abs(zham) <= this > -------------- --------------------- > 18.377% 1.0 > 36.525% 2.0 > 53.650% 3.0 > 67.919% 4.0 > 78.301% 5.0 > 85.831% 6.0 > 90.788% 7.0 > 93.762% 8.0 > 95.696% 9.0 > 97.044% 10.0 This is where I got my scaling factor of ~4: This population has its 65% cutoff at Z=~4 and 95% cutoff at Z=~9. In a normal distribution these two occur at 1 sigma and 2 sigma, respectively. This indicates that Zham is about 4 times too large. Your -24.3 above thus reduces to ~6 standard deviations: still very unlikely (Hm, I have some problems with the definition and use of the error function. Do I need to do erfc(sqrt(p))? That gives p=5e-4 for Z=6), but less so than the original 24.... For spam the reduction factor might be ~3 (you say it is sharper), reducing the 42.9 to 14 standard deviations (p=1e-7), which is indeed much, much less likely (pham/pspam>4000) than the ~6 for the distance to the ham population. But if it would be my incoming mail, I think I'd want to see it "unclassified" because it is so different from both. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From richie@entrian.com Wed Oct 2 12:52:56 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 02 Oct 2002 12:52:56 +0100 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: <3D99A3E6.4050403@startechgroup.co.uk> References: <20020928002231.CD68.JCARLSON@uci.edu> <20020928153427.D68E.JCARLSON@uci.edu> <3D99A3E6.4050403@startechgroup.co.uk> Message-ID: > Lotus Notes still can't filter on arbitrary headers. Grr. Do you know what it *can* filter on? Is there a sensible behaviour for pop3proxy that would work for Notes? Preferably something less intrusive than Josiah's idea of modifying the Subject line. -- Richie Hindle richie@entrian.com From msergeant@startechgroup.co.uk Wed Oct 2 13:34:05 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Wed, 02 Oct 2002 13:34:05 +0100 Subject: [Spambayes] Good evening/morning/afternoon everyone References: <20020928002231.CD68.JCARLSON@uci.edu> <20020928153427.D68E.JCARLSON@uci.edu> <3D99A3E6.4050403@startechgroup.co.uk> Message-ID: <3D9AE7BD.5090607@startechgroup.co.uk> Richie Hindle wrote: >>Lotus Notes still can't filter on arbitrary headers. > > > Grr. Do you know what it *can* filter on? Is there a sensible > behaviour for pop3proxy that would work for Notes? Preferably > something less intrusive than Josiah's idea of modifying the > Subject line. We've been thinking about this at work. We *think* it might be able to look at the Precedence headers, so you could potentially set them to "junk" and have it work. Alternatively you could modify the From header (and set Reply-To if it's not set) to something like "spammer". Or finally yes, you can modify the subject. Definitely the worst piece of junk email client I've ever had to deal with. Wait until you have to ask for an original email from them with all headers in-tact. Bwahahahahahaha ;-) From noreply@sourceforge.net Wed Oct 2 10:53:51 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Wed, 02 Oct 2002 02:53:51 -0700 Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail integration Message-ID: Feature Requests item #616944, was opened at 2002-10-01 13:31 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: Sinchi Pacharuraq (sinchi) Assigned to: Nobody/Anonymous (nobody) Summary: Mozilla Mail integration Initial Comment: Integration with Mozilla Mail client ---------------------------------------------------------------------- >Comment By: Sinchi Pacharuraq (sinchi) Date: 2002-10-02 13:53 Message: Logged In: YES user_id=621182 I just want to have this anti-spam filter built in Mozilla message filters. For example, user might activate this filter to delete spam messages from inbox or to move it to special folder. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2002-10-01 20:04 Message: Logged In: YES user_id=44345 ummm.... a bit short on detail/description. What precisely do you mean by "Mozilla Mail integration"? Can you describe what you would like to see feature-wise? Note that no other mail system integration has been attempted at this point with the exception that I believe the hammie script works with procmail. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 From skip@pobox.com Wed Oct 2 14:20:17 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 2 Oct 2002 08:20:17 -0500 Subject: [Spambayes] Matt Sergeant: Introduction In-Reply-To: <3D9AADC1.9030207@startechgroup.co.uk> References: <3D9AADC1.9030207@startechgroup.co.uk> Message-ID: <15770.62097.97078.342522@12-248-11-90.client.attbi.com> Matt> We have a live feed from one of our towers.... then later: Matt> It's just on one particular email tower, ... What's an "email tower"? Skip From neale@woozle.org Wed Oct 2 16:02:35 2002 From: neale@woozle.org (Neale Pickett) Date: 02 Oct 2002 08:02:35 -0700 Subject: [Spambayes] Another proposal from one of us. In-Reply-To: References: Message-ID: So then, John Draper is all like: > I propose a preprocessor which would convert message > "meta-information" into tokens which would be appended to the email > message prior to digestion by SpamBayes. Call this SBMIP. Why not just run it through SpamAssassin first, then have your tokenizer pay attention to the tests reported in the X-Spam-Status header? SA is probably a better place to be doing this sort of testing anyhow--that's what SA is all about. :) Neale From jcarlson@uci.edu Wed Oct 2 16:21:50 2002 From: jcarlson@uci.edu (Josiah Carlson) Date: Wed, 02 Oct 2002 08:21:50 -0700 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: References: <3D99A3E6.4050403@startechgroup.co.uk> Message-ID: <20021002081804.E1B7.JCARLSON@uci.edu> > > Lotus Notes still can't filter on arbitrary headers. > > Grr. Do you know what it *can* filter on? Is there a sensible > behaviour for pop3proxy that would work for Notes? Preferably > something less intrusive than Josiah's idea of modifying the > Subject line. I only modified it when it was a suspected spam. In those cases, it was nice to know. Of course my software knows to not use that portion of the subject (or really any word longer than 10 characters). But it now does the X-Hammie-Disposition thing. *grin* And subject line modification is not that intrusive when you consider how intrusive spam itself is. - Josiah From tim.one@comcast.net Wed Oct 2 16:34:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 11:34:12 -0400 Subject: [Spambayes] Central limit In-Reply-To: <3D9AC8D8.5020308@hooft.net> Message-ID: A quickie: [Rob W.W. Hooft] > ... > (Hm, I have some problems with the definition and use of the error function. > Do I need to do erfc(sqrt(p))? I think I can clear this one up: erf() is often documented incorrectly. For hysterical raisins, erf(x) computes the area under the unit Gaussian from -x*sqrt(2) to x*sqrt(2). So if you want the area under the unit Gaussian from -x to x, you need to do erf(x/sqrt(2)). (erf integrates over exp(-t**2) for historical simplicity, while the unit Gaussian integrates over exp(-t**2/2); the difference is where the sqrt(2) comes from) From noreply@sourceforge.net Wed Oct 2 14:33:57 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Wed, 02 Oct 2002 06:33:57 -0700 Subject: [Spambayes] [ spambayes-Feature Requests-616944 ] Mozilla Mail integration Message-ID: Feature Requests item #616944, was opened at 2002-10-01 09:31 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 Category: None Group: None Status: Open Priority: 5 Submitted By: Sinchi Pacharuraq (sinchi) Assigned to: Nobody/Anonymous (nobody) Summary: Mozilla Mail integration Initial Comment: Integration with Mozilla Mail client ---------------------------------------------------------------------- >Comment By: Richie Hindle (richiehindle) Date: 2002-10-02 13:33 Message: Logged In: YES user_id=85414 I'm no expert on how Mozilla filters work... can you add a filter that says "If a message contains an X-Hammie-Disposition header whose value starts with Yes then "? If so, you can use either hammie.py (as part of your unix mail delivery system) or pop3proxy.py (on either a server machine or your own client machine). Both of these add an X-Hammie-Disposition header, with which you can filter your messages. ---------------------------------------------------------------------- Comment By: Sinchi Pacharuraq (sinchi) Date: 2002-10-02 09:53 Message: Logged In: YES user_id=621182 I just want to have this anti-spam filter built in Mozilla message filters. For example, user might activate this filter to delete spam messages from inbox or to move it to special folder. ---------------------------------------------------------------------- Comment By: Skip Montanaro (montanaro) Date: 2002-10-01 16:04 Message: Logged In: YES user_id=44345 ummm.... a bit short on detail/description. What precisely do you mean by "Mozilla Mail integration"? Can you describe what you would like to see feature-wise? Note that no other mail system integration has been attempted at this point with the exception that I believe the hammie script works with procmail. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 From chk@pobox.com Wed Oct 2 16:49:21 2002 From: chk@pobox.com (Harald Koch) Date: Wed, 02 Oct 2002 11:49:21 -0400 Subject: [Spambayes] Re: Tokenising clues In-Reply-To: Your message of "Tue, 01 Oct 2002 15:54:32 -0400". References: Message-ID: <17064.1033573761@elisabeth.cfrq.net> > I only care which way the results are heading . As you've mentioned before, at this point you're tuning the tokenizer to *your* sample, which doesn't necessarily represent the global population of spam. I still strongly suspect that you're entering chaotic space at this point. > Organization: Massachussetts Institute of Technology > > is going to generate a big pile of ham clues, and if a spammer happens to > include that header too, it's going to be hard to overcome them. The first time, yes. Then that message gets moved into the spam corpus, the probabilities are recaculated, and those words are no longer discriminators. What's the problem with that? -- Harald Koch From richie@entrian.com Wed Oct 2 17:19:35 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 02 Oct 2002 17:19:35 +0100 Subject: [Spambayes] Cunning use of quoted-printable Message-ID: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com> [Send privately to Tim by accident; now forwarding to the list] [Tim] > I *think* you meant it was a false negative, since you said it was in your > spam collection, and haven't argued that it's actually ham. Correct, sorry. [Tim] > If you can without revealing a confidence, it would be good if you could > share the fp. Short of that, are these fp that bother you? Would you be > upset if you lost them in real life? Here they are. The first is a request to unsubscribe from a mailing list - this one I certainly *would* be bothered about. I've censored the email address slightly in deference to its author - I've replaced every other character with 'x'. 'header:Received:5': 0.14; 'from:email addr:biglobe.ne.jp>': 0.16; 'from:email name:From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997 Received: from punt-2.mail.demon.net by mailstore for sr-list@sundog.demon.co.uk id 862608130:10:24450:1; Fri, 02 May 97 22:22:10 BST Received: from mailsv1.pcvan.or.jp ([192.47.117.193]) by punt-2.mail.demon.net id aa1024075; 2 May 97 22:21 BST Received: from mail-gw.biglobe.ne.jp (mailsv5.pcvan.or.jp [192.47.117.85]) by mailsv1.pcvan.or.jp (8.7.5+2.6Wbeta6/3.5W9-PCVAN01) with ESMTP id GAA11518 for ; Sat, 3 May 1997 06:21:40 +0900 (JST) Received: by mail-gw.biglobe.ne.jp (8.7.5+2.6Wbeta6/6.4J.6-BIGLOBE_GW) id GAA02729; Sat, 3 May 1997 06:21:15 +0900 (JST) Received: by biglobe.ne.jp id 1023702; Sat, 03 May 1997 06:21:22 +0900 Message-Id: <970503062118.23085B03.1023702@biglobe.ne.jp> Date: Sat, 03 May 1997 06:21:22 +0900 From: =?ISO-2022-JP?B?GyRCJV8layUtITwbKEI=?= To: sr-list@sundog.demon.co.uk Subject: =?ISO-2022-JP?B?IBskQiMxGyhC?= Content-Type: Text/Plain; charset=us-ascii MIME-Version: 1.0 unsubscribe [] end ------------------------------------------------------------------------ The second is a spam-looking mail from one of my ISPs, telling me that their web address has changed. I wouldn't care if I'd missed that. ------------------------------------------------------------------------ >From Orange#18.3250.d5-BLEXlg11G9rR.1@socket.cyberdialogue.com Thu Sep 26 10:04:37 2002 Return-Path: Received: from punt-2.mail.demon.net by mailstore for entrian@sundog.demon.co.uk id 1033031768:20:16776:120; Thu, 26 Sep 2002 09:16:08 GMT Received: from westhost19.westhost.net ([216.71.84.92]) by punt-2.mail.demon.net id aa2017667; 26 Sep 2002 9:15 GMT Received: from accumx-2.cyberdialogue.com (accumx-2.cyberdialogue.com [209.123.95.101]) by westhost19.westhost.net (8.11.6/8.11.6) with SMTP id g8Q9Dxu31643 for ; Thu, 26 Sep 2002 04:13:59 -0500 Received: (qmail 31858 invoked from network); 26 Sep 2002 08:26:33 -0000 Received: from socket.fulcrumanalytics.com (HELO socket.cyberdialogue.com) (209.123.95.99) by 0 with SMTP; 26 Sep 2002 08:26:33 -0000 Message-ID: <7160165.1033031077537.JavaMail.root@socket.cyberdialogue.com> Date: Thu, 26 Sep 2002 05:04:37 -0400 (EDT) From: "orange@orange.co.uk" To: richie@entrian.com Subject: Orange Internet has moved Mime-Version: 1.0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Mailer: Accucast (http://www.accucast.com) X-Mailer-Version: 2.7.2-1 X-Hammie-Disposition: Yes Orange Internet moving to orange.co.uk
Orange Internet moving to orange.co.uk
Hello Richard
Orange Internet has moved from its old home at orange.net to its brand new address at orange.co.uk. You can still organise your life exactly the way you have been, with the same Orange email address and log in, your diary, and free text messages - all available to you on Orange today. Orange today is our new look site, you'll find the link at the top right of Orange.co.uk.
get just the news you want
Your news service can now be personalised, so you can receive updates on the news that matters to you. Go to Orange today for more details.
tell me more
Do you want to keep up with all the latest news on Orange products and services? Simply click here to provide your contact details.
Click here to see the Orange privacy statement
If you don't want to receive marketing information from us by email, please click here to unsubscribe
orange™
------------------------------------------------------------------------ -- Richie Hindle richie@entrian.com From nas@python.ca Wed Oct 2 18:45:10 2002 From: nas@python.ca (Neil Schemenauer) Date: Wed, 2 Oct 2002 10:45:10 -0700 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com> References: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com> Message-ID: <20021002174510.GA32247@glacier.arctrix.com> Richie Hindle wrote: > The bit I understand least here is this: > > 'header:Message-Id:1': 0.64 It means there is one Message-Id header. Neil From tim.one@comcast.net Wed Oct 2 18:45:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 13:45:12 -0400 Subject: [Spambayes] Re: Tokenising clues In-Reply-To: <17064.1033573761@elisabeth.cfrq.net> Message-ID: [Harald Koch] > As you've mentioned before, at this point you're tuning the tokenizer to > *your* sample, which doesn't necessarily represent the global population > of spam. I still strongly suspect that you're entering chaotic space at > this point. I do very little tokenizer tuning anymore, for this very reason. Nearly all changes I've made recently were supported by tests conducted publicly on this list, across multiple corpora. When 10 of 10 runs across each of 3 distinct testers all say "yup, it worked the same way here too", chaos isn't a likely explanation . >> Organization: Massachussetts Institute of Technology >> >> is going to generate a big pile of ham clues, and if a spammer >> happens to include that header too, it's going to be hard to overcome them. > The first time, yes. Then that message gets moved into the spam corpus, > the probabilities are recaculated, and those words are no longer > discriminators. Sorry, it takes time for the algorithm to learn, and if there are many ham containing this header line now then it will take almost that many training samples of spam containing the same thing before the spamprobs decrease to neutrality. > What's the problem with that? I advised Neil to read the comments in tokenizer.py about previous experiments with character n-grams, and can only repeat that advice to you. When a single phrase can generate a large number of clues, a single unlucky (or lucky) phrase can determine the entire outcome. *Mixing* character n-grams in one part of the tokenizer with word-based tokenization elsewhere effectively gives more weight to the parts tokenized via characters, as the latter generate nore clues per character of input text. This suggests it introduces biases, and we've never seen a case yet where a bias was actually helpful. Testing is the final judge, but I have reasonable cause to suspect it will backfire, and have seen it backfire in previous attempts. From tim.one@comcast.net Wed Oct 2 19:03:02 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 14:03:02 -0400 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: <20021002174510.GA32247@glacier.arctrix.com> Message-ID: [Richie Hindle] >> The bit I understand least here is this: >> >> 'header:Message-Id:1': 0.64 [Neil Schemenauer] > It means there is one Message-Id header. More, that "Message-Id" was exactly how it was spelled: this count is case-sensitive. It's interesting that this Camel-Case spelling is a mild spam indicator for Richie. In Neil's reply, the message id was spelled "Message-id". Staring at your database entries will turn up some interesting things! For example, in my corpus MiME-Version has killer-strong spamprob. From anthony@interlink.com.au Wed Oct 2 19:19:10 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 03 Oct 2002 04:19:10 +1000 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: <8r6mpu4tq03lb0i0j4ncoftnsgdd2394up@4ax.com> Message-ID: <200210021819.g92IJA711989@localhost.localdomain> >>> Richie Hindle wrote > 'header:Received:5': 0.14; > 'from:email addr:biglobe.ne.jp>': 0.16; 'from:email name: 'from:skip:= 30': 0.16; 'message-id:@biglobe.ne.jp': 0.16; > 'subject:2022': 0.16; 'subject:IBskQiMxGyhC': 0.16; 'charset:us-ascii': 0.26 ; > 'content-type:text/plain': 0.35; 'subject:ISO': 0.35; > 'header:Message-Id:1': 0.64; 'x-mailer:none': 0.68; 'subject:=?': 0.70; > 'subject:?=': 0.72; 'unsubscribe': 0.93 It looks like it's tokenizing the encoded version of the subject here... ? -- Anthony Baxter It's never too late to have a happy childhood. From tim.one@comcast.net Wed Oct 2 20:23:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 15:23:07 -0400 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: <20021002081804.E1B7.JCARLSON@uci.edu> Message-ID: [Josiah Carlson] > ... > And subject line modification is not that intrusive when you consider how > intrusive spam itself is. My employer fiddled our system to prepend a tilde (~) to the Subject of suspected spam. I never even noticed this until it was pointed out to me! Which was months after we started doing it. Then again, there's not much that doesn't escape me . From tim.one@comcast.net Wed Oct 2 20:51:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 15:51:26 -0400 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: <200210021819.g92IJA711989@localhost.localdomain> Message-ID: [Richie Hindle, after untangling the compressed clue listing so it's readable] >> 'header:Received:5': 0.14; >> 'from:email addr:biglobe.ne.jp>': 0.16; >> 'from:email name:> 'from:skip:= 30': 0.16; >> 'message-id:@biglobe.ne.jp': 0.16; >> 'subject:2022': 0.16; >> 'subject:IBskQiMxGyhC': 0.16; >> 'charset:us-ascii': 0.26; >> 'content-type:text/plain': 0.35; >> 'subject:ISO': 0.35; >> 'header:Message-Id:1': 0.64; >> 'x-mailer:none': 0.68; >> 'subject:=?': 0.70; >> 'subject:?=': 0.72; >> 'unsubscribe': 0.93 [Anthony Baxter] > It looks like it's tokenizing the encoded version of the subject here... ? That's right, both the Subject and From headers: From: =?ISO-2022-JP?B?GyRCJV8layUtITwbKEI=?= Subject: =?ISO-2022-JP?B?IBskQiMxGyhC?= The only things we ever decode are text/* quoted-printable and text/* base64 MIME sections. This isn't by design either way -- AFAIK, it's simply that nobody has *tried* to decode anything else, and so it's unknown how doing so would affect results. (There'ss one false negative in my corpus that would go away if we decoded uuencoded sections, btw.) OTOH, we don't have a clue about how to tokenize Asian languages anyway, so I'm not sure that anyone here knows *how* to "decode" this in a way that might help. Richie, what do you have spam_cutoff set to? I thought your first message implied it was set to 0.56. The thing that strikes me hardest about this false positive is that it's got a lot more ham clues than spam clues listed. You chopped off the overall score (or the test driver you're using doesn't display it), but looks to me like it should be about 0.4. That's an awfully low value for spam_cutoff! If spam_cutoff wasn't that low, this should not have *been* a false positive (.4 is too low for the system to consider it spam unless spam_cutoff is less than .4). Or didn't you send the full list of clues? That this was a false positive simply doesn't make sense based on what you've told us. From richie@entrian.com Wed Oct 2 21:26:39 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 02 Oct 2002 21:26:39 +0100 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: <3D9AE7BD.5090607@startechgroup.co.uk> References: <20020928002231.CD68.JCARLSON@uci.edu> <20020928153427.D68E.JCARLSON@uci.edu> <3D99A3E6.4050403@startechgroup.co.uk> <3D9AE7BD.5090607@startechgroup.co.uk> Message-ID: <58kmpuknmmknkhpep044pc1nak4i7u7s96@4ax.com> [Matt] > Lotus Notes still can't filter on arbitrary headers. [Richie] > Grr. Do you know what it *can* filter on? [Matt] > We've been thinking about this at work. We *think* it might be able to > look at the Precedence headers, so you could potentially set them to > "junk" and have it work. That would be good - a more portable version of the X-Hammie-Disposition header. If you confirm whether Notes can do this, please let us know! [Josiah] > subject line modification is not that intrusive when you consider how > intrusive spam itself is. An excellent point! 8-) -- Richie Hindle richie@entrian.com From richie@entrian.com Wed Oct 2 22:34:48 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 02 Oct 2002 22:34:48 +0100 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: References: <200210021819.g92IJA711989@localhost.localdomain> Message-ID: [Tim] > Richie, what do you have spam_cutoff set to? I thought your first message > implied it was set to 0.56. It is, yes. [Tim] > this should not have *been* a false positive You're right. Where 'richie.pickle' is my full ~4000-message database: >>> import cPickle, pprint, tokenizer, classifier >>> from Options import options >>> text = open( "Data/Ham/Set4/1641", "rt" ).read() >>> bayes = cPickle.load( open( "richie.pickle", "rb" ) ) >>> score, clues = bayes.spamprob( tokenizer.tokenize( text ), True ) >>> print options.spam_cutoff, score 0.56 0.402748505794 >>> pprint.pprint( clues ) [('header:Received:5', 0.13592289441927), ('from:email addr:biglobe.ne.jp>', 0.15517241379310345), ('from:email name:>> But running in the test environment, which uses the same 4000 messages (subject to a couple of hundred extras being shuffled around by rebal.py), I get this: > python timcv.py -n10 --ham=200 --spam=200 -s1 [snip] -> 1 new false positives new fp: ['Data/Ham/Set4/1641'] ****************************************************************************** Data/Ham/Set4/1641 prob = 0.581295852793 prob('header:Received:5') = 0.141997 prob('charset:us-ascii') = 0.26578 prob('content-type:text/plain') = 0.346687 prob('header:Message-Id:1') = 0.648679 prob('x-mailer:none') = 0.674625 prob('subject:=?') = 0.775229 prob('subject:?=') = 0.908163 prob('unsubscribe') = 0.928485 >From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997 [snip] What's going on?? Far fewer clues in the test environment (and my other false positive prints 67 of them, so it's not a display issue). I have a bayescustomize.ini like this: [TestDriver] best_cutoff_fp_weight = 10 nbuckets = 100 which I guess shouldn't have any effect on this at all. -- Richie Hindle richie@entrian.com From skip@pobox.com Thu Oct 3 01:13:58 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 2 Oct 2002 19:13:58 -0500 Subject: [Spambayes] Integration w/ mail clients Message-ID: <15771.35782.414547.651734@localhost.localdomain> There's a tracker item asking for "Mozilla Mail integration": https://sourceforge.net/tracker/?func=detail&atid=498106&aid=616944&group_id=61702 In my own fiddling around trying to as correctly create ham & spam collections (*) using Emacs' VM mail reader I've noticed that the simple act of dumping a message into either a ham or spam file after deciding its category is a bit tedious. I think this is where most of my mistaken hams-as-spam or spams-as-ham come from. Currently, all messages so disposed also wind up deleted, so I have to undelete them if I don't want them lost from the regular mail stream. Occasionally, I forget to do this. I should break down and write a little ELisp to do a bit better job of the task. For any other VM users out there, it seems to me that "l h" and "l s" would be decent keybindings for whatever commands I develop to save ham and spam. I think it would be worthwhile understanding what tasks can and/or should be integrated into various mail clients. Here's what I see as required: * training mode - the user makes the ham/spam distinction (my "lh"/"ls") example for VM * run mode - the mail client calls out to SB to get a reading on the message - this may not be necessary in many unixoid environments since other tools upstream from the MUA may run the classifier * override - the user corrects a mistake by the classifier - presumably it should be able to incrementally subtract the incorrect classification info from the score database and add the correct infon What MUA functionality do other people think is necessary? Skip (*) Can we settle on "collection" instead of "corpus" to avoid the weird plural? From jcarlson@uci.edu Thu Oct 3 01:28:28 2002 From: jcarlson@uci.edu (Josiah Carlson) Date: Wed, 02 Oct 2002 17:28:28 -0700 Subject: [Spambayes] Integration w/ mail clients In-Reply-To: <15771.35782.414547.651734@localhost.localdomain> References: <15771.35782.414547.651734@localhost.localdomain> Message-ID: <20021002171806.E1BD.JCARLSON@uci.edu> > What MUA functionality do other people think is necessary? If the mail client stores email in any decent format (mbox, '\n.\n' delimited, etc.), I can't imagine it would be a big deal to just have the classifier check at regular intervals whether or not the file has changed, and if so, re-index. It wouldn't be difficult to add support in for multiple recursive folders for people who don't have 15 folders in their email root, but have subdirectories (I've done it in other projects), and multiple databases (to make re-indexing easier). It also wouldn't be a big deal to require that the user keep a 'spam' folder, named in undercase, or even '__spam__', so that the software can easily determine which is the bad stuff. Of course assuming that the good stuff is every other email anywhere else in your email archives. In terms of actual integration, I don't believe more than the above is required, especially if the proxy knows how to do multiple mail servers (check out pasp or popfile on how we do it). If the above were implemented for a few major email clients, a virtually drop-in spam filter is possible. Mozilla uses mbox, I don't know what Eudora or pegasus use, most linux mail clients use mbox. Outlook is a whore, but you can import your outlook mail into mozilla, then it becomes mbox. Of course there is the borrowing of the outlook->mbox code from the mozilla project that could happen, if only for outlook people. - Josiah From tim.one@comcast.net Thu Oct 3 04:30:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 23:30:16 -0400 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: Message-ID: [Richie Hindle, continuing to unravel the mystery of the now-it-is, now-it-ain't false positive] > You're right. Where 'richie.pickle' is my full ~4000-message database: Ah! If that's really been trained on all your msgs, then in particular it's been trained on the very message you're predicting against. The test drivers are careful never to do that (unless two msgs happen to have identical content, in which case that's fine -- it that's what real life looks like, it's not cheating to exploit it). > >>> import cPickle, pprint, tokenizer, classifier > >>> from Options import options > >>> text = open( "Data/Ham/Set4/1641", "rt" ).read() > >>> bayes = cPickle.load( open( "richie.pickle", "rb" ) ) > >>> score, clues = bayes.spamprob( tokenizer.tokenize( text ), True ) > >>> print options.spam_cutoff, score > 0.56 0.402748505794 > >>> pprint.pprint( clues ) > [('header:Received:5', 0.13592289441927), > ('from:email addr:biglobe.ne.jp>', 0.15517241379310345), Let's pause here and ponder. Earlier you said you believed this was the only msg with ISO encodings in the Subject/From lines. Suppose that's true. Then you've trained on exactly one message (this one) producing (among others) "word" 'from:email addr:biglobe.ne.jp>' The estimated *from counting* probability that a message containing this word is spam is then exactly 0.0 (you've seen it once, and only in ham). Then Gary's Bayesian probability adjustment is applied, to account for how much evidence you've got in favor of "the true" spamprob being 0.0: s*x + n*p --------- s+n The default prior-belief strength (s) is 0.45, the default unknown-word prob (x) is 0.5, the counting probability estimate (p) is 0 (as above), and the total evidence (n -- the number of messages containing this word) is 1. So the adjusted spamprob is 0.45*0.5 + 1*0 0.225 -------------- = ----- = 0.15517241379310345 0.45+1 1.45 And that's exactly the prob shown on the line above, so we can be pretty certain that your database was in fact trained on this msg. > ('from:email name: ('from:skip:= 30', 0.15517241379310345), > ('message-id:@biglobe.ne.jp', 0.15517241379310345), > ('subject:2022', 0.15517241379310345), > ('subject:IBskQiMxGyhC', 0.15517241379310345), > ('charset:us-ascii', 0.26241865802854009), > ('content-type:text/plain', 0.34572203385342953), > ('subject:ISO', 0.35151428063116696), > ('header:Message-Id:1', 0.64496476638361089), > ('x-mailer:none', 0.67584084707587), > ('subject:=?', 0.69778644753001717), > ('subject:?=', 0.7215916912471283), > ('unsubscribe', 0.93148161126231199)] > >>> > > But running in the test environment, which uses the same 4000 messages > (subject to a couple of hundred extras being shuffled around by > rebal.py), I get this: > > > python timcv.py -n10 --ham=200 --spam=200 -s1 As at the start, timcv never predicts against a message that the classifier has been trained on. It would be a very much weaker test if it ever did so, and the example we're discussing here shows why. In the test environment, then, *all* the words unique to this message have never been seen in the msgs the classifier was trained on, and so they all get the "unknown word" spamprob, 0.5. Then they're ignored completely, because the default robinson_minimum_prob_strength is 0.1, which ignores all words with spamprob in 0.4 thru 0.6. > [snip] > -> 1 new false positives > new fp: ['Data/Ham/Set4/1641'] > ****************************************************************** > ************ > Data/Ham/Set4/1641 > prob = 0.581295852793 > prob('header:Received:5') = 0.141997 > prob('charset:us-ascii') = 0.26578 > prob('content-type:text/plain') = 0.346687 > prob('header:Message-Id:1') = 0.648679 > prob('x-mailer:none') = 0.674625 > prob('subject:=?') = 0.775229 > prob('subject:?=') = 0.908163 > prob('unsubscribe') = 0.928485 Without those other clues, the best judgment it can make is that it's spam. This is also why the system needs to be trained over time! It can only know what it's been taught. Very brief subscribe/unsubscribe msgs have been a problem in my data too, but I expect more so: such msgs don't belong on c.l.py at all, and they're really quite rare there. That prevents subscribe/unsubscribe from getting milder spamprobs no matter how much c.l.py data I train them on. But if you get a non-trivial number of these, the system will act differently for your data, over time. > From RxMx7x5x@biglobe.ne.jp Fri May 02 22:21:22 1997 > [snip] > > What's going on?? Far fewer clues in the test environment Right -- but what that really shows is that the test environment isn't cheating, so that's a Good Thing. > (and my other false positive prints 67 of them, so it's not a > display issue). > > I have a bayescustomize.ini like this: > > [TestDriver] > best_cutoff_fp_weight = 10 > nbuckets = 100 > > which I guess shouldn't have any effect on this at all. Right again, none at all -- they merely affect the histogram display. The only [TestDriver] option that can affect results is spam_cutoff, and even that has no effect on scores. From tim.one@comcast.net Thu Oct 3 04:35:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 02 Oct 2002 23:35:01 -0400 Subject: [Spambayes] Integration w/ mail clients In-Reply-To: <15771.35782.414547.651734@localhost.localdomain> Message-ID: [Skip Montanaro, raises important issues which I'm ignoring because I'm spread too thin -- but if anyone wants s challenge, try getting msgs out of Outlook 2000 natively without 8 weeks of Mark Hammond's help ] > ... > (*) Can we settle on "collection" instead of "corpus" to avoid the > weird plural? If we agree to call the plural of collection collecora, sure! From jbublitz@nwinternet.com Thu Oct 3 06:28:30 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Wed, 02 Oct 2002 22:28:30 -0700 (PDT) Subject: [Spambayes] Here's why "generate_long_skips: False" worked... Message-ID: Tim Peters wrote: > An easy example is Asian spam, where the lack of whitespace > ends up generating oodles of skip tokens (and '8bit%' tokens), > but there must be a more effective way to generate useful tokens > for that without bloating the database beyond reason. So I hope > that skip-generation will eventually become worthless. I'm not sure this'll help much, but: I'm playing around with Graham and have just started looking at Spambayes. I have something more than 6500 Asian language spams (about 4 mos. worth, over half my spam), and what I use to tokenize is: re.compile(r"[\w'$_-]+", re.U) which gives tokens (from Asian languages) in the 1 to 10 character length range (mostly to the low end of that, similar to a distribution of English words). I imagine you could apply something like this when 8 bit data is detected. OTOH, in running some very preliminary tests with hammie.py out of the box, Spambayes catches all of the Asian language spam but gets right the ham msgs which contain a small portion of Asian chars (same with Graham), so your handling of 8 bit data seems to work pretty well. Tokenizing as above certainly adds to the database size, but nowhere near as much as the equivalent number of English language messages probably would. I haven't really quantified it, but I'd guess it adds less than 10% - perhaps a lot less. I haven't seen any strings of unbounded length, and at the moment I'm not trimming any tokens from the above regex. lower() also works with Asian characters (doesn't raise an exception anyway), but I get better results staying case sensitive. Jim From msergeant@startechgroup.co.uk Thu Oct 3 11:20:49 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: Thu, 03 Oct 2002 11:20:49 +0100 Subject: [Spambayes] Matt Sergeant: Introduction References: <3D9AADC1.9030207@startechgroup.co.uk> <15770.62097.97078.342522@12-248-11-90.client.attbi.com> Message-ID: <3D9C1A01.1010208@startechgroup.co.uk> Skip Montanaro wrote: > Matt> We have a live feed from one of our towers.... > > then later: > > Matt> It's just on one particular email tower, ... > > What's an "email tower"? Sorry - internal lingo for a rack. Well, not quite a rack, sometimes multiple racks. Basically a group of email servers. We use multiple towers throughout the world for redundancy and proximity reasons. Matt. From papaDoc@videotron.ca Thu Oct 3 14:48:46 2002 From: papaDoc@videotron.ca (papaDoc) Date: Thu, 03 Oct 2002 09:48:46 -0400 Subject: [Spambayes] Result of a test Message-ID: <3D9C4ABE.8070407@videotron.ca> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Hi, The attachment is the result of a run on my ham and spam. They are comming from 3 different email addresses. The email can be in english or french. Most of the fp are email from company (palm and APC) that I subscribed to their mailing list. (Even if I don't see those email I won't miss them because usually I don't read them). Others are subscription verification and some spam (what I consider spam) are email forwarded to me by my boss. I am using all the default values. Most of the false negative are spam in french ! Since my ration of french/english is really low and the ration of french spam/french ham is very low I did not play with the python code yet since I'm new to python Looking at the prob of each word I saw something prob('battery"') = 0.844828 prob('battery,') = 0.844828 prob('powernews,') = 0.77651 prob('powernews.') = 0.77651 prob('outlet,') = 0.844828 prob('outlet.') = 0.844828 prob('luncheon') = 0.844828 prob('luncheon:') = 0.844828 prob('luncheons') = 0.844828 I think it can be interesting to try to remove the ponctuation (the . , ? !) at the end of a word and then count it as the same word and do the same thing with the plurial (luncheon and luncheons) based on a dictionary like the one in ispell. papaDoc ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: run1.zip Type: application/x-zip-compressed Size: 41935 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021003/f71c7995/run1.bin ---------------------- multipart/mixed attachment-- From richie@entrian.com Thu Oct 3 17:44:00 2002 From: richie@entrian.com (Richie Hindle) Date: Thu, 03 Oct 2002 17:44:00 +0100 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: References: Message-ID: [Tim] > you've trained on exactly one message (this one) producing (among > others) "word" > > 'from:email addr:biglobe.ne.jp>' > > The estimated *from counting* probability that a message containing this > word is spam is then exactly 0.0 (you've seen it once, and only in ham). > > Then Gary's Bayesian probability adjustment is applied [...] I did briefly think that this might be due to this message having unique words, but I thought that the non-zero scores for those words meant they must have appeared in a mix of ham and spam. I confess I've let the mathematical discussions slip past me, so I wasn't expecting words unique to the ham corpus to have non-zero probabilities. I should have looked more carefully at the words ('from:email name: Message-ID: [Richie Hindle] > No need! It's amusing, though . > ... > Many thanks for the explanation, and sorry to have wasted your time. Let's be clear about that: it wasn't a waste of time at all. Tracking down the details to the bloody end verified that some key parts of the system are working as intended, which raises confidence; and made an opportunity to write a little tutorial on what's going on behind the scenes, which should be helpful to those who followed it. If I had to do the same thing every day, then it would become a drag, but I thought this one a very good use of time. From gward@python.net Thu Oct 3 20:07:48 2002 From: gward@python.net (Greg Ward) Date: Thu, 3 Oct 2002 15:07:48 -0400 Subject: [Spambayes] Good evening/morning/afternoon everyone In-Reply-To: References: <20021002081804.E1B7.JCARLSON@uci.edu> Message-ID: <20021003190748.GA29525@cthulhu.gerg.ca> On 02 October 2002, Tim Peters said: > My employer fiddled our system to prepend a tilde (~) to the Subject of > suspected spam. I never even noticed this until it was pointed out to me! > Which was months after we started doing it. Then again, there's not much > that doesn't escape me . They're not the only ones. I noticed this was so prevalent in spam to mail.python.org that I 1) wondered why spammers would make life so easy for filters by making their spam obvious and 2) added this SA rule: header SUBJECT_TILDE Subject =~ /^\~/ describe SUBJECT_TILDE Subject starts with a tilde (~) score SUBJECT_TILDE 2.5 I guess it's not the spammers adding tildes. Whatever, it helps! Greg -- Greg Ward http://www.gerg.ca/ Money is truthful. If a man speaks of his honor, make him pay cash. From seant@webreply.com Thu Oct 3 21:59:54 2002 From: seant@webreply.com (Sean True) Date: Thu, 3 Oct 2002 16:59:54 -0400 Subject: [Spambayes] Microsoft Outlook 'support' Message-ID: I've written a couple of scripts which use Mark H's win32com package to do the following: 1) Dump arbitrary mail folders in Outlook 2000 to Data/Spam/reservoir and/or Data/Ham/Reservoir 2) Train a classifier directly from Outlook 2000 folders 3) Move messages from folder to folder based on a thresholded classifier score These scripts are quite raw, and do require Outlook (as opposed to Outlook Express). If there is general interest, I'd be glad to share. The question is, where? -- Sean ------- Sean True WebReply.Com, Inc. From seant@webreply.com Thu Oct 3 22:04:40 2002 From: seant@webreply.com (Sean True) Date: Thu, 3 Oct 2002 17:04:40 -0400 Subject: [Spambayes] Bad at math. Message-ID: Is it plausible to use the classifier as a multi-class classifier by using multiple independent classifiers and 'somehow' taking the best score? Anybody want to comment on the 'somehow'? I strongly suspect that there is a better way to do this, but the results of the multiple independent classifiers appear to match the hand crafted regexp recognizer that I wrote some time ago. And are easier to maintain, if someone will keep the training set up to date. I should have paid more attention in my last job . -- Sean ------- Sean True WebReply.Com, Inc. From gward@python.net Thu Oct 3 22:13:51 2002 From: gward@python.net (Greg Ward) Date: Thu, 3 Oct 2002 17:13:51 -0400 Subject: [Spambayes] Result of a test In-Reply-To: <3D9C4ABE.8070407@videotron.ca> References: <3D9C4ABE.8070407@videotron.ca> Message-ID: <20021003211351.GC29525@cthulhu.gerg.ca> On 03 October 2002, papaDoc said: > Looking at the prob of each word I saw something > > > prob('battery"') = 0.844828 > prob('battery,') = 0.844828 > > prob('powernews,') = 0.77651 > prob('powernews.') = 0.77651 > > prob('outlet,') = 0.844828 > prob('outlet.') = 0.844828 > > prob('luncheon') = 0.844828 > prob('luncheon:') = 0.844828 > prob('luncheons') = 0.844828 > > > I think it can be interesting to try to remove the ponctuation (the . , > ? !) at the end of a word > and then count it as the same word and do the same thing with the > plurial (luncheon and luncheons) based > on a dictionary like the one in ispell. Tim played with this very early in the project. Turned out that keeping punctuation, preserving case, and not stemming, were all wins. A bit counter-intuitive, but there you go. Experiment beats intuition every time in this project. Greg -- Greg Ward http://www.gerg.ca/ All right, you degenerates! I want this place evacuated in 20 seconds! From whisper@oz.net Thu Oct 3 22:19:32 2002 From: whisper@oz.net (David LeBlanc) Date: Thu, 3 Oct 2002 14:19:32 -0700 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: Message-ID: I'd dearly love to have your scripts for munging Outlook 2000 mail folders!! This seems like a good thing to add to a sub-project of spambayes! David LeBlanc Seattle, WA USA > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of Sean True > Sent: Thursday, October 03, 2002 14:00 > To: spambayes@python.org > Subject: [Spambayes] Microsoft Outlook 'support' > > > I've written a couple of scripts which use Mark H's win32com package to do > the following: > > 1) Dump arbitrary mail folders in Outlook 2000 to > Data/Spam/reservoir and/or > Data/Ham/Reservoir > 2) Train a classifier directly from Outlook 2000 folders > 3) Move messages from folder to folder based on a thresholded classifier > score > > These scripts are quite raw, and do require Outlook (as opposed to Outlook > Express). If there is general interest, > I'd be glad to share. The question is, where? > > -- Sean > ------- > Sean True > WebReply.Com, Inc. > > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman-21/listinfo/spambayes From tim.one@comcast.net Thu Oct 3 22:28:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 03 Oct 2002 17:28:25 -0400 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: Message-ID: [Sean True] > I've written a couple of scripts which use Mark H's win32com package > to do the following: > > 1) Dump arbitrary mail folders in Outlook 2000 to > Data/Spam/reservoir and/or > Data/Ham/Reservoir > 2) Train a classifier directly from Outlook 2000 folders > 3) Move messages from folder to folder based on a thresholded classifier > score > > These scripts are quite raw, and do require Outlook (as opposed > to Outlook Express). If there is general interest, I'd be glad to share. > The question is, where? If you're willing to let the PSF (Python Software Foundation) hold copyright, I'd be delighted to add these to the project, probably in a new Outlook2000 subdirectory. And if you've got a SourceForge account, I'd be delighted to add you as a developer to the project. general-interest-be-damned-*i*-use-outlook-2k-ly y'rs - tim From tim.one@comcast.net Thu Oct 3 22:40:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 03 Oct 2002 17:40:18 -0400 Subject: [Spambayes] Bad at math. In-Reply-To: Message-ID: [Sean True] > Is it plausible to use the classifier as a multi-class classifier by > using multiple independent classifiers and 'somehow' taking the best > score? It hasn't been tried here. The math in the schemes here use p(w) and 1-p(w) in various ways, where p(w) is a guess about the probability that a msg is spam given that it contains word w, and so doesn't generalize in a screamingly obvious way to N-way decisions. There's also the demonstrated error rates on traditional N-way classifiers (like ifile), which are decent but much worse than we're getting on this binary decision problem. > Anybody want to comment on the 'somehow'? Undoubtedly . > I strongly suspect that there is a better way to do this, but the > results of the multiple independent classifiers appear to match the hand > crafted regexp recognizer that I wrote some time ago. Is that good or bad? Can you quantify? If you have N classifiers, in which order do you run them? How are error rates affected if you vary the order in which you run them? Why does the porridge bird lay its eggs in the air? > And are easier to maintain, if someone will keep the training set > up to date. > > I should have paid more attention in my last job . Indeed, I'd especially love to have Larry Gillick's help here. I learned a lot about how to test from your last job, although don't tell my old boss as he would have thought that a waste of time . From whisper@oz.net Thu Oct 3 23:09:06 2002 From: whisper@oz.net (David LeBlanc) Date: Thu, 3 Oct 2002 15:09:06 -0700 Subject: [Spambayes] Bad at math. In-Reply-To: Message-ID: > [Sean True] > > Is it plausible to use the classifier as a multi-class classifier by > > using multiple independent classifiers and 'somehow' taking the best > > score? > > It hasn't been tried here. The math in the schemes here use p(w) > and 1-p(w) > in various ways, where p(w) is a guess about the probability that a msg is > spam given that it contains word w, and so doesn't generalize in a > screamingly obvious way to N-way decisions. There's also the demonstrated > error rates on traditional N-way classifiers (like ifile), which > are decent > but much worse than we're getting on this binary decision problem. > > > Anybody want to comment on the 'somehow'? > > Undoubtedly . > > > I strongly suspect that there is a better way to do this, but the > > results of the multiple independent classifiers appear to match the hand > > crafted regexp recognizer that I wrote some time ago. > > Is that good or bad? Can you quantify? If you have N > classifiers, in which > order do you run them? How are error rates affected if you vary the order > in which you run them? Why does the porridge bird lay its eggs > in the air? > > > And are easier to maintain, if someone will keep the training set > > up to date. > > > > I should have paid more attention in my last job . > > Indeed, I'd especially love to have Larry Gillick's help here. I > learned a > lot about how to test from your last job, although don't tell my > old boss as > he would have thought that a waste of time . > >From the literature search I've done, the best n-way classifier is based on Support Vector Machines. It's significantly better then naive Bayes. (As Tim points out, the Graham-Peters binary classifier isn't Bayesian at all.) Dave LeBlanc Seattle, WA USA From papaDoc@videotron.ca Fri Oct 4 04:16:09 2002 From: papaDoc@videotron.ca (Remi Ricard) Date: Thu, 03 Oct 2002 23:16:09 -0400 Subject: [Spambayes] Result of a test In-Reply-To: <20021003211351.GC29525@cthulhu.gerg.ca> References: <3D9C4ABE.8070407@videotron.ca> <20021003211351.GC29525@cthulhu.gerg.ca> Message-ID: <1033701369.4141.17.camel@localhost.localdomain> Hi, > > > > I think it can be interesting to try to remove the punctuation (the . , > > ? !) at the end of a word > > and then count it as the same word and do the same thing with the > > plural (luncheon and luncheons) based > > on a dictionary like the one in ispell. > > Tim played with this very early in the project. Turned out that keeping > punctuation, preserving case, and not stemming, were all wins. A bit > counter-intuitive, but there you go. Experiment beats intuition every > time in this project. I read the comments in the file tokenizer.py and saw that It was already tried. Sorry... So I tried something else ;-) Since spam want to catch your attention they use ? ! very often. So I remove only the ',' and '.' and ':' This is the patch: # Tokenize everything in the body. for w in text.split(): n = len(w) # Make sure this range matches in tokenize_word(). if 3 <= n <= 12: if w[-1] == ',' or w[-1] == '.' or w[-1] == ':': w = w[:-1]; yield w elif n >= 3: for t in tokenize_word(w): yield t Please don't flame me this is my first modification of python code I'm more a C and C++ guy.... This is the result: run1s -> run2s -> tested 225 hams & 279 spams against 941 hams & 1113 spams -> tested 242 hams & 275 spams against 924 hams & 1117 spams -> tested 251 hams & 298 spams against 915 hams & 1094 spams -> tested 230 hams & 272 spams against 936 hams & 1120 spams -> tested 218 hams & 268 spams against 948 hams & 1124 spams -> tested 225 hams & 279 spams against 941 hams & 1113 spams -> tested 242 hams & 275 spams against 924 hams & 1117 spams -> tested 251 hams & 298 spams against 915 hams & 1094 spams -> tested 230 hams & 272 spams against 936 hams & 1120 spams -> tested 218 hams & 268 spams against 948 hams & 1124 spams false positive percentages 0.889 0.444 won -50.06% 0.826 1.240 lost +50.12% 1.594 1.594 tied 1.304 1.304 tied 0.000 0.000 tied won 1 times tied 3 times lost 1 times total unique fp went from 11 to 11 tied mean fp % went from 0.922661698796 to 0.916417438007 won -0.68% false negative percentages 0.717 0.717 tied 0.727 0.364 won -49.93% 1.342 1.678 lost +25.04% 0.000 0.368 lost +(was 0) 0.746 0.373 won -50.00% won 2 times tied 1 times lost 2 times total unique fn went from 10 to 10 tied mean fn % went from 0.706533828263 to 0.699823195589 won -0.95% ham mean ham sdev 24.18 24.58 +1.65% 9.24 8.93 -3.35% 25.70 26.23 +2.06% 8.47 8.21 -3.07% 25.51 25.87 +1.41% 9.12 8.90 -2.41% 25.01 25.34 +1.32% 8.08 8.07 -0.12% 24.93 25.36 +1.72% 8.27 8.17 -1.21% ham mean and sdev for all runs 25.08 25.49 +1.63% 8.67 8.49 -2.08% spam mean spam sdev 80.43 79.91 -0.65% 8.79 8.78 -0.11% 79.72 79.38 -0.43% 8.30 8.12 -2.17% 79.67 79.25 -0.53% 8.83 8.69 -1.59% 80.09 79.73 -0.45% 8.15 8.17 +0.25% 79.84 79.48 -0.45% 9.35 9.07 -2.99% spam mean and sdev for all runs 79.95 79.55 -0.50% 8.70 8.58 -1.38% ham/spam mean difference: 54.87 54.06 -0.81 papaDoc From tim.one@comcast.net Fri Oct 4 06:02:31 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 04 Oct 2002 01:02:31 -0400 Subject: [Spambayes] Result of a test In-Reply-To: <1033701369.4141.17.camel@localhost.localdomain> Message-ID: [Remi Ricard] >>> I think it can be interesting to try to remove the >>> punctuation (the . , > > > ? !) at the end of a word >>> and then count it as the same word and do the same thing with the >>> plural (luncheon and luncheons) based >>> on a dictionary like the one in ispell. [Greg Ward] >> Tim played with this very early in the project. Turned out that >> keeping punctuation, preserving case, and not stemming, were all >> wins. A bit counter-intuitive, but there you go. Experiment beats >> intuition every time in this project. Rather than saying that keeping punctuation won, I'd say instead that simple split-on-whitespace beat searching for alphanumeric runs (where "alphanumeric runs" sometimes included things like '$' and '_' and '-' too). I'd say that because that's what I tested . This should be revisited, since boosting max_discriminators in particular may change the conclusion. About preserving case, it was my *intuition* (and, indeed, a strong intuition) that preserving case would be a clear win. Testing didn't back me on that, though (see comments in tokenizer.py), and so we stopped preserving case -- the data said it bloated the database without significant benefit. This too should be revisited. However, experiments also showed that preserving case in Subject lines was a winner, so we do preserve case there. See tokenizer.py comments too for another scheme that, at the time, significantly cut the f-n rate at the cost of major database bloat. About stemming/lemmatization, I didn't even try it, mostly because the code to do so is expensive and highly language-dependent. We've since heard Matt Sergeant's testimony that it hurt when he tried it, so I'm even less motivated to bother. Everyone should feel encouraged to try these things on their own! If you find something that wins for you, share it, and we can set up a cross-corpus test here. [Remi Ricard] > I read the comments in the file tokenizer.py and saw that It was > already tried. Sorry... No problem. Read TESTING.txt too -- heresy can pay . > So I tried something else ;-) > Since spam want to catch your attention they use ? ! very often. So > I remove only the ',' and '.' and ':' > > This is the patch: > # Tokenize everything in the body. > for w in text.split(): > n = len(w) > # Make sure this range matches in tokenize_word(). > if 3 <= n <= 12: > if w[-1] == ',' or w[-1] == '.' or w[-1] == ':': > w = w[:-1]; > yield w I'd write that part: while w and w[-1] in ',.:': w = w[:-1] n -= 1 if n >= 3: yield w For whatever reason, putting "words" with fewer than 3 chars in the database has hurt results whenever I've tried it. > ... > Please don't flame me this is my first modification of python code > I'm more a C and C++ guy.... That won't last . > This is the result: > run1s -> run2s > -> tested 225 hams & 279 spams against 941 hams & 1113 spams > -> tested 242 hams & 275 spams against 924 hams & 1117 spams > -> tested 251 hams & 298 spams against 915 hams & 1094 spams > -> tested 230 hams & 272 spams against 936 hams & 1120 spams > -> tested 218 hams & 268 spams against 948 hams & 1124 spams > -> tested 225 hams & 279 spams against 941 hams & 1113 spams > -> tested 242 hams & 275 spams against 924 hams & 1117 spams > -> tested 251 hams & 298 spams against 915 hams & 1094 spams > -> tested 230 hams & 272 spams against 936 hams & 1120 spams > -> tested 218 hams & 268 spams against 948 hams & 1124 spams Do "rebal -h" for instructions on how to use an automagical "rebalancing" script -- rebal will even out the # of ham and spam across your directories. It's *OK* if they're unbalanced, it just complicates life a little. > false positive percentages > 0.889 0.444 won -50.06% > 0.826 1.240 lost +50.12% > 1.594 1.594 tied > 1.304 1.304 tied > 0.000 0.000 tied > > won 1 times > tied 3 times > lost 1 times Seems to have had small random effects in both directions, with no overall tendency. > total unique fp went from 11 to 11 tied > mean fp % went from 0.922661698796 to 0.916417438007 won -0.68% > > false negative percentages > 0.717 0.717 tied > 0.727 0.364 won -49.93% > 1.342 1.678 lost +25.04% > 0.000 0.368 lost +(was 0) > 0.746 0.373 won -50.00% > > won 2 times > tied 1 times > lost 2 times Ditto. BTW, it looks like the default value of spam_cutoff is probably too high for your data. > total unique fn went from 10 to 10 tied > mean fn % went from 0.706533828263 to 0.699823195589 won -0.95% > > ham mean ham sdev > 24.18 24.58 +1.65% 9.24 8.93 -3.35% > 25.70 26.23 +2.06% 8.47 8.21 -3.07% > 25.51 25.87 +1.41% 9.12 8.90 -2.41% > 25.01 25.34 +1.32% 8.08 8.07 -0.12% > 24.93 25.36 +1.72% 8.27 8.17 -1.21% > > ham mean and sdev for all runs > 25.08 25.49 +1.63% 8.67 8.49 -2.08% > > spam mean spam sdev > 80.43 79.91 -0.65% 8.79 8.78 -0.11% > 79.72 79.38 -0.43% 8.30 8.12 -2.17% > 79.67 79.25 -0.53% 8.83 8.69 -1.59% > 80.09 79.73 -0.45% 8.15 8.17 +0.25% > 79.84 79.48 -0.45% 9.35 9.07 -2.99% > > spam mean and sdev for all runs > 79.95 79.55 -0.50% 8.70 8.58 -1.38% > > ham/spam mean difference: 54.87 54.06 -0.81 Those are mixed signs, but overall on the bad side: the average score of ham went up on every run, and the average score of spam went down on every run. That means they're closer together . The variance of both decreased, though, so while your populations grew closer together, they're tighter than they were; alas, the decrease in variances weren't enough relative to the decrease in spread (difference between means) to reduce the likely overlap any. You're off to a great start, Remi! Keep at it. From tim.one@comcast.net Fri Oct 4 06:11:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 04 Oct 2002 01:11:58 -0400 Subject: [Spambayes] Result of a test In-Reply-To: <20021003211351.GC29525@cthulhu.gerg.ca> Message-ID: >> prob('powernews,') = 0.77651 >> prob('powernews.') = 0.77651 BTW, it's impossible under Gary's probability adjustment (provided you stick to the default "unknown word prob" of 0.5) for a spamprob to move "to the other side" of 0.5 than the probability-by-counting estimate was (this wasn't true when we were using Paul's prob calculations: there it was possible for a word to be a ham indicator even if it appeared more often in spam(!)). So that tells me that these variants of "powernews" *did* appear more often in spam than in ham in the training data. But that's a very unlikely word, and it shows up routinely in all the "APC PowerNews" false positives papadoc reported. This very strongly suggests that the spam in that collection is polluted with ham, and specifically that some APC PowerNews newsletters were incorrectly classified as spam in the training data. This would go a long way toware explaining why the "APC PowerNews" false positives got such extremely high scores (if the system was fed some and *told* they were spam, it believes you ). From stephena@hiwaay.net Fri Oct 4 07:31:42 2002 From: stephena@hiwaay.net (Stephen Anderson) Date: Fri, 4 Oct 2002 01:31:42 -0500 (CDT) Subject: [Spambayes] splitndirs bug [need help] Message-ID: Hi, I'm trying to use splitndirs to split collected spam mbox's into maildir format for testing. I've got exactly 3 hours of Python experience and I'm running into a wall. Splitndirs is incorreclt capturing and munging the "From " line of the "next" message at the end of the preceding message. This oddity seems to be happening because of a malfunction of a .read(length) call. In mailbox.py on line 53, part of the _Subfile.read(length) function, a call exists to "self.fp.read(length). Now length is defined from self.stop - self.pos. Before the call self.pos is 7058L. Also self.stop is 10987L. Consequently, length is 2788L. Now if I understand this right, we should read 2788 bytes. But, after the call, self.pos is 11101L. This represents an overread of 114 bytes. This also happens to be the length of the "From " line that we were supposed to stop in front of. And, said "From " line is part of the read data. I traced it down to this point, but I can't seem to find the definition of the self.fp.read(lenght) function. I suspects it's OS specific, but I can't find it in the Python libraries. FYI, I am running WinXP Pro. Can somebody please help me out; I've hit my Python newbie limit. Thanks! Stephen Anderson From richie@entrian.com Fri Oct 4 09:06:25 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 04 Oct 2002 09:06:25 +0100 Subject: [Spambayes] Cunning use of quoted-printable In-Reply-To: References: Message-ID: Tim, > [...] made an opportunity to > write a little tutorial on what's going on behind the scenes, which should > be helpful to those who followed it. Glad to help - any time you want to write such a tutorial, and need a slow and dimwitted pupil to explain things to, I'm your man. 8-) -- Richie Hindle richie@entrian.com From richie@entrian.com Fri Oct 4 09:06:59 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 04 Oct 2002 09:06:59 +0100 Subject: [Spambayes] splitndirs bug [need help] In-Reply-To: References: Message-ID: Hi Stephen, > Splitndirs is incorreclt capturing and munging the "From " line of the > "next" message at the end of the preceding message. [...] This represents > an overread of 114 bytes. This is because mboxutils.py is opening the mailbox file in text mode, but the Python mailbox library uses tell() and seek() to navigate around the file, which is no good with text-mode files on Windows. I've patched my mboxutils.py by changing the third-to-last line of mboxutils.py from: fp = open(name) to fp = open(name, "rb") and that seemed to fix it. I've been meaning to commit this, but I need to work out whether reading the '\r\n' line endings will break anything (Tim?) In the meantime, that should fix your problem. -- Richie Hindle richie@entrian.com From Alexander@Leidinger.net Fri Oct 4 09:17:19 2002 From: Alexander@Leidinger.net (Alexander Leidinger) Date: Fri, 4 Oct 2002 10:17:19 +0200 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: References: Message-ID: <20021004101719.2730fdf5.Alexander@Leidinger.net> On Thu, 03 Oct 2002 17:28:25 -0400 Tim Peters wrote: > If you're willing to let the PSF (Python Software Foundation) hold > copyright, I'd be delighted to add these to the project, probably in a > new Outlook2000 subdirectory. I suggest to add a level of indirection here, please don't put the Outlook directory directly into the spambayes root. If someone else provides some sort of code for other MUAs we would end up with a lot of directories in the base. Bye, Alexander. -- ...and that is how we know the Earth to be banana-shaped. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From guido@python.org Fri Oct 4 13:31:03 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 04 Oct 2002 08:31:03 -0400 Subject: [Spambayes] splitndirs bug [need help] In-Reply-To: Your message of "Fri, 04 Oct 2002 09:06:59 BST." References: Message-ID: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net> > This is because mboxutils.py is opening the mailbox file in text mode, but > the Python mailbox library uses tell() and seek() to navigate around the > file, which is no good with text-mode files on Windows. > > I've patched my mboxutils.py by changing the third-to-last line of > mboxutils.py from: > > fp = open(name) > > to > > fp = open(name, "rb") Good catch! > and that seemed to fix it. I've been meaning to commit this, but I need > to work out whether reading the '\r\n' line endings will break anything > (Tim?) In the meantime, that should fix your problem. I think Tim is already opening his message files with 'rb' -- his code doesn't use mboxutils.py. So please go ahead! --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Oct 4 13:32:06 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 04 Oct 2002 08:32:06 -0400 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: Your message of "Fri, 04 Oct 2002 10:17:19 +0200." <20021004101719.2730fdf5.Alexander@Leidinger.net> References: <20021004101719.2730fdf5.Alexander@Leidinger.net> Message-ID: <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net> > I suggest to add a level of indirection here, please don't put the > Outlook directory directly into the spambayes root. If someone else > provides some sort of code for other MUAs we would end up with a lot > of directories in the base. I disagree. There aren't going to be that many, and (as you may have noticed :-) I'm not fond of deep hierarchies -- they tend to obscure more than they help. --Guido van Rossum (home page: http://www.python.org/~guido/) From Alexander@Leidinger.net Fri Oct 4 13:56:14 2002 From: Alexander@Leidinger.net (Alexander Leidinger) Date: Fri, 4 Oct 2002 14:56:14 +0200 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net> References: <20021004101719.2730fdf5.Alexander@Leidinger.net> <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20021004145614.452ae254.Alexander@Leidinger.net> On Fri, 04 Oct 2002 08:32:06 -0400 Guido van Rossum wrote: > > I suggest to add a level of indirection here, please don't put the > > Outlook directory directly into the spambayes root. If someone else > > provides some sort of code for other MUAs we would end up with a lot > > of directories in the base. > > I disagree. There aren't going to be that many, and (as you may have > noticed :-) I'm not fond of deep hierarchies -- they tend to obscure > more than they help. I don't want something like MUA_Interfaces/Windows/Outlook2000, just one level of indirection, like ProgramInterfaces/Outlook2000 or something like this. Having several pages of ls output isn't very userfriendly. Grouping relevant pieces together and hiding things which aren't relevant in the actual context is userfriendly. Bye, Alexander. -- Weird enough for government work. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From guido@python.org Fri Oct 4 14:07:34 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 04 Oct 2002 09:07:34 -0400 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: Your message of "Fri, 04 Oct 2002 14:56:14 +0200." <20021004145614.452ae254.Alexander@Leidinger.net> References: <20021004101719.2730fdf5.Alexander@Leidinger.net> <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net> <20021004145614.452ae254.Alexander@Leidinger.net> Message-ID: <200210041307.g94D7YN21286@pcp02138704pcs.reston01.va.comcast.net> > I don't want something like MUA_Interfaces/Windows/Outlook2000, > just one level of indirection, like ProgramInterfaces/Outlook2000 or > something like this. Then at least use "mua/Outlook2000". Let's please start a convention that directory names should be short lowercase words, and keep the ugly camelcase for class names. > Having several pages of ls output isn't very userfriendly. You already got that now. It has to become a *lot* worse before it's going to bother me. One extra subdir per supported MUA won't add that much (especially since most MUAs are irrelevant in practice :-). > Grouping relevant pieces together and hiding things > which aren't relevant in the actual context is userfriendly. One directory per MUA seems plenty of hiding to me. --Guido van Rossum (home page: http://www.python.org/~guido/) From Alexander@Leidinger.net Fri Oct 4 14:22:39 2002 From: Alexander@Leidinger.net (Alexander Leidinger) Date: Fri, 4 Oct 2002 15:22:39 +0200 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: <200210041307.g94D7YN21286@pcp02138704pcs.reston01.va.comcast.net> References: <20021004101719.2730fdf5.Alexander@Leidinger.net> <200210041232.g94CW6U20188@pcp02138704pcs.reston01.va.comcast.net> <20021004145614.452ae254.Alexander@Leidinger.net> <200210041307.g94D7YN21286@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20021004152239.4bd25185.Alexander@Leidinger.net> On Fri, 04 Oct 2002 09:07:34 -0400 Guido van Rossum wrote: > > I don't want something like MUA_Interfaces/Windows/Outlook2000, > > just one level of indirection, like ProgramInterfaces/Outlook2000 or > > something like this. > > Then at least use "mua/Outlook2000". Let's please start a convention > that directory names should be short lowercase words, and keep the > ugly camelcase for class names. I don't care how the directory is spelled. I'm fine with everything you come up with (someone may want to add files for MTAs, I like to have them in the same directory, but I don't care if you want to add a mta directory for them ;-) ). > > Having several pages of ls output isn't very userfriendly. > > You already got that now. It has to become a *lot* worse before it's Yes. That's the reason I wrote my initial mail on this topic. > going to bother me. One extra subdir per supported MUA won't add that > much (especially since most MUAs are irrelevant in practice :-). > > > Grouping relevant pieces together and hiding things > > which aren't relevant in the actual context is userfriendly. > > One directory per MUA seems plenty of hiding to me. Normaly you are only interested in files for your MUA, aren't you? Bye, Alexander. -- Give a man a fish and you feed him for a day; teach him to use the Net and he won't bother you for weeks. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From tim@zope.com Fri Oct 4 17:56:02 2002 From: tim@zope.com (Tim Peters) Date: Fri, 4 Oct 2002 12:56:02 -0400 Subject: [Spambayes] splitndirs bug [need help] In-Reply-To: Message-ID: [Richie Hindle] > ... > This is because mboxutils.py is opening the mailbox file in text mode, > but the Python mailbox library uses tell() and seek() to navigate > around the file, which is no good with text-mode files on Windows. Not quite. seek and tell are fine with text-mode Windows files, provided you stick to what C guarantees about them with text mode files: you can seek to a position previously returned by tell(), but that's essentially *all* that's defined. In particular, trying to do arithmetic on text-mode tell() results has no meaning, and Stephen found code doing > a call exists to "self.fp.read(length). Now length is defined from > self.stop - self.pos. *That* makes no sense for text-mode files on Windows. (BTW, good detective work, Stephen!) > I've patched my mboxutils.py by changing the third-to-last line of > mboxutils.py from: > > fp = open(name) > > to > > fp = open(name, "rb") > > and that seemed to fix it. Yes! Please check that in. Besides the seek/tell business, opening a mail archive in text mode under Windows is likely to truncate the data prematurely, if the archive contains any 8-bit chars (the first instance of chr(26) is taken to mean EOF in Windows text mode). > I've been meaning to commit this, but I need to work out whether > reading the '\r\n' line endings will break anything (Tim?) I can't say for sure, but if it does I'll fix it. Offhand, the only pieces that *might* be vulnerable are regular expressions assuming plain \n line endings, but it's unlikely they would fall into a trap here. I normalized all line endings to plain \n in my data, BTW: before that, all my spam had \r\n, and all my ham plain \n, and when experimenting with character n-grams the mere fact of different line endings proved to be a killer strong clue! Bottom lines: all mail files should always be opened in binary mode, and spambayes code should never be sensitive to line endings. From tim@zope.com Fri Oct 4 18:01:17 2002 From: tim@zope.com (Tim Peters) Date: Fri, 4 Oct 2002 13:01:17 -0400 Subject: [Spambayes] splitndirs bug [need help] In-Reply-To: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > I think Tim is already opening his message files with 'rb' Yes, all code I've written for the project already opens files in binary mode. > ... > So please go ahead! Ditto! From tim@zope.com Fri Oct 4 18:05:51 2002 From: tim@zope.com (Tim Peters) Date: Fri, 4 Oct 2002 13:05:51 -0400 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: <20021004101719.2730fdf5.Alexander@Leidinger.net> Message-ID: [Alexander Leidinger] > ... > I suggest to add a level of indirection here, please don't put the > Outlook directory directly into the spambayes root. If someone else > provides some sort of code for other MUAs we would end up with a lot > of directories in the base. [and Guido and Alexander go back & forth on this] Sorry, I'm unpersuaded. Most users have no idea what "MUA" or "MTA" mean, and I'm not going to hide what they're looking for under layers of geek-speak. Neither you nor they will be confused by a directory named Outlook2000; if 30 other such directories appear, then I'll think about rearranging stuff; I predict it ain't gonna happen. From tim.one@comcast.net Fri Oct 4 20:11:11 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 04 Oct 2002 15:11:11 -0400 Subject: [Spambayes] For the bold Message-ID: I checked in enough stuff so that bold experimenters can play with the central-limit schemes, but not yet enough so that I (and Rob, bless his heart) can get a detailed picture of what's going on under the covers (patience, please). You CANNOT use a cross-validation test with these schemes. So don't use timcv or mboxtest. timtest is fine, or any other grid driver (are there any?). I believe I'll need to whip up a custom driver for deeper analysis to make progress. You CANNOT meaningfully compare error rates between a cross-validation driver and a grid driver. Don't even think about it. If you want to do comparisons with a central-limit scheme, use a grid driver for both. A sample .ini file: """ [Classifier] use_central_limit2: True max_discriminators: 50 zscore_ratio_cutoff: 1.9 [TestDriver] spam_cutoff: 0.50 nbuckets: 4 """ Note that, for now, every message gets one of just 4 distinct scores when a central-limit scheme is in use: 0.00 -- certain it's ham 0.49 -- guesses ham but is unsure 0.51 -- guesses spam but is unsure 1.00 -- certain it's spam That's the reason for setting nbuckets to 4: more than that won't do you a lick of good, as there are only 4 possible scores. spam_cutoff must also be exactly 0.50, and for the same reason; the "best cutoff" histogram analysis is still displayed, but is meaningless. Nothing is known about how max_discrimators affects this. Play! Nothing is known about how use_central_limit (as opposed to use_central_limit2) works with this. Play! When one of the central-limit schemes is in use, the list of (word, prob) clues returned by spamprob() now has two made-up entries at the start, in this order: ('*zham*', zham), ('*zspam*', zspam) These are the ham and spam zscores. So, for example, a listing of a false positive now begins like so: Data/Ham/Set2/143733.txt prob = 0.51 prob('*zham*') = -65.9011 prob('*zspam*') = -53.3419 prob('header:Errors-To:1') = 0.0266272 prob('subject:: ') = 0.0266272 prob('python') = 0.0412844 ... Here's something remarkable. I just tried this, with the .ini file given above, like so: timtest.py -n5 --s=10 --h=10 -s123 In other words, this does 5**2-5 = 20 runs, training the classifier each time on *just* 10 random ham and 10 random spam, and then predicting against 10 disjoint random ham and 10 disjoint random spam. Here's the bottom line from this run (the "all runs" histograms at the end): -> Ham scores for all runs: 200 items; mean 5.42; sdev 15.42 -> min 0; median 0; max 51 * = 3 items 0.0 178 ************************************************************ 25.0 19 ******* 50.0 3 * 75.0 0 -> Spam scores for all runs: 200 items; mean 93.13; sdev 17.03 -> min 49; median 100; max 100 * = 3 items 0.0 0 25.0 1 * 50.0 27 ********* 75.0 172 ********************************************************** The 0.00 score ends up in the 0.0 bucket. The 0.49 score ends up in the 25.0 bucket. The 0.51 score ends up in the 50.0 bucket. The 1.00 score ends up in the 75.0 bucket. Even with such little data, this was never wrong when it was certain. For ham, it was wrong 3 of the 19+3=22 times it was unsure. For spam, it was wrong 1 of the 27+1=28 times it was unsure. What surprised me most there is-- given how little training was done --just how often it *was* "certain". This continues to suggest that these schemes have enormous potential, but we still don't know how to exploit it (although with my pragmatic hat on, I'd say we're already doing a not-too-shabby job of exploiting it ). From tim.one@comcast.net Fri Oct 4 20:43:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 04 Oct 2002 15:43:26 -0400 Subject: [Spambayes] For the bold In-Reply-To: Message-ID: BTW, that teensy test run I reported on uncovered a ham hiding in Bru= ceG's spam -- it was one the "false negatives" the central-limit scheme sai= d it was unsure about, but *guessed* it was ham (note that both zscores ar= e very large): """ Data/Spam/Set5/6510.txt prob =3D 0.49 prob('*zham*') =3D -31.6082 prob('*zspam*') =3D -44.4025 prob('header:Organization:1') =3D 0.00738916 prob('wrote:') =3D 0.0110024 prob('header:User-Agent:1') =3D 0.0167286 prob('class') =3D 0.0412844 prob('files') =3D 0.0412844 prob('comes') =3D 0.0505618 prob('hi,') =3D 0.0652174 prob('might') =3D 0.12963 prob('subject:: ') =3D 0.135891 prob('contains:') =3D 0.155172 prob('files.') =3D 0.155172 prob('there.') =3D 0.155172 prob('inc.') =3D 0.155172 prob('subject:?') =3D 0.194323 prob('charset:us-ascii') =3D 0.244597 prob('line') =3D 0.263314 prob('content-type:text/plain') =3D 0.306763 prob('proto:http') =3D 0.681245 prob('skip:p 10') =3D 0.691388 prob('will') =3D 0.700267 prob('url:org') =3D 0.701342 prob('url:www') =3D 0.702475 prob('easily') =3D 0.724719 prob('been') =3D 0.740964 prob('your') =3D 0.752572 prob('addresses') =3D 0.775229 prob('subject:-') =3D 0.775229 prob('people') =3D 0.776817 prob('url:html') =3D 0.776817 prob('world') =3D 0.810078 prob('subject:000') =3D 0.844828 prob('subject:. ') =3D 0.844828 prob('ease') =3D 0.844828 prob('sent') =3D 0.85503 prob('bulk') =3D 0.908163 prob('subject:,') =3D 0.908163 prob('emails') =3D 0.908163 prob('low') =3D 0.908163 prob('our') =3D 0.918944 prob('regardless') =3D 0.934783 prob('received.') =3D 0.934783 prob('info') =3D 0.958716 prob('million') =3D 0.965116 prob('send') =3D 0.969799 prob('unsubscribe') =3D 0.969799 prob('header:Return-Path:1') =3D 0.971807 prob('header:Received:7') =3D 0.973373 prob('money') =3D 0.983271 prob('email') =3D 0.991159 prob('please') =3D 0.991803 Return-Path: Delivered-To: lists-linux-kernel@bruce-guenter.dyndns.org Received: (qmail 27880 invoked from network); 16 Apr 2002 17:23:30 -0= 000 Received: from vger.kernel.org (209.116.70.75) by bruce-guenter.dyndns.org (192.168.1.3) with ESMTP; 16 Apr 2002 17:23:30 -0000 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpa= nd id ; Tue, 16 Apr 2002 13:20:37 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Tue, 16 Apr 2002 13:20:36 -0400 Received: from moutvdomng1.kundenserver.de ([212.227.126.181]:56795 "= EHLO moutvdomng1.kundenserver.de") by vger.kernel.org with ESMTP id ; Tue, 16 Apr 2002 13:20:35 -0400 Received: from [212.227.126.155] (helo=3Dmrvdomng2.kundenserver.de) by moutvdomng1.kundenserver.de with esmtp (Exim 3.22 #2) id 16xWd5-0001Gw-00 for linux-kernel@vger.kernel.org; Tue, 16 Apr 2002 19:20:31 += 0200 Received: from pd9e23b10.dip.t-dialin.net ([217.226.59.16] helo=3Dngforever.de) by mrvdomng2.kundenserver.de with esmtp (Exim 3.22 #2) id 16xWd4-0007sA-00 for linux-kernel@vger.kernel.org; Tue, 16 Apr 2002 19:20:31 += 0200 Message-ID: <3CBC5D5D.7060909@ngforever.de> Date: Tue, 16 Apr 2002 11:20:29 -0600 =46rom: Thunder from the hill Organization: The LuckyNet Administration User-Agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:0.9.9+) Gecko/20020405 X-Accept-Language: en-us, en MIME-Version: 1.0 To: LKML Subject: Re: 60 Million Emails inc. 600,000 Uk =3D?ISO-8859-1?Q?=3DA3= 19=3D2E95?=3D References: <20020416154606Z313666-22651+7853@vger.kernel.org> Content-Type: text/plain; charset=3Dus-ascii; format=3Dflowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 948 Hi, Bulk Email Cd wrote: > Bulk Email CD just #19.95 inc. p&p and contains: > > 60 Million World wide email addresses. > 600,000 VALIDATED UK email addresses - Verified in March 2002, ensu= ring a low failure rate. > > The World-wide emails have been split and compressed into many file= s for ease of use. The UK lists comes in easily identifiable files. > > The CD comes with simple instuctions and will be sent by first clas= s post as soon as your money has been received. > [Snip] People selling email addresses really make me sick. People degraded t= o wares, regardless of their personalities. We might even find Alan in = there. Regards, Thunder -- Thunder from the hill. Citizen of our universe. - To unsubscribe from this list: send the line "unsubscribe linux-kerne= l" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/""" """ If I have to keep the quote of the Nigerian-scam spam in my ham, this= one has noooooo excuse for being called spam . Here's another one it was unsure about: """ Return-Path: Delivered-To: em-ca-bruceg@em.ca Received: (qmail 15516 invoked from network); 8 Aug 2002 22:20:13 -00= 00 Received: from mail.inet.pl (195.116.59.85) by churchill.factcomp.com with SMTP; 8 Aug 2002 22:20:13 -0000 Received: (qmail 26458 invoked by uid 33); 8 Aug 2002 22:26:04 -0000 Date: 8 Aug 2002 22:26:04 -0000 Message-ID: <20020808222604.26455.qmail@mail.inet.pl> TO: bruceg@em.ca =46rom: jax@inet.pl Subject: Wiadomo=B6=E6 zosta=B3a dostarczona Content-Length: 129 Twoja Wiadomo=B6=E6 zosta=B3a dostarczona ! Zostanie jednak przeczytana 12 sierpnia. Do tego czasu korzystam z wypoczynku. "" I have no idea -- do you? I really despise the presumption that non-= English msgs are spam, BTW. From guido@python.org Fri Oct 4 20:59:12 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 04 Oct 2002 15:59:12 -0400 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: Your message of "Fri, 04 Oct 2002 13:05:51 EDT." References: Message-ID: <200210041959.g94JxCR29867@pcp02138704pcs.reston01.va.comcast.net> > Sorry, I'm unpersuaded. Most users have no idea what "MUA" or "MTA" mean, > and I'm not going to hide what they're looking for under layers of > geek-speak. Neither you nor they will be confused by a directory named > Outlook2000; if 30 other such directories appear, then I'll think about > rearranging stuff; I predict it ain't gonna happen. What I said. --Guido van Rossum (home page: http://www.python.org/~guido/) From richie@entrian.com Fri Oct 4 20:59:23 2002 From: richie@entrian.com (Richie Hindle) Date: Fri, 04 Oct 2002 20:59:23 +0100 Subject: [Spambayes] splitndirs bug [need help] In-Reply-To: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net> References: <200210041231.g94CV3020176@pcp02138704pcs.reston01.va.comcast.net> Message-ID: > So please go ahead! Done. -- Richie Hindle richie@entrian.com From python-spambayes@discworld.dyndns.org Fri Oct 4 21:02:08 2002 From: python-spambayes@discworld.dyndns.org (Charles Cazabon) Date: Fri, 4 Oct 2002 14:02:08 -0600 Subject: [Spambayes] For the bold In-Reply-To: ; from tim.one@comcast.net on Fri, Oct 04, 2002 at 03:43:26PM -0400 References: Message-ID: <20021004140208.A3542@discworld.dyndns.org> Tim Peters wrote: > > Here's another one it was unsure about: > """ > Return-Path: > Delivered-To: em-ca-bruceg@em.ca Original envelope recipient address: bruceg@em.ca, his most widely-advertised address, and one that is trivial to harvest from webpages. Suspicious. > Twoja Wiadomo¶æ zosta³a dostarczona ! > Zostanie jednak przeczytana 12 sierpnia. > Do tego czasu korzystam z wypoczynku. > > I have no idea -- do you? I really despise the presumption that non-English > msgs are spam, BTW. In this case, it appears to be true. I can't read it, but identical messages (with identical envelope senders) went to the linux-kernel mailing list, the qmail mailing list, and a bunch of other people who have it listed in their spam blocklists and such. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From tim.one@comcast.net Fri Oct 4 21:30:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 04 Oct 2002 16:30:51 -0400 Subject: [Spambayes] For the bold In-Reply-To: <20021004140208.A3542@discworld.dyndns.org> Message-ID: >> Twoja Wiadomo=B6=E6 zosta=B3a dostarczona ! >> Zostanie jednak przeczytana 12 sierpnia. >> Do tego czasu korzystam z wypoczynku. > In this case, it appears to be true. I can't read it, but > identical messages (with identical envelope senders) went to the > linux-kernel mailing list, the qmail mailing list, and a bunch of > other people who have it listed in their spam blocklists and such. Luckily, I was able to find an excellent Polish -> Ukranian dictionar= y online, and with my encyclopedic knowledge of Ukranian can supply thi= s fine translation: Your it's known < known > your =B6 & # 230; & # 179 zosta; but suppl= ied! However, it will be read 12 august. I use with refreshment for this time. OTOH, it may a technical service manual for a 1993 Subaru wagon, or p= erhaps a translation of Romeo and Juliet. I'll keep researching it in my sp= are time ... From python-spambayes@discworld.dyndns.org Fri Oct 4 21:38:29 2002 From: python-spambayes@discworld.dyndns.org (Charles Cazabon) Date: Fri, 4 Oct 2002 14:38:29 -0600 Subject: [Spambayes] For the bold In-Reply-To: ; from tim.one@comcast.net on Fri, Oct 04, 2002 at 04:30:51PM -0400 References: <20021004140208.A3542@discworld.dyndns.org> Message-ID: <20021004143829.A5033@discworld.dyndns.org> Tim Peters wrote: > > Luckily, I was able to find an excellent Polish -> Ukranian dictionary > online, and with my encyclopedic knowledge of Ukranian can supply this fine > translation: > > Your it's known < known > your ¶ & # 230; & # 179 zosta; but supplied! > However, it will be read 12 august. > I use with refreshment for this time. > > OTOH, it may a technical service manual for a 1993 Subaru wagon, or perhaps > a translation of Romeo and Juliet. I'll keep researching it in my spare > time ... Your translation sounds suspiciously like a Polish vacation message. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From popiel@wolfskeep.com Fri Oct 4 23:15:17 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 04 Oct 2002 15:15:17 -0700 Subject: [Spambayes] Effects of training set size Message-ID: <20021004221518.04E60F59A@cashew.wolfskeep.com> Executive summary: Increasing the training set size helps, but not as much as one might think. Specifically, the ham/spam means spread apart, but the error rates stay fairly constant. More data improves classification of ham, but it seems that a _very_ small sample of spam (200 messages) is enough to represent it. I'm running with everything at defaults, which means I'm using the Robinson classifier, spam_cutoff of 0.560, x = 0.5, s = 0.45, et cetera, et cetera, ad nauseum. I have about 3000 spam and nearly 2000 ham, representing everything from my own personal mail feed since 22 Aug 2002 (when I stopped throwing away a significant portion of my ham). I should have a full 2000 ham in another day or two, at which point I'll probably redo my data directories. I did cross-validation (via timcv.py) using --ham-keep and --spam-keep at each of 50, 70, 90, 110, 130, 150, 170, and 190. This means that I used training corpus sizes of 200, 280, 360, 440, 520, 600, 680, and 760 hams and spams, testing against the smaller numbers of messages. I used the following adaptation of runtest.sh: """ #! /bin/sh -x ## ## runsizes.sh -- run some tests for Tim ## ## This does everything you need to test yer data. You may want to skip ## the rebal steps if you've recently moved some of your messages ## (because they were in the wrong corpus) or you may suffer my fate and ## get stuck forever re-categorizing email. ## ## Just set up your messages as detailed in README.txt; put them all in ## the reservoir directories, and this script will take care of the ## rest. Paste the output (also in results.txt) to the mailing list for ## good karma. ## ## Neale Pickett ## if [ "$1" = "-r" ]; then REBAL=1 shift fi # Number of messages per rebalanced set RNUM=190 # Number of sets SETS=5 # Seed for random number generator SEED=13666 if [ -n "$REBAL" ]; then # Put them all into reservoirs python2.2 rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q python2.2 rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q # Rebalance python2.2 rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q python2.2 rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q fi for keep in 50 70 90 110 130 150 170 190; do python2.2 timcv.py -n $SETS --ham-keep $keep --spam-keep $keep -s $SEED > run$keep.txt done for k1 in 50 70 90 110 130 150 170; do k2=`echo $k1 20 + p | dc` python2.2 rates.py run$k1 run$k2 > runrates$k1.txt python2.2 cmp.py run${k1}s run${k2}s | tee results$k1.txt done for k1 in 50 70 90 110 130 150 170; do k2=190 python2.2 rates.py run$k1 run$k2 > runrates${k1}-190.txt python2.2 cmp.py run${k1}s run${k2}s | tee results${k1}-190.txt done """ I then hand-munged the results output to reveal: keep: 50 70 90 110 130 150 170 190 fp %: (meaningless, only 1 or 2 fp in any run) fn %: 3.20 4.57 4.00 4.36 4.15 3.20 3.53 4.53 h mean: 25.28 24.38 22.19 21.35 21.21 20.91 20.37 19.50 h sdev: 7.45 7.56 6.86 6.89 7.05 6.92 6.87 6.81 s mean: 74.21 74.54 73.65 73.92 74.63 74.99 74.81 74.52 s sdev: 8.56 9.10 8.84 9.13 8.98 8.76 8.62 8.99 mean difference: 48.93 50.16 51.46 52.57 53.42 54.08 54.44 55.02 I'm not sure if the fn % are significant, and they're jumping enough for me to suspect they're not. No obvious trend there, anyway. The ham mean drifted down steadily with more data, and the spam mean held fairly constant with a very slight upward drift. Ham sdev seems to get slowly tighter, with spam sdev jiggling in no particularly obvious direction. Finally, the difference in means steadily increased, echoing the downward drift of the ham mean. All of the reports are available at: http://www.wolfskeep.com/~popiel/spambayes/trainsize My next experiment: try this all again with --ham-keep constant and only --spam-keep variable. :-) - Alex From papaDoc@videotron.ca Sat Oct 5 02:26:25 2002 From: papaDoc@videotron.ca (Remi Ricard) Date: Fri, 04 Oct 2002 21:26:25 -0400 Subject: [Spambayes] New tokenization of the Subject line Message-ID: <1033781185.1125.7.camel@localhost.localdomain> Hi, I try something again. Since most of the mail from subscribed groups have in their subject [spambayes] or [freesco] i.e "[" and "]". I decided to keep this as a word so my words from a subject line like: Re: [Spambayes] Moving closer to Gary's ideal will be Re: [Spambayes] Moving closer to Gary's ideal And this is the result. -> tested 200 hams & 279 spams against 800 hams & 1113 spams -> tested 200 hams & 275 spams against 800 hams & 1117 spams -> tested 200 hams & 298 spams against 800 hams & 1094 spams -> tested 200 hams & 272 spams against 800 hams & 1120 spams -> tested 200 hams & 268 spams against 800 hams & 1124 spams -> tested 200 hams & 279 spams against 800 hams & 1113 spams -> tested 200 hams & 275 spams against 800 hams & 1117 spams -> tested 200 hams & 298 spams against 800 hams & 1094 spams -> tested 200 hams & 272 spams against 800 hams & 1120 spams -> tested 200 hams & 268 spams against 800 hams & 1124 spams false positive percentages 1.000 0.500 won -50.00% 1.500 1.500 tied 2.000 2.500 lost +25.00% 1.000 1.000 tied 0.000 0.000 tied won 1 times tied 3 times lost 1 times total unique fp went from 11 to 11 tied mean fp % went from 1.1 to 1.1 tied false negative percentages 0.717 0.717 tied 0.727 0.727 tied 1.007 1.342 lost +33.27% 0.000 0.368 lost +(was 0) 0.746 0.373 won -50.00% won 1 times tied 2 times lost 2 times total unique fn went from 9 to 10 lost +11.11% mean fn % went from 0.639419734305 to 0.705436374356 lost +10.32% ham mean ham sdev 24.51 25.20 +2.82% 9.45 9.09 -3.81% 26.14 27.20 +4.06% 8.62 8.32 -3.48% 26.04 26.94 +3.46% 10.00 9.68 -3.20% 25.15 25.85 +2.78% 8.05 7.93 -1.49% 25.12 26.11 +3.94% 8.28 8.16 -1.45% ham mean and sdev for all runs 25.39 26.26 +3.43% 8.93 8.69 -2.69% spam mean spam sdev 80.41 79.86 -0.68% 8.80 8.81 +0.11% 79.87 79.47 -0.50% 8.20 8.11 -1.10% 79.87 79.31 -0.70% 8.79 8.73 -0.68% 80.42 80.03 -0.48% 8.13 8.22 +1.11% 80.11 79.70 -0.51% 9.32 9.07 -2.68% spam mean and sdev for all runs 80.13 79.66 -0.59% 8.66 8.60 -0.69% ham/spam mean difference: 54.74 53.40 -1.34 I'm still having problem reading the result can someone explain this a little bit. My statistic knowledge is comming from a course I took almost 15 years ago and it was the only course I manage to fell asleep in it..... even if I like math (I did a B.Sc in physics). papaDoc From popiel@wolfskeep.com Sat Oct 5 02:30:50 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 04 Oct 2002 18:30:50 -0700 Subject: [Spambayes] New tokenization of the Subject line In-Reply-To: Message from Remi Ricard of "Fri, 04 Oct 2002 21:26:25 EDT." <1033781185.1125.7.camel@localhost.localdomain> References: <1033781185.1125.7.camel@localhost.localdomain> Message-ID: <20021005013050.B393CF59A@cashew.wolfskeep.com> In message: <1033781185.1125.7.camel@localhost.localdomain> Remi Ricard writes: > >I try something again. > >Since most of the mail from subscribed groups have in their >subject [spambayes] or [freesco] i.e "[" and "]". > >I decided to keep this as a word Unfortunately, this makes things worse overall. Good idea, but I think that it's not helping because mailing lists get spammed, too... so showing that something is on a mailing list really doesn't help (it just gives the spam that does show up on the list some apparent validity). >total unique fp went from 11 to 11 tied >mean fp % went from 1.1 to 1.1 tied This is neutral. >total unique fn went from 9 to 10 lost +11.11% >mean fn % went from 0.639419734305 to 0.705436374356 lost +10.32% This is a loss, though too small of one to be significant. (One message in either direction is too small to care about.) >ham mean and sdev for all runs > 25.39 26.26 +3.43% 8.93 8.69 -2.69% This shows the ham scores moving up, and getting tighter together. The first is bad, the second is good. >spam mean and sdev for all runs > 80.13 79.66 -0.59% 8.66 8.60 -0.69% This shows the spam scores moving down, and getting tighter. Again, first is bad, second is good. >ham/spam mean difference: 54.74 53.40 -1.34 This shows ham and spam getting closer together overall, and is bad. The reduction in the standard deviation is (I think) too small to overcome this... but I'm just eyeballing it; can someone with a bit of the theory help here? - Alex From tim.one@comcast.net Sat Oct 5 05:23:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 00:23:04 -0400 Subject: [Spambayes] For the bold In-Reply-To: Message-ID: Two new programs have been checked in, to help with analyzing internals of the central-limit schemes: clgen.py A test driver. Its primary purpose is to generate a binary pickle, recording all relevant details of every prediction made: msg id, whether it's really ham or really spam, the raw ham-mean value, the ham zscore, the raw spam-mean value, the spam zscore, and the number of "extreme words" used in scoring (up to the maximum of max_discriminators). There's enough info here so that the exact results of changing anything about deciding whether a msg is ham or spam, or about deciding how confident we are, can be determined quickly (i.e., without needing to rerun the test). clpik.py A sample analysis program, showing how to load the pickles created by clgen, how to extract info from them, and how to generate histograms from the data. Note that the Histogram module was previously made more robust (numerically speaking) and more flexible, in anticipation of this. Apart from that, here are central_limit2 results from a larger training set than I've reported on before: trained on 2000 ham + 2000 spam, then predicted against 8000 of each (see earlier email for the .ini file I used here; max_discriminators is 50 here, and there are only 4 possible scores): -> Ham scores for all runs: 8000 items; mean 0.17; sdev 3.07 -> min 0; median 0; max 100 * = 131 items 0.0 7975 ************************************************************* 25.0 21 * unsure but right 50.0 2 * unsure but wrong 75.0 2 * sure but wrong -> Spam scores for all runs: 8000 items; mean 99.82; sdev 3.07 -> min 0; median 100; max 100 * = 131 items 0.0 1 * sure but wrong 25.0 3 * unsure but wrong 50.0 24 * unsure but right 75.0 7972 ************************************************************* So the results are even more intense with more training data: it's certain about almost everything, has miniscule error rates when it is certain, and has large error rates on the few msgs it's unsure about. The 2 "certain but wrong" false postives were, again, the Nigerian-scam quote: prob('*zham*') = -39.112 prob('*zspam*') = -7.06214 prob('*hmean*') = -3.35442 prob('*smean*') = -0.569831 prob('*n*') = 50 and the lady with the obnoxious employer-generated sig: prob('*zham*') = -21.6362 prob('*zspam*') = -9.10452 prob('*hmean*') = -1.98372 prob('*smean*') = -0.684297 prob('*n*') = 50 The 1 "certain but wrong" false negative's body consists of a uuencoded text file, which we throw away without decoding: prob('*zham*') = -5.97943 prob('*zspam*') = -12.4922 prob('*hmean*') = -1.45919 prob('*smean*') = -1.92436 prob('*n*') = 8 The histograms generated by clpik on this data are encouraging too (that's your cue, Rob ). From janzert@haskincentral.com Fri Oct 4 21:13:02 2002 From: janzert@haskincentral.com (Brian Haskin) Date: Fri, 04 Oct 2002 16:13:02 -0400 Subject: [Spambayes] Re: For the bold References: Message-ID: Tim Peters wrote: > Here's another one it was unsure about: > > """ > Return-Path: > Delivered-To: em-ca-bruceg@em.ca > Received: (qmail 15516 invoked from network); 8 Aug 2002 22:20:13 -0000 > Received: from mail.inet.pl (195.116.59.85) > by churchill.factcomp.com with SMTP; 8 Aug 2002 22:20:13 -0000 > Received: (qmail 26458 invoked by uid 33); 8 Aug 2002 22:26:04 -0000 > Date: 8 Aug 2002 22:26:04 -0000 > Message-ID: <20020808222604.26455.qmail@mail.inet.pl> > TO: bruceg@em.ca > From: jax@inet.pl From the toplevel domain we can guess polish and http://www.poltran.com/ supplies the following > Subject: Wiadomo¶æ zosta³a dostarczona It's known < known > zosta supplied > Content-Length: 129 > > Twoja Wiadomo¶æ zosta³a dostarczona ! It's known < known > your supplied zosta! > Zostanie jednak przeczytana 12 sierpnia. However, it will be read 12 august. > Do tego czasu korzystam z wypoczynku. I use with refreshment for this time. > "" > > I have no idea -- do you? I really despise the presumption that non-English > msgs are spam, BTW. Anyone have an idea what zosta is? or know someone that can actually read polish? Brian Haskin Janzert@haskincentral.com From tim.one@comcast.net Sat Oct 5 05:42:22 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 00:42:22 -0400 Subject: [Spambayes] Re: For the bold In-Reply-To: Message-ID: [Brian Haskin] > Anyone have an idea what zosta is? An online dictionary said "to have been". > or know someone that can actually read polish? Yes, but not well enough to bother with this trivia . From tim.one@comcast.net Sat Oct 5 08:18:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 03:18:16 -0400 Subject: [Spambayes] For the bold In-Reply-To: Message-ID: There's one more "central limit" scheme on the table now: use_central_limit3. The spamprob() code is identical to use_central_limit2, but the ham and spam populations are computed differently. Under central_limit2, the spam population is computed like so: for each msg in the training spam: for each extreme word w in msg: if we haven't seen w before: add ln(prob(w)) to the spam population The ham population is computed similarly, except using ln(1-prob(w)) instead. Under central_limit3, the spam population is composed of whole-msg scores, not of individual word scores: for each msg in the training spam: compute the mean of ln(prob(w)) over the extreme words w in msg add that mean to the spam population And likewise for the ham population, using ln(1-prob(w)) instead. There's not even a ghost of an illusion that the central limit theorem applies to this variant, but the spamprob() code remains identical, happily ignoring that it's utterly unjustified . Still, brief preliminary tests suggest this *may* actually work better. Here's the bottom line for a run training against 5000 ham + 5000 spam, then predicting against 5000 of each: -> Ham scores for all runs: 5000 items; mean 0.09; sdev 2.31 -> min 0; median 0; max 100 * = 82 items 0 4992 ************************************************************* 25 7 * 50 0 75 1 * this was the Nigerian scam spam -> Spam scores for all runs: 5000 items; mean 99.68; sdev 4.07 -> min 0; median 100; max 100 * = 82 items 0 1 * this was the spam with a uuencoded body we ignore 25 6 * 50 24 * 75 4969 ************************************************************* The advantage-- if it's real --is that it's certain more often. The populations are sharply separated: ham ham mean: 5000 items; mean -0.35; sdev 0.20 -> min -3.55286; median -0.316515; max -0.00523756 spam ham mean: 5000 items; mean -3.87; sdev 0.92 -> min -6.03683; median -3.857; max -1.22996 That is, when we score a ham using the ham ln(1-prob) rule, the mean msg mean is -0.35 with a small sdev of 0.20. But when we score a spam using the ham ln(1-prob) rule, the mean msg mean is -3.87, with a larger sdev. Another pair of results says what happens when we score ham and spam using the spam ln(prob) rule: ham spam mean: 5000 items; mean -3.02; sdev 0.71 -> min -5.72426; median -2.91819; max -0.602309 spam spam mean: 5000 items; mean -0.11; sdev 0.14 -> min -2.23055; median -0.0546932; max -0.00268306 It's essentially impossible for a msg to score well under both measures, but it's easy for a msg to score poorly under both measures. The most appropriate rule again appears to be that it doesn't matter how poorly a msg scores, it only matters how much more poorly it scores under the other measure. From rob@hooft.net Sat Oct 5 09:34:12 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 05 Oct 2002 10:34:12 +0200 Subject: [Spambayes] clt tests Message-ID: <3D9EA404.8090603@hooft.net> I focussed for our night on optimizing the max_discriminators for clt2 using 10x(200+200) messages out of my corpses, running rebal.py between runs to get the best possible idea of variations. It appears the fp and fn counts drop upto max_discriminators ~ 30, and after that it appears constant upto 400. An optimist might see a slight descent between 30 and 100. I will post a plot later. This morning I programmed a Debugger class with the same functionality as Tim's debugging pickle. Will start analysing those results somewhere this weekend. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From paul-bayes@svensson.org Sat Oct 5 10:58:34 2002 From: paul-bayes@svensson.org (Paul Svensson) Date: Sat, 5 Oct 2002 05:58:34 -0400 (EDT) Subject: [Spambayes] Re: For the bold In-Reply-To: Message-ID: On Fri, 4 Oct 2002, Brian Haskin wrote: >> Subject: Wiadomo¶æ zosta³a dostarczona >It's known < known > zosta supplied >> Content-Length: 129 >> >> Twoja Wiadomo¶æ zosta³a dostarczona ! >It's known < known > your supplied zosta! >> Zostanie jednak przeczytana 12 sierpnia. >However, it will be read 12 august. >> Do tego czasu korzystam z wypoczynku. >I use with refreshment for this time. >> "" >> >> I have no idea -- do you? I really despise the presumption that non-English >> msgs are spam, BTW. > >Anyone have an idea what zosta is? or know someone that can actually >read polish? Machine translation has ways to go. A translator colleague at http://www.proz.com gave me: zosta'a = has been delivered "Your message has been delivered, however, it will be read on August 12. I enjoy my holiday until then." /Paul From rob@hooft.net Sat Oct 5 14:31:12 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 05 Oct 2002 15:31:12 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3D9EE9A0.7010505@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Tim Peters wrote: > Nothing is known about how max_discrimators affects this. Play! See the attached plot anarun1.pdf where this is the horizontal axis. I'd think ~30 would be enough, but more doesn't seem to take much more time and doesn't hurt for me. The raw data are below; run1ref is a comparable non-clt result. Rob ==> run1ref.txt <== -> fp rate 1.43% fn rate 1.23% ==> run1_10.txt <== -> fp rate 1.9% fn rate 1.16% ==> run1_15.txt <== -> fp rate 1.22% fn rate 0.867% ==> run1_20.txt <== -> fp rate 1.33% fn rate 1.01% ==> run1_25.txt <== -> fp rate 0.922% fn rate 0.867% ==> run1_30.txt <== -> fp rate 1.07% fn rate 0.95% ==> run1_35.txt <== -> fp rate 1.02% fn rate 1.18% ==> run1_40.txt <== -> fp rate 1.23% fn rate 0.983% ==> run1_45.txt <== -> fp rate 0.939% fn rate 0.983% ==> run1_50.txt <== -> fp rate 0.65% fn rate 1.16% ==> run1_55.txt <== -> fp rate 0.672% fn rate 1.23% ==> run1_60.txt <== -> fp rate 0.983% fn rate 1.26% ==> run1_65.txt <== -> fp rate 0.756% fn rate 1.04% ==> run1_70.txt <== -> fp rate 0.739% fn rate 0.828% ==> run1_75.txt <== -> fp rate 0.917% fn rate 0.867% ==> run1_80.txt <== -> fp rate 0.717% fn rate 0.944% ==> run1_85.txt <== -> fp rate 0.828% fn rate 0.883% ==> run1_90.txt <== -> fp rate 0.972% fn rate 1.07% ==> run1_95.txt <== -> fp rate 1.04% fn rate 0.861% ==> run1_100.txt <== -> fp rate 0.822% fn rate 1.21% ==> run1_110.txt <== -> fp rate 0.989% fn rate 1.2% ==> run1_120.txt <== -> fp rate 0.556% fn rate 1.46% ==> run1_130.txt <== -> fp rate 0.85% fn rate 1.26% ==> run1_140.txt <== -> fp rate 0.794% fn rate 1.33% ==> run1_150.txt <== -> fp rate 0.606% fn rate 1.47% ==> run1_160.txt <== -> fp rate 0.733% fn rate 1.08% ==> run1_170.txt <== -> fp rate 0.628% fn rate 1.37% ==> run1_180.txt <== -> fp rate 0.517% fn rate 1.25% ==> run1_200.txt <== -> fp rate 0.589% fn rate 1.18% ==> run1_300.txt <== -> fp rate 0.872% fn rate 1.17% ==> run1_400.txt <== -> fp rate 0.822% fn rate 0.917% -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: anarun1.pdf Type: application/pdf Size: 6545 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/efb61190/anarun1.pdf ---------------------- multipart/mixed attachment-- From rob@hooft.net Sat Oct 5 15:22:22 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 05 Oct 2002 16:22:22 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3D9EF59E.4040207@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Here are two zham/zspam scatter plots: one for my spam body, and one for my ham body. This was done using clt2. Tim's test basically says "certain" if the distance to the diagonal line is sufficiently large. You can see that that is a reasonable proposal. I'll do more analyses. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: hamscat.png Type: image/png Size: 27946 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/6da9b8a3/hamscat.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: spamscat.png Type: image/png Size: 22771 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/6da9b8a3/spamscat.png ---------------------- multipart/mixed attachment-- From rob@hooft.net Sat Oct 5 16:26:34 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 05 Oct 2002 17:26:34 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3D9F04AA.8050706@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Another large message. Appended is a pdf containing six histograms made using max_discriminators=55 The first one is zham for all ham messages. As you can see, the distribution is asymmetric. Furthermore, a simple average and standard deviation calculation results in a bell curve that does not follow the important tail of the histogram: the chances will be severely underestimated by these parameters. The second one is abs(zham) for all ham messages. The bell curve fits this histogram much better! The third page is zspam for all spam messages. The fourth page is abs(zspam) for all spam messages. Also much better. Fifth and sixth are zspam for all ham and zham for all spam, just to complete the picture. From the second and fourth image, I drew the conclusion that my Z-scores are overestimated by a factor of 6.7/6.6. This means e.g. that the zspam for all ham distribution is not -53 +/- 20, but -8 +/- 3 and the zham for all spam distribution is not -43 +/- 18, but -6.4 +/- 2.6 I will try a discriminator based on this. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: all.pdf Type: application/pdf Size: 56510 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021005/df86955c/all.pdf ---------------------- multipart/mixed attachment-- From noreply@sourceforge.net Sat Oct 5 14:46:02 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 05 Oct 2002 06:46:02 -0700 Subject: [Spambayes] [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1 Message-ID: Patches item #618928, was opened at 2002-10-05 13:46 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rob W.W. Hooft (hooft) Assigned to: Nobody/Anonymous (nobody) Summary: runtest.sh: add timtest + spam/ham!=1 Initial Comment: * Add timtest to runtest.sh * Add different spam/ham counts to runtest.sh ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 From noreply@sourceforge.net Sat Oct 5 14:52:48 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 05 Oct 2002 06:52:48 -0700 Subject: [Spambayes] [ spambayes-Patches-618932 ] fpfn.py: add interactivity on unix Message-ID: Patches item #618932, was opened at 2002-10-05 13:52 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618932&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rob W.W. Hooft (hooft) Assigned to: Nobody/Anonymous (nobody) Summary: fpfn.py: add interactivity on unix Initial Comment: * Add "-i" option to show all falses using "less", and ask the user what to do with them. I used this a lot to clean up my spam/ham corpuses. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618932&group_id=61702 From noreply@sourceforge.net Sat Oct 5 14:53:39 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 05 Oct 2002 06:53:39 -0700 Subject: [Spambayes] [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1 Message-ID: Patches item #618928, was opened at 2002-10-05 13:46 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rob W.W. Hooft (hooft) Assigned to: Nobody/Anonymous (nobody) Summary: runtest.sh: add timtest + spam/ham!=1 Initial Comment: * Add timtest to runtest.sh * Add different spam/ham counts to runtest.sh ---------------------------------------------------------------------- >Comment By: Rob W.W. Hooft (hooft) Date: 2002-10-05 13:53 Message: Logged In: YES user_id=47476 Here is the patch ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 From tim.one@comcast.net Sat Oct 5 19:26:23 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 14:26:23 -0400 Subject: [Spambayes] Microsoft Outlook 'support' In-Reply-To: Message-ID: [Sean True] > I've written a couple of scripts which use Mark H's win32com package > to do the following: [for Outlook 2000] > ... Those not on the spambayes-checkins mailing list probably missed that I checked Sean's files in to the project yesterday. Have at it! They're all in a top-level Outlook2000 directory; see its README.txt. From bkc@murkworks.com Sat Oct 5 19:32:36 2002 From: bkc@murkworks.com (Brad Clements) Date: Sat, 05 Oct 2002 14:32:36 -0400 Subject: [Spambayes] CL2 results Message-ID: <3D9EF7D4.23399.2790EC5D@localhost> (2nd time posting, first time was rejected for name of attachment quoted in this message ending with a .bat extension) looks like virus buster for this list isn't decoding the message, just scanning for content-type even if it's body text ? --- Wow, sure found a lot of ham in my spam.. Also, turns out I had a lot of zero length message files that came up as false negatives.. I've rm `find -empty` and rebal.. I'm doing 50-50 training testing. -> Training on Data/Ham/Set{6,7,8,9,10} & Data/Spam/Set{6,7,8,9,10} ... 6500 hams & 6500 spams hammean -0.258919766598 hamvar 0.235232283813 spammean -0.238803626095 spamvar 0.189273495163 -> population hammean -0.258919766598 hamvar 0.235232283813 -> population spammean -0.238803626095 spamvar 0.189273495163 -> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ... -> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams -> false positive %: 1.12307692308 -> false negative %: 0.369230769231 -> 73 new false positives A lot of the false positives are messages from e-trade, paypal, novell, ingram-micro, HP reseller, my mother ... Here's a false negative I think should be caught somehow.. though I don't know how.. (actually, I think this is klez.. I've saved those as spam too. I have 3 other msgs with the same subject, 138k, when I open them Sophos says it's klez) Data/Spam/Set5/11706 prob = 0.0 prob('*zham*') = -1.74447 prob('*zspam*') = -23.1814 prob('*hmean*') = -0.396172 prob('*smean*') = -1.87484 prob('*n*') = 38 prob('header:Received:1') = 0.00372208 prob('base64') = 0.0121951 prob('from:email addr:murkworks.com>') = 0.0261568 prob('skip:g 70') = 0.0266272 prob('from:email name:From webmaster@technofile.com Fri Jun 28 06:09:54 2002 Received: from Izhpfl ([68.98.236.100]) by mail.netmatrix.com with SMTP (IOA-IPAD 3.18a/96) id 1350300; Fri, 28 Jun 2002 06:09:54 -0600 From: bkc To: david900@channeli.net Subject: Japanese girl VS playboy MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=S61861Nu4e17 Date: Fri, 28 Jun 2002 06:09:54 -0600 Message-Id: <200206281009.1350300@mail.netmatrix.com> --S61861Nu4e17 Content-Type: text/html; Content-Transfer-Encoding: quoted-printable --S61861Nu4e17 Content-Type: audio/x-midi; name=x.txt Content-Transfer-Encoding: base64 Content-ID: TVqQAAMAAAAEAAAA//8AALgAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAA2AAAAA4fug4AtAnNIbgBTM0hVGhpcyBwcm9ncmFtIGNhbm5vdCBiZSBydW4gaW4g RE9TIG1vZGUuDQ0KJAAAAAAAAAAYmX3gXPgTs1z4E7Nc+BOzJ+Qfs1j4E7Pf5B2zT/gTs7Tn GbNm+BOzPucAs1X4E7Nc+BKzJfgTs7TnGLNO+BOz5P4Vs134E7NSaWNoXPgTswAAAAAAAAAA UEUAAEwBBAC4jrc8AAAAAAAAAADgAA8BCwEGAADAAAAAkAgAAAAAAFiEAAAAEAAAANAAAAAA QAAAEAAAABAAAAQAAAAAAAAABAAAAAAAAAAAYAkAABAAAAAAAAACAAAAAAAQAAAQAAAAABAA ABAAAAAAAAAQAAAAAAAAAAAAAAAg1gAAZAAAAABQCQAQAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ANAAAOwBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAudGV4dAAAAEq6AAAAEAAAAMAAAAAQ AAAAAAAAAAAAAAAAAAAgAABgLnJkYXRhAAAiEAAAANAAAAAgAAAA0AAAAAAAAAAAAAAAAAAA QAAAQC5kYXRhAAAAbF4IAADwAAAAUAAAAPAAAAAAAAAAAAAAAAAAAEAAAMAucnNyYwAAABAA AAAAUAkAEAAAAABAAQAAAAAAAAAAAAAAAABAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ****************************************************************************** Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Sat Oct 5 20:12:00 2002 From: bkc@murkworks.com (Brad Clements) Date: Sat, 05 Oct 2002 15:12:00 -0400 Subject: [Spambayes] CL3 results vs. CL2 Message-ID: <3D9F0110.22796.27B4FEB2@localhost> Didn't change anything from the CL2 test. I haven't had a chance to examine the new false negatives. CL3 Results: /tmp/clgen-cl3-5x5 -> /tmp/clgen-cl3-5x5s.txt -> Training on Data/Ham/Set{6,7,8,9,10} & Data/Spam/Set{6,7,8,9,10} ... 6500 hams & 6500 spams -> population hammean -0.353152965041 hamvar 0.0190809424597 -> population spammean -0.230738145294 spamvar 0.0141618505748 -> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ... -> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams -> false positive %: 0.8 -> false negative %: 0.569230769231 0.800 0.569 -> 52 new false positives -> 37 new false negatives -> Ham scores for all in this training set: 6500 items; mean 1.11; sdev 8.23 -> min 0; median 0; max 100 -> Spam scores for all in this training set: 6500 items; mean 98.96; sdev 7.46 -> min 0; median 100; max 100 -> best cutoff for all in this training set: 0.5 -> with weighted total 1*52 fp + 37 fn = 89 -> fp rate 0.8% fn rate 0.569% -> Ham scores for all runs: 6500 items; mean 1.11; sdev 8.23 -> min 0; median 0; max 100 -> Spam scores for all runs: 6500 items; mean 98.96; sdev 7.46 -> min 0; median 100; max 100 -> best cutoff for all runs: 0.5 -> with weighted total 1*52 fp + 37 fn = 89 -> fp rate 0.8% fn rate 0.569% total unique false pos 52 total unique false neg 37 average fp % 0.8 average fn % 0.569230769231 CL2 Results: /tmp/clgen-cl2-5x5 -> /tmp/clgen-cl2-5x5s.txt -> Training on Data/Ham/Set{6,7,8,9,10} & Data/Spam/Set{6,7,8,9,10} ... 6500 hams & 6500 spams -> population hammean -0.258919766598 hamvar 0.235232283813 -> population spammean -0.238803626095 spamvar 0.189273495163 -> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ... -> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams -> false positive %: 1.12307692308 -> false negative %: 0.369230769231 1.123 0.369 -> 73 new false positives -> 24 new false negatives -> Ham scores for all in this training set: 6500 items; mean 1.53; sdev 9.48 -> min 0; median 0; max 100 -> Spam scores for all in this training set: 6500 items; mean 99.17; sdev 6.93 -> min 0; median 100; max 100 -> best cutoff for all in this training set: 0.5 -> with weighted total 1*73 fp + 24 fn = 97 -> fp rate 1.12% fn rate 0.369% -> Ham scores for all runs: 6500 items; mean 1.53; sdev 9.48 -> min 0; median 0; max 100 -> Spam scores for all runs: 6500 items; mean 99.17; sdev 6.93 -> min 0; median 100; max 100 -> best cutoff for all runs: 0.5 -> with weighted total 1*73 fp + 24 fn = 97 -> fp rate 1.12% fn rate 0.369% total unique false pos 73 total unique false neg 24 average fp % 1.12307692308 average fn % 0.369230769231 Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Sat Oct 5 20:19:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 15:19:32 -0400 Subject: [Spambayes] CL2 results In-Reply-To: <3D9EF7D4.23399.2790EC5D@localhost> Message-ID: [Brad Clements] > (2nd time posting, first time was rejected for name of attachment > quoted in this message ending with a .bat extension) looks like virus > buster for this list isn't decoding the message, just scanning for > content-type even if it's body text ? Greg Ward explained how python.org checks for viruses here: http://mail.python.org/pipermail-21/spambayes/2002-September/000327.html Something tagged by that never even makes it to the list moderator. This is one area where people seem to have a very high tolerance for false positives! > Wow, sure found a lot of ham in my spam.. > > Also, turns out I had a lot of zero length message files that > came up as false negatives.. I've rm `find -empty` and rebal.. How *should* empty msgs be treated (that's a question for everyone)? When there's nothing to go on, it's hard to decide . > I'm doing 50-50 training testing. > > -> Training on Data/Ham/Set{6,7,8,9,10} & > Data/Spam/Set{6,7,8,9,10} ... 6500 hams & > 6500 spams > hammean -0.258919766598 hamvar 0.235232283813 > spammean -0.238803626095 spamvar 0.189273495163 > -> population hammean -0.258919766598 hamvar 0.235232283813 > -> population spammean -0.238803626095 spamvar 0.189273495163 > -> Predicting Data/Ham/Set{1,2,3,4,5} & Data/Spam/Set{1,2,3,4,5} ... > -> tested 6500 hams & 6500 spams against 6500 hams & 6500 spams > -> false positive %: 1.12307692308 > -> false negative %: 0.369230769231 > -> 73 new false positives Please, please, please, show us the tiny 4-line histograms from the end of the full output file! More than half the point of the clt schemes is whether they *know* when they're uncertain. A "false positive" with a score of 1.0 is bad news, but a false positive with a score of 0.51 is a huge success for the clt schemes relative to the non-clt scheme. Similarly for false negatives, the difference between scores of 0.0 and 0.49 is most of the show here. The histograms reveal all this, and nothing else does. > A lot of the false positives are messages from > > e-trade, paypal, novell, ingram-micro, HP reseller, my mother ... Which is why, for purposes of evaluating the clt schemes, it's vital to know how many of those were "I'm certain" false positives, and how many "here's a guess, but I'm really confused about this one" false positives. No sane deployment would block a "I'm really confused about this one" msg, but *might* shuffle such a thing off to a distinct "please help me, I'm lost" folder. > Here's a false negative I think should be caught somehow.. though > I don't know how.. > > (actually, I think this is klez.. I've saved those as spam too. I > have 3 other msgs with the same subject, 138k, when I open them > Sophos says it's klez) > > Data/Spam/Set5/11706 > prob = 0.0 > prob('*zham*') = -1.74447 > prob('*zspam*') = -23.1814 > prob('*hmean*') = -0.396172 > prob('*smean*') = -1.87484 > prob('*n*') = 38 > prob('header:Received:1') = 0.00372208 > prob('base64') = 0.0121951 > prob('from:email addr:murkworks.com>') = 0.0261568 > prob('skip:g 70') = 0.0266272 > prob('from:email name: prob('content-id:') = 0.0412844 > prob('skip:/ 70') = 0.0610425 > prob('skip:f 70') = 0.0847751 > prob('skip:t 70') = 0.0910781 > prob('skip:n 40') = 0.0983936 > prob('skip:r 70') = 0.117225 > prob('skip:- 10') = 0.131484 > prob('skip:k 70') = 0.14497 > prob('skip:q 70') = 0.14497 > prob('skip:h 70') = 0.14497 > prob('skip:z 70') = 0.14497 > prob('skip:j 70') = 0.14497 > prob('skip:u 70') = 0.14497 > prob('skip:v 70') = 0.14497 > prob('skip:x 70') = 0.14497 > prob('skip:d 70') = 0.194323 > prob('skip:n 70') = 0.228997 > prob('skip:b 70') = 0.239777 > prob('skip:s 70') = 0.268638 > prob('skip:c 70') = 0.29021 > prob('skip:e 70') = 0.299427 > prob('skip:9 70') = 0.308612 > prob('skip:i 70') = 0.308612 > prob('skip:6 70') = 0.308612 > prob('skip:m 70') = 0.314126 > prob('skip:w 70') = 0.356734 > prob('skip:p 70') = 0.378419 > prob('skip:a 70') = 0.391599 > prob('skip:c 20') = 0.392071 > prob('x-mailer:none') = 0.642702 > prob('subject:Japanese') = 0.844828 > prob('content-type:multipart/alternative') = 0.884053 > prob('content-type:text/html') = 0.941554 That collection of clues *suggests* the email package couldn't parse this msg, so that we fell back to the raw text. You could open this file "by hand" and try to get the email package to parse it, and that would answer the question. If it was a well-formed message, we *should* have skipped the base64 part entirely, since we ignore all MIME sections that don't have a text/* type. What you showed us here is likely truncated because you have a default show_charlimit setting of 3000 (and there are indeed about 3K bytes in the rest of what you passed on). From jbublitz@nwinternet.com Sat Oct 5 20:32:59 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Sat, 05 Oct 2002 12:32:59 -0700 (PDT) Subject: [Spambayes] Sequemtial Test Results Message-ID: I have a very unusual corpus of ham and spam compared to "normal", so these results may not be widely applicable. In evaluating Graham and Spambayes I've used both random testing (not as extensive as Spambayes) and sequential testing (train on first N, test next M). Since I use qmail, all of my mail is individual files and the filenames are the delivery timestamp, so it's easy to get an accurate sequence. In my experience, testing sequentially (as above) has always been "worst case" performance. A few days ago I switched to simulating actual performance, since I have to implement something sooner or later. My test procedure is: 1. Train on first T msgs 2. Test next t msgs 3. Train (incrementally) on t msgs 4. Loop on 2 & 3 for N msgs (all numbers are 50/50 spam/ham, which is my avg receiving about 200 msgs/day) For T = 8000, t = 200, N = 14400, the results I got for Graham were (cutoff is independent of anything else, so select the most desireable result): (zero above) cutoff: 0.15 -- fn = 0 (0.00%) fp = 7 (0.33%) cutoff: 0.16 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.17 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.18 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.19 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.20 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.21 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.22 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.23 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.24 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.25 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.26 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.27 -- fn = 0 (0.00%) fp = 6 (0.29%) cutoff: 0.28 -- fn = 0 (0.00%) fp = 4 (0.19%) cutoff: 0.29 -- fn = 0 (0.00%) fp = 1 (0.05%) cutoff: 0.30 -- fn = 0 (0.00%) fp = 1 (0.05%) cutoff: 0.31 -- fn = 0 (0.00%) fp = 1 (0.05%) cutoff: 0.32 -- fn = 0 (0.00%) fp = 1 (0.05%) cutoff: 0.33 -- fn = 0 (0.00%) fp = 1 (0.05%) cutoff: 0.34 -- fn = 0 (0.00%) fp = 1 (0.05%) ------------------------------------------------------ cutoff: 0.35 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.36 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.37 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.38 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.39 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.40 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.41 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.42 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.43 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.44 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.45 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.46 -- fn = 0 (0.00%) fp = 0 (0.00%) cutoff: 0.47 -- fn = 0 (0.00%) fp = 0 (0.00%) ------------------------------------------------------ cutoff: 0.48 -- fn = 1 (0.05%) fp = 0 (0.00%) cutoff: 0.49 -- fn = 1 (0.05%) fp = 0 (0.00%) cutoff: 0.50 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.51 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.52 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.53 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.54 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.55 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.56 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.57 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.58 -- fn = 2 (0.10%) fp = 0 (0.00%) cutoff: 0.59 -- fn = 3 (0.14%) fp = 0 (0.00%) cutoff: 0.60 -- fn = 3 (0.14%) fp = 0 (0.00%) cutoff: 0.61 -- fn = 3 (0.14%) fp = 0 (0.00%) cutoff: 0.62 -- fn = 3 (0.14%) fp = 0 (0.00%) cutoff: 0.63 -- fn = 6 (0.29%) fp = 0 (0.00%) cutoff: 0.64 -- fn = 6 (0.29%) fp = 0 (0.00%) cutoff: 0.65 -- fn = 7 (0.33%) fp = 0 (0.00%) cutoff: 0.66 -- fn = 7 (0.33%) fp = 0 (0.00%) cutoff: 0.67 -- fn = 8 (0.38%) fp = 0 (0.00%) cutoff: 0.68 -- fn = 9 (0.43%) fp = 0 (0.00%) cutoff: 0.69 -- fn = 10 (0.48%) fp = 0 (0.00%) cutoff: 0.70 -- fn = 10 (0.48%) fp = 0 (0.00%) cutoff: 0.71 -- fn = 10 (0.48%) fp = 0 (0.00%) cutoff: 0.72 -- fn = 10 (0.48%) fp = 0 (0.00%) cutoff: 0.73 -- fn = 10 (0.48%) fp = 0 (0.00%) cutoff: 0.74 -- fn = 15 (0.71%) fp = 0 (0.00%) (zero below) Graham Spam Ham Mean 0.98 0.01 Std Dev 0.04 0.02 3 sigma 0.86 0.07 For Spambayes ("out of the box" - CVS from 10/2) cutoff: 0.41 -- fn = 0 (0.00%) fp = 164 (2.13%) cutoff: 0.42 -- fn = 1 (0.01%) fp = 140 (1.82%) cutoff: 0.43 -- fn = 1 (0.01%) fp = 121 (1.57%) cutoff: 0.44 -- fn = 1 (0.01%) fp = 103 (1.34%) cutoff: 0.45 -- fn = 1 (0.01%) fp = 90 (1.17%) cutoff: 0.46 -- fn = 1 (0.01%) fp = 68 (0.88%) cutoff: 0.47 -- fn = 2 (0.03%) fp = 55 (0.71%) cutoff: 0.48 -- fn = 2 (0.03%) fp = 47 (0.61%) cutoff: 0.49 -- fn = 2 (0.03%) fp = 36 (0.47%) cutoff: 0.50 -- fn = 3 (0.04%) fp = 30 (0.39%) cutoff: 0.51 -- fn = 5 (0.06%) fp = 23 (0.30%) cutoff: 0.52 -- fn = 8 (0.10%) fp = 15 (0.19%) cutoff: 0.53 -- fn = 11 (0.14%) fp = 13 (0.17%) cutoff: 0.54 -- fn = 15 (0.19%) fp = 7 (0.09%) cutoff: 0.55 -- fn = 18 (0.23%) fp = 7 (0.09%) cutoff: 0.56 -- fn = 28 (0.36%) fp = 5 (0.06%) cutoff: 0.57 -- fn = 36 (0.47%) fp = 3 (0.04%) cutoff: 0.58 -- fn = 46 (0.60%) fp = 2 (0.03%) cutoff: 0.59 -- fn = 55 (0.71%) fp = 2 (0.03%) cutoff: 0.60 -- fn = 63 (0.82%) fp = 2 (0.03%) cutoff: 0.61 -- fn = 73 (0.95%) fp = 1 (0.01%) cutoff: 0.62 -- fn = 90 (1.17%) fp = 0 (0.00%) Spambayes Spam Ham Mean 0.85 0.16 Std Dev 0.10 0.10 3 sigma 0.54 0.46 For Graham, the modifications made are: 1. Word freq threshhold = 1 instead of 5 2. Case sensitive tokeninzing 3. Use Gary Robinson's score calculation 4. Use token count instead of msg count in computing probability. Counting msgs instead of tokens in computing probability is a fairly subtle bias (noted by Graham in "A Plan for Spam") and is still included in Spambayes. If I count msgs instead of tokens I can get about the same results and the mean and std dev are unaffected, but the tails of the distributions for ham/spam scores move closer together (no large dead band as above). Here's why (sort of): The probability calculation is: (s is spam count for a token, h is ham count, H/S are either the number of msgs seen or number of tokens seen) prob = (s/S)/((h/H) + (s/S)) which can be refactored to: prob = 1/(1 + (S/H)*(h/s)) or with Graham's bias: prob = 1/(1 + (S/H)*(2*h/s)) For my mail/testing, msgs -- S/H = 1 tokens -- S/H ~= 0.5 (ranges from 0.40 to 0.52 over time) so (for my unusual data anyway) counting msgs doubles the bias on the ham probability, but surprisingly affects the shape of my score distributions adversely. If I count msgs and remove Graham's 2.0 bias I get only slightly worse results than if I count tokens and include Graham's bias, since they're almost the same calculation and the sensitivity to S/H is small but noticeable. Playing around with Spambayes, I get slightly better results if I a) count tokens; b) count every token; c) drop robinson_probability_s to .05, but I still have overlap on the score distribution tails. (Adding Graham's bias back in helps too). Nothing I did to Spambayes had much effect on mean/std dev, but did reshape the distribution curves. I get a lot more tokens than Spambayes, but the ratios are close. Process sizes are comparable (about 100MB peak for the tests above). Spambayes is about 2X faster. For my data (which, again, is unusual) I'd conclude: 1. Counting msgs and counting tokens once per msg seems wrong to me. It seems to me to be enumerating containers rather than enumerating contents, or at least mixing the two. 2. Sequential testing/training is important to look at (there may be time related effects - certainly S/H (counting tokens) varies over time). These are better than any other test results I've had for either method. 3. I'd concentrate on shaping the tails of the distribution rather than worrying about mean and std dev. Some adjustments will degrade the mean/std dev but improve the shape of the distribution/sharpness of discrimination. If you look at the 3 sigma limits on either method (covers about 99.7% of the distribution in theory), the fns and fps are out past 3 sigma. In EE terms, you want sharper rolloff, not necessarily higher Q or a change in center frequency. Graham appears to be less sensitive to choice of cutoff than Spambayes for my dataset. As far as traing sample size, the results above were based on an intial training sample of 8000 msgs. If I start with 200 msgs, I get an additional 15 fns in the first 200 tested, and one additional fn through the rest of the first 8000 msgs, at which time I'm back to the test/results above (choosing a fairly optimum cutoff value). If I start with 1 ham and 1 spam, I get 89 fps on the first 200 and no more failures after that. It seems to converge. Not very sensitive to initial conditions. All of this might only work for my email. YMMV. For either method the results are fantastic. I'd be happy with a 90% spam reduction and no fps. Jim From rob@hooft.net Sat Oct 5 21:20:59 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 05 Oct 2002 22:20:59 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3D9F49AB.6040900@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment I am attaching my new version of clpik.py that implements my RMS Z-score ideas. Some results I get are listed hereunder. I'm very interested to hear what other people get with this! amigo[143]clpik%% python clpik.py climbig12.pk 34 descriptions from knownfalse.dat Reading climbig12.pk ... Nham= 12800 RmsZham= 2.76178782393 Nspam= 5600 RmsZspam= 4.64849650515 ====================================================================== HAM: FALSE POSITIVE: zham=-3.92 zspam=-1.51 Data/Ham/Set4/h06542.txt SURE! ==> Mailing list removal confirmation request FALSE POSITIVE: zham=-4.99 zspam=-2.55 Data/Ham/Set4/h07701.txt SURE! ==> E-mail provider newsletter (in German) FALSE POSITIVE: zham=-3.87 zspam=-1.53 Data/Ham/Set6/h03075.txt SURE! ==> Congratulations from the World Birthday Web FALSE POSITIVE: zham=-4.73 zspam=-1.49 Data/Ham/Set7/m05802.txt SURE! ==> Student from India applying to a course FALSE POSITIVE: zham=-4.87 zspam=-0.99 Data/Ham/Set7/h16981.txt SURE! ==> Congratulations from the World Birthday Web FALSE POSITIVE: zham=-4.76 zspam=-1.73 Data/Ham/Set8/h07523.txt SURE! ==> Postmaster autoreply FALSE POSITIVE: zham=-6.02 zspam=-1.54 Data/Ham/Set8/h13038.txt SURE! ==> Amazon.com customer data change announcement FALSE POSITIVE: zham=-4.55 zspam=-2.45 Data/Ham/Set9/h16973.txt SURE! ==> Headhunter hunting me FALSE POSITIVE: zham=-5.90 zspam=-2.32 Data/Ham/Set9/h07516.txt SURE! ==> Autoreply on website request FALSE POSITIVE: zham=-3.63 zspam=-1.62 Data/Ham/Set9/h03070.txt SURE! ==> Congratulations from the World Birthday Web FALSE POSITIVE: zham=-4.70 zspam=-1.74 Data/Ham/Set9/h17001.txt SURE! ==> Postmaster autoreply Sure/ok 12477 Unsure/ok 239 Unsure/not ok 73 Sure/not ok 11 Unsure rate = 2.44% Sure fp rate = 0.09%; Unsure fp rate = 23.40% ====================================================================== SPAM: FALSE NEGATIVE: zham=-2.28 zspam=-6.70 Data/Spam/Set4/m06556.txt SURE! FALSE NEGATIVE: zham=-1.69 zspam=-3.63 Data/Spam/Set5/h16027.txt SURE! FALSE NEGATIVE: zham=-2.28 zspam=-6.70 Data/Spam/Set6/m01349.txt SURE! Sure/ok 5437 Unsure/ok 141 Unsure/not ok 19 Sure/not ok 3 Unsure rate = 1.25% Sure fn rate = 0.06%; Unsure fn rate = 11.88% -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment #! /usr/bin/env python # Analyze a clim.pik file. """Usage: %(program)s [options] [central_limit_pickle_file] An example analysis program showing to access info from a central-limit pickle file created by clgen.py. This program produces histograms of various things. Scores for all predictions are saved at the end of binary pickle clim.pik. This contains two lists of tuples, the first list with a tuple for every ham predicted, the second list with a tuple for every spam predicted. Each tuple has these values: tag the msg identifier is_spam True if msg came from a spam Set, False if from a ham Set zham the msg zscore relative to the population ham zspam the msg zscore relative to the population spam hmean the raw mean ham score smean the raw mean spam score n the number of clues used to judge this msg Note that hmean and smean are the same under use_central_limit; they're very likely to differ under use_central_limit2. Where: -h Show usage and exit. If no file is named on the cmdline, clim.pik is used. """ surefactor=1000 # This is basically the inverse of the accepted fp/fn rate punsure=0 # Print unsure decisions (otherwise only sure-but-false) import sys,math,os import cPickle as pickle program = sys.argv[0] def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def chance(x): if x>=0: return 1.0 x=-x/math.sqrt(2) if x<1.4: return 1.0 assert x>=1.4 x=float(x) pre=math.exp(-x**2)/math.sqrt(math.pi)/x post=1-(1/(2*x**2)) return pre*post knownfalse={} def readknownfalse(): global knownfalse knownfalse={} try: f=open('knownfalse.dat') except IOError: return while 1: line=f.readline() if not line: break key,desc=line.split(None,1) knownfalse[key]=desc[:-1] print "%d descriptions from knownfalse.dat"%len(knownfalse) def prknown(tag): bn=os.path.basename(tag) if knownfalse.has_key(bn): print " ==>",knownfalse[bn] def drive(fname): print 'Reading', fname, '...' f = open(fname, 'rb') ham = pickle.load(f) spam = pickle.load(f) f.close() zhamsum2=0 nham=0 for msg in ham: if msg[1]: print "spam in ham",msg else: zhamsum2+=msg[2]**2 nham+=1 rmszham=math.sqrt(zhamsum2/nham) print "Nham=",nham print "RmsZham=",rmszham zspamsum2=0 nspam=0 for msg in spam: if not msg[1]: print "ham in spam",msg else: zspamsum2+=msg[3]**2 nspam+=1 rmszspam=math.sqrt(zspamsum2/nspam) print "Nspam=",nspam print "RmsZspam=",rmszspam #========= Analyze ham print "="*70 print "HAM:" nsureok=0 nunsureok=0 nunsurenok=0 nsurenok=0 for msg in ham: zham=msg[2]/rmszham zspam=msg[3]/rmszspam cham=chance(zham) cspam=chance(zspam) if cham>surefactor*cspam and cham>0.01: nsureok+=1 # very certain elif cham>cspam: nunsureok+=1 #print "Unsure",msg[0] #prknown(msg[0]) else: if cspam>surefactor*cham and cspam>0.01: reason="SURE!" nsurenok+=1 elif cham<0.01 and cspam<0.01: reason="neither?" nunsurenok+=1 elif cham>0.1 and cspam>0.1: reason="both?" nunsurenok+=1 else: reason="Unsure" nunsurenok+=1 if reason=="SURE!" or punsure: print "FALSE POSITIVE: zham=%.2f zspam=%.2f %s %s"%(zham,zspam,msg[0],reason) prknown(msg[0]) print "Sure/ok ",nsureok print "Unsure/ok ",nunsureok print "Unsure/not ok",nunsurenok print "Sure/not ok ",nsurenok print "Unsure rate = %.2f%%"%(100.*(nunsureok+nunsurenok)/len(ham)) print "Sure fp rate = %.2f%%; Unsure fp rate = %.2f%%"%(100.*nsurenok/(nsurenok+nsureok),100.*nunsurenok/(nunsurenok+nunsureok)) #========= Analyze spam print "="*70 print "SPAM:" nsureok=0 nunsureok=0 nunsurenok=0 nsurenok=0 for msg in spam: zham=msg[2]/rmszham zspam=msg[3]/rmszspam cham=chance(zham) cspam=chance(zspam) if cspam>surefactor*cham and cspam>0.01: nsureok+=1 # very certain elif cspam>cham: nunsureok+=1 #print "Unsure",msg[0] #prknown(msg[0]) else: if cham>surefactor*cspam and cham>0.01: reason="SURE!" nsurenok+=1 elif cham<0.01 and cspam<0.01: reason="neither?" nunsurenok+=1 elif cham>0.1 and cspam>0.1: reason="both?" nunsurenok+=1 else: reason="Unsure" nunsurenok+=1 if reason=="SURE!" or punsure: print "FALSE NEGATIVE: zham=%.2f zspam=%.2f %s %s"%(zham,zspam,msg[0],reason) prknown(msg[0]) print "Sure/ok ",nsureok print "Unsure/ok ",nunsureok print "Unsure/not ok",nunsurenok print "Sure/not ok ",nsurenok print "Unsure rate = %.2f%%"%(100.*(nunsureok+nunsurenok)/len(ham)) print "Sure fn rate = %.2f%%; Unsure fn rate = %.2f%%"%(100.*nsurenok/(nsurenok+nsureok),100.*nunsurenok/(nunsurenok+nunsureok)) def main(): import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'h:', ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) nbuckets = 100 for opt, arg in opts: if opt == '-h': usage(0) fname = 'clim.pik' if args: fname = args.pop(0) if args: usage(1, "No more than one positional argument allowed") readknownfalse() drive(fname) if __name__ == "__main__": main() ---------------------- multipart/mixed attachment-- From bkc@murkworks.com Sat Oct 5 21:33:55 2002 From: bkc@murkworks.com (Brad Clements) Date: Sat, 05 Oct 2002 16:33:55 -0400 Subject: [Spambayes] CL2 results and CL3 results In-Reply-To: References: <3D9EF7D4.23399.2790EC5D@localhost> Message-ID: <3D9F1443.24980.27FFFFA0@localhost> Uh, they're not 4-lines because my .ini settings aren't default.. but, I've made them four lines now, snip snip. CL2 RESULTS -> Ham scores for all in this training set: 6500 items; mean 1.53; sdev 9.48 -> min 0; median 0; max 100 * = 104 items 0 6321 ************************************************************* 48 106 ** 50 52 * 98 21 * -> Spam scores for all in this training set: 6500 items; mean 99.17; sdev 6.93 -> min 0; median 100; max 100 * = 105 items 0 10 * 48 14 * 50 75 * 98 6401 ************************************************************* -> best cutoff for all in this training set: 0.5 -> with weighted total 1*73 fp + 24 fn = 97 -> fp rate 1.12% fn rate 0.369% saving pickle to class1.pik -> Ham scores for all runs: 6500 items; mean 1.53; sdev 9.48 -> min 0; median 0; max 100 * = 104 items 0 6321 ************************************************************* 48 106 ** 50 52 * 98 21 * -> Spam scores for all runs: 6500 items; mean 99.17; sdev 6.93 -> min 0; median 100; max 100 * = 105 items 0 10 * 48 14 * 50 75 * 98 6401 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*73 fp + 24 fn = 97 -> fp rate 1.12% fn rate 0.369% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik Saving all score data to pickle clim.pik CL3 RESULTS -> Ham scores for all in this training set: 6500 items; mean 1.11; sdev 8.23 -> min 0; median 0; max 100 * = 105 items 0 6373 ************************************************************* 48 75 * 50 34 * 98 18 * -> Spam scores for all in this training set: 6500 items; mean 98.96; sdev 7.46 -> min 0; median 100; max 100 * = 105 items 0 7 * 48 30 * 50 92 * 98 6371 ************************************************************* -> best cutoff for all in this training set: 0.5 -> with weighted total 1*52 fp + 37 fn = 89 -> fp rate 0.8% fn rate 0.569% saving pickle to class1.pik -> Ham scores for all runs: 6500 items; mean 1.11; sdev 8.23 -> min 0; median 0; max 100 * = 105 items 0 6373 ************************************************************* 48 75 * 50 34 * 98 18 * -> Spam scores for all runs: 6500 items; mean 98.96; sdev 7.46 -> min 0; median 100; max 100 * = 105 items 0 7 * 48 30 * 50 92 * 98 6371 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*52 fp + 37 fn = 89 -> fp rate 0.8% fn rate 0.569% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik Saving all score data to pickle clim.pik Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Sat Oct 5 21:35:36 2002 From: bkc@murkworks.com (Brad Clements) Date: Sat, 05 Oct 2002 16:35:36 -0400 Subject: [Spambayes] CL histograms Message-ID: <3D9F14A7.29070.280186A8@localhost> Actually, my .ini doesn't specify nbuckets, but my histograms are still 40 lines.. ?? [Tokenizer] mine_received_headers: True [Classifier] use_central_limit2 = False use_central_limit3 = True [TestDriver] spam_cutoff: 0.50 show_false_negatives: True show_spam_lo: 0.0 show_spam_hi: 0.45 save_trained_pickles: True save_histogram_pickles: True Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Sat Oct 5 22:42:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 17:42:28 -0400 Subject: [Spambayes] CL histograms In-Reply-To: <3D9F14A7.29070.280186A8@localhost> Message-ID: [Brad Clements] > Actually, my .ini doesn't specify nbuckets, but my histograms are > still 40 lines.. ?? Right, nbuckets defaults to 40. All options and their default values are in Options.py. For central limit runs, I recommend this base .ini file. "base" means it's irrelevant to test evaluation how you set the various display options (whether you want to see false negatives and/or f-p, how many characters you want to clamp those to, whether you want to save pickles, etc -- none of that has any effect on error rates, or on the stuff that displays error rates): """ [Classifier] use_central_limit2: True # or use_central_limit: True # or use_central_limit3: True max_discriminators: 50 zscore_ratio_cutoff: 1.9 [TestDriver] spam_cutoff: 0.50 nbuckets: 4 """ From tim.one@comcast.net Sun Oct 6 00:32:11 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 19:32:11 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3D9F49AB.6040900@hooft.net> Message-ID: [Rob Hooft] > I am attaching my new version of clpik.py that implements my RMS > Z-score ideas. Cool! I'm going to check this into the project, but under the name rmspik.py. People playing along: you DO NOT need to rerun a test to try this! rmspik.py analyzes the binary pickle (clim.pik) left behind by clgen.py (the central-limit analysis test driver), and very quickly (a matter of seconds) determines exactly what would have happened had we used Rob's RMS certainty rules instead. > Some results I get are listed hereunder. I'm very interested to > hear what other people get with this! Here's a use_central_limit2 run with max_discriminators=50, trained on 5000 ham and 5000 spam, then predicting against 7500 of each: -> Ham scores for all runs: 7500 items; mean 0.14; sdev 2.72 -> min 0; median 0; max 100 * = 123 items 0 7480 ************************************************************* 25 18 * 50 1 * 75 1 * -> Spam scores for all runs: 7500 items; mean 99.86; sdev 2.85 -> min 0; median 100; max 100 * = 123 items 0 2 * 25 1 * 50 16 * 75 7481 ************************************************************* Under rmspik, Reading clim.pik ... Nham= 7500 RmsZham= 2.27249107964 Nspam= 7500 RmsZspam= 2.354280998 ====================================================================== HAM: Sure/ok 7325 Unsure/ok 172 Unsure/not ok 3 Sure/not ok 0 Unsure rate = 2.33% Sure fp rate = 0.00%; Unsure fp rate = 1.71% ====================================================================== SPAM: FALSE NEGATIVE: zham=-2.39 zspam=-4.93 Data/Spam/Set7/99999.txt SURE! Sure/ok 7422 Unsure/ok 75 Unsure/not ok 2 Sure/not ok 1 Unsure rate = 1.03% Sure fn rate = 0.01%; Unsure fn rate = 2.60% So RMS was unsure much more often, and especially unsure about ham. In the end RMS had one more false positive (2 versus 3), but all 3 were in its region of uncertainty. They both had 3 false negatives, but RMS had one fewer in its region of certainty. The sole f-n it was certain about is also one clim2 was certain about, and is a spam with a uuencoded body that we don't decode. This is a tradeoff in the tokenizer: it simply doesn't generate enough clues to nail this one (10 "words" total). It's especially embarrassing because the subject line is Subject: HOW TO BECOME A MILLIONAIRE IN WEEKS!! Sheesh . BTW, for python.org use, an uncertainty rate over 2% may not fly -- Greg already gripes about reviewing a trivial number of msgs each day. Now all over again, but with use_central_limit3; max_discriminators still 50, and same sets of msgs trained on and predicted against: -> Ham scores for all runs: 7500 items; mean 0.05; sdev 1.61 -> min 0; median 0; max 51 * = 123 items 0 7492 ************************************************************* 25 7 * 50 1 * 75 0 -> Spam scores for all runs: 7500 items; mean 99.63; sdev 4.43 -> min 0; median 100; max 100 * = 123 items 0 2 * 25 5 * 50 48 * 75 7445 ************************************************************* The uncertainty rate on ham is plain jaw-dropping there. It's less sure about spam, but in the end makes the same "but I was certain" mistakes. Let's see how rmspik does on it: Reading clim.pik ... Nham= 7500 RmsZham= 9.77605846416 Nspam= 7500 RmsZspam= 10.1887670936 ====================================================================== HAM: Sure/ok 7316 Unsure/ok 183 Unsure/not ok 1 Sure/not ok 0 Unsure rate = 2.45% Sure fp rate = 0.00%; Unsure fp rate = 0.54% ====================================================================== SPAM: FALSE NEGATIVE: zham=-2.32 zspam=-6.04 Data/Spam/Set7/99999.txt SURE! Sure/ok 7269 Unsure/ok 225 Unsure/not ok 5 Sure/not ok 1 Unsure rate = 3.07% Sure fn rate = 0.01%; Unsure fn rate = 2.17% RMS's uncertainty about spam skyrocketed under this scheme, but it did a little better on ham under this scheme (1 fp total versus 3 before). In return, it has more fn (6 total vs 3 before). From tim.one@comcast.net Sun Oct 6 00:47:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 19:47:32 -0400 Subject: [Spambayes] CL2 results and CL3 results In-Reply-To: <3D9F1443.24980.27FFFFA0@localhost> Message-ID: [Brad Clements] > Uh, they're not 4-lines because my .ini settings aren't default.. > > but, I've made them four lines now, snip snip. Thanks! > CL2 RESULTS > ... > -> Ham scores for all runs: 6500 items; mean 1.53; sdev 9.48 > -> min 0; median 0; max 100 > * = 104 items > 0 6321 ************************************************************* > 48 106 ** > 50 52 * > 98 21 * > > -> Spam scores for all runs: 6500 items; mean 99.17; sdev 6.93 > -> min 0; median 100; max 100 > * = 105 items > 0 10 * > 48 14 * > 50 75 * > 98 6401 ************************************************************* > CL3 RESULTS > ... > -> Ham scores for all runs: 6500 items; mean 1.11; sdev 8.23 > -> min 0; median 0; max 100 > * = 105 items > 0 6373 ************************************************************* > 48 75 * > 50 34 * > 98 18 * > > -> Spam scores for all runs: 6500 items; mean 98.96; sdev 7.46 > -> min 0; median 100; max 100 > * = 105 items > 0 7 * > 48 30 * > 50 92 * > 98 6371 ************************************************************* Your test data looks tougher than mine, but three outcomes are the same: 1. CL3 is certain more often than CL2 about ham, and makes fewer mistakes when it is certain. 2. CL3 is certain less often than CL2 about spam, but makes fewer mistakes when it's certain there too. 3. CL2 and CL3 both have high error rates in their regions of uncertainty. I think that's a Very Good Thing, because it means manual review won't be overwhelmingly a waste of time. If the error rate in the uncertainty region is just a percent or two, I believe manual review will become careless, or even skipped. But if it's actually wrong in its guess a third of the time, it will be fun to remind yourself of how much smarter you are than a stupid computer . If you've still got the clim pickles from these runs, please try Rob's rmspik.py on them too (I just checked that into the project). From tim.one@comcast.net Sun Oct 6 01:35:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 20:35:49 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: Message-ID: One more test result here, using Gary's *original* central-limit scheme. That didn't get a fair trial when it was introduced: at the time, the business about "certainty" under these schemes wasn't known, or even suspected, so it looked poorer by comparison due to a seemingly large increase in errors rates. Now we know that *most* of that was just the system very helpfully telling us it's unsure of its decision. But at the time, Gary immediately came up with central_limit2, and central_limit has been neglected ever since. Same setup as before, but with use_central_limit: -> Ham scores for all runs: 7500 items; mean 0.26; sdev 3.67 -> min 0; median 0; max 100 * = 123 items 0 7461 ************************************************************* 25 37 * 50 1 * 75 1 * -> Spam scores for all runs: 7500 items; mean 99.75; sdev 3.59 -> min 0; median 100; max 100 * = 123 items 0 1 * 25 3 * 50 33 * 75 7463 ************************************************************* Overall, it's quite comparable to the two other central limit variations, just uncertain slightly (in absolute terms) more often. The uncertainty increase is large in *relative* terms, though, which is why this looked like a big jump in error rates when it was first tried. Crunching the raw data via rmspik: Reading clim.pik ... Nham= 7500 RmsZham= 2.93763751621 Nspam= 7500 RmsZspam= 3.62374621717 ====================================================================== HAM: Sure/ok 7491 Unsure/ok 8 Unsure/not ok 1 Sure/not ok 0 Unsure rate = 0.12% Sure fp rate = 0.00%; Unsure fp rate = 11.11% ====================================================================== SPAM: FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE! FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE! FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE! FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE! FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE! Sure/ok 0 Unsure/ok 0 Unsure/not ok 7495 Sure/not ok 5 Unsure rate = 99.93% Sure fn rate = 100.00%; Unsure fn rate = 100.00% So the RMS business is certain very much more often under the original central limit scheme: RMS ham unsure RMS spam unsure -------------- --------------- central_limit 9 0 central_limit2 175 77 central_limit3 184 227 That suggests to me that, whatever the heck RMS is doing, it's a much better fit to the original central_limit scheme, but has a bizarre problem with spam there. I don't know whether I care about it, though, as it would have leaked 5 spam out of 7500, and that's a measly 0.067% total f-n rate. Let's look at the "sure but wrong" FN there: FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE! The "Hello, my Name is BlackIntrepid" spam, discussed at length previously here. Had no spam indicators at all when max_discriminators was 16 under the Graham scheme (highest spamprob among the 16 most extreme was about 0.05(!)). FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE! A short "just folks" spam that has given lots of schemes trouble: """ Return-Path: Delivered-To: em-ca-bruceg@em.ca Received: (qmail 13437 invoked from network); 16 Aug 2002 02:37:15 -0000 Received: from unknown (HELO pakistan) (203.135.9.174) by churchill.factcomp.com with SMTP; 16 Aug 2002 02:37:15 -0000 From: "Scott Mark" To: Subject: Hello ! Mime-Version: 1.0 Content-Type: text/html; charset="iso-8859-1" Date: Fri, 9 Aug 2002 08:35:24 Content-Length: 609
Hi,

Just wanted you to check out this cool online website builder. It lets people create cool websites in minutes and for free. You can create your own Flash animations and Intro as well. Its really simple and easy to use :) and its all a matter of minutes, you'll have an impressive website up and running in no time, i'm impressed ... I bet you'll be impressed as well.

This website gives a nice review and how to get started creating your first website easily : www.click-free.com

Thanks,
Scott Mark.
""" FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE! "Subject: Website Programmers Available Now" Loaded with tech terms related to web design and programming, a frequent topic on c.l.py (my ham). The more-extreme central- limit schemes get huge benefit out of extremely large spamprob words like "offshore". Extreme extreme words don't have such extreme effect under the original cl scheme. FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE! "Subject: www.NameYork.com / Webmaster link directory" I've had lots of trouble with this one before. It's a long HTML msg full of links that would be of actual interest to webmasters. It even includes a link for Python. I've never been entirely sure that it's spam, but it "smells more" like spam than ham to me. FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE! Another "just folks" spam that has given lots of schemes trouble: """ Return-Path: Delivered-To: em-ca-bruceg@em.ca Received: (qmail 27970 invoked from network); 14 Jul 2002 01:43:00 -0000 Received: from agamemnon.bfsmedia.com (204.83.201.2) by churchill.factcomp.com with SMTP; 14 Jul 2002 01:43:00 -0000 Received: (qmail 28917 invoked from network); 14 Jul 2002 01:26:53 -0000 Received: from c-24-131-114-96.mw.client2.attbi.com (HELO core.com) (24.131.114.96) by agamemnon.bfsmedia.com with SMTP; 14 Jul 2002 01:26:53 -0000 From: "jarph3@core.com" To: Subject: I want to share with you what I found Sender: "jarph3@core.com" Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit Date: Sat, 13 Jul 2002 21:16:04 -0400 Content-Length: 676 My brother asked me to design a web page for his band Tainted Emotions. At first, his site was nothing more than a few paragraphs describing his unique psychotic melodies. Although a good start, mere words failed to convey the complete Tainted Emotions experience. For that, I needed graphics. Not just any graphics though. Fast, sleek, and professional images that only my brother's band deserves. I found all the free public domain photos I needed at freewebgrafix.com. They had everything an aspiring graphics designer needs to transform a texty site into a graphic sensation. Animated GIFs, backgrounds, banners, and of course--photos. http://www.freewebgrafix.com """ I can live with spam like that -- the combination of original-cl and RMS looks very much worth pursuing. From tim.one@comcast.net Sun Oct 6 01:46:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 20:46:32 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: Message-ID: Oops! I misread this data badly. > Crunching the raw data via rmspik [from the original use_central_limit]: > > Reading clim.pik ... > Nham= 7500 > RmsZham= 2.93763751621 > Nspam= 7500 > RmsZspam= 3.62374621717 > ====================================================================== > HAM: > Sure/ok 7491 > Unsure/ok 8 > Unsure/not ok 1 > Sure/not ok 0 > Unsure rate = 0.12% > Sure fp rate = 0.00%; Unsure fp rate = 11.11% > ====================================================================== > SPAM: > FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE! > FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE! > FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE! > FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE! > FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE! > Sure/ok 0 > Unsure/ok 0 > Unsure/not ok 7495 > Sure/not ok 5 > Unsure rate = 99.93% > Sure fn rate = 100.00%; Unsure fn rate = 100.00% It actually unsure about alomst 100% of the spam! So this table's first row: > RMS ham unsure RMS spam unsure > -------------- --------------- > central_limit 9 0 > central_limit2 175 77 > central_limit3 184 227 should have said > central_limit 9 7495 instead. I assume this is evidence of a bug somewhere. Note that the hmean and smean for a msg are always identical under the original central limit scheme. From tim.one@comcast.net Sun Oct 6 02:32:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 21:32:10 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: Message-ID: [Tim] > ... > It actually unsure about alomst 100% of the spam! So this table's first > row: > > RMS ham unsure RMS spam unsure > -------------- --------------- > central_limit 9 0 > central_limit2 175 77 > central_limit3 184 227 > > should have said > > central_limit 9 7495 > > instead. I assume this is evidence of a bug somewhere. Note > that the hmean and smean for a msg are always identical under the > original central limit scheme. The stuff below changes the first line to central_limit 49 11 I believe "the bug" is in rmspik.chance(), which appears to assume that a zscore in the positive direction is an indicator of certainty. That seems to be true in the logarithmic central-limit schemes, but isn't true in the original central-limit scheme. Changing the first three lines like so: # if x>=0: # return 1.0 # x=-x/math.sqrt(2) x = abs(x)/math.sqrt(2) and rerunning rmspik leads to very different results under the original central limit scheme: Reading clim.pik ... Nham= 7500 RmsZham= 2.93763751621 Nspam= 7500 RmsZspam= 3.62374621717 ====================================================================== HAM: FALSE POSITIVE: zham=6.64 zspam=-1.66 Data/Ham/Set10/107687.txt SURE! Sure/ok 7413 Unsure/ok 79 Unsure/not ok 7 Sure/not ok 1 Unsure rate = 1.15% Sure fp rate = 0.01%; Unsure fp rate = 8.14% ====================================================================== SPAM: Sure/ok 7451 Unsure/ok 38 Unsure/not ok 11 Sure/not ok 0 Unsure rate = 0.65% Sure fn rate = 0.00%; Unsure fn rate = 22.45% All the problems with spam went away then, and ham gives it more trouble now. It's still certain much more often here than under the extreme central-limit schemes, so I still suspect RMS is a better fit to the original cl scheme (but the probability calculation has to change to something more symmetric). The false positive it was certain about was the lady with a brief relevant question, and a long, obnoxious, employer-generated sig. That's one of my two remaining f-p under the all-default scheme too (it so happens that the Nigerian scam quote was in the training data on these runs, so can't show up as an f-p). From tim.one@comcast.net Sun Oct 6 02:45:34 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 21:45:34 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: Message-ID: >> RMS ham unsure RMS spam unsure >> -------------- --------------- >> central_limit 9 0 >> central_limit2 175 77 >> central_limit3 184 227 >> >> should have said >> >> central_limit 9 7495 >> > ... > The stuff below changes the first line to > > central_limit 49 11 Heh. I give up. That should have read central_limit 86 49 None of the conclusions change: RMS is happiest with the original central limit scheme, and does very well with it indeed, but rmspik.chance() needs to change to be symmetric when used with the original central limit scheme else the spam results are a disaster. From tim.one@comcast.net Sun Oct 6 04:36:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 05 Oct 2002 23:36:25 -0400 Subject: [Spambayes] Sequemtial Test Results In-Reply-To: Message-ID: [Jim Bublitz] Thanks for sharing this! It's an excellent report. > I have a very unusual corpus of ham and spam compared > to "normal", so these results may not be widely > applicable. Could you say something about *what* makes you abnormal ? > In evaluating Graham and Spambayes I've used both > random testing (not as extensive as Spambayes) and > sequential testing (train on first N, test next M). > Since I use qmail, all of my mail is individual files > and the filenames are the delivery timestamp, so it's > easy to get an accurate sequence. In my experience, > testing sequentially (as above) has always been "worst > case" performance. > > A few days ago I switched to simulating actual > performance, since I have to implement something sooner > or later. Excellent -- nobody here has done that yet (that I know of), and I've worried out loud about that randomization allows msgs to get benefit from training msgs that appeared *after* them in time; e.g., a ham msg can be helped by that a reply to it appeared in the training ham, but that can never happen in real life. > My test procedure is: > > 1. Train on first T msgs > 2. Test next t msgs > 3. Train (incrementally) on t msgs > 4. Loop on 2 & 3 for N msgs > > (all numbers are 50/50 spam/ham, which is my avg > receiving about 200 msgs/day) > > For T = 8000, t = 200, N = 14400, the results I got for > Graham were (cutoff is independent of anything else, so > select the most desireable result): I'm not sure what these are results *of* -- like, the last time you ran step #2? An average over all times you ran step #2? > (zero above) > cutoff: 0.15 -- fn = 0 (0.00%) fp = 7 (0.33%) > cutoff: 0.16 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.17 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.18 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.19 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.20 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.21 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.22 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.23 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.24 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.25 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.26 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.27 -- fn = 0 (0.00%) fp = 6 (0.29%) > cutoff: 0.28 -- fn = 0 (0.00%) fp = 4 (0.19%) > cutoff: 0.29 -- fn = 0 (0.00%) fp = 1 (0.05%) > cutoff: 0.30 -- fn = 0 (0.00%) fp = 1 (0.05%) > cutoff: 0.31 -- fn = 0 (0.00%) fp = 1 (0.05%) > cutoff: 0.32 -- fn = 0 (0.00%) fp = 1 (0.05%) > cutoff: 0.33 -- fn = 0 (0.00%) fp = 1 (0.05%) > cutoff: 0.34 -- fn = 0 (0.00%) fp = 1 (0.05%) > ------------------------------------------------------ > cutoff: 0.35 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.36 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.37 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.38 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.39 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.40 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.41 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.42 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.43 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.44 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.45 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.46 -- fn = 0 (0.00%) fp = 0 (0.00%) > cutoff: 0.47 -- fn = 0 (0.00%) fp = 0 (0.00%) > ------------------------------------------------------ > cutoff: 0.48 -- fn = 1 (0.05%) fp = 0 (0.00%) > cutoff: 0.49 -- fn = 1 (0.05%) fp = 0 (0.00%) > cutoff: 0.50 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.51 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.52 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.53 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.54 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.55 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.56 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.57 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.58 -- fn = 2 (0.10%) fp = 0 (0.00%) > cutoff: 0.59 -- fn = 3 (0.14%) fp = 0 (0.00%) > cutoff: 0.60 -- fn = 3 (0.14%) fp = 0 (0.00%) > cutoff: 0.61 -- fn = 3 (0.14%) fp = 0 (0.00%) > cutoff: 0.62 -- fn = 3 (0.14%) fp = 0 (0.00%) > cutoff: 0.63 -- fn = 6 (0.29%) fp = 0 (0.00%) > cutoff: 0.64 -- fn = 6 (0.29%) fp = 0 (0.00%) > cutoff: 0.65 -- fn = 7 (0.33%) fp = 0 (0.00%) > cutoff: 0.66 -- fn = 7 (0.33%) fp = 0 (0.00%) > cutoff: 0.67 -- fn = 8 (0.38%) fp = 0 (0.00%) > cutoff: 0.68 -- fn = 9 (0.43%) fp = 0 (0.00%) > cutoff: 0.69 -- fn = 10 (0.48%) fp = 0 (0.00%) > cutoff: 0.70 -- fn = 10 (0.48%) fp = 0 (0.00%) > cutoff: 0.71 -- fn = 10 (0.48%) fp = 0 (0.00%) > cutoff: 0.72 -- fn = 10 (0.48%) fp = 0 (0.00%) > cutoff: 0.73 -- fn = 10 (0.48%) fp = 0 (0.00%) > cutoff: 0.74 -- fn = 15 (0.71%) fp = 0 (0.00%) > (zero below) > > > Graham > Spam Ham > Mean 0.98 0.01 And these are the means of what? For example, there's no false-negative rate as large as 0.98 in the table above, so 0.98 certainly isn't the mean of the table entries. > Std Dev 0.04 0.02 > 3 sigma 0.86 0.07 > > > For Spambayes ("out of the box" - CVS from 10/2) > > > cutoff: 0.41 -- fn = 0 (0.00%) fp = 164 (2.13%) > cutoff: 0.42 -- fn = 1 (0.01%) fp = 140 (1.82%) > cutoff: 0.43 -- fn = 1 (0.01%) fp = 121 (1.57%) > cutoff: 0.44 -- fn = 1 (0.01%) fp = 103 (1.34%) > cutoff: 0.45 -- fn = 1 (0.01%) fp = 90 (1.17%) > cutoff: 0.46 -- fn = 1 (0.01%) fp = 68 (0.88%) > cutoff: 0.47 -- fn = 2 (0.03%) fp = 55 (0.71%) > cutoff: 0.48 -- fn = 2 (0.03%) fp = 47 (0.61%) > cutoff: 0.49 -- fn = 2 (0.03%) fp = 36 (0.47%) > cutoff: 0.50 -- fn = 3 (0.04%) fp = 30 (0.39%) > cutoff: 0.51 -- fn = 5 (0.06%) fp = 23 (0.30%) > cutoff: 0.52 -- fn = 8 (0.10%) fp = 15 (0.19%) > cutoff: 0.53 -- fn = 11 (0.14%) fp = 13 (0.17%) > cutoff: 0.54 -- fn = 15 (0.19%) fp = 7 (0.09%) > cutoff: 0.55 -- fn = 18 (0.23%) fp = 7 (0.09%) > cutoff: 0.56 -- fn = 28 (0.36%) fp = 5 (0.06%) > cutoff: 0.57 -- fn = 36 (0.47%) fp = 3 (0.04%) > cutoff: 0.58 -- fn = 46 (0.60%) fp = 2 (0.03%) > cutoff: 0.59 -- fn = 55 (0.71%) fp = 2 (0.03%) > cutoff: 0.60 -- fn = 63 (0.82%) fp = 2 (0.03%) > cutoff: 0.61 -- fn = 73 (0.95%) fp = 1 (0.01%) > cutoff: 0.62 -- fn = 90 (1.17%) fp = 0 (0.00%) > > > Spambayes > Spam Ham > Mean 0.85 0.16 > Std Dev 0.10 0.10 > 3 sigma 0.54 0.46 > For Graham, the modifications made are: > > 1. Word freq threshhold = 1 instead of 5 That helped us a lot when we were using Graham. > 2. Case sensitive tokeninzing That did not (made no overall difference in error rates; it systematically called conference announcements spam, but was better at distinguishing spam screaming about MONEY from casual mentions of money in ham). > 3. Use Gary Robinson's score calculation With or without artificially clamping spamprobs into [0.01, 0.99] first (as Graham does)? > 4. Use token count instead of msg count in computing probability. We haven't tried that. > Counting msgs instead of tokens in computing probability is > a fairly subtle bias (noted by Graham in "A Plan for Spam") > and is still included in Spambayes. Not really. We currently depart from Graham too in counting multiple occurrences of a word only once in both training and scoring. Our hamcounts and spamcounts are counts of the # of messages a word appears in now, not counts of the total number of times the word appears in msgs (as they were under Graham). > If I count msgs instead of tokens I can get about the same results > and the mean and std dev are unaffected, but the tails of the > distributions for ham/spam scores move closer together (no large > dead band as above). Here's why (sort of): > > The probability calculation is: > > (s is spam count for a token, h is ham count, H/S are either > the number of msgs seen or number of tokens seen) I'm not sure what "spam count for a token" means. For Graham, it means the total number of times a token appears in spam, regardless of msg boundaries. For us today, it means the number of spams in which the token appears (and "Nigeria" appearing 100 times in a single spam adds only 1 to Nigeria's spam count for us; it adds 100 to Graham's Nigeria spam count). Our error rates got lower when we made training symmetric with scoring in this respect, although that wasn't true before we purged *all* of the deliberate biases in Paul's scheme. > prob = (s/S)/((h/H) + (s/S)) > > which can be refactored to: > > prob = 1/(1 + (S/H)*(h/s)) > > or with Graham's bias: > > prob = 1/(1 + (S/H)*(2*h/s)) Did you keep Graham's ham bias? We have not. > For my mail/testing, > > msgs -- S/H = 1 > tokens -- S/H ~= 0.5 (ranges from 0.40 to 0.52 over time) > > so (for my unusual data anyway) counting msgs doubles the bias on > the ham probability, but surprisingly affects the shape of my > score distributions adversely. If I count msgs and remove Graham's > 2.0 bias I get only slightly worse results than if I count tokens > and include Graham's bias, since they're almost the same calculation > and the sensitivity to S/H is small but noticeable. Playing around > with Spambayes, I get slightly better results if I a) count tokens; > b) count every token; c) drop robinson_probability_s to .05, but I > still have overlap on the score distribution tails. (Adding Graham's > bias back in helps too). Note that overlapping tails aren't something our default scheme tries to eliminate. It's considered "a feature" here that Gary's scheme has a middle ground where mistakes are very likely to live. This is something you learn to love after realizing that mistakes cannot be stopped. For example, under Graham's scheme, you're *eventually* going to find ham that scores 1.0 (and spam that scores 0.0). For example, with 15 discriminators, sooner or later you're going to find a ham that just happens to have 8 .99 clues and 7 .01 clues, and then Graham is certain it's spam. There's no cutoff value that can save you from this kind of false positive, short of never calling anything spam. When Gary's scheme makes a mistake, it's almost always within a short distance of the data's best spam_cutoff value. In a system with manual human review, this is very exploitable; in a system without manual review, I suppose you just pass such msgs on, but still have the *possibility* to say clearly that the system is known to make mistakes in this range. > Nothing I did to Spambayes had much effect on mean/std dev, but did > reshape the distribution curves. I get a lot more tokens than > Spambayes, ? What does that mean? If you're using spambayes, it's generating tokens, so it seems hard to get a lot more than that . > but the ratios are close. Process sizes are comparable (about 100MB > peak for the tests above). Spambayes is about 2X faster. > > For my data (which, again, is unusual) I'd conclude: > > 1. Counting msgs and counting tokens once per msg seems wrong to > me. It seems to me to be enumerating containers rather than > enumerating contents, or at least mixing the two. Graham also *scores* tokens (at most) once per message. The training we do matches the way our scoring uses the information produced by training. We've seen reason to believe that the density of a word in a msg does contain exploitable information, but don't have a way to exploit it; experiment showed that using this info to distort spamprobs was not a useful way to exploit it; for now, we just ignore it. > 2. Sequential testing/training is important to look at (there > may be time related effects - certainly S/H (counting tokens) > varies over time). No argument, and we really need to test that too. > These are better than any other test results I've had for either > method. > > 3. I'd concentrate on shaping the tails of the distribution > rather than worrying about mean and std dev. The so-called central-limit schemes we're investigating now are almost entirely about separating the tails, and *knowing* when we can't, so that should give you cause for hope. > Some adjustments will degrade the mean/std dev but improve the > shape of the distribution/sharpness of discrimination. If you > look at the 3 sigma limits on either method (covers about 99.7% > of the distribution in theory), I've got no reason to assume the distributions here are normal, and short of that you're reduced to Chebyshev's inequality (that at least 8/9ths of the data lives inside 3 sigmas, regardless of distribution). That said, they "look pretty normal" , apart indeed from the long dribbly tails. OTOH, some ham and some spam simply aren't clearcut, even for human judgment, so I see no hope that this can be wholly eliminated "even in theory". > the fns and fps are out past 3 sigma. In EE terms, you want sharper > rolloff, not necessarily higher Q or a change in center frequency. > Graham appears to be less sensitive to choice of cutoff than > Spambayes for my dataset. This was universally observed: the Graham score histograms approximated two solid bars, one at 0.0, the other at 1.0, the more data it was trained on. Unfortunately, its *mistakes* also lived on these bars. > As far as traing sample size, the results above were based on > an intial training sample of 8000 msgs. If I start with 200 > msgs, I get an additional 15 fns in the first 200 tested, and > one additional fn through the rest of the first 8000 msgs, > at which time I'm back to the test/results above (choosing > a fairly optimum cutoff value). If I start with 1 ham and > 1 spam, I get 89 fps on the first 200 and no more failures > after that. It seems to converge. Not very sensitive to > initial conditions. That's been my experience too, and that it takes a very large increase in training data to cut an error rate in half. My corpus now is at the point where it can't really improve, due to the "some ham and spam simply aren't clearcut" reason above. It would take telepathy, and even people on this list argue about whether specific msgs are ham or spam. > All of this might only work for my email. YMMV. For either > method the results are fantastic. I'd be happy with a 90% > spam reduction and no fps. You can get a 90% spam reduction, but you won't be as happy with that when the amount of spam you get increases by a factor of 10 again. But there's no way you can get no fps, short of calling nothing spam. I've noted before that the chance my classifier would produce an FP over the next year is smaller than the chance I'll die in that time, and I personally don't fear a false positive more than death . From tim.one@comcast.net Sun Oct 6 06:00:56 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 01:00:56 -0400 Subject: [Spambayes] Sequemtial Test Results In-Reply-To: Message-ID: [Jim Bublitz] > ... > Playing around with Spambayes, I get slightly better results if I > ... > c) drop robinson_probability_s to .05, That's a very low value. I find this way of rewriting Gary's adjustment easier to reason about: s*x + n*p x - p --------- = p + ------- s + n 1 + n/s This makes it clear that it moves p in the direction of x, but less so the larger n is, or the smaller s is. For you, s=.05, and then that's x-p p + ------ 1+20*n At n=1, that's p + (x-p)/21. The *interesting* thing there is that, since you said you effectively removed Graham's mincount gimmick, under pure Graham you *were* getting extreme spamprobs of 0.01 and 0.99 for words that had been seen only once in the training data. Setting s to 0.05 gives a very similar effect under Gary's adjustment. If x is 0.5, 0 + .5/21 ~= 0.024 and 1 + -.5/21 ~= 0.976 Those are really extreme probability estimates based on 1 measly occurence in training data, but perhaps this ties in to the unusual nature of your data. For example, I've seen that low s helps ham message threads when a typo or unusual word gets repeated in replies. From rob@hooft.net Sun Oct 6 06:49:06 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 07:49:06 +0200 Subject: [Spambayes] RE: For the bold References: Message-ID: <3D9FCED2.4050802@hooft.net> Tim Peters wrote: > I believe "the bug" is in rmspik.chance(), which appears to assume that a > zscore in the positive direction is an indicator of certainty. That seems > to be true in the logarithmic central-limit schemes, but isn't true in the > original central-limit scheme. Changing the first three lines like so: > > # if x>=0: > # return 1.0 > # x=-x/math.sqrt(2) > x = abs(x)/math.sqrt(2) Indeed, the chance function as I wrote it uses the information I had, which was only based on my clt2 experience.where positive Z-scores mean "absolute certainty", and negative Z-scores are increasingly uncertain. But: in practice, even for clt2, positive Z-scores above 2.0 do not appear very frequently if at all, and if/when that happens, the chance that the message belongs to the "other" group is extremely small. I just tried it for my clt2 data: your fix doesn't change anything there. In case you're wondering what chance(x) is using under these if statements: if x < 1.4: return 1.0 pre = math.exp(-x**2) / math.sqrt(math.pi) / x post = 1.0 - (1.0 / (2.0 * x**2)) return pre * post This is an approximation of the integral under the tail of the unit normal Gaussian, but the approximation only valid for x>>1 so for the "mass" of the curve, we just return 1. Tim: It does look like your messages are a bit easier to classify than mine.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 6 06:54:05 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 07:54:05 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3D9FCFFD.3060609@hooft.net> Tim Peters wrote: [clt2] > Nham= 7500 > RmsZham= 2.27249107964 > Nspam= 7500 > RmsZspam= 2.354280998 [clt3] > Nham= 7500 > RmsZham= 9.77605846416 > Nspam= 7500 > RmsZspam= 10.1887670936 OOF! Under clt3 your rms values are 4x bigger! I have to look at the details of that: the assumption under which the rmspik.py code works is that the distributions of zham and zspam values are normally distributed if all values are "mirrored" around 0. I'll have to test that assumption for clt1 and clt3! Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 6 07:10:12 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 08:10:12 +0200 Subject: [Spambayes] tokenizing identical words Message-ID: <3D9FD3C4.9060902@hooft.net> I have ony been following the tonenizer from a distance, but has it been tried yet to use logarithm tokens for multiple occurrences of a word? So, a spam mentioning Nigeria a couple times could result in "nigeria nigeria:2 nigeria:4 nigeria:8" tokens. I can imagine that the:16 is not going to mean a lot, but nigeria:4 like this message may quickly result in a spam score... So: if you want to be removed, take your credit card, get rich quick, pay $100000 and click here: http://123456789/ :-) Rob PS: In my ham corpus there is a message of someone sending a list of all ISO country codes. In my spam corpus there is a spam that lists a lot of countries where this company is selling stuff.... -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Sun Oct 6 07:14:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 02:14:15 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3D9F04AA.8050706@hooft.net> Message-ID: [Rob Hooft] >... > Appended is a pdf containing six histograms made using > max_discriminators=55 > > The first one is zham for all ham messages. As you can see, the > distribution is asymmetric. Furthermore, a simple average and standard > deviation calculation results in a bell curve that does not follow the > important tail of the histogram: the chances will be severely > underestimated by these parameters. Two things. First, the raw spam score (smean) of a msg is the natural log of the geometric mean of the extreme-word spamprobs. This statistic can never be positive, has no theoretical bound on how low it can go, and is typically a small negative number, around -0.12. It's simply impossible to get a raw score "much larger" (much more positive) than that, but easy to get one much smaller (much more negative), so I think the asymmetry is inevitable. The raw ham score (hmean) is similar, but uses the log of the geometric mean of 1-prob, and is typically farther away from 0.0, nearer -0.33. That gives more room for larger scores to exist (remember that it can never be positive!), and I expect that's why the first stab at fitting a bell curve to the ham worked better than for the spam, despite that both were poor fits. All this may well be why the original use_central_limit scheme (which uses the straight mean of the word spamprobs -- no logs, no geometric means, no two-way prob vs 1-prob scoring) worked better for me under your scheme in my tests: that's got no fundamental reason (as far as I can see) to be *so* lopsided; indeed, the mean and median of hmean are very close under use_central_limit, and likewise for smean. This isn't true under the other central limit schemes. They're still lopsided, though; here from an original central limit run: ham ham mean: 6000 items; mean 0.18; sdev 0.09 -> min 0.00620435; median 0.183251; max 0.840666 spam spam mean: 6000 items; mean 0.93; sdev 0.07 -> min 0.486362; median 0.950825; max 0.996632 The ham mean can't get below 0 under that scheme, and 0 is just two sdevs away from the ham-mean mean ~= the ham-mean median. The spam mean can't get above 1.0 under that scheme, and 1.0 is just one sdev removed the spam-mean mean ~= (but less so) the spam-mean median. So here again, fitting the ham in a bell curve is easier than fitting the spam. Second, there's no real justification for the way zscores are computed in the classifier code now. You may get better results if you ignore the zscores in the pickle, and work directly with the raw hmean and smean scores instead (which are also in the binary pickle saved by clgen). They're the actual data here, and the zscores are a distorted version that factor in n (the number of extreme words) in a way that doesn't make real sense. Note that n is also in the clgen pickle tuples: all the relevant info is there, except for the individual word probabilities used. > The second one is abs(zham) for all ham messages. The bell curve fits > this histogram much better! Since use_central_limit2 and use_central_limit3 produce inherently and highly lopsided distributions, I think that makes good sense. From tim.one@comcast.net Sun Oct 6 07:43:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 02:43:30 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3D9FCFFD.3060609@hooft.net> Message-ID: [Tim] > [clt2] > Nham= 7500 > RmsZham= 2.27249107964 > Nspam= 7500 > RmsZspam= 2.354280998 > > [clt3] > Nham= 7500 > RmsZham= 9.77605846416 > Nspam= 7500 > RmsZspam= 10.1887670936 [Rob Hooft] > OOF! Under clt3 your rms values are 4x bigger! I have to look at the > details of that: clt1 and clt2 build ham and spam populations out of individual word probabilities. If the central limit theorem actually applied (which it does not), the way zscores are computed would make sense (at least when n > 30). clt3 builds ham and spam populations out of whole-msg scores. The way zscores are computed there is the same as under clt2, but it makes no sense whatsoever under clt3. I didn't care, because the results were at least as good regardless; "zscores" in the hundreds are pretty common under clt3. I think you should ignore the classifier's zscores, Rob: *none* of them make good sense, and under clt3 they make no sense. The only virtue they have is that tests say they work really well . > the assumption under which the rmspik.py code works is that the > distributions of zham and zspam values are normally distributed > if all values are "mirrored" around 0. I'll have to test that > assumption for clt1 and clt3! I didn't catch the meaning there, but expect any assumption you would like to make is most likely to be true under clt1 (which is the least extreme of these gimmicks). From noreply@sourceforge.net Sun Oct 6 07:41:02 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 05 Oct 2002 23:41:02 -0700 Subject: [Spambayes] [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1 Message-ID: Patches item #618928, was opened at 2002-10-05 06:46 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rob W.W. Hooft (hooft) >Assigned to: Neale Pickett (npickett) Summary: runtest.sh: add timtest + spam/ham!=1 Initial Comment: * Add timtest to runtest.sh * Add different spam/ham counts to runtest.sh ---------------------------------------------------------------------- Comment By: Rob W.W. Hooft (hooft) Date: 2002-10-05 06:53 Message: Logged In: YES user_id=47476 Here is the patch ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 From noreply@sourceforge.net Sun Oct 6 07:48:07 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sat, 05 Oct 2002 23:48:07 -0700 Subject: [Spambayes] [ spambayes-Patches-618928 ] runtest.sh: add timtest + spam/ham!=1 Message-ID: Patches item #618928, was opened at 2002-10-05 06:46 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 Category: None Group: None >Status: Closed >Resolution: Accepted Priority: 5 Submitted By: Rob W.W. Hooft (hooft) Assigned to: Neale Pickett (npickett) Summary: runtest.sh: add timtest + spam/ham!=1 Initial Comment: * Add timtest to runtest.sh * Add different spam/ham counts to runtest.sh ---------------------------------------------------------------------- >Comment By: Neale Pickett (npickett) Date: 2002-10-05 23:48 Message: Logged In: YES user_id=619391 Looks good, thanks for the patch! ---------------------------------------------------------------------- Comment By: Rob W.W. Hooft (hooft) Date: 2002-10-05 06:53 Message: Logged In: YES user_id=47476 Here is the patch ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=618928&group_id=61702 From neale@woozle.org Sun Oct 6 07:53:00 2002 From: neale@woozle.org (Neale Pickett) Date: 05 Oct 2002 23:53:00 -0700 Subject: [Spambayes] ["Neale Pickett" ] [Spambayes-checkins] spambayes runtest.sh,1.6,1.7 Message-ID: ---------------------- multipart/mixed attachment Take heed: the runtest.sh I just checked in uses "cv1.txt" and "cv2.txt" instead of "run1.txt" and "run2.txt", as there can now be different types of runs with different output. If you want to keep your old run data, rename your "run1.txt" and "run2.txt" files. Neale ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: "Neale Pickett" Subject: [Spambayes-checkins] spambayes runtest.sh,1.6,1.7 Date: Sat, 05 Oct 2002 23:47:38 -0700 Size: 6169 Url: http://mail.python.org/pipermail-21/spambayes/attachments/20021005/77169193/attachment.txt ---------------------- multipart/mixed attachment-- From jbublitz@nwinternet.com Sun Oct 6 10:42:23 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Sun, 06 Oct 2002 02:42:23 -0700 (PDT) Subject: [Spambayes] Sequemtial Test Results In-Reply-To: Message-ID: On 06-Oct-02 Tim Peters wrote: > Thanks for sharing this! It's an excellent report. Thanks for taking time to reply - I realize I'm somewhat off-topic here. >> I have a very unusual corpus of ham and spam compared >> to "normal", so these results may not be widely >> applicable. > Could you say something about *what* makes you abnormal ? You mean besides coding Python? Spam: over 50% is Asian language; includes virus msgs (raw or ISP scrubbed/tagged); business related or industry specific spam; plus all of the other usual kinds of spam. A lot of virus msgs (raw or scrubbed/tagged by an ISP or firewall filtering) - my favorite was "I insult your mother". Ham: Over 3/4 is lists (some long) of part numbers/quantities/some other info (eg TMS320P14FNL 1000 99DC $10.00) or related correspondence (quotes, RFQs, inquiries, etc), some with a small amount of Asian text mixed in. Otherwise newsletters, mailing lists, personal stuff (small amount). > Excellent -- nobody here has done that yet (that I know of), and > I've worried out loud about that randomization allows msgs to get > benefit from training msgs that appeared *after* them in time; > e.g., a ham msg can be helped by that a reply to it appeared in > the training ham, but that can never happen in real life. It seems the opposite is true - my results were worse before (0.5% to 1.0% failures or worse). You might have already acheived perfection and not know it due to randomization :) It appears that the systems both learn gradually. For example, one of my ISPs started virus filtering at a point after the initial training data, and that produced problems in the past training on N msgs then testing the next M without any retraining. That didn't occur here. Some other hard to filter msgs (again, for both methods) also didn't fail. > I'm not sure what these are results *of* -- like, the last time > you ran step>#2? An average over all times you ran step #2? Total results for testing 14400 msgs in batches of 200 (and training after each 200) - failures against (virtual) cutoff setting. > Graham > Spam Ham > Mean 0.98 0.01 > And these are the means of what? For example, there's no > false-negative rate as large as 0.98 in the table above, so 0.98 > certainly isn't the mean of the table entries. Mean/std deviation for scores of all msgs tested. >> Std Dev 0.04 0.02 >> 3 sigma 0.86 0.07 >> 1. Word freq threshhold = 1 instead of 5 > That helped us a lot when we were using Graham. >> 2. Case sensitive tokeninzing > That did not (made no overall difference in error rates; it > systematically called conference announcements spam, but was > better at distinguishing spam screaming about MONEY from casual > mentions of money in ham). Everything made a *small* difference - I'm really quite surprised everything lined up in the same direction for once. I went through most of the tweaks from scratch one at a time (including some of my own that I thought were really cool but ultimately didn't work very well) and what's left is what what worked the best. Finally having clean samples really helped too. >> 3. Use Gary Robinson's score calculation > With or without artificially clamping spamprobs into [0.01, 0.99] > first (as Graham does)? Same as Graham. I went back and tried Graham's scoring again too, and it's only marginally worse than Robinson's (but has the problem of extreme values of fp & fn). My "Robinson scoring" is just the S = (P - Q)/(P + Q) kind. >> 4. Use token count instead of msg count in computing >> probability. > We haven't tried that. It's a programming error (wrong indentation in token processing loop) that led to better results. Wish I could say I thought of it, but it makes more sense to me now. Again, it makes a small difference overall, but has a bigger effect on the shape of the score distribution in my tests. >> Counting msgs instead of tokens in computing probability is >> a fairly subtle bias (noted by Graham in "A Plan for Spam") >> and is still included in Spambayes. > Not really. We currently depart from Graham too in counting > multiple occurrences of a word only once in both training and > scoring. Our hamcounts and spamcounts are counts of the # of > messages a word appears in now, not counts of the total number > of times the word appears in msgs (as they were > under Graham). Yes - that's what bothers me. >> If I count msgs instead of tokens I can get about the same >> results >> and the mean and std dev are unaffected, but the tails of the >> distributions for ham/spam scores move closer together (no large >> dead band as above). Here's why (sort of): >> >> The probability calculation is: >> >> (s is spam count for a token, h is ham count, H/S are either >> the number of msgs seen or number of tokens seen) > I'm not sure what "spam count for a token" means. For Graham, it > means the total number of times a token appears in spam, > regardless of msg boundaries. For us today, it means the number > of spams in which the token appears (and "Nigeria" appearing 100 > times in a single spam adds only 1 to Nigeria's spam count for > us; it adds 100 to Graham's Nigeria spam count). Our error rates > got lower when we made training symmetric with scoring in this > respect, although that wasn't true before we purged *all* of the > deliberate biases in Paul's scheme. "spam count" means the same as your "spamcount" variable in update_probabilities - you count once per msg, I count every occurance in a msg. Making "training symmetric with scoring" is what seems intuitively incorrect to me, along with nham/nspam being msg counts instead of token counts. If you arbitrarily see the word "fussball" in a wordstream is the wordstream German or English ("football" in German, a table game found in bars in US English)? I'd guess German because I'm also guessing the word occurs with greater frequency in German wordstreams than in English wordstreams (absent context) - not because I think more German books contain at least one occurance of the word compared to English books. On the testing side, if the test wordstream contained "fussball!fussball!fussball!", would you change your guess? I'd suggest your guess would still be based on a single occurance - the repetition doesn't change the probability of which set the wordstream belongs to. I can't see it would 3X more likely one way or the other - what else could you conclude then but that "fussball" and "fussball!fussball!fussball!" have identical probabilities of being elements of a German wordstream without some other kind of data? >> prob = 1/(1 + (S/H)*(2*h/s)) > Did you keep Graham's ham bias? We have not. Yes - again, a small (positive) difference. > Note that overlapping tails aren't something our default scheme > tries to eliminate. It's considered "a feature" here that > Gary's scheme has a middle ground where mistakes are very likely > to live. This is something you learn to love after > realizing that mistakes cannot be stopped. Yes - and if your scores really indicate the actual probability of spamminess, you can use that info to sort the msgs for manual review. Given the volume of spam, fatigue is a real problem in manual review - I wouldn't risk the possibility of fps except that they're more likely with a manual system (as I found out in sorting 25K msgs semi-manually). I'm actually concerned that if the fp rate is too low, they're won't be enough reward in reviewing the results manually - my fps could be very expensive. It appears to me that perfect results are not obtainable because everyone probably has msgs that they can't reliably bucket as spam or ham. > For example, under Graham's scheme, you're *eventually* going to > find ham that scores 1.0 (and spam that scores 0.0). For > example, with 15 discriminators, sooner or later you're going to > find a ham that just happens to have 8 .99 clues and 7 .01 > clues, and then Graham is certain it's spam. Happened a lot in other kinds of testing, but not much when testing sequentially as described above - I have no idea why. > There's no cutoff value that can save you from this kind of false > positive, short of never calling anything spam. When Gary's > scheme makes a mistake, it's almost always within a short > distance of the data's best spam_cutoff value. In a system with > manual human review, this is very exploitable; Agree - I should have read ahead before the response above. > in a system without manual review, I suppose you just pass such > msgs on, but still have the *possibility* to say clearly that > the system is known to make mistakes in this range. >> Nothing I did to Spambayes had much effect on mean/std dev, but >> did reshape the distribution curves. I get a lot more tokens >> than Spambayes, > ? What does that mean? If you're using spambayes, it's > generating tokens, so it seems hard to get a lot more than that > . My tokenizer consists of a findall on re.compile(r"[\w'$_-]+", re.U). I get a lot more tokens than than spambayes tokenizer produces. "I get" meant my Graham version vs. spambayes. >> 3. I'd concentrate on shaping the tails of the distribution >> rather than worrying about mean and std dev. > The so-called central-limit schemes we're investigating now are > almost entirely about separating the tails, and *knowing* when > we can't, so that should give you cause for hope. I gathered that from today's list msgs - didn't notice it before. > OTOH, some ham and some spam simply aren't clearcut, even for > human judgment, so I see no hope that this can be wholly > eliminated "even in theory". Either method does better than I do (or thinks it does at any rate) >> the fns and fps are out past 3 sigma. In EE terms, you want >> sharper rolloff, not necessarily higher Q or a change in center >> frequency. Graham appears to be less sensitive to choice of >> cutoff than Spambayes for my dataset. > This was universally observed: the Graham score histograms > approximated two solid bars, one at 0.0, the other at 1.0, the > more data it was trained on. Unfortunately, its *mistakes* also > lived on these bars. Yes, but the (P - Q)/(P + Q) scoring fixes that nicely for my data. > It would take telepathy, and even people on this list argue about > whether specific msgs are ham or spam. The computer is always right. > I've noted before that the chance my classifier > would produce an FP over the next year is smaller than the > chance I'll die in that time, and I personally don't fear a > false positive more than death . You haven't met my wife - one persistent fp was from a woman who is both my wife's best friend and was (and may be again) our best customer. I suppose that's what whitelists are for. Jim From nas@python.ca Sun Oct 6 17:03:38 2002 From: nas@python.ca (Neil Schemenauer) Date: Sun, 6 Oct 2002 09:03:38 -0700 Subject: [Spambayes] CL2 results In-Reply-To: References: <3D9EF7D4.23399.2790EC5D@localhost> Message-ID: <20021006160338.GA9127@glacier.arctrix.com> Tim Peters wrote: > Greg Ward explained how python.org checks for viruses here: > > http://mail.python.org/pipermail-21/spambayes/2002-September/000327.html Here's what I'm using with qmail: http://arctrix.com/~nas/misc/qmail-filter-exe.py Message with viruses are usually pretty big so keeping them out of corpus saves a lot of space. Neil From tim.one@comcast.net Sun Oct 6 17:20:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 12:20:48 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3D9FCED2.4050802@hooft.net> Message-ID: [Rob Hooft] > ... > Tim: It does look like your messages are a bit easier to classify than > mine.... I don't know. The results I reported were: > Here's a use_central_limit2 run with max_discriminators=50, trained > on 5000 ham and 5000 spam, then predicting against 7500 of each and all runs were on the same set of msgs. The last time you mentioned how "big" your tests are was: > I focussed for our night on optimizing the max_discriminators for > clt2 using 10x(200+200) messages out of my corpses, I'm not sure exactly what 10x(200+200) means, but at the plausible extremes it means your classifiers were trained on 200 on each, or on 1800 of each. So at worst, my classifier was trained on 3x as much data, and at best on 25x as much data. Error rates certainly improve with more training data, albeit slowly. OTOH, later you showed output saying > Reading climbig12.pk ... > Nham= 12800 > RmsZham= 2.76178782393 > Nspam= 5600 so at *some* point you stopped predicting against equal amounts of ham and spam, but there's no way to guess how much was trained on for that result. Interpreting results here gets very difficult because it's often not clear what a tester is reporting on (how much training data, how much prediction data, which test driver produced the results, what the relevant options were). That said, I expect my ham is easier than most, because newsgroup traffic almost never contains personal msgs -- no screaming red HTML birthday wishes from 9-year-old nieces, no confirmations of payment received, no opt-in marketing newsletters, no chain letters forwarded from naive brothers, etc. From bkc@murkworks.com Sun Oct 6 17:34:07 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 12:34:07 -0400 Subject: [Spambayes] CL2 test part II Message-ID: <3DA02D8B.11446.2C4AD3E5@localhost> In my earlier CL2 and CL3 tests, I trained on the 2nd half of my corpus, and tested the first half. Now, I'm training on the first half and testing the 2nd half. First run of CL2 uncovered more misclassifications (which probably affected the training of my first test). I'm temporarily "borrowing" a client's dual Xeon machine, still only using one processor of course, but it seems a lot faster than my PIII-933 In any case, here's CL2 results training first, testing second half. > Ham scores for all runs: 6500 items; mean 0.94; sdev 7.21 -> min 0; median 0; max 100 * = 105 items 0 6384 ************************************************************* 25 87 * 50 21 * 75 8 * -> Spam scores for all runs: 6500 items; mean 99.32; sdev 5.94 -> min 0; median 100; max 100 * = 106 items 0 3 * 25 15 * 50 68 * 75 6414 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*29 fp + 18 fn = 47 -> fp rate 0.446% fn rate 0.277% Tokenizer] mine_received_headers: True [Classifier] use_central_limit2 = True use_central_limit3 = False zscore_ratio_cutoff: 1.9 [TestDriver] spam_cutoff: 0.50 show_false_negatives: True nbuckets: 4 show_spam_lo: 0.0 show_spam_hi: 0.45 save_trained_pickles: True save_histogram_pickles: True Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Sun Oct 6 17:48:44 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 12:48:44 -0400 Subject: [Spambayes] CL3 test part II Message-ID: <3DA030F7.11395.2C5834AA@localhost> As in CL2 part II test, here are the results of training on the first half, and testing on the 2nd half of my data set -> Ham scores for all runs: 6500 items; mean 0.64; sdev 5.98 -> min 0; median 0; max 100 * = 106 items 0 6422 ************************************************************* 25 59 * 50 13 * 75 6 * -> Spam scores for all runs: 6500 items; mean 98.85; sdev 7.84 -> min 0; median 100; max 100 * = 105 items 0 8 * 25 22 * 50 113 ** 75 6357 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*19 fp + 30 fn = 49 -> fp rate 0.292% fn rate 0.462% Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Sun Oct 6 18:01:03 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 13:01:03 -0400 Subject: [Spambayes] rmspik results on CL2 and CL3 Message-ID: <3DA033DA.25462.2C637B3C@localhost> on CL2 test II. Reading results/cl2-b/clim.pik ... Nham= 6500 RmsZham= 4.48175325539 Nspam= 6500 RmsZspam= 3.7809204202 ====================================================================== HAM: FALSE POSITIVE: zham=-5.79 zspam=-2.33 Data/Ham/Set6/6438 SURE! FALSE POSITIVE: zham=-5.43 zspam=-2.27 Data/Ham/Set6/10068 SURE! FALSE POSITIVE: zham=-3.97 zspam=-1.72 Data/Ham/Set7/9964 SURE! FALSE POSITIVE: zham=-6.17 zspam=-2.35 Data/Ham/Set9/6415 SURE! Sure/ok 6297 Unsure/ok 181 Unsure/not ok 18 Sure/not ok 4 Unsure rate = 3.06% Sure fp rate = 0.06%; Unsure fp rate = 9.05% ====================================================================== SPAM: FALSE NEGATIVE: zham=-1.86 zspam=-3.48 Data/Spam/Set7/6718 SURE! FALSE NEGATIVE: zham=-1.48 zspam=-6.37 Data/Spam/Set10/10979 SURE! Sure/ok 6240 Unsure/ok 232 Unsure/not ok 26 Sure/not ok 2 Unsure rate = 3.97% Sure fn rate = 0.03%; Unsure fn rate = 10.08% All the hams really are hams. Network Computing renewal, etc.. The Set10/10979 spam .. really was a ham (oops) The Set7/6718 is .. I think a spam, you decide >From ???@??? Sat Sep 21 16:59:31 2002 Received: from SpoolDir by GIMPELSTIMER (Mercury 1.44); 12 Sep 02 11:10:59 -0400 Received: from anvil.murkworks.com (128.153.43.1) by coal.murkworks.com (Mercury 1.44) with ESMTP; 12 Sep 02 11:10:49 -0400 Received: from hotmail.com (oe19.pav1.hotmail.com [64.4.30.123]) by anvil.murkworks.com (8.9.1/8.9.1) with ESMTP id LAA14649 for ; Thu, 12 Sep 2002 11:00:52 -0400 (EDT) Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Thu, 12 Sep 2002 08:11:15 -0700 X-Originating-IP: [202.88.161.198] From: "preeti" To: Cc: , Date: Mon, 19 Aug 2002 22:01:06 +0530 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0005_01C247CB.EB1BDC40" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Message-ID: X-OriginalArrivalTime: 12 Sep 2002 15:11:15.0656 (UTC) FILETIME=[A38C0480:01C25A6E] This is a multi-part message in MIME format. ------=_NextPart_000_0005_01C247CB.EB1BDC40 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I want information about novell netware. =20 I think it's spam because the return path is maxi_sexy@hotmail.com Though, we do have software for NetWare that we market, but I'd expect web inquiries to go to info@, not support@ I suppose technically this is not spam, since a human individually typed this in, I'm sure. CL3 test II rmspik results Reading results/cl3-b/clim.pik ... Nham= 6500 RmsZham= 12.6590343376 Nspam= 6500 RmsZspam= 14.8475623174 ====================================================================== HAM: FALSE POSITIVE: zham=-6.65 zspam=-2.35 Data/Ham/Set6/6438 SURE! FALSE POSITIVE: zham=-6.19 zspam=-2.29 Data/Ham/Set6/10068 SURE! FALSE POSITIVE: zham=-4.60 zspam=-1.73 Data/Ham/Set7/9964 SURE! FALSE POSITIVE: zham=-7.11 zspam=-2.37 Data/Ham/Set9/6415 SURE! Sure/ok 6294 Unsure/ok 182 Unsure/not ok 20 Sure/not ok 4 Unsure rate = 3.11% Sure fp rate = 0.06%; Unsure fp rate = 9.90% ====================================================================== SPAM: FALSE NEGATIVE: zham=-1.37 zspam=-6.36 Data/Spam/Set10/10979 SURE! Sure/ok 6271 Unsure/ok 207 Unsure/not ok 21 Sure/not ok 1 Unsure rate = 3.51% Sure fn rate = 0.02%; Unsure fn rate = 9.21% All hams really are hams, The ONE spam .. isn't a spam. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Sun Oct 6 18:34:37 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 13:34:37 -0400 Subject: [Spambayes] CL3 part I (reprise) Message-ID: <3DA03BB8.17777.2C82369A@localhost> I've re-run CL3 test from yesterday after further cleaning up my corpus (training on second half, testing on first half) -> Ham scores for all runs: 6500 items; mean 0.98; sdev 7.64 -> min 0; median 0; max 100 * = 105 items 0 6386 ************************************************************* 25 74 * 50 26 * 75 14 * -> Spam scores for all runs: 6500 items; mean 98.90; sdev 7.55 -> min 0; median 100; max 100 * = 105 items 0 5 * 25 33 * 50 101 * 75 6361 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*40 fp + 38 fn = 78 -> fp rate 0.615% fn rate 0.585% Reading results/cl3-a/clim.pik ... Nham= 6500 RmsZham= 14.4660600316 Nspam= 6500 RmsZspam= 15.176558614 ====================================================================== HAM: FALSE POSITIVE: zham=-11.40 zspam=-2.41 Data/Ham/Set1/10180 SURE! FALSE POSITIVE: zham=-6.90 zspam=-2.54 Data/Ham/Set1/10852 SURE! FALSE POSITIVE: zham=-4.81 zspam=-2.54 Data/Ham/Set3/5943 SURE! FALSE POSITIVE: zham=-6.97 zspam=-2.22 Data/Ham/Set3/6480 SURE! FALSE POSITIVE: zham=-4.69 zspam=-1.39 Data/Ham/Set4/69 SURE! FALSE POSITIVE: zham=-4.88 zspam=-2.31 Data/Ham/Set4/5548 SURE! FALSE POSITIVE: zham=-12.55 zspam=-1.49 Data/Ham/Set4/10008 SURE! FALSE POSITIVE: zham=-6.22 zspam=-2.06 Data/Ham/Set4/10937 SURE! FALSE POSITIVE: zham=-12.06 zspam=-0.40 Data/Ham/Set5/5105 SURE! FALSE POSITIVE: zham=-5.21 zspam=-2.42 Data/Ham/Set5/6369 SURE! Sure/ok 6272 Unsure/ok 182 Unsure/not ok 36 Sure/not ok 10 Unsure rate = 3.35% Sure fp rate = 0.16%; Unsure fp rate = 16.51% ====================================================================== SPAM: FALSE NEGATIVE: zham=-0.96 zspam=-5.47 Data/Spam/Set2/5185 SURE! FALSE NEGATIVE: zham=-2.12 zspam=-4.62 Data/Spam/Set2/6457 SURE! FALSE NEGATIVE: zham=-1.97 zspam=-18.20 Data/Spam/Set3/3010 SURE! Sure/ok 6248 Unsure/ok 215 Unsure/not ok 34 Sure/not ok 3 Unsure rate = 3.83% Sure fn rate = 0.05%; Unsure fn rate = 13.65% I'm going to stick with the ham and spam classification .. hams are mostly network computing renewals, discover card statement, etc. spams .. stuff I don't want.. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From rob@hooft.net Sun Oct 6 18:40:25 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 19:40:25 +0200 Subject: [Spambayes] RE: For the bold References: Message-ID: <3DA07589.50407@hooft.net> Tim Peters wrote: > [Rob Hooft] > >>... >>Tim: It does look like your messages are a bit easier to classify than >>mine.... > > > I don't know. The results I reported were: > > >>Here's a use_central_limit2 run with max_discriminators=50, trained >>on 5000 ham and 5000 spam, then predicting against 7500 of each > > > and all runs were on the same set of msgs. > > The last time you mentioned how "big" your tests are was: > > >>I focussed for our night on optimizing the max_discriminators for >>clt2 using 10x(200+200) messages out of my corpses, > > > I'm not sure exactly what 10x(200+200) means, but at the plausible extremes > it means your classifiers were trained on 200 on each, or on 1800 of each. > So at worst, my classifier was trained on 3x as much data, and at best on > 25x as much data. Error rates certainly improve with more training data, > albeit slowly. I did have 10 sets each of ham and spam, each set containing 200 messages out of a total reservoir of ~17000 ham and 7500 spam. This subset of everything was heavy enough for this optimization: it took about 24 hours of calculating to get that analysis done.... > OTOH, later you showed output saying > > >>Reading climbig12.pk ... >>Nham= 12800 >>RmsZham= 2.76178782393 >>Nspam= 5600 > > > so at *some* point you stopped predicting against equal amounts of ham and > spam, but there's no way to guess how much was trained on for that result. At that point, I had 10 sets, each ham set contained 1600 hams, and each spam set 700 spams. I was using 2 sets each to train, and 8 to analyse. Since that time I have cleaned out the spam body by looking for duplicate "Date:" headers, and removed ~1300 spams that were identical (only sent to different addresses). I think this is a useful thing to do to prevent that the same spam in two messages is both in the training and in the test set. The "Message-ID" sort I did in the beginning didn't help all that much, because lots of these spams do not have their message-id added by the spammer. I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis. > That said, I expect my ham is easier than most, because newsgroup traffic > almost never contains personal msgs -- no screaming red HTML birthday wishes > from 9-year-old nieces, no confirmations of payment received, no opt-in > marketing newsletters, no chain letters forwarded from naive brothers, etc. Exactly. I find that my ham is very diverse. Besides all the things you mentioned, I had (but removed) communications with postmasters over early-day spam that was sent using their machines. And I am using some ham from my previous job. There is not a lot of mailing list traffic, because I am no longer storing all that. Lots of customer E-mails with many different computer backgrounds. I removed all the viruses. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From bkc@murkworks.com Sun Oct 6 18:51:17 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 13:51:17 -0400 Subject: [Spambayes] CL2 part I (reprise) Message-ID: <3DA03FA0.19882.2C917996@localhost> FP rate is down after cleaning up training sets but CL2 seems less sure than CL3 -> Ham scores for all runs: 6500 items; mean 1.36; sdev 8.88 -> min 0; median 0; max 100 * = 104 items 0 6339 ************************************************************* 25 100 * 50 44 * 75 17 * -> Spam scores for all runs: 6500 items; mean 99.27; sdev 6.21 -> min 0; median 100; max 100 * = 106 items 0 4 * 25 14 * 50 74 * 75 6408 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*61 fp + 18 fn = 79 -> fp rate 0.938% fn rate 0.277% Reading results/cl2-a/clim.pik ... Nham= 6500 RmsZham= 4.91231538116 Nspam= 6500 RmsZspam= 3.7454876964 ====================================================================== HAM: FALSE POSITIVE: zham=-9.71 zspam=-2.38 Data/Ham/Set1/10180 SURE! FALSE POSITIVE: zham=-6.08 zspam=-2.50 Data/Ham/Set1/10852 SURE! FALSE POSITIVE: zham=-6.09 zspam=-2.19 Data/Ham/Set3/6480 SURE! FALSE POSITIVE: zham=-4.08 zspam=-1.36 Data/Ham/Set4/69 SURE! FALSE POSITIVE: zham=-4.32 zspam=-2.28 Data/Ham/Set4/5548 SURE! FALSE POSITIVE: zham=-10.65 zspam=-1.44 Data/Ham/Set4/10008 SURE! FALSE POSITIVE: zham=-5.43 zspam=-2.03 Data/Ham/Set4/10937 SURE! FALSE POSITIVE: zham=-10.25 zspam=-0.34 Data/Ham/Set5/5105 SURE! FALSE POSITIVE: zham=-4.61 zspam=-2.40 Data/Ham/Set5/6369 SURE! Sure/ok 6280 Unsure/ok 184 Unsure/not ok 27 Sure/not ok 9 Unsure rate = 3.25% Sure fp rate = 0.14%; Unsure fp rate = 12.80% ====================================================================== SPAM: FALSE NEGATIVE: zham=-1.00 zspam=-5.48 Data/Spam/Set2/5185 SURE! FALSE NEGATIVE: zham=-2.01 zspam=-4.62 Data/Spam/Set2/6457 SURE! FALSE NEGATIVE: zham=-2.09 zspam=-18.28 Data/Spam/Set3/3010 SURE! FALSE NEGATIVE: zham=-2.46 zspam=-4.49 Data/Spam/Set4/6367 SURE! FALSE NEGATIVE: zham=-2.46 zspam=-4.49 Data/Spam/Set4/6371 SURE! Sure/ok 6226 Unsure/ok 224 Unsure/not ok 45 Sure/not ok 5 Unsure rate = 4.14% Sure fn rate = 0.08%; Unsure fn rate = 16.73% Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From rob@hooft.net Sun Oct 6 19:25:20 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 20:25:20 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3DA08010.3010803@hooft.net> I made a number of changes to rmspik.py: - The "chance" function was replaced by something a bit more scientific (this helps!). - There are new parameters in the source code (I'm hoping someone else can make these configurable through the .ini file). # surefactor: the ratio of the two p's to decide we're sure a message # belongs to one of the two populations. raising this number increases # the "unsures" on both sides, decreasing the "sure fp" and "sure fn" # rates. A value of 1000 works well for me; at 10000 you get slightly # less sure fp/fn at a cost of a lot more middle ground; at 10 you have # much less work on the middle ground but ~50% more "sure false" # scores. This variable operates on messages that are "a bit of both # ham and spam" surefactor = 100 # pminhamsure: The minimal pham at which we say it's surely ham # lowering this value gives less "unsure ham" and more "sure ham"; it # might however result in more "sure fn" 0.01 works well, but to accept # a bit more fn, I set it to 0.005. This variable operates on messages # that are "neither ham nor spam; but a bit more ham than spam" pminhamsure = 0.005 # pminspamsure: The minimal pspam at which we say it's surely spam # lowering this value gives less "unsure spam" and more "sure spam"; it # might however result in more "sure fp" Since most people find fp # worse than fn, this value should most probably be higher than # pminhamsure. 0.01 works well, but to accept a bit less fp, I set it # to 0.02. This variable operates on messages that are "neither ham # nor spam; but a bit more spam than ham" pminspamsure = 0.02 # usetail: if False, use complete distributions to renormalize the # Z-scores; if True, use only the worst tail value. I get worse results # if I set this to True, so the default is False. usetail = False # medianoffset: If True, set the median of the zham and zspam to 0 # before calculating rmsZ. If False, do not shift the data and hence # assume that 0 is the center of the population. True seems to help for # my data. medianoffset = True I'd like to invite everyone to play with this. It takes only a few seconds to run once the .pik is set up using "clgen"! I'll post some of my results under separate cover. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From bkc@murkworks.com Sun Oct 6 19:37:30 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 14:37:30 -0400 Subject: [Spambayes] Re: For the bold In-Reply-To: <3DA08010.3010803@hooft.net> Message-ID: <3DA04A75.3377.2CBBC9B0@localhost> On 6 Oct 2002 at 20:25, Rob Hooft wrote: > I made a number of changes to rmspik.py: > > - The "chance" function was replaced by something a bit more > scientific (this helps!). > - There are new parameters in the source code (I'm hoping someone else > can make these configurable through the .ini file). > Does this mean it works for CL now? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Sun Oct 6 19:43:16 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 14:43:16 -0400 Subject: [Spambayes] incremental testing with CL2/CL3? Message-ID: <3DA04BCE.30178.2CC10EB2@localhost> Someone mentioned they did incremental testing and posted their results, but I couldn't figure out what the results meant. So, I want to try it too. I notice in the TestDriver, comments like: # CAUTION: this just doesn't work for incrememental training when # options.use_central_limit is in effect. def train(self, ham, spam): I'm not planning on using untrain(), so does this comment still apply? my plan is: 1. Receive 100 (configurable) messages "per day", with a (configurable) percentage of those being spam. 2. run the classifier on those messages and make 3 categories: ham, spam, unsure. I want to know how many fall into each category on each "day". 3. some percentage (configurable) of each category will be fed back into training each "day". 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to show how rates vary.. 5. modulate max_discriminators, training feedback (% of messages in each category fed back into system) vs. "days" to get a feel for the results a typical user might expect.. 6. re-run testing using new classifier schemes.. where do I start? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From rob@hooft.net Sun Oct 6 19:43:30 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 20:43:30 +0200 Subject: [Spambayes] rmspik results Message-ID: <3DA08452.8060105@hooft.net> As promised, here are some of my results from the current version of rmspik.py. For the record: I just wrote in a previous message: I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis. These two tests I did with clt1, clt2 and with clt3 resulting in 6 pik files that I analysed using rmspik.py. This results in such a mass of results that I wrote a quick script to make a "score" out of each run, something that weighs the work of filtering unsure messages, the occurrence of fp's and the occurrence of fn's. The score is done using: fprate=float(nfp)/nham fnrate=float(nfn)/nspam unsurerate=float(nunsure)/ntot score=fprate*fpfac+fnrate*fnfac+unsurerate*unsurefac Where: fpfac=3000.0; fnfac=300.0; unsurefac=100.0 representing one possible "private" mix of priorities (you could think of these as the cost in Euros or Dollars for such a mistake). For a mailing list a philosophy tells me fnfac/unsurefac should be about the number of members of the list, and fp's are not too bad if you can send a nice message to the poster telling him what happened and how to get his message posted anyway. The score is the last number on each line describing a run. surefactor=1000 pmin(sp|h)amsure=0.01 usetail=False medianoffset=False expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7745 228 21 6 2683 104 13 0 5.6 clt1-67890 7738 225 33 4 2680 108 8 4 5.4 clt2-12345 7814 155 26 5 2690 101 9 0 4.6 clt2-67890 7781 180 34 5 2714 75 8 3 4.9 clt3-12345 7751 211 32 6 2681 110 9 0 5.6 clt3-67890 7704 256 35 5 2699 91 7 3 5.8 With pminhamsure=0.005 and pminspamsure=0.02 expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7746 227 21 6 2671 116 13 0 5.7 clt1-67890 7738 225 34 3 2670 118 8 4 5.1 clt2-12345 7835 134 26 5 2673 118 9 0 4.5 clt2-67890 7822 139 34 5 2693 96 8 3 4.8 clt3-12345 7783 179 33 5 2665 126 9 0 5.1 clt3-67890 7752 208 37 3 2672 118 7 3 4.9 With surefactor=10000 expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7492 481 23 4 2618 169 13 0 7.9 clt1-67890 7481 482 34 3 2601 187 9 3 8.0 clt2-12345 7810 159 27 4 2644 147 9 0 4.7 clt2-67890 7792 169 35 4 2660 129 8 3 5.0 clt3-12345 7743 219 34 4 2640 151 9 0 5.3 clt3-67890 7717 243 37 3 2643 147 7 3 5.5 With surefactor=10 expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7922 51 19 8 2733 54 11 2 4.5 clt1-67890 7905 39 32 5 2749 39 7 5 3.5 clt2-12345 7864 105 25 6 2683 108 8 1 4.6 clt2-67890 7852 109 33 6 2701 88 7 4 4.9 clt3-12345 7814 148 33 5 2675 116 9 0 4.7 clt3-67890 7779 181 36 4 2679 111 6 4 5.0 With surefactor=100, usetail=True expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7777 192 24 7 2705 83 12 0 5.5 clt1-67890 7791 171 34 4 2704 84 8 4 4.7 clt2-12345 7824 143 28 5 2695 96 9 0 4.4 clt2-67890 7668 277 47 8 2731 62 4 3 6.9 clt3-12345 7802 165 28 5 2692 99 9 0 4.7 clt3-67890 7636 309 48 7 2727 66 4 3 6.9 With surefactor=100, usetail=True, medianoffset=True expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7728 241 24 7 2704 84 12 0 6.0 clt1-67890 7753 210 33 4 2695 93 9 3 5.0 clt2-12345 7813 154 28 5 2699 92 9 0 4.5 clt2-67890 7653 292 48 7 2733 60 4 3 6.7 clt3-12345 7803 164 28 5 2693 98 9 0 4.6 clt3-67890 7636 307 50 7 2728 65 4 3 6.9 With surefactor=100, usetail=False, medianoffset=True expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR clt1-12345 7842 127 25 6 2672 116 12 0 4.8 clt1-67890 7832 131 34 3 2675 113 8 4 4.2 clt2-12345 7816 147 32 5 2675 116 9 0 4.7 clt2-67890 7786 174 36 4 2684 106 6 4 4.9 clt3-12345 7807 156 32 5 2673 118 9 0 4.8 clt3-67890 7774 187 35 5 2684 106 6 4 5.4 Conclusions so far: - medianoffset=True helps - usetail=False is better than True - clt1 seems to do best, although the difference is not large. - there are large differences between the 12345 and 67890 runs. I'm sure that systematic variation of the parameters (e.g. using a simplex optimization?) will give me even better scores. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 6 20:01:57 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 06 Oct 2002 21:01:57 +0200 Subject: [Spambayes] Re: For the bold References: Message-ID: <3DA088A5.9060100@hooft.net> Tim Peters wrote: >>the assumption under which the rmspik.py code works is that the >>distributions of zham and zspam values are normally distributed >>if all values are "mirrored" around 0. I'll have to test that >>assumption for clt1 and clt3! > > > I didn't catch the meaning there, but expect any assumption you would like > to make is most likely to be true under clt1 (which is the least extreme of > these gimmicks). I intended to say that the approach assumes that the (tails of the) zham and zspam values in the pickles can be described with a Gaussian centered on 0. This was a fairly good approximation for clt2 that I tried first, but appeared HORRIBLE for both clt1 and clt3. That is when I started trying to explain the tails by looking at the tails only. But that didn't help. What did help is recentering the data on the median value (not the average value; that is a bad approximation for an assymmetric distribution). BTW: I tried using the direct values hmean and smean in the pickles as well. This worked fine immediately for clt2 and gives (as expected) the exact same results. But for clt1 and clt3 this needed the medianoffset parameter which I only implemented now. There seems to be a small difference, but I did not investigate that yet. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From jbublitz@nwinternet.com Sun Oct 6 22:10:28 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Sun, 06 Oct 2002 14:10:28 -0700 (PDT) Subject: [Spambayes] incremental testing with CL2/CL3? In-Reply-To: <3DA04BCE.30178.2CC10EB2@localhost> Message-ID: On 06-Oct-02 Brad Clements wrote: > Someone mentioned they did incremental testing and posted their > results, but I couldn't figure out what the results meant. That would be me. Apparently nobody could figure out what I wrote. The short summary is that for my data, running it sequentially with "daily" retraining gave far better results than any other testing method, Graham worked slightly better than Spambayes for me (< 0.3% difference in fp/fn %'s - small), the effect of initial training size (as low as 1 ham, 1 spam) disappeared after the first "day". > So, I want to try it too. > I notice in the TestDriver, comments like: > # CAUTION: this just doesn't work for incrememental training > when > # options.use_central_limit is in effect. > def train(self, ham, spam): > > I'm not planning on using untrain(), so does this comment still > apply? > > my plan is: I'd suggest: 0. Start with a size-configurable basic training sample. > 1. Receive 100 (configurable) messages "per day", with a > (configurable) percentage of > those being spam. > > 2. run the classifier on those messages and make 3 categories: > ham, spam, unsure. I > want to know how many fall into each category on each "day". > > 3. some percentage (configurable) of each category will be fed > back into training each > "day". > > 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to > show how rates vary.. I had no errors in 21 day tests (with large enough initial training sample - otherwise only errors on first "day"). I needed to test 7K to 8K of *each* type of msg to see any errors in the best case. Short tests are nice for code debugging/checking the effects of methodology changes, as in (5) and (6) below. > 5. modulate max_discriminators, training feedback (% of messages > in each category > fed back into system) vs. "days" to get a feel for the results a > typical user might expect.. The other thing that would be interesting (to me anyway) is if it's possible/desireable for the system to modify the discrimination cutoff(s) automatically based on new training data. In other words, if the system starts at "score > 0.5 is spam", can learning adjust that number to compensate for changes in newly learned data? > 6. re-run testing using new classifier schemes.. > > where do I start? I'm not sure what other info you need - you seem to have it all in order. For my data, msg filename == delivery timestamp (one msg per file), but otherwise you'd probably get the most accurate ordering from the first encountered "Received" line in the headers if the msgs aren't already ordered. Otherwise, I instantiated Hammie (from hammie.py) with hammie.createBayes and just did calls to Hammie.train, Hammie.update_probabilities, and Hammie.score. I didn't try "untrain" either - it would be interesting see whether using that is good or bad. I also accumulated "weekly" totals thinking I might need to smooth out "daily" variations, but the error rates are so low, the only thing it told me is whether the errors occurred early or late in the sequence. The last test run I did only had two errors - one the first "day" and one at almost the last "day" (somewhere between 7700 and 8000 ham msgs). Jim From tim.one@comcast.net Sun Oct 6 23:10:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 18:10:51 -0400 Subject: [Spambayes] incremental testing with CL2/CL3? In-Reply-To: <3DA04BCE.30178.2CC10EB2@localhost> Message-ID: [Brad Clements] > ... > I notice in the TestDriver, comments like: > > # CAUTION: this just doesn't work for incrememental training when > # options.use_central_limit is in effect. > def train(self, ham, spam): > > I'm not planning on using untrain(), so does this comment still apply? Yes, afraid so. A do-something compute_population_stats() is unique to the central limit schemes, and all it knows about the world is the ham and spam passed to train(). If you had trained on 20000 ham and 20000 spam, and then passed 10 of each to train() in another call, the population statistics for the previous 40000 of each would be lost, overwritten by the stats for the new 20 msgs. I don't see an obvious way to fix this, alas. It would be easiest to fix under clt1. You could train on every previous msg every time, but that's a quadratic-time proposition overall. From tim.one@comcast.net Sun Oct 6 23:18:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 18:18:13 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3DA07589.50407@hooft.net> Message-ID: [Tim] >> I'm not sure exactly what 10x(200+200) means, but at the >> plausible extremes it means your classifiers were trained on >> 200 on each, or on 1800 of each. So at worst, my classifier was >> trained on 3x as much data, and at best on 25x as much data. >> Error rates certainly improve with more training data, >> albeit slowly. [Rob Hooft] > I did have 10 sets each of ham and spam, each set containing 200 > messages out of a total reservoir of ~17000 ham and 7500 spam. See? This still doesn't give the reader a clue about how many msgs your classifiers were trained on, or how many they predicted against. It's important info, and I don't know how to convince people to reveal it -- the cmp.py output even prints it, but most people snip that part off, as if cmp.py printed it by mistake . > This subset of everything was heavy enough for this optimization: it > took about 24 hours of calculating to get that analysis done.... Without knowing what you did, I really can't comment. This seems like an awfully long time to go through 4,000 (10*(200+200)) total msgs, though, no matter what you were doing. On my box (866MHz, 256MB RAM), the system scores about 80 msgs per second. From tim.one@comcast.net Sun Oct 6 23:26:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 18:26:08 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3DA088A5.9060100@hooft.net> Message-ID: [Rob Hooft] > ... > BTW: I tried using the direct values hmean and smean in the pickles as > well. This worked fine immediately for clt2 and gives (as expected) the > exact same results. Actually, I wouldn't expect that: the "zscores" aren't a function of smean or hmean alone, they also depend on each msg's n value (in a way that's highly dubious in clt1 and clt2, and wholly unjustified in clt3). To the limited extent that the zscores *may* make sense under clt1 and clt2, it's the dependence on n that the sense comes from. If you got the same results by ignoring n completely (which just looking at hmean or smean does), then that at least confirms that fiddling with n is in fact of no value in the current pseudo-zscore computation. From tim.one@comcast.net Sun Oct 6 23:53:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 18:53:13 -0400 Subject: [Spambayes] CL2 test part II In-Reply-To: <3DA02D8B.11446.2C4AD3E5@localhost> Message-ID: [Brad Clements] > In my earlier CL2 and CL3 tests, I trained on the 2nd half of my > corpus, and tested the first half. > > Now, I'm training on the first half and testing the 2nd half. > > First run of CL2 uncovered more misclassifications (which > probably affected the training of my first test). > > ... > > In any case, here's CL2 results training first, testing second half. > > > Ham scores for all runs: 6500 items; mean 0.94; sdev 7.21 > -> min 0; median 0; max 100 > * = 105 items > 0 6384 ************************************************************* > 25 87 * > 50 21 * > 75 8 * Let me interleave these from various followup msgs. Ham for clt2 rmspick: Reading results/cl2-b/clim.pik ... Nham= 6500 RmsZham= 4.48175325539 Nspam= 6500 RmsZspam= 3.7809204202 ====================================================================== HAM: FALSE POSITIVE: zham=-5.79 zspam=-2.33 Data/Ham/Set6/6438 SURE! FALSE POSITIVE: zham=-5.43 zspam=-2.27 Data/Ham/Set6/10068 SURE! FALSE POSITIVE: zham=-3.97 zspam=-1.72 Data/Ham/Set7/9964 SURE! FALSE POSITIVE: zham=-6.17 zspam=-2.35 Data/Ham/Set9/6415 SURE! Sure/ok 6297 Unsure/ok 181 Unsure/not ok 18 Sure/not ok 4 Unsure rate = 3.06% Sure fp rate = 0.06%; Unsure fp rate = 9.05% Ham for clt3: -> Ham scores for all runs: 6500 items; mean 0.98; sdev 7.64 -> min 0; median 0; max 100 * = 105 items 0 6386 ************************************************************* 25 74 * 50 26 * 75 14 * Ham for clt3 rmspick: Reading results/cl3-b/clim.pik ... Nham= 6500 RmsZham= 12.6590343376 Nspam= 6500 RmsZspam= 14.8475623174 ====================================================================== HAM: FALSE POSITIVE: zham=-6.65 zspam=-2.35 Data/Ham/Set6/6438 SURE! FALSE POSITIVE: zham=-6.19 zspam=-2.29 Data/Ham/Set6/10068 SURE! FALSE POSITIVE: zham=-4.60 zspam=-1.73 Data/Ham/Set7/9964 SURE! FALSE POSITIVE: zham=-7.11 zspam=-2.37 Data/Ham/Set9/6415 SURE! Sure/ok 6294 Unsure/ok 182 Unsure/not ok 20 Sure/not ok 4 Unsure rate = 3.11% Sure fp rate = 0.06%; Unsure fp rate = 9.90% I don't see a significant difference between clt2 & clt3 here; clt2 may be doing slightly better. The differences after rmspick are clearly insignificant. rmspick is unsure twice as often, but is dead wrong half as often as clt2, and even better wrt raw clt3. On to the spam: > -> Spam scores for all runs: 6500 items; mean 99.32; sdev 5.94 > -> min 0; median 100; max 100 > * = 106 items > 0 3 * > 25 15 * > 50 68 * > 75 6414 ************************************************************* > -> best cutoff for all runs: 0.5 > -> with weighted total 1*29 fp + 18 fn = 47 > -> fp rate 0.446% fn rate 0.277% Spam for clt2 rmspick: SPAM: FALSE NEGATIVE: zham=-1.86 zspam=-3.48 Data/Spam/Set7/6718 SURE! FALSE NEGATIVE: zham=-1.48 zspam=-6.37 Data/Spam/Set10/10979 SURE! BOGUS Sure/ok 6240 Unsure/ok 232 Unsure/not ok 26 Sure/not ok 2 Unsure rate = 3.97% Sure fn rate = 0.03%; Unsure fn rate = 10.08% and the 2nd false negative was bogus (really a ham). rmspick was uncertain 3x as often. Spam for clt3: -> Spam scores for all runs: 6500 items; mean 98.85; sdev 7.84 -> min 0; median 100; max 100 * = 105 items 0 8 * 25 22 * 50 113 ** 75 6357 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*19 fp + 30 fn = 49 -> fp rate 0.292% fn rate 0.462% Less certain than clt2, *and* made more mistakes when certain (that's a bad combination ). Spam for clt3 rmspick: SPAM: FALSE NEGATIVE: zham=-1.37 zspam=-6.36 Data/Spam/Set10/10979 SURE! Sure/ok 6271 Unsure/ok 207 Unsure/not ok 21 Sure/not ok 1 Unsure rate = 3.51% Sure fn rate = 0.02%; Unsure fn rate = 9.21% Uncertain more often than raw clt3, but far fewer errors when certain. Overall, I'd say that clt2 works better for you than clt3, and that rmspick gives an improvement either way. I bet you're just dying to try clt1 . > [Tokenizer] > mine_received_headers: True > > [Classifier] > use_central_limit2 = True > use_central_limit3 = False > zscore_ratio_cutoff: 1.9 > > [TestDriver] > spam_cutoff: 0.50 > show_false_negatives: True > nbuckets: 4 > > show_spam_lo: 0.0 > show_spam_hi: 0.45 > > save_trained_pickles: True > save_histogram_pickles: True Looks good! Thank you, Brad. From tim.one@comcast.net Mon Oct 7 03:08:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 22:08:26 -0400 Subject: [Spambayes] RE: For the bold In-Reply-To: <3DA08010.3010803@hooft.net> Message-ID: [Rob Hooft] > I made a number of changes to rmspik.py: > > - The "chance" function was replaced by something a bit more > scientific (this helps!). If you think more accuracy would help more, there are three well-known routines for computing areas under the unit Gaussian in this file: http://lib.stat.cmu.edu/apstat/66 The fanciest is good to 15 digits, which is possibly more than is really needed here . From bkc@murkworks.com Mon Oct 7 03:23:02 2002 From: bkc@murkworks.com (Brad Clements) Date: Sun, 06 Oct 2002 22:23:02 -0400 Subject: [Spambayes] CL1 tests Message-ID: <3DA0B78F.29877.2E65FDB5@localhost> Two tests, a and b using cl1 and rmspik. Not formatted too well, doing this via vnc. Set A -> Ham scores for all in this training set: 6500 items; mean 1.99; sdev 10.31 -> min 0; median 0; max 100 * = 103 items 0 6254 ************************************************************* 25 169 ** 50 62 * 75 15 * -> Spam scores for all in this training set: 6500 items; mean 99.08; sdev 6.76 -> min 0; median 100; max 100 * = 105 items 0 2 * 25 9 * 50 108 ** 75 6381 ************************************************************* -> best cutoff for all in this training set: 0.5 -> with weighted total 1*77 fp + 11 fn = 88 -> fp rate 1.18% fn rate 0.169% saving pickle to class1.pik -> Ham scores for all runs: 6500 items; mean 1.99; sdev 10.31 -> min 0; median 0; max 100 * = 103 items 0 6254 ************************************************************* 25 169 ** 50 62 * 75 15 * -> Spam scores for all runs: 6500 items; mean 99.08; sdev 6.76 -> min 0; median 100; max 100 * = 105 items 0 2 * 25 9 * 50 108 ** 75 6381 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*77 fp + 11 fn = 88 -> fp rate 1.18% fn rate 0.169% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik Saving all score data to pickle clim.pik Reading results/cl1-a/clim.pik ... Nham= 6500 RmsZham= 4.15398302786 Nspam= 6500 RmsZspam= 4.50455044819 ====================================================================== HAM: FALSE POSITIVE: zham=3.62 zspam=-1.05 Data/Ham/Set7/9964 SURE! Sure/ok 6236 Unsure/ok 232 Unsure/not ok 31 Sure/not ok 1 Unsure rate = 4.05% Sure fp rate = 0.02%; Unsure fp rate = 11.79% ====================================================================== SPAM: FALSE NEGATIVE: zham=0.91 zspam=-3.42 Data/Spam/Set10/9656 SURE! Sure/ok 6144 Unsure/ok 336 Unsure/not ok 19 Sure/not ok 1 Unsure rate = 5.46% Sure fn rate = 0.02%; Unsure fn rate = 5.35% Set B -> Ham scores for all in this training set: 6500 items; mean 1.72; sdev 9.38 -> min 0; median 0; max 100 * = 103 items 0 6282 ************************************************************* 25 173 ** 50 37 * 75 8 * -> Spam scores for all in this training set: 6500 items; mean 99.16; sdev 6.43 -> min 0; median 100; max 100 * = 105 items 0 1 * 25 10 * 50 99 * 75 6390 ************************************************************* -> best cutoff for all in this training set: 0.5 -> with weighted total 1*45 fp + 11 fn = 56 -> fp rate 0.692% fn rate 0.169% saving pickle to class1.pik -> Ham scores for all runs: 6500 items; mean 1.72; sdev 9.38 -> min 0; median 0; max 100 * = 103 items 0 6282 ************************************************************* 25 173 ** 50 37 * 75 8 * -> Spam scores for all runs: 6500 items; mean 99.16; sdev 6.43 -> min 0; median 100; max 100 * = 105 items 0 1 * 25 10 * 50 99 * 75 6390 ************************************************************* -> best cutoff for all runs: 0.5 -> with weighted total 1*45 fp + 11 fn = 56 -> fp rate 0.692% fn rate 0.169% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik Saving all score data to pickle clim.pik Reading results/cl1-b/clim.pik ... Nham= 6500 RmsZham= 4.43688346925 Nspam= 6500 RmsZspam= 4.49901192821 ====================================================================== HAM: FALSE POSITIVE: zham=8.00 zspam=-1.23 Data/Ham/Set1/10180 SURE! FALSE POSITIVE: zham=3.55 zspam=-1.56 Data/Ham/Set4/69 SURE! FALSE POSITIVE: zham=8.59 zspam=-0.61 Data/Ham/Set4/10008 SURE! FALSE POSITIVE: zham=8.86 zspam=-0.32 Data/Ham/Set5/5105 SURE! Sure/ok 6251 Unsure/ok 193 Unsure/not ok 52 Sure/not ok 4 Unsure rate = 3.77% Sure fp rate = 0.06%; Unsure fp rate = 21.22% ====================================================================== SPAM: FALSE NEGATIVE: zham=0.60 zspam=-3.25 Data/Spam/Set2/5185 SURE! FALSE NEGATIVE: zham=0.19 zspam=-9.53 Data/Spam/Set3/3010 SURE! Sure/ok 6131 Unsure/ok 337 Unsure/not ok 30 Sure/not ok 2 Unsure rate = 5.65% Sure fn rate = 0.03%; Unsure fn rate = 8.17% [Tokenizer] mine_received_headers: True [Classifier] use_central_limit = True use_central_limit2 = False use_central_limit3 = False zscore_ratio_cutoff: 1.9 [TestDriver] spam_cutoff: 0.50 show_false_negatives: True nbuckets: 4 show_spam_lo: 0.0 show_spam_hi: 0.45 save_trained_pickles: True save_histogram_pickles: True Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Mon Oct 7 04:12:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 06 Oct 2002 23:12:10 -0400 Subject: [Spambayes] New tokenization of the Subject line In-Reply-To: <1033781185.1125.7.camel@localhost.localdomain> Message-ID: [Remi Ricard] > I try something again. > > Since most of the mail from subscribed groups have in their > subject [spambayes] or [freesco] i.e "[" and "]". > I decided to keep this as a word so my words from a subject line > like: Re: [Spambayes] Moving closer to Gary's ideal > will be > Re: > [Spambayes] > Moving > closer > to > Gary's > ideal Two things about that: 1. It's not a precise enough description to know exactly what you did. On a list with programmers, don't be afraid to show code . 2. Do you think it's more likely that a spam would have "freesco" than "[freesco]" in its Subject line? Not bloodly likely . That is, you couldn't have picked worse examples for selling the idea that this *might* help. Indeed, that may be why it didn't help. It's usually more fruitful to stare at mistakes made by the system, and then see if there's something about them in common that the tokenizer isn't presenting in a usable way (very clear example: we throw away uuencoded pieces entirely; very muddy example: we throw away info about how many times a word appears in a msg). > And this is the result. Alex did a nice of job of running thru this, so I'll skip to the end. > -> tested 200 hams & 279 spams against 800 hams & 1113 spams > -> tested 200 hams & 275 spams against 800 hams & 1117 spams > -> tested 200 hams & 298 spams against 800 hams & 1094 spams > -> tested 200 hams & 272 spams against 800 hams & 1120 spams > -> tested 200 hams & 268 spams against 800 hams & 1124 spams > -> tested 200 hams & 279 spams against 800 hams & 1113 spams > -> tested 200 hams & 275 spams against 800 hams & 1117 spams > -> tested 200 hams & 298 spams against 800 hams & 1094 spams > -> tested 200 hams & 272 spams against 800 hams & 1120 spams > -> tested 200 hams & 268 spams against 800 hams & 1124 spams > > false positive percentages > 1.000 0.500 won -50.00% > 1.500 1.500 tied > 2.000 2.500 lost +25.00% > 1.000 1.000 tied > 0.000 0.000 tied > > won 1 times > tied 3 times > lost 1 times > > total unique fp went from 11 to 11 tied > mean fp % went from 1.1 to 1.1 tied > > false negative percentages > 0.717 0.717 tied > 0.727 0.727 tied > 1.007 1.342 lost +33.27% > 0.000 0.368 lost +(was 0) > 0.746 0.373 won -50.00% > > won 1 times > tied 2 times > lost 2 times > > total unique fn went from 9 to 10 lost +11.11% > mean fn % went from 0.639419734305 to 0.705436374356 lost +10.32% > > ham mean ham sdev > 24.51 25.20 +2.82% 9.45 9.09 -3.81% > 26.14 27.20 +4.06% 8.62 8.32 -3.48% > 26.04 26.94 +3.46% 10.00 9.68 -3.20% > 25.15 25.85 +2.78% 8.05 7.93 -1.49% > 25.12 26.11 +3.94% 8.28 8.16 -1.45% > > ham mean and sdev for all runs > 25.39 26.26 +3.43% 8.93 8.69 -2.69% > > spam mean spam sdev > 80.41 79.86 -0.68% 8.80 8.81 +0.11% > 79.87 79.47 -0.50% 8.20 8.11 -1.10% > 79.87 79.31 -0.70% 8.79 8.73 -0.68% > 80.42 80.03 -0.48% 8.13 8.22 +1.11% > 80.11 79.70 -0.51% 9.32 9.07 -2.68% > > spam mean and sdev for all runs > 80.13 79.66 -0.59% 8.66 8.60 -0.69% > > ham/spam mean difference: 54.74 53.40 -1.34 [T. Alexander Popiel] > This shows ham and spam getting closer together overall, and > is bad. The reduction in the standard deviation is (I think) > too small to overcome this... but I'm just eyeballing it; > can someone with a bit of the theory help here? Not much in this case, because it had nothing else going for it: the conclusion to give up on this idea should have been reached long before getting to this point . We don't know what this distribution "looks like", exactly. It appears to be "kinda normal", but is tighter than normal at the endpoints, and looser than normal where the tails dribble toward each other. This limits the usefulness we can get out of sdevs: the only thoroughly general result is that, for *any* distribution, no more than 1/k**2 of the data lives more than k standard deviations away from the mean. This is an especially useless result when k <= 1 . There's a one-tailed version that says something non-trivial for k <= 1: http://www.btinternet.com/~se16/hgb/cheb.htm But we're more interested in the overlap, and that occurs at higher k. The rule of thumb I fall back on is that, *whatever* sdev means for this distribution, I assume it means much the same thing across testers, and that (which is justified although hard to quantify here) separating the means by more sdevs is a good thing. So I look for the value of k such that (and assuming mean1 < mean2): mean1 + k * sdev1 = mean2 - k * sdev2 or, rearranging, mean2 - mean1 k = ------------- sdev1 + sdev2 That tells us the score that's "equally far away" from both means in a standard-deviation sense, and how far away that is from both means (in units of standard deviations). A little Python helps: def findk(mean1, sdev1, mean2, sdev2): """Solve mean1 + k*sdev1 = mean2 - k*sdev2 for k. Return (k, common value). """ assert mean1 < mean2 k = (mean2 - mean1) / (sdev1 + sdev2) score = mean1 + k * sdev1 return k, score Plugging in the "before" means and sdevs gives: >>> findk(25.39, 8.93, 80.13, 8.66) (3.1119954519613415, 53.180119386014781) BTW, if you don't favor one kind of error over another, this suggests spam_cutoff=0.5318 may well be a good value for this data. If it isn't, the direction it errs in is a clue about which distribution is stranger. Plugging in the "after" values gives: >>> findk(26.26, 8.69, 79.66, 8.60) (3.0884904569115093, 53.098982070561014) >>> So the means have gotten a tiny bit closer in an sdev sense too (they meet at 3.09 sdevs from both, instead of at 3.11 before). The difference is so small as to be insignificant, though. From tim.one@comcast.net Mon Oct 7 06:30:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 07 Oct 2002 01:30:41 -0400 Subject: [Spambayes] incremental testing with CL2/CL3? In-Reply-To: <3DA04BCE.30178.2CC10EB2@localhost> Message-ID: [Brad Clements] > Someone mentioned they did incremental testing and posted their > results, but I couldn't > figure out what the results meant. > > So, I want to try it too. > > I notice in the TestDriver, comments like: > > # CAUTION: this just doesn't work for incrememental training when > # options.use_central_limit is in effect. > def train(self, ham, spam): > > > I'm not planning on using untrain(), so does this comment still apply? I replied to this before with "sorry, yes", but this issue needs to be forced, and I checked in changes so we can at least *try* this. Let me explain the problem: Under the all-default scheme, the only thing we remember about training msgs is now many msgs each word appears in. That's all. Given any msg, we can add it or remove it at will, and the only affect it has is on the ord->hamcount and word->spamcount maps (from which we guess probabilities). The central limit schemes are quite different this way: we not only save word->hamcount and word->spamcount maps (and in exactly the same way, so no problem there), we also do a third training pass (.central_limit_compute_population_stats{,2,3}) under the covers. This looks for the set of "extreme words" in each training message (which can't be known until after update_probabilities() completes), and saves away statistics about their probabilities, one set of statistics for all the ham messages trained on, and a parallel, distinct set for all the spam messages trained on. The problem with incremental training under the clt schemes is in that third pass: when you train on any new data: 1. The word->hamcount and word->spamcount maps change. 2. This in turn changes word probabilities. The word probabilities that *were* used in the third training pass for *previous* data are no onger current, and so the statistics computed from them are also incorrect for the new state of the world. 3. Changing word probabilities can in turn even change the *set* of extreme words in a msg. And again, the set of extreme words found by the third training pass for previous data may not even be the correct extreme words for the new state of the world. There's simply no way to repair #2 and #3 short of recomputing them from scratch for every msg ever trained on, and that requires feeding them all into the system again (or a moral equivalent, like storing, for each msg ever trained on, the set of tokens it generated). In particular, as time goes on the probabilities computed in #2 get more extreme (closer to 0.0 and closer to 1.0) for strong clues, and clt2 and clt3 in particular make extreme use of extreme words. clt1 is less sensitive that way. This implies that, if you don't retrain on every msg, the mild spamprobs in the msgs first trained on will forever after drag down the statistics toward neutrality. There are two hacks I can think of to try, short of retraining on every msg ever seen: 1. Just keep adding in new statistics, and don't worry about the moderating effects of the early msgs. The code as checked in now will do this: so long as you don't call new_classifier(), each time train() is called it justs adds the new statistics to the old ones (before I checked in the changes, it overwrite the old statistics, as if they had never existed). 2. Indeed simply overwrite the old statistics. This is as if the third training pass had never been done for older messages. My intuition (which isn't worth much!) is that #2 is quirkier and riskier, making much of the effect of the central-limit gimmicks depend solely on the last batch of msgs trained on. #1 should have much greater stability over time, but that's not necessarily a good thing if the stability is bought at the cost of not moving quickly enough toward the true state of the world. Anyway, the only way to know is to try it. > my plan is: > > 1. Receive 100 (configurable) messages "per day", with a > (configurable) percentage of those being spam. You're ordering these by time received, right? > 2. run the classifier on those messages and make 3 categories: > ham, spam, unsure. I want to know how many fall into each > category on each "day". I would like to see eight categories instead: ham sure correct ham sure incorrect ham unsure correct ham unsure incorrect and the same four for spam. > 3. some percentage (configurable) of each category will be fed > back into training each "day". There's a world of interesting variations here . For example, what if you only feed it "sure but wrong" false positives and false negatives? Or only those plus "unsure but wrong" mistakes? Or only the latter? Etc. Semi-realistic is to feed it all mistakes, and a random sampling from correct results. It's hard to know what people would really do, but I'm *most* interested at first in what happens if intelligent use of the system is made. > 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to > show how rates vary.. Note that there two f-n and two f-p rates under the clt schemes (the "sure" and "unsure" mistake rates). > 5. modulate max_discriminators, training feedback (% of messages > in each category fed back into system) vs. "days" to get a feel > for the results a typical user might expect.. Like such a beast exists . I know one of my sisters well enough to guess that she would feed it every false negative, and nothing else. > 6. re-run testing using new classifier schemes.. > > where do I start? At step #1 . You'll need a custom test driver, but those are easy enough to write. Really stare at the differences between, e.g., timtest.py and timcv.py: the differences between strategies as different as a grid driver and a cross-validation driver amount to a few dozen lines of code in one function. For this, something like: d = TestDriver.Driver() ham, spam = some initial set of msgs to get things started d.train(ham, spam) for day in range(number_of_days): ham, spam = get the day's new msgs d.test(ham, spam) d.finishtest() print out whatever stats you want, athough d.finishtest() automatically prints out all the stuff you're interested in, so this may be much more a matter of writing a custom output analyzer; inferring the 4 error rates from pairs of 4-line histograms would be a PITA that we could make easier (adding new "-> " lines is easy, and harmless so long as they're not easily confusable with the lines of this kind other programs are already extracting) ham2, spam2 = the msgs from ham & spam you want to train on d.train(ham2, spam2) d.alldone() From papaDoc@videotron.ca Mon Oct 7 13:17:02 2002 From: papaDoc@videotron.ca (papaDoc) Date: Mon, 07 Oct 2002 08:17:02 -0400 Subject: [Spambayes] New tokenization of the Subject line References: Message-ID: <3DA17B3E.4060503@videotron.ca> Hi, >[Remi Ricard] > > >>I try something again. >> >>Since most of the mail from subscribed groups have in their >>subject [spambayes] or [freesco] i.e "[" and "]". >>I decided to keep this as a word so my words from a subject line >>like: Re: [Spambayes] Moving closer to Gary's ideal >>will be >>Re: >>[Spambayes] >>Moving >>closer >>to >>Gary's >>ideal >> >> > >Two things about that: > >1. It's not a precise enough description to know exactly what you > did. On a list with programmers, don't be afraid to show code . > >2. Do you think it's more likely that a spam would have "freesco" > than "[freesco]" in its Subject line? Not bloodly likely . > That is, you couldn't have picked worse examples for selling the > idea that this *might* help. Indeed, that may be why it didn't > help. > > >It's usually more fruitful to stare at mistakes made by the system, and then >see if there's something about them in common that the tokenizer isn't >presenting in a usable way (very clear example: we throw away uuencoded >pieces entirely; very muddy example: we throw away info about how many >times a word appears in a msg). > OK this is the code I changed this subject_word_re = re.compile(r"[\w\x80-\xff$.%]+") punctuation_run_re = re.compile(r'\W+') for subject_word_re = re.compile(r"[\w\x80-\xff\[\]$.%]+") punctuation_run_re = re.compile(r'\W^\[^\]+') Why I did that is because I found this "prob(subject: '[') 0.0012345 and prob(subject: ']') 0.0012345 and usually I have a '[' of ']' in the subject if I have "[someword_from_a_mailing_list]" so instead of having '[' 'someword_from_a_mailing_list' and ']' as three token why not using [someword_from_a_mailing_list] as one token. I is more likely that a ham will have in its subject [freesco] than only freesco "for my case" and I think a spam won't have at all freesco in its subject. (This is a clean mailing list he he.. this is still possible.....) And I don't want a spam with a subject like: "[[[[[[New free porn site]]]]]]" to have its '[' and ']' to count as ham. papaDoc P.S Thanks for the statistic explanation of my result :-) From popiel@wolfskeep.com Mon Oct 7 20:58:51 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 07 Oct 2002 12:58:51 -0700 Subject: [Spambayes] Effects of ham to spam ratio Message-ID: <20021007195851.A54C3F57F@cashew.wolfskeep.com> Executive summary: more spam is VERY good. 1:4 ham:spam is _much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam. I'm back with another unusual experiment. This time, I varied the ratio of ham to spam, while keeping the total number of messages trained and tested constant. Once again, I'm doing this using the all-defaults Robinson classifier. If someone gives me a good set of .ini files, I'd be more than happy to run this test using any of the central limit algorithms, too. I again used timcv.py as my test driver, this time with 200 messages in each ham/spam set. For the different runs, I used the --{ham,spam}-keep options to control how much of each set got used, with the total used always being 250 ham+spam from each pair. The script I used (along with all the run output, etc.) is on my website at: http://www.wolfskeep.com/~popiel/spambayes/ratio I also mangled a version of cmp.py (now called table.py, also on the website) to generate the following output: -> tested 50 hams & 200 spams against 200 hams & 800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 800 hams & 200 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 2 1 2 2 3 3 1 fp %: 0.80 0.27 0.40 0.32 0.40 0.34 0.10 fn tot: 12 17 20 28 28 30 36 fn %: 1.20 1.94 2.67 4.48 5.60 8.00 14.40 h mean: 28.80 25.01 22.57 20.83 19.80 18.74 16.59 h sdev: 8.37 7.61 7.09 7.07 7.24 7.24 7.30 s mean: 78.32 76.48 75.05 73.79 72.88 70.96 68.10 s sdev: 7.87 8.36 8.82 9.28 9.77 10.36 10.86 mean diff: 49.52 51.47 52.48 52.96 53.08 52.22 51.51 k: 3.05 3.22 3.30 3.24 3.12 2.97 2.84 There are several interesting things here: 1. The false positive rate remains insignificant throughout. 2. The false negative rate drops significantly as the ham:spam ratio goes down. The more spam you have in your mailfeed, the better this whole thing works. 3. The ham:spam ratio affects the spam sdev much more than the ham sdev. 4. Tim's k value (mean separation divided by sum of standard deviations) is best with slightly less ham than spam (at 2:3), which happens to be about the same ratio as in my real mailfeed. It would be very interesting to find out if the best ham:spam ratio for k (#4 above) is constant, or if it's actually tied to the ratio in the real mail feed from which the training data is taken. This may be hard to measure for people who are using corpora augmented from several sources. - Alex From skip@pobox.com Mon Oct 7 21:13:14 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 7 Oct 2002 15:13:14 -0500 Subject: [Spambayes] CL2 results In-Reply-To: References: <3D9EF7D4.23399.2790EC5D@localhost> Message-ID: <15777.60122.260016.744625@montanaro.dyndns.org> >> Also, turns out I had a lot of zero length message files that came up >> as false negatives.. I've rm `find -empty` and rebal.. Tim> How *should* empty msgs be treated (that's a question for Tim> everyone)? When there's nothing to go on, it's hard to decide Tim> . Well, even empty messages will have headers. Sounds like Brad's files were truly zero-length, that is, not really mail messages. (I suspect this response is kind of late for this thread. I'm still working through mail problems on my new computer, so this will also serve as a test to see if it makes it out and back...) Skip From chk@pobox.com Mon Oct 7 21:17:20 2002 From: chk@pobox.com (Harald Koch) Date: Mon, 07 Oct 2002 16:17:20 -0400 Subject: [Spambayes] Re: Effects of ham to spam ratio In-Reply-To: popiel's message of "Mon, 07 Oct 2002 12:58:51 -0700". <20021007195851.A54C3F57F@cashew.wolfskeep.com> References: <20021007195851.A54C3F57F@cashew.wolfskeep.com> Message-ID: <9288.1034021840@elisabeth.cfrq.net> > Executive summary: more spam is VERY good. 1:4 ham:spam is > _much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam. Thank the Gods I don't *receive* spam in that ratio... -- Harald Koch From tim@zope.com Tue Oct 8 01:47:01 2002 From: tim@zope.com (Tim Peters) Date: Mon, 7 Oct 2002 20:47:01 -0400 Subject: [Spambayes] CL2 results In-Reply-To: <15777.60122.260016.744625@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Well, even empty messages will have headers. Sounds like Brad's > files were truly zero-length, that is, not really mail messages. Yup, and they should come out with "a score" of 0.5 then. Judging a msg from the headers alone (with an empty body) seems too much a crapshoot, though. > (I suspect this response is kind of late for this thread. I'm > still working through mail problems on my new computer, so this will > also serve as a test to see if it makes it out and back...) I can testify it made it one way . BTW, I don't think it's ever too late for a response. There's been an unreal amount of traffic on this list. I see I still have 134 msgs here I intended to reply to, and God only knows if I'll ever get to a fraction of them. So if someone feels a good point has been lost in the shuffle, pleae bring it up again. From tim@zope.com Tue Oct 8 18:02:03 2002 From: tim@zope.com (Tim Peters) Date: Tue, 8 Oct 2002 13:02:03 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment The attached sets up an experiment: create a vector of 50 "probabilities" at random, uniformly distributed in (0.0, 1.0) combine them using Paul Graham's scheme, and using Gary Robinson's scheme record the results repeat 5000 times The results should look familiar for those playing this game from the start: Result for random vectors of 50 probs, + 0 forced to 0.99 Graham combining 5000 items; mean 0.50; sdev 0.47 -> min 9.54792e-022; median 0.506715; max 1 * = 35 items 0.00 2051 *********************************************************** 0.05 100 *** 0.10 75 *** 0.15 63 ** 0.20 44 ** 0.25 35 * 0.30 40 ** 0.35 34 * 0.40 30 * 0.45 25 * 0.50 34 * 0.55 32 * 0.60 31 * 0.65 24 * 0.70 39 ** 0.75 43 ** 0.80 56 ** 0.85 55 ** 0.90 108 **** 0.95 2081 ************************************************************ Robinson combining 5000 items; mean 0.50; sdev 0.04 -> min 0.350831; median 0.500083; max 0.649056 * = 34 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 20 * 0.40 450 ************** 0.45 2027 ************************************************************ 0.50 2019 ************************************************************ 0.55 452 ************** 0.60 32 * 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 0 IOW, Paul's scheme is almost always "certain" given 50 discriminators, even in the face of random input. Gary's is never "certain" then. OTOH, do the experiment all over again, but attach one prob of 0.99 to each random vector of 50 probs. The probs are now systematically biased: Result for random vectors of 50 probs, + 1 forced to 0.99 Graham combining 5000 items; mean 0.65; sdev 0.45 -> min 8.36115e-021; median 0.992403; max 1 * = 47 items 0.00 1353 ***************************** 0.05 92 ** 0.10 50 ** 0.15 42 * 0.20 40 * 0.25 35 * 0.30 26 * 0.35 31 * 0.40 32 * 0.45 31 * 0.50 23 * 0.55 29 * 0.60 30 * 0.65 31 * 0.70 45 * 0.75 33 * 0.80 58 ** 0.85 84 ** 0.90 113 *** 0.95 2822 ************************************************************* Robinson combining 5000 items; mean 0.51; sdev 0.04 -> min 0.377845; median 0.513446; max 0.637992 * = 42 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 2 * 0.40 181 ***** 0.45 1549 ************************************* 0.50 2527 ************************************************************* 0.55 698 ***************** 0.60 43 ** 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 0 There's a dramatic difference in the Paul results, while the Gary results move sublty (in comparison). If we force 10 additional .99 spamprobs, the differences are night and day: Result for random vectors of 50 probs, + 10 forced to 0.99 Graham combining 5000 items; mean 1.00; sdev 0.01 -> min 0.213529; median 1; max 1 * = 82 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 1 * 0.25 0 0.30 1 * 0.35 0 0.40 0 0.45 0 0.50 0 0.55 0 0.60 0 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 4998 ************************************************************* Robinson combining 5000 items; mean 0.59; sdev 0.03 -> min 0.49794; median 0.58555; max 0.694905 * = 51 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 2 * 0.50 412 ********* 0.55 3068 ************************************************************* 0.60 1447 ***************************** 0.65 71 ** 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 0 It's hard to know what to make of this, especially in light of the claim that Gary-combining has been proven to be the most sensitive possible test for rejecting the hypothesis that a collection of probs is uniformly distributed. At least in this test, Paul-combining seemed far more sensitive (even when the data is random ). Intuitively, it *seems* like it would be good to get something not so insanely sensitive to random input as Paul-combining, but more sensitive to overwhelming amounts of evidence than Gary-combining. Even forcing 50 spamprobs of 0.99, the latter only moves up to an average of 0.7: Result for random vectors of 50 probs, + 50 forced to 0.99 Graham combining 5000 items; mean 1.00; sdev 0.00 -> min 1; median 1; max 1 * = 82 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 0 0.50 0 0.55 0 0.60 0 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 5000 ************************************************************* Robinson combining 5000 items; mean 0.70; sdev 0.02 -> min 0.628976; median 0.704543; max 0.810235 * = 45 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 0 0.50 0 0.55 0 0.60 40 * 0.65 2070 ********************************************** 0.70 2743 ************************************************************* 0.75 146 **** 0.80 1 * 0.85 0 0.90 0 0.95 0 ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: combine.py Type: application/octet-stream Size: 1294 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021008/6dd66b22/combine.exe ---------------------- multipart/mixed attachment-- From neale@woozle.org Tue Oct 8 18:45:36 2002 From: neale@woozle.org (Neale Pickett) Date: 08 Oct 2002 10:45:36 -0700 Subject: [Spambayes] quick hammie poll Message-ID: RSVP, but only if you use hammie :) 1. Do you use the pickle store (pickle jar? :) or the anydbm store (-d option)? 2. How big is your store file? 3. Would you be able and willing to run an XML-RPC server process all the time for mail scoring? (Note to Tim: my resistance is weakening ;) Thanks Neale From neale@woozle.org Tue Oct 8 18:58:31 2002 From: neale@woozle.org (Neale Pickett) Date: 08 Oct 2002 10:58:31 -0700 Subject: [Spambayes] spamprob combining In-Reply-To: References: Message-ID: So then, "Tim Peters" is all like: > The attached sets up an experiment: > > create a vector of 50 "probabilities" at random, uniformly > distributed in (0.0, 1.0) > > combine them using Paul Graham's scheme, and using Gary > Robinson's scheme > > record the results > > repeat 5000 times > > The results should look familiar for those playing this game from the start: Heh, I got an exception: Traceback (most recent call last): File "combine.py", line 56, in ? h1.display() File "Histogram.py", line 116, in display raise ValueError("nbuckets %g > 0 required" % nbuckets) TypeError: float argument required I patched Histogram.py to do what I think you meant (Also ITYM "buckets", not "buckts"): @@ -111,6 +112,8 @@ # buckts to a list of nbuckets counts, but only if at least one # data point is in the collection. def display(self, nbuckets=None, WIDTH=61): + if nbuckets is None: + nbuckets = self.nbuckets if nbuckets <= 0: raise ValueError("nbuckets %g > 0 required" % nbuckets) self.compute_stats() Submitted for your approval, Neale From tim.one@comcast.net Tue Oct 8 19:04:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 08 Oct 2002 14:04:45 -0400 Subject: [Spambayes] spamprob combining Message-ID: <2c2842cb6f.2cb6f2c284@icomcast.net> [Neale Pickett] > Heh, I got an exception: > > Traceback (most recent call last): > File "combine.py", line 56, in ? > h1.display() > File "Histogram.py", line 116, in display > raise ValueError("nbuckets %g > 0 required" % nbuckets) > TypeError: float argument required > > I patched Histogram.py to do what I think you meant (Also ITYM > "buckets", not "buckts"): > > @@ -111,6 +112,8 @@ > # buckts to a list of nbuckets counts, but only if at least one > # data point is in the collection. > def display(self, nbuckets=None, WIDTH=61): > + if nbuckets is None: > + nbuckets = self.nbuckets > if nbuckets <= 0: Heh. I had made the same change here but neglected to check it in. Be my guest! > Submitted for your approval, I always approve of you, Neale. From tim.one@comcast.net Tue Oct 8 22:45:02 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 08 Oct 2002 17:45:02 -0400 Subject: [Spambayes] Effects of ham to spam ratio In-Reply-To: <20021007195851.A54C3F57F@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Executive summary: more spam is VERY good. 1:4 ham:spam is > _much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam. > > I'm back with another unusual experiment. This time, I varied > the ratio of ham to spam, while keeping the total number of > messages trained and tested constant. Once again, I'm doing > this using the all-defaults Robinson classifier. If someone > gives me a good set of .ini files, I'd be more than happy to > run this test using any of the central limit algorithms, too. They're all the same, except for which one of use_central_limit: True use_central_limit2: True use_central_limit3: True you want to use. Other than that, the spam cutoff ratio must be 0.5, and the only semi-automated way to extract the 4 error rates (fp/fn when certain/uncertain) is to set nbuckets to 4 and stare at the little histograms. > I again used timcv.py as my test driver, this time with 200 > messages in each ham/spam set. How many sets (-n10, -n5, ...?). Looks like 5. > For the different runs, I used the --{ham,spam}-keep options to > control how much of each set got used, with the total used always > being 250 ham+spam from each pair. The script I used (along with > all the run output, etc.) is on my website at: > > http://www.wolfskeep.com/~popiel/spambayes/ratio > > I also mangled a version of cmp.py (now called table.py, > also on the website) to generate the following output: > > -> tested 50 hams & 200 spams against 200 hams & 800 spams > [... edited for brevity ...] > -> tested 200 hams & 50 spams against 800 hams & 200 spams > > ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 > fp tot: 2 1 2 2 3 3 1 > fp %: 0.80 0.27 0.40 0.32 0.40 0.34 0.10 > fn tot: 12 17 20 28 28 30 36 > fn %: 1.20 1.94 2.67 4.48 5.60 8.00 14.40 > h mean: 28.80 25.01 22.57 20.83 19.80 18.74 16.59 > h sdev: 8.37 7.61 7.09 7.07 7.24 7.24 7.30 > s mean: 78.32 76.48 75.05 73.79 72.88 70.96 68.10 > s sdev: 7.87 8.36 8.82 9.28 9.77 10.36 10.86 > mean diff: 49.52 51.47 52.48 52.96 53.08 52.22 51.51 > k: 3.05 3.22 3.30 3.24 3.12 2.97 2.84 > > There are several interesting things here: > > 1. The false positive rate remains insignificant throughout. > 2. The false negative rate drops significantly as the ham:spam > ratio goes down. The more spam you have in your mailfeed, > the better this whole thing works. The reason isn't clear, though: it may well have less to do with the ratio than with the absolute quantity of spam trained on. If there's sufficient variety in your spam, it could simply be that 200 is way too few to get a representative sampling of the diversity your spam, umm, enjoys 3. The ham:spam ratio affects the spam sdev much more than the > ham sdev. Which is more reason to be suspicious: sdev is a measure of how wild the data is. If the sdev gets steady as the absolute count increases, it means the data is "settling down". Your spam sdev goes up by about 0.50 in each column, with no sign of settling down "to the left", which suggests that even at the 50-200 extreme it's *still* finding plenty of new stuff in the spam. Do you have a lot of Asian spam? The gimmicks we've got for that ("skip" and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is just a lucky accident. > 4. Tim's k value (mean separation divided by sum of standard > deviations) is best with slightly less ham than spam (at 2:3), > which happens to be about the same ratio as in my real mailfeed. > > It would be very interesting to find out if the best ham:spam > ratio for k (#4 above) is constant, or if it's actually tied to > the ratio in the real mail feed from which the training data is > taken. This may be hard to measure for people who are using > corpora augmented from several sources. It would be better to get independent results from the same kind of test but run with more data. I know that, for example, in my data, I have to train on several thousand spam before the improvement in spam identification slows to a crawl. Thanks for the report, Alex! Well down and provocative. From popiel@wolfskeep.com Tue Oct 8 23:58:37 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Tue, 08 Oct 2002 15:58:37 -0700 Subject: [Spambayes] Effects of ham to spam ratio In-Reply-To: Message from Tim Peters of "Tue, 08 Oct 2002 17:45:02 EDT." References: Message-ID: <20021008225837.32620F588@cashew.wolfskeep.com> In message: Tim Peters writes: > >the only semi-automated way to extract the 4 error rates (fp/fn when >certain/uncertain) is to set nbuckets to 4 and stare at the little >histograms. I'll see if I can get something to read those histograms for me, when I start doing the central limit testing. ;-) >> I again used timcv.py as my test driver, this time with 200 >> messages in each ham/spam set. > >How many sets (-n10, -n5, ...?). Looks like 5. Yeah, I was only using 5 sets, even though I have 10 available. Doh! >> There are several interesting things here: >> >> 1. The false positive rate remains insignificant throughout. >> 2. The false negative rate drops significantly as the ham:spam >> ratio goes down. The more spam you have in your mailfeed, >> the better this whole thing works. > >The reason isn't clear, though: it may well have less to do with the ratio >than with the absolute quantity of spam trained on. If there's sufficient >variety in your spam, it could simply be that 200 is way too few to get a >representative sampling of the diversity your spam, umm, enjoys > 3. The ham:spam ratio affects the spam sdev much more than the >> ham sdev. > >Which is more reason to be suspicious: sdev is a measure of how wild the >data is. If the sdev gets steady as the absolute count increases, it means >the data is "settling down". Your spam sdev goes up by about 0.50 in each >column, with no sign of settling down "to the left", which suggests that >even at the 50-200 extreme it's *still* finding plenty of new stuff in the >spam. True. Hrm. >Do you have a lot of Asian spam? The gimmicks we've got for that ("skip" >and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is >just a lucky accident. Nope. No Asian spam at all. My spam is mostly in English, with a fair amount of German porn spam (I have _no_ idea how I got onto that list) and one or two spams in Spanish or Italian (I'm not sure which). >> 4. Tim's k value (mean separation divided by sum of standard >> deviations) is best with slightly less ham than spam (at 2:3), >> which happens to be about the same ratio as in my real mailfeed. >> >> It would be very interesting to find out if the best ham:spam >> ratio for k (#4 above) is constant, or if it's actually tied to >> the ratio in the real mail feed from which the training data is >> taken. This may be hard to measure for people who are using >> corpora augmented from several sources. > >It would be better to get independent results from the same kind of >test but run with more data. I know that, for example, in my data, I have >to train on several thousand spam before the improvement in spam >identification slows to a crawl. I'll rerun using all 10 sets instead of just 5. *blush* - Alex From popiel@wolfskeep.com Wed Oct 9 06:02:43 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Tue, 08 Oct 2002 22:02:43 -0700 Subject: [Spambayes] More ratio experiments Message-ID: <20021009050244.1D47FF588@cashew.wolfskeep.com> Executive summary: Yes, a low ham:spam ratio is good even with larger (to the limit of my available corpora) data sets, but the degree of the goodness seems to go down as the training corpora get larger. Also, the 2:3 ham:spam ratio seems to be interesting for some reason... The methodology I'm using for this experiment is almost identical to that I used for my original ratio experiment: http://www.wolfskeep.com/~popiel/spambayes/ratio The only thing that I changed was the number of sets I was using for the timcv.py (from 5 in the original experiment to 8, 10, and 15). At 10 this more than doubled and at 15 it more than tripled the training set size for each run, keeping the testing set size the same. For the runs with 15 sets, I had to rebalance my ham and spam, and I could only go up to 1:1 instead of 4:1 due to lack of raw data. The original experiment (with 5 sets) produced: -> tested 50 hams & 200 spams against 200 hams & 800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 800 hams & 200 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 2 1 2 2 3 3 1 fp %: 0.80 0.27 0.40 0.32 0.40 0.34 0.10 fn tot: 12 17 20 28 28 30 36 fn %: 1.20 1.94 2.67 4.48 5.60 8.00 14.40 h mean: 28.80 25.01 22.57 20.83 19.80 18.74 16.59 h sdev: 8.37 7.61 7.09 7.07 7.24 7.24 7.30 s mean: 78.32 76.48 75.05 73.79 72.88 70.96 68.10 s sdev: 7.87 8.36 8.82 9.28 9.77 10.36 10.86 mean diff: 49.52 51.47 52.48 52.96 53.08 52.22 51.51 k: 3.05 3.22 3.30 3.24 3.12 2.97 2.84 The new experiment (with 8 sets) produced: -> tested 50 hams & 200 spams against 350 hams & 1400 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 1400 hams & 350 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 1 2 3 2 4 2 2 fp %: 0.25 0.33 0.38 0.20 0.33 0.14 0.12 fn tot: 18 27 34 40 44 44 45 fn %: 1.12 1.93 2.83 4.00 5.50 7.33 11.25 h mean: 26.37 23.64 21.76 19.87 19.03 18.30 17.02 h sdev: 7.73 7.18 6.95 6.89 7.01 7.16 7.35 s mean: 78.66 77.49 76.49 74.85 73.92 72.44 69.86 s sdev: 7.96 8.50 8.64 9.14 9.74 10.31 10.82 mean diff: 52.29 53.85 54.73 54.98 54.89 54.14 52.84 k: 3.33 3.43 3.51 3.43 3.28 3.10 2.91 With 10 sets it produced: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 2 3 3 3 4 3 3 fp %: 0.40 0.40 0.30 0.24 0.27 0.17 0.15 fn tot: 32 41 43 43 47 48 51 fn %: 1.60 2.34 2.87 3.44 4.70 6.40 10.20 h mean: 24.25 21.75 20.12 18.87 18.33 17.72 16.71 h sdev: 7.52 7.13 7.04 7.09 7.16 7.31 7.43 s mean: 77.56 76.66 75.93 74.85 74.13 72.80 70.57 s sdev: 8.24 8.62 8.77 9.09 9.68 9.90 10.54 mean diff: 53.31 54.91 55.81 55.98 55.80 55.08 53.86 k: 3.38 3.49 3.53 3.46 3.31 3.20 3.00 With 15 sets it produced: -> tested 50 hams & 200 spams against 700 hams & 2800 spams [... edited for brevity ...] -> tested 125 hams & 125 spams against 1750 hams & 1750 spams ham-spam: 50-200 75-175 100-150 125-125 fp tot: 2 3 4 3 fp %: 0.27 0.27 0.27 0.16 fn tot: 61 69 62 62 fn %: 2.03 2.63 2.76 3.31 h mean: 21.13 19.54 18.44 17.90 h sdev: 6.96 7.02 6.95 7.24 s mean: 76.89 76.47 76.41 75.85 s sdev: 8.35 8.65 8.85 9.02 mean diff: 55.76 56.93 57.97 57.95 k: 3.64 3.63 3.67 3.56 The value of a small ham:spam ratio seems to go down at the training set size increases... or perhaps the sweet spot on the curve is moving, since the fn rates went up on the small ham:spam ratios while they went down on the large ham:spam ratios. Of note, the best k seems to remain at the 2:3 ratio, independent of training set size. This is also the point at which the fn rates switched directions as with more data. _Something_ is interesting about that ratio. This could be due to that being near the real ratio of my mail, or it could be due to some of the tunables (spam cutoff, a, s, whatever) in the the classifier, or it could be something completely unexpected. All of this is (of course) on my website at: http://www.wolfskeep.com/~popiel/spambayes/ratio2 - Alex From rob@hooft.net Wed Oct 9 08:09:49 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Wed, 09 Oct 2002 09:09:49 +0200 Subject: [Spambayes] More ratio experiments References: <20021009050244.1D47FF588@cashew.wolfskeep.com> Message-ID: <3DA3D63D.5090203@hooft.net> T. Alexander Popiel wrote: > Executive summary: Yes, a low ham:spam ratio is good even with larger > (to the limit of my available corpora) data sets, but the degree of the > goodness seems to go down as the training corpora get larger. Also, > the 2:3 ham:spam ratio seems to be interesting for some reason... > > The methodology I'm using for this experiment is almost identical to > that I used for my original ratio experiment: > > http://www.wolfskeep.com/~popiel/spambayes/ratio > > The only thing that I changed was the number of sets I was using for > the timcv.py (from 5 in the original experiment to 8, 10, and 15). > At 10 this more than doubled and at 15 it more than tripled the > training set size for each run, keeping the testing set size the same. Nope, it is the other way around: you still train on s+h=250 messages all the time, you're just testing the scores of more messages. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From Alexander@Leidinger.net Wed Oct 9 09:33:25 2002 From: Alexander@Leidinger.net (Alexander Leidinger) Date: Wed, 9 Oct 2002 10:33:25 +0200 Subject: [Spambayes] quick hammie poll In-Reply-To: References: Message-ID: <20021009103325.62746537.Alexander@Leidinger.net> On 08 Oct 2002 10:45:36 -0700 Neale Pickett wrote: > RSVP, but only if you use hammie :) I don't use it for classifying my regular mail, at the moment I just use it so seperate the spam from the ham in my mega corpus (10^6 mails total). > 1. Do you use the pickle store (pickle jar? :) or the anydbm store (-d > option)? I hadn't time to investigate my dbm issue, so I still use the pickle store. > 2. How big is your store file? This depends... :-) If I only train on 11 ham sets (~950 msgs each) and one spam set (~4600 at the moment) it only has 9MB. > 3. Would you be able and willing to run an XML-RPC server process all > the time for mail scoring? I don't think it makes a difference in my scenario, but if it makes a speed difference in a delivery pipeline: sure (is it an option to make this optional?). -- Press every key to continue. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 From tim_one@email.msn.com Wed Oct 9 09:35:41 2002 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 9 Oct 2002 04:35:41 -0400 Subject: [Spambayes] More ratio experiments In-Reply-To: <3DA3D63D.5090203@hooft.net> Message-ID: [T. Alexander Popiel] > ... > The only thing that I changed was the number of sets I was using for > the timcv.py (from 5 in the original experiment to 8, 10, and 15). > At 10 this more than doubled and at 15 it more than tripled the > training set size for each run, keeping the testing set size the same. [Rob W.W. Hooft] > Nope, it is the other way around: you still train on s+h=250 messages > all the time, you're just testing the scores of more messages. It can be confusing if you didn't write these test drivers, which gives me a small advantage here . Alex left some of the test driver output intact, which really helps: > The original experiment (with 5 sets) produced: > > -> tested 50 hams & 200 spams against 200 hams & 800 spams > [... edited for brevity ...] > -> tested 200 hams & 50 spams against 800 hams & 200 spams ... > The new experiment (with 8 sets) produced: > > -> tested 50 hams & 200 spams against 350 hams & 1400 spams > [... edited for brevity ...] > -> tested 200 hams & 50 spams against 1400 hams & 350 spams So all can see that the # of ham & spam trained on really did increase; that's why the test driver prints this stuff, and the summary file retains it, of course. If you're running timcv with n sets: n classifiers are built 1 run is done with each classifier each classifier is trained on n-1 sets, and predicts against the sole remaining set (the set not used to train the classifier) mboxtest does the same timcv should not be used for central limit tests (it requires incremental learning and unlearning) If you're running timtest with n sets: n classifiers are built n-1 runs are done with each classifier each classifier is trained on 1 set, and predicts against each of the n-1 remaining sets (those not used to train the classifier) central limit tests are fine with timtest this is a much harder test than timcv, because it trains on less data, and makes each classifier predict against n-1 times more data than it's been taught about From jm@jmason.org Wed Oct 9 13:21:11 2002 From: jm@jmason.org (Justin Mason) Date: Wed, 09 Oct 2002 13:21:11 +0100 Subject: [Spambayes] fully-public corpus of mail available Message-ID: <20021009122116.6EB2416F03@jmason.org> (Please feel free to forward this message to other possibly-interested parties.) Hi all, One of the big problems working with spam classification, is finding good mail to test with. There are few public corpora available; Ion Androutsopoulos' "Ling-spam" corpus is one (hi Ion!), but unfortunately this does not contain all of the mail message data, so would not be useful to a SpamAssassin-style system (which relies heavily on header data), for example. Another effect of not having a common, shared corpus, is the difficulty this introduces in comparing accuracy rates between spam filter software; since everyone tests using different corpora, statistics can be unportable as a result. Building public corpora is difficult, as it typically involves saving your own (classified) mail. This brings privacy problems, as your mail senders may not wish to see this made public. But what the heck, that's what I've done anyway ;) Here's a public corpus I've assembled from my own corpora, removing messages which were not public in the first place. Please feel free to download it and use it for spam-filter development. It's quite small, but should be big enough for use as a reference corpus, at least, so that hit-rate statistics can be compared across tools. Hope it helps. It lives here: http://spamassassin.org/publiccorpus/ and here's the README.txt: Welcome to the SpamAssassin public mail corpus. This is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points: - All headers are reproduced in full. Some address obfuscation has taken place; hostnames in some cases have been replaced with "example.com", which should have a valid MX record (if I recall correctly). In most cases though, the headers appear as they were received. - All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites. - Copyright for the text in the messages remains with the original senders. OK, now onto the corpus description. It's split into three parts, as follows: - spam: 500 spam messages, all received from non-spam-trap sources. - easy_ham: 350 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc). - hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc. The corpora are prefixed with "200210", because that's the date when I assembled it, so it's as good a version string as anything else ;) . They are compressed using "bzip2". This corpus lives at http://spamassassin.org/publiccorpus/ . Mail jm - public - corpus AT jmason dot org if you have questions, or to donate mail. (Oct 9 2002 jm) From popiel@wolfskeep.com Wed Oct 9 16:37:43 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 09 Oct 2002 08:37:43 -0700 Subject: [Spambayes] More ratio experiments In-Reply-To: Message from "Tim Peters" of "Wed, 09 Oct 2002 04:35:41 EDT." References: Message-ID: <20021009153743.9EC45F54A@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >Alex left some of the test driver output intact All of the test driver output is available at http://www.wolfskeep.com/~popiel/spambayes/ratio2 just in case someone wants to look at it. Histograms, more verbose indications of the training and testing cycles, false positive excerpts, and everything. After sleeping on the data (yes, my bedroom is over the computer rooms ;-) ), some more things are niggling at me... like the error rates (specifically fn) going _UP_ as more training data is added for the very low ham:spam ratios. I'm guessing that that's due to the classifier seeming to discover that yes, there _is_ ham in the universe, and maybe more stuff should be classified as ham. I'm also wondering if there's a point at which where dropping the ham:spam ratio starts increasing the fn rate, holding the training set size constant (this I can test), and if there's an amount of training data above which low ham:spam is nolonger good, or even bad (this I don't have enough data to test). Lastly, I'm wondering if I should even bother with the non-central-limit stuff anymore, since the central-limit stuff seems from other reports to be more interesting. (I really ought to do comparisons among the 7 extant classifiers (default, clt[123] x {cl,rms}pik) on my data... heck, it might even be getting close to shootout time again... - Alex From popiel@wolfskeep.com Wed Oct 9 19:04:39 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 09 Oct 2002 11:04:39 -0700 Subject: [Spambayes] Modifications to timcv.py Message-ID: <20021009180439.62465F54A@cashew.wolfskeep.com> The inability to use timcv.py with the central limit stuff annoyed me. I offer this patch to correct that problem... - Alex Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.9 diff -u -r1.9 timcv.py --- timcv.py 24 Sep 2002 05:37:11 -0000 1.9 +++ timcv.py 9 Oct 2002 17:59:56 -0000 @@ -26,6 +26,15 @@ at least on of {--ham-keep, --spam-keep} is specified. If -s isn't specifed, the seed is taken from current time. +If you want full retraining for each classifier (because untrain and +retrain don't work), + + --trainstyle arg + Use one of the following training styles: + partial: train on everything, then untrain individual sets + full: train from scratch on only applicable sets + partial is the historical (and default) behaviour. + In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ @@ -48,7 +57,7 @@ print >> sys.stderr, __doc__ % globals() sys.exit(code) -def drive(nsets): +def drive(nsets, trainstyle): print options.display() hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] @@ -67,16 +76,28 @@ spamstream = msgs.SpamStream(s, [s]) if i > 0: - # Forget this set. - d.untrain(hamstream, spamstream) + if trainstyle == 'partial': + # Forget this set. + d.untrain(hamstream, spamstream) + elif trainstyle == 'full': + # Retrain with the other sets. + hname = "%s-%d, except %d" % (hamdirs[0], nsets, i + 1) + h2 = hamdirs * 1 + del h2[i] + sname = "%s-%d, except %d" % (spamdirs[0], nsets, i + 1) + s2 = spamdirs * 1 + del s2[i] + d.new_classifier() + d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2)) # Predict this set. d.test(hamstream, spamstream) d.finishtest() if i < nsets - 1: - # Add this set back in. - d.train(hamstream, spamstream) + if trainstyle == 'partial': + # Add this set back in. + d.train(hamstream, spamstream) d.alldone() @@ -85,11 +106,12 @@ try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', - ['ham-keep=', 'spam-keep=']) + ['ham-keep=', 'spam-keep=', 'trainstyle=']) except getopt.error, msg: usage(1, msg) nsets = seed = hamkeep = spamkeep = None + trainstyle = 'partial' for opt, arg in opts: if opt == '-h': usage(0) @@ -101,14 +123,18 @@ hamkeep = int(arg) elif opt == '--spam-keep': spamkeep = int(arg) + elif opt == '--trainstyle': + trainstyle = arg if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") + if trainstyle not in ('partial', 'full'): + usage(1, "Unknown train style '%s'" % trainstyle) msgs.setparms(hamkeep, spamkeep, seed) - drive(nsets) + drive(nsets, trainstyle) if __name__ == "__main__": main() From tim@zope.com Wed Oct 9 19:30:03 2002 From: tim@zope.com (Tim Peters) Date: Wed, 9 Oct 2002 14:30:03 -0400 Subject: [Spambayes] Modifications to timcv.py In-Reply-To: <20021009180439.62465F54A@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > The inability to use timcv.py with the central limit stuff > annoyed me. I offer this patch to correct that problem... Thank you! It annoys me too. I can't work on this now, but will check this in (or a minor variant) tonight, when I can test it first. PS: Don't worry -- I won't tell anyone you're writing Python code . From bkc@murkworks.com Wed Oct 9 21:09:50 2002 From: bkc@murkworks.com (Brad Clements) Date: Wed, 09 Oct 2002 16:09:50 -0400 Subject: [Spambayes] runratio with timcv.py Message-ID: <3DA45489.14551.B29C989@localhost> Hmm, Well I didn't get tim's message about "using timcv.py for incremental is bad" until after I started my ratio testing. This is timcv.py on -n 10 with 1200 messages total per set. use_central_limit: True I also have timtest.py running, I have no idea if runratio.sh will handle it's output.. Will post when that finishes running. Also, I modified runratio.sh to handle arbitrary list of spam/ham count steps.. want me to post? (last stat line) -> tested 1050 hams & 150 spams against 9450 hams & 1350 spams And the table ham-spam: 150-1050 300-900 450-750 600-600 750-450 900-3001050-150 fp tot: 30 39 45 48 48 50 40 fp %: 2.00 1.30 1.00 0.80 0.64 0.56 0.38 fn tot: 14 20 17 19 14 16 15 fn %: 0.13 0.22 0.23 0.32 0.31 0.53 1.00 h mean: 3.31 2.36 1.93 1.74 1.53 1.36 1.08 h sdev: 13.41 10.95 9.86 9.47 8.86 8.26 7.31 s mean: 99.37 99.16 99.02 98.74 98.57 98.26 97.11 s sdev: 5.61 6.50 7.04 8.08 8.54 9.46 12.29 mean diff: 96.06 96.80 97.09 97.00 97.04 96.90 96.03 k: 5.05 5.55 5.74 5.53 5.58 5.47 4.90 Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Wed Oct 9 22:00:03 2002 From: bkc@murkworks.com (Brad Clements) Date: Wed, 09 Oct 2002 17:00:03 -0400 Subject: [Spambayes] runratio with timtest.py Message-ID: <3DA4604E.10211.B57C38C@localhost> use_central_limit: true this is runratio.sh, but using timtest.py (last stat line is) -> tested 1050 hams & 150 spams against 1050 hams & 150 spams ham-spam: 150-1050 300-900 450-750 600-600 750-450 900-3001050-150 fp tot: 237 213 208 209 168 119 59 fp %: 8.40 4.37 2.72 1.69 1.01 0.58 0.20 fn tot: 34 42 57 87 130 186 181 fn %: 0.06 0.16 0.28 0.54 1.18 2.57 7.23 h mean: 16.84 7.22 4.53 3.38 2.43 1.68 0.96 h sdev: 26.45 19.08 15.16 13.02 10.96 9.08 6.82 s mean: 99.66 99.18 98.71 97.93 96.71 94.66 88.92 s sdev: 4.11 6.34 7.92 10.05 12.62 16.17 23.10 mean diff: 82.82 91.96 94.18 94.55 94.28 92.98 87.96 k: 2.71 3.62 4.08 4.10 4.00 3.68 2.94 Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From richie@entrian.com Wed Oct 9 22:08:44 2002 From: richie@entrian.com (Richie Hindle) Date: Wed, 09 Oct 2002 22:08:44 +0100 Subject: [Spambayes] quick hammie poll In-Reply-To: References: Message-ID: Hi Neale, > RSVP, but only if you use hammie :) I use it to create the pickle, but not to classify mails (I use pop3proxy for that, surprise surprise!) > 1. Do you use the pickle store (pickle jar? :) or the anydbm store (-d > option)? Pickle (but mostly out of laziness, not for any concrete reason). > 2. How big is your store file? 4,775,607 bytes. That's after training on around 4,200 messages. > 3. Would you be able and willing to run an XML-RPC server process all > the time for mail scoring? Sure. I have to start and stop pop3proxy anyway, so that would be no problem (but for exactly that reason I'm probably not the sort of user you should be asking...) -- Richie Hindle richie@entrian.com From mhammond@skippinet.com.au Wed Oct 9 23:38:42 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 10 Oct 2002 08:38:42 +1000 Subject: [Spambayes] Demo Outlook Plugin available Message-ID: Hi all, I just released new win32all builds that contain support for Microsoft Outlook Extensions. If you install win32all-149, 150 or the most recent CVS snapshot build, you will find a file win32com\demos\outlookAddin.py - please see the comments in the file for information on how to install and test this plugin. Please let me know if you try it. Also, feel free to contact me if you need some help turning it into something useful. Mark. From jbublitz@nwinternet.com Wed Oct 9 23:55:47 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Wed, 09 Oct 2002 15:55:47 -0700 (PDT) Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: On 08-Oct-02 Tim Peters wrote: > It's hard to know what to make of this, especially in light of > the claim that Gary-combining has been proven to be the most > sensitive possible test for rejecting the hypothesis that a > collection of probs is uniformly distributed. At least in this > test, Paul-combining seemed far more sensitive (even when the > data is random ). > Intuitively, it *seems* like it would be good to get something > not so insanely sensitive to random input as Paul-combining, but > more sensitive to overwhelming amounts of evidence than > Gary-combining. Even forcing 50 spamprobs of 0.99, the latter > only moves up to an average of 0.7: Since my last msg was incomprehensible, I'm just going to attach my code at the bottom and refer to it. Graham's original score calculation - product/(product + inverseProducct) does give the kind of score distribution you described. If you substitute Gary Robinson's suggestion (see below - last few lines), the score distribution does spread out to the center a little bit. You can get Robinson's scoring calculation (as below) to produce a normal distribution around the mean ham or spam score if you either: a. Increase VECTOR_SIZE (max_discriminators??) - a value of around 100 seems to do pretty well b. Instead of selecting the most extreme N word probabilities from the msg being tested, select the words randomly from the list of words in the msg (not shown in code below). You immediately (VECTOR_SIZE = 15) get a normal distribution around the means, but accuracy sucks until you select 75 to 100 words/msg randomly. Neither (a) nor (b) works as well as the 15 most extreme words on my test data. Also, Robinson's calculation doesn't produce ham at 0.99 or spam at 0.01 - in fact the msgs that I had a hard time classifying manually are (mostly) the ones that fall near the cutoff. Note also that the code below will produce an unpredictable score if the msg contains only 30 .01 words and 30 .99 words. It depends on how pairs.sort (...) handles ties. Making the limits asymmetrical (eg .989 and .01 instead of .99/.01) doesn't seem to work very well. The other thing that helps make the scores extreme in actual use is that the distribution of word probabilities is extreme. For my corpora using the code below I get 169378 unique tokens (from 24000 msgs, 50% spam): Probability Number of Tokens % of Total [0.00, 0.01) 46329 27.4% (never in spam) [0.99, 1.00) 104367 61.7% (never in ham) ----- 89.1% >From looking at failures (and assuming passes behave similarly) the 10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the work, which makes sense, since those are the most commonly used words. My experience has been that the tail tips of the score distribution maintain about the same distance from the mean score no matter what you do. If you improve the shape of the distribution (make it look more normal), you move the tails about the same distance as the distribution has spread out, and the ham and spam tails overlap more and more, increasing the fp/fn rates. The little testing I did on Spambayes (last week's CVS) seemed to show the same effect. For the code below, if I train on 8000 msgs (50% spam) and then test 200, retrain on those 200, and repeat for 16000 msgs, I get 4 fns (3 are identical msgs from the same sender with different dates, all are Klez msgs) and 1 fp (an ISP msg "Routine Service Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The failures all scored in the range [0.495, 0.511] (cuttoff at 0.50) I ran the the SA Corpus today also and don't get any failures if I train on 8K of my msgs and 50/100 of their msgs (worse results under other conditions), but the sample sizes there are too small to do an adequete training sample and have enough test data to have confidence in the results. I can post those results if anyone is interested. Graham's method was basically designed to produce extreme scores, and the distribution of words in the data seems to reinforce that. If it's of any use to anybody (it's certainly beyond me), both the distribution of msg scores and distribution of word probabilities look like exponential or Weibull distributions. (They're "bathtub" curves, if anyone is familiar with reliability statistics). This is all based on my data, which is not the same as your data. YMMV. Jim # classes posted to c.l.p by Erik Max Francis # algorithm from Paul Graham ("A Plan for Spam") # was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+") # changed to catch Asian charsets TOKEN_RE = re.compile(r"[\w'$_-]+", re.U) FREQUENCY_THRESHHOLD = 1 # was 5 GOOD_BIAS = 2.0 BAD_BIAS = 1.0 # changed to improve distribution 'width' because # of smaller token count in training data GOOD_PROB = 0.0001 # was 0.01 BAD_PROB = 0.9999 # was 0.99 VECTOR_SIZE = 15 UNKNOWN_PROB = 0.5 # was 0.4 or 0.2 # remove mixed alphanumerics or strictly numeric: # eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r) pn1_re = re.compile (r"[a-zA-Z]+[0-9]+") pn2_re = re.compile (r"[0-9]+[a-zA-Z]+") num_re = re.compile (r"^[0-9]+") class Corpus(dict): # instantiate one training Corpus for spam, one for ham, # and then one Corpus for each test msg as msgs are tested # (the msg Corpus instance is destroyed after # testing the msg) def __init__(self, data=None): dict.__init__(self) self.count = 0 if data is not None: self.process(data) # process is used to extract tokens from msg, # either in building the training sample or # when testing a msg (can process entire msg # or one part of msg at a time) # 'data' is a string def process(self, data): tokens = TOKEN_RE.findall(str (data)) if not len (tokens): return # added the first 'if' in the loop to reduce # total # of tokens by >75% deletes = 0 for token in tokens: if (len (token) > 20)\ or (pn1_re.search (token) != None)\ or (pn2_re.search (token) != None)\ or (num_re.search (token) != None): deletes += 1 continue if self.has_key(token): self[token] += 1 else: self[token] = 1 # count tokens, not msgs self.count += len (tokens) - deletes class Database(dict): def __init__(self, good, bad): dict.__init__(self) self.build(good, bad) # 'build' constructs the dict of token: probability # run once after training from the ham/spam Corpus # instances; the ham/spam Corpus instances can be # destroyed (after saving?) after 'build' is run def build(self, good, bad): ngood = good.count nbad = bad.count # print ngood, nbad, float(nbad)/float(ngood) for token in good.keys() + bad.keys(): # doubles up, but # works if not self.has_key(token): g = GOOD_BIAS*good.get(token, 0) b = BAD_BIAS*bad.get(token, 0) if g + b >= FREQUENCY_THRESHHOLD: # the 'min's are leftovers from counting # msgs instead of tokens for ngood, nbad goodMetric = min(1.0, g/ngood) badMetric = min(1.0, b/nbad) total = goodMetric + badMetric prob = max(GOOD_PROB,\ min(BAD_PROB,badMetric/total)) self[token] = prob def scan(self, corpus): pairs = [(token, self.get(token, UNKNOWN_PROB)) \ for token in corpus.keys()] pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\ - 0.5))) significant = pairs[:VECTOR_SIZE] inverseProduct = product = 1.0 for token, prob in significant: product *= prob inverseProduct *= 1.0 - prob # Graham scoring - was: # return pairs, significant, product/(product +\ # inverseProduct) # 'pairs' and 'significant' added to assist data logging, evaluation # Robinson scoring - don't know why, but this works great n = float (len (significant)) # n could be < VECTOR_SIZE # div by zero possible if no headers (and msg has no body) try: P = 1 - inverseProduct ** (1/n) Q = 1 - product ** (1/n) S = (1 + (P - Q)/(P + Q))/2 except: S = 0.99 return pairs, significant, S From tim.one@comcast.net Thu Oct 10 01:34:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 09 Oct 2002 20:34:15 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: [Tim] > ... > Intuitively, it *seems* like it would be good to get something not so > insanely sensitive to random input as Paul-combining, but more > sensitive to overwhelming amounts of evidence than Gary-combining. So there's a new option, [Classifier] use_tim_combining: True The comments (from Options.py) explain it: # For the default scheme, use "tim-combining" of probabilities. This # has no effect under the central-limit schemes. Tim-combining is a # kind of cross between Paul Graham's and Gary Robinson's combining # schemes. Unlike Paul's, it's never crazy-certain, and compared to # Gary's, in Tim's tests it greatly increased the spread between mean # ham-scores and spam-scores, while simultaneously decreasing the # variance of both. Tim needed a higher spam_cutoff value for best # results, but spam_cutoff is less touchy than under Gary-combining. use_tim_combining: False "Tim combining" simply takes the geometric mean of the spamprobs as a measure of spamminess S, and the geometric mean of 1-spamprob as a measure of hamminess H, then returns S/(S+H) as "the score". This is well-behaved when fed random, uniformly distributed probabilities, but isn't reluctant to let an overwhelming number of extreme clues lead it to an extreme conclusion (although you're not going to see it give Graham-like 1e-30 or 1.0000000000000 scores). Don't use a central-limit scheme with this (it has no effect on those). If you test it, use whatever variations on the "all default" scheme you usually use, but it will probably help to boost spam_cutoff. Note that the default max_discriminators is still 150, and that's what I used below. Here's a 10-set cross-validation run on my data, restricted to 100 ham and 100 spam per set, with all defaults, except before after ------ ----- use_tim_combining False True spam_cutoff 0.55 0.615 -> tested 100 hams & 100 spams against 900 hams & 900 spams [ditto 19 times] false positive percentages 0.000 0.000 tied 1.000 0.000 won -100.00% 1.000 1.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 1 times tied 9 times lost 0 times total unique fp went from 2 to 1 won -50.00% mean fp % went from 0.2 to 0.1 won -50.00% false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.000 1.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 1 to 1 tied mean fn % went from 0.1 to 0.1 tied The real story here is in the score distributions; contrary to what the comment said above, the ham-score variance increased with this little data: ham mean ham sdev 30.63 18.80 -38.62% 6.03 6.83 +13.27% 29.31 17.35 -40.81% 5.48 6.84 +24.82% 29.96 18.50 -38.25% 6.95 9.02 +29.78% 29.66 18.12 -38.91% 5.89 6.81 +15.62% 29.51 17.34 -41.24% 5.73 6.71 +17.10% 29.40 17.43 -40.71% 5.73 6.61 +15.36% 29.75 17.74 -40.37% 5.76 6.96 +20.83% 29.71 18.17 -38.84% 5.97 6.48 +8.54% 31.98 20.41 -36.18% 5.96 8.02 +34.56% 29.83 18.11 -39.29% 4.75 5.41 +13.89% ham mean and sdev for all runs 29.97 18.20 -39.27% 5.90 7.08 +20.00% spam mean spam sdev 79.23 88.38 +11.55% 6.96 5.52 -20.69% 79.40 88.70 +11.71% 7.00 5.64 -19.43% 78.68 88.06 +11.92% 6.69 5.13 -23.32% 79.65 89.01 +11.75% 7.20 5.22 -27.50% 79.91 88.87 +11.21% 6.35 4.67 -26.46% 80.47 89.16 +10.80% 7.22 6.06 -16.07% 80.94 89.78 +10.92% 6.60 4.45 -32.58% 80.30 89.41 +11.34% 6.95 5.49 -21.01% 78.54 87.70 +11.66% 7.30 6.45 -11.64% 80.06 89.06 +11.24% 6.98 5.43 -22.21% spam mean and sdev for all runs 79.72 88.81 +11.40% 6.97 5.47 -21.52% ham/spam mean difference: 49.75 70.61 +20.86 So before, the score equidistant from both means was 52.78, at 3.87 sdevs from each; after, it was 58.03, at 5.63 sdevs from each. The populations are much better separated by this measure. Histograms before: -> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90 -> min 13.521; median 29.6919; max 60.8937 * = 2 items ... 13 2 * 14 0 15 2 * 16 8 **** 17 4 ** 18 9 ***** 19 17 ********* 20 14 ******* 21 16 ******** 22 24 ************ 23 38 ******************* 24 47 ************************ 25 62 ******************************* 26 65 ********************************* 27 69 *********************************** 28 73 ************************************* 29 70 *********************************** 30 76 ************************************** 31 70 *********************************** 32 61 ******************************* 33 51 ************************** 34 50 ************************* 35 34 ***************** 36 30 *************** 37 27 ************** 38 18 ********* 39 12 ****** 40 11 ****** 41 13 ******* 42 2 * 43 5 *** 44 8 **** 45 2 * 46 1 * 47 3 ** 48 1 * 49 0 50 3 ** 51 0 52 0 53 0 54 0 55 1 * 56 0 57 0 58 0 59 0 60 1 * ... -> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97 -> min 52.3428; median 79.9799; max 98.1879 * = 2 items ... 52 1 * 53 0 54 0 55 0 56 3 ** 57 1 * 58 0 59 1 * 60 4 ** 61 4 ** 62 4 ** 63 3 ** 64 4 ** 65 7 **** 66 9 ***** 67 10 ***** 68 13 ******* 69 16 ******** 70 26 ************* 71 18 ********* 72 29 *************** 73 35 ****************** 74 40 ******************** 75 39 ******************** 76 56 **************************** 77 52 ************************** 78 50 ************************* 79 76 ************************************** 80 60 ****************************** 81 77 *************************************** 82 45 *********************** 83 61 ******************************* 84 50 ************************* 85 43 ********************** 86 41 ********************* 87 33 ***************** 88 19 ********** 89 11 ****** 90 11 ****** 91 8 **** 92 2 * 93 9 ***** 94 4 ** 95 9 ***** 96 2 * 97 11 ****** 98 3 ** 99 0 Histograms after: -> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08 -> min 5.6946; median 17.1757; max 73.1302 * = 2 items ... 5 1 * 6 13 ******* 7 16 ******** 8 25 ************* 9 22 *********** 10 37 ******************* 11 45 *********************** 12 56 **************************** 13 70 *********************************** 14 61 ******************************* 15 66 ********************************* 16 79 **************************************** 17 63 ******************************** 18 59 ****************************** 19 59 ****************************** 20 56 **************************** 21 47 ************************ 22 36 ****************** 23 37 ******************* 24 32 **************** 25 9 ***** 26 20 ********** 27 17 ********* 28 8 **** 29 7 **** 30 11 ****** 31 6 *** 32 7 **** 33 5 *** 34 4 ** 35 2 * 36 2 * 37 6 *** 38 1 * 39 0 40 3 ** 41 3 ** 42 0 43 1 * 44 1 * 45 1 * 46 0 47 1 * 48 0 49 0 50 2 * 51 1 * 52 0 53 0 54 0 55 0 56 0 57 0 58 0 59 0 60 0 61 1 * 62 0 63 0 64 0 65 0 66 0 67 0 68 0 69 0 70 0 71 0 72 0 73 1 * -> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47 -> min 54.9382; median 89.5188; max 98.3805 * = 2 items ... 54 1 * 55 0 56 0 57 0 58 0 59 0 60 0 61 0 62 0 63 1 * 64 3 ** 65 0 66 1 * 67 0 68 2 * 69 2 * 70 3 ** 71 3 ** 72 2 * 73 2 * 74 4 ** 75 4 ** 76 6 *** 77 8 **** 78 8 **** 79 6 *** 80 12 ****** 81 25 ************* 82 26 ************* 83 25 ************* 84 39 ******************** 85 58 ***************************** 86 70 *********************************** 87 64 ******************************** 88 74 ************************************* 89 106 ***************************************************** 90 85 ******************************************* 91 62 ******************************* 92 86 ******************************************* 93 79 **************************************** 94 37 ******************* 95 23 ************ 96 42 ********************* 97 25 ************* 98 6 *** 99 0 There are snaky tails in either case, but "the middle ground" here is larger, sparser, and still contains the errors. Across my full test data, which I actually ran first, you can ignore the "won/lost" business; I had spam_cutoff at 0.55 for both runs, and the overall results would have been virtually identical had I boosted spam_cutoff in the second run (recall that I can't demonstrate an improvement on this data anymore! I can only determine whether something is a disaster, and this ain't). -> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams [ditto 19 times] ... false positive percentages 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.000 tied 0.050 0.100 lost +100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied won 0 times tied 6 times lost 4 times total unique fp went from 2 to 6 lost +200.00% mean fp % went from 0.01 to 0.03 lost +200.00% false negative percentages 0.000 0.000 tied 0.071 0.071 tied 0.000 0.000 tied 0.071 0.071 tied 0.143 0.071 won -50.35% 0.143 0.000 won -100.00% 0.143 0.143 tied 0.143 0.000 won -100.00% 0.071 0.000 won -100.00% 0.000 0.000 tied won 4 times tied 6 times lost 0 times total unique fn went from 11 to 5 won -54.55% mean fn % went from 0.0785714285714 to 0.0357142857143 won -54.55% ham mean ham sdev 25.65 10.68 -58.36% 5.67 5.44 -4.06% 25.61 10.68 -58.30% 5.50 5.29 -3.82% 25.57 10.68 -58.23% 5.67 5.49 -3.17% 25.66 10.71 -58.26% 5.54 5.27 -4.87% 25.42 10.55 -58.50% 5.72 5.71 -0.17% 25.51 10.43 -59.11% 5.39 5.11 -5.19% 25.65 10.40 -59.45% 5.59 5.29 -5.37% 25.61 10.51 -58.96% 5.41 5.21 -3.70% 25.84 10.80 -58.20% 5.48 5.30 -3.28% 25.81 10.85 -57.96% 5.81 5.73 -1.38% ham mean and sdev for all runs 25.63 10.63 -58.53% 5.58 5.39 -3.41% spam mean spam sdev 83.86 93.17 +11.10% 7.09 4.55 -35.83% 83.64 93.16 +11.38% 6.83 4.52 -33.82% 83.27 92.91 +11.58% 6.81 4.52 -33.63% 83.82 93.14 +11.12% 6.88 4.67 -32.12% 83.89 93.29 +11.21% 6.65 4.56 -31.43% 83.78 93.11 +11.14% 6.96 4.72 -32.18% 83.42 93.00 +11.48% 6.82 4.74 -30.50% 83.86 93.29 +11.24% 6.71 4.55 -32.19% 83.88 93.22 +11.13% 6.98 4.71 -32.52% 83.75 93.28 +11.38% 6.65 4.32 -35.04% spam mean and sdev for all runs 83.72 93.16 +11.28% 6.84 4.59 -32.89% ham/spam mean difference: 58.09 82.53 +24.44 So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to 55.20 at 8.27 sdevs from each. That's big. The "after" histograms had 200 buckets in this run: -> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39 -> min 0.281945; median 9.69929; max 81.9673 * = 17 items 0.0 7 * 0.5 13 * 1.0 21 ** 1.5 41 *** 2.0 86 ****** 2.5 166 ********** 3.0 239 *************** 3.5 326 ******************** 4.0 466 **************************** 4.5 554 ********************************* 5.0 642 ************************************** 5.5 701 ****************************************** 6.0 793 *********************************************** 6.5 804 ************************************************ 7.0 933 ******************************************************* 7.5 972 ********************************************************** 8.0 997 *********************************************************** 8.5 934 ******************************************************* 9.0 947 ******************************************************** 9.5 939 ******************************************************** 10.0 839 ************************************************** 10.5 786 *********************************************** 11.0 752 ********************************************* 11.5 760 ********************************************* 12.0 636 ************************************** 12.5 606 ************************************ 13.0 554 ********************************* 13.5 483 ***************************** 14.0 461 **************************** 14.5 399 ************************ 15.0 360 ********************** 15.5 317 ******************* 16.0 275 ***************** 16.5 224 ************** 17.0 193 ************ 17.5 169 ********** 18.0 172 *********** 18.5 154 ********** 19.0 153 ********* 19.5 92 ****** 20.0 104 ******* 20.5 99 ****** 21.0 74 ***** 21.5 73 ***** 22.0 73 ***** 22.5 50 *** 23.0 38 *** 23.5 50 *** 24.0 38 *** 24.5 34 ** 25.0 26 ** 25.5 39 *** 26.0 24 ** 26.5 34 ** 27.0 18 ** 27.5 15 * 28.0 20 ** 28.5 15 * 29.0 14 * 29.5 15 * 30.0 12 * 30.5 15 * 31.0 14 * 31.5 10 * 32.0 12 * 32.5 6 * 33.0 10 * 33.5 4 * 34.0 8 * 34.5 5 * 35.0 5 * 35.5 6 * 36.0 7 * 36.5 4 * 37.0 2 * 37.5 3 * 38.0 1 * 38.5 4 * 39.0 6 * 39.5 2 * 40.0 2 * 40.5 5 * 41.0 0 41.5 2 * 42.0 3 * 42.5 3 * 43.0 1 * 43.5 2 * 44.0 1 * 44.5 2 * 45.0 1 * 45.5 1 * 46.0 2 * 46.5 0 47.0 3 * 47.5 0 48.0 1 * 48.5 1 * 49.0 1 * 49.5 0 50.0 1 * 50.5 0 51.0 2 * 51.5 0 52.0 1 * 52.5 0 53.0 0 53.5 1 * 54.0 1 * 54.5 2 * 55.0 0 55.5 0 56.0 1 * 56.5 1 * 57.0 0 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 0 60.5 0 61.0 1 * 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 1 * the lady with the long & obnoxious employer-generated sig 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 1 * the verbatim quote of a long Nigerian-scam spam ... -> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59 -> min 24.3497; median 93.8141; max 99.6769 * = 15 items ... 24.0 1 * not really sure -- it's a giant base64-encoded plain text file 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 1 * the spam with the uuencoded body we throw away 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 1 * Hello, my Name is BlackIntrepid 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 1 * unclear; a collection of webmaster links 54.0 1 * Susan makes a propsal (sic) to Tim 54.5 0 55.0 1 * 55.5 0 56.0 0 56.5 1 * 57.0 2 * 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 2 * 61.0 1 * 61.5 1 * 62.0 0 62.5 1 * 63.0 1 * 63.5 0 64.0 1 * 64.5 1 * 65.0 0 65.5 1 * 66.0 1 * 66.5 2 * 67.0 4 * 67.5 2 * 68.0 0 68.5 1 * 69.0 0 69.5 3 * 70.0 1 * 70.5 5 * 71.0 5 * 71.5 3 * 72.0 4 * 72.5 3 * 73.0 3 * 73.5 6 * 74.0 3 * 74.5 4 * 75.0 8 * 75.5 8 * 76.0 10 * 76.5 10 * 77.0 10 * 77.5 17 ** 78.0 14 * 78.5 27 ** 79.0 16 ** 79.5 23 ** 80.0 28 ** 80.5 29 ** 81.0 37 *** 81.5 37 *** 82.0 46 **** 82.5 55 **** 83.0 47 **** 83.5 53 **** 84.0 58 **** 84.5 68 ***** 85.0 86 ****** 85.5 118 ******** 86.0 135 ********* 86.5 159 *********** 87.0 165 *********** 87.5 178 ************ 88.0 209 ************** 88.5 231 **************** 89.0 299 ******************** 89.5 391 *************************** 90.0 425 ***************************** 90.5 402 *************************** 91.0 501 ********************************** 91.5 582 *************************************** 92.0 636 ******************************************* 92.5 667 ********************************************* 93.0 713 ************************************************ 93.5 685 ********************************************** 94.0 610 ***************************************** 94.5 621 ****************************************** 95.0 721 ************************************************* 95.5 735 ************************************************* 96.0 870 ********************************************************** 96.5 742 ************************************************** 97.0 449 ****************************** 97.5 447 ****************************** 98.0 556 ************************************** 98.5 561 ************************************** 99.0 264 ****************** 99.5 171 ************ The mistakes are all familiar; the good news is that "the normal cases" are far removed from what might plausibly be called a middle ground. For example, if we called the region from 40 thru 70 here "the middle ground", and kicked those out for manual review, there would be very few msgs to review, but they would contain almost all the mistakes. How does this do on your data? I'm in favor what works . From grobinson@transpose.com Thu Oct 10 02:06:56 2002 From: grobinson@transpose.com (Gary Robinson) Date: Wed, 09 Oct 2002 21:06:56 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: The thing about the geometric mean is that it is much more sensitive to numbers near 0, so the S/(S+H) technique is biased in that way. If you want to try something like that, I would suggest using the ARITHMETIC means in computing S and H and again using S(S+H). That would remove that bias. It wouldn't be invoking that optimality theorem, but whatever works... It really seems, as a matter of being educated, that the arithmetic approach is worth trying if it doesn't take a lot of trouble to try it. >"but more sensitive to overwhelming amounts of evidence than Gary-combining" >From the email you sent at 1:02PM yesterday: 0.40 0 0.45 2 * 0.50 412 ********* 0.55 3068 ************************************************************* 0.60 1447 ***************************** 0.65 71 ** 0.70 0 One thing I'd like to be more clear on. If I understand the experiment correctly you set 10 to .99 and 40 were random. What percentage actually ended up as > .5, without regard to HOW MUCH over .5? ' > It's hard to know what to make of this, especially in light of the claim > that Gary-combining has been proven to be the most sensitive possible test > for rejecting the hypothesis that a collection of probs is uniformly > distributed. It's not the (S-H)/(S+H) that is the most sensitive (under certain conditions), it that the geometric mean approach for computing S gives a result that is MONOTONIC WITH a calculation which is the most sensitive. The real technique would take S and feed it into an inverse chi-square function with (in this experiment) 100 degrees of freedom. The output (roughly speaking) would be the probability that that S (or a more extreme one) might have occurred by chance alone. Call these numbers S' and H' for S and H respectively. The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H') (unless I've made some error). So, as a binary indicator, the two are equivalent. However, if you used S' and H', you would see something more like real probabilities that would probably be of magnitudes that would be more attractive to you. You could probably use a table to approximate the inverse chi-square calc rather than actually doing the computations all the time. I didn't suggest doing that, at first, because I was interested in providing a binary indicator and wanting to keep things simple -- and from the POV of a binary indicator, it doesn't make any difference. So, if it happens that feel like taking the time to go "all the way" with this approach, I would suggest actually computing S' and H' and seeing what happens. I think you would like the results better -- I just didn't suggest it at first because I didn't know the spread would be of such interest and I wanted to keep things simple. I think this would work better than the S/(S+H) approach, because if you use geometric means, it's more sensitive to one condition than the other, and if you use arithmetic means, you don't invoke the optimality theorem. Of course, this is ALL speculative. But the probabilities involved will DEFINATELY be of greater magnitude, and so a better-defined spread, if the inverse chi-square is used. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 > From: Tim Peters > Date: Wed, 09 Oct 2002 20:34:15 -0400 > To: SpamBayes > Cc: Gary Robinson > Subject: RE: [Spambayes] spamprob combining > > [Tim] >> ... >> Intuitively, it *seems* like it would be good to get something not so >> insanely sensitive to random input as Paul-combining, but more >> sensitive to overwhelming amounts of evidence than Gary-combining. > > So there's a new option, > > [Classifier] > use_tim_combining: True > > The comments (from Options.py) explain it: > > # For the default scheme, use "tim-combining" of probabilities. This > # has no effect under the central-limit schemes. Tim-combining is a > # kind of cross between Paul Graham's and Gary Robinson's combining > # schemes. Unlike Paul's, it's never crazy-certain, and compared to > # Gary's, in Tim's tests it greatly increased the spread between mean > # ham-scores and spam-scores, while simultaneously decreasing the > # variance of both. Tim needed a higher spam_cutoff value for best > # results, but spam_cutoff is less touchy than under Gary-combining. > use_tim_combining: False > > "Tim combining" simply takes the geometric mean of the spamprobs as a > measure of spamminess S, and the geometric mean of 1-spamprob as a measure > of hamminess H, then returns S/(S+H) as "the score". This is well-behaved > when fed random, uniformly distributed probabilities, but isn't reluctant to > let an overwhelming number of extreme clues lead it to an extreme conclusion > (although you're not going to see it give Graham-like 1e-30 or > 1.0000000000000 scores). > > Don't use a central-limit scheme with this (it has no effect on those). If > you test it, use whatever variations on the "all default" scheme you usually > use, but it will probably help to boost spam_cutoff. Note that the default > max_discriminators is still 150, and that's what I used below. > > Here's a 10-set cross-validation run on my data, restricted to 100 ham and > 100 spam per set, with all defaults, except > > before after > ------ ----- > use_tim_combining False True > spam_cutoff 0.55 0.615 > > > -> tested 100 hams & 100 spams against 900 hams & 900 spams > [ditto 19 times] > > false positive percentages > 0.000 0.000 tied > 1.000 0.000 won -100.00% > 1.000 1.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > > won 1 times > tied 9 times > lost 0 times > > total unique fp went from 2 to 1 won -50.00% > mean fp % went from 0.2 to 0.1 won -50.00% > > false negative percentages > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 1.000 1.000 tied > 0.000 0.000 tied > > won 0 times > tied 10 times > lost 0 times > > total unique fn went from 1 to 1 tied > mean fn % went from 0.1 to 0.1 tied > > The real story here is in the score distributions; contrary to what the > comment said above, the ham-score variance increased with this little data: > > ham mean ham sdev > 30.63 18.80 -38.62% 6.03 6.83 +13.27% > 29.31 17.35 -40.81% 5.48 6.84 +24.82% > 29.96 18.50 -38.25% 6.95 9.02 +29.78% > 29.66 18.12 -38.91% 5.89 6.81 +15.62% > 29.51 17.34 -41.24% 5.73 6.71 +17.10% > 29.40 17.43 -40.71% 5.73 6.61 +15.36% > 29.75 17.74 -40.37% 5.76 6.96 +20.83% > 29.71 18.17 -38.84% 5.97 6.48 +8.54% > 31.98 20.41 -36.18% 5.96 8.02 +34.56% > 29.83 18.11 -39.29% 4.75 5.41 +13.89% > > ham mean and sdev for all runs > 29.97 18.20 -39.27% 5.90 7.08 +20.00% > > spam mean spam sdev > 79.23 88.38 +11.55% 6.96 5.52 -20.69% > 79.40 88.70 +11.71% 7.00 5.64 -19.43% > 78.68 88.06 +11.92% 6.69 5.13 -23.32% > 79.65 89.01 +11.75% 7.20 5.22 -27.50% > 79.91 88.87 +11.21% 6.35 4.67 -26.46% > 80.47 89.16 +10.80% 7.22 6.06 -16.07% > 80.94 89.78 +10.92% 6.60 4.45 -32.58% > 80.30 89.41 +11.34% 6.95 5.49 -21.01% > 78.54 87.70 +11.66% 7.30 6.45 -11.64% > 80.06 89.06 +11.24% 6.98 5.43 -22.21% > > spam mean and sdev for all runs > 79.72 88.81 +11.40% 6.97 5.47 -21.52% > > ham/spam mean difference: 49.75 70.61 +20.86 > > So before, the score equidistant from both means was 52.78, at 3.87 sdevs > from each; after, it was 58.03, at 5.63 sdevs from each. The populations > are much better separated by this measure. > > Histograms before: > > -> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90 > -> min 13.521; median 29.6919; max 60.8937 > * = 2 items > ... > 13 2 * > 14 0 > 15 2 * > 16 8 **** > 17 4 ** > 18 9 ***** > 19 17 ********* > 20 14 ******* > 21 16 ******** > 22 24 ************ > 23 38 ******************* > 24 47 ************************ > 25 62 ******************************* > 26 65 ********************************* > 27 69 *********************************** > 28 73 ************************************* > 29 70 *********************************** > 30 76 ************************************** > 31 70 *********************************** > 32 61 ******************************* > 33 51 ************************** > 34 50 ************************* > 35 34 ***************** > 36 30 *************** > 37 27 ************** > 38 18 ********* > 39 12 ****** > 40 11 ****** > 41 13 ******* > 42 2 * > 43 5 *** > 44 8 **** > 45 2 * > 46 1 * > 47 3 ** > 48 1 * > 49 0 > 50 3 ** > 51 0 > 52 0 > 53 0 > 54 0 > 55 1 * > 56 0 > 57 0 > 58 0 > 59 0 > 60 1 * > ... > > -> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97 > -> min 52.3428; median 79.9799; max 98.1879 > * = 2 items > ... > 52 1 * > 53 0 > 54 0 > 55 0 > 56 3 ** > 57 1 * > 58 0 > 59 1 * > 60 4 ** > 61 4 ** > 62 4 ** > 63 3 ** > 64 4 ** > 65 7 **** > 66 9 ***** > 67 10 ***** > 68 13 ******* > 69 16 ******** > 70 26 ************* > 71 18 ********* > 72 29 *************** > 73 35 ****************** > 74 40 ******************** > 75 39 ******************** > 76 56 **************************** > 77 52 ************************** > 78 50 ************************* > 79 76 ************************************** > 80 60 ****************************** > 81 77 *************************************** > 82 45 *********************** > 83 61 ******************************* > 84 50 ************************* > 85 43 ********************** > 86 41 ********************* > 87 33 ***************** > 88 19 ********** > 89 11 ****** > 90 11 ****** > 91 8 **** > 92 2 * > 93 9 ***** > 94 4 ** > 95 9 ***** > 96 2 * > 97 11 ****** > 98 3 ** > 99 0 > > Histograms after: > > -> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08 > -> min 5.6946; median 17.1757; max 73.1302 > * = 2 items > ... > 5 1 * > 6 13 ******* > 7 16 ******** > 8 25 ************* > 9 22 *********** > 10 37 ******************* > 11 45 *********************** > 12 56 **************************** > 13 70 *********************************** > 14 61 ******************************* > 15 66 ********************************* > 16 79 **************************************** > 17 63 ******************************** > 18 59 ****************************** > 19 59 ****************************** > 20 56 **************************** > 21 47 ************************ > 22 36 ****************** > 23 37 ******************* > 24 32 **************** > 25 9 ***** > 26 20 ********** > 27 17 ********* > 28 8 **** > 29 7 **** > 30 11 ****** > 31 6 *** > 32 7 **** > 33 5 *** > 34 4 ** > 35 2 * > 36 2 * > 37 6 *** > 38 1 * > 39 0 > 40 3 ** > 41 3 ** > 42 0 > 43 1 * > 44 1 * > 45 1 * > 46 0 > 47 1 * > 48 0 > 49 0 > 50 2 * > 51 1 * > 52 0 > 53 0 > 54 0 > 55 0 > 56 0 > 57 0 > 58 0 > 59 0 > 60 0 > 61 1 * > 62 0 > 63 0 > 64 0 > 65 0 > 66 0 > 67 0 > 68 0 > 69 0 > 70 0 > 71 0 > 72 0 > 73 1 * > > -> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47 > -> min 54.9382; median 89.5188; max 98.3805 > * = 2 items > ... > 54 1 * > 55 0 > 56 0 > 57 0 > 58 0 > 59 0 > 60 0 > 61 0 > 62 0 > 63 1 * > 64 3 ** > 65 0 > 66 1 * > 67 0 > 68 2 * > 69 2 * > 70 3 ** > 71 3 ** > 72 2 * > 73 2 * > 74 4 ** > 75 4 ** > 76 6 *** > 77 8 **** > 78 8 **** > 79 6 *** > 80 12 ****** > 81 25 ************* > 82 26 ************* > 83 25 ************* > 84 39 ******************** > 85 58 ***************************** > 86 70 *********************************** > 87 64 ******************************** > 88 74 ************************************* > 89 106 ***************************************************** > 90 85 ******************************************* > 91 62 ******************************* > 92 86 ******************************************* > 93 79 **************************************** > 94 37 ******************* > 95 23 ************ > 96 42 ********************* > 97 25 ************* > 98 6 *** > 99 0 > > There are snaky tails in either case, but "the middle ground" here is > larger, sparser, and still contains the errors. > > Across my full test data, which I actually ran first, you can ignore the > "won/lost" business; I had spam_cutoff at 0.55 for both runs, and the > overall results would have been virtually identical had I boosted > spam_cutoff in the second run (recall that I can't demonstrate an > improvement on this data anymore! I can only determine whether something is > a disaster, and this ain't). > > -> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams > [ditto 19 times] > ... > false positive percentages > 0.000 0.050 lost +(was 0) > 0.000 0.050 lost +(was 0) > 0.000 0.050 lost +(was 0) > 0.000 0.000 tied > 0.050 0.100 lost +100.00% > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.050 0.050 tied > > won 0 times > tied 6 times > lost 4 times > > total unique fp went from 2 to 6 lost +200.00% > mean fp % went from 0.01 to 0.03 lost +200.00% > > false negative percentages > 0.000 0.000 tied > 0.071 0.071 tied > 0.000 0.000 tied > 0.071 0.071 tied > 0.143 0.071 won -50.35% > 0.143 0.000 won -100.00% > 0.143 0.143 tied > 0.143 0.000 won -100.00% > 0.071 0.000 won -100.00% > 0.000 0.000 tied > > won 4 times > tied 6 times > lost 0 times > > total unique fn went from 11 to 5 won -54.55% > mean fn % went from 0.0785714285714 to 0.0357142857143 won -54.55% > > ham mean ham sdev > 25.65 10.68 -58.36% 5.67 5.44 -4.06% > 25.61 10.68 -58.30% 5.50 5.29 -3.82% > 25.57 10.68 -58.23% 5.67 5.49 -3.17% > 25.66 10.71 -58.26% 5.54 5.27 -4.87% > 25.42 10.55 -58.50% 5.72 5.71 -0.17% > 25.51 10.43 -59.11% 5.39 5.11 -5.19% > 25.65 10.40 -59.45% 5.59 5.29 -5.37% > 25.61 10.51 -58.96% 5.41 5.21 -3.70% > 25.84 10.80 -58.20% 5.48 5.30 -3.28% > 25.81 10.85 -57.96% 5.81 5.73 -1.38% > > ham mean and sdev for all runs > 25.63 10.63 -58.53% 5.58 5.39 -3.41% > > spam mean spam sdev > 83.86 93.17 +11.10% 7.09 4.55 -35.83% > 83.64 93.16 +11.38% 6.83 4.52 -33.82% > 83.27 92.91 +11.58% 6.81 4.52 -33.63% > 83.82 93.14 +11.12% 6.88 4.67 -32.12% > 83.89 93.29 +11.21% 6.65 4.56 -31.43% > 83.78 93.11 +11.14% 6.96 4.72 -32.18% > 83.42 93.00 +11.48% 6.82 4.74 -30.50% > 83.86 93.29 +11.24% 6.71 4.55 -32.19% > 83.88 93.22 +11.13% 6.98 4.71 -32.52% > 83.75 93.28 +11.38% 6.65 4.32 -35.04% > > spam mean and sdev for all runs > 83.72 93.16 +11.28% 6.84 4.59 -32.89% > > ham/spam mean difference: 58.09 82.53 +24.44 > > So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to > 55.20 at 8.27 sdevs from each. That's big. > > The "after" histograms had 200 buckets in this run: > > -> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39 > -> min 0.281945; median 9.69929; max 81.9673 > * = 17 items > 0.0 7 * > 0.5 13 * > 1.0 21 ** > 1.5 41 *** > 2.0 86 ****** > 2.5 166 ********** > 3.0 239 *************** > 3.5 326 ******************** > 4.0 466 **************************** > 4.5 554 ********************************* > 5.0 642 ************************************** > 5.5 701 ****************************************** > 6.0 793 *********************************************** > 6.5 804 ************************************************ > 7.0 933 ******************************************************* > 7.5 972 ********************************************************** > 8.0 997 *********************************************************** > 8.5 934 ******************************************************* > 9.0 947 ******************************************************** > 9.5 939 ******************************************************** > 10.0 839 ************************************************** > 10.5 786 *********************************************** > 11.0 752 ********************************************* > 11.5 760 ********************************************* > 12.0 636 ************************************** > 12.5 606 ************************************ > 13.0 554 ********************************* > 13.5 483 ***************************** > 14.0 461 **************************** > 14.5 399 ************************ > 15.0 360 ********************** > 15.5 317 ******************* > 16.0 275 ***************** > 16.5 224 ************** > 17.0 193 ************ > 17.5 169 ********** > 18.0 172 *********** > 18.5 154 ********** > 19.0 153 ********* > 19.5 92 ****** > 20.0 104 ******* > 20.5 99 ****** > 21.0 74 ***** > 21.5 73 ***** > 22.0 73 ***** > 22.5 50 *** > 23.0 38 *** > 23.5 50 *** > 24.0 38 *** > 24.5 34 ** > 25.0 26 ** > 25.5 39 *** > 26.0 24 ** > 26.5 34 ** > 27.0 18 ** > 27.5 15 * > 28.0 20 ** > 28.5 15 * > 29.0 14 * > 29.5 15 * > 30.0 12 * > 30.5 15 * > 31.0 14 * > 31.5 10 * > 32.0 12 * > 32.5 6 * > 33.0 10 * > 33.5 4 * > 34.0 8 * > 34.5 5 * > 35.0 5 * > 35.5 6 * > 36.0 7 * > 36.5 4 * > 37.0 2 * > 37.5 3 * > 38.0 1 * > 38.5 4 * > 39.0 6 * > 39.5 2 * > 40.0 2 * > 40.5 5 * > 41.0 0 > 41.5 2 * > 42.0 3 * > 42.5 3 * > 43.0 1 * > 43.5 2 * > 44.0 1 * > 44.5 2 * > 45.0 1 * > 45.5 1 * > 46.0 2 * > 46.5 0 > 47.0 3 * > 47.5 0 > 48.0 1 * > 48.5 1 * > 49.0 1 * > 49.5 0 > 50.0 1 * > 50.5 0 > 51.0 2 * > 51.5 0 > 52.0 1 * > 52.5 0 > 53.0 0 > 53.5 1 * > 54.0 1 * > 54.5 2 * > 55.0 0 > 55.5 0 > 56.0 1 * > 56.5 1 * > 57.0 0 > 57.5 0 > 58.0 0 > 58.5 1 * > 59.0 0 > 59.5 0 > 60.0 0 > 60.5 0 > 61.0 1 * > 61.5 0 > 62.0 0 > 62.5 0 > 63.0 0 > 63.5 0 > 64.0 0 > 64.5 0 > 65.0 0 > 65.5 0 > 66.0 0 > 66.5 0 > 67.0 0 > 67.5 0 > 68.0 0 > 68.5 0 > 69.0 0 > 69.5 0 > 70.0 1 * the lady with the long & obnoxious employer-generated sig > 70.5 0 > 71.0 0 > 71.5 0 > 72.0 0 > 72.5 0 > 73.0 0 > 73.5 0 > 74.0 0 > 74.5 0 > 75.0 0 > 75.5 0 > 76.0 0 > 76.5 0 > 77.0 0 > 77.5 0 > 78.0 0 > 78.5 0 > 79.0 0 > 79.5 0 > 80.0 0 > 80.5 0 > 81.0 0 > 81.5 1 * the verbatim quote of a long Nigerian-scam spam > ... > > -> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59 > -> min 24.3497; median 93.8141; max 99.6769 > * = 15 items > ... > 24.0 1 * not really sure -- it's a giant base64-encoded plain text file > 24.5 0 > 25.0 0 > 25.5 0 > 26.0 0 > 26.5 0 > 27.0 0 > 27.5 0 > 28.0 0 > 28.5 0 > 29.0 1 * the spam with the uuencoded body we throw away > 29.5 0 > 30.0 0 > 30.5 0 > 31.0 0 > 31.5 0 > 32.0 0 > 32.5 0 > 33.0 0 > 33.5 0 > 34.0 0 > 34.5 0 > 35.0 0 > 35.5 0 > 36.0 0 > 36.5 0 > 37.0 0 > 37.5 0 > 38.0 0 > 38.5 0 > 39.0 0 > 39.5 0 > 40.0 0 > 40.5 0 > 41.0 0 > 41.5 0 > 42.0 0 > 42.5 0 > 43.0 0 > 43.5 0 > 44.0 0 > 44.5 0 > 45.0 0 > 45.5 0 > 46.0 1 * Hello, my Name is BlackIntrepid > 46.5 0 > 47.0 0 > 47.5 0 > 48.0 0 > 48.5 0 > 49.0 0 > 49.5 0 > 50.0 0 > 50.5 0 > 51.0 0 > 51.5 0 > 52.0 0 > 52.5 0 > 53.0 0 > 53.5 1 * unclear; a collection of webmaster links > 54.0 1 * Susan makes a propsal (sic) to Tim > 54.5 0 > 55.0 1 * > 55.5 0 > 56.0 0 > 56.5 1 * > 57.0 2 * > 57.5 0 > 58.0 0 > 58.5 1 * > 59.0 0 > 59.5 0 > 60.0 1 * > 60.5 2 * > 61.0 1 * > 61.5 1 * > 62.0 0 > 62.5 1 * > 63.0 1 * > 63.5 0 > 64.0 1 * > 64.5 1 * > 65.0 0 > 65.5 1 * > 66.0 1 * > 66.5 2 * > 67.0 4 * > 67.5 2 * > 68.0 0 > 68.5 1 * > 69.0 0 > 69.5 3 * > 70.0 1 * > 70.5 5 * > 71.0 5 * > 71.5 3 * > 72.0 4 * > 72.5 3 * > 73.0 3 * > 73.5 6 * > 74.0 3 * > 74.5 4 * > 75.0 8 * > 75.5 8 * > 76.0 10 * > 76.5 10 * > 77.0 10 * > 77.5 17 ** > 78.0 14 * > 78.5 27 ** > 79.0 16 ** > 79.5 23 ** > 80.0 28 ** > 80.5 29 ** > 81.0 37 *** > 81.5 37 *** > 82.0 46 **** > 82.5 55 **** > 83.0 47 **** > 83.5 53 **** > 84.0 58 **** > 84.5 68 ***** > 85.0 86 ****** > 85.5 118 ******** > 86.0 135 ********* > 86.5 159 *********** > 87.0 165 *********** > 87.5 178 ************ > 88.0 209 ************** > 88.5 231 **************** > 89.0 299 ******************** > 89.5 391 *************************** > 90.0 425 ***************************** > 90.5 402 *************************** > 91.0 501 ********************************** > 91.5 582 *************************************** > 92.0 636 ******************************************* > 92.5 667 ********************************************* > 93.0 713 ************************************************ > 93.5 685 ********************************************** > 94.0 610 ***************************************** > 94.5 621 ****************************************** > 95.0 721 ************************************************* > 95.5 735 ************************************************* > 96.0 870 ********************************************************** > 96.5 742 ************************************************** > 97.0 449 ****************************** > 97.5 447 ****************************** > 98.0 556 ************************************** > 98.5 561 ************************************** > 99.0 264 ****************** > 99.5 171 ************ > > The mistakes are all familiar; the good news is that "the normal cases" are > far removed from what might plausibly be called a middle ground. For > example, if we called the region from 40 thru 70 here "the middle ground", > and kicked those out for manual review, there would be very few msgs to > review, but they would contain almost all the mistakes. > > How does this do on your data? I'm in favor what works . > From grobinson@transpose.com Thu Oct 10 02:18:28 2002 From: grobinson@transpose.com (Gary Robinson) Date: Wed, 09 Oct 2002 21:18:28 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: If you do decide to try the chi-square thing, the idea is to find or create a function (perhaps using a lookup table) that takes a chi-square random variable and outputs the associated p-value. The input random variable is the product of the p's or (1-p)'s as the case may be. If the p's are uniformly distributed under the null hypothesis, the product is chi-square with 2n degrees of freedom, where n is the number of terms making up the product. So the inverse chi-square function gives the probability associated with that product. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From grobinson@transpose.com Thu Oct 10 02:41:31 2002 From: grobinson@transpose.com (Gary Robinson) Date: Wed, 09 Oct 2002 21:41:31 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: No no no, I had that wrong. Silly of me. Sorry. It's been too long since I did that calc. It's not the product of the p's that is a chi-square distribution, it's the following, given p1, p2..., pn: 2*((ln p1) + (ln p2) + ... + (ln pn)) That expression has a chi-square distribution with 2n degrees of freedom. So you feed THAT into the inverse-chi square function to get a p-value. Let invchi(x, f), where s is the random variable and f is the degrees of freedom, be the inverse chi square function. Let S be a number near 1 when the email looks spammy and H be a number near 1 when the email looks hammy. Then you want S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n) and H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n) I am a little out-of-practice but I am about 99.9% sure that the above is right. I looked up some of my notes from a few years ago to get the calc. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 > From tim.one@comcast.net Thu Oct 10 04:08:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 09 Oct 2002 23:08:03 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: [Gary Robinson] > The thing about the geometric mean is that it is much more sensitive to > numbers near 0, so the S/(S+H) technique is biased in that way. A single geometric mean would surely be biased, but the combination of two used here doesn't appear to be. That is, throwing random data at it, the mean and median are 0.5, and it's symmetric around that: 5000 items; mean 0.50; sdev 0.06 -> min 0.291521; median 0.500264; max 0.726668 * = 24 items 0.25 3 * 0.30 34 ** 0.35 211 ********* 0.40 816 ********************************** 0.45 1431 ************************************************************ 0.50 1442 ************************************************************* 0.55 809 ********************************** 0.60 219 ********** 0.65 33 ** 0.70 2 * If I do the same random-data experiment and force a prob of 0.99, the mean rises to 0.52; if I force a prob of 0.01, it falls to 0.48. If there's a bias, it's hiding pretty well . If there is a spamprob near 0, it's very much the intent that S take that seriously, and if one near 1, that H take that seriously; else, as now, I see screaming spam or screaming ham barely cracking scores above 70 or below 30. "Too much ends up in the middle." > If you want to try something like that, I would suggest using the > ARITHMETIC means in computing S and H and again using S(S+H). That > would remove that bias. That doesn't appear promising: If S = Smean = (sum p_i)/n and H = Hmean = (sum 1-p_i)/n then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1. So whether you meant S*(S+H) or S/(S+H), the result is S. To within roundoff error, that's what happens, too. > It wouldn't be invoking that optimality theorem, but whatever works... I'm not sure the optimality theorem in question is relevant to the task at hand, though. Why should we care abour rejecting a hypothesis that the word probabilities are uniformly distributed? There's virtually no message in which they are, and no reason to believe that the *majority* of words in spam will have spamprobs over 0.5. Graham got results as good as he did because the spamprob strength of a mere handful of words is usually enough to decide it. In a sense, I am trying to move back toward what worked best in his formulation. > It really seems, as a matter of being educated, that the > arithmetic approach is worth trying if it doesn't take a lot of > trouble to try it. Nope, no trouble, but my test data can't demonstrate improvements, just disasters. On a brief 10-fold cv run with 100 ham + 100 spam in each set, using the arithmetic spamprob mean gave results pretty much the same as the default scheme; error rates were the same, but the best range for spam_cutoff shifted from 0.52 thru 0.54, to 0.56 thru 0.58; it increased the spread a little: ham mean and sdev for all runs 30.35 30.53 +0.59% 5.83 5.91 +1.37% spam mean and sdev for all runs 80.97 84.08 +3.84% 7.07 6.38 -9.76% ham/spam mean difference: 50.62 53.55 +2.93 >> "but more sensitive to overwhelming amounts of evidence than >> Gary-combining" > From the email you sent at 1:02PM yesterday: > > 0.40 0 > 0.45 2 * > 0.50 412 ********* > 0.55 3068 ************************************************************* > 0.60 1447 ***************************** > 0.65 71 ** > 0.70 0 > > One thing I'd like to be more clear on. If I understand the experiment > correctly you set 10 to .99 and 40 were random. I have to dig up that email to find the context ... OK, this one was tagged Result for random vectors of 50 probs, + 10 forced to 0.99 That means there were 60 probs in all, 50 drawn from (0.0, 1.0), + 10 of 0.99. > What percentage actually ended up as > .5, without regard to > HOW MUCH over .5? >From the histogram, all but 2, out of 5000 trials. 0.5 doesn't work as a spam_cutoff on anyone's corpus here, though (it's too low; too many false positives). The median value in that run was 0.58555, which is close to what some people have been using for spam_cutoff. Under the S/(S+H) scheme, the same experiment yields 5000 items; mean 0.68; sdev 0.05 -> min 0.490773; median 0.683328; max 0.819528 * = 34 items 0.45 2 * 0.50 27 * 0.55 171 ****** 0.60 991 ****************************** 0.65 2016 ************************************************************ 0.70 1510 ********************************************* 0.75 275 ********* 0.80 8 * So if the percentage above 0.5 is sole the measure of goodness here, S/(S+H) did equally well in this experiement. > ... > It's not the (S-H)/(S+H) that is the most sensitive (under certain > conditions), it that the geometric mean approach for computing S gives a > result that is MONOTONIC WITH a calculation which is the most sensitive. > > The real technique would take S and feed it into an inverse chi-square > function with (in this experiment) 100 degrees of freedom. The output > (roughly speaking) would be the probability that that S (or a more extreme > one) might have occurred by chance alone. > > Call these numbers S' and H' for S and H respectively. > > The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H') > (unless I've made some error). > > So, as a binary indicator, the two are equivalent. However, if you used S' > and H', you would see something more like real probabilities that would > probably be of magnitudes that would be more attractive to you. > > You could probably use a table to approximate the inverse chi-square calc > rather than actually doing the computations all the time. > > I didn't suggest doing that, at first, because I was interested > in providing a binary indicator and wanting to keep things simple -- > and from the POV of a binary indicator, it doesn't make any difference. It's not a question of attraction so much as that this "binary indicator" doesn't come with a decision rule for knowing which outcome is which: it varies across corpus, and within a given corpus varies over time, depending on how much data has been trained on. So we get a stream of test results where the numbers have to be fudged retroactively via "but if I had set the cutoff to *this* on this run, the results would have been very different". It's just too delicate as is. > So, if it happens that feel like taking the time to go "all the way" > with this approach, I would suggest actually computing S' and H' and > seeing what happens. Sounds like fun. > I think you would like the results better -- I just didn't suggest > it at first because I didn't know the spread would be of such > interest and I wanted to keep things simple. That's fine. In practice, the touchiness of spam_cutoff has been an ongoing practical problem; but it's been the *only* ongoing problem, so that's why we're talking about it > > I think this would work better than the S/(S+H) approach, because > if you use geometric means, it's more sensitive to one condition than > the other, and if you use arithmetic means, you don't invoke the > optimality theorem. As above, I've found no reason yet to believe S/(S+H) favors one side over the other, and the test runs didn't show me evidence of that either. Indeed, it made the same mistakes on the same messages, but moved mounds of correctly classified message out of "the middle ground". > Of course, this is ALL speculative. But the probabilities involved will > DEFINATELY be of greater magnitude, and so a better-defined spread, if > the inverse chi-square is used. It's doable, but the experimental results so far are promising enough that I'm still keener to see how it works for others here. From popiel@wolfskeep.com Thu Oct 10 04:19:46 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 09 Oct 2002 20:19:46 -0700 Subject: [Spambayes] spamprob combining In-Reply-To: Message from Tim Peters of "Wed, 09 Oct 2002 20:34:15 EDT." References: Message-ID: <20021010031946.EA9ACF54A@cashew.wolfskeep.com> In message: Tim Peters writes: > >So there's a new option, > >[Classifier] >use_tim_combining: True >How does this do on your data? I'm in favor what works . Oooh, goodie! Another thing to consume CPU-hours! I'll run this one after I get done with my initial clt tests (which are taking about 4.5 hours each :-/ ). I can't really say anything else, yet, but clt seems _much_ slower than the default classifier. - Alex From tim.one@comcast.net Thu Oct 10 04:29:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 09 Oct 2002 23:29:38 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: <20021010031946.EA9ACF54A@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Oooh, goodie! Another thing to consume CPU-hours! Yup, that's the only idea here . > I'll run this one after I get done with my initial clt tests > (which are taking about 4.5 hours each :-/ ). Use less data? > I can't really say anything else, yet, but clt seems _much_ slower > than the default classifier. I haven't really noticed that. If you're using your "--trainstyle full" patch with timcv, then, yes, it would be enormously slower -- timcv gets enormous *efficiency* benefits (both instruction-count and temporal cache locality) out of incremental learning and unlearning. The "third training pass" unique to the clt methods also doubles the training time (each msg in the training data is tokenized once to update the wordprobs, and then a second time to compute the clt ham and spam population statistics). From tim.one@comcast.net Thu Oct 10 05:11:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 10 Oct 2002 00:11:20 -0400 Subject: [Spambayes] Demo Outlook Plugin available In-Reply-To: Message-ID: [Mark Hammond] > I just released new win32all builds that contain support for Microsoft > Outlook Extensions. > > If you install win32all-149, 150 or the most recent CVS snapshot > build, you will find a file win32com\demos\outlookAddin.py - please > see the comments in the file for information on how to install and > test this plugin. Those of you who aren't Python natives may be wondering where to find that! Here you go: http://starship.python.net/crew/mhammond/ If you want to know how to *use* it , Mark is the co-author of O'Reilly's "Python Programming on Win 32". Tell 'em Uncle Timmy sent you. From tim.one@comcast.net Thu Oct 10 05:58:06 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 10 Oct 2002 00:58:06 -0400 Subject: [Spambayes] Modifications to timcv.py In-Reply-To: <20021009180439.62465F54A@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > The inability to use timcv.py with the central limit stuff > annoyed me. I offer this patch to correct that problem... Thanks again! I checked this in. The biggest difference is that this has become a new option in a new section: [CV Driver] build_each_classifier_from_scratch: False See associated comments in Options.py. mboxtest.py is also a cross-validation (CV) driver, so should also learn how to do this. When this options is True, a CV driver can be used safely with a central-limit test -- although it will run much slower due to the "build each from scratch" business that *makes* it safe. From quinlan@pathname.com Thu Oct 10 05:47:02 2002 From: quinlan@pathname.com (Daniel Quinlan) Date: 09 Oct 2002 21:47:02 -0700 Subject: [Spambayes] Re: [SAdev] fully-public corpus of mail available In-Reply-To: jm@jmason.org's message of "Wed, 09 Oct 2002 13:21:11 +0100" References: <20021009122116.6EB2416F03@jmason.org> Message-ID: > (Please feel free to forward this message to other possibly-interested > parties.) Some caveats (in decending order of concern): 1. These messages could end up being falsely (or incorrectly) reported to Razor, DCC, Pyzor, etc. Certain RBLs too. I don't think the results for these distributed tests can be trusted in any way, shape, or form when running over a public corpus. 2. These messages could also be submitted (more than once) to projects like SpamAssassin that rely on filtering results submission for GA tuning and development. 3. Spammers could adopt elements of the good messages to throw off filters. And, of course, there's always progression in technology (by both spammers and non-spammers). The second problem could be alleviated somewhat by adding a Nilsimsa signature (or similar) to the mass-check file (the results format used by SpamAssassin) and giving the message files unique names (MD5 or SHA-1 of each file). The third problem doesn't really worry me. These problems (and perhaps others I have not identified) are unique to spam filtering. Compression corpuses and other performance-related corpuses have their own set of problems, of course. In other words, I don't think there's any replacement for having multiple independent corpuses. Finding better ways to distribute testing and collate results seems like a more viable long-term solution (and I'm glad we're working on exactly that for SpamAssassin). If you're going to seriously work on filter development, building a corpus of 10000-50000 messages (half spam/half non-spam) is not really that much work. If you don't get enough spam, creating multi-technique spamtraps (web, usenet, replying to spam) is pretty easy. And who doesn't get thousands of non-spam every week? ;-) Dan From rob@hooft.net Thu Oct 10 06:00:04 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 10 Oct 2002 07:00:04 +0200 Subject: [Spambayes] spamprob combining References: Message-ID: <3DA50954.7040503@hooft.net> Tim Peters wrote: > "Tim combining" simply takes the geometric mean of the spamprobs as a > measure of spamminess S, and the geometric mean of 1-spamprob as a measure > of hamminess H, then returns S/(S+H) as "the score". This is well-behaved > when fed random, uniformly distributed probabilities, but isn't reluctant to > let an overwhelming number of extreme clues lead it to an extreme conclusion > (although you're not going to see it give Graham-like 1e-30 or > 1.0000000000000 scores). While reading this I had a sudden thought: With the distributions I'm normally interested in, I want to explain the "bulk" accurately, without being extremely sensitive to the tails. e.g. in my previous job, the bulk was a database of protein structures, and I wanted to describe the bulk so that I could recognize the outliers. In my current job, the population is pixel activity on a CCD, and I don't want to be sensitive to bad pixels. The standard way to calculate a standard deviation is to calculate the mean first, and then calculate (x-)^2/(n-1) in a second pass over the numbers. This is rather sensitive to outliers, however. In both cases I have experience with, the best way to describe the bulk is to use the median, and "median ways" to calculate the standard deviation. These methods absolutely ignore the extreme values. But now spambayes. The bulk are words like "the" and "with" and "want" and,.... All totally uninteresting. So if we want to be sensitive to outliers, we should "go the other way". We have two options I can think off: * use a (x-)^4 function. This will be very sensitive to extremes. * calculate the mean and standard deviation both using the standard technique and using medians, and then use the DIFFERENCE between the result as a measure of the extreme-characteristic. Just some random ideas I wouldn't yet know how to apply. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From mhammond@skippinet.com.au Thu Oct 10 09:52:04 2002 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 10 Oct 2002 18:52:04 +1000 Subject: [Spambayes] Demo Outlook Plugin available In-Reply-To: Message-ID: [Tim] > [Mark Hammond] > > I just released new win32all builds that contain support for Microsoft > > Outlook Extensions. > > > > If you install win32all-149, 150 or the most recent CVS snapshot > > build, you will find a file win32com\demos\outlookAddin.py - please > > see the comments in the file for information on how to install and > > test this plugin. > > Those of you who aren't Python natives may be wondering where to > find that! > Here you go: > > http://starship.python.net/crew/mhammond/ Thanks Tim! Except unfortunately I did screw up the packaging. However, as I only announced the new package here, I simply re-issued these win32all builds. If you have already installed win32all-149, win32all-150, or the CVS snapshot build, you should move the files "site-packages\win32com\pythoncom.*" to the "site-packages" directory - that is, move pythoncom.py/pyc/pyo to its parent directory. All COM related stuff should then work. You can ignore the fact that the sample COM objects weren't registered. If course, simply re-download the package if you want to make sure! It looks like Sean True and I are going to be doing a little more work on this ;) Mark. From grobinson@transpose.com Thu Oct 10 13:45:53 2002 From: grobinson@transpose.com (Gary Robinson) Date: Thu, 10 Oct 2002 08:45:53 -0400 Subject: [Spambayes] chi-square Message-ID: > >> It wouldn't be invoking that optimality theorem, but whatever works... > > I'm not sure the optimality theorem in question is relevant to the task at > hand, though. Why should we care abour rejecting a hypothesis that the word > probabilities are uniformly distributed? There's virtually no message in > which they are, and no reason to believe that the *majority* of words in > spam will have spamprobs over 0.5. Graham got results as good as he did > because the spamprob strength of a mere handful of words is usually enough > to decide it. In a sense, I am trying to move back toward what worked best > in his formulation. Right, I agree and I've noted earlier that because the variables aren't independent this isn't really an "optimal" use of the optimality theorem. ;) Nevertheless, I think it is a good idea to come as close as we can to invoking it, because even approximately invoking such a theorem is often better than a doing something which has no real mathematics underlying it at all. > There's a dramatic difference in the Paul results, while the Gary results > move sublty (in comparison). > > If we force 10 additional .99 spamprobs, the differences are night and day: > > Result for random vectors of 50 probs, + 10 forced to 0.99 > [Histogram here] > > It's hard to know what to make of this, especially in light of the claim > that Gary-combining has been proven to be the most sensitive possible test > for rejecting the hypothesis that a collection of probs is uniformly > distributed. At least in this test, Paul-combining seemed far more > sensitive (even when the data is random ). If you do the chi-square transformation, it should respond strongly to this experiment, because it figures out a probability in association with that kind of distortion. That is, doing the inverse chi-square thing uncovers the probablistic information that is now completely buried in the product of the p's, and that can only emerge when the number of p's is considered, which is done by means of the inverse chi-square computation. The number of p's is currently ignored; when it is considered a very different result will emerge. Look at it this way. You're saying that in your experiment 17% of the p's are artificially forced to .99. If there are 6 p's to start with, 17% would only mean 1 p was skewed and that is not very unusual. But if you had 1,000,000 p's, and 17% of them were totally out-of-whack with a uniform distribution, the odds against it happening by chance alone would be completely astronomical. So, you have to figure in the number of p's if you want to get anything like a real probability. You can compute that real probability using the inverse chi-square calc. Otherwise all the probabilistic detail is lost; it just gets buried in the process of calculating the geometric mean. If you are playing with different cutoffs, the details that are lost when you don't do the inverse chi-square calc may really matter. They DON'T matter if you are only using a .5 cutoff, because the monotonic property we've discussed means that a binary choice based on a .5 cutoff will be the same either way. But the details will matter more as you get away from .5 for the cutoff. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From grobinson@transpose.com Thu Oct 10 13:46:01 2002 From: grobinson@transpose.com (Gary Robinson) Date: Thu, 10 Oct 2002 08:46:01 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: >> If you want to try something like that, I would suggest using the >> ARITHMETIC means in computing S and H and again using S(S+H). That >> would remove that bias. > > That doesn't appear promising: > > If > S = Smean = (sum p_i)/n > > and > H = Hmean = (sum 1-p_i)/n > > then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1. So whether you > meant S*(S+H) or S/(S+H), the result is S. To within roundoff error, that's > what happens, too. Ha ha ha! I should have thought of that! :) Gary From bkc@murkworks.com Thu Oct 10 15:08:41 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 10 Oct 2002 10:08:41 -0400 Subject: [Spambayes] timcombine results (long) Message-ID: <3DA55161.28574.F0586AC@localhost> I re-ran tests on my corpus using this morning's checkout. The only difference between the two runs was use_tim_combining: (false first, then true) Note, I left spam_cutoff at 0.50 for both runs! [bkc@strader2 spambayes]$ more bayescustomize.ini [Tokenizer] mine_received_headers: True [Classifier] use_central_limit = False use_central_limit2 = False use_central_limit3 = False zscore_ratio_cutoff: 1.9 use_tim_combining: False [TestDriver] spam_cutoff: 0.50 show_false_negatives: True nbuckets: 100 show_spam_lo: 0.0 show_spam_hi: 0.45 save_trained_pickles: True save_histogram_pickles: True Histogram from from use_tim_combining: false -> Ham scores for all runs: 13000 items; mean 25.35; sdev 6.95 -> min 0.771618; median 24.6878; max 78.0095 * = 20 items 0 12 * 1 12 * 2 1 * 3 7 * 4 12 * 5 12 * 6 29 ** 7 13 * 8 32 ** 9 32 ** 10 45 *** 11 35 ** 12 80 **** 13 122 ******* 14 130 ******* 15 191 ********** 16 230 ************ 17 300 *************** 18 369 ******************* 19 533 *************************** 20 701 ************************************ 21 839 ****************************************** 22 946 ************************************************ 23 998 ************************************************** 24 1165 *********************************************************** 25 975 ************************************************* 26 859 ******************************************* 27 780 *************************************** 28 659 ********************************* 29 541 **************************** 30 388 ******************** 31 299 *************** 32 267 ************** 33 197 ********** 34 168 ********* 35 144 ******** 36 135 ******* 37 102 ****** 38 96 ***** 39 84 ***** 40 51 *** 41 64 **** 42 44 *** 43 42 *** 44 39 ** 45 34 ** 46 17 * 47 28 ** 48 24 ** 49 19 * 50 26 ** 51 7 * 52 15 * 53 12 * 54 6 * 55 5 * 56 11 * 57 3 * 58 3 * 59 2 * 60 0 61 2 * 62 1 * 63 0 64 2 * 65 1 * 66 0 67 1 * 68 0 69 0 70 0 71 0 72 0 73 0 74 0 75 0 76 0 77 0 78 1 * 79 0 80 0 81 0 82 0 83 0 84 0 85 0 86 0 87 0 88 0 89 0 90 0 91 0 92 0 93 0 94 0 95 0 96 0 97 0 98 0 99 0 -> Spam scores for all runs: 13000 items; mean 81.18; sdev 7.55 -> min 34.1005; median 82.5437; max 99.5356 * = 17 items 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 0 20 0 21 0 22 0 23 0 24 0 25 0 26 0 27 0 28 0 29 0 30 0 31 0 32 0 33 0 34 1 * 35 0 36 1 * 37 0 38 0 39 1 * 40 0 41 0 42 0 43 3 * 44 1 * 45 5 * 46 2 * 47 1 * 48 5 * 49 0 50 7 * 51 9 * 52 7 * 53 12 * 54 18 ** 55 18 ** 56 23 ** 57 22 ** 58 30 ** 59 38 *** 60 37 *** 61 49 *** 62 74 ***** 63 56 **** 64 102 ****** 65 83 ***** 66 100 ****** 67 102 ****** 68 124 ******** 69 160 ********** 70 191 ************ 71 228 ************** 72 211 ************* 73 261 **************** 74 318 ******************* 75 312 ******************* 76 411 ************************* 77 413 ************************* 78 497 ****************************** 79 627 ************************************* 80 745 ******************************************** 81 780 ********************************************** 82 861 *************************************************** 83 991 *********************************************************** 84 903 ****************************************************** 85 860 *************************************************** 86 771 ********************************************** 87 622 ************************************* 88 506 ****************************** 89 510 ****************************** 90 230 ************** 91 158 ********** 92 142 ********* 93 112 ******* 94 79 ***** 95 52 **** 96 38 *** 97 14 * 98 18 ** 99 48 *** -> best cutoff for all runs: 0.53 -> with weighted total 1*50 fp + 43 fn = 93 -> fp rate 0.385% fn rate 0.331% -> matched at 0.54 with 38 fp & 55 fn; fp rate 0.292%; fn rate 0.423% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik And now, histogram from timcombine true -> Ham scores for all runs: 13000 items; mean 11.93; sdev 8.33 -> min 0.584578; median 9.92718; max 87.3273 * = 18 items 0 61 **** 1 220 ************* 2 233 ************* 3 398 *********************** 4 670 ************************************** 5 811 ********************************************** 6 930 **************************************************** 7 1080 ************************************************************ 8 1081 ************************************************************* 9 1088 ************************************************************* 10 871 ************************************************* 11 788 ******************************************** 12 702 *************************************** 13 629 *********************************** 14 558 ******************************* 15 419 ************************ 16 337 ******************* 17 268 *************** 18 229 ************* 19 195 *********** 20 146 ********* 21 141 ******** 22 125 ******* 23 105 ****** 24 102 ****** 25 75 ***** 26 50 *** 27 60 **** 28 47 *** 29 58 **** 30 48 *** 31 49 *** 32 31 ** 33 35 ** 34 19 ** 35 30 ** 36 34 ** 37 21 ** 38 9 * 39 19 ** 40 13 * 41 25 ** 42 10 * 43 10 * 44 12 * 45 14 * 46 12 * 47 11 * 48 12 * 49 11 * 50 13 * 51 13 * 52 4 * 53 7 * 54 8 * 55 5 * 56 8 * 57 3 * 58 4 * 59 3 * 60 5 * 61 4 * 62 3 * 63 1 * 64 3 * 65 3 * 66 0 67 2 * 68 2 * 69 0 70 0 71 0 72 0 73 2 * 74 1 * 75 2 * 76 0 77 0 78 0 79 0 80 0 81 0 82 0 83 1 * 84 0 85 0 86 0 87 1 * 88 0 89 0 90 0 91 0 92 0 93 0 94 0 95 0 96 0 97 0 98 0 99 0 -> Spam scores for all runs: 13000 items; mean 90.62; sdev 7.36 -> min 11.1229; median 92.6441; max 99.5389 * = 21 items 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 1 * 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 1 * 20 0 21 1 * 22 0 23 0 24 0 25 0 26 0 27 0 28 0 29 0 30 0 31 0 32 1 * 33 0 34 0 35 1 * 36 0 37 1 * 38 0 39 1 * 40 0 41 2 * 42 5 * 43 1 * 44 1 * 45 0 46 3 * 47 1 * 48 0 49 0 50 2 * 51 7 * 52 3 * 53 3 * 54 6 * 55 3 * 56 7 * 57 13 * 58 6 * 59 11 * 60 13 * 61 16 * 62 11 * 63 18 * 64 22 ** 65 20 * 66 28 ** 67 33 ** 68 24 ** 69 36 ** 70 55 *** 71 39 ** 72 55 *** 73 77 **** 74 59 *** 75 69 **** 76 93 ***** 77 100 ***** 78 110 ****** 79 152 ******** 80 156 ******** 81 172 ********* 82 210 ********** 83 193 ********** 84 242 ************ 85 278 ************** 86 313 *************** 87 393 ******************* 88 477 *********************** 89 608 ***************************** 90 689 ********************************* 91 950 ********************************************** 92 1131 ****************************************************** 93 1278 ************************************************************* 94 1244 ************************************************************ 95 945 ********************************************* 96 902 ******************************************* 97 1056 *************************************************** 98 604 ***************************** 99 48 *** -> best cutoff for all runs: 0.57 -> with weighted total 1*40 fp + 51 fn = 91 -> fp rate 0.308% fn rate 0.392% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik And rates cmp.py results/timcombinefalses.txt -> results/timcombinetrues.txt -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams false positive percentages 1.077 1.077 tied 0.769 0.769 tied 0.769 0.769 tied 0.923 0.923 tied 0.769 0.769 tied 0.538 0.538 tied 0.538 0.538 tied 0.692 0.692 tied 0.769 0.769 tied 0.692 0.692 tied won 0 times tied 10 times lost 0 times total unique fp went from 98 to 98 tied mean fp % went from 0.753846153846 to 0.753846153846 tied false negative percentages 0.154 0.154 tied 0.154 0.154 tied 0.231 0.231 tied 0.077 0.077 tied 0.000 0.000 tied 0.231 0.231 tied 0.231 0.231 tied 0.077 0.077 tied 0.154 0.154 tied 0.231 0.231 tied won 0 times tied 10 times lost 0 times total unique fn went from 20 to 20 tied mean fn % went from 0.153846153846 to 0.153846153846 tied ham mean ham sdev 25.47 12.23 -51.98% 7.31 9.02 +23.39% 25.37 12.04 -52.54% 7.07 8.57 +21.22% 25.56 12.08 -52.74% 6.96 8.44 +21.26% 25.57 12.21 -52.25% 7.09 8.65 +22.00% 25.33 11.98 -52.70% 6.94 8.40 +21.04% 25.56 12.20 -52.27% 6.77 8.16 +20.53% 25.29 11.69 -53.78% 6.71 7.80 +16.24% 25.19 11.61 -53.91% 6.71 7.91 +17.88% 25.07 11.63 -53.61% 7.02 8.31 +18.38% 25.14 11.60 -53.86% 6.88 7.94 +15.41% ham mean and sdev for all runs 25.35 11.93 -52.94% 6.95 8.33 +19.86% spam mean spam sdev 80.93 90.31 +11.59% 7.72 7.59 -1.68% 81.17 90.59 +11.61% 7.73 7.68 -0.65% 81.36 90.72 +11.50% 7.52 7.40 -1.60% 81.51 90.91 +11.53% 7.40 7.16 -3.24% 81.02 90.54 +11.75% 7.19 6.93 -3.62% 81.26 90.68 +11.59% 7.41 7.23 -2.43% 81.03 90.49 +11.67% 7.52 7.25 -3.59% 81.08 90.61 +11.75% 7.48 7.29 -2.54% 81.47 90.93 +11.61% 7.54 7.21 -4.38% 80.93 90.40 +11.70% 7.95 7.80 -1.89% spam mean and sdev for all runs 81.18 90.62 +11.63% 7.55 7.36 -2.52% ham/spam mean difference: 55.83 78.69 +22.86 Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From bkc@murkworks.com Thu Oct 10 15:57:46 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 10 Oct 2002 10:57:46 -0400 Subject: [Spambayes] timcombine comparison with varied cutoff Message-ID: <3DA55CE2.767.F327508@localhost> I re-ran the comparison between use_tim_combine false --> true this time, I set the spam_cutoff to the recommended value for the false (0.53) and true (0.57) case. ran rates and cmp. results/timcombinefalse053s.txt -> results/timcombinetrue057s.txt -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams false positive percentages 0.385 0.231 won -40.00% 0.462 0.462 tied 0.385 0.231 won -40.00% 0.615 0.538 won -12.52% 0.462 0.231 won -50.00% 0.231 0.231 tied 0.154 0.154 tied 0.154 0.154 tied 0.615 0.538 won -12.52% 0.385 0.308 won -20.00% won 6 times tied 4 times lost 0 times total unique fp went from 50 to 40 won -20.00% mean fp % went from 0.384615384615 to 0.307692307692 won -20.00% false negative percentages 0.308 0.385 lost +25.00% 0.385 0.385 tied 0.385 0.385 tied 0.385 0.462 lost +20.00% 0.231 0.231 tied 0.308 0.385 lost +25.00% 0.385 0.385 tied 0.308 0.385 lost +25.00% 0.308 0.538 lost +74.68% 0.308 0.385 lost +25.00% won 0 times tied 4 times lost 6 times total unique fn went from 43 to 51 lost +18.60% mean fn % went from 0.330769230769 to 0.392307692307 lost +18.60% ham mean ham sdev 25.47 12.23 -51.98% 7.31 9.02 +23.39% 25.37 12.04 -52.54% 7.07 8.57 +21.22% 25.56 12.08 -52.74% 6.96 8.44 +21.26% 25.57 12.21 -52.25% 7.09 8.65 +22.00% 25.33 11.98 -52.70% 6.94 8.40 +21.04% 25.56 12.20 -52.27% 6.77 8.16 +20.53% 25.29 11.69 -53.78% 6.71 7.80 +16.24% 25.19 11.61 -53.91% 6.71 7.91 +17.88% 25.07 11.63 -53.61% 7.02 8.31 +18.38% 25.14 11.60 -53.86% 6.88 7.94 +15.41% ham mean and sdev for all runs 25.35 11.93 -52.94% 6.95 8.33 +19.86% spam mean spam sdev 80.93 90.31 +11.59% 7.72 7.59 -1.68% 81.17 90.59 +11.61% 7.73 7.68 -0.65% 81.36 90.72 +11.50% 7.52 7.40 -1.60% 81.51 90.91 +11.53% 7.40 7.16 -3.24% 81.02 90.54 +11.75% 7.19 6.93 -3.62% 81.26 90.68 +11.59% 7.41 7.23 -2.43% 81.03 90.49 +11.67% 7.52 7.25 -3.59% 81.08 90.61 +11.75% 7.48 7.29 -2.54% 81.47 90.93 +11.61% 7.54 7.21 -4.38% 80.93 90.40 +11.70% 7.95 7.80 -1.89% spam mean and sdev for all runs 81.18 90.62 +11.63% 7.55 7.36 -2.52% ham/spam mean difference: 55.83 78.69 +22.86 Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From popiel@wolfskeep.com Thu Oct 10 17:07:57 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 10 Oct 2002 09:07:57 -0700 Subject: [Spambayes] CLT run results Message-ID: <20021010160757.6AF69F59E@cashew.wolfskeep.com> Not much to say about this one. The magic of the 2:3 ham:spam ratio is maintained across default, clt1, clt2, and clt3. This nudges me to believe that it's something about my corpus or perhaps a universal constant. (Brad's posted results seemed to have high k at 2:3, too). As others have shown, the clt total error rate is lower than that of the default classifier, but the fp rate is higher. I have not yet looked at the certainty stuff that clt gives. I used my modified timcv.py (posted earlier, and on my website)... but if you want to reproduce my results, use Tim's version instead (it makes more sense to have the training style as an ini option). Just make sure to use the full retraining when doing clt tests. I also retrieved my 10 set configuration (from before I rebalanced for 15 sets in my last experiment). Note that I did _not_ rebalance to get back to this point; I untarred the archive I'd made. This means that the comparison against the 10set data from my ratio2 experiment might actually be valid. The default (robinson) classifier results (from the ratio2 experiment): -> tested 50 hams & 200 spams against 450 hams & 1800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 2 3 3 3 4 3 3 fp %: 0.40 0.40 0.30 0.24 0.27 0.17 0.15 fn tot: 32 41 43 43 47 48 51 fn %: 1.60 2.34 2.87 3.44 4.70 6.40 10.20 h mean: 24.25 21.75 20.12 18.87 18.33 17.72 16.71 h sdev: 7.52 7.13 7.04 7.09 7.16 7.31 7.43 s mean: 77.56 76.66 75.93 74.85 74.13 72.80 70.57 s sdev: 8.24 8.62 8.77 9.09 9.68 9.90 10.54 mean diff: 53.31 54.91 55.81 55.98 55.80 55.08 53.86 k: 3.38 3.49 3.53 3.46 3.31 3.20 3.00 clt1 results: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 9 4 6 6 10 10 11 fp %: 1.80 0.53 0.60 0.48 0.67 0.57 0.55 fn tot: 6 6 4 6 9 10 13 fn %: 0.30 0.34 0.27 0.48 0.90 1.33 2.60 h mean: 3.17 1.58 1.29 1.22 1.09 0.91 0.77 h sdev: 14.74 9.77 8.77 8.66 8.54 7.83 7.09 s mean: 99.55 99.32 99.18 98.85 98.22 97.88 96.42 s sdev: 5.66 6.68 7.06 7.96 10.00 11.57 14.67 mean diff: 96.38 97.74 97.89 97.63 97.13 96.97 95.65 k: 4.72 5.94 6.18 5.87 5.24 5.00 4.40 clt2 results: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 10 5 6 6 9 10 8 fp %: 2.00 0.67 0.60 0.48 0.60 0.57 0.40 fn tot: 6 6 4 6 11 14 16 fn %: 0.30 0.34 0.27 0.48 1.10 1.87 3.20 h mean: 3.37 1.39 0.89 0.68 0.57 0.57 0.47 h sdev: 15.03 9.31 8.28 7.56 7.17 7.15 6.20 s mean: 99.65 99.43 99.37 99.01 98.46 97.94 96.41 s sdev: 5.22 6.49 6.37 7.75 9.45 11.49 15.04 mean diff: 96.28 98.04 98.48 98.33 97.89 97.37 95.94 k: 4.75 6.21 6.72 6.42 5.89 5.22 4.52 clt3 results: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [... edited for brevity ...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham-spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp tot: 9 4 5 6 8 9 8 fp %: 1.80 0.53 0.50 0.48 0.53 0.51 0.40 fn tot: 7 7 5 11 18 21 21 fn %: 0.35 0.40 0.33 0.88 1.80 2.80 4.20 h mean: 3.27 1.06 0.74 0.48 0.53 0.46 0.38 h sdev: 14.54 8.44 7.51 6.31 6.81 5.85 5.12 s mean: 99.58 99.35 99.18 98.61 97.81 97.07 95.11 s sdev: 5.78 6.80 7.30 8.89 11.15 13.34 17.31 mean diff: 96.31 98.29 98.44 98.13 97.28 96.61 94.73 k: 4.74 6.45 6.65 6.46 5.42 5.03 4.22 The clt variants all are sensitive to the ham:spam ratio in both fp and fn, and the directions are crossed (which makes sense). It's impossible to tell from the fp and fn numbers where the sweet spot really is, but the k values seem to point at 2:3. All of this is (of course) on my website at: http://www.wolfskeep.com/~popiel/spambayes/clt - Alex From popiel@wolfskeep.com Thu Oct 10 19:58:27 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 10 Oct 2002 11:58:27 -0700 Subject: [Spambayes] A Tim-Combining run Message-ID: <20021010185827.39DF9F59E@cashew.wolfskeep.com> All of this is on my website at: http://www.wolfskeep.com/~popiel/spambayes/timcomb Tim combining looks interesting, but I can't really tell if it's a win or a lose. The fp/fn rates only changed by a message or two. On the other hand, Tim combining shoved one of my ham scores into the 80s, and some of my spam scores into the single digits. Ouch. For this one, I haven't done the full ratio analysis... I only did the 1:1 case with 10 sets of 200 each. Like Brad, I ran it once to find out what the cutoffs should be, then ran it again to utilize those cutoffs. Default combining is cv1, Tim combining is cv2. The one thing of note is that I seem to have surprisingly low cutoff values (0.42 and 0.38). Default combining, spam_cutoff 0.42: -> Ham scores for all runs: 2000 items; mean 18.07; sdev 7.22 -> min 0.763677; median 17.9834; max 63.3652 * = 3 items 0 2 * 1 20 ******* 2 14 ***** 3 26 ********* 4 33 *********** 5 21 ******* 6 27 ********* 7 29 ********** 8 26 ********* 9 25 ********* 10 41 ************** 11 56 ******************* 12 53 ****************** 13 88 ****************************** 14 115 *************************************** 15 134 ********************************************* 16 153 *************************************************** 17 142 ************************************************ 18 130 ******************************************** 19 159 ***************************************************** 20 129 ******************************************* 21 111 ************************************* 22 87 ***************************** 23 74 ************************* 24 59 ******************** 25 55 ******************* 26 40 ************** 27 28 ********** 28 18 ****** 29 17 ****** 30 19 ******* 31 8 *** 32 2 * 33 10 **** 34 8 *** 35 7 *** 36 12 **** 37 3 * 38 2 * 39 1 * 40 0 41 1 * 42 1 * 43 0 44 1 * 45 0 46 1 * 47 0 48 3 * 49 0 50 2 * 51 0 52 1 * 53 0 54 2 * 55 0 56 0 57 0 58 0 59 2 * 60 0 61 1 * 62 0 63 1 * 64 0 65 0 66 0 67 0 68 0 69 0 70 0 71 0 72 0 73 0 74 0 75 0 76 0 77 0 78 0 79 0 80 0 81 0 82 0 83 0 84 0 85 0 86 0 87 0 88 0 89 0 90 0 91 0 92 0 93 0 94 0 95 0 96 0 97 0 98 0 99 0 -> Spam scores for all runs: 2000 items; mean 75.97; sdev 9.00 -> min 19.5284; median 77.2341; max 98.0328 * = 2 items 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 1 * 20 0 21 0 22 1 * 23 0 24 1 * 25 0 26 0 27 0 28 0 29 1 * 30 1 * 31 0 32 0 33 0 34 1 * 35 0 36 0 37 0 38 0 39 0 40 0 41 0 42 1 * 43 0 44 2 * 45 4 ** 46 4 ** 47 4 ** 48 2 * 49 3 ** 50 7 **** 51 3 ** 52 4 ** 53 1 * 54 6 *** 55 12 ****** 56 11 ****** 57 12 ****** 58 7 **** 59 17 ********* 60 19 ********** 61 21 *********** 62 17 ********* 63 17 ********* 64 31 **************** 65 31 **************** 66 44 ********************** 67 32 **************** 68 53 *************************** 69 44 ********************** 70 51 ************************** 71 63 ******************************** 72 76 ************************************** 73 95 ************************************************ 74 88 ******************************************** 75 90 ********************************************* 76 97 ************************************************* 77 107 ****************************************************** 78 106 ***************************************************** 79 110 ******************************************************* 80 110 ******************************************************* 81 90 ********************************************* 82 87 ******************************************** 83 90 ********************************************* 84 73 ************************************* 85 63 ******************************** 86 42 ********************* 87 56 **************************** 88 26 ************* 89 18 ********* 90 13 ******* 91 11 ****** 92 8 **** 93 2 * 94 1 * 95 1 * 96 5 *** 97 3 ** 98 3 ** 99 0 -> best cutoff for all runs: 0.42 -> with weighted total 1*15 fp + 6 fn = 21 -> fp rate 0.75% fn rate 0.3% -> matched at 0.43 with 14 fp & 7 fn; fp rate 0.7%; fn rate 0.35% -> matched at 0.44 with 14 fp & 7 fn; fp rate 0.7%; fn rate 0.35% With Tim combining, spam_cutoff 0.38: -> Ham scores for all runs: 2000 items; mean 8.43; sdev 6.15 -> min 0.66161; median 7.56964; max 81.0785 * = 4 items 0 14 **** 1 95 ************************ 2 115 ***************************** 3 126 ******************************** 4 169 ******************************************* 5 189 ************************************************ 6 188 *********************************************** 7 186 *********************************************** 8 176 ******************************************** 9 178 ********************************************* 10 128 ******************************** 11 90 *********************** 12 64 **************** 13 76 ******************* 14 45 ************ 15 45 ************ 16 21 ****** 17 18 ***** 18 14 **** 19 14 **** 20 3 * 21 5 ** 22 7 ** 23 5 ** 24 3 * 25 1 * 26 3 * 27 2 * 28 1 * 29 3 * 30 0 31 0 32 0 33 0 34 0 35 1 * 36 0 37 1 * 38 0 39 1 * 40 0 41 0 42 0 43 1 * 44 0 45 0 46 1 * 47 2 * 48 0 49 0 50 1 * 51 0 52 1 * 53 0 54 0 55 1 * 56 0 57 0 58 1 * 59 0 60 0 61 1 * 62 0 63 0 64 0 65 0 66 0 67 3 * 68 0 69 0 70 0 71 0 72 0 73 0 74 0 75 0 76 0 77 0 78 0 79 0 80 0 81 1 * 82 0 83 0 84 0 85 0 86 0 87 0 88 0 89 0 90 0 91 0 92 0 93 0 94 0 95 0 96 0 97 0 98 0 99 0 -> Spam scores for all runs: 2000 items; mean 86.91; sdev 10.12 -> min 8.51412; median 89.773; max 98.1726 * = 3 items 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 1 * 9 1 * 10 0 11 0 12 0 13 0 14 1 * 15 1 * 16 0 17 1 * 18 0 19 0 20 0 21 1 * 22 0 23 0 24 0 25 0 26 1 * 27 0 28 0 29 0 30 0 31 0 32 0 33 1 * 34 0 35 0 36 0 37 0 38 3 * 39 1 * 40 1 * 41 0 42 1 * 43 1 * 44 2 * 45 4 ** 46 0 47 2 * 48 1 * 49 2 * 50 5 ** 51 2 * 52 2 * 53 1 * 54 2 * 55 2 * 56 1 * 57 0 58 5 ** 59 8 *** 60 6 ** 61 2 * 62 3 * 63 6 ** 64 5 ** 65 10 **** 66 10 **** 67 9 *** 68 12 **** 69 14 ***** 70 11 **** 71 6 ** 72 12 **** 73 14 ***** 74 23 ******** 75 18 ****** 76 15 ***** 77 20 ******* 78 21 ******* 79 30 ********** 80 35 ************ 81 39 ************* 82 41 ************** 83 66 ********************** 84 72 ************************ 85 75 ************************* 86 98 ********************************* 87 69 *********************** 88 118 **************************************** 89 112 ************************************** 90 154 **************************************************** 91 151 *************************************************** 92 160 ****************************************************** 93 140 *********************************************** 94 142 ************************************************ 95 83 **************************** 96 91 ******************************* 97 49 ***************** 98 4 ** 99 0 -> best cutoff for all runs: 0.38 -> with weighted total 1*14 fp + 8 fn = 22 -> fp rate 0.7% fn rate 0.4% Finally, results.txt: cv1s -> cv2s -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams -> tested 200 hams & 200 spams against 1800 hams & 1800 spams false positive percentages 0.500 0.500 tied 0.500 0.500 tied 0.000 0.000 tied 1.000 1.000 tied 0.500 0.500 tied 2.000 1.500 won -25.00% 1.000 1.000 tied 0.500 0.500 tied 1.000 1.000 tied 0.500 0.500 tied won 1 times tied 9 times lost 0 times total unique fp went from 15 to 14 won -6.67% mean fp % went from 0.75 to 0.7 won -6.67% false negative percentages 0.500 0.500 tied 0.000 0.000 tied 1.000 1.500 lost +50.00% 0.500 0.500 tied 0.000 0.500 lost +(was 0) 0.500 0.500 tied 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied won 0 times tied 8 times lost 2 times total unique fn went from 6 to 8 lost +33.33% mean fn % went from 0.3 to 0.4 lost +33.33% ham mean ham sdev 17.22 7.72 -55.17% 7.39 6.43 -12.99% 18.69 8.46 -54.74% 7.27 5.77 -20.63% 18.86 8.94 -52.60% 6.50 4.71 -27.54% 16.79 7.92 -52.83% 7.75 6.01 -22.45% 18.66 8.88 -52.41% 7.09 5.98 -15.66% 18.47 8.99 -51.33% 7.83 8.27 +5.62% 18.19 8.51 -53.22% 6.99 6.02 -13.88% 18.38 8.44 -54.08% 6.80 5.45 -19.85% 17.67 8.38 -52.57% 7.88 7.12 -9.64% 17.72 8.10 -54.29% 6.18 4.79 -22.49% ham mean and sdev for all runs 18.07 8.43 -53.35% 7.22 6.15 -14.82% spam mean spam sdev 75.58 86.54 +14.50% 9.15 10.45 +14.21% 76.81 87.80 +14.31% 8.53 8.21 -3.75% 74.95 85.60 +14.21% 9.44 12.09 +28.07% 76.18 87.24 +14.52% 8.64 9.83 +13.77% 76.55 87.63 +14.47% 8.84 9.54 +7.92% 76.08 86.83 +14.13% 8.69 10.19 +17.26% 75.61 86.38 +14.24% 9.72 11.14 +14.61% 76.51 87.65 +14.56% 8.30 8.75 +5.42% 75.92 86.79 +14.32% 9.62 11.13 +15.70% 75.52 86.64 +14.72% 8.76 8.95 +2.17% spam mean and sdev for all runs 75.97 86.91 +14.40% 9.00 10.12 +12.44% ham/spam mean difference: 57.90 78.48 +20.58 - Alex From popiel@wolfskeep.com Thu Oct 10 20:13:49 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 10 Oct 2002 12:13:49 -0700 Subject: [Spambayes] spamprob combining In-Reply-To: Message from Tim Peters of "Wed, 09 Oct 2002 23:29:38 EDT." References: Message-ID: <20021010191349.57718F59E@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> I'll run this one after I get done with my initial clt tests >> (which are taking about 4.5 hours each :-/ ). > >Use less data? Yes, I could go back to using only 5 sets instead of 10... but then my results would be a bit less comparable with other runs I've done. >> I can't really say anything else, yet, but clt seems _much_ slower >> than the default classifier. > >I haven't really noticed that. If you're using your "--trainstyle full" >patch with timcv, then, yes, it would be enormously slower -- timcv gets >enormous *efficiency* benefits (both instruction-count and temporal cache >locality) out of incremental learning and unlearning. > >The "third training pass" unique to the clt methods also doubles the >training time (each msg in the training data is tokenized once to update the >wordprobs, and then a second time to compute the clt ham and spam population >statistics). Is it worth caching the token streams somehow? (I'm thinking not, since this is still in the research-project stage...) Quite possibly the problem is that I'm running all this on a PII-300 with only 64M RAM, which is also running X (but not Gnome or KDE; I'm a hardend twm user!)... - Alex From tim.one@comcast.net Thu Oct 10 21:09:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 10 Oct 2002 16:09:14 -0400 Subject: [Spambayes] A Tim-Combining run In-Reply-To: <20021010185827.39DF9F59E@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > All of this is on my website at: > > http://www.wolfskeep.com/~popiel/spambayes/timcomb > > Tim combining looks interesting, but I can't really tell if it's > a win or a lose. The fp/fn rates only changed by a message or two. The error rates aren't really the point here. As with the clt schemes, the point is whether you get a more useful middle ground. There's no way to tell that just from running the tests, though -- you have to stare at the mistakes and think. > On the other hand, Tim combining shoved one of my ham scores into > the 80s, and some of my spam scores into the single digits. Ouch. Why is that painful? For example, if they were mistakes before too, what difference does it make to you if their scores change? In my large test run, it made the same mistakes before and after, but the worst mistakes were already so far out of range of a *usable* "middle ground" before that it made no difference that the scores got more extreme after. Those particular false positives and negatives are never going to swing into the other category, short of never calling anything spam, or never calling anything ham. The usefulness of the change was in a different dimension: the middle ground in which *marginal* mistakes lived contained significantly fewer messages after. The extremes are hopeless under any scheme (e.g., my "ham" consisting almost entirely of a giant Nigerian spam quote is simply never going to be *called* ham by any useful scheme -- and this doesn't have much of anything to do with how spamprobs get combined, the msg simply has an overwhelming number of overwhelming large-spamprob words). So one question for you is whether the extreme mistakes in your runs are in fact hopeless (remember that we're running a computer program here, not a psychic hotline ). This requires thought and careful judgment more than running tests. At some point the belief is that we actually deploy this code, and then we need a no-fudging way of saying "ham", "spam", "not sure". The clt schemes have appeared to be the only hope of getting a useful "not sure" category, but this variation on the non-clt scheme *may* be able to get there without the extra complication and expense of the clt schemes. From popiel@wolfskeep.com Thu Oct 10 21:36:09 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 10 Oct 2002 13:36:09 -0700 Subject: [Spambayes] A Tim-Combining run In-Reply-To: Message from Tim Peters of "Thu, 10 Oct 2002 16:09:14 EDT." References: Message-ID: <20021010203609.73FC1F59E@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> On the other hand, Tim combining shoved one of my ham scores into >> the 80s, and some of my spam scores into the single digits. Ouch. > >Why is that painful? For example, if they were mistakes before too, what >difference does it make to you if their scores change? In my large test >run, it made the same mistakes before and after, but the worst mistakes were >already so far out of range of a *usable* "middle ground" before that it >made no difference that the scores got more extreme after. Point. I wasn't thinking. *bonk* >So one question for you is whether the extreme mistakes in your runs are in >fact hopeless (remember that we're running a computer program here, not a >psychic hotline ). Some are... like the FDIC sending me notice that NextBank folded, and (insert long list of marketroid-named services) would no longer be available as it went into receivership. I'll take a closer look. But darn it, I wanted to have it tell me if I was gonna have a hot date this weekend after I won the lottery! (I'd even settle for it telling me that my life is hard because I don't listen to it more often at $1.75 a minute, first 10 minutes free.) ;-) - Alex From tim.one@comcast.net Sat Oct 12 00:53:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 11 Oct 2002 19:53:45 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: [Gary Robinson] > ... > It's not the product of the p's that is a chi-square distribution, it's > the following, given p1, p2..., pn: > > 2*((ln p1) + (ln p2) + ... + (ln pn)) > > That expression has a chi-square distribution with 2n degrees of > freedom. I haven't found a reference to this online, and don't have reasonable access to a good technical library, so I need your help to get this straight. The first thing that strikes me is that it can't be quite right : a chi-squared statistic is positive by its very nature, but the expression is a sum of logs of values in (0., 1), so is necessarily negative. Here's the chi-squared function I'm using: """ import math as _math def chi2Q(x2, v, exp=_math.exp): """Return prob(chisq >= x2, with v degrees of freedom). v must be even. """ assert v & 1 == 0 m = x2 / 2.0 sum = term = exp(-m) for i in range(1, v//2): term *= m / i sum += term return sum """ It's an especially simple and numerically stable calculation when v is even, and v always is even if the formulation is right. I understand I could save time with tabulated values, but speeding this is premature. Example: >>> chi2Q(129.561, 100) 0.025000686582048785 >>> Abramowitz & Stegun give 129.561 as the 0.025 point for the chi-squared distribution with 100 degrees of freedom. Etc. I'm confident *that* function is working correctly. > So you feed THAT into the inverse-chi square function to get a p-value. I'm feeding its negation in instead, since the correct result would be 1.0 for any negative x2 input. > Let invchi(x, f), where s is the random variable and f is the degrees of > freedom, be the inverse chi square function. Let S be a number near 1 when > the email looks spammy and H be a number near 1 when the email > looks hammy. > > Then you want > > S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n) > > and > > H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n) OK, I believe I'm doing that, but multiplying the first argument by -2 instead of 2: """ from Histogram import Hist from random import random import sys h = Hist(20, lo=0.0, hi=1.0) def judge(ps, ln=_math.log): H = S = 0.0 for p in ps: S += ln(1.0 - p) H += ln(p) n = len(ps) S = 1.0 - chi2Q(-2.0 * S, 2*n) H = 1.0 - chi2Q(-2.0 * H, 2*n) return S/(S+H) warp = 0 if len(sys.argv) > 1: warp = int(sys.argv[1]) for i in range(5000): ps = [random() for j in range(50)] p = judge(ps + [0.99] * warp) h.add(p) print "Result for random vectors of 50 probs, +", warp, "forced to 0.99" print h.display() """ Note: as usual, scaling (S-H)/(S+H) from [-1, 1] into [0, 1] is ((S-H)/(S+H) + 1)/2 = ((S-H+S+H)/(S+H))/2 = (2*S/(S+H))/2 = S/(S+H) The bad(?) news is that, on random inputs, this is all over the map: Result for random vectors of 50 probs, + 0 forced to 0.99 5000 items; mean 0.50; sdev 0.27 -> min 0.000219435; median 0.49027; max 0.999817 * = 6 items 0.00 206 *********************************** 0.05 215 ************************************ 0.10 209 *********************************** 0.15 240 **************************************** 0.20 282 *********************************************** 0.25 239 **************************************** 0.30 270 ********************************************* 0.35 289 ************************************************* 0.40 276 ********************************************** 0.45 325 ******************************************************* 0.50 291 ************************************************* 0.55 300 ************************************************** 0.60 278 *********************************************** 0.65 267 ********************************************* 0.70 234 *************************************** 0.75 234 *************************************** 0.80 211 ************************************ 0.85 207 *********************************** 0.90 213 ************************************ 0.95 214 ************************************ I don't think it's uniformly distributed (across many runs, the small peakedness near the midpoint persists), but it's close. The better news is that it's indeed very sensitive to bias (perhaps that's *why* all-random data scores all over the map?): Result for random vectors of 50 probs, + 1 forced to 0.99 5000 items; mean 0.59; sdev 0.24 -> min 0.00175781; median 0.596673; max 0.999818 * = 7 items 0.00 49 ******* 0.05 94 ************** 0.10 119 ***************** 0.15 109 **************** 0.20 153 ********************** 0.25 185 *************************** 0.30 214 ******************************* 0.35 253 ************************************* 0.40 286 ***************************************** 0.45 338 ************************************************* 0.50 344 ************************************************** 0.55 381 ******************************************************* 0.60 369 ***************************************************** 0.65 340 ************************************************* 0.70 325 *********************************************** 0.75 315 ********************************************* 0.80 292 ****************************************** 0.85 285 ***************************************** 0.90 255 ************************************* 0.95 294 ****************************************** Result for random vectors of 50 probs, + 2 forced to 0.99 5000 items; mean 0.66; sdev 0.21 -> min 0.0214171; median 0.667916; max 0.9999 * = 7 items 0.00 4 * 0.05 17 *** 0.10 45 ******* 0.15 50 ******** 0.20 58 ********* 0.25 103 *************** 0.30 137 ******************** 0.35 181 ************************** 0.40 259 ************************************* 0.45 324 *********************************************** 0.50 372 ****************************************************** 0.55 377 ****************************************************** 0.60 412 *********************************************************** 0.65 427 ************************************************************* 0.70 345 ************************************************** 0.75 369 ***************************************************** 0.80 379 ******************************************************* 0.85 370 ***************************************************** 0.90 376 ****************************************************** 0.95 395 ********************************************************* Result for random vectors of 50 probs, + 10 forced to 0.99 5000 items; mean 0.88; sdev 0.13 -> min 0.494068; median 0.922177; max 1 * = 33 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 1 * 0.50 84 *** 0.55 139 ***** 0.60 172 ****** 0.65 257 ******** 0.70 282 ********* 0.75 326 ********** 0.80 369 ************ 0.85 560 ***************** 0.90 800 ************************* 0.95 2010 ************************************************************* Result for random vectors of 50 probs, + 20 forced to 0.99 5000 items; mean 0.97; sdev 0.06 -> min 0.543929; median 0.996147; max 1 * = 69 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 0 0.50 1 * 0.55 4 * 0.60 12 * 0.65 25 * 0.70 32 * 0.75 48 * 0.80 128 ** 0.85 203 *** 0.90 383 ****** 0.95 4164 ************************************************************* I won't bother to show it here, but the histograms are essentially mirror images if I force the bias value to 0.01 instead of to 0.99. Does the near-uniform spread of "scores" on wholly random inputs strike you as sane or insane? I don't understand the theoretical underpinnings of this test, so can't really guess. It did surprise me -- I guess I expected strong clustering at 0.5. From tim.one@comcast.net Sat Oct 12 02:34:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 11 Oct 2002 21:34:50 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: Regardless of whether the chi-squared code makes sense, I whipped up another spamprob() variant to use it, and checked it in. There's a new option: [Classifier] use_chi_squared_combining: False This is yet another alternative to use_tim_combining (by the way, offline Gary and I agreed that tim_combining isn't biased, but are still butting heads over whether it's actually just a trivial transformation of Gary-combining ; scores from each are always on the same *side* of 0.5, but tim-combining scores are always at least as far from 0.5 as Gary-combining scores, and usually significant farther -- that's why the spread increases so dramatically). Small test run, 10-fold CV with 400+400 in each set. As usual when switching combining schemes, the "won/lost" things don't make sense for the "after" run, because the appropriate value for spam_cutoff changes. The before run is all-default, the after run just setting the new option true: -> tested 400 hams & 400 spams against 3600 hams & 3600 spams [ditto 19 times] false positive percentages 0.000 0.000 tied 0.000 0.250 lost +(was 0) 0.000 0.250 lost +(was 0) 0.000 0.000 tied 0.250 0.500 lost +100.00% 0.000 0.250 lost +(was 0) 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 6 times lost 4 times total unique fp went from 1 to 5 lost +400.00% mean fp % went from 0.025 to 0.125 lost +400.00% false negative percentages 0.000 0.000 tied 0.250 0.000 won -100.00% 0.000 0.000 tied 0.250 0.250 tied 0.250 0.000 won -100.00% 0.500 0.250 won -50.00% 0.000 0.000 tied 0.250 0.000 won -100.00% 0.500 0.250 won -50.00% 0.000 0.000 tied won 5 times tied 5 times lost 0 times total unique fn went from 8 to 3 won -62.50% mean fn % went from 0.2 to 0.075 won -62.50% ham mean ham sdev 27.29 0.49 -98.20% 5.80 3.68 -36.55% 27.62 0.62 -97.76% 5.57 4.91 -11.85% 27.25 0.66 -97.58% 5.52 5.40 -2.17% 27.75 0.25 -99.10% 5.36 2.39 -55.41% 27.47 0.84 -96.94% 6.07 6.78 +11.70% 27.65 0.78 -97.18% 5.84 4.68 -19.86% 28.00 0.75 -97.32% 5.85 4.41 -24.62% 27.44 0.29 -98.94% 5.35 2.47 -53.83% 27.55 0.36 -98.69% 5.31 2.66 -49.91% 27.95 0.68 -97.57% 5.85 4.37 -25.30% ham mean and sdev for all runs 27.60 0.57 -97.93% 5.66 4.39 -22.44% spam mean spam sdev 82.89 99.96 +20.59% 7.17 0.48 -93.31% 82.11 99.84 +21.59% 7.04 2.11 -70.03% 81.34 99.93 +22.85% 7.30 0.79 -89.18% 81.73 99.84 +22.16% 7.38 2.66 -63.96% 82.07 99.85 +21.66% 6.78 1.85 -72.71% 82.02 99.70 +21.56% 7.32 3.28 -55.19% 82.03 99.91 +21.80% 7.05 1.27 -81.99% 82.22 99.93 +21.54% 6.75 0.73 -89.19% 82.14 99.70 +21.38% 7.50 3.27 -56.40% 82.30 99.92 +21.41% 7.30 0.84 -88.49% spam mean and sdev for all runs 82.08 99.86 +21.66% 7.17 2.00 -72.11% ham/spam mean difference: 54.48 99.29 +44.81 Stare at what happened to the means, and it's easy to see that this is more Graham-like in its score distribution than anything we've seen since using Graham-combining: -> Ham scores for all runs: 4000 items; mean 0.57; sdev 4.39 -> min -2.22045e-013; median 8.33096e-009; max 100 Check out the median there: that's extreme. Note that one ham scored 1.0! That's the Nigerian-scam quote, and I don't care because it's hopeless. It actually scored 0.999999988294. * = 63 items 0.0 3813 ************************************************************* 0.5 32 * 1.0 18 * 1.5 13 * 2.0 6 * 2.5 5 * 3.0 3 * 3.5 4 * 4.0 7 * 4.5 7 * 5.0 8 * 5.5 2 * 6.0 2 * 6.5 3 * 7.0 2 * 7.5 3 * 8.0 4 * 8.5 0 9.0 4 * 9.5 2 * 10.0 2 * 10.5 0 11.0 2 * 11.5 1 * 12.0 1 * 12.5 1 * 13.0 2 * 13.5 1 * 14.0 1 * 14.5 1 * 15.0 1 * 15.5 2 * 16.0 1 * 16.5 3 * 17.0 1 * 17.5 1 * 18.0 3 * 18.5 0 19.0 1 * 19.5 0 20.0 1 * 20.5 1 * 21.0 0 21.5 1 * 22.0 0 22.5 0 23.0 1 * 23.5 0 24.0 0 24.5 0 25.0 0 25.5 2 * 26.0 2 * 26.5 0 27.0 1 * 27.5 0 28.0 0 28.5 1 * 29.0 1 * 29.5 2 * 30.0 0 30.5 0 31.0 1 * 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 1 * 34.5 0 35.0 0 35.5 0 36.0 1 * 36.5 3 * 37.0 2 * 37.5 0 38.0 0 38.5 0 39.0 0 39.5 2 * 40.0 0 40.5 1 * 41.0 1 * 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 1 * 45.0 0 45.5 1 * 46.0 0 46.5 0 47.0 0 47.5 1 * 48.0 0 48.5 0 49.0 2 * 49.5 0 50.0 0 50.5 0 51.0 0 51.5 1 * 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 1 * 55.0 1 * haven't seen this get a high score since using bigrams; 55.5 0 it's someone putting together a Python user group; 56.0 0 "fully functional", etc -- accidental spam phrases 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * "If you are interested in saving money ...": someone looking 63.5 0 to share a hotel room at a Python conference, but neglecting 64.0 0 to mention it *is* a Python conference 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 1 * this is a disturbing fp -- it's not spammish at all; 71.0 0 someone looking for help writing a webmasterish program; 71.5 0 lots of accidental high-spamprob words 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 1 * "TOOLS Europe 2000" conference announcement 77.0 0 ... 99.5 1 * Nigerian-scam quote -> Spam scores for all runs: 4000 items; mean 99.86; sdev 2.00 -> min 46.9565; median 100; max 100 * = 65 items Note that the *median* is 100: that's extreme. ... 46.5 1 * "Hello, my Name is BlackIntrepid" 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 1 * "Website Programmers Available Now!"; lots of tech terms 52.5 0 53.0 0 53.5 0 54.0 1 * This one slays me. It has this meta tag we ignore: It's may be the most obvious spam ever created . 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 1 * 59.5 0 60.0 2 * 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 1 * 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 1 * If spam_cutoff had been here, it would have matched the 8 FN from the "before" run, and would have left only the Nigerian-scam and TOOLS annoucement as f-p. 75.5 0 76.0 0 76.5 0 And if spam_cutoff had been here, the wretched TOOLS announcement would have gotten thru too (sorry, but that annoucement is spam in my eyes) 77.0 1 * 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 1 * 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 1 * 86.0 0 86.5 1 * 87.0 0 87.5 0 88.0 3 * 88.5 1 * 89.0 2 * 89.5 0 90.0 0 90.5 0 91.0 0 91.5 1 * 92.0 2 * 92.5 0 93.0 3 * 93.5 0 94.0 2 * 94.5 0 95.0 0 95.5 1 * 96.0 1 * 96.5 3 * 97.0 2 * 97.5 1 * 98.0 4 * 98.5 6 * 99.0 3 * 99.5 3953 ************************************************************* Looks promising, albeit uncomfortably extreme. There's a huge and sparsely populated middle ground where all the mistakes live, except for the hopeless Nigerian scam quote. Example: if we called everything from 50 thru 80 "the middle ground", that easily contains all but the Nigerian mistake, yet contains only 6 (of 4000 total) ham and only 8 (of 4000 total) spam. So in a manual-review system, this combines all the desirable properties: 1. Very little is kicked out for review. 2. There are high error rates among the msgs kicked out for review. 3. There are unmeasurably low error rates among the msgs not kicked out for review. Feel encouraged to try this if you like, but keep in mind that the *point* here is how useful the middle ground may be -- just pasting in f-p and f-n rates without analysis (== staring at the mistakes and thinking about them) won't help (unless they're both disasters). It may be wise to wait for Gary to look over my previous questions about the math -- I can't swear the implementation even makes sense at this point. From tim.one@comcast.net Sat Oct 12 07:27:29 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 12 Oct 2002 02:27:29 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: OK! Gary and I exchanged info offline, and I believe the implementation of use_chi_squared_combining matches his intent for it. > ... > Example: if we called everything from 50 thru 80 "the middle > ground", ... in a manual-review system, this combines all the > desirable properties: > > 1. Very little is kicked out for review. > > 2. There are high error rates among the msgs kicked out for review. > > 3. There are unmeasurably low error rates among the msgs not kicked > out for review. On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the all-default scheme with the very touchy spam_cutoff. The middle ground is the *interesting* thing, and it's like a laser beam here (yippee!). In the "50 thru 80" range guessed at above, 1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737). The other 2 FP scored 0.999999929221 (Nigerian scam quote) and 0.972986477986 (lady with the short question and long obnoxious employer-generated SIG). I don't believe any usable scheme will ever call those ham, though, or put them in a middle ground without greatly bloating the middle ground with correctly classified messages. 2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near 0.61, 1 near 0.63, and 1 near 0.68). The 3 remaining spam scored below 0.50: 0.35983017036 "Hello, my Name is BlackIntrepid" Except that it contained a URL and an invitation to visit it, this could have been a poorly written c.l.py post explaining a bit about hackers to newbies (and if you don't think there are plenty of those in my ham, you don't read c.l.py ). 0.39570232415 The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam, whose body consists of a uuencoded text file we throw away unlooked at. (This is quite curable, but I doubt it's worth the bother -- at least until spammers take to putting everything in uuencoded text files!) 0.499567195859 (about as close to "middle ground" cutoff as can be) A giant (> 20KB) base64-encoded plain text file. I've never bothered to decode this to see what it says; like the others, though, it's been a persistent FN under all schemes. Note that we do decode this; I've always assumed it's of the "long, chatty, just-folks" flavor of tech spam that's hard to catch; the list of clues contains "cookies", "editor", "ms-dos", "backslashes", "guis", "commands", "folder", "dumb", "(well,", "cursor", and "trick" (a spamprob 0.00183748 word!). For my original purpose of looking at a scheme for c.l.py traffic, this has become the clear leader among all schemes: while it's more extreme than I might like, it made very few errors, and a miniscule middle ground (less than 0.08% of all msgs) contains 64+% of all errors. 3 FN would survive, and 2 FP, but I don't expect that any usable scheme could do better on this data. Note that Graham combining was also very extreme, but had *no* usable middle ground on this data: all mistakes had scores of almost exactly 0.0 or almost exactly 1.0 (and there were more mistakes). How does it do for you? An analysis like the above is what I'm looking for, although it surely doesn't need to be so detailed. Here's the .ini file I used: """ [Classifier] use_chi_squared_combining: True [TestDriver] spam_cutoff: 0.70 nbuckets: 200 best_cutoff_fp_weight: 10 show_false_positives: True show_false_negatives: True show_best_discriminators: 50 show_spam_lo = 0.40 show_spam_hi = 0.80 show_ham_lo = 0.40 show_ham_hi = 0.80 show_charlimit: 100000 """ Your best spam_cutoff may be different, but the point to this exercise isn't to find the best cutoff, it's to think about the middle ground. Note that I set show_{ham,spam}_{lo,hi} to values such that I would see every ham and spam that lived in my presumed middle ground of 0.50-0.80, plus down to 0.40 on the low end. I also set show_charlimit to a large value so that I'd see the full text of each such msg. Heh: My favorite: Data/Ham/Set7/51781.txt got overall score 0.485+, close to the middle ground cutoff. It's a msg I posted 2 years ago to the day (12 Sep 2000), and consists almost entirely of a rather long transcript of part of the infamous Chicago Seven trial: http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html I learned two things from this : 1. There are so many unique lexical clues when I post a thing, I can get away with posting anything. 2. "tyranny" is a spam clue, but "nazi" a ham clue: prob('tyranny') = 0.850877 prob('nazi') = 0.282714 leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs - tim From grobinson@transpose.com Sat Oct 12 16:39:24 2002 From: grobinson@transpose.com (Gary Robinson) Date: Sat, 12 Oct 2002 11:39:24 -0400 Subject: [Spambayes] spamprob combining In-Reply-To: Message-ID: This sounds like it's working out pretty well! If we get to the point that it becomes the accepted technique for spambayes, I'll add it to the my online essay. NOTE: As we've discussed ad nauseum, this multipicative thing is one-sided in its sensitivity, which is why we end up having to do something like S/(S+H) where S is based on (1-p) calcs for combining the p's and H is based on p calcs. There ARE meta-analytical ways of combining the p-values which are equally sensitive on both sides... but are a TAD overall less sensitive than the chi-square thing. And frankly, the S/(S+H)-style trick may take away a lot of that super-strength super sensitivity anyway -- maybe even all of the advantage over other methods (I just don't know without directly testing it). So a two-sided combining approach may perform equally well for our practical purposes... there's no way of knowing without trying. The advantage of such an approach would essentially be algorithmic elegance. No longer would we need that klugy (P-Q)/(P+Q) or S/(S+H) stuff which doesn't convert to a real probability. Instead, the combined P would be all we would need. Combined P near 1 would be spammy, and combined P near 0 would by hammy. And P would be a REAL probability (against the null hypothesis of randomness). I wouldn't expect any performance ADVANTAGE to this other approach, but it WOULD be more elegant. (Note, all these approaches depend on one or another statistical function as the current one does the inverse-chi-square). If you are interested in going that way let me know, and I'll send info on how to do it. Maybe you'll have another beautifully simple algorithm up your sleave to implement the necessary statistical function. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 > From: Tim Peters > Date: Sat, 12 Oct 2002 02:27:29 -0400 > To: SpamBayes > Cc: Gary Robinson > Subject: RE: [Spambayes] spamprob combining > > OK! Gary and I exchanged info offline, and I believe the implementation of > use_chi_squared_combining matches his intent for it. > >> ... >> Example: if we called everything from 50 thru 80 "the middle >> ground", ... in a manual-review system, this combines all the >> desirable properties: >> >> 1. Very little is kicked out for review. >> >> 2. There are high error rates among the msgs kicked out for review. >> >> 3. There are unmeasurably low error rates among the msgs not kicked >> out for review. > > On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this > got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the > all-default scheme with the very touchy spam_cutoff. The middle ground is > the *interesting* thing, and it's like a laser beam here (yippee!). In the > "50 thru 80" range guessed at above, > > 1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737). > The other 2 FP scored 0.999999929221 (Nigerian scam quote) and > 0.972986477986 (lady with the short question and long obnoxious > employer-generated SIG). I don't believe any usable scheme will > ever call those ham, though, or put them in a middle ground without > greatly bloating the middle ground with correctly classified > messages. > > 2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN > (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near > 0.61, 1 near 0.63, and 1 near 0.68). The 3 remaining spam scored > below 0.50: > > 0.35983017036 > "Hello, my Name is BlackIntrepid" > Except that it contained a URL and an invitation to visit it, this > could have been a poorly written c.l.py post explaining a bit > about hackers to newbies (and if you don't think there are > plenty of those in my ham, you don't read c.l.py ). > > 0.39570232415 > The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam, > whose body consists of a uuencoded text file we throw away > unlooked at. (This is quite curable, but I doubt it's worth > the bother -- at least until spammers take to putting everything > in uuencoded text files!) > > 0.499567195859 (about as close to "middle ground" cutoff as can be) > A giant (> 20KB) base64-encoded plain text file. I've never > bothered to decode this to see what it says; like the others, > though, it's been a persistent FN under all schemes. Note that > we do decode this; I've always assumed it's of the "long, chatty, > just-folks" flavor of tech spam that's hard to catch; the list of > clues contains "cookies", "editor", "ms-dos", "backslashes", > "guis", "commands", "folder", "dumb", "(well,", "cursor", > and "trick" (a spamprob 0.00183748 word!). > > > For my original purpose of looking at a scheme for c.l.py traffic, this has > become the clear leader among all schemes: while it's more extreme than I > might like, it made very few errors, and a miniscule middle ground (less > than 0.08% of all msgs) contains 64+% of all errors. 3 FN would survive, > and 2 FP, but I don't expect that any usable scheme could do better on this > data. Note that Graham combining was also very extreme, but had *no* usable > middle ground on this data: all mistakes had scores of almost exactly 0.0 > or almost exactly 1.0 (and there were more mistakes). > > How does it do for you? An analysis like the above is what I'm looking for, > although it surely doesn't need to be so detailed. Here's the .ini file I > used: > > """ > [Classifier] > use_chi_squared_combining: True > > [TestDriver] > spam_cutoff: 0.70 > > nbuckets: 200 > best_cutoff_fp_weight: 10 > > show_false_positives: True > show_false_negatives: True > show_best_discriminators: 50 > show_spam_lo = 0.40 > show_spam_hi = 0.80 > show_ham_lo = 0.40 > show_ham_hi = 0.80 > show_charlimit: 100000 > """ > > Your best spam_cutoff may be different, but the point to this exercise isn't > to find the best cutoff, it's to think about the middle ground. Note that I > set > > show_{ham,spam}_{lo,hi} > > to values such that I would see every ham and spam that lived in my presumed > middle ground of 0.50-0.80, plus down to 0.40 on the low end. I also set > show_charlimit to a large value so that I'd see the full text of each such > msg. > > Heh: My favorite: Data/Ham/Set7/51781.txt got overall score 0.485+, close > to the middle ground cutoff. It's a msg I posted 2 years ago to the day (12 > Sep 2000), and consists almost entirely of a rather long transcript of part > of the infamous Chicago Seven trial: > > http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html > > I learned two things from this : > > 1. There are so many unique lexical clues when I post a thing, I can > get away with posting anything. > > 2. "tyranny" is a spam clue, but "nazi" a ham clue: > > prob('tyranny') = 0.850877 > prob('nazi') = 0.282714 > > leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs - tim > From jm@jmason.org Wed Oct 9 13:21:11 2002 From: jm@jmason.org (Justin Mason) Date: Wed, 09 Oct 2002 13:21:11 +0100 Subject: [Spambayes] [SAtalk] fully-public corpus of mail available Message-ID: <20021009122116.6EB2416F03@jmason.org> (Please feel free to forward this message to other possibly-interested parties.) Hi all, One of the big problems working with spam classification, is finding good mail to test with. There are few public corpora available; Ion Androutsopoulos' "Ling-spam" corpus is one (hi Ion!), but unfortunately this does not contain all of the mail message data, so would not be useful to a SpamAssassin-style system (which relies heavily on header data), for example. Another effect of not having a common, shared corpus, is the difficulty this introduces in comparing accuracy rates between spam filter software; since everyone tests using different corpora, statistics can be unportable as a result. Building public corpora is difficult, as it typically involves saving your own (classified) mail. This brings privacy problems, as your mail senders may not wish to see this made public. But what the heck, that's what I've done anyway ;) Here's a public corpus I've assembled from my own corpora, removing messages which were not public in the first place. Please feel free to download it and use it for spam-filter development. It's quite small, but should be big enough for use as a reference corpus, at least, so that hit-rate statistics can be compared across tools. Hope it helps. It lives here: http://spamassassin.org/publiccorpus/ and here's the README.txt: Welcome to the SpamAssassin public mail corpus. This is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points: - All headers are reproduced in full. Some address obfuscation has taken place; hostnames in some cases have been replaced with "example.com", which should have a valid MX record (if I recall correctly). In most cases though, the headers appear as they were received. - All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites. - Copyright for the text in the messages remains with the original senders. OK, now onto the corpus description. It's split into three parts, as follows: - spam: 500 spam messages, all received from non-spam-trap sources. - easy_ham: 350 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc). - hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc. The corpora are prefixed with "200210", because that's the date when I assembled it, so it's as good a version string as anything else ;) . They are compressed using "bzip2". This corpus lives at http://spamassassin.org/publiccorpus/ . Mail jm - public - corpus AT jmason dot org if you have questions, or to donate mail. (Oct 9 2002 jm) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list Spamassassin-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-talk From bkc@murkworks.com Sat Oct 12 20:07:50 2002 From: bkc@murkworks.com (Brad Clements) Date: Sat, 12 Oct 2002 15:07:50 -0400 Subject: [Spambayes] Chi True results Message-ID: <3DA83A74.20675.1A642869@localhost> I ran this twice, first to get the recommended spam cutoff, the 2nd time with the recommended cutoff in the .ini then I compared it against the tim_combine_true test I ran previously. In this message: .ini, cmp.py results, histograms from chi true run. [Tokenizer] mine_received_headers: True [Classifier] use_central_limit = False use_central_limit2 = False use_central_limit3 = False use_tim_combining: False use_chi_squared_combining: True [TestDriver] spam_cutoff: 0.98 show_false_negatives: True show_false_positives: True nbuckets: 200 best_cutoff_fp_weight: 10 show_spam_lo: 0.4 show_spam_hi: 0.80 show_ham_lo = 0.40 show_ham_hi = 0.80 show_charlimit: 10000 save_trained_pickles: True save_histogram_pickles: True results/timcombinetrues.txt -> results/chitrues.txt -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams false positive percentages 1.077 0.154 won -85.70% 0.769 0.231 won -69.96% 0.769 0.077 won -89.99% 0.923 0.154 won -83.32% 0.769 0.154 won -79.97% 0.538 0.077 won -85.69% 0.538 0.077 won -85.69% 0.692 0.000 won -100.00% 0.769 0.231 won -69.96% 0.692 0.000 won -100.00% won 10 times tied 0 times lost 0 times total unique fp went from 98 to 15 won -84.69% mean fp % went from 0.753846153846 to 0.115384615385 won -84.69% false negative percentages 0.154 0.846 lost +449.35% 0.154 1.231 lost +699.35% 0.231 1.154 lost +399.57% 0.077 0.615 lost +698.70% 0.000 0.923 lost +(was 0) 0.231 1.308 lost +466.23% 0.231 0.692 lost +199.57% 0.077 1.077 lost +1298.70% 0.154 1.231 lost +699.35% 0.231 1.231 lost +432.90% won 0 times tied 0 times lost 10 times total unique fn went from 20 to 134 lost +570.00% mean fn % went from 0.153846153846 to 1.03076923077 lost +570.00% ham mean ham sdev 12.23 1.40 -88.55% 9.02 8.67 -3.88% 12.04 1.12 -90.70% 8.57 8.09 -5.60% 12.08 1.12 -90.73% 8.44 8.02 -4.98% 12.21 1.26 -89.68% 8.65 8.62 -0.35% 11.98 1.06 -91.15% 8.40 8.03 -4.40% 12.20 1.01 -91.72% 8.16 6.87 -15.81% 11.69 0.85 -92.73% 7.80 6.57 -15.77% 11.61 0.96 -91.73% 7.91 7.06 -10.75% 11.63 1.15 -90.11% 8.31 8.38 +0.84% 11.60 1.01 -91.29% 7.94 7.62 -4.03% ham mean and sdev for all runs 11.93 1.09 -90.86% 8.33 7.83 -6.00% spam mean spam sdev 90.31 99.74 +10.44% 7.59 3.59 -52.70% 90.59 99.67 +10.02% 7.68 4.17 -45.70% 90.72 99.68 +9.88% 7.40 4.12 -44.32% 90.91 99.83 +9.81% 7.16 2.68 -62.57% 90.54 99.84 +10.27% 6.93 2.20 -68.25% 90.68 99.66 +9.90% 7.23 4.29 -40.66% 90.49 99.67 +10.14% 7.25 4.68 -35.45% 90.61 99.79 +10.13% 7.29 2.98 -59.12% 90.93 99.75 +9.70% 7.21 3.24 -55.06% 90.40 99.54 +10.11% 7.80 5.07 -35.00% spam mean and sdev for all runs 90.62 99.72 +10.04% 7.36 3.80 -48.37% ham/spam mean difference: 78.69 98.63 +19.94 -- histogram from chi: true -> Ham scores for all runs: 13000 items; mean 1.09; sdev 7.83 -> min -2.66454e-13; median 2.85882e-12; max 100 * = 204 items 0.0 12433 ************************************************************* 0.5 71 * 1.0 43 * 1.5 33 * 2.0 14 * 2.5 15 * 3.0 12 * 3.5 5 * 4.0 14 * 4.5 11 * 5.0 6 * 5.5 9 * 6.0 9 * 6.5 5 * 7.0 6 * 7.5 3 * 8.0 7 * 8.5 2 * 9.0 5 * 9.5 5 * 10.0 5 * 10.5 5 * 11.0 3 * 11.5 4 * 12.0 7 * 12.5 2 * 13.0 3 * 13.5 2 * 14.0 3 * 14.5 4 * 15.0 3 * 15.5 3 * 16.0 0 16.5 3 * 17.0 2 * 17.5 1 * 18.0 0 18.5 5 * 19.0 3 * 19.5 1 * 20.0 1 * 20.5 3 * 21.0 0 21.5 1 * 22.0 1 * 22.5 2 * 23.0 1 * 23.5 2 * 24.0 2 * 24.5 0 25.0 0 25.5 3 * 26.0 2 * 26.5 2 * 27.0 1 * 27.5 1 * 28.0 2 * 28.5 3 * 29.0 2 * 29.5 2 * 30.0 1 * 30.5 3 * 31.0 1 * 31.5 1 * 32.0 4 * 32.5 2 * 33.0 2 * 33.5 3 * 34.0 1 * 34.5 3 * 35.0 1 * 35.5 3 * 36.0 5 * 36.5 4 * 37.0 0 37.5 3 * 38.0 1 * 38.5 1 * 39.0 0 39.5 2 * 40.0 2 * 40.5 3 * 41.0 2 * 41.5 1 * 42.0 1 * 42.5 3 * 43.0 2 * 43.5 1 * 44.0 2 * 44.5 3 * 45.0 3 * 45.5 5 * 46.0 1 * 46.5 3 * 47.0 1 * 47.5 5 * 48.0 1 * 48.5 3 * 49.0 9 * 49.5 11 * 50.0 8 * 50.5 1 * 51.0 3 * 51.5 1 * 52.0 7 * 52.5 3 * 53.0 2 * 53.5 1 * 54.0 0 54.5 1 * 55.0 2 * 55.5 0 56.0 3 * 56.5 0 57.0 0 57.5 1 * 58.0 2 * 58.5 0 59.0 0 59.5 1 * 60.0 1 * 60.5 1 * 61.0 0 61.5 0 62.0 0 62.5 0 63.0 2 * 63.5 0 64.0 0 64.5 0 65.0 0 65.5 1 * 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 2 * 69.0 0 69.5 1 * 70.0 1 * 70.5 0 71.0 0 71.5 1 * 72.0 0 72.5 1 * 73.0 0 73.5 0 74.0 1 * 74.5 0 75.0 0 75.5 0 76.0 2 * 76.5 0 77.0 0 77.5 0 78.0 0 78.5 1 * 79.0 0 79.5 0 80.0 1 * 80.5 1 * 81.0 1 * 81.5 0 82.0 2 * 82.5 0 83.0 0 83.5 1 * 84.0 0 84.5 3 * 85.0 0 85.5 1 * 86.0 1 * 86.5 0 87.0 1 * 87.5 1 * 88.0 2 * 88.5 1 * 89.0 0 89.5 0 90.0 2 * 90.5 0 91.0 0 91.5 1 * 92.0 0 92.5 1 * 93.0 1 * 93.5 0 94.0 2 * 94.5 1 * 95.0 1 * 95.5 2 * 96.0 1 * 96.5 2 * 97.0 1 * 97.5 2 * 98.0 0 98.5 0 99.0 3 * 99.5 12 * thanks for joining paypal, ETrade news, HP Symposiom, Registration ack from Cingular, EDN renewal, X10 newsletter (argh!), FAFSA US Dept Education renewal :-(, United Connection, Network Computing Renewal, Infotel Distributing -> Spam scores for all runs: 13000 items; mean 99.72; sdev 3.80 This histogram seems broken, I have 4 or 5 spams with prob < .0.05 > Survey on Software Reuse Views and Activity > You are invited to participate in my Dissertation research on the topic of ^M > Software Reuse. (naw) VoIP solutions for providers HP Enterprise Technical Symposium (oops, this should be ham, guess I got sick of getting these) -> min 0.000127988; median 100; max 100 * = 210 items 0.0 1 * ***New SAP Opportunities*** Client interviewing now!! 0.5 1 * Certified IT professional with over 6 years of Experience on Design and Coding. 1.0 0 1.5 1 * Senior Consultant with Experience on JD Edwards, ONE WORLD, XE, CNC, AS/400 is available 2.0 0 2.5 1 * Fax / Copier Sales / service call 2078787 3.0 1 * Development Services on Telecom/Datacom Protocols 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 1 * Certified IT professional with over 6 years of Experience on Design and Coding. 7.5 0 8.0 0 8.5 1 * 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 1 * Use the Session Scheduler to personalize your training (hp, probably mis-classified, guess I did get sick of them) 16.5 1 * VoIP solutions for providers 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 1 * 21.0 0 21.5 0 22.0 2 * 22.5 0 23.0 0 23.5 0 24.0 0 24.5 1 * 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 1 * 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 1 * 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 1 * 44.5 2 * 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 1 * 49.0 0 49.5 1 * 50.0 9 * 50.5 0 51.0 2 * 51.5 0 52.0 1 * 52.5 0 53.0 1 * 53.5 0 54.0 0 54.5 0 55.0 0 55.5 2 * 56.0 1 * 56.5 0 57.0 1 * 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 2 * 63.0 0 63.5 0 64.0 0 64.5 2 * 65.0 0 65.5 1 * 66.0 1 * 66.5 0 67.0 0 67.5 0 68.0 0 68.5 1 * 69.0 0 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 1 * 73.0 0 73.5 1 * 74.0 0 74.5 0 75.0 0 75.5 0 76.0 5 * 76.5 0 77.0 1 * 77.5 2 * 78.0 2 * 78.5 1 * 79.0 2 * 79.5 2 * 80.0 1 * 80.5 0 81.0 1 * 81.5 0 82.0 1 * 82.5 1 * 83.0 2 * 83.5 1 * 84.0 3 * 84.5 0 85.0 1 * 85.5 1 * 86.0 2 * 86.5 1 * 87.0 0 87.5 0 88.0 2 * 88.5 0 89.0 1 * 89.5 1 * 90.0 2 * 90.5 5 * 91.0 0 91.5 4 * 92.0 3 * 92.5 2 * 93.0 1 * 93.5 3 * 94.0 2 * 94.5 5 * 95.0 3 * 95.5 4 * 96.0 5 * 96.5 6 * 97.0 5 * 97.5 4 * 98.0 10 * 98.5 16 * 99.0 33 * 99.5 12807 ************************************************************* -> best cutoff for all runs: 0.98 -> with weighted total 10*15 fp + 134 fn = 284 -> fp rate 0.115% fn rate 1.03% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Sat Oct 12 21:09:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 12 Oct 2002 16:09:55 -0400 Subject: [Spambayes] Chi True results In-Reply-To: <3DA83A74.20675.1A642869@localhost> Message-ID: [Brad Clements] > I ran this twice, first to get the recommended spam cutoff, the > 2nd time with the recommended cutoff in the .ini > > then I compared it against the tim_combine_true test I ran previously. Brad, of all the approaches you've tried here (and I really appreciate how many you've tried!), which have *you* been happiest with? The numbers can't tell me that, it's a human judgment. Note that the "recommended cutoff" isn't really a recommendation, it's a dry-as-dust number that objectively minimizes best_fp_cutoff_weight * total_fp + total_fn What you get out of that is a function of what you feed into it via selecting best_fp_cutoff_weight. The value for that *you* like is also a matter of personal judgment. > In this message: .ini, cmp.py results, histograms from chi true run. > > [Tokenizer] > mine_received_headers: True > > [Classifier] > use_central_limit = False > use_central_limit2 = False > use_central_limit3 = False > use_tim_combining: False > use_chi_squared_combining: True > > [TestDriver] > spam_cutoff: 0.98 Dang, that's big . > show_false_negatives: True > show_false_positives: True > nbuckets: 200 > best_cutoff_fp_weight: 10 > > show_spam_lo: 0.4 > show_spam_hi: 0.80 > show_ham_lo = 0.40 > show_ham_hi = 0.80 > show_charlimit: 10000 > > save_trained_pickles: True > save_histogram_pickles: True > > > > results/timcombinetrues.txt -> results/chitrues.txt > -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams > > false positive percentages > 1.077 0.154 won -85.70% > 0.769 0.231 won -69.96% > 0.769 0.077 won -89.99% > 0.923 0.154 won -83.32% > 0.769 0.154 won -79.97% > 0.538 0.077 won -85.69% > 0.538 0.077 won -85.69% > 0.692 0.000 won -100.00% > 0.769 0.231 won -69.96% > 0.692 0.000 won -100.00% > > won 10 times > tied 0 times > lost 0 times > > total unique fp went from 98 to 15 won -84.69% > mean fp % went from 0.753846153846 to 0.115384615385 won -84.69% > > false negative percentages > 0.154 0.846 lost +449.35% > 0.154 1.231 lost +699.35% > 0.231 1.154 lost +399.57% > 0.077 0.615 lost +698.70% > 0.000 0.923 lost +(was 0) > 0.231 1.308 lost +466.23% > 0.231 0.692 lost +199.57% > 0.077 1.077 lost +1298.70% > 0.154 1.231 lost +699.35% > 0.231 1.231 lost +432.90% > > won 0 times > tied 0 times > lost 10 times > > total unique fn went from 20 to 134 lost +570.00% > mean fn % went from 0.153846153846 to 1.03076923077 lost +570.00% So is that a tradeoff (massive decrease in fp vs massive increase in fn) you're happy with? Would the middle ground here be useful to you? > ham mean ham sdev > 12.23 1.40 -88.55% 9.02 8.67 -3.88% > 12.04 1.12 -90.70% 8.57 8.09 -5.60% > 12.08 1.12 -90.73% 8.44 8.02 -4.98% > 12.21 1.26 -89.68% 8.65 8.62 -0.35% > 11.98 1.06 -91.15% 8.40 8.03 -4.40% > 12.20 1.01 -91.72% 8.16 6.87 -15.81% > 11.69 0.85 -92.73% 7.80 6.57 -15.77% > 11.61 0.96 -91.73% 7.91 7.06 -10.75% > 11.63 1.15 -90.11% 8.31 8.38 +0.84% > 11.60 1.01 -91.29% 7.94 7.62 -4.03% > > ham mean and sdev for all runs > 11.93 1.09 -90.86% 8.33 7.83 -6.00% > > spam mean spam sdev > 90.31 99.74 +10.44% 7.59 3.59 -52.70% > 90.59 99.67 +10.02% 7.68 4.17 -45.70% > 90.72 99.68 +9.88% 7.40 4.12 -44.32% > 90.91 99.83 +9.81% 7.16 2.68 -62.57% > 90.54 99.84 +10.27% 6.93 2.20 -68.25% > 90.68 99.66 +9.90% 7.23 4.29 -40.66% > 90.49 99.67 +10.14% 7.25 4.68 -35.45% > 90.61 99.79 +10.13% 7.29 2.98 -59.12% > 90.93 99.75 +9.70% 7.21 3.24 -55.06% > 90.40 99.54 +10.11% 7.80 5.07 -35.00% > > spam mean and sdev for all runs > 90.62 99.72 +10.04% 7.36 3.80 -48.37% > > ham/spam mean difference: 78.69 98.63 +19.94 > > > -- > > histogram from chi: true > > -> Ham scores for all runs: 13000 items; mean 1.09; sdev 7.83 > -> min -2.66454e-13; median 2.85882e-12; max 100 > * = 204 items > 0.0 12433 ************************************************************* > 0.5 71 * > 1.0 43 * > 1.5 33 * > 2.0 14 * > 2.5 15 * > 3.0 12 * > 3.5 5 * > 4.0 14 * > 4.5 11 * > 5.0 6 * > 5.5 9 * > 6.0 9 * > 6.5 5 * > 7.0 6 * > 7.5 3 * > 8.0 7 * > 8.5 2 * > 9.0 5 * > 9.5 5 * > 10.0 5 * > 10.5 5 * > 11.0 3 * > 11.5 4 * > 12.0 7 * > 12.5 2 * > 13.0 3 * > 13.5 2 * > 14.0 3 * > 14.5 4 * > 15.0 3 * > 15.5 3 * > 16.0 0 > 16.5 3 * > 17.0 2 * > 17.5 1 * > 18.0 0 > 18.5 5 * > 19.0 3 * > 19.5 1 * > 20.0 1 * > 20.5 3 * > 21.0 0 > 21.5 1 * > 22.0 1 * > 22.5 2 * > 23.0 1 * > 23.5 2 * > 24.0 2 * > 24.5 0 > 25.0 0 > 25.5 3 * > 26.0 2 * > 26.5 2 * > 27.0 1 * > 27.5 1 * > 28.0 2 * > 28.5 3 * > 29.0 2 * > 29.5 2 * > 30.0 1 * > 30.5 3 * > 31.0 1 * > 31.5 1 * > 32.0 4 * > 32.5 2 * > 33.0 2 * > 33.5 3 * > 34.0 1 * > 34.5 3 * > 35.0 1 * > 35.5 3 * > 36.0 5 * > 36.5 4 * > 37.0 0 > 37.5 3 * > 38.0 1 * > 38.5 1 * > 39.0 0 > 39.5 2 * > 40.0 2 * > 40.5 3 * > 41.0 2 * > 41.5 1 * > 42.0 1 * > 42.5 3 * > 43.0 2 * > 43.5 1 * > 44.0 2 * > 44.5 3 * > 45.0 3 * > 45.5 5 * > 46.0 1 * > 46.5 3 * > 47.0 1 * > 47.5 5 * > 48.0 1 * > 48.5 3 * > 49.0 9 * > 49.5 11 * Suppose you were to call scores of .50 thru .80 "unsure", and stuffed them in a different folder(s). Then the ham from here: > 50.0 8 * > 50.5 1 * > 51.0 3 * > 51.5 1 * > 52.0 7 * > 52.5 3 * > 53.0 2 * > 53.5 1 * > 54.0 0 > 54.5 1 * > 55.0 2 * > 55.5 0 > 56.0 3 * > 56.5 0 > 57.0 0 > 57.5 1 * > 58.0 2 * > 58.5 0 > 59.0 0 > 59.5 1 * > 60.0 1 * > 60.5 1 * > 61.0 0 > 61.5 0 > 62.0 0 > 62.5 0 > 63.0 2 * > 63.5 0 > 64.0 0 > 64.5 0 > 65.0 0 > 65.5 1 * > 66.0 0 > 66.5 0 > 67.0 0 > 67.5 0 > 68.0 0 > 68.5 2 * > 69.0 0 > 69.5 1 * > 70.0 1 * > 70.5 0 > 71.0 0 > 71.5 1 * > 72.0 0 > 72.5 1 * > 73.0 0 > 73.5 0 > 74.0 1 * > 74.5 0 > 75.0 0 > 75.5 0 > 76.0 2 * > 76.5 0 > 77.0 0 > 77.5 0 > 78.0 0 > 78.5 1 * > 79.0 0 > 79.5 0 through here would get "booted out for manual review". There's not much ham in this range compared to 13,000 msgs. The spam in this range would *also* get "booted out for manual review", and that would catch 41 (see next histogram) of your false negatives. Is that attractive to you? Useless? In either case, would shifting the "unsure" range change your judgment? For example, if you shifted the upper end of the unsure range from .8 to .9, that would add 16 ham to the "booted out" range, in return for catching another 19 spam. > 80.0 1 * > 80.5 1 * > 81.0 1 * > 81.5 0 > 82.0 2 * > 82.5 0 > 83.0 0 > 83.5 1 * > 84.0 0 > 84.5 3 * > 85.0 0 > 85.5 1 * > 86.0 1 * > 86.5 0 > 87.0 1 * > 87.5 1 * > 88.0 2 * > 88.5 1 * > 89.0 0 > 89.5 0 > 90.0 2 * > 90.5 0 > 91.0 0 > 91.5 1 * > 92.0 0 > 92.5 1 * > 93.0 1 * > 93.5 0 > 94.0 2 * > 94.5 1 * > 95.0 1 * > 95.5 2 * > 96.0 1 * > 96.5 2 * > 97.0 1 * > 97.5 2 * > 98.0 0 > 98.5 0 > 99.0 3 * > 99.5 12 * thanks for joining paypal, ETrade news, HP > Symposiom, Registration ack from Cingular, > EDN renewal, X10 newsletter (argh!), FAFSA US Dept > Education renewal :-(, > United Connection, Network Computing Renewal, Infotel Distributing > > -> Spam scores for all runs: 13000 items; mean 99.72; sdev 3.80 > > This histogram seems broken, I have 4 or 5 spams with prob < .0.05 I'm not sure what you mean here. Note that for hysterical raisins, the histogram buckets are labelled with 100x the score values. So a prob of 0.05 is in the bucket with label 5.0. There are 5 spam in the histogram below under 5.0 (= score 0.05). > > Survey on Software Reuse Views and Activity > > > You are invited to participate in my Dissertation research on > the topic of ^M > > Software Reuse. > > (naw) > > VoIP solutions for providers > > HP Enterprise Technical Symposium (oops, this should be ham, > guess I got sick of getting these) > > -> min 0.000127988; median 100; max 100 > * = 210 items > 0.0 1 * ***New SAP Opportunities*** Client interviewing now!! > 0.5 1 * Certified IT professional with over 6 years of > Experience on Design > and Coding. > 1.0 0 > 1.5 1 * Senior Consultant with Experience on JD Edwards, ONE > WORLD, XE, CNC, > AS/400 is available > 2.0 0 > 2.5 1 * Fax / Copier Sales / service call 2078787 > 3.0 1 * Development Services on Telecom/Datacom Protocols > 3.5 0 > 4.0 0 > 4.5 0 > 5.0 0 > 5.5 0 > 6.0 0 > 6.5 0 > 7.0 1 * Certified IT professional with over 6 years of > Experience on Design and Coding. > 7.5 0 > 8.0 0 > 8.5 1 * > 9.0 0 > 9.5 0 > 10.0 0 > 10.5 0 > 11.0 0 > 11.5 0 > 12.0 0 > 12.5 0 > 13.0 0 > 13.5 0 > 14.0 0 > 14.5 0 > 15.0 0 > 15.5 0 > 16.0 1 * Use the Session Scheduler to personalize your > training (hp, probably mis-classified, guess I did get sick of them) > 16.5 1 * VoIP solutions for providers > 17.0 0 > 17.5 0 > 18.0 0 > 18.5 0 > 19.0 0 > 19.5 0 > 20.0 0 > 20.5 1 * > 21.0 0 > 21.5 0 > 22.0 2 * > 22.5 0 > 23.0 0 > 23.5 0 > 24.0 0 > 24.5 1 * > 25.0 0 > 25.5 0 > 26.0 0 > 26.5 0 > 27.0 0 > 27.5 0 > 28.0 0 > 28.5 0 > 29.0 0 > 29.5 0 > 30.0 0 > 30.5 0 > 31.0 1 * > 31.5 0 > 32.0 0 > 32.5 0 > 33.0 0 > 33.5 0 > 34.0 0 > 34.5 0 > 35.0 0 > 35.5 0 > 36.0 0 > 36.5 0 > 37.0 0 > 37.5 0 > 38.0 0 > 38.5 1 * > 39.0 0 > 39.5 0 > 40.0 0 > 40.5 0 > 41.0 0 > 41.5 0 > 42.0 0 > 42.5 0 > 43.0 0 > 43.5 0 > 44.0 1 * > 44.5 2 * > 45.0 0 > 45.5 0 > 46.0 0 > 46.5 0 > 47.0 0 > 47.5 0 > 48.0 0 > 48.5 1 * > 49.0 0 > 49.5 1 * .50 thru .80 covers the spam starting here: > 50.0 9 * > 50.5 0 > 51.0 2 * > 51.5 0 > 52.0 1 * > 52.5 0 > 53.0 1 * > 53.5 0 > 54.0 0 > 54.5 0 > 55.0 0 > 55.5 2 * > 56.0 1 * > 56.5 0 > 57.0 1 * > 57.5 0 > 58.0 0 > 58.5 0 > 59.0 0 > 59.5 0 > 60.0 0 > 60.5 0 > 61.0 0 > 61.5 0 > 62.0 0 > 62.5 2 * > 63.0 0 > 63.5 0 > 64.0 0 > 64.5 2 * > 65.0 0 > 65.5 1 * > 66.0 1 * > 66.5 0 > 67.0 0 > 67.5 0 > 68.0 0 > 68.5 1 * > 69.0 0 > 69.5 0 > 70.0 0 > 70.5 0 > 71.0 0 > 71.5 0 > 72.0 0 > 72.5 1 * > 73.0 0 > 73.5 1 * > 74.0 0 > 74.5 0 > 75.0 0 > 75.5 0 > 76.0 5 * > 76.5 0 > 77.0 1 * > 77.5 2 * > 78.0 2 * > 78.5 1 * > 79.0 2 * > 79.5 2 * and ending above. > 80.0 1 * > 80.5 0 > 81.0 1 * > 81.5 0 > 82.0 1 * > 82.5 1 * > 83.0 2 * > 83.5 1 * > 84.0 3 * > 84.5 0 > 85.0 1 * > 85.5 1 * > 86.0 2 * > 86.5 1 * > 87.0 0 > 87.5 0 > 88.0 2 * > 88.5 0 > 89.0 1 * > 89.5 1 * And another 19 spam lived in [0.8, 0.9). > 90.0 2 * > 90.5 5 * > 91.0 0 > 91.5 4 * > 92.0 3 * > 92.5 2 * > 93.0 1 * > 93.5 3 * > 94.0 2 * > 94.5 5 * > 95.0 3 * > 95.5 4 * > 96.0 5 * > 96.5 6 * > 97.0 5 * > 97.5 4 * > 98.0 10 * > 98.5 16 * > 99.0 33 * > 99.5 12807 ************************************************************* > -> best cutoff for all runs: 0.98 > -> with weighted total 10*15 fp + 134 fn = 284 > -> fp rate 0.115% fn rate 1.03% One thing I noticed that doesn't require your personal judgment : unless, e.g., you join PayPal over and over again, the system is never going to learn that Subject: Get $5 by Referring Your Friends to PayPal! Dear Tim Peters, Thank you for joining PayPal! You can use your new account to make purchases from over 3 million eBay(TM) auctions, shop online at over 20,000 online stores that accept PayPal, or just collect money from friends and co-workers. [blah blah blah blah blah, and a URL containing "refer" ] isn't spam. Likewise for yearly renewal notices. I haven't had to deal with this since my ham is composed of newsgroup traffic. It does suggest that some form of whitelist is needed for personal email, else there appears no hope for these kinds of rare, commercial, pseudo-personalized mass mailings. They've got all the earmarks of spam; the only difference is that sometime in the past, you asked for them; but the tokenizer can't know that. From bkc@murkworks.com Sat Oct 12 21:21:33 2002 From: bkc@murkworks.com (Brad Clements) Date: Sat, 12 Oct 2002 16:21:33 -0400 Subject: [Spambayes] Chi True results In-Reply-To: References: <3DA83A74.20675.1A642869@localhost> Message-ID: <3DA84BBA.2174.1AA7A657@localhost> On 12 Oct 2002 at 16:09, Tim Peters wrote: > Brad, of all the approaches you've tried here (and I really appreciate how > many you've tried!), which have *you* been happiest with? The numbers can't > tell me that, it's a human judgment. > Oh, I reached nirvana a few weeks ago. Any of these schemes seem like a big win for me. though I did like the central limit schemes well enough. That is, the original graham method didn't have "sure, mostly sure" (ham x spam).. Which I like to have. I can appreciate gary's interest in numerical purity, but the absolute difference between 1% fn and 2%fn is, in my case, only 1 spam message a day. At this point, I'm working to put the rubber on the road and tackle deployment issues .. Like how could you implement this scheme for 300 users on an IMAP server? Not with a 20 megabyte pickle per user! if tim_combining works "nearly as well" as chi, but takes 1/4 the processor time.. I'd probably choose the former. Sorry, guess I haven't answered your question. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Sat Oct 12 21:59:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 12 Oct 2002 16:59:45 -0400 Subject: [Spambayes] Chi True results In-Reply-To: <3DA84BBA.2174.1AA7A657@localhost> Message-ID: [Brad Clements] > Oh, I reached nirvana a few weeks ago. Cool -- I hope to join you there soon . > Any of these schemes seem like a big win for me. though I did > like the central limit schemes well enough. Because? That is, what about them was attractive to you, in contrast= to the others? > That is, the original graham method didn't have "sure, mostly > sure" (ham x spam).. > Which I like to have. > > I can appreciate gary's interest in numerical purity, but the > absolute difference between 1% fn and 2%fn is, in my case, only > 1 spam message a day. All of the remaining schemes beyond the current default (the 3 clt sc= hemes, tim combining, and chi combining) haven't been about numerical purity= , but about refining "the middle ground": isolating as many mistakes into = as small a group of "unsure" msgs as possible, with as least touchy a se= t of cutoff values as possible. On my test data, chi combining blows the = others out of the water by these measures, and python.org: 1. Deals with many more msgs than any individual deals with. and 2. Has a mail admin notorious for whining about currently reviewing a measly 20 msgs per day . Cutting an error rate in half means half the work, and probably a qua= rter of the whining, in that context. > At this point, I'm working to put the rubber on the road and > tackle deployment issues .. > Like how could you implement this scheme for 300 users on an IMAP > server? There you go: cut an error rate in half there, and your "1 msg per d= ay" instantly turns into 300. > Not with a 20 megabyte pickle per user! Things to look at: we shouldn't need an 8-byte timestamp per word; t= he killcount may not be useful at all when we stop *comparing* schemes; = about half of all words will be found only once in the whole database (this= is an Invariant Truth across all computer indexing applications -- "hapax legomena"(*) is what it's called in the literature), so half the word= s in your database can be expected to be useless because unique; work need= s to be done on pruning the database over time; and these are all related. Note that incremental adjustments to the clt schemes bristle with pro= blems the non-clt schemes don't have, due to the third training pass unique= to the clt schemes. > if tim_combining works "nearly as well" as chi, but takes 1/4 the > processor time.. I'd probably choose the former. Processor time won't be a factor here -- tokenization and I/O times d= ominate all schemes so far, and the combining method is an expense distinct f= rom those (note that all the variations discussed here are purely variati= ons in the combining method: they all see the same token streams and word c= ounts, the differences are in how they *use* the evidence). I barely notice= d the time difference as-is, yet chi combining is invoking log about 50x mo= re often than necessary now, and computing chi2Q() to about 14 significa= nt digits is way more than necessary too. > Sorry, guess I haven't answered your question. Indeed not, but you answered other interesting questions I didn't thi= nk to ask . (*) For our grammarians, the plural is hapaxes, as in 31.6% of English hapaxes have corresponding Lithuanian hapaxes. and Among the evangelists, Luke is the most capable of apparently writing =93uncharacteristically=94 since he has the largest voca= bulary, the greatest number of hapax legomena, and a disturbing habit of varying his synonyms. Paffenroth does not engage, for example, with Michael Goulder=92s claim that Luke introduces more hapaxes into Mark than he takes over. And you thought we were getting academic *here* . From rob@hooft.net Sat Oct 12 22:00:58 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 12 Oct 2002 23:00:58 +0200 Subject: [Spambayes] Chi**2 results Message-ID: <3DA88D8A.3070509@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Here is my chi results. I am amazed by the high cutoff it is advising me to use! This feels very good. On the FP side bad messages are: * a yahoo account created to correct incorrect listings in their database * A problem with my Linux Journal subscription * India student applying for a course * Amazon.com membership update * Red Cross blood drive announcement Which is 5 out of 16000; but I have to admit that even missing 4 out of these 5 would not have been too costly. The middle ground is amazingly empty! I'd almost want to set my cutoff at 0.99 or 0.995! One thing that does bother me a bit is that some words have a very high correlation of co-existing in a message, and there is no way of finding this out. E.g. all the "bad jokes" I'm referring to in the attachment were sent by a friend of mine that uses a very strange way of forwarding by modifying the "From:" line: From: callaway@indigo.picower.edu (David Callaway) (by way of Pieter Stouten) Which results in the highly correlated: prob('from:pieter') = 0.00151566 prob('message-id:@[158.117.170.103]') = 0.00306331 prob('x-mailer:eudora pro 3.1 for macintosh') = 0.00474183 prob('from:stouten)') = 0.0115681 prob('from:way') = 0.012894 prob('from:(by') = 0.0167286 Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment [TestDriver] pickle_basename = class save_trained_pickles = False show_histograms = True show_ham_lo = 0.40 show_best_discriminators = 50 nbuckets = 200 show_ham_hi = 0.80 spam_cutoff = 0.70 spam_directories = Data/Spam/Set%d show_spam_lo = 0.40 show_false_negatives = True ham_directories = Data/Ham/Set%d compute_best_cutoffs_from_histograms = True show_false_positives = True best_cutoff_fp_weight = 10 show_spam_hi = 0.80 save_histogram_pickles = False show_charlimit = 100000 [CV Driver] build_each_classifier_from_scratch = False [Tokenizer] mine_received_headers = False octet_prefix_size = 5 generate_long_skips = True count_all_header_lines = False check_octets = False ignore_redundant_html = False basic_header_tokenize = False safe_headers = abuse-reports-to date errors-to from importance in-reply-to message-id mime-version organization received reply-to return-path subject to user-agent x-abuse-info x-complaints-to x-face basic_header_skip = received date x-.* basic_header_tokenize_only = False retain_pure_html_tags = False -> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03 -> min -2.22045e-13; median 9.99201e-14; max 100 * = 253 items 0.0 15408 ************************************************************* 0.5 115 * 1.0 59 * 1.5 27 * 2.0 25 * 2.5 19 * 3.0 27 * 3.5 9 * 4.0 6 * 4.5 8 * 5.0 12 * 5.5 7 * 6.0 4 * 6.5 8 * 7.0 6 * 7.5 4 * 8.0 6 * 8.5 5 * 9.0 4 * 9.5 12 * 10.0 9 * 10.5 6 * 11.0 3 * 11.5 1 * 12.0 6 * 12.5 4 * 13.0 1 * 13.5 1 * 14.0 2 * 14.5 6 * 15.0 3 * 15.5 2 * 16.0 3 * 16.5 5 * 17.0 4 * 17.5 5 * 18.0 1 * 18.5 2 * 19.0 2 * 19.5 2 * 20.0 7 * 20.5 1 * 21.0 4 * 21.5 2 * 22.0 4 * 22.5 5 * 23.0 2 * 23.5 3 * 24.0 1 * 24.5 3 * 25.0 2 * 25.5 1 * 26.0 2 * 26.5 1 * 27.0 0 27.5 1 * 28.0 2 * 28.5 0 29.0 2 * 29.5 2 * 30.0 4 * 30.5 3 * 31.0 1 * 31.5 1 * 32.0 1 * 32.5 2 * 33.0 0 33.5 1 * 34.0 2 * 34.5 1 * 35.0 1 * 35.5 2 * 36.0 0 36.5 2 * 37.0 0 37.5 6 * 38.0 2 * 38.5 1 * 39.0 4 * 39.5 0 40.0 2 * Someone replying to a spam on a mailinglist (NO Fwd:!); Bad joke 40.5 2 * Official company press release; Bad joke Bruker AXS Announces Appointment of Laura Francis as New Chief Financial Officer 3/25/2002 9:03:00 AM MADISON, Wis., Mar 25, 2002 (BUSINESS WIRE) Bruker AXS Inc., a leading global provider of advanced X-ray solutions for life and advanced materials sciences, today announced that it has appointed Laura Francis as its new Chief Financial Officer, effective April 8, 2002. Ms. Francis will also be responsible for investor relations. 41.0 2 * Bad joke; ISP helpdesk reply (payment related) 41.5 2 * Internic regret; Unsubscribe confirmation commercial mailing list. We regret to inform you that we were unable to accept your credit card payment for the domain names listed below, in the amount of $100.00. To determine the specific reason your credit card was not accepted please contact your credit card company as we do not receive that information. For accounting purposes, we can not reflect a paid status for this domain name. Please resubmit payment by calling (703)742-4777, or by sending a check to the address listed on your invoice. If you submit a check, please ensure that the domain name and invoice number are listed as references. We apologize for any inconvenience, and hope this matter can be resolved as quickly as possible. Thank you, Jill Dodson InterNIC Registration Services 42.0 0 42.5 4 * Unsubscribe from commercial newsletter; Bad joke; Bad joke; Someone mass-asking for help 43.0 1 * My wife sending me a link to a housing service. 43.5 2 * Happy birthday via WBW; Happy birthday via WBW. 44.0 2 * Press release International Court of Justice (Nigeria....); Linux journal autoreply 44.5 3 * Linux journal autoreply; Bad joke; Bad joke. 45.0 2 * Bad joke; Bad Joke. 45.5 2 * Customer license request; Bad joke ------=_NextPart_000_0005_01C00135.2727FB40 Content-Type: text/plain; charset="ks_c_5601-1987" Content-Transfer-Encoding: base64 SGksDQpJIGhhdmUgYSBxdWVzdGlvbi4uDQpEbyBJIG5lZWQgdG8gZ2V0IG5ldyBsaWNlbnNlIGlm IEkgdXBncmFkZSB0aGUgY29sbGVjdCBzb2Z0d2FyZT8NCkkgaGF2ZSB1cGdyYWRlZCBpdCBqdXN0 IG5vdywgYnV0IGl0IGRvZXNuJ3Qgc2VlbSB0byBiZSBjb25uZWN0ZWQgdG8gQ0NEIGNvbnRyb2xs ZXIuDQpTbyBJJ20gdXNpbmcgdGhlIG9sZCB2ZXJzaW9uLg0KDQpQbGVhc2UsIHRlbGwgbWUgd2hh dCBzaG9sZCBJIGRvLg0KDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNp bmNlcmVseSwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgRG9uZyBN b2sgU2hpbg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBTZW91bCBO YXRpb25hbCBVbml2ZXJzaXR5DQo= 46.0 3 * Bad joke; Shogi mailing list posting; Bad joke 46.5 1 * My boss asking me for help with a SPAM on his mailing list (Fwd) 47.0 0 47.5 2 * Bad joke; Happy birthday via WBW 48.0 4 * Bad joke; Company press release; Bad joke; Colleague notifying me of an important new web service. The following press release went out earlier this morning. =20 Bruker AXS Acquires MAC Science to Further Penetrate Japanese Life = Science and Materials Research Markets 48.5 4 * Conference invitation; Happy birthday from WBW; Company sales budget (in German); Press release International Court of Justice (Congo, Burundi,...) 49.0 2 * Headhunter hunting me via mailing; Bad joke. 49.5 3 * Shogi mailing list posting; My boss asking me how to deal with a spam message (Fwd:); Shogi mailing list posting announcing a tournament 50.0 1 * Customer sending 1.44MB binary file as text/plain attachment :-) 50.5 0 51.0 2 * Bad joke; Bad joke. 51.5 0 52.0 0 52.5 1 * Bad joke. 53.0 0 53.5 1 * ECA Annual fee reminder Dear ECA Members Just to remember those that have not paid the 2001 annual fee can do that in Krakow. It is easier and "cheaper". 54.0 0 54.5 0 55.0 1 * Python Professional Services Europe (PPSE) announcement. 55.5 0 56.0 0 56.5 0 57.0 1 * Bad joke. 57.5 0 58.0 0 58.5 1 * Bad joke. 59.0 1 * Happy birthday via WBW 59.5 0 60.0 1 * ISP Newsletter in German 60.5 2 * Happy birthday via WBW; request for information Hi, I see you name im the web museum, in M. C. Escher page, OK, I am building a site about ilusions, can you help me in this? I use 2 pictures of the artist in my page this is ok? Anyway you known other images that I can use in my work? Please visit my page at: http://www.geocities.com/SoHo/Studios/4762/ 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * Happy birthday via WBW 63.5 1 * Bad joke 64.0 0 64.5 0 65.0 2 * Bad joke; Bad joke. 65.5 0 66.0 0 66.5 0 67.0 0 67.5 2 * Free copy of Caldera linux; Web-site registration code. Linux Developer: We greatly appreciate the contribution you have made to the Linux community and, to demonstrate that appreciation, we would like to send you a free copy of our latest Linux-based product, OpenLinux Standard 1.1. 68.0 1 * Happy birthday via WBW 68.5 0 69.0 0 69.5 0 70.0 0 70.5 0 71.0 1 * Happy birthday via WBW 71.5 0 72.0 0 72.5 2 * Self-reminder of a bug in a program; Auto-reply to a web request fields can start with 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 1 * Happy birthday via WBW 77.0 0 77.5 0 78.0 0 78.5 1 * Four11 directory listing announcement 79.0 0 79.5 0 80.0 0 80.5 1 * 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 1 * 85.5 0 86.0 0 86.5 1 * 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 1 * 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 1 * 97.0 1 * 97.5 0 98.0 1 * 98.5 0 99.0 1 * 99.5 7 * -> Spam scores for all runs: 5600 items; mean 99.35; sdev 5.40 -> min 4.22602e-09; median 100; max 100 * = 89 items 0.0 3 * 0.5 0 1.0 0 1.5 1 * 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 1 * 5.0 0 5.5 0 6.0 1 * 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 1 * 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 1 * 32.0 0 32.5 0 33.0 1 * 33.5 0 34.0 1 * 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 1 * 40.0 1 * "we would like to send you our information". May be misclassified. 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 1 * ObjectSpace C++ product announcement 45.0 0 45.5 0 46.0 0 46.5 1 * Webcounter You tried other counters now try something AMAZING!!! ONE CODE FOR ALL YOUR PAGES AND DOMAINS!!! http://www.freewebcounter.com 1. View your full raw log files and perform tracerouts from the hosts! 2. See the every page the person visited in order! 3. Top 50 full search phrase used to find your site! 4. All countries 5. unique visits / page views. 6. visites by day/week/month/year. 7. Top 50 browser agents. 8. Emails. 47.0 0 47.5 0 48.0 0 48.5 1 * Character analysis This is a: Commercial Electronic Mail Message. It is TOTALLY LEGAL (Washington.' Law; chapter 19.190 RCW) and with U.S. Federal requirements for commercial email under bill: S 1618 Title 111 section 301 paragraph (a) (2) (C) because it includes a removal mechanism. To be removed:the list: please see below. 49.0 0 49.5 3 * Translation company based in Beijing; Distance education IT school; Anti-aids medicin from Beijing 50.0 7 * HTML-only with image maps; How to juggle women (book); Far east spam; HTML only far east spam; conference announcement; Conference announcement; Hunza diet bread; Tim's hometown stories; Far east spam 50.5 2 * Web advertising 51.0 1 * Far east spam 51.5 3 * Far east spam; Get rich via python mailinglist; empty message 52.0 0 52.5 0 53.0 1 * Hyper porn YIKES: prob('subject:porn') = 0.696523 only! From: HairyKevin Return-path: To: HairyKevin@aol.com Subject: hyper porn Date: Sun, 24 May 1998 15:11:51 EDT Organization: AOL (http://www.aol.com) Mime-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7bit click here 53.5 0 54.0 0 54.5 0 55.0 2 * Web hosting (German); Internet programming offered 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 2 * Affengeil; Far east spam AFFENGEIL !!!! 002.45.29.65.83 ... Ruf an! 58.5 0 59.0 1 * Microsoft office training 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 1 * here is the picture of me that you asked for... 62.5 1 * "E-bay auction" spam (Congradulations (sic) on your selling) 63.0 1 * Dahanut newsletter 63.5 1 * "My friend is going out with this girl" 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 1 * Make a million 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 2 * Both same mailinglist removal confirmation. MISCLASSIFIED. 70.0 0 70.5 0 71.0 0 71.5 0 72.0 1 * Spanish, HTML only. 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 1 * Happy birthday via WBW with commercial appendix 75.5 0 76.0 1 * Christian site in Jerusalem 76.5 1 * "\Below is the result of your feedback form." 77.0 0 77.5 0 78.0 2 * Diet science (close to the biology I worked in for a while); Medical website on-line announcement 78.5 1 * Medical conference announcement 79.0 0 79.5 1 * "I have attached my web page with new photos!" 80.0 1 * 80.5 0 81.0 2 * 81.5 4 * 82.0 2 * 82.5 2 * 83.0 0 83.5 1 * 84.0 1 * 84.5 1 * 85.0 2 * 85.5 1 * 86.0 1 * 86.5 1 * 87.0 0 87.5 5 * 88.0 0 88.5 0 89.0 1 * 89.5 1 * 90.0 1 * 90.5 2 * 91.0 2 * 91.5 2 * 92.0 2 * 92.5 3 * 93.0 36 * 93.5 1 * 94.0 4 * 94.5 4 * 95.0 5 * 95.5 6 * 96.0 9 * 96.5 8 * 97.0 4 * 97.5 7 * 98.0 7 * 98.5 15 * 99.0 26 * 99.5 5378 ************************************************************* -> best cutoff for all runs: 0.87 -> with weighted total 10*12 fp + 71 fn = 191 -> fp rate 0.075% fn rate 1.27% -> matched at 0.875 with 12 fp & 71 fn; fp rate 0.075%; fn rate 1.27% ---------------------- multipart/mixed attachment-- From tim.one@comcast.net Sat Oct 12 23:55:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 12 Oct 2002 18:55:49 -0400 Subject: [Spambayes] chi-squared versus "cancellation disease" Message-ID: Turns out there's a good reason to keep 0.5 in your "middle ground" when using chi-squared combining: it's not fooled by "cancellation disease", and refuses to make a choice in either direction when it happens. I think this is good. chi2.py has a new function showscore() you can use to see exactly what happens on a given vector of probabilities. Like so: >>> from chi2 import showscore as s >>> s([.01, .99] * 30) # 30 pairs of "cancelling" extremes P(chisq >= 276.913 | v=120) = 2.00377e-014 P(chisq >= 276.913 | v=120) = 2.00377e-014 spam prob 1.0 ham prob 1.0 S/(S+H) 0.5 >>> The sums are so large that there's virtually no chance the probs are random under either the ham or spam measures. For a peculiar reason, this causes the internals to estimate the probability of both outcomes to be 1. When they're combined, though, 0.5 is the best guess it can make. Adding a bunch more clues in one direction doesn't really change this: >>> s([.01, .99] * 30 + [.99] * 10) P(chisq >= 369.017 | v=140) = 1.55622e-022 P(chisq >= 277.114 | v=140) = 4.54929e-011 spam prob 1.0 ham prob 0.999999999955 S/(S+H) 0.500000000011 >>> That only managed to convince it that spam was a *tiny* bit more likely -- the distribution is still wildly unlikely under either measure. In that sense it's accomplishing much of what the clt schemes try to do, but with less mechanism and pain. Seeing a non-pathological case should make normal behavior clearer: >>> s([.1, .1, .2, .3, .4, .4, .45, .7, .8]) P(chisq >= 10.4469 | v= 18) = 0.91634 P(chisq >= 21.259 | v= 18) = 0.266549 spam prob 0.0836602351022 ham prob 0.733450737215 S/(S+H) 0.102385401661 >>> That was clearly a hammish probability vector, and the scheme has no trouble realizing that. Note that we get some intuitive outcomes via unintuitive means: >>> s([0.5] * 20) P(chisq >= 27.7259 | v= 40) = 0.928958 P(chisq >= 27.7259 | v= 40) = 0.928958 spam prob 0.071042357154 ham prob 0.071042357154 S/(S+H) 0.5 >>> That is, a vector of all 0.5 is quite unlikely against the hypothesis that the probs are uniformly distributed (the sums are too small -- it's *too* regular), but it's equally unlikely under both measures, so 0.5 is the best guess it can make. BTW, when eyeballing this stuff, it's helpful to know that a chi-squared distribution with v degrees of freedom has mean v and sdev sqrt(2*v). From tim.one@comcast.net Sun Oct 13 02:14:24 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 12 Oct 2002 21:14:24 -0400 Subject: [Spambayes] chi-squared versus "prob strength" Message-ID: Note the default robinson_minimum_prob_strength is still 0.1, meaning that we ignore words with spamprobs in 0.4 to 0.6. Since the chi-squared test is testing the hypothesis that the probs are uniformly distributed, systematically leaving a chunk of probs "out of the middle" may bias it. Rerunning my fat test with this option set to 0.0 (don't ignore any words) gave nearly identical final results, but I didn't like the fine-grained differences. In particular, there's a seeming paradox: ham mean ham sdev 0.39 0.20 -48.72% 3.47 2.56 -26.22% 0.33 0.15 -54.55% 3.13 2.30 -26.52% 0.40 0.28 -30.00% 3.54 3.34 -5.65% 0.23 0.09 -60.87% 2.24 1.40 -37.50% 0.47 0.30 -36.17% 4.38 3.69 -15.75% 0.31 0.18 -41.94% 3.05 2.56 -16.07% 0.38 0.19 -50.00% 3.23 2.30 -28.79% 0.29 0.15 -48.28% 2.80 2.12 -24.29% 0.30 0.17 -43.33% 2.90 2.37 -18.28% 0.55 0.32 -41.82% 4.45 3.78 -15.06% ham mean and sdev for all runs 0.36 0.20 -44.44% 3.38 2.74 -18.93% spam mean spam sdev 99.93 99.95 +0.02% 1.25 1.18 -5.60% 99.94 99.96 +0.02% 1.24 1.11 -10.48% 99.98 99.99 +0.01% 0.34 0.32 -5.88% 99.92 99.93 +0.01% 1.84 2.24 +21.74% 99.93 99.94 +0.01% 1.72 1.44 -16.28% 99.88 99.90 +0.02% 1.95 1.75 -10.26% 99.86 99.86 +0.00% 2.22 2.60 +17.12% 99.91 99.96 +0.05% 1.26 0.57 -54.76% 99.90 99.93 +0.03% 1.75 1.43 -18.29% 99.96 99.97 +0.01% 0.73 0.61 -16.44% spam mean and sdev for all runs 99.92 99.94 +0.02% 1.53 1.50 -1.96% ham/spam mean difference: 99.56 99.74 +0.18 While the tight ham distribution got significantly tighter (but note that the effect on the spam distribution was inconsistent and overall virtually nil), the ham at the wrong end of the scale got worse. Here are the ham-score histograms starting at 50, combined into one, "before" in the middle column (default prob strength), "after" in the right column (don't ignore any words): 50.0 1 1 50.5 3 0 51.0 0 1 51.5 - 52.0 1 0 52.5 - 53.0 - 53.5 0 1 54.0 - 54.5 - 55.0 - 55.5 - 56.0 2 0 56.5 - 57.0 - 57.5 - 58.0 - 58.5 1 0 59.0 0 1 59.5 - 60.0 - 60.5 - 61.0 - 61.5 - 62.0 - 62.5 0 1 63.0 - 63.5 1 0 64.0 - 64.5 - 65.0 - 65.5 0 1 66.0 1 0 66.5 - 67.0 - 67.5 - 68.0 1 0 68.5 - 69.0 - 69.5 - 70.0 0 1 70.5 - 71.0 - 71.5 - 72.0 - 72.5 - 73.0 - 73.5 1 0 74.0 0 1 74.5 - 75.0 - 75.5 - 76.0 - 76.5 - 77.0 - 77.5 - 78.0 - 78.5 0 1 79.0 - 79.5 - 80.0 - 80.5 0 1 81.0 - 81.5 - 82.0 - 82.5 - 83.0 - 83.5 - 84.0 - 84.5 - 85.0 0 1 85.5 - 86.0 - 86.5 - 87.0 0 1 87.5 - 88.0 - 88.5 - 89.0 - 89.5 - 90.0 - 90.5 - 91.0 - 91.5 - 92.0 - 92.5 - 93.0 - 93.5 - 94.0 - 94.5 - 95.0 - 95.5 - 96.0 - 96.5 - 97.0 1 0 97.5 - 98.0 - 98.5 - 99.0 - 99.5 1 2 The ham scores drift up here rather dramatically. I haven't found any particular *sense* to it. The lady with the brief question and the obnoxious employer-generated sig saw her score climb from 0.972986477986 to 0.998446743969. Here's the full list of "after" clues: prob('python.') = 0.000144374 prob('subject:Python') = 0.00115551 prob('header:Errors-To:1') = 0.0225343 prob('thanks,') = 0.0642414 prob('x-mailer:microsoft outlook express 4.72.3155.0') = 0.0652174 prob('help?') = 0.134215 prob('edinburgh') = 0.155172 prob('there,') = 0.164471 prob('but') = 0.223265 prob('skip:r 20') = 0.245934 prob('standard') = 0.260615 prob('road,') = 0.283848 prob('tel:') = 0.286105 prob('content-type:text/plain') = 0.306072 prob('calls') = 0.307063 prob('return') = 0.323593 prob('addressee') = 0.340522 prob('alteration') = 0.340522 prob('header:Message-ID:1') = 0.372119 prob('scan') = 0.388899 "before" ignored all words from here ... prob('caused') = 0.42498 prob('fax:') = 0.441534 prob('not') = 0.442943 prob('header:Date:1') = 0.47242 prob('the') = 0.476861 prob('to:2**0') = 0.48041 prob('subject: ') = 0.488475 prob('header:To:1') = 0.489883 prob('header:Subject:1') = 0.495711 prob('header:From:1') = 0.496624 prob('0131') = 0.5 prob('1127') = 0.5 prob('1550') = 0.5 prob('2552') = 0.5 prob('2dh,') = 0.5 prob('eh1') = 0.5 prob('email addr:standardlife.com') = 0.5 prob('email name:vickie_mills') = 0.5 prob('from:email addr:standardlife.com>') = 0.5 prob('from:email name:>> from chi2 import showscore as s >>> s([.2, .8, .9]) P(chisq >= 8.27033 | v= 6) = 0.218959 P(chisq >= 3.87588 | v= 6) = 0.693468 spam prob 0.781040515476 ham prob 0.306531778646 S/(S+H) 0.71815043441 >>> s([.2, .8, .9] + [0.5] * 10) P(chisq >= 22.1333 | v= 26) = 0.681383 P(chisq >= 17.7388 | v= 26) = 0.885068 spam prob 0.318617174026 ham prob 0.114932197304 S/(S+H) 0.734904015772 >>> I can't love that adding a pile of 100% neutral probs intensifies the spam judgment, and under the covers the effects on S and H are seen to be dramatic. Yes, "it's even more not uniformly distributed" after adding in 10 0.5s, but that's really got nothing to do with whether the msg is ham or spam! From tim.one@comcast.net Sun Oct 13 06:57:23 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 13 Oct 2002 01:57:23 -0400 Subject: [Spambayes] Chi**2 results In-Reply-To: <3DA88D8A.3070509@hooft.net> Message-ID: [Rob Hooft] > Here is my chi results. Thanks for trying this, Rob! > I am amazed by the high cutoff it is advising me to use! Well, you told it you hate fp 10x more than you hate fn (best_cutoff_fp_weight = 10), and that pushes the best cutoff up. Note that the cutoff is an after-the-fact thing, and moving it improves one error rate at the unavoidable expense of injuring the other -- it doesn't change any scores. It looks like this scheme has an extremely usable middle ground for you, so provided your deployment can *do* something with a middle ground, you've got a very large range for absolute cutoffs that would leave you staring at very few "unsure" msgs. > This feels very good. Looks good too . One part is *too* good: -> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03 -> min -2.22045e-13; median 9.99201e-14; max 100 ^^^^^^^^^^^^^^^^ It's not logically possible for a score to go negative -- we can thank rounding errors for that. On the FP side bad messages are: > * a yahoo account created to correct incorrect listings in their > database > * A problem with my Linux Journal subscription > * India student applying for a course > * Amazon.com membership update > * Red Cross blood drive announcement > > Which is 5 out of 16000; but I have to admit that even missing 4 out of > these 5 would not have been too costly. I don't think any scheme can afford to throw msgs away entirely. What I hope instead is that a middle ground can shuffle unclear msgs into a "please help me" folder (or two, if it's still valuable to record the "ham or spam?" guess for these) where most mistakes live, and that any scheme tossing a msg entirely try to notify the sender. I personally would never use a scheme that tosses msgs entirely, but that's just me. Unless you create a lot of Yahoo accts, and have a lot of problems with your Linux Journal subscriptions, and etc, seems likely that the system just won't get enough training examples to learn that they're OK for you. A whitelist might help, except it's hard to populate one without first recognizing an FP from an unfortunate sender. > The middle ground is amazingly empty! I'd almost want to set my cutoff > at 0.99 or 0.995! It's OK by me if you do . > One thing that does bother me a bit is that some words have a very high > correlation of co-existing in a message, and there is no way of finding > this out. E.g. all the "bad jokes" I'm referring to in the attachment > were sent by a friend of mine that uses a very strange way of > forwarding by modifying the "From:" line: > > From: callaway@indigo.picower.edu (David Callaway) (by way of Pieter > Stouten) > > > Which results in the highly correlated: > > prob('from:pieter') = 0.00151566 > prob('message-id:@[158.117.170.103]') = 0.00306331 > prob('x-mailer:eudora pro 3.1 for macintosh') = 0.00474183 > prob('from:stouten)') = 0.0115681 > prob('from:way') = 0.012894 > prob('from:(by') = 0.0167286 I don't know whether to call that a bug or a feature. In this specific example, I think I have to call it a feature: the "bad joke" msgs appear to confuse the system routinely, and this bundle of very low-spamprob words may be all that's saving them from getting scores near 1.0. There are a significant number of my ham that are redeemed by this kind of thing too -- a well-known poster posting from a well-known address, but going on about something that has nothing to do with the newsgroup. Sucking out 8 distinct clues about who they are and where they posted from helps them a *lot* in these cases, even if all 8 come from the "From" line. If you turn on mine_received_headers, you'll also find that Neil goes out of his way to present IP addr and machine-name info in multiple ways, triggering the same kind of effect for "bad machines" and "bad networks". So, overall, "this kind of thing" has appeared valuable to me. OTOH, we've been reduced to stripping all HTML tags else we get a mountain of high-spamprob decorations (in legit HTML mail) that are nearly 100% correlated but each counts as if a killer-good clue all by itself. So it's at best a mixed bag. I don't know of a computationally cheap way to take correlations into account, else I would have tried that before resorting to stripping HTML tags (I hate throwing info away). From rob@hooft.net Sun Oct 13 08:00:24 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 13 Oct 2002 09:00:24 +0200 Subject: [Spambayes] Chi**2 results References: Message-ID: <3DA91A08.70605@hooft.net> Tim Peters wrote: > Looks good too . One part is *too* good: > > -> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03 > -> min -2.22045e-13; median 9.99201e-14; max 100 > ^^^^^^^^^^^^^^^^ I noticed that... But indeed, one cannot blame the program if it is calculating chi2Q with 14 digit accuracy and then subtract it from 1.0..... > I don't think any scheme can afford to throw msgs away entirely. I have to admit that I do have a "spam" folder from SA at this moment, and that I am only "scanning" the index page of this for 3 seconds per week.... That is almost as good as throwing them out completely. A good feature of spamassassin is that it turns every suspect message into text/plain. This would be a good feature for the middle-ground messages (but it should be easy to undo somehow for middle-ground-negatives). > So it's at best a mixed bag. I don't know of a computationally cheap way to > take correlations into account, else I would have tried that before > resorting to stripping HTML tags (I hate throwing info away). We'd just have to make a 100k*100k correlation matrix. Programmatically very cheap ;-) I'm currently looking at the H and S values of middle ground messages. I have seen a few H+S>1.9 so far. Advantage of the current schema is that if H+S>1.25, the message is always at least in the middle ground. H+S<<1 are quite rare with this schema, but I've seen some with H=0.05 S=0.02 and will investigate whether something can be gained (sure fp/fn) in that area. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 13 08:27:11 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 13 Oct 2002 09:27:11 +0200 Subject: [Spambayes] chi-squared versus "prob strength" References: Message-ID: <3DA9204F.5000509@hooft.net> Tim Peters wrote: > Note the default robinson_minimum_prob_strength is still 0.1, meaning that > we ignore words with spamprobs in 0.4 to 0.6. > > Since the chi-squared test is testing the hypothesis that the probs are > uniformly distributed, systematically leaving a chunk of probs "out of the > middle" may bias it. > > Rerunning my fat test with this option set to 0.0 (don't ignore any words) > gave nearly identical final results, but I didn't like the fine-grained > differences. Here is my cmp run for this. First is with 0.1, second with 0.0. Distributions are tighter. Is this due to the fact that we have more clues now, so the Chi2 distribution is more decisive? cv2s -> cv3s -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams [...] -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams false positive percentages 0.062 0.188 lost +203.23% 0.312 0.438 lost +40.38% 0.062 0.125 lost +101.61% 0.062 0.125 lost +101.61% 0.062 0.125 lost +101.61% 0.062 0.062 tied 0.250 0.250 tied 0.125 0.188 lost +50.40% 0.250 0.312 lost +24.80% 0.000 0.000 tied won 0 times tied 3 times lost 7 times total unique fp went from 20 to 29 lost +45.00% mean fp % went from 0.125 to 0.18125 lost +45.00% false negative percentages 1.034 1.034 tied 0.345 0.345 tied 0.517 0.345 won -33.27% 0.517 0.517 tied 1.207 1.207 tied 0.862 0.690 won -19.95% 0.862 0.690 won -19.95% 0.345 0.345 tied 0.517 0.517 tied 1.034 0.862 won -16.63% won 4 times tied 6 times lost 0 times total unique fn went from 42 to 38 won -9.52% mean fn % went from 0.724137931034 to 0.655172413793 won -9.52% ham mean ham sdev 0.52 0.39 -25.00% 4.49 4.46 -0.67% 0.72 0.60 -16.67% 6.62 6.59 -0.45% 0.63 0.45 -28.57% 4.83 4.42 -8.49% 0.60 0.41 -31.67% 4.83 4.51 -6.63% 0.52 0.36 -30.77% 4.26 4.06 -4.69% 0.43 0.31 -27.91% 4.21 3.82 -9.26% 0.64 0.52 -18.75% 5.75 5.72 -0.52% 0.68 0.51 -25.00% 5.63 5.39 -4.26% 0.70 0.62 -11.43% 5.71 6.13 +7.36% 0.41 0.31 -24.39% 3.65 3.24 -11.23% ham mean and sdev for all runs 0.59 0.45 -23.73% 5.07 4.94 -2.56% spam mean spam sdev 99.20 99.32 +0.12% 6.10 5.77 -5.41% 99.70 99.71 +0.01% 3.45 3.80 +10.14% 99.55 99.68 +0.13% 3.63 3.23 -11.02% 99.38 99.44 +0.06% 6.34 6.27 -1.10% 99.14 99.19 +0.05% 7.05 7.05 +0.00% 99.40 99.47 +0.07% 4.72 5.24 +11.02% 99.42 99.50 +0.08% 5.09 5.10 +0.20% 99.41 99.51 +0.10% 4.55 4.99 +9.67% 99.48 99.62 +0.14% 3.81 3.20 -16.01% 99.31 99.39 +0.08% 6.09 5.97 -1.97% spam mean and sdev for all runs 99.40 99.48 +0.08% 5.22 5.21 -0.19% ham/spam mean difference: 98.81 99.03 +0.22 -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Sun Oct 13 09:13:47 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 13 Oct 2002 04:13:47 -0400 Subject: [Spambayes] chi-squared versus "prob strength" In-Reply-To: <3DA9204F.5000509@hooft.net> Message-ID: [Tim] > Note the default robinson_minimum_prob_strength is still 0.1 ... > ... > Rerunning my fat test with this option set to 0.0 (don't ignore > any words) gave nearly identical final results, but I didn't like > the fine-grained differences. [Rob Hooft] > Here is my cmp run for this. First is with 0.1, second with 0.0. > Distributions are tighter. Is this due to the fact that we have more > clues now, so the Chi2 distribution is more decisive? It's been my belief that bland words are at best worthless as clues, and at worst actively hurt (experiment: fiddle your favorite scheme to look *only* at the bland words; do they have predictive power?). I think this is one of the schemes where they hurt, for the reason illustrated by tiny example at the end of my original post: """ >>> from chi2 import showscore as s >>> s([.2, .8, .9]) P(chisq >= 8.27033 | v= 6) = 0.218959 P(chisq >= 3.87588 | v= 6) = 0.693468 spam prob 0.781040515476 ham prob 0.306531778646 S/(S+H) 0.71815043441 >>> s([.2, .8, .9] + [0.5] * 10) P(chisq >= 22.1333 | v= 26) = 0.681383 P(chisq >= 17.7388 | v= 26) = 0.885068 spam prob 0.318617174026 ham prob 0.114932197304 S/(S+H) 0.734904015772 >>> I can't love that adding a pile of 100% neutral probs intensifies the spam judgment, and under the covers the effects on S and H are seen to be dramatic. Yes, "it's even more not uniformly distributed" after adding in 10 0.5s, but that's really got nothing to do with whether the msg is ham or spam! """ The hypothesis that the spamprobs are uniformly distributed seems irrelevant to whether a msg is ham or spam, and dumping bland words in acts to reject the hypothesis for a reason that also has nothing to do with the distinction we're *trying* to make. The bland words seem most of all to intensify the decision the scheme would have made anyway if they weren't included. That makes things more extreme, but (IMO) not for a *reasonable* reason. I think it's akin to taking scores below 0.1 and dividing them by 2, and taking scores above 0.9 and adding half their distance to 1: it makes things more extreme, but not usefully. Extremity for extremity's sake is no virtue . > -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams > [...] > -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams > > false positive percentages > 0.062 0.188 lost +203.23% > 0.312 0.438 lost +40.38% > 0.062 0.125 lost +101.61% > 0.062 0.125 lost +101.61% > 0.062 0.125 lost +101.61% > 0.062 0.062 tied > 0.250 0.250 tied > 0.125 0.188 lost +50.40% > 0.250 0.312 lost +24.80% > 0.000 0.000 tied > > won 0 times > tied 3 times > lost 7 times > > total unique fp went from 20 to 29 lost +45.00% > mean fp % went from 0.125 to 0.18125 lost +45.00% > > false negative percentages > 1.034 1.034 tied > 0.345 0.345 tied > 0.517 0.345 won -33.27% > 0.517 0.517 tied > 1.207 1.207 tied > 0.862 0.690 won -19.95% > 0.862 0.690 won -19.95% > 0.345 0.345 tied > 0.517 0.517 tied > 1.034 0.862 won -16.63% > > won 4 times > tied 6 times > lost 0 times > > total unique fn went from 42 to 38 won -9.52% > mean fn % went from 0.724137931034 to 0.655172413793 won -9.52% > > ham mean ham sdev > 0.52 0.39 -25.00% 4.49 4.46 -0.67% > 0.72 0.60 -16.67% 6.62 6.59 -0.45% > 0.63 0.45 -28.57% 4.83 4.42 -8.49% > 0.60 0.41 -31.67% 4.83 4.51 -6.63% > 0.52 0.36 -30.77% 4.26 4.06 -4.69% > 0.43 0.31 -27.91% 4.21 3.82 -9.26% > 0.64 0.52 -18.75% 5.75 5.72 -0.52% > 0.68 0.51 -25.00% 5.63 5.39 -4.26% > 0.70 0.62 -11.43% 5.71 6.13 +7.36% > 0.41 0.31 -24.39% 3.65 3.24 -11.23% > > ham mean and sdev for all runs > 0.59 0.45 -23.73% 5.07 4.94 -2.56% Because the ham distrubtion got tighter and closer to 0, you need a larger spam_cutoff now. A spam_cutoff too low probably explains both the increase in FP rate and the decrease in FN rate. > spam mean spam sdev > 99.20 99.32 +0.12% 6.10 5.77 -5.41% > 99.70 99.71 +0.01% 3.45 3.80 +10.14% > 99.55 99.68 +0.13% 3.63 3.23 -11.02% > 99.38 99.44 +0.06% 6.34 6.27 -1.10% > 99.14 99.19 +0.05% 7.05 7.05 +0.00% > 99.40 99.47 +0.07% 4.72 5.24 +11.02% > 99.42 99.50 +0.08% 5.09 5.10 +0.20% > 99.41 99.51 +0.10% 4.55 4.99 +9.67% > 99.48 99.62 +0.14% 3.81 3.20 -16.01% > 99.31 99.39 +0.08% 6.09 5.97 -1.97% > > spam mean and sdev for all runs > 99.40 99.48 +0.08% 5.22 5.21 -0.19% > > ham/spam mean difference: 98.81 99.03 +0.22 I saw the same thing (qualitatively), and it's at least curious: ham mean and sdev consistently decrease; spam mean consistently increases but less so; and effects on spam sdev a mixed bag with almost no net effect when averaged out. BTW, with max_discriminators=150, you *may* have many ham that didn't have 150 unique extreme words, and in that case no longer ignoring the bland words may have a large effect similar to the one in the example above. From rob@hooft.net Sun Oct 13 12:37:40 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 13 Oct 2002 13:37:40 +0200 Subject: [Spambayes] chi-squared versus "prob strength" References: Message-ID: <3DA95B04.2040806@hooft.net> I'm playing currently with a variant on the S/(S+H) formula. I replaced it with (S-H+1)/2 Some examples where this doesn't make much difference: H S S/(H+S) (S-H+1)/2 0.01 0.99 0.99 0.99 Typical spam. 0.99 0.01 0.01 0.01 Typical ham. 0.50 0.50 0.50 0.50 Typical half-way. 0.90 0.90 0.50 0.50 Looks both like ham and spam 0.10 0.10 0.50 0.50 Doesn't look like either 0.80 0.95 0.54 0.57 Both, but a bit more spam But where it makes a difference is: H S S/(H+S) (S-H+1)/2 0.05 0.20 0.80 0.57 0.02 0.05 0.71 0.51 Here, the low S value tells you "I don't have any proof that it looks like spam." Just because the H value is even lower, we suddenly put this in or close to the realm of certainty using S/(H+S). How come? Well we're dividing by H+S, which tells the system we're sure it is either ham or spam. If we're fair, however, these messages with H+S<<1 are not Ham nor Spam. So, maybe we should not divide by H+S at all? Remember, the original formula was (S-H)/(S+H). Replace this by (S-H)/1.0 and you arrive at my (S-H+1)/2 which puts message that are neither ham nor spam close to 0.50 Tim Peters wrote: > It's been my belief that bland words are at best worthless as clues, and at > worst actively hurt (experiment: fiddle your favorite scheme to look *only* > at the bland words; do they have predictive power?). I think this is one of > the schemes where they hurt, for the reason illustrated by tiny example at > the end of my original post: > > """ > >>>>from chi2 import showscore as s >>> > >>>>s([.2, .8, .9]) >>> > P(chisq >= 8.27033 | v= 6) = 0.218959 > P(chisq >= 3.87588 | v= 6) = 0.693468 > spam prob 0.781040515476 > ham prob 0.306531778646 > S/(S+H) 0.71815043441 (S-H+1)/2 = 0.737 > >>>>s([.2, .8, .9] + [0.5] * 10) >>> > P(chisq >= 22.1333 | v= 26) = 0.681383 > P(chisq >= 17.7388 | v= 26) = 0.885068 > spam prob 0.318617174026 > ham prob 0.114932197304 > S/(S+H) 0.734904015772 (S-H+1)/2 = 0.602 Better, isn't it? Elsewhere you write: > Lady with the obnoxious sig: > Ignoring bland words: > P(chisq >= 222.333 | v=136) = 4.23496e-006 > P(chisq >= 106.24 | v=136) = 0.972237 > spam prob 0.999995765045 > ham prob 0.0277633711662 > S/(S+H) 0.972986500253 (S-H+1)/2 = 0.986 > Including bland words: > > P(chisq >= 282.465 | v=220) = 0.00283528 > P(chisq >= 163.095 | v=220) = 0.998449 > spam prob 0.997164718534 > ham prob 0.00155126034776 > S/(S+H) 0.99844674524 (S-H+1)/2 = 0.997 The difference is smaller. This small addition of certainty could be due to the bland words actually contributing. > The ham whose score rose from 0.68 to 0.87: > Ignoring bland words: > P(chisq >= 123.422 | v=100) = 0.0560948 > P(chisq >= 97.2217 | v=100) = 0.560026 > spam prob 0.943905161882 > ham prob 0.439974054337 > S/(S+H) 0.682071925656 (S-H+1)/2 = 0.752 > Including bland words: > P(chisq >= 174.229 | v=172) = 0.438174 > P(chisq >= 146.746 | v=172) = 0.918976 > spam prob 0.561826411084 > ham prob 0.0810237511331 > S/(S+H) 0.873961685171 (S-H+1)/2 = 0.740 Convinced? With this rule, it does no longer harm to add the bland words. For my set, with bland words, I end up with 3 spams < 0.01; 15499 hams < 0.01 4 spams < 0.10; 15766 hams < 0.01 9 hams > 0.90; 5658 spams < 0.10 3 hams > 0.99; 5392 spams > 0.99 S/S+H left and (S-H+1)/2 right: cv3s -> cv5s -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams false positive percentages 0.188 0.062 won -67.02% 0.438 0.125 won -71.46% 0.125 0.062 won -50.40% 0.125 0.062 won -50.40% 0.125 0.062 won -50.40% 0.062 0.062 tied 0.250 0.188 won -24.80% 0.188 0.250 lost +32.98% 0.312 0.188 won -39.74% 0.000 0.000 tied won 7 times tied 2 times lost 1 times total unique fp went from 29 to 17 won -41.38% mean fp % went from 0.18125 to 0.10625 won -41.38% false negative percentages 1.034 1.207 lost +16.73% 0.345 0.517 lost +49.86% 0.345 0.862 lost +149.86% 0.517 0.862 lost +66.73% 1.207 1.207 tied 0.690 1.379 lost +99.86% 0.690 1.034 lost +49.86% 0.345 1.034 lost +199.71% 0.517 1.034 lost +100.00% 0.862 1.552 lost +80.05% won 0 times tied 1 times lost 9 times total unique fn went from 38 to 62 lost +63.16% mean fn % went from 0.655172413793 to 1.06896551724 lost +63.16% ham mean ham sdev 0.39 0.58 +48.72% 4.46 4.94 +10.76% 0.60 0.60 +0.00% 6.59 5.74 -12.90% 0.45 0.60 +33.33% 4.42 4.57 +3.39% 0.41 0.57 +39.02% 4.51 4.46 -1.11% 0.36 0.61 +69.44% 4.06 4.63 +14.04% 0.31 0.41 +32.26% 3.82 4.08 +6.81% 0.52 0.66 +26.92% 5.72 5.48 -4.20% 0.51 0.69 +35.29% 5.39 5.74 +6.49% 0.62 0.70 +12.90% 6.13 5.71 -6.85% 0.31 0.44 +41.94% 3.24 3.76 +16.05% ham mean and sdev for all runs 0.45 0.59 +31.11% 4.94 4.96 +0.40% spam mean spam sdev 99.32 98.98 -0.34% 5.77 6.32 +9.53% 99.71 99.25 -0.46% 3.80 4.28 +12.63% 99.68 99.15 -0.53% 3.23 4.55 +40.87% 99.44 98.90 -0.54% 6.27 7.00 +11.64% 99.19 98.96 -0.23% 7.05 6.67 -5.39% 99.47 98.96 -0.51% 5.24 5.93 +13.17% 99.50 98.94 -0.56% 5.10 6.17 +20.98% 99.51 98.95 -0.56% 4.99 5.91 +18.44% 99.62 99.18 -0.44% 3.20 4.70 +46.88% 99.39 98.93 -0.46% 5.97 6.40 +7.20% spam mean and sdev for all runs 99.48 99.02 -0.46% 5.21 5.86 +12.48% ham/spam mean difference: 99.03 98.43 -0.60 -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 13 17:06:55 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 13 Oct 2002 18:06:55 +0200 Subject: [Spambayes] Bland word only score.. Message-ID: <3DA99A1F.9040506@hooft.net> [Tim: the previous copy of this message I sent to you was too quick.] Tim Peters wrote: > It's been my belief that bland words are at best worthless as clues, > and at worst actively hurt (experiment: fiddle your favorite scheme > to look *only* at the bland words; do they have predictive power?). Just for kicks: Yes, with the latest schema and (S-H+1)/2, it does give a third of a standard deviation of separation on my sets. And the best is: it doesn't have any false positives :-P [Classifier] use_chi_squared_combining: True robinson_minimum_prob_strength = 0.1 [TestDriver] spam_cutoff: 0.70 nbuckets: 200 best_cutoff_fp_weight: 10 Obviously, the robinson_minimum_prob_strength test is inverted in the code. -> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41 -> min 40.7953; median 49.9561; max 57.7839 40.0 0 40.5 1 * 41.0 0 41.5 1 * 42.0 9 * 42.5 8 * 43.0 17 * 43.5 31 * 44.0 35 * 44.5 61 * 45.0 95 ** 45.5 136 ** 46.0 186 *** 46.5 317 ***** 47.0 383 ****** 47.5 572 ******** 48.0 832 *********** 48.5 1101 *************** 49.0 1455 ******************** 49.5 3829 *************************************************** 50.0 4625 ************************************************************* 50.5 1024 ************** 51.0 520 ******* 51.5 275 **** 52.0 176 *** 52.5 108 ** 53.0 71 * 53.5 66 * 54.0 30 * 54.5 16 * 55.0 10 * 55.5 4 * 56.0 3 * 56.5 2 * 57.0 0 57.5 1 * 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 -> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25 -> min 43.2803; median 50.2241; max 59.1799 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 1 * 43.5 2 * 44.0 1 * 44.5 4 * 45.0 8 * 45.5 12 * 46.0 30 * 46.5 38 * 47.0 53 ** 47.5 65 ** 48.0 94 *** 48.5 118 *** 49.0 206 ***** 49.5 497 ************ 50.0 2580 ************************************************************ 50.5 925 ********************** 51.0 493 ************ 51.5 234 ****** 52.0 135 **** 52.5 88 *** 53.0 95 *** 53.5 43 * 54.0 22 * 54.5 22 * 55.0 17 * 55.5 7 * 56.0 3 * 56.5 2 * 57.0 1 * 57.5 1 * 58.0 0 58.5 0 59.0 3 * 59.5 0 60.0 0 -> best cutoff for all runs: 0.58 -> with weighted total 10*0 fp + 5797 fn = 5797 -> fp rate 0% fn rate 99.9% -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 13 18:58:46 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 13 Oct 2002 19:58:46 +0200 Subject: [Spambayes] Total cost analysis Message-ID: <3DA9B456.2010908@hooft.net> I checked in a new program 'cvcost.py' that analyses the total human cost to you of human spam filtering based on a result of timcv.py The program is called cvcost.py. The default cost for an unknown message is set to $0.20, for a fn to $1 and for a fp to $10; these numbers can be changed using command line options. amigo[142]spambayes%% /usr/local/bin/python cvcost.py cv[2345].txt ......................................................................................... Optimal cost is 127.2 with grey zone between 49.0 and 99.0 ......................................................................................... Optimal cost is 143.4 with grey zone between 49.0 and 98.0 ......................................................................................... Optimal cost is 149.4 with grey zone between 49.0 and 98.0 ......................................................................................... Optimal cost is 103.2 with grey zone between 49.0 and 96.0 /usr/local/bin/python cost.py cv[2345].txt 26.88s user 0.14s system 98% cpu 27.346 total The four runs that this represents are: cv2.txt : Tims suggested run (min_prob_str=0.1) cv3.txt : Same run but min_prob_str=0 (has more fp) cv4.txt : Failed try: if H+S<0.3 prob = 0.65 (force middle ground for strange) cv5.txt : New decision criterion: prob = (S-H+1)/2 The latter is "objectively" the best.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Sun Oct 13 20:24:50 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 13 Oct 2002 21:24:50 +0200 Subject: [Spambayes] Total cost analysis References: <3DA9B456.2010908@hooft.net> Message-ID: <3DA9C882.1030903@hooft.net> I had cv5.txt : New decision criterion: prob = (S-H+1)/2 robinson_minimum_prob_strength = 0.0 Adding cv6.txt : Same as cv5 but with robinson_minimum_prob_strength = 0.1 amigo[165]spambayes%% /usr/local/bin/python cvcost.py cv[56].txt ......................................................................................... cv5.txt: Optimal cost is $103.2 with grey zone between 49.0 and 96.0 ......................................................................................... cv6.txt: Optimal cost is $109.0 with grey zone between 49.0 and 97.0 So for me, robinson_minimum_prob_strength = 0.0 gives the best result yet. -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Sun Oct 13 21:42:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 13 Oct 2002 16:42:32 -0400 Subject: [Spambayes] Total cost analysis In-Reply-To: <3DA9B456.2010908@hooft.net> Message-ID: [Rob Hooft] > I checked in a new program 'cvcost.py' that analyses the total human > cost to you of human spam filtering based on a result of timcv.py Very cool! Thank you. Everyone, note that this looks at the 'all runs' ham and spam histograms at the end of the file, so the granularity of the analysis is limited by your nbuckets setting. I usually run with nbuckets 200; maybe I should boost the default to that (it's currently 40). > The program is called cvcost.py. The default cost for an unknown message > is set to $0.20, for a fn to $1 and for a fp to $10; these numbers can > be changed using command line options. I find I can make almost any scheme "the winner" by fiddling these to extreme enough values . In particular, by boosting the fp cost toward infinity, the all-default scheme Rulz -- even at nbuckets 200, the extreme schemes don't have fine enough granularity in the histograms to weed out the one or two (depending on scheme) extremely high-scoring false positives in my data. But I don't actually care if the Nigerian scam quote gets rejected, so like all automated analyses this has to be tempered with judgment. It's a wonderfully useful tool then! PS: I'm rerunning my fat test now with your alternative S-and-H combination scheme; I sure agree I like the effects it had in the examples you presented; we'll see whether my data agrees too ... From tim.one@comcast.net Mon Oct 14 02:05:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 13 Oct 2002 21:05:28 -0400 Subject: [Spambayes] chi-squared versus "prob strength" In-Reply-To: <3DA95B04.2040806@hooft.net> Message-ID: [Rob Hooft] > I'm playing currently with a variant on the S/(S+H) formula. I replaced > it with (S-H+1)/2 [and then shows specific examples where this gives intuitively more- sensible endcase results than the current rule] > ... > Better, isn't it? > ... > Convinced? I was, but more importantly my test data agreed, so I'm going to switch to this (the evidence is so consistent and solid on both our datasets that making it an option would supply a pointless choice -- losers are killed). Good show! S/(S+H) before, (S-H+1)/2 after (all defaults except use_chi_squared_combining in both): ham mean ham sdev 0.39 0.29 -25.64% 3.47 2.98 -14.12% 0.33 0.24 -27.27% 3.13 2.66 -15.02% 0.40 0.31 -22.50% 3.54 3.23 -8.76% 0.23 0.16 -30.43% 2.24 1.78 -20.54% 0.47 0.39 -17.02% 4.38 4.06 -7.31% 0.31 0.24 -22.58% 3.05 2.73 -10.49% 0.38 0.28 -26.32% 3.23 2.71 -16.10% 0.29 0.21 -27.59% 2.80 2.35 -16.07% 0.30 0.23 -23.33% 2.90 2.51 -13.45% 0.55 0.43 -21.82% 4.45 4.08 -8.31% ham mean and sdev for all runs 0.36 0.28 -22.22% 3.38 2.99 -11.54% spam mean spam sdev 99.93 99.95 +0.02% 1.25 1.01 -19.20% 99.94 99.96 +0.02% 1.24 1.11 -10.48% 99.98 99.99 +0.01% 0.34 0.19 -44.12% 99.92 99.93 +0.01% 1.84 1.93 +4.89% 99.93 99.94 +0.01% 1.72 1.59 -7.56% 99.88 99.90 +0.02% 1.95 1.72 -11.79% 99.86 99.88 +0.02% 2.22 2.27 +2.25% 99.91 99.94 +0.03% 1.26 0.83 -34.13% 99.90 99.92 +0.02% 1.75 1.55 -11.43% 99.96 99.97 +0.01% 0.73 0.43 -41.10% spam mean and sdev for all runs 99.92 99.94 +0.02% 1.53 1.41 -7.84% ham/spam mean difference: 99.56 99.66 +0.10 So it's even more extreme this way, but not in a way that hurts: the weird msgs in "the middle ground" are even more reliably *in* the middle ground now. For example, in my data, conference announcements, and the very difficult but rare long & chatty spam, almost always end up scoring near 0.5 now. But the regions of "extreme certainty" contain more msgs at the same time: HAM BEFORE -> Ham scores for all runs: 20000 items; mean 0.36; sdev 3.38 -> min -1.9984e-013; median 1.18333e-010; max 100 * = 319 items 0.0 19401 ************************************************************* 0.5 97 * HAM AFTER -> Ham scores for all runs: 20000 items; mean 0.28; sdev 2.99 -> min -9.99201e-014; median 6.28553e-011; max 100 * = 320 items 0.0 19492 ************************************************************* 0.5 104 * Median, mean and sdev all decreased, and about 100 more hams scored below 0.05. SPAM BEFORE -> Spam scores for all runs: 14000 items; mean 99.92; sdev 1.53 -> min 35.983; median 100; max 100 * = 228 items 99.0 15 * 99.5 13906 ************************************************************* SPAM AFTER -> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.41 -> min 29.6176; median 100; max 100 * = 229 items 99.0 13 * 99.5 13918 ************************************************************* The effects are milder here, but still in the right direction. The "BlackIntrepid" spam is the min-scoring spam in both cases: prob('*H*') = 0.930885 prob('*S*') = 0.523237 Chop that up any way you want, it's always going to look more like ham than spam, and it does look a lot like legit c.l.py traffic. cvcost doesn't find much bottom-line difference: chisq.txt: Optimal cost is $27.2 with grey zone between 50.0 and 74.0 chisq_altsh.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0 Given that I have two false positives that are never going to go away, and they're charged $10 each, the cost of both methods for 34,000 msgs is trivial. From tim.one@comcast.net Mon Oct 14 06:42:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 01:42:58 -0400 Subject: Bland words, and z-combining (was RE: [Spambayes] Bland word only score..) In-Reply-To: <3DA99A1F.9040506@hooft.net> Message-ID: [Rob Hooft] > [Tim: the previous copy of this message I sent to you was too quick.] Ah, replied to that privately. Bottom line: [tail end of histograms after running looking *only* at bland words] > -> best cutoff for all runs: 0.58 > -> with weighted total 10*0 fp + 5797 fn = 5797 > -> fp rate 0% fn rate 99.9% The overlap is so bad that even with 200 buckets, the best the histogram analysis could do is suggest a cutoff with a nearly 100% FN rate. > -> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41 > -> min 40.7953; median 49.9561; max 57.7839 > -> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25 > -> min 43.2803; median 50.2241; max 59.1799 So whether ham or spam, nearly half the bland words point in the wrong direction. It's too much like adding in coin flips for my tastes. > I had cv5.txt : New decision criterion: prob = (S-H+1)/2 > robinson_minimum_prob_strength = 0.0 > > Adding cv6.txt : Same as cv5 but with > robinson_minimum_prob_strength = 0.1 > > > amigo[165]spambayes%% /usr/local/bin/python cvcost.py cv[56].txt > cv5.txt: Optimal cost is $103.2 with grey zone between 49.0 and 96.0 > cv6.txt: Optimal cost is $109.0 with grey zone between 49.0 and 97.0 > > So for me, robinson_minimum_prob_strength = 0.0 gives the best result > yet. It didn't help on my data: chisq.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0 bland.txt: Optimal cost is $28.2 with grey zone between 50.0 and 85.0 The difference is so small I can't swear it hurt, either. I think the difference in your case is too small to be confident too. There's *one* scheme where including the bland words helps me: there's another option use_z_combining I haven't talked about here, which implements another speculative idea from Gary. That one is, well, extremely extreme. Only 16 of 20,000 ham scored over 0.50 using it, and only 3 of 14,000 spam scored under 0.50. The 16 FP include my 2 that will never go away, and they score 1.00000000000 and 0.999693086732 even with the bland words. BTW, in *some* sense the z-combining score is an actual probability. With the all-default costs, cvcost sez z-combining worked even better for me (including all bland words): zcomb.txt: Optimal cost is $26.8 with grey zone between 75.0 and 90.0 The difference between that and chisq.txt's $27.00 is one "not sure" msg out of 34,000, so I'm not highly motivated to pursue it. But I encourage others to try it -- it may work better on harder data than mine! I'll note that it suffers its own form of "cancellation disease" (one of my very long spam scored 0.0000000000041), which the chi-squared scheme is refreshingly free of (that same spam scored 0.5 under chi combining). If you want to try it, I suggest """ [Classifier] use_z_combining: True robinson_minimum_prob_strength: 0.0 [TestDriver] nbuckets: 200 """ I'd rather that people who haven't been playing along lately try chi-combining, though, because as far as I'm concerned, the results so far say it's the best scheme we've got -- and as someone else recently suggested, it's high time to start killing off the losers again. """ [Classifier] use_chi_squared_combining: True [TestDriver] nbuckets: 200 """ I sped that up, BTW (it invokes log() up to 150x less often now). Note that chi and z combining do NOT require "the third" training pass, so cross-validation tests can be run in the default "high speed" mode (incremental training and untraining work fine with these). From rob@hooft.net Mon Oct 14 07:18:49 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Mon, 14 Oct 2002 08:18:49 +0200 Subject: Bland words, and z-combining (was RE: [Spambayes] Bland word only score..) References: Message-ID: <3DAA61C9.4090108@hooft.net> Tim Peters wrote: > There's *one* scheme where including the bland words helps me: there's > another option use_z_combining I haven't talked about here, which implements > another speculative idea from Gary. That one is, well, extremely extreme. > Only 16 of 20,000 ham scored over 0.50 using it, and only 3 of 14,000 spam > scored under 0.50. The 16 FP include my 2 that will never go away, and they > score 1.00000000000 and 0.999693086732 even with the bland words. BTW, in > *some* sense the z-combining score is an actual probability. I tried z-combining before going to bed last night (saw it in the CVS), but it cost me $6 more for my 21800 messages than chi2-combining (i.e. a whopping 0.03 cents per message) and I didn't have the time this morning before going to work to check why. The problem starts to be to find a set of corpuses that are difficult enough to score; I am quite happy with the separation I have now.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From anthony@interlink.com.au Mon Oct 14 07:31:31 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 14 Oct 2002 16:31:31 +1000 Subject: [Spambayes] chi-squared versus "prob strength" In-Reply-To: Message-ID: <200210140631.g9E6VWn31907@localhost.localdomain> >>> Tim Peters wrote > I was, but more importantly my test data agreed, so I'm going to switch to > this (the evidence is so consistent and solid on both our datasets that > making it an option would supply a pointless choice -- losers are killed). > Good show! Here's what my mungo-test set shows for this (before is pre-Rob Hooft's change, after is current CVS) chi2s.txt -> chi2as.txt -> tested 3490 hams & 1687 spams against 31410 hams & 15161 spams -> tested 3490 hams & 1682 spams against 31410 hams & 15166 spams -> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams -> tested 3490 hams & 1679 spams against 31410 hams & 15169 spams -> tested 3490 hams & 1686 spams against 31410 hams & 15162 spams -> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams -> tested 3490 hams & 1678 spams against 31410 hams & 15170 spams -> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams -> tested 3490 hams & 1683 spams against 31410 hams & 15165 spams -> tested 3490 hams & 1689 spams against 31410 hams & 15159 spams -> tested 3490 hams & 1687 spams against 31410 hams & 15161 spams -> tested 3490 hams & 1682 spams against 31410 hams & 15166 spams -> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams -> tested 3490 hams & 1679 spams against 31410 hams & 15169 spams -> tested 3490 hams & 1686 spams against 31410 hams & 15162 spams -> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams -> tested 3490 hams & 1678 spams against 31410 hams & 15170 spams -> tested 3490 hams & 1688 spams against 31410 hams & 15160 spams -> tested 3490 hams & 1683 spams against 31410 hams & 15165 spams -> tested 3490 hams & 1689 spams against 31410 hams & 15159 spams false positive percentages 0.946 0.974 lost +2.96% 0.917 0.917 tied 0.802 0.831 lost +3.62% 0.659 0.860 lost +30.50% 0.573 0.659 lost +15.01% 0.802 0.831 lost +3.62% 0.716 0.745 lost +4.05% 0.516 0.544 lost +5.43% 0.630 0.688 lost +9.21% 0.917 1.003 lost +9.38% won 0 times tied 1 times lost 9 times total unique fp went from 261 to 281 lost +7.66% mean fp % went from 0.747851002865 to 0.805157593123 lost +7.66% false negative percentages 0.356 0.296 won -16.85% 0.119 0.059 won -50.42% 0.237 0.237 tied 0.476 0.476 tied 0.297 0.237 won -20.20% 0.415 0.415 tied 0.596 0.477 won -19.97% 0.296 0.237 won -19.93% 0.416 0.416 tied 0.355 0.296 won -16.62% won 6 times tied 4 times lost 0 times total unique fn went from 60 to 53 won -11.67% mean fn % went from 0.356257958499 to 0.314689990048 won -11.67% ham mean ham sdev 3.46 3.24 -6.36% 12.12 11.96 -1.32% 3.01 2.85 -5.32% 11.48 11.39 -0.78% 3.28 3.01 -8.23% 11.45 11.22 -2.01% 3.23 3.02 -6.50% 11.43 11.27 -1.40% 3.15 2.88 -8.57% 10.65 10.37 -2.63% 3.17 2.95 -6.94% 11.30 11.07 -2.04% 3.27 3.02 -7.65% 11.29 10.94 -3.10% 3.06 2.82 -7.84% 10.51 10.20 -2.95% 3.32 3.13 -5.72% 11.37 11.18 -1.67% 3.45 3.21 -6.96% 11.75 11.59 -1.36% ham mean and sdev for all runs 3.24 3.01 -7.10% 11.34 11.13 -1.85% spam mean spam sdev 99.75 99.76 +0.01% 3.91 3.85 -1.53% 99.90 99.91 +0.01% 1.62 1.38 -14.81% 99.81 99.82 +0.01% 3.09 3.05 -1.29% 99.60 99.62 +0.02% 4.92 4.80 -2.44% 99.78 99.78 +0.00% 3.24 3.36 +3.70% 99.78 99.78 +0.00% 3.04 3.14 +3.29% 99.62 99.62 +0.00% 4.73 4.78 +1.06% 99.79 99.81 +0.02% 2.75 2.66 -3.27% 99.66 99.66 +0.00% 4.47 4.62 +3.36% 99.70 99.70 +0.00% 4.37 4.32 -1.14% spam mean and sdev for all runs 99.74 99.75 +0.01% 3.75 3.75 +0.00% ham/spam mean difference: 96.50 96.74 +0.24 Here's the histograms from the 'after' case: -> Ham scores for all runs: 34900 items; mean 3.01; sdev 11.13 -> min -9.99201e-14; median 0.000498415; max 100 * = 448 items 0.0 27319 ************************************************************* 0.5 1129 *** 1.0 695 ** 1.5 507 ** 2.0 412 * 2.5 320 * 3.0 269 * 3.5 241 * 4.0 194 * 4.5 178 * 5.0 151 * 5.5 114 * 6.0 131 * 6.5 129 * 7.0 106 * 7.5 104 * 8.0 103 * 8.5 84 * 9.0 76 * 9.5 85 * 10.0 65 * 10.5 60 * 11.0 73 * 11.5 54 * 12.0 63 * 12.5 50 * 13.0 59 * 13.5 51 * 14.0 65 * 14.5 43 * 15.0 31 * 15.5 50 * 16.0 40 * 16.5 38 * 17.0 39 * 17.5 37 * 18.0 27 * 18.5 31 * 19.0 40 * 19.5 31 * 20.0 41 * 20.5 27 * 21.0 27 * 21.5 29 * 22.0 26 * 22.5 34 * 23.0 23 * 23.5 26 * 24.0 31 * 24.5 23 * 25.0 12 * 25.5 15 * 26.0 16 * 26.5 27 * 27.0 27 * 27.5 27 * 28.0 18 * 28.5 25 * 29.0 16 * 29.5 19 * 30.0 19 * 30.5 17 * 31.0 14 * 31.5 18 * 32.0 16 * 32.5 12 * 33.0 29 * 33.5 19 * 34.0 6 * 34.5 15 * 35.0 14 * 35.5 15 * 36.0 19 * 36.5 11 * 37.0 9 * 37.5 12 * 38.0 13 * 38.5 10 * 39.0 12 * 39.5 15 * 40.0 13 * 40.5 12 * 41.0 9 * 41.5 14 * 42.0 14 * 42.5 13 * 43.0 21 * 43.5 16 * 44.0 11 * 44.5 7 * 45.0 10 * 45.5 8 * 46.0 9 * 46.5 10 * 47.0 9 * 47.5 9 * 48.0 9 * 48.5 10 * 49.0 12 * 49.5 20 * 50.0 31 * 50.5 8 * 51.0 12 * 51.5 6 * 52.0 10 * 52.5 8 * 53.0 10 * 53.5 3 * 54.0 9 * 54.5 5 * 55.0 16 * 55.5 14 * 56.0 6 * 56.5 7 * 57.0 10 * 57.5 8 * 58.0 6 * 58.5 7 * 59.0 11 * 59.5 3 * 60.0 5 * 60.5 9 * 61.0 3 * 61.5 5 * 62.0 5 * 62.5 5 * 63.0 5 * 63.5 9 * 64.0 10 * 64.5 8 * 65.0 5 * 65.5 7 * 66.0 7 * 66.5 3 * 67.0 3 * 67.5 5 * 68.0 7 * 68.5 3 * 69.0 5 * 69.5 6 * 70.0 6 * 70.5 3 * 71.0 2 * 71.5 5 * 72.0 5 * 72.5 1 * 73.0 1 * 73.5 6 * 74.0 2 * 74.5 8 * 75.0 5 * 75.5 5 * 76.0 5 * 76.5 7 * 77.0 5 * 77.5 3 * 78.0 4 * 78.5 4 * 79.0 2 * 79.5 2 * 80.0 2 * 80.5 4 * 81.0 7 * 81.5 4 * 82.0 6 * 82.5 5 * 83.0 1 * 83.5 5 * 84.0 4 * 84.5 2 * 85.0 4 * 85.5 4 * 86.0 2 * 86.5 1 * 87.0 8 * 87.5 6 * 88.0 3 * 88.5 5 * 89.0 2 * 89.5 3 * 90.0 0 90.5 0 91.0 1 * 91.5 3 * 92.0 1 * 92.5 3 * 93.0 5 * 93.5 5 * 94.0 5 * 94.5 3 * 95.0 8 * 95.5 4 * 96.0 1 * 96.5 3 * 97.0 5 * 97.5 4 * 98.0 5 * 98.5 8 * 99.0 8 * 99.5 50 * -> Spam scores for all runs: 16848 items; mean 99.75; sdev 3.75 -> min 0.00333927; median 100; max 100 * = 273 items 0.0 1 * 0.5 1 * 1.0 1 * 1.5 0 2.0 0 2.5 1 * 3.0 1 * 3.5 0 4.0 0 4.5 0 5.0 1 * 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 1 * 8.5 0 9.0 1 * 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 1 * 12.5 0 13.0 0 13.5 0 14.0 2 * 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 1 * 17.5 2 * 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 0 21.0 1 * 21.5 1 * 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 2 * 25.0 0 25.5 1 * 26.0 1 * 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 1 * 38.0 1 * 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 1 * 42.0 0 42.5 1 * 43.0 0 43.5 1 * 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 1 * 47.0 0 47.5 0 48.0 0 48.5 1 * 49.0 0 49.5 2 * 50.0 3 * 50.5 2 * 51.0 0 51.5 2 * 52.0 0 52.5 1 * 53.0 0 53.5 0 54.0 1 * 54.5 0 55.0 0 55.5 0 56.0 0 56.5 2 * 57.0 1 * 57.5 1 * 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 0 61.0 0 61.5 1 * 62.0 0 62.5 0 63.0 0 63.5 1 * 64.0 0 64.5 1 * 65.0 0 65.5 2 * 66.0 1 * 66.5 0 67.0 0 67.5 0 68.0 1 * 68.5 0 69.0 1 * 69.5 1 * 70.0 0 70.5 1 * 71.0 0 71.5 1 * 72.0 0 72.5 3 * 73.0 0 73.5 0 74.0 1 * 74.5 0 75.0 1 * 75.5 1 * 76.0 0 76.5 1 * 77.0 1 * 77.5 0 78.0 6 * 78.5 0 79.0 1 * 79.5 0 80.0 1 * 80.5 0 81.0 1 * 81.5 1 * 82.0 1 * 82.5 1 * 83.0 0 83.5 0 84.0 0 84.5 0 85.0 3 * 85.5 1 * 86.0 2 * 86.5 2 * 87.0 2 * 87.5 0 88.0 2 * 88.5 0 89.0 0 89.5 2 * 90.0 0 90.5 1 * 91.0 0 91.5 0 92.0 4 * 92.5 5 * 93.0 2 * 93.5 1 * 94.0 2 * 94.5 4 * 95.0 2 * 95.5 8 * 96.0 3 * 96.5 5 * 97.0 9 * 97.5 10 * 98.0 9 * 98.5 22 * 99.0 44 * 99.5 16628 ************************************************************* -> best cutoff for all runs: 0.995 -> with weighted total 10*50 fp + 220 fn = 720 -> fp rate 0.143% fn rate 1.31% From tim.one@comcast.net Mon Oct 14 18:34:39 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 13:34:39 -0400 Subject: [Spambayes] Total cost analysis In-Reply-To: Message-ID: In order to ease "middle ground" testing, I redid the automatic histogram analysis to do total-cost minimization similar to that done by Rob's cvcost.py. Here's highly atypical sample output. It's from a tiny run so that you can see by eyeball what it means: -> Ham scores for this pair: 10 items; mean 1.04; sdev 1.21 -> min 0.000428085; median 0.45401; max 3.12227 * = 1 items 0.0 5 ***** 0.5 2 ** 1.0 0 1.5 0 2.0 1 * 2.5 0 3.0 2 ** 3.5 0 ... -> Spam scores for this pair: 10 items; mean 100.00; sdev 0.00 -> min 100; median 100; max 100 * = 1 items ... 99.0 0 99.5 10 ********** -> best cost $0.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 18721 cutoff pairs -> smallest ham & spam cutoffs 0.035 & 0.035 -> fp 0; fn 0; unsure ham 0; unsure spam 0 -> fp rate 0%; fn rate 0% -> largest ham & spam cutoffs 0.995 & 0.995 -> fp 0; fn 0; unsure ham 0; unsure spam 0 -> fp rate 0%; fn rate 0% This is trivial because no "middle ground" is needed here: calling everything >= 0.035 spam works exactly as well as calling everything >= 0.995 spam, and there are no mistakes or unsures in either case. Less trivial, because the ham scores slobber all over the range: -> Ham scores for all runs: 100 items; mean 8.09; sdev 17.25 -> min 3.24153e-007; median 0.846144; max 97.6463 * = 1 items 0.0 44 ******************************************** 0.5 8 ******** 1.0 5 ***** 1.5 5 ***** 2.0 4 **** 2.5 0 3.0 5 ***** 3.5 2 ** 4.0 1 * 4.5 0 5.0 1 * 5.5 1 * 6.0 0 6.5 2 ** 7.0 2 ** 7.5 1 * 8.0 0 8.5 0 9.0 0 9.5 0 10.0 1 * 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 1 * 14.5 0 15.0 0 15.5 1 * 16.0 1 * 16.5 0 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 1 * 21.0 2 ** 21.5 0 22.0 1 * 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 1 * 28.5 0 29.0 0 29.5 0 30.0 1 * 30.5 1 * 31.0 0 31.5 0 32.0 1 * 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 1 * 44.5 0 45.0 0 45.5 0 46.0 1 * 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 1 * 59.5 1 * 60.0 0 60.5 0 61.0 0 61.5 1 * 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 0 71.0 1 * 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 1 * 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for all runs: 100 items; mean 99.87; sdev 0.71 -> min 94.9387; median 100; max 100 * = 2 items ... 94.0 0 94.5 1 * 95.0 0 95.5 0 96.0 1 * 96.5 1 * 97.0 0 97.5 0 98.0 0 98.5 0 99.0 1 * 99.5 96 ************************************************ -> best cost $0.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 141 cutoff pairs -> smallest ham & spam cutoffs 0.715 & 0.98 -> fp 0; fn 0; unsure ham 1; unsure spam 3 -> fp rate 0%; fn rate 0% -> largest ham & spam cutoffs 0.945 & 0.99 -> fp 0; fn 0; unsure ham 1; unsure spam 3 -> fp rate 0%; fn rate 0% There is a middle ground here: saying something is "unsure" if 0.715 <= score < 0.98 works exactly as well as 0.945 <= score < 0.99 and there are 141-2 = 139 other cutoff pairs from the histogram boundaries that also achieve cost $0.80 (== 4 msgs in the middle ground, and no errors outside the middle ground). The default nbuckets has been boosted to 200, although TestDriver.printhist() (which does this display and computation) can be passed any number of buckets "after the fact", provided you saved the histogram objects as pickles. There are two new options to support this: """ # After the display of a ham+spam histogram pair, you can get a listing of # all the cutoff values (coinciding with histogram bucket boundaries) that # minimize # # best_cutoff_fp_weight * (# false positives) + # best_cutoff_fn_weight * (# false negatives) + # best_cutoff_unsure_weight * (# unsure msgs) # # This displays two cutoffs: hamc and spamc, where # # 0.0 <= hamc <= spamc <= 1.0 # # The idea is that if something scores < hamc, it's called ham; if # something scores >= spamc, it's called spam; and everything else is # called "I'm not sure" -- the middle ground. # # Note that cvcost.py does a similar analysis. # # Note: You may wish to increase nbuckets, to give this scheme more # cutoff values to analyze. compute_best_cutoffs_from_histograms: True best_cutoff_fp_weight: 10.00 best_cutoff_fn_weight: 1.00 best_cutoff_unsure_weight: 0.20 """ Note that the default values match cvcost.py's defaults. From tim.one@comcast.net Mon Oct 14 18:57:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 13:57:03 -0400 Subject: [Spambayes] Total cost analysis In-Reply-To: Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment CAUTION: For the attached histogram pair, cvcost sez: tcap.txt: Optimal cost is $10.0 with grey zone between 89.0 and 97.0 but the new histogram analysis says: -> best cost $0.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 24 cutoff pairs -> smallest ham & spam cutoffs 0.855 & 0.995 -> fp 0; fn 0; unsure ham 1; unsure spam 3 -> fp rate 0%; fn rate 0%; unsure rate 2% -> largest ham & spam cutoffs 0.97 & 0.995 -> fp 0; fn 0; unsure ham 1; unsure spam 3 -> fp rate 0%; fn rate 0%; unsure rate 2% and eyeballing the histograms shows that the latter is correct. I don't know why cvcost.py thinks $10.00 is the best that can be done; I suspect it's because it's skipping some cutoff pairs in order to save time. ---------------------- multipart/mixed attachment -> Ham scores for all runs: 100 items; mean 7.21; sdev 18.87 -> min 3.34881e-009; median 0.18187; max 99.2347 * = 2 items 0.0 63 ******************************** 0.5 6 *** 1.0 4 ** 1.5 5 *** 2.0 1 * 2.5 0 3.0 3 ** 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 1 * 9.5 1 * 10.0 0 10.5 0 11.0 0 11.5 0 12.0 1 * 12.5 0 13.0 0 13.5 0 14.0 0 14.5 1 * 15.0 0 15.5 1 * 16.0 1 * 16.5 0 17.0 0 17.5 0 18.0 0 18.5 1 * 19.0 1 * 19.5 0 20.0 0 20.5 0 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 1 * 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 1 * 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 1 * 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 1 * 44.5 1 * 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 0 63.5 1 * 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 1 * 75.0 0 75.5 0 76.0 0 76.5 0 77.0 1 * 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 1 * 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 1 * 99.5 0 -> Spam scores for all runs: 100 items; mean 99.94; sdev 0.34 -> min 97.0896; median 100; max 100 * = 2 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 0 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 1 * 97.5 0 98.0 0 98.5 2 * 99.0 0 99.5 97 ************************************************* -> best cost $0.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 24 cutoff pairs -> smallest ham & spam cutoffs 0.855 & 0.995 -> fp 0; fn 0; unsure ham 1; unsure spam 3 -> fp rate 0%; fn rate 0%; unsure rate 2% -> largest ham & spam cutoffs 0.97 & 0.995 -> fp 0; fn 0; unsure ham 1; unsure spam 3 -> fp rate 0%; fn rate 0%; unsure rate 2% C:\Code\spambayes>tcap/u ---------------------- multipart/mixed attachment-- From bkc@murkworks.com Mon Oct 14 19:08:15 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 14 Oct 2002 14:08:15 -0400 Subject: [Spambayes] Comparing chi to zcombine Message-ID: <3DAACFCE.25807.14D268A@localhost> First, cmp.py results/chitrues.txt -> results/zcombines.txt -> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams false positive percentages 0.154 0.154 tied 0.231 0.077 won -66.67% 0.077 0.154 lost +100.00% 0.154 0.462 lost +200.00% 0.154 0.154 tied 0.077 0.154 lost +100.00% 0.077 0.077 tied 0.000 0.000 tied 0.231 0.538 lost +132.90% 0.000 0.154 lost +(was 0) won 1 times tied 4 times lost 5 times total unique fp went from 15 to 25 lost +66.67% mean fp % went from 0.115384615385 to 0.192307692308 lost +66.67% false negative percentages 0.846 1.308 lost +54.61% 1.231 1.538 lost +24.94% 1.154 1.308 lost +13.34% 0.615 0.846 lost +37.56% 0.923 1.000 lost +8.34% 1.308 1.154 won -11.77% 0.692 1.077 lost +55.64% 1.077 1.231 lost +14.30% 1.231 1.154 won -6.26% 1.231 1.077 won -12.51% won 3 times tied 0 times lost 7 times total unique fn went from 134 to 152 lost +13.43% mean fn % went from 1.03076923077 to 1.16923076923 lost +13.43% ham mean ham sdev 1.40 1.37 -2.14% 8.67 9.24 +6.57% 1.12 1.03 -8.04% 8.09 8.37 +3.46% 1.12 1.02 -8.93% 8.02 7.86 -2.00% 1.26 1.13 -10.32% 8.62 9.07 +5.22% 1.06 1.04 -1.89% 8.03 8.27 +2.99% 1.01 0.86 -14.85% 6.87 7.08 +3.06% 0.85 0.71 -16.47% 6.57 6.55 -0.30% 0.96 0.90 -6.25% 7.06 7.56 +7.08% 1.15 1.00 -13.04% 8.38 8.67 +3.46% 1.01 0.77 -23.76% 7.62 7.11 -6.69% ham mean and sdev for all runs 1.09 0.98 -10.09% 7.83 8.03 +2.55% spam mean spam sdev 99.74 99.75 +0.01% 3.59 3.85 +7.24% 99.67 99.71 +0.04% 4.17 3.95 -5.28% 99.68 99.70 +0.02% 4.12 4.43 +7.52% 99.83 99.81 -0.02% 2.68 3.16 +17.91% 99.84 99.91 +0.07% 2.20 0.96 -56.36% 99.66 99.73 +0.07% 4.29 3.92 -8.62% 99.67 99.74 +0.07% 4.68 4.28 -8.55% 99.79 99.81 +0.02% 2.98 2.52 -15.44% 99.75 99.78 +0.03% 3.24 2.85 -12.04% 99.54 99.68 +0.14% 5.07 4.96 -2.17% spam mean and sdev for all runs 99.72 99.76 +0.04% 3.80 3.66 -3.68% ham/spam mean difference: 98.63 98.78 +0.15 And now, the zcombine histogram -> Ham scores for all runs: 13000 items; mean 0.98; sdev 8.03 -> min -6.66134e-14; median 0; max 100 * = 205 items 0.0 12487 ************************************************************* 0.5 84 * 1.0 44 * 1.5 29 * 2.0 28 * 2.5 17 * 3.0 7 * 3.5 7 * 4.0 11 * 4.5 14 * 5.0 1 * 5.5 5 * 6.0 5 * 6.5 6 * 7.0 3 * 7.5 6 * 8.0 8 * 8.5 2 * 9.0 1 * 9.5 3 * 10.0 1 * 10.5 3 * 11.0 6 * 11.5 3 * 12.0 6 * 12.5 6 * 13.0 8 * 13.5 1 * 14.0 1 * 14.5 0 15.0 1 * 15.5 2 * 16.0 3 * 16.5 1 * 17.0 2 * 17.5 3 * 18.0 2 * 18.5 1 * 19.0 2 * 19.5 2 * 20.0 0 20.5 3 * 21.0 1 * 21.5 0 22.0 2 * 22.5 0 23.0 3 * 23.5 1 * 24.0 0 24.5 2 * 25.0 1 * 25.5 3 * 26.0 0 26.5 2 * 27.0 0 27.5 1 * 28.0 0 28.5 2 * 29.0 2 * 29.5 1 * 30.0 1 * 30.5 2 * 31.0 0 31.5 3 * 32.0 2 * 32.5 0 33.0 0 33.5 1 * 34.0 2 * 34.5 0 35.0 2 * 35.5 0 36.0 2 * 36.5 2 * 37.0 1 * 37.5 2 * 38.0 1 * 38.5 0 39.0 2 * 39.5 1 * 40.0 1 * 40.5 4 * 41.0 1 * 41.5 1 * 42.0 0 42.5 0 43.0 1 * 43.5 2 * 44.0 0 44.5 0 45.0 1 * 45.5 1 * 46.0 1 * 46.5 1 * 47.0 1 * 47.5 1 * 48.0 1 * 48.5 2 * 49.0 0 49.5 1 * 50.0 0 50.5 0 51.0 1 * 51.5 0 52.0 1 * 52.5 1 * 53.0 2 * 53.5 0 54.0 0 54.5 1 * 55.0 1 * 55.5 1 * 56.0 1 * 56.5 1 * 57.0 1 * 57.5 0 58.0 2 * 58.5 2 * 59.0 2 * 59.5 3 * 60.0 1 * 60.5 2 * 61.0 1 * 61.5 1 * 62.0 2 * 62.5 1 * 63.0 2 * 63.5 0 64.0 0 64.5 1 * 65.0 3 * 65.5 0 66.0 0 66.5 2 * 67.0 0 67.5 1 * 68.0 0 68.5 1 * 69.0 0 69.5 0 70.0 0 70.5 0 71.0 3 * 71.5 1 * 72.0 0 72.5 1 * 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 1 * 78.0 0 78.5 0 79.0 2 * 79.5 0 80.0 3 * 80.5 0 81.0 0 81.5 0 82.0 0 82.5 1 * 83.0 1 * 83.5 1 * 84.0 0 84.5 1 * 85.0 0 85.5 1 * 86.0 1 * 86.5 0 87.0 1 * 87.5 2 * 88.0 3 * 88.5 4 * 89.0 1 * 89.5 1 * 90.0 2 * 90.5 2 * 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 1 * 94.0 0 94.5 0 95.0 3 * 95.5 0 96.0 2 * 96.5 2 * 97.0 1 * 97.5 1 * 98.0 4 * 98.5 0 99.0 2 * 99.5 23 * -> Spam scores for all runs: 13000 items; mean 99.76; sdev 3.66 -> min 0; median 100; max 100 * = 210 items 0.0 5 * 0.5 1 * 1.0 1 * 1.5 2 * 2.0 1 * 2.5 0 3.0 1 * 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 0 17.5 1 * 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 1 * 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 1 * 33.0 0 33.5 0 34.0 0 34.5 2 * 35.0 0 35.5 0 36.0 1 * 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 1 * 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 0 47.0 1 * 47.5 0 48.0 0 48.5 0 49.0 1 * 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 1 * 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 2 * 60.5 0 61.0 2 * 61.5 0 62.0 0 62.5 0 63.0 0 63.5 1 * 64.0 1 * 64.5 0 65.0 0 65.5 1 * 66.0 1 * 66.5 1 * 67.0 0 67.5 0 68.0 0 68.5 1 * 69.0 0 69.5 0 70.0 0 70.5 1 * 71.0 0 71.5 0 72.0 2 * 72.5 0 73.0 0 73.5 0 74.0 1 * 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 3 * 78.0 2 * 78.5 1 * 79.0 3 * 79.5 0 80.0 0 80.5 1 * 81.0 0 81.5 0 82.0 0 82.5 0 83.0 1 * 83.5 0 84.0 3 * 84.5 1 * 85.0 1 * 85.5 3 * 86.0 1 * 86.5 1 * 87.0 1 * 87.5 2 * 88.0 0 88.5 1 * 89.0 1 * 89.5 2 * 90.0 2 * 90.5 1 * 91.0 2 * 91.5 2 * 92.0 2 * 92.5 3 * 93.0 6 * 93.5 1 * 94.0 2 * 94.5 2 * 95.0 5 * 95.5 6 * 96.0 10 * 96.5 4 * 97.0 6 * 97.5 14 * 98.0 21 * 98.5 17 * 99.0 37 * 99.5 12794 ************************************************************* -> best cutoff for all runs: 0.985 -> with weighted total 10*25 fp + 152 fn = 402 -> fp rate 0.192% fn rate 1.17% saving ham histogram pickle to class_hamhist.pik saving spam histogram pickle to class_spamhist.pik .ini for zcombine run [Tokenizer] mine_received_headers: True [Classifier] use_central_limit = False use_central_limit2 = False use_central_limit3 = False use_tim_combining: False use_chi_squared_combining: False use_z_combining: True robinson_minimum_prob_strength: 0.0 [TestDriver] spam_cutoff: 0.985 show_false_negatives: True show_false_positives: True nbuckets: 200 best_cutoff_fp_weight: 10 show_spam_lo: 0.4 show_spam_hi: 0.80 show_ham_lo = 0.40 show_ham_hi = 0.80 show_charlimit: 10000 save_trained_pickles: True save_histogram_pickles: True Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From popiel@wolfskeep.com Mon Oct 14 19:57:13 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 14 Oct 2002 11:57:13 -0700 Subject: [Spambayes] defaults vs. chi-square Message-ID: <20021014185713.4604CF4D4@cashew.wolfskeep.com> I'm being lazy today, so I haven't put this one up on my website in all its gory detail. I did a cvs up, catching the changes to the histograms and the cost determinations. I did not catch Tim's last modification for tagging the cost computations with set/all discriminators. cv1 is all defaults. cv2 is chi-square, but otherwise default. """ cv1s -> cv2s -> tested 200 hams & 200 spams against 1800 hams & 1800 spams [yadda yadda yadda] -> tested 200 hams & 200 spams against 1800 hams & 1800 spams false positive percentages 0.500 0.500 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.500 lost +(was 0) 1.000 1.000 tied 0.000 0.500 lost +(was 0) 0.000 0.000 tied 0.500 1.000 lost +100.00% 0.000 0.000 tied won 0 times tied 7 times lost 3 times total unique fp went from 4 to 7 lost +75.00% mean fp % went from 0.2 to 0.35 lost +75.00% false negative percentages 2.000 1.500 won -25.00% 1.500 0.500 won -66.67% 4.000 2.000 won -50.00% 2.000 1.000 won -50.00% 2.000 1.500 won -25.00% 3.000 2.000 won -33.33% 5.000 3.500 won -30.00% 3.000 1.500 won -50.00% 5.000 2.500 won -50.00% 2.000 0.500 won -75.00% won 10 times tied 0 times lost 0 times total unique fn went from 59 to 33 won -44.07% mean fn % went from 2.95 to 1.65 won -44.07% ham mean ham sdev 17.22 0.50 -97.10% 7.39 7.04 -4.74% 18.69 0.27 -98.56% 7.27 3.71 -48.97% 18.86 0.04 -99.79% 6.50 0.41 -93.69% 16.79 0.41 -97.56% 7.75 4.13 -46.71% 18.66 0.36 -98.07% 7.09 4.84 -31.73% 18.47 1.01 -94.53% 7.83 9.42 +20.31% 18.19 0.51 -97.20% 6.99 5.47 -21.75% 18.38 0.16 -99.13% 6.80 1.94 -71.47% 17.67 0.95 -94.62% 7.88 9.40 +19.29% 17.72 0.14 -99.21% 6.18 1.88 -69.58% ham mean and sdev for all runs 18.07 0.44 -97.57% 7.22 5.65 -21.75% spam mean spam sdev 75.58 98.42 +30.22% 9.15 10.85 +18.58% 76.81 99.26 +29.23% 8.53 5.56 -34.82% 74.95 97.82 +30.51% 9.44 12.18 +29.03% 76.18 98.85 +29.76% 8.64 8.90 +3.01% 76.55 98.55 +28.74% 8.84 9.65 +9.16% 76.08 98.31 +29.22% 8.69 11.21 +29.00% 75.61 97.25 +28.62% 9.72 13.12 +34.98% 76.51 98.98 +29.37% 8.30 6.15 -25.90% 75.92 98.26 +29.43% 9.62 10.37 +7.80% 75.52 99.01 +31.10% 8.76 5.46 -37.67% spam mean and sdev for all runs 75.97 98.47 +29.62% 9.00 9.72 +8.00% ham/spam mean difference: 57.90 98.03 +40.13 """ Nothing too surprising, though I wonder if it would be good to mangle cmp.py to output a table for unsure like it does for fp and fn. It also looks like it's using the raw untuned numbers for fp and fn, instead of the computed best values. The best info for cv1 (defaults): """ -> best cost $41.20 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.425 & 0.635 -> fp 0; fn 6; unsure ham 14; unsure spam 162 -> fp rate 0%; fn rate 0.3%; unsure rate 4.4% """ The best info for cv2 (chi-square): """ -> best cost $48.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 3 cutoff pairs -> smallest ham & spam cutoffs 0.03 & 0.89 -> fp 3; fn 6; unsure ham 12; unsure spam 48 -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% -> largest ham & spam cutoffs 0.03 & 0.9 -> fp 3; fn 6; unsure ham 12; unsure spam 48 -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% """ The histograms for chi-square look pretty much like all the other histograms reported here (big spikes at the ends for the ham and spam, several spread lightly (and fairly evenly) over the middle ground. I must say that I like chi-square best out of all the ones I've tested, since it has fairly obvious points for the cutoffs (I suspect that .05 and .90 are not too far from optimal for just about everyone), and it does have a useful middle ground. (The false positives I get from it are fairly hopeless cases: FDIC informing customers that NextBank died, a contractor's bid containing only an encoded .pdf, info requests wrt getting a new mortgage. The false negatives are a bunch of particularly chatty spams, and one or two with empty bodies. Again, fairly hopeless.) I'll be testing the zcombining shortly. - Alex From tim.one@comcast.net Mon Oct 14 20:35:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 15:35:33 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: <20021014185713.4604CF4D4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > I'm being lazy today, so I haven't put this one up on my > website in all its gory detail. I confess I haven't been able to make enough time to follow all the msgs on this list carefully, let alone cruise the web mining more details. If stupid beats smart here, let's hope lazy beats ambitious too . > I did a cvs up, catching the changes to the histograms > and the cost determinations. Good! > I did not catch Tim's last modification for tagging the cost > computations with set/all discriminators. That's fine -- purely cosmetic, no difference in results. > cv1 is all defaults. cv2 is chi-square, but otherwise default. > > """ > cv1s -> cv2s > -> tested 200 hams & 200 spams against 1800 hams & 1800 spams > [yadda yadda yadda] > -> tested 200 hams & 200 spams against 1800 hams & 1800 spams > > false positive percentages > 0.500 0.500 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.500 lost +(was 0) > 1.000 1.000 tied > 0.000 0.500 lost +(was 0) > 0.000 0.000 tied > 0.500 1.000 lost +100.00% > 0.000 0.000 tied > > won 0 times > tied 7 times > lost 3 times > > total unique fp went from 4 to 7 lost +75.00% > mean fp % went from 0.2 to 0.35 lost +75.00% > > false negative percentages > 2.000 1.500 won -25.00% > 1.500 0.500 won -66.67% > 4.000 2.000 won -50.00% > 2.000 1.000 won -50.00% > 2.000 1.500 won -25.00% > 3.000 2.000 won -33.33% > 5.000 3.500 won -30.00% > 3.000 1.500 won -50.00% > 5.000 2.500 won -50.00% > 2.000 0.500 won -75.00% > > won 10 times > tied 0 times > lost 0 times > > total unique fn went from 59 to 33 won -44.07% > mean fn % went from 2.95 to 1.65 won -44.07% > > ham mean ham sdev > 17.22 0.50 -97.10% 7.39 7.04 -4.74% > 18.69 0.27 -98.56% 7.27 3.71 -48.97% > 18.86 0.04 -99.79% 6.50 0.41 -93.69% > 16.79 0.41 -97.56% 7.75 4.13 -46.71% > 18.66 0.36 -98.07% 7.09 4.84 -31.73% > 18.47 1.01 -94.53% 7.83 9.42 +20.31% > 18.19 0.51 -97.20% 6.99 5.47 -21.75% > 18.38 0.16 -99.13% 6.80 1.94 -71.47% > 17.67 0.95 -94.62% 7.88 9.40 +19.29% > 17.72 0.14 -99.21% 6.18 1.88 -69.58% > > ham mean and sdev for all runs > 18.07 0.44 -97.57% 7.22 5.65 -21.75% > > spam mean spam sdev > 75.58 98.42 +30.22% 9.15 10.85 +18.58% > 76.81 99.26 +29.23% 8.53 5.56 -34.82% > 74.95 97.82 +30.51% 9.44 12.18 +29.03% > 76.18 98.85 +29.76% 8.64 8.90 +3.01% > 76.55 98.55 +28.74% 8.84 9.65 +9.16% > 76.08 98.31 +29.22% 8.69 11.21 +29.00% > 75.61 97.25 +28.62% 9.72 13.12 +34.98% > 76.51 98.98 +29.37% 8.30 6.15 -25.90% > 75.92 98.26 +29.43% 9.62 10.37 +7.80% > 75.52 99.01 +31.10% 8.76 5.46 -37.67% > > spam mean and sdev for all runs > 75.97 98.47 +29.62% 9.00 9.72 +8.00% > > ham/spam mean difference: 57.90 98.03 +40.13 > """ > > Nothing too surprising, though I wonder if it would be good > to mangle cmp.py to output a table for unsure like it does > for fp and fn. It also looks like it's using the raw untuned > numbers for fp and fn, instead of the computed best values. Yes, cmp.py doesn't look at the histograms at all, it's mining the individual > ... > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.500 lost +(was 0) > 1.000 1.000 tied > 0.000 0.500 lost +(was 0) > 0.000 0.000 tied > ... output lines. Those are still based on a single value for spam_cutoff, and a single cutoff value doesn't really make sense for the "middle ground" schemes. The mean and sdev stats remain interesting for these schemes, but cmp.py's fn and fp accounts are at best misleading for the middle-ground schemes. For now, the histogram analysis is the best analytic ouput we get for such schemes. > The best info for cv1 (defaults): > > """ > -> best cost $41.20 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.425 & 0.635 > -> fp 0; fn 6; unsure ham 14; unsure spam 162 > -> fp rate 0%; fn rate 0.3%; unsure rate 4.4% > """ The all-default scheme does do very well; the practical difficulty has been that "the best" cutoff values seem extremely corpus-dependent, and even so require 3 digits of precision to express, and change depending on how much data you train on. Cutoffs that can only be determined after the fact, and only when knowing exactly what the classifications *should* have been, are impractical on several counts. Still, if you had a time machine (so could pick "the best" cutoffs later and apply them retroactively), nothing else really does better. > The best info for cv2 (chi-square): > > """ > -> best cost $48.00 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 3 cutoff pairs > -> smallest ham & spam cutoffs 0.03 & 0.89 > -> fp 3; fn 6; unsure ham 12; unsure spam 48 > -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% > -> largest ham & spam cutoffs 0.03 & 0.9 > -> fp 3; fn 6; unsure ham 12; unsure spam 48 > -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% > """ And this seems a lot easier to live with in a world without time machines: the middle ground spans a huge range of scores, yet contains a lot fewer msgs than under highly-corpus-tuned cv1. > The histograms for chi-square look pretty much like all the other > histograms reported here (big spikes at the ends for the ham and > spam, several spread lightly (and fairly evenly) over the middle > ground. > > I must say that I like chi-square best out of all the ones I've > tested, since it has fairly obvious points for the cutoffs (I suspect > that .05 and .90 are not too far from optimal for just about everyone), > and it does have a useful middle ground. I agree on all counts. > (The false positives I get from it are fairly hopeless cases: > FDIC informing customers that NextBank died, a contractor's bid > containing only an encoded .pdf, That one surprises me: assuming we threw the body away unlooked-at (we ignore MIME sections that aren't of text/* type), it's hard to get enough other clues to force a spam score so high. If possible, I'd like to see the list of clues (the "prob('word') = 0.432' thingies in the main output file, assuing you have show_false_positives enabled). > info requests wrt getting a new mortgage. The false negatives are a > bunch of particularly chatty spams, and one or two with empty bodies. > Again, fairly hopeless.) Long chatty spam has been pretty reliably scoring near 0.5 for me, which has been a real advantage of chi combining. So again I'd really like to see the list of clues. > I'll be testing the zcombining shortly. I look forward to it. Note that, as above, that's another middle-ground scheme, so only the histogram analysis will be truly interesting. From bkc@murkworks.com Mon Oct 14 21:12:48 2002 From: bkc@murkworks.com (Brad Clements) Date: Mon, 14 Oct 2002 16:12:48 -0400 Subject: [Spambayes] Tokenizer output text range, high bits Message-ID: <3DAAECFE.18531.1BF2CA3@localhost> I thought I'd read in the list that the tokenizer doesn't return chars with the "high bit" set, just creates a new token indicating that. So, when going through the classifier wordlist keys, I don't expect to see any keys with chars where ord(c) & 0x80 != 0 however, I am finding some. Also, finding chars whose ord() < 32. I'm not so worried about the later (as long as there aren't any nuls), but somewhat concerned about the high-bit. Unicode? I don't want to deal with that just now.. :-( Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From rob@hooft.net Mon Oct 14 21:12:38 2002 From: rob@hooft.net (Rob Hooft) Date: Mon, 14 Oct 2002 22:12:38 +0200 Subject: [Spambayes] Total cost analysis References: Message-ID: <3DAB2536.8000601@hooft.net> Tim Peters wrote: > CAUTION: For the attached histogram pair, cvcost sez: > > tcap.txt: Optimal cost is $10.0 with grey zone between 89.0 and 97.0 > > but the new histogram analysis says: > > -> best cost $0.80 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 24 cutoff pairs > -> smallest ham & spam cutoffs 0.855 & 0.995 > -> fp 0; fn 0; unsure ham 1; unsure spam 3 > -> fp rate 0%; fn rate 0%; unsure rate 2% > -> largest ham & spam cutoffs 0.97 & 0.995 > -> fp 0; fn 0; unsure ham 1; unsure spam 3 > -> fp rate 0%; fn rate 0%; unsure rate 2% > > and eyeballing the histograms shows that the latter is correct. I don't > know why cvcost.py thinks $10.00 is the best that can be done; I suspect > it's because it's skipping some cutoff pairs in order to save time. Yep, it only does full percentage points. It is a quick hack that should be done away with now that it is implemented in the histogram analysis. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From popiel@wolfskeep.com Mon Oct 14 21:29:59 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 14 Oct 2002 13:29:59 -0700 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: Message from Tim Peters of "Mon, 14 Oct 2002 15:35:33 EDT." References: Message-ID: <20021014202959.76B5BF4D4@cashew.wolfskeep.com> In message: Tim Peters writes: >> The best info for cv2 (chi-square): >> >> """ >> -> best cost $48.00 >> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 >> -> achieved at 3 cutoff pairs >> -> smallest ham & spam cutoffs 0.03 & 0.89 >> -> fp 3; fn 6; unsure ham 12; unsure spam 48 >> -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% >> -> largest ham & spam cutoffs 0.03 & 0.9 >> -> fp 3; fn 6; unsure ham 12; unsure spam 48 >> -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% >> """ > >And this seems a lot easier to live with in a world without time machines: >the middle ground spans a huge range of scores, yet contains a lot fewer >msgs than under highly-corpus-tuned cv1. > >> The histograms for chi-square look pretty much like all the other >> histograms reported here (big spikes at the ends for the ham and >> spam, several spread lightly (and fairly evenly) over the middle >> ground. >> >> I must say that I like chi-square best out of all the ones I've >> tested, since it has fairly obvious points for the cutoffs (I suspect >> that .05 and .90 are not too far from optimal for just about everyone), >> and it does have a useful middle ground. > >I agree on all counts. > >> (The false positives I get from it are fairly hopeless cases: >> FDIC informing customers that NextBank died, a contractor's bid >> containing only an encoded .pdf, > >That one surprises me: assuming we threw the body away unlooked-at (we >ignore MIME sections that aren't of text/* type), it's hard to get enough >other clues to force a spam score so high. If possible, I'd like to see the >list of clues (the "prob('word') = 0.432' thingies in the main output file, >assuing you have show_false_positives enabled). Data/Ham/Set5/2745 prob = 0.685540245196 prob('*H*') = 0.535842 prob('*S*') = 0.906922 prob('content-type:application/pdf') = 0.0918367 prob('filename:fname piece:pdf') = 0.0918367 prob('subject:Electrical') = 0.155172 prob('content-type:text/plain') = 0.389566 prob('header:Received:5') = 0.389918 prob('content-type:multipart/mixed') = 0.737422 prob('content-type:multipart/alternative') = 0.948917 prob(' ') = 0.959269 prob('content-type:text/html') = 0.986282 That's the whole list of probabilities. I did fib slightly: in addition to the bid.pdf, there's a one-space-character message body represented in both plain text and HTML. Effectively null, but the classifier doesn't see it that way. It's that dual-body that's killing it. >> info requests wrt getting a new mortgage. The false negatives are a >> bunch of particularly chatty spams, and one or two with empty bodies. >> Again, fairly hopeless.) > >Long chatty spam has been pretty reliably scoring near 0.5 for me, which has >been a real advantage of chi combining. So again I'd really like to see the >list of clues. My error... I was looking at the fn output without paying attention to the listed probs. Since the fn output is based on the single cutoff (set at 0.56), it was getting some of the chatty stuff. The real fns are pretty short, and generally in odd languages or binary. This one looks like a worm: Data/Spam/Set3/32 prob = 0.000317545970781 prob('*H*') = 0.999926 prob('*S*') = 0.000560844 prob('skip:b 70') = 0.0412844 prob('skip:a 70') = 0.0505618 prob('skip:d 70') = 0.0505618 prob('skip:e 70') = 0.0505618 prob('email name:debian-java-request') = 0.0547407 prob('email addr:lists.debian.org') = 0.0594895 prob('email name:listmaster') = 0.0599834 prob("control: couldn't decode") = 0.0652174 prob('from:email addr:t-online.de>') = 0.0652174 prob('skip:c 70') = 0.0652174 prob('skip:i 70') = 0.0652174 prob('skip:y 70') = 0.0652174 prob('skip:z 70') = 0.0652174 prob('trouble?') = 0.0753369 prob('skip:" 10') = 0.277389 prob('skip:a 20') = 0.295202 prob('content-type:text/plain') = 0.388944 prob('header:Message-Id:1') = 0.6167 prob('email') = 0.787497 prob('x-mailer:microsoft outlook express 5.50.4133.2400') = 0.791262 prob('message-id:@lists.debian.org') = 0.844828 prob('skip:5 70') = 0.844828 And again: Data/Spam/Set3/2472 prob = 0.0029549796705 prob('*H*') = 0.999949 prob('*S*') = 0.00585924 prob('header:In-Reply-To:1') = 0.000449595 prob('skip:s 70') = 0.0412844 prob('skip:d 70') = 0.0505618 prob('skip:o 70') = 0.0505618 prob('skip:t 70') = 0.0505618 prob("control: couldn't decode") = 0.0652174 prob('skip:c 70') = 0.0652174 prob('skip:i 70') = 0.0652174 prob('skip:l 70') = 0.0652174 prob('skip:z 70') = 0.0652174 prob('from:email addr:mail.com>') = 0.23545 prob('charset:us-ascii') = 0.317057 prob('skip:n 30') = 0.355072 prob('content-type:text/plain') = 0.388944 prob('header:Message-Id:1') = 0.6167 prob('content-disposition:inline') = 0.661659 prob('content-type:multipart/mixed') = 0.696645 prob('x-mailer:microsoft outlook, build 10.0.2616') = 0.97619 This one actually wasn't too long and chatty, but it seemed to hit a bunch of good words, and was half in french: Data/Spam/Set6/2011 prob = 0.00173950022128 prob('*H*') = 0.99774 prob('*S*') = 0.00121919 prob('forum') = 0.0121951 prob('url:be') = 0.0302013 prob('email name:debian-java-request') = 0.0341451 prob('email addr:lists.debian.org') = 0.0441114 prob('email name:listmaster') = 0.044487 prob('trouble?') = 0.0604856 prob('des') = 0.0652174 prob('cross') = 0.117486 prob('avec') = 0.155172 prob('est') = 0.155172 prob('firmwares') = 0.155172 prob('progress,') = 0.155172 prob('toute') = 0.155172 prob('...') = 0.180314 prob('occasionally') = 0.184814 prob('still') = 0.237895 prob('but') = 0.249098 prob('skip:" 10') = 0.278104 prob('site') = 0.295343 prob('already') = 0.301798 prob('charset:us-ascii') = 0.308681 prob('after') = 0.341657 prob('x-mailer:microsoft outlook express 6.00.2600.0000') = 0.347036 prob('content-type:text/plain') = 0.390599 prob('header:Reply-To:1') = 0.60073 prob('from') = 0.604083 prob('subject:.') = 0.605015 prob('available') = 0.637633 prob('header:Mime-Version:1') = 0.646706 prob('email') = 0.785132 prob('please') = 0.83219 prob('subject:skip:W 10') = 0.908163 prob('url:') = 0.936848 I don't know what happened to the other fn < 0.03. Close, but not quite, is a nigerian spam (!!!): Data/Spam/Set7/352 prob = 0.0344593026264 prob('*H*') = 0.999908 prob('*S*') = 0.0688269 prob('indeed') = 0.00556242 prob('aim') = 0.012894 prob('(my') = 0.0145631 prob('manner') = 0.0180723 prob('wrote') = 0.0211545 prob('reminder') = 0.0238095 prob('nigerian') = 0.0266272 prob('december') = 0.0266272 prob('so.') = 0.0281933 prob('okay') = 0.0302013 prob('although') = 0.0350768 prob('numbered') = 0.0412844 prob('ratio') = 0.0446266 prob('opposed') = 0.0481336 prob('apparently,') = 0.0505618 prob('revert') = 0.0505618 prob('officer') = 0.0505618 prob('subsequently') = 0.0505618 prob('patience') = 0.0505618 prob('however') = 0.0524146 prob('overcome') = 0.0599022 prob('fixed') = 0.0617239 prob('infer') = 0.0652174 prob('presumed') = 0.0652174 prob('filename:fname piece:txt') = 0.0652174 prob('therefore') = 0.0838752 prob('attempts') = 0.0874263 prob('expert,') = 0.0918367 prob('calendar') = 0.0918367 prob('travelling') = 0.0918367 prob('nigeria.') = 0.0918367 prob('apparently') = 0.0929593 prob('forwarding') = 0.106987 prob('saw') = 0.107116 prob('thus') = 0.110275 prob('did') = 0.112618 prob('concern') = 0.114396 prob('especially') = 0.125537 prob('finally,') = 0.126719 prob('shall') = 0.135258 prob('worked') = 0.138554 prob('point') = 0.154593 prob('totaling') = 0.155172 prob('proposition') = 0.155172 prob('6th') = 0.155172 prob('actively') = 0.165428 prob('since') = 0.166612 prob('knows') = 0.169148 prob('which') = 0.172635 prob('necessary') = 0.182854 prob('source') = 0.183395 prob('routine') = 0.189922 prob('driven') = 0.205305 prob('got') = 0.206143 prob('reality') = 0.206601 prob('light') = 0.207284 prob('skip:h 20') = 0.211375 prob('some') = 0.214937 prob('there') = 0.219934 prob('same') = 0.227242 prob('still') = 0.238027 prob('but') = 0.254404 prob('according') = 0.254563 prob('very') = 0.256327 prob('skip:m 10') = 0.258633 prob('stand') = 0.260226 prob('died') = 0.263314 prob('branch') = 0.263314 prob('zero') = 0.26593 prob('number') = 0.267526 prob('them') = 0.274205 prob('large') = 0.27431 prob('his') = 0.276565 prob('transaction') = 0.281659 prob('consultant') = 0.283198 prob('reason') = 0.288324 prob('dead') = 0.288434 prob('trace') = 0.29021 prob('mr.') = 0.292388 prob('part') = 0.294772 prob('when') = 0.297739 prob('ask') = 0.299886 prob('already') = 0.299963 prob('listing') = 0.310964 prob('given') = 0.311411 prob('down') = 0.311983 prob('charset:us-ascii') = 0.312457 prob('being') = 0.312739 prob('federal') = 0.695627 prob('president') = 0.697044 prob('safely') = 0.700267 prob('notification') = 0.700364 prob('information') = 0.703131 prob('skip:r 10') = 0.706302 prob('inform') = 0.707612 prob('brought') = 0.70783 prob('your') = 0.710937 prob('complete') = 0.711206 prob('content-type:application/octet-stream') = 0.718341 prob('country.') = 0.718341 prob('immediately') = 0.727163 prob('further') = 0.728674 prob('obtained') = 0.732221 prob('risk') = 0.747156 prob('content-type:multipart/mixed') = 0.751609 prob('contract') = 0.754669 prob('informed') = 0.75788 prob('business') = 0.761283 prob('internet') = 0.768097 prob('phone') = 0.774467 prob('questions') = 0.795045 prob('money,') = 0.796192 prob('bank') = 0.801151 prob('succeed') = 0.805677 prob('settled') = 0.810078 prob('month') = 0.811997 prob('claim') = 0.812913 prob('confidential') = 0.815186 prob('money.') = 0.8156 prob('our') = 0.820323 prob('please') = 0.828641 prob('months,') = 0.829218 prob('fund') = 0.83557 prob('national') = 0.835796 prob('sent') = 0.837147 prob('blood') = 0.843797 prob('asked,') = 0.844828 prob('treasury') = 0.844828 prob('address') = 0.860353 prob('reply') = 0.864689 prob('achieving') = 0.87037 prob('money') = 0.878353 prob('70%') = 0.880818 prob('million') = 0.885051 prob('corporation') = 0.891198 prob('free') = 0.90477 prob('approval') = 0.904949 prob('x-mailer:microsoft outlook express 5.00.2919.6900 dm') = 0.908163 prob('modalities') = 0.908163 prob('employment') = 0.912574 prob('claim.') = 0.915225 prob('skip:y 10') = 0.922406 prob('deposit') = 0.929253 prob('wish') = 0.930416 prob('credit') = 0.941699 prob('valued') = 0.950726 prob('guaranteed') = 0.956906 prob('honored') = 0.958716 prob('message-id:@ucsu.colorado.edu') = 0.965116 prob('conservative') = 0.983271 All you folks _talking_ about the nigerian spams has turned them into ham for me! ;-) - Alex From tim.one@comcast.net Mon Oct 14 21:33:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 16:33:10 -0400 Subject: [Spambayes] Comparing chi to zcombine In-Reply-To: <3DAACFCE.25807.14D268A@localhost> Message-ID: [Brad Clements] > ... > -> best cutoff for all runs: 0.985 > -> with weighted total 10*25 fp + 152 fn = 402 > -> fp rate 0.192% fn rate 1.17% > saving ham histogram pickle to class_hamhist.pik > saving spam histogram pickle to class_spamhist.pik Note that a single cutoff value doesn't make sense for the "middle ground" methods. Since you ran this, I checked in changes to histogram analysis that compute "best" ham *and* spam cutoff points, where best minimizes a function with three distinct costs (cost of an FP, cost of an FN, cost of an "unsure" msg). You set those costs to what makes sense for your application (e.g., as I've said many times, *I'd* rather get an fp than an fn for my own use, as I'm going to review every rejection anyway, and I just want to shuffle spam out of my main inbox so it doesn't interfere with normal workflow; I may be unique in that, though). I was able to run that analysis over the z-combining histogram you included here, but it's impossible to guess what it would have said for your chi-combining run: -> best cost for Brad z-combining run: $301.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.6 & 0.995 -> fp 23; fn 21; unsure ham 67; unsure spam 185 -> fp rate 0.177%; fn rate 0.162%; unsure rate 0.969% .995 is the highest bucket there was, so it couldn't draw any finer distinction among the 23 ham in in the .995 bucket. Boosting nbuckets would allow a more exact analysis. OTOH, those fp are scoring so high they may be hopeless. On the third hand, -> Spam scores for all runs: 13000 items; mean 99.76; sdev 3.66 -> min 0; median 100; max 100 *at least* half your spam scored 100 under z-combining (because the median spam score was 100), so there may well a useful distinction remaining to be drawn within the .995 bucket. From tim.one@comcast.net Mon Oct 14 21:45:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 16:45:18 -0400 Subject: [Spambayes] Tokenizer output text range, high bits In-Reply-To: <3DAAECFE.18531.1BF2CA3@localhost> Message-ID: [Brad Clements] > I thought I'd read in the list that the tokenizer doesn't return > chars with the "high bit" set, just creates a new token indicating that. "Words" are currently obtained via simple-minded split-on-whitespace after converting to lowercase. If a word has fewer than 3 chars, it's ignored. If it has between 3 and 12 chars inclusive, it's taken as-is (unconditionally -- it doesn't matter if it's all \xff or all NUL bytes or anything in-between). If it has more than 12 chars, then a "skip" metatoken is generated, *and* if it has any high-bit chars, an "8bit%" metatoken is also generated. > So, when going through the classifier wordlist keys, I don't > expect to see any keys with chars where ord(c) & 0x80 != 0 > > however, I am finding some. > > Also, finding chars whose ord() < 32. See above; all that is expected. > I'm not so worried about the later (as long as there aren't any > nuls), Why would you care about \x00 bytes? Python doesn't. > but somewhat concerned about the high-bit. Unicode? I don't want to > deal with that just now.. :-( Any number of non-Unicode encoding schemes use high-bit characters. For example, *typical* French, German and Spanish use high-bit characters sparingly. The current scheme should work fine for French and Spanish users. German seems to contain a lot of very long words, though, so less sanguine about that. Some Asian language don't use whitespace at all, so the s-o-w scheme ends up generating lots of "skip" and "8bit%" tokens for those. I expect the current tokenizer would be pretty much useless for Asian users as a result. From tim.one@comcast.net Mon Oct 14 22:22:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 17:22:45 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: <20021014202959.76B5BF4D4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] >>> (The false positives I get from it are fairly hopeless cases: >>> FDIC informing customers that NextBank died, a contractor's bid >>> containing only an encoded .pdf, [Tim] >> That one surprises me: assuming we threw the body away unlooked-at (we >> ignore MIME sections that aren't of text/* type), it's hard to get >> enough other clues to force a spam score so high. If possible, I'd >> like to see the list of clues (the "prob('word') = 0.432' thingies in >> the main output file, assuing you have show_false_positives enabled). [Alex] > Data/Ham/Set5/2745 > prob = 0.685540245196 How did this end up getting counted as an FP? A score of 0.69 was very solidly in your middle ground. > prob('*H*') = 0.535842 > prob('*S*') = 0.906922 > prob('content-type:application/pdf') = 0.0918367 > prob('filename:fname piece:pdf') = 0.0918367 > prob('subject:Electrical') = 0.155172 > prob('content-type:text/plain') = 0.389566 > prob('header:Received:5') = 0.389918 > prob('content-type:multipart/mixed') = 0.737422 > prob('content-type:multipart/alternative') = 0.948917 > prob(' ') = 0.959269 > prob('content-type:text/html') = 0.986282 > > That's the whole list of probabilities. Right, that's what I expected: if we skipped the .pdf attachment, there's very little left, and it's hard for very little to get a killer-strong spam score. > I did fib slightly: in addition to the bid.pdf, there's a > one-space-character message body represented in both plain text > and HTML. Effectively null, but the classifier doesn't see it that > way. It's that dual-body that's killing it. As above, this just doesn't *have* a high spam score. I think you must have confused this with some other other FP. The tokenizer should probably get rid of " " anyway, but that's a different experiment. >>> The false negatives are a bunch of particularly chatty spams, and >>> one or two with empty bodies. Again, fairly hopeless.) >> Long chatty spam has been pretty reliably scoring near 0.5 for >> me, which has been a real advantage of chi combining. So again I'd >> really like to see the list of clues. > My error... I was looking at the fn output without paying attention > to the listed probs. Since the fn output is based on the single > cutoff (set at 0.56), Ah, that would also explain why the 0.69 msg above was mistaken for an FP rather than a middle-ground msg. > it was getting some of the chatty stuff. The real fns are pretty > short, and generally in odd languages or binary. > > This one looks like a worm: > > Data/Spam/Set3/32 > prob = 0.000317545970781 > prob('*H*') = 0.999926 > prob('*S*') = 0.000560844 > prob('skip:b 70') = 0.0412844 > prob('skip:a 70') = 0.0505618 > prob('skip:d 70') = 0.0505618 > prob('skip:e 70') = 0.0505618 > prob('email name:debian-java-request') = 0.0547407 > prob('email addr:lists.debian.org') = 0.0594895 > prob('email name:listmaster') = 0.0599834 > prob("control: couldn't decode") = 0.0652174 > prob('from:email addr:t-online.de>') = 0.0652174 > prob('skip:c 70') = 0.0652174 > prob('skip:i 70') = 0.0652174 > prob('skip:y 70') = 0.0652174 > prob('skip:z 70') = 0.0652174 An odd thing is that you must have a lot of 'skip:z 70' (etc) tokens in your ham too, else these spamprobs wouldn't be so small. Any idea where they come from? It suggests the tokenizer is giving up on something it should really be picking apart -- but I don't have many of these in my ham, so I'm at a loss to guess where they come from. > prob('trouble?') = 0.0753369 > prob('skip:" 10') = 0.277389 > prob('skip:a 20') = 0.295202 > prob('content-type:text/plain') = 0.388944 > prob('header:Message-Id:1') = 0.6167 > prob('email') = 0.787497 > prob('x-mailer:microsoft outlook express 5.50.4133.2400') = 0.791262 > prob('message-id:@lists.debian.org') = 0.844828 > prob('skip:5 70') = 0.844828 > > And again: > > Data/Spam/Set3/2472 > prob = 0.0029549796705 > prob('*H*') = 0.999949 > prob('*S*') = 0.00585924 > prob('header:In-Reply-To:1') = 0.000449595 > prob('skip:s 70') = 0.0412844 > prob('skip:d 70') = 0.0505618 > prob('skip:o 70') = 0.0505618 > prob('skip:t 70') = 0.0505618 > prob("control: couldn't decode") = 0.0652174 > prob('skip:c 70') = 0.0652174 > prob('skip:i 70') = 0.0652174 > prob('skip:l 70') = 0.0652174 > prob('skip:z 70') = 0.0652174 As above, you must have an awful lot of low-spamprob skip tokens in your ham. > prob('from:email addr:mail.com>') = 0.23545 > prob('charset:us-ascii') = 0.317057 > prob('skip:n 30') = 0.355072 > prob('content-type:text/plain') = 0.388944 > prob('header:Message-Id:1') = 0.6167 > prob('content-disposition:inline') = 0.661659 > prob('content-type:multipart/mixed') = 0.696645 > prob('x-mailer:microsoft outlook, build 10.0.2616') = 0.97619 > > > This one actually wasn't too long and chatty, but it seemed > to hit a bunch of good words, and was half in french: You must have more French in your ham, then (else the French words wouldn't have low spamprobs). > Data/Spam/Set6/2011 > prob = 0.00173950022128 > prob('*H*') = 0.99774 > prob('*S*') = 0.00121919 > prob('forum') = 0.0121951 > prob('url:be') = 0.0302013 > prob('email name:debian-java-request') = 0.0341451 > prob('email addr:lists.debian.org') = 0.0441114 > prob('email name:listmaster') = 0.044487 > prob('trouble?') = 0.0604856 > prob('des') = 0.0652174 > prob('cross') = 0.117486 > prob('avec') = 0.155172 > prob('est') = 0.155172 > prob('firmwares') = 0.155172 > prob('progress,') = 0.155172 > prob('toute') = 0.155172 > prob('...') = 0.180314 > prob('occasionally') = 0.184814 > prob('still') = 0.237895 > prob('but') = 0.249098 > prob('skip:" 10') = 0.278104 > prob('site') = 0.295343 > prob('already') = 0.301798 > prob('charset:us-ascii') = 0.308681 > prob('after') = 0.341657 > prob('x-mailer:microsoft outlook express 6.00.2600.0000') = 0.347036 > prob('content-type:text/plain') = 0.390599 > prob('header:Reply-To:1') = 0.60073 > prob('from') = 0.604083 > prob('subject:.') = 0.605015 > prob('available') = 0.637633 > prob('header:Mime-Version:1') = 0.646706 > prob('email') = 0.785132 > prob('please') = 0.83219 > prob('subject:skip:W 10') = 0.908163 > prob('url:') = 0.936848 > > I don't know what happened to the other fn < 0.03. Close, but not > quite, is a nigerian spam (!!!): > > Data/Spam/Set7/352 > prob = 0.0344593026264 > prob('*H*') = 0.999908 > prob('*S*') = 0.0688269 > prob('indeed') = 0.00556242 > prob('aim') = 0.012894 > prob('(my') = 0.0145631 > prob('manner') = 0.0180723 > prob('wrote') = 0.0211545 > prob('reminder') = 0.0238095 > prob('nigerian') = 0.0266272 You have lot of ham containing "Nigerian"? If so, that may be my fault for talking about my Nigerian-scam FP every chance I get . > prob('december') = 0.0266272 > prob('so.') = 0.0281933 > prob('okay') = 0.0302013 > prob('although') = 0.0350768 > prob('numbered') = 0.0412844 > prob('ratio') = 0.0446266 > prob('opposed') = 0.0481336 > prob('apparently,') = 0.0505618 > prob('revert') = 0.0505618 > prob('officer') = 0.0505618 > prob('subsequently') = 0.0505618 > prob('patience') = 0.0505618 > prob('however') = 0.0524146 > prob('overcome') = 0.0599022 > prob('fixed') = 0.0617239 > prob('infer') = 0.0652174 > prob('presumed') = 0.0652174 > prob('filename:fname piece:txt') = 0.0652174 > prob('therefore') = 0.0838752 > prob('attempts') = 0.0874263 > prob('expert,') = 0.0918367 > prob('calendar') = 0.0918367 > prob('travelling') = 0.0918367 > prob('nigeria.') = 0.0918367 > prob('apparently') = 0.0929593 > prob('forwarding') = 0.106987 > prob('saw') = 0.107116 > prob('thus') = 0.110275 > prob('did') = 0.112618 > prob('concern') = 0.114396 > prob('especially') = 0.125537 > prob('finally,') = 0.126719 > prob('shall') = 0.135258 > prob('worked') = 0.138554 > prob('point') = 0.154593 > prob('totaling') = 0.155172 > prob('proposition') = 0.155172 > prob('6th') = 0.155172 > prob('actively') = 0.165428 > prob('since') = 0.166612 > prob('knows') = 0.169148 > prob('which') = 0.172635 > prob('necessary') = 0.182854 > prob('source') = 0.183395 > prob('routine') = 0.189922 > prob('driven') = 0.205305 > prob('got') = 0.206143 > prob('reality') = 0.206601 > prob('light') = 0.207284 > prob('skip:h 20') = 0.211375 > prob('some') = 0.214937 > prob('there') = 0.219934 > prob('same') = 0.227242 > prob('still') = 0.238027 > prob('but') = 0.254404 > prob('according') = 0.254563 > prob('very') = 0.256327 > prob('skip:m 10') = 0.258633 > prob('stand') = 0.260226 > prob('died') = 0.263314 > prob('branch') = 0.263314 > prob('zero') = 0.26593 > prob('number') = 0.267526 > prob('them') = 0.274205 > prob('large') = 0.27431 > prob('his') = 0.276565 > prob('transaction') = 0.281659 > prob('consultant') = 0.283198 > prob('reason') = 0.288324 > prob('dead') = 0.288434 > prob('trace') = 0.29021 > prob('mr.') = 0.292388 > prob('part') = 0.294772 > prob('when') = 0.297739 > prob('ask') = 0.299886 > prob('already') = 0.299963 > prob('listing') = 0.310964 > prob('given') = 0.311411 > prob('down') = 0.311983 > prob('charset:us-ascii') = 0.312457 > prob('being') = 0.312739 > prob('federal') = 0.695627 > prob('president') = 0.697044 > prob('safely') = 0.700267 > prob('notification') = 0.700364 > prob('information') = 0.703131 > prob('skip:r 10') = 0.706302 > prob('inform') = 0.707612 > prob('brought') = 0.70783 > prob('your') = 0.710937 > prob('complete') = 0.711206 > prob('content-type:application/octet-stream') = 0.718341 Eh? A Nigerian scam with an octet-stream attachment?! That's unique! > prob('country.') = 0.718341 > prob('immediately') = 0.727163 > prob('further') = 0.728674 > prob('obtained') = 0.732221 > prob('risk') = 0.747156 > prob('content-type:multipart/mixed') = 0.751609 > prob('contract') = 0.754669 > prob('informed') = 0.75788 > prob('business') = 0.761283 > prob('internet') = 0.768097 > prob('phone') = 0.774467 > prob('questions') = 0.795045 > prob('money,') = 0.796192 > prob('bank') = 0.801151 > prob('succeed') = 0.805677 > prob('settled') = 0.810078 > prob('month') = 0.811997 > prob('claim') = 0.812913 > prob('confidential') = 0.815186 > prob('money.') = 0.8156 > prob('our') = 0.820323 > prob('please') = 0.828641 > prob('months,') = 0.829218 > prob('fund') = 0.83557 > prob('national') = 0.835796 > prob('sent') = 0.837147 > prob('blood') = 0.843797 > prob('asked,') = 0.844828 > prob('treasury') = 0.844828 > prob('address') = 0.860353 > prob('reply') = 0.864689 > prob('achieving') = 0.87037 > prob('money') = 0.878353 > prob('70%') = 0.880818 > prob('million') = 0.885051 > prob('corporation') = 0.891198 > prob('free') = 0.90477 > prob('approval') = 0.904949 > prob('x-mailer:microsoft outlook express 5.00.2919.6900 dm') = 0.908163 > prob('modalities') = 0.908163 > prob('employment') = 0.912574 > prob('claim.') = 0.915225 > prob('skip:y 10') = 0.922406 > prob('deposit') = 0.929253 > prob('wish') = 0.930416 > prob('credit') = 0.941699 > prob('valued') = 0.950726 > prob('guaranteed') = 0.956906 > prob('honored') = 0.958716 > prob('message-id:@ucsu.colorado.edu') = 0.965116 > prob('conservative') = 0.983271 > > All you folks _talking_ about the nigerian spams has turned them > into ham for me! ;-) That could be. I hardly ever mention modalities here . From popiel@wolfskeep.com Mon Oct 14 22:36:15 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 14 Oct 2002 14:36:15 -0700 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: Message from Tim Peters of "Mon, 14 Oct 2002 17:22:45 EDT." References: Message-ID: <20021014213616.10BE4F4D4@cashew.wolfskeep.com> In message: Tim Peters writes: > >[Alex] >> Data/Ham/Set5/2745 >> prob = 0.685540245196 > >How did this end up getting counted as an FP? A score of 0.69 was very >solidly in your middle ground. You're right, I'm a twit who can't read. Okay, where did those false positives really go? >An odd thing is that you must have a lot of 'skip:z 70' (etc) tokens in your >ham too, else these spamprobs wouldn't be so small. Any idea where they >come from? It suggests the tokenizer is giving up on something it should >really be picking apart -- but I don't have many of these in my ham, so I'm >at a loss to guess where they come from. I'm not sure offhand, either. I'd have to work to track it down, though... and as mentioned earlier, today is a lazy day. My best guess is a few base64 bits that didn't get decoded properly. >You must have more French in your ham, then (else the French words wouldn't >have low spamprobs). Yes, I do, from you folks talking about French messages... this mailing list is doing a fine job of polluting my corpora with difficult messages. ;-) - Alex From popiel@wolfskeep.com Mon Oct 14 22:53:06 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 14 Oct 2002 14:53:06 -0700 Subject: [Spambayes] z-combining Message-ID: <20021014215307.0D632F4D4@cashew.wolfskeep.com> Well, I did a z-combining run. @whee. It replaces my all-defaults run as cv1. chi-square remains as cv2. >From results.txt: """ ham mean ham sdev 0.50 0.50 +0.00% 7.05 7.04 -0.14% 0.26 0.27 +3.85% 3.65 3.71 +1.64% 0.02 0.04 +100.00% 0.29 0.41 +41.38% 0.49 0.41 -16.33% 5.44 4.13 -24.08% 0.38 0.36 -5.26% 5.27 4.84 -8.16% 1.03 1.01 -1.94% 9.88 9.42 -4.66% 0.51 0.51 +0.00% 5.56 5.47 -1.62% 0.09 0.16 +77.78% 1.26 1.94 +53.97% 0.97 0.95 -2.06% 9.66 9.40 -2.69% 0.12 0.14 +16.67% 1.73 1.88 +8.67% ham mean and sdev for all runs 0.44 0.44 +0.00% 5.90 5.65 -4.24% spam mean spam sdev 98.68 98.42 -0.26% 10.66 10.85 +1.78% 99.31 99.26 -0.05% 5.62 5.56 -1.07% 97.68 97.82 +0.14% 13.94 12.18 -12.63% 98.84 98.85 +0.01% 9.00 8.90 -1.11% 98.54 98.55 +0.01% 11.71 9.65 -17.59% 97.99 98.31 +0.33% 13.48 11.21 -16.84% 96.88 97.25 +0.38% 15.83 13.12 -17.12% 99.34 98.98 -0.36% 4.95 6.15 +24.24% 98.07 98.26 +0.19% 11.74 10.37 -11.67% 99.65 99.01 -0.64% 3.04 5.46 +79.61% spam mean and sdev for all runs 98.50 98.47 -0.03% 10.81 9.72 -10.08% ham/spam mean difference: 98.06 98.03 -0.03 """ z-combining loses vs. chi-square there, with looser sdevs. Next, we have the best computations for z-combining: """ -> best cost $54.20 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 6 cutoff pairs -> smallest ham & spam cutoffs 0.01 & 0.985 -> fp 3; fn 13; unsure ham 12; unsure spam 44 -> fp rate 0.15%; fn rate 0.65%; unsure rate 1.4% -> largest ham & spam cutoffs 0.035 & 0.985 -> fp 3; fn 13; unsure ham 12; unsure spam 44 -> fp rate 0.15%; fn rate 0.65%; unsure rate 1.4% """ Compare with the one from chi-square: """ -> best cost $48.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 3 cutoff pairs -> smallest ham & spam cutoffs 0.03 & 0.89 -> fp 3; fn 6; unsure ham 12; unsure spam 48 -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% -> largest ham & spam cutoffs 0.03 & 0.9 -> fp 3; fn 6; unsure ham 12; unsure spam 48 -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% """ Looks like z-combining has real granularity problems near the top end. Trash it. - Alex From tim.one@comcast.net Mon Oct 14 22:59:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 17:59:14 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: <20021014213616.10BE4F4D4@cashew.wolfskeep.com> Message-ID: [Tim] >> An odd thing is that you must have a lot of 'skip:z 70' (etc) >> tokens in your ham too, else these spamprobs wouldn't be so small. >> Any idea where they come from? [T. Alexander Popiel] > I'm not sure offhand, either. I'd have to work to track it down, > though... and as mentioned earlier, today is a lazy day. My best > guess is a few base64 bits that didn't get decoded properly. I cater to lazy: you had a bunch of them in the very spams you were talking about. What does the source for those look like? I *used* to get a bunch of these before we started stripping uuencoded sections, but that shouldn't be happening anymore -- unless the uuencode-finding regexp is missing a pattern that's common in your data but not in mine. Or unless the message headers are damaged to such an extent that the email package barfs on them (in which case we fall back to the raw body text). Whatever the cause, if it's a systematic problem in your data, it will be for others too. It may be unique to Perl programmers, though . From popiel@wolfskeep.com Mon Oct 14 23:09:02 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 14 Oct 2002 15:09:02 -0700 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: Message from Tim Peters of "Mon, 14 Oct 2002 17:59:14 EDT." References: Message-ID: <20021014220902.71797F4D4@cashew.wolfskeep.com> In message: Tim Peters writes: >[Tim] >>> An odd thing is that you must have a lot of 'skip:z 70' (etc) >>> tokens in your ham too, else these spamprobs wouldn't be so small. >>> Any idea where they come from? > >[T. Alexander Popiel] >> I'm not sure offhand, either. I'd have to work to track it down, >> though... and as mentioned earlier, today is a lazy day. My best >> guess is a few base64 bits that didn't get decoded properly. > >I cater to lazy: you had a bunch of them in the very spams you were talking >about. What does the source for those look like? I *used* to get a bunch >of these before we started stripping uuencoded sections, but that shouldn't >be happening anymore -- unless the uuencode-finding regexp is missing a >pattern that's common in your data but not in mine. Or unless the message >headers are damaged to such an extent that the email package barfs on them >(in which case we fall back to the raw body text). It appears to be a systematic error when a mailing list manager appends plain text to what should be a base64 encoded segment. Bad MLM, no biscuit. This confuses the MIME decoder. Bad MIME decoder, too! As a sample: """ Return-Path: bounce-debian-java=popiel=wolfskeep.com@lists.debian.org Delivery-Date: Fri, 23 Aug 2002 02:56:21 -0700 Return-Path: Delivered-To: popiel@wolfskeep.com Received: from murphy.debian.org (murphy.debian.org [65.125.64.134]) by cashew.wolfskeep.com (Postfix) with SMTP id 0EAFBF58E for ; Fri, 23 Aug 2002 02:56:21 -0700 (PDT) Received: (qmail 29739 invoked by uid 38); 23 Aug 2002 09:37:09 -0000 X-Envelope-Sender: ybqiwbt@t-online.de Received: (qmail 29162 invoked from network); 23 Aug 2002 09:36:55 -0000 Received: from adsl-065-081-092-098.sip.gsp.bellsouth.net (HELO xpfoncv) (65.81.92.98) by murphy.debian.org with SMTP; 23 Aug 2002 09:36:55 -0000 From: Cagdas Burhansan31 To: Subject: Arþiv hazýr Date: Fri, 23 Aug 2002 10:33:48 -0400 X-Mailer: Microsoft Outlook Express 5.50.4133.2400 Content-Type: text/plain Content-Transfer-Encoding: base64 Message-Id: X-Spam-Status: No, hits=0.0 required=4.7 tests= version=2.01 Resent-Message-ID: Resent-From: debian-java@lists.debian.org X-Mailing-List: archive/latest/2709 X-Loop: debian-java@lists.debian.org List-Post: List-Help: List-Subscribe: List-Unsubscribe: Precedence: list Resent-Sender: debian-java-request@lists.debian.org Resent-Date: Fri, 23 Aug 2002 02:56:21 -0700 (PDT) DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k aS5jb20NCg0KDQoNCg0K -- To UNSUBSCRIBE, email to debian-java-request@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org """ >Whatever the cause, if it's a systematic problem in your data, it will be >for others too. It may be unique to Perl programmers, though . Nope. In this case, it's Java programmers. ;-) - Alex From guido@python.org Mon Oct 14 23:13:46 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 14 Oct 2002 18:13:46 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: Your message of "Mon, 14 Oct 2002 15:09:02 PDT." <20021014220902.71797F4D4@cashew.wolfskeep.com> References: <20021014220902.71797F4D4@cashew.wolfskeep.com> Message-ID: <200210142213.g9EMDsP00965@pcp02138704pcs.reston01.va.comcast.net> > It appears to be a systematic error when a mailing list manager > appends plain text to what should be a base64 encoded segment. > Bad MLM, no biscuit. This confuses the MIME decoder. Bad MIME > decoder, too! > > As a sample: > > """ [...] > Content-Type: text/plain > Content-Transfer-Encoding: base64 [...] > > DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu > ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg > YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu > IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp > 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u > ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl > cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv > bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp > bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh > a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l > eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k > aS5jb20NCg0KDQoNCg0K > > > -- > To UNSUBSCRIBE, email to debian-java-request@lists.debian.org > with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org > """ Mailman used to do this too; I believe it's finally fixed in Mailman 2.1. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Tue Oct 15 03:52:47 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 22:52:47 -0400 Subject: [Spambayes] z-combining In-Reply-To: <20021014215307.0D632F4D4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Well, I did a z-combining run. @whee. It replaces my > all-defaults run as cv1. chi-square remains as cv2. > > From results.txt: [inconsitent effects on means across runs, small and large effects on sdevs, but overall decreases] > ... > z-combining loses vs. chi-square there, with looser sdevs. The sdevs actually got smaller overall: > ham mean and sdev for all runs > 0.44 0.44 +0.00% 5.90 5.65 -4.24% > > spam mean and sdev for all runs > 98.50 98.47 -0.03% 10.81 9.72 -10.08% The means are so far apart compared to the sdevs, and the extreme concentration at the endpoints, though, that random overlap isn't an issue with either scheme -- the mistakes these guys make are more fundamental than random. > Next, we have the best computations for z-combining: > > """ > -> best cost $54.20 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 6 cutoff pairs > -> smallest ham & spam cutoffs 0.01 & 0.985 > -> fp 3; fn 13; unsure ham 12; unsure spam 44 > -> fp rate 0.15%; fn rate 0.65%; unsure rate 1.4% > -> largest ham & spam cutoffs 0.035 & 0.985 > -> fp 3; fn 13; unsure ham 12; unsure spam 44 > -> fp rate 0.15%; fn rate 0.65%; unsure rate 1.4% > """ > > Compare with the one from chi-square: > > """ > -> best cost $48.00 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 3 cutoff pairs > -> smallest ham & spam cutoffs 0.03 & 0.89 > -> fp 3; fn 6; unsure ham 12; unsure spam 48 > -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% > -> largest ham & spam cutoffs 0.03 & 0.9 > -> fp 3; fn 6; unsure ham 12; unsure spam 48 > -> fp rate 0.15%; fn rate 0.3%; unsure rate 1.5% > """ > > Looks like z-combining has real granularity problems near > the top end. Trash it. It's indeed not working better for anyone so far, and it does suffer cancellation disease. OTOH, it was a quick hack to get a quick feel for how this *kind* of approach might work, and it didn't go all the way. Gary would like to "rank" the spamprobs first, but that requires another version of "the third training pass" that I just don't know how to make practical over time. If Rob is feeling particularly adventurous, it would be interesting (in conncection with z-combining) to transform the database spamprobs into unit-normalized zscores via his RMS black magic, as an extra step at the end of update_probabilities(). This wouldn't require another pass over the training data, would speed z-combining scoring a lot, and I *think* would make the inputs to this scheme much closer to what Gary would really like them to be (z-combining *pretends* the "extreme-word" spamprobs are normally distributed now; I don't have any idea how close that is to the truth). The attraction of this scheme is that it gives a single "spam probability" directly; combining distinct ham and spam indicators is still a bit of a puzzle (although a happy puzzle from my POV when both indicators suck, as happens in chi combining with large numbers of strong clues on both ends). From tim.one@comcast.net Tue Oct 15 04:03:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 23:03:08 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: <20021014220902.71797F4D4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel, tracks down a source of his many "skip" tokens] > ... > It appears to be a systematic error when a mailing list manager > appends plain text to what should be a base64 encoded segment. > Bad MLM, no biscuit. This confuses the MIME decoder. Bad MIME > decoder, too! > > As a sample: > > """ > [headers] > ... > Content-Type: text/plain > Content-Transfer-Encoding: base64 > ... > > DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu > ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg > YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu > IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp > 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u > ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl > cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv > bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp > bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh > a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l > eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k > aS5jb20NCg0KDQoNCg0K > > > -- > To UNSUBSCRIBE, email to debian-java-request@lists.debian.org > with a subject of "unsubscribe". Trouble? Contact > listmaster@lists.debian.org Ouch. That would do it, all right, here in tokenizer.py: for part in textparts(msg): # Decode, or take it as-is if decoding fails. try: text = part.get_payload(decode=True) except: yield "control: couldn't decode" text = part.get_payload(decode=False) The base64 decoder will barf on that kind of msg, but you've got so many of these in your ham that even the "couldn't decode" metatoken is taken as a strong ham clue: prob("control: couldn't decode") = 0.0652174 I overlooked that in your msg before. So, Barry, what can we do about this? Filling the database with "skip" tokens from raw base64 is a Bad Idea, and I assume the email pkg doesn't know how to, e.g., "decode base64 up until it can't anymore, and then grab the rest as plain text". Heh -- just writing that made me want to puke. We have to do something better with this, though. From tim.one@comcast.net Tue Oct 15 04:32:57 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 23:32:57 -0400 Subject: [Spambayes] chi-squared versus "prob strength" In-Reply-To: <200210140631.g9E6VWn31907@localhost.localdomain> Message-ID: [Tim, to Rob, on switching from S/(S+H) to (S-H+1)/2] > I was, but more importantly my test data agreed, so I'm going > to switch to this (the evidence is so consistent and solid on both > our datasets that making it an option would supply a pointless > choice -- losers are killed). Good show! [Anthony Baxter] > Here's what my mungo-test set shows for this (before is pre-Rob Hooft's > change, after is current CVS) This would have been a useful result, but, unfortunately, you ran it before the histogram analysis was beefed up to tell us the useful bits. If you still have the final ("all runs") ham and spam histograms from *both* runs in output files, you could post much more useful info by running them thru cvcost.py. With some pain I can run the new histo analysis for you on your "after" run, because you included the full final histograms for that: -> best cost for Anthoy's CVS run: $626.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.85 & 0.995 -> fp 50; fn 75; unsure ham 110; unsure spam 145 -> fp rate 0.143%; fn rate 0.445%; unsure rate 0.493% It's a peculiar pair of cutoffs, reflecting that you have few low-scoring spam but (relative to others) many high-scoring ham. The analysis is limited by nbuckets=200, as 50 of your ham scored in the highest bucket: 99.5 50 * and so there's no way to get rid of more FP at this granularity short of calling everything ham. However, your *median* spam score was 100: -> Spam scores for all runs: 16848 items; mean 99.75; sdev 3.75 -> min 0.00333927; median 100; max 100 meaning that at least half your spam scored 100, so there may well be useful distinctions still to be drawn if only we could peer inside the 0.995 bucket. Your data is so nasty I think 200 buckets is too small for you; try 1000 next time? In any case, the idea that these lines were telling useful truths: > total unique fp went from 261 to 281 lost +7.66% > total unique fn went from 60 to 53 won -11.67% is right out. In part, those say two things: 1. spam_cutoff was too low for the "after" run. 2. A single spam_cutoff doesn't make sense for the middle-ground methods: we're trying to *get* you a useful middle ground here, a small number of nasty msgs where we have strong reason to believe many mistakes will live. From tim.one@comcast.net Tue Oct 15 04:54:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 14 Oct 2002 23:54:51 -0400 Subject: Bland words, and z-combining (was RE: [Spambayes] Bland word only score..) In-Reply-To: Message-ID: FYI, I doubled the number of accurate digits in z-combining's probability -> zscore calculations. This made it even more extreme for me -- the median ham score fell to 0 on the nose. The good news is that my lowest-scoring spam's score rose, from 4.09672e-012 to 4.10227e-012 Take *that* to the bank . -> best cost for all runs: $26.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 114 cutoff pairs -> smallest ham & spam cutoffs 0.63 & 0.944 -> fp 2; fn 3; unsure ham 10; unsure spam 9 -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559% -> largest ham & spam cutoffs 0.868 & 0.946 -> fp 2; fn 5; unsure ham 2; unsure spam 7 -> fp rate 0.01%; fn rate 0.0357%; unsure rate 0.0265% That's the first run of any kind I've seen where the minimum cost could be achieved in more than one way. I don't mean that there were 114 cutoff pairs that achieved it (that's normal enough), but that the two specific endpoints shown there make different tradeoffs between FN and unsures. What this doesn't show is that picking cutoffs of 0.05 and 0.95 would have been almost as cheap -- getting *close* to the minimum isn't touchy at all, but getting the absolute minimum is. From barry@wooz.org Tue Oct 15 05:10:06 2002 From: barry@wooz.org (Barry A. Warsaw) Date: Tue, 15 Oct 2002 00:10:06 -0400 Subject: [Spambayes] defaults vs. chi-square References: <20021014220902.71797F4D4@cashew.wolfskeep.com> Message-ID: <15787.38174.711987.679063@gargle.gargle.HOWL> >> ... It appears to be a systematic error when a mailing list >> manager appends plain text to what should be a base64 encoded >> segment. Bad MLM, no biscuit. This confuses the MIME >> decoder. Bad MIME decoder, too! Known problem with MM2.0. MM2.1 does a better job of adding headers and footers. If it can't do it in a MIME-safe way, it won't do it. TP> So, Barry, what can we do about this? Filling the database TP> with "skip" tokens from raw base64 is a Bad Idea, and I assume TP> the email pkg doesn't know how to, e.g., "decode base64 up TP> until it can't anymore, and then grab the rest as plain text". TP> Heh -- just writing that made me want to puke. We have to do TP> something better with this, though. Upgrade to MM2.1 :) Seriously, when the email package has to decode a base64 payload, it just hands the whole string off to base64.decodestring(). Given that that function isn't very forgiving, I'm not sure what to do. Sucks. -Barry From guido@python.org Tue Oct 15 05:17:52 2002 From: guido@python.org (Guido van Rossum) Date: Tue, 15 Oct 2002 00:17:52 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: Your message of "Tue, 15 Oct 2002 00:10:06 EDT." <15787.38174.711987.679063@gargle.gargle.HOWL> References: <20021014220902.71797F4D4@cashew.wolfskeep.com> <15787.38174.711987.679063@gargle.gargle.HOWL> Message-ID: <200210150417.g9F4HqZ17493@pcp02138704pcs.reston01.va.comcast.net> > TP> So, Barry, what can we do about this? Filling the database > TP> with "skip" tokens from raw base64 is a Bad Idea, and I assume > TP> the email pkg doesn't know how to, e.g., "decode base64 up > TP> until it can't anymore, and then grab the rest as plain text". > TP> Heh -- just writing that made me want to puke. We have to do > TP> something better with this, though. > > Upgrade to MM2.1 :) > > Seriously, when the email package has to decode a base64 payload, it > just hands the whole string off to base64.decodestring(). Given that > that function isn't very forgiving, I'm not sure what to do. Sucks. Split it up in lines first, and collect lines that match a simple regexp to recognize base64. Then feed the collected stuff to base64.decodestring(). If there's non-white excess, deal with that separately. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Tue Oct 15 05:58:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 15 Oct 2002 00:58:28 -0400 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: <200210150417.g9F4HqZ17493@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > Split it up in lines first, and collect lines that match a simple > regexp to recognize base64. Then feed the collected stuff to > base64.decodestring(). If there's non-white excess, deal with that > separately. I'm trying to shame Barry into doing this, since he sucked me into this project and then vanished . More importantly, if he can be provoked into giving it some real thought, he could do a better job faster than I could. For example, I've just got a Message object at this point, and I don't know beans about whether it's plain, base64-encoded, qp-encoded, or whatever. The email pkg knows, though, and Barry knows how to get it to tell him without even thinking about it. Since most base64 stuff isn't damaged, we need smarter recovery code in the "except:" clause of the snippet I posted. For a start, if it failed to decode base64 stuff, it would likely be better to ignore that part entirely than to run off tokenizing it. It would be much better still to decode it anyway. From popiel@wolfskeep.com Tue Oct 15 06:04:42 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 14 Oct 2002 22:04:42 -0700 Subject: [Spambayes] z-combining In-Reply-To: Message from Tim Peters of "Mon, 14 Oct 2002 22:52:47 EDT." References: Message-ID: <20021015050442.23597F4D4@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> Well, I did a z-combining run. @whee. It replaces my >> all-defaults run as cv1. chi-square remains as cv2. >> z-combining loses vs. chi-square there, with looser sdevs. > >The sdevs actually got smaller overall: Remember, I had z-combining on the left, and chi-square on the right. Just to confuse you. ;-) >The means are so far apart compared to the sdevs, and the extreme >concentration at the endpoints, though, that random overlap isn't an issue >with either scheme -- the mistakes these guys make are more fundamental than >random. Yup. - Alex From barry@wooz.org Tue Oct 15 06:13:38 2002 From: barry@wooz.org (Barry A. Warsaw) Date: Tue, 15 Oct 2002 01:13:38 -0400 Subject: [Spambayes] defaults vs. chi-square References: <200210150417.g9F4HqZ17493@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15787.41986.422263.708035@gargle.gargle.HOWL> >>>>> "TP" == Tim Peters writes: TP> I'm trying to shame Barry into doing this, since he sucked me TP> into this project and then vanished . I think everyone here will agree that was the most productive email, character for character that I ever wrote. It would peg the ham meter. TP> More importantly, if he can be provoked into giving it some TP> real thought, he could do a better job faster than I could. Will do, but tomorrow. I have one more Mailman flame to extinguish tonight and then it's time to drool into my pillow for 6 hours. -Barry From rob@hooft.net Tue Oct 15 14:19:57 2002 From: rob@hooft.net (Rob Hooft) Date: Tue, 15 Oct 2002 15:19:57 +0200 Subject: [Spambayes] z-combining References: Message-ID: <3DAC15FD.3020200@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Tim Peters wrote: > If Rob is feeling particularly adventurous, it would be interesting (in > conncection with z-combining) to transform the database spamprobs into > unit-normalized zscores via his RMS black magic, as an extra step at the end > of update_probabilities(). This wouldn't require another pass over the > training data, would speed z-combining scoring a lot, and I *think* would > make the inputs to this scheme much closer to what Gary would really like > them to be (z-combining *pretends* the "extreme-word" spamprobs are normally > distributed now; I don't have any idea how close that is to the truth). I'm not exactly sure what you want me to renormalize using my black magic, but I did make an interesting histogram of 250000 single-token spam probabilities... I'm hoping you're not assuming that this is normally distributed, although it looks like that is what you are trying to do when recalculating this into Z-scores. Out of the 250k tokens I put in my histogram, 93k occurred exactly once in the ham corpus of 4500 messages only, and ~75k exactly once in the spam corpus of 4500 messages only..... The noise you see at the baseline is messages that occur multiple times in both ham and spam; amplified in the second image where all words that occur only once or twice are removed from the histogram. A histogram of words that occur more than 30 times in total is a bit more flat, but still has many >30+0 / 0+>30 extremes. My strongest ham clue is "wrote:" (763+0) second "het" (533+0) [Dutch for "the" for words without gender and for "it"], at the spam side it is "8bit%:100" (0+937) and "charset:ks_c_5601-1987" (0+838) > The > attraction of this scheme is that it gives a single "spam probability" > directly; combining distinct ham and spam indicators is still a bit of a > puzzle (although a happy puzzle from my POV when both indicators suck, as > happens in chi combining with large numbers of strong clues on both ends). I don't see why this schema could not produce a "H" value as well, and then mix it with the "S" score we're using now. This schema looks a lot like the "S" half of earlier ones like chi2 combining. Think about what goes wrong if we would only use the S half of chi2 combining: messages that look like both ham and spam come out as perfect spam, and messages that look neither like ham nor spam come out as perfect ham. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: probfreq.png Type: image/png Size: 9903 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/probfreq.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: prob2.png Type: image/png Size: 7985 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/prob2.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: balk29.png Type: image/png Size: 5900 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/balk29.png ---------------------- multipart/mixed attachment-- From grobinson@transpose.com Tue Oct 15 14:38:21 2002 From: grobinson@transpose.com (Gary Robinson) Date: Tue, 15 Oct 2002 09:38:21 -0400 Subject: [Spambayes] z Message-ID: > It's indeed not working better for anyone so far, and it does suffer > cancellation disease. OTOH, it was a quick hack to get a quick feel for how > this *kind* of approach might work, and it didn't go all the way. Gary > would like to "rank" the spamprobs first, but that requires another version > of "the third training pass" that I just don't know how to make practical > over time. Actually I think it would be complicated or even impossible to do the way it really *should* be done because it would have to be structured so that spammy words always had a rank over .5 and hammy words had a rank under .5, while the probability of hitting a spam or a ham under a reasonable null hypothesis is the same. It would get complicated, so I recommend not bothering trying to do it right. I know I don't have time to try to work out a good way to do it now. > If Rob is feeling particularly adventurous, it would be interesting (in > conncection with z-combining) to transform the database spamprobs into > unit-normalized zscores via his RMS black magic, as an extra step at the end > of update_probabilities(). This wouldn't require another pass over the > training data, would speed z-combining scoring a lot, and I *think* would > make the inputs to this scheme much closer to what Gary would really like > them to be (z-combining *pretends* the "extreme-word" spamprobs are normally > distributed now; I don't have any idea how close that is to the truth). I didn't realize that this wasn't already being done. Yes I would recommend that somebody do this because I don't think we're really testing the z approach completely fairly until it is. I'm not saying I believe that the z approach will turn out to be better -- I just don't know -- but it seems worth trying. Gary --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 From rob@hooft.net Tue Oct 15 14:58:29 2002 From: rob@hooft.net (Rob Hooft) Date: Tue, 15 Oct 2002 15:58:29 +0200 Subject: [Spambayes] Tokenizing numbers and money Message-ID: <3DAC1F05.7030809@hooft.net> I just scanned through my 250k token list and found that a surprising number of these are numeric or almost numeric. Here is a random part of the list: prob nham nspam token 0.1552 1 0 3601.2 0.1552 1 0 3601.5 0.1552 1 0 3603.6 0.1552 1 0 3604.2 0.8448 0 1 3605 0.0918 2 0 3607 0.1552 1 0 3607.2 0.1552 1 0 3613 0.1552 1 0 3617 0.0918 2 0 3618 0.1552 1 0 3620. 0.8448 0 1 3621 0.1552 1 0 3624.2 0.1552 1 0 3626.5 0.1552 1 0 3627.7 0.1552 1 0 3629 0.1552 1 0 3631 [...] 0.9698 0 7 $65.00 0.8448 0 1 $369.00. 0.9698 0 7 $149.00, 0.9698 0 7 $800,000 0.8448 0 1 $30.00) 0.9587 0 5 $205.00 0.8448 0 1 $.19 0.8448 0 1 $24.00 0.9734 0 8 $800 0.9494 0 4 $37). 0.9587 0 5 $1.70 0.8448 0 1 $50,00 0.8448 0 1 $450.00. 0.9082 0 2 $1,000.00! 0.9494 0 4 $663.90 0.8448 0 1 $30...get 0.8448 0 1 $350,000 0.8448 0 1 $.275, 0.9651 0 6 $185.00 0.1552 1 0 $500,- 0.9651 0 6 $349.95. 0.8448 0 1 $2,000- [...but also...] 0.9803 0 11 $30.00 0.9938 0 36 $319,210.00 0.9921 0 28 $25.00 0.9884 0 19 $100,000.00 0.9979 0 108 $5,000 0.9002 13 119 $500 0.9755 3 128 $50 0.9843 2 139 $25 [...and...] 0.9921 0 28 $25.00 0.8448 0 1 x5=$25.00. 0.9082 0 2 us$25.00 0.9878 0 18 5=$25.00. 0.9082 0 2 $25.00! 0.9941 0 38 $25.00. 0.9348 0 3 $25.00, Does anyone believe that "3605" is a real spam clue, and "3607" a real ham clue? I think collapsing numbers into a few classes might significantly reduce the size of the database, and actually help the classification. Even though for someone doing fragrances "4711" may be a strong ham clue, I think that over the whole this is just adding noise. How about something like tokens for num:float (e.g. 3624.2) num:int (e.g. 3629) num:intpair (e.g. 439,443) num:$1 (for amounts between $0.00 and $9.99) num:$10 (for amounts between $10 and $99.99) num:$100 (for amounts between $100 and $999.99) num:$1000 (for amounts between $1k and $10k) num:$huge (for amounts >$10k) Each of these might have "logarithm suffixes"? Is this unrealistic? Currently roughly one in six tokens in my list contains at least 3 digits in a row! amigo[197]spambayes%% egrep -c ' .*[0-9][0-9][0-9]' balk.dat 44757 amigo[198]spambayes%% wc -l balk.dat 255907 balk.dat Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Tue Oct 15 21:05:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 15 Oct 2002 16:05:33 -0400 Subject: [Spambayes] z In-Reply-To: Message-ID: [Tim] >> If Rob is feeling particularly adventurous, it would be interesting (in >> conncection with z-combining) to transform the database spamprobs into >> unit-normalized zscores via his RMS black magic, as an extra >> step at the endof update_probabilities(). This wouldn't require another [Gary Robinson] > I didn't realize that this wasn't already being done. It's unclear to me what "this" means. RMS transformations? No, we're not doing those here. > Yes I would recommend that somebody do this because I don't think we're > really testing the z approach completely fairly until it is. You tell me whether this is this ; this is the code people have been using: def z_spamprob(self, wordstream, evidence=False): from math import sqrt clues = self._getclues(wordstream) zsum = 0.0 for prob, word, record in clues: if record is not None: # else wordinfo doesn't know about it record.killcount += 1 zsum += normIP(prob) n = len(clues) if n: # We've added n zscores from a unit normal distribution. By the # central limit theorem, their mean is normally distributed with # mean 0 and sdev 1/sqrt(n). So the zscore of zsum/n is # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n). prob = normP(zsum / sqrt(n)) else: prob = 0.5 normIP() maps a probability p to the real z such that the area under the unit Gaussian from -inf to z is p. normP() is the inverse, mapping real z to the area under the unit Gaussian from -inf to z. Example: >>> normIP(.9) 1.2815502653713151 >>> normP(_) 0.8999997718215671 >>> normIP(.1) -1.2815502653713149 >>> normP(_) 0.10000022817843296 >>> normP() is accurate to about 14 decimal digits; normIP() is accurate to about 6 decimal digits. The word "prob" values here are your f(w). > I'm not saying I believe that the z approach will turn out to be > better -- I just don't know -- but it seems worth trying. Happy to try, but really don't know how to proceed. There's seems no reason to believe that the f(w) values lead to normIP() values that are *in fact* unit-normal distributed on a random collection of words, and I don't actually see a reason to believe that this would get closer to being true if the f(w) were ranked first. If we can define precisely what we mean by "a random collection of words", the idea that the resulting normIP() values are or aren't unit-normal distributed seems easily testable, though. From grobinson@transpose.com Tue Oct 15 21:50:43 2002 From: grobinson@transpose.com (Gary Robinson) Date: Tue, 15 Oct 2002 16:50:43 -0400 Subject: [Spambayes] z In-Reply-To: Message-ID: Urgh. Sorry. I am so totally swamped with work that I am only quickly looking in sometimes and I think I got a wrong impression before. Based on what you say in the message quoted below, I think you're already doing what I was hoping for, with the exception of the ranking part! I guess I was confused by the earlier message... And I also agree that it doesn't make sense to try ranking now because there are aspects to this data that mean it won't come out to a uniform distribution under a reasonable null hypothesis without more tweaking than I (or, I guess, any of us) can suggest a way to do at this point. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454 > From: Tim Peters > Date: Tue, 15 Oct 2002 16:05:33 -0400 > To: Gary Robinson > Cc: SpamBayes > Subject: RE: [Spambayes] z > > [Tim] >>> If Rob is feeling particularly adventurous, it would be interesting (in >>> conncection with z-combining) to transform the database spamprobs into >>> unit-normalized zscores via his RMS black magic, as an extra >>> step at the endof update_probabilities(). This wouldn't require another > > [Gary Robinson] >> I didn't realize that this wasn't already being done. > > It's unclear to me what "this" means. RMS transformations? No, we're not > doing those here. > >> Yes I would recommend that somebody do this because I don't think we're >> really testing the z approach completely fairly until it is. > > You tell me whether this is this ; this is the code people have been > using: > > def z_spamprob(self, wordstream, evidence=False): > from math import sqrt > > clues = self._getclues(wordstream) > zsum = 0.0 > for prob, word, record in clues: > if record is not None: # else wordinfo doesn't know about it > record.killcount += 1 > zsum += normIP(prob) > > n = len(clues) > if n: > # We've added n zscores from a unit normal distribution. By the > # central limit theorem, their mean is normally distributed with > # mean 0 and sdev 1/sqrt(n). So the zscore of zsum/n is > # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n). > prob = normP(zsum / sqrt(n)) > else: > prob = 0.5 > > normIP() maps a probability p to the real z such that the area under the > unit Gaussian from -inf to z is p. normP() is the inverse, mapping real z > to the area under the unit Gaussian from -inf to z. Example: > >>>> normIP(.9) > 1.2815502653713151 >>>> normP(_) > 0.8999997718215671 >>>> normIP(.1) > -1.2815502653713149 >>>> normP(_) > 0.10000022817843296 >>>> > > normP() is accurate to about 14 decimal digits; normIP() is accurate to > about 6 decimal digits. > > The word "prob" values here are your f(w). > >> I'm not saying I believe that the z approach will turn out to be >> better -- I just don't know -- but it seems worth trying. > > Happy to try, but really don't know how to proceed. There's seems no reason > to believe that the f(w) values lead to normIP() values that are *in fact* > unit-normal distributed on a random collection of words, and I don't > actually see a reason to believe that this would get closer to being true if > the f(w) were ranked first. > > If we can define precisely what we mean by "a random collection of words", > the idea that the resulting normIP() values are or aren't unit-normal > distributed seems easily testable, though. > From tim.one@comcast.net Tue Oct 15 22:40:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 15 Oct 2002 17:40:46 -0400 Subject: [Spambayes] Tokenizing numbers and money In-Reply-To: <3DAC1F05.7030809@hooft.net> Message-ID: [Rob Hooft] > I just scanned through my 250k token list and found that a surprising > number of these are numeric or almost numeric. Here is a random part of > the list: > > prob nham nspam token > 0.1552 1 0 3601.2 > 0.1552 1 0 3601.5 > 0.1552 1 0 3603.6 > ... > 0.9698 0 7 $65.00 > 0.8448 0 1 $369.00. > 0.9698 0 7 $149.00, > ... > [...but also...] > 0.9803 0 11 $30.00 > 0.9938 0 36 $319,210.00 > 0.9921 0 28 $25.00 > 0.9884 0 19 $100,000.00 > 0.9979 0 108 $5,000 > 0.9002 13 119 $500 > 0.9755 3 128 $50 > 0.9843 2 139 $25 > [...and...] > 0.9921 0 28 $25.00 > 0.8448 0 1 x5=$25.00. > 0.9082 0 2 us$25.00 > 0.9878 0 18 5=$25.00. > 0.9082 0 2 $25.00! > 0.9941 0 38 $25.00. > 0.9348 0 3 $25.00, > > Does anyone believe that "3605" is a real spam clue, and "3607" a real > ham clue? The question may be more whether hapaxes (unique occurences) in general are useful clues. About half of all words are unique across all kinds of computer indices, and I expect this app will have more than most (since email has lots of artificial decorations). > I think collapsing numbers into a few classes might significantly reduce > the size of the database, Nuking hapaxes in general would probably cut it by more than half. What's special about numbers in this? > and actually help the classification. That I don't know, but there's reason to question it. We do know that each time it's been tried, fiddling the value of robinson_probability_s has had a real effect on results, and that reducing it from 1 has always helped. The effect of reducing it is to give more extreme spamprobs to rare words, so we already know that the treatment of rare words is important (or was important, in the schemes under which that experiment was tried). I don't know how numbers specifically fit into that. > Even though for someone doing fragrances "4711" may be a strong ham > clue, I think that over the whole this is just adding noise. You can try it, although it fights the "stupid beats smart" meta-rule. It's easy to think of examples in the other direction too. For example, I get an electronic order receipt with an order number, and a few days later get a shipping confirmation referencing the same number. If I trained on the order receipt between times, that "senseless number" is certainly going to help the shipping confirmation score low. > How about something like tokens for > > num:float (e.g. 3624.2) > num:int (e.g. 3629) > num:intpair (e.g. 439,443) > num:$1 (for amounts between $0.00 and $9.99) > num:$10 (for amounts between $10 and $99.99) > num:$100 (for amounts between $100 and $999.99) > num:$1000 (for amounts between $1k and $10k) > num:$huge (for amounts >$10k) > > Each of these might have "logarithm suffixes"? Is this unrealistic? It's realistic to try it, but more expensive than the tokenization we do now (we do nothing at all for "words" of under 13 chars now except determine their length; the split-on-whitespace business goes at C speed). > Currently roughly one in six tokens in my list contains at least 3 > digits in a row! > > amigo[197]spambayes%% egrep -c ' .*[0-9][0-9][0-9]' balk.dat > 44757 > amigo[198]spambayes%% wc -l balk.dat > 255907 balk.dat I believe that, but it doesn't suggest anything to me other than that a sixth of your tokens contain at least 3 digits in a row -- how many contain at least 3 letters in a row ? From rbodkin@statalabs.com Tue Oct 15 23:01:31 2002 From: rbodkin@statalabs.com (Ron Bodkin) Date: Tue, 15 Oct 2002 15:01:31 -0700 Subject: [Spambayes] Wanted: contractor to work on spam control for innovative email client Message-ID: <200210152206.g9FM6KO27967@host12.webserver1010.com> This is a multipart message ---------------------- multipart/mixed attachment Hi all, I'm a consultant with Stata Labs, which is a Silicon Valley-based R&D firm. We are developing an innovative new email client, NewMonix. Like any email product, it needs to deal with spam effectively. We've been following the spambayes project with interest, and are impressed with the quality of discussion and the development going on. We're looking for a contractor to integrate spam filtering into our email client. I contacted Tim and he suggested that I post to the list. Our product has a Java back-end with a qt front-end, so we'd prefer to use those technologies rather than adding Python to the mix. I've attached a description of what we're looking for in a contract. For those interested in applying, please respond to Teresa Stancato (tstancato@statalabs.com). If you'd like to comment or have questions for the list, please cc my email address since I only read the archives occasionally. Thank you, Ron ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: Software Engineer-Spam.doc Type: application/doc Size: 34705 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/47110785/SoftwareEngineer-Spam.bin ---------------------- multipart/mixed attachment-- From popiel@wolfskeep.com Wed Oct 16 00:27:35 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Tue, 15 Oct 2002 16:27:35 -0700 Subject: [Spambayes] Wanted: contractor to work on spam control for innovative email client In-Reply-To: Message from "Ron Bodkin" of "Tue, 15 Oct 2002 15:01:31 PDT." <200210152206.g9FM6KO27967@host12.webserver1010.com> References: <200210152206.g9FM6KO27967@host12.webserver1010.com> Message-ID: <20021015232735.2B5ADF590@cashew.wolfskeep.com> In message: <200210152206.g9FM6KO27967@host12.webserver1010.com> "Ron Bodkin" writes: > >I'm a consultant with Stata Labs, which is a Silicon Valley-based R&D firm. [...] >We're looking for a contractor to integrate spam filtering The answer to this is probably of general interest, so I'll ask it publicly: Are you willing to have remote contractors, or do you only want people in the Bay Area? - Alex (several hundred miles to the north, and not moving) From tim.one@comcast.net Wed Oct 16 01:11:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 15 Oct 2002 20:11:01 -0400 Subject: [Spambayes] Slice o' life Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment In the background, I've been a guinea pig for Sean True's and Mark Hammond's experiments hooking our code up to Outlook 2000. Toward that end, over the last week I've just been shuffling most of the spam I normally get, and the truly *hard* ham, into special folders. By "truly hard ham" I mean assorted HTML newsletters, PayPal announcements, company newsletters in odd formats, order/shipping confirmations, and conference announcements. In all that's 696 spam and 86 truly hard ham so far. Then I added in about 100 "typical" msgs from assorted work sources and friends. This has been my first chance to play with mining the headers for real: """ [Tokenizer] mine_received_headers: True basic_header_tokenize: True [Classifier] use_chi_squared_combining: True """ The performance on my real-life email is nothing short of amazing! The code adds a "Hammie" field to Outlook msgs, and I fiddled my Outlook views to show the new field, and to color msgs with a hammie score > 0.05 bold green. I'll attach a jpeg with a view of the tail end of today's email so far. That view is in chronological order, and the mix of 0.0 and 1.0 is typical. There are 523 pending msgs in my inbox right now that haven't been trained on, and the highest-scoring non-spam is 0.03 (a personal email from someone I didn't train on yet) There's also one with a score of 0.01. All the rest of the non-spam score 0.00 or -0.00 in the display (yes, I should fix that ). All the spam score 1.00. I suppose it helps that one of my email accounts automagically puts "Spam:" at the front of suspected-spam msg Subject lines, but I suspect it wouldn't matter a bit if they didn't. I didn't realize it before, but this stuff is cool ! ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: hammie.jpg Type: image/jpeg Size: 82792 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/6cf680fd/hammie.jpeg ---------------------- multipart/mixed attachment-- From tim.one@comcast.net Wed Oct 16 04:41:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 15 Oct 2002 23:41:48 -0400 Subject: [Spambayes] z-combining In-Reply-To: <3DAC15FD.3020200@hooft.net> Message-ID: [Rob Hooft] > I'm not exactly sure what you want me to renormalize using my black > magic, Me neither -- it's just something to think about if you're feeling particularly adventurous . I can't make a good case for it (more on that below). > but I did make an interesting histogram of 250000 single-token > spam probabilities... I'm hoping you're not assuming that this is > normally distributed, No, and my language was unclear before. I'll try to repair that next. > although it looks like that is what you are trying to do when > recalculating this into Z-scores. Not really. Here's the scoop: the chi- and z-combining schemes are both trying to reject the same hypothesis: the extreme-word probabilities in a msg are random, and uniformly distributed across [0, 1] Now if you run chi2.py as a main program, it computes histograms showing quite clearly (especially if you boost the # of data points) that the internal chi-squared H and S statistics are uniformly distributed if you feed in vectors of random probabilities chosen uniformly from [0, 1]. S and H are as likely then to take any value as any other, and for each x in [0, 1], S <= x with probability x, and H <= x with probability x (another way of saying S and H are uniformly distributed). If you warp the input probabilities even a little, the histograms clearly react by shifting strongly to one side. You can do the same thing with the statistic computed by z-combining. Replace chi2.judge() like so: def judge(ps, sqrt=_math.sqrt, normIP=normIP): zsum = 0.0 for p in ps: zsum += normIP(p) n = len(ps) prob = normP(zsum / sqrt(n)) return prob, 0, 0 The last two return values are dummies so you don't have to bother changing the code that calls this. Again the output probabilities are uniformly distributed across [0, 1], if the input probabilities are random chosen uniformly from [0, 1] too, and again biasing the input probabilities very clearly moves the output histogram away from a uniform distribution. So both are excellent tests for rejecting the hypothesis in question. A mystery is to what extent our computed spamprobs "act enough" like uniformly distributed random values so that rejecting the hypothesis is a valid and useful and predictably relevant thing to try. I don't know. But it *should* be a debating point for both schemes, not just for the z scheme: if our computed spamprobs don't meet the preconditions for the z-scheme to make sense, they probably fail likewise for the chi-squared scheme. In practice I can't say I see any evidence of that, though: both approaches routinely make extreme judgments with very low error rates, and the specific cases where the z- scheme does worse that I've looked at are adequately explained by z's vulnerability to "cancellation disease". Still, it's possible that one or both schemes would do even better if we found some way to precondition the computed spamprobs to fit the schemes' assumptions better. Ranking is one idea Gary has in mind for that (sorting the spamprobs and reassigning to values uniformly spaced). > Out of the 250k tokens I put in my histogram, 93k occurred exactly once > in the ham corpus of 4500 messages only, and ~75k exactly once in the > spam corpus of 4500 messages only..... Ya, those are the infamous hapaxes, and they consume more than half the database. A worthwhile experiment I haven't gotten to is to see what would happen if update_probabilities() purged them from the database. That really can't be done right with incremental learning/unlearning, so it would require one of the slower or harder test driver modes. > The noise you see at the baseline is messages that occur multiple times > in both ham and spam; amplified in the second image where > all words that occur only once or twice are removed from the histogram. > A histogram of words that occur more than 30 times in total is a bit > more flat, but still has many >30+0 / 0+>30 extremes. FYI, the modes are approximately at 1/6 and 5/6 because of the specific values we're using for robinson_probability_s and robinson_probability_x. They act to adjust a 1-message probability guess of 0.0 (word appeared in 1 ham, no spam) up to about 1/6, and a 1-message probability guess of 1.0 (word appeared in one spam, no ham) down to about 5/6. Your histogram would be squashed closer together by raising s, or spread out more by decreasing s. The value of x essentially determines where the median lies (assuming equal #s of ham and spam). > My strongest ham clue is "wrote:" (763+0) second "het" (533+0) [Dutch > for "the" for words without gender and for "it"], at the spam side it is > "8bit%:100" (0+937) and "charset:ks_c_5601-1987" (0+838) How's it doing on Asian spam for you? Those are about the only useful Asian clues we get, but they seem to suffice on my spam. >> The attraction of this [z] scheme is that it gives a single "spam >> probability" directly; combining distinct ham and spam indicators is >> still a bit of a puzzle (although a happy puzzle from my POV when >> both indicators suck, as happens in chi combining with large numbers >> of strong clues on both ends). > I don't see why this schema could not produce a "H" value as well, Exactly how? Subtracting the current z prob from 1 would do it, I guess. > and then mix it with the "S" score we're using now. Why would we want to? For example, is there some weakness in chi's current H you've identified? > This schema looks a lot like the "S" half of earlier ones like chi2 > combining. If you play with the chi2.py histogram suggestions above, you'll *see* that chi's S is especially sensitive to high-spamprob words, chi's H is especially sensitive to low-spamprob words, while z's output is equally sensitive to both. Those were all Gary's intended results, and they all work as he expected in these respects. > Think about what goes wrong if we would only use the S half of chi2 > combining: messages that look like both ham and spam come out as > perfect spam, and messages that look neither like ham nor spam come > out as perfect ham. I've actually run that experiment (using only the S part of chi-combining), but not reported on it, except to Gary offline. It did very well overall on my data, but had a systematic weakness akin to one you suggest: a higher false positive rate, due to msgs where a few very strong spam words manage to overpower a larger number of strong ham words, and due to S's greater sensitivity to high-spamprob words. The z-scheme isn't systematically weak in that way: it doesn't favor one kind of clue over the other (low-spamprob words generate negative z, high-spamprob words generate positive z, and the absolute value of the distance of a spamprob from 0.5 determines z's magnitude -- it's wholly symmetric). Its weakness appears to be cancellation disease, where a msg with lots of strong ham and lots of strong spam clues gets an extreme score in the direction of the flavor of clue that appears more often. chi-combining tends to get S and H both near 1 then, and returns 0.5. From seant@iname.com Wed Oct 16 04:52:21 2002 From: seant@iname.com (Sean True) Date: Tue, 15 Oct 2002 23:52:21 -0400 Subject: [Spambayes] Slice o' life In-Reply-To: Message-ID: I think I'm in agreement with Tim. This stuff is wicked cool. And a simple regexp filter written in Python was easy to write, and easier to maintain than all those rules in Microsoft's pseudo NL syntax. I train my classifier with the out of the box parameters, and I run Outlook with it turned on all the time. Outlook may not be your mailer of choice, but it has a fine UI for sorting mail. Makes weeding the remnant spam from the mailbox of 4500+ genuine ham much faster. Hat's off to Mark for doing the heavy lifting of wiring up a Python addin for Outlook. Before that I was working with a really crappy VBA macro package that almost worked. Mark has been making daily improvements in the UI and the integration. It's COOL stuff. IMHO, and my daily practice, this stuff is ready for deployment, and deployment inside the MUA makes some sense. The user is the one who knows what spam really is. It's the stuff in the Spam folder! Even if we can provide an efficient server side version for general spam (all that mail from Nigeria), I'm not sure that it's practical (or even wise) to do it all on the server. I've also trained filters to recognize some other mail classifications, and they work quite nicely. Thanks, folks. -- Sean From tim.one@comcast.net Wed Oct 16 05:06:21 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 16 Oct 2002 00:06:21 -0400 Subject: [Spambayes] z In-Reply-To: Message-ID: [Gary Robinson] > ... > Based on what you say in the message quoted below, I think you're > already doing what I was hoping for, with the exception of the ranking > part! Me too . If I didn't mention it before, that code snippet *does* produce uniformly distributed outputs in [0, 1] when fed artificially constructed vectors of uniformly-distributed random probs, so there's nothing wrong with the theory or this implementation of it -- so far as it goes. > I guess I was confused by the earlier message... > > And I also agree that it doesn't make sense to try ranking now > because there are aspects to this data that mean it won't come out > to a uniform distribution under a reasonable null hypothesis > without more tweaking than I (or, I guess, any of us) can suggest > a way to do at this point. More, I wouldn't see much point to it even if it were dead easy: the chi- and z- schemes are having no problems at all making correct extreme judgments about ham and spam 99+% of the time. The cases where they're prone to mistakes mostly fall in "a middle ground", and staring at many examples strongly suggests they're just freaking hard to classify. It's hard to imagine in what sense ranking (or any other probability preconditioning) could really help here -- the mistakes aren't failures to separate the spaces when a clear separation exists. However, I think it may well be worth pursuing with your *original* scheme, because that one had trouble establishing a clear boundary between ham and spam scores, and creating "a middle ground" for it via two cutoffs ended up capturing many more correctly classified messages than the middle grounds in the chi- and z- schemes (although the z-scheme is so extreme that sometimes the best spam cutoff is over 0.995! that's in part, though, due to the combination of wanting to avoid false positives, and that cancellation disease sometimes gives ham very high z spam scores). From tim.one@comcast.net Wed Oct 16 05:33:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 16 Oct 2002 00:33:01 -0400 Subject: [Spambayes] Slice o' life In-Reply-To: Message-ID: [Tim] > ... > This has been my first chance to play with mining the headers for real: > > """ > [Tokenizer] > mine_received_headers: True > basic_header_tokenize: True > > [Classifier] > use_chi_squared_combining: True > """ And now I note the first systematic weakness: I scored my own "spam" folder, and discovered 5 spam with scores of 0.0. They all have one thing in common: they're spam that SpamAssassin didn't catch, and came to me via a python.org mailing list. It turns out that python.org, Mailman, and SpamAssassin, put sooooooooo many unique "Hey, I had my fingers this!" clues in the headers that virtually any message coming thru python.org has a relatively huge collection of killer-strong ham clues (just listing headers containing such clues): Received: from mail.python.org (mail.python.org [12.155.117.29]) ... Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org) by mail.python.org with esmtp (Exim 4.05) ... Received: from [168.103.194.76] (helo=wvwrbn) by mail.python.org ... Subject: [Python-Help] Mp3sa hwnf Sender: python-help-admin@python.org To: help@python.org Errors-to: python-help-admin@python.org Precedence: bulk X-BeenThere: python-help@python.org X-warning: 168.103.194.76 in blacklist at list.dsbl.org (http://dsbl.org/listing.php?168.103.194.76) X-Spam-Status: No, hits=3.8 required=5.0 tests=BASE64_ENC_TEXT,CTYPE_JUST_HTML X-Spam-Level: *** X-Mailman-Version: 2.0.13 (101270) List-Post: List-Subscribe: , List-Unsubscribe: , List-Archive: List-Help: List-Id: Expert volunteers answer Python-related questions This was an HTML msg that appeared to be pushing a Turkish MP3 site. It's not a dead-easy msg to score, but I also got a copy from another email account, and it scored 0.64 there (instead of 0 via python.org). I guess I go back to ignoring various header lines again ... From anthony@interlink.com.au Wed Oct 16 05:36:38 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Wed, 16 Oct 2002 14:36:38 +1000 Subject: [Spambayes] Slice o' life In-Reply-To: Message-ID: <200210160436.g9G4acc09914@localhost.localdomain> >>> Tim Peters wrote > And now I note the first systematic weakness: I scored my own "spam" > folder, and discovered 5 spam with scores of 0.0. They all have one thing > in common: they're spam that SpamAssassin didn't catch, and came to me via > a python.org mailing list. This is precisely the same problem that I had with my personal mail, and I had to take the same approach - disable the header frobbing. It's really frustrating, because there _is_ a bunch of great clues in there, but there's too much ham-pointing clues as well. I'm thinking about trying something which only looks at, say, the two oldest received lines or some such - but not today... -- Anthony Baxter It's never too late to have a happy childhood. From rob@hooft.net Wed Oct 16 06:03:48 2002 From: rob@hooft.net (Rob Hooft) Date: Wed, 16 Oct 2002 07:03:48 +0200 Subject: [Spambayes] Tokenizing numbers and money References: Message-ID: <3DACF334.3040701@hooft.net> Tim Peters wrote: > I believe that, but it doesn't suggest anything to me other than that a > sixth of your tokens contain at least 3 digits in a row -- how many contain > at least 3 letters in a row ? Roughly two thirds. I may try to tokenize the numbers. Many numbers are not hapaxes, but I've seen ham significantly harmed by numbers that happened to be spam clues. I have customers that send me their log files full of numbers! Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From popiel@wolfskeep.com Wed Oct 16 07:08:47 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Tue, 15 Oct 2002 23:08:47 -0700 Subject: [Spambayes] Making Tester and TestDriver unsure Message-ID: <20021016060847.D7753F590@cashew.wolfskeep.com> I thought it would be interesting to bring the middle ground into the Tester and TestDriver, in preparation for new comparators (cmp.py and table.py) which grok the middle ground. Only so much I can do in one night, though. Have patch. - Alex Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.48 diff -c -r1.48 Options.py *** Options.py 14 Oct 2002 17:13:47 -0000 1.48 --- Options.py 16 Oct 2002 06:08:15 -0000 *************** *** 107,112 **** --- 107,119 ---- # to work best on some data. spam_cutoff: 0.560 + # A message is considered ham iff it scores less than or equal to + # ham_cutoff. For a binary classifier, make ham_cutoff == spam_cutoff. + # If ham_cutoff < spam_cutoff, you get a classifier with a middle + # ground of unsurety. If ham_cutoff > spam_cutoff, results will + # be strange in ways that have not been fully thought out. + ham_cutoff: 0.560 + # Number of buckets in histograms. nbuckets: 200 show_histograms: True *************** *** 146,151 **** --- 153,159 ---- show_false_positives: True show_false_negatives: False + show_unsure: False # Near the end of Driver.test(), you can get a listing of the 'best # discriminators' in the words from the training sets. These are the *************** *** 311,322 **** --- 319,332 ---- 'show_spam_hi': float_cracker, 'show_false_positives': boolean_cracker, 'show_false_negatives': boolean_cracker, + 'show_unsure': boolean_cracker, 'show_histograms': boolean_cracker, 'show_best_discriminators': int_cracker, 'save_trained_pickles': boolean_cracker, 'save_histogram_pickles': boolean_cracker, 'pickle_basename': string_cracker, 'show_charlimit': int_cracker, + 'ham_cutoff': float_cracker, 'spam_cutoff': float_cracker, 'spam_directories': string_cracker, 'ham_directories': string_cracker, Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.23 diff -c -r1.23 TestDriver.py *** TestDriver.py 14 Oct 2002 18:04:56 -0000 1.23 --- TestDriver.py 16 Oct 2002 06:08:16 -0000 *************** *** 128,133 **** --- 128,134 ---- def __init__(self): self.falsepos = Set() self.falseneg = Set() + self.unsure = Set() self.global_ham_hist = Hist() self.global_spam_hist = Hist() self.ntimes_finishtest_called = 0 *************** *** 186,191 **** --- 187,197 ---- def alldone(self): if options.show_histograms: printhist("all runs:", self.global_ham_hist, self.global_spam_hist) + + print "-> cost for all runs: $%.2f" % ( + len(self.falsepos) * options.best_cutoff_fp_weight + + len(self.falseneg) * options.best_cutoff_fn_weight + + len(self.unsure) * options.best_cutoff_unsure_weight) if options.save_histogram_pickles: for f, h in (('ham', self.global_ham_hist), *************** *** 229,234 **** --- 235,246 ---- print "-> false positive %:", t.false_positive_rate() print "-> false negative %:", t.false_negative_rate() + print "-> unsure %:", t.unsure_rate() + print "-> cost: $%.2f" % ( + t.nham_wrong * options.best_cutoff_fp_weight + + t.nspam_wrong * options.best_cutoff_fn_weight + + (t.nham_unsure + t.nspam_unsure) * + options.best_cutoff_unsure_weight) newfpos = Set(t.false_positives()) - self.falsepos self.falsepos |= newfpos *************** *** 250,255 **** --- 262,279 ---- if not options.show_false_negatives: newfneg = () for e in newfneg: + print '*' * 78 + prob, clues = c.spamprob(e, True) + printmsg(e, prob, clues) + + newunsure = Set(t.unsures()) - self.unsure + self.unsure |= newunsure + print "-> %d new unsure" % len(newunsure) + if newunsure: + print " new unsure:", [e.tag for e in newunsure] + if not options.show_unsure: + newunsure = () + for e in newunsure: print '*' * 78 prob, clues = c.spamprob(e, True) printmsg(e, prob, clues) Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.5 diff -c -r1.5 Tester.py *** Tester.py 27 Sep 2002 21:18:18 -0000 1.5 --- Tester.py 16 Oct 2002 06:08:16 -0000 *************** *** 35,46 **** --- 35,49 ---- # The number of test instances correctly and incorrectly classified. self.nham_right = 0 self.nham_wrong = 0 + self.nham_unsure = 0; self.nspam_right = 0 self.nspam_wrong = 0 + self.nspam_unsure = 0; # Lists of bad predictions. self.ham_wrong_examples = [] # False positives: ham called spam. self.spam_wrong_examples = [] # False negatives: spam called ham. + self.unsure_examples = [] # Train the classifier on streams of ham and spam. Updates probabilities # before returning, and resets test results. *************** *** 85,108 **** if callback: callback(example, prob) is_spam_guessed = prob > options.spam_cutoff ! correct = is_spam_guessed == is_spam if is_spam: self.nspam_tested += 1 ! if correct: self.nspam_right += 1 ! else: self.nspam_wrong += 1 self.spam_wrong_examples.append(example) else: self.nham_tested += 1 ! if correct: self.nham_right += 1 ! else: self.nham_wrong += 1 self.ham_wrong_examples.append(example) ! assert self.nham_right + self.nham_wrong == self.nham_tested ! assert self.nspam_right + self.nspam_wrong == self.nspam_tested def false_positive_rate(self): """Percentage of ham mistakenly identified as spam, in 0.0..100.0.""" --- 88,119 ---- if callback: callback(example, prob) is_spam_guessed = prob > options.spam_cutoff ! is_ham_guessed = prob <= options.ham_cutoff if is_spam: self.nspam_tested += 1 ! if is_spam_guessed: self.nspam_right += 1 ! elif is_ham_guessed: self.nspam_wrong += 1 self.spam_wrong_examples.append(example) + else: + self.nspam_unsure += 1 + self.unsure_examples.append(example) else: self.nham_tested += 1 ! if is_ham_guessed: self.nham_right += 1 ! elif is_spam_guessed: self.nham_wrong += 1 self.ham_wrong_examples.append(example) + else: + self.nham_unsure += 1 + self.unsure_examples.append(example) ! assert self.nham_right + self.nham_wrong + self.nham_unsure \ ! == self.nham_tested ! assert self.nspam_right + self.nspam_wrong + self.nspam_unsure \ ! == self.nspam_tested def false_positive_rate(self): """Percentage of ham mistakenly identified as spam, in 0.0..100.0.""" *************** *** 112,123 **** --- 123,140 ---- """Percentage of spam mistakenly identified as ham, in 0.0..100.0.""" return self.nspam_wrong * 1e2 / self.nspam_tested + def unsure_rate(self): + return (self.nham_unsure + self.nspam_unsure) * 1e2 \ + / (self.nham_tested + self.nspam_tested) + def false_positives(self): return self.ham_wrong_examples def false_negatives(self): return self.spam_wrong_examples + def unsures(self): + return self.unsure_examples class _Example: def __init__(self, name, words): From rob@hooft.net Wed Oct 16 11:51:55 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Wed, 16 Oct 2002 12:51:55 +0200 Subject: [Spambayes] Slice o' life References: Message-ID: <3DAD44CB.103@hooft.net> Tim Peters wrote: > It turns out that python.org, Mailman, and SpamAssassin, put sooooooooo many > unique "Hey, I had my fingers this!" clues in the headers that virtually any > message coming thru python.org has a relatively huge collection of > killer-strong ham clues (just listing headers containing such clues): Correlations, correlations, correlations. It all boils down to correlations. Not the fact that there are correlations, but that they are very, very different from one clue to the next. All these mailman clues are correlated. And by not downweighting them, we're blinding the procedure to the other clues that do not come by the dozens... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From chk@pobox.com Wed Oct 16 16:13:03 2002 From: chk@pobox.com (Harald Koch) Date: Wed, 16 Oct 2002 11:13:03 -0400 Subject: [Spambayes] Re: Slice o' life In-Reply-To: Your message of "Wed, 16 Oct 2002 00:33:01 -0400". References: Message-ID: <9329.1034781183@elisabeth.cfrq.net> > And now I note the first systematic weakness: I scored my own "spam" > folder, and discovered 5 spam with scores of 0.0. They all have one thing > in common: they're spam that SpamAssassin didn't catch, and came to me via > a python.org mailing list. This is why I don't usually bother spam-filtering my email lists. I don't get much spam that way to begin with; most of my email lists have their own spam filters in place already. In the olden days, filtering lists resulted in too many fps; now it confuses the classifier. -- Harald Koch "It takes a child to raze a village." -Michael T. Fry From tim.one@comcast.net Wed Oct 16 19:49:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 16 Oct 2002 14:49:03 -0400 Subject: [Spambayes] Slice o' life In-Reply-To: <3DAD44CB.103@hooft.net> Message-ID: [Rob W.W. Hooft] > Correlations, correlations, correlations. It all boils down to > correlations. Not the fact that there are correlations, but that they > are very, very different from one clue to the next. All these mailman > clues are correlated. And by not downweighting them, we're blinding the > procedure to the other clues that do not come by the dozens... It's not even that they're Mailman clues, though, it's more that python.org specifically already has strong anti-spam and anti-virus measures in place. That's how these "Mailman clues" earned their very low spamprobs to begin with -- it's not that Mailman is stopping spam, it's that virtually all the Mailman lists I'm on go through python.org. So when python.org screws up, there's little anything can do on the user's end, short of ignoring python.org clues as evidence. I don't know how to automate that in a no-brainer cross-user way (and, no, I still don't think 200K x 200K matrix analysis is tractable for this ). So far as python.org goes, I expect it will eventually use the code developed here, and its false negative rate should go down then (I haven't yet seen a spam approved by python.org that *this* code scores low when the python.org header clues are ignored). From tim.one@comcast.net Wed Oct 16 19:55:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 16 Oct 2002 14:55:53 -0400 Subject: [Spambayes] Re: Slice o' life In-Reply-To: <9329.1034781183@elisabeth.cfrq.net> Message-ID: [Tim] > I scored my own "spam" folder, and discovered 5 spam with scores of 0.0. They > all have one thing in common: they're spam that SpamAssassin didn't catch, and > came to me via a python.org mailing list. [Harald Koch] > This is why I don't usually bother spam-filtering my email lists. I > don't get much spam that way to begin with; most of my email lists have > their own spam filters in place already. In the olden days, filtering > lists resulted in too many fps; now it confuses the classifier. Unclear. I retrained my home-mail classifier to go back to the "ignore most header lines" defaults, and these low-scoring spam scored high again. Regular list traffic continued to score low, presumably because it had genuine hammish content. What suffered some was personal email, which is sometimes very brief, and where sucking up header clues about who sent it is a real help. Some solicited commercial email also suffered. The system was still highly accurate, although this is not a controlled experiment, and the database has only one week of non-random email (so I won't draw any conclusions based on this). From rob@hooft.net Wed Oct 16 20:40:45 2002 From: rob@hooft.net (Rob Hooft) Date: Wed, 16 Oct 2002 21:40:45 +0200 Subject: [Spambayes] Tokenizing numbers and money References: Message-ID: <3DADC0BD.7050806@hooft.net> Tim Peters wrote: > You can try it, although it fights the "stupid beats smart" meta-rule. It's > easy to think of examples in the other direction too. For example, I get an > electronic order receipt with an order number, and a few days later get a > shipping confirmation referencing the same number. If I trained on the > order receipt between times, that "senseless number" is certainly going to > help the shipping confirmation score low. > > >>How about something like tokens for >> >> num:float (e.g. 3624.2) >> num:int (e.g. 3629) >> num:intpair (e.g. 439,443) >> num:$1 (for amounts between $0.00 and $9.99) >> num:$10 (for amounts between $10 and $99.99) >> num:$100 (for amounts between $100 and $999.99) >> num:$1000 (for amounts between $1k and $10k) >> num:$huge (for amounts >$10k) >> >>Each of these might have "logarithm suffixes"? Is this unrealistic? > > > It's realistic to try it, but more expensive than the tokenization we do now > (we do nothing at all for "words" of under 13 chars now except determine > their length; the split-on-whitespace business goes at C speed). More expensive, but I didn't notice it yet. First results: It doesn't make a difference. cv5: original code cv8: with "num:XXX" tokens for simple numerics amigo[109]spambayes%% grep -A1 'all runs' cv5.txt -> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96 -> min -1.22125e-13; median 1.3603e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86 -> min 6.85483e-09; median 100; max 100 amigo[110]spambayes%% grep -A1 'all runs' cv8.txt -> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00 -> min -1.44329e-13; median 2.66842e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74 -> min 7.69111e-09; median 100; max 100 cv8 now has the following tokens: prob nham nspam token 0.0082 27 0 num:float8 0.0088 25 0 num:signfloat6 0.0122 18 0 num:signfloat5 0.0137 16 0 num:signfloat4 0.0138 657 9 num:signint3 0.0167 13 0 num:signfloat7 0.0197 11 0 num:int12 0.0266 8 0 num:float10 0.0266 8 0 num:signfloat8 0.0302 7 0 num:signfloat9 0.0302 7 0 num:signint6 0.0413 5 0 num:signfloat10 0.0506 4 0 num:signfloat11 0.0868 265 25 num:signint5 0.0911 12 1 num:float9 0.1539 111 20 num:int10 0.1552 1 0 num:expfloat10 0.1552 1 0 num:float12 0.1552 1 0 num:signint9 0.1566 71 13 num:float7 0.1654 11 2 num:signint4 0.2085 255 67 num:int7 0.2248 4 1 num:float11 0.2656 64 23 num:int9 0.2935 164 68 num:float5 0.3138 431 197 num:float3 0.3196 194 91 num:float4 0.3550 151 83 num:int6 0.3596 1900 1067 num:int4 0.4041 65 44 num:int8 0.4255 1218 902 num:int3 0.4369 687 533 num:int5 0.4399 65 51 num:float6 0.4471 5 4 num:signint11 0.7432 4 12 num:int11 0.7752 1 4 num:signint8 0.8133 127 554 num:intpair 0.8356 1 6 num:signint10 0.9082 0 2 num:money12 0.9180 9 103 num:money5 0.9383 24 368 num:money4 0.9587 0 5 num:exclmoney12 0.9587 0 5 num:money9 0.9599 2 53 num:fracmoney9 0.9700 13 428 num:money3 0.9730 1 44 num:money10 0.9734 0 8 num:fracmoney4 0.9762 0 9 num:exclmoney4 0.9785 0 10 num:exclmoney11 0.9785 0 10 num:exclmoney9 0.9788 4 195 num:money8 0.9794 1 58 num:exclmoney8 0.9796 4 203 num:money6 0.9833 0 13 num:money11 0.9863 0 16 num:exclmoney5 0.9900 4 417 num:fracmoney6 0.9904 0 23 num:exclmoney10 0.9912 2 249 num:fracmoney7 0.9920 2 274 num:fracmoney5 0.9933 0 33 num:fracmoney8 0.9937 0 35 num:exclmoney6 0.9954 1 262 num:money7 0.9956 0 51 num:exclmoney7 0.9974 0 86 num:fracmoney11 0.9975 0 89 num:fracmoney10 Dead end? Or is the reduction in number of tokens significant? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From gward@python.net Wed Oct 16 21:10:09 2002 From: gward@python.net (Greg Ward) Date: Wed, 16 Oct 2002 16:10:09 -0400 Subject: [Spambayes] Re: Slice o' life In-Reply-To: <9329.1034781183@elisabeth.cfrq.net> References: <9329.1034781183@elisabeth.cfrq.net> Message-ID: <20021016201009.GA6778@cthulhu.gerg.ca> On 16 October 2002, Harald Koch said: > This is why I don't usually bother spam-filtering my email lists. I > don't get much spam that way to begin with; most of my email lists have > their own spam filters in place already. In the olden days, filtering > lists resulted in too many fps; now it confuses the classifier. Depends on the server -- I was surprised to learn that a list I follow fairly closely (optik-users@lists.sourceforge.net) got nothing but spam for most of the summer. I never knew until I looked at the archive, because SA on python.net kept all that spam out of my inbox, even if it's spam from a mailing list. (And yes, I am thinking of moving optik-users to either python.net or python.org...) Greg -- Greg Ward http://www.gerg.ca/ Vote Cthulhu -- why settle for a lesser evil? From popiel@wolfskeep.com Wed Oct 16 22:17:51 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 16 Oct 2002 14:17:51 -0700 Subject: [Spambayes] More modifications to TestDriver Message-ID: <20021016211751.4B11CF49B@cashew.wolfskeep.com> I mangled TestDriver some more to report the fp, fn, and unsure totals at the end, along with the percentages and cost. This way I can have table.py eat the TestDriver output directly, instead of mediating it through the summary rates.py script... Another patch, from the same base as before. - Alex Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.48 diff -c -r1.48 Options.py *** Options.py 14 Oct 2002 17:13:47 -0000 1.48 --- Options.py 16 Oct 2002 21:17:54 -0000 *************** *** 107,112 **** --- 107,119 ---- # to work best on some data. spam_cutoff: 0.560 + # A message is considered ham iff it scores less than or equal to + # ham_cutoff. For a binary classifier, make ham_cutoff == spam_cutoff. + # If ham_cutoff < spam_cutoff, you get a classifier with a middle + # ground of unsurety. If ham_cutoff > spam_cutoff, results will + # be strange in ways that have not been fully thought out. + ham_cutoff: 0.560 + # Number of buckets in histograms. nbuckets: 200 show_histograms: True *************** *** 146,151 **** --- 153,159 ---- show_false_positives: True show_false_negatives: False + show_unsure: False # Near the end of Driver.test(), you can get a listing of the 'best # discriminators' in the words from the training sets. These are the *************** *** 311,322 **** --- 319,332 ---- 'show_spam_hi': float_cracker, 'show_false_positives': boolean_cracker, 'show_false_negatives': boolean_cracker, + 'show_unsure': boolean_cracker, 'show_histograms': boolean_cracker, 'show_best_discriminators': int_cracker, 'save_trained_pickles': boolean_cracker, 'save_histogram_pickles': boolean_cracker, 'pickle_basename': string_cracker, 'show_charlimit': int_cracker, + 'ham_cutoff': float_cracker, 'spam_cutoff': float_cracker, 'spam_directories': string_cracker, 'ham_directories': string_cracker, Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.23 diff -c -r1.23 TestDriver.py *** TestDriver.py 14 Oct 2002 18:04:56 -0000 1.23 --- TestDriver.py 16 Oct 2002 21:17:54 -0000 *************** *** 128,133 **** --- 128,134 ---- def __init__(self): self.falsepos = Set() self.falseneg = Set() + self.unsure = Set() self.global_ham_hist = Hist() self.global_spam_hist = Hist() self.ntimes_finishtest_called = 0 *************** *** 187,192 **** --- 188,209 ---- if options.show_histograms: printhist("all runs:", self.global_ham_hist, self.global_spam_hist) + nham = self.global_ham_hist.n + nspam = self.global_spam_hist.n + nfp = len(self.falsepos) + nfn = len(self.falseneg) + nun = len(self.unsure) + print "-> all runs false positives:", nfp + print "-> all runs false negatives:", nfn + print "-> all runs unsure:", nun + print "-> all runs false positive %:", (nfp * 1e2 / nham) + print "-> all runs false negative %:", (nfn * 1e2 / nspam) + print "-> all runs unsure %:", (nun * 1e2 / (nham + nspam)) + print "-> all runs cost: $%.2f" % ( + nfp * options.best_cutoff_fp_weight + + nfn * options.best_cutoff_fn_weight + + nun * options.best_cutoff_unsure_weight) + if options.save_histogram_pickles: for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)): *************** *** 229,234 **** --- 246,257 ---- print "-> false positive %:", t.false_positive_rate() print "-> false negative %:", t.false_negative_rate() + print "-> unsure %:", t.unsure_rate() + print "-> cost: $%.2f" % ( + t.nham_wrong * options.best_cutoff_fp_weight + + t.nspam_wrong * options.best_cutoff_fn_weight + + (t.nham_unsure + t.nspam_unsure) * + options.best_cutoff_unsure_weight) newfpos = Set(t.false_positives()) - self.falsepos self.falsepos |= newfpos *************** *** 250,255 **** --- 273,290 ---- if not options.show_false_negatives: newfneg = () for e in newfneg: + print '*' * 78 + prob, clues = c.spamprob(e, True) + printmsg(e, prob, clues) + + newunsure = Set(t.unsures()) - self.unsure + self.unsure |= newunsure + print "-> %d new unsure" % len(newunsure) + if newunsure: + print " new unsure:", [e.tag for e in newunsure] + if not options.show_unsure: + newunsure = () + for e in newunsure: print '*' * 78 prob, clues = c.spamprob(e, True) printmsg(e, prob, clues) Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.5 diff -c -r1.5 Tester.py *** Tester.py 27 Sep 2002 21:18:18 -0000 1.5 --- Tester.py 16 Oct 2002 21:17:55 -0000 *************** *** 35,46 **** --- 35,49 ---- # The number of test instances correctly and incorrectly classified. self.nham_right = 0 self.nham_wrong = 0 + self.nham_unsure = 0; self.nspam_right = 0 self.nspam_wrong = 0 + self.nspam_unsure = 0; # Lists of bad predictions. self.ham_wrong_examples = [] # False positives: ham called spam. self.spam_wrong_examples = [] # False negatives: spam called ham. + self.unsure_examples = [] # Train the classifier on streams of ham and spam. Updates probabilities # before returning, and resets test results. *************** *** 85,108 **** if callback: callback(example, prob) is_spam_guessed = prob > options.spam_cutoff ! correct = is_spam_guessed == is_spam if is_spam: self.nspam_tested += 1 ! if correct: self.nspam_right += 1 ! else: self.nspam_wrong += 1 self.spam_wrong_examples.append(example) else: self.nham_tested += 1 ! if correct: self.nham_right += 1 ! else: self.nham_wrong += 1 self.ham_wrong_examples.append(example) ! assert self.nham_right + self.nham_wrong == self.nham_tested ! assert self.nspam_right + self.nspam_wrong == self.nspam_tested def false_positive_rate(self): """Percentage of ham mistakenly identified as spam, in 0.0..100.0.""" --- 88,119 ---- if callback: callback(example, prob) is_spam_guessed = prob > options.spam_cutoff ! is_ham_guessed = prob <= options.ham_cutoff if is_spam: self.nspam_tested += 1 ! if is_spam_guessed: self.nspam_right += 1 ! elif is_ham_guessed: self.nspam_wrong += 1 self.spam_wrong_examples.append(example) + else: + self.nspam_unsure += 1 + self.unsure_examples.append(example) else: self.nham_tested += 1 ! if is_ham_guessed: self.nham_right += 1 ! elif is_spam_guessed: self.nham_wrong += 1 self.ham_wrong_examples.append(example) + else: + self.nham_unsure += 1 + self.unsure_examples.append(example) ! assert self.nham_right + self.nham_wrong + self.nham_unsure \ ! == self.nham_tested ! assert self.nspam_right + self.nspam_wrong + self.nspam_unsure \ ! == self.nspam_tested def false_positive_rate(self): """Percentage of ham mistakenly identified as spam, in 0.0..100.0.""" *************** *** 112,123 **** --- 123,140 ---- """Percentage of spam mistakenly identified as ham, in 0.0..100.0.""" return self.nspam_wrong * 1e2 / self.nspam_tested + def unsure_rate(self): + return (self.nham_unsure + self.nspam_unsure) * 1e2 \ + / (self.nham_tested + self.nspam_tested) + def false_positives(self): return self.ham_wrong_examples def false_negatives(self): return self.spam_wrong_examples + def unsures(self): + return self.unsure_examples class _Example: def __init__(self, name, words): From popiel@wolfskeep.com Wed Oct 16 22:39:22 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Wed, 16 Oct 2002 14:39:22 -0700 Subject: [Spambayes] Ratios and chi-squared Message-ID: <20021016213923.0D317F49B@cashew.wolfskeep.com> I decided to see how chi-squared coped with differing ham:spam ratios in the training data. I'll also be checking the effect of training set size. In any case, here's a preview of the 5 sets (1000 ham & 1000 spam total) run, with ham_cutoff 0.05 and spam_cutoff 0.9... """ -> tested 50 hams & 200 spams against 200 hams & 800 spams [ yadda yadda yadda ] -> tested 200 hams & 50 spams against 800 hams & 200 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 1 1 2 2 3 3 2 fp %: 0.40 0.27 0.40 0.32 0.40 0.34 0.20 fn total: 2 2 3 2 3 4 6 fn %: 0.20 0.23 0.40 0.32 0.60 1.07 2.40 unsure t: 26 24 25 33 29 26 37 unsure %: 2.08 1.92 2.00 2.64 2.32 2.08 2.96 real cost: $17.20 $16.80 $28.00 $28.60 $38.80 $39.20 $33.40 best cost: $15.60 $15.00 $19.80 $19.20 $27.80 $14.80 $14.60 h mean: 2.59 1.18 0.73 0.44 0.51 0.46 0.35 h sdev: 11.57 7.82 7.00 5.68 6.46 6.01 5.02 s mean: 99.31 98.95 98.32 97.41 96.84 96.10 93.12 s sdev: 7.03 8.50 10.20 12.75 14.43 15.70 19.33 mean diff: 96.72 97.77 97.59 96.97 96.33 95.64 92.77 k: 5.20 5.99 5.67 5.26 4.61 4.41 3.81 """ The chi-squared combining seems much less sensitive to training set ratios than the default method. (Of course, it could just be the broad and obvious middle ground that's saving it.) I'll see what the rest of the data shows, and then do a real writeup... - Alex From tim.one@comcast.net Wed Oct 16 22:29:54 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 16 Oct 2002 17:29:54 -0400 Subject: [Spambayes] chi-z combining: a worthless scheme In-Reply-To: Message-ID: If one other person thinks this is funny too, it was worth it . Since the sum of squares of n unit-normal distributed vars follows a chi-squared distribution with n degrees of freedom, here's Yet Another test for rejecting the hypothesis that a vector of probs is uniformly distributed: S = 0.0 for p in ps: z = normIP(p) S += z*z S = chi2Q(S, len(ps)) This works as it should: S is uniformly distributed when the input ps are uniformly distributed. But it combines the advantage of being equally sensitive to high-spamprob and low-spamprob words, with a remarkable disadvantage no other scheme to date has managed to achieve: it gives very low scores to ham *and* to spam, and very high scores to exceedingly bland msgs. Take that, BlandAssassin. From rbodkin@statalabs.com Wed Oct 16 23:59:47 2002 From: rbodkin@statalabs.com (Ron Bodkin) Date: Wed, 16 Oct 2002 15:59:47 -0700 Subject: [Spambayes] Wanted: contractor to work on spam control for innovative email client Message-ID: <200210162259.g9GMxOj08543@host12.webserver1010.com> Hi Alex, While we prefer local contractors, we are open to applications from outstanding candidates who are remote. The office is in Burlingame, California. In answer to other questions we received: the contract is for forty hour work weeks. The software is being developed on Windows using cygwin for build and test scripts. Thanks! Ron ------------Original Message------------- From: "T. Alexander Popiel" To: Ron Bodkin Date: Tue, 15 Oct 2002 16:27:35 -0700 Subject: Re: [Spambayes] Wanted: contractor to work on spam control for innovative email client In message: <200210152206.g9FM6KO27967@host12.webserver1010.com> "Ron Bodkin" writes: > >I'm a consultant with Stata Labs, which is a Silicon Valley-based R&D firm. [...] >We're looking for a contractor to integrate spam filtering The answer to this is probably of general interest, so I'll ask it publicly: Are you willing to have remote contractors, or do you only want people in the Bay Area? - Alex (several hundred miles to the north, and not moving) From tim.one@comcast.net Thu Oct 17 04:35:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 16 Oct 2002 23:35:16 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes Message-ID: I propose to remove these options and their supporting code: use_central_limit use_central_limit2 use_central_limit3 The point of the 3 central limit schemes was (or, rather, turned out to be) to create a usable middle ground. chi-combining appears to do a better job of that, or at worst at least as good. As a (highly) practical matter, the central limit schemes are unique in requiring "a third training pass", and it's never become clear how to *do* that in an incremental way, short of saving every msg ever trained on and retraining on all whenever a new msg is added to training. So even if they did better, I don't know how to deploy them in real life. Luckily(?), they're not doing better, so that hard choice decision is easy to sidestep. use_z_combining It hasn't done better than chi-combining for anyone, and has done worse for some; it's known to be systematically vulnerable to cancellation disease. This would leave 3 combining schemes, none of which I'm willing to kill off yet: Gary's original scheme use_tim_combining use_chi_combining Note that these three are 100% compatible at the database level: they don't affect *training* at all. The only difference among them is the implementation of Bayes.spamprob() (the scoring function). A trained classifier can use any of these three freely. Indeed, it's possible (no experiments have been done on this) that a "hard" msg for one scheme could benefit via getting scored again by one or both of the others. Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm growing fonder of the non-chi schemes again. Rational or not, I find that the more uniform range of outcomes in [0.0, 1.0] is psychologically reassuring when using a UI that throws the scores in your face. If there are no killer objections, I'll remove the 4 schemes in question. From rob@hooft.net Thu Oct 17 05:22:02 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 06:22:02 +0200 Subject: [Spambayes] Tokenizing numbers and money References: Message-ID: <3DAE3AEA.4020108@hooft.net> Tim Peters wrote: > That I don't know, but there's reason to question it. We do know that each > time it's been tried, fiddling the value of robinson_probability_s has had a > real effect on results, and that reducing it from 1 has always helped. The > effect of reducing it is to give more extreme spamprobs to rare words, so we > already know that the treatment of rare words is important (or was > important, in the schemes under which that experiment was tried). I don't > know how numbers specifically fit into that. The problem is that the final scoring has been adapted so thoroughly since those tests, that all of that should be done again. And then it becomes very difficult, because the procedure is so good now that we're all looking with a microscope at all our fp/fn's and anyway, I "agree" (that it "looks" wrong) with the filter in most of my fp/fn cases. I did try something: s=0.25: -> Ham scores for all runs: 16000 items; mean 0.51; sdev 4.70 -> min -1.33227e-13; median 1.19543e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.10; sdev 5.81 -> min 2.89463e-09; median 100; max 100 s=0.45: -> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00 -> min -1.44329e-13; median 2.66842e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74 -> min 7.69111e-09; median 100; max 100 s=0.75: -> Ham scores for all runs: 16000 items; mean 0.73; sdev 5.43 -> min -1.11022e-13; median 9.83325e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 98.95; sdev 5.68 -> min 3.83111e-05; median 100; max 100 And: s=0.25: -> best cost for all runs: $109.60 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 2 cutoff pairs -> smallest ham & spam cutoffs 0.48 & 0.93 -> fp 6; fn 13; unsure ham 43; unsure spam 140 -> fp rate 0.0375%; fn rate 0.224%; unsure rate 0.839% -> largest ham & spam cutoffs 0.49 & 0.93 -> fp 6; fn 14; unsure ham 39; unsure spam 139 -> fp rate 0.0375%; fn rate 0.241%; unsure rate 0.817% s=0.45: -> best cost for all runs: $112.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 2 cutoff pairs -> smallest ham & spam cutoffs 0.495 & 0.975 -> fp 3; fn 15; unsure ham 42; unsure spam 295 -> fp rate 0.0187%; fn rate 0.259%; unsure rate 1.55% -> largest ham & spam cutoffs 0.5 & 0.975 -> fp 3; fn 16; unsure ham 38; unsure spam 294 -> fp rate 0.0187%; fn rate 0.276%; unsure rate 1.52% s=0.75: -> best cost for all runs: $108.20 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.505 & 0.95 -> fp 4; fn 13; unsure ham 46; unsure spam 230 -> fp rate 0.025%; fn rate 0.224%; unsure rate 1.27% Don't know what to think about this. Total cost looks fairly insensitive here, but the distribution over the types of cost is different. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Thu Oct 17 05:42:52 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 06:42:52 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAE3FCC.4060201@hooft.net> Tim Peters wrote: > I propose to remove these options and their supporting code: > > use_central_limit > use_central_limit2 > use_central_limit3 Go ahead. > use_z_combining I guess that means that no RMS magic can help here. Go ahead. > Note that these three are 100% compatible at the database level: they don't > affect *training* at all. The only difference among them is the > implementation of Bayes.spamprob() (the scoring function). A trained > classifier can use any of these three freely. Indeed, it's possible (no > experiments have been done on this) that a "hard" msg for one scheme could > benefit via getting scored again by one or both of the others. I don't expect a lot from that. You and I at least have repeatedly seen the same fp and fn's across methods. > Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm growing > fonder of the non-chi schemes again. Rational or not, I find that the more > uniform range of outcomes in [0.0, 1.0] is psychologically reassuring when > using a UI that throws the scores in your face. But it is unrealistic. Think about the original problem again: "why can't software that classifies ham/spam be very easy? Almost all spam's scream in your face that they are". With chi_squared combining we found a method that agrees with this. Most messages scream either "Ham" or "Spam", and there is very little left to doubt. You can downscale things a bit by reducing the final S,H-score in chi_squared combining before calling chi2Q. Maybe take the sqrt or something similar. That is actually realistic because of correlations. It may shift a few messages along the middle ground, but not have a lot of effect on separating ham and spam except broadening the distribution a bit. Maybe the better answer is that the final UI shouldn't throw the scores in your face. > If there are no killer objections, I'll remove the 4 schemes in question. Did you ever try tim combining with (S-H+1)/2? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Thu Oct 17 05:55:51 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 06:55:51 +0200 Subject: [Spambayes] Tokenizing numbers and money References: Message-ID: <3DAE42D7.6040505@hooft.net> Tim Peters wrote: >>Even though for someone doing fragrances "4711" may be a strong ham >>clue, I think that over the whole this is just adding noise. > > > You can try it, although it fights the "stupid beats smart" meta-rule. Here are some more results: original tokenizer: -> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96 -> min -1.22125e-13; median 1.3603e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86 -> min 6.85483e-09; median 100; max 100 with my "num:" tokens (~10000 different tokens less): -> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00 -> min -1.44329e-13; median 2.66842e-11; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74 -> min 7.69111e-09; median 100; max 100 all words with at least two digits (r'^.*\d.*\d') removed (~45000 different tokens less): -> Ham scores for all runs: 16000 items; mean 0.61; sdev 5.08 -> min -1.11022e-13; median 1.24117e-10; max 100 -- -> Spam scores for all runs: 5800 items; mean 99.05; sdev 5.66 -> min 9.13394e-06; median 100; max 100 conclusion: The more numbers I throw out, the tighter the spam, and the wider the ham (both means go up). BTW: I just realized that the "sdev" in these lines is only determined by the few middle ground messages and the fp/fn's. I think this is not a good measure for the tightness of the distributions at all. At the very least we should throw out all points further away than 4 sigma in the calculation of sigma. Better still would be to give numbers like 1% of all ham scores are larger than XXX and 1% of all spam scores are smaller than YYY". Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Thu Oct 17 06:34:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 01:34:38 -0400 Subject: [Spambayes] Tokenizing numbers and money In-Reply-To: <3DAE3AEA.4020108@hooft.net> Message-ID: [Tim] > We do know that each time it's been tried, fiddling the value of > robinson_probability_s has had a real effect on results, and that > reducing it from 1 has always helped. The effect of reducing it > is to give more extreme spamprobs to rare words, so we already > know that the treatment of rare words is important (or was important, > in the schemes under which that experiment was tried). I don't > know how numbers specifically fit into that. [Rob Hooft] > The problem is that the final scoring has been adapted so thoroughly > since those tests, that all of that should be done again. Along with everything else <0.1 wink> -- everything is always open to question here. But I have to point out that the training and scoring code has remained absolutely regular under all schemes since abandoning Graham's original collection of deliberate biases: no special cases, no warts, no tweaks (the *tokenizer* code is a different story). The words with extreme spamprobs have the strongest effects under all schemes, and s controls how quickly or slowly a spamprob can *get* extreme relative to the # of msgs a word has been seen in. In that sense, there's some reason to believe "the best" value for s is more a function of the data than of the combining scheme. Make s too small and too much credence is given to accidents; make s too large and the amount of training data needed to get crisp decisions zooms. > And then it becomes very difficult, because the procedure is so > good now that we're all looking with a microscope at all our > fp/fn's and anyway, I "agree" (that it "looks" wrong) with the > filter in most of my fp/fn cases. There's something else to vary too: nobody has looked at fiddling max_discriminators under the newer schemes, and from what I see here I think we all leave it at the default 150, which was chosen based on the death-match results pitting Gary's original scheme against Paul's scheme. It could be that max_discriminators should change. > I did try something: > > s=0.25: > -> Ham scores for all runs: 16000 items; mean 0.51; sdev 4.70 > -> min -1.33227e-13; median 1.19543e-11; max 100 > -- > -> Spam scores for all runs: 5800 items; mean 99.10; sdev 5.81 > -> min 2.89463e-09; median 100; max 100 > > s=0.45: > -> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00 > -> min -1.44329e-13; median 2.66842e-11; max 100 > -- > -> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74 > -> min 7.69111e-09; median 100; max 100 > > s=0.75: > -> Ham scores for all runs: 16000 items; mean 0.73; sdev 5.43 > -> min -1.11022e-13; median 9.83325e-11; max 100 > -- > -> Spam scores for all runs: 5800 items; mean 98.95; sdev 5.68 > -> min 3.83111e-05; median 100; max 100 That all makes sense, right? The lower s, the more extreme spamprobs get, ahd the higher s the less extreme. So from top to bottom, ham means and medians increase, spam means and medians decrease (well, that last is invisible for spam at this level of precision: at least half your spam scores above 100, to 6 significant digits, under all variations), and sdevs for all increase. > And: > > s=0.25: > -> best cost for all runs: $109.60 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 2 cutoff pairs > -> smallest ham & spam cutoffs 0.48 & 0.93 > -> fp 6; fn 13; unsure ham 43; unsure spam 140 > -> fp rate 0.0375%; fn rate 0.224%; unsure rate 0.839% > -> largest ham & spam cutoffs 0.49 & 0.93 > -> fp 6; fn 14; unsure ham 39; unsure spam 139 > -> fp rate 0.0375%; fn rate 0.241%; unsure rate 0.817% > > s=0.45: > -> best cost for all runs: $112.40 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 2 cutoff pairs > -> smallest ham & spam cutoffs 0.495 & 0.975 > -> fp 3; fn 15; unsure ham 42; unsure spam 295 > -> fp rate 0.0187%; fn rate 0.259%; unsure rate 1.55% > -> largest ham & spam cutoffs 0.5 & 0.975 > -> fp 3; fn 16; unsure ham 38; unsure spam 294 > -> fp rate 0.0187%; fn rate 0.276%; unsure rate 1.52% > > s=0.75: > -> best cost for all runs: $108.20 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.505 & 0.95 > -> fp 4; fn 13; unsure ham 46; unsure spam 230 > -> fp rate 0.025%; fn rate 0.224%; unsure rate 1.27% > > Don't know what to think about this. Total cost looks fairly insensitive > here, but the distribution over the types of cost is different. The most interesting thing there may be a coincidence <0.3 wink>: the default s=0.45 was obtained from staring at all the reports that came in during the Graham-vs-Robinson death match (tuning s for your data was part of the task there, although it was called "a" at the time), then picking a default value that appeared to get close to minimizing the fp rate across testers. And s=.45 miminized the fp rate in your results above. With an absolute # of fp so low, though, I'm afraid that just one specific oddball ham can easily warp the conclusions to fit it best. If I try to mentally discount that, I think the data above suggests most that a higher-than-default value for s is better for some combination of your test data this combining scheme (which did you use? chi-combining?) this value of robinson_minimum_prob_strength (ditto) this value of max_discriminators (ditto) It's not obvious how much training data you used here either, but do note that s=0.45 was picked from 10-fold cv runs with 200 ham and 200 spam in each set (that exact setup was a requirment for participating in the death match). You appear to be using about 10x more ham and something like 2.5x more spam than that, and I think it stands to reason that low s is potentially more helpful the less training data you have (no matter what the value of s, spamprobs *eventually* approach the raw estimates obtained from counting -- if you have a lot of data, the really strong clues remain really strong clues throughout this range of s values). BTW, I was reading a paper on boosting, and one observation struck home: boosting combines many rules in a weighted-average way, where the weights are adjusted iteratively, between passes boosting the "importance" of the examples the previous iteration misclassified. What the author found was that boosting worked better overall if he fiddled it to eventually *stop* paying attention to examples that were persistently and badly misclassifed. In effect, trying ever harder to fit the outliers warped the whole scheme in their direction in ever more extreme ways, but almost by definition the outliers didn't fit the scheme at all. Similarly, I believe that some of our persistent fp and fn under this scheme are simply never going to go away, and endless fiddling of parameters to try to make them go away will hurt overall performance in a doomed attempt to redeem them. The combining schemes we've got now are excellent by any measure, and I suspect it's time to leave them alone. From tim.one@comcast.net Thu Oct 17 07:29:13 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 02:29:13 -0400 Subject: [Spambayes] Making Tester and TestDriver unsure In-Reply-To: <20021016060847.D7753F590@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > I thought it would be interesting to bring the middle ground > into the Tester and TestDriver, Indeed, long overdue. Thank you! I checked in a minor variation of this patch. Everyone, note that there's a new option ham_cutoff, and the meaning of spam_cutoff has changed slightly. Also new bool option show_unsure. From the new Options.py: """ [TestDriver] ... # spam_cutoff and ham_cutoff are used in Python slice sense: # A msg is considered ham if its score is in 0:ham_cutoff # A msg is considered unsure if its score is in ham_cutoff:spam_cutoff # A msg is considered spam if its score is in spam_cutoff: # # So it's unsure iff ham_cutoff <= score < spam_cutoff. # For a binary classifier, make ham_cutoff == spam_cutoff. # ham_cutoff > spam_cutoff doesn't make sense. # # The defaults are for the all-default Robinson scheme, which makes a # binary decision with no middle ground. The precise value that works # best is corpus-dependent, and values into the .600's have been known # to work best on some data. ham_cutoff: 0.560 spam_cutoff: 0.560 ... show_unsure: False """ I should probably add that 0.05 and 0.95 probably aren't optimal, but may well be close to optimal, if using chi-combining. > in preparation for new comparators (cmp.py and table.py) which grok > the middle ground. Only so much I can do in one night, though. Same here, I'm afraid -- I won't get to your later patch tonight. From mal@lemburg.com Thu Oct 17 11:22:01 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 12:22:01 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB Message-ID: <3DAE8F49.5080305@lemburg.com> Is anyone interested in trying out mxBeeBase as hammie DB ? It is pretty fast, portable and seems to work out nicely. Oh yes, and the generated DB files are much smaller than for e.g. hammie with dbm backend. You do need the latest egenix-mx-base-2.1.0b5 installed though. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From mal@lemburg.com Thu Oct 17 12:21:20 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 13:21:20 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB References: <3DAE8F49.5080305@lemburg.com> Message-ID: <3DAE9D30.4050801@lemburg.com> M.-A. Lemburg wrote: > Is anyone interested in trying out mxBeeBase as hammie DB ? > > It is pretty fast, portable and seems to work out nicely. > Oh yes, and the generated DB files are much smaller than > for e.g. hammie with dbm backend. > > You do need the latest egenix-mx-base-2.1.0b5 installed > though. Just to put some numbers by the fishes: Teaching hammie 13000 messages from comp.lang.python gives a database size of 23MB (that's data + index). Checking a single message takes 200ms on my Athlon 1200 (this includes Python startup time). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From guido@python.org Thu Oct 17 12:53:38 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 07:53:38 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Your message of "Thu, 17 Oct 2002 06:42:52 +0200." <3DAE3FCC.4060201@hooft.net> References: <3DAE3FCC.4060201@hooft.net> Message-ID: <200210171153.g9HBrcN10612@pcp02138704pcs.reston01.va.comcast.net> [Tim] > > Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm > > growing fonder of the non-chi schemes again. Rational or not, I > > find that the more uniform range of outcomes in [0.0, 1.0] is > > psychologically reassuring when using a UI that throws the scores > > in your face. [Rob] > But it is unrealistic. Think about the original problem again: "why > can't software that classifies ham/spam be very easy? Almost all > spam's scream in your face that they are". With chi_squared > combining we found a method that agrees with this. Most messages > scream either "Ham" or "Spam", and there is very little left to > doubt. But in real life there are also plenty of messages that mislead or defy the human screener (if only for a second), and if these still have a significant chance of becoming a f.p. or f.n., it would be appropriate if the score reflected that uncertainty. It may be clear by now that I haven't been following recent discussions much -- but the "all outcomes are extreme" characteristic was what led us to look for an alternative to Graham's scheme, and I've come to appreciate having a gray area. > Maybe the better answer is that the final UI shouldn't throw the > scores in your face. While you're still deciding on how much value you place on f.p. vs. f.n., the score can be very helpful (as long as it has a middle ground). --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Thu Oct 17 13:13:25 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 08:13:25 -0400 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: Your message of "Thu, 17 Oct 2002 13:21:20 +0200." <3DAE9D30.4050801@lemburg.com> References: <3DAE8F49.5080305@lemburg.com> <3DAE9D30.4050801@lemburg.com> Message-ID: <200210171213.g9HCDPl11730@pcp02138704pcs.reston01.va.comcast.net> > M.-A. Lemburg wrote: > > Is anyone interested in trying out mxBeeBase as hammie DB ? > > > > It is pretty fast, portable and seems to work out nicely. > > Oh yes, and the generated DB files are much smaller than > > for e.g. hammie with dbm backend. > > > > You do need the latest egenix-mx-base-2.1.0b5 installed > > though. > > Just to put some numbers by the fishes: > > Teaching hammie 13000 messages from comp.lang.python > gives a database size of 23MB (that's data + index). > > Checking a single message takes 200ms on my Athlon 1200 > (this includes Python startup time). Can you post or (better!) check in a variant of hammie with this enabled? I'd like to see this! --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Thu Oct 17 13:39:24 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 14:39:24 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB References: <3DAE8F49.5080305@lemburg.com> <3DAE9D30.4050801@lemburg.com> <200210171213.g9HCDPl11730@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3DAEAF7C.2060800@lemburg.com> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Guido van Rossum wrote: >>M.-A. Lemburg wrote: >> >>>Is anyone interested in trying out mxBeeBase as hammie DB ? >>> >>>It is pretty fast, portable and seems to work out nicely. >>>Oh yes, and the generated DB files are much smaller than >>>for e.g. hammie with dbm backend. >>> >>>You do need the latest egenix-mx-base-2.1.0b5 installed >>>though. >> >>Just to put some numbers by the fishes: >> >>Teaching hammie 13000 messages from comp.lang.python >>gives a database size of 23MB (that's data + index). >> >>Checking a single message takes 200ms on my Athlon 1200 >>(this includes Python startup time). > > > Can you post or (better!) check in a variant of hammie with this > enabled? I'd like to see this! I'd need checkin rights for that. Here's the drop-in file (I've renamed hammie.py to spambayes.py). The latest beta of egenix-mx-base is here: http://www.egenix.com/files/python/egenix-mx-base-2.1.0b5.tar.gz To install: run "python2.2 setup.py install". -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: spambayes.py Type: text/x-python Size: 13029 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/1f907181/spambayes.py ---------------------- multipart/mixed attachment-- From rob@hooft.net Thu Oct 17 13:59:35 2002 From: rob@hooft.net (Rob W. W. Hooft) Date: Thu, 17 Oct 2002 14:59:35 +0200 Subject: [Fwd: Re: [Spambayes] Proposing to remove 4 combining schemes] Message-ID: <3DAEB437.6050301@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment sorry, I forgot to CC the list on this one. -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: "Rob W. W. Hooft" Subject: Re: [Spambayes] Proposing to remove 4 combining schemes Date: Thu, 17 Oct 2002 14:42:40 +0200 Size: 2123 Url: http://mail.python.org/pipermail-21/spambayes/attachments/20021017/629a7302/SpambayesProposingtoremove4combiningschemes.txt ---------------------- multipart/mixed attachment-- From rob@hooft.net Thu Oct 17 14:18:31 2002 From: rob@hooft.net (Rob W. W. Hooft) Date: Thu, 17 Oct 2002 15:18:31 +0200 Subject: [Spambayes] 5% points in statistics Message-ID: <3DAEB8A7.6010807@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment I added 5% and 95% points to the statistics in Histogram.py. The calculation is similar to a "median": a median is the 50% point. This has as effect: -> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96 -> min 0; median 1.36141e-11; max 100 -> fivepctlo 0; fivepcthi 0.144228 -> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86 -> min 6.85475e-09; median 100; max 100 -> fivepctlo 96.8278; fivepcthi 100 So indeed this reveals new information about the distributions: where "sdev" for ham and spam are very similar, the fivepct{lo,hi} values show that the distributions are NOT the same width. 95% of ham is 20 times tighter than 95% of spam. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment Index: Histogram.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Histogram.py,v retrieving revision 1.5 diff -u -r1.5 Histogram.py --- Histogram.py 8 Oct 2002 18:13:49 -0000 1.5 +++ Histogram.py 17 Oct 2002 13:13:41 -0000 @@ -28,6 +28,8 @@ # min smallest value in collection # max largest value in collection # median midpoint + # fivepctlo five percent of data is lower than this + # fivepcthi five percent of data is higher than this # mean # var variance # sdev population standard deviation (sqrt(variance)) @@ -47,6 +49,14 @@ self.median = data[n // 2] else: self.median = (data[n // 2] + data[(n-1) // 2]) / 2.0 + xfivepct = 0.05 * (n-1) + frac = xfivepct % 1.0 + self.fivepctlo = (data[int(xfivepct)] * (1 - frac) + + data[int(xfivepct)+1] * frac) + xfivepct = 0.95 * (n-1) + frac=xfivepct % 1.0 + self.fivepcthi = (data[int(xfivepct)] * (1 - frac) + + data[int(xfivepct) + 1] * frac) # Compute mean. # Add in increasing order of magnitude, to minimize roundoff error. if data[0] < 0.0: @@ -124,6 +134,8 @@ print "-> min %g; median %g; max %g" % (self.min, self.median, self.max) + print "-> fivepctlo %g; fivepcthi %g" % (self.fivepctlo, + self.fivepcthi) lo, hi = self.get_lo_hi() if lo > hi: return ---------------------- multipart/mixed attachment-- From bkc@murkworks.com Thu Oct 17 14:47:53 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 17 Oct 2002 09:47:53 -0400 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAE9D30.4050801@lemburg.com> Message-ID: <3DAE86DF.22732.FD1E5F7@localhost> On 17 Oct 2002 at 13:21, M.-A. Lemburg wrote: > Just to put some numbers by the fishes: > > Teaching hammie 13000 messages from comp.lang.python > gives a database size of 23MB (that's data + index). > > Checking a single message takes 200ms on my Athlon 1200 > (this includes Python startup time). What operating system, and how much RAM do you have? Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From rob@hooft.net Thu Oct 17 15:12:06 2002 From: rob@hooft.net (Rob W. W. Hooft) Date: Thu, 17 Oct 2002 16:12:06 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAEC536.10803@hooft.net> Sean True wrote: > I hate to try to speak for Joe User (like speaking for the "common man", > always a red flag), but I _am_ just a user of these scoring schemes. I have > several hundred messages (commercial email) tucked away in a folder that > score in the non-chi scheme in the range .4 to .6. That score appears to > reflect my own real uncertainty about the value of Motley Fool newsletters. > No snickering, please. A system like chi- looks like a very good choice for > black and white, upstream discards offers to increase body part size. > > But I don't want these messages automatically discarded upstream, I want > them labelled so that I can deal with them more efficiently. > > When I sort this particular folder by spam score, I get MIT club and > Infoworld newsletters at the the beginning (the good end), and the Motley > Fool and Edgar Online at the other end, with a range of spam score from .2 > to .6 Just right. If I could color them continuously, it would be easy to > spot the ones I want to read, now. And over time, as I change my definition > of spam, their position in the list looks like it will vary smoothly -- and > appropriately. > > This may not fit your original mission statement, but mission statements > often don't survive contact with the enemy, err, customer. But I agree 100%! Sorting on the spamminess/hamminess is very useful. Coloring on the spamminess/hamminess is very useful. But only in the middle ground folder. And the numeric values as such are useless, that is MHO. Part of my work is to make "clean" user interfaces, and I am allergic to showing things that the user can't do anything with. I understood the original idea of Tim as that he wanted to see the spamminess of clearcut spam and the hamminess of clearcut ham. I don't see the point of that, but there would be an easy way to do it: Remap the probabilities such that 0->0; hamcutoff->0.33; spamcutoff->0.66; 1->1 using any monotonic increasing function (e.g. three linear segments). Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From mal@lemburg.com Thu Oct 17 15:19:27 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 16:19:27 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB References: <3DAE86DF.22732.FD1E5F7@localhost> Message-ID: <3DAEC6EF.6080304@lemburg.com> Brad Clements wrote: > On 17 Oct 2002 at 13:21, M.-A. Lemburg wrote: > > >>Just to put some numbers by the fishes: >> >>Teaching hammie 13000 messages from comp.lang.python >>gives a database size of 23MB (that's data + index). >> >>Checking a single message takes 200ms on my Athlon 1200 >>(this includes Python startup time). > > > What operating system, and how much RAM do you have? SuSE Linux 8 on 1GB RAM. But why would that matter ? The process size is only 4.8MB. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From bkc@murkworks.com Thu Oct 17 15:27:32 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 17 Oct 2002 10:27:32 -0400 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAEC6EF.6080304@lemburg.com> Message-ID: <3DAE9029.2035.FF630F1@localhost> On 17 Oct 2002 at 16:19, M.-A. Lemburg wrote: > > What operating system, and how much RAM do you have? > > SuSE Linux 8 on 1GB RAM. But why would that matter ? The process > size is only 4.8MB. Two thoughts: 1. you ran the test at least once before timing it, so Python and other stuff was probably "still in ram" Not exactly sure how Linux pages things, but on Windows this statement would most likely be true. 2. with less ram, you're more likely to need to throw out something to load Python and stuff (especially on Windows OS). I just found the "load time" to be extremely low for a typical office worker box. You don't appear to have a typical box. If your box is typical, is your company hiring? ;-) Note I'm not slighting Python, since the load time is a given no matter what. Just wanted to know how you achieved the low load time. Regarding the 23 megabytes . well, to run this on an IMAP server supporting 100 users. That's a lot of disk space. I realize the context switching from one "user" to the next wouldn't be so bad using a database. If you were using a pickle, argh! Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From rob@hooft.net Thu Oct 17 15:49:15 2002 From: rob@hooft.net (Rob W. W. Hooft) Date: Thu, 17 Oct 2002 16:49:15 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAECDEB.8030206@hooft.net> Sean True wrote: > The mail folder in question is not a middle ground folder, though. It's a > collection > of mail which I decided to keep at some point, for one reason or another. > And sorting by > the spam score appears to be _very_ useful for managing the contents. Ah, that is an interesting application. But do you really use the numbers, or is the ordering sufficient? > These filters learn by example -- and if the examples are ambiguous and > conflicting, the scores > should reflect that, right? Sure. But even the extreme scoring schemes do that! > I'm mostly lobbying for two things: the importance of MUA based scoring and > filtering, and the > retention of the non-extreme scoring schemes. Whether they are _default_ or > not should be a deployment > decision based on fitness to the task. If the scoring schemes are mutually compatible like the ones Tim proposed to keep, there is no harm in keeping them. But I think that the older schemes are a lot worse in their scoring than the newer ones, so I find it questionable whether they will be useful in any application. If you want a more linear judgement array, then rescaling the numbers produced by the chi2 method to something you can always read in 2 decimal digits might be more useful than a procedure that generates a sub-optimal ordering. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Thu Oct 17 15:58:31 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 10:58:31 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAE3FCC.4060201@hooft.net> Message-ID: [Tim, suggests to remove use_z_combining] >> [Rob Hooft] > I guess that means that no RMS magic can help here. Go ahead. I really don't know, but I don't *see* a way. It's the normIP() results that are assumed to be unit-normal, and that happens iff the input probs are uniformly distributed. But the deviation of the latter from uniformity doesn't have any bad consequence I can detect -- to the contrary, if anything, it seems to make the ham-vs-spam decision easier. [on the 3 remaining schemes] >> Indeed, it's possible (no experiments have been done on this) that >> a "hard" msg for one scheme could benefit via getting scored again >> by one or both of the others. > I don't expect a lot from that. You and I at least have repeatedly seen > the same fp and fn's across methods. The same final decision, yes, but in at least my cases the *relative* scores across schemes are quite different. For example, even my worst FP, which scores nearly 1.0000000000000 under chi-combining, doesn't have a particularly high score under Gary-combining *when compared against* the universe of genuine-spam scores under Gary-combining. The few clues that this FP were posted by a real person count a lot under the latter. Not enough to drag it into ham territory (and nothing ever will do that), and not even enough to drag into what could be reasonably called a middle ground for Gary-combining, but still below the mean for Gary-combining spam scores. The same is true of my other deadly-bad FP under chi-combining, but even more so. I expect the same is true of Alex's data, because his first reaction when trying the more-extreme tim-combining (but far less extreme than chi-) was despair over how much *more* extreme his FP got. I assume they score 1.0 under chi-combining. So the idea to try here (which remains untested) would be to broaden chi's middle ground via thinking twice when Gary-combining is much less sure of a msg. This needs precise fleshing out before it can be tested, though. Note that the 3 remaining schemes all compute products of prods and of 1-prods, and the loopy bit doing that is the expensive part of scoring. Getting the 3 final measures out of that is really cheap. [on extreme vs non-extreme] > But it is unrealistic. Think about the original problem again: "why > can't software that classifies ham/spam be very easy? Almost all spam's > scream in your face that they are". With chi_squared combining we found > a method that agrees with this. Most messages scream either "Ham" or > "Spam", and there is very little left to doubt. It could be that the UI would be better off with a "ham", "spam", "unsure" string tag than with decimal digits of precision. > You can downscale things a bit by reducing the final S,H-score in > chi_squared combining before calling chi2Q. Maybe take the sqrt or > something similar. Not really attractive; sqrt would be far too gross a distortion, btw (e.g., it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev 2*sqrt(n)). > ... > Maybe the better answer is that the final UI shouldn't throw the scores > in your face. Possibly. For now it's helpful to me, since I'm a developer and really need a window on the internals. > ... > Did you ever try tim combining with (S-H+1)/2? No, but it would be an excellent idea to try it with the current default combining! tim-combining is unique in that its S is especially sensitive to *low*-spamprob words, and its H to high-spamprob words; when something really is spam, tim-combining isn't relying so much on having a high S value as on having a low H value, so that the ratio S/(S+H) approaches 1. Gary-combining is much more like chi-combining in these respects, and chi-combining is where the (S-H+1)/2 reformulation helped. From seant@webreply.com Thu Oct 17 14:25:54 2002 From: seant@webreply.com (Sean True) Date: Thu, 17 Oct 2002 09:25:54 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAEB040.3070302@hooft.net> Message-ID: > > [Tim] > > > >>>Now that I'm playing with a UI (Sean & Mark's code) as a user, I'm > >>>growing fonder of the non-chi schemes again. Rational or not, I > >>>find that the more uniform range of outcomes in [0.0, 1.0] is > >>>psychologically reassuring when using a UI that throws the scores > >>>in your face. > >> > > > > [Rob] > > > >>But it is unrealistic. Think about the original problem again: "why > >>can't software that classifies ham/spam be very easy? Almost all > >>spam's scream in your face that they are". With chi_squared > >>combining we found a method that agrees with this. Most messages > >>scream either "Ham" or "Spam", and there is very little left to > >>doubt. > > > > > > But in real life there are also plenty of messages that mislead or > > defy the human screener (if only for a second), and if these still > > have a significant chance of becoming a f.p. or f.n., it would be > > appropriate if the score reflected that uncertainty. > > But it does: between one and two percent of all messages deviates > significantly from 0.0 and 100.0; those are the ones we as humans take > more than split second to judge. > > > While you're still deciding on how much value you place on > > f.p. vs. f.n., the score can be very helpful (as long as it has a > > middle ground). > > Sure, but for Joe User, this "should" be uninteresting. > > Rob > I hate to try to speak for Joe User (like speaking for the "common man", always a red flag), but I _am_ just a user of these scoring schemes. I have several hundred messages (commercial email) tucked away in a folder that score in the non-chi scheme in the range .4 to .6. That score appears to reflect my own real uncertainty about the value of Motley Fool newsletters. No snickering, please. A system like chi- looks like a very good choice for black and white, upstream discards offers to increase body part size. But I don't want these messages automatically discarded upstream, I want them labelled so that I can deal with them more efficiently. When I sort this particular folder by spam score, I get MIT club and Infoworld newsletters at the the beginning (the good end), and the Motley Fool and Edgar Online at the other end, with a range of spam score from .2 to .6 Just right. If I could color them continuously, it would be easy to spot the ones I want to read, now. And over time, as I change my definition of spam, their position in the list looks like it will vary smoothly -- and appropriately. This may not fit your original mission statement, but mission statements often don't survive contact with the enemy, err, customer. -- Sean From rob@hooft.net Thu Oct 17 16:25:54 2002 From: rob@hooft.net (Rob W. W. Hooft) Date: Thu, 17 Oct 2002 17:25:54 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAED682.3090905@hooft.net> I wrote about the huge certainties in chi2 combining: >>You can downscale things a bit by reducing the final S,H-score in >>chi_squared combining before calling chi2Q. Maybe take the sqrt or >>something similar. > Tim wrote: > > Not really attractive; sqrt would be far too gross a distortion, btw (e.g., > it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev > 2*sqrt(n)). I tried it anyway. Here are some results: Normal: -> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96 -> min 0; median 1.36141e-11; max 100 -> fivepctlo 0; fivepcthi 0.144228 * = 253 items 0.0 15415 ************************************************************* 0.5 84 * 1.0 54 * 1.5 30 * 2.0 30 * 2.5 17 * 3.0 19 * 3.5 19 * 4.0 12 * -> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86 -> min 6.85475e-09; median 100; max 100 -> fivepctlo 96.8278; fivepcthi 100 * = 87 items 95.5 46 * 96.0 17 * 96.5 14 * 97.0 16 * 97.5 21 * 98.0 38 * 98.5 35 * 99.0 92 ** 99.5 5300 ************************************************************* -> best cost for all runs: $102.60 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.495 & 0.96 -> fp 3; fn 14; unsure ham 40; unsure spam 253 -> fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34% ================== Dividing the log-products and n by 2: -> Ham scores for all runs: 16000 items; mean 0.76; sdev 5.07 -> min 0; median 1.19013e-05; max 99.9998 -> fivepctlo 0; fivepcthi 1.54439 * = 242 items 0.0 14736 ************************************************************* 0.5 316 ** 1.0 134 * 1.5 103 * 2.0 74 * 2.5 60 * 3.0 37 * 3.5 35 * 4.0 34 * -> Spam scores for all runs: 5800 items; mean 98.71; sdev 5.97 -> min 0.000221093; median 100; max 100 -> fivepctlo 92.9253; fivepcthi 100 * = 83 items 95.5 27 * 96.0 21 * 96.5 35 * 97.0 38 * 97.5 40 * 98.0 59 * 98.5 82 * 99.0 122 ** 99.5 5005 ************************************************************* -> best cost for all runs: $104.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.49 & 0.92 -> fp 3; fn 14; unsure ham 43; unsure spam 259 -> fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39% ============================================= Dividing the log-products and n by 4: -> Ham scores for all runs: 16000 items; mean 1.32; sdev 5.49 -> min 0; median 0.0140483; max 99.9378 -> fivepctlo 1.11022e-14; fivepcthi 6.09162 * = 206 items 0.0 12557 ************************************************************* 0.5 880 ***** 1.0 511 *** 1.5 298 ** 2.0 223 ** 2.5 176 * 3.0 135 * 3.5 113 * 4.0 91 * -> min 0.0626454; median 99.9953; max 100 -> fivepctlo 87.8576; fivepcthi 100 * = 71 items 95.5 38 * 96.0 54 * 96.5 55 * 97.0 59 * 97.5 70 * 98.0 150 *** 98.5 142 ** 99.0 280 **** 99.5 4331 ************************************************************* -> best cost for all runs: $108.20 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 2 cutoff pairs -> smallest ham & spam cutoffs 0.48 & 0.855 -> fp 4; fn 13; unsure ham 46; unsure spam 230 -> fp rate 0.025%; fn rate 0.224%; unsure rate 1.27% -> largest ham & spam cutoffs 0.485 & 0.855 -> fp 4; fn 14; unsure ham 42; unsure spam 229 -> fp rate 0.025%; fn rate 0.241%; unsure rate 1.24% As I expected, this significantly broadens the extremes at only very little cost. What this does statistically is downweighting all clues thereby taking care of a "standard" correlation between clues. This may be functionally equivalent to raising the value of s. This is the /4 code for reference: Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.38 diff -u -r1.38 classifier.py --- classifier.py 14 Oct 2002 02:20:35 -0000 1.38 +++ classifier.py 17 Oct 2002 15:24:55 -0000 @@ -516,7 +516,10 @@ S = ln(S) + Sexp * LN2 H = ln(H) + Hexp * LN2 - n = len(clues) + S = S/4.0 + H = H/4.0 + + n = len(clues)//4 if n: S = 1.0 - chi2Q(-2.0 * S, 2*n) H = 1.0 - chi2Q(-2.0 * H, 2*n) Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From rob@hooft.net Thu Oct 17 16:33:36 2002 From: rob@hooft.net (Rob W. W. Hooft) Date: Thu, 17 Oct 2002 17:33:36 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAED850.6060107@hooft.net> Sean True wrote: > I'm not passionate about this in particular, but having a score that looks > like a gaussian > when I have a gaussian feel about the scored messages makes sense to me. Right! But the idea of spambayes is to make a binary classification between spam and ham. We have discovered that there is a middle ground which can be explored. But why would the "ham" behave in a Gaussian way under such a model? Ham is one of the two extremes, and most ham is very easy to recognize as such. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From mal@lemburg.com Thu Oct 17 16:42:24 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 17:42:24 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB References: <3DAE9029.2035.FF630F1@localhost> Message-ID: <3DAEDA60.20801@lemburg.com> Brad Clements wrote: > On 17 Oct 2002 at 16:19, M.-A. Lemburg wrote: > > >>>What operating system, and how much RAM do you have? >> >>SuSE Linux 8 on 1GB RAM. But why would that matter ? The process >>size is only 4.8MB. > > Two thoughts: > > 1. you ran the test at least once before timing it, so Python and other stuff was probably > "still in ram" Not exactly sure how Linux pages things, but on Windows this statement > would most likely be true. The times come directly from the system's time command and are user + system times (not wall clock). And yes, things were most probably still in memory since I always run the tests a few times and then take the numbers from the last test. > 2. with less ram, you're more likely to need to throw out something to load Python and > stuff (especially on Windows OS). True. > I just found the "load time" to be extremely low for a typical office worker box. You don't > appear to have a typical box. Hmm, this is a standard SuSE installation and not even an up-to-date machine (1.2GHz is only half the speed of today's boxes). I am running Reiser FS if that makes any difference. > If your box is typical, is your company hiring? ;-) Unfortunately, not. Bad times these days... > Note I'm not slighting Python, since the load time is a given no matter what. Just > wanted to know how you achieved the low load time. Could be that the file system is using some smart caching technique which makes the dozens of stat calls at Python startup time rather fast. > Regarding the 23 megabytes . well, to run this on an IMAP server supporting 100 > users. That's a lot of disk space. I realize the context switching from one "user" to the > next wouldn't be so bad using a database. If you were using a pickle, argh! I suppose that you can easily create and use multiple spam databases, e.g. have a central one for the whole company which only masks standard spam and then use smaller ones per user which override the settings in the main one if needed. Sort of like: md = open(maindict) ud = open(userdict) value = ud.get(key) if value is None: value = md[key] The database size only increases as more words find their way into it. I'm not sure, but perhaps it's possible to filter the entries and remove meaningless ones (those with ~50% spam level). No idea. This time I'm a user, not a developer ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From tim.one@comcast.net Thu Oct 17 16:48:02 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 11:48:02 -0400 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAEAF7C.2060800@lemburg.com> Message-ID: [M.-A. Lemburg] > I'd need checkin rights for that. Sorry, Marc-Andre, that excuse just went away . mind-your-whitespace-and-we'll-get-along-just-fine-ly y'rs - tim From tim.one@comcast.net Thu Oct 17 16:59:31 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 11:59:31 -0400 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAE9D30.4050801@lemburg.com> Message-ID: [M.-A. Lemburg, on mxBeeBase] > Just to put some numbers by the fishes: > > Teaching hammie 13000 messages from comp.lang.python > gives a database size of 23MB (that's data + index). Note that at least half the words in the database are almost certainly unique, and so of no actual use. Pruning the database, and especially over time, is something that needs work here. > Checking a single message takes 200ms on my Athlon 1200 > (this includes Python startup time). For contrast, I run tests using a plain Python dict for "a database", and reading up msgs stored one per file, but doing many (on the order of 1e5) scorings per run. On a slower 866MHz Pentium box with 256MB RAM, this scores about 80 msgs/second, or about 12.5ms per msg (under 2.3 CVS Python, which is zippier than 2.2.2). Firing up the system once per msg is a real expense; keeping it running in the background all the time is a real expense of a different kind. From mal@lemburg.com Thu Oct 17 17:12:05 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 18:12:05 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB References: Message-ID: <3DAEE155.6060202@lemburg.com> Tim Peters wrote: > [M.-A. Lemburg] > >>I'd need checkin rights for that. > > > Sorry, Marc-Andre, that excuse just went away . Oh dear, I knew that would happen ;-) Will I ever be a plain user ? > mind-your-whitespace-and-we'll-get-along-just-fine-ly y'rs - tim Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From tim.one@comcast.net Thu Oct 17 17:12:52 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 12:12:52 -0400 Subject: [Spambayes] 5% points in statistics In-Reply-To: <3DAEB8A7.6010807@hooft.net> Message-ID: [Rob W. W. Hooft] > I added 5% and 95% points to the statistics in Histogram.py. The > calculation is similar to a "median": a median is the 50% point. That's a fine idea! I would like to generalize it, and allow specifying an arbitrary list of percentile points (e.g., in that sense, you've hard-coded the list 5 95 and the code already hard-coded 50). > This has as effect: > > -> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96 > -> min 0; median 1.36141e-11; max 100 > -> fivepctlo 0; fivepcthi 0.144228 > -> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86 > -> min 6.85475e-09; median 100; max 100 > -> fivepctlo 96.8278; fivepcthi 100 > > So indeed this reveals new information about the distributions: where > "sdev" for ham and spam are very similar, the fivepct{lo,hi} values show > that the distributions are NOT the same width. 95% of ham is 20 times > tighter than 95% of spam. At least on that data . The sdev is a lot easier to make sense of under schemes where score distributions look "kinda normal" (or normalish Weibull, whatever). The same is true of the histograms, for that matter -- when the histos approximate two solid bars at 0.1 and 1.0, they're really not helpful. Percentile points make reasonable sense for all distributions. From mal@lemburg.com Thu Oct 17 17:19:32 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 18:19:32 +0200 Subject: [Spambayes] Using mxBeeBase as hammie DB References: Message-ID: <3DAEE314.6040903@lemburg.com> Tim Peters wrote: > [M.-A. Lemburg, on mxBeeBase] > >>Just to put some numbers by the fishes: >> >>Teaching hammie 13000 messages from comp.lang.python >>gives a database size of 23MB (that's data + index). > > > Note that at least half the words in the database are almost certainly > unique, and so of no actual use. Pruning the database, and especially over > time, is something that needs work here. Is there some way to do this automagically ? >>Checking a single message takes 200ms on my Athlon 1200 >>(this includes Python startup time). > > For contrast, I run tests using a plain Python dict for "a database", and > reading up msgs stored one per file, but doing many (on the order of 1e5) > scorings per run. On a slower 866MHz Pentium box with 256MB RAM, this > scores about 80 msgs/second, or about 12.5ms per msg (under 2.3 CVS Python, > which is zippier than 2.2.2). Firing up the system once per msg is a real > expense; keeping it running in the background all the time is a real expense > of a different kind. I suppose a PCGI style approach would be best here: you use a small C program as client (used for filtering by e.g. procmail) which then talks to a long-running daemon process. The alternative would be to wrap up the whole spambayes package into a mxCGIPython kind of frozen application which then uses an on-disk dictionary. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From popiel@wolfskeep.com Thu Oct 17 17:39:22 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 17 Oct 2002 09:39:22 -0700 Subject: [Spambayes] Full chi-squared ratio/training analysis Message-ID: <20021017163922.1685FF4CD@cashew.wolfskeep.com> Well, I promised a full writeup yesterday, so here it is. ;-) Quick summary: The chi-squared method is still sensitive to the differing ham:spam ratios. Low ham:spam is still better than high ham:spam, and the sweet spot seems to have moved down to about 1:2 ham:spam (or maybe even 1:3... but my data doesn't have the granularity to tell). People wanting to use this on their real mailfeeds may want to train with only a subset of their ham mail. Also, chi-squared is relatively unaffected by _quantity_ of training data (same as before). More training data (in the ranges I can provide) brings at best modest improvement, and that is inconsistent. Finally, chi-squared seems to do decently with 0.05/0.95 cutoffs, regardless of ratio. While not perfectly ideal, the costs are generally close to ideal (at worst about 1.5 times ideal). Have some tables: Chi-squared, 0.05-0.90 cutoffs, 5 sets: -> tested 50 hams & 200 spams against 200 hams & 800 spams [...] -> tested 200 hams & 50 spams against 800 hams & 200 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 1 1 2 2 3 3 2 fp %: 0.40 0.27 0.40 0.32 0.40 0.34 0.20 fn total: 2 2 3 2 3 4 6 fn %: 0.20 0.23 0.40 0.32 0.60 1.07 2.40 unsure t: 26 24 25 33 29 26 37 unsure %: 2.08 1.92 2.00 2.64 2.32 2.08 2.96 real cost: $17.20 $16.80 $28.00 $28.60 $38.80 $39.20 $33.40 best cost: $15.60 $15.00 $19.80 $19.20 $27.80 $14.80 $14.60 h mean: 2.59 1.18 0.73 0.44 0.51 0.46 0.35 h sdev: 11.57 7.82 7.00 5.68 6.46 6.01 5.02 s mean: 99.31 98.95 98.32 97.41 96.84 96.10 93.12 s sdev: 7.03 8.50 10.20 12.75 14.43 15.70 19.33 mean diff: 96.72 97.77 97.59 96.97 96.33 95.64 92.77 k: 5.20 5.99 5.67 5.26 4.61 4.41 3.81 Chi-squared, 0.05-0.90 cutoffs, 8 sets: -> tested 50 hams & 200 spams against 350 hams & 1400 spams [...] -> tested 200 hams & 50 spams against 1400 hams & 350 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 1 2 3 2 3 3 2 fp %: 0.25 0.33 0.38 0.20 0.25 0.21 0.12 fn total: 2 3 2 2 4 6 10 fn %: 0.12 0.21 0.17 0.20 0.50 1.00 2.50 unsure t: 37 33 39 41 46 44 39 unsure %: 1.85 1.65 1.95 2.05 2.30 2.20 1.95 real cost: $19.40 $29.60 $39.80 $30.20 $43.20 $44.80 $37.80 best cost: $16.00 $25.80 $29.20 $25.00 $24.00 $19.60 $18.80 h mean: 1.77 0.78 0.66 0.42 0.49 0.48 0.36 h sdev: 9.19 6.90 6.85 5.58 6.16 5.97 4.91 s mean: 99.42 99.03 98.73 97.96 96.88 96.09 93.81 s sdev: 6.02 7.63 8.39 11.27 14.29 15.85 20.02 mean diff: 97.65 98.25 98.07 97.54 96.39 95.61 93.45 k: 6.42 6.76 6.44 5.79 4.71 4.38 3.75 Chi-squared, 0.05-0.90 cutoffs, 10 sets: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 2 3 3 4 5 4 2 fp %: 0.40 0.40 0.30 0.32 0.33 0.23 0.10 fn total: 5 6 4 5 6 7 9 fn %: 0.25 0.34 0.27 0.40 0.60 0.93 1.80 unsure t: 41 37 38 39 45 42 49 unsure %: 1.64 1.48 1.52 1.56 1.80 1.68 1.96 real cost: $33.20 $43.40 $41.60 $52.80 $65.00 $55.40 $38.80 best cost: $28.60 $28.40 $34.00 $35.60 $34.60 $30.60 $28.60 h mean: 1.31 0.58 0.50 0.46 0.51 0.48 0.36 h sdev: 8.51 6.47 6.46 6.25 6.44 6.12 4.97 s mean: 99.25 98.92 98.60 98.17 97.25 96.73 94.66 s sdev: 6.75 8.05 9.04 10.76 13.47 14.49 18.20 mean diff: 97.94 98.34 98.10 97.71 96.74 96.25 94.30 k: 6.42 6.77 6.33 5.74 4.86 4.67 4.07 Chi-squared, 0.05-0.95 cutoffs, 10 sets: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 2 3 3 3 2 2 2 fp %: 0.40 0.40 0.30 0.24 0.13 0.11 0.10 fn total: 5 6 4 5 6 7 9 fn %: 0.25 0.34 0.27 0.40 0.60 0.93 1.80 unsure t: 49 44 49 46 54 58 53 unsure %: 1.96 1.76 1.96 1.84 2.16 2.32 2.12 real cost: $34.80 $44.80 $43.80 $44.20 $36.80 $38.60 $39.60 best cost: $28.60 $28.40 $34.00 $35.60 $34.60 $30.60 $28.60 h mean: 1.31 0.58 0.50 0.46 0.51 0.48 0.36 h sdev: 8.51 6.47 6.46 6.25 6.44 6.12 4.97 s mean: 99.25 98.92 98.60 98.17 97.25 96.73 94.66 s sdev: 6.75 8.05 9.04 10.76 13.47 14.49 18.20 mean diff: 97.94 98.34 98.10 97.71 96.74 96.25 94.30 k: 6.42 6.77 6.33 5.74 4.86 4.67 4.07 Chi-squared, 0.02-0.98 cutoffs, 10 sets: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 2 3 2 3 2 2 1 fp %: 0.40 0.40 0.20 0.24 0.13 0.11 0.05 fn total: 4 4 3 3 3 5 6 fn %: 0.20 0.23 0.20 0.24 0.30 0.67 1.20 unsure t: 60 63 63 62 67 70 73 unsure %: 2.40 2.52 2.52 2.48 2.68 2.80 2.92 real cost: $36.00 $46.60 $35.60 $45.40 $36.40 $39.00 $30.60 best cost: $28.60 $28.40 $34.00 $35.60 $34.60 $30.60 $28.60 h mean: 1.31 0.58 0.50 0.46 0.51 0.48 0.36 h sdev: 8.51 6.47 6.46 6.25 6.44 6.12 4.97 s mean: 99.25 98.92 98.60 98.17 97.25 96.73 94.66 s sdev: 6.75 8.05 9.04 10.76 13.47 14.49 18.20 mean diff: 97.94 98.34 98.10 97.71 96.74 96.25 94.30 k: 6.42 6.77 6.33 5.74 4.86 4.67 4.07 The first three tables show the effects of ratio and training set size over otherwise consistent parameters. The last three tables show the effects of ratio and cutoffs. These results _ARE_ comparable to my earlier ratio tests; I'm using the same training data, the same random seed, etc. I ought to rerun the old tests to get the best cost info for them, though... I have this up on my website at: http://www.wolfskeep.com/~popiel/spambayes/chi - Alex From nas@python.ca Thu Oct 17 17:48:14 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 17 Oct 2002 09:48:14 -0700 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAE9D30.4050801@lemburg.com> References: <3DAE8F49.5080305@lemburg.com> <3DAE9D30.4050801@lemburg.com> Message-ID: <20021017164814.GA3731@glacier.arctrix.com> M.-A. Lemburg wrote: > Just to put some numbers by the fishes: > > Teaching hammie 13000 messages from comp.lang.python > gives a database size of 23MB (that's data + index). > > Checking a single message takes 200ms on my Athlon 1200 > (this includes Python startup time). $ time python2.3 neilfilter.py wordprobs.cdb ~/Maildir/ ~/Maildir/ < test.msg real 0m0.139s user 0m0.090s sys 0m0.020s $ ls -s wordprobs.cdb 4556 wordprobs.cdb The database was trained on about 3800 messages. test.msg is about 1 kB in size. My machine is a Athon 1700+ with 512 MB of RAM. Neil From popiel@wolfskeep.com Thu Oct 17 17:54:28 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 17 Oct 2002 09:54:28 -0700 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: Message from "M.-A. Lemburg" of "Thu, 17 Oct 2002 17:42:24 +0200." <3DAEDA60.20801@lemburg.com> References: <3DAE9029.2035.FF630F1@localhost> <3DAEDA60.20801@lemburg.com> Message-ID: <20021017165428.E228CF4CD@cashew.wolfskeep.com> In message: <3DAEDA60.20801@lemburg.com> "M.-A. Lemburg" writes: >> I just found the "load time" to be extremely low for a typical office >> worker box. You don't appear to have a typical box. > >Hmm, this is a standard SuSE installation and not even an up-to-date >machine (1.2GHz is only half the speed of today's boxes). I am running >Reiser FS if that makes any difference. *snort* At home, I'm running on a 300MHz PII. At work, I've got a 350MHz PII and a 450MHz PIII. Getting that latter box required effort akin to pulling teeth. And I'm a coder! Office workers often get utter crap for hardware. - Alex From neale@woozle.org Thu Oct 17 18:08:51 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Oct 2002 10:08:51 -0700 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAEE314.6040903@lemburg.com> References: <3DAEE314.6040903@lemburg.com> Message-ID: So then, "M.-A. Lemburg" is all like: > I suppose a PCGI style approach would be best here: you use a small > C program as client (used for filtering by e.g. procmail) which then > talks to a long-running daemon process. You might be interested in hammiesrv.py, which is an XML-RPC hammie server. I'll check in hammiecli.py, which we've been using at work, right after I send this message out. BTW, I have a buttload of email from folks at work now, I'll post some results of testing it all against my wordlist RSN. So many cool projects, so little time. Neale From tim.one@comcast.net Thu Oct 17 19:10:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 14:10:14 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <200210171153.g9HBrcN10612@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > ... > It may be clear by now that I haven't been following recent discussions > much -- but the "all outcomes are extreme" characteristic was what led > us to look for an alternative to Graham's scheme, and I've come to > appreciate having a gray area. Rob covered this well, but I want to hammer the point home, because I expect most people here have been overwhelmed by the tech talk: given enough training data, Graham's combining scheme was *always* extreme. chi-combining is extreme "only" about 99% of the time, and the ~1% of the time it isn't extreme turns out to contain most of its mistakes. It's the closest thing to a laser beam we've got, and it's really quite amazing. Playing with chi2.py as a main program can be very instructive. If you fiddle its judge() function to do Graham-combining, the historgrams it prints shows that G-combining makes an extreme judgement most of the time even when fed collections of random (uniformly distributed) probabilities: it infers certainty out of thin air. But the S and H statistics that go into chi-combining are uniform in the face of random input: an S value of 0.001 is as likely as a value of 0.999 is as likely as a value of 0.500 (etc): given random input, they're unbiased, favoring no outcome in particular. These reflect in real life: chi-combining knows when it's confused, and Graham-combining doesn't, and those reflect in the real-life score distributions. That chi-combining is rarely confused is really a great strength; that Graham-combining is (almost) never confused is its great weakness. From tim.one@comcast.net Thu Oct 17 19:28:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 14:28:20 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Message-ID: [Sean True] > I hate to try to speak for Joe User (like speaking for the "common man", > always a red flag), but I _am_ just a user of these scoring schemes. > I have several hundred messages (commercial email) tucked away in a > folder that score in the non-chi scheme in the range .4 to .6. That > score appears to reflect my own real uncertainty about the value of > Motley Fool newsletters. No snickering, please. A system like chi- > looks like a very good choice for black and white, upstream discards > offers to increase body part size. Try rescoring your msgs using chi-combining before simply guessing that it's inappropriate for you. You don't have to retrain your database, you just have to change the spamprob() used for scoring. In my email so far, chi-combining is correctly and extremely certain that most of my commerical ham is in fact ham. The worst it does is on the one HTML newsletter I have from Strong Investments so far, of which there are no examples in my training data. It gets a score of about 0.6, very solidly in its "middle ground". A UI that makes most sense under a middle-ground scheme would shuffle "I'm pretty sure it's spam" into a Spam folder, and the ones it knows it's confused about into an Unsure folder. The rest ("I'm pretty sure it's ham") would be left in your inbox. We can't really do that with the default combining scheme because about half your email would end up in the Unsure folder -- the separation it makes between populations is too fuzzy. > But I don't want these messages automatically discarded upstream, I want > them labelled so that I can deal with them more efficiently. Try chi-combining first. There's no requirement that extreme msgs get discarded here, that's simply a choice the emailmeister at python.org is likely to make. I've said many times that I personally will never use a scheme that discards a msg without my review, so short of a major personality transplant you can be sure I'm not going put anything in this project that requires such trust. > When I sort this particular folder by spam score, I get MIT club and > Infoworld newsletters at the the beginning (the good end), and the > Motley Fool and Edgar Online at the other end, with a range of spam > score from .2 to .6 Just right. If I could color them continuously, it > would be easy to spot the ones I want to read, now. And over time, as I > change my definition of spam, their position in the list looks like it > will vary smoothly -- and appropriately. > > This may not fit your original mission statement, but mission statements > often don't survive contact with the enemy, err, customer. I was the one who said I wasn't willing to kill off the non-extreme methods (yet), because there are *many* customers here, and the union of what they want can't be had with a single scheme. Bat a fan of a particular scheme gets a lot more credibility after they've tried the alternatives and given them thought based on actual experience. From guido@python.org Thu Oct 17 19:29:59 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 14:29:59 -0400 Subject: [Spambayes] Client/server model In-Reply-To: Your message of "Thu, 17 Oct 2002 11:19:44 PDT." References: Message-ID: <200210171829.g9HITxI21925@odiug.zope.com> Neale's hammie client and server seem to me to be wasting some effort. Currently, what happens, is: cli sends the entire message to svr svr parses and scores the message svr inserts the X-Hammie-Disposition header in the message svr sends the message, thus modified, back cli prints the returned, modified, message to stdout What would make more sense from the POV of minimizing traffic and minimizing work done in the server: cli parses the message cli sends the list of tokens to svr svr scores the list of tokens svr returns the text to be inserted in the X-Hammie-Disposition header cli inserts the X-Hammie-Disposition in the message cli prints the message to stdout (I like to minimize traffic as well as the work done by the server; minimizing traffic is always a good idea, while minimizing server work means less load on a shared server -- if the clients run on separate machines, the combined CPU power of the clients is much more than that of the server.) --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@pobox.com Wed Oct 16 22:02:38 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 16 Oct 2002 16:02:38 -0500 Subject: [Spambayes] Slice o' life In-Reply-To: References: Message-ID: <15789.54254.340082.342114@montanaro.dyndns.org> Tim> It turns out that python.org, Mailman, and SpamAssassin, put Tim> sooooooooo many unique "Hey, I had my fingers this!" clues in the Tim> headers that virtually any message coming thru python.org has a Tim> relatively huge collection of killer-strong ham clues (just listing Tim> headers containing such clues): Why not tweak the code to call the guts of unheader.py? Something like unheader.py -p 'X-Mailman|List-|Errors-to|Sender' should get rid of most of the header fluff. Skip From agmsmith@rogers.com Thu Oct 17 19:58:32 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Thu, 17 Oct 2002 14:58:32 EDT (-0400) Subject: [Spambayes] Client/server model In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com> Message-ID: <10385763431-BeMail@CR593174-A> Guido van Rossum wrote: > What would make more sense from the POV of minimizing traffic and > minimizing work done in the server: > > cli parses the message > cli sends the list of tokens to svr I'd want the server to do tokenization for consistency reasons. Particularly if you are also spam filtering news articles and not just e-mail messages. Also, the server can have all that mail parsing code (discarding attachments, decoding BASE64 etc), making the client simpler. > svr scores the list of tokens > svr returns the text to be inserted in the X-Hammie-Disposition header I'm returning the spam ratio in my server (using BeOS inter-program communication, though I suppose I could use the package which extends the BMessage system to the Internet, but the spam database is really a per-user thing so that isn't useful). I let the client decide if it's over their own threshold limit or not (ok, that may be a bad design choice). I'm also returning the list of words and their individual scores, but that's mostly for debugging (and wastes a lot of space - 150 words at a time!). The client (a plug-in filter for the BeMail package) also does the sound effects (saying "Spam" or "Genuine" as each message comes in). > cli inserts the X-Hammie-Disposition in the message > cli prints the message to stdout > > (I like to minimize traffic as well as the work done by the server; > minimizing traffic is always a good idea, while minimizing server work > means less load on a shared server -- if the clients run on separate > machines, the combined CPU power of the clients is much more than that > of the server.) Actually, it turns out that my server approach really isn't needed for speed reasons. It just takes a fraction of a second to load and parse the spam database (a 0.5MB (stripped of unique strings after initial training on 1500 messages / 21000 words) text file with words and numbers). But still it's nice to have it separate from other programs so that it is more modular. - Alex From tim.one@comcast.net Thu Oct 17 20:05:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 15:05:30 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAEC536.10803@hooft.net> Message-ID: [Rob W. W. Hooft] > ... > I understood the original idea of Tim as that he wanted to see the > spamminess of clearcut spam and the hamminess of clearcut ham. I don't > see the point of that, Neither do I, but I'm not following your meaning so that shouldn't be too surprising . For my own use with a middle ground, I want near-certain spam shuffled into a Spam folder, near-certain Ham left in my inbox, and the middle ground shuffled into an Unsure folder. From guido@python.org Thu Oct 17 20:05:25 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 15:05:25 -0400 Subject: [Spambayes] Client/server model In-Reply-To: Your message of "Thu, 17 Oct 2002 14:58:32 EDT." <10385763431-BeMail@CR593174-A> References: <10385763431-BeMail@CR593174-A> Message-ID: <200210171905.g9HJ5PD22148@odiug.zope.com> > > What would make more sense from the POV of minimizing traffic and > > minimizing work done in the server: > > > > cli parses the message > > cli sends the list of tokens to svr > > I'd want the server to do tokenization for consistency reasons. > Particularly if you are also spam filtering news articles and not > just e-mail messages. I don't understand this. > Also, the server can have all that mail parsing code (discarding > attachments, decoding BASE64 etc), making the client simpler. But discarding attachments in the client would reduce the traffic to the server tremendously! Maybe your server has more available CPU power than your client though? > > svr scores the list of tokens > > svr returns the text to be inserted in the X-Hammie-Disposition header > > I'm returning the spam ratio in my server (using BeOS inter-program > communication, though I suppose I could use the package which extends > the BMessage system to the Internet, but the spam database is really > a per-user thing so that isn't useful). I let the client decide if > it's over their own threshold limit or not (ok, that may be a bad design > choice). I'm also returning the list of words and their individual > scores, but that's mostly for debugging (and wastes a lot of space - > 150 words at a time!). The client (a plug-in filter for the BeMail > package) also does the sound effects (saying "Spam" or "Genuine" as > each message comes in). Cool. :-) > > cli inserts the X-Hammie-Disposition in the message > > cli prints the message to stdout > > > > (I like to minimize traffic as well as the work done by the server; > > minimizing traffic is always a good idea, while minimizing server work > > means less load on a shared server -- if the clients run on separate > > machines, the combined CPU power of the clients is much more than that > > of the server.) > > Actually, it turns out that my server approach really isn't needed > for speed reasons. It just takes a fraction of a second to load and > parse the spam database (a 0.5MB (stripped of unique strings after > initial training on 1500 messages / 21000 words) text file with > words and numbers). But still it's nice to have it separate from > other programs so that it is more modular. Fractions of seconds add up. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From agmsmith@rogers.com Thu Oct 17 20:10:55 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Thu, 17 Oct 2002 15:10:55 EDT (-0400) Subject: [Spambayes] Client/server model In-Reply-To: <200210171905.g9HJ5PD22148@odiug.zope.com> Message-ID: <11128972051-BeMail@CR593174-A> Guido van Rossum wrote: > > I'd want the server to do tokenization for consistency reasons. > > Particularly if you are also spam filtering news articles and not > > just e-mail messages. > > I don't understand this. So that everybody tokenizes the incoming messages in the same way, particularly the same way as that used earlier during training. Also, I'd have the server keep track of spam from other sources, such as UseNet news. Is there anywere else where spam messages show up that might need to be included, or is it just mail and news=3F - Alex From guido@python.org Thu Oct 17 20:19:59 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 15:19:59 -0400 Subject: [Spambayes] Client/server model In-Reply-To: Your message of "Thu, 17 Oct 2002 15:10:55 EDT." <11128972051-BeMail@CR593174-A> References: <11128972051-BeMail@CR593174-A> Message-ID: <200210171919.g9HJJxs22230@odiug.zope.com> > > > I'd want the server to do tokenization for consistency reasons. > > > Particularly if you are also spam filtering news articles and not > > > just e-mail messages. > > > > I don't understand this. > > So that everybody tokenizes the incoming messages in the same way, > particularly the same way as that used earlier during training. The hammie-client approach has a separate client program that's invoked each time, and that takes care of the uniform parsing. > Also, I'd have the server keep track of spam from other sources, > such as UseNet news. Is there anywere else where spam messages > show up that might need to be included, or is it just mail and > news? Not that I know of. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Thu Oct 17 20:25:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 15:25:38 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAED682.3090905@hooft.net> Message-ID: [Rob W. W. Hooft, divides S and H and n by various things before computing chi2Q] > Normal: > -> best cost for all runs: $102.60 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.495 & 0.96 > -> fp 3; fn 14; unsure ham 40; unsure spam 253 > -> fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34% > Dividing the log-products and n by 2: > -> best cost for all runs: $104.40 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.49 & 0.92 > -> fp 3; fn 14; unsure ham 43; unsure spam 259 > -> fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39% > Dividing the log-products and n by 4: > -> best cost for all runs: $108.20 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at 2 cutoff pairs > -> smallest ham & spam cutoffs 0.48 & 0.855 > -> fp 4; fn 13; unsure ham 46; unsure spam 230 > -> fp rate 0.025%; fn rate 0.224%; unsure rate 1.27% > -> largest ham & spam cutoffs 0.485 & 0.855 > -> fp 4; fn 14; unsure ham 42; unsure spam 229 > -> fp rate 0.025%; fn rate 0.241%; unsure rate 1.24% > As I expected, this significantly broadens the extremes at only very > little cost. But what's the point? By your own cost measure, it didn't do you any good, and in fact it raised your FP rate by the time you got to 4. > What this does statistically is downweighting all clues thereby taking > care of a "standard" correlation between clues. This may be > functionally equivalent to raising the value of s. I doubt the latter, but if it's true I'd much rather get there by raising s, which is symmetric and comprehensible. Fudging H, S and n introduces strange biases, because the info you're feeding into chi2Q no longer follows a chi-squared distribution after fudging, and chi2Q may as well be some form of biased random-number generator then. > This is the /4 code for reference: > > Index: classifier.py > =================================================================== > RCS file: /cvsroot/spambayes/spambayes/classifier.py,v > retrieving revision 1.38 > diff -u -r1.38 classifier.py > --- classifier.py 14 Oct 2002 02:20:35 -0000 1.38 > +++ classifier.py 17 Oct 2002 15:24:55 -0000 > @@ -516,7 +516,10 @@ > S = ln(S) + Sexp * LN2 > H = ln(H) + Hexp * LN2 > > - n = len(clues) > + S = S/4.0 > + H = H/4.0 > + > + n = len(clues)//4 > if n: > S = 1.0 - chi2Q(-2.0 * S, 2*n) > H = 1.0 - chi2Q(-2.0 * H, 2*n) Fiddle chi2.judge() to play with this. Here's the straight H distribution (S is similar) on vectors of 52 random probs: 52 random probs H 10000 items; mean 0.50; sdev 0.29 -> min 0.000119708; median 0.500356; max 0.999988 * = 9 items 0.00 498 ******************************************************** 0.05 494 ******************************************************* 0.10 504 ******************************************************** 0.15 546 ************************************************************* 0.20 484 ****************************************************** 0.25 470 ***************************************************** 0.30 494 ******************************************************* 0.35 491 ******************************************************* 0.40 505 ********************************************************* 0.45 513 ********************************************************* 0.50 504 ******************************************************** 0.55 474 ***************************************************** 0.60 500 ******************************************************** 0.65 502 ******************************************************** 0.70 501 ******************************************************** 0.75 542 ************************************************************* 0.80 517 ********************************************************** 0.85 443 ************************************************** 0.90 514 ********************************************************** 0.95 504 ******************************************************** Do the same but divide everything by 4 first (as you showed), and H is no longer uniformly distributed: 52 random probs H/4 & n//4 10000 items; mean 0.52; sdev 0.18 -> min 0.0144875; median 0.527973; max 0.973816 * = 17 items 0.00 4 * 0.05 47 *** 0.10 116 ******* 0.15 238 ************** 0.20 303 ****************** 0.25 498 ****************************** 0.30 631 ************************************** 0.35 781 ********************************************** 0.40 900 ***************************************************** 0.45 933 ******************************************************* 0.50 967 ********************************************************* 0.55 1017 ************************************************************ 0.60 893 ***************************************************** 0.65 812 ************************************************ 0.70 699 ****************************************** 0.75 519 ******************************* 0.80 339 ******************** 0.85 208 ************* 0.90 87 ****** 0.95 8 * The bias also shifts according to the number of extreme words in a msg modulo 4, getting more lopsided the larger n%4: 53 random probs H/4 & n//4 10000 items; mean 0.55; sdev 0.18 -> min 0.030539; median 0.554048; max 0.975847 * = 17 items 0.00 3 * 0.05 24 ** 0.10 74 ***** 0.15 133 ******** 0.20 261 **************** 0.25 420 ************************* 0.30 558 ********************************* 0.35 706 ****************************************** 0.40 822 ************************************************* 0.45 936 ******************************************************** 0.50 995 *********************************************************** 0.55 1007 ************************************************************ 0.60 989 *********************************************************** 0.65 866 *************************************************** 0.70 804 ************************************************ 0.75 642 ************************************** 0.80 396 ************************ 0.85 247 *************** 0.90 106 ******* 0.95 11 * 54 random probs H/4 & n//4 items; mean 0.57; sdev 0.17 -> min 0.0562266; median 0.579539; max 0.984772 * = 17 items 0.00 0 0.05 14 * 0.10 47 *** 0.15 97 ****** 0.20 201 ************ 0.25 327 ******************** 0.30 478 ***************************** 0.35 643 ************************************** 0.40 744 ******************************************** 0.45 868 **************************************************** 0.50 981 ********************************************************** 0.55 1020 ************************************************************ 0.60 1004 ************************************************************ 0.65 968 ********************************************************* 0.70 894 ***************************************************** 0.75 750 ********************************************* 0.80 532 ******************************** 0.85 298 ****************** 0.90 112 ******* 0.95 22 ** 55 random probs H/4 & n//4 10000 items; mean 0.60; sdev 0.17 -> min 0.0477139; median 0.61042; max 0.971135 * = 19 items 0.00 1 * 0.05 7 * 0.10 26 ** 0.15 84 ***** 0.20 153 ********* 0.25 270 *************** 0.30 359 ******************* 0.35 452 ************************ 0.40 659 *********************************** 0.45 819 ******************************************** 0.50 919 ************************************************* 0.55 1022 ****************************************************** 0.60 1108 *********************************************************** 0.65 1088 ********************************************************** 0.70 959 *************************************************** 0.75 792 ****************************************** 0.80 661 *********************************** 0.85 412 ********************** 0.90 186 ********** 0.95 23 ** So, sorry, but overall this strikes me as the kind of thing we worked like hell to get away from in Paul's scheme: strange and inconsistent biases that don't actually help, but at least cancel each other out when you get lucky . Extremity merely for the sake of extremity was no virtue, and neither is its converse. From python-spambayes@discworld.dyndns.org Thu Oct 17 20:46:47 2002 From: python-spambayes@discworld.dyndns.org (Charles Cazabon) Date: Thu, 17 Oct 2002 13:46:47 -0600 Subject: [Spambayes] Client/server model In-Reply-To: <11128972051-BeMail@CR593174-A>; from agmsmith@rogers.com on Thu, Oct 17, 2002 at 03:10:55PM -0400 References: <200210171905.g9HJ5PD22148@odiug.zope.com> <11128972051-BeMail@CR593174-A> Message-ID: <20021017134647.A3293@discworld.dyndns.org> Alexander G. M. Smith wrote: > > Also, I'd have the server keep track of spam from other sources, such as > UseNet news. Is there anywere else where spam messages show up that might > need to be included, or is it just mail and news? Apparently some enterprising spammers are now experimenting with delivering spam via the netbios messaging protocol (i.e. winpopup) so that the message comes up on the user's screen in a dialog box. It doesn't get through firewalls, but they're more interested in the home user than corporate users anyway. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ ----------------------------------------------------------------------- From mal@lemburg.com Thu Oct 17 21:02:12 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Oct 2002 22:02:12 +0200 Subject: [Spambayes] Client/server model References: <200210171829.g9HITxI21925@odiug.zope.com> Message-ID: <3DAF1744.8080703@lemburg.com> Guido van Rossum wrote: > Neale's hammie client and server seem to me to be wasting some > effort. Currently, what happens, is: > > cli sends the entire message to svr > > svr parses and scores the message > svr inserts the X-Hammie-Disposition header in the message > svr sends the message, thus modified, back > > cli prints the returned, modified, message to stdout > > What would make more sense from the POV of minimizing traffic and > minimizing work done in the server: > > cli parses the message > cli sends the list of tokens to svr > > svr scores the list of tokens > svr returns the text to be inserted in the X-Hammie-Disposition header > > cli inserts the X-Hammie-Disposition in the message > cli prints the message to stdout > > (I like to minimize traffic as well as the work done by the server; > minimizing traffic is always a good idea, while minimizing server work > means less load on a shared server -- if the clients run on separate > machines, the combined CPU power of the clients is much more than that > of the server.) This may be true if you have clients on different CPUs but if you are on the same machine (client talking to daemon), then Neale's model is certainly the better one. In fact, making the client as tiny as possible would save more CPU time. I'm thinking of the situation where you have a mail server which uses procmail to do the filtering for many different users having their account on that machine. Another scenario would be to built the C client directly into the MTA being used for the delivery. The only situation where the fat client would be better is that of distributed mail servers, but that seems like a rather uncommon setup. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From seant@iname.com Thu Oct 17 21:02:57 2002 From: seant@iname.com (Sean True) Date: Thu, 17 Oct 2002 16:02:57 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Message-ID: > [Sean True] > > I hate to try to speak for Joe User (like speaking for the "common man", > > always a red flag), but I _am_ just a user of these scoring schemes. > > I have several hundred messages (commercial email) tucked away in a > > folder that score in the non-chi scheme in the range .4 to .6. That > > score appears to reflect my own real uncertainty about the value of > > Motley Fool newsletters. No snickering, please. A system like chi- > > looks like a very good choice for black and white, upstream discards > > offers to increase body part size. > > Try rescoring your msgs using chi-combining before simply > guessing that it's > inappropriate for you. You don't have to retrain your database, you just > have to change the spamprob() used for scoring. > > In my email so far, chi-combining is correctly and extremely certain that > most of my commerical ham is in fact ham. The worst it does is on the one > HTML newsletter I have from Strong Investments so far, of which > there are no > examples in my training data. It gets a score of about 0.6, very > solidly in > its "middle ground". I changed to chi-scoring. That was easy. I rescored the /News filter, which is not used for any training whatsover. The spread of scores is now 0 to .99, as advertised. The scores for the financial newsletters are now much higher, which does not meet my qualitative assessment. If I'm not sure it's spam, I'd prefer a score that matched that. All in all, for exposed scoring, I still prefer the old scores, but not enough to keep complaining about. And since getting whacked by Tim for lack of intellectual rigor and laziness is familiar, rousing, but not all that much fun , I think I'll go back to doing systems engineering and just _use_ the results. > > A UI that makes most sense under a middle-ground scheme would shuffle "I'm > pretty sure it's spam" into a Spam folder, and the ones it knows it's > confused about into an Unsure folder. The rest ("I'm pretty sure > it's ham") > would be left in your inbox. We can't really do that with the default > combining scheme because about half your email would end up in the Unsure > folder -- the separation it makes between populations is too fuzzy. Depends on whether one wants the machine making the decision, or wants help making the decision oneself. The ability to label a message, and then sort using the label is really handy if you spend your time classifying mail. >(yet), because there are *many* customers here, and the union of what they > want can't be had with a single scheme. Bat a fan of a particular scheme > gets a lot more credibility after they've tried the alternatives and given > them thought based on actual experience. Whack, whack, whack. Always a pleasure, Mr. Peters. I'm off to rescore all the mail in my Outlook folders, which takes about an hour on an Athlon 2000 XP. -- Sean From rob@hooft.net Thu Oct 17 21:06:18 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 22:06:18 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAF183A.8070600@hooft.net> Tim Peters wrote: > > But what's the point? By your own cost measure, it didn't do you any good, > and in fact it raised your FP rate by the time you got to 4. > There was some discussion about the judgments being too strict. I was trying to find a statistically sound way to reduce correlations such that results would be less sure. I explained that here: >>What this does statistically is downweighting all clues thereby taking >>care of a "standard" correlation between clues. To which you said: > Fudging H, S and n introduces > strange biases, because the info you're feeding into chi2Q no longer follows > a chi-squared distribution after fudging, and chi2Q may as well be some form > of biased random-number generator then. That is not exactly true. What I am assuming is that if there is one clue in a message that says 0.8, there are probably more of those. That is the correlation we're discussing. A clue rarely comes alone. Effect of that is that my joke messages with "From: xxx@yyy (by way of ppp@qqq)" gets a very strong and repeated signal from the From: line, and your filtered mailman list is much too sure about hamminess. This is solved by my hack: it practically divides the number of clues by 2 or 4. > Fiddle chi2.judge() to play with this. Here's the straight H distribution > (S is similar) on vectors of 52 random probs: OK: Index: chi2.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/chi2.py,v retrieving revision 1.7 diff -u -r1.7 chi2.py --- chi2.py 16 Oct 2002 21:31:19 -0000 1.7 +++ chi2.py 17 Oct 2002 20:04:17 -0000 @@ -145,7 +145,7 @@ for i in range(5000): ps = [random() for j in range(50)] - s1, h1, score1 = judge(ps + [bias] * warp) + s1, h1, score1 = judge((ps + [bias] * warp)*4) s.add(s1) h.add(h1) score.add(score1) (i.e. adding correlated data points) Results in: Result for random vectors of 50 probs, + 0 forced to 0.99 H 5000 items; mean 0.47; sdev 0.38 -> min 1.26528e-11; median 0.444004; max 1 -> fivepctlo 0.000293787; fivepcthi 0.999102 * = 19 items 0.00 1125 ************************************************************ 0.05 291 **************** 0.10 230 ************* 0.15 182 ********** 0.20 157 ********* 0.25 146 ******** 0.30 119 ******* 0.35 135 ******** 0.40 129 ******* 0.45 121 ******* 0.50 120 ******* 0.55 131 ******* 0.60 128 ******* 0.65 152 ******** 0.70 128 ******* 0.75 167 ********* 0.80 172 ********** 0.85 208 *********** 0.90 239 ************* 0.95 920 ************************************************* S 5000 items; mean 0.50; sdev 0.39 -> min 2.81657e-11; median 0.497487; max 1 -> fivepctlo 0.0005459; fivepcthi 0.999608 * = 18 items 0.00 1049 *********************************************************** 0.05 286 **************** 0.10 195 *********** 0.15 166 ********** 0.20 138 ******** 0.25 163 ********** 0.30 129 ******** 0.35 128 ******** 0.40 123 ******* 0.45 128 ******** 0.50 123 ******* 0.55 129 ******** 0.60 114 ******* 0.65 133 ******** 0.70 149 ********* 0.75 142 ******** 0.80 201 ************ 0.85 183 *********** 0.90 265 *************** 0.95 1056 *********************************************************** (S-H+1)/2 5000 items; mean 0.51; sdev 0.34 -> min 3.71508e-09; median 0.515657; max 1 -> fivepctlo 0.00540499; fivepcthi 0.996936 * = 12 items 0.00 651 ******************************************************* 0.05 240 ******************** 0.10 214 ****************** 0.15 180 *************** 0.20 184 **************** 0.25 173 *************** 0.30 163 ************** 0.35 181 **************** 0.40 208 ****************** 0.45 243 ********************* 0.50 217 ******************* 0.55 182 **************** 0.60 216 ****************** 0.65 157 ************** 0.70 185 **************** 0.75 190 **************** 0.80 191 **************** 0.85 225 ******************* 0.90 282 ************************ 0.95 718 ************************************************************ So: chi2 will be fairly sure even about random data if it is correlated. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From gward@python.net Thu Oct 17 21:52:08 2002 From: gward@python.net (Greg Ward) Date: Thu, 17 Oct 2002 16:52:08 -0400 Subject: [Spambayes] Client/server model In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com> References: <200210171829.g9HITxI21925@odiug.zope.com> Message-ID: <20021017205208.GB14491@cthulhu.gerg.ca> On 17 October 2002, Guido van Rossum said: > Neale's hammie client and server seem to me to be wasting some > effort. Currently, what happens, is: > > cli sends the entire message to svr > > svr parses and scores the message > svr inserts the X-Hammie-Disposition header in the message > svr sends the message, thus modified, back > > cli prints the returned, modified, message to stdout Arrggh. That's exactly how SpamAssassin's spamc/spamd work, and it's a pain-in-the-ass for anyone who wants to access spamd in an unusual way. > What would make more sense from the POV of minimizing traffic and > minimizing work done in the server: > > cli parses the message > cli sends the list of tokens to svr > > svr scores the list of tokens > svr returns the text to be inserted in the X-Hammie-Disposition header > > cli inserts the X-Hammie-Disposition in the message > cli prints the message to stdout If there are multiple client implementations, then spreading work across clients also means duplicating code. Yuck. Based on my experience with SA, I think I'd prefer a model like this: cli sends message headers svr parses the headers cli sends message body OR individual attachments [ie. the protocol needs some state so the client can say, "I'm sending you the headers now", or "I'm sending you the entire body now", or "I'm sending one attachment now"] svr parses the message body/attachments/whatever cli tells the server what it wants: eg. "give me the X-Hammie-Disposition header", or "give me just the score", or "give me the top-N scoring words and their probabilities" svr gives the client what it wants Yes, I know this is more complex. But it's how I wish SA's spamd protocol worked! Greg -- Greg Ward http://www.gerg.ca/ I have the power to HALT PRODUCTION on all TEENAGE SEX COMEDIES!! From rob@hooft.net Thu Oct 17 21:49:26 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 22:49:26 +0200 Subject: [Spambayes] optimal max_discriminators for chi2 Message-ID: <3DAF2256.30509@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment I did a series of runs: ========================= [Classifier] use_chi_squared_combining: True robinson_minimum_prob_strength = 0.0 robinson_probability_s = 0.45 max_discriminators = XXXXXX [TestDriver] spam_cutoff: 0.70 nbuckets: 200 best_cutoff_fp_weight: 10 show_false_positives: True show_false_negatives: True show_best_discriminators: 50 show_spam_lo = 0.00 show_spam_hi = 0.80 show_ham_lo = 0.40 show_ham_hi = 1.00 show_charlimit: 5000 ============ With XXXXXX between 15 and 300. Attached are plots of the 95th percentile ham, 5th percentile spam, and of the total cost vertical against max_discriminators horizontal. Please note again that my ham is much tighter than my spam: vertical scales are from 0 to 0.16 and from 89 to 100, respectively (Almost a factor of 100!). The cost plot shows "no trend at all", but the variation is not large. I'd almost conclude "anything goes", but based on the spam-5% value I'd like to stick with values over ~40. -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: ham95.png Type: image/png Size: 6748 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/107955fc/ham95.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: spam5.png Type: image/png Size: 6330 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/107955fc/spam5.png ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: cost.png Type: image/png Size: 8545 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021017/107955fc/cost.png ---------------------- multipart/mixed attachment-- From tim.one@comcast.net Thu Oct 17 22:03:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 17:03:30 -0400 Subject: [Spambayes] Using mxBeeBase as hammie DB In-Reply-To: <3DAEE314.6040903@lemburg.com> Message-ID: [Tim] >> Pruning the database, and especially over time, is something that >> needs work here. [M.-A. Lemburg] > Is there some way to do this automagically ? No; that's part of what "needs work here" means. In addition, some fields in the WordInfo records probably aren't needed, or at best are too big (like saving an 8-byte double for a timestamp). It's also unknown how pruning will affect accuracy over time, esp. since training is done on a batch of words per msg basis, but unless the tokenstream for each msg is saved, expiring words from the database will yield a state that doesn't match any real-life combination of training msgs. Feel free to solve all that in your spare time . From tim.one@comcast.net Thu Oct 17 22:32:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 17:32:46 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Message-ID: [Sean True] > I changed to chi-scoring. That was easy. > I rescored the /News filter, which is not used for any training > whatsover. The spread of scores is now 0 to .99, as advertised. That's a bit odd: in all test reports to date, the median spam score under chi-combining was 1.00, and that matches what I've seen on my personal email too (a large majority of spam scores 1.00, to the precision of the Hammie display). > The scores for the financial newsletters are now much higher, Meaning 0.99, or ...? The rule of thumb under chi-combining so far has been to say a thing is spam if its score exceeds 0.95, else to call it a "middle ground" msg (with "it's ham" under 0.05). > which does not meet my qualitative assessment. The scores will change over time, of course -- the system learns what it's taught. > If I'm not sure it's spam, I'd prefer a score that matched that. Under chi-combining, a score under .95 (as a rule of thumb so far) does mean "I'm not sure it's spam". So quantifying this would be helpful. > All in all, for exposed scoring, I still prefer the old scores, > but not enough to keep complaining about. You're allowed to -- as I said in the msg that started this thread, I too re-found a fondess for the fuzzy scores when using the same UI you're using. I'm not sure how much that has to do with the reality of the scoring, and how much to do with the UI, but we have lots of test results saying that chi-combining does objectively better *if* you need to pick your cutoffs in advance. If you're happy to live with fuzzy and shifting cutoffs (which is allowed, but I expect will be as much a minority position as my position that I'd rather have false positives than false negatives), the all-default scheme may work just as well. > And since getting whacked by Tim for lack of intellectual rigor and > laziness is familiar, rousing, but not all that much fun , I think > I'll go back to doing systems engineering and just _use_ the results. I think you're confusing me with the geometric mean of "the MREC group" <0.9 wink>, Sean. I simply challenged you to use a scheme before strongly dissing it. >> A UI that makes most sense under a middle-ground scheme would >> shuffle "I'm pretty sure it's spam" into a Spam folder, and the >> ones it knows it's confused about into an Unsure folder. The >> rest ("I'm pretty sure it's ham") would be left in your inbox. >> We can't really do that with the default combining scheme because >> about half your email would end up in the Unsure folder -- the >> separation it makes between populations is too fuzzy. > Depends on whether one wants the machine making the decision, or wants > help making the decision oneself. I only want help, but shuffling msgs off to Spam and Unsure folders for later review is exactly the help I want. For example, I don't want to be bothered *at all* with probable spam until I put my brain in "spam mode" and go on a mass-delete spree dedicated to reviewing probable spam. Until then, I don't want it in my inbox at all. > The ability to label a message, and then sort using the label is really > handy if you spend your time classifying mail. With respect to labels specifically measuring spamness, I expect we're destined never to agree on this. I found sorting Outlook displays by Hammie score (yes, I've tried rescoring under all schemes) to be intellectually interesting, but not the way I'd want to work in real life. Spam vs non-spam is a boring decision I want to spend as little time on as possible; personal email vs work email is an example of an interesting decision. > ... > I'm off to rescore all the mail in my Outlook folders, which > takes about an hour on an Athlon 2000 XP. This is something else to look into: scoring thru the Outlook wrappers is *much* slower than scoring msg-per-plain-text-file (which is what I do during large tests, which routinely score over 100,000 times). I score about 80 msgs per second the latter way. When scoring about 500 Outlook Inbox msgs, I take a much-needed bathroom break . From nas@python.ca Thu Oct 17 22:36:09 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 17 Oct 2002 14:36:09 -0700 Subject: [Spambayes] Client/server model In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com> References: <200210171829.g9HITxI21925@odiug.zope.com> Message-ID: <20021017213608.GA4467@glacier.arctrix.com> Guido van Rossum wrote: > (I like to minimize traffic as well as the work done by the server; > minimizing traffic is always a good idea, while minimizing server work > means less load on a shared server -- if the clients run on separate > machines, the combined CPU power of the clients is much more than that > of the server.) Why do we want to do it on the server at all? I think having the DB and classifier on the client would work better. Neil From rob@hooft.net Thu Oct 17 22:38:41 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 23:38:41 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAF2DE1.5090404@hooft.net> Tim Peters wrote: > [Sean True] >>If I'm not sure it's spam, I'd prefer a score that matched that. > > > Under chi-combining, a score under .95 (as a rule of thumb so far) does mean > "I'm not sure it's spam". So quantifying this would be helpful. My gut feeling says: under ideal combining, a score under .95 means "I'm less than 95% sure this is spam". > I only want help, but shuffling msgs off to Spam and Unsure folders for > later review is exactly the help I want. For example, I don't want to be > bothered *at all* with probable spam until I put my brain in "spam mode" and > go on a mass-delete spree dedicated to reviewing probable spam. Until then, > I don't want it in my inbox at all. The first time I used SpamAssassin, I used it in label-only mode. That gave some relief. After using it for a month, I was confident enough to make a procmail rule to move spam into a spam folder without showing it to me. I was amazed by the amount of rest that has created. I did not realize that the spam was having such a psychological effect on me. This is definitely what I'd want from spambayes. I'd only read my "incoming ham". Once a week I'd go into unsure mode, and do some selection work. Once a month I can probably go into spam-curse mode, and do the mass deletion Tim talks about. But Sean's "sort on score" idea is also very useful. I think it'd speed up the manual scanning/deletion process. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From skip@pobox.com Thu Oct 17 22:43:27 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 17 Oct 2002 16:43:27 -0500 Subject: [Spambayes] Client/server model In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com> References: <200210171829.g9HITxI21925@odiug.zope.com> Message-ID: <15791.12031.16824.172208@montanaro.dyndns.org> Guido> Neale's hammie client and server seem to me to be wasting some Guido> effort. Currently, what happens, is: [current behavior] Guido> What would make more sense from the POV of minimizing traffic and Guido> minimizing work done in the server: [proposed behavior] SpamAssassin has a spamc/spamd pair which works like hammiecli/hammiesvr. The strongest reason I see to push all the processing into the server program is that hammiecli can degenerate into a little C program which will beat the pants off anything which starts up the Python interpreter. Spamc has no Perl bits in it (though it does know about the headers spamd adds to the message. I rather suspect that by simply changing the port to which spamc connects and simplifying the code executed after the message is returned, it could replace hammiecli. Skip From rob@hooft.net Thu Oct 17 22:49:07 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 17 Oct 2002 23:49:07 +0200 Subject: [Spambayes] Proposing to remove 4 combining schemes References: Message-ID: <3DAF3053.6040103@hooft.net> Tim Peters wrote: > [Rob] >>Did you ever try tim combining with (S-H+1)/2? > > > No, but it would be an excellent idea to try it with the current default > combining! tim-combining is unique in that its S is especially sensitive to > *low*-spamprob words, and its H to high-spamprob words; when something > really is spam, tim-combining isn't relying so much on having a high S value > as on having a low H value, so that the ratio S/(S+H) approaches 1. > Gary-combining is much more like chi-combining in these respects, and > chi-combining is where the (S-H+1)/2 reformulation helped. tim combining: -> Ham scores for all runs: 16000 items; mean 13.62; sdev 9.66 -> min 0.109175; median 12.3561; max 76.0553 -> fivepctlo 1.35543; fivepcthi 31.4327 -> Spam scores for all runs: 5800 items; mean 84.42; sdev 11.70 -> min 21.351; median 85.6889; max 99.8161 -> fivepctlo 64.4615; fivepcthi 98.8117 -> best cost for all runs: $110.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.5 & 0.625 -> fp 5; fn 16; unsure ham 35; unsure spam 187 -> fp rate 0.0312%; fn rate 0.276%; unsure rate 1.02% default combining: -> Ham scores for all runs: 16000 items; mean 26.37; sdev 8.32 -> min 0.137212; median 27.2524; max 65.3836 -> fivepctlo 11.7696; fivepcthi 38.3897 -> Spam scores for all runs: 5800 items; mean 75.96; sdev 10.74 -> min 33.8547; median 74.3976; max 99.7559 -> fivepctlo 59.9773; fivepcthi 96.4292 -> best cost for all runs: $106.20 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.5 & 0.585 -> fp 5; fn 16; unsure ham 35; unsure spam 166 -> fp rate 0.0312%; fn rate 0.276%; unsure rate 0.922% default combining with P-Q instead of (P-Q)/(P+Q): -> Ham scores for all runs: 16000 items; mean 21.49; sdev 8.73 -> min 0.123198; median 21.7049; max 68.8251 -> fivepctlo 7.34536; fivepcthi 35.6937 -> Spam scores for all runs: 5800 items; mean 79.44; sdev 11.00 -> min 29.348; median 79.2283; max 99.786 -> fivepctlo 61.9311; fivepcthi 97.3078 -> best cost for all runs: $103.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.5 & 0.615 -> fp 3; fn 16; unsure ham 37; unsure spam 250 -> fp rate 0.0187%; fn rate 0.276%; unsure rate 1.32% It is all so close together in the final "cost" result that it is very difficult to judge from the statistics. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From neale@woozle.org Thu Oct 17 23:22:34 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Oct 2002 15:22:34 -0700 Subject: [Spambayes] Client/server model In-Reply-To: <20021017205208.GB14491@cthulhu.gerg.ca> References: <200210171829.g9HITxI21925@odiug.zope.com> <20021017205208.GB14491@cthulhu.gerg.ca> Message-ID: So then, Greg Ward is all like: > If there are multiple client implementations, then spreading work across > clients also means duplicating code. Yuck. Based on my experience with > SA, I think I'd prefer a model like this: > > cli sends message headers > svr parses the headers > cli sends message body OR individual attachments > [ie. the protocol needs some state so the client can say, > "I'm sending you the headers now", or "I'm sending you the > entire body now", or "I'm sending one attachment now"] > svr parses the message body/attachments/whatever > cli tells the server what it wants: eg. "give me the > X-Hammie-Disposition header", or "give me just the score", or > "give me the top-N scoring words and their probabilities" > svr gives the client what it wants I'm not sure that the tokenizer would be too amenable to splitting the header from the body, although if someone can think of a way to do that, it certainly would rock my world, as it'd make this technique *way* more accessible to $FIRM's embedded product. But if you just want the score, you can do that. Easy squeezy: #! /usr/bin/env python import xmlrpclib import sys RPCBASE="http://localhost:65000" msg = sys.stdin.read() x = xmlrpclib.ServerProxy(RPCBASE) m = xmlrpclib.Binary(msg) score = x.score(m) print "You get", score, "points." You can even pass a second (true) argument to x.score to get back a list of the contributing words. I wrote hammiecli to show how easy it is to use hammiesrv. You don't have to do it my way though--feel free to write your own 6 lines of code :) Neale From neale@woozle.org Thu Oct 17 23:31:56 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Oct 2002 15:31:56 -0700 Subject: [Spambayes] Re: Client/server model In-Reply-To: <200210171829.g9HITxI21925@odiug.zope.com> References: <200210171829.g9HITxI21925@odiug.zope.com> Message-ID: So then, Guido van Rossum is all like: > Neale's hammie client and server seem to me to be wasting some > effort. Currently, what happens, is: > > [ server tokenizes ] I did it this way so you could write your own hammiecli in as nothing more than an XML-RPC call. So like, could easily integrate hammie checking, without having to know how to tokenize. And as I pointed out in another message, you can call the .score() method if you don't want the whole message back. I wrote hammiecli to run from my .procmailrc. > What would make more sense from the POV of minimizing traffic and > minimizing work done in the server: > > [ client tokenizes ] That makes sense too. It depends on how you're going to use the thing, I guess. So I'll make .score() and .filter() accept a tokenized list as well as a string. Then you can call me the Burger King, 'cause you can Have It Your Way. :^) Neale From popiel@wolfskeep.com Thu Oct 17 23:38:23 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 17 Oct 2002 15:38:23 -0700 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Message from Rob Hooft of "Thu, 17 Oct 2002 23:38:41 +0200." <3DAF2DE1.5090404@hooft.net> References: <3DAF2DE1.5090404@hooft.net> Message-ID: <20021017223823.785E5F4CD@cashew.wolfskeep.com> In message: <3DAF2DE1.5090404@hooft.net> Rob Hooft writes: >Tim Peters wrote: >> [Sean True] > >>>If I'm not sure it's spam, I'd prefer a score that matched that. >> >> >> Under chi-combining, a score under .95 (as a rule of thumb so far) does mean >> "I'm not sure it's spam". So quantifying this would be helpful. > >My gut feeling says: under ideal combining, a score under .95 means "I'm >less than 95% sure this is spam". Ah, here's the basic problem... the final score we're generating has very little to do with a percentage, or any human concept of assurance. Heck, the final number isn't even a percentage of how much the message looks like ham or spam, since we're combining _those_ two numbers in very non-percentage-like ways. On the other hand, end users are quite likely to inappropriate interpretations like this on the numbers, if they see them... so in any final presentation of this system, I'd _STRONGLY_ discourage showing the numbers. Just the three categories 'spam', 'ham', and 'unknown' should be sufficient. People who are not statisticians tend to make a lot of silly interpretations of numbers, particularly when those numbers are percentages (or look like percentages). If I tell people "I'm 75% sure these dice are loaded", the vast majority of them will expect that they will roll particular values 75% of the time. (Translation to spambayes: for every message in some set of messages, a classifier says it's 75% sure that the message is spam... and people think that about 3/4 of those messages will be spam. As a simple disproof, consider if all the messages are identical.) People just don't grok that surety has very little to do with distribution of results. They also tend to go for all sorts of logical fallacies like a statement implying its converse, excluded middles, etc. The score we've got is just a number in the range 0 to 1 which has interesting discriminatory properties. It's not linear with any concept of surety, and it's not linear with similarity to spam or ham, either. People not immersed in how it's generated and/or buried in test results over decent sized corpora are sure (there's that troubling word again) to misinterpret it. >But Sean's "sort on score" idea is also very useful. I think it'd speed >up the manual scanning/deletion process. Having looked at the results from the show_unsure config option, I tend to disagree... position in the list doesn't seem to have any correlation with spam vs. ham. - Alex From nas@python.ca Thu Oct 17 23:54:49 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 17 Oct 2002 15:54:49 -0700 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAF2DE1.5090404@hooft.net> References: <3DAF2DE1.5090404@hooft.net> Message-ID: <20021017225449.GA4778@glacier.arctrix.com> Rob Hooft wrote: > The first time I used SpamAssassin, I used it in label-only mode. That > gave some relief. After using it for a month, I was confident enough to > make a procmail rule to move spam into a spam folder without showing it > to me. I was amazed by the amount of rest that has created. I did not > realize that the spam was having such a psychological effect on me. That matches my experience with setting up a spam filter. After installation, I found it much easier to deal with messages in my inbox (mail not from lists). The psychological effect was larger than I had expected as well. I look at the spam mailbox last and only if I have the time and energy. Neil From agmsmith@rogers.com Fri Oct 18 00:04:05 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Thu, 17 Oct 2002 19:04:05 EDT (-0400) Subject: [Spambayes] Sort on Score Usefulness for Manual Updates In-Reply-To: <3DAF2DE1.5090404@hooft.net> Message-ID: <25119245050-BeMail@CR593174-A> Rob Hooft wrote: > But Sean's "sort on score" idea is also very useful. I think it'd speed > up the manual scanning/deletion process. It does. The BeOS version adds an attribute with the spam score to each e-mail (each is a separate file in BeOS). It's then easy to show the attribute as an extra column in a normal directory window and then click on the heading to sort it. Then I can quickly junk the spam and also quickly spot the marginal ones (ham or spam that's close to the threshold). I then manually add those ones as examples to the database (right click on them, open-with the spam classifier program, it asks if they are spam or genuine, and that's it). Then I go back to sort by thread and read the mail. Relatively quick, and effective at keeping the database up to date. - Alex From guido@python.org Fri Oct 18 00:06:34 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 19:06:34 -0400 Subject: [Spambayes] Client/server model In-Reply-To: Your message of "Thu, 17 Oct 2002 15:22:34 PDT." References: <200210171829.g9HITxI21925@odiug.zope.com> <20021017205208.GB14491@cthulhu.gerg.ca> Message-ID: <200210172306.g9HN6Z312594@pcp02138704pcs.reston01.va.comcast.net> > I'm not sure that the tokenizer would be too amenable to splitting the > header from the body, although if someone can think of a way to do that, > it certainly would rock my world, as it'd make this technique *way* more > accessible to $FIRM's embedded product. The email package makes this a breeze AFAIK. --Guido van Rossum (home page: http://www.python.org/~guido/) From agmsmith@rogers.com Fri Oct 18 00:09:02 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Thu, 17 Oct 2002 19:09:02 EDT (-0400) Subject: [Spambayes] Client/server model In-Reply-To: Message-ID: <25415735222-BeMail@CR593174-A> Neale Pickett wrote: > I'm not sure that the tokenizer would be too amenable to splitting the > header from the body, although if someone can think of a way to do that, > it certainly would rock my world, as it'd make this technique *way* more > accessible to $FIRM's embedded product. That's a feature I've been asked for. Just classify by the header alone. The idea being that it would only download the header from the mail server, and immediately delete the message on the server if it looked like spam. I'm a bit nervous about implementing it in case it is a false positive and thus irretrievably deletes the message. - Alex From neale@woozle.org Fri Oct 18 00:52:55 2002 From: neale@woozle.org (Neale Pickett) Date: 17 Oct 2002 16:52:55 -0700 Subject: [Spambayes] Client/server model In-Reply-To: <200210172306.g9HN6Z312594@pcp02138704pcs.reston01.va.comcast.net> References: <200210171829.g9HITxI21925@odiug.zope.com> <20021017205208.GB14491@cthulhu.gerg.ca> <200210172306.g9HN6Z312594@pcp02138704pcs.reston01.va.comcast.net> Message-ID: So then, Guido van Rossum is all like: > > I'm not sure that the tokenizer would be too amenable to splitting > > the header from the body, although if someone can think of a way to > > do that, it certainly would rock my world, as it'd make this > > technique *way* more accessible to $FIRM's embedded product. > > The email package makes this a breeze AFAIK. Yeah but I don't think anybody's done any tests to see if classifying on headers alone still gets good results. At least, I assume that's the reasoning behind the "send me the headers, and I'll tell you if I need the body" approach... From tim.one@comcast.net Fri Oct 18 01:21:06 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 20:21:06 -0400 Subject: [Spambayes] Ham or spam? Message-ID: Running the classifier over the day's email load so far turned up one= with a low (chi-combining) score of 0.01 (I'm skipping most of the header li= nes, so it's not the python.org clues driving it so low): """ Return-path: Path: news.baymountain.net!uunet!ash.uu.net!easynews.net!newsfeed3.easynews= .net!ma ngo.news.easynet.net!easynet.net!proxad.net!feeder2-1.proxad.net!news= 2-1.fre e.fr!not-for-mail Received: from bright08.icomcast.net (bright08-qfe0.icomcast.net [172.20.4.65]) by msgstore01.icomcast.net (iPlanet Messaging Server 5.1 HotFix 0.8 (built May 13 2002)) with ESMTP id <0H45002HAFBN1S@msgstore01.icomcast.net> for tim.one@ims-ms-daemon (ORCPT tim.one@comcast.net); Thu, 17 Oct 2002 19:16:35 -0400 (EDT) Received: from mtain02 (bright-LB.icomcast.net [172.20.3.155]) =09by bright08.icomcast.net (8.11.6/8.11.6) with ESMTP id g9HNGlG2232= 4=09for <@msgstore01.icomcast.net:tim.one@comcast.net>; Thu, 17 Oct 2002 19:16:47 -0400 (EDT) Received: from mail.python.org (mail.python.org [12.155.117.29]) by mtain02.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.4 (built Aug 5 2002)) with ESMTP id <0H45004FMFBH05@mtain02.icomcast.net> for tim.one@comc= ast.net (ORCPT tim.one@comcast.net); Thu, 17 Oct 2002 19:16:29 -0400 (EDT) Received: from localhost.localdomain ([127.0.0.1] helo=3Dmail.python.= org) =09by mail.python.org with esmtp (Exim 4.05)=09id 182JsN-00070C-00; T= hu, 17 Oct 2002 19:16:23 -0400 X-Trace: 1034896234 news2-1.free.fr 1396 62.212.104.101 Date: Fri, 18 Oct 2002 01:10:19 +0200 =46rom: Meles MELES Subject: Barre de progression Sender: python-list-admin@python.org To: python-list@python.org Errors-to: python-list-admin@python.org Message-id: <5jfnoa.94c.ln@farfadet.home.org> Organization: Guest of ProXad - France X-Complaints-to: abuse@proxad.net MIME-version: 1.0 Content-type: text/plain; charset=3Diso-8859-15 Content-transfer-encoding: 8BIT NNTP-posting-date: 18 Oct 2002 01:10:34 MEST Precedence: bulk X-BeenThere: python-list@python.org User-Agent: KNode/0.7.1 Newsgroups: comp.lang.python Lines: 18 NNTP-posting-host: 62.212.104.101 X-Mailman-Version: 2.0.13 (101270) List-Post: List-Subscribe: = , =09 List-Unsubscribe: , =09 List-Archive: List-Help: List-Id: General discussion list for the Python programming language Xref: news.baymountain.net comp.lang.python:186832 Bonsoir =E0 tous, je suis =E0 la recherche d'exemple d'impl=E9mentation d'une b= a rre de progression en mode console (ou, =E0 d=E9faut en mode graphiqu= e) un peu du style de celle de urpmi lors de l'installation d'un paquet a vec la mandrake. Si en plus, =E0 la fin de celle ci, le pourcentage d= e t ravail effectu=E9 pouvait s'afficher, ce serai le bonheur. L'id=E9al pour moi serait de voir un code tout fait pour m'en inspire= r, au pire un peu de docs derait l'affaire. Cordialement """ I read French well enough to know she's asking for directions to the = nearest alligator breeding museum, but what I don't know is whether it's ham = or spam. Whaddyathink? It's the only judgment made today that isn't ob= viously correct (nor incorrect, for that matter) to my English eyes. From guido@python.org Fri Oct 18 01:43:33 2002 From: guido@python.org (Guido van Rossum) Date: Thu, 17 Oct 2002 20:43:33 -0400 Subject: [Spambayes] Ham or spam? In-Reply-To: Your message of "Thu, 17 Oct 2002 20:21:06 EDT." References: Message-ID: <200210180043.g9I0hYS12884@pcp02138704pcs.reston01.va.comcast.net> > Bonsoir à tous, > je suis à la recherche d'exemple d'implémentation d'une ba > rre de progression en mode console (ou, à défaut en mode graphique) un > peu du style de celle de urpmi lors de l'installation d'un paquet a > vec la mandrake. Si en plus, à la fin de celle ci, le pourcentage de t > ravail effectué pouvait s'afficher, ce serai le bonheur. > > L'idéal pour moi serait de voir un code tout fait pour m'en inspirer, au > pire un peu de docs derait l'affaire. > > Cordialement > """ > > I read French well enough to know she's asking for directions to the > nearest alligator breeding museum, but what I don't know is whether > it's ham or spam. Whaddyathink? It's the only judgment made today > that isn't obviously correct (nor incorrect, for that matter) to my > English eyes. Definitely ham; she's asking on how to implement a progress bar in the style of Mandrake's urpmi tool. Bonus points for showing the percentage of work done. Now, is a score of .01 typically ham or spam? --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Fri Oct 18 01:43:52 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 20:43:52 -0400 Subject: [Spambayes] Client/server model In-Reply-To: <25415735222-BeMail@CR593174-A> Message-ID: [Alexander G. M. Smith] > That's a feature I've been asked for. Just classify by the header > alone. The idea being that it would only download the header from > the mail server, and immediately delete the message on the server > if it looked like spam. I'm a bit nervous about implementing it > in case it is a false positive and thus irretrievably deletes the > message. I'd be very nervous about that. You may want to ask Eric Raymond if he got anywhere with this -- at one time he intended to set up a "header score server" in connection with, or as an offshoot of, his bogofilter project. From tim.one@comcast.net Fri Oct 18 01:59:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 20:59:32 -0400 Subject: [Spambayes] Client/server model akjha In-Reply-To: Message-ID: [Neale Pickett] > Yeah but I don't think anybody's done any tests to see if classifying on > headers alone still gets good results. A while back I reported on an experiment that looked only at Subject lines: no other headers, and nothing in the body. It did very heavy tokenization of subject lines (word unigrams, and word bigrams, and folding case, and preserving case, and splitting on whitespace, and sucking out alphanumeric runs, and tokenizing runs of pure punctuation). Using the default combining, the bottom line was -> best cutoff for all runs: 0.575 -> with weighted total 10*65 fp + 486 fn = 1136 -> fp rate 0.325% fn rate 3.47% That's much worse than we do by taking the body into account too, but in absolute terms it's not too shabby! Staring at the results caused me to add the least likely part of that gimmick to our regular tokenizer: generating tokens for runs of pure punctuation in Subject lines. It's obvious in retrospect: spam often has over-the-top PUNCTUATION!!! $$$$$$, and the one that delighted me the most was long runs of blanks. Those come from Subject lines that stuff a short random string at the end of the line to fool dumb filters, separated from the ***SCREAMING PART*** by a long run of blanks. I added one to the Subject line here for illustration . From tim.one@comcast.net Fri Oct 18 02:15:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 17 Oct 2002 21:15:58 -0400 Subject: [Spambayes] Ham or spam? In-Reply-To: <200210180043.g9I0hYS12884@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Tim, displays his profound knowledge of French] [Guido] > Definitely ham; she's asking on how to implement a progress bar in the > style of Mandrake's urpmi tool. Bonus points for showing the > percentage of work done. Thanks! > Now, is a score of .01 typically ham or spam? Under chi-combining (which was used here), it's in the "I'm sure it's ham" range. The median score for ham under chi-combining is too small to express in 2 digits, though, so 0.01 is a fairly large ham score, indicating slight uncertainty. Since starting this on my own email, there is (exactly) one spam (of 843 so far) that's scored under 0.05, which is the high end of chi's "I'm sure it's ham" range: """ python, A friend of yours, Michael (michael_suswanto@yahoo.com) thought you might like to check out this web page. http://www.newmarketingsite.com/2848/ -- The coolest site in town """ SpamAssassin didn't stop this either, but did find strong spam clues in the headers that we're ignorant about: X-Spam-Status: No, hits=4.2 required=5.0 tests=FROM_NAME_NO_SPACES,FROM_BIGISP,NO_REAL_NAME,FORGED_YAHOO_RCVD X-Spam-Level: **** On the other side, no ham so far has scored in the "I'm sure it's spam range". My highest-scoring ham is at 0.76 (in chi's "middle ground" range), and is a short "happy new year" msg left over from January. from a friend I exchange email with once each 3 years . From tim.one@comcast.net Fri Oct 18 06:54:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 18 Oct 2002 01:54:20 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Message-ID: I removed the 4 schemes in question. The log msg is attached, as this affected lots of code (mostly in an "it's gone" sense). If anyone has a real use for use_tim_combining, speak up, else I expect to drop that too (it was really another attempt to get a better middle ground, but chi-combining beats it for that). Modified Files: Options.py README.txt TestDriver.py classifier.py Removed Files: clgen.py clpik.py rmspik.py Log Message: Removed 4 combining schemes: use_central_limit use_central_limit2 use_central_limit3 use_z_combining The central limit schemes aimed at getting a useful middle ground, but chi-combining has proved to work better for that. The chi scheme doesn't require the troublesome "third training pass" either. z-combining was more like chi-combining, and worked well, but not as well as chi- combining; z-combining proved vulnerable to "cancellation disease", to which chi-combining seems all but immune. Removed supporting option zscore_ratio_cutoff. Removed various data attributes of class Bayes, unique to the central limit schemes. __getstate__ and __setstate__ had never been updated to save or restore them, so old pickles will still work fine. Removed method Bayes.compute_population_stats(), which constituted "the third training pass" unique to the central limit schemes. There's scant chance this will ever be needed again, since it was never clear how to make the 3-pass schemes practical over time. Gave the still-default combining scheme's method the name gary_spamprob, and made spamprob an alias for that by default. This allows to name each combining scheme explicitly in case you want to test using more than one (the others are named tim_spamprob and chi2_spamprob). In gary_spamprob, simplified the scaling of (P-Q)/(P+Q) into 0 .. 1, replacing the whole shebang with P/(P+Q). Same result, but a little faster. Removed files clgen.py, clpik.py, and rmspik.py. These were data generation and analysis tools unique to the central limit schemes. From tim.one@comcast.net Fri Oct 18 07:59:41 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 18 Oct 2002 02:59:41 -0400 Subject: [Spambayes] 5% points in statistics In-Reply-To: <3DAEB8A7.6010807@hooft.net> Message-ID: Inspired by Rob's patch, there's a new option: """ [TestDriver] # Histogram analysis also displays percentiles. For each percentile p # in the list, the score S such that p% of all scores are <= S is given. # Note that percentile 50 is the median, and is displayed (along with the # min score and max score) independent of this option. percentiles: 5 25 75 95 """ Example output from the starts of histogram displays: -> Ham scores for all runs: 100 items; mean 6.23; sdev 16.47 -> min 2.51688e-008; median 0.19102; max 85.9665 -> percentiles: 5% 0.000538997; 25% 0.0281789; 75% 2.81561; 95% 45.2147 -> Spam scores for all runs: 100 items; mean 99.97; sdev 0.26 -> min 97.3715; median 100; max 100 -> percentiles: 5% 99.9512; 25% 100; 75% 100; 95% 100 >From that alone you can deduce that this tiny 10-fold cv run using chi-combining nailed all the spam (min spam score was over 95), nailed at least 75% of the ham (75% of all ham scores were under 2.82 < 5), and that no ham scored in the spam zone (max ham score was < 86). BTW, it's a curious thing that *all* schemes have been better at nailing spam than ham with very little training data, going all the way down to training on just one of each. I still don't know where the cutoff point is in my data (i.e., by the time I run my fat test, the roles are reversed: it's better at nailing ham than spam). From msergeant@startechgroup.co.uk Fri Oct 18 10:45:19 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: 18 Oct 2002 10:45:19 +0100 Subject: [Spambayes] Client/server model In-Reply-To: <11128972051-BeMail@CR593174-A> References: <11128972051-BeMail@CR593174-A> Message-ID: <1034934319.23385.26.camel@felony.int.star.co.uk> ---------------------- multipart/signed attachment On Thu, 2002-10-17 at 20:10, Alexander G. M. Smith wrote: > Guido van Rossum wrote: > > > I'd want the server to do tokenization for consistency reasons. > > > Particularly if you are also spam filtering news articles and not > > > just e-mail messages. > >=20 > > I don't understand this. >=20 > So that everybody tokenizes the incoming messages in the same way, > particularly the same way as that used earlier during training. What does it matter? The worst thing that happens is that the client gets the wrong answer back, in which case it's a good excuse to get the client upgraded ;-) > Also, I'd have the server keep track of spam from other sources, > such as UseNet news. Is there anywere else where spam messages > show up that might need to be included, or is it just mail and > news? I'm waiting for spammers to start spamming web based forums. It's probably harder than usenet since most have local moderation systems in place, but I suspect it's only a matter of time. Matt. ---------------------- multipart/signed attachment A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 232 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021018/c81ca11c/attachment.bin ---------------------- multipart/signed attachment-- From msergeant@startechgroup.co.uk Fri Oct 18 10:52:21 2002 From: msergeant@startechgroup.co.uk (Matt Sergeant) Date: 18 Oct 2002 10:52:21 +0100 Subject: [Spambayes] Client/server model In-Reply-To: <20021017205208.GB14491@cthulhu.gerg.ca> References: <200210171829.g9HITxI21925@odiug.zope.com> <20021017205208.GB14491@cthulhu.gerg.ca> Message-ID: <1034934741.23383.34.camel@felony.int.star.co.uk> ---------------------- multipart/signed attachment On Thu, 2002-10-17 at 21:52, Greg Ward wrote: > On 17 October 2002, Guido van Rossum said: > > Neale's hammie client and server seem to me to be wasting some > > effort. Currently, what happens, is: > >=20 > > cli sends the entire message to svr > >=20 > > svr parses and scores the message > > svr inserts the X-Hammie-Disposition header in the message > > svr sends the message, thus modified, back > >=20 > > cli prints the returned, modified, message to stdout >=20 > Arrggh. That's exactly how SpamAssassin's spamc/spamd work, and it's > a pain-in-the-ass for anyone who wants to access spamd in an unusual > way. FWIW this was fixed for Ask Bjorn Hansen, who wanted to be able to use spamd "in an unusual way". Here's how his spamassassin plugin does it for qpsmtpd: print "REPORT_IFSPAM SPAMC/1.0" to spamd's socket. print the message to spamd's socket. shutdown the sending end of the socket. get *just* the spam headers back from the socket (that's all that is sent). All done. See http://cvs.perl.org/viewcvs/qpsmtpd/plugins/spamassassin?rev=3D1.2&content-= type=3Dtext/vnd.viewcvs-markup (though I'm not even sure I like that mechanism, but I digress). ---------------------- multipart/signed attachment A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 232 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021018/a586400f/attachment.bin ---------------------- multipart/signed attachment-- From skip@pobox.com Fri Oct 18 14:16:59 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 18 Oct 2002 08:16:59 -0500 Subject: [Spambayes] timcv.py? Message-ID: <15792.2507.448518.596654@montanaro.dyndns.org> I've been busy with other stuff for a couple weeks and have only vaguely noticed all the changes happening. I've been using a somewhat simplified version of Neale's runtest.sh script. Is timcv.py still the core program used for testing? Skip From tim.one@comcast.net Fri Oct 18 17:07:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 18 Oct 2002 12:07:26 -0400 Subject: [Spambayes] timcv.py? In-Reply-To: <15792.2507.448518.596654@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I've been busy with other stuff for a couple weeks and have only > vaguely noticed all the changes happening. Them mostly you missed a whole bunch of experimental code in classifier.py get tested and thrown away again. The code is more like it was now than it was when you tuned out . > I've been using a somewhat simplified version of Neale's runtest.sh > script. Is timcv.py still the core program used for testing? That or mboxtest.py (depending on your data setup) remain our two cross-validation drivers, which I recommend. timtest.py remains a grid driver, and there's no reason to use it unless you want to set up brutal tests (train on a little data and predict against a lot, N**2-N times). From tim.one@comcast.net Fri Oct 18 19:43:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 18 Oct 2002 14:43:33 -0400 Subject: [Spambayes] Combining combining schemes Message-ID: I mentioned earlier that chi-combining and gary-combining have quite different ideas about "how certain" they are on my extreme FP and FN. So I checked in some new options to allow us to play with that: """ [Classifier] # Use a weighted average of chi-combining and gary-combining. use_mixed_combining: False mixed_combining_chi_weight: 0.9 """ I ran my fat test just once (10-fold CV with 20,000 ham and 14,000 spam), making parameters up off the top of my head: """ [Classifier] use_mixed_combining: True mixed_combining_chi_weight: 0.9 [TestDriver] ham_cutoff: 0.10 spam_cutoff: 0.90 nbuckets: 200 """ The bottom line is that this particular combination of settings removed all(!) false negatives, left me with my 2 very hard FP, moved all other hard ham very solidly into the middle ground, and had an unsure rate under 1%: -> all runs false positives: 2 -> all runs false negatives: 0 -> all runs unsure: 226 -> all runs false positive %: 0.01 -> all runs false negative %: 0.0 -> all runs unsure %: 0.664705882353 -> all runs cost: $65.20 The histogram analysis found that it was possible to reduce the total middle ground to 20 (out of 34,000!) messages at the cost of biting 3 FN: -> best cost for all runs: $27.00 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 3 cutoff pairs -> smallest ham & spam cutoffs 0.5 & 0.75 -> fp 2; fn 3; unsure ham 12; unsure spam 8 -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588% -> largest ham & spam cutoffs 0.5 & 0.76 -> fp 2; fn 3; unsure ham 12; unsure spam 8 -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588% I can't make more time for this right now, but I think there's clearly potential worth pursuing. -> Ham scores for all runs: 20000 items; mean 2.81; sdev 2.92 -> min 0.121417; median 2.54101; max 96.5433 -> percentiles: 5% 1.68334; 25% 2.20207; 75% 2.89507; 95% 3.54761 * = 111 items 0.0 6 * 0.5 41 * 1.0 420 **** 1.5 2355 ********************** 2.0 6526 *********************************************************** 2.5 6743 ************************************************************* 3.0 2789 ************************** 3.5 568 ****** 4.0 120 ** 4.5 71 * 5.0 44 * 5.5 21 * 6.0 23 * 6.5 14 * 7.0 17 * 7.5 8 * 8.0 9 * 8.5 12 * 9.0 6 * 9.5 4 * 10.0 7 * 10.5 10 * 11.0 7 * 11.5 10 * 12.0 7 * 12.5 9 * 13.0 5 * 13.5 3 * 14.0 3 * 14.5 4 * 15.0 3 * 15.5 7 * 16.0 3 * 16.5 7 * 17.0 1 * 17.5 5 * 18.0 4 * 18.5 0 19.0 0 19.5 4 * 20.0 3 * 20.5 3 * 21.0 2 * 21.5 3 * 22.0 1 * 22.5 2 * 23.0 1 * 23.5 3 * 24.0 2 * 24.5 3 * 25.0 0 25.5 1 * 26.0 5 * 26.5 0 27.0 2 * 27.5 3 * 28.0 3 * 28.5 1 * 29.0 3 * 29.5 1 * 30.0 1 * 30.5 0 31.0 1 * 31.5 2 * 32.0 0 32.5 2 * 33.0 3 * 33.5 1 * 34.0 2 * 34.5 0 35.0 0 35.5 1 * 36.0 2 * 36.5 2 * 37.0 1 * 37.5 2 * 38.0 0 38.5 2 * 39.0 1 * 39.5 1 * 40.0 2 * 40.5 1 * 41.0 1 * 41.5 0 42.0 2 * 42.5 2 * 43.0 1 * 43.5 0 44.0 2 * 44.5 0 45.0 0 45.5 2 * 46.0 0 46.5 1 * 47.0 2 * 47.5 0 48.0 0 48.5 2 * 49.0 3 * 49.5 3 * 50.0 1 * A resume from a "an experienced engineer/mathematician/modeler who has built models and done computational mathematics in Python". 50.5 0 51.0 3 * TOOLS Europe '99 conference announcement A word-free post kidy listing 3 URLs; we've argued before about whether it's ham or spam; I think it's ham Someone posting a reply they got from MSN Hotmail Customer support in response to a complaint about fetish porn spam on c.l.py 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 1 * "If you are interested in saving money ..." 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 1 * questions about the job and real estate markets in France 60.0 1 * HTML "Please unsubscribe me" 60.5 0 61.0 0 61.5 0 62.0 1 * asking for advice on how to break into others' computers 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 1 * long emotional msg the day after the 911 terrorist attack 69.5 0 70.0 0 70.5 0 71.0 0 71.5 1 * Job announcement from Industrial Light & Magic. Hurt in part because split-on-whitespace left "Python-savvy" as one word. 72.0 0 72.5 0 73.0 1 * asking for help with a webmaster-ish program; it's in the middle ground of both schemes: prob('*gary_score*') = 0.532758 prob('*chi_score*') = 0.751966 73.5 0 74.0 0 74.5 1 * inappropriate two-word "confirm 438765" followed by "Get Your Private, Free E-mail from ..." 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 1 * lady with the long, obnoxious employer-generated sig; gary-combining looks on this one much more kindly (but still outside a reasonable middle groud for it); chi is only slightly unsure prob('*gary_score*') = 0.597568 prob('*chi_score*') = 0.986116 prob('*H*') = 0.0277634 prob('*S*') = 0.999996 prob('*Q*') = 0.542133 prob('*P*') = 0.805009 95.0 0 95.5 0 96.0 0 96.5 1 * Nigerian scam quote gary-combining again has a much milder judgment, but chi is off the charts prob = 0.965433332477 prob('*gary_score*') = 0.654334 prob('*chi_score*') = 1 prob('*H*') = 7.07788e-008 prob('*S*') = 1 prob('*Q*') = 0.466239 prob('*P*') = 0.882573 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for all runs: 14000 items; mean 98.32; sdev 1.55 -> min 31.4614; median 98.3667; max 99.9601 -> percentiles: 5% 97.1931; 25% 97.9872; 75% 98.7541; 95% 99.657 Note that > 95% of spam scored higher than the Nigerian "ham"! (its score is lower than spam's 5-percentile score) * = 76 items ... [all 0] ... 30.5 0 31.0 1 * "Hello, my Name is BlackIntrepid" prob = 0.314614377139 prob('*gary_score*') = 0.480559 prob('*chi_score*') = 0.296176 prob('*H*') = 0.930885 prob('*S*') = 0.523237 prob('*Q*') = 0.684254 prob('*P*') = 0.633036 31.5 0 32.0 0 32.5 0 33.0 0 33.5 1 * uuencoded text body we throw away unlooked at 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 1 * giant base64-encoded text file; gary- and chi- both score it near 0.50 49.5 0 50.0 1 * Website Programmers Available Now!; full of tech talk 50.5 2 * webmaster link directory the spam with dozens of killer spam clues hiding in meta tags we don't look at 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 1 * 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 1 * 67.0 0 67.5 0 68.0 0 68.5 1 * 69.0 0 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 1 * 74.0 0 74.5 0 75.0 0 75.5 0 76.0 1 * 76.5 0 77.0 0 77.5 0 78.0 1 * 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 1 * 81.5 0 82.0 1 * 82.5 1 * 83.0 1 * 83.5 0 84.0 0 84.5 1 * 85.0 2 * 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 1 * 89.0 3 * 89.5 1 * 90.0 2 * 90.5 1 * 91.0 1 * 91.5 0 92.0 16 * 92.5 3 * 93.0 3 * 93.5 2 * 94.0 2 * 94.5 6 * 95.0 6 * 95.5 20 * 96.0 76 * 96.5 269 **** 97.0 838 ************ 97.5 2329 ******************************* 98.0 4600 ************************************************************* 98.5 3792 ************************************************** 99.0 1045 ************** 99.5 964 ************* From popiel@wolfskeep.com Sat Oct 19 05:44:50 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 18 Oct 2002 21:44:50 -0700 Subject: [Spambayes] Mixed combining Message-ID: <20021019044450.73D33F5B4@cashew.wolfskeep.com> I did two runs of the mixed combining. Data is not yet indexed on my website; perhaps tomorrow. By my results, mixed spamprob is effectively neutral compared to straight chi-squared. The best cost is better, but how to achieve those costs is no clearer than before. The fp & fn counts are lower, but at a cost of about half again more unsures. I guess it all depends on how you assign your costs. Anyway, here's the tables: Mixed, .9 chi-squared, 0.10-0.90 unsure: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 2 3 3 3 3 2 2 fp %: 0.40 0.40 0.30 0.24 0.20 0.11 0.10 fn total: 5 6 4 5 6 7 9 fn %: 0.25 0.34 0.27 0.40 0.60 0.93 1.80 unsure t: 46 44 45 42 52 51 52 unsure %: 1.84 1.76 1.80 1.68 2.08 2.04 2.08 real cost: $34.20 $44.80 $43.00 $43.40 $46.40 $37.20 $39.40 best cost: $28.60 $28.20 $34.00 $33.20 $34.20 $30.40 $23.80 h mean: 3.61 2.70 2.47 2.30 2.29 2.21 1.99 h sdev: 8.09 6.15 6.13 5.93 6.13 5.84 4.79 s mean: 97.08 96.69 96.33 95.84 94.94 94.34 92.25 s sdev: 6.48 7.71 8.63 10.21 12.73 13.67 17.09 mean diff: 93.47 93.99 93.86 93.54 92.65 92.13 90.26 k: 6.42 6.78 6.36 5.80 4.91 4.72 4.13 Mixed, .9 chi-squared, 0.05-0.95 unsure: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 2 2 2 2 2 2 1 fp %: 0.40 0.27 0.20 0.16 0.13 0.11 0.05 fn total: 4 4 3 3 3 3 4 fn %: 0.20 0.23 0.20 0.24 0.30 0.40 0.80 unsure t: 69 71 70 73 83 82 89 unsure %: 2.76 2.84 2.80 2.92 3.32 3.28 3.56 real cost: $37.80 $38.20 $37.00 $37.60 $39.60 $39.40 $31.80 best cost: $28.60 $28.20 $34.00 $33.20 $34.20 $30.40 $23.80 h mean: 3.61 2.70 2.47 2.30 2.29 2.21 1.99 h sdev: 8.09 6.15 6.13 5.93 6.13 5.84 4.79 s mean: 97.08 96.69 96.33 95.84 94.94 94.34 92.25 s sdev: 6.48 7.71 8.63 10.21 12.73 13.67 17.09 mean diff: 93.47 93.99 93.86 93.54 92.65 92.13 90.26 k: 6.42 6.78 6.36 5.80 4.91 4.72 4.13 And, for reference, pure chi-squared, 0.05-0.95 unsure: -> tested 50 hams & 200 spams against 450 hams & 1800 spams [...] -> tested 200 hams & 50 spams against 1800 hams & 450 spams ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 fp total: 2 3 3 3 2 2 2 fp %: 0.40 0.40 0.30 0.24 0.13 0.11 0.10 fn total: 5 6 4 5 6 7 9 fn %: 0.25 0.34 0.27 0.40 0.60 0.93 1.80 unsure t: 49 44 49 46 54 58 53 unsure %: 1.96 1.76 1.96 1.84 2.16 2.32 2.12 real cost: $34.80 $44.80 $43.80 $44.20 $36.80 $38.60 $39.60 best cost: $28.60 $28.40 $34.00 $35.60 $34.60 $30.60 $28.60 h mean: 1.31 0.58 0.50 0.46 0.51 0.48 0.36 h sdev: 8.51 6.47 6.46 6.25 6.44 6.12 4.97 s mean: 99.25 98.92 98.60 98.17 97.25 96.73 94.66 s sdev: 6.75 8.05 9.04 10.76 13.47 14.49 18.20 mean diff: 97.94 98.34 98.10 97.71 96.74 96.25 94.30 k: 6.42 6.77 6.33 5.74 4.86 4.67 4.07 Enjoy. - Alex From tim.one@comcast.net Sat Oct 19 06:03:42 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 01:03:42 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAF183A.8070600@hooft.net> Message-ID: [Tim] >> But what's the point? By your own cost measure, it didn't do >> you any good, and in fact it raised your FP rate by the time you >> got to 4. [Rob Hooft] > There was some discussion about the judgments being too strict. I was > trying to find a statistically sound way to reduce correlations such > that results would be less sure. > > I explained that here: > > >>What this does statistically is downweighting all clues thereby taking > >>care of a "standard" correlation between clues. I understood that was your intent. That's why I asked what the point in the *end* was. By your own cost measure, it didn't help; etc -- at best it was neutral, and taking the results at face value it hurt a little. >> Fudging H, S and n introduces strange biases, because the info >> you're feeding into chi2Q no longer follows a chi-squared >> distribution after fudging, and chi2Q may as well be some form >> of biased random-number generator then. > That is not exactly true. What I am assuming is that if there is one > clue in a message that says 0.8, there are probably more of those. That > is the correlation we're discussing. A clue rarely comes alone. Effect > of that is that my joke messages with "From: xxx@yyy (by way of > ppp@qqq)" gets a very strong and repeated signal from the From: line, Except that those correlations act in your favor, correct? You called the bad jokes ham, and their very high spam scores relative to the rest of your ham suggested that these correlated "From" clues were the only things saving them from being false positives. > and your filtered mailman list is much too sure about hamminess. Yes, and if we don't strip HTML tags, every scheme has been much too sure about spaminess. But these last two are the only two cases I've seen where correlations were harmful: all other cases I've looked at are akin to your "bad joke" example, where correlations were helpful in making the correct decision. > This is solved by my hack: it practically divides the number of clues by > 2 or 4. I'm not sure what it does, but until it demonstrably helps something I'm not keen to pursue it. Note that the new use_mixed_combining gimmick can be used to get any scoring behavior between default-combining and chi-combining, using a simple weighted average of two schemes that have been widely tested with good results. > ... > --- chi2.py 16 Oct 2002 21:31:19 -0000 1.7 > +++ chi2.py 17 Oct 2002 20:04:17 -0000 > @@ -145,7 +145,7 @@ > > for i in range(5000): > ps = [random() for j in range(50)] > - s1, h1, score1 = judge(ps + [bias] * warp) > + s1, h1, score1 = judge((ps + [bias] * warp)*4) > s.add(s1) > h.add(h1) > score.add(score1) > > (i.e. adding correlated data points) Results in: > > Result for random vectors of 50 probs, + 0 forced to 0.99 > > H 5000 items; mean 0.47; sdev 0.38 > -> min 1.26528e-11; median 0.444004; max 1 > -> fivepctlo 0.000293787; fivepcthi 0.999102 > * = 19 items > 0.00 1125 ************************************************************ > 0.05 291 **************** > 0.10 230 ************* > 0.15 182 ********** > 0.20 157 ********* > 0.25 146 ******** > 0.30 119 ******* > 0.35 135 ******** > 0.40 129 ******* > 0.45 121 ******* > 0.50 120 ******* > 0.55 131 ******* > 0.60 128 ******* > 0.65 152 ******** > 0.70 128 ******* > 0.75 167 ********* > 0.80 172 ********** > 0.85 208 *********** > 0.90 239 ************* > 0.95 920 ************************************************* > .. > So: chi2 will be fairly sure even about random data if it is correlated. Well, random data isn't correlated , but I see what you mean and it is an interesting point. Whether it's of general importance (as opposed to in the few special cases we've identified by staring at the tiny minority of mistakes) to the ham-vs-spam problem I don't know. I *expect* a related question is why split-on-whitespace works better than searching for alphanumeric runs. When staring at one of his own false positives, Guido complained that, e.g., "hotels" and "hotels," were counted as two distinct clues. And s-o-w routinely creates lots of highly correlated word combinations. But staring only at mistakes gives no insight into what *works*, and I suspect that, e.g., counting "Python" and "Python?" and "Python." and "Python," (etc) as distinct clues actively helps my c.l.py ham. Ditto counting "erection" and "Viagra" as distinct, and "Nigeria" and "Nigerian", etc. Recall that we also had Matt Sergeant's testimony that lemmatization harmed performance in his Bayesian classifier, and one clear effect of lemmatization is to reduce the number of highly correlated features. So I'm not willing to believe that reducing correlation is a sensible goal in this task without strong experimental evidence to back it up; so far, all we have is indirect evidence about that, but to the extent that applies, it's not supporting the thesis that correlation is generally a bad thing here. From tim.one@comcast.net Sat Oct 19 06:10:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 01:10:17 -0400 Subject: [Spambayes] Mixed combining In-Reply-To: <20021019044450.73D33F5B4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > I did two runs of the mixed combining. Data is not yet indexed > on my website; perhaps tomorrow. > > By my results, mixed spamprob is effectively neutral compared to > straight chi-squared. The best cost is better, but how to achieve > those costs is no clearer than before. The fp & fn counts are > lower, but at a cost of about half again more unsures. I guess > it all depends on how you assign your costs. I've run some more experiments of my own, and I'm embarrassed to agree that indeed straight chi-squared did just as well, and that cutoffs got fuzzier under mixed combining, and that Yet Another Parameter to fiddle (the chi weight) was more Yet Another PITA (Parameter In The Ass) than anything else. Chalk it up to youthful enthusiasm -- I should follow my own advice and just give up on my two miserable FP. > Anyway, here's the tables: > > Mixed, .9 chi-squared, 0.10-0.90 unsure: > -> tested 50 hams & 200 spams against 450 hams & 1800 spams > [...] > -> tested 200 hams & 50 spams against 1800 hams & 450 spams > ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 > fp total: 2 3 3 3 3 2 2 > fp %: 0.40 0.40 0.30 0.24 0.20 0.11 0.10 > fn total: 5 6 4 5 6 7 9 > fn %: 0.25 0.34 0.27 0.40 0.60 0.93 1.80 > unsure t: 46 44 45 42 52 51 52 > unsure %: 1.84 1.76 1.80 1.68 2.08 2.04 2.08 > real cost: $34.20 $44.80 $43.00 $43.40 $46.40 $37.20 $39.40 > best cost: $28.60 $28.20 $34.00 $33.20 $34.20 $30.40 $23.80 > h mean: 3.61 2.70 2.47 2.30 2.29 2.21 1.99 > h sdev: 8.09 6.15 6.13 5.93 6.13 5.84 4.79 > s mean: 97.08 96.69 96.33 95.84 94.94 94.34 92.25 > s sdev: 6.48 7.71 8.63 10.21 12.73 13.67 17.09 > mean diff: 93.47 93.99 93.86 93.54 92.65 92.13 90.26 > k: 6.42 6.78 6.36 5.80 4.91 4.72 4.13 This is a nice way to present summary info. Are these produced by your table2.py? If so, I know where to find that -- would you consider contributing it to the project? > ... From tim.one@comcast.net Sat Oct 19 06:20:24 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 01:20:24 -0400 Subject: [Spambayes] optimal max_discriminators for chi2 In-Reply-To: <3DAF2256.30509@hooft.net> Message-ID: [Rob Hooft] > I did a series of runs: > ========================= > [Classifier] > use_chi_squared_combining: True > robinson_minimum_prob_strength = 0.0 > robinson_probability_s = 0.45 > max_discriminators = XXXXXX > ... > With XXXXXX between 15 and 300. Attached are plots of the 95th > percentile ham, 5th percentile spam, and of the total cost vertical > against max_discriminators horizontal. Please note again that my ham is > much tighter than my spam: vertical scales are from 0 to 0.16 and from > 89 to 100, respectively (Almost a factor of 100!). The cost plot shows > "no trend at all", but the variation is not large. Thanks, Rob! Have you ever plotted the density of the number of "words" in your msgs? I did at one time but have forgotten the result; IIRC, a surprisingly large percentage didn't *have* 150 distinct words (but then I'm also using the default robinson_minimum_prob_strength, which renders a whole bunch of bland words invisible). The cost plot is disturbing, suggesting we're looking at random effects more than trends. Perhaps "best cost" is just too fickle a measure here, and it would be better to develop a measure of "average cost" across all cutoff pairs within the specified base (ham_cutoff, spam_cutoff) pair. > I'd almost conclude "anything goes", but based on the spam-5% value > I'd like to stick with values over ~40. This sounds sensible to me too, and my own data doesn't contradict it . I'll leave the default at 150 until there's a clear reason to change it. From tim.one@comcast.net Sat Oct 19 06:35:52 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 01:35:52 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAF2DE1.5090404@hooft.net> Message-ID: [Rob Hooft] > My gut feeling says: under ideal combining, a score under .95 means "I'm > less than 95% sure this is spam". That may be close to what my gut feeling was at the start of this: that a spam score of p "should mean" that if I see N messages with that p value, N*p of those messages "should" be spam, and N*(1-p) of them "should be" ham. But over time I've come to believe that (an actual probability) would be a pretty useless measure in real life -- I want absolute certainty now <0.9 wink>. > ... > The first time I used SpamAssassin, I used it in label-only mode. That > gave some relief. After using it for a month, I was confident enough to > make a procmail rule to move spam into a spam folder without showing it > to me. I was amazed by the amount of rest that has created. I did not > realize that the spam was having such a psychological effect on me. This > is definitely what I'd want from spambayes. I'd only read my "incoming > ham". Once a week I'd go into unsure mode, and do some selection work. > Once a month I can probably go into spam-curse mode, and do the mass > deletion Tim talks about. I flipped-flopped on this until I actually rescored my mail using the mixed-combining gimmick. It was just plain annoying then to see 20 spam scoring 1.00 and 20 more scoring .99 and 20 more scoring .98, etc. Chi- and default- combining are both satisfying in their own way if you have to see the scores, the former because it does an excellent job of reflecting my own "certainty at a glance" in a vast majority of cases; I'm still not sure what's satisfying about the latter, but it's *something*. > But Sean's "sort on score" idea is also very useful. I think it'd speed > up the manual scanning/deletion process. Simulating, by hand, my ideal of "spam" and "unsure" folders, sorting on score is very helpful, but it's *also* very helpful then to see the actual scores. Especially in my Unsure folder, what's typical is that about 10% of the msgs are very close to the low end of the unsure range, and another 10% very close to the high end of the unsure range, and they're usually what you expect them to be (i.e., ham and spam, respectively). That polishes off a fifth of them very quickly. The rest are more puzzling (they're more solidly in the range where the scheme is very sure it's unsure <0.5 wink>), and I'm not sure the scores help at all then. The helpful part of that could be gotten via color-coding too, where "the helpful part" means segregating the "I'm unsure, but I think I have a good guess" parts at the ends, from the "I'm certain I'm lost" part in the middle. From tim.one@comcast.net Sat Oct 19 06:55:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 01:55:16 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <3DAF3053.6040103@hooft.net> Message-ID: [Tim, suggests that (S-H+1)/2 would be good to try with gary-combining] [Rob] > tim combining: > -> Ham scores for all runs: 16000 items; mean 13.62; sdev 9.66 > -> min 0.109175; median 12.3561; max 76.0553 > -> fivepctlo 1.35543; fivepcthi 31.4327 > -> Spam scores for all runs: 5800 items; mean 84.42; sdev 11.70 > -> min 21.351; median 85.6889; max 99.8161 > -> fivepctlo 64.4615; fivepcthi 98.8117 > -> best cost for all runs: $110.40 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.5 & 0.625 > -> fp 5; fn 16; unsure ham 35; unsure spam 187 > -> fp rate 0.0312%; fn rate 0.276%; unsure rate 1.02% BTW, note that I killed this scheme off -- it was, at the time, trying to get a better middle ground, but chi-combining works better for that. > default combining: > -> Ham scores for all runs: 16000 items; mean 26.37; sdev 8.32 > -> min 0.137212; median 27.2524; max 65.3836 > -> fivepctlo 11.7696; fivepcthi 38.3897 > -> Spam scores for all runs: 5800 items; mean 75.96; sdev 10.74 > -> min 33.8547; median 74.3976; max 99.7559 > -> fivepctlo 59.9773; fivepcthi 96.4292 > -> best cost for all runs: $106.20 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.5 & 0.585 > -> fp 5; fn 16; unsure ham 35; unsure spam 166 > -> fp rate 0.0312%; fn rate 0.276%; unsure rate 0.922 > > default combining with P-Q instead of (P-Q)/(P+Q): > -> Ham scores for all runs: 16000 items; mean 21.49; sdev 8.73 > -> min 0.123198; median 21.7049; max 68.8251 > -> fivepctlo 7.34536; fivepcthi 35.6937 > -> Spam scores for all runs: 5800 items; mean 79.44; sdev 11.00 > -> min 29.348; median 79.2283; max 99.786 > -> fivepctlo 61.9311; fivepcthi 97.3078 > -> best cost for all runs: $103.40 > -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 > -> achieved at ham & spam cutoffs 0.5 & 0.615 > -> fp 3; fn 16; unsure ham 37; unsure spam 250 > -> fp rate 0.0187%; fn rate 0.276%; unsure rate 1.32% > > It is all so close together in the final "cost" result that it is very > difficult to judge from the statistics. Then let's take the stats at face value: these are large runs, so if it doesn't make a clear difference here, it's unlikely to make a clear difference anywhere. IIRC, you were inspired to try S-H under chi-combining by staring at mistakes where a modest S value was paired with a very low H value, leading to S/(S+H) approaching 1 despite that S was far from certain on its own. But gary-combining is much less extreme in both its S and H measures, so it's less of a *potential* problem there. It *may* account for the two FP that got redeemed in your last run, though -- knowing their internal S and H values would help (oops -- they're called P and Q inside the default scheme, but same thing). From tim.one@comcast.net Sat Oct 19 07:11:49 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 02:11:49 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <20021017223823.785E5F4CD@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > The score we've got is just a number in the range 0 to 1 which has > interesting discriminatory properties. It's not linear with any > concept of surety, and it's not linear with similarity to spam or > ham, either. Ah, but it is linear with 1 minus the probability that -2 times the natural log of the geometric mean of 1-p_i for a vector of random probabilities p would exceed 1 minus -2 times the natural log of the geometric mean of 1-p_i for the estimated spamprobs in the message, minus 1 minus the probability that -2 times the natural log of the geometric mean of p_i for a vector or random probabilities p would exceed 1 minus -2 times the natural log of the geometric mean of p_i for the estimated spamprobs in the message. > People not immersed in how it's generated and/or buried in test results > over decent sized corpora are sure (there's that troubling word again) > to misinterpret it. > Even given a clear explanation like the above? I vote we put that in the user docs, and strongly imply that anyone to whom that isn't obvious from mere inspection is an idiot who deserves all the spam they get . [Rob] >> But Sean's "sort on score" idea is also very useful. I think it'd speed >> up the manual scanning/deletion process. [Alex] > Having looked at the results from the show_unsure config option, > I tend to disagree... position in the list doesn't seem to have > any correlation with spam vs. ham. Are you sure? I've got a GUI that sorts email by "Hammie score" now, and there's a clear correlation by eyeball *adjacent to the endpoints* of the unsure range. The middle of the unsure range is a jumble, though, and predictably so since long messages suffering cancellation disease in particular predictably score near very close to 0.5 under chi-combining. Where Graham-combining would score them at 0.0 or 1.0 depending on which flavor of clue just happened to appear more often, chi scores them more like 0.49999 or 0.50001. It's still a coin toss, but of an exceedingly tiny coin . From tim.one@comcast.net Sat Oct 19 07:17:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 02:17:01 -0400 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: <20021017225449.GA4778@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > That matches my experience with setting up a spam filter. After > installation, I found it much easier to deal with messages in my inbox > (mail not from lists). The psychological effect was larger than I had > expected as well. I look at the spam mailbox last and only if I have > the time and energy. Heh. I find myself eagerly awaiting my next batch of spam now, just to see how it scores. When an hour goes by without new spam, I nervously check all the cables and starting pinging my POP3 servers. You obviously need an attitude adjustment . From tim.one@comcast.net Sat Oct 19 08:56:56 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 19 Oct 2002 03:56:56 -0400 Subject: [Spambayes] Scoring spam discussions Message-ID: Here's an interesting result: I've got a separate folder for my email discussing this project. It currently contains 1,316 msgs, and *none* of them have been trained on. Just for fun, I scored them with my growing-but-still-small "Tim's email" classifier. Result: almost all scored 0.00 under chi-combining, including the ones you expected would score as spam . Non-zero scores: 0.01 7 0.02 6 0.03 1 0.04 5 ------------- 0.05 is a paranoid "I'm sure it's ham" chi cutoff 0.06 3 0.08 1 0.09 1 ------------- 0.10 is a conservative chi ham_cutoff 0.11 2 0.12 1 0.13 1 pvt 30KB email from someone including a full listing of all their FP on a run ------------- 0.30 is a fine chi ham_cutoff on my c.l.py data 0.40 1 strange brief note from an acm.org spam-filter developer 0.66 1 A msg from me, from PythonLabs email discussions that took place before any code was written. That last was a forwarded Asian spam, with a bunch of my comments, and the Subject line is: Subject: [PythonLabs] =?ks_c_5601-1987?B?Rlc6ICixpLDtKbDmwO+75yC48LTPxc24tSC8rbrxvbo=?= It turns out that the MIME structure in this msg is damaged, and the email package gave up after parsing the headers. The high spam score (which is nevertheless solidly in chi's middle ground, thanks to finding clues that I sent this msg), was mostly due to all the gibberish in the Subject line (Rob, avert your eyes ): 'subject:[' 0.206009 'subject:PythonLabs' 0.228589 'subject:-' 0.356645 'subject:?' 0.681345 'subject:1987' 0.844828 'subject:] =?' 0.844828 'subject:ks_c_5601' 0.844828 'subject:skip:7 20' 0.844828 'subject:skip:R 10' 0.844828 'subject:=?=' 0.978469 'subject:+' 0.980349 If I repair the MIME by hand, so that it sees my comments (as well as the forwarded spam), the chi score falls to 0.03. The forwarded spam in isolation scores 1.00. My comments in isolation broke the Pentium's underflow trap . damn-this-stuff-works-good-ly y'rs - tim From agmsmith@rogers.com Sat Oct 19 15:17:54 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Sat, 19 Oct 2002 10:17:54 EDT (-0400) Subject: [Spambayes] Client/server model and Headers Only In-Reply-To: Message-ID: <5417381978-BeMail@CR593174-A> Tim Peters wrote: > [Alexander G. M. Smith] > > That's a feature I've been asked for. Just classify by the header > > alone. The idea being that it would only download the header from > > the mail server, and immediately delete the message on the server > > if it looked like spam. I'm a bit nervous about implementing it > > in case it is a false positive and thus irretrievably deletes the > > message. > > I'd be very nervous about that. You may want to ask Eric Raymond if he got > anywhere with this -- at one time he intended to set up a "header score > server" in connection with, or as an offshoot of, his bogofilter project. No answer yet, but I did add some mail parsing code, to my program, that only trained and used headers. It seems to work surprisingly well even if it is just using headers. My database file is now full of e-mail mailbox names, from those spammers that use CC:, and lots of message IDs, and IP addresses. About a third of the "words" start with a number. The example data is just slapped together from some recent messages I had (probably should try to get a longer history for the ham). Genuine (ham): 374 examples used in training. Spam: 401 examples. Genuine: 100 test messages which came in the last week and a half. Spam: 45 tests. Gary-combining method, simplistic word tokenizing. Genuines: .0862202 to .51863, all under the 0.56 threshold, zero false positives. Spams: .471454 to .726808, giving 23 false negatives under the 0.56 threshold, or 11 under 0.52. Summing up, it can get rid of half the spam by just looking at the headers. Applying the same messages to the full examination (whole e-mail text just examined for words, not parsed into parts, resulting database is 3X larger) I get: Genuine: .147655 to .830371, 6 false positives over the 0.56 threshold. Spam: .600419 to .993935, 0 false negatives. Hmmmm. I'll have to be more careful in selecting my example messages, I usually get better performance with my actual working database. Better tests needed, more later... - Alex From popiel@wolfskeep.com Sat Oct 19 18:32:46 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 19 Oct 2002 10:32:46 -0700 Subject: [Spambayes] Mixed combining In-Reply-To: Message from Tim Peters of "Sat, 19 Oct 2002 01:10:17 EDT." References: Message-ID: <20021019173246.592DCF4D6@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> Anyway, here's the tables: >> >> Mixed, .9 chi-squared, 0.10-0.90 unsure: >> -> tested 50 hams & 200 spams against 450 hams & 1800 spams >> [...] >> -> tested 200 hams & 50 spams against 1800 hams & 450 spams >> ham:spam: 50-200 75-175 100-150 125-125 150-100 175-75 200-50 >> fp total: 2 3 3 3 3 2 2 >> fp %: 0.40 0.40 0.30 0.24 0.20 0.11 0.10 >> fn total: 5 6 4 5 6 7 9 >> fn %: 0.25 0.34 0.27 0.40 0.60 0.93 1.80 >> unsure t: 46 44 45 42 52 51 52 >> unsure %: 1.84 1.76 1.80 1.68 2.08 2.04 2.08 >> real cost: $34.20 $44.80 $43.00 $43.40 $46.40 $37.20 $39.40 >> best cost: $28.60 $28.20 $34.00 $33.20 $34.20 $30.40 $23.80 >> h mean: 3.61 2.70 2.47 2.30 2.29 2.21 1.99 >> h sdev: 8.09 6.15 6.13 5.93 6.13 5.84 4.79 >> s mean: 97.08 96.69 96.33 95.84 94.94 94.34 92.25 >> s sdev: 6.48 7.71 8.63 10.21 12.73 13.67 17.09 >> mean diff: 93.47 93.99 93.86 93.54 92.65 92.13 90.26 >> k: 6.42 6.78 6.36 5.80 4.91 4.72 4.13 > >This is a nice way to present summary info. Are these produced by your >table2.py? If so, I know where to find that -- would you consider >contributing it to the project? Yes, it's produced by table2.py. Feel free to use it. I just created a sourceforge account for myself (username popiel), just in case you feel the urge to add me as a developer. - Alex From popiel@wolfskeep.com Sat Oct 19 18:37:45 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 19 Oct 2002 10:37:45 -0700 Subject: [Spambayes] Proposing to remove 4 combining schemes In-Reply-To: Message from Tim Peters of "Sat, 19 Oct 2002 02:11:49 EDT." References: Message-ID: <20021019173745.E948EF4D6@cashew.wolfskeep.com> In message: Tim Peters writes: >[Rob] >>> But Sean's "sort on score" idea is also very useful. I think it'd speed >>> up the manual scanning/deletion process. > >[Alex] >> Having looked at the results from the show_unsure config option, >> I tend to disagree... position in the list doesn't seem to have >> any correlation with spam vs. ham. > >Are you sure? I've got a GUI that sorts email by "Hammie score" now, and >there's a clear correlation by eyeball *adjacent to the endpoints* of the >unsure range. Well, no, I'm not sure. There was a bit more ham towards the bottom of the range, and a bit more spam towards the top... but I had high scoring ham and low scoring spam right near the endpoints, too. And to make it all worse, each section of the show_unsure output would only have 4-5 messages, so seeing the trends is hard. ;-) - Alex From popiel@wolfskeep.com Sun Oct 20 00:25:37 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 19 Oct 2002 16:25:37 -0700 Subject: [Spambayes] Mixed combining In-Reply-To: Message from "T. Alexander Popiel" of "Sat, 19 Oct 2002 10:32:46 PDT." <20021019173246.592DCF4D6@cashew.wolfskeep.com> References: <20021019173246.592DCF4D6@cashew.wolfskeep.com> Message-ID: <20021019232537.DDFEAF4D6@cashew.wolfskeep.com> In message: <20021019173246.592DCF4D6@cashew.wolfskeep.com> "T. Alexander Popiel" writes: > >Yes, it's produced by table2.py. Feel free to use it. Of course, if you do include it in the project, I'll feel obliged to submit patches to it to correct the horribly out of date (and incorrect) header comments. And clean it up in general. Bah. If people are going to actually _use_ it, I'll have to make it presentable, at least... - Alex From tim.one@comcast.net Sun Oct 20 07:03:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 20 Oct 2002 02:03:27 -0400 Subject: [Spambayes] Mixed combining In-Reply-To: <20021019173246.592DCF4D6@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel, on his nice table2.py] > Yes, it's produced by table2.py. Feel free to use it. I just created > a sourceforge account for myself (username popiel), just in case you > feel the urge to add me as a developer. I thought I would be delighted to -- and it turned out I was. Yet another correct prediction . You can cut your SF CVS teeth by adding table2.py to the project, if you like. If there are any problems dealing with the mechanics of remote CVS, feel free to ask (either here, or directly to me if you can afford to wait). Welcome again! From anthony@interlink.com.au Sun Oct 20 13:50:05 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sun, 20 Oct 2002 22:50:05 +1000 Subject: [Spambayes] expiration ideas. Message-ID: <200210201250.g9KCo5C03237@localhost.localdomain> Just thinking again about expiration, and wondering if the following would work: When training new data (say a new week's worth), train it with a new classifier ("interim"). Once it's trained, merge the interim classifier's wordinfo into your master classifier wordinfo by adding the new spamcounts and hamcounts to the master wordinfo blob, then recalc probabilities. Keep the "interim" wordinfo around (gzipped, datestamped) until your expiration time is up - then undo the earlier merge, subtracting the spamcount/hamcounts. Thoughts? Unless there's a screamingly obvious "don't be stupid" I'll play with this tomorrow (ah, leave....) Anthony From agmsmith@rogers.com Sun Oct 20 17:04:14 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Sun, 20 Oct 2002 12:04:14 EDT (-0400) Subject: [Spambayes] expiration ideas. In-Reply-To: <200210201250.g9KCo5C03237@localhost.localdomain> Message-ID: <2124500893-BeMail@CR593174-A> Anthony Baxter wrote: > Keep the "interim" wordinfo around (gzipped, datestamped) until your > expiration time is up - then undo the earlier merge, subtracting > the spamcount/hamcounts. > > Thoughts=3F Unless there's a screamingly obvious "don't be stupid" I'll > play with this tomorrow (ah, leave....) Sounds reasonable. But I'd rather keep around the whole messages so that I can change tokenizing schemes. Or perhaps use one of those future inter-word relation schemes. The total space is several times (ten times) more than a word list (5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is still almost trivial on today's computers and huge disk drives to store the complete messages. So, you have to ask yourself if a 10X space (and tokenizing time) savings is worth it. - Alex From popiel@wolfskeep.com Sun Oct 20 17:52:28 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 20 Oct 2002 09:52:28 -0700 Subject: [Spambayes] expiration ideas. In-Reply-To: Message from "Alexander G. M. Smith" of "Sun, 20 Oct 2002 12:04:14 EDT." <2124500893-BeMail@CR593174-A> References: <2124500893-BeMail@CR593174-A> Message-ID: <20021020165228.51EF0F519@cashew.wolfskeep.com> In message: <2124500893-BeMail@CR593174-A> "Alexander G. M. Smith" writes: >Anthony Baxter wrote: >> Keep the "interim" wordinfo around (gzipped, datestamped) until your >> expiration time is up - then undo the earlier merge, subtracting >> the spamcount/hamcounts. >> >> Thoughts=3F Unless there's a screamingly obvious "don't be stupid" I'll >> play with this tomorrow (ah, leave....) > >Sounds reasonable. But I'd rather keep around the whole messages so >that I can change tokenizing schemes. Or perhaps use one of those >future inter-word relation schemes. Whether you want to keep whole messages or just the wordlists depends entirely on whether you want to fully retrain when you switch tokenization schemes vs. keeping the old database and just adding new stuff with the new tokenization. If you keep the database through tokenizations, then you want a record of what actually got added during a prior training, instead of when would have been added if the current tokenization was used. Thus, the word lists are better for database integrity. Of course, if you fully retrain every time you switch tokenizers, then keeping the entire messages is the only way to support arbitrary changes in the tokenizer. It's a question of approach... Personally, I'm keeping all messages for all time, so it doesn't matter much one way or another. - Alex PS. We can really confuse folks if Alex and Alex start holding regular debates on the list... From tim@fourstonesforum.com Sun Oct 20 15:15:43 2002 From: tim@fourstonesforum.com (Four Stones Forum) Date: Sun, 20 Oct 2002 09:15:43 -0500 Subject: [Spambayes] from a new member Message-ID: I've recently become aware of the Spambayes project, and I'm quite interested, so I subscribed to the mailing, and I've been reading for a while, trying to get my head around the solution you're working on. I think I (kinda) have the idea now, and I figured I'd post to introduce myself and to ask a few questions. I'm Tim Stone, I've never worked on an open source project before, though I've used lots of open source stuff, I've been in the IT industry since 1975 (which makes me a geezer), I know 30-odd languages (Python isn't one of them), I've worked on all kinds of stuff under lots of different architectures... so enough about me. First of all: I HATE SPAM. It is an insidious evil, and I'm glad to see some truely progressive thinking about how to deal with it, not only deal with the mail, but deal with the PROBLEM. Second of all: I run a website (www.fourstonesExpressions.com) that has a mailing list (I say these words at risk of having this mail rejected by simplistic filters) that I feel to be completely legit. It's a completely voluntary opt-in, there are no checkboxes with 'Yes' defaults, etc. etc. I don't sell the list or give it away. I only send mailings occasionally, perhaps 3 or 4 times a year. I've only ever had one opt-out. I think that speaks well of how the list is run. I say all that to say this: I take great pains in my mailings to ensure that things like spam-assassin don't label my mailings. Spam-assassin is very popular, and it does some great things. It also documents its reasonings in incoming mail's headers, so you can see how it arrived at its conclusion about your mail. This allows me to optimize my mailings by simply sending one to myself and seeing how SA rates it, and then fixing the problems. It shocks me that all spammers don't do this, but I'm certainly glad that they don't, because that allows SA to work for me. However, as our ability to block spam becomes better and better, I think they'll be forced to use this stragegy more and more. As someone who sends mailings that *could* be thought of as spam, these are the things that I'm sure spammers will think about. How do you defeat Spambayes? Well, if I'm a spammer, I get me a copy and train it on a vast number of spams that are like mine, then I start tweaking.... As such, I think that Spambayes will work BEST in conjunction with other technologies. One of the best ideas I've in the discussions thus far is to keep a PUBLIC list of urls that spammers actively promote. This should probably be done at the domain level. The keeper of this list could very well use a crawler and a Bayesian approach to rating the website itself, which is a double safety net. Otherwise a spammer could include urls that are not related to the spam, and do (at least public relations) damage to other sites. Using this in conjunction with Spambayes actually defeats several other simple (temporary) workarounds that spammers could employ, s u c h a s i n c l u d i n g s p a c e s b e t w e e n l e t t e r s, which is quite human readable, but breaks the document down into a large number of single character words, or sending spam as a single jpg. Well, that's enough for now. Is anybody working on the Bayesian crawler idea? - Tim From agmsmith@rogers.com Sun Oct 20 22:12:07 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Sun, 20 Oct 2002 17:12:07 EDT (-0400) Subject: [Spambayes] Headers and Other Significant Message Parts Message-ID: <13547908442-BeMail@CR593174-A> This is a multipart message in MIME format. ---------------------- multipart/mixed attachment Database: 341 training genuine (ham) messages, 406 training spam messages (or 398 spam when parsing due to a bug with messages that don't have body text). 40 test genuine messages, 40 test spam messages, all more recent than the training ones. Spam threshold is 0.56, Gary-combining method, simplistic word tokenization. Just headers: Genuine .181352 to .557881, one false positive (a mailbox full announcement). 2.5% wrong. Spam .450602 to .750511, 21 false negatives. 52.5% wrong. Whole raw message text: Genuine .163027 to .627022, 3 false positives. 7.5% wrong. Spam .509355 to .993985, 1 false negative. 2.5% wrong. Any text/* parts and header: Genuine .162697 to .614136, 4 false positives, 10% wrong. Spam .614973 to .994362, 0 false negatives, 0% wrong. Any text parts, no headers: Genuine .221923 to .635487, 6 false positives, 15% wrong. Spam .594271 to .994441, 0 false negatives, 0% wrong. Just text/plain parts (including body text) and headers: Genuine .137869 to .583192, 3 false positives, 7.5% wrong. Spam .448059 to .994119, 17 false negatives, 42.5% wrong. Just text/plain parts, no headers. 150 spam and 1 genuine training message had no words: Genuine .219169 to .696899, 9 false positives, 22.5% wrong. Spam .660755 to .994116, 0 false positives, 27 had no words. So, the headers are quite useful for identifying Spam. The winners are chewing up the whole message, or using all text text parts (throwing away binary attachments) and including the headers too. The advantage with the parts method is that the database doesn't fill up with junk words from binary attachments. - Alex ---------------------- multipart/mixed attachment I did some more tests using AGMSBayesianSpam v1.58 for BeOS (http://www.bebits.com/app/3055) to tokenize different parts of mail messages, to see if headers were useful or if some parts could be discarded. Database: 341 training genuine (ham) messages, 406 training spam messages (or 398 spam when parsing due to a bug with messages that don't have body text, shouldn't influence it too much). 40 test genuine messages, 40 test spam messages, all more recent than the training ones. Spam threshold is 0.56, Gary-combining method, simplistic word tokenization. Just headers: Genuine .181352 to .557881, one false positive (a mailbox full announcement). 2.5% wrong. Spam .450602 to .750511, 21 false negatives. 52.5% wrong. Whole raw message text (only quoted-printable decoding): Genuine .163027 to .627022, 3 false positives. 7.5% wrong. Spam .509355 to .993985, 1 false negative. 2.5% wrong. Message parsed into parts (parsing decodes base64 and quoted-printable, and for text converts the character set to UTF-8), plus headers (includes MIME subheaders too): Genuine .168857 to .609005, 4 false positives, 10% wrong. Spam .614564 to .994364, 0 false negatives, 0% wrong. Message parsed into parts of all kinds, no header data: Genuine .220161 to .631161, 5 false positives, 12.5% wrong. Spam .592501 to .994444, 0 false negatives, 0% wrong. Only text/* parts and headers: Genuine .162697 to .614136, 4 false positives, 10% wrong. Spam .614973 to .994362, 0 false negatives, 0% wrong. Just text/* parts, no headers: Genuine .221923 to .635487, 6 false positives, 15% wrong. Spam .594271 to .994441, 0 false negatives, 0% wrong. Just text/plain parts (including body text) and headers: Genuine .137869 to .583192, 3 false positives, 7.5% wrong. Spam .448059 to .994119, 17 false negatives, 42.5% wrong. Just text/plain parts, no headers. 150 spam and 1 genuine training message had no words. Genuine .219169 to .696899, 9 false positives, 22.5% wrong. Spam .660755 to .994116, 0 false positives, 27 spam had no words (a good sign of spam). So, the headers are quite useful for identifying Spam in general. If using just headers, there are few false positives, making them suitable for deleting spam on the server (only downloading the header). But they have many false negatives, so it isn't that useful. Harmless and half useless :-). The winners are the whole message as raw text method, or using all text parts (throwing away binary attachments) and including the headers too. The advantage with the parts method is that the database doesn't fill up with junk words from binary attachments. - Alex ---------------------- multipart/mixed attachment-- From agmsmith@rogers.com Sun Oct 20 22:17:17 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Sun, 20 Oct 2002 17:17:17 EDT (-0400) Subject: [Spambayes] Headers and Other Significant Message Parts In-Reply-To: <13547908442-BeMail@CR593174-A> Message-ID: <13857947920-BeMail@CR593174-A> Sorry about the mangled message (another bug found!), here it is again: I did some more tests using AGMSBayesianSpam v1.58 for BeOS (http://www.bebits.com/app/3055) to tokenize different parts of mail messages, to see if headers were useful or if some parts could be discarded. Database: 341 training genuine (ham) messages, 406 training spam messages (or 398 spam when parsing due to a bug with messages that don't have body text, shouldn't influence it too much). 40 test genuine messages, 40 test spam messages, all more recent than the training ones. Spam threshold is 0.56, Gary-combining method, simplistic word tokenization. Just headers: Genuine .181352 to .557881, one false positive (a mailbox full announcement). 2.5% wrong. Spam .450602 to .750511, 21 false negatives. 52.5% wrong. Whole raw message text (only quoted-printable decoding): Genuine .163027 to .627022, 3 false positives. 7.5% wrong. Spam .509355 to .993985, 1 false negative. 2.5% wrong. Message parsed into parts (parsing decodes base64 and quoted-printable, and for text converts the character set to UTF-8), plus headers (includes MIME subheaders too): Genuine .168857 to .609005, 4 false positives, 10% wrong. Spam .614564 to .994364, 0 false negatives, 0% wrong. Message parsed into parts of all kinds, no header data: Genuine .220161 to .631161, 5 false positives, 12.5% wrong. Spam .592501 to .994444, 0 false negatives, 0% wrong. Only text/* parts and headers: Genuine .162697 to .614136, 4 false positives, 10% wrong. Spam .614973 to .994362, 0 false negatives, 0% wrong. Just text/* parts, no headers: Genuine .221923 to .635487, 6 false positives, 15% wrong. Spam .594271 to .994441, 0 false negatives, 0% wrong. Just text/plain parts (including body text) and headers: Genuine .137869 to .583192, 3 false positives, 7.5% wrong. Spam .448059 to .994119, 17 false negatives, 42.5% wrong. Just text/plain parts, no headers. 150 spam and 1 genuine training message had no words. Genuine .219169 to .696899, 9 false positives, 22.5% wrong. Spam .660755 to .994116, 0 false positives, 27 spam had no words (a good sign of spam). So, the headers are quite useful for identifying Spam in general. If using just headers, there are few false positives, making them suitable for deleting spam on the server (only downloading the header). But they have many false negatives, so it isn't that useful. Harmless and half useless :-). The winners are the whole message as raw text method, or using all text parts (throwing away binary attachments) and including the headers too. The advantage with the parts method is that the database doesn't fill up with junk words from binary attachments. - Alex From dereks@itsite.com Sun Oct 20 23:43:21 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 20 Oct 2002 15:43:21 -0700 (PDT) Subject: [Spambayes] Deployment time In-Reply-To: Message-ID: After reading The Article and briefly reviewing the code, I've decided to go out on a limb and deploy hammie.py as the server-wide spam filter for a production email server. This is with Postfix. I'll be reporting my experiences here to the list, so you algorithm geniouses get a reminder that dummies like me should be able to install and use it. If things work out well, I'll write a "success story" that can be posted on the website. I may also be able to help contribute enduser documentation. So my immediate first question is: Is there a pre-exististing DBM store that I can install on my server? I'm looking for something that will catch most (>90%) of current spams, and has the kind of ham traffic you'd see at a University or Corporate America institution (i.e., words like ConfigParser and -OO shouldn't be required to classify it as ham). If not, could one of you nice guys make one available for download? (Unfortunately, other than my developer-type personal emails, I have no existing ham store of my own to work from.) Any help in this regard is greatly appreciated. And although I'm not a spam expert, I know enough Python to get by, so hopefully I'll make a helpful test candidate. (My Python experience includes a custom Apache authentication module, a load-balanced cluster management system, and a 3D OpenGL/SDL game engine -- yes, Python is fast enough for a 3D game engine, as long as the rendering is done in C extension modules :) Thank You, Derek Simkowiak From dereks@itsite.com Mon Oct 21 04:18:52 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 20 Oct 2002 20:18:52 -0700 (PDT) Subject: [Spambayes] Deployment time In-Reply-To: Message-ID: > I'll be reporting my experiences here to the list [...] First note: The filter-mode header of "X-Hammie-Disposition" seems inappropriate, and let me explain why. I, the sysadmin, know what hammie.py is. I know that I installed it, and I know that it filters for spam. I know that it is part of the SpamBayes project, and that the header is inserted into spam-like messages. However, someone else looking at the "X-Hammie-Disposition" header out of context would not know at all what that header means, what to do with it, or that they can filter on it for classifying spam. A Google search for "Hammie" does not give any results relating to the SpamBayes project, and even worse, a search for "X-Hammie-Disposition" gives no results at all. It would be much more useful to use a header that can be recognized for what it is, without having to be one of the rare individuals who knows what "hammie.py" is. I suggest something like X-SpamBayes-Disposition [or] X-Spam-Disposition [or] X-Spamfilter-Disposition ...or, better yet, to stick with the conventions that SpamAssassin has used. This would be easiest on endusers and helpdesks, since setting up filters for a SpamBayes installation would be the same as doing it for a SpamAssassin installation. Mobile users with email accounts in both kinds of domain would only need one set of filter rules. That, and the SpamAssassin headers are pretty intuitive. SpamAssassin has slightly different semantics for its headers, but it will be trivial to implement them in hammie.py. If the maintainer(s) are in favor of this approach, I can submit a patch in a couple of weeks. For reference, you can see how SpamAssassin tags spams at the following URL: http://spamassassin.taint.org/doc/spamassassin.html#tagging: Thanks, Derek Simkowiak From anthony@interlink.com.au Mon Oct 21 05:22:47 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 21 Oct 2002 14:22:47 +1000 Subject: [Spambayes] expiration ideas. In-Reply-To: <2124500893-BeMail@CR593174-A> Message-ID: <200210210422.g9L4MlK08246@localhost.localdomain> >>> "Alexander G. M. Smith" wrote > Anthony Baxter wrote: > > Keep the "interim" wordinfo around (gzipped, datestamped) until your > > expiration time is up - then undo the earlier merge, subtracting > > the spamcount/hamcounts. > Sounds reasonable. But I'd rather keep around the whole messages so > that I can change tokenizing schemes. Or perhaps use one of those > future inter-word relation schemes. That's fine, but once this stuff is deployed, how many end-users are going to want to tweak their tokeniser? I'd suggest approximately three eighth's of one fifth of bugger-all :) > The total space is several times (ten times) more than a word list > (5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB > bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is > still almost trivial on today's computers and huge disk drives to > store the complete messages. So, you have to ask yourself if a > 10X space (and tokenizing time) savings is worth it. For one user, fine - but in a setting where you've got multiple users, say, using an IMAP server? You'd want the stuff to happen on the server, before the end users have to run a program to download the mail, check it, and send commands to the IMAP server to move the spam out of the way... I also get enough email that I really don't want to be lugging around all of my old email for a couple of months... Anthony -- Anthony Baxter It's never too late to have a happy childhood. From dereks@itsite.com Mon Oct 21 05:38:58 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 20 Oct 2002 21:38:58 -0700 (PDT) Subject: [Spambayes] expiration ideas. In-Reply-To: <200210210422.g9L4MlK08246@localhost.localdomain> Message-ID: > > The total space is several times (ten times) more than a word list > > (5.9MB raw, 2.4MB zipped archive, 1.5MB gzip tar file, 1.2MB > > bzip2ed tar file vs 660KB raw, 270KB zipped word list), but it is > > still almost trivial on today's computers and huge disk drives to > > store the complete messages. > For one user, fine - but in a setting where you've got multiple > users, say, using an IMAP server? Many hosting companies only offer 5 or 10 megs of email space with their "basic" accounts. From anthony@interlink.com.au Mon Oct 21 05:38:14 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 21 Oct 2002 14:38:14 +1000 Subject: [Spambayes] expiration ideas. In-Reply-To: Message-ID: <200210210438.g9L4cE608354@localhost.localdomain> > Many hosting companies only offer 5 or 10 megs of email space with > their "basic" accounts. *nod* I think our webmail is < 50M or so.... From tim.one@comcast.net Mon Oct 21 06:57:09 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 21 Oct 2002 01:57:09 -0400 Subject: [Spambayes] expiration ideas. In-Reply-To: <200210201250.g9KCo5C03237@localhost.localdomain> Message-ID: [Anthony Baxter] > Just thinking again about expiration, and wondering if the following > would work: > > When training new data (say a new week's worth), train it with a > new classifier ("interim"). Once it's trained, merge the interim > classifier's wordinfo into your master classifier wordinfo by adding > the new spamcounts and hamcounts to the master wordinfo blob, then > recalc probabilities. > > Keep the "interim" wordinfo around (gzipped, datestamped) until your > expiration time is up - then undo the earlier merge, subtracting > the spamcount/hamcounts. > > Thoughts? Unless there's a screamingly obvious "don't be stupid" I'll > play with this tomorrow (ah, leave....) It's sure the most principled idea I've heard, in that it would always leave the database corresponding exactly with *some* real-world collection of msgs. OTOH, what's the purpose of expiration? I can think of two: 1. To reduce database size. 2. To accelerate adaptation to changes in ham and/or spam. I don't know that #2 is a real problem, and some reason to doubt it. Over the weekend, I tried my c.l.py ham + bruceg spam classifer on newer data Greg Ward harvested from all non-personal python.org traffic (which turns out to be partly untrue: python.org also hosts a few small & unadvertised "hobby lists" I didn't know about, and they count as "personal email" to me). Anyway, the c.l.py classifier had a very high FP rate, and especially on the "hobby list" traffic. But its FN rate was identical to that of a classifier trained from scratch on the new data: 1 FN, under chi's rules for FN. This suggests that everyone is right in believing that spam is much the same. So far as changes in ham go, it suggests that a significantly new source of ham needs to be trained on ASAP, lest it be viewed as spam. About #1, there are lots of things that haven't been tested properly, the most obvious being to purge unique words from the database immediately after training. That should cut the database size in half with one quick and easy stroke. Whether it hurts performance is unknown. At the start, my favorite gimmick was embodied in the atime attr of WordInfo records: remember the most recent time a word was used in scoring, and get rid of words that haven't been used "recently". If they're not being used, then getting rid of them can't affect accuracy. It addresses both #1 and #2, but #1 on a revolving-door basis, and #2 in only a very weak sense. From anthony@interlink.com.au Mon Oct 21 07:06:53 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 21 Oct 2002 16:06:53 +1000 Subject: [Spambayes] expiration ideas. In-Reply-To: Message-ID: <200210210606.g9L66sO08825@localhost.localdomain> >>> Tim Peters wrote > OTOH, what's the purpose of expiration? I can think of two: > > 1. To reduce database size. > > 2. To accelerate adaptation to changes in ham and/or spam. The former. I'm trying to think about how this could be deployed "in the real world". Note also that I'm not so much worried about adapting to spam as adapting to changing ham patterns. I know that my own email changes over time (for instance, until this project started, I doubt the word "Nigerian" would have been considered a strong ham indicator for me :) (somewhat off-topic, but related: I also suspect that if the spambayes code is vulnerable to being deliberately sabotaged, it'll be the tokeniser that's the weak point, not the classifier. For instance, I already have a couple of persistent FNs with message bodies entirely encoded in javascript. I don't want to think about having to decode javascript or run it to check if something's spam.) I'm somewhat nervous of the "purge all unique words" approach - one obvious failing is that it means if you _are_ doing ongoing training, you'd want to batch up a bunch of messages. I'm also not sure that deliberately perverting the real world in that way isn't going against the "stupid beats smart" meta-rule that's served us so far... but-then-maybe-stupider-beats-stupid, too. Anthony. From anthony@interlink.com.au Mon Oct 21 07:30:02 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 21 Oct 2002 16:30:02 +1000 Subject: [Spambayes] expiration ideas. In-Reply-To: <200210210606.g9L66sO08825@localhost.localdomain> Message-ID: <200210210630.g9L6U3809108@localhost.localdomain> >>> Anthony Baxter wrote > Note also that I'm not so much worried about adapting to spam as > adapting to changing ham patterns. I know that my own email changes > over time (for instance, until this project started, I doubt the word > "Nigerian" would have been considered a strong ham indicator for me :) Another thought - if we were to ship a package with a small "starter" wordinfo dict, it would be very good if this was gradually expired out. Two reasons I can think of: the gradually adapting wordinfo will end up better representing the user's real usage, plus it means anyone out there starting with a standard wordinfo won't be vulnerable to spammers picking up words with high hamprob and deliberately inserting them into their spam. I imagine it's highly possible we'll start seeing things like 'wrote:' appearing, I'm already seeing spam with 'Re: ' in the subject (but as yet, no 'In-reply-to' headers...) Anthony From anthony@interlink.com.au Mon Oct 21 09:44:37 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 21 Oct 2002 18:44:37 +1000 Subject: [Spambayes] Slice o' life In-Reply-To: Message-ID: <200210210844.g9L8icP10231@localhost.localdomain> >>> Tim Peters wrote > [Rob W.W. Hooft] > > Correlations, correlations, correlations. It all boils down to > > correlations. Not the fact that there are correlations, but that they > > are very, very different from one clue to the next. All these mailman > > clues are correlated. And by not downweighting them, we're blinding the > > procedure to the other clues that do not come by the dozens... > > It's not even that they're Mailman clues, though, it's more that python.org > specifically already has strong anti-spam and anti-virus measures in place. > That's how these "Mailman clues" earned their very low spamprobs to begin > with -- it's not that Mailman is stopping spam, it's that virtually all the > Mailman lists I'm on go through python.org. For an additional data point - if I turn on mine_received_headers, one of the clues that shows up in a lot of very very low-prob fn's is received lines with mail.python.org. So stripping out just the mailman headers won't help. This also shows up with the footers of the messages that do make it past Greg to python-list. The .sig at the end shows up as strong ham clues. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Mon Oct 21 10:34:55 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 21 Oct 2002 19:34:55 +1000 Subject: [Spambayes] cancellation disease again? Message-ID: <200210210934.g9L9Ytk10743@localhost.localdomain> I think I'm seeing what's been referred to as cancellation disease again, using chi combining. I'm getting very very long spams (like those interminable MLMs with the "5 reports" that are getting both *H* and *S* scores at or near 1, and a final score of 0.5. E.g. the perfectly standard "send money for 5 reports" spam gets: prob = 0.500000000004 prob('*H*') = 1 prob('*S*') = 1 prob('sent:') = 0.000670741 prob('indeed') = 0.00248756 prob('place.') = 0.0025729 prob('obviously') = 0.00272893 prob('missed') = 0.0033358 prob('persistent') = 0.00378469 prob('replaced') = 0.00455005 prob('something.') = 0.00542823 prob('george') = 0.00617284 prob('think.') = 0.00617284 prob('"no') = 0.0065312 prob('happen.') = 0.00672646 prob('him.') = 0.00672646 prob('basically,') = 0.00819672 prob('key.') = 0.00850662 prob('"why') = 0.00884086 prob('correctly,') = 0.00920245 prob('sorry.') = 0.00920245 prob('"just') = 0.00959488 prob('hopes') = 0.00959488 prob('initially') = 0.00959488 prob('(so') = 0.0104895 prob('it;') = 0.0104895 prob('assumes') = 0.0110024 prob('at.') = 0.0110024 prob('everyone,') = 0.0110024 prob('myself.') = 0.0110024 prob('determining') = 0.0115681 prob('problem).') = 0.0115681 prob('falling') = 0.0121951 prob('received,') = 0.0121951 prob("don't,") = 0.012894 prob('stanford') = 0.012894 prob('struggling') = 0.012894 prob('directions') = 0.0136778 prob('jonathan') = 0.0145631 prob('portland,') = 0.0145631 prob('privately') = 0.0145631 prob('sometime') = 0.0145631 prob('saying,') = 0.0155709 prob('gained') = 0.0180723 prob('sized') = 0.0180723 prob('belief') = 0.0196507 prob('fortunately,') = 0.0196507 prob('goodness') = 0.0196507 prob('encounters') = 0.0215311 prob('scratch.') = 0.0215311 prob('trash') = 0.0215311 prob('build.') = 0.0238095 prob('exactly,') = 0.0238095 prob('invested') = 0.0238095 prob('pressed') = 0.0238095 prob('me;') = 0.0266272 prob('work...') = 0.0266272 prob('financially') = 0.973743 prob('responses.') = 0.974053 prob('money.') = 0.974232 prob('ordering') = 0.974234 prob('wolf') = 0.974514 prob('remember,') = 0.974677 prob('residual') = 0.975263 prob('guidelines') = 0.975736 prob('downline') = 0.97619 prob('investing') = 0.976423 prob('response,') = 0.976574 prob('investment') = 0.976809 prob('goes,') = 0.976946 prob('pencil') = 0.9772 prob('me!') = 0.977468 prob('envelope') = 0.977667 prob('involved.') = 0.97782 prob('recession') = 0.978047 prob('following.') = 0.978188 prob('lately.') = 0.978188 prob('legal.') = 0.978188 prob('receive,') = 0.97834 prob('in!') = 0.978469 prob('devoted') = 0.978815 prob('orders') = 0.979431 prob('wife,') = 0.979852 prob('purchase') = 0.979994 prob('subject:YOUR') = 0.980474 prob('tested,') = 0.980495 prob('plan.') = 0.980747 prob('materials') = 0.981284 prob('friend,') = 0.981371 prob('opportunity.') = 0.981474 prob('$5,000') = 0.981496 prob('income,') = 0.981928 prob('$50,000') = 0.981962 prob('gambling') = 0.982318 prob('$25') = 0.982672 prob('chicago,') = 0.982771 prob('secrets') = 0.982771 prob('resell') = 0.982897 prob('letter,') = 0.983163 prob('#4.') = 0.983271 prob('e-mails') = 0.983483 prob('currency)') = 0.983805 prob('instructed') = 0.984241 prob('live.') = 0.984241 prob('success:') = 0.985 prob('exceedingly') = 0.985702 prob('her,') = 0.985702 prob('reach.') = 0.98603 prob('earn') = 0.986397 prob('e-mailed') = 0.986405 prob('profits.') = 0.986641 prob('e-mail,') = 0.987065 prob('profit!') = 0.987106 prob('subject:Money') = 0.987106 prob('500,000') = 0.987464 prob('invaluable') = 0.987784 prob('independent.') = 0.988086 prob('marketing,') = 0.988432 prob('crammed') = 0.988647 prob('mitchell.') = 0.988647 prob('p.o.') = 0.988647 prob('prohibiting') = 0.988989 prob("'knew'") = 0.988998 prob("so'") = 0.988998 prob('orders,') = 0.989157 prob('profitable') = 0.989427 prob('reports!') = 0.98951 prob('ordered.') = 0.990472 prob('advertise.') = 0.990959 prob('imagined.') = 0.991185 prob('originator') = 0.991185 prob('$500,000') = 0.991603 prob("1,000's") = 0.991603 prob('feet.') = 0.991603 prob('grumbled') = 0.991603 prob('50,000') = 0.991984 prob('concealed') = 0.991984 prob('year!!!') = 0.992846 prob('refinance') = 0.993653 prob('accurately!') = 0.994148 prob('cash,') = 0.994148 prob('relax,') = 0.994297 prob('spouting') = 0.994438 prob('instructed.') = 0.994572 prob('jody') = 0.994572 prob('merciless') = 0.994572 prob('(u.s.') = 0.994822 prob('income') = 0.994933 prob('multilevel') = 0.994938 prob('ordering,') = 0.995156 prob('e-mails.') = 0.995258 prob('money!') = 0.99579 prob('message-id:@yarrina.connect.com.au') = 0.998453 I'm not sure what the best way to approach this is.... Anthony From tim.one@comcast.net Mon Oct 21 16:37:57 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 21 Oct 2002 11:37:57 -0400 Subject: [Spambayes] cancellation disease again? In-Reply-To: <200210210934.g9L9Ytk10743@localhost.localdomain> Message-ID: [Anthony Baxter] > I think I'm seeing what's been referred to as cancellation disease again, > using chi combining. I'm getting very very long spams (like those > interminable MLMs with the "5 reports" that are getting both *H* and > *S* scores at or near 1, and a final score of 0.5. > > E.g. the perfectly standard "send money for 5 reports" spam gets: > > prob = 0.500000000004 In "cancellation disease", "cancellation" does indeed refer to msgs with huge numbers of both low-spamprob and high-spamprob words, and that's a property of the msg in conjunction with the state of your training data -- cancellation can't be stopped. "Diseased" refers to a scheme that infers certainty when given such a msg. For example, Graham-combining is diseased in this way -- it would have scored this msg 0.0 or 1.0, and it's hard to predict which. chi-combining reliably scores such msgs smack in its middle ground, which is the best that can be done -- chi is confused, and it knows it's confused, and it tells you it's confused. > ... > I'm not sure what the best way to approach this is.... Middle-ground schemes *have* a middle ground -- that's their point . You have to be aware of their middle ground. I set up Sean/Mark's Outlook GUI to move chi middle-ground msgs into an Unsure folder. For python.org use, chi middle-ground msgs will be kicked out for human review. If you lack a mechanism like that, I suppose the best you can do is pass them on (if you hate FP more than FN), or call them spam (if you hate FN more than FP), or decide that 0.000000000004 over 0.5 means spam is the best guess (if you're determined to wish away reality ). In any case, after a correct classification is known, you should add it to your training data. Over time, the word spamprobs will change accordingly. The "5 reports" spams I have in my personal-email classifier score with an internal H of 0 and an internal S of 1, for a final score of 1. From neale@woozle.org Mon Oct 21 20:21:20 2002 From: neale@woozle.org (Neale Pickett) Date: 21 Oct 2002 12:21:20 -0700 Subject: [Spambayes] Testing against someone else's corpora (Was: There Can Be Only One) In-Reply-To: References: Message-ID: I bet you thought I'd forgotten about this :) So then, Tim Peters is all like: > [TIm] > >> 3. Is it possible to "seed" a database with somebody else's data and > >> get decent results out of the box? > > [Neale Pickett] > > $FIRM has a tangible interest in the answer to this question. [snip] > So I'd build a custom test driver on top of TestDriver, like so: > > d = TestDriver.Driver() > d.train(ham, spam) # create the seed database > for user in users: > d.test(user.ham, user.spam) > d.finishtest() > d.alldone() [snip] > The output will display results for each user individually, and an aggregate > across all users. Then you'll want to stare at the output to see how well > it does. Come back when you get that far . Okay. Here's my test setup. I have been collecting all the spam sent to $FIRM for the past week and a half. I'm sad to report that "all the spam" means "all incoming mail that spamassassin scored over 10". For the ten days I collected it, I got 14997 spam! If this is typical, I understand better why spam filtering is such a big deal. The ham came from a guy who's been working here since 1998. It's every message he's sent or recieved since then. He claims he hand-filtered spam out of it, but I know it's not that clean from timcv runs. I'm working on hand-cleaning this and the spam corpus, but it's going to take some time. To test things, I hand-cleaned two mailboxes of co-workers, W and B. Then I ran this code: import TestDriver from Options import options import msgs users = ("B", "W") hamdir_template = "Data/Users/%s/Ham" spamdir_template = "Data/Users/%s/Spam" def drive(nsets): print options.display() spamdirs = [options.spam_directories % i for i in range(1, nsets+1)] hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] d = TestDriver.Driver() d.train(msgs.HamStream("%s-%d" % (hamdirs[0], nsets), hamdirs), msgs.SpamStream("%s-%d" % (spamdirs[0], nsets), spamdirs)) for user in users: hamdir = hamdir_template % user spamdir = spamdir_template % user d.test(msgs.HamStream(hamdir, [hamdir]), msgs.SpamStream(spamdir, [spamdir])) d.finishtest() d.alldone() drive(2) So, here's the output: [TestDriver] show_histograms = True show_best_discriminators = 30 nbuckets = 200 spam_cutoff = 0.560 pickle_basename = class show_ham_lo = 1.0 show_false_negatives = True best_cutoff_fn_weight = 1.00 ham_cutoff = 0.560 show_spam_hi = 0.0 show_unsure = False show_spam_lo = 1.0 save_trained_pickles = False show_ham_hi = 0.0 show_false_positives = True spam_directories = Data/Spam/Set%d percentiles = 5 25 75 95 compute_best_cutoffs_from_histograms = True best_cutoff_fp_weight = 10.00 show_charlimit = 3000 best_cutoff_unsure_weight = 0.20 ham_directories = Data/Ham/Set%d save_histogram_pickles = False [CV Driver] build_each_classifier_from_scratch = False [Tokenizer] mine_received_headers = False octet_prefix_size = 5 generate_long_skips = True count_all_header_lines = False check_octets = False ignore_redundant_html = False basic_header_tokenize = True safe_headers = abuse-reports-to date errors-to from importance in-reply-to message-id mime-version organization received reply-to return-path subject to user-agent x-abuse-info x-complaints-to x-face basic_header_skip = received x-.* delivered-to date basic_header_tokenize_only = False retain_pure_html_tags = False [Classifier] use_mixed_combining = False robinson_probability_x = 0.5 robinson_minimum_prob_strength = 0.1 robinson_probability_s = 0.45 use_chi_squared_combining = False max_discriminators = 150 mixed_combining_chi_weight = 0.9 -> Training on Data/Ham/Set1-2 & Data/Spam/Set1-2 ... 400 hams & 400 spams -> Predicting Data/Users/B/Ham & Data/Users/B/Spam ... -> tested 121 hams & 23 spams against 400 hams & 400 spams -> false positive %: 7.43801652893 -> false negative %: 0.0 -> unsure %: 0.0 -> cost: $90.00 -> 9 new false positives [snip] -> 0 new false negatives -> 0 new unsure best discriminators: 'edit' 42 0.0564005 'to:skip:w 10' 43 0.370886 'header:Received:4' 44 0.00169875 'subject:PERFORCE' 44 0.00585176 'subject:change' 44 0.00585176 'subject:review' 44 0.00570342 'to:skip:B 10' 44 0.0412844 '...' 46 0.181134 'message-id:@horus.inside.$FIRM' 46 0.00556242 'your' 46 0.758353 'affected' 47 0.00570342 'message-id:skip:h 20' 48 0.0416277 'precedence:bulk' 48 0.0429152 'header:MIME-Version:1' 49 0.346045 'url:com' 49 0.761515 'you' 49 0.650341 'content-type:plain' 50 0.177419 'from' 50 0.691328 'change' 52 0.267149 'this' 53 0.655698 'proto:http' 54 0.738164 'files' 57 0.125668 'header:Message-Id:1' 60 0.72845 'message-id:skip:2 20' 60 0.724415 'header:Message-ID:1' 64 0.298295 'from:email addr:$FIRM>' 72 0.00825756 'from:skip:w 10' 76 0.0214323 'return-path:skip:w 10' 98 0.038085 'header:Return-Path:1' 121 0.685963 'content-type:text/plain' 124 0.272913 -> Ham scores for this pair: 121 items; mean 35.83; sdev 13.31 -> min 17.1664; median 32.0866; max 71.2362 -> percentiles: 5% 20.1379; 25% 24.186; 75% 45.0782; 95% 60.3254 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 1 * 17.5 0 18.0 0 18.5 2 ** 19.0 0 19.5 2 ** 20.0 4 **** 20.5 2 ** 21.0 1 * 21.5 5 ***** 22.0 5 ***** 22.5 3 *** 23.0 2 ** 23.5 2 ** 24.0 3 *** 24.5 1 * 25.0 1 * 25.5 2 ** 26.0 6 ****** 26.5 4 **** 27.0 0 27.5 3 *** 28.0 2 ** 28.5 1 * 29.0 1 * 29.5 3 *** 30.0 0 30.5 1 * 31.0 2 ** 31.5 1 * 32.0 2 ** 32.5 0 33.0 0 33.5 1 * 34.0 0 34.5 0 35.0 2 ** 35.5 1 * 36.0 1 * 36.5 0 37.0 2 ** 37.5 0 38.0 3 *** 38.5 0 39.0 0 39.5 0 40.0 2 ** 40.5 1 * 41.0 1 * 41.5 0 42.0 4 **** 42.5 1 * 43.0 3 *** 43.5 3 *** 44.0 2 ** 44.5 1 * 45.0 2 ** 45.5 1 * 46.0 3 *** 46.5 0 47.0 0 47.5 2 ** 48.0 2 ** 48.5 1 * 49.0 2 ** 49.5 0 50.0 2 ** 50.5 3 *** 51.0 0 51.5 0 52.0 1 * 52.5 0 53.0 0 53.5 0 54.0 1 * 54.5 0 55.0 2 ** 55.5 0 56.0 0 56.5 0 57.0 0 57.5 1 * 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 1 * 61.0 0 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 2 ** 68.5 1 * 69.0 0 69.5 0 70.0 1 * 70.5 0 71.0 1 * 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for this pair: 23 items; mean 73.88; sdev 5.94 -> min 62.9927; median 74.0114; max 82.6517 -> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 0 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 1 * 63.0 0 63.5 0 64.0 0 64.5 2 ** 65.0 1 * 65.5 0 66.0 0 66.5 1 * 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 2 ** 70.5 0 71.0 0 71.5 1 * 72.0 0 72.5 0 73.0 2 ** 73.5 1 * 74.0 2 ** 74.5 0 75.0 1 * 75.5 1 * 76.0 0 76.5 0 77.0 0 77.5 0 78.0 1 * 78.5 2 ** 79.0 0 79.5 0 80.0 1 * 80.5 1 * 81.0 1 * 81.5 0 82.0 1 * 82.5 1 * 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> best cost for this pair: $2.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 4 cutoff pairs -> smallest ham & spam cutoffs 0.61 & 0.715 -> fp 0; fn 0; unsure ham 5; unsure spam 7 -> fp rate 0%; fn rate 0%; unsure rate 8.33% -> largest ham & spam cutoffs 0.625 & 0.715 -> fp 0; fn 0; unsure ham 5; unsure spam 7 -> fp rate 0%; fn rate 0%; unsure rate 8.33% -> Ham scores for all in this training set: 121 items; mean 35.83; sdev 13.31 -> min 17.1664; median 32.0866; max 71.2362 -> percentiles: 5% 20.1379; 25% 24.186; 75% 45.0782; 95% 60.3254 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 1 * 17.5 0 18.0 0 18.5 2 ** 19.0 0 19.5 2 ** 20.0 4 **** 20.5 2 ** 21.0 1 * 21.5 5 ***** 22.0 5 ***** 22.5 3 *** 23.0 2 ** 23.5 2 ** 24.0 3 *** 24.5 1 * 25.0 1 * 25.5 2 ** 26.0 6 ****** 26.5 4 **** 27.0 0 27.5 3 *** 28.0 2 ** 28.5 1 * 29.0 1 * 29.5 3 *** 30.0 0 30.5 1 * 31.0 2 ** 31.5 1 * 32.0 2 ** 32.5 0 33.0 0 33.5 1 * 34.0 0 34.5 0 35.0 2 ** 35.5 1 * 36.0 1 * 36.5 0 37.0 2 ** 37.5 0 38.0 3 *** 38.5 0 39.0 0 39.5 0 40.0 2 ** 40.5 1 * 41.0 1 * 41.5 0 42.0 4 **** 42.5 1 * 43.0 3 *** 43.5 3 *** 44.0 2 ** 44.5 1 * 45.0 2 ** 45.5 1 * 46.0 3 *** 46.5 0 47.0 0 47.5 2 ** 48.0 2 ** 48.5 1 * 49.0 2 ** 49.5 0 50.0 2 ** 50.5 3 *** 51.0 0 51.5 0 52.0 1 * 52.5 0 53.0 0 53.5 0 54.0 1 * 54.5 0 55.0 2 ** 55.5 0 56.0 0 56.5 0 57.0 0 57.5 1 * 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 1 * 61.0 0 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 2 ** 68.5 1 * 69.0 0 69.5 0 70.0 1 * 70.5 0 71.0 1 * 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for all in this training set: 23 items; mean 73.88; sdev 5.94 -> min 62.9927; median 74.0114; max 82.6517 -> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 0 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 1 * 63.0 0 63.5 0 64.0 0 64.5 2 ** 65.0 1 * 65.5 0 66.0 0 66.5 1 * 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 2 ** 70.5 0 71.0 0 71.5 1 * 72.0 0 72.5 0 73.0 2 ** 73.5 1 * 74.0 2 ** 74.5 0 75.0 1 * 75.5 1 * 76.0 0 76.5 0 77.0 0 77.5 0 78.0 1 * 78.5 2 ** 79.0 0 79.5 0 80.0 1 * 80.5 1 * 81.0 1 * 81.5 0 82.0 1 * 82.5 1 * 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> best cost for all in this training set: $2.40 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 4 cutoff pairs -> smallest ham & spam cutoffs 0.61 & 0.715 -> fp 0; fn 0; unsure ham 5; unsure spam 7 -> fp rate 0%; fn rate 0%; unsure rate 8.33% -> largest ham & spam cutoffs 0.625 & 0.715 -> fp 0; fn 0; unsure ham 5; unsure spam 7 -> fp rate 0%; fn rate 0%; unsure rate 8.33% This doesn't look--it's all over the map. However, IANAS, nor am I a Tim, so I'll leave judgement up to you fine folks. Here's W's mail: -> Predicting Data/Users/W/Ham & Data/Users/W/Spam ... -> tested 361 hams & 0 spams against 400 hams & 400 spams -> false positive %: 1.38504155125 -> false negative %: 0.0 -> unsure %: 0.0 -> cost: $50.00 -> 5 new false positives [snip] -> 0 new false negatives -> 0 new unsure best discriminators: '2002' 129 0.805927 'our' 129 0.799565 'message-----' 134 0.0121951 'subject:' 140 0.0297372 'from:' 141 0.0563603 'watchguard' 145 0.0213021 'subject:] ' 163 0.0148575 'are' 166 0.629383 'to:' 167 0.214615 'please' 180 0.839247 'from' 184 0.691328 'subject:-' 187 0.218038 'precedence:bulk' 189 0.0429152 'url:com' 196 0.761515 'your' 206 0.758353 'x-mailer:internet mail service (5.5.2653.19)' 210 0.00556242 'proto:http' 239 0.738164 'you' 239 0.650341 'to:skip:w 10' 240 0.370886 'content-type:text' 246 0.610023 'this' 281 0.655698 'content-type:charset' 287 0.33352 'content-type:plain' 306 0.177419 'return-path:skip:w 10' 312 0.038085 'from:email addr:$FIRM>' 318 0.00825756 'from:skip:w 10' 322 0.0214323 'header:MIME-Version:1' 322 0.346045 'header:Return-Path:1' 331 0.685963 'header:Message-ID:1' 358 0.298295 'content-type:text/plain' 456 0.272913 -> Ham scores for this pair: 361 items; mean 38.74; sdev 7.88 -> min 17.2567; median 39.624; max 63.0457 -> percentiles: 5% 24.5112; 25% 33.4889; 75% 44.0288; 95% 49.719 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 5 ***** 17.5 0 18.0 1 * 18.5 3 *** 19.0 0 19.5 2 ** 20.0 0 20.5 0 21.0 0 21.5 1 * 22.0 0 22.5 0 23.0 3 *** 23.5 2 ** 24.0 1 * 24.5 2 ** 25.0 2 ** 25.5 2 ** 26.0 3 *** 26.5 2 ** 27.0 4 **** 27.5 3 *** 28.0 0 28.5 1 * 29.0 5 ***** 29.5 6 ****** 30.0 4 **** 30.5 4 **** 31.0 2 ** 31.5 5 ***** 32.0 8 ******** 32.5 9 ********* 33.0 11 *********** 33.5 6 ****** 34.0 6 ****** 34.5 11 *********** 35.0 4 **** 35.5 3 *** 36.0 6 ****** 36.5 7 ******* 37.0 5 ***** 37.5 7 ******* 38.0 6 ****** 38.5 11 *********** 39.0 14 ************** 39.5 13 ************* 40.0 7 ******* 40.5 8 ******** 41.0 7 ******* 41.5 12 ************ 42.0 15 *************** 42.5 8 ******** 43.0 14 ************** 43.5 6 ****** 44.0 14 ************** 44.5 9 ********* 45.0 15 *************** 45.5 10 ********** 46.0 2 ** 46.5 4 **** 47.0 6 ****** 47.5 7 ******* 48.0 2 ** 48.5 3 *** 49.0 3 *** 49.5 1 * 50.0 2 ** 50.5 1 * 51.0 2 ** 51.5 2 ** 52.0 0 52.5 1 * 53.0 0 53.5 1 * 54.0 1 * 54.5 1 * 55.0 2 ** 55.5 0 56.0 0 56.5 1 * 57.0 0 57.5 1 * 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for this pair: -> Ham scores for all in this training set: 361 items; mean 38.74; sdev 7.88 -> min 17.2567; median 39.624; max 63.0457 -> percentiles: 5% 24.5112; 25% 33.4889; 75% 44.0288; 95% 49.719 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 5 ***** 17.5 0 18.0 1 * 18.5 3 *** 19.0 0 19.5 2 ** 20.0 0 20.5 0 21.0 0 21.5 1 * 22.0 0 22.5 0 23.0 3 *** 23.5 2 ** 24.0 1 * 24.5 2 ** 25.0 2 ** 25.5 2 ** 26.0 3 *** 26.5 2 ** 27.0 4 **** 27.5 3 *** 28.0 0 28.5 1 * 29.0 5 ***** 29.5 6 ****** 30.0 4 **** 30.5 4 **** 31.0 2 ** 31.5 5 ***** 32.0 8 ******** 32.5 9 ********* 33.0 11 *********** 33.5 6 ****** 34.0 6 ****** 34.5 11 *********** 35.0 4 **** 35.5 3 *** 36.0 6 ****** 36.5 7 ******* 37.0 5 ***** 37.5 7 ******* 38.0 6 ****** 38.5 11 *********** 39.0 14 ************** 39.5 13 ************* 40.0 7 ******* 40.5 8 ******** 41.0 7 ******* 41.5 12 ************ 42.0 15 *************** 42.5 8 ******** 43.0 14 ************** 43.5 6 ****** 44.0 14 ************** 44.5 9 ********* 45.0 15 *************** 45.5 10 ********** 46.0 2 ** 46.5 4 **** 47.0 6 ****** 47.5 7 ******* 48.0 2 ** 48.5 3 *** 49.0 3 *** 49.5 1 * 50.0 2 ** 50.5 1 * 51.0 2 ** 51.5 2 ** 52.0 0 52.5 1 * 53.0 0 53.5 1 * 54.0 1 * 54.5 1 * 55.0 2 ** 55.5 0 56.0 0 56.5 1 * 57.0 0 57.5 1 * 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for all in this training set: -> Ham scores for all runs: 482 items; mean 38.01; sdev 9.62 -> min 17.1664; median 39.2174; max 71.2362 -> percentiles: 5% 21.5969; 25% 31.7042; 75% 44.2002; 95% 51.8263 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 6 ****** 17.5 0 18.0 1 * 18.5 5 ***** 19.0 0 19.5 4 **** 20.0 4 **** 20.5 2 ** 21.0 1 * 21.5 6 ****** 22.0 5 ***** 22.5 3 *** 23.0 5 ***** 23.5 4 **** 24.0 4 **** 24.5 3 *** 25.0 3 *** 25.5 4 **** 26.0 9 ********* 26.5 6 ****** 27.0 4 **** 27.5 6 ****** 28.0 2 ** 28.5 2 ** 29.0 6 ****** 29.5 9 ********* 30.0 4 **** 30.5 5 ***** 31.0 4 **** 31.5 6 ****** 32.0 10 ********** 32.5 9 ********* 33.0 11 *********** 33.5 7 ******* 34.0 6 ****** 34.5 11 *********** 35.0 6 ****** 35.5 4 **** 36.0 7 ******* 36.5 7 ******* 37.0 7 ******* 37.5 7 ******* 38.0 9 ********* 38.5 11 *********** 39.0 14 ************** 39.5 13 ************* 40.0 9 ********* 40.5 9 ********* 41.0 8 ******** 41.5 12 ************ 42.0 19 ******************* 42.5 9 ********* 43.0 17 ***************** 43.5 9 ********* 44.0 16 **************** 44.5 10 ********** 45.0 17 ***************** 45.5 11 *********** 46.0 5 ***** 46.5 4 **** 47.0 6 ****** 47.5 9 ********* 48.0 4 **** 48.5 4 **** 49.0 5 ***** 49.5 1 * 50.0 4 **** 50.5 4 **** 51.0 2 ** 51.5 2 ** 52.0 1 * 52.5 1 * 53.0 0 53.5 1 * 54.0 2 ** 54.5 1 * 55.0 4 **** 55.5 0 56.0 0 56.5 1 * 57.0 0 57.5 2 ** 58.0 0 58.5 2 ** 59.0 0 59.5 0 60.0 2 ** 60.5 1 * 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 2 ** 68.5 1 * 69.0 0 69.5 0 70.0 1 * 70.5 0 71.0 1 * 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> Spam scores for all runs: 23 items; mean 73.88; sdev 5.94 -> min 62.9927; median 74.0114; max 82.6517 -> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079 * = 1 items 0.0 0 0.5 0 1.0 0 1.5 0 2.0 0 2.5 0 3.0 0 3.5 0 4.0 0 4.5 0 5.0 0 5.5 0 6.0 0 6.5 0 7.0 0 7.5 0 8.0 0 8.5 0 9.0 0 9.5 0 10.0 0 10.5 0 11.0 0 11.5 0 12.0 0 12.5 0 13.0 0 13.5 0 14.0 0 14.5 0 15.0 0 15.5 0 16.0 0 16.5 0 17.0 0 17.5 0 18.0 0 18.5 0 19.0 0 19.5 0 20.0 0 20.5 0 21.0 0 21.5 0 22.0 0 22.5 0 23.0 0 23.5 0 24.0 0 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 0 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 0 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 1 * 63.0 0 63.5 0 64.0 0 64.5 2 ** 65.0 1 * 65.5 0 66.0 0 66.5 1 * 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 2 ** 70.5 0 71.0 0 71.5 1 * 72.0 0 72.5 0 73.0 2 ** 73.5 1 * 74.0 2 ** 74.5 0 75.0 1 * 75.5 1 * 76.0 0 76.5 0 77.0 0 77.5 0 78.0 1 * 78.5 2 ** 79.0 0 79.5 0 80.0 1 * 80.5 1 * 81.0 1 * 81.5 0 82.0 1 * 82.5 1 * 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 0 86.0 0 86.5 0 87.0 0 87.5 0 88.0 0 88.5 0 89.0 0 89.5 0 90.0 0 90.5 0 91.0 0 91.5 0 92.0 0 92.5 0 93.0 0 93.5 0 94.0 0 94.5 0 95.0 0 95.5 0 96.0 0 96.5 0 97.0 0 97.5 0 98.0 0 98.5 0 99.0 0 99.5 0 -> best cost for all runs: $2.60 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 4 cutoff pairs -> smallest ham & spam cutoffs 0.61 & 0.715 -> fp 0; fn 0; unsure ham 6; unsure spam 7 -> fp rate 0%; fn rate 0%; unsure rate 2.57% -> largest ham & spam cutoffs 0.625 & 0.715 -> fp 0; fn 0; unsure ham 6; unsure spam 7 -> fp rate 0%; fn rate 0%; unsure rate 2.57% -> all runs false positives: 14 -> all runs false negatives: 0 -> all runs unsure: 0 -> all runs false positive %: 2.90456431535 -> all runs false negative %: 0.0 -> all runs unsure %: 0.0 -> all runs cost: $140.00 The f-ps are conference announcements, solicited commercial email, or listserv responses. Should I set the cutoff to 0.63? Do I owe Tim and Gary $140? Sorry I can't answer these questions myself, but I've been lucky to skim subject headers on this list lately so I don't know what all this new-fangled stuff is. I realize the data is less than ideal, but it's all I can get at the moment. Aside from cleaning the training data, what should I do next? Neale From agmsmith@rogers.com Mon Oct 21 23:48:25 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Mon, 21 Oct 2002 18:48:25 EDT (-0400) Subject: [Spambayes] expiration ideas. In-Reply-To: <200210210630.g9L6U3809108@localhost.localdomain> Message-ID: <2987213371-BeMail@CR593174-A> Anthony Baxter wrote: > Another thought - if we were to ship a package with a small "starter" > wordinfo dict, it would be very good if this was gradually expired > out. Two reasons I can think of: the gradually adapting wordinfo will > end up better representing the user's real usage, plus it means anyone > out there starting with a standard wordinfo won't be vulnerable to > spammers picking up words with high hamprob and deliberately inserting > them into their spam. I imagine it's highly possible we'll start seeing > things like 'wrote:' appearing, I'm already seeing spam with 'Re: ' in > the subject (but as yet, no 'In-reply-to' headers...) That's what I do, with expiry based on the age of training messages added, not the number of times used (so it's not quite as efficient, but it doesn't need to update the database every time it checks for new mail). Plus, every time I release a new version, I include a new sample database, with fresh spam (single words removed, so it's only 185KB). That works well enough to keep the users happy. I've now added an illustrated guide on how to train the system; some people didn't realise they could do that - still need to add a big red flashing button to the mail client :-). - Alex From tim.one@comcast.net Thu Oct 24 03:06:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 23 Oct 2002 22:06:28 -0400 Subject: [Spambayes] Foreign language spam: bug or feature? In-Reply-To: Message-ID: There's an interesting bug in the Outlook 2000 client that's absolutely nailing all the Asian spam I get, along with several other non-Asian languages. "The bug" is this, in Outlook2000/manager.py's GetBayesStreamForMessage(): body += message.Text.encode("ascii", "replace") Outlook uses Unicode internally. message.Text grabs the message body from Outlook as a Unicode string. .encode(...) is then plain Python, telling it to encode the Unicode string as a regular string, using the ascii encoding, and replacing Unicode characters that can't be represented faithfully in ascii by "a suitable replacement character". For the ascii encoding, that almost always turns out to be a question mark character, because there's almost always nothing in ascii that's truly suitable. While this may suck from a purity view, it leads to spam-clue listings like this (from a typical Asian spam): Spam Score: 1 '*H*' 0 '*S*' 1 'header:Return-Path:1' 0.611133 'header:Message-ID:1' 0.813889 '15????' 0.844828 '24????' 0.844828 '7??????' 0.844828 '&' 0.863317 'header:Mime-Version:1' 0.89556 'header:Reply-To:1' 0.90756 '10????' 0.934783 '??????!!!' 0.934783 'header:Received:2' 0.957828 '??????????)' 0.958716 '??????...' 0.965116 '????????...' 0.965116 'message-id:@cpimssmtpa05.msn.com' 0.969799 'from:email addr:korea.com>' 0.980349 '(????' 0.981928 '??.' 0.985437 'e-mail??????' 0.986322 '????,' 0.99505 '????????,' 0.995258 '??????,' 0.99545 '????????.' 0.997691 '??????????.' 0.99776 'skip:? 20' 0.998034 '????????????' 0.998192 '??????????' 0.998474 '??????' 0.998562 '????' 0.998598 '????????' 0.998672 'skip:? 10' 0.998894 That is, languages having scant intersection with ASCII end up getting tokenized as collections of mostly question marks, and each instance of "?"*n ends up earning a high spamprob. The database burden is trivial, since there just aren't many *possible* strings consisting of nearly pure question marks, and the "skip" gimmick kicks in when a contiguous string of question marks gets long. Of course lots of '?'*n thingies in a msg are highly correlated, which in *my* personal email is helpful: spam or not, anything sent to me in a language having small intersection with ASCII may as well be spam -- there's no chance *I* can read it regardless. If somebody would like to formalize this bug as a tokenizer option, so that non-Outlook American-English users can enjoy its benefits too, I won't object. For International Sensitivity reasons, we may have to put it in a [Dont Ask Dont Tell] .ini section . From barry@python.org Thu Oct 24 21:51:21 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 24 Oct 2002 16:51:21 -0400 Subject: [Spambayes] Get rid of the email directory? Message-ID: <15800.23881.157471.402911@gargle.gargle.HOWL> We checked the email package into spambayes because we wanted to use the new api and avoid the bugs which were present in earlier versions of the libary. Python 2.2.2 and Python 2.3a0 have the same, latest version of the email package now, so I think this directory isn't necessary, and may be harmful. I'd like to remove it, but it means if you're running Python 2.2.1, you'll need to upgrade. Any objections? -Barry From popiel@wolfskeep.com Thu Oct 24 22:33:17 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Thu, 24 Oct 2002 14:33:17 -0700 Subject: [Spambayes] Get rid of the email directory? In-Reply-To: Message from barry@python.org (Barry A. Warsaw) of "Thu, 24 Oct 2002 16:51:21 EDT." <15800.23881.157471.402911@gargle.gargle.HOWL> References: <15800.23881.157471.402911@gargle.gargle.HOWL> Message-ID: <20021024213317.7D3B5F599@cashew.wolfskeep.com> In message: <15800.23881.157471.402911@gargle.gargle.HOWL> barry@python.org (Barry A. Warsaw) writes: > >We checked the email package into spambayes because we wanted to use >the new api and avoid the bugs which were present in earlier versions >of the libary. Python 2.2.2 and Python 2.3a0 have the same, latest >version of the email package now, so I think this directory isn't >necessary, and may be harmful. I'd like to remove it, but it means if >you're running Python 2.2.1, you'll need to upgrade. I'd rather you didn't remove the directory, since python 2.2.2 is not easily available for debian woody. (It appears that 2.2.2 has only been packaged in the unstable branch... which many of us assiduously avoid.) - Alex From dereks@itsite.com Thu Oct 24 23:35:14 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Thu, 24 Oct 2002 15:35:14 -0700 (PDT) Subject: [Spambayes] Get rid of the email directory? In-Reply-To: <15800.23881.157471.402911@gargle.gargle.HOWL> Message-ID: > Any objections? I object because > [...] if you're running Python 2.2.1, you'll need to upgrade. and I intend on using hammie.py in a production environment where upgrading Python would be a big deal. Please let one or two more minor-version releases come out before removing the directory. Just my $0.02. --Derek From tim.one@comcast.net Fri Oct 25 15:56:44 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 10:56:44 -0400 Subject: [Spambayes] Foreign language spam: bug or feature? In-Reply-To: Message-ID: [Tim, remarks about an Outlook client "bug" that caused Asian spam to get nailed via replacing most high-bit chars with question marks, leading to clue lists like this one: ] > Spam Score: 1 > > '*H*' 0 > '*S*' 1 > 'header:Return-Path:1' 0.611133 > 'header:Message-ID:1' 0.813889 > '15????' 0.844828 > '24????' 0.844828 > '7??????' 0.844828 > '&' 0.863317 > 'header:Mime-Version:1' 0.89556 > 'header:Reply-To:1' 0.90756 > '10????' 0.934783 > '??????!!!' 0.934783 > 'header:Received:2' 0.957828 > '??????????)' 0.958716 > '??????...' 0.965116 > '????????...' 0.965116 > 'message-id:@cpimssmtpa05.msn.com' 0.969799 > 'from:email addr:korea.com>' 0.980349 > '(????' 0.981928 > '??.' 0.985437 > 'e-mail??????' 0.986322 > '????,' 0.99505 > '????????,' 0.995258 > '??????,' 0.99545 > '????????.' 0.997691 > '??????????.' 0.99776 > 'skip:? 20' 0.998034 > '????????????' 0.998192 > '??????????' 0.998474 > '??????' 0.998562 > '????' 0.998598 > '????????' 0.998672 > 'skip:? 10' 0.998894 MarkH subsequently fixed that bug by accident , while greatly speeding the Outlook operations and making the Outlook client more robust. My Asian spam is *still* nailed, but via clue lists like this now: 'skip:\x92 40' 0.958716 'skip:\x95 40' 0.958716 'skip:\x96 30' 0.958716 'skip:\x93 30' 0.965116 'skip:\x93 50' 0.965116 '8bit%:58' 0.969799 'skip:\x82 10' 0.969799 'skip:\x83 30' 0.969799 'skip:\x8d 30' 0.969799 'skip:\x93 20' 0.969799 'subject:==?=' 0.969799 'skip:\x81 60' 0.973373 'skip:\x93 10' 0.973373 'url:jp' 0.973373 'skip:\x81 10' 0.97619 'skip:\x81 40' 0.97619 'skip:\x82 30' 0.97619 'subject:GyRCTCQ' 0.97619 'subject:iso' 0.978469 '8bit%:69' 0.980349 'skip:\x81 30' 0.980349 'skip:\x81 20' 0.981928 '8bit%:97' 0.983271 '8bit%:72' 0.988432 '8bit%:83' 0.990405 '8bit%:87' 0.990798 '8bit%:91' 0.991159 '8bit%:81' 0.99236 '8bit%:56' 0.993274 '8bit%:88' 0.994148 '8bit%:68' 0.9947 '8bit%:85' 0.9947 '8bit%:94' 0.994822 '8bit%:50' 0.994938 '8bit%:80' 0.995258 '8bit%:75' 0.99545 'subject:=?' 0.996151 '8bit%:86' 0.996562 '8bit%:93' 0.99776 '8bit%:100' 0.998375 The downside for me is that the database size took a significant hit, just because there are a lot more potential "skip" tokens than strings of question marks. WRT correlation effects, a msg that has an 8bit% metatoken under this scheme is likely to have lots of them, but is also likely to have lots of distinct '?'*n tokens under the other scheme; in both cases, counting them all as distinct clues actually helps nail this stuff as spam. Unless someone has a strong objection, I expect to introduce a new option: """ [Tokenizer] # If true, replace high-bit characters (ord(c) >= 128) and # control characters with question marks. This allows # non-ASCII character strings to be identified with little # training and small database burden. It's appropriate only # if your ham is plain 7-bit ASCII, or nearly so, so that # the mere presence of non-ASCII character strings is known # in advance to be a strong spam indicator. replace_nonascii_chars: False """ From jeremy@zope.com Fri Oct 25 17:02:33 2002 From: jeremy@zope.com (Jeremy Hylton) Date: Fri, 25 Oct 2002 12:02:33 -0400 Subject: [Spambayes] pop3proxy bug? (resend) Message-ID: I'm resending this message because python.org rejected it the first time around. To: richiehindle@users.sourceforge.net Cc: spambayes@python.org Subject: pop3proxy bug? Reply-to: jeremy@alum.mit.edu I tried to use pop3proxy.py but it failed every time it tried to send data to the real pop server. The traceback printed by asyncore is this: error: uncaptured python exception, closing channel (exceptions.IOError:[Errno 9] Bad file descriptor [/usr/local/lib/python2.2/asyncore.py|poll|99] [/usr/local/lib/python2.2/asyncore.py|handle_read_event|396] [/usr/local/lib/python2.2/asynchat.py|handle_read|130] [/home/jeremy/src/spambayes/pop3proxy.py|found_terminator|187]) When I look at the code, I don't see how it could ever work :-(. found_terminator() is calling self.serverFile.write(), but self.serverFile was produced by calling makefile() on a socket. With makefile() you get either a readable file or a writeable file. pop3proxy.py is using makefile() with no arguments so it gets a readable file. There's no way to write to this file. I changed the code to use the raw server socket and sendall() instead of self.serverFile.write() and it worked. But I'm uneasy. Did you ever test this code? Jeremy From tim.one@comcast.net Fri Oct 25 17:36:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 12:36:08 -0400 Subject: [Spambayes] Foreign language spam: bug or feature? In-Reply-To: Message-ID: [Tim] > ... > Unless someone has a strong objection, I expect to introduce a new option: > > """ > [Tokenizer] > # If true, replace high-bit characters (ord(c) >= 128) and > # control characters with question marks. This allows > # non-ASCII character strings to be identified with little > # training and small database burden. It's appropriate only > # if your ham is plain 7-bit ASCII, or nearly so, so that > # the mere presence of non-ASCII character strings is known > # in advance to be a strong spam indicator. > replace_nonascii_chars: False > """ This has been added, and is False by default. However, it's True by default for users of the Outlook 2000 client, since I can't remember the last time Mark or Sean asked me a question in Korean . From jeremy@alum.mit.edu Fri Oct 25 22:19:25 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 25 Oct 2002 17:19:25 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment Message-ID: <15801.46429.507385.482352@slothrop.zope.com> I don't know if anyone else on Earth wants to manage their mail the same way I do. I've made some progress on hooking my mail up to spambayes, however, and wanted to report on the deploment issues. I read my mail with VM, an emacs mail reader. My mail collects on a couple of POP servers, and I fetch the mail directly from the POP servers using VM. I addressed the following issues: - Incremental training from VM folders - Scoring via a POP proxy - Management of training data using ZODB (I don't know if the last part was necessary or not, but I wanted to use ZODB. I think it's simplified some things.) The runtime environment is fairly complicated. It's got more moving parts than I would like, but I don't know how to eliminate any of them. It's also slower than I would like, but I haven't done enough profiling to really understand why. There are a few open issues: - It was hard to use the classifier module with ZODB because of the __slots__. I ended up using the WordInfo objects unchanged, and __slots__ there helped minimize storage. But I wanted to make the Bayes class persistent and I couldn't do that because of the slots. Since there's only a single Bayes instance, I can't see why it needs to use __slots__. - It thought it would be nice if spambayes was a package, so I could separate it from my code. It can't work as a package, though, because it contains a copy of the email package. When I turned spambayes into a package, it ended up treating email as a subpackage. My apps ended up getting two copies of the email package loaded -- one from the std library and one as a subpackage of spambayes. The duplication broke a bunch of isinstance() tests. - Configuration. It would be nice to use the existing options framework and extend it with application-specific options (like the POP ports, the ZEO server location, etc.). It isn't clear what the best way to extend Options is. The different components involved in the setup are: - A ZEO server managing a ZODB database. I have a long-running ZEO server process. By using ZEO, multiple clients can access the database at the same time. Clients connect to the server using a Unix domain socket. - A persistent mail profile based on VM folders. The profile is stored in the database. A VM folder is just a Unix mailbox. A config file contains a list of folders that contain ham and a list of folders that contain spam. The profile manages these folders and a spambayes classifier. - A training program, update.py. The training program scans the folders listed in the profile. When it finds new messages, it learns from them. When it finds that a message was deleted, it unlearns it. This process is incremental, but it depends on the mailbox module to parse the folders. The parsing is definitely slow -- especially for large folders. - A POP3 proxy I wrote my own proxy based on SocketServer.ThreadingTCPServer. I don't like the asynchat style of programming, and I was having trouble integrating pop3proxy with ZEO. They both use ZEO, but the way they use them seemed to be causing deadlocks :-(. The proxy uses the strategy as pop3proxy, intercepting messages and adding a spam score header. I add a header like this: From: Martijn Pieters To: (Zope.Com Geeks) Cc: sa@zope.com Subject: [Zope.Com Geeks] Zope.org storage server was down.. Date: Fri, 25 Oct 2002 17:10:42 -0400 X-Spambayes: 0.001 The proxy doesn't do anything other than add the header. - A set of VM filters and tools for handling spam and training. I wrote some little elisp functions. One saves a message to the spam training folder and deletes it. Another saves a message to the ham training folder, but does not delete it. A third pipes it to a small Python script that prints out the evidence for a message. The next step is to add autofoldering rules that file spam above a certain threshold to the spam folder and messages in the middle to an unsure folder. That's a standard VM thing, but I haven't done it yet. The total code base is about 2000 lines of code, half of it in the POP proxy. I'd be happy to check it in to the spambayes project if anyone else wants to try to use parts of it. Jeremy From dereks@itsite.com Fri Oct 25 22:57:08 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Fri, 25 Oct 2002 14:57:08 -0700 (PDT) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <15801.46429.507385.482352@slothrop.zope.com> Message-ID: Where did you get your initial training corpses... carpals... um, collections of email? Just personal stuff lying around? I am still after a nice "real world" hammie.db. (I'll buy a pizza for the first person to send me a good .db file, just include your address, topping list, and the phone number of your favorite local pizza joint in a private email to me.) Not having a nice .db to start out with seems like a pretty heavy barrier for [potential] new users. We need to go searching through undocumented code just to figure out how to play with it. Thanks, Derek From popiel@wolfskeep.com Fri Oct 25 22:58:35 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 25 Oct 2002 14:58:35 -0700 Subject: [Spambayes] Where are we heading? Message-ID: <20021025215835.57F16F5A4@cashew.wolfskeep.com> It seems like all the work for the last week or so has been on integration of the classifier with end-user deployments (clients, mailing list filters, whathaveyou). Have we reached the point where we're no longer interested in this as a research project, but instead as a useful tool? If so, I suggest that we may want to rewrite the whole thing from scratch, after actually deciding on a usage model or two. Choosing the algorithms to use (gary-combining or chi-square?) would be good, too. What we've got now is a decent prototype, but it lacks quite a bit as a finished tool... there are a lot of issues with database storage (what should be in it, how it should be stored, etc.) and options management, just to name two of the hotspots. Personally, I'm still interested in the research aspects; once I get another two free hours to rub together, I'm going to see if I can deal with some of the mail decoding issues in the tokenizer (the unencoded mailing-list footer appended to a base64 body, to be specific). There's also a few experiments I'd like to see revisited: the time of delivery stuff might be interesting to test on multiple corpora, as an example (since my spam does not seem to be evenly spread throughout the day, unlike the original experimenter's spam). - Alex From popiel@wolfskeep.com Fri Oct 25 23:21:38 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 25 Oct 2002 15:21:38 -0700 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message from Derek Simkowiak of "Fri, 25 Oct 2002 14:57:08 PDT." References: Message-ID: <20021025222138.EC47BF5A4@cashew.wolfskeep.com> In message: Derek Simkowiak writes: > > Where did you get your initial training corpses... carpals... >um, collections of email? Just personal stuff lying around? I personally get my corpora by adding a procmail entry to save all my incoming email to a folder that I never touch, before doing any other filing on it. Then, as I process my mail, I move any spam I get into a spam folder. The spam folder acts as my spam corpus, and the everything - spam stuff acts as my ham corpus. Do this for about a month, and you should have some decent size corpora. (It took me about a month and a half to get above the 2000 ham and 2000 spam limit that Tim set for doing algorithm shootouts. :-) ) > I am still after a nice "real world" hammie.db. (I'll buy a pizza >for the first person to send me a good .db file, just include your >address, topping list, and the phone number of your favorite local pizza >joint in a private email to me.) I think sharing dbs is actually a very _BAD_ idea. Sure, it saves some initial effort, but it encourages a tendency to just take the stock db and never retrain. One of the things I like most about this system is how easily and automatically it customizes itself to your personal mail patterns... which means that spammers will have a harder time defeating it (since there's no single widespread db to defeat). > Not having a nice .db to start out with seems like a pretty heavy >barrier for [potential] new users. We need to go searching through >undocumented code just to figure out how to play with it. I agree that the documentation needs to be improved, if this is to be used by anyone other than researchers. I don't think that providing a starter db is the right way to make up for the lack of documentation. :-) - Alex From jeremy@alum.mit.edu Fri Oct 25 23:27:42 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 25 Oct 2002 18:27:42 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: References: <15801.46429.507385.482352@slothrop.zope.com> Message-ID: <15801.50526.459467.387029@slothrop.zope.com> >>>>> "DS" == Derek Simkowiak writes: DS> Where did you get your initial training corpses... carpals... DS> um, collections of email? Just personal stuff lying around? I started with a few messages from my existing VM folders. I've also got two training folders that I just created. I'm adding any messages that wasn't classified correctly to the training folder. For example, if a ham comes in and its score isn't < 0.10, I'm training on it. Same for spam, but the min score is 0.95. I've got some new key bindings that automatically save messages in the appropriate folder. DS> I am still after a nice "real world" hammie.db. (I'll buy a DS> pizza DS> for the first person to send me a good .db file, just include DS> your address, topping list, and the phone number of your DS> favorite local pizza joint in a private email to me.) I don't think you want someone else's database. Their ham might be your spam, or vice versa. Tim has mentioned a couple of times the example of Guido's email about hotels. Guido gets a non-trivial amount of email about hotels for conferences. He would have to train his classifier to recognize messages about hotels as ham, but that probably makes it more likely he'll get spams advertising discount hotels. The details of what exactly your ham looks like is pretty personal. The spam is easy to collect, unless you don't get much spam. And if you don't get much spam, it's hardly a problem. DS> Not having a nice .db to start out with seems like a pretty DS> heavy barrier for [potential] new users. We need to go DS> searching through undocumented code just to figure out how to DS> play with it. I agree that there are a lot of problems to be solved before potential new users can try things out. I think an initial training database is a pretty minor problem. I just spent an entire day getting the POP proxies hooked up to a training database, and I still have a bubble-gum-and-bailing-wire solution. Jeremy From jeremy@alum.mit.edu Fri Oct 25 23:30:46 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 25 Oct 2002 18:30:46 -0400 Subject: [Spambayes] Where are we heading? In-Reply-To: <20021025215835.57F16F5A4@cashew.wolfskeep.com> References: <20021025215835.57F16F5A4@cashew.wolfskeep.com> Message-ID: <15801.50710.553887.279223@slothrop.zope.com> >>>>> "TAP" == T Alexander Popiel writes: TAP> It seems like all the work for the last week or so has been on TAP> integration of the classifier with end-user deployments TAP> (clients, mailing list filters, whathaveyou). Have we reached TAP> the point where we're no longer interested in this as a TAP> research project, but instead as a useful tool? I think we've reached the pointer where the classifier is good enough to be useful for my email. This issues is pretty much independent of the need for further research on the algorithms. I hope there is continued progress on the classifier. But there's also big gulf between a good enough classifier and a usable spam filtering system. I'm hoping to contribute to the latter. Jeremy From dereks@itsite.com Sat Oct 26 00:06:26 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Fri, 25 Oct 2002 16:06:26 -0700 (PDT) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <15801.50526.459467.387029@slothrop.zope.com> Message-ID: > I don't think you want someone else's database. Their ham might be > your spam, or vice versa. A couple of people have mentioned this, and while I see the point, I disagree. Let me explain why. The differences between one person's ham and another individual's spam (such as the hotel conference-info example) is far less significant than the difference between one person's ham and everyone's spam. That is, the strongest indicators like "color=#FF0000" and porn-type swearwords are not likely to appear in anyone's ham. At least, not nearly as frequently as it will be found in most of the spams that are out there. I take it for granted than a general starter.db file will not be very accurate for my particular needs. But I should be able to set a fairly high cutoff value and get 80% to 90% of real-world spams correctly flagged right out of the gate -- that's heads and tails above having nothing at all, when trying to learn how this stuff works. But most importantly, training a starter.db for my specialized needs is far easier as "step two" than creating a .db from scratch is as "step one". And that is why I'm asking for a .db file. > I just spent an entire day getting the POP proxies hooked up to a > training database, and I still have a bubble-gum-and-bailing-wire > solution. I just used the Postfix-with-SpamAssassin instructions and replaced SpamAssassin with hammie.py in filter mode. For my needs, finding a nice "real world" starter corpus is what's holding me back. I'm not looking for a "documentation substitute". I'm just looking for something that will (a) tell me if I've installed the software correctly, and (b) correctly identify more than 80% of the spams that I feed it. So again, with full recognition that whatever somebody else has won't be tailored to my email lifestyle, I ask for the .db -- just to save me a few hours of ramp-up time. Once I've had a chance to dink around, and try out the software, I will know if I want to take the time necessary to collect, organize, and manually filter a highly-customized training corpus for my personalized needs. The pizza offer still stands :) Thanks, Derek Simkowiak From tim.one@comcast.net Sat Oct 26 00:06:28 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 19:06:28 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <15801.46429.507385.482352@slothrop.zope.com> Message-ID: [Jeremy Hylton] > I don't know if anyone else on Earth wants to manage their mail the > same way I do. I've made some progress on hooking my mail up to > spambayes, however, and wanted to report on the deploment issues. Thanks for the report! > I read my mail with VM, an emacs mail reader. My mail collects on a > couple of POP servers, and I fetch the mail directly from the POP > servers using VM. > > I addressed the following issues: > > - Incremental training from VM folders > - Scoring via a POP proxy > - Management of training data using ZODB > > (I don't know if the last part was necessary or not, but I wanted to > use ZODB. I think it's simplified some things.) > > The runtime environment is fairly complicated. It's got more moving > parts than I would like, but I don't know how to eliminate any of > them. Check out the Outlook2000 directory -- there's already more code there than in the tokenizer and classifier combined. It's a remarkable and very capable GUI, but still, email clients seem universally poorly designed for programmability. > It's also slower than I would like, but I haven't done enough > profiling to really understand why. MarkH made great progress in speeding the Outlook client via finding a way to tell Outlook to deliver "batches" of msgs. It's still at best twice as slow (when bulk training or bulk classifying) as when running in "one msg per plain text file" tests, but it's at least 30 msgs/second, and I don't notice the speed drag at all when it's doing auto-filtering of incoming email. I do notice the increase in Outlook startup time, as it drags in several pickles and lots of Python code (including mounds of the Python win32 extensions). Just for fun, I'd suggest training in a different way: start with an empty database and forget batch training! Feed it examples from your live email. The system does better than chance after training on one ham and one spam, and it's fun & gratifying to see it get better in response to your training efforts. I've done that a few times now, and one day's worth of ham and spam (of which I admittedly get a lot in a day -- about 100 spam) has always been enough that it did better on its own then than my previous collection of by-hand Outlook rules (which I eventually reduced to one, because they made so many mistakes -- I don't use any now, except for spambayes-based "spam" and "unsure" rules). > There are a few open issues: > > - It was hard to use the classifier module with ZODB because of the > __slots__. My understanding here is that this is a problem with inheritance from ZODB's Persistent class. > I ended up using the WordInfo objects unchanged, and __slots__ there > helped minimize storage. But I wanted to make the Bayes class > persistent and I couldn't do that because of the slots. Since > there's only a single Bayes instance, I can't see why it needs to > use __slots__. There may be more than one Bayes instance (for example, I believe Sean True routinely uses several, faking N-way classification via chaining differently trained binary classifiers), but the real reason I used slots here was for their better error-detecting capabilities. This *was* very rapidly changing research code, and __slots__ caught early. Fine by me if we nuke the Bayes __slots__ now. > - It thought it would be nice if spambayes was a package, so I could > separate it from my code. It can't work as a package, though, > because it contains a copy of the email package. When I turned > spambayes into a package, it ended up treating email as a > subpackage. My apps ended up getting two copies of the email > package loaded -- one from the std library and one as a subpackage > of spambayes. The duplication broke a bunch of isinstance() tests. As Barry pointed out yesterday, Python 2.2.2 users don't need the duplicated email pkg at all. Neither people using CVS Python. We should nuke it. People who want to run under 2.2.1 should then work out what they need to do to fiddle their PYTHONPATH to get a backported copy loaded. > - Configuration. It would be nice to use the existing options > framework and extend it with application-specific options (like the > POP ports, the ZEO server location, etc.). It isn't clear what the > best way to extend Options is. Name one way, and it will automatically become "the best" . Something that's been a minor problem in the Outlook client: as soon as you load any module in the spambayes core, it imports Options.py, and somtimes makes module compile-time decisions based on the option values then in effect. Setting the BAYESCUSTOMIZE envar after that point has no effect, since Options has already been loaded. > The different components involved in the setup are: > > - A ZEO server managing a ZODB database. And you marvel at how many moving parts you've got ? > I have a long-running ZEO server process. By using ZEO, multiple > clients can access the database at the same time. Clients connect > to the server using a Unix domain socket. YAGNI for *you*, right? > - A persistent mail profile based on VM folders. > > The profile is stored in the database. A VM folder is just a Unix > mailbox. A config file contains a list of folders that contain ham > and a list of folders that contain spam. The profile manages these > folders and a spambayes classifier. Mark added another database to the Outlook client: a mapping from (Outlook) message id to whether it's been trained on as ham or spam (and a message id is absent if neither). So far this has at least two good effects: (1) if a mistake is moved from one flavor of training folder to another, the system automatically knows to untrain it from the wrong flavor; (2) folder-based training is much faster now, as it doesn't even bother to fetch msgs it already trained on. > - A training program, update.py. > > The training program scans the folders listed in the profile. When > it finds new messages, it learns from them. When it finds that a > message was deleted, it unlearns it. I don't think you'll want that over time: If a msg has been deleted, fine, it's gone but still trained. Right now I'm carrying around many megabytes of useless spam in my Outlook store, and that has lots of bad effects: longer backup times, much longer scanpst times (the Outlook "inbox repair tool"), and very much longer times to transfer my msg store between laptop and desktop. Only a researcher wants to carry dead spam around forever. > This process is incremental, but it depends on the mailbox module > to parse the folders. The parsing is definitely slow -- especially > for large folders. Perhaps your moral equivalent to the Outlook client's msgid -> training status map would be a jeremy_msg_id -> seek offset map, along with a highwater mark offset to distinguish old from new msgs. > - A POP3 proxy > > I wrote my own proxy based on SocketServer.ThreadingTCPServer. I > don't like the asynchat style of programming, and I was having > trouble integrating pop3proxy with ZEO. They both use ZEO, but the > way they use them seemed to be causing deadlocks :-(. That's unheard of in ZEO . > The proxy uses the strategy as pop3proxy, intercepting messages and > adding a spam score header. I add a header like this: > > From: Martijn Pieters > To: (Zope.Com Geeks) > Cc: sa@zope.com > Subject: [Zope.Com Geeks] Zope.org storage server was down.. > Date: Fri, 25 Oct 2002 17:10:42 -0400 > X-Spambayes: 0.001 > > The proxy doesn't do anything other than add the header. Is your ultimate email reader programmable enough to "do something" with this? One prediction I made for myself turned out to be just right: moving things automagically into Probable-Spam and Unsure folders is exactly what I wanted and turns out to be exactly what I still want. Works great. > - A set of VM filters and tools for handling spam and training. > > I wrote some little elisp functions. One saves a message to the > spam training folder and deletes it. Another saves a message to > the ham training folder, but does not delete it. A third pipes it > to a small Python script that prints out the evidence for a message. > > The next step is to add autofoldering rules that file spam above a > certain threshold to the spam folder and messages in the middle to > an unsure folder. That's a standard VM thing, but I haven't done it > yet. Thank you for answering my questions so quickly. > The total code base is about 2000 lines of code, half of it in the POP > proxy. I'd be happy to check it in to the spambayes project if anyone > else wants to try to use parts of it. I'm bothered that you had no luck with the POP3 proxy already checked in. Who's using that, and why didn't it work for Jeremy? From tim.one@comcast.net Sat Oct 26 00:09:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 19:09:17 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: [Derek Simkowiak] > ... > I am still after a nice "real world" hammie.db. (I'll buy a pizza > for the first person to send me a good .db file, just include your > address, topping list, and the phone number of your favorite local pizza > joint in a private email to me.) You don't need one: just start. Train it on examples from your live email. It learns quickly. > Not having a nice .db to start out with seems like a pretty heavy > barrier for [potential] new users. We need to go searching through > undocumented code just to figure out how to play with it. Follow my suggestion, and you'll discover that you still don't know what to do -- it's not the lack of a prepackaged database that's stopping you. From dereks@itsite.com Sat Oct 26 00:18:55 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Fri, 25 Oct 2002 16:18:55 -0700 (PDT) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: > > I don't think you want someone else's database. Their ham might be > > your spam, or vice versa. I just thought of another argument for a stock "starter.db". How can we test out new algorithms if the project doesn't have a control group? We have no way of knowing if someone's successful (or poor) results are an attribute of the new algorithm, or if it's an attribute of their particular sample data. Having a starter.db would both (a) make life easier for getting started, and (b) give us a well-established baseline to test against. --Derek From tim.one@comcast.net Sat Oct 26 00:34:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 19:34:33 -0400 Subject: [Spambayes] Where are we heading? In-Reply-To: <20021025215835.57F16F5A4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > It seems like all the work for the last week or so has been > on integration of the classifier with end-user deployments > (clients, mailing list filters, whathaveyou). Pretty much, yes. > Have we reached the point where we're no longer interested in > this as a research project, but instead as a useful tool? I'm no longer *primarily* interested in this as a research project -- the results on the corpora I'm targeting have been so good for more than a month that I couldn't measure an improvement if one were to be made. Time to move it along. > If so, I suggest that we may want to rewrite the whole thing > from scratch, after actually deciding on a usage model or two. There's no point to that I can see, so far as the "heavy lifting" code goes -- the classifier has had the same interface since the day it was first written, and the tokenizer has changed interface in only minor ways. IOW, there's nothing in need of refactoring there, else it would have been refactored already. Reworking the WordInfo structure is overdue, but it's hard to know what to do with that before we know more about how decisions affect accuracy over time (I'm thinking of database cleaning here). > Choosing the algorithms to use (gary-combining or chi-square?) > would be good, too. I don't mind supporting both (the differences are trivial at the code level). I do intend to get rid of mixed-combining, and want to make chi-combining the default. > What we've got now is a decent prototype, but it lacks quite a bit > as a finished tool... Like, approximately, everything <0.5 wink>. It's a 10,000 horsepower engine without seats, tires, or a steering wheel now. > there are a lot of issues with database storage (what should be in it, > how it should be stored, etc.) People won't agree on that -- let 1000 databases bloom. A clean API for database interface would be nice, *provided that* database heads would actually use it. I expect they're more likely to break into the internals "for speed". > and options management, just to name two of the hotspots. One thing I want to do too is purge useless options that are gumming up the works now. > Personally, I'm still interested in the research aspects; > once I get another two free hours to rub together, I'm going > to see if I can deal with some of the mail decoding issues in > the tokenizer (the unencoded mailing-list footer appended to > a base64 body, to be specific). Barry promised to do that "tomorrow", which was admittedly a week ago, but Barry works on his own calendar . > There's also a few experiments I'd like to see revisited: the time of > delivery stuff might be interesting to test on multiple corpora, as an > example (since my spam does not seem to be evenly spread throughout > the day, unlike the original experimenter's spam). Yup, that's a good one. I'd also like to dig deeper into header-line tokenization: to date, I've gotten worse results on both my own email, and on a new "pure" collection of python.org traffic, when enabling *any* of these: mine_received_headers count_all_header_lines basic_header_tokenize I suspect but don't know this is mostly due to the dark side of the correlation effects Rob likes to worry about. For example, mine_received_headers presents IP and machine name info in several distinct and partially redundant ways, so that, e.g., email going thru python.org ends up with 6 good ham clues for that alone. I saw one spam get thru under basic_header_tokenize just because 6 different header lines happened to have the string "GMT" in them. Etc. I'm sure there's a world of info in the header lines that default tokenization is missing, but it remains unclear (to me) how to exploit it in a way that does more good than harm. From popiel@wolfskeep.com Sat Oct 26 00:37:39 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 25 Oct 2002 16:37:39 -0700 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message from Derek Simkowiak References: Message-ID: <20021025233739.9C94DF5A4@cashew.wolfskeep.com> In message: Derek Simkowiak writes: > > I just thought of another argument for a stock "starter.db". > > How can we test out new algorithms if the project doesn't have a >control group? We have no way of knowing if someone's successful (or >poor) results are an attribute of the new algorithm, or if it's an >attribute of their particular sample data. That's why we have multiple people test anything that looks promising, and compare the variations across all the different runs. Since the classifications are reproducable over given corpora, we don't need control groups in the same way that biological experiments do. > Having a starter.db would both (a) make life easier for getting >started, and (b) give us a well-established baseline to test against. I disagree with (b), because changes in the tokenizer (where I suspect some of the advances will come from) will invalidate the database. - Alex From jeremy@alum.mit.edu Sat Oct 26 00:36:08 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 25 Oct 2002 19:36:08 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: References: <15801.46429.507385.482352@slothrop.zope.com> Message-ID: <15801.54632.545081.293386@slothrop.zope.com> >>>>> "TP" == Tim Peters writes: TP> MarkH made great progress in speeding the Outlook client via TP> finding a way to tell Outlook to deliver "batches" of msgs. TP> It's still at best twice as slow (when bulk training or bulk TP> classifying) as when running in "one msg per plain text file" TP> tests, but it's at least 30 msgs/second, and I don't notice the TP> speed drag at all when it's doing auto-filtering of incoming TP> email. I do notice the increase in Outlook startup time, as it TP> drags in several pickles and lots of Python code (including TP> mounds of the Python win32 extensions). The POP proxy I'm using is a long-running process with a ZEO client connection. It just calls spamprob() for each email as it passes through the proxy. The ZEO/ZODB cache should do a good job of keeping recently used words in memory. (I'm storing the WordInfo objects in an OOBTree instead of a dict.) TP> Just for fun, I'd suggest training in a different way: start TP> with an empty database and forget batch training! Feed it TP> examples from your live email. I just started it off with a few things I was sure I didn't want to miss that don't show up in my email every day. Examples: email from my brother and sister, order receipts from things I've bought online, etc. I also started with the 4 spams that were sitting in my INBOX. >> There are a few open issues: >> >> - It was hard to use the classifier module with ZODB because of >> the >> __slots__. TP> My understanding here is that this is a problem with inheritance TP> from ZODB's Persistent class. That's right. I'm going to fix that problem for ZODB4, but I wanted to use ZODB3 for this project. >> I ended up using the WordInfo objects unchanged, and __slots__ >> there helped minimize storage. But I wanted to make the Bayes >> class persistent and I couldn't do that because of the slots. >> Since there's only a single Bayes instance, I can't see why it >> needs to use __slots__. TP> There may be more than one Bayes instance (for example, I TP> believe Sean True routinely uses several, faking N-way TP> classification via chaining differently trained binary TP> classifiers), but the real reason I used slots here was for TP> their better error-detecting capabilities. This *was* very TP> rapidly changing research code, and __slots__ caught early. TP> Fine by me if we nuke the Bayes __slots__ now. Cool. TP> As Barry pointed out yesterday, Python 2.2.2 users don't need TP> the duplicated email pkg at all. Neither people using CVS TP> Python. We should nuke it. People who want to run under 2.2.1 TP> should then work out what they need to do to fiddle their TP> PYTHONPATH to get a backported copy loaded. I agree that we should nuke it. >> - Configuration. It would be nice to use the existing options >> framework and extend it with application-specific options (like >> the POP ports, the ZEO server location, etc.). It isn't clear >> what the best way to extend Options is. TP> Name one way, and it will automatically become "the best" TP> . I haven't come up with one yet <0.5 wink>. I import Options from spambayes, then I add stuff to its all_options dict and call mergfiles a second time. Yuck. TP> Something that's been a minor problem in the Outlook client: as TP> soon as you load any module in the spambayes core, it imports TP> Options.py, and somtimes makes module compile-time decisions TP> based on the option values then in effect. Setting the TP> BAYESCUSTOMIZE envar after that point has no effect, since TP> Options has already been loaded. This is one of the reasons I'd like something else :-). >> The different components involved in the setup are: >> >> - A ZEO server managing a ZODB database. TP> And you marvel at how many moving parts you've got ? This is one of the moving parts I'm not entirely happy with. But I hope to get to a point where I start the database and POP proxies when I boot my machine and leave them running all the time. >> I have a long-running ZEO server process. By using ZEO, multiple >> clients can access the database at the same time. Clients >> connect to the server using a Unix domain socket. TP> YAGNI for *you*, right? No. I want to be able to train the database while I'm fetching mail or scoring a particular message. Even though I'm a single user, I find it essential to have multiple processes reading and writing the classifier database concurrently. TP> Mark added another database to the Outlook client: a mapping TP> from (Outlook) message id to whether it's been trained on as ham TP> or spam (and a message id is absent if neither). So far this TP> has at least two good effects: (1) if a mistake is moved from TP> one flavor of training folder to another, the system TP> automatically knows to untrain it from the wrong flavor; (2) TP> folder-based training is much faster now, as it doesn't even TP> bother to fetch msgs it already trained on. That's an interesting point. I was thinking about "deletion from folder" as the mechanism to correct training mistakes. I think "shows up in the other folder" sounds like a good alternative. >> - A training program, update.py. >> >> The training program scans the folders listed in the profile. >> When it finds new messages, it learns from them. When it finds >> that a message was deleted, it unlearns it. TP> I don't think you'll want that over time: If a msg has been TP> deleted, fine, it's gone but still trained. Right now I'm TP> carrying around many megabytes of useless spam in my Outlook TP> store, and that has lots of bad effects: longer backup times, TP> much longer scanpst times (the Outlook "inbox repair tool"), and TP> very much longer times to transfer my msg store between laptop TP> and desktop. Only a researcher wants to carry dead spam around TP> forever. I only train on spam when the existing classifier doesn't mark it as spam. I expect that the amount of spam I keep around won't be that big compared to all the other email that I keep :-). >> This process is incremental, but it depends on the mailbox module >> to parse the folders. The parsing is definitely slow -- >> especially for large folders. TP> Perhaps your moral equivalent to the Outlook client's msgid -> TP> training status map would be a jeremy_msg_id -> seek offset map, TP> along with a highwater mark offset to distinguish old from new TP> msgs. I was thinking that something like that would work. The mailbox module is passing the start and stop point of each message to _Subfile() before it calls the message factory. So if I hook that, I can store the location of the message in the database. Then I only need to check that the locations are still valid, which is true as long as messages aren't deleted. >> The next step is to add autofoldering rules that file spam above >> a certain threshold to the spam folder and messages in the middle >> to an unsure folder. That's a standard VM thing, but I haven't >> done it yet. TP> Thank you for answering my questions so quickly. As I learn more about auto foldering, I discover that my mail client doesn't do quite what I want. Once the labelled spam shows up in my INBOX, I can cause it to be deleted by a single key. Unfortunately, that interacts badly with the use of the feature to automatically guess what folder to save a message in when you want to keep it. I'll have to wait for Barry to come up with some serious elisp to move the spam to a secure location. >> The total code base is about 2000 lines of code, half of it in >> the POP proxy. I'd be happy to check it in to the spambayes >> project if anyone else wants to try to use parts of it. TP> I'm bothered that you had no luck with the POP3 proxy already TP> checked in. Who's using that, and why didn't it work for TP> Jeremy? I sent an email earlier about the problem where the proxy attempts to write to a read-only file-wrapping-a-socket. I just don't see how the code can work as written. It's even worse, though, that it uses asycnore. I found asyncore added a lot of complexity to ZEO and would rather we hadn't used it. Then add in a second asyncore app (the proxy) and you've got real trouble. The complexity seems to be multiplicative rather than additive. Jeremy From jeremy@alum.mit.edu Sat Oct 26 03:15:09 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 25 Oct 2002 22:15:09 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: References: <15801.46429.507385.482352@slothrop.zope.com> Message-ID: <15801.64173.255155.281547@slothrop.zope.com> Earlier I reported that the pop proxy was slow. It's now a lot faster, and I didn't change a stitch of code. I guess the network between me and the real POP servers was very slow this afternoon. Jeremy From tim.one@comcast.net Sat Oct 26 03:20:39 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 22:20:39 -0400 Subject: [Spambayes] Proposing to drop use_mixed_combining In-Reply-To: Message-ID: Proposing to drop the options: use_mixed_combining mixed_combining_chi_weight They haven't worked better than chi_combining for anyone yet. From tim.one@comcast.net Sat Oct 26 03:29:34 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 22:29:34 -0400 Subject: [Spambayes] Proposing to drop ignore_redundant_html In-Reply-To: Message-ID: Proposing to drop the option ignore_redundant_html This has been False by default for a long time, and there are no known clients. I used it early in the project, before we stripped HTML tags, else (at the time) there was no way to get any multipart/alternative msg with a text/html part to score as ham in the c.l.py tests. Since then, A. We strip HTML tags by default (and   character entities -- that's a change I made recently I probably didn't announce here, although I mentioned it often enough ). B. We know that sometimes multipart/alternative msgs have different content in the text/plain and text/html parts, and in particular that some spam can be identified only by staring at the HTML part. C. We no longer count multiple instances of a word in a msg multiple times during training. So if text/html and text/plain parts are in fact redundant, training isn't affected by seeing the content twice. It used to be. IOW, ignore_redundant_html has nothing going for it anymore. From tim.one@comcast.net Sat Oct 26 04:05:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 23:05:12 -0400 Subject: [Spambayes] Proposing to make chi-combining the default In-Reply-To: Message-ID: The current default combining scheme is anonymous, so this proposal amounts to two things: 1. Introduce option use_gary_combining, defaulting to False, meaning the combining scheme that's currently the default. 2. Change the default for use_chi_combining to True. 2'. Change the default ham_cutoff to 0.20 and the default spam_cutoff to 0.90. I'll introduce a named option for #1 in any case, since an anonymous behavior is a Bad Idea regardless. Both combining schemes are 100% compatible at the database level -- they don't affect training at all (you can use either scheme to *score* msgs, using the same database). In all my tests, use_chi_combining works better, because it has a small (in # of msgs) middle ground spanning a large range of scores where most mistakes live, and the boundaries of the middle ground aren't touchy. In contrast, there still seems no way to predict good cutoff values for gary_combining; they're corpus- and training-data dependent. People still seem to have some fear of chi-combining because it makes extreme judgments (median score for spam is near 0.0, and median score for ham is near 1.0, in test after test), and this reminds them of the bad behavior of Graham-combining. The difference is that Graham-combining had no middle ground as training data increased, but chi-combining does. Indeed, the more training data there is, the more certain chi-combining seems to get about just *how* confused it is . FYI, in my personal email I use chi-combining all the time now. About 1% of incoming msgs (I get about 600 per day, w/ about 100 spam) end up in my Unsure folder, using ham_cutoff 0.30 and spam_cutoff 0.80. They're about evenly mixed between ham and spam, they "make sense" to me as Unsure msgs, and training on them correctly ASAP is very effective in preventing followups (for ham) or near-duplicates (for spam) from ending up in the Unsure bucket too. Hapaxes ("words" appearing uniquely in the msg) appear to play a large role in that last happy result. One spam has been left in my Inbox, which SpamAssassin let thru on the mailing-list verion of comp.lang.python, so that the Mailman-inserted URL at the bottom gave it some strong ham clues: http://mail.python.org/mailman/listinfo/python-list 'url:python-list' 0.00712585 'url:mailman' 0.0120696 'url:listinfo' 0.0121098 'url:python' 0.0170755 The text of the spam was """ python, A friend of yours, Michael (michael_suswanto@yahoo.com) thought you might like to check out this web page. http://www.newmarketingsite.com/2848/ -- The coolest site in town """ and so it got strong help from mentioning "python," too. As spam goes, it wasn't particularly disgusting . There have been no false positives in this time. This reminds me that Jim Bublitz reported that using a system "for real" day to day gave even better results than contrived tests, and while I'm not running controlled experiments on my own email, that's my (subjective) impression too. For my personal use in real life, it's been pure delight. From tim.one@comcast.net Sat Oct 26 04:06:06 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 25 Oct 2002 23:06:06 -0400 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <15801.64173.255155.281547@slothrop.zope.com> Message-ID: [Jeremy Hylton] > Earlier I reported that the pop proxy was slow. It's now a lot > faster, and I didn't change a stitch of code. I guess the network > between me and the real POP servers was very slow this afternoon. Well, I checked in a change to the Outlook client. I figured that would cure your speed problems . From popiel@wolfskeep.com Sat Oct 26 05:23:54 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 25 Oct 2002 21:23:54 -0700 Subject: [Spambayes] Proposing to drop use_mixed_combining In-Reply-To: Message from Tim Peters References: Message-ID: <20021026042354.2A0E7F5A4@cashew.wolfskeep.com> In message: Tim Peters writes: >Proposing to drop the options: > > use_mixed_combining > mixed_combining_chi_weight Hear! Hear! - Alex From popiel@wolfskeep.com Sat Oct 26 05:24:14 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 25 Oct 2002 21:24:14 -0700 Subject: [Spambayes] Proposing to drop ignore_redundant_html In-Reply-To: Message from Tim Peters References: Message-ID: <20021026042414.145B8F5A4@cashew.wolfskeep.com> In message: Tim Peters writes: >Proposing to drop the option > > ignore_redundant_html Sounds good. - Alex From popiel@wolfskeep.com Sat Oct 26 05:29:07 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Fri, 25 Oct 2002 21:29:07 -0700 Subject: [Spambayes] Proposing to make chi-combining the default In-Reply-To: Message from Tim Peters References: Message-ID: <20021026042907.14850F5A4@cashew.wolfskeep.com> In message: Tim Peters writes: >The current default combining scheme is anonymous, so this proposal amounts >to two things: > >1. Introduce option use_gary_combining, defaulting to False, meaning > the combining scheme that's currently the default. > >2. Change the default for use_chi_combining to True. > >2'. Change the default ham_cutoff to 0.20 and the default spam_cutoff > to 0.90. I'm slightly surprised at the looseness of 2', but as you say, the boundaries aren't all that touchy. I'm all for the above. - Alex From jbublitz@nwinternet.com Sat Oct 26 16:25:09 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Sat, 26 Oct 2002 08:25:09 -0700 (PDT) Subject: [Spambayes] Proposing to make chi-combining the default In-Reply-To: Message-ID: On 26-Oct-02 Tim Peters wrote: > This reminds me that Jim Bublitz reported that using a system > "for real" day to day gave even better results than contrived > tests, and while I'm not running controlled experiments on my > own email, that's my (subjective) impression too. For my > personal use in real life, it's been pure delight. Just like Beetlejuice, if you say my name, I appear :) Previously I just did testing in chronological order and got much better results than random testing. As of Sunday I turned on my new mail system which includes the spam filter, but also replaces fetchmail, procmail, cron (for mail anyway), and some of qmail (qmail still acts as my local smtp/pop3 server). I also have a whitelist in front of the spam filter (my fps can be expensive). It's completely in Python, of course. Over 6 full days of use, 1 or 2 spams per day get through and I've had a couple of fps total. The fns/fps are less than 1%. I haven't written anything to parse the logs yet so I don't have actual stats, and for the first few days I had to restart the mail system a number of times (bugs), so there hasn't been any way to accumulate actual results except the logs. As several other people have mentioned, I'm also juggling msgs between ham and spam folders based on the results of "manual" review. The only thing I think deserves mention is the review process. Every 20 spams received, the mail system puts together a msg with a list of Subj and From lines from 20 spams in score order. Next to each msg is a checkbox []. This email msg gets sent to a user (all 2 of us) on an alternating basis. The user replies to the msg to confirm the scoring was correct (leaves the checkboxes empty), or if a score looks wrong, puts an [x] in the box. When the mail system receives the reply, it forwards any checked msgs back to the user. If the checked msg was really spam, the user places it in a local spam folder, along with any fns; if it *was* ham they do nothing (the mail system has already moved the msg to the ham folder temporarily). At the end of the day the mail system empties all of the local spam folders and shifts msgs around again if req'd, and then retrains on the new mail. Reviewing 20 msgs at a time takes less than a minute. Doing it via email makes it more likely the msgs will actually get reviewed. The review email is much easier than having to scan a folder and delete each unwanted msg, but it still gives users a sense of control over the process, demonstrates how much spam is actually being blocked, and offers a sense of "victory over spammers". My wife likes it anyway, and she usually hates my UIs. We still end up looking at spam subject lines because we can't afford any fps, but real mail gets through more quickly and sorting is much more accurate and done more quickly with about the absolute minimum of user activity. In a few months I might have enough confidence to trust the .99 (spam) scores without review, or else have constructed a sizable blacklist that doesn't require scoring or review. The couple of fps so far have scored around .502 (0.5 cutoff) - one was a legitimate mail from a guy who works for a company that's well represented in my spam corpus. Jim From sholden@holdenweb.com Sat Oct 26 18:48:20 2002 From: sholden@holdenweb.com (Steve Holden) Date: Sat, 26 Oct 2002 13:48:20 -0400 Subject: [Spambayes] Some minor nits ... Message-ID: <001a01c27d17$dfec99a0$6300000a@holdenweb.com> I've just been testing the pop3proxy with Outlook Express on Win2K. If I run it under Windows 2.2.1 (the ActiveState distro, if it makes a difference), everything seems to work except that the X-Hammie-Disposition header is treated as apart of the message body, presumably due to the that precedes the . Could we make the line ending controllable by some sort of option, or are there specific reasons for sticking to RFC standards here :-)? Under cygwin (python 2.2.1) I see the folowing asyncore error: error: uncaptured python exception, closing channel <__main__.BayesProxy connected 127.0.0.1:1111 at 0x102026c0> (exceptions.IOError:(0, 'Error') [/tmp/python.576/usr/lib/python2.2/asyncore.py|poll|95] [/tmp/python.576/usr/lib/python2.2/asyncore.py|handle_read_event|392] [/tmp/python.576/usr/lib/python2.2/asynchat.py|handle_read|130] [pop3proxy.py|found_terminator|181]) Not sure quite what that's about, I'll take a look if I get a chance. regards ----------------------------------------------------------------------- Steve Holden http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/pwp/ Previous .sig file retired to www.homeforoldsigs.com ----------------------------------------------------------------------- From jeremy@alum.mit.edu Sat Oct 26 19:04:21 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Sat, 26 Oct 2002 14:04:21 -0400 Subject: [Spambayes] Some minor nits ... In-Reply-To: <001a01c27d17$dfec99a0$6300000a@holdenweb.com> References: <001a01c27d17$dfec99a0$6300000a@holdenweb.com> Message-ID: <15802.55589.455847.462428@slothrop.zope.com> >>>>> "SH" == Steve Holden writes: SH> Under cygwin (python 2.2.1) I see the folowing asyncore error: SH> error: uncaptured python exception, closing channel SH> <__main__.BayesProxy connected 127.0.0.1:1111 at 0x102026c0> SH> (exceptions.IOError:(0, 'Error') SH> [/tmp/python.576/usr/lib/python2.2/asyncore.py|poll|95] SH> [/tmp/python.576/usr/lib/python2.2/asyncore.py|handle_read_event|392] SH> [/tmp/python.576/usr/lib/python2.2/asynchat.py|handle_read|130] SH> [pop3proxy.py|found_terminator|181]) SH> Not sure quite what that's about, I'll take a look if I get a SH> chance. This is the error I saw on Linux. I assume that means that makefile() on Windows doesn't care whether it's opened with a read or write mode. That's hardly surprising, although it might be a Python bug. Jeremy From popiel@wolfskeep.com Sat Oct 26 21:34:09 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 26 Oct 2002 13:34:09 -0700 Subject: [Spambayes] Mining the headers Message-ID: <20021026203410.166C1F54A@cashew.wolfskeep.com> Tim mentioned three tokenizer options (mine_received_headers, count_all_header_lines, basic_header_tokenize). I hadn't played with these yet, so I ran the 8 combinations of these. Summary: both mine_received_headers and basic_header_tokenize seem good for me, but count_all_header_lines is a minor lose. r == mine_received_headers: False R == mine_received_headers: True c == count_all_header_lines: False C == count_all_header_lines: True b == basic_header_tokenize: False B == basic_header_tokenize: True Other options are: [Classifier] use_chi_squared_combining: True [TestDriver] show_false_negatives: False show_false_positives: False show_unsure: False ham_cutoff: 0.20 spam_cutoff: 0.90 -> tested 200 hams & 200 spams against 1800 hams & 1800 spams [...] filename: rcb rcB rCb rCB Rcb RcB RCb RCB ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 fp total: 3 3 3 3 3 3 3 3 fp %: 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 fn total: 12 14 16 14 12 12 12 12 fn %: 0.60 0.70 0.80 0.70 0.60 0.60 0.60 0.60 unsure t: 53 37 50 39 40 31 37 32 unsure %: 1.32 0.93 1.25 0.97 1.00 0.78 0.93 0.80 real cost: $52.60 $51.40 $56.00 $51.80 $50.00 $48.20 $49.40 $48.40 best cost: $48.20 $45.20 $49.20 $45.60 $37.20 $38.80 $40.60 $38.60 h mean: 0.40 0.32 0.35 0.32 0.31 0.30 0.29 0.29 h sdev: 5.39 4.71 5.12 4.68 4.55 4.47 4.47 4.43 s mean: 98.45 98.68 98.35 98.68 98.75 98.85 98.72 98.85 s sdev: 9.76 9.57 10.46 9.58 9.08 9.06 9.37 9.11 mean diff: 98.05 98.36 98.00 98.36 98.44 98.55 98.43 98.56 k: 6.47 6.89 6.29 6.90 7.22 7.28 7.11 7.28 Yes, it looks like there's good info in the headers. Counting the header lines doesn't appear to be a helpful way to get at that information, but mining the received headers and just doing basic tokenization over all the headers both seem to work, and work even better together. This is on my website at: http://www.wolfskeep.com/~popiel/spambayes/headers - Alex From gward@python.net Sat Oct 26 22:11:18 2002 From: gward@python.net (Greg Ward) Date: Sat, 26 Oct 2002 17:11:18 -0400 Subject: [Spambayes] python.org corpus updated Message-ID: <20021026211118.GA29889@cthulhu.gerg.ca> Hi all -- I've just updated the python.org email corpus to include mail harvested last week, ie. from 2002-10-19 to 2002-10-24. (The harvest was supposed to run for a full week, ie. until this morning, but for some reason it stopped Thursday evening. Oh well.) As before, I'm not going to share this with just anyone. Let me know if you're interested, and I'll let you know the URL and password. If I've never met you personally, I'll probably ask for approval from Guido/Barry/Tim first. Oh: there are undoubtedly spams in the ham folder and vice-versa; I've done a manual pass over all of the folders, but running them through a different spam filter always finds some errors. If you download the corpus and find mis-filed messages, please let me know and I'll update the canonical corpus accordingly. Greg -- Greg Ward http://www.gerg.ca/ Never try to outstubborn a cat. From gward@python.net Sat Oct 26 22:13:36 2002 From: gward@python.net (Greg Ward) Date: Sat, 26 Oct 2002 17:13:36 -0400 Subject: [Spambayes] Re: python.org corpus updated In-Reply-To: <20021026211118.GA29889@cthulhu.gerg.ca> References: <20021026211118.GA29889@cthulhu.gerg.ca> Message-ID: <20021026211336.GA29902@cthulhu.gerg.ca> Oh, some depressing statistics. From the Sept harvest: dsn 1662 messages 8662 kB ham 3819 messages 11249 kB spam 1896 messages 16692 kB virus 991 messages 120758 kB (dsn = delivery status notification = bounces, delay notifications, vacation mail, etc. I only kept 10% of the actual DSNs received.) And from October: dsn 1006 messages 5347 kB ham 2851 messages 7803 kB spam 2841 messages 22206 kB virus 634 messages 73754 kB Note that the spam:ham ratio is now 1:1. *sigh* Greg -- Greg Ward http://www.gerg.ca/ Eschew obfuscation! From rob@hooft.net Sat Oct 26 22:25:08 2002 From: rob@hooft.net (Rob Hooft) Date: Sat, 26 Oct 2002 23:25:08 +0200 Subject: [Spambayes] hammie deployment without Outlook Message-ID: <3DBB0834.7050504@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Since I'm Linux-only, I've been trying to put hammie.py to work for me. I made the attached change to hammie.py. I Trained it on a few hundred recent representative ham and spam messages. Then I created an executable file "~/bin/hammie": ---- #!/bin/sh cd $HOME/p/spambayes /usr/local/bin/python hammie.py -d -f ---- In .forward I put (on one line): ---- "|exec /mnt/disk2/People/Development/hooft/bin/hammie |/usr/bin/procmail" ---- and I created a file "~/.procmailrc" with contents: ---- LOGFILE=procmail.log :0: * ^X-Hammie-Disposition: Yes imap/spam :0 c * ^X-Hammie-Disposition: Unsure imap/unsure :0 c * ^X-Hammie-Disposition: No imap/ham ---- Any other hints from people with more experience? I'll be waiting for a mozilla plugin to make interactive use of spambayes.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.29 diff -u -r1.29 hammie.py --- hammie.py 6 Oct 2002 23:07:23 -0000 1.29 +++ hammie.py 26 Oct 2002 21:23:40 -0000 @@ -59,6 +59,7 @@ # Probability at which a message is considered spam SPAM_THRESHOLD = options.spam_cutoff +HAM_THRESHOLD = options.ham_cutoff # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) @@ -227,7 +228,8 @@ import traceback traceback.print_exc() - def filter(self, msg, header=DISPHEADER, cutoff=SPAM_THRESHOLD): + def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD, + ham_cutoff=HAM_THRESHOLD): """Score (judge) a message and add a disposition header. msg can be a string, a file object, or a Message object. @@ -245,10 +247,12 @@ elif not hasattr(msg, "add_header"): msg = email.message_from_string(msg) prob, clues = self._scoremsg(msg, True) - if prob < cutoff: + if prob < ham_cutoff: disp = "No" - else: + elif prob > spam_cutoff: disp = "Yes" + else: + disp = "Unsure" disp += "; %.2f" % prob disp += "; " + self.formatclues(clues) msg.add_header(header, disp) ---------------------- multipart/mixed attachment-- From skip@pobox.com Sun Oct 27 00:18:25 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 26 Oct 2002 18:18:25 -0500 Subject: [Spambayes] Mining the headers In-Reply-To: <20021026203410.166C1F54A@cashew.wolfskeep.com> References: <20021026203410.166C1F54A@cashew.wolfskeep.com> Message-ID: <15803.8897.442152.315985@montanaro.dyndns.org> Alex> Tim mentioned three tokenizer options (mine_received_headers, Alex> count_all_header_lines, basic_header_tokenize). I hadn't played Alex> with these yet, so I ran the 8 combinations of these. I've had three other options knocking around locally which haven't seemed to help or hurt when applied to my collections: mine_date_headers, generate_time_buckets, and extract_dow. The first controls overall attention to the Date: header. The second generates tokens like time:12:3 (the third six-minute bucket of the twelfth hour). The third generates tokens like dow:0 (Monday). Should I check them in to see if they are useful for other people? (I seem to have a bit different fp & fn results than others.) Skip From skip@pobox.com Sun Oct 27 00:26:39 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 26 Oct 2002 18:26:39 -0500 Subject: [Spambayes] Re: python.org corpus updated In-Reply-To: <20021026211336.GA29902@cthulhu.gerg.ca> References: <20021026211118.GA29889@cthulhu.gerg.ca> <20021026211336.GA29902@cthulhu.gerg.ca> Message-ID: <15803.9391.949518.510594@montanaro.dyndns.org> >>>>> "Greg" == Greg Ward writes: Greg> And from October: Greg> dsn 1006 messages 5347 kB Greg> ham 2851 messages 7803 kB Greg> spam 2841 messages 22206 kB Greg> virus 634 messages 73754 kB Greg> Note that the spam:ham ratio is now 1:1. *sigh* Greg, Do you get email from python-list? Its source is primarily the c.l.py newsfeed, which has been down for about five days now. According to the last message I saw from Barry, it seems there's a problem between Bay Mountain and UUNet. That would seriously impact your spam:ham ratio. Take a look at the statistics at the bottom of . It looks to me like October's going to be a light month as a result. If the feed isn't reestablished soon, newsgroup messages are likely to begin expiring before they get to the list archive. Skip From tim.one@comcast.net Sun Oct 27 01:31:59 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 26 Oct 2002 21:31:59 -0400 Subject: [Spambayes] python.org corpus updated In-Reply-To: <20021026211118.GA29889@cthulhu.gerg.ca> Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment [Greg Ward] > ... > Oh: there are undoubtedly spams in the ham folder and vice-versa; I've > done a manual pass over all of the folders, but running them through a > different spam filter always finds some errors. If you download the > corpus and find mis-filed messages, please let me know and I'll update > the canonical corpus accordingly. Thanks, Greg! I did the usual quick cheap-ass thing, taking the new data and just splitting it in half, training on the first half and predicting against the second. This seemed discouraging at first: -> 4 new false positives new fp: ['pyham/02155.txt', 'pyham/01816.txt', 'pyham/02322.txt', 'pyham/02406.txt'] but I believe they're all spam. I'll attach them for your review. They correspond, respectively, to your ham/184Q4b-0000MJ-00 2155 ham/184Dc6-0003e0-00 1816 ham/184UDM-0002Cn-00 2322 ham/184Xs0-0007q6-00 2406 A difference from last time: since python.org already takes a Draconian view of Asian-language traffic, I enabled the new (late this week) tokenizer option replace_nonascii_chars. This allows detection of "hated languages" more reliably with a smaller database and less training data. In the other direction (training on the 2nd half of the new data and predicting on the 1st half), the FP rate zoomed: -> 9 new false positives new fp: ['pyham/00277.txt', 'pyham/00278.txt', 'pyham/00275.txt', 'pyham/00267.txt', 'pyham/01346.txt', 'pyham/00261.txt', 'pyham/00276.txt', 'pyham/01284.txt', 'pyham/00645.txt'] Again I believe these are all spam, and some are so outrageously spam it's hard to believe SpamAssassin let them pass! Then again, most are in a hated language . ham/183BtE-00072Z-00 261 ham/183DZB-0007dJ-00 267 ham/183Epz-0001IH-00 275 ham/183Epz-0001II-00 276 ham/183Epz-0001IJ-00 277 ham/183Epz-0001IK-00 278 ham/183aCi-00024k-00 645 ham/183ueG-0006vd-00 1284 ham/183xNY-0008Gi-00 1346 Take those away and there were no false positives in either direction. There's also ham in the spam, but there are a lot more of those to dig thru, they seem to be grosser errors than last time around, and I'm tired of this now. One example: spam/183UWS-00060A-00 633 seems a perfectly ordinary piece of mailman-users traffic. chi-combining is quite certain it's ham: prob = 3.37424532759e-012 prob('*H*') = 1 prob('*S*') = 6.63913e-012 OTOH, SpamAssassin seems certain it's spam: """ Return-Path: Envelope-To: mailman-users@python.org Received: from northgate.starhub.net.sg ([203.117.1.53]) by mail.python.org with esmtp (Exim 4.05) id 183UWS-00060A-00 for mailman-users@python.org; Mon, 21 Oct 2002 00:50:37 -0400 Received: from sourcevisions.net (root@cm29.omega93.scvmaxonline.com.sg [218.186.93.29]) by northgate.starhub.net.sg (8.12.5/8.12.5) with ESMTP id g9L4oXr2016531 for ; Mon, 21 Oct 2002 12:50:34 +0800 (SST) Received: from localhost (whitestar@localhost) by sourcevisions.net (8.11.6/8.11.6) with ESMTP id g9L4qx330186 for ; Mon, 21 Oct 2002 12:52:59 +0800 Date: Mon, 21 Oct 2002 12:52:59 +0800 (SGT) From: Terence To: mailman-users@python.org Subject: Need help on email headers Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-warning: 203.117.1.53 in blacklist at unconfirmed.dsbl.org (http://dsbl.org/listing.php?203.117.1.53) X-Spam-Status: Yes, hits=5.4 required=5.0 tests=MAILTO_WITH_SUBJ,RCVD_IN_MULTIHOP_DSBL,RCVD_IN_RFCI, RCVD_IN_UNCONFIRMED_DSBL,SIGNATURE_SHORT_DENSE, SPAM_PHRASE_02_03,USER_AGENT_PINE,WEIRD_PORT X-Spam-Flag: YES X-Spam-Level: ***** Hello, I see some redundant links on the email headers, listed below: List-Help: List-Post: List-Subscribe: , List-Id: Basic Essence Mailing List List-Unsubscribe: , List-Archive: Is there any way to remove those above while only maintaining the List-Id?. Those lines are extras since we can follow the link below to configure our options I'll appreciate if anyone can help me out Thank you very much :) -- Best Regards Terence Tham E-Alias: Odin whitestar@sourcevisions.net """ There also appear to be an awful lot of "false negatives" of the form: """ This is a message from the IFL E-Mail Virus Protection Service -------------------------------------------------------------- The original e-mail attachment "Card.DOC.pif" appears to be infected by a virus and has been replaced by this=20 warning message. """ That may be virus fallout, but I don't believe it belongs in the spam corpus, right? ---------------------- multipart/mixed attachment Return-Path: =0A= Envelope-To: do-sig-request@python.org=0A= Received: from [64.14.139.219] (helo=3Doutboundc.link2buy.com)=0A= by mail.python.org with esmtp (Exim 4.05)=0A= id 184Dc6-0003e0-00=0A= for do-sig-request@python.org; Wed, 23 Oct 2002 00:59:26 -0400=0A= Received: from [10.3.220.208]=0A= by outboundc.link2buy.com (10.3.220.242) with QMQP; 22 Oct 2002 = 21:56:21 +0000=0A= Message-ID: <704023967.1035349071889.mu@link2buy.com>=0A= Date: Tue Oct 22 21:46:35 PDT 2002=0A= From: ConsumerDirect =0A= To: do-sig-request@python.org=0A= Subject: Is there a Miracle Vitamin?=0A= Mime-Version: 1.0=0A= Content-Type: text/html=0A= Content-Transfer-Encoding: 7bit=0A= X-warning: 64.14.139.219 in blacklist at dnsbl.njabl.org=0A= (spam source -- 1031416958)=0A= X-Spam-Status: No, hits=3D-4.6 required=3D5.0 = tests=3DCTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,INVALID_= DATE,MAILTO_LINK,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS=0A= X-Spam-Level:=0A= =0A= =0A= 3D"link2buy.com"
=0A= =0A=




=0A= =0A=
3D"Why
http://link2buy.com/c/u.jsp?E=3Ddo-sig-request@python.or= g&P=3DMO2501_20021022_3638&U=3D259578032
Your email address on record = is do-sig-request@python.org.

Unsubscribe = me.
=0A= =0A= ---------------------- multipart/mixed attachment Return-Path: Envelope-To: do-sig-request@python.org Received: from [64.14.139.218] (helo=outbounda.link2buy.com) by mail.python.org with esmtp (Exim 4.05) id 184UDM-0002Cn-00 for do-sig-request@python.org; Wed, 23 Oct 2002 18:43:00 -0400 Received: from [10.3.220.205] by outbounda.link2buy.com (10.3.20.240) with QMQP; 23 Oct 2002 15:42:29 +0000 Message-ID: <704023967.1035412948706.mu@link2buy.com> Date: Wed Oct 23 15:31:03 PDT 2002 From: ConsumerDirect To: do-sig-request@python.org Subject: Back Child Support? We Can Help! Mime-Version: 1.0 Content-Type: text/html Content-Transfer-Encoding: 7bit X-warning: 64.14.139.218 in blacklist at bl.spamcop.net (Blocked - see http://spamcop.net/bl.shtml?64.14.139.218) X-Spam-Status: No, hits=-4.7 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,INVALID_DATE,PLING_QUERY,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS X-Spam-Level: link2buy.com
Why are you receiving this email?
http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO4509_20021021_3632&U=259578032
Your email address on record is do-sig-request@python.org.

Unsubscribe me.
---------------------- multipart/mixed attachment Return-Path: Envelope-To: do-sig-request@python.org Received: from [64.14.139.214] (helo=link2buy.com) by mail.python.org with smtp (Exim 4.05) id 184Xs0-0007q6-00 for do-sig-request@python.org; Wed, 23 Oct 2002 22:37:12 -0400 Received: (qmail 3223 invoked from network); 24 Oct 2002 02:20:51 -0000 Received: from unknown (HELO app02) (64.14.139.216) by 10.3.220.32 with SMTP; 24 Oct 2002 02:20:51 -0000 Message-ID: <704023967.1035426051418.mu@link2buy.com> Date: Wed, 23 Oct 2002 19:14:17 -0700 (PDT) From: ConsumerDirect To: do-sig-request@python.org Subject: Get a Car Loan Now - Any Credit-No Commitment or Fees! Mime-Version: 1.0 Content-Type: text/html Content-Transfer-Encoding: 7bit X-warning: 64.14.139.214 in blacklist at dnsbl.njabl.org (spam source -- 1029255112) X-Spam-Status: No, hits=-0.4 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_OSIRUSOFT_COM,RCVD_IN_SBL,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS,X_OSIRU_SPAMWARE_SITE X-Spam-Level: link2buy.com


Why are you receiving this email?
http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO3922_20021023_3653&U=259578032
Your email address on record is do-sig-request@python.org.

Unsubscribe me.
---------------------- multipart/mixed attachment Return-Path: Envelope-To: do-sig-request@python.org Received: from [64.14.139.219] (helo=outboundc.link2buy.com) by mail.python.org with esmtp (Exim 4.05) id 184Q4b-0000MJ-00 for do-sig-request@python.org; Wed, 23 Oct 2002 14:17:41 -0400 Received: from [10.3.220.208] by outboundc.link2buy.com (10.3.220.242) with QMQP; 23 Oct 2002 11:14:17 +0000 Message-ID: <704023967.1035396949410.mu@link2buy.com> Date: Wed Oct 23 11:08:22 PDT 2002 From: ConsumerDirect To: do-sig-request@python.org Subject: Enjoy the Web Again Mime-Version: 1.0 Content-Type: text/html Content-Transfer-Encoding: 7bit X-warning: 64.14.139.219 in blacklist at dnsbl.njabl.org (spam source -- 1031416958) X-Spam-Status: No, hits=-4.5 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTML_50_70,HTTP_WITH_EMAIL_IN_URL,INVALID_DATE,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS X-Spam-Level: link2buy.com
Why are you receiving this email?
http://link2buy.com/c/u.jsp?E=do-sig-request@python.org&P=MO2735_20021017_3587&U=259578032
Your email address on record is do-sig-request@python.org.

Unsubscribe me.
---------------------- multipart/mixed attachment Return-Path: Envelope-To: jobs@python.org Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp) by mail.python.org with smtp (Exim 4.05) id 183BtE-00072Z-00 for jobs@python.org; Sun, 20 Oct 2002 04:56:52 -0400 Received: (qmail 542 invoked from network); 20 Oct 2002 07:26:47 -0000 Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96) by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 07:26:47 -0000 Date: Sun, 20 Oct 2002 16:27:31 +0900 From: INFO To: a@a.a Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?= =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?= =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?= Message-Id: <20021020155600.237F.INFO@goldrush.ntf.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.05.06 X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01 X-Spam-Level: $B!c#F#R#O#M!d(B info@goldrush.ntf.ne.jp $B!cAw?.Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B $B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38 $BEEOCHV9f!'(B093-592-4496 $B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B $B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B http://goldrush.ntf.ne.jp/ $B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B $B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$ZL@>Z$N#F#A#X$G#O#K!#(B $B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B $BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B

$B!|:#7n$N;YJ'$$$,B-$j$J$$(B
$B!|.8/$$$,$$$k(B
$B!|2q $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B

$B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B

$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g Envelope-To: tutor@python.org Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp) by mail.python.org with smtp (Exim 4.05) id 183DZB-0007dJ-00 for tutor@python.org; Sun, 20 Oct 2002 06:44:17 -0400 Received: (qmail 542 invoked from network); 20 Oct 2002 07:26:47 -0000 Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96) by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 07:26:47 -0000 Date: Sun, 20 Oct 2002 16:27:31 +0900 From: INFO To: a@a.a Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?= =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?= =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?= Message-Id: <20021020155600.237F.INFO@goldrush.ntf.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.05.06 X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01 X-Spam-Level: $B!c#F#R#O#M!d(B info@goldrush.ntf.ne.jp $B!cAw?.Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B $B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38 $BEEOCHV9f!'(B093-592-4496 $B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B $B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B http://goldrush.ntf.ne.jp/ $B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B $B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$ZL@>Z$N#F#A#X$G#O#K!#(B $B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B $BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B

$B!|:#7n$N;YJ'$$$,B-$j$J$$(B
$B!|.8/$$$,$$$k(B
$B!|2q $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B

$B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B

$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g Envelope-To: python-docs@python.org Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp) by mail.python.org with smtp (Exim 4.05) id 183Epz-0001IH-00 for python-docs@python.org; Sun, 20 Oct 2002 08:05:43 -0400 Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000 Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96) by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000 Date: Sun, 20 Oct 2002 17:24:41 +0900 From: INFO To: a@a.a Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?= =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?= =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?= Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.05.06 X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01 X-Spam-Level: $B!c#F#R#O#M!d(B info@goldrush.ntf.ne.jp $B!cAw?.Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B $B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38 $BEEOCHV9f!'(B093-592-4496 $B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B $B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B http://goldrush.ntf.ne.jp/ $B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B $B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$ZL@>Z$N#F#A#X$G#O#K!#(B $B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B $BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B

$B!|:#7n$N;YJ'$$$,B-$j$J$$(B
$B!|.8/$$$,$$$k(B
$B!|2q $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B

$B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B

$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g Envelope-To: python-list@python.org Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp) by mail.python.org with smtp (Exim 4.05) id 183Epz-0001II-00 for python-list@python.org; Sun, 20 Oct 2002 08:05:43 -0400 Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000 Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96) by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000 Date: Sun, 20 Oct 2002 17:24:41 +0900 From: INFO To: a@a.a Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?= =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?= =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?= Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.05.06 X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01 X-Spam-Level: $B!c#F#R#O#M!d(B info@goldrush.ntf.ne.jp $B!cAw?.Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B $B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38 $BEEOCHV9f!'(B093-592-4496 $B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B $B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B http://goldrush.ntf.ne.jp/ $B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B $B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$ZL@>Z$N#F#A#X$G#O#K!#(B $B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B $BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B

$B!|:#7n$N;YJ'$$$,B-$j$J$$(B
$B!|.8/$$$,$$$k(B
$B!|2q $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B

$B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B

$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g Envelope-To: python-help@python.org Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp) by mail.python.org with smtp (Exim 4.05) id 183Epz-0001IJ-00 for python-help@python.org; Sun, 20 Oct 2002 08:05:43 -0400 Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000 Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96) by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000 Date: Sun, 20 Oct 2002 17:24:41 +0900 From: INFO To: a@a.a Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?= =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?= =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?= Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.05.06 X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01 X-Spam-Level: $B!c#F#R#O#M!d(B info@goldrush.ntf.ne.jp $B!cAw?.Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B $B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38 $BEEOCHV9f!'(B093-592-4496 $B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B $B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B http://goldrush.ntf.ne.jp/ $B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B $B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$ZL@>Z$N#F#A#X$G#O#K!#(B $B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B $BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B

$B!|:#7n$N;YJ'$$$,B-$j$J$$(B
$B!|.8/$$$,$$$k(B
$B!|2q $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B

$B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B

$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g Envelope-To: python-announce@python.org Received: from 211008071155.cidr.odn.ne.jp ([211.8.71.155] helo=vivis.ntf.ne.jp) by mail.python.org with smtp (Exim 4.05) id 183Epz-0001IK-00 for python-announce@python.org; Sun, 20 Oct 2002 08:05:43 -0400 Received: (qmail 24767 invoked from network); 20 Oct 2002 08:23:57 -0000 Received: from p1096-ipbf04fukuokachu.fukuoka.ocn.ne.jp (HELO ?192.172.0.105?) (220.96.125.96) by 211008071155.cidr.odn.ne.jp with SMTP; 20 Oct 2002 08:23:57 -0000 Date: Sun, 20 Oct 2002 17:24:41 +0900 From: INFO To: a@a.a Subject: =?ISO-2022-JP?B?GyRCTCQ+NUJ6OS05cCIoIVYlKyE8JUkkThsoQg==?= =?ISO-2022-JP?B?GyRCR2MkJEoqT0gkRyUtJWMlQyU3JWUlMhsoQg==?= =?ISO-2022-JP?B?GyRCJUMlSCFXGyhC?= Message-Id: <20021020164146.2388.INFO@goldrush.ntf.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.05.06 X-Spam-Status: No, hits=0.8 required=5.0 tests=SPAM_PHRASE_00_01 X-Spam-Level: $B!c#F#R#O#M!d(B info@goldrush.ntf.ne.jp $B!cAw?.Kt$OL>>N(B:$B%(%`%1%$!&%$%s%?!<%J%7%g%J%k!!9-9pIt(B $B=;=j!'J!2,8)KL6e=#;T>.ARKL6hEDD.(B10-38 $BEEOCHV9f!'(B093-592-4496 $B!cG[?.Dd;_J}K!!d>e5-%"%I%l%9$K$=$N$^$^JV?.%a!<%k$r$*Aw$j$/$@$5$$!#(BDB$B$h$j:o=|$5$;$F$$$?$@$-$^$9!#(B $B!2!2!2!2!2!2!2!2!2!2K\!!!!J8!2!2!2!2!2!2!2!2!2!2(B http://goldrush.ntf.ne.jp/ $B%+!<%I$N%7%g%C%T%s%0OH$G%-%c%C%7%e%2%C%H>pJs(B $B$$$^$^$GEl5~$dBg:e$N0lIt$N?M$7$+MxMQ$G$-$J$+$C$?!J$=$l$b9b$$ZL@>Z$N#F#A#X$G#O#K!#(B $B5$$K$J$k496bN($G$9$,!"9qFb:G9b?e=`$N496bN(#8#7!s$G496b$G$-$^$9!#(B $BJLESHqMQEy$OAwNA$N(B300$B1_$H?6$j9~$_NA!J6d9T5,Dj $B8BEY3[$O(B1$B2s$"$?$j(B10$BK|1_$^$G$G$*0l?MMM(B1$B%v7n$"$?$j(B20$BK|1_$^$G$G$9!#(B

$B!|:#7n$N;YJ'$$$,B-$j$J$$(B
$B!|.8/$$$,$$$k(B
$B!|2q $B!|3X@8%+!<%I$G%-%c%C%7%s%0$,?F$K$P$l$k$H$^$:$$(B

$B$=$s$J$"$J$?$K$T$C$?$j$G$9!#(B

$B$=$l$>$l$N2hLL$O2<$N%j%s%/$r%/%j%C%/$9$l$P=g Envelope-To: python-list-request@python.org Received: from dav11.pav3.hotmail.com ([64.4.38.115] helo=hotmail.com) by mail.python.org with esmtp (Exim 4.05) id 183aCi-00024k-00 for python-list-request@python.org; Mon, 21 Oct 2002 06:54:36 -0400 Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Mon, 21 Oct 2002 03:54:05 -0700 X-Originating-IP: [202.105.138.19] From: "amou *^_^*" To: Subject: re Date: Mon, 21 Oct 2002 18:57:43 +0800 MIME-Version: 1.0 X-Mailer: MSN Explorer 7.00.0021.1700 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0000_01C27933.BC9FFDD0" Message-ID: X-OriginalArrivalTime: 21 Oct 2002 10:54:05.0647 (UTC) FILETIME=[2CA7BDF0:01C278F0] X-Spam-Status: No, hits=-7.3 required=5.0 tests=BASE64_ENC_TEXT,MIME_ALTERNATIVE,SPAM_PHRASE_00_01,USER_IN_WHITELIST_TO X-Spam-Level: ------=_NextPart_001_0000_01C27933.BC9FFDD0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: base64 tNPN+NW+tcO1vbj8tuDQxc+ioaNNU04gRXhwbG9yZXIgw+K30c/C1Ng6aHR0cDovL2V4cGxvcmVy Lm1zbi5jb20vbGNjbg== ------=_NextPart_001_0000_01C27933.BC9FFDD0 Content-Type: text/html; charset="gb2312" Content-Transfer-Encoding: quoted-printable


=

=B4=D3=CD=F8=D5=BE=B5=C3=B5=BD=B8=FC=B6= =E0=D0=C5=CF=A2=A1=A3MSN Explorer =C3=E2=B7=D1=CF=C2=D4=D8=A3=BAhttp://explorer.msn.com/lccn

------=_NextPart_001_0000_01C27933.BC9FFDD0-- ---------------------- multipart/mixed attachment Return-Path: =0A= Envelope-To: mailman-users-request@python.org=0A= Received: from abn14-168.ist-avrupa-ports.kablonet.net.tr = ([195.174.14.168] helo=3Ddell)=0A= by mail.python.org with smtp (Exim 4.05)=0A= id 183ueG-0006vd-00=0A= for mailman-users-request@python.org; Tue, 22 Oct 2002 04:44:27 -0400=0A= x-esmtp: 0 0 1=0A= Message-ID: <633420021022264341949@dell>=0A= X-EM-Version: 5, 0, 0, 19=0A= X-EM-Registration: #01B0530810E603002D00=0A= X-Priority: 1=0A= Reply-To: istanbul@sushi.co.jp=0A= X-MSMail-Priority: High=0A= From: "www.indirimim.com" =0A= To:=0A= Subject: = =3D?windows-1254?Q?ekim_ay=3DFD_=3DE7ekili=3DFElerine_kat=3DFDl=3DFDn?=3D=0A= Date: Tue, 22 Oct 2002 09:43:41 +0300=0A= MIME-Version: 1.0=0A= Content-Type: multipart/alternative; =0A= boundary=3D"----=3D_NextPart_84815C5ABAF209EF376268C8"=0A= X-SMTPExp-Version: 1, 0, 2, 10=0A= X-SMTPExp-Registration: =FF=FF=FF=FF=0A= X-Spam-Status: No, hits=3D-0.1 required=3D5.0 = tests=3DHEADER_8BITS,MIME_ALTERNATIVE,MISSING_MIMEOLE,PRIORITY_NO_NAME,RC= VD_IN_RFCI,SPAM_PHRASE_02_03,USER_IN_WHITELIST_TO,X_ESMTP,X_MSMAIL_PRIORI= TY_HIGH,X_PRIORITY_HIGH,X_SMTPEXP_REGISTRATION,X_SMTPEXP_VERSION=0A= X-Spam-Level:=0A= =0A= This message was sent using a demo of EasyMail SMTP Express. For more = information about EasyMail SMTP Express visit = http://www.quiksoft.com/easymail.=0A= =0A= ------=3D_NextPart_84815C5ABAF209EF376268C8=0A= Content-type: text/plain; charset=3D"windows-1254"=0A= =0A= =DDnk=FDlap Kitapevin'den=0A= =FCcretsiz kitaplar Jimmi's 'de=0A= Misafirimiz olun Tatilya'da =FCcretsiz=0A= - S=FDn=FDrs=FDz E=F0lence - =DCcretsiz Fresh Look=0A= Renkli Lensler UltraForm G=FCzellik&Zay=FDflama=0A= Merkezi Ekim 2002 hediyelerini kazanmak i=E7in formu doldurun ve = =E7ekili=FEimize kat=FDl=FDn...... Aram=FDza Kat=FDlan Yeni Firmalar = Lens Market=0A= =DEa=FE=FDrt=FDc=FD De=F0i=FEiminizi www.indirimim.com avantaj=FDyla = ya=FEay=FDn. 64K/Kaspersky Antivirus (AVP) =0A= Bilgisayar=FDn=FDz=FD www.indirimim.com fiyat fark=FDyla koruyun=0A= Metro Turizm=0A= =DDndirimimli seyahatler sizi bekliyor. =DDnk=FDlap Kitapevi=0A= www.indirimim.com ile kitap almak daha avantajli Solid Sa=F0l=FDk = =DCr=FCnleri =0A= www.indirimim.com m=FC=FEterilerine =F6zel indirimler =0A= =0A= =0A= ------=3D_NextPart_84815C5ABAF209EF376268C8=0A= Content-Type: text/html; charset=3D"windows-1254"=0A= Content-Transfer-Encoding: quoted-printable=0A= =0A= =0A= =0A= Untitled Document=0A= =0A= =0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=3DDDnk=3DFDlap Kitapevin'den
=0A= =3DFCcretsiz kitaplar
=0A=
=3D20=0A=
Jimmi's 'de
=0A= Misafirimiz olun
=0A=
=3D20=0A=
Tatilya'da =3DFCcretsiz
=0A= - S=3DFDn=3DFDrs=3DFDz E=3DF0lence -
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=3DDCcretsiz Fresh Look
=0A= Renkli Lensler
=0A=
=3D20=0A=
UltraForm = G=3DFCzellik&Zay=3DFDflama
=0A= Merkezi
=0A=
=3D20=0A=
=0A=
=3D20=0A=
=0A=
=3D20=0A=
Ekim 2002 hediyelerini kazanmak i=3DE7in = formu d=3D=0A= oldurun=3D20=0A= ve =3DE7ekili=3DFEimize = kat=3DFDl=3DFDn=3D2E=3D2E=3D2E=3D2E=3D2E=3D2E
=0A=
=0A= =0A= =3D20=0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A=
=3D20=0A=
Aram=3DFDza Kat=3DFDlan Yeni = Firmalar
=0A=
Lens Market
=0A= =3DDEa=3DFE=3DFDrt=3DFDc=3DFD De=3DF0i=3DFEiminizi = www=3D2Eindirimim=3D2Ecom avantaj=3DFD=3D=0A= yla ya=3DFEay=3DFDn=3D2E
64K/Kaspersky Antivirus (AVP) =
=0A= Bilgisayar=3DFDn=3DFDz=3DFD www=3D2Eindirimim=3D2Ecom fiyat = fark=3DFDyla koruyun=3D=0A=
=0A=
Metro Turizm
=0A= =3DDDndirimimli seyahatler sizi bekliyor=3D2E
=0A= =0A= =3D20=0A= =0A= =0A= =0A= =3D20=0A= =0A= =0A= =0A=
=3DDDnk=3DFDlap Kitapevi
=0A= www=3D2Eindirimim=3D2Ecom ile kitap almak daha avantajli
Solid Sa=3DF0l=3DFDk = =3DDCr=3DFCnleri
=0A= www=3D2Eindirimim=3D2Ecom m=3DFC=3DFEterilerine =3DF6zel = indirimler
=0A=

 

=0A= =0A= =0A= =0A= ------=3D_NextPart_84815C5ABAF209EF376268C8--=0A= =0A= ---------------------- multipart/mixed attachment Return-Path: Envelope-To: do-sig-request@python.org Received: from [64.14.139.137] (helo=mybigbargains.com) by mail.python.org with smtp (Exim 4.05) id 183xNY-0008Gi-00 for do-sig-request@python.org; Tue, 22 Oct 2002 07:39:20 -0400 Received: (qmail 55296 invoked from network); 22 Oct 2002 11:31:39 -0000 Received: from unknown (HELO app11) (64.14.139.216) by 10.3.220.146 with SMTP; 22 Oct 2002 11:31:39 -0000 Message-ID: <704023967.1035286299797.mu@2mbb.com> Date: Tue, 22 Oct 2002 04:26:51 -0700 (PDT) From: ConsumerDirect To: do-sig-request@python.org Subject: Refinance - Get 4 free loan offers Mime-Version: 1.0 Content-Type: text/html Content-Transfer-Encoding: 7bit X-warning: 64.14.139.137 in blacklist at dnsbl.njabl.org (spam source -- 1032063500) X-Spam-Status: No, hits=0.8 required=5.0 tests=CTYPE_JUST_HTML,FROM_ENDS_IN_NUMS,HTTP_WITH_EMAIL_IN_URL,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_OSIRUSOFT_COM,RCVD_IN_SBL,SPAM_PHRASE_03_05,USER_IN_WHITELIST_TO,WEB_BUGS,X_OSIRU_SPAM_SRC X-Spam-Level: 2mbb.com
Why are you receiving this email?
http://2mbb.com/c/u.jsp?E=do-sig-request@python.org&P=MO2003_20021018_3606&U=259578032
Your email address on record is do-sig-request@python.org.

Unsubscribe me.
---------------------- multipart/mixed attachment-- From tim.one@comcast.net Sun Oct 27 02:46:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 26 Oct 2002 22:46:43 -0400 Subject: [Spambayes] Re: python.org corpus updated In-Reply-To: <15803.9391.949518.510594@montanaro.dyndns.org> Message-ID: [Skip Montanaro, to Greg Ward about the rising spam::ham ratio on python.org traffic] > Do you get email from python-list? Yes, I believe all mailing lists going thru python.org are part of this traffic. This includes some private non-tech lists, and administrative requests, which is in part why Greg is rightfully reluctant to open the corpus for free public consumption. > Its source is primarily the c.l.py newsfeed, which has been down for > about five days now. Yup. From tim.one@comcast.net Sun Oct 27 02:57:06 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 26 Oct 2002 22:57:06 -0400 Subject: [Spambayes] Proposing to make chi-combining the default In-Reply-To: <20021026042907.14850F5A4@cashew.wolfskeep.com> Message-ID: [Tim] >>2'. Change the default ham_cutoff to 0.20 and the default spam_cutoff >> to 0.90. [T. Alexander Popiel] > I'm slightly surprised at the looseness of 2', but as you say, > the boundaries aren't all that touchy. For my own email, and on my large c.l.py test, I use cutoffs of 0.30 and 0.80 with chi-combining very happily, so the suggested defaults are conservative relative to that. But they're *just* defaults, and anyone taking a default too seriously should be shot . Certainly, they should be closer to the endpoints if just starting training. > I'm all for the above. Nobody has objected, so I'll make the change next (I already made the other changes threatened, BTW -- use_mixed_combining is gone, and ditto ignore_redundant_html). Anyone wedded to gary-combining, don't panic: your database is unaffected by changing this default. There's no need to retrain it. If you want to continue using gary-combining for scoring (again, the remaining combining schemes have nothing to do with training, they're purely a scoring-time choice), you'll need to add [Classifier] use_gary_combining: True to your .ini file. From popiel@wolfskeep.com Sun Oct 27 03:10:48 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sat, 26 Oct 2002 20:10:48 -0700 Subject: [Spambayes] Mining the headers In-Reply-To: Message from Skip Montanaro <15803.8897.442152.315985@montanaro.dyndns.org> References: <20021026203410.166C1F54A@cashew.wolfskeep.com> <15803.8897.442152.315985@montanaro.dyndns.org> Message-ID: <20021027031048.3680AF54A@cashew.wolfskeep.com> In message: <15803.8897.442152.315985@montanaro.dyndns.org> Skip Montanaro writes: > >I've had three other options knocking around locally which haven't seemed to >help or hurt when applied to my collections: mine_date_headers, >generate_time_buckets, and extract_dow. The first controls overall attention >to the Date: header. The second generates tokens like time:12:3 (the third >six-minute bucket of the twelfth hour). The third generates tokens like >dow:0 (Monday). Should I check them in to see if they are useful for other >people? (I seem to have a bit different fp & fn results than others.) Yes, I'd love to test them. - Alex From tim.one@comcast.net Sun Oct 27 04:06:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 00:06:25 -0400 Subject: [Spambayes] hammie deployment without Outlook In-Reply-To: <3DBB0834.7050504@hooft.net> Message-ID: [Rob Hooft] > Since I'm Linux-only, I've been trying to put hammie.py to work for me. > I made the attached change to hammie.py. I can't help you with hammie, but I did check your change in. Especially since I just changed the default scoring scheme to chi-combining, it's important that tools take the middle ground seriously. All, Rob's patch taught hammie how to use the ham_cutoff option, and introduced a new X-Hammie-Disposition: Unsure disposition (in addition to its Yes and No dispositions). > ... > I'll be waiting for a mozilla plugin to make interactive use of > spambayes.... Indeed, Mark Hammond and Sean True appear to be proving that people are more willing to contribute mounds of hard work to Windows clients . From skip@pobox.com Sun Oct 27 04:17:20 2002 From: skip@pobox.com (Skip Montanaro) Date: Sat, 26 Oct 2002 23:17:20 -0500 Subject: [Spambayes] training a single message w/ hammie.py? Message-ID: <15803.26832.97712.301007@montanaro.dyndns.org> I'm getting ready to set up my procmailrc file to run hammie.py. I'm hoping to do incremental training. Given an existing persistent store that's possible, right? If incremental training is possible, it seems odd to me that no capability is provided to accept a single message from stdin that is tagged as known spam or non-spam. Before I launch into modifying hammie.py, am I missing something obvious? Also, is it possible to train with the hammiecli/hammiesrv pair? Seems not. Skip From tim.one@comcast.net Sun Oct 27 04:37:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 00:37:58 -0400 Subject: [Spambayes] Mining the headers In-Reply-To: <15803.8897.442152.315985@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I've had three other options knocking around locally which > haven't seemed to help or hurt when applied to my collections: > mine_date_headers, generate_time_buckets, and extract_dow. The first > controls overall attention to the Date: header. The second generates > tokens like time:12:3 (the third six-minute bucket of the twelfth hour). > The third generates tokens like dow:0 (Monday). Should I check them > in to see if they are useful for other people? Of course. Please do. Note that many state governments have agreed to give you an extra hour tonight to do this . > (I seem to have a bit different fp & fn results than others.) Actually, unless things have changed dramatically, you get worse results than everyone else combined. That still lacks an explanation. Have you tried chi-combining yet? From skip@pobox.com Sun Oct 27 05:37:51 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 27 Oct 2002 00:37:51 -0500 Subject: [Spambayes] Mining the headers In-Reply-To: <20021027031048.3680AF54A@cashew.wolfskeep.com> References: <20021026203410.166C1F54A@cashew.wolfskeep.com> <15803.8897.442152.315985@montanaro.dyndns.org> <20021027031048.3680AF54A@cashew.wolfskeep.com> Message-ID: <15803.31663.391441.711086@montanaro.dyndns.org> >> I've had three other options knocking around locally which haven't >> seemed to help or hurt.... Should I check them in.... Alex> Yes, I'd love to test them. Done. Note that I deleted the mine_date_headers option. It was just a gatekeeper for the other two. Seemed pointless to me. Here's my latest run. The first run was the default. My dates.ini file is [Tokenizer] generate_time_buckets: True extract_dow: True The results: run1s -> datess -> tested 200 hams & 200 spams against 1800 hams & 1800 spams ... etc ... false positive percentages 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied 0.000 0.000 tied 0.500 0.500 tied won 0 times tied 10 times lost 0 times total unique fp went from 5 to 5 tied mean fp % went from 0.25 to 0.25 tied false negative percentages 0.000 0.000 tied 0.000 0.000 tied 1.000 1.000 tied 1.000 1.000 tied 0.500 0.500 tied 1.000 0.500 won -50.00% 0.500 0.500 tied 1.500 1.500 tied 0.000 0.000 tied 2.000 2.000 tied won 1 times tied 9 times lost 0 times total unique fn went from 15 to 14 won -6.67% mean fn % went from 0.75 to 0.7 won -6.67% ham mean ham sdev 1.38 1.38 +0.00% 10.18 10.17 -0.10% 0.42 0.43 +2.38% 3.77 3.78 +0.27% 0.98 0.98 +0.00% 8.39 8.36 -0.36% 0.17 0.21 +23.53% 1.05 1.52 +44.76% 0.93 0.93 +0.00% 7.73 7.73 +0.00% 1.40 1.40 +0.00% 8.36 8.39 +0.36% 1.18 1.14 -3.39% 7.39 7.24 -2.03% 0.73 0.74 +1.37% 7.54 7.54 +0.00% 0.97 0.98 +1.03% 6.62 6.72 +1.51% 0.79 0.79 +0.00% 7.74 7.74 +0.00% ham mean and sdev for all runs 0.89 0.90 +1.12% 7.32 7.32 +0.00% spam mean spam sdev 99.17 99.16 -0.01% 4.63 4.71 +1.73% 98.65 98.66 +0.01% 6.34 6.27 -1.10% 96.71 96.71 +0.00% 13.73 13.74 +0.07% 96.74 96.73 -0.01% 13.46 13.46 +0.00% 98.44 98.46 +0.02% 9.25 9.23 -0.22% 97.35 97.36 +0.01% 12.00 11.92 -0.67% 98.33 98.34 +0.01% 9.55 9.53 -0.21% 97.17 97.17 +0.00% 13.68 13.68 +0.00% 98.94 98.93 -0.01% 6.89 6.90 +0.15% 97.46 97.45 -0.01% 13.72 13.73 +0.07% spam mean and sdev for all runs 97.89 97.90 +0.01% 10.87 10.86 -0.09% ham/spam mean difference: 97.00 97.00 +0.00 Here's the cost table: -> tested 200 hams & 200 spams against 1800 hams & 1800 spams ... yeah, yeah, yeah, enough already! ... filename: run1 dates ham:spam: 2000:2000 2000:2000 fp total: 5 5 fp %: 0.25 0.25 fn total: 15 14 fn %: 0.75 0.70 unsure t: 93 93 unsure %: 2.33 2.33 real cost: $83.60 $82.60 best cost: $53.80 $53.60 h mean: 0.89 0.90 h sdev: 7.32 7.32 s mean: 97.89 97.90 s sdev: 10.87 10.86 mean diff: 97.00 97.00 k: 5.33 5.34 Note that my numbers seem to be getting a lot better. My ham/spam collection has slowly gotten cleaner and I've been adding more new stuff, not to mention which the default scheme (chi2?) seems a lot more sensitive/accurate. I noticed that as I lopped off old messages, first those from 1999 and before then those from 2000, that the accuracy improved. That suggests two things to me: first, the nature of "what is spam?" has changed a bit, and two, someone ought to test this notion. ;-) thanks-to-uncle-timmy-for-the-extra-hour-ly, y'rs, Skip From skip@pobox.com Sun Oct 27 05:40:27 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 27 Oct 2002 00:40:27 -0500 Subject: [Spambayes] Mining the headers In-Reply-To: References: <15803.8897.442152.315985@montanaro.dyndns.org> Message-ID: <15803.31819.353336.954786@montanaro.dyndns.org> >> (I seem to have a bit different fp & fn results than others.) Tim> Actually, unless things have changed dramatically, you get worse Tim> results than everyone else combined. That still lacks an Tim> explanation. Have you tried chi-combining yet? Yes, now that it's the default, though I didn't make a conscious attempt to compare it with other schemes. I got a bit behind the past several weeks, and still don't really understand all of what's been changed lately (I don't even pretend to be a statistician as a pick-up line in online chat rooms!). Skip From skip@pobox.com Sun Oct 27 05:46:09 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 27 Oct 2002 00:46:09 -0500 Subject: [Spambayes] setup.py? Message-ID: <15803.32161.725680.230490@montanaro.dyndns.org> I don't suppose someone would care to skim the setup.py file to see if I've gotten most stuff that needs installing would they? Pretty please? Skip From rob@hooft.net Sun Oct 27 08:11:53 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 27 Oct 2002 09:11:53 +0100 Subject: [Spambayes] More proposed hammie changes: use Options Message-ID: <3DBB9FC9.2070306@hooft.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Attached are some more changes I'd like to propose to make to hammie: * Add -D option to reverse the -d option * Make the default use of pickle/database configurable * Add a showclue-limit to limit the clues added to the Hammie-Disposition header. I found the header becoming a bit large for many of my messages. This option can be used to make it show only the strongest clues either way. * Add a section [Hammie] to the configuration file to take all these hammie configurations such that hammie doesn't always need to be run with half a dozen of options to work (I always forget one if I'm trying it interactively). Furthermore, the patch changes a lot of the ' and " signs in the default string in Options.py such that the parser in emacs/python-mode.el is now happy with it. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ ---------------------- multipart/mixed attachment Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.59 diff -u -r1.59 Options.py --- Options.py 27 Oct 2002 05:26:01 -0000 1.59 +++ Options.py 27 Oct 2002 08:02:09 -0000 @@ -48,7 +48,7 @@ # Generate tokens just counting the number of instances of each kind of # header line, in a case-sensitive way. # -# Depending on data collection, some headers aren't safe to count. +# Depending on data collection, some headers are not safe to count. # For example, if ham is collected from a mailing list but spam from your # regular inbox traffic, the presence of a header like List-Info will be a # very strong ham clue, but a bogus one. In that case, set @@ -150,7 +150,7 @@ # # The idea is that if something scores < hamc, it's called ham; if # something scores >= spamc, it's called spam; and everything else is -# called "I'm not sure" -- the middle ground. +# called 'I am not sure' -- the middle ground. # # Note that cvcost.py does a similar analysis. # @@ -169,7 +169,7 @@ # Display spam when # show_spam_lo <= spamprob <= show_spam_hi -# and likewise for ham. The defaults here don't show anything. +# and likewise for ham. The defaults here do not show anything. show_spam_lo: 1.0 show_spam_hi: 0.0 show_ham_lo: 1.0 @@ -179,8 +179,8 @@ show_false_negatives: False show_unsure: False -# Near the end of Driver.test(), you can get a listing of the 'best -# discriminators' in the words from the training sets. These are the +# Near the end of Driver.test(), you can get a listing of the best +# discriminators in the words from the training sets. These are the # words whose WordInfo.killcount values are highest, meaning they most # often were among the most extreme clues spamprob() found. The number # of best discriminators to show is given by show_best_discriminators; @@ -196,7 +196,7 @@ # pickle_basename, the extension is .pik, and increasing integers are # appended to pickle_basename. By default (if save_trained_pickles is # true), the filenames are class1.pik, class2.pik, ... If a file of that -# name already exists, it's overwritten. pickle_basename is ignored when +# name already exists, it is overwritten. pickle_basename is ignored when # save_trained_pickles is false. # if save_histogram_pickles is true, Driver.train() saves a binary @@ -218,9 +218,9 @@ # training each on N-1 sets, and the predicting against the set not trained # on. By default, it does this in a clever way, learning *and* unlearning # sets as it goes along, so that it never needs to train on N-1 sets in one -# gulp after the first time. Setting this option true forces "one gulp -# from-scratch" training every time. There used to be a set of combining -# schemes that needed this, but now it's just in case you're paranoid . +# gulp after the first time. Setting this option true forces ''one gulp +# from-scratch'' training every time. There used to be a set of combining +# schemes that needed this, but now it is just in case you are paranoid . build_each_classifier_from_scratch: False [Classifier] @@ -230,15 +230,15 @@ max_discriminators: 150 # These two control the prior assumption about word probabilities. -# "x" is essentially the probability given to a word that's never been +# "x" is essentially the probability given to a word that has never been # seen before. Nobody has reported an improvement via moving it away # from 1/2. # "s" adjusts how much weight to give the prior assumption relative to # the probabilities estimated by counting. At s=0, the counting estimates # are believed 100%, even to the extent of assigning certainty (0 or 1) -# to a word that's appeared in only ham or only spam. This is a disaster. +# to a word that has appeared in only ham or only spam. This is a disaster. # As s tends toward infintity, all probabilities tend toward x. All -# reports were that a value near 0.4 worked best, so this doesn't seem to +# reports were that a value near 0.4 worked best, so this does not seem to # be corpus-dependent. # NOTE: Gary Robinson previously used a different formula involving 'a' # and 'x'. The 'x' here is the same as before. The 's' here is the old @@ -249,11 +249,11 @@ # When scoring a message, ignore all words with # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. # This may be a hack, but it has proved to reduce error rates in many -# tests over Robinson's base scheme. 0.1 appeared to work well across +# tests over Robinsons base scheme. 0.1 appeared to work well across # all corpora. robinson_minimum_prob_strength: 0.1 -# The combining scheme currently detailed on Gary Robinon's web page. +# The combining scheme currently detailed on Gary Robinons web page. # The middle ground here is touchy, varying across corpus, and within # a corpus across amounts of training data. It almost never gives extreme # scores (near 0.0 or 1.0), but the tail ends of the ham and spam @@ -261,15 +261,15 @@ use_gary_combining: False # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) -# follows the chi-squared distribution with 2*n degrees of freedom. That's -# the "provably most-sensitive" test Gary's original scheme was monotonic +# follows the chi-squared distribution with 2*n degrees of freedom. That is +# the "provably most-sensitive" test Garys original scheme was monotonic # with. Getting closer to the theoretical basis appears to give an excellent # combining method, usually very extreme in its judgment, yet finding a tiny # (in # of msgs, spread across a huge range of scores) middle ground where -# lots of the mistakes live. This is the best method so far on Tim's data. -# One systematic benefit is that it's immune to "cancellation disease". One -# systematic drawback is that it's sensitive to *any* deviation from a -# uniform distribution, regardless of whether that's actually evidence of +# lots of the mistakes live. This is the best method so far on Tims data. +# One systematic benefit is that it is immune to "cancellation disease". One +# systematic drawback is that it is sensitive to *any* deviation from a +# uniform distribution, regardless of whether that is actually evidence of # ham or spam. Rob Hooft alleviated that by combining the final S and H # measures via (S-H+1)/2 instead of via S/(S+H)). # In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95, @@ -278,6 +278,26 @@ # with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets # (original c.l.p data, his own email, and newer general python.org traffic). use_chi_squared_combining: True + +[Hammie] +# The name of the header that hammie adds to an E-mail in filter mode +header: X-Hammie-Disposition + +# The default database path used by hammie +defaultdb: hammie.db + +# The range of clues that are added to the "hammie" header in the E-mail +# All clues that have their probability smaller than this number, or larger +# than one minus this number are added to the header such that you can see +# why spambayes thinks this is ham/spam or why it is unsure. The default is +# to show all clues, but you can reduce that by setting showclue to a lower +# value, such as 0.1 (which Rob is using) +showclue: 0.5 + +# hammie can use either a database (quick to score one message) or a pickle +# (quick to train on huge amounts of messages). Set this to True to use a +# database by default. +usedb: False """ int_cracker = ('getint', None) @@ -333,6 +353,12 @@ 'use_gary_combining': boolean_cracker, 'use_chi_squared_combining': boolean_cracker, }, + 'Hammie': {'header': string_cracker, + 'defaultdb': string_cracker, + 'showclue': float_cracker, + 'usedb': boolean_cracker, + }, + } def _warn(msg): Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.30 diff -u -r1.30 hammie.py --- hammie.py 27 Oct 2002 03:59:52 -0000 1.30 +++ hammie.py 27 Oct 2002 08:02:11 -0000 @@ -22,11 +22,14 @@ Only meaningful with the -u option. -p FILE use file as the persistent store. loads data from this file if it - exists, and saves data to this file at the end. Default: %(DEFAULTDB)s + exists, and saves data to this file at the end. + Default: %(DEFAULTDB)s -d use the DBM store instead of cPickle. The file is larger and creating it is slower, but checking against it is much faster, - especially for large word databases. + especially for large word databases. Default: %(USEDB)s + -D + the reverse of -d: use the cPickle instead of DBM -f run as a filter: read a single message from stdin, add an %(DISPHEADER)s header, and write it to stdout. If you want to @@ -52,15 +55,21 @@ program = sys.argv[0] # For usage(); referenced by docstring above # Name of the header to add in filter mode -DISPHEADER = "X-Hammie-Disposition" +DISPHEADER = options.header # Default database name -DEFAULTDB = "hammie.db" +DEFAULTDB = options.defaultdb # Probability at which a message is considered spam SPAM_THRESHOLD = options.spam_cutoff HAM_THRESHOLD = options.ham_cutoff +# Probability limit for a clue to be added to the DISPHEADER +SHOWCLUE = options.showclue + +# Use a database? If False, use a pickle +USEDB = options.usedb + # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) from tokenizer import tokenize @@ -208,7 +217,10 @@ def formatclues(self, clues, sep="; "): """Format the clues into something readable.""" - return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) + return sep.join(["%r: %.2f" % (word, prob) + for word, prob in clues + if (word[0] == '*' or + prob <= SHOWCLUE or prob >= 1.0 - SHOWCLUE)]) def score(self, msg, evidence=False): """Score (judge) a message. @@ -377,7 +389,7 @@ def main(): """Main program; parse options and go.""" try: - opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:r') + opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r') except getopt.error, msg: usage(2, msg) @@ -389,7 +401,8 @@ spam = [] unknown = [] reverse = 0 - do_filter = usedb = False + do_filter = False + usedb = USEDB for opt, arg in opts: if opt == '-h': usage(0) @@ -401,6 +414,8 @@ pck = arg elif opt == "-d": usedb = True + elif opt == "-D": + usedb = False elif opt == "-f": do_filter = True elif opt == '-u': ---------------------- multipart/mixed attachment-- From Alexander@Leidinger.net Sun Oct 27 12:40:41 2002 From: Alexander@Leidinger.net (Alexander Leidinger) Date: Sun, 27 Oct 2002 13:40:41 +0100 Subject: [Spambayes] Bugfix for hammie.py Message-ID: <20021027134041.766a951a.Alexander@Leidinger.net> This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Hi, it seems nobody is using multiple -u options besides me... Bye, Alexander. -- Yes, I've heard of "decaf." What's your point? http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: hammie.py.diff Type: application/octet-stream Size: 819 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021027/86c846f9/hammie.py.exe ---------------------- multipart/mixed attachment-- From toby@tarind.com Sun Oct 27 14:41:43 2002 From: toby@tarind.com (Toby Dickenson) Date: Sun, 27 Oct 2002 14:41:43 +0000 Subject: [Spambayes] Maildir folders Message-ID: <200210271441.43419@trumpet.tarind.com> I guess noone else is using maildir folders: diff -c -1 -r1.2 mboxutils.py *** mboxutils.py 4 Oct 2002 19:41:36 -0000 1.2 --- mboxutils.py 27 Oct 2002 14:40:49 -0000 *************** *** 12,13 **** --- 12,15 ---- files + /foo/bar/ -- (existing directory with a cur/ subdirectory) + Maildir mailbox /foo/Mail/bar/ -- (existing directory with /Mail/ in its path) *************** *** 83,85 **** # else a DirOfTxtFileMailbox. ! if name.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(name, _factory) --- 85,89 ---- # else a DirOfTxtFileMailbox. ! if os.path.exists(name+'/cur'): ! mbox = mailbox.Maildir(name, _factory) ! elif name.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(name, _factory) From tim.one@comcast.net Sun Oct 27 17:34:57 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 12:34:57 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <15801.54632.545081.293386@slothrop.zope.com> Message-ID: [Tim] >> Fine by me if we nuke the Bayes __slots__ now. [Jeremy] > Cool. Bayes.__slots__ has been nuked. Old pickles should continue to load without problem, though. > ... > I only train on spam when the existing classifier doesn't mark it as > spam. I expect that's a poor idea over time, because it ends up being too reliant on hapaxes, which in turn makes the scheme brittle. The intent has always been to train on a random sampling of ham and spam, and in real life I'm finding it valuable to train frequently (in large part because if I get a new spam today, it seems I'm likely to get the same thing four more times over the next day, and 100 *related* spams over the next week; the hapaxes change across the variants, and particularly the specific numbers in numeric IPs in embedded URLs, but key phrases like "private gold mine" stay the same). I'm also finding it valuable to train on ham that scores above 0.01 and spam that scores below 0.99 (under chi-combining), and again seemingly because scores relying mostly on hapaxes work well over the short term but can be counterproductive over time. > I expect that the amount of spam I keep around won't be that > big compared to all the other email that I keep :-). Just don't go blaming the spambayes code when it breaks down <0.9 wink>. > ... > It's even worse, though, that it uses asycnore. I found asyncore > added a lot of complexity to ZEO and would rather we hadn't used it. > Then add in a second asyncore app (the proxy) and you've got real > trouble. The complexity seems to be multiplicative rather than > additive. Upgrade to Outlook and you can enjoy Mark's many threads . From tim.one@comcast.net Sun Oct 27 17:44:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 12:44:36 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <20021025233739.9C94DF5A4@cashew.wolfskeep.com> Message-ID: [Derek Simkowiak] >> Having a starter.db would both (a) make life easier for getting >> started, and (b) give us a well-established baseline to test against. [T. Alexander Popiel] > I disagree with (b), because changes in the tokenizer (where I suspect > some of the advances will come from) will invalidate the database. I expect virtually all future advances will come from the tokenizer now. When staring at msgs, the classifier is the brain but the tokenizer supplies the eyes. We appear to have gotten all the drugs (biases) out of the brain, and gave it some truly kick-ass chintelligence, but it can't judge what it can't see. Obvious example: by default, the tokenizer still ignores the content of To and Cc headers, except to *count* the number of recipients. By default, we still don't "see", e.g., Undisclosed Recipients in the To header! This is so because cracking To gave great results for bogus reasons when using mixed-source corpora (which, e.g., includes people using their own personal email, when mixing in a spam or ham archive from a time when they used a different ISP with a different To address). From tim.one@comcast.net Sun Oct 27 18:02:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 13:02:07 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: [Derek Simkowiak] > How can we test out new algorithms if the project doesn't have a > control group? We have no way of knowing if someone's successful (or > poor) results are an attribute of the new algorithm, or if it's an > attribute of their particular sample data. Do read TESTING.txt, checked into the project. The testing framework is set up in a statistically sound way, so that even people working with a single corpus get it sliced and diced in random ways across multiple testing runs. In addition, as Alex already said, Big Changes have been made only after multiple-corpora tests reported on this list. When 10 randomized runs across each of several distinct corpora all yield similar results, it's easy to have confidence. > Having a starter.db would both (a) make life easier for getting > started, I couldn't give you a starter db that would work well for your ham. The algorithms here aren't *trying* to "identify spam" -- you want something like SpamAssassin if that's what you want. The algorithms here are trying to *separate* ham from spam, and the ham words are just as important to that as the spam words. I've run several experiments where a classifier trained on one corpus was used to predict against a different corpus. The false negative rate remained good (little spam snuck thru), but the false positive rate zoomed (many ham were *called* spam). In IR terms, spam recall remained good but spam precision suffered badly. This isn't surprising, either: except for foreign-language spam, spam is still using ordinary words, and the same words show up in ham too. For example, in the very msg I'm replying to, 'give' 0.648963 'skip:w 10' 0.664292 'results' 0.693332 'database.' 0.718815 'successful' 0.821229 'stock' 0.867852 'data.' 0.887295 "someone's" 0.969799 'subject:+' 0.987106 That's a decent collection of high-spamprob words. Nevertheless, chi-combining was extremely confident the msg was ham, because of a much larger number of low-spamprob words, some of which are specific to the topic being discussed on this mailing list, and some of which are specific to computer-geek chatter: 'argument' 0.0155709 'header:In-reply-to:1' 0.0158379 'subject:: [' 0.0169746 'attribute' 0.0196507 'url:mailman-21' 0.0196507 'skip:_ 40' 0.0320263 "else's" 0.0348837 '(b)' 0.0412844 'header:Errors-to:1' 0.0458968 'started,' 0.0505618 'subject:Spambayes' 0.0505618 'algorithms' 0.0652174 'subject:ZODB' 0.0652174 'subject:] ' 0.0772017 'from:derek' 0.0918367 'spambayes' 0.0918367 'header:Return-path:1' 0.0946929 'header:Message-id:1' 0.0962885 'header:MIME-version:1' 0.122459 The low-spamprob words specific to *your* ham will depend on the content of your ham in equally quirky ways. From dereks@itsite.com Sun Oct 27 19:05:46 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 27 Oct 2002 11:05:46 -0800 (PST) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: > I've run several experiments where a classifier trained on one corpus > was used to predict against a different corpus. The false negative > rate remained good (little spam snuck thru), but the false positive > rate zoomed (many ham were *called* spam). In IR terms, spam recall > remained good but spam precision suffered badly. It seems like you're saying that SpamBayes will not work for an enterprise-wide deployment, since different individual's vocabularies, writing styles, and interests vary so wildly. In the false positives you mention above, was the spam cutoff being used? (If so, what was it set to?) Or, are those "false positives" hams being assigned a spam probability >.50 ? I am a big fan of enterprise-wide anti-spam measures. In my mind, it makes sense to flag messages and have "default" filter rules for every workstation. It makes it much easier on the I.T. department. Requiring Python on every Windows box would immediately make SpamBayes a no-go in many businesses and Universities, simply because of the (expensive) user support that would be required. So I am concerned when you present evidence that every individual needs to do their own SpamBayes training. It is obvious and well-understood that a .db trained from a specific individual's body of emails will work better for that individual than for some other individual. So what you say above does not surprise me. But what does surpise me is the argument that every individual should do their own SpamBayes training. > The low-spamprob words specific to *your* ham will depend on the > content of your ham in equally quirky ways. No doubt; but over a large body of emails from many different individuals, I think the "quirkies" would fall by the wayside (because any one individual's quirkies would not be very frequent over the given collection), and that the Spam-specific "quirkies" (things like color=#FF0000) would hence become the strongest identifiers for any given message. (Officially proposing the term "quirkie" to mean a strong spamprob word -- either for or against -- that is specific a particular corpus of email.) I'm guessing that if you did your tests again, but trained against all the corpuses before doing the test, your false positive rate would drop way down. (Is that not how SpamBayes is supposed to work?) You see, I do not have access to a large corpus of email from many different individuals. All I have is my inbox, which is quite quirky indeed. But I want to set up a hammie.py installation for a small workgroup, to see what kind of performance I get, and to monitor SpamBayes' performance changes over time (as it's trained to the small workgroup's incoming messages). If I had a starter .db file that was trained against many emails from many different individuals, then I'd be able to get going. Instead, I'm stuck wondering what process I should go through to try to collect a large corpus of email that will have its ham quirkies averaged away. But I know from reading test results here that many individuals have already taken the time and effort to do that. So I am asking for someone to share that effort -- kind of like Open Source, except on SpamBayes training instead of code writing. --Derek From seant@iname.com Sun Oct 27 19:19:23 2002 From: seant@iname.com (Sean True) Date: Sun, 27 Oct 2002 14:19:23 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: > I am a big fan of enterprise-wide anti-spam measures. In my mind, > it makes sense to flag messages and have "default" filter rules for every > workstation. It makes it much easier on the I.T. department. Requiring > Python on every Windows box would immediately make SpamBayes a no-go in > many businesses and Universities, simply because of the (expensive) user > support that would be required. So I am concerned when you present > evidence that every individual needs to do their own SpamBayes training. Tim's concerns seem to center on the very individual definition of what ham is. I think I remember an earlier concession about a more common definition of what spam is. Perhaps we need a starter database that has a predefined set of spam probabilities, to which one could add ones own ham (and additional spam). I have a lot more ham than spam available, and I've been saving spam for months, against a rainy day -- or a project like this. If somebody jump started my spam collection, I'd be a happy camper. -- Sean From tim.one@comcast.net Sun Oct 27 20:11:50 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 15:11:50 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: [Sean True] > Tim's concerns seem to center on the very individual definition > of what ham is. I think I remember an earlier concession about a more > common definition of what spam is. They're intimately related: the definition of spam that seems to work best with this system is "automated bulk advertising the user doesn't want". The language and artifacts of advertising "are different", akin to the way it's usually easy to distinguish a TV commercial from a sitcom or movie. But virtually everyone signs up for automated bulk advertising they *do* want, and it's the "X wants Y's crap" part that can't be served by a single database. You don't want ads from esmokes.com but I do; you may want marketing spam from Microsoft about .NET but I don't; etc. It's consistent in tests that all forms of marketing collateral get rated as spam unless specifically trained for an individual's "ya, but I want *that* spam" preferences. > Perhaps we need a starter database that has a predefined set of spam > probabilities, to which one could add ones own ham (and additional spam). > I have a lot more ham than spam available, and I've been saving spam for > months, against a rainy day -- or a project like this. If somebody jump > started my spam collection, I'd be a happy camper. There are extensive spam archives available for the taking: http://www.paulgraham.com/spamarchives.html My large comp.lang.python test trained against Bruce Guenter's 2002 spam collection (about 14,000 spam at the time). When I used that classifier against Greg Ward's later and more-general python.org corpus, it gave high false positive rates against the handful of small private "hobby" mailing lists run thru python.org. The ham in those didn't look anything like legitimate c.l.py traffic (neither to the classifier nor to human eyes), so got high spam scores: all ham uses spam words, but it's hard for spam to hit a significant # of low-spamprob words, were "low" is defined relative to an individual. Hence the FN rate doesn't suffer much, but the FP rate zooms. There's really no point in arguing about this, as it's something that can be tested. All tests in that direction to date have been discouraging. OTOH, tests on the general python.org corpus-- barring personal email --have been very encouraging, but only when run with a classifier trained *on* the general python.org corpus. Tech mailing lists simply don't tolerate much advertising of any kind, and it's still the case that non-Python non-Zope conference announcements (which are a form of bulk advertising) get high scores relative to other ham. From tim.one@comcast.net Sun Oct 27 20:55:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 15:55:25 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: [Derek Simkowiak] > It seems like you're saying that SpamBayes will not work for an > enterprise-wide deployment, since different individual's vocabularies, > writing styles, and interests vary so wildly. That's a matter for testing to decide, but it's not a kind of thing I can make time to test. I doubt that their vocabularies or writing styles matter (it's the email you get, not the email you write, that's judged), what matters is what forms of advertising the individuals within the enterprise want. "Enterprise" is too vague a word to guess anything about that in general. If "the enterprise" is general tech mailing-list traffic going thru python.org, then we have strong evidence (from testing) that a single classifier will work great. If "the enterprise" is an ISP serving 1,000 individuals' private email, I expect a single classifier would have such high false positive rates as to be unacceptable. If you have one user who *wants* porn ads, a single classifier has to be trained to accept them (and can be -- it's easy). Then all users get them. If one user signs up for a minister-by-mail scam (a real-life example reported earlier on this list), then all users get minister-by-mail scams. Etc. > In the false positives you mention above, was the spam cutoff > being used? (If so, what was it set to?) Or, are those "false positives" > hams being assigned a spam probability >.50 ? Different tests were done at different times with different combining schemes and different corpora. They all had in common that "false positive" scores were above a realistic middle-ground cutoff. > I am a big fan of enterprise-wide anti-spam measures. In my mind, > it makes sense to flag messages and have "default" filter rules for every > workstation. It makes it much easier on the I.T. department. Requiring > Python on every Windows box would immediately make SpamBayes a no-go in > many businesses and Universities, simply because of the (expensive) user > support that would be required. So I am concerned when you present > evidence that every individual needs to do their own SpamBayes training. Spam in the sense of "advertising I don't want to see, as opposed to advertising I do want to see" is a personal judgment. That doesn't preclude server-based approaches, but would require knowing about (saving info about) each individual, unless "the enterprise" has a single, fixed policy about what constitutes advertising nobody in the enterprise should be allowed to receive. > It is obvious and well-understood that a .db trained from a > specific individual's body of emails will work better for that individual > than for some other individual. So what you say above does not surprise > me. But what does surpise me is the argument that every individual should > do their own SpamBayes training. Test it and draw your own conclusions -- nothing is hidden here . >> The low-spamprob words specific to *your* ham will depend on the >> content of your ham in equally quirky ways. > No doubt; but over a large body of emails from many different > individuals, I think the "quirkies" would fall by the wayside (because any > one individual's quirkies would not be very frequent over the given > collection), and that the Spam-specific "quirkies" (things like > color=#FF0000) would hence become the strongest identifiers for any given > message. In that case the ham quirkies become too weak to let that individual's favored forms of advertising thru. By the way, if you think #FF0000 is a killer-strong spam clue, you don't have young relatives sending you HTML birthday greetings <0.6 wink>. > (Officially proposing the term "quirkie" to mean a strong spamprob > word -- either for or against -- that is specific a particular corpus of > email.) Consider it adopted -- I like it! > I'm guessing that if you did your tests again, but trained against > all the corpuses before doing the test, your false positive rate would > drop way down. (Is that not how SpamBayes is supposed to work?) Training on ham does improve the FP rate. But if I have to train it to allow the forms of bulk advertising you want to see, then a single classifier can't block those forms of advertising for anyone else. In the python.org context, the only community-accepted advertising is highly specific to Python and Zope, so a single classifier works fine. In the context of my personal email, the only advertising I want to see is from the companies I do business with, and I indeed needed to train carefully on several examples each of marketing email from various *specific* financial institutions, companies, and special-interest newsletters *I* like to see. I've even trained it to accept "Joke of the Day" spam, because I often like the jokes, despite that the rest of those spams are trying to sell me the usual range of crap from human growth hormone to miracle diets. You don't want to see that stuff, and that I've trained my classifier to accept marketing blurbs from Strong isn't going to help you get marketing blurbs from Oppenheimer. > You see, I do not have access to a large corpus of email from many > different individuals. All I have is my inbox, which is quite quirky > indeed. So start with that. > But I want to set up a hammie.py installation for a small > workgroup, to see what kind of performance I get, and to monitor > SpamBayes' performance changes over time (as it's trained to the small > workgroup's incoming messages). Then start with that. > If I had a starter .db file that was trained against many emails > from many different individuals, then I'd be able to get going. Just start and see what happens. You're simply not going to get a DB from anyone trained on personal email, because there are too many clues about individual identities in the database, including things like passwords and account numbers, and email addresses of friends and relatives. > Instead, I'm stuck wondering what process I should go through to try to > collect a large corpus of email that will have its ham quirkies averaged > away. You don't need a large corpus; the system learns quickly; just start. > But I know from reading test results here that many individuals > have already taken the time and effort to do that. So I am asking for > someone to share that effort -- kind of like Open Source, except on > SpamBayes training instead of code writing. I could give you a classifier trained on comp.lang.python traffic plus Bruce G's 2002 spam collection. Indeed, I used to make such a thing available on SourceForge. Few people bothered to try it, and those who did reported poor results on their personal email, so I got rid of it. I don't believe anyone tried it in the context of corporate email. I won't believe that you're going to try it until you report that you've already started and are getting poor results. From tim.one@comcast.net Sun Oct 27 21:36:16 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 16:36:16 -0500 Subject: [Spambayes] Maildir folders In-Reply-To: <200210271441.43419@trumpet.tarind.com> Message-ID: [Toby Dickenson] > I guess noone else is using maildir folders: Thanks for the patch, Toby. I checked it in but can't test it; you should, since I changed > ! if os.path.exists(name+'/cur'): to use os.path.join(). From tim.one@comcast.net Sun Oct 27 21:41:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 16:41:17 -0500 Subject: [Spambayes] Bugfix for hammie.py In-Reply-To: <20021027134041.766a951a.Alexander@Leidinger.net> Message-ID: [Alexander Leidinger] > it seems nobody is using multiple -u options besides me... hammie users appear missing in action today -- I checked your patch in (and thank you!), but haven't tested it. From tim.one@comcast.net Sun Oct 27 21:50:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 16:50:03 -0500 Subject: [Spambayes] More proposed hammie changes: use Options In-Reply-To: <3DBB9FC9.2070306@hooft.net> Message-ID: [Rob Hooft] > Attached are some more changes I'd like to propose to make to hammie: I'd just check this in, if I were you. One suggestion: option names have to be unique across all sections now, so tiny option names like "header" would be better spelled, e.g., hammie_header_name. From rob@hooft.net Sun Oct 27 22:12:30 2002 From: rob@hooft.net (Rob Hooft) Date: Sun, 27 Oct 2002 23:12:30 +0100 Subject: [Spambayes] More proposed hammie changes: use Options References: Message-ID: <3DBC64CE.7040100@hooft.net> Tim Peters wrote: > [Rob Hooft] > >>Attached are some more changes I'd like to propose to make to hammie: > > > I'd just check this in, if I were you. One suggestion: option names have > to be unique across all sections now, so tiny option names like "header" > would be better spelled, e.g., hammie_header_name. Done. I was just waiting for one review like this. New options and their defaults are now: [Hammie] hammie_header_name: X-Hammie-Disposition persistant_storage_file: hammie.db clue_mailheader_cutoff: 0.5 persistant_use_database: False Note that this was crafted such that nothing changes for people that are using hammie.py already; except if they changed the defaults in the source (which would result in collisions now). Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Sun Oct 27 22:31:36 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 17:31:36 -0500 Subject: [Spambayes] Mining the headers In-Reply-To: <15803.31663.391441.711086@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Done. Note that I deleted the mine_date_headers option. It was just a > gatekeeper for the other two. Seemed pointless to me. Here's my latest > run. The first run was the default. My dates.ini file is > > [Tokenizer] > generate_time_buckets: True > extract_dow: True Skip, I think there's a bug in the extract_dow code. On a quick python.org test, here are the dow tokens left behind in the database: #ham #spam spamprob 'dow:0' 2 7 0.890542594688 'dow:1' 3 7 0.854937008074 'dow:2' 725 71 0.220827483069 'dow:3' 1038 261 0.420993872704 'dow:4' 845 234 0.444677806501 'dow:5' 126 196 0.81766035841 'dow:6' 0 137 0.998363041106 'dow:invalid' 2741 946 0.499472081328 Those only trained on half a week's traffic, so it's not surprising that half the days are virtually empty. What is surprising is that every ham trained on, and all but 2 of the spam, generated a dow:invalid token. Because the for fmt in self.date_formats: loop has no early exit, its "else:" clause always executes. If I repair that, dow:invalid becomes a mild spam clue: 'dow:invalid' 2 33 0.97338283678 I say it's "mild" just because it's infrequent in absolute terms. I'll check that change in anyway, and run a better test. From dereks@itsite.com Sun Oct 27 22:46:53 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 27 Oct 2002 14:46:53 -0800 (PST) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: > can be -- it's easy). Then all users get them. If one user signs up for a > minister-by-mail scam (a real-life example reported earlier on this list), > then all users get minister-by-mail scams. Etc. I'm a little slow, so forgive me if this is... repetitive. But your argument sounds like it something of a showstopper to my intended use of SpamBayes, and I want to make sure this behaviour is clearly documented in the archives. Consider a group of a people who all use the same mail server. I'm thinking of a university, or customers of one of those $20/month email services, or a 1000-person company. Now consider the sysadmin who wants to use SpamBayes for the purpose of flagging spam on that mail server, such that users can set up a generic filter rule that is easily supported by the organization's Help Desk. The way I understand it, if any _one_ person in the group of people likes to get advertisements, porn mails, hotel conference info, and/or minister-by-mail, and SpamBayes is trained on all incoming mail, then everybody in the group will have their filtering rendered useless. In other words, Bayesian filtering (as popularized by the article "A Plan for Spam") is only good for individuals, or small groups of individuals who all like the same kinds of ham. I can't help but feel that I'm missing something. In this setting, it seems like training on hams is quite destructive to the goal of flagging Spam. What if we pretend that all hams have exactly .5 probability, that is, any given ham cannot be identified as either being a spam, or not being a spam. That is, all hams are just random noise. Then we train against a huge collection of spam, like Bruce G.'s stuff. Each word in the database gets a "spam likelihood" rating, depending on what percentage of the time it shows up in the spams. A word that shows up in every single spam gets a "1.0", and every word that does not appear in the spam at all gets a "0.0". We throw out ueber-common words like a, and, the, it, just like Google does for its searches, as a matter of efficiency. Then every email is rated word-by-word. The scores for all the words are then averaged together. So an email with many words commonly found in spam gets a high rating... (?) Um, I've overstepped my understanding of the problem, so I'll just stop there. But to you algorithm geniuses, I plead for a way to filter spam that depends only on previously-seen Spam, and that does not depend on what ham looks like. Thanks, Derek From tim.one@comcast.net Sun Oct 27 22:57:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 17:57:05 -0500 Subject: [Spambayes] More proposed hammie changes: use Options In-Reply-To: <3DBC64CE.7040100@hooft.net> Message-ID: [Tim] >> I'd just check this in, if I were you. One suggestion: option names ... [Rob Hooft] > Done. I was just waiting for one review like this. Happy to oblige. > New options and their defaults are now: > > [Hammie] > hammie_header_name: X-Hammie-Disposition > persistant_storage_file: hammie.db > clue_mailheader_cutoff: 0.5 > persistant_use_database: False Good! NOTE that I did s/persistant/persistent/g later. > Note that this was crafted such that nothing changes for people that are > using hammie.py already; except if they changed the defaults in the > source (which would result in collisions now). Indeed you did a careful job -- it's appreciated. From dereks@itsite.com Sun Oct 27 23:27:04 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 27 Oct 2002 15:27:04 -0800 (PST) Subject: [Spambayes] Maildir folders In-Reply-To: Message-ID: > > ! if os.path.exists(name+'/cur'): > > to use os.path.join(). os.path.join() will ignore all preceding entries as soon as one of the entries starts with a '/'. (Furthermore, '/' is not cross platform, try to stick to os.sep.) So make sure it's os.path.join(name, 'cur') From tim.one@comcast.net Sun Oct 27 23:43:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 18:43:18 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: [Tim] >> If one user signs up for a minister-by-mail scam (a real-life example >> reported earlier on this list), then all users get minister-by-mail >> scams. Etc. [Derek Simkowiak] > I'm a little slow, so forgive me if this is... repetitive. But > your argument sounds like it something of a showstopper to my intended > use of SpamBayes, and I want to make sure this behaviour is clearly > documented in the archives. Arguments don't count for much here: you can set up a test and measure results. That's the only way to know. I've told you my best guess, but guesses here are often wrong. > Consider a group of a people who all use the same mail server. > I'm thinking of a university, or customers of one of those $20/month > email services, or a 1000-person company. > > Now consider the sysadmin who wants to use SpamBayes for the > purpose of flagging spam on that mail server, such that users can set > up a generic filter rule that is easily supported by the organization's > Help Desk. It's not an application I've got in mind, and not one that I've tested or intend to test. Other people here are interested in this, but they don't appear to be around today. > The way I understand it, if any _one_ person in the group of > people likes to get advertisements, porn mails, hotel conference info, > and/or minister-by-mail, and SpamBayes is trained on all incoming mail, > then everybody in the group will have their filtering rendered useless. Minus the hyperbole, yes, unless you've done whatever it takes to inject some recipient-specific smarts. If it passes on porn spam to me, how could it possibly block it for you otherwise? For a start, it would have to know that you and I are different. And that's got nothing to do with Bayesian filters, or any other technicality: if different people call different things spam (and they do -- that's a fact), and any scheme that doesn't know the difference between people necessarily treats all people the same (that sure *seems* to be a fact ), then if it lets my porn spam through then you get it too, or if it blocks my porn spam for you then it blocks it for me too. Either way one of us is left unhappy. > In other words, Bayesian filtering (as popularized by the article > "A Plan for Spam") is only good for individuals, or small groups of > individuals who all like the same kinds of ham. I think that's too extreme a conclusion. For example, python.org serves up tech lists for tens of thousands of users, and we have strong evidence that a single classifier will work fine there. Tech lists have a *shared* notion of what's spam, though. > I can't help but feel that I'm missing something. In this > setting, it seems like training on hams is quite destructive to the goal > of flagging Spam. The algorithm doesn't try to flag spam, it tries *separate* ham from spam, and the characteristics of both populations feed into that. We've come full circle, and I'll repeat that SpamAssassin may be more to your liking. It does try to flag spam largely independent of any notion of ham, although from what I've seen of SpammAssassin admins they spend a lot of time crafting "positive rules" to try to let through things *their* site considers to be ham. Whitelists seem very effective for that, and I expect some form of whitelist would help a large deployment of the spambayes code too. OTOH, different people also want different whitelists. > What if we pretend that all hams have exactly .5 probability, that > is, any given ham cannot be identified as either being a spam, or not > being a spam. That is, all hams are just random noise. > > Then we train against a huge collection of spam, like Bruce G.'s > stuff. > > Each word in the database gets a "spam likelihood" rating, > depending on what percentage of the time it shows up in the spams. A word > that shows up in every single spam gets a "1.0", and every word that does > not appear in the spam at all gets a "0.0". I don't know, and it doesn't seem to make sense in the statistical framework the spambayes project is built around. You could test it, though, by fiddling our codebase. For example, replace update_probabilities like so: def update_probabilities(self): """Update the word probabilities in the spam database. This computes a new probability for every word in the database, so can be expensive. learn() and unlearn() update the probabilities each time by default. Thay have an optional argument that allows to skip this step when feeding in many messages, and in that case you should call update_probabilities() after feeding the last message and before calling spamprob(). """ nspam = float(self.nspam or 1) S = options.robinson_probability_s StimesX = S * options.robinson_probability_x for word, record in self.wordinfo.iteritems(): spamcount = record.spamcount assert spamcount <= nspam prob = spamcount / nspam # Now do Robinson's Bayesian adjustment. # ... prob = (StimesX + spamcount * prob) / (S + spamcount) if record.spamprob != prob: record.spamprob = prob self.wordinfo[word] = record BTW, there's no need to train on ham at all then (doing so would have no effect on computed spamprobs). > We throw out ueber-common words like a, and, the, it, just like Google > does for its searches, as a matter of efficiency. It's not really a matter of efficiency, it's more that since "a" appears in virtually every spam *and* ham, the spamprob of "a" will be approximately 1.0 if you ignore hamcounts (it's approximately 0.5 now). Note too that any ham that just happens to mention "money" will also have a very high spamprob word. Words you used in this email: 'filtering' 0.844828 'plead' 0.844828 'is...' 0.844828 'company.' 0.895746 'spam.' 0.899585 'scam' 0.908163 'like.' 0.934783 'flagging' 0.958716 'rated' 0.983271 'porn' 0.988998 will have even higher spamprobs than those, because there will be no hamcounts to counteract them. Indeed, all words will have higher spamprobs than they have now. > Then every email is rated word-by-word. The scores for all the > words are then averaged together. So an email with many words commonly > found in spam gets a high rating... (?) Here's a spam I picked at random from my personal collection. Which words in this can you hope to get a high spam rating? """ Hi i read your profile and you live in my area. Maybe we could chat on line or even meet for a coffee. If you would like to come and chat with me i will be on line most of the night at http://www.designerlove.com/?rid=love2 My screen name is "PenPal" Log in and i'll be in the chat section. Hope to see you soon. """ As a matter of fact, none of those words are *common* in spam, except for words like "and", "the" and "on". My classifier nails it anyway (score of 0.97), because while words like "chat" appear in a small percentage of my spam, they appear in even less of my ham (peeking inside a fat c.l.py classifier, 'chat' appeared in 20 of 18,000 ham, and 164 of 12,600 spam: it's rare by any measure, what matters here is that it's *relatively* rarer in my ham than in my spam; and likewise for 'coffee.', and so on). > Um, I've overstepped my understanding of the problem, so I'll just > stop there. But to you algorithm geniuses, I plead for a way to filter > spam that depends only on previously-seen Spam, and that does not depend > on what ham looks like. Why do you think you get so much spam? One reason is that one-size-fits-all schemes don't work well. I'd like to plead for world peace too, while the algorithm geniuses are at it . From tim.one@comcast.net Sun Oct 27 23:57:10 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 18:57:10 -0500 Subject: [Spambayes] Maildir folders In-Reply-To: Message-ID: [Derek Simkowiak] > os.path.join() will ignore all preceding entries as soon as one of > the entries starts with a '/'. (Furthermore, '/' is not cross platform, > try to stick to os.sep.) > > So make sure it's > > os.path.join(name, 'cur') Yup, that's what it is. From skip@pobox.com Mon Oct 28 00:08:35 2002 From: skip@pobox.com (Skip Montanaro) Date: Sun, 27 Oct 2002 18:08:35 -0600 Subject: [Spambayes] Mining the headers In-Reply-To: References: <15803.31663.391441.711086@montanaro.dyndns.org> Message-ID: <15804.32771.26821.701868@montanaro.dyndns.org> Tim> Skip, I think there's a bug in the extract_dow code. Thanks for catching it. for: ... else: isn't a construct I use often, so it's not entirely surprising that I muffed it. How did you generate the table of tokens in your note? Tim> #ham #spam spamprob Tim> 'dow:0' 2 7 0.890542594688 Tim> 'dow:1' 3 7 0.854937008074 Tim> 'dow:2' 725 71 0.220827483069 Tim> 'dow:3' 1038 261 0.420993872704 Tim> 'dow:4' 845 234 0.444677806501 Tim> 'dow:5' 126 196 0.81766035841 Tim> 'dow:6' 0 137 0.998363041106 Tim> 'dow:invalid' 2741 946 0.499472081328 The only tokens I've ever seen are in the summaries. Skip From tim.one@comcast.net Mon Oct 28 01:07:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 20:07:01 -0500 Subject: [Spambayes] Mining the headers In-Reply-To: Message-ID: About: [Tokenizer] generate_time_buckets: True extract_dow: True Across my c.l.py test (10-fold cv; mixed source; 20,000 c.l.py ham + 14,000 bruceg spam) it didn't change the FP, FN or unsure rates, but there's nothing that's ever going to get rid of my 2 remaining FP and 2 remaining FN. There's evidence that bruceg got spam more often on weekends than c.l.py got ham on weekends, and mostly because c.l.py traffic drops on weekends. Here in decreasing order of spamprob, but from just 1 of the 10 classifiers built during the test: 'dow:invalid' 57 426 0.913973614869 'dow:5' 1611 1562 0.580722462655 'dow:6' 1599 1413 0.557982096725 'dow:1' 2982 1701 0.449007117316 'dow:0' 2738 1480 0.435736703465 'dow:3' 3067 1661 0.436204501487 'dow:4' 2860 1535 0.433990360589 'dow:2' 3086 1642 0.431861726944 NOTE: Since I'm running with the default robinson_minimum_prob_strength == 0.1, all words with spamprob between 0.4 and 0.6 are ignored. Therefore only the 'dow:invalid' token *could* have had an effect on this test. Time buckets show higher spamprobs in the hours most of America is asleep. Again this appears to have more to do with that there's less c.l.py traffic then than with an increase in spam then -- but for purposes of prediction, any regularity in spam *or* ham is exploitable: hh:mm #h #s spamprob 0.00 133 66 0.415 0.10 108 74 0.495 0.20 114 81 0.504 0.30 93 85 0.566 0.40 103 87 0.547 0.50 102 93 0.566 1.00 82 62 0.519 1.10 85 89 0.599 1.20 83 70 0.546 -------------------------- above .60 starting roughly here 1.30 79 89 0.616 1.40 106 84 0.531 1.50 74 88 0.629 2.00 60 65 0.607 2.10 67 99 0.678 2.20 60 76 0.644 2.30 79 89 0.616 2.40 45 75 0.703 2.50 81 81 0.588 3.00 55 67 0.635 3.10 58 99 0.709 3.20 52 66 0.644 3.30 66 81 0.636 3.40 64 81 0.643 3.50 62 81 0.651 4.00 45 68 0.683 4.10 47 53 0.616 4.20 45 57 0.643 4.30 45 85 0.729 4.40 56 49 0.555 4.50 46 57 0.638 5.00 32 83 0.786 5.10 47 77 0.700 5.20 42 56 0.655 5.30 50 49 0.583 5.40 44 55 0.640 5.50 48 63 0.652 6.00 52 76 0.676 6.10 46 48 0.598 6.20 42 57 0.659 6.30 53 59 0.613 6.40 56 52 0.570 6.50 41 65 0.693 7.00 49 56 0.620 -------------------------- and ending roughly here 7.10 58 53 0.566 7.20 69 50 0.509 7.30 75 64 0.549 7.40 83 65 0.528 7.50 94 57 0.464 8.00 97 48 0.414 8.10 113 69 0.466 8.20 109 76 0.499 8.30 141 70 0.415 8.40 112 50 0.390 8.50 117 58 0.415 9.00 120 55 0.396 9.10 137 57 0.373 9.20 154 57 0.346 9.30 171 57 0.323 9.40 141 55 0.358 9.50 170 49 0.292 10.00 159 81 0.421 10.10 182 76 0.374 10.20 200 73 0.343 10.30 176 69 0.359 10.40 132 81 0.467 10.50 163 67 0.370 11.00 184 92 0.417 11.10 174 66 0.352 11.20 181 58 0.314 11.30 169 73 0.382 11.40 170 69 0.367 11.50 167 60 0.339 12.00 191 95 0.416 12.10 182 73 0.365 12.20 128 63 0.413 12.30 156 70 0.391 12.40 153 82 0.434 12.50 170 106 0.471 13.00 157 78 0.415 13.10 149 77 0.425 13.20 160 82 0.423 13.30 140 71 0.420 13.40 172 66 0.354 13.50 192 64 0.323 14.00 169 99 0.456 14.10 170 90 0.431 14.20 203 69 0.327 14.30 168 89 0.431 14.40 192 78 0.367 14.50 199 63 0.312 15.00 200 68 0.327 15.10 195 59 0.302 15.20 183 71 0.357 15.30 198 67 0.326 15.40 193 78 0.366 15.50 195 60 0.306 16.00 181 81 0.390 16.10 176 72 0.369 16.20 194 98 0.419 16.30 177 70 0.361 16.40 175 81 0.398 16.50 185 88 0.405 17.00 187 79 0.377 17.10 167 67 0.365 17.20 165 80 0.409 17.30 185 74 0.364 17.40 179 82 0.396 17.50 166 90 0.437 18.00 136 65 0.406 18.10 141 77 0.438 18.20 165 65 0.360 18.30 168 74 0.386 18.40 148 96 0.481 18.50 144 70 0.410 19.00 140 83 0.459 19.10 130 93 0.505 19.20 139 67 0.408 19.30 111 79 0.504 19.40 129 64 0.415 19.50 128 80 0.472 20.00 121 71 0.456 20.10 129 71 0.440 20.20 124 63 0.421 20.30 127 95 0.517 20.40 140 86 0.467 20.50 131 78 0.460 21.00 142 84 0.458 21.10 144 87 0.463 21.20 143 79 0.441 21.30 143 82 0.450 21.40 142 93 0.483 21.50 132 98 0.515 22.00 135 70 0.426 22.10 133 69 0.426 22.20 136 98 0.507 22.30 121 95 0.529 22.40 128 82 0.478 22.50 117 70 0.461 23.00 121 60 0.415 23.10 142 61 0.381 23.20 146 74 0.420 23.30 115 77 0.489 23.40 107 71 0.487 23.50 114 80 0.501 So, overall, in this data there are mild indicators based on DOW and on time bucket tokens. The c.l.py test was already getting all the info it could use, though, and they're too mild to make much of a dent in a chi-score unless there are very few total clues. From dereks@itsite.com Mon Oct 28 01:12:00 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Sun, 27 Oct 2002 17:12:00 -0800 (PST) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: > Why do you think you get so much spam? One reason is that one-size-fits-all > schemes don't work well. Tim, Thanks for your responses today, they've been very helpful. I hold out hope, though, that one day the heuristics will be available to magicallly know what I want :) Seriously speaking, my gut says that all the information we need is in the spam collections like Bruce's. Somebody just needs to figure out how to mine it, methinks. > I'd like to plead for world peace too, while the algorithm geniuses > are at it . Pfft, that one's easy. It's the implementation that kills ya! :) From tim.one@comcast.net Mon Oct 28 01:20:32 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 27 Oct 2002 20:20:32 -0500 Subject: [Spambayes] Mining the headers In-Reply-To: <15804.32771.26821.701868@montanaro.dyndns.org> Message-ID: > Tim> Skip, I think there's a bug in the extract_dow code. [Skip] > Thanks for catching it. You're welcome . > for: ... else: isn't a construct I use often, so it's not entirely > surprising that I muffed it. It was just one "break" away from perfection -- a loop needs an early exit else "else:" is a bug (hmm -- I wonder whether PyChecker knows that rule!). > How did you generate the table of tokens in your note? > > Tim> #ham #spam spamprob > Tim> 'dow:0' 2 7 0.890542594688 > Tim> 'dow:1' 3 7 0.854937008074 > Tim> 'dow:2' 725 71 0.220827483069 > Tim> 'dow:3' 1038 261 0.420993872704 > Tim> 'dow:4' 845 234 0.444677806501 > Tim> 'dow:5' 126 196 0.81766035841 > Tim> 'dow:6' 0 137 0.998363041106 > Tim> 'dow:invalid' 2741 946 0.499472081328 > > The only tokens I've ever seen are in the summaries. I do that mostly by hand. Here's a little Python program I didn't bother to check in: """ import cPickle as pickle #f = file('outlook2000/default_bayes_database.pck', 'rb') #f = file('fat.pik', 'rb') f = file('class1.pik', 'rb') c = pickle.load(f) f.close() w = c.wordinfo def root(prefix): for k, r in w.iteritems(): if k.startswith(prefix): print `k`, r.hamcount, r.spamcount, r.spamprob """ Run that via, e.g., python -i pik.py It then loads the trained classifier pickle of your choice into 'c', its wordinfo dict into 'w', and leaves you in an interactive session where you can play around. The utility root() function prints token hamcount spamcount spamprob for every token beginning with a given string. So, in this case, I did root('dow:') and pasted a screen scrape into the email. Note that the option [TestDriver] save_trained_pickles: True will leave behind classifier pickles for each classifier trained during a test run. So there's not much too it! Spend a few minutes studying the classes in Classifier: their instance data members are very simple (esp. since we got rid of a ton of combining schemes), and the whole thing will make a lot more sense to you then. The classifier's data structures are very easy to rummage around in, and there are very few of them. perfection-is-reached-when-there's-nothing-left-to-throw-away-ly y'rs - tim From popiel@wolfskeep.com Mon Oct 28 03:15:17 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Sun, 27 Oct 2002 19:15:17 -0800 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message from Derek Simkowiak References: Message-ID: <20021028031518.E6F8CF4D4@cashew.wolfskeep.com> In message: Derek Simkowiak writes: >> can be -- it's easy). Then all users get them. If one user signs up for a >> minister-by-mail scam (a real-life example reported earlier on this list), >> then all users get minister-by-mail scams. Etc. > > I'm a little slow, so forgive me if this is... repetitive. But >your argument sounds like it something of a showstopper to my intended use >of SpamBayes, and I want to make sure this behaviour is clearly documented >in the archives. > > Consider a group of a people who all use the same mail server. >I'm thinking of a university, or customers of one of those $20/month email >services, or a 1000-person company. > > Now consider the sysadmin who wants to use SpamBayes for the >purpose of flagging spam on that mail server, such that users can set up a >generic filter rule that is easily supported by the organization's Help >Desk. Okay. There's many distinct ways to set this up. Two ways of interest are: 1) Set up a common database for filtering all mail coming into the system. This will be subject to all the problems and limitations that Tim is talking about. 2) Set up separate databases for each user, filtering only their mail. This will take a lot more space, but will probably have _MUCH_ better results, assuming that you teach your users how to train the beast. With the new parameterization of the Hammie headers, you could even run both of these, generating two separate headers to filter on. With this sort of combo approach, people could set their clients up to put in their normal inbox if _either_ of the headers said ham, put in unsure if both of the headers said unsure, and put in spam otherwise (at least one header said spam, and the other didn't say ham). This might solve the 'one guy wants farmgirls and horses spam' problem of #1. Dunno. I don't receive mail (with spam) on a shared server, so there's no way I could test this sort of thing. > In other words, Bayesian filtering (as popularized by the article >"A Plan for Spam") is only good for individuals, or small groups of >individuals who all like the same kinds of ham. This sort of filtering (I don't call it Bayesian anymore) is probably only good for small high-commonality groups if there's only a single database. Using a separate database for each (and no common database) clearly will work (we've got several individuals now doing that). Combining separate databases and a common database is uncharted territory, which may have things of interest. - Alex From seant@iname.com Mon Oct 28 03:38:45 2002 From: seant@iname.com (Sean True) Date: Sun, 27 Oct 2002 22:38:45 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <20021028031518.E6F8CF4D4@cashew.wolfskeep.com> Message-ID: > > In other words, Bayesian filtering (as popularized by the article > >"A Plan for Spam") is only good for individuals, or small groups of > >individuals who all like the same kinds of ham. > In the "real world", I suspect there are many companies who are probably perfectly happy to block (or label) all "pony and farmgirl" messages, independent of whether some lonely farm guy thinks they are ham. And for corporate email, I think such practices are not unreasonable (granted, I've tried hard to not work places like that, or be in charge of MIS when I was). There are legal standards that may require an employer to make a best effort to keep the "pony and farmgirl" message away from those who might be offended by even having to _label_ it as spam. If users can nominate mail that violates community standards, and some MIS person agrees, a single filter might well be kept that would be a substantial help to a large body of people. As usual, for those with special needs (customer service), or special privileges (the vice president for reading naughty mail), more personalized filters could be made available. -- Sean From tim.one@comcast.net Mon Oct 28 05:49:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 00:49:08 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment [Derek Simkowiak] > Thanks for your responses today, they've been very helpful. I > hold out hope, though, that one day the heuristics will be available to > magicallly know what I want :) > > Seriously speaking, my gut says that all the information we need > is in the spam collections like Bruce's. Somebody just needs to figure > out how to mine it, methinks. I think you're missing a basic point, but not due to lack of repetition . Let's get real concrete. Go to this site: http://www.esmokes.com/ I buy cigarettes from that site, and I get glitzy HTML promotional email from them about once a week. I want that email. You don't (or so I guess), and *any* filter trained on any spam collection on Earth is-- if it's worth anything at all --going to say that's spam. I'll attach one of their emails for your perusal. This isn't a question of classification technology so much as it's a question of personal preference, and so long as you're determined that everyone must use the same classifier, personal preference goes out the window. That's a bad use of technology, IMO -- I'm not interested in treating everyone like interchangeable cogs. Buy a server with enough disk space so everyone can have their own classifier, and do whatever else it takes to give people a system they'll truly love instead of merely endure. The spambayes system had no trouble learning that *I* want this crap because it found many lexical clues nearly unique to email from this particular vendor: 'esmokes.com' 0.0412844 'subject:eSmokes.com' 0.0412844 'url:esmokes' 0.0412844 'carton' 0.0505618 'esmokes.com!' 0.0505618 'esmokes.com,' 0.0505618 'from:email addr:esmokes.com' 0.0505618 'message-id:@mail-server' 0.0505618 'url:brandid' 0.0505618 'url:side' 0.0505618 'url:template1' 0.0505618 'url:vadcamp' 0.0505618 'carton!' 0.0652174 and there are about 25 other low-spamprob words (but with higher spamprobs than those) common in this vendor's (but no other's) email too. The detection of other kinds of spam wasn't injured at all. Even so, email from this place still scores between 0.03 and 0.17 for me (which are high ham scores under chi-combining, but well within my "I'm sure it's ham" range). >> I'd like to plead for world peace too, while the algorithm geniuses >> are at it . > Pfft, that one's easy. It's the implementation that kills ya! :) In my case, it will be the cigarettes . ---------------------- multipart/mixed attachment-- From anthony@interlink.com.au Mon Oct 28 05:58:50 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 28 Oct 2002 16:58:50 +1100 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: <200210280558.g9S5woq01988@localhost.localdomain> >>> Tim Peters wrote > I think you're missing a basic point, but not due to lack of repetition > . Let's get real concrete. Go to this site: > > http://www.esmokes.com/ > > I buy cigarettes from that site, and I get glitzy HTML promotional email > from them about once a week. I want that email. You don't (or so I guess), > and *any* filter trained on any spam collection on Earth is-- if it's worth > anything at all --going to say that's spam. I'll attach one of their emails > for your perusal. Something to bear in mind, though, is that a site like this is unlikely to be a spammer. Reputable businesses don't spam often, and certainly they don't do it twice :) So when _other_ peddlers of Timmy's noxious habit send spam, he's unlikely to want their particular spam. For him, the many ham-clues from esmokes.com should outweigh the marketing-speak in their email. Something I've been working on over the last week is the notion that spam data can be shared between users (to a certain extent) while ham data is user-specific. This example doesn't (to me) seem to show that this isn't still a worthwhile goal to pursue... Anthony -- Anthony Baxter It's never too late to have a happy childhood. From tim.one@comcast.net Mon Oct 28 06:09:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 01:09:53 -0500 Subject: [Spambayes] An interesting example of bad correlation In-Reply-To: Message-ID: I just got two copies of this spam from python.org: """ Ol=E1 me chamo Marquinho. Acabei de lan=E7ar um site na WEB que fala = sobre o povo brasileiro e meu projeto... L=E1 voc=EA vai ver minhas fotos. Vo= c=EA pode divulgar o potencial de sua cidade. Al=E9m disso voc=EA pode concorre= r a uma web cam. dia 27 de dezembro. Visite! e vote no meu site! Preciso de apoio... http://www.nossobrasil.kit.net Se n=E3o quiser mais receber nossa informa=E7=E3o favor somente respo= nda. NossoBrasil.kit.net NossoBrasil.kit.net """ One of them showed up in my "I'm sure it's spam" folder, with a score= of 0.96. The other showed up in my "I'm confused" folder, with a score = of 0.75. What's the difference? The former was addressed to webmaster@python.org, and the latter to help@python.org, and the latt= er is a (privately archived) mailing list so Mailman put its fingers on it. = Despite that I *thought* I was ignoring all Mailman headers, I was . B= ut it turns out Mailman does other stuff that reflects in the headers, addi= ng this stuff that didn't exist in the copy I got via webmaster: 'header:Errors-to:1' 0.045086 'subject:Python' 0.0644291 'subject:] ' 0.0772537 'subject:[' 0.147731 'subject:Help' 0.270936 'subject:-' 0.286281 The original didn't have an Errors-to header. The last 5(!) are due = to the [Python-Help] inserted into the Subject line. I believe spam that isn't caught by python.org, and comes thru on a m= ailing list, is my biggest source of Unsure msgs. From tim.one@comcast.net Mon Oct 28 06:25:48 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 01:25:48 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <200210280558.g9S5woq01988@localhost.localdomain> Message-ID: [Anthony Baxter, discovers the joys of cigarettes-by-web] > Something to bear in mind, though, is that a site like this is unlikely > to be a spammer. Reputable businesses don't spam often, and certainly > they don't do it twice :) If you stared at the email from them I attached, you're not going to find any lexical clue that they're a reputable business, and the spambayes code certainly didn't before I trained on half a dozen msgs from them. If we shared classifiers, this would be called spam. > So when _other_ peddlers of Timmy's noxious habit send spam, he's > unlikely to want their particular spam. This is true, and I indeed get a lot of shady "avoid state sales tax!" cig spam from other sources. They're nailed as spam. > For him, the many ham-clues from esmokes.com should outweigh the > marketing-speak in their email. There is hope there: marketing email from a firm with an actual marketing dept is dead serious about establishing brand identification, so puts in endless repetitions of their company name and slogans. That's what makes esmokes so easy to distinguish from other cig spam, and the same is true of marketing blurbs from Microsoft, Sun, Expedia, Amazon, Fidelity, etc etc. They're all considered spam (and strongly so) before training on them, though, as they're dripping with the language of advertising. > Something I've been working on over the last week is the notion that > spam data can be shared between users (to a certain extent) while > ham data is user-specific. This example doesn't (to me) seem to > show that this isn't still a worthwhile goal to pursue... Not at all -- it's an example of why sharing one classifier *completely* is unlikely to work well, and is more an after-the-fact rationalization attempting to explain why all tests in that *direction* have delivered discouraging results. I expect spam stats are quite sharable, and the same tests that showed high FP rates when using a single classifier across multiple tech-list corpora did not show significant increases in FN rates. From anthony@interlink.com.au Mon Oct 28 07:08:01 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 28 Oct 2002 18:08:01 +1100 Subject: [Spambayes] options.skip_max_word_size. Message-ID: <200210280708.g9S781602374@localhost.localdomain> I noticed a bunch of really nice ham clues were getting skipped in some of my personal email's 'unsure' bucket. They were words like 'interconnection' and other longer techie-words. I added an option skip_max_word_size and tried boosting it to 20 (from the default of 12). cmp.py shows this (skip_max_word_size 12 on left, 20 on right) false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 4 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 0.000 0.277 lost +(was 0) 0.559 0.838 lost +49.91% 0.836 1.114 lost +33.25% 0.279 0.836 lost +199.64% won 0 times tied 0 times lost 4 times total unique fn went from 6 to 11 lost +83.33% mean fn % went from 0.418410855729 to 0.766214465324 lost +83.12% ham mean ham sdev 0.67 0.58 -13.43% 4.54 3.89 -14.32% 0.45 0.38 -15.56% 2.64 2.24 -15.15% 0.68 0.67 -1.47% 4.44 4.57 +2.93% 0.48 0.45 -6.25% 3.52 3.47 -1.42% ham mean and sdev for all runs 0.57 0.52 -8.77% 3.86 3.64 -5.70% spam mean spam sdev 98.30 98.76 +0.47% 8.61 7.95 -7.67% 97.47 97.61 +0.14% 10.67 10.78 +1.03% 98.51 98.43 -0.08% 9.13 10.93 +19.72% 97.58 97.27 -0.32% 10.90 12.08 +10.83% spam mean and sdev for all runs 97.97 98.02 +0.05% 9.88 10.56 +6.88% ham/spam mean difference: 97.40 97.50 +0.10 Unfortunately, cmp.py skips the important bit. My 'unsure' numbers went from 164 to 135! I'm not sure if this is just something that's an artifact of my own data, or more general - if others could try it as well, it would be good. Anthony From tim.one@comcast.net Mon Oct 28 07:32:11 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 02:32:11 -0500 Subject: [Spambayes] options.skip_max_word_size. In-Reply-To: <200210280708.g9S781602374@localhost.localdomain> Message-ID: [Anthony Baxter] > I noticed a bunch of really nice ham clues were getting skipped in some > of my personal email's 'unsure' bucket. They were words like > 'interconnection' and other longer techie-words. I added an option > skip_max_word_size and tried boosting it to 20 (from the default of 12). > > cmp.py shows this (skip_max_word_size 12 on left, 20 on right) ... > total unique fp went from 0 to 0 tied > mean fp % went from 0.0 to 0.0 tied > > false negative percentages > 0.000 0.277 lost +(was 0) > 0.559 0.838 lost +49.91% > 0.836 1.114 lost +33.25% > 0.279 0.836 lost +199.64% > > won 0 times > tied 0 times > lost 4 times > > total unique fn went from 6 to 11 lost +83.33% > mean fn % went from 0.418410855729 to 0.766214465324 lost +83.12% > ... > Unfortunately, cmp.py skips the important bit. My 'unsure' numbers > went from 164 to 135! Under the default costs, this would be judged close to a wash: 5 new fn @ $1 was a loss of $5, while 29 fewer unsure @ $.20 was a gain of $5.80. table.py would show this more clearly, and the histogram analysis (which table.py summarizes) would tell us whether you could have gotten just as good an improvement by changing your ham_cutoff and spam_cutoff values (it's impossible to guess that from what you posted). > I'm not sure if this is just something that's an artifact of my > own data, or more general - if others could try it as well, it > would be good. It's something I haven't tried under chi-combining yet, so I will, but not right now. In previous tests, boosting to 13 didn't have significant effect on error rates but did boost the database size. This was before we had a usable notion of middle ground, though, so I've no idea what effect those older tests may have had on the unsure rate. From rob@hooft.net Mon Oct 28 08:44:34 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Mon, 28 Oct 2002 09:44:34 +0100 Subject: [Spambayes] options.skip_max_word_size. References: <200210280708.g9S781602374@localhost.localdomain> Message-ID: <3DBCF8F2.1060907@hooft.net> Anthony Baxter wrote: > I noticed a bunch of really nice ham clues were getting skipped in some > of my personal email's 'unsure' bucket. They were words like 'interconnection' > and other longer techie-words. I added an option skip_max_word_size and > tried boosting it to 20 (from the default of 12). Here are my results: nows122[179]spambayes%% python ./table.py skip12.txt skip20.txt -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams [...] -> tested 1600 hams & 580 spams against 14400 hams & 5220 spams filename: skip12 skip20 ham:spam: 16000:5800 16000:5800 fp total: 12 13 fp %: 0.07 0.08 fn total: 7 7 fn %: 0.12 0.12 unsure t: 178 184 unsure %: 0.82 0.84 real cost: $162.60 $173.80 best cost: $106.20 $109.60 h mean: 0.51 0.52 h sdev: 4.87 4.92 s mean: 99.42 99.39 s sdev: 5.22 5.34 mean diff: 98.91 98.87 k: 9.80 9.64 Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From anthony@interlink.com.au Mon Oct 28 08:12:38 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 28 Oct 2002 19:12:38 +1100 Subject: [Spambayes] table.py patch to produce averages at end of line. Message-ID: <200210280813.g9S8CcL02773@localhost.localdomain> The following simple patch produces a final column in table.py of averages for all files and all measures. This is useful if you're doing tests with very small amounts of data, and want to run the test multiple times with different seeds to check that your results are actually meaningful. For instance (ignore the actual results, they won't make sense outside of the context of the testing I'm doing) filename: 002a_100 002c_100 002b_100 002d_100 ham:spam: 400:1000 400:1000 400:1000 400:1000 fp total: 127 195 104 245 167 fp %: 31.75 48.75 26.00 61.25 41.94 fn total: 0 0 0 0 0 fn %: 0.00 0.00 0.00 0.00 0.00 unsure t: 282 162 287 86 204 unsure %: 20.14 11.57 20.50 6.14 14.59 real cost:$1326.40$1982.40$1097.40$2467.20 $1718.35 best cost: $231.00 $244.20 $228.00 $249.80 $238.25 h mean: 81.07 78.72 77.23 79.29 79.08 h sdev: 20.09 30.80 24.59 34.25 27.43 s mean: 99.94 99.94 99.93 99.99 99.95 s sdev: 0.71 0.90 1.01 0.09 0.68 mean diff: 18.87 21.22 22.70 20.70 20.87 k: 0.91 0.67 0.89 0.60 0.77 Not sure if this is generally useful enough to anyone else for it to be checked in - any opinions? Anthony Index: table.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/table.py,v retrieving revision 1.4 diff -u -r1.4 table.py --- table.py 26 Oct 2002 15:30:23 -0000 1.4 +++ table.py 28 Oct 2002 08:06:07 -0000 @@ -122,6 +122,9 @@ meand = "mean diff:" kval = "k: " +tfptot = tfpper = tfntot = tfnper = tuntot = tunper = trcost = tbcost = \ +thmean = thsdev = tsmean = tssdev = tmeand = tkval = 0 + for filename in sys.argv[1:]: filename = windowsfy(filename) (htest, stest, fp, fn, un, fpp, fnp, unp, cost, bestcost, @@ -147,20 +150,51 @@ rat2 = rat2[0:(len(ratio) + 8)] ratio += " %7s" % ("%d:%d" % (htest, stest)) fptot += "%8d" % fp + tfptot += fp fpper += "%8.2f" % fpp + tfpper += fpp fntot += "%8d" % fn + tfntot += fn fnper += "%8.2f" % fnp + tfnper += fnp untot += "%8d" % un + tuntot += un unper += "%8.2f" % unp + tunper += unp rcost += "%8s" % ("$%.2f" % cost) + trcost += cost bcost += "%8s" % ("$%.2f" % bestcost) + tbcost += bestcost hmean += "%8.2f" % hamdevall[0] + thmean += hamdevall[0] hsdev += "%8.2f" % hamdevall[1] + thsdev += hamdevall[1] smean += "%8.2f" % spamdevall[0] + tsmean += spamdevall[0] ssdev += "%8.2f" % spamdevall[1] + tssdev += spamdevall[1] meand += "%8.2f" % (spamdevall[0] - hamdevall[0]) + tmeand += (spamdevall[0] - hamdevall[0]) k = (spamdevall[0] - hamdevall[0]) / (spamdevall[1] + hamdevall[1]) kval += "%8.2f" % k + tkval += k + +nfiles = len(sys.argv[1:]) +if nfiles: + fptot += "%12d" % (tfptot/nfiles) + fpper += "%12.2f" % (tfpper/nfiles) + fntot += "%12d" % (tfntot/nfiles) + fnper += "%12.2f" % (tfnper/nfiles) + untot += "%12d" % (tuntot/nfiles) + unper += "%12.2f" % (tunper/nfiles) + rcost += "%12s" % ("$%.2f" % (trcost/nfiles)) + bcost += "%12s" % ("$%.2f" % (tbcost/nfiles)) + hmean += "%12.2f" % (thmean/nfiles) + hsdev += "%12.2f" % (thsdev/nfiles) + smean += "%12.2f" % (tsmean/nfiles) + ssdev += "%12.2f" % (tssdev/nfiles) + meand += "%12.2f" % (tmeand/nfiles) + kval += "%12.2f" % (tkval/nfiles) print fname if len(fnam2.strip()) > 0: From jeremy@alum.mit.edu Mon Oct 28 15:08:04 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Mon, 28 Oct 2002 10:08:04 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: References: Message-ID: <15805.21204.928940.814710@slothrop.zope.com> >>>>> "DS" == Derek Simkowiak writes: DS> In other words, Bayesian filtering (as popularized by the rticle DS> "A Plan for Spam") is only good for individuals, or small groups DS> of individuals who all like the same kinds of ham. A single classifier is only good for individuals or for groups / lists where there is a uniform notion of what is ham and what is spam. The general approach to filtering could certainly be used for a large institution, but seems to require some tailoring to an individual's ham. There are a bunch of interesting UI and systems issues to resolve for such usage. Jeremy From agmsmith@rogers.com Mon Oct 28 15:45:26 2002 From: agmsmith@rogers.com (Alexander G. M. Smith) Date: Mon, 28 Oct 2002 10:45:26 EST (-0500) Subject: [Spambayes] options.skip_max_word_size. In-Reply-To: <200210280708.g9S781602374@localhost.localdomain> Message-ID: <10727137924-BeMail@CR593174-A> Anthony Baxter wrote: > I noticed a bunch of really nice ham clues were getting skipped in some > of my personal email's 'unsure' bucket. They were words like 'interconnection' > and other longer techie-words. I added an option skip=5Fmax=5Fword=5Fsize and > tried boosting it to 20 (from the default of 12). I took the naive approach and allow words up to 50 bytes long. I picked that because I saw some uuencoded data with 60 bytes per line. Also while looking up the spelling of supercalifragilisticexpialidoceous, I found pneumonoultramicroscopicsilicovolcanoconiosis mentioned as the longest word in English, according to some web site* which refered back to the Oxford English Dictionary. So, 50 seems like a nice safe value. - Alex *: http://www.dictionary.com/doctor/faq/l/longestword.html From skip@pobox.com Mon Oct 28 16:31:57 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 28 Oct 2002 10:31:57 -0600 Subject: [Spambayes] incremental training strategies Message-ID: <15805.26237.16266.425547@montanaro.dyndns.org> I am now running hammie.py from my procmailrc file, but not yet doing any filtering based on the results. I trained it on my current setup (7000 hams, 5000 spams). Should I: * train it on every message which passes through my inbox * only train it on messages which it incorrectly classifies * some other scheme ? Or is that not yet known? Skip From nas@python.ca Mon Oct 28 16:44:05 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 28 Oct 2002 08:44:05 -0800 Subject: [Spambayes] incremental training strategies In-Reply-To: <15805.26237.16266.425547@montanaro.dyndns.org> References: <15805.26237.16266.425547@montanaro.dyndns.org> Message-ID: <20021028164405.GA22741@glacier.arctrix.com> Skip Montanaro wrote: > I am now running hammie.py from my procmailrc file, but not yet doing any > filtering based on the results. I trained it on my current setup (7000 > hams, 5000 spams). Should I: > > * train it on every message which passes through my inbox > > * only train it on messages which it incorrectly classifies > > * some other scheme > > ? Or is that not yet known? I've trained twice since I started using "neilfilter.py" two months ago. One of those times it was because I updated the classifer and tokenizer code. I don't see the benefit of elaborate incremental updates. Neil From tim.one@comcast.net Mon Oct 28 17:01:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 12:01:15 -0500 Subject: [Spambayes] options.skip_max_word_size. In-Reply-To: <3DBCF8F2.1060907@hooft.net> Message-ID: On skip_max_word_size, my c.l.py test, 10-fold CV, ham_cutoff=0.20 and spam_cutoff=0.80: -> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams [ditto] filename: max12 max20 ham:spam: 20000:14000 20000:14000 fp total: 2 2 the same fp %: 0.01 0.01 fn total: 0 0 the same fn %: 0.00 0.00 unsure t: 103 100 slight decrease unsure %: 0.30 0.29 real cost: $40.60 $40.00 slight improvement with these cutoffs best cost: $27.00 $27.40 best possible got slightly worse h mean: 0.28 0.27 h sdev: 2.99 2.92 s mean: 99.94 99.93 s sdev: 1.41 1.47 mean diff: 99.66 99.66 k: 22.65 22.70 "Best possible" in max20 would have been to boost ham_cutoff to 0.50(!), and drop spam_cutoff a little to 0.78. This would have traded away most of the unsures in return for letting 3 spam through: -> smallest ham & spam cutoffs 0.5 & 0.78 -> fp 2; fn 3; unsure ham 11; unsure spam 11 -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0647% Best possible in max12 was much the same: -> largest ham & spam cutoffs 0.5 & 0.78 -> fp 2; fn 3; unsure ham 12; unsure spam 8 -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588% The classifier pickle size increased by about 1.5 MB (~8.4% bigger). Anthony, you didn't respond to the question about whether you could have gotten a similar improvement simply by changing cutoff values. The data you posted showed a large decrease in unsures at the expense of a large boost in your FN rate. It's quite plausible that exactly the same would have happened if you raised ham_cutoff. See my results above, where boosting ham cutoff from 0.20 to 0.50 would get rid of 80% of my unsures at the cost of letting 3 (vs 0) spam thru. From dereks@itsite.com Mon Oct 28 17:16:03 2002 From: dereks@itsite.com (Derek Simkowiak) Date: Mon, 28 Oct 2002 09:16:03 -0800 (PST) Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: > > is in the spam collections like Bruce's. Somebody just needs to figure > > out how to mine it, methinks. > > I think you're missing a basic point, but not due to lack of repetition > . I'm not missing the basic point, I'm disagreeing with it. (You can stop with the lengthy examples of one guy who wants commercial mails from some particular company or subject domain -- I get it, really, I do.) I may personally consider messages from you to be "spam" (not as Unsolicited Bulk Email, but simply as unwanted messages). But I don't think it would be the job of a general-purspose installation-wide spam identifier to know that about me, as you seem to suggest. I would want a tool like SpamBayes to flag emails as being like the ones in Bruce's collection. If I like to get mails similar to those, then nowhere am I obligated to filter those flagged messages into my "Trash" folder. If I like to get messages similar to those, but only if they come from Company X, then I can set up my filters to do that, too. But for the vast majority of people, just knowing that a particular email has Bruce-spam-like content would be enough to want to filter it into a lower-priority folder, or even directly into Trash. At least, I see it as the job of the postmaster to provide a flag that could be used like that. To summarize: I think it's the job of a spam filter (or "flagger") to identify those messages univerally accepted as being spam -- whether or not any one person likes that kind of mail. And although for any given spam there is _somebody_ on Earth who would want to read it, it would be up to them to set up their client-app filter rules to work how they want them to -- even if that includes running a local installation of SpamBayes to do personalized (high-resolution) filtering. > This isn't a question of classification technology so much as it's a > question of personal preference, and so long as you're determined that > everyone must use the same classifier, personal preference goes out the > window. Yes, and that's exactly what I'm asking for. I think that for installation-wide filters (I'll use the term 'flagger' from here on since no spam filtering should ever take place at a server -- for both legal and privacy reasons) personal preference is irrelevant. It's irrelevant practically by definition. > That's a bad use of technology, IMO -- I'm not interested in treating > everyone like interchangeable cogs. I think there are a great many people interested in having all spam messages treated like interchangeable cogs. "Spam" meaning a message that would be universally accepted as being a "spam". I've seen many people on this list use Bruce's spam for their training. But undoubtedly there is a message in his collection that would be of interest to at least *someone* on this list. Does that invalidate his collection as being a spam training repository? I would say no, it does not, because his collection is of the type "universally accepted as spam". That is the type of message I would like to see flagged at Universities, ISPs, and companies. And to do that, I don't think ham training can be in the picture, since somebody's "ham" is another person's "spam", and training on people's "ham" can only weaken what is considered "universally accepted as spam". --Derek From gward@python.net Mon Oct 28 17:16:20 2002 From: gward@python.net (Greg Ward) Date: Mon, 28 Oct 2002 12:16:20 -0500 Subject: [Spambayes] python.org corpus updated In-Reply-To: References: <20021026211118.GA29889@cthulhu.gerg.ca> Message-ID: <20021028171620.GA31109@cthulhu.gerg.ca> On 26 October 2002, Tim Peters said: > -> 4 new false positives > new fp: ['pyham/02155.txt', 'pyham/01816.txt', 'pyham/02322.txt', > 'pyham/02406.txt'] > > but I believe they're all spam. I'll attach them for your review. They > correspond, respectively, to your Can't really blame SpamAssassin for missing these -- they were all sent to a Mailman -request address, which is explicitly whitelisted on python.org (I don't want to reject unsubscribe requests from people who happen to be on too many RBLs). Moved 'em to spam folder. > -> 9 new false positives > new fp: ['pyham/00277.txt', 'pyham/00278.txt', 'pyham/00275.txt', > 'pyham/00267.txt', 'pyham/01346.txt', 'pyham/00261.txt', > 'pyham/00276.txt', 'pyham/01284.txt', 'pyham/00645.txt'] > > Again I believe these are all spam, and some are so outrageously spam it's > hard to believe SpamAssassin let them pass! Then again, most are in a hated > language . > > ham/183BtE-00072Z-00 261 > ham/183DZB-0007dJ-00 267 > ham/183Epz-0001IH-00 275 > ham/183Epz-0001II-00 276 > ham/183Epz-0001IJ-00 277 > ham/183Epz-0001IK-00 278 These should have been dead easy: subject encoded in iso-2022-jp (which is *now* a banned charset on python.org, but wasn't when this harvest started), and are "To: a@a.a". Unfortunately Exim can be made very picky about addresses in sender headers ("From", "Reply-to", "Sender"), but I don't think it has anything for rigorous checking of recipient headers. Hmmm. > ham/183aCi-00024k-00 645 > ham/183ueG-0006vd-00 1284 > ham/183xNY-0008Gi-00 1346 These slipped through because they are to "-request" addresses. > Take those away and there were no false positives in either direction. Wow, awesome. > One example: > > spam/183UWS-00060A-00 633 > > seems a perfectly ordinary piece of mailman-users traffic. chi-combining is > quite certain it's ham: > > prob = 3.37424532759e-012 > prob('*H*') = 1 > prob('*S*') = 6.63913e-012 > > OTOH, SpamAssassin seems certain it's spam: Well, actually, it only scored 5.4. SA doesn't have any formal notion of certainty, but I'm pretty comfortable in stating that scores from 3.0 to 10.0 is the informal SA zone of uncertainty. Blame me: I think I forgot to manually review low-scoring messages in the spam folder for FPs. I'll do that before regenerating the tarballs. > There also appear to be an awful lot of "false negatives" of the form: > > """ > This is a message from the IFL E-Mail Virus Protection Service > -------------------------------------------------------------- > > The original e-mail attachment > > "Card.DOC.pif" > > appears to be infected by a virus and has been replaced by this=20 > warning message. > """ > > That may be virus fallout, but I don't believe it belongs in the spam > corpus, right? Correct -- I usually put all that stuff in the virus folder, because I'd like to see all virus-related junk mail stopped, and I think it should be done with different tools from spam detectors. Again, my fault for not manually reviewing the spam folder. Greg -- Greg Ward http://www.gerg.ca/ I'm on a strict vegetarian diet -- I only eat vegetarians. From tim.one@comcast.net Mon Oct 28 17:24:46 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 12:24:46 -0500 Subject: [Spambayes] incremental training strategies In-Reply-To: <15805.26237.16266.425547@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I am now running hammie.py from my procmailrc file, but not yet doing any > filtering based on the results. I trained it on my current setup (7000 > hams, 5000 spams). Should I: > > * train it on every message which passes through my inbox > > * only train it on messages which it incorrectly classifies > > * some other scheme > > ? Or is that not yet known? Experiment . Note that chi-combining has a very real middle ground, and you're not used to that yet: you should certainly train it on msgs it says it's unsure about. For my personal email, I've trained on about 1,000 ham and 1,500 spam. As an experiment, I'm going to stop training now, except for Unsure msgs and mistakes; however, I haven't yet seen a mistake beyond one spam python.org let thru (it let thru more than that, but all the rest of those wound up in my Unsure folder despite the "I've been thru python.org" ham clues; the one that fooled both of us is hopeless). From popiel@wolfskeep.com Mon Oct 28 17:28:55 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 09:28:55 -0800 Subject: [Spambayes] incremental training strategies In-Reply-To: Message from Skip Montanaro <15805.26237.16266.425547@montanaro.dyndns.org> References: <15805.26237.16266.425547@montanaro.dyndns.org> Message-ID: <20021028172855.53969F53E@cashew.wolfskeep.com> In message: <15805.26237.16266.425547@montanaro.dyndns.org> Skip Montanaro writes: > >I am now running hammie.py from my procmailrc file, but not yet doing any >filtering based on the results. I trained it on my current setup (7000 >hams, 5000 spams). Should I: > > * train it on every message which passes through my inbox > > * only train it on messages which it incorrectly classifies > > * some other scheme > >? Or is that not yet known? > >Skip Speaking from a theoretical purity standpoint, I suspect that training it on everything that came through would be 'cleaner'... but I have no idea if in practise it would work any better than just training on the mistakes and unsure. Try out variations, and post results? - Alex From gward@python.net Mon Oct 28 17:31:13 2002 From: gward@python.net (Greg Ward) Date: Mon, 28 Oct 2002 12:31:13 -0500 Subject: [Spambayes] python.org corpus updated In-Reply-To: <20021028171620.GA31109@cthulhu.gerg.ca> References: <20021026211118.GA29889@cthulhu.gerg.ca> <20021028171620.GA31109@cthulhu.gerg.ca> Message-ID: <20021028173113.GA31162@cthulhu.gerg.ca> OK, I've done a manual pass over all "low-scoring" messages in the spam folder, and moved a bunch of stuff around. Revised tarballs of the Oct 2002 python.org corpus are now online. Greg -- Greg Ward http://www.gerg.ca/ "Question authority!" "Oh yeah? Says who?" From popiel@wolfskeep.com Mon Oct 28 17:30:23 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 09:30:23 -0800 Subject: [Spambayes] table.py patch to produce averages at end of line. In-Reply-To: Message from Anthony Baxter <200210280813.g9S8CcL02773@localhost.localdomain> References: <200210280813.g9S8CcL02773@localhost.localdomain> Message-ID: <20021028173023.EB363F53E@cashew.wolfskeep.com> In message: <200210280813.g9S8CcL02773@localhost.localdomain> Anthony Baxter writes: >The following simple patch produces a final column in table.py of >averages for all files and all measures. This is useful if you're >doing tests with very small amounts of data, and want to run the >test multiple times with different seeds to check that your results >are actually meaningful. >Not sure if this is generally useful enough to anyone else for it to >be checked in - any opinions? Make it a command line option, and I'm sure it'll be welcome. ;-) - Alex From tim.one@comcast.net Mon Oct 28 17:42:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 12:42:15 -0500 Subject: [Spambayes] python.org corpus updated In-Reply-To: <20021028171620.GA31109@cthulhu.gerg.ca> Message-ID: [Greg Ward] > ... > Correct -- I usually put all that stuff in the virus folder, because I'd > like to see all virus-related junk mail stopped, and I think it should > be done with different tools from spam detectors. I don't expect that *this* spam detector is going to do well at viruses anyway -- although that hasn't been tested. > Again, my fault for not manually reviewing the spam folder. Before you do (or maybe while ), let's think about what we're trying to accomplish: whether or not you ban various charsets, and whether or not you look at blacklists, and whether or not etc, all tests run against c.l.py and python.org traffic have said that a spambayes classifer catches virtually all the spam and has very low fp rates. So, at this point, what is the purpose of more testing? What's the goal here from your POV? I can run any number of tests over any number of coming months, but it's begun to feel simply redundant from my POV. We're not learning anything new here, just confirming that this approach works great for tech mailing lists, and even for python.org's private hobby lists (provided the classifier is trained on them too). From guido@python.org Mon Oct 28 17:38:18 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 28 Oct 2002 12:38:18 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Your message of "Mon, 28 Oct 2002 09:16:03 PST." References: Message-ID: <200210281738.g9SHcIL20111@pcp02138704pcs.reston01.va.comcast.net> > But for the vast majority of people, just knowing that a > particular email has Bruce-spam-like content would be enough to want > to filter it into a lower-priority folder, or even directly into > Trash. At least, I see it as the job of the postmaster to provide a > flag that could be used like that. > > To summarize: I think it's the job of a spam filter (or "flagger") > to identify those messages univerally accepted as being spam -- > whether or not any one person likes that kind of mail. And although > for any given spam there is _somebody_ on Earth who would want to > read it, it would be up to them to set up their client-app filter > rules to work how they want them to -- even if that includes running > a local installation of SpamBayes to do personalized > (high-resolution) filtering. That would be a laudable goal, but the techniques pursued here don't work like that. They can only do a good job if you train them on *both* spam and non-spam. That's how the math of a Bayesian classifier works, alas. Someone can probably prove that you can't reduce the false positives more without knowing what *your* non-spam looks like. It sounds like SpamAssassin might be your best bet if you don't want to train on your non-spam (and even SpamAssassin requires an elaborate "whitelist" setup to avoid flagging the most flagrant spammish-looking non-spam). --Guido van Rossum (home page: http://www.python.org/~guido/) From popiel@wolfskeep.com Mon Oct 28 17:50:04 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 09:50:04 -0800 Subject: [Spambayes] Timestamp analysis Message-ID: <20021028175004.64C20F53E@cashew.wolfskeep.com> This set of runs took me a lot longer than expected; first I had a couple errors in my scripts causing result files to collide, then I wanted to do it again saving pickles for probing, and finally I discovered that the day-of-week stuff was failing (getting dow:invalid) for nearly all my mail. I have not yet fixed the latter, so the day-of-week results are invalid for the concept, but valid for the implementation. Also, the implementation of generate_time_buckets seems to use 10 minute time buckets, not 6 minute buckets as the code comments suggest. Overall, looking at the date in detail, unrelated to anything else, seems neutral. Almost perfectly so; at most, there was a one unsure difference, which is not significant. In the table below, r) mine_received_headers: False basic_header_tokenize: False R) mine_received_headers: True basic_header_tokenize: True t) generate_time_buckets: False T) generate_time_buckets: True d) extract_dow: False D) extract_dow: True -> tested 200 hams & 200 spams against 1800 hams & 1800 spams [...] filename: rtd rtD rTd rTD Rtd RtD RTd RTD ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 fp total: 3 3 3 3 3 3 3 3 fp %: 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 fn total: 12 12 12 12 12 12 12 12 fn %: 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 unsure t: 53 53 54 54 31 31 31 31 unsure %: 1.32 1.32 1.35 1.35 0.78 0.78 0.78 0.78 real cost: $52.60 $52.60 $52.80 $52.80 $48.20 $48.20 $48.20 $48.20 best cost: $48.20 $48.20 $48.20 $48.20 $38.80 $38.80 $38.80 $38.80 h mean: 0.40 0.40 0.40 0.40 0.30 0.30 0.30 0.30 h sdev: 5.39 5.39 5.38 5.38 4.47 4.47 4.48 4.48 s mean: 98.45 98.46 98.46 98.46 98.85 98.85 98.85 98.85 s sdev: 9.76 9.76 9.76 9.75 9.06 9.06 9.06 9.05 mean diff: 98.05 98.06 98.06 98.06 98.55 98.55 98.55 98.55 k: 6.47 6.47 6.48 6.48 7.28 7.28 7.28 7.28 I have not yet posted this on my website... - Alex From tim.one@comcast.net Mon Oct 28 18:11:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 13:11:18 -0500 Subject: [Spambayes] Spam vs time-of-day In-Reply-To: <15805.26237.16266.425547@montanaro.dyndns.org> Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Attached is a plot of # of spams sent per 10-minute bucket (based on Skip's Date header cracking), vs time-of-day, across my subset of BruceG's 2002 spam collection. The idea that *his* spam is mostly sent overnight is clearly bogus. Someone who stops looking at email at 5pm and doesn't look again until 8am could sure get that impression, though. The wiggly red line is a one-hour moving average. An obvious conclusion is that many spammers have day jobs, and send out huge spikes at the beginning and end of their lunch hours, but struggle with software problems in between -- IOW, they're us . ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: spamtime.png Type: image/png Size: 14910 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021028/bb18b31e/spamtime.png ---------------------- multipart/mixed attachment-- From popiel@wolfskeep.com Mon Oct 28 18:11:24 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 10:11:24 -0800 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message from Derek Simkowiak References: Message-ID: <20021028181124.D69E8F53E@cashew.wolfskeep.com> In message: Derek Simkowiak writes: > > To summarize: I think it's the job of a spam filter (or "flagger") >to identify those messages univerally accepted as being spam -- whether or >not any one person likes that kind of mail. I'm reasonably sure there is no consensus on the definition of spam, so the concept of 'universally accepted' spam is flawed at its root. Some people restrict it to unsolicited commercial email; some consider any marketing message to be spam. Some don't care if its commercial or not. Worst, for the lowest-common-denominator UCE definition, knowledge of the individual users is required (whether they solicited it or not). As such, I'd say your ideal universal flagger concept is unrealizable. Even if the concept is sound, I think that the classifiers we're working with are a bad fit for your concept, since at their core they need to know something about what's good as well as what's bad. Otherwise, you end up saying stuff is spam because it used the words 'you', 'there', 'some', 'the', etc... the incidentals of the language, with no real import on the message. > I've seen many people on this list use Bruce's spam for their >training. But undoubtedly there is a message in his collection that would >be of interest to at least *someone* on this list. Does that invalidate >his collection as being a spam training repository? I have avoided using _any_ outside source of spam, precisely because I don't trust their judgement on my mail. If there's a classification error, I want it to be tracable only to me, not to some other person's potentially warped ideas about mail. (Note that this is not to say that I think Bruce's collection is bad or warped... I haven't looked at it, so cannot say. I'm just paranoid about my mail.) > I would say no, it does not, because his collection is of the type >"universally accepted as spam". That is the type of message I would like >to see flagged at Universities, ISPs, and companies. > > And to do that, I don't think ham training can be in the picture, >since somebody's "ham" is another person's "spam", and training on >people's "ham" can only weaken what is considered "universally accepted as >spam". I'll run some experiments (I've been doing the most with ham:spam ratio, anyway), but I suspect that without any ham the spambayes classifier will fail horribly. - Alex From guido@python.org Mon Oct 28 18:20:56 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 28 Oct 2002 13:20:56 -0500 Subject: [Spambayes] Spam vs time-of-day In-Reply-To: Your message of "Mon, 28 Oct 2002 13:11:18 EST." References: Message-ID: <200210281820.g9SIKuM20406@pcp02138704pcs.reston01.va.comcast.net> > Attached is a plot of # of spams sent per 10-minute bucket (based on > Skip's Date header cracking), vs time-of-day, across my subset of > BruceG's 2002 spam collection. The idea that *his* spam is mostly > sent overnight is clearly bogus. Someone who stops looking at email > at 5pm and doesn't look again until 8am could sure get that > impression, though. > > The wiggly red line is a one-hour moving average. An obvious > conclusion is that many spammers have day jobs, and send out huge > spikes at the beginning and end of their lunch hours, but struggle > with software problems in between -- IOW, they're us . The Date header reflects local time at the spammer's box, right? Could it be local time on a box to which the spammer connects to send his mail? And would that box necessarily have the same local time? In the graph, is there a difference between the narrow black bars and the slightly wider blue/gray bars with black outlines? --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@pobox.com Mon Oct 28 18:25:44 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 28 Oct 2002 12:25:44 -0600 Subject: [Spambayes] incremental training strategies In-Reply-To: References: <15805.26237.16266.425547@montanaro.dyndns.org> Message-ID: <15805.33064.884324.694879@montanaro.dyndns.org> Tim> you should certainly train it on msgs it says it's unsure about. Sounds like a plan. So far, I haven't seen any of these. Skip From jbublitz@nwinternet.com Mon Oct 28 17:34:58 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Mon, 28 Oct 2002 10:34:58 -0700 (PST) Subject: [Spambayes] incremental training strategies In-Reply-To: <20021028172855.53969F53E@cashew.wolfskeep.com> Message-ID: On 28-Oct-02 T. Alexander Popiel wrote: > In message: <15805.26237.16266.425547@montanaro.dyndns.org> > Skip Montanaro writes: >> I am now running hammie.py from my procmailrc file, but not yet >> doing any filtering based on the results. I trained it on my >> current setup (7000 hams, 5000 spams). Should I: >> * train it on every message which passes through my inbox >> * only train it on messages which it incorrectly classifies >> * some other scheme >>? Or is that not yet known? > Speaking from a theoretical purity standpoint, I suspect that > training it on everything that came through would be > 'cleaner'... but I have no idea if in practise it would work any > better than just training on the mistakes and unsure. > Try out variations, and post results? I ran tests in chronological order where I trained on 4000 of each type of msg and then: a. Tested 8000 msgs of each type without retraining b. Tested 8000 msgs of each type, retraining on all new msgs after each batch of 100 spam/100 ham b gave clearly better results by nearly an order of magnitude, but that's only 1% or 2% vs. 0.1% or 0.2% at most, so in absolute terms the effect might not be huge depending on mail volume. In theory a closed-loop system should give more accurate results, but it also requires some measures to make sure the retraining data is clean or performance will probably degrade more quickly than if you never retrain at all. Jim From tim.one@comcast.net Mon Oct 28 18:41:45 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 13:41:45 -0500 Subject: [Spambayes] Spam vs time-of-day In-Reply-To: <200210281820.g9SIKuM20406@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > The Date header reflects local time at the spammer's box, right? > Could it be local time on a box to which the spammer connects to send > his mail? And would that box necessarily have the same local time? Barry may know; I don't. I have to suspect that the answer is "it depends on the mail client". > In the graph, is there a difference between the narrow black bars and > the slightly wider blue/gray bars with black outlines? No, there are six vertical bars per labelled hour (one for each of Skip's 144 10-minute buckets), and I don't know why Excel decided to color some differently. Under a magnified view, it appears that it tried to make them strictly alternate, but every now and again put two blue ones next to each other. I may have left some X-axis minor-tick display option set to an unfortunate value, as the instances of doubled blue bars appear to be more-than-less regularly spaced. From skip@pobox.com Mon Oct 28 18:42:08 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 28 Oct 2002 12:42:08 -0600 Subject: [Spambayes] incremental training strategies In-Reply-To: <20021028172855.53969F53E@cashew.wolfskeep.com> References: <15805.26237.16266.425547@montanaro.dyndns.org> <20021028172855.53969F53E@cashew.wolfskeep.com> Message-ID: <15805.34048.389444.16035@montanaro.dyndns.org> Alex> Speaking from a theoretical purity standpoint, I suspect that Alex> training it on everything that came through would be Alex> 'cleaner'... but I have no idea if in practise it would work any Alex> better than just training on the mistakes and unsure. Yeah, but theory and practice often disagree. ;-) The biggest problem I see in training it on every message you encounter is you are likely to make mistakes, generally of the inattentiveness or fumble-fingered variety. That's fine when you're testing the algorithm. You migrate the message to the other pool, then test again. It's a bit different proposition if you are training messages on-the-fly, then delete them (or even if you don't delete them). How do you realize you misclassified a message? If you realize you misclassified a message, how do you undo the effect of the misclassification, particularly if you no longer have the message laying around? >From the standpoint of minimizing human error, once you have a decent hammie.db file, it seems to me that only training on either unsure or incorrect messages is likely to be the best way to improve it. Skip From seant@iname.com Mon Oct 28 18:47:36 2002 From: seant@iname.com (Sean True) Date: Mon, 28 Oct 2002 13:47:36 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: <20021028181124.D69E8F53E@cashew.wolfskeep.com> Message-ID: > -----Original Message----- > From: spambayes-bounces@python.org > [mailto:spambayes-bounces@python.org]On Behalf Of T. Alexander Popiel > Sent: Monday, October 28, 2002 1:11 PM > To: Derek Simkowiak > Cc: spambayes@python.org; popiel@wolfskeep.com > Subject: Re: [Spambayes] progress on POP+VM+ZODB deployment > > > In message: > Derek Simkowiak writes: > > > > To summarize: I think it's the job of a spam filter (or "flagger") > >to identify those messages univerally accepted as being spam -- > whether or > >not any one person likes that kind of mail. > > I'm reasonably sure there is no consensus on the definition of spam, > so the concept of 'universally accepted' spam is flawed at its root. > Some people restrict it to unsolicited commercial email; some consider > any marketing message to be spam. Some don't care if its commercial > or not. Worst, for the lowest-common-denominator UCE definition, > knowledge of the individual users is required (whether they solicited > it or not). > > As such, I'd say your ideal universal flagger concept is unrealizable. At this risk of being repetitive, for many large email systems, sponsored by large companies, spam is "things that don't contribute to productivity". Being able to preemptively -- and intelligently -- filter out porn, get rich quick, and Nigerian scam mail may be of real interest to people who administer 10000+ seat email systems. This may not be the preferred way to use this filters, from _our_ point of view, but it will likely be an interesting one to the MIS manager in charge of keeping system usage reasonable. -- Sean From guido@python.org Mon Oct 28 18:49:15 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 28 Oct 2002 13:49:15 -0500 Subject: [Spambayes] Spam vs time-of-day In-Reply-To: Your message of "Mon, 28 Oct 2002 13:41:45 EST." References: Message-ID: <200210281849.g9SInFr20784@pcp02138704pcs.reston01.va.comcast.net> > > In the graph, is there a difference between the narrow black bars and > > the slightly wider blue/gray bars with black outlines? > > No, there are six vertical bars per labelled hour (one for each of > Skip's 144 10-minute buckets), and I don't know why Excel decided to > color some differently. Under a magnified view, it appears that it > tried to make them strictly alternate, but every now and again put > two blue ones next to each other. I may have left some X-axis > minor-tick display option set to an unfortunate value, as the > instances of doubled blue bars appear to be more-than-less regularly > spaced. Maybe it's a simple roundoff problem -- the bars could be approximately 2.5 pixels wide, and this gets rounded to 2 or 3 depending on circumstance. I'll ignore the difference then. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Mon Oct 28 18:53:21 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 13:53:21 -0500 Subject: [Spambayes] incremental training strategies In-Reply-To: <15805.34048.389444.16035@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Yeah, but theory and practice often disagree. ;-) In theory, training on every msg is, over time, the same as training on a small random sampling of msgs. > The biggest problem I see in training it on every message you encounter > is you are likely to make mistakes, generally of the inattentiveness or > fumble-fingered variety. > > That's fine when you're testing the algorithm. You migrate the message > to the other pool, then test again. It's a bit different proposition if > you are training messages on-the-fly, then delete them (or even if you > don't delete them). How do you realize you misclassified a message? I save my personal training ham and spam in their own distinct folders, and use Mark's GUI to score them too, then tell Outlook to sort them by score. Mistakes very reliably end up "at the wrong end" of the display. > If you realize you misclassified a message, how do you undo the effect of > the misclassification, Mark added a msg_id -> training_status database to the Outlook client. If I move a mistake into the other training folder, it automatically "does the right thing" (realizes that the msg was trained in the other direction, untrains it from that category, and retrains it for the correct category). > particularly if you no longer have the message laying around? Then you're hosed. > From the standpoint of minimizing human error, once you have a decent > hammie.db file, it seems to me that only training on either unsure or > incorrect messages is likely to be the best way to improve it. I don't believe it, but it hasn't been tested. The problem I foresee is scores that rely too much on accidental hapaxes. This will appear to work great over the short term. When other messages appear containing the same accidental rare strings, their classification will be a coin toss to a proportional extent. From skip@pobox.com Mon Oct 28 18:53:50 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 28 Oct 2002 12:53:50 -0600 Subject: [Spambayes] Re: Spam vs time-of-day In-Reply-To: References: <15805.26237.16266.425547@montanaro.dyndns.org> Message-ID: <15805.34750.447765.195346@montanaro.dyndns.org> ---------------------- multipart/mixed attachment Tim> Attached is a plot of # of spams sent per 10-minute bucket (based Tim> on Skip's Date header cracking), vs time-of-day, across my subset Tim> of BruceG's 2002 spam collection. The idea that *his* spam is Tim> mostly sent overnight is clearly bogus. Someone who stops looking Tim> at email at 5pm and doesn't look again until 8am could sure get Tim> that impression, though. "*his*" refers to Bruce, right? My contention after plotting time buckets was the same: that spam was generally sent at a continuous rate. Ham, on the other hand, does have a strong diurnal pattern. I posted a gnuplot graph to that effect back at the end of September. That's what convinced me to try mining information from the Date: header. For completeness, I've attached my original graph. I believe the x-axis is the 6-minute bucket offset, starting from midnight. The large spike at 0 is an artifact of my simpleminded Date header scanning. Invalid dates probably wound up with a value of 0. Buckets were calculated using local time. That way I didn't penalize Anthony Baxter and other folks who happen not to live in the US. Skip ---------------------- multipart/mixed attachment A non-text attachment was scrubbed... Name: hour.png Type: image/png Size: 7616 bytes Desc: not available Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021028/3548e9e9/hour.png ---------------------- multipart/mixed attachment-- From jbublitz@nwinternet.com Mon Oct 28 17:43:43 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Mon, 28 Oct 2002 10:43:43 -0700 (PST) Subject: [Spambayes] python.org corpus updated In-Reply-To: Message-ID: On 28-Oct-02 Tim Peters wrote: > [Greg Ward] >> ... >> Correct -- I usually put all that stuff in the virus folder, >> because I'd like to see all virus-related junk mail stopped, >> and I think it should be done with different tools from spam >> detectors. > I don't expect that *this* spam detector is going to do well at > viruses anyway -- although that hasn't been tested. Works great for me - I put all tagged/scrubbed or virgin virus msgs in my spam corpus from the start and haven't had a problem. I don't virus scan (Linux) but some of my ISPs do. The email module has some problems with them though, because some of the virus taggers mung the boundaries or attachments. Viruses looks like spam to me. Jim From popiel@wolfskeep.com Mon Oct 28 18:59:00 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 10:59:00 -0800 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message from "Sean True" References: Message-ID: <20021028185901.711BEF53E@cashew.wolfskeep.com> In message: "Sean True" writes: > >At this risk of being repetitive, for many large email systems, sponsored >by large companies, spam is "things that don't contribute to productivity". Uh oh... better filter out this list, then... it is distracting me from work! ;-) - Alex From tim.one@comcast.net Mon Oct 28 19:03:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 14:03:08 -0500 Subject: [Spambayes] python.org corpus updated In-Reply-To: Message-ID: [Jim Bublitz] > Works great for me - I put all tagged/scrubbed or virgin virus msgs > in my spam corpus from the start and haven't had a problem. I don't > virus scan (Linux) but some of my ISPs do. The email module has some > problems with them though, because some of the virus taggers mung > the boundaries or attachments. > > Viruses looks like spam to me. How do you tokenize? We ignore MIME sections that aren't text/*, except for generating metatokens from the MIME armor (content-type, content-disposition, charset and filename parameter values). There's another option to suck up the first 5 decoded bytes of octet-stream sections, but enabling that hasn't made any difference in my tests. IOW, a typical virus generates a very small set of tokens, the way we tokenize. We're also missing src=cid: clues from iframe tags. From tim.one@comcast.net Mon Oct 28 19:29:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 14:29:30 -0500 Subject: [Spambayes] RE: Spam vs time-of-day In-Reply-To: <15805.34750.447765.195346@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > "*his*" refers to Bruce, right? Right. > My contention after plotting time buckets was the same: that spam was > generally sent at a continuous rate. No, your graph and mine both show that it falls off in the early-morning hours. Offline I did a chi-squared test against the hypothesis that the spam was evenly distributed, and the probability that random data could be so skewed was < 1e-18. But ham falls off much more. > Ham, on the other hand, does have a strong diurnal pattern. Very. > I posted a gnuplot graph to that effect back at the end of September. > That's what convinced me to try mining information from the Date: header. > For completeness, I've attached my original graph. I believe the x-axis > is the 6-minute bucket offset, starting from midnight. Your buckets span 10 minutes. The comment in the code is confused about this too. That's why your graph and mine both have 144 points on the X axis (24 * 6 = 144; you have six *buckets* per hour, and each spans 10 minutes). > The large spike at 0 is an artifact of my simpleminded Date header > scanning. Invalid dates probably wound up with a value of 0. And at that time, *every* Date header generated a dow:invalid token (as well as the correct token, when possible). That's been repaired since then. > Buckets were calculated using local time. That way I didn't penalize > Anthony Baxter and other folks who happen not to live in the US. I'm unsure what "were calculated using local time" means. Does the checked in code do that or not? I took what the checked-in code produced at face value (after untangling the hour.bucket_number format into hour.minute). I doubt that it matters, though. Most c.l.py traffic in my corpus is sent from the U.S., and in any case enabling these things didn't help my results (the spamprobs were too mild to make a difference). From barry@wooz.org Mon Oct 28 19:33:14 2002 From: barry@wooz.org (Barry A. Warsaw) Date: Mon, 28 Oct 2002 14:33:14 -0500 Subject: [Spambayes] Spam vs time-of-day References: <200210281820.g9SIKuM20406@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15805.37114.807848.125972@gargle.gargle.HOWL> >>>>> "TP" == Tim Peters writes: >> The Date header reflects local time at the spammer's box, >> right? Could it be local time on a box to which the spammer >> connects to send his mail? And would that box necessarily have >> the same local time? TP> Barry may know; I don't. I have to suspect that the answer is TP> "it depends on the mail client". Actually, just "it depends" would be the correct answer. :) Of course, given the right mail client, just about anything can be shoved into a Date header (and often is). How far messages with bogus or even missing Date headers will make it along the delivery path is dependent on all the tools in the change. Many mail clients will add Date headers and I can't imagine such would reflect anything other than local time on the box composing the message. Because RFC 2822 requires exactly one Date header, an SMTPd would be within its rights to reject a message from a client that was missing Date, although I think all but qmail probably just add one if it's missing. I'd bet in 99% of the situations that would have the same local time as the composing machine. BTW, RFC 2822 has this to say about Date: 3.6.1. The origination date field [...] The origination date specifies the date and time at which the creator of the message indicated that the message was complete and ready to enter the mail delivery system. For instance, this might be the time that a user pushes the "send" or "submit" button in an application program. In any case, it is specifically not intended to convey the time that the message is actually transported, but rather the time at which the human or other creator of the message has put the message into its final form, ready for transport. (For example, a portable computer user who is not connected to a network might queue a message for delivery. The origination date is intended to contain the date and time that the user queued the message, not the time when the user connected to the network to send the message.) So I think it's safe to treat Date as the moment in time when the human hit "send". -Barry From tim.one@comcast.net Mon Oct 28 19:36:14 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 14:36:14 -0500 Subject: [Spambayes] incremental training strategies In-Reply-To: <15805.33064.884324.694879@montanaro.dyndns.org> Message-ID: > Tim> you should certainly train it on msgs it says it's unsure about. [Skip] > Sounds like a plan. So far, I haven't seen any of these. If you're just starting, I suggest fiddling ham_cutoff to a low value and spam_cutoff to a high value. For example, 0.05 and 0.95. That should get you some valuable practice with unsures. From popiel@wolfskeep.com Mon Oct 28 20:43:33 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 12:43:33 -0800 Subject: [Spambayes] Training without ham Message-ID: <20021028204333.3A1BDF53E@cashew.wolfskeep.com> Summary: Ham is required in the training set, as expected. Okay, I made some modifications to my standard ham:spam ratio tests for this one. Previously, when I tested the ham:spam ratio, I just used the --ham-keep and --spam-keep options to timcv.py, meaning that the the entire mail stream (both training and testing) was shaped to the given ratio. For this set of tests, I mangled timcv.py to use those options for the training set only, and test with all the ham and spam in the given bucket. (I'm thinking of rerunning the less extreme ratio tests, to see if this changes the sweet-spot interpretation.) Have a table: -> tested 200 hams & 200 spams against 180 hams & 1620 spams [...] -> tested 200 hams & 200 spams against 0 hams & 1800 spams filename: 20-180 15-185 10-190 5-195 2-198 1-199 0-200 ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 fp total: 36 46 68 95 381 455 2000 fp %: 1.80 2.30 3.40 4.75 19.05 22.75 100.00 fn total: 5 6 4 2 0 0 0 fn %: 0.25 0.30 0.20 0.10 0.00 0.00 0.00 unsure t: 143 175 279 536 1307 1374 0 unsure %: 3.58 4.38 6.97 13.40 32.67 34.35 0.00 real cost: $393.60 $501.00 $739.80$1059.20$4071.40$4824.80$20000.00 best cost: $296.00 $386.20 $421.00 $425.60 $452.20 $465.20 $800.00 h mean: 5.48 6.65 10.39 18.54 54.99 61.84 100.00 h sdev: 17.80 19.52 23.29 27.58 30.36 27.06 0.00 s mean: 99.58 99.58 99.64 99.74 99.88 99.92 100.00 s sdev: 5.66 5.65 5.21 4.05 2.04 1.71 0.00 mean diff: 94.10 92.93 89.25 81.20 44.89 38.08 0.00 k: 4.01 3.69 3.13 2.57 1.39 1.32 --NaN-- Okay, looking at this, there is a very clear degredation as the amount of ham drops. This degredation is not just with the default cutoffs (.02 and .9), but with the prescient best cutoffs, too. A quick peek in the actual run output shows that the best cutoffs get progressively closer to .995 and 1.0 throughout. In fact, the ideal spam cutoff is 1.0 for all runs with less than 15 ham trained from each bucket, effectively eliminating the spam category and calling all spam unsure just to lower costs. Also note that with no ham in the training set, *EVERYTHING* is called spam (with sane cutoffs) or unsure (with .995 and 1.0). in either case, there is no distinguishing ham from spam. So yes, spambayes is worthless without ham in the training corpus. - Alex From tim.one@comcast.net Mon Oct 28 20:51:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 15:51:20 -0500 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: Message-ID: [Derek Simkowiak] > I'm not missing the basic point, I'm disagreeing with it. (You > can stop with the lengthy examples of one guy who wants commercial mails > from some particular company or subject domain -- I get it, really, I do.) Good! > I may personally consider messages from you to be "spam" (not as > Unsolicited Bulk Email, but simply as unwanted messages). But I don't > think it would be the job of a general-purspose installation-wide spam > identifier to know that about me, as you seem to suggest. Then you're willing to settle for very little, and I'm glad you're not running my installation . > I would want a tool like SpamBayes to flag emails as being like > the ones in Bruce's collection. If I like to get mails similar to those, > then nowhere am I obligated to filter those flagged messages into my > "Trash" folder. If I like to get messages similar to those, but only if > they come from Company X, then I can set up my filters to do that, too. > > But for the vast majority of people, just knowing that a > particular email has Bruce-spam-like content would be enough to want to > filter it into a lower-priority folder, or even directly into Trash. At > least, I see it as the job of the postmaster to provide a flag that could > be used like that. > > To summarize: I think it's the job of a spam filter (or "flagger") > to identify those messages univerally accepted as being spam -- whether or > not any one person likes that kind of mail. And although for any given > spam there is _somebody_ on Earth who would want to read it, it would be > up to them to set up their client-app filter rules to work how they want > them to -- even if that includes running a local installation of SpamBayes > to do personalized (high-resolution) filtering. In that case, try this code and see what happens. Use all defaults, because they still favor mixed-source corpora so won't suck out "too many" clues specific to your machines or your recipients. Generate a starter database from your own email, and then teach it from the complaints your friendly workgroup makes. Put some elbow grease into this! > ... > I think there are a great many people interested in having all > spam messages treated like interchangeable cogs. "Spam" meaning a message > that would be universally accepted as being a "spam". I'll leave that argument to you and your users now. > I've seen many people on this list use Bruce's spam for their > training. I know of two. > But undoubtedly there is a message in his collection that would > be of interest to at least *someone* on this list. Does that invalidate > his collection as being a spam training repository? Of course not, but I've removed messages from his spam corpus that don't fit an appropriate definition of spam for comp.lang.python purposes. There are other messages I'd remove from his spam corpus if training for my personal purposes. There are some messages that need to be removed for any purposes, because they were plainly misclassified. > I would say no, it does not, because his collection is of the type > "universally accepted as spam". That is the type of message I would like > to see flagged at Universities, ISPs, and companies. > > And to do that, I don't think ham training can be in the picture, > since somebody's "ham" is another person's "spam", and training on > people's "ham" can only weaken what is considered "universally accepted as > spam". Set up a test and measure results. I expect it will detect "BruceG spam" quite reliably, but that it will also call many other msgs spam. The variety in spam is, I expect, much larger than you presently imagine, and BruceG's collection includes msgs like this: """ Tim, It was great to talk to you today I should have the propsal done by tommorrow Take Care, Susan """""" In fact, it contains *many* msgs like that. They are in fact spam, but I doubt you would claim that this msg would be "universally recognized as spam". If you don't want msgs "like that" classified as spam, and won't train on ham too to give it a fighting chance, then you've got weeks of work of your own to do to try and remove msgs like that from BruceG's (or anyone else's) spam collection before training. Our codebase will help you do that, BTW: this kind of spam usually does score as spam, but on the low end of the spam scale. It's statistically unusual compared to the bulk of the spam. From jbublitz@nwinternet.com Mon Oct 28 20:04:27 2002 From: jbublitz@nwinternet.com (Jim Bublitz) Date: Mon, 28 Oct 2002 13:04:27 -0700 (PST) Subject: [Spambayes] python.org corpus updated In-Reply-To: Message-ID: On 28-Oct-02 Tim Peters wrote: > How do you tokenize? We ignore MIME sections that aren't text/*, > except for generating metatokens from the MIME armor > (content-type, content-disposition, charset and filename > parameter values). There's another option to suck up the first 5 > decoded bytes of octet-stream sections, but enabling that hasn't > made any difference in my tests. > IOW, a typical virus generates a very small set of tokens, the > way we tokenize. We're also missing src=cid: clues from iframe > tags. When I tried spambayes I didn't have any problems with virus msgs either (before the chi-combining scheme, but that should work better if anything), so I don't think tokenizing is that critical. Viruses do tend to score towards the middle and have been some of my earlier fn problems, IMO for exactly the reasons you state. The ISP tagged viruses do have more words though. Given sufficient training these schemes appear able to classify almost anything. I only tokenize anything that's Content-Type: text/* (or headers) I end up (after 30K+ msgs) with a 2.5MB text file db of (token, prob) - about 150K tokens. The complete code was posted to the list a week or two ago. Tokenizing is: # > 50% Asian language spam, some English/Asian language # mixed ham TOKEN_RE = re.compile(r"[\w'$_-]+", re.U) # remove mixed alphanumerics or strictly numeric: # eg: HM6116, 555N, 1234 (also Windows98, 1337, h4X0r) # removes about 500K tokens (also most boundaries, msg IDs, # some date/time info) pn1_re = re.compile (r"[a-zA-Z]+[0-9]+") pn2_re = re.compile (r"[0-9]+[a-zA-Z]+") num_re = re.compile (r"^[0-9]+") >> in the method that actually tokenizes (headers and everything) << tokens = TOKEN_RE.findall(str (data)) if not len (tokens): return # added the first 'if' in the loop to reduce # total # of tokens by >75% deletes = 0 for token in tokens: if (len (token) > 20)\ or (pn1_re.search (token) != None)\ or (pn2_re.search (token) != None)\ or (num_re.search (token) != None)\ or token in ignore: deletes += 1 continue if token in self: self[token] += 1 else: self[token] = 1 # count tokens, not msgs self.count += len (tokens) - deletes "ignore" is just a list of strings that scoring puts in the headers - about 10 words. The regexes above strip the actual score, so "99" or "0.99" won't be strong indicators after training. The "if" stmt could probably be cleaned up - I was just adding and subtracting different stuff for the best performance and settled on what's there. Jim From nas@python.ca Mon Oct 28 21:27:47 2002 From: nas@python.ca (Neil Schemenauer) Date: Mon, 28 Oct 2002 13:27:47 -0800 Subject: [Spambayes] progress on POP+VM+ZODB deployment In-Reply-To: References: Message-ID: <20021028212747.GB23637@glacier.arctrix.com> It seems to me that the spambayes approach works best when integrated into the user's MUA. That's unfortunate because there are so many different MUAs out there and most of them are not easily extendable. I suppose you could use IMAP and have both the ham and spam folders on the server. Neil From tim.one@comcast.net Mon Oct 28 21:33:59 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 16:33:59 -0500 Subject: [Spambayes] Training without ham In-Reply-To: <20021028204333.3A1BDF53E@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Summary: Ham is required in the training set, as expected. > ... > So yes, spambayes is worthless without ham in the training corpus. Ya, but that doesn't prove we need to train on spam . I posted a variant update_probabilities yesterday, which ignored hamcounts when computing spamprobs. What I didn't report on was trying that, after fiddling a combining method to merely compute the average spamprob in a msg. Histogram analysis consistently suggested that my best strategy was to set ham_cutoff at 0.0 then, and spam_cutoff at 1.0; i.e., to call *everything* "unsure". The (possibly surprising) reason can be deduced from this (from a 10-fold randomized CV run over 2000 of each): -> Spam scores for all runs: 2000 items; mean 19.54; sdev 8.85 -> min 3.97721; median 19.6394; max 71.6909 -> percentiles: 5% 4.65238; 25% 14.8778; 75% 23.7485; 95% 34.8339 -> Ham scores for all runs: 2000 items; mean 24.17; sdev 7.97 -> min 4.2792; median 23.4837; max 73.9471 -> percentiles: 5% 12.0403; 25% 19.4017; 75% 28.2031; 95% 37.6717 IOW, ham scores *higher* than spam for spamness under this measure, although the overlap is extreme. I wasn't much motivated to pursue this . From popiel@wolfskeep.com Mon Oct 28 23:31:34 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Mon, 28 Oct 2002 15:31:34 -0800 Subject: [Spambayes] Training without ham In-Reply-To: Message from Tim Peters References: Message-ID: <20021028233134.17645F53E@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> Summary: Ham is required in the training set, as expected. >> ... >> So yes, spambayes is worthless without ham in the training corpus. > >Ya, but that doesn't prove we need to train on spam . You are an evil man, Tim. Just for that, I present the following: Summary: We need to train on spam, too. Methodology is identical to my no-ham test, except that I'm using very little spam instead of very little ham. -> tested 200 hams & 200 spams against 1620 hams & 180 spams [...] -> tested 200 hams & 200 spams against 1800 hams & 0 spams filename: 180-20 185-15 190-10 195-5 198-2 199-1 200-0 ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 fp total: 1 2 0 0 0 0 0 fp %: 0.05 0.10 0.00 0.00 0.00 0.00 0.00 fn total: 68 77 118 291 672 1223 2000 fn %: 3.40 3.85 5.90 14.55 33.60 61.15 100.00 unsure t: 318 378 554 1011 1160 707 0 unsure %: 7.95 9.45 13.85 25.27 29.00 17.68 0.00 real cost: $141.60 $172.60 $228.80 $493.20 $904.00$1364.40$2000.00 best cost: $92.60 $98.40 $127.40 $209.40 $371.00 $607.80 $800.00 h mean: 0.29 0.28 0.21 0.11 0.06 0.04 0.00 h sdev: 4.29 4.20 3.04 1.84 1.30 1.19 0.00 s mean: 90.53 88.71 81.88 62.52 37.17 20.21 0.00 s sdev: 22.22 23.77 28.84 33.27 29.01 25.77 0.00 mean diff: 90.24 88.43 81.67 62.41 37.11 20.17 0.00 k: 3.40 3.16 2.56 1.78 1.22 0.75 --NaN-- This is almost a perfect mirror image of the problem on the other end, including the cutoffs approaching 0.0 and 0.005. I won't bother with more detail on this one. Tim, you're evil. - Alex From anthony@interlink.com.au Tue Oct 29 01:25:51 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 29 Oct 2002 12:25:51 +1100 Subject: [Spambayes] training on very small ham sets, normal sized spamsets. Message-ID: <200210290125.g9T1Ppw09085@localhost.localdomain> So I hacked on timcv.py and msgs.py to add options 'spam-test', 'spam-train', 'ham-test' and 'ham-train', to allow you to set the training set size separately to the testing set size. I haven't checked this in because it will break everyone's test scripts - --spam= will no longer be distinct, and getopt will gripe. Let me know if I should check this in anyway - I think it's useful, but YMMV. I wanted to see what would happen to the fp numbers when you were testing against a full-sized spam corpus, but a very very small number of ham messages. Here's some of what I found. The filenames are hamtrain_hamtest numbers - so 002_100 tested 100 hams against 3x2 spams (my personal corpus has 4 data sets in it). What I was trying to determine here was whether it would be feasible to ship with a pre-canned set of spam wordinfo data, but no ham, and get the user to feed in a bunch of their own ham at the start. And if so, how much ham does the user have to feed in to start getting useful results. I've snipped the 'ratio' numbers, as they didn't reflect reality here. For each of these tests, the system trained on 350 spam per set, and tested on 250 spam per set. It then trained on a small number of ham per set (the left hand side of the filename gives the number) and tested on 100 hams per set. The righthand-most set is the numbers for the 'full' set - 4 sets, each with around 2500 ham and 400 spam. For all these tests, ham_cutoff was 0.27 and spam_cutoff was 0.99. They're just numbers I picked, after earlier testing. I'm more interested in how the data changes as the amount of training data changes. filename: 001:100 002:100 003:100 005:100 010:100 015:100 020:100 full fp total: 189 167 152 104 69 53 42 0 fp %: 47.44 41.94 38.19 26.12 17.25 13.33 10.69 0.00 fn total: 0 0 0 0 0 0 0 7 fn %: 0.00 0.00 0.00 0.00 0.03 0.03 0.00 0.49 unsure t: 215 204 187 209 191 176 148 164 unsure %: 15.41 14.59 13.39 14.95 13.66 12.57 10.59 1.30 real cost:$1940.65$1718.35$1565.00$1086.85 $728.50 $568.87 $457.40 $39.80 best cost: $244.10 $238.25 $235.90 $227.65 $222.10 $218.13 $215.65 $35.20 h mean: 88.62 79.08 71.88 61.70 48.45 40.65 33.62 0.56 h sdev: 16.26 27.43 33.44 35.76 37.99 37.23 36.79 3.79 s mean: 99.93 99.95 99.95 99.88 99.87 99.83 99.81 97.91 s sdev: 0.94 0.68 0.80 1.59 1.94 2.02 2.27 10.12 mean diff: 11.32 20.87 28.07 38.18 51.41 59.18 66.19 97.35 k: 0.66 0.77 0.83 1.02 1.29 1.51 1.70 7.00 The numbers for each (001:, 002:, 003:, 005:, 010:, 015:, 020:) are actually averages of 4 different runs for each, with different -s options on each one (same set of 4 -s used for each, tho). Otherwise the variation was just too damn high. It's still a little 'bloopy' - the unsure bounces around a bit, but it's not bad. The next set was just to confirm some of my predjudices - how much did upping the number of test hams change things? Testing with the same setup as before - this time, training with 25 hams from each of 3 sets, and testing with 100, 200, 300, 400, 500 hams. Once again, for reference, the 'full' data is the right hand column. Obviously, for this test the key thing to look at are the percentages, means, medians and the like, not the totals. filename: 025_100 025_200 025_300 025_400 025_500 full fp total: 50 70 91 159 150 0 fp %: 12.50 8.75 7.58 9.94 7.50 0.00 fn total: 0 0 0 0 0 7 fn %: 0.00 0.00 0.00 0.00 0.00 0.49 unsure t: 142 252 360 556 592 164 unsure %: 10.14 14.00 16.36 21.38 19.73 1.30 real cost: $528.40 $750.40 $982.00$1701.20$1618.40 $39.80 best cost: $216.80 $225.20 $230.20 $243.80 $242.00 $35.20 h mean: 35.31 29.04 26.69 32.89 27.12 0.56 h sdev: 37.76 35.33 33.94 36.29 34.21 3.79 s mean: 99.82 99.75 99.72 99.82 99.72 97.91 s sdev: 1.73 2.86 3.11 1.73 3.11 10.12 mean diff: 64.51 70.71 73.03 66.93 72.60 97.35 k: 1.63 1.85 1.97 1.76 1.95 7.00 Finally, decided to go again, this time with ham-test set to 500, and ham-train set to different numbers (going up much higher this time). This time, I couldn't be bothered screenscraping the summarised averages from each, so here's the data for each of the 4 runs at each ham-train setting, with averages on the right. filename: 001a_500 001c_500 001b_500 001d_500 fp total: 1576 1170 919 871 1134 fp %: 78.80 58.50 45.95 43.55 56.70 fn total: 0 0 0 0 0 fn %: 0.00 0.00 0.00 0.00 0.00 unsure t: 154 812 1079 1130 793 unsure %: 5.13 27.07 35.97 37.67 26.46 real cost:$15790.80$11862.40$9405.80$8936.00 $11498.75 best cost: $516.00 $438.00 $389.40 $379.80 $430.80 h mean: 86.07 91.31 87.46 87.14 87.99 h sdev: 31.85 16.40 16.67 16.51 20.36 s mean: 100.00 99.95 99.97 99.95 99.97 s sdev: 0.08 0.96 0.39 0.71 0.54 mean diff: 13.93 8.64 12.51 12.81 11.97 k: 0.44 0.50 0.73 0.74 0.60 filename: 010a_500 010c_500 010b_500 010d_500 fp total: 310 297 308 346 315 fp %: 15.50 14.85 15.40 17.30 15.76 fn total: 0 1 0 0 0 fn %: 0.00 0.10 0.00 0.00 0.03 unsure t: 953 987 921 986 961 unsure %: 31.77 32.90 30.70 32.87 32.06 real cost:$3290.60$3168.40$3264.20$3657.20 $3345.10 best cost: $274.00 $270.60 $270.40 $281.20 $274.05 h mean: 46.54 47.73 45.74 50.05 47.52 h sdev: 36.64 35.98 37.15 37.73 36.88 s mean: 99.87 99.82 99.90 99.91 99.88 s sdev: 1.19 2.88 1.25 1.10 1.60 mean diff: 53.33 52.09 54.16 49.86 52.36 k: 1.41 1.34 1.41 1.28 1.36 filename: 020a_500 020c_500 020b_500 020d_500 fp total: 243 173 202 71 172 fp %: 12.15 8.65 10.10 3.55 8.61 fn total: 0 1 0 0 0 fn %: 0.00 0.10 0.00 0.00 0.03 unsure t: 791 691 690 489 665 unsure %: 26.37 23.03 23.00 16.30 22.18 real cost:$2588.20$1869.20$2158.00 $807.80 $1855.80 best cost: $261.40 $246.60 $247.60 $226.40 $245.50 h mean: 38.01 31.48 33.09 19.77 30.59 h sdev: 37.36 35.42 36.67 29.44 34.72 s mean: 99.84 99.77 99.88 99.67 99.79 s sdev: 1.40 2.90 1.71 3.52 2.38 mean diff: 61.83 68.29 66.79 79.90 69.20 k: 1.60 1.78 1.74 2.42 1.89 filename: 030a_500 030c_500 030b_500 030d_500 fp total: 173 150 155 133 152 fp %: 8.65 7.50 7.75 6.65 7.64 fn total: 0 0 0 0 0 fn %: 0.00 0.00 0.00 0.00 0.00 unsure t: 624 533 571 580 577 unsure %: 20.80 17.77 19.03 19.33 19.23 real cost:$1854.80$1606.60$1664.20$1446.00 $1642.90 best cost: $248.20 $242.80 $239.80 $237.80 $242.15 h mean: 29.65 24.97 27.12 25.45 26.80 h sdev: 35.74 33.86 35.35 33.55 34.62 s mean: 99.81 99.74 99.88 99.77 99.80 s sdev: 1.89 2.93 1.78 2.69 2.32 mean diff: 70.16 74.77 72.76 74.32 73.00 k: 1.86 2.03 1.96 2.05 1.98 filename: 040a_500 040c_500 040b_500 040d_500 fp total: 121 114 109 81 106 fp %: 6.05 5.70 5.45 4.05 5.31 fn total: 0 0 0 0 0 fn %: 0.00 0.00 0.00 0.00 0.00 unsure t: 522 443 470 404 459 unsure %: 17.40 14.77 15.67 13.47 15.33 real cost:$1314.40$1228.60$1184.00 $890.80 $1154.45 best cost: $241.00 $235.60 $229.80 $231.20 $234.40 h mean: 23.62 20.82 21.42 17.71 20.89 h sdev: 33.35 32.06 32.15 29.90 31.87 s mean: 99.77 99.72 99.85 99.69 99.76 s sdev: 2.14 3.17 1.96 3.26 2.63 mean diff: 76.15 78.90 78.43 81.98 78.87 k: 2.15 2.24 2.30 2.47 2.29 filename: 060a_500 060c_500 060b_500 060d_500 fp total: 54 80 74 60 67 fp %: 2.70 4.00 3.70 3.00 3.35 fn total: 0 1 0 0 0 fn %: 0.00 0.10 0.00 0.00 0.03 unsure t: 272 320 281 235 277 unsure %: 9.07 10.67 9.37 7.83 9.23 real cost: $594.40 $865.00 $796.20 $647.00 $725.65 best cost: $224.20 $230.60 $223.60 $223.80 $225.55 h mean: 11.95 15.23 14.08 11.38 13.16 h sdev: 25.25 28.86 27.35 25.31 26.69 s mean: 99.60 99.63 99.78 99.75 99.69 s sdev: 3.75 3.75 2.41 2.45 3.09 mean diff: 87.65 84.40 85.70 88.37 86.53 k: 3.02 2.59 2.88 3.18 2.92 filename: 100a_500 100c_500 100b_500 100d_500 fp total: 40 56 61 40 49 fp %: 2.00 2.80 3.05 2.00 2.46 fn total: 0 0 0 1 0 fn %: 0.00 0.00 0.00 0.10 0.03 unsure t: 190 215 203 185 198 unsure %: 6.33 7.17 6.77 6.17 6.61 real cost: $438.00 $603.00 $650.60 $438.00 $532.40 best cost: $219.60 $225.20 $221.00 $218.40 $221.05 h mean: 8.03 10.37 9.74 8.33 9.12 h sdev: 21.41 24.44 23.46 22.11 22.86 s mean: 99.55 99.50 99.70 99.69 99.61 s sdev: 4.17 4.53 3.16 3.35 3.80 mean diff: 91.52 89.13 89.96 91.36 90.49 k: 3.58 3.08 3.38 3.59 3.41 filename: 150a_500 150c_500 150b_500 150d_500 fp total: 34 36 33 50 38 fp %: 1.70 1.80 1.65 2.50 1.91 fn total: 1 0 1 1 0 fn %: 0.10 0.00 0.10 0.10 0.08 unsure t: 151 152 114 124 135 unsure %: 5.03 5.07 3.80 4.13 4.51 real cost: $371.20 $390.40 $353.80 $525.80 $410.30 best cost: $217.40 $221.00 $216.40 $219.60 $218.60 h mean: 6.43 6.84 5.56 6.50 6.33 h sdev: 19.43 20.21 18.38 20.33 19.59 s mean: 99.50 99.43 99.58 99.48 99.50 s sdev: 4.57 4.95 4.27 4.83 4.65 mean diff: 93.07 92.59 94.02 92.98 93.17 k: 3.88 3.68 4.15 3.70 3.85 filename: 200a_500 200c_500 200b_500 200d_500 fp total: 20 24 16 10 17 fp %: 1.00 1.20 0.80 0.50 0.88 fn total: 1 1 1 1 1 fn %: 0.10 0.10 0.10 0.10 0.10 unsure t: 123 136 101 109 117 unsure %: 4.10 4.53 3.37 3.63 3.91 real cost: $225.60 $268.20 $181.20 $122.80 $199.45 best cost: $213.40 $217.60 $174.40 $114.20 $179.90 h mean: 4.69 5.31 3.95 3.88 4.46 h sdev: 16.53 17.87 15.37 14.41 16.05 s mean: 99.45 99.38 99.50 99.45 99.44 s sdev: 4.81 5.29 5.02 5.10 5.05 mean diff: 94.76 94.07 95.55 95.57 94.99 k: 4.44 4.06 4.69 4.90 4.52 filename: 250a_500 250c_500 250b_500 250d_500 fp total: 13 14 11 8 11 fp %: 0.65 0.70 0.55 0.40 0.58 fn total: 1 1 1 0 0 fn %: 0.10 0.10 0.10 0.00 0.08 unsure t: 117 118 93 118 111 unsure %: 3.90 3.93 3.10 3.93 3.72 real cost: $154.40 $164.60 $129.60 $103.60 $138.05 best cost: $145.20 $158.40 $123.20 $93.40 $130.05 h mean: 3.93 4.03 3.41 4.00 3.84 h sdev: 14.76 15.08 14.07 14.78 14.67 s mean: 99.41 99.33 99.49 99.62 99.46 s sdev: 5.11 5.55 5.13 3.99 4.95 mean diff: 95.48 95.30 96.08 95.62 95.62 k: 4.81 4.62 5.00 5.09 4.88 filename: 300a_500 300c_500 300b_500 300d_500 fp total: 9 8 7 9 8 fp %: 0.45 0.40 0.35 0.45 0.41 fn total: 2 2 1 1 1 fn %: 0.20 0.20 0.10 0.10 0.15 unsure t: 112 107 97 89 101 unsure %: 3.73 3.57 3.23 2.97 3.38 real cost: $114.40 $103.40 $90.40 $108.80 $104.25 best cost: $105.00 $96.00 $84.20 $102.00 $96.80 h mean: 3.43 3.25 3.14 2.99 3.20 h sdev: 13.57 13.03 13.41 13.16 13.29 s mean: 99.34 99.30 99.48 99.58 99.42 s sdev: 5.36 5.73 5.19 4.11 5.10 mean diff: 95.91 96.05 96.34 96.59 96.22 k: 5.07 5.12 5.18 5.59 5.24 filename: 350a_500 350c_500 350b_500 350d_500 fp total: 8 5 4 4 5 fp %: 0.40 0.25 0.20 0.20 0.26 fn total: 2 1 1 3 1 fn %: 0.20 0.10 0.10 0.30 0.17 unsure t: 101 100 89 94 96 unsure %: 3.37 3.33 2.97 3.13 3.20 real cost: $102.20 $71.00 $58.80 $61.80 $73.45 best cost: $93.00 $65.60 $53.80 $54.60 $66.75 h mean: 2.88 2.81 2.65 2.63 2.74 h sdev: 12.12 11.84 11.90 11.54 11.85 s mean: 99.33 99.28 99.43 99.34 99.34 s sdev: 5.48 5.83 5.41 5.93 5.66 mean diff: 96.45 96.47 96.78 96.71 96.60 k: 5.48 5.46 5.59 5.54 5.52 filename: 400a_500 400c_500 400b_500 400d_500 fp total: 6 5 3 6 5 fp %: 0.30 0.25 0.15 0.30 0.25 fn total: 2 3 2 1 2 fn %: 0.20 0.30 0.20 0.10 0.20 unsure t: 94 96 83 80 88 unsure %: 3.13 3.20 2.77 2.67 2.94 real cost: $80.80 $72.20 $48.60 $77.00 $69.65 best cost: $73.40 $64.40 $44.20 $71.00 $63.25 h mean: 2.57 2.51 2.39 2.30 2.44 h sdev: 11.24 11.11 11.20 10.67 11.05 s mean: 99.29 99.23 99.38 99.46 99.34 s sdev: 5.60 6.21 5.62 4.70 5.53 mean diff: 96.72 96.72 96.99 97.16 96.90 k: 5.74 5.58 5.77 6.32 5.85 filename: 450a_500 450c_500 450b_500 450d_500 fp total: 5 4 3 5 4 fp %: 0.25 0.20 0.15 0.25 0.21 fn total: 2 4 2 3 2 fn %: 0.20 0.40 0.20 0.30 0.28 unsure t: 81 88 77 88 83 unsure %: 2.70 2.93 2.57 2.93 2.78 real cost: $68.20 $61.60 $47.40 $70.60 $61.95 best cost: $63.40 $53.60 $42.80 $60.20 $55.00 h mean: 2.14 2.19 2.08 2.79 2.30 h sdev: 10.05 10.12 10.34 12.08 10.65 s mean: 99.21 99.09 99.32 99.56 99.30 s sdev: 5.98 6.91 5.89 5.16 5.99 mean diff: 97.07 96.90 97.24 96.77 96.99 k: 6.06 5.69 5.99 5.61 5.84 filename: 500a_500 500c_500 500b_500 500d_500 fp total: 5 3 3 5 4 fp %: 0.25 0.15 0.15 0.25 0.20 fn total: 2 4 2 1 2 fn %: 0.20 0.40 0.20 0.10 0.23 unsure t: 78 88 74 76 79 unsure %: 2.60 2.93 2.47 2.53 2.63 real cost: $67.60 $51.60 $46.80 $66.20 $58.05 best cost: $63.00 $45.60 $41.80 $59.60 $52.50 h mean: 2.08 2.07 1.80 2.00 1.99 h sdev: 9.82 9.83 9.35 9.79 9.70 s mean: 99.19 99.06 99.21 99.39 99.21 s sdev: 6.11 6.98 6.51 5.31 6.23 mean diff: 97.11 96.99 97.41 97.39 97.22 k: 6.10 5.77 6.14 6.45 6.11 filename: 600a_500 600c_500 600b_500 600d_500 fp total: 4 2 3 4 3 fp %: 0.20 0.10 0.15 0.20 0.16 fn total: 3 3 2 1 2 fn %: 0.30 0.30 0.20 0.10 0.23 unsure t: 79 77 70 79 76 unsure %: 2.63 2.57 2.33 2.63 2.54 real cost: $58.80 $38.40 $46.00 $56.80 $50.00 best cost: $51.60 $33.60 $40.80 $50.60 $44.15 h mean: 1.84 1.69 1.56 1.77 1.71 h sdev: 9.10 8.68 8.52 9.11 8.85 s mean: 99.13 98.98 99.15 99.25 99.13 s sdev: 6.40 7.29 6.80 5.74 6.56 mean diff: 97.29 97.29 97.59 97.48 97.41 k: 6.28 6.09 6.37 6.56 6.33 filename: 700a_500 700c_500 700b_500 700d_500 fp total: 4 2 2 2 2 fp %: 0.20 0.10 0.10 0.10 0.12 fn total: 3 3 2 3 2 fn %: 0.30 0.30 0.20 0.30 0.28 unsure t: 75 70 70 62 69 unsure %: 2.50 2.33 2.33 2.07 2.31 real cost: $58.00 $37.00 $36.00 $35.40 $41.60 best cost: $51.60 $34.00 $33.40 $32.00 $37.75 h mean: 1.64 1.53 1.41 1.35 1.48 h sdev: 8.51 8.13 7.99 7.30 7.98 s mean: 99.07 98.92 99.07 99.23 99.07 s sdev: 6.60 7.60 6.99 6.45 6.91 mean diff: 97.43 97.39 97.66 97.88 97.59 k: 6.45 6.19 6.52 7.12 6.57 filename: 1000a_500 1000c_500 1000b_500 1000d_500 fp total: 2 0 2 1 1 fp %: 0.10 0.00 0.10 0.05 0.06 fn total: 4 3 3 1 2 fn %: 0.40 0.30 0.30 0.10 0.28 unsure t: 68 66 61 56 62 unsure %: 2.27 2.20 2.03 1.87 2.09 real cost: $37.60 $16.20 $35.20 $22.20 $27.80 best cost: $35.20 $15.60 $33.80 $20.20 $26.20 h mean: 1.11 1.10 1.11 1.27 1.15 h sdev: 6.63 6.33 7.02 7.78 6.94 s mean: 98.80 98.71 98.90 99.26 98.92 s sdev: 7.62 8.13 7.66 5.63 7.26 mean diff: 97.69 97.61 97.79 97.99 97.77 k: 6.86 6.75 6.66 7.31 6.89 filename: 1500a_500 1500c_500 1500b_500 1500d_500 fp total: 1 0 1 0 0 fp %: 0.05 0.00 0.05 0.00 0.03 fn total: 4 4 3 7 4 fn %: 0.40 0.40 0.30 0.70 0.45 unsure t: 76 71 77 74 74 unsure %: 2.53 2.37 2.57 2.47 2.48 real cost: $29.20 $18.20 $28.40 $21.80 $24.40 best cost: $28.00 $14.80 $23.80 $11.80 $19.60 h mean: 0.86 0.77 0.79 0.89 0.83 h sdev: 5.54 4.86 5.43 5.58 5.35 s mean: 98.39 98.31 98.42 98.45 98.39 s sdev: 8.81 9.31 9.04 9.10 9.06 mean diff: 97.53 97.54 97.63 97.56 97.56 k: 6.80 6.88 6.75 6.65 6.77 filename: 2000a_500 2000c_500 2000b_500 2000d_500 fp total: 0 0 0 0 0 fp %: 0.00 0.00 0.00 0.00 0.00 fn total: 5 4 3 6 4 fn %: 0.50 0.40 0.30 0.60 0.45 unsure t: 81 82 86 75 81 unsure %: 2.70 2.73 2.87 2.50 2.70 real cost: $21.20 $20.40 $20.20 $21.00 $20.70 best cost: $20.60 $14.00 $14.40 $13.00 $15.50 h mean: 0.73 0.60 0.68 0.66 0.67 h sdev: 4.74 3.80 4.63 4.48 4.41 s mean: 98.06 98.02 98.13 98.11 98.08 s sdev: 9.77 9.94 9.60 10.14 9.86 mean diff: 97.33 97.42 97.45 97.45 97.41 k: 6.71 7.09 6.85 6.67 6.83 filename: 2500a_500 2500c_500 2500b_500 2500d_500 fp total: 0 0 0 0 0 fp %: 0.00 0.00 0.00 0.00 0.00 fn total: 5 3 3 6 4 fn %: 0.50 0.30 0.30 0.60 0.43 unsure t: 87 92 92 82 88 unsure %: 2.90 3.07 3.07 2.73 2.94 real cost: $22.40 $21.40 $21.40 $22.40 $21.90 best cost: $15.40 $13.80 $14.40 $18.80 $15.60 h mean: 0.64 0.55 0.58 0.57 0.58 h sdev: 4.32 3.59 3.87 4.07 3.96 s mean: 97.79 97.76 97.90 97.72 97.79 s sdev: 10.38 10.50 10.02 10.85 10.44 mean diff: 97.15 97.21 97.32 97.15 97.21 k: 6.61 6.90 7.01 6.51 6.76 filename: full_500 (trained on 3 sets of 2700 each, tested against 500) fp total: 0 0 fp %: 0.00 0.00 fn total: 5 5 fn %: 0.50 0.50 unsure t: 89 89 unsure %: 2.97 2.97 real cost: $22.80 $22.80 best cost: $20.60 $20.60 h mean: 0.61 0.61 h sdev: 4.31 4.31 s mean: 97.77 97.77 s sdev: 10.63 10.63 mean diff: 97.16 97.16 k: 6.50 6.50 Note that the '0/5/89' fp/fn/unsure could be switched into 1/7/48 by adjusting the ham_cutoff to 0.33 and spam_cutoff to 0.90. I'm not re-running the above series of tests for that, though! Here's the summary-summary table: ham-train bestcost realcost fp% fn% unsure% 1 430.80 11498.75 56.70 0.00 26.46 10 274.05 3345.10 15.76 0.03 32.06 20 245.50 1855.80 8.61 0.03 22.18 30 242.15 1642.90 7.64 0.00 19.23 40 234.40 1154.45 5.31 0.00 15.33 60 225.55 725.65 3.35 0.03 9.23 100 221.05 532.40 2.46 0.03 6.61 150 218.60 410.30 1.91 0.08 4.51 200 179.90 199.45 0.88 0.10 3.91 250 130.05 138.05 0.58 0.08 3.72 300 96.80 104.25 0.41 0.15 3.38 350 66.75 73.45 0.26 0.17 3.20 400 63.25 69.65 0.25 0.20 2.94 450 61.95 61.95 0.21 0.28 2.78 500 52.50 58.05 0.20 0.23 2.63 600 44.15 50.00 0.16 0.23 2.54 700 37.75 41.60 0.12 0.28 2.31 1000 26.20 27.80 0.06 0.28 2.09 1500 19.60 24.40 0.03 0.45 2.48 2000 15.50 20.70 0.00 0.45 2.70 2500 15.60 21.90 0.00 0.43 2.94 2700 20.60 22.80 0.00 0.50 2.97 It seems like most of the wins come once you get up around 350, the number of spam trained on. The unsure bucket actually gets a bit worse as more ham is added - looking at the histograms, various bits of spam are dragged downwards. Anthony From tim.one@comcast.net Tue Oct 29 03:58:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 22:58:55 -0500 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: <20021014220902.71797F4D4@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel [mailto:popiel@wolfskeep.com] Sent: Monday, October 14, 2002 6:09 PM] > It appears to be a systematic error when a mailing list manager > appends plain text to what should be a base64 encoded segment. > Bad MLM, no biscuit. This confuses the MIME decoder. Bad MIME > decoder, too! This is ironic : it turns out that the cause is that the MIME decoder was *too* forgiving, in a twisted relevant sense: > As a sample: > > """ > ... > Content-Type: text/plain > Content-Transfer-Encoding: base64 > ... > > DQpUck1lbG9kaSwgS/1y/WsgbGlua2xpIOdhbP3+bWF5YW4gdmUgYmlydGVrIG1wMyD8IGlu > ZGlyaXJrZW4gYmlsZSBpbnNhbmxhcv0ga2FocmVkZW4gc/Z6ZGUgbXAzIHNpdGVsZXJpbmUg > YWx0ZXJuYXRpZiANCm9sYXJhayBzaXpsZXIgaedpbiD2emVubGUgaGF6/XJsYW5t/f50/XIu > IEhlciB5Yf50YW4gaGVyIGtlc2ltZGVuIG38emlrc2V2ZXJlIGhpdGFwIGVkZWJpbG1layBp > 52luIHRhc2FybGFubf3+IDEzIEdCIA0KbP1rIGRldiBNcDMgbGlzdGVzaXlsZSBz/W79Zv1u > ZGEgcmFraXBzaXogb2xhY2FrIP5la2lsZGUgZG9uYXT9bG39/iB2ZSBzaXogbfx6aWtzZXZl > cmxlcmluIGhpem1ldGluZSBzdW51bG11/nR1ci4gDQpodHRwOi8vd3d3LnRybWVsb2RpLmNv > bSBhZHJlc2luZGVraSBkZXYgYXL+aXZpbWl6ZGUgc2l6aSBiZWtsZXllbiBlbiBzZXZkafBp > bml6IHNhbmF05/1sYXL9biBlbiBzZXZkafBpbml6IA0K/mFya/1sYXL9bv0gYmlya2HnIGRh > a2lrYSBp52luZGUgYmlsZ2lzYXlhcv1u/XphIGluZGlyaW4gdmUga2V5aWZsZSBkaW5sZW1l > eWUgYmH+bGF5/W4uIA0KDQrdeWkgRfBsZW5jZWxlci4uIA0KaHR0cDovL3d3dy50cm1lbG9k > aS5jb20NCg0KDQoNCg0K > > > -- > To UNSUBSCRIBE, email to debian-java-request@lists.debian.org > with a subject of "unsubscribe". Trouble? Contact > listmaster@lists.debian.org > """ I tried like hell to provoke this problem with base64 msgs, and couldn't. It turns that the final "real base64" line was the key: > aS5jb20NCg0KDQoNCg0K Because this section didn't happen to need any '=' padding, the base64 decoder didn't know that it was over, and went on to take the entire remainder of the text as if it were base64 too. Until it sees a string of '=' marks, it will accept darned near everything, and simply ignore characters that don't make sense for base64. In the end, the error it raises is due to that treating the remainder of the msg as pseudo-base64 too leads to an improperly padded base64 string. I believe I've fixed this now, by falling back to a stricter(!) approach when the builtin approach fails. In cases where the base64 section is terminated by a string of '=', the builtin approach doesn't fail, and in those cases we lose the plain text part. If it fails back to the stricter approach, we don't lose the plain text part. Perhaps I should lose the plain text part in this case too? BTW, looks like your example was foreign-language MP3 spam. It scores like so for me: 0.99970963814 '*H*' 0.000577077329346 '*S*' 0.99999635361 From tim.one@comcast.net Tue Oct 29 04:55:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 28 Oct 2002 23:55:27 -0500 Subject: [Spambayes] defaults vs. chi-square In-Reply-To: Message-ID: [Tim, claims to have fixed the "plain text follows a base64 section" decoding glitch] Just FYI, this had minor good effects on my c.l.py test (10-fold CV): filename: cv tcap ham:spam: 20000:14000 20000:14000 fp total: 2 2 fp %: 0.01 0.01 fn total: 0 0 fn %: 0.00 0.00 unsure t: 103 97 unsure %: 0.30 0.29 real cost: $40.60 $39.40 best cost: $27.00 $26.80 h mean: 0.28 0.26 h sdev: 2.99 2.89 s mean: 99.94 99.94 s sdev: 1.41 1.44 mean diff: 99.66 99.68 k: 22.65 23.02 Hmm! That "after" run there also had replace_nonascii_chars: True different. Sorry about that; it's not worth it (to me) to separate those out. The percentiles for this large-training test have gotten very interesting: -> Ham scores for all runs: 20000 items; mean 0.26; sdev 2.89 -> min 0; median 6.37101e-011; max 100 -> percentiles: 5% 0; 25% 2.22045e-014; 75% 8.15779e-007; 95% 0.0358985 -> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.44 -> min 29.8279; median 100; max 100 -> percentiles: 5% 100; 25% 100; 75% 100; 95% 100 Histogram analysis still suggests it would be cheaper to let some FN go through: -> best cost for all runs: $26.80 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at ham & spam cutoffs 0.5 & 0.775 -> fp 2; fn 3; unsure ham 11; unsure spam 8 -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559% From rob@hooft.net Tue Oct 29 09:48:45 2002 From: rob@hooft.net (Rob W.W. Hooft) Date: Tue, 29 Oct 2002 10:48:45 +0100 Subject: [Spambayes] training on very small ham sets, normal sized spamsets. References: <200210290125.g9T1Ppw09085@localhost.localdomain> Message-ID: <3DBE597D.2040504@hooft.net> Anthony Baxter wrote: > So I hacked on timcv.py and msgs.py to add options 'spam-test', > 'spam-train', 'ham-test' and 'ham-train', to allow you to set > the training set size separately to the testing set size. > I haven't checked this in because it will break everyone's > test scripts - --spam= will no longer be distinct, and getopt > will gripe. Let me know if I should check this in anyway - I > think it's useful, but YMMV. Can you fix the backward compatibility by adding --spam and --ham and --spam-keep and --ham-keep options that do both? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From anthony@interlink.com.au Tue Oct 29 09:50:16 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Tue, 29 Oct 2002 20:50:16 +1100 Subject: [Spambayes] training on very small ham sets, normal sized spamsets. In-Reply-To: <3DBE597D.2040504@hooft.net> Message-ID: <200210290950.g9T9oHU11606@localhost.localdomain> >>> "Rob W.W. Hooft" wrote > Anthony Baxter wrote: > > So I hacked on timcv.py and msgs.py to add options 'spam-test', > > 'spam-train', 'ham-test' and 'ham-train', to allow you to set > > the training set size separately to the testing set size. > > I haven't checked this in because it will break everyone's > > test scripts - --spam= will no longer be distinct, and getopt > > will gripe. Let me know if I should check this in anyway - I > > think it's useful, but YMMV. > > Can you fix the backward compatibility by adding --spam and --ham > and --spam-keep and --ham-keep options that do both? Nah - getopt doesn't like it. Anthony From popiel@wolfskeep.com Tue Oct 29 18:41:24 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Tue, 29 Oct 2002 10:41:24 -0800 Subject: [Spambayes] training on very small ham sets, normal sized spamsets. In-Reply-To: Message from Anthony Baxter <200210290125.g9T1Ppw09085@localhost.localdomain> References: <200210290125.g9T1Ppw09085@localhost.localdomain> Message-ID: <20021029184124.1A3D9F595@cashew.wolfskeep.com> In message: <200210290125.g9T1Ppw09085@localhost.localdomain> Anthony Baxter writes: >So I hacked on timcv.py and msgs.py to add options 'spam-test', >'spam-train', 'ham-test' and 'ham-train', to allow you to set >the training set size separately to the testing set size. >I haven't checked this in because it will break everyone's >test scripts - --spam= will no longer be distinct, and getopt >will gripe. Let me know if I should check this in anyway - I >think it's useful, but YMMV. I'd like to have it. :-) >The numbers for each (001:, 002:, 003:, 005:, 010:, 015:, 020:) are >actually averages of 4 different runs for each, with different >-s options on each one (same set of 4 -s used for each, tho). >Otherwise the variation was just too damn high. It's still a little >'bloopy' - the unsure bounces around a bit, but it's not bad. Cool. Good to see someone more thorough than I am... I've been getting(?) sloppy. I'm not a real statistician, and it shows. >Here's the summary-summary table: >ham-train bestcost realcost fp% fn% unsure% > 1 430.80 11498.75 56.70 0.00 26.46 > 10 274.05 3345.10 15.76 0.03 32.06 > 20 245.50 1855.80 8.61 0.03 22.18 > 30 242.15 1642.90 7.64 0.00 19.23 > 40 234.40 1154.45 5.31 0.00 15.33 > 60 225.55 725.65 3.35 0.03 9.23 > 100 221.05 532.40 2.46 0.03 6.61 > 150 218.60 410.30 1.91 0.08 4.51 > 200 179.90 199.45 0.88 0.10 3.91 > 250 130.05 138.05 0.58 0.08 3.72 > 300 96.80 104.25 0.41 0.15 3.38 > 350 66.75 73.45 0.26 0.17 3.20 > 400 63.25 69.65 0.25 0.20 2.94 > 450 61.95 61.95 0.21 0.28 2.78 > 500 52.50 58.05 0.20 0.23 2.63 > 600 44.15 50.00 0.16 0.23 2.54 > 700 37.75 41.60 0.12 0.28 2.31 > 1000 26.20 27.80 0.06 0.28 2.09 > 1500 19.60 24.40 0.03 0.45 2.48 > 2000 15.50 20.70 0.00 0.45 2.70 > 2500 15.60 21.90 0.00 0.43 2.94 > 2700 20.60 22.80 0.00 0.50 2.97 > >It seems like most of the wins come once you get up around 350, the >number of spam trained on. The unsure bucket actually gets a bit worse >as more ham is added - looking at the histograms, various bits of spam >are dragged downwards. Beautiful. It looks like the excess ham only starts hurting unsures after about 1000 (or about 3:1). - Alex From popiel@wolfskeep.com Tue Oct 29 18:54:22 2002 From: popiel@wolfskeep.com (T. Alexander Popiel) Date: Tue, 29 Oct 2002 10:54:22 -0800 Subject: [Spambayes] max word size Message-ID: <20021029185422.2675EF595@cashew.wolfskeep.com> Changing the max word size (for generating skip tokens) doesn't seem to have much effect on my data. Have table... it pretty much says it all. -> tested 200 hams & 200 spams against 1800 hams & 1800 spams [...] filename: skip10 skip11 skip12 skip13 skip14 skip20 skip50 ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 2000:2000 fp total: 4 3 3 3 3 4 4 fp %: 0.20 0.15 0.15 0.15 0.15 0.20 0.20 fn total: 12 10 12 11 12 10 10 fn %: 0.60 0.50 0.60 0.55 0.60 0.50 0.50 unsure t: 52 55 53 55 53 52 54 unsure %: 1.30 1.38 1.32 1.38 1.32 1.30 1.35 real cost: $62.40 $51.00 $52.60 $52.00 $52.60 $60.40 $60.80 best cost: $49.20 $49.00 $48.20 $48.40 $48.40 $49.40 $50.00 h mean: 0.42 0.41 0.40 0.40 0.38 0.39 0.39 h sdev: 5.47 5.42 5.39 5.35 5.22 5.30 5.22 s mean: 98.44 98.45 98.45 98.46 98.46 98.48 98.48 s sdev: 9.87 9.79 9.76 9.72 9.75 9.71 9.69 mean diff: 98.02 98.04 98.05 98.06 98.08 98.09 98.09 k: 6.39 6.45 6.47 6.51 6.55 6.53 6.58 It doesn't look like there's any significance in there, even with the extreme sizes... - Alex From richie@entrian.com Tue Oct 29 21:04:01 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 29 Oct 2002 21:04:01 +0000 Subject: [Spambayes] Re: pop3proxy bug? In-Reply-To: References: Message-ID: Hi Jeremy, Sorry for the delay - I've been away since Friday. > Did you ever test this code? Yes, of course I tested it - I've been using it to retrieve all my email since the day I wrote it! 8-) The problem is probably down to platform-dependent behaviour - I'm running on Windows 98 and it works like a charm for me. I'll give it a go on Linux over the next day or two and see what happens. > I changed the code to use the raw server socket and sendall() instead > of self.serverFile.write() and it worked. But I'm uneasy. That's a perfectly reasonable fix. Once I've reproduced the problem on Linux, I'll apply, test and commit that fix - thanks. -- Richie Hindle richie@entrian.com From richie@entrian.com Tue Oct 29 21:04:27 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 29 Oct 2002 21:04:27 +0000 Subject: [Spambayes] Some minor nits ... In-Reply-To: <001a01c27d17$dfec99a0$6300000a@holdenweb.com> References: <001a01c27d17$dfec99a0$6300000a@holdenweb.com> Message-ID: Hi Steve, > the X-Hammie-Disposition header is treated as a part of the message body This was a bug, now fixed. In trying to deal with non-conforming emails, I was converting all your emails into non-conforming ones. Nice. 8-) > Under cygwin (python 2.2.1) I see the folowing asyncore error: See my reply to Jeremy - I'll look at this this week. -- Richie Hindle richie@entrian.com From jeremy@alum.mit.edu Tue Oct 29 21:40:33 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Tue, 29 Oct 2002 16:40:33 -0500 Subject: [Spambayes] Re: pop3proxy bug? In-Reply-To: References: Message-ID: <15807.81.616489.126728@slothrop.zope.com> >>>>> "RH" == Richie Hindle writes: >> Did you ever test this code? RH> Yes, of course I tested it - I've been using it to retrieve all RH> my email since the day I wrote it! 8-) Sorry for the testy response. I didn't realize that you could do what you were doing with makefile(). RH> The problem is probably down to platform-dependent behaviour - RH> I'm running on Windows 98 and it works like a charm for me. RH> I'll give it a go on Linux over the next day or two and see what RH> happens. You could also create two files with makefile(), just like SocketServer. Jeremy From richie@entrian.com Tue Oct 29 22:10:52 2002 From: richie@entrian.com (Richie Hindle) Date: Tue, 29 Oct 2002 22:10:52 +0000 Subject: [Spambayes] Re: pop3proxy bug? In-Reply-To: <15807.81.616489.126728@slothrop.zope.com> References: <15807.81.616489.126728@slothrop.zope.com> Message-ID: Hi Jeremy, > Jeremy: Did you ever test this code? > Richie: Yes, of course I tested it > Jeremy: Sorry for the testy response. 'testy', very good! > I didn't realize that you could do what you were doing with makefile(). It probably shouldn't be allowed, but I guess the line in socket.py that says: self.mode = mode # Not actually used in this version means that someone somewhere is aware of this. > You could also create two files with makefile(), just like > SocketServer. Thanks for the suggestion - that's probably the neatest fix. -- Richie Hindle richie@entrian.com From anthony@interlink.com.au Wed Oct 30 07:36:46 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Wed, 30 Oct 2002 18:36:46 +1100 Subject: [Spambayes] training on very small ham sets, normal sized spamsets. In-Reply-To: <20021029184124.1A3D9F595@cashew.wolfskeep.com> Message-ID: <200210300736.g9U7alm19317@localhost.localdomain> >>> "T. Alexander Popiel" wrote > >So I hacked on timcv.py and msgs.py to add options 'spam-test', > >'spam-train', 'ham-test' and 'ham-train', to allow you to set > >the training set size separately to the testing set size. > >I haven't checked this in because it will break everyone's > >test scripts - --spam= will no longer be distinct, and getopt > >will gripe. Let me know if I should check this in anyway - I > >think it's useful, but YMMV. > I'd like to have it. :-) I figured out a backwards compatible way to do it - make the new options --SpamTrain --SpamTest &c. I'll check it in shortly. > Cool. Good to see someone more thorough than I am... I've > been getting(?) sloppy. I'm not a real statistician, and > it shows. Neither am I - I just know enough to hurt myself :) > >Here's the summary-summary table: > >ham-train bestcost realcost fp% fn% unsure% > > 1 430.80 11498.75 56.70 0.00 26.46 > > 10 274.05 3345.10 15.76 0.03 32.06 > > 20 245.50 1855.80 8.61 0.03 22.18 > > 30 242.15 1642.90 7.64 0.00 19.23 > > 40 234.40 1154.45 5.31 0.00 15.33 > > 60 225.55 725.65 3.35 0.03 9.23 > > 100 221.05 532.40 2.46 0.03 6.61 > > 150 218.60 410.30 1.91 0.08 4.51 > > 200 179.90 199.45 0.88 0.10 3.91 > > 250 130.05 138.05 0.58 0.08 3.72 > > 300 96.80 104.25 0.41 0.15 3.38 > > 350 66.75 73.45 0.26 0.17 3.20 > > 400 63.25 69.65 0.25 0.20 2.94 > > 450 61.95 61.95 0.21 0.28 2.78 > > 500 52.50 58.05 0.20 0.23 2.63 > > 600 44.15 50.00 0.16 0.23 2.54 > > 700 37.75 41.60 0.12 0.28 2.31 > > 1000 26.20 27.80 0.06 0.28 2.09 > > 1500 19.60 24.40 0.03 0.45 2.48 > > 2000 15.50 20.70 0.00 0.45 2.70 > > 2500 15.60 21.90 0.00 0.43 2.94 > > 2700 20.60 22.80 0.00 0.50 2.97 > > > >It seems like most of the wins come once you get up around 350, the > >number of spam trained on. The unsure bucket actually gets a bit worse > >as more ham is added - looking at the histograms, various bits of spam > >are dragged downwards. > > Beautiful. It looks like the excess ham only starts hurting > unsures after about 1000 (or about 3:1). fns also get worse after about 2:1, and most of the wins in the fp are there by the time you get to 3:1. So I'd say from this something like 2:1 or 3:1 ham:spam is a good number. But, as always, YMMV. The 'best cost' column shows something different, but it's overly weighting fp's vs everything else (for my tastes). (yes, I can tweak it, but chose not to for this test). -- Anthony Baxter It's never too late to have a happy childhood. From skip@pobox.com Tue Oct 29 04:19:54 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 28 Oct 2002 22:19:54 -0600 Subject: [Spambayes] RE: Spam vs time-of-day In-Reply-To: References: <15805.34750.447765.195346@montanaro.dyndns.org> Message-ID: <15806.3178.674153.634773@montanaro.dyndns.org> Tim> Your buckets span 10 minutes. The comment in the code is confused Tim> about this too. That's why your graph and mine both have 144 Tim> points on the X axis (24 * 6 = 144; you have six *buckets* per Tim> hour, and each spans 10 minutes). Yeah, after seeing this several times I'm beginning to think I made a mistake. ;-) >> The large spike at 0 is an artifact of my simpleminded Date header >> scanning. Invalid dates probably wound up with a value of 0. Tim> And at that time, *every* Date header generated a dow:invalid token Tim> (as well as the correct token, when possible). That's been Tim> repaired since then. Not really. The graph was generated by a shell pipeline using suitable non-spambayes tools (awk, sed, gnuplot, etc). My dow:invalid mistake came later. >> Buckets were calculated using local time. That way I didn't penalize >> Anthony Baxter and other folks who happen not to live in the US. Tim> I'm unsure what "were calculated using local time" means. Simply that I ignored timezone information. If the Date: header was Date: Mon, 28 Oct 2002 14:29:30 -0500 the send time was taken to be 14:29, local time. The -0500 was ignored. Tim> Does the checked in code do that or not? Yes, the checked in code just uses a regular expression which matches HH:MM:SS preceded and followed by a space. Nothing else in the Date: header is considered for this particular token. Skip From skip@pobox.com Thu Oct 31 01:59:13 2002 From: skip@pobox.com (Skip Montanaro) Date: Wed, 30 Oct 2002 19:59:13 -0600 Subject: [Spambayes] X-Hammie-Disposition split suggestion Message-ID: <15808.36465.890458.583477@montanaro.dyndns.org> The X-Hammie-Disposition header contains multiple bits of information. I'm not sure what the *H* and *S* chunks are for (overall hammieness?), but I think it would be worthwhile to put the individual word probabilities in a separate header. That way, I could tell my mailer to display the much smaller X-Hammie-Disposition header and suppress display of the (for example) X-Hammie-Word-Probabilities header by default, e.g.: X-Hammie-Disposition: Yes; 1.00; '*H*': 0.00; '*S*': 1.00 X-Hammie-Word-Probabilities:'rbl':0.07; 'script':0.07; 'to:2**1':0.09; 'osirusoft':0.10; 'url:org':0.15; 'subject:; ':0.15; 'cgi':0.20; 'sorry':0.22; 'mailing':0.23; 'list:':0.24; 'skip:" 10':0.27; 'skip:r 20':0.28; 'subject:SPAM':0.30; 'called':0.31; 'body':0.33; 'rcvd_in_dsbl':0.34; 'open':0.35; 'being':0.35; 'version':0.36; 'from:':0.36; 'skip:u 10':0.37; ... If something in the X-Hammie-Disposition header jumps out at you, you can display all the message's headers. Make sense? If so, I'll be happy to modify hammie.py. Skip From tim.one@comcast.net Thu Oct 31 02:18:08 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 30 Oct 2002 21:18:08 -0500 Subject: [Spambayes] Database reduction Message-ID: There's a semi-standard trick for database size reduction I haven't persued and don't intend to pursue. Those keenly interested in reducing database size may wish to pursue it. Currently, the classifier's wordinfo dict is indexed by strings S. There's no bound on how many unique strings may appear, and so also no bound on how large the database may grow. A cheesy but probably-effective trick is to pick an integer N for all time, and index a wordinfo structure by hash(S) % N instead of by S. Since strings are no longer stored: + Good: Space for storing strings isn't needed. + Bad: You can't get words out again (for, e.g., clue lists). Since hash(S) is a many-to-one mapping: + Bad: Words get combined more-than-less randomly. Since mod N is also a many-to-one mapping: + Bad: As above. + Good: N is a solid upper bound on the maximum number of wordinfo records, and so you can know exactly how big a classifier can get. + Good: You could drop the dict and use hash(S)%N to index a contiguous structure directly, like an mmap'ed file (http://crm114.sf.net/ uses that specific trick after multiple layers of hashing, into distinct files of one-byte clamped ham and spam counts). Since database size would be bounded: + Good: There's less obvious need to prune the database over time (a main point of pruning is to reclaim space for words that aren't being used anymore -- or ever). + Bad: If the database is never pruned, it will adapt slower to changes in the nature of ham and spam. I suppose the scariest thing is combining words "at random". It's possible, e.g., that Python would get mapped to the same record as Viagra. And the smaller N is, the most certain "bad stuff like that" *will* happen. We won't know until someone tries it and measures results; my intuition is that unless you get silly with N, it won't hurt much, as most words are approximately worthless anyway. Think about what happens when N=1 for the other side of this coin. From tim.one@comcast.net Thu Oct 31 02:33:37 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 30 Oct 2002 21:33:37 -0500 Subject: [Spambayes] X-Hammie-Disposition split suggestion In-Reply-To: <15808.36465.890458.583477@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > The X-Hammie-Disposition header contains multiple bits of > information. I'm not sure what the *H* and *S* chunks are for > (overall hammieness?), chi-combining computes two scores internally, one for ham-ness (H) and the other for spam-ness (S). That's what *H* and *S* tell you. The final score is (S-H+1)/2. > but I think it would be worthwhile to put the individual word > probabilities in a separate header. Or drop them altogether. Geeks may find this stuff morbidly interesting, and spambayes developers need to see this stuff when a msg gets a surprising score, but I doubt anyone else has any earthly use for it. It's also a bit like giving away pieces of your private key in public-key cryptosystem: "well, Mister Spammer, you can't guess what's spam and ham to me without breaking into my database, but here are the 150 best & worst guesses you made, along with exactly how good they were". > That way, I could tell my mailer to display the much smaller > X-Hammie-Disposition header and suppress display of the (for > example) X-Hammie-Word-Probabilities header by default, e.g.: > > X-Hammie-Disposition: Yes; 1.00; '*H*': 0.00; '*S*': 1.00 I suggest dropping the *H* and *S* here too. In the Outlook client, we've also switched to feeding the end user int(round(score * 100.0)), i.e. an integer in 0 .. 100 inclusive. There's really no need to bother pretty users' heads with the mysteries of floating point . > X-Hammie-Word-Probabilities:'rbl':0.07; 'script':0.07; 'to:2**1':0.09; > 'osirusoft':0.10; 'url:org':0.15; 'subject:; ':0.15; 'cgi':0.20; > 'sorry':0.22; 'mailing':0.23; 'list:':0.24; 'skip:" 10':0.27; > 'skip:r 20':0.28; 'subject:SPAM':0.30; 'called':0.31; 'body':0.33; > 'rcvd_in_dsbl':0.34; 'open':0.35; 'being':0.35; 'version':0.36; > 'from:':0.36; 'skip:u 10':0.37; ... > > If something in the X-Hammie-Disposition header jumps out at you, you can > display all the message's headers. > > Make sense? If so, I'll be happy to modify hammie.py. I'm not a hammie user, but I know my sisters. That leaves me more neutral than I may sound, as one of my sisters doubtless has no idea "headers" exist. She pays to download them, though! From tim.one@comcast.net Thu Oct 31 03:47:35 2002 From: tim.one@comcast.net (Tim Peters) Date: Wed, 30 Oct 2002 22:47:35 -0500 Subject: [Spambayes] Database reduction In-Reply-To: Message-ID: [Tim] > ... > A cheesy but probably-effective trick is to pick an integer N for > all time, and index a wordinfo structure by hash(S) % N instead of by S. FYI, if you want to pursue this, here's a start (there's not much to it if you just want to see what happens): Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.45 diff -c -u -r1.45 classifier.py --- classifier.py 27 Oct 2002 17:11:00 -0000 1.45 +++ classifier.py 31 Oct 2002 03:33:40 -0000 @@ -40,6 +40,9 @@ PICKLE_VERSION = 1 +def HashSet(words): + return [n % 100003 for n in map(hash, Set(words))] + class WordInfo(object): __slots__ = ('atime', # when this record was last used by scoring(*) 'spamcount', # # of spams in which this word appears @@ -320,11 +323,11 @@ # adjustment following keeps them in a sane range, and one # that naturally grows the more evidence there is to back up # a probability. - hamcount = record.hamcount + hamcount = min(record.hamcount, nham) assert hamcount <= nham hamratio = hamcount / nham - spamcount = record.spamcount + spamcount = min(record.spamcount, nspam) assert spamcount <= nspam spamratio = spamcount / nspam @@ -397,7 +400,7 @@ wordinfo = self.wordinfo wordinfoget = wordinfo.get now = time.time() - for word in Set(wordstream): + for word in HashSet(wordstream): record = wordinfoget(word) if record is None: record = wordinfo[word] = WordInfo(now) @@ -419,7 +422,7 @@ self.nham -= 1 wordinfoget = self.wordinfo.get - for word in Set(wordstream): + for word in HashSet(wordstream): record = wordinfoget(word) if record is not None: if is_spam: @@ -440,7 +443,7 @@ wordinfoget = self.wordinfo.get now = time.time() - for word in Set(wordstream): + for word in HashSet(wordstream): record = wordinfoget(word) if record is None: prob = unknown Since N is 100003 there, no more than 100003 "words" can exist in the database. On my large c.l.py test, about 325,000 unique words exist, so at least 225,000 words get folded into other words. Accuracy does suffer: filename: cv tcap ham:spam: 20000:14000 20000:14000 fp total: 2 4 fp %: 0.01 0.02 fn total: 0 1 fn %: 0.00 0.01 unsure t: 97 179 unsure %: 0.29 0.53 real cost: $39.40 $76.80 best cost: $26.80 $58.20 h mean: 0.26 0.42 h sdev: 2.89 3.39 s mean: 99.94 99.70 s sdev: 1.44 3.20 mean diff: 99.68 99.28 k: 23.02 15.07 although the distros remain highly skewed: -> Ham scores for all runs: 20000 items; mean 0.42; sdev 3.39 -> min 0; median 2.4147e-006; max 100 -> percentiles: 5% 2.22045e-014; 25% 1.57802e-009; 75% 0.000878141; 95% 0.588783 -> Spam scores for all runs: 14000 items; mean 99.70; sdev 3.20 -> min 17.485; median 100; max 100 -> percentiles: 5% 99.9864; 25% 100; 75% 100; 95% 100 -> best cost for all runs: $58.20 -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20 -> achieved at 2 cutoff pairs -> smallest ham & spam cutoffs 0.435 & 0.93 -> fp 2; fn 7; unsure ham 27; unsure spam 129 -> fp rate 0.01%; fn rate 0.05%; unsure rate 0.459% -> largest ham & spam cutoffs 0.44 & 0.93 -> fp 2; fn 7; unsure ham 27; unsure spam 129 -> fp rate 0.01%; fn rate 0.05%; unsure rate 0.459% There's not much point digging into "what went wrong" in the new error cases, since the list of clues is worthless; e.g., here's the list for one of the new FP: Data/Ham/Set1/64316.txt prob = 0.838786702949 prob('*H*') = 0.130012 prob('*S*') = 0.807586 prob(744) = 0.0228281 prob(34690) = 0.0918367 prob(87505) = 0.0970545 prob(91999) = 0.304589 prob(29591) = 0.328993 prob(70192) = 0.371651 prob(46915) = 0.634625 prob(60034) = 0.646331 prob(49959) = 0.648468 prob(63366) = 0.686216 prob(66610) = 0.702733 prob(25331) = 0.731237 prob(81757) = 0.747858 prob(13278) = 0.751421 prob(89046) = 0.758242 prob(13498) = 0.773519 prob(5337) = 0.779329 prob(26879) = 0.805219 prob(50301) = 0.912593 prob(26426) = 0.918411 prob(35130) = 0.943716 I can say that all the new errors were difficult cases before this too, and often popped in out of my FP and FN sets over the weeks. Have fun . From tim.one@comcast.net Thu Oct 31 05:51:00 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 31 Oct 2002 00:51:00 -0500 Subject: [Spambayes] Spam Clues: Share source code securely, inexpensively Message-ID: This is a multi-part message in MIME format. ---------------------- multipart/mixed attachment Great spam! Fooled my personal spambayes, and python.org. It did everything right. BTW, this email and its attachment was auto-generated by MarkH's Outlook client code, by hitting the "Show spam clues for current msg" button while looking at the spam. Since HTML is (I think) disabled on this list for no good reason, you won't get the intended effect. But it's still great spam . Spam Score: 0.000848613 '*H*' 0.999804 '*S*' 0.00150092 'url:python-list' 0.00653675 'header:X-Complaints-to:1' 0.0103695 'url:mailman' 0.0140407 'url:listinfo' 0.0171948 'url:python' 0.019608 'repository' 0.0196507 'message-id:@posting.google.com' 0.0215311 'subject:skip:i 10' 0.0302013 'algorithm' 0.0348837 'repository.' 0.0412844 'replaced' 0.0412844 'facility' 0.0412844 'encrypted' 0.0412844 '(b)' 0.0412844 'url:org' 0.0474286 'header:Errors-to:1' 0.059578 'get,' 0.0652174 'ssl' 0.0918367 'someone,' 0.0918367 'approach.' 0.0918367 'header:Organization:1' 0.0967191 'header:Return-path:1' 0.0986292 'header:Message-id:1' 0.102295 'url:mail' 0.127291 'header:Received:4' 0.147359 'feature' 0.147511 'communicate' 0.155172 'standard.' 0.155172 'preferable' 0.155172 'converted' 0.155172 'stored' 0.162027 'resources,' 0.164415 'code' 0.187502 'person.' 0.197597 'web' 0.213598 'downloaded' 0.221874 'remote' 0.242271 'server,' 0.251262 'site,' 0.260551 '(d)' 0.267484 'development' 0.2861 'which' 0.286755 'source' 0.293831 'when' 0.311512 'standard' 0.314052 'purposes' 0.325776 'single' 0.32782 'relationship' 0.34209 'using' 0.349124 'skip:l 10' 0.349551 'used' 0.353388 'browser' 0.358337 'software' 0.360879 'skip:u 10' 0.363596 'days.' 0.378856 'documents' 0.379023 'with' 0.383038 'what' 0.38355 'set' 0.383969 'there' 0.390437 'say' 0.392813 'that' 0.395213 'note,' 0.399936 '"free"' 0.399936 'document,' 0.399936 'now' 0.603761 'even' 0.607558 'government' 0.612155 'based' 0.613413 'account' 0.618173 'pay' 0.634181 'secure.' 0.637489 'securely' 0.637489 'give' 0.644942 'skip:w 10' 0.650571 'system' 0.677844 'special' 0.679297 'is:' 0.684005 'site' 0.684034 'secure' 0.68629 'place.' 0.71577 'unlike' 0.71577 'highly' 0.720868 'sites,' 0.757669 'high' 0.759307 'offshore' 0.767183 'professional' 0.770397 'fax' 0.773719 'information' 0.781197 'cost' 0.783448 'offers' 0.833401 'required.' 0.836352 'received' 0.839223 'encrypted,' 0.844828 'subject:source' 0.844828 'format,' 0.866464 'cheap' 0.866464 'permits' 0.934783 'emailing' 0.969799 'dirt' 0.973373 'inexpensive' 0.9947 Message Stream: Return-path: Path: news.baymountain.com!uunet!ash.uu.net!prodigy.com!news.cc.ukans.edu!logbridg e.uoregon.edu!newsfeed.stanford.edu!postnews1.google.com!not-for-mail Received: from bright14. (bright14-qfe0.icomcast.net [172.20.4.103]) by msgstore01.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) with ESMTP id <0H4T00D57ZFLRO@msgstore01.icomcast.net> for tim.one@ims-ms-daemon (ORCPT tim.one@comcast.net); Thu, 31 Oct 2002 00:33:21 -0500 (EST) Received: from mtain03 (bright-LB.icomcast.net [172.20.3.155]) by bright14. (8.11.6/8.11.6) with ESMTP id g9V5XZq28319 for <@msgstore01.icomcast.net:tim.one@comcast.net>; Thu, 31 Oct 2002 00:33:35 -0500 (EST) Received: from mail.python.org (mail.python.org [12.155.117.29]) by mtain03.icomcast.net (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002)) with ESMTP id <0H4T00GR0ZFXIU@mtain03.icomcast.net> for tim.one@comcast.net (ORCPT tim.one@comcast.net); Thu, 31 Oct 2002 00:33:33 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org) by mail.python.org with esmtp (Exim 4.05) id 1877xW-00008g-00; Thu, 31 Oct 2002 00:33:34 -0500 X-Trace: posting.google.com 1036042065 27240 127.0.0.1 (31 Oct 2002 05:27:45 GMT) Date: Wed, 30 Oct 2002 21:27:45 -0800 From: post@ironcitadel.com (ICWeb) Subject: Share source code securely, inexpensively Sender: python-list-admin@python.org To: python-list@python.org Errors-to: python-list-admin@python.org Message-id: <99c3d303.0210302127.15dee024@posting.google.com> Organization: http://groups.google.com/ X-Complaints-to: groups-abuse@google.com Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 8bit NNTP-posting-date: 31 Oct 2002 05:27:45 GMT Precedence: bulk X-BeenThere: python-list@python.org Newsgroups: comp.lang.python Lines: 31 NNTP-posting-host: 66.166.68.234 X-Mailman-Version: 2.0.13 (101270) List-Post: List-Subscribe: , List-Unsubscribe: , List-Archive: List-Help: List-Id: General discussion list for the Python programming language Xref: news.baymountain.com comp.lang.python:187588 When doing source code development using offshore resources, or when team members are geographically distributed, this site offers a very inexpensive and secure approach. It can be used as an adjunct to a source code control system (manually) when a small team does not have access to a secure web based source code control repository. For professional and client relationship purposes this can be preferable to emailing code around or using a "free" document repository which is insecure. There is a good site, www.ironcitadel.com, which permits the secure storage and communication of documents. Unlike most "free" sites, this site is: (a) Securely encrypted - all document uploads/downloads are 128bit SSL encrypted; all documents stored are encrypted using the new Rinjdael algorithm which has replaced Triple-DES as the US government high security standard. (b) Has a fax feature - if you just have a hardcopy of a document, say a handwritten note, you can fax it in and the document is received by a fax server, converted to TIFF format, and stored encrypted. (c) All documents can be downloaded (also uploaded) just via a standard web browser - no special software required. (d) Cost is dirt cheap - $5/month and you don't even pay until the end of your first 30 days. Considering what you get, that is CHEAP! If you want to communicate securely with someone, just set up an account and give the login/password to a single other person. All communications are now highly secure. If you want to store secure information remotely, where all the information is at a remote facility that is encrypted, then www.ironcitadel.com is the place. -- http://mail.python.org/mailman/listinfo/python-list ---------------------- multipart/mixed attachment An embedded message was scrubbed... From: ICWeb Subject: Share source code securely, inexpensively Date: Thu, 31 Oct 2002 00:27:45 -0500 Size: 2999 Url: http://mail.python.org/pipermail/spambayes/attachments/20021031/2dc295c9/attachment.txt ---------------------- multipart/mixed attachment-- From rob@hooft.net Thu Oct 31 06:37:44 2002 From: rob@hooft.net (Rob Hooft) Date: Thu, 31 Oct 2002 07:37:44 +0100 Subject: [Spambayes] X-Hammie-Disposition split suggestion References: Message-ID: <3DC0CFB8.2010900@hooft.net> Tim Peters wrote: > >>That way, I could tell my mailer to display the much smaller >>X-Hammie-Disposition header and suppress display of the (for >>example) X-Hammie-Word-Probabilities header by default, e.g.: >> >> X-Hammie-Disposition: Yes; 1.00; '*H*': 0.00; '*S*': 1.00 > > > I suggest dropping the *H* and *S* here too. In the Outlook client, we've > also switched to feeding the end user int(round(score * 100.0)), i.e. an > integer in 0 .. 100 inclusive. There's really no need to bother pretty > users' heads with the mysteries of floating point . Hm. Sure, the *H* and *S* could be moved to the "debugging" header, which should be switched by an option (with default off). But I am actually "bothered" ;-) by having only two digits. For chuckles, I'd like to have an indication for the "0" and "100" scores how far they are away from the actual 0 and 100 (as a 10-log). Something like "0 (4)" could mean "0.0000XXX" and "100 (5)" could mean "0.99999XXX". Again, this would probably only be for hackers.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From tim.one@comcast.net Thu Oct 31 06:47:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 31 Oct 2002 01:47:26 -0500 Subject: [Spambayes] FW: [Spambayes-checkins] spambayes tokenizer.py,1.57,1.58 Message-ID: FYI, for those not on the checkin list. -----Original Message----- From: spambayes-checkins-bounces@python.org [mailto:spambayes-checkins-bounces@python.org]On Behalf Of Tim Peters Sent: Thursday, October 31, 2002 1:43 AM To: spambayes-checkins@python.org Subject: [Spambayes-checkins] spambayes tokenizer.py,1.57,1.58 Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30231 Modified Files: tokenizer.py Log Message: A new mini-phase of body tokenization scours HTML for common virus clues, variations of > src=cid: >> height=0 width=0 [Guido] > This gets us awfully close to SA's "precompiled list of clues to look > for" approach. :-( We're throwing away *all* HTML tags now, and missing a lot of info because of that. As I said about this one, virus/worm msgs of this nature often have no other content period. The classifier can't score what it can't see. Feel free to design a principled approach to tokenizing HTML tags that still allows some HTML messages to avoid getting called spam. In the absence of that, I've got no qualms about adding special cases that help. For goodness sake, it was a massive special-case hack to *strip* HTML tags to begin with -- think of this as a minor unhack of that . From bkc@murkworks.com Thu Oct 31 17:55:52 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 31 Oct 2002 12:55:52 -0500 Subject: [Spambayes] FW: [Spambayes-checkins] spambayes tokenizer.py,1.57,1.58 In-Reply-To: <200210311622.g9VGMAC07720@odiug.zope.com> References: Your message of "Thu, 31 Oct 2002 01:47:26 EST." Message-ID: <3DC127A8.14297.28B44F20@localhost> On 31 Oct 2002 at 11:22, Guido van Rossum wrote: > > A new mini-phase of body tokenization scours HTML for common virus clues, > > variations of > > > >