From tim.one@comcast.net Thu Sep 5 18:57:17 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 05 Sep 2002 13:57:17 -0400 Subject: [Spambayes] RE: [Python-Dev] Getting started with GBayes testing In-Reply-To: <3D772EC2.30217.184B6C78@localhost> Message-ID: [Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Brad Clements] > ... > My feeling is that the presentation of "the message" is independent of the > message itself, so if I get a message in Text, HTML, RTF only the actual > content is important, not the markup method. Everything's A Clue. Everything that gets ignored partly blinds the classifier, so the question isn't whether there's a difference, it's how much of a difference it makes. > Though I suppose using lots of red and large fonts might be an > indicator of spam, the text of the message should still suffice. Indeed, Graham reported that the hex color code for bright red was one of the strongest spam indicators in his database. > Tim's comments in timtest.py hint that stripping tags isn't a > catastrophe for f-n's, but he's not planning on doing that for use on > technical lists. When HTML-only email is a 99.99% spam indicator on a tech list, it would be crazy to ignore that clue. But note that the comments *also* say I'd be delighted to remove HTML tags even there if some other way of slashing the f-n rate is proven to work (and most people who have tried it say that mining more header lines does do it -- but then I haven't seen anything from them about how they do when they ignore the header lines. I was happy to ignore header lines in order to get *some* kind of handle on how well could be done on "pure content", and turned out that works remarkably well). >> # So if a message is multipart/alternative with both text/plain >> # and text/html branches, we ignore the latter, else newbies would never >> # get a message through. If a message is just HTML, it has virtually no >> # chance of getting through > Tells me (spammer hat on) that I can send message with a > non-spammish text only part, and a spam html part since most > "non-techie" email client users automatically display the html > version when available, however Tim's implementation will ignore it. Sure. It *certainly* isn't a problem on my test data (as witnessed by the measured error rates). If the nature of the world changes, the code has to adapt along with it. But 90% of the spam I receive (and I get a lot) is still trivial to recognize from a mere glance at the subject line, and I don't buy that spammers are a class of ubergeek with formidable skill. Response rates are a percentage game, and more so than anti-spammers I expect spammers are keen to go for high-percentage wins at the expense of esoterica. > Most "average users" never even see the text-only part of > multipart messages. In Tim's application, that's okay since he's going > to use the text-only part anyway. But for my purposes, I need to consider > both portions. So it's simpler for me to strip html and combine that text > with the text-only part and then "test" the combined parts. Not unreasonable , but testing remains the only way to decide. It's rare you can out-think a fraction of a percent! From tim.one@comcast.net Thu Sep 5 19:30:03 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 05 Sep 2002 14:30:03 -0400 Subject: [Spambayes] RE: [Python-Dev] GBayes design In-Reply-To: <002b01c254f2$f6c7c020$71b53bd0@othello> Message-ID: [Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Raymond Hettinger] > Is it too late to challenge a core design decision? Never too late, but somebody has to do real work to prove that a change is justified. Plausible ideas are cheaper than dirt, alas. > Instead of multiplying probablities, use fuzzy logic methods. > Classify the indicators into damning, strong, weak, neautral, ... Think about how that differs from 0.99, 0.80, 0.20 and 0.50. Does it? > After counting the number of indicators in each class, make > a spam/ham decision that can be easily tweaked. This would > make it easy to implement variations of Tim's recent clear > win, where additional indicators are gathered until the > balance shifts sharply to one side. > > Some other advantages are: > -- easily interpreted score vectors (6 damning, 7 strong, 4 weak, ... ) I've seen people see the current prob("TV") = 0.99 style cold and pick it up at once. With character n-grams I think it's frustrating, but word-like tokenization gives easily recognized clues. > -- avoids mathematical issues with indicators not being independent How do you know this? > -- allows the addition of non-token based indicators. for instance, > a preponderance of caps would be a weak indicator. the presence > of caps separated by spaces would be a strong indicator. As far as the current classifier is concerned, "a token" is any Python object usable as a dict key. There are already several ways in which the current tokenization scheme in timtest.py uses strings to *represent* non-textual indicators. For example, if the headers lack an Organization line, a 'bool:noorg' "token" is generated. For large blobs of text that get skipped, a token is generated that records both the first character in that blob and the number of bytes skipped (chopped to the nearest multiple of 10). And so on -- you can inject anything you like into the scheme, including stuff like "number of caps separated by spaces: more than 10" (BTW, I happen to know that this particular "clue" acts to block relevant conference announcements, not just spam) I got some interesting results by injecting a crude characters/word statistic: yield "cpw:%.1g" % (float(len(text)) / len(text.split())) There are certain values of that statistic that turned out to be killer-strong spam indicators, but there's a potential problem I've mentioned before: if you have an unbounded number of free parameters you can fiddle, you can train a system to fit any given dataset exactly. That's in part why replication of results by others is necessary to make schemes like this superb (I can only make one merely excellent on my own ). > -- the decision logic would be more intuitive > -- avoids the issue of having equal amounts of spam and ham in > the sample It's not clear that this matters; some results of preliminary experiments are written up in the code comments. The way Graham computes P(Spam | Word) is via ratios, *as if* there were an equal number of each; and that's consistent with the other bogus equality assumption in the scorer. I haven't yet changed all these guys at the same time to take P(Spam) and P(Ham) into account. BTW, note that all the results I've reported had a ham/spam training ratio of 4000/2750. I left that non-unity on purpose. > The core concept would stay the same -- it's really just a shift from > continuous to discrete. Let us know how it turns out . From nas@python.ca Thu Sep 5 20:04:21 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 5 Sep 2002 12:04:21 -0700 Subject: [Spambayes] all but one testing Message-ID: <20020905190420.GB19726@glacier.arctrix.com> I've written a driver script the does "all but one testing". The basic algorithm is: gb = GrahamBayes() for msg in spam: gb.learn(msg, is_spam=True) for msg in ham: gb.learn(msg, is_spam=False) for msg in spam: gb.unlearn(msg, is_spam=True) gb.spamprob(msg) gb.lear(msg, is_spam=True) for msg in ham: gb.unlearn(msg, is_spam=False) gb.spamprob(msg) gb.lear(msg, is_spam=False) print summary Is this type of testing useful? As understand it, it's most useful when you have a small amount of testing and training data. That doesn't seem to be a problem for us. Also, it's really slow. Neil From neale@woozle.org Thu Sep 5 20:47:26 2002 From: neale@woozle.org (Neale Pickett) Date: 05 Sep 2002 12:47:26 -0700 Subject: [Spambayes] spamcan release, finally Message-ID: Although at this point its value is limited (if existant at all), yesterday I finally got the green light to release my spamcan package. http://woozle.org/~neale/src/spamcan/spamcan.html Who knows, maybe there's something useful in there. For starters, it uses an anydbm file back-end. I don't know that a dbm file is the best solution, but the way the Debian distribution is set up currently, you can either have ZODB (Python2.1) or you can have generators (Python2.2). I suspect that the dbm method will result is smaller files, but that just a hunch. The more I look at classifier.py, the more it looks like something well-suited for a cdb file (heresy, I know). But I haven't seen any data showing that cdb is better in any way than Berkeley DB hashes--all I've managed to find so far is dogma. Anyway, if it wasn't obvious before, I'm currently trying to make stuff run quickly, with as little disk usage as possible. Aloha :) Neale From bkc@murkworks.com Thu Sep 5 20:56:19 2002 From: bkc@murkworks.com (Brad Clements) Date: Thu, 05 Sep 2002 15:56:19 -0400 Subject: [Spambayes] RE: [Python-Dev] Getting started with GBayes testing In-Reply-To: References: <3D772EC2.30217.184B6C78@localhost> Message-ID: <3D777F05.5725.1984F990@localhost> On 5 Sep 2002 at 13:57, Tim Peters wrote: > Not unreasonable , but testing remains the only way to decide. It's > rare you can out-think a fraction of a percent! Hence my interest in getting a test rig setup so I can generate my own percentages to crow about ;-) .. or not :-( Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From neale@woozle.org Thu Sep 5 20:49:44 2002 From: neale@woozle.org (Neale Pickett) Date: 05 Sep 2002 12:49:44 -0700 Subject: [Spambayes] spamcan release, finally In-Reply-To: References: Message-ID: So then, Neale Pickett is all like: > Although at this point its value is limited (if existant at all), > yesterday I finally got the green light to release my spamcan package. Oh, I should have mentioned, I have zero interest in maintaining spamcan now that spambayes exists. I present it mainly for code-lifting purposes :) Neale From skip@pobox.com Thu Sep 5 21:53:23 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 5 Sep 2002 15:53:23 -0500 Subject: [Spambayes] test sets? Message-ID: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> Tim, Any thought to wrapping up your spam and ham test sets for inclusion w/ the spambayes project? Skip From gward@python.net Thu Sep 5 22:03:12 2002 From: gward@python.net (Greg Ward) Date: Thu, 5 Sep 2002 17:03:12 -0400 Subject: [Spambayes] test sets? In-Reply-To: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> Message-ID: <20020905210312.GA16171@cthulhu.gerg.ca> On 05 September 2002, Skip Montanaro said: > Any thought to wrapping up your spam and ham test sets for inclusion w/ the > spambayes project? Might be more useful to have other people working other test sets. Variety is the spice of life, and you don't want an algorithm completely biased towards one particular dataset. Greg -- Greg Ward http://www.gerg.ca/ Dyslexics of the world, untie! From skip@pobox.com Thu Sep 5 22:06:55 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 5 Sep 2002 16:06:55 -0500 Subject: [Spambayes] test sets? In-Reply-To: <20020905210312.GA16171@cthulhu.gerg.ca> References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> <20020905210312.GA16171@cthulhu.gerg.ca> Message-ID: <15735.51055.618575.247684@12-248-11-90.client.attbi.com> >> Any thought to wrapping up your spam and ham test sets for inclusion >> w/ the spambayes project? Greg> Might be more useful to have other people working other test sets. Greg> Variety is the spice of life, and you don't want an algorithm Greg> completely biased towards one particular dataset. Agreed, but for the purposes of comparing new stuff with the baseline it seems to make sense to have a standard test set. Sort of a spam-stone. ;-) Skip From paul-bayes@svensson.org Thu Sep 5 22:18:13 2002 From: paul-bayes@svensson.org (Paul Svensson) Date: Thu, 5 Sep 2002 17:18:13 -0400 (EDT) Subject: [Spambayes] GBayes spam filtering In-Reply-To: Message-ID: I've been following the discussion on spam filtering on the python-dev ist with great interest. It looks very promising so far, but there's one issue I would like to explore further: we don't all have a pre-filtered corpus and a Tim or Brad to hand it to, to turn into a well tuned filter, and even if we did, how often would we need to bring them back to re-tune the filter as the flavor of spam changes over time ? Thus my interest in the operational side of corpus-collecting. This is more an issue for person-to-person email than for large mailing lists, as the later are more likely to actually have a Tom or Brad available. I don't think it's realistic to expect users to mark everything they read as ham or spam. For a single-user setup, I would consider a mail reader command "delete as spam"; everything that's read and not thusly marked would go in the ham list. However, for a multi-user system, I think something a little more sofisticated would be neccesary. Here's my idea: The message corpus database needs to contain, for each message, the message-id a timestamp (for removal of old stuff) the word count histogram a spam/ham flag On SMTP receipt of a message, it's scanned, and if it smells like spam, it's bounced. It's NOT automatically added to the corpus. If the message does not smell like spam, it's delivered, and added to the corpus as ham. When a user reads a message and find that it's spam that got thru the filter, they need a way to send the message-id to the corpus, to flag it as spam. At this point, it would be a good idea to compare the histogram of the new spam to each histogram in the ham corpus, and remove any that are similar (any good ideas how to do the comparison?), or maybe if they are VERY similar simply flag them as spam. After recomputing the filter from the modified corpus, we could also re-filter the ham corpus, and remove more newfound spam that way. Characteristically of this system, the spam corpus will be reasonably clean (assuming the users don't abuse it too much), but the ham corpus will be quite dirty, containing spam that's not yet read, and spam that the recipient didn't bother to mark. I'm curious how GBayes would handle this situation; I assume the false negative rate would go up, but how much ? /Paul From tim.one@comcast.net Thu Sep 5 23:20:33 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 05 Sep 2002 18:20:33 -0400 Subject: [Spambayes] all but one testing In-Reply-To: <20020905190420.GB19726@glacier.arctrix.com> Message-ID: [Neil Schemenauer] > I've written a driver script the does "all but one testing". The basic > algorithm is: > > gb = GrahamBayes() > for msg in spam: > gb.learn(msg, is_spam=True) > for msg in ham: > gb.learn(msg, is_spam=False) > for msg in spam: > gb.unlearn(msg, is_spam=True) > gb.spamprob(msg) > gb.lear(msg, is_spam=True) > for msg in ham: > gb.unlearn(msg, is_spam=False) > gb.spamprob(msg) > gb.lear(msg, is_spam=False) > print summary > > Is this type of testing useful? It's sure better than nothing . Also better than nothing, but not as good, is doing the same thing but skipping the learn/unlearn calls after initial training. > As understand it, it's most useful when you have a small amount of testing > and training data. I've run no experiments on training set size yet, and won't hazard a guess as to how much is enough. I'm nearly certain that the 4000h+2750s I've been using is way more than enough, though. It's a question of practical importance open for fresh triumphs . > That doesn't seem> to be a problem for us. Also, it's really slow. Each call to learn() and to unlearn() computes a new probability for every word in the database. There's an official way to avoid that in the first two loops, e.g. for msg in spam: gb.learn(msg, True, False) gb.update_probabilities() In each of the last two loops, the total # of ham and total # of spam in the "learned" set is invariant across loop trips, and you *could* break into the abstraction to exploit that: the only probabilities that actually change across those loop trips are those associated with the words in msg. Then the runtime for each trip would be proportional to the # of words in the msg rather than the number of words in the database. Another area for potentially fruitful study: it's clear that the highest-value indicators usually appear "early" in msgs, and for spam there's an actual reason for that: advertising has to strive to get your attention early. So, for example, if we only bothered to tokenize the first 90% of a msg, would results get worse? I doubt it. And if not, what about the first 50%? The first 10%? The first 1000 bytes? max(1000 bytes, first 10%)? That could also yield a major speed boost, and *may* even improve results -- e.g., sometimes an on-topic message starts well but then rambles. From whisper@oz.net Thu Sep 5 23:42:38 2002 From: whisper@oz.net (David LeBlanc) Date: Thu, 5 Sep 2002 15:42:38 -0700 Subject: [Spambayes] All but one testing Message-ID: Errr... not to be pedantic or anything, but this is called "omit one testing" or OOT in the literature IIRC. Helpful in case you're searching for additional information, say at http://citeseer.nj.nec.com/ for instance. David LeBlanc Seattle, WA USA From nas@python.ca Thu Sep 5 23:49:23 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 5 Sep 2002 15:49:23 -0700 Subject: [Spambayes] all but one testing In-Reply-To: References: <20020905190420.GB19726@glacier.arctrix.com> Message-ID: <20020905224923.GA20480@glacier.arctrix.com> Tim Peters wrote: > I've run no experiments on training set size yet, and won't hazard a guess > as to how much is enough. I'm nearly certain that the 4000h+2750s I've been > using is way more than enough, though. Okay, I believe you. > Each call to learn() and to unlearn() computes a new probability for every > word in the database. There's an official way to avoid that in the first > two loops, e.g. > > for msg in spam: > gb.learn(msg, True, False) > gb.update_probabilities() I did that. It's still really slow when you have thousands of messages. > In each of the last two loops, the total # of ham and total # of spam in the > "learned" set is invariant across loop trips, and you *could* break into the > abstraction to exploit that: the only probabilities that actually change > across those loop trips are those associated with the words in msg. Then > the runtime for each trip would be proportional to the # of words in the msg > rather than the number of words in the database. I hadn't tried that. I figured it was better to find out if "all but one" testing had any appreciable value. It looks like it doesn't so I'll forget about it. > Another area for potentially fruitful study: it's clear that the > highest-value indicators usually appear "early" in msgs, and for spam > there's an actual reason for that: advertising has to strive to get your > attention early. So, for example, if we only bothered to tokenize the first > 90% of a msg, would results get worse? Spammers could exploit this including a large MIME part at the beginning of the message. In pratice that would probably work fine. > sometimes an on-topic message starts well but then rambles. Never. I remember the time when I was ten years old and went down to the fishing hole with my buddies. This guy named Gordon had a really huge head. Wait, maybe that was Joe. Well, no matter. As I recall, it was a hot day and everyone was tired...Human Growth Hormone...girl with huge breasts...blah blah blah...... From nas@python.ca Thu Sep 5 23:56:01 2002 From: nas@python.ca (Neil Schemenauer) Date: Thu, 5 Sep 2002 15:56:01 -0700 Subject: [Spambayes] All but one testing In-Reply-To: References: Message-ID: <20020905225601.GA20578@glacier.arctrix.com> David LeBlanc wrote: > Errr... not to be pedantic or anything, but this is called "omit one > testing" or OOT in the literature IIRC. I have no idea. I made up the name. Thanks for the correction. Neil From tim.one@comcast.net Fri Sep 6 03:14:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 05 Sep 2002 22:14:07 -0400 Subject: [Spambayes] test sets? In-Reply-To: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> Message-ID: [Skip Montanaro] > Any thought to wrapping up your spam and ham test sets for > inclusion w/ the spambayes project? I gave it all the thought it deserved . It would be wonderful to get several people cranking on the same test data, and I'm all in favor of that. OTOH, my Data/ subtree currently has more than 35,000 files slobbering over 134 million bytes -- even if I had a place to put that much stuff, I'm not sure my ISP would let me email it in one msg . Apart from that, there was a mistake very early on whose outcome was that this isn't the data I hoped I was using. I *hoped* I was using a snapshot of only recent msgs (to match the snapshot this way of only spam from 2002), but turns out they actually go back to the last millennium. Greg Ward is currently capturing a stream coming into python.org, and I hope we can get a more modern, and cleaner, test set out of that. But if that stream contains any private email, it may not be ethically possible to make that available. Can you think of anyplace to get a large, shareable ham sample apart from a public mailing list? Everyone's eager to share their spam, but spam is so much alike in so many ways that's the easy half of the data collection problem. From skip@pobox.com Fri Sep 6 03:41:13 2002 From: skip@pobox.com (Skip Montanaro) Date: Thu, 5 Sep 2002 21:41:13 -0500 Subject: [Spambayes] test sets? In-Reply-To: References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> Message-ID: <15736.5577.157228.229200@12-248-11-90.client.attbi.com> Tim> I gave it all the thought it deserved . It would be Tim> wonderful to get several people cranking on the same test data, and Tim> I'm all in favor of that. OTOH, my Data/ subtree currently has Tim> more than 35,000 files slobbering over 134 million bytes -- even if Tim> I had a place to put that much stuff, I'm not sure my ISP would let Tim> me email it in one msg . Do you have a dialup or something more modern ? 134MB of messages zipped would probably compress pretty well - under 50MB I'd guess with all the similarity in the headers and such. You could zip each of the 10 sets individually and upload them somewhere. Tim> Can you think of anyplace to get a large, shareable ham sample Tim> apart from a public mailing list? Everyone's eager to share their Tim> spam, but spam is so much alike in so many ways that's the easy Tim> half of the data collection problem. How about random sampling lots of public mailing lists via gmane or something similar, manually cleaning it (distributing that load over a number of people) and then relying on your clever code and your rebalancing script to help further cleanse it? The "problem" with the ham is it tends to be much more tied to one person (not just intimate, but unique) than the spam. I save all incoming email for ten days (gzipped mbox format) before it rolls over and disappears. At any one time I think I have about 8,000-10,000 messages. Most of it isn't terribly personal (which I would cull before passing along anyway) and much of it is machine-generated, so would be of marginal use. Finally, it's all ham-n-spam mixed together. Do we call that an omelette or a Denny's Grand Slam? Skip From tim.one@comcast.net Fri Sep 6 07:09:11 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 02:09:11 -0400 Subject: [Spambayes] all but one testing In-Reply-To: <20020905224923.GA20480@glacier.arctrix.com> Message-ID: [Tim] > Another area for potentially fruitful study: it's clear that the > highest-value indicators usually appear "early" in msgs, and for spam > there's an actual reason for that: advertising has to strive > to get your attention early. So, for example, if we only bothered to > tokenize the first 90% of a msg, would results get worse? [Neil Schemenauer] > Spammers could exploit this including a large MIME part at the beginning > of the message. In pratice that would probably work fine. Note that timtest.py's current tokenizer only looks at decoded text/* MIME sections (or raw message text if no MIME exists); spammers could put megabytes of other crap before that and it wouldn't even be looked at (except that the email package has to parse non-text/* parts well enough to skip over them, and tokens for the most interesting parts of Content-{Type, Disposition, Transfer-Encoding} decorations are generated for all MIME sections). Schemes that remain ignorant of MIME are vulnerable to spammers putting arbitrary amounts of "nice text" in the preamble area (after the headers and before the first MIME section), which most mail readers don't display, but which appear first in the file so are latched on to by Graham's scoring scheme. But I don't worry about clever spammers -- I've seen no evidence that they exist <0.5 wink>. Even if they do, the Open Source zoo is such that no particular scheme will gain dominance, and there's no percentage for spammers in trying to fool just one scheme. Even if they did, for the kind of scheme we're using here they can't *know* what "nice text" is, not unless they pay a lot of attention to the spam targets and highly tailor their messages to each different one. At that point they'd be doing targeted marketing, and the cost of the game to them would increase enormously. if-you're-out-to-make-a-quick-buck-you-don't-waste-a-second-on-hard- targets-ly y'rs - tim From anthony@interlink.com.au Fri Sep 6 08:59:38 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 06 Sep 2002 17:59:38 +1000 Subject: [Spambayes] test sets? Message-ID: <200209060759.g867xcV03853@localhost.localdomain> I've got a test set here that's the last 3 and a bit years email to info@ekit.com and info@ekno.com - it's a really ugly set of 20,000+ messages, currently broken into 7,000 spam, 9,000 ham, 9,000 currently unclassified. These addresses are all over the 70-some different ekit/ekno/ISIConnect websites, so they get a LOT of spam. As well as the usual spam, it also has customers complaining about credit card charges, it has people interested in the service and asking questions about long distance rates, &c &c &c. Lots and lots of "commercial" speech, in other words. Stuff that SA gets pretty badly wrong. I'm currently mangling it by feeding all parts (text, html, whatever else :) into the filters, as well as both a selected number of headers (to, from, content-type, x-mailer), and also a list of (header,count_of_header). This is showing up some nice stuff - e.g. the X-uidl that stoopid spammers blindly copy into their messages. I did have Received in there, but it's out for the moment, as it causes rates to drop. I'm also stripping out HTML tags, except for href="" and src="" - there's so so much goodness in them (note that I'm only keeping the contents of the attributes). -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Fri Sep 6 09:06:57 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 06 Sep 2002 18:06:57 +1000 Subject: [Spambayes] Re: [Python-Dev] Getting started with GBayes testing In-Reply-To: Message-ID: <200209060806.g8686ve03964@localhost.localdomain> >>> Tim Peters wrote > > I've actually got a bunch of spam like that. The text/plain is something > > like > > > > **This is a HTML message** > > > > and nothing else. > > Are you sure that's in a text/plain MIME section? I've seen that many times > myself, but it's always been in the prologue (*between* MIME sections -- so > it's something a non-MIME aware reader will show you). *nod* I know - on my todo is to feed the prologue into the system as well. A snippet, hopefully not enough to trigger the spam-filters. To: into89j@gin.elax.ekorp.com X-Mailer: Microsoft Outlook Express 4.72.1712.3 X-MimeOLE: Produced By Microsoft MimeOLE V??D.1712.3 Mime-Version: 1.0 Date: Sun, 28 Jan 2001 23:54:39 -0500 Content-Type: multipart/mixed; boundary="----=_NextPart_000_007F_01BDF6C7.FABAC1 B0" Content-Transfer-Encoding: 7bit This is a MIME Message ------=_NextPart_000_007F_01BDF6C7.FABAC1B0 Content-Type: multipart/alternative; boundary="----=_NextPart_001_0080_01BDF6C7. FABAC1B0" ------=_NextPart_001_0080_01BDF6C7.FABAC1B0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable ***** This is an HTML Message ! ***** ------=_NextPart_001_0080_01BDF6C7.FABAC1B0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable From anthony@interlink.com.au Fri Sep 6 09:11:50 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 06 Sep 2002 18:11:50 +1000 Subject: [Spambayes] test sets? In-Reply-To: <200209060759.g867xcV03853@localhost.localdomain> Message-ID: <200209060811.g868Bo904031@localhost.localdomain> >>> Anthony Baxter wrote > I'm currently mangling it by feeding all parts (text, html, whatever > else :) into the filters, as well as both a selected number of headers > (to, from, content-type, x-mailer), and also a list of > (header,count_of_header). This is showing up some nice stuff - e.g. the > X-uidl that stoopid spammers blindly copy into their messages. The other thing on my todo list (probably tonight's tram ride home) is to add all headers from non-text parts of multipart messages. If nothing else, it'll pick up most virus email real quick. -- Anthony Baxter It's never too late to have a happy childhood. From gward@python.net Fri Sep 6 14:44:17 2002 From: gward@python.net (Greg Ward) Date: Fri, 6 Sep 2002 09:44:17 -0400 Subject: [Spambayes] test sets? In-Reply-To: References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> Message-ID: <20020906134417.GA16820@cthulhu.gerg.ca> On 05 September 2002, Tim Peters said: > Greg Ward is > currently capturing a stream coming into python.org, and I hope we can get a > more modern, and cleaner, test set out of that. Not yet -- still working on the required config changes. But I have a cunning plan... > But if that stream contains > any private email, it may not be ethically possible to make that available. It will! Part of my cunning plan involves something like this: if folder == "accepted": # ie. not suspected junk mail if (len(recipients) == 1 and recipients[0] in ("guido@python.org", "barry@python.org", ...)): folder = "personal" If you (and Guido, Barry, et. al.) prefer, I could change that last statement to "folder = None", so the mail won't be saved at all. I *might* also add a "and sender doesn't look like -bounce-*, -request, -admin, ..." clause to that if statement. > Can you think of anyplace to get a large, shareable ham sample apart from a > public mailing list? Everyone's eager to share their spam, but spam is so > much alike in so many ways that's the easy half of the data collection > problem. I believe the SpamAssassin maintainers have a scheme whereby the corpus of non-spam is distributed, ie. several people have bodies of non-spam that they use for collectively evolving the SA score set. If that sounds vague, it matches my level of understanding. Greg -- Greg Ward http://www.gerg.ca/ Reality is for people who can't handle science fiction. From guido@python.org Fri Sep 6 14:54:14 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 09:54:14 -0400 Subject: [Spambayes] test sets? In-Reply-To: Your message of "Fri, 06 Sep 2002 09:44:17 EDT." <20020906134417.GA16820@cthulhu.gerg.ca> References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> <20020906134417.GA16820@cthulhu.gerg.ca> Message-ID: <200209061354.g86DsEE14105@pcp02138704pcs.reston01.va.comcast.net> > I believe the SpamAssassin maintainers have a scheme whereby the corpus > of non-spam is distributed, ie. several people have bodies of non-spam > that they use for collectively evolving the SA score set. If that > sounds vague, it matches my level of understanding. See if you can get a hold of that so we can do a level-playing-field competition. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From gward@python.net Fri Sep 6 14:57:18 2002 From: gward@python.net (Greg Ward) Date: Fri, 6 Sep 2002 09:57:18 -0400 Subject: [Spambayes] Re: [Python-Dev] Getting started with GBayes testing In-Reply-To: <200209060806.g8686ve03964@localhost.localdomain> References: <200209060806.g8686ve03964@localhost.localdomain> Message-ID: <20020906135718.GC16820@cthulhu.gerg.ca> On 06 September 2002, Anthony Baxter said: > A snippet, hopefully not enough to trigger the spam-filters. As an aside: one of the best ways to dodge SpamAssassin is by having an In-Reply-To header. Most list traffic should meet this criterion. Alternately, I can whitelist mail to spambayes@python.org -- that'll work until spammers get ahold of the list address, which usually seems to take a few months. Greg -- Greg Ward http://www.gerg.ca/ Gee, I feel kind of LIGHT in the head now, knowing I can't make my satellite dish PAYMENTS! From barry@python.org Fri Sep 6 15:23:19 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 10:23:19 -0400 Subject: [Spambayes] test sets? References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> Message-ID: <15736.47703.689156.538539@anthem.wooz.org> >>>>> "TP" == Tim Peters writes: >> Any thought to wrapping up your spam and ham test sets for >> inclusion w/ the spambayes project? TP> I gave it all the thought it deserved . It would be TP> wonderful to get several people cranking on the same test TP> data, and I'm all in favor of that. OTOH, my Data/ subtree TP> currently has more than 35,000 files slobbering over 134 TP> million bytes -- even if I had a place to put that much stuff, TP> I'm not sure my ISP would let me email it in one msg . Check it into the spambayes project. SF's disks are cheap . -Barry From guido@python.org Fri Sep 6 15:24:37 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 10:24:37 -0400 Subject: [Spambayes] test sets? In-Reply-To: Your message of "Fri, 06 Sep 2002 10:23:19 EDT." <15736.47703.689156.538539@anthem.wooz.org> References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> <15736.47703.689156.538539@anthem.wooz.org> Message-ID: <200209061424.g86EOcd14363@pcp02138704pcs.reston01.va.comcast.net> > Check it into the spambayes project. SF's disks are cheap . Perhaps more useful would be if Tim could check in the pickle(s?) generated by one of his training runs, so that others can see how Tim's training data performs against their own corpora. This could also be the starting point for a self-contained distribution (you've got to start with *something*, and training with python-list data seems just as good as anything else). --Guido van Rossum (home page: http://www.python.org/~guido/) From barry@python.org Fri Sep 6 15:28:12 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 10:28:12 -0400 Subject: [Spambayes] test sets? References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> <20020906134417.GA16820@cthulhu.gerg.ca> Message-ID: <15736.47996.84689.421662@anthem.wooz.org> >>>>> "GW" == Greg Ward writes: GW> If you (and Guido, Barry, et. al.) prefer, I could change that GW> last statement to "folder = None", so the mail won't be saved GW> at all. I don't care if the mail is foldered on python.org, but personal messages regardless of who they're for, shouldn't be part of the public spambayes repository unless specifically approved by both the recipient and sender. Note also that we are much more liberal about python.org/zope.org mailing list traffic than most folks. Read list-managers for any length of time and you'll find that there are a lot of people who assert strict copyright over their collections, are very protective of their traffic, and got really pissed when gmane just started gatewaying their messages without asking. Which might be an appropriate for their lists, but not for ours (don't think I'm suggesting we do the same -- I /like/ our laissez-faire approach). But for personal email, we should be more careful. -Barry From guido@python.org Fri Sep 6 15:31:22 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 10:31:22 -0400 Subject: [Spambayes] Deployment Message-ID: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Quite independently from testing and tuning the algorithm, I'd like to think about deployment. Eventually, individuals and postmasters should be able to download a spambayes software distribution, answer a few configuration questions about their mail setup, training and false positives, and install it as a filter. A more modest initial goal might be the production of a tool that can easily be used by individuals (since we're more likely to find individuals willing to risk this than postmasters). There are many ways to do this. Some ideas: - A program that acts both as a pop client and a pop server. You configure it by telling it about your real pop servers. You then point your mail reader to the pop server at localhost. When it receives a connection, it connects to the remote pop servers, reads your mail, and gives you only the non-spam. To train it, you'd only need to send it the false negatives somehow; it can assume that anything is ham that you don't say is spam within 48 hours. - A server with a custom protocol that you send a copy of a message and that answers "spam" or "ham". Then you have a little program that is invoked e.g. by procmail that talks to the server. (The server exists so that it doesn't have to load the pickle with the scoring database for each message. I don't know how big that pickle would be, maybe loading it each time is fine. Or maybe marshalling.) - Your idea here. Takers? How is ESR's bogofilter packaged? SpamAssassin? The Perl Bayes filter advertised on slashdot? --Guido van Rossum (home page: http://www.python.org/~guido/) From barry@python.org Fri Sep 6 15:38:56 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 10:38:56 -0400 Subject: [Spambayes] test sets? References: <15735.50243.135743.32180@12-248-11-90.client.attbi.com> <15736.47703.689156.538539@anthem.wooz.org> <200209061424.g86EOcd14363@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15736.48640.283430.184348@anthem.wooz.org> >>>>> "GvR" == Guido van Rossum writes: GvR> Perhaps more useful would be if Tim could check in the GvR> pickle(s?) generated by one of his training runs, so that GvR> others can see how Tim's training data performs against their GvR> own corpora. He could do that too. :) -Barry From bkc@murkworks.com Fri Sep 6 15:39:48 2002 From: bkc@murkworks.com (Brad Clements) Date: Fri, 06 Sep 2002 10:39:48 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <3D788653.9143.1D8992DA@localhost> On 6 Sep 2002 at 10:31, Guido van Rossum wrote: > your mail, and gives you only the non-spam. To train it, you'd only need > to send it the false negatives somehow; it can assume that anything is > ham that you don't say is spam within 48 hours. I have folks who leave their email programs running 24 hours a day, constantly polling for mail. If they go away for a long weekend, lots of "friday night spam" will become ham on sunday night. (Friday night seems to be the most popular time) > - Your idea here. Ultimately I'd like to see tight integration into the "most popular email clients".. As a stop-gap to the auto-ham .. How about adding an IMAP server with a spam and deleted-ham folder. Most email clients can handle IMAP. Users should be able to quickly move "spam" into the spam folder. Instead of deleting messages (or, by reprogramming the delete function) they can quickly move ham into the ham folder. In either case, the message would be processed and then destroyed. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From tim.one@comcast.net Fri Sep 6 15:45:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 10:45:27 -0400 Subject: [Spambayes] test sets? In-Reply-To: <15736.5577.157228.229200@12-248-11-90.client.attbi.com> Message-ID: [Tim] > OTOH, my Data/ subtree currently has more than 35,000 files slobbering > over 134 million bytes -- even if I had a place to put that much stuff, > I'm not sure my ISP would let me email it in one msg . [Skip] > Do you have a dialup or something more modern ? Much more modern: a cable modem with a small upload rate cap. There's a reason the less modern uncapped @Home went out of business . > 134MB of messages zipped would probably compress pretty well - under 50MB > I'd guess with all the similarity in the headers and such. You could zip > each of the 10 sets individually and upload them somewhere. I suppose this could finish over the course of an afternoon. Now where's "somewhere"? I expect we'll eventually collect several datasets; SourceForge isn't a good place for it (they expect projects to distribute relatively small code files, and complain if even those get big). > ... > How about random sampling lots of public mailing lists via gmane or > something similar, manually cleaning it (distributing that load over a > number of people) and then relying on your clever code and your > rebalancing script to help further cleanse it? What then are we training the classifier to do? Graham's scoring scheme is based on an assumption that the ham-vs-spam task is *easy*, and half of that is due to that the ham has a lot in common. It was an experiment to apply his scheme to all the comp.lang.python traffic, which is a lot broader than he had in mind (c.l.py has long had a generous definition of "on topic" ). I don't expect good things to come of making it ever broader, *unless* your goal is to investigate just how broad it can be made before it breaks down. > The "problem" with the ham is it tends to be much more tied to one person > (not just intimate, but unique) than the spam. Which is "a feature" from Graham's POV: the more clues, the better this "smoking guns only" approach should work. > I save all incoming email for ten days (gzipped mbox format) before it rolls > over and disappears. At any one time I think I have about 8,000-10,000 > messages. Most of it isn't terribly personal (which I would cull before > passing along anyway) and much of it is machine-generated, so would be of > marginal use. Finally, it's all ham-n-spam mixed together. Do we call > that an omelette or a Denny's Grand Slam? Unless you're volunteering to clean it, tag it, package it, and distribute it, I'd call it irrelevant . From guido@python.org Fri Sep 6 15:43:33 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 10:43:33 -0400 Subject: [Spambayes] Deployment In-Reply-To: Your message of "Fri, 06 Sep 2002 10:39:48 EDT." <3D788653.9143.1D8992DA@localhost> References: <3D788653.9143.1D8992DA@localhost> Message-ID: <200209061443.g86Ehie14557@pcp02138704pcs.reston01.va.comcast.net> > > your mail, and gives you only the non-spam. To train it, you'd only need > > to send it the false negatives somehow; it can assume that anything is > > ham that you don't say is spam within 48 hours. > > I have folks who leave their email programs running 24 hours a day, > constantly polling for mail. If they go away for a long weekend, > lots of "friday night spam" will become ham on sunday night. > (Friday night seems to be the most popular time) So we'll make this a config parameter. > > - Your idea here. > > Ultimately I'd like to see tight integration into the "most popular > email clients".. As a stop-gap to the auto-ham .. What's an auto-ham? > How about adding an IMAP server with a spam and deleted-ham > folder. Most email clients can handle IMAP. Users should be able to > quickly move "spam" into the spam folder. I personally don't think IMAP has a bright future, but for people who do use it, that's certainly a good approach. > Instead of deleting messages (or, by reprogramming the delete > function) they can quickly move ham into the ham folder. Yes. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Fri Sep 6 15:59:38 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 10:59:38 -0400 Subject: [Spambayes] test sets? In-Reply-To: <200209060759.g867xcV03853@localhost.localdomain> Message-ID: [Anthony Baxter] > I've got a test set here that's the last 3 and a bit years email to > info@ekit.com and info@ekno.com - it's a really ugly set of 20,000+ > messages, currently broken into 7,000 spam, 9,000 ham, 9,000 currently > unclassified. These addresses are all over the 70-some different > ekit/ekno/ISIConnect websites, so they get a LOT of spam. > > As well as the usual spam, it also has customers complaining about > credit card charges, it has people interested in the service and > asking questions about long distance rates, &c &c &c. Lots and lots > of "commercial" speech, in other words. Stuff that SA gets pretty > badly wrong. Can this corpus be shared? I suppose not. > I'm currently mangling it by feeding all parts (text, html, whatever > else :) into the filters, as well as both a selected number of headers > (to, from, content-type, x-mailer), and also a list of (header, > count_of_header). This is showing up some nice stuff - e.g. the > X-uidl that stoopid spammers blindly copy into their messages. If we ever have a shared corpus, an easy refactoring of timtest should allow to plug in different tokenizers. I've only made three changes to Graham's algorithm so far (well, I've made dozens -- only three survived testing as proven winners), all the rest has been refining the tokenization to provide better clues. > I did have Received in there, but it's out for the moment, as it causes > rates to drop. That's ambiguous. Accuracy rates or error rates, ham or spam rates? > I'm also stripping out HTML tags, except for href="" and src="" - there's > so so much goodness in them (note that I'm only keeping the contents of > the attributes). Mining embedded http/https/ftp thingies cut the false negative rate in half in my tests (not keying off href, just scanning for anything that "looked like" one); that was the single biggest f-n improvement I've seen. It didn't change the false positive rate. So you know whether src added additional power, or did you do both at once? From skip@pobox.com Fri Sep 6 16:01:51 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 10:01:51 -0500 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15736.50015.881231.510395@12-248-11-90.client.attbi.com> Guido> Takers? How is ESR's bogofilter packaged? SpamAssassin? The Guido> Perl Bayes filter advertised on slashdot? Dunno about the other tools, but SpamAssassin is a breeze to incorporate into a procmail environment. Lots of people use it in many other ways. For performance reasons, many people run a spamd process and then invoke a small C program called spamc which shoots the message over to spamd and passes the result back out. I think spambayes in incremental mode is probably fast enough to not require such tricks (though I would consider changing the pickle to an anydbm file). Basic procmail usage goes something like this: :0fw | spamassassin -P :0 * ^X-Spam-Status: Yes $SPAM Which just says, "Run spamassassin -P reinjecting its output into the processing stream. If the resulting mail has a header which begins "X-Spam-Status: Yes", toss it into the folder indicated by the variable $SPAM. SpamAssassin also adds other headers as well, which give you more detail about how its tests fared. I'd like to see spambayes operate in at least this way: do its thing then return a message to stdout with a modified set of headers which further processing downstream can key on. Skip From bkc@murkworks.com Fri Sep 6 16:02:11 2002 From: bkc@murkworks.com (Brad Clements) Date: Fri, 06 Sep 2002 11:02:11 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061443.g86EhXQ14543@pcp02138704pcs.reston01.va.comcast.net> References: Your message of "Fri, 06 Sep 2002 10:39:48 EDT." <3D788653.9143.1D8992DA@localhost> Message-ID: <3D788B92.22739.1D9E0FD1@localhost> Did you want this on the list? I'm replying to the list.. On 6 Sep 2002 at 10:43, Guido van Rossum wrote: > What's an auto-ham? Automatically marking something as ham after a given timeout.. regardless of how long that timeout is, someone is going to forget to submit the message back as spam. How many spams-as-hams can be accepted before the f-n rate gets unacceptable? > > How about adding an IMAP server with a spam and deleted-ham > > folder. Most email clients can handle IMAP. Users should be able to > > quickly move "spam" into the spam folder. > > I personally don't think IMAP has a bright future, but for people who > do use it, that's certainly a good approach. > > > Instead of deleting messages (or, by reprogramming the delete > > function) they can quickly move ham into the ham folder. > > Yes. I view IMAP as a stop-gap measure until tighter integration with various email clients can be achieved. I still feel it's better to require classification feedback from the recipient, rather than make any assumptions after some period of time passes. But this is an end-user issue and we're still at the algorithm stage.. ;-) Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements From guido@python.org Fri Sep 6 16:05:22 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 11:05:22 -0400 Subject: [Spambayes] Deployment In-Reply-To: Your message of "Fri, 06 Sep 2002 11:02:11 EDT." <3D788B92.22739.1D9E0FD1@localhost> References: "Your message of Fri, 06 Sep 2002 10:39:48 EDT." <3D788653.9143.1D8992DA@localhost> <3D788B92.22739.1D9E0FD1@localhost> Message-ID: <200209061505.g86F5MM14762@pcp02138704pcs.reston01.va.comcast.net> > > What's an auto-ham? > > Automatically marking something as ham after a given > timeout.. regardless of how long that timeout is, someone is going > to forget to submit the message back as spam. OK, here's a refinement. Assuming very little spam comes through, we only need to pick a small percentage of ham received as new training ham to match the new training spam. The program could randomly select a sufficient number of saved non-spam msgs and ask the user to validate this selection. You could do this once a day or week (config parameter). > How many spams-as-hams can be accepted before the f-n rate gets > unacceptable? Config parameter. > I view IMAP as a stop-gap measure until tighter integration with > various email clients can be achieved. > > I still feel it's better to require classification feedback from the > recipient, rather than make any assumptions after some period of > time passes. But this is an end-user issue and we're still at the > algorithm stage.. ;-) I'm trying to think about the end-user issues because I have nothing to contribute to the algorithm at this point. For deployment we need both! --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Sep 6 16:06:26 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 11:06:26 -0400 Subject: [Spambayes] Deployment In-Reply-To: Your message of "Fri, 06 Sep 2002 10:01:51 CDT." <15736.50015.881231.510395@12-248-11-90.client.attbi.com> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <15736.50015.881231.510395@12-248-11-90.client.attbi.com> Message-ID: <200209061506.g86F6Qo14777@pcp02138704pcs.reston01.va.comcast.net> > Dunno about the other tools, but SpamAssassin is a breeze to incorporate > into a procmail environment. Lots of people use it in many other ways. For > performance reasons, many people run a spamd process and then invoke a small > C program called spamc which shoots the message over to spamd and passes the > result back out. I think spambayes in incremental mode is probably fast > enough to not require such tricks (though I would consider changing the > pickle to an anydbm file). > > Basic procmail usage goes something like this: > > :0fw > | spamassassin -P > > :0 > * ^X-Spam-Status: Yes > $SPAM > > Which just says, "Run spamassassin -P reinjecting its output into the > processing stream. If the resulting mail has a header which begins > "X-Spam-Status: Yes", toss it into the folder indicated by the variable > $SPAM. > > SpamAssassin also adds other headers as well, which give you more detail > about how its tests fared. I'd like to see spambayes operate in at least > this way: do its thing then return a message to stdout with a modified set > of headers which further processing downstream can key on. Do you feel capable of writing such a tool? It doesn't look too hard. --Guido van Rossum (home page: http://www.python.org/~guido/) From skip@pobox.com Fri Sep 6 16:12:58 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 10:12:58 -0500 Subject: [Spambayes] Deployment In-Reply-To: <200209061443.g86Ehie14557@pcp02138704pcs.reston01.va.comcast.net> References: <3D788653.9143.1D8992DA@localhost> <200209061443.g86Ehie14557@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15736.50682.911121.462698@12-248-11-90.client.attbi.com> >> Ultimately I'd like to see tight integration into the "most popular >> email clients".. The advantage of using a kitchen sink (umm, make that highly programmable) editor+email package like Emacs+VM is that you can twiddle your key bindings and write a little ELisp (or Pymacs) glue to toss messages in the right direction (spam or ham). For this, spambayes would have to operate in an incremental fashion when fed a single ham or spam message. (No, I have no idea what an "auto-ham" is. A pig run over by a car, perhaps?) give-a-dog-a-bone-ly, y'rs, Skip From skip@pobox.com Fri Sep 6 16:19:30 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 10:19:30 -0500 Subject: [Spambayes] Deployment In-Reply-To: <200209061506.g86F6Qo14777@pcp02138704pcs.reston01.va.comcast.net> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <15736.50015.881231.510395@12-248-11-90.client.attbi.com> <200209061506.g86F6Qo14777@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15736.51074.369911.905337@12-248-11-90.client.attbi.com> >> Dunno about the other tools, but SpamAssassin is a breeze ... >> SpamAssassin also adds other headers as well, which give you more >> detail ... Guido> Do you feel capable of writing such a tool? It doesn't look too Guido> hard. Sure, but at the moment I have to stop reading email for a few hours and do some real work. ;-) I'll see if I can modify GBayes.py suitably over the weekend. Skip From tim.one@comcast.net Fri Sep 6 16:45:43 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 11:45:43 -0400 Subject: [Spambayes] test sets? In-Reply-To: <200209060759.g867xcV03853@localhost.localdomain> Message-ID: [Anthony Baxter] > ... and also a list of (header,count_of_header). This is showing up some > nice stuff - e.g. the X-uidl that stoopid spammers blindly copy into their > messages. This is a very cool idea. I'm currently special-casing the absence of an Organization line because that proved to give a tiny reduction in the f-n rate. But counting the # of each kind of header line clearly gives the same info at comparable cost and is much more general. Unfortunately, on my corpora it turns out to be *too* strong, because of header fields injected by Mailman into the c.l.py msg archive. Here's a list of all header fields + counts found in my first spam corpus (Spam/Set1/*.txt), from most valuable to least ("most valuable" here == farthest from spamprob 0.5, and then by total # of hams and spams it appeared in). X-UIDL variants are strong spam indicators here in a probability sense, but are extremely rare even in spam. Barry, can you please identify for me which of these headers are Mailman artifacts so I can avoid counting them? Each line is of the form spam_probability raw_spam_count raw_ham_count token where token is "header:" header_name ":" count Note that the "probabilities" don't make sense : as always, the ham counts are multiplied by 2 (HAMBIAS) artificially before computing a prob. prob nspam nham token 0.01 19 3559 'header:X-Mailman-Version:1' 0.01 19 3559 'header:List-Id:1' 0.01 19 3557 'header:X-BeenThere:1' 0.01 0 3093 'header:Newsgroups:1' 0.01 0 3054 'header:Xref:1' 0.01 0 3053 'header:Path:1' 0.01 0 2846 'header:References:1' 0.01 24 2760 'header:Organization:1' 0.99 2685 14 'header:Content-Length:1' 0.01 19 2668 'header:List-Unsubscribe:1' 0.01 19 2668 'header:List-Subscribe:1' 0.01 19 2668 'header:List-Post:1' 0.01 19 2668 'header:List-Help:1' 0.01 19 2668 'header:List-Archive:1' 0.01 0 2652 'header:NNTP-Posting-Host:1' 0.01 0 2058 'header:X-Trace:1' 0.01 0 1756 'header:X-Complaints-To:1' 0.01 0 1655 'header:NNTP-Posting-Date:1' 0.01 0 1320 'header:X-Newsreader:1' 0.01 1 941 'header:User-Agent:1' 0.99 689 0 'header:Delivered-To:4' 0.01 2 627 'header:X-Accept-Language:1' 0.01 0 538 'header:In-Reply-To:1' 0.99 522 0 'header:Delivered-To:3' 0.99 519 0 'header:Received:8' 0.99 466 1 'header:Received:7' 0.99 364 0 'header:Return-Path:2' 0.99 273 0 'header:MiME-Version:1' 0.01 0 149 'header:X-Spam-Status:1' 0.01 0 103 'header:X-Http-User-Agent:1' 0.01 0 103 'header:X-Http-Proxy:1' 0.01 0 103 'header:X-Article-Creation-Date:1' 0.01 0 103 'header:X-Abuse-Info:2' 0.01 0 99 'header:X-Spam-Level:1' 0.01 0 84 'header:X-MyDeja-Info:1' 0.01 0 77 'header:X-Face:1' 0.01 0 76 'header:Mail-Followup-To:1' 0.01 0 68 'header:In-reply-to:1' 0.99 52 0 'header:Spam-Apparently-To:1' 0.01 0 49 'header:X-Server-Date:1' 0.01 0 47 'header:cc:1' 0.99 44 0 'header:Received:10' 0.01 0 35 'header:X-Original-NNTP-Posting-Host:1' 0.01 0 32 'header:Distribution:1' 0.99 31 0 'header:X-MailID:1' 0.99 31 0 'header:Complain-To:1' 0.01 0 30 'header:X-NNTP-Posting-Host:1' 0.99 29 0 'header:X-Encoding:1' 0.01 0 28 'header:X-Mimeole:1' 0.99 27 0 'header:1:1' 0.01 0 27 'header:X-Originally-To:1' 0.01 0 26 'header:Cache-Post-Path:1' 0.99 25 0 'header:X-Mailing-List:1' 0.01 0 25 'header:Mail-Copies-To:1' 0.99 24 0 'header:SUBJECT:1' 0.99 24 0 'header:DATE:1' 0.01 0 24 'header:X-Received-Date:1' 0.99 23 0 'header:x-esmtp:1' 0.99 23 0 'header:X-Precedence-Ref:1' 0.01 0 23 'header:X-Cache:1' 0.99 22 0 'header:FROM:1' 0.01 0 22 'header:X-Report:1' 0.01 0 22 'header:X-Orig-Message-ID:1' 0.01 0 22 'header:NNTP-Proxy-Relay:1' 0.01 0 22 'header:NNTP-Posting-Time:1' 0.01 0 22 'header:Abuse-Reports-To:1' 0.01 0 21 'header:X-Comment-To:1' 0.99 20 0 'header:Received:11' 0.01 0 20 'header:X-Comments:1' 0.01 0 20 'header:X-Authenticated-User:1' 0.01 0 20 'header:X-Abuse-Info:1' 0.99 19 0 'header:X-List-Name:1' 0.99 19 0 'header:X-List-Manager:1' 0.99 19 0 'header:X-Library:1' 0.01 0 19 'header:X-Comments2:1' 0.01 0 19 'header:Content-Transfer-Encoding:2' 0.01 0 18 'header:X-Original-Path:1' 0.01 0 18 'header:X-Comments3:1' 0.01 0 16 'header:X-Attribution:1' 0.99 15 0 'header:X-Stormpost-To:1' 0.99 15 0 'header:X-List-Unsubscribe:1' 0.99 15 0 'header:X-EM-Version:1' 0.99 15 0 'header:X-EM-Registration:1' 0.01 0 15 'header:X-Posting-Agent:1' 0.01 0 15 'header:X-Orig-Path:1' 0.01 0 15 'header:X-In-Reply-To:1' 0.01 0 14 'header:X-X-Sender:1' 0.01 0 14 'header:X-UserInfo1:1' 0.01 0 14 'header:X-Filtered-By:1' 0.01 0 14 'header:Mime-version:1' 0.01 0 13 'header:X-scanner:1' 0.01 0 13 'header:X-Url:1' 0.01 0 13 'header:X-Filename:1' 0.99 12 0 'header:TO:1' 0.01 0 12 'header:X-Gateway:1' 0.01 0 12 'header:X-FTNADDR:1' 0.01 0 12 'header:X-Envelope-To:1' 0.01 0 12 'header:Content-Encoding:1' 0.01 0 11 'header:X-Nntp-Posting-Host:1' 0.99 10 0 'header:X-UIDL:1' 0.01 0 10 'header:X-Reposted-By:1' 0.01 0 10 'header:X-Repost-Date:1' 0.01 0 10 'header:X-Original-Message-ID:1' 0.01 0 10 'header:X-Oblique-Strategy:1' 0.01 0 10 'header:X-No-Productlinks:1' 0.01 0 10 'header:X-MS-TNEF-Correlator:1' 0.01 0 10 'header:X-MS-Has-Attach:1' 0.01 0 10 'header:X-Comments:5' 0.99 9 0 'header:Comments:1' 0.01 0 9 'header:X-eGroups-Return:1' 0.01 0 9 'header:X-No-Archive:1' 0.99 8 0 'header:X-PMG-Userid:1' 0.99 8 0 'header:X-PMG-Recipient:1' 0.99 8 0 'header:X-PMG-Msgid:1' 0.99 8 0 'header:X-PMFLAGS:1' 0.99 8 0 'header:X-Info:2' 0.01 0 8 'header:X-Path-Notice:1' 0.01 0 8 'header:Originator:1' 0.01 0 8 'header:Cancel-Lock:1' 0.99 7 0 'header:X-x:1' 0.01 0 7 'header:content-class:1' 0.01 0 7 'header:X-Organization:1' 0.01 0 7 'header:Followup-To:1' 0.99 6 0 'header:X-MDRemoteIP:1' 0.01 0 6 'header:X-Wren-Trace:1' 0.01 0 6 'header:X-Squaresville:1' 0.01 0 6 'header:X-Originating-Host:1' 0.01 0 6 'header:X-Niggle:1' 0.01 0 6 'header:X-NETCOM-Date:1' 0.01 0 6 'header:X-Moon-Phase:1' 0.01 0 6 'header:X-Licznik:1' 0.01 0 6 'header:X-Copyright:1' 0.01 0 6 'header:NNTP-Posting-User:1' 0.01 0 6 'header:Bytes:1' 0.99 5 0 'header:X-Info:1' 0.99 5 0 'header:Received:12' 0.01 0 5 'header:X-homepage:1' 0.01 0 5 'header:X-Original-Trace:1' 0.01 0 5 'header:X-Mail2News-Path:1' 0.01 0 5 'header:X-Editor:1' 0.01 0 5 'header:X-Admin:1' 0.01 0 5 'header:Supersedes:1' 0.01 0 5 'header:Keywords:1' 0.01 0 4 'header:X-Trace-PostClient-IP:1' 0.01 0 4 'header:X-Meow:1' 0.01 0 4 'header:X-GC-Trace:1' 0.01 0 4 'header:X-DMCA-Complaints-To:1' 0.01 0 4 'header:Posted-And-Mailed:1' 0.01 0 3 'header:X-question:1' 0.01 0 3 'header:X-Sun-Charset:1' 0.01 0 3 'header:X-Spam-Status:2' 0.01 0 3 'header:X-Spam-Level:2' 0.01 0 3 'header:X-SessionID:1' 0.01 0 3 'header:X-Path-Stamp:1' 0.01 0 3 'header:X-PGP-Fingerprint:1' 0.01 0 3 'header:X-Original-NNTP-Posting-Host:2' 0.01 0 3 'header:X-Home-Page:1' 0.01 0 3 'header:X-Hash:1' 0.01 0 3 'header:X-Hash-Info:1' 0.01 0 3 'header:X-AntiVirus:1' 0.01 0 3 'header:Original-Sender:1' 0.01 0 3 'header:Mail-copies-to:1' 0.01 0 3 'header:Content-ID:1' 0.99 133 1 'header:Received:9' 0.02 55 3559 'header:Errors-To:1' 0.02 65 3559 'header:Precedence:1' 0.04 3 49 'header:X-MIMEOLE:1' 0.95 239 9 'header:X-OriginalArrivalTime:1' 0.05 2 26 'header:Priority:1' 0.06 1 11 'header:Thread-Topic:1' 0.07 1 10 'header:X-MIME-Autoconverted:1' 0.07 219 3647 'header:Sender:1' 0.07 1 9 'header:X-URL:1' 0.11 19 111 'header:Content-Disposition:1' 0.88 135 13 'header:Received:6' 0.86 2373 277 'header:Return-Path:1' 0.14 191 863 'header:X-MimeOLE:1' 0.86 17 2 'header:X-Return-Path:1' 0.86 17 2 'header:X-MDaemon-Deliver-To:1' 0.85 15 2 'header:X-X:1' 0.16 12 46 'header:Content-transfer-encoding:1' 0.84 28 4 'header:Content-Class:1' 0.20 7 21 'header:X-Authentication-Warning:1' 0.20 4 12 'header:X-MSMail-priority:1' 0.20 1 3 'header:thread-index:1' 0.80 1471 273 'header:Delivered-To:1' 0.76 480 112 'header:Message-Id:1' 0.74 16 4 'header:Delivered-To:2' 0.74 4 1 'header:X-SLUIDL:1' 0.72 117 33 'header:CC:1' 0.29 5 9 'header:X-Apparently-From:1' 0.71 17 5 'header:Bcc:1' 0.71 27 8 'header:Thread-Index:1' 0.29 8 14 'header:X-Msmail-Priority:1' 0.69 31 10 'header:X-MIMETrack:1' 0.31 564 903 'header:Mime-Version:1' 0.32 13 20 'header:MIME-version:1' 0.35 349 474 'header:Received:2' 0.65 756 297 'header:Received:3' 0.36 94 123 'header:X-Sender:1' 0.36 1561 3867 'header:Message-ID:1' 0.37 16 20 'header:Message-id:1' 0.63 329 141 'header:Importance:1' 0.62 100 44 'header:Content-type:1' 0.61 68 31 'header:Received:5' 0.39 983 1096 'header:X-Mailer:1' 0.40 754 827 'header:X-MSMail-Priority:1' 0.41 293 307 'header:Cc:1' 0.59 1158 587 'header:Reply-To:1' 0.42 1558 1575 'header:Content-Transfer-Encoding:1' 0.42 16 16 'header:X-mailer:1' 0.58 224 119 'header:Received:4' 0.42 882 875 'header:X-Priority:1' 0.43 2090 3097 'header:Lines:1' 0.46 1397 1199 'header:MIME-Version:1' 0.47 2410 2207 'header:Content-Type:1' 0.47 2450 4000 'header:Date:1' 0.48 15 12 'header:X-Originating-IP:1' 0.52 15 10 'header:Received:1' 0.48 31 24 'header:Reply-to:1' 0.49 2627 3989 'header:To:1' 0.50 2700 3999 'header:Subject:1' 0.50 2704 4000 'header:From:1' 0.50 4 0 'header:X-RMD-Text:1' 0.50 4 0 'header:X-Owner:1' 0.50 4 0 'header:Cc:90' 0.50 4 0 'header:2:1' 0.50 3 0 'header:received:1' 0.50 3 0 'header:microsoft:1' 0.50 3 0 'header:X-Unsent:1' 0.50 3 0 'header:X-UID:1' 0.50 3 0 'header:X-Set:1' 0.50 3 0 'header:X-SMTPExp-Version:1' 0.50 3 0 'header:X-SMTPExp-Registration:1' 0.50 3 0 'header:X-PLATTER:1' 0.50 3 0 'header:X-MB-Pid:1' 0.50 3 0 'header:X-MB-Mid:1' 0.50 3 0 'header:X-Info-2:1' 0.50 3 0 'header:X-Info-1:1' 0.50 3 0 'header:X-IONK:1' 0.50 3 0 'header:X-Debug:1' 0.50 3 0 'header:X-CRUNCHERS:1' 0.50 3 0 'header:X-CORONNA:1' 0.50 3 0 'header:X-BlackMail:1' 0.50 3 0 'header:X-Authenticated-Timestamp:1' 0.50 3 0 'header:Errors-to:1' 0.50 3 0 'header:Disposition-Notification-To:1' 0.50 3 0 'header:6:1' 0.50 3 0 'header:5:1' 0.50 3 0 'header:4:1' 0.50 3 0 'header:3:1' 0.50 2 1 'header:X-MDRcpt-To:1' 0.50 2 0 'header:X-vsuite-type:1' 0.50 2 0 'header:X-Tracking:1' 0.50 2 0 'header:X-Sent-Mail:1' 0.50 2 0 'header:X-Sender-Ip:1' 0.50 2 0 'header:X-Originating-Ip:1' 0.50 2 0 'header:X-Hops:1' 0.50 2 0 'header:X-Expiredinmiddle:1' 0.50 2 0 'header:X-AntiAbuse:5' 0.50 2 0 'header:X-:1' 0.50 2 0 'header:Date-warning:1' 0.50 2 0 'header:Content-Location:1' 0.50 2 0 'header:Content-Language:1' 0.50 2 0 'header:Cc:95' 0.50 0 2 'header:X-VirusChecked:1' 0.50 0 2 'header:X-Uptime:1' 0.50 0 2 'header:X-RBL-Warning:1' 0.50 0 2 'header:X-Public-Domain:1' 0.50 0 2 'header:X-Postfilter:1' 0.50 0 2 'header:X-Poster-Key:1' 0.50 0 2 'header:X-Operating-System:1' 0.50 0 2 'header:X-OS:1' 0.50 0 2 'header:X-News-Software:1' 0.50 0 2 'header:X-NFilter:1' 0.50 0 2 'header:X-MailScanner:1' 0.50 0 2 'header:X-Lotus-FromDomain:1' 0.50 0 2 'header:X-Loop-Detect:1' 0.50 0 2 'header:X-Inktomi-Trace:1' 0.50 0 2 'header:X-Delivery-Agent:1' 0.50 0 2 'header:X-DMCA-Notifications:1' 0.50 0 2 'header:X-Complaints-To:2' 0.50 0 2 'header:X-Complaints-Info:1' 0.50 0 2 'header:X-BeenThere:3' 0.50 0 2 'header:X-Access:1' 0.50 0 2 'header:X-Abuse-and-DMCA-Info:2' 0.50 0 2 'header:Status:1' 0.50 0 2 'header:Mail-From:1' 0.50 0 2 'header:Content-return:1' 0.50 0 2 'header:' 0.50 1 0 'header:http:1' 0.50 1 0 'header:content-type:1' 0.50 1 0 'header:X-zippo:1' 0.50 1 0 'header:X-WSS-ID:2' 0.50 1 0 'header:X-WSS-ID:1' 0.50 1 0 'header:X-Sybari-Space:1' 0.50 1 0 'header:X-Server-Uuid:2' 0.50 1 0 'header:X-Server-Uuid:1' 0.50 1 0 'header:X-Sanitizer:1' 0.50 1 0 'header:X-Reply-To:1' 0.50 1 0 'header:X-Rcpt-To:1' 0.50 1 0 'header:X-RAV-Antivirus:1' 0.50 1 0 'header:X-PROJECT-ID:1' 0.50 1 0 'header:X-Msmail-priority:1' 0.50 1 0 'header:X-MessageNo:1' 0.50 1 0 'header:X-MDSend-Notifications-To:1' 0.50 1 0 'header:X-MDMailing-List:1' 0.50 1 0 'header:X-MAILER:1' 0.50 1 0 'header:X-Lookup-Warning:1' 0.50 1 0 'header:X-JsMail-Priority:1' 0.50 1 0 'header:X-Infomail-Spawn:1' 0.50 1 0 'header:X-Infomail-Id:1' 0.50 1 0 'header:X-GWIA:1' 0.50 1 0 'header:X-GCMulti:1' 0.50 1 0 'header:X-Envelope-From:1' 0.50 1 0 'header:X-David-Sym:1' 0.50 1 0 'header:X-David-Flags:1' 0.50 1 0 'header:X-Auto-Forward:1' 0.50 1 0 'header:To:45' 0.50 1 0 'header:Reply-To:2' 0.50 1 0 'header:Received:13' 0.50 1 0 'header:MIME-Version:4' 0.50 1 0 'header:Illegal-Object:1' 0.50 1 0 'header:Delivered-To:5' 0.50 1 0 'header:Cc:82' 0.50 1 0 'header:Cc:80' 0.50 1 0 'header:Cc:74' 0.50 1 0 'header:Cc:24' 0.50 0 1 'header:x-message-flag:1' 0.50 0 1 'header:subject:1' 0.50 0 1 'header:X400-MTS-Identifier:1' 0.50 0 1 'header:X-uri:1' 0.50 0 1 'header:X-no-archive:1' 0.50 0 1 'header:X-news-hint:1' 0.50 0 1 'header:X-news-hint2:1' 0.50 0 1 'header:X-Virus-Scanned:1' 0.50 0 1 'header:X-User:1' 0.50 0 1 'header:X-TMDA-Fingerprint:1' 0.50 0 1 'header:X-TCP-IDENTITY:1' 0.50 0 1 'header:X-T.O.S.:1' 0.50 0 1 'header:X-Silly:1' 0.50 0 1 'header:X-Report-Abuse-To:1' 0.50 0 1 'header:X-Real-Address:1' 0.50 0 1 'header:X-ROUTED:1' 0.50 0 1 'header:X-Posting-IP:1' 0.50 0 1 'header:X-Phone:1' 0.50 0 1 'header:X-PGP:1' 0.50 0 1 'header:X-PGP-Key:1' 0.50 0 1 'header:X-PGP-Key-Fingerprint:1' 0.50 0 1 'header:X-Original-To:1' 0.50 0 1 'header:X-Original-NNTP-Posting-Host:3' 0.50 0 1 'header:X-Orig-X-Trace:1' 0.50 0 1 'header:X-Orig-X-Complaints-To:1' 0.50 0 1 'header:X-Orig-NNTP-Posting-Host:1' 0.50 0 1 'header:X-Orig-NNTP-Posting-Date:1' 0.50 0 1 'header:X-Organisation:1' 0.50 0 1 'header:X-Notice:2' 0.50 0 1 'header:X-No-Repost:1' 0.50 0 1 'header:X-Nntp-Posting-Date:1' 0.50 0 1 'header:X-Newsposter:1' 0.50 0 1 'header:X-Newsgroups:1' 0.50 0 1 'header:X-Mail-Copies-To:1' 0.50 0 1 'header:X-MTA:1' 0.50 0 1 'header:X-Location:1' 0.50 0 1 'header:X-Islamic-Date:1' 0.50 0 1 'header:X-HomePage:1' 0.50 0 1 'header:X-Get-A-Real-Newsreader:1' 0.50 0 1 'header:X-GPG-Key-ID:1' 0.50 0 1 'header:X-GPG-Fingerprint:1' 0.50 0 1 'header:X-From:1' 0.50 0 1 'header:X-Flags:1' 0.50 0 1 'header:X-Fax:1' 0.50 0 1 'header:X-Favorite-Dwarf:1' 0.50 0 1 'header:X-Faculty:1' 0.50 0 1 'header:X-Eudora-Signature:1' 0.50 0 1 'header:X-Eric-Conspiracy:1' 0.50 0 1 'header:X-Enigmail-Version:1' 0.50 0 1 'header:X-Enigmail-Supports:1' 0.50 0 1 'header:X-Emacs:1' 0.50 0 1 'header:X-Emacs-Acronym:1' 0.50 0 1 'header:X-ELN-Date:1' 0.50 0 1 'header:X-Draft-From:1' 0.50 0 1 'header:X-Disclaimer:1' 0.50 0 1 'header:X-Cyberus:1' 0.50 0 1 'header:X-Commercial-ReplyTo:1' 0.50 0 1 'header:X-Bpc-Relay-Sender-Host:1' 0.50 0 1 'header:X-Bpc-Relay-Info:1' 0.50 0 1 'header:X-Bpc-Relay-Envelope-From:1' 0.50 0 1 'header:X-BeOS-Platform:1' 0.50 0 1 'header:X-Authentication-Info:1' 0.50 0 1 'header:X-Attachments:1' 0.50 0 1 'header:X-Added:1' 0.50 0 1 'header:X-Abuse-Info2:1' 0.50 0 1 'header:UA-Content-Id:1' 0.50 0 1 'header:To:2' 0.50 0 1 'header:Sensitivity:1' 0.50 0 1 'header:Phone:1' 0.50 0 1 'header:Original-Encoded-Information-Types:1' 0.50 0 1 'header:Nntp-Posting-Host:1' 0.50 0 1 'header:Microsoft:1' 0.50 0 1 'header:MMDF-Warning:1' 0.50 0 1 'header:Injector-Info:1' 0.50 0 1 'header:Favorite-Color:1' 0.50 0 1 'header:Content-disposition:1' 0.50 0 1 'header:Content-description:1' 0.50 0 1 'header:Content-MD5:1' 0.50 0 1 'header:Content-Identifier:1' 0.50 0 1 'header:Cc:2' 0.50 0 1 'header:Bcc:2' 0.50 0 1 'header:Autoforwarded:1' 0.50 0 1 'header:Archive-Name:1' From gward@python.net Fri Sep 6 16:55:09 2002 From: gward@python.net (Greg Ward) Date: Fri, 6 Sep 2002 11:55:09 -0400 Subject: [Spambayes] test sets? In-Reply-To: References: <200209060759.g867xcV03853@localhost.localdomain> Message-ID: <20020906155509.GA17800@cthulhu.gerg.ca> On 06 September 2002, Tim Peters said: > prob nspam nham token > 0.99 2685 14 'header:Content-Length:1' That might be a bias of Bruce Guenter's spam collection. > 0.99 689 0 'header:Delivered-To:4' And this *definitely* is, because Bruce is a qmail guy. Greg -- Greg Ward http://www.gerg.ca/ Life is too short for ordinary music. From nas@python.ca Fri Sep 6 16:57:05 2002 From: nas@python.ca (Neil Schemenauer) Date: Fri, 6 Sep 2002 08:57:05 -0700 Subject: [Spambayes] Deployment In-Reply-To: <200209061443.g86Ehie14557@pcp02138704pcs.reston01.va.comcast.net> References: <3D788653.9143.1D8992DA@localhost> <200209061443.g86Ehie14557@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020906155705.GA22115@glacier.arctrix.com> Guido van Rossum wrote: > I personally don't think IMAP has a bright future, but for people who > do use it, that's certainly a good approach. Writing an IMAP server is a non-trivial task. The specification is huge and clients do all kinds of weird stuff. POP is very easy in comparison. Perhaps you could forward messages to a special address or save them in a special folder to mark them as false negatives. Alternatively, perhaps there could be a separate protocol and client that could be used to review additions to the training set. Each day a few random spam and ham messages could be grabbed as candidates. Someone would periodically startup the client, review the candidates, reclassify or remove any messages they don't like and add them to the training set. Neil From tim.one@comcast.net Fri Sep 6 17:21:51 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 12:21:51 -0400 Subject: [Spambayes] test sets? In-Reply-To: Message-ID: [Tim] > ... > Unfortunately, on my corpora it turns out to be *too* strong, > ... Here's what happens if I leave all the header counts in: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.100 0.025 won -75.00% 0.000 0.000 tied 0.025 0.000 won -100.00% 0.025 0.000 won -100.00% 0.100 0.025 won -75.00% 0.025 0.000 won -100.00% 0.025 0.000 won -100.00% 0.050 0.000 won -100.00% 0.100 0.000 won -100.00% 0.025 0.000 won -100.00% 0.025 0.025 tied 0.025 0.000 won -100.00% 0.025 0.000 won -100.00% 0.025 0.000 won -100.00% 0.025 0.025 tied 0.000 0.000 tied 0.025 0.000 won -100.00% 0.100 0.025 won -75.00% won 14 times tied 6 times lost 0 times total unique fp went from 9 to 2 false negative percentages 0.364 0.145 won -60.16% 0.400 0.291 won -27.25% 0.400 0.364 won -9.00% 0.909 0.618 won -32.01% 0.836 0.545 won -34.81% 0.618 0.473 won -23.46% 0.291 0.291 tied 1.018 0.654 won -35.76% 0.982 0.655 won -33.30% 0.727 0.545 won -25.03% 0.800 0.618 won -22.75% 1.163 0.872 won -25.02% 0.764 0.545 won -28.66% 0.473 0.291 won -38.48% 0.473 0.327 won -30.87% 0.727 0.509 won -29.99% 0.655 0.400 won -38.93% 0.509 0.218 won -57.17% 0.545 0.364 won -33.21% 0.509 0.436 won -14.34% won 19 times tied 1 times lost 0 times total unique fn went from 168 to 124 A false positive *really* has to work hard then, eh? The long quote of a Nigerian scam letter is one of the two that made it, and spamprob() looked at all this stuff before deciding it was spam: prob = 0.999945196947 prob('domestic') = 0.99 prob('dollars)') = 0.99 prob('solicit') = 0.99 prob('partner.') = 0.99 prob('accounts,') = 0.99 prob('federal') = 0.99 prob('nigeria.') = 0.99 prob('ministry') = 0.99 prob('subject:Business') = 0.99 prob('overseas') = 0.99 prob('housing') = 0.99 prob('nigeria') = 0.99 prob('nigerian') = 0.99 prob('estate') = 0.99 prob('70%') = 0.99 prob('regime') = 0.99 prob('payment') = 0.99 prob('header:X-Complaints-To:1') = 0.01 prob('header:X-BeenThere:1') = 0.01 prob('header:NNTP-Posting-Host:1') = 0.01 prob('ended.') = 0.01 prob('wrote') = 0.01 prob('header:Path:1') = 0.01 prob('header:NNTP-Posting-Date:1') = 0.01 prob('header:X-Mailman-Version:1') = 0.01 prob('header:List-Id:1') = 0.01 prob('header:List-Archive:1') = 0.01 prob('header:X-Trace:1') = 0.01 prob('header:Organization:1') = 0.01 prob('header:Newsgroups:1') = 0.01 prob('header:List-Post:1') = 0.01 prob('header:References:1') = 0.01 prob('header:List-Help:1') = 0.01 prob('header:X-Newsreader:1') = 0.01 prob('states') = 0.959986 prob('united') = 0.96139 prob('money') = 0.964852 prob('country.') = 0.97034 prob('civil') = 0.96754 prob('partner') = 0.969003 prob('complex,') = 0.01 prob('funds') = 0.972142 prob('million') = 0.971369 prob('purchase') = 0.986651 prob('government') = 0.985578 prob('header:Precedence:1') = 0.0306554 prob('header:Xref:1') = 0.01 prob('header:List-Subscribe:1') = 0.01 prob('header:List-Unsubscribe:1') = 0.01 prob('header:Errors-To:1') = 0.0182013 It actually found more 0.01 clues than 0.99 ones then, but the content is *so* bad nothing can overcome the judgment of guilt. BTW, the false negative rate in my corpora is also getting near the point where I won't be able to measure improvement reliably. Since there are only 2750 spams in a spam set, 1% is 27.5 spams, whereas in the ham corpus 1% is 40 hams. So, e.g., a f-n rate of 0.364% means a grand total of 10 false negatives, so even changing that by 1 measly msg makes a 10% difference in the f-n rate. From barry@python.org Fri Sep 6 17:23:33 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 12:23:33 -0400 Subject: [Spambayes] Deployment References: <3D788653.9143.1D8992DA@localhost> <200209061443.g86Ehie14557@pcp02138704pcs.reston01.va.comcast.net> <20020906155705.GA22115@glacier.arctrix.com> Message-ID: <15736.54917.688066.738120@anthem.wooz.org> >>>>> "NS" == Neil Schemenauer writes: NS> Writing an IMAP server is a non-trivial task. That's what I've been told by everyone I've talked to who's actually tried to write one. NS> Alternatively, perhaps there could be a separate protocol and NS> client that could be used to review additions to the training NS> set. Each day a few random spam and ham messages could be NS> grabbed as candidates. Someone would periodically startup the NS> client, review the candidates, reclassify or remove any NS> messages they don't like and add them to the training set. I think people will be much more motivated to report spam than ham. I like the general approach that copies of random messages will be sequestered for some period of time before they're assumed to be ham. Matched with a simple spam reporting scheme, this could keep the training up to date with little effort. I've sketched out an approach a listserver like Mailman could do along these lines and if I get some free time I'll hack something together. I like the idea of a POP proxy which is classifying messages as they're pulled from the server. The easiest way for such a beast to be notified of spam might be to simply save the spam in a special folder or file that the POP proxy would periodically consult. -Barry From gward@python.net Fri Sep 6 17:25:05 2002 From: gward@python.net (Greg Ward) Date: Fri, 6 Sep 2002 12:25:05 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <20020906162505.GB17800@cthulhu.gerg.ca> On 06 September 2002, Guido van Rossum said: > Quite independently from testing and tuning the algorithm, I'd like to > think about deployment. I was just pondering this this morning. In case it wasn't obvious, I'm a strong proponent of filtering junk mail as early as possible, ie. right after the SMTP DATA command has been completed. Filtering spam at the MUA just seems stupid to me -- by the time it gets to me MUA, the spammer has already stolen my bandwidth. My public addresses are gward@python.net and gward@mems-exchange.org, so I want spam stopped by the mail servers for those two domains. (Hence the recent MTA switch on starship...) I guess MUA-level filtering is just a fallback for people who don't have 1) a burning, all-consuming hatred of junk mail, 2) root access to all mail servers they rely on, and 3) the ability and inclination to install an MTA with every bell and whistle tweaked to keep out junk mail. Anyways, here's how I think it should work: * as soon as the DATA command is completed, the MTA passes the message to some local message-scanning code: a milter with Sendmail, local_scan() with Exim. Dunno if any other MTAs have similar provisions. * the local scanner feeds the message to spambayes; if it says "yep, this is spam", the local scanner generates an SMTP rejection message, which the MTA returns to the client, eg. DATA [...spam...] . 550-rejected -- looks like spam 550 (see http://mail.python.org/spam/17nLfU-0003IT-00) The hypothetical web page (one per rejected message) would give an explanation of why the message was considered spam (eg. the top 15 keywords), and give the sender the option to "request review" -- what I'm thinking is send email to postmaster, and one of the postmasters will pop over to another web page, look at the message, and either rescue it or decide that it really is spam. Yes, I'm willing to risk giving spammers information in order to make life easier for false positive victims. I very much doubt that spammers read SMTP rejection messages. As for feeding the message to spambayes: for the Exim servers that I have a hand in, the local_scan() function is written in Python, so there shouldn't be any need to spawn a sub-process or open a socket to do this. Other sites may not be so lucky, in which case a fast, low-overhead way to evaluate a message is essential. Python's startup overhead is not trivial, but I'd bet Python+spambayes is much faster to startup than Perl+SpamAssassin. Python has bytecode compilation, and the spambayes database is much simpler than SpamAssassin's ruleset. (Especially if the pickle is changed to a DB, DBM, or CDB file.) So a spamd-style daemon is worth considering, but not necessarily the answer. Anyways, I've outlined a way to gather false positives above. We already have a protocol for dealing with false negs -- forward them to spam@python.org. Just have to figure out what to do with them then. (Currently they're piling up in /var/mail/nc-spam [nc = not caught] on mail.python.org.) > Eventually, individuals and postmasters should be able to download a > spambayes software distribution, answer a few configuration questions > about their mail setup, training and false positives, and install it > as a filter. Note that SpamAssassin is not as simple as that to install -- I think the "few configuration questions about their mail setup" is a massive blackhole that's best avoided. SA provides tools that tell you whether something looks like spam, and how spammy it is. Everything else is up to the local admin, which makes eminent sense to me. The mantra is: SpamAssassin is a tool for *detecting* spam, not for rejecting it/discarding it/moving it somewhere/whatever. Do one thing, and do it well. The downside of that approach is that every MTA/MDA(/MUA?) community has to figure out clever ways to integrate SpamAssassin. This might not be a bad thing: the ways to integrate SA with Exim just keep getting better and better, and the SA people don't really have to worry about that. Greg -- Greg Ward http://www.gerg.ca/ A committee is a life form with six or more legs and no brain. From paul-bayes@svensson.org Fri Sep 6 17:27:57 2002 From: paul-bayes@svensson.org (Paul Svensson) Date: Fri, 6 Sep 2002 12:27:57 -0400 (EDT) Subject: [Spambayes] Corpus Collection (Was: Re: Deployment) In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: On Fri, 6 Sep 2002, Guido van Rossum wrote: >Quite independently from testing and tuning the algorithm, I'd like to >think about deployment. > >Eventually, individuals and postmasters should be able to download a >spambayes software distribution, answer a few configuration questions >about their mail setup, training and false positives, and install it >as a filter. > >A more modest initial goal might be the production of a tool that can >easily be used by individuals (since we're more likely to find >individuals willing to risk this than postmasters). My impression is that a pre-collected corpus would not fit most individuals very well, but each individual (or group?) should collect their own corpus. One problem that comes upp immediately: individuals are lazy. If I currently get 50 spam and 50 ham a day, and I'll have to press the 'delete' button once for each spam, I'll be happy to press the 'spam' button instead. However, if in addition have to press a 'ham' button for each ham, it starts to look much less like a win to me. Add the time to install and setup the whole machinery, and I'll just keep hitting delete. The suggestions so far have been to hook something on the delete action, that adds a message to the ham corpus. I see two problems with this: the ham will be a bit skewed; mail that I keep around without deleting will not be counted. Secondly, if I by force of habit happen to press the 'delete' key instead of the 'spam' key, I'll end up with spam in the ham, anyways. I would like to look for a way to deal with spam in the ham. The obvious thing to do is to trigger on the 'spam' button, and at that time look for messages similar to the deleted one in the ham corpus, and simply remove them. To do this we need a way to compare two word count histograms, to see how similar they are. Any ideas ? Also, I personally would prefer to not see the spam at all. If they get bounced (preferably already in the SMTP), false positives become the senders problem, to rewrite to remove the spam smell. In a well tuned system then, there spam corpus will be much smaller than the ham corpus, so it would be possible to be slightly over-agressive when clearing potential spam from the ham corpus. This should make it easier to keep it clean. Having a good way to remove spam from the ham corpus, there's less need to worry about it getting there by mistake, and we might as well simply add all messages to the ham corpus, that didn't get deleted by the spam filtering. It might also be useful to have a way to remove messages from the spam corpus, in case of user ooops. /Paul From barry@wooz.org Fri Sep 6 17:16:17 2002 From: barry@wooz.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 12:16:17 -0400 Subject: [Spambayes] test sets? References: <200209060759.g867xcV03853@localhost.localdomain> Message-ID: <15736.54481.733005.644033@anthem.wooz.org> >>>>> "TP" == Tim Peters writes: TP> Barry, can you please identify for me which of these headers TP> are Mailman artifacts so I can avoid counting them? Sure, with a little off-topic commentary added for no charge. 0.01 19 3559 'header:X-Mailman-Version:1' 0.01 19 3559 'header:List-Id:1' 0.01 19 3557 'header:X-BeenThere:1' These three are definitely MM artifacts, although the second one /could/ be inserted by other list management software (it's described in an RFC). 0.01 0 3093 'header:Newsgroups:1' 0.01 0 3054 'header:Xref:1' 0.01 0 3053 'header:Path:1' These aren't MM artifacts, but are byproducts of gating a message off of an nntp feed. Some of the other NNTP-* headers are similar, but I won't point them out below. 0.01 19 2668 'header:List-Unsubscribe:1' 0.01 19 2668 'header:List-Subscribe:1' 0.01 19 2668 'header:List-Post:1' 0.01 19 2668 'header:List-Help:1' 0.01 19 2668 'header:List-Archive:1' RFC recommended generic listserve headers that MM injects. 0.99 689 0 'header:Delivered-To:4' This one's often a byproduct of the mail server. In particularly, Postfix and possibly others put the envelope recipient in this header. 0.99 522 0 'header:Delivered-To:3' So why do you get two entries for this one? 0.99 519 0 'header:Received:8' 0.99 466 1 'header:Received:7' And this one? 0.99 273 0 'header:MiME-Version:1' Note that header names are case insensitive, so this one's no different than "MIME-Version:". Similarly other headers in your list. 0.99 27 0 'header:1:1' Huh? 0.01 0 27 'header:X-Originally-To:1' Mailman copies any To: header found in a message gated off of nntp to the X-Originally-To: header. Others possible here in clude X-Original-To, X-Original-Cc, X-Original-Content-Transfer-Encoding, and X-Original-Date. 0.01 0 9 'header:X-No-Archive:1' Could be MM or not. This is used to stop the archiving of certain messages, and MM will inject these into digests and password reminders, but it's also possible that user agents have added this. (Aside, in particularly mischievous fashion the value of this header is "yes", so you see things like "X-No-Archive: yes". X-Isn't-Not-Nonsense: no). 0.02 65 3559 'header:Precedence:1' Could be Mailman, or not. This header is supposed to tell other automated software that this message was automated. E.g. a replybot should ignore any message with a Precedence: {bulk|junk|list}. 0.80 1471 273 'header:Delivered-To:1' Why again?! 0.50 4 0 'header:2:1' !? 0.50 3 0 'header:6:1' 0.50 3 0 'header:5:1' 0.50 3 0 'header:4:1' 0.50 3 0 'header:3:1' 0.50 2 0 'header:X-:1' Freaky. 0.50 0 2 'header:X-BeenThere:3' X-BeenThere: before :) 0.50 0 2 'header:' Heh? 0.50 0 1 'header:X-Silly:1' X-Very-Silly: fneh X-Very-Silly-Indeed: dead parrot 0.50 0 1 'header:X-Get-A-Real-Newsreader:1' 0.50 0 1 'header:X-Favorite-Dwarf:1' 0.50 0 1 'header:X-Eric-Conspiracy:1' 0.50 0 1 'header:Favorite-Color:1' Cute. :) Some headers of course are totally unreliable as to their origin. I'm thinking stuff like MIME-Version, Content-Type, To, From, etc, etc. Everyone sticks those in. -Barry From guido@python.org Fri Sep 6 17:27:01 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 12:27:01 -0400 Subject: [Spambayes] Deployment In-Reply-To: Your message of "Fri, 06 Sep 2002 12:25:05 EDT." <20020906162505.GB17800@cthulhu.gerg.ca> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <20020906162505.GB17800@cthulhu.gerg.ca> Message-ID: <200209061627.g86GR1p15407@pcp02138704pcs.reston01.va.comcast.net> > I guess MUA-level filtering is just a fallback for people who don't have > 1) a burning, all-consuming hatred of junk mail, 2) root access to all > mail servers they rely on, and 3) the ability and inclination to install > an MTA with every bell and whistle tweaked to keep out junk mail. Sure. But for most people, changing their company's or ISP's server requires years of lobbying, while they have total and immediate control over their own MUA. That said, I agree that we should offer a good solution to postmasters, and I trust that your ideas are right on the mark! --Guido van Rossum (home page: http://www.python.org/~guido/) From jeremy@alum.mit.edu Fri Sep 6 17:28:09 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 6 Sep 2002 12:28:09 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15736.55193.38098.486459@slothrop.zope.com> I think one step towards deployment is creating a re-usable tokenizer for mail messages. The current codebase doesn't expose an easy-to-use or easy-to-customize tokenizer. The timtest module seems to contain an enormous body of practical knowledge about how to parse mail messages, but the module wasn't designed for re-use. I'd like to see a module that can take a single message or a collection of messages and tokenize each one. I'd like to see the tokenize by customizable, too. Tim had to exclude some headers from his test data, because there were particular biases in the test data. If other people have test data without those biases, they ought to be able to customize the tokenizer to include them or exclude others. Jeremy From tim.one@comcast.net Fri Sep 6 17:45:09 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 12:45:09 -0400 Subject: [Spambayes] test sets? In-Reply-To: <15736.54481.733005.644033@anthem.wooz.org> Message-ID: [Barry A. Warsaw, gives answers and asks questions] Here's the code that produced the header tokens: x2n = {} for x in msg.keys(): x2n[x] = x2n.get(x, 0) + 1 for x in x2n.items(): yield "header:%s:%d" % x Some responses: > 0.01 19 3559 'header:X-Mailman-Version:1' > 0.01 19 3559 'header:List-Id:1' > 0.01 19 3557 'header:X-BeenThere:1' > > These three are definitely MM artifacts, although the second one > /could/ be inserted by other list management software (it's described > in an RFC). Since all the ham came from Mailman, and only 19 spam had it, it's quite safe to assume then that I should ignore these for now. > 0.01 0 3093 'header:Newsgroups:1' > 0.01 0 3054 'header:Xref:1' > 0.01 0 3053 'header:Path:1' > > These aren't MM artifacts, but are byproducts of gating a message off > of an nntp feed. Some of the other NNTP-* headers are similar, but I > won't point them out below. I should ignore these too then. > 0.01 19 2668 'header:List-Unsubscribe:1' > 0.01 19 2668 'header:List-Subscribe:1' > 0.01 19 2668 'header:List-Post:1' > 0.01 19 2668 'header:List-Help:1' > 0.01 19 2668 'header:List-Archive:1' > > RFC recommended generic listserve headers that MM injects. Ditto. > So why do you get two entries for this one? > > 0.99 519 0 'header:Received:8' > 0.99 466 1 'header:Received:7' Read the code . The first line counts msgs that had 8 instances of a 'Received' header, and the second counts msgs that had 7 instances. I expect this is a good clue! The more indirect the mail path, the more of those thingies we'll see, and if you're posting from a spam trailer park in Tasmania you may well need to travel thru more machines. > ... > Note that header names are case insensitive, so this one's no > different than "MIME-Version:". Similarly other headers in your list. Ignoring case here may or may not help; that's for experiment to decide. It's plausible that case is significant, if, e.g., a particular spam mailing package generates unusual case, or a particular clueless spammer misconfigures his package. > 0.02 65 3559 'header:Precedence:1' > > Could be Mailman, or not. This header is supposed to tell other > automated software that this message was automated. E.g. a replybot > should ignore any message with a Precedence: {bulk|junk|list}. Rule of thumb: if Mailman inserts a thing, I should ignore it. Or, better, I should stop trying to out-think the flaws in the test data and get better test data instead! > 0.50 4 0 'header:2:1' > > !? > ... > 0.50 0 2 'header:' > > Heh? I sucked out all the wordinfo keys that began with "header:". The last line there was probably due to unrelated instances of the string "header:" in message bodies. Harder to guess about the first line. > ... > Some headers of course are totally unreliable as to their origin. I'm > thinking stuff like MIME-Version, Content-Type, To, From, etc, etc. > Everyone sticks those in. The brilliance of Anthony's "just count them" scheme is that it requires no thought, so can't be fooled . Header lines that are evenly distributed across spam and ham will turn out to be worthless indicators (prob near 0.5), so do no harm. From tim.one@comcast.net Fri Sep 6 17:55:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 12:55:07 -0400 Subject: [Spambayes] test sets? In-Reply-To: <200209060811.g868Bo904031@localhost.localdomain> Message-ID: [Anthony Baxter] > The other thing on my todo list (probably tonight's tram ride home) is > to add all headers from non-text parts of multipart messages. If nothing > else, it'll pick up most virus email real quick. See the checkin comments for timtest.py last night. Adding this code gave a major reduction in the false negative rate: def crack_content_xyz(msg): x = msg.get_type() if x is not None: yield 'content-type:' + x.lower() x = msg.get_param('type') if x is not None: yield 'content-type/type:' + x.lower() for x in msg.get_charsets(None): if x is not None: yield 'charset:' + x.lower() x = msg.get('content-disposition') if x is not None: yield 'content-disposition:' + x.lower() fname = msg.get_filename() if fname is not None: for x in fname.lower().split('/'): for y in x.split('.'): yield 'filename:' + y x = msg.get('content-transfer-encoding:') if x is not None: yield 'content-transfer-encoding:' + x.lower() ... t = '' for x in msg.walk(): for w in crack_content_xyz(x): yield t + w t = '>' I *suspect* most of that stuff didn't make any difference, but I put it all in as one blob so don't know which parts did and didn't help. From barry@python.org Fri Sep 6 17:59:49 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 12:59:49 -0400 Subject: [Spambayes] test sets? References: Message-ID: <15736.57093.811682.371784@anthem.wooz.org> TP> A false positive *really* has to work hard then, eh? The long TP> quote of a Nigerian scam letter is one of the two that made TP> it, and spamprob() looked at all this stuff before deciding it TP> was spam: Here's an interesting thing to test: discriminate words differently if they are on a line that starts with `>' or, to catch styles like above, that the first occurance on a line of < or > is > (to eliminate html). Then again, it may not be worth trying to un-false-positive that Nigerian scam quote. -Barry From neale@woozle.org Fri Sep 6 18:13:17 2002 From: neale@woozle.org (Neale Pickett) Date: 06 Sep 2002 10:13:17 -0700 Subject: [Spambayes] Deployment In-Reply-To: <200209061506.g86F6Qo14777@pcp02138704pcs.reston01.va.comcast.net> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <15736.50015.881231.510395@12-248-11-90.client.attbi.com> <200209061506.g86F6Qo14777@pcp02138704pcs.reston01.va.comcast.net> Message-ID: So then, Guido van Rossum is all like: > > Basic procmail usage goes something like this: > > > > :0fw > > | spamassassin -P > > > > :0 > > * ^X-Spam-Status: Yes > > $SPAM > > > > Do you feel capable of writing such a tool? It doesn't look too hard. Not to beat a dead horse, but that's exactly what my spamcan package did. For those just tuning in, spamcan is a thingy I wrote before I knew about Tim & co's work on this crazy stuff; you can download it from , but I'm not going to work on it anymore. I'm currently writing a new one based on classifier (and timtest's booty-kicking tokenizer). I'll probably have something soon, like maybe half an hour, and no, it's not too hard. The hard part is storing the data somewhere. I don't want to use ZODB, as I'd like something a person can just drop in with a default Python install. So anydbm is looking like my best option. I already have a setup like this using Xavier Leroy's SpamOracle, which does the same sort of thing. You call it from procmail, it adds a new header, and then you can filter on that header. Really easy. Here's how I envision this working. Everybody gets four new mailboxes: train-eggs train-spam trained-eggs trained-spam You copy all your spam and eggs* into the "train-" boxes as you get it. How frequently you do this would be up to you, but you'd get better results if you did it more often, and you'd be wise to always copy over anything which was misclassified. Then, every night, the spam fairy swoops down and reads through your folders, learning about what sorts of things you think are eggs and what sorts of things are spam. After she's done, she moves your mail into the "trained-" folders. This would work for anybody using IMAP on a Unix box, or folks who read their mail right off the server. I've spoken with some fellows at work about Exchange and they seem to beleive that Exchange exports appropriate functionality to implement a spam fairy as well. Advanced users could stay ahead of the game by reprogramming their mail client to bind the key "S" to "move to train-spam" and "H" to "move to train-eggs". Eventually, if enough people used this sort of thing, it'd start showing up in mail clients. That's the "delete as spam" button Paul Graham was talking about. * The Hormel company might not think well of using the word "ham" as the opposite of "spam", and they've been amazingly cool about the use of their product name for things thus far. So I propose we start calling non-spam something more innocuous (and more Monty Pythonic) such as "eggs". Neale From tim.one@comcast.net Fri Sep 6 18:35:55 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 13:35:55 -0400 Subject: [Spambayes] Deployment In-Reply-To: <15736.55193.38098.486459@slothrop.zope.com> Message-ID: [Jeremy Hylton] > I think one step towards deployment is creating a re-usable tokenizer > for mail messages. The current codebase doesn't expose an easy-to-use > or easy-to-customize tokenizer. tokenize() couldn't be easier to use: it takes a string argument, and produces a stream of tokens (whether via explicit list, or generator, or tuple, or ... doesn't matter). All the tokenize() functions in GBayes.py and timtest.py are freely interchangeable this way. Note that we have no evidence to support that a customizable tokenizer would do any good, or, if it would, in which ways customization could be helpful. That's a research issue on which no work has been done. > The timtest module seems to contain an enormous body of practical > knowledge about how to parse mail messages, but the module wasn't > designed for re-use. That's partly a failure of imagination . Splitting out all knowledge of tokenization is just a large block cut-and-paste ... there, it's done. Change the from timtoken import tokenize at the top to use any other tokenizer now. If you want to make it easier still, feel free to check in something better. > I'd like to see a module that can take a single message or a collection of > messages and tokenize each one. The Msg and MsgStream classes in timtest.py are a start at that, but it's hard to do anything truly *useful* here when people use all sorts of different physical representations for email msgs (mboxes in various formats, one file per "folder", one file per msg, Skip's gzipped gimmick, ...). If you're a Python coder , you *should* find it very easy to change the guts of Msg and MsgStream to handle your peculiar scheme. Defining interfaces for these guys should be done. > I'd like to see the tokenize by customizable, too. Tim had to exclude > some headers from his test data, because there were particular biases > in the test data. If other people have test data without those > biases, they ought to be able to customize the tokenizer to include > them or exclude others. This sounds like a bottomless pit to me, and there's no easier way to customize than to edit the code. As README.txt still says, though, massive refactoring would help. Hop to it! From whisper@oz.net Fri Sep 6 18:53:24 2002 From: whisper@oz.net (David LeBlanc) Date: Fri, 6 Sep 2002 10:53:24 -0700 Subject: [Spambayes] Deployment In-Reply-To: Message-ID: I think that when considering deployment, a solution that supports all Python platforms and not just the L|Unix crowd is desirable. Mac and PC users are more apt to be using a commercial MUA that's unlikely to offer hooking ability (at least not easily). As mentioned elsewhere, even L|Unix users may find an MUA solution easier to use then getting it added to their MTA. (SysOps make programmers look like flaming liberals ;).) My notion of a solution for Windows/Outlook has been, as Guido described, a client-server. Client side does pop3/imap/mapi fetching (of which, I'm only going to implement pop3 initially) potentially on several hosts, spamhams the incoming mail and puts it into one file per message (qmail style?). The MUA accesses this "eThunk" as a server to obtain all the ham. Spam is retained in the eThunk and a simple viewer would be used for manual oversight on the spam for ultimate rejection (and training of spam filter) and the ham will go forward (after being used for training) on the next MUA fetch. eThunk would sit on a timer for 'always online' users, but I am not clear on how to support dialup users with this scheme. Outbound mail would use a direct path from the MUA to the MTA. Hopefully all MUAs can split the host fetch/send URL's IMO, end users are likely to be more intested in n-way classification. If this is available, the "simple viewer" could be enhanced to support viewing via folders and (at least for me) the Outlook nightmare is over - I would use this as my only MUA. (N.B. according to my recent readings, the best n-way classifier uses something called a "Support Vector Machine" (SVM) which is 5-8% more accurate then Naive Bayes (NB) ). I wonder if the focus of spambayes ought not to be a classifier that leaves the fetching and feeding of messages to auxillary code? That way, it could be dropped into whatever harness that suited the user's situation. David LeBlanc Seattle, WA USA From guido@python.org Fri Sep 6 18:58:14 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 13:58:14 -0400 Subject: [Spambayes] Deployment In-Reply-To: Your message of "Fri, 06 Sep 2002 10:53:24 PDT." References: Message-ID: <200209061758.g86HwET15939@pcp02138704pcs.reston01.va.comcast.net> > I wonder if the focus of spambayes ought not to be a classifier that > leaves the fetching and feeding of messages to auxillary code? That > way, it could be dropped into whatever harness that suited the > user's situation. I see no reason to restrict the project to developing the classifier and leave the deployment to others. Attempts at deployment in the real world will surely provide additional feedback for the classifier. --Guido van Rossum (home page: http://www.python.org/~guido/) From gward@python.net Fri Sep 6 19:02:23 2002 From: gward@python.net (Greg Ward) Date: Fri, 6 Sep 2002 14:02:23 -0400 Subject: [Spambayes] test sets? In-Reply-To: References: <15736.54481.733005.644033@anthem.wooz.org> Message-ID: <20020906180223.GA18250@cthulhu.gerg.ca> On 06 September 2002, Tim Peters said: > > Note that header names are case insensitive, so this one's no > > different than "MIME-Version:". Similarly other headers in your list. > > Ignoring case here may or may not help; that's for experiment to decide. > It's plausible that case is significant, if, e.g., a particular spam mailing > package generates unusual case, or a particular clueless spammer > misconfigures his package. Case of headers is definitely helpful. SpamAssassin has a rule for it -- if you have headers like "DATE" or "SUBJECT", you get a few more points. Greg -- Greg Ward http://www.gerg.ca/ God is omnipotent, omniscient, and omnibenevolent ---it says so right here on the label. From skip@pobox.com Fri Sep 6 19:48:35 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 13:48:35 -0500 Subject: [Spambayes] Deployment In-Reply-To: <20020906162505.GB17800@cthulhu.gerg.ca> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <20020906162505.GB17800@cthulhu.gerg.ca> Message-ID: <15736.63619.488739.691181@12-248-11-90.client.attbi.com> Greg> In case it wasn't obvious, I'm a strong proponent of filtering Greg> junk mail as early as possible, ie. right after the SMTP DATA Greg> command has been completed. Filtering spam at the MUA just seems Greg> stupid to me -- by the time it gets to me MUA, the spammer has Greg> already stolen my bandwidth. The two problems I see with filtering that early are: 1. Everyone receiving email via that server will contribute ham to the stew, making the Bayesian classification less effective. 2. Given that there will be some false positives, you absolutely have to put the mail somewhere. You can't simply delete it. (I also don't like the TMDA-ish business of replying with a msg that says, "here's what you do to really get your message to me." That puts an extra burden on my correspondents.) As an individual, I would prefer you put spammish messages somewhere where I can review them, not an anonymous sysadmin who I might not trust with my personal email (nothing against you Greg ;-). I personally prefer to manage this stuff at the user agent level. Bandwidth is a heck of a lot cheaper than my time. Skip From harri.pasanen@bigfoot.com Fri Sep 6 20:07:28 2002 From: harri.pasanen@bigfoot.com (Harri Pasanen) Date: Fri, 6 Sep 2002 21:07:28 +0200 Subject: [Spambayes] Deployment In-Reply-To: <15736.63619.488739.691181@12-248-11-90.client.attbi.com> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <20020906162505.GB17800@cthulhu.gerg.ca> <15736.63619.488739.691181@12-248-11-90.client.attbi.com> Message-ID: <200209062107.28106.harri.pasanen@bigfoot.com> On Friday 06 September 2002 20:48, Skip Montanaro wrote: > Greg> In case it wasn't obvious, I'm a strong proponent of > filtering Greg> junk mail as early as possible, ie. right after the > SMTP DATA Greg> command has been completed. Filtering spam at the > MUA just seems Greg> stupid to me -- by the time it gets to me MUA, > the spammer has Greg> already stolen my bandwidth. > > The two problems I see with filtering that early are: > > 1. Everyone receiving email via that server will contribute ham > to the stew, making the Bayesian classification less effective. > > 2. Given that there will be some false positives, you absolutely > have to put the mail somewhere. You can't simply delete it. (I also > don't like the TMDA-ish business of replying with a msg that says, > "here's what you do to really get your message to me." That puts an > extra burden on my correspondents.) As an individual, I would prefer > you put spammish messages somewhere where I can review them, not an > anonymous sysadmin who I might not trust with my personal email > (nothing against you Greg ;-). > > I personally prefer to manage this stuff at the user agent level. > Bandwidth is a heck of a lot cheaper than my time. > I see no reason why both approaches could and should not be used. MTA level filtering would just need to use a different corpus, one that would contain illegal or otherwise commonly unapproved material for the group of people using that MTA. I'm sure that such an approach would significantly reduce the mail traffic as a first step, without giving false positives. MUA corpus would then be personally trained -- although I'd like the option of 'down-loadable' corpuses and merge functionality. Harri PS. Just joined the list, so pardon if my thoughts have been hashed through before. From tim.one@comcast.net Fri Sep 6 20:24:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 15:24:15 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > ... > - A program that acts both as a pop client and a pop server. You > configure it by telling it about your real pop servers. You then > point your mail reader to the pop server at localhost. When it > receives a connection, it connects to the remote pop servers, reads > your mail, and gives you only the non-spam. FYI, I'll never trust such a scheme: I have no tolerance for false positives, and indeed do nothing to try to block spam on any of my email accounts now for that reason. Deliver all suspected spam to a Spam folder instead and I'd love it. From guido@python.org Fri Sep 6 20:24:38 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 15:24:38 -0400 Subject: [Spambayes] Deployment In-Reply-To: Your message of "Fri, 06 Sep 2002 15:24:15 EDT." References: Message-ID: <200209061924.g86JOc516514@pcp02138704pcs.reston01.va.comcast.net> > > - A program that acts both as a pop client and a pop server. You > > configure it by telling it about your real pop servers. You then > > point your mail reader to the pop server at localhost. When it > > receives a connection, it connects to the remote pop servers, reads > > your mail, and gives you only the non-spam. > > FYI, I'll never trust such a scheme: I have no tolerance for false > positives, and indeed do nothing to try to block spam on any of my email > accounts now for that reason. Deliver all suspected spam to a Spam folder > instead and I'd love it. Another config parameter. The filter could add a header file. Or a ~ to the subject if you like that style. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From tim.one@comcast.net Fri Sep 6 20:21:22 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 15:21:22 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > ... > I don't know how big that pickle would be, maybe loading it each time > is fine. Or maybe marshalling.) My tests train on about 7,000 msgs, and a binary pickle of the database is approaching 10 million bytes. I haven't done anything to try to reduce its size, and know of some specific problem areas (for example, doing character 5-grams of "long words" containing high-bit characters generates a lot of database entries, and I suspect they're approximately worthless). OTOH, adding in more headers will increase the size. So let's call it 10 meg . From tim.one@comcast.net Fri Sep 6 20:43:56 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 15:43:56 -0400 Subject: [Spambayes] Deployment In-Reply-To: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> Message-ID: [Guido] > Takers? How is ESR's bogofilter packaged? SpamAssassin? The Perl > Bayes filter advertised on slashdot? WRT the last, it's a small pile of Windows .exe files along with cygwin1.dll. The .exes are cmdline programs. One is a POP3 proxy. If I currently have an email server named, say, mail.comcast.net, with user name timmy, then I change my email reader to say that my server is 127.0.0.1, and that my user name on that server is mail.comcast.net:timmy. In that way the proxy picks up both the real server and user names from what the mail reader tells it the user name is. This is an N-way classifier (like ifile that way), and "all it does" is insert a X-Text-Classification: one_of_the_class_names_you_picked header into your email before passing it on to your mail reader. The user then presumably fiddles their mail reader to look for such headers and "do something about it" (and even Outlook can handle *that* much ). The user is responsible for generating text files with appropriate examples of each class of message, and for running the cmdline tools to train the classifier. From whisper@oz.net Fri Sep 6 20:53:36 2002 From: whisper@oz.net (David LeBlanc) Date: Fri, 6 Sep 2002 12:53:36 -0700 Subject: [Spambayes] Deployment In-Reply-To: Message-ID: You missed the part that said that spam is kept in the "eThunk" and was viewable by a simple viewer for final disposition? Of course, with Outbloat, you could fire up PythonWin and stuff the spam into the Junk Email folder... but then you loose the ability to retrain on the user classified ham/spam. David LeBlanc Seattle, WA USA > -----Original Message----- > From: spambayes-bounces+whisper=oz.net@python.org > [mailto:spambayes-bounces+whisper=oz.net@python.org]On Behalf Of Tim > Peters > Sent: Friday, September 06, 2002 12:24 > To: spambayes@python.org > Subject: RE: [Spambayes] Deployment > > > [Guido] > > ... > > - A program that acts both as a pop client and a pop server. You > > configure it by telling it about your real pop servers. You then > > point your mail reader to the pop server at localhost. When it > > receives a connection, it connects to the remote pop servers, reads > > your mail, and gives you only the non-spam. > > FYI, I'll never trust such a scheme: I have no tolerance for false > positives, and indeed do nothing to try to block spam on any of my email > accounts now for that reason. Deliver all suspected spam to a Spam folder > instead and I'd love it. > > > _______________________________________________ > Spambayes mailing list > Spambayes@python.org > http://mail.python.org/mailman-21/listinfo/spambayes From neale@woozle.org Fri Sep 6 20:58:33 2002 From: neale@woozle.org (Neale Pickett) Date: 06 Sep 2002 12:58:33 -0700 Subject: [Spambayes] Deployment In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > [Guido] > > ... > > I don't know how big that pickle would be, maybe loading it each time > > is fine. Or maybe marshalling.) > > My tests train on about 7,000 msgs, and a binary pickle of the database is > approaching 10 million bytes. My paltry 3000-message training set makes a 6.3MB (where 1MB=1e6 bytes) pickle. hammie.py, which I just checked in, will optionally let you write stuff out to a dbm file. With that same message base, the dbm file weighs in at a hefty 21.4MB. It also takes longer to write: Using a database: real 8m24.741s user 6m19.410s sys 1m33.650s Using a pickle: real 1m39.824s user 1m36.400s sys 0m2.160s This is on a PIII at 551.257MHz (I don't know what it's *supposed* to be, 551.257 is what /proc/cpuinfo says). For comparison, SpamOracle (currently the gold standard in my mind, at least for speed) on the same data blazes along: real 0m29.592s user 0m28.050s sys 0m1.180s Its data file, which appears to be a marshalled hash, is 448KB. However, it's compiled O'Caml and it uses a much simpler tokenizing algorithm written with a lexical analyzer (ocamllex), so we'll never be able to outperform it. It's something to keep in mind, though. I don't have statistics yet for scanning unknown messages. (Actually, I do, and the database blows the pickle out of the water, but it scores every word with 0.00, so I'm not sure that's a fair test. ;) In any case, 21MB per user is probably too large, and 10MB is questionable. On the other hand, my pickle compressed very well with gzip, shrinking down to 1.8MB. Neale From gward@python.net Fri Sep 6 21:01:11 2002 From: gward@python.net (Greg Ward) Date: Fri, 6 Sep 2002 16:01:11 -0400 Subject: [Spambayes] Deployment In-Reply-To: <15736.63619.488739.691181@12-248-11-90.client.attbi.com> References: <200209061431.g86EVM114413@pcp02138704pcs.reston01.va.comcast.net> <20020906162505.GB17800@cthulhu.gerg.ca> <15736.63619.488739.691181@12-248-11-90.client.attbi.com> Message-ID: <20020906200111.GA18381@cthulhu.gerg.ca> On 06 September 2002, Skip Montanaro said: > The two problems I see with filtering that early are: > > 1. Everyone receiving email via that server will contribute ham to the > stew, making the Bayesian classification less effective. I agree -- the whole idea of automatically adding mail to the training corpus makes me really nervous. Especially when you just assume that mail not deemed to be spam is really not spam. > 2. Given that there will be some false positives, you absolutely have to > put the mail somewhere. You can't simply delete it. Absolutely! With SpamAssassin, there's a large grey area where false positives live (scores between 5 and 10, I'd say) -- hence the whole tedious mechanism of setting that mail aside, mailing nightly summaries to postmaster, and one of us logging in to rescue the false positives. Rejecting messages that score above 10 is fairly safe -- if there is the occasional false positive in there, then someone gets a bounce report telling their mail looked like spam. At least it's not discarded. As Tim has reported elsewhere, there's a much smaller grey area with the Bayesian classifier, and false positives are clustered up around 0.99 along with all the spam. Hence the idea I floated, where all rejected messages are saved, and the rejection message includes a URL where the sender can request a second chance. I think that would be much less burden on everyone -- the sender and the postmasters. (Also, we could make it more sophisticated then just mailing requests for a second chance to postmaster -- eg. associate each recipient address with a "responsible person" who has the authority to rescue FPs for that address. Not sure what to do about messages with multiple recipients.) > (I also don't > like the TMDA-ish business of replying with a msg that says, "here's > what you do to really get your message to me." That puts an extra > burden on my correspondents.) Err, then I guess you don't like the above -- especially since it's not a nice friendly TMDA-style message, but an MTA-generated bounce. Hmmm. > I personally prefer to manage this stuff at the user agent level. Bandwidth > is a heck of a lot cheaper than my time. Good to hear another perspective. I'm sure some people will come up with a great MUA solution, while others concentrate on the server side. We all have our preferences... Greg -- Greg Ward http://www.gerg.ca/ Those of you who think you know everything really annoy those of us who do. From tim.one@comcast.net Fri Sep 6 21:15:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 16:15:18 -0400 Subject: [Spambayes] GBayes spam filtering In-Reply-To: Message-ID: [Paul Svensson] > ... > When a user reads a message and find that it's spam that got thru > the filter, they need a way to send the message-id to the corpus, to > flag it as spam. At this point, it would be a good idea to compare the > histogram of the new spam to each histogram in the ham corpus, and remove > any that are similar (any good ideas how to do the comparison?), Read the "memory-based approach" stuff in Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach http://arxiv.org/ftp/cs/papers/0009/0009009.pdf > or maybe if they are VERY similar simply flag them as spam. After > recomputing the filter from the modified corpus, we could also re-filter > the ham corpus, and remove more newfound spam that way. > > Characteristically of this system, the spam corpus will be > reasonably clean (assuming the users don't abuse it too much), but > the ham corpus will be quite dirty, containing spam that's not yet read, > and spam that the recipient didn't bother to mark. I'm curious how > GBayes would handle this situation; I assume the false negative rate > would go up, but how much ? You can run an experiment and measure it. That's almost as easy as, and much more reliable than, guessing . From jeremy@alum.mit.edu Fri Sep 6 21:03:28 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 6 Sep 2002 16:03:28 -0400 Subject: [Spambayes] understanding high false negative rate Message-ID: <15737.2576.315460.956295@slothrop.zope.com> I've tried to do some testing with some personal collections of ham and spam. I'm seeing very high false negative rates. 20-30% is typical.=A0 The false positive rate is 0-3%. (Finally! I had to scrub= a bunch of previously unnoticed spam from my inbox.) Both collections have about 1100 messages. I'd like to figure out why my false negative rate is so high, but I'm not sure what details I should look at to diagnose. I'm assuming that mboxtest.py is basically correct, but it could have bugs. One possibility is that my ham test set isn't nearly so useful as the python-list, since it isn't focused on a single topic. I've got some python email, personal correspondence, questions about my Shakespeare web site, and a few email newsletters I get on a regular basis. I've got receipts from various online order sites, mail from the company that manages my student loans, etc. Maybe the great variety in my non-spam email makes it harder to find good discriminators for spam? Here's a sample spam distribution from a test run: Spam distribution for this pair: * =3D 3 items 0.00 73 ************************* 2.50 0=20 5.00 2 * 7.50 0=20 10.00 0=20 12.50 1 * 15.00 0=20 17.50 1 * 20.00 1 * 22.50 0=20 25.00 2 * 27.50 0=20 30.00 0=20 32.50 0=20 35.00 0=20 37.50 0=20 40.00 0=20 42.50 0=20 45.00 0=20 47.50 0=20 50.00 0=20 52.50 0=20 55.00 0=20 57.50 1 * 60.00 0=20 62.50 1 * 65.00 0=20 67.50 0=20 70.00 1 * 72.50 0=20 75.00 0=20 77.50 0=20 80.00 2 * 82.50 2 * 85.00 2 * 87.50 0=20 90.00 4 ** 92.50 1 * 95.00 5 ** 97.50 127 ******************************************* And here's a sample false negative. (I'll quote the report so it stands out.) One thing I don't understand is how the spam probability for the message is so low, when there are several high indicators and few low indicators. > Low prob spam! 1.64654685184e-11 > /home/jeremy/Mail/spam:242 subject: your web site has been mapped > prob('millions') =3D 0.99 > prob('skip:=3D 40') =3D 0.99 > prob('"remove"') =3D 0.99 > prob('from:email addr:mail') =3D 0.99 > prob('email addr:alum') =3D 0.01 > prob('status') =3D 0.01 > prob('connected') =3D 0.01 > prob('returning') =3D 0.01 > prob('from:email addr:com>') =3D 0.224056 > prob('every') =3D 0.789741 > prob('charges') =3D 0.208406 > prob('free') =3D 0.818103 > prob('survey.') =3D 0.14931 > prob('officer') =3D 0.208406 > prob('its') =3D 0.155044 > prob('added') =3D 0.133131 > prob('current') =3D 0.152639 > prob('email addr:mit') =3D 0.01 > prob('wide') =3D 0.0911528 > prob('mark') =3D 0.136416 > prob('survey') =3D 0.0850202 > prob('http1:asp') =3D 0.88055 > prob("i'd") =3D 0.0470418 > prob('notices') =3D 0.01 >=20 > From VM Mon Jul 24 10:05:39 2000 > Return-Path: > Message-ID: <0112a1010021870MARS1@mars1.internetseer.com> > Status: RO > From: "InternetSeer.com" > To: jeremy@alum.mit.edu > Subject: Your web site has been mapped > Date: 23 Jul 2000 22:10:11 -0400 >=20 > Freewire has added your web site to its map of the World Wide Web. F= reewire will continue to monitor millions of links and web sites every = day during its ongoing web survey. >=20 > If it is important for you to know that your site is connected to the= web at all times, Freewire has arranged with InternetSeer.com to notif= y you when your site does not respond. This means that, AT NO CHARGE; = InternetSeer.com will monitor your Web site every hour and send notific= ation to you by email whenever your site is not connected to the Web. T= here are NO current or future charges associated with this service. >=20 > To begin your FREE monitoring NOW, activate your account at: > http://www.internetseer.com/signup.asp?email=3Djeremy@alum.mit.edu >=20 > Mark McLellan > Chief Technology Officer > Freewire.com >=20 > Is your web site status important to you? I'd love your comments. If = you prefer not to receive any future notices that result from our ongoi= ng survey please let me know by returning this email with the word "rem= ove" in the subject line. >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > ##Remove: jeremy@alum.mit.edu## >=20 >=20 Jeremy From guido@python.org Fri Sep 6 21:26:52 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 16:26:52 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: Your message of "Fri, 06 Sep 2002 16:03:28 EDT." <15737.2576.315460.956295@slothrop.zope.com> References: <15737.2576.315460.956295@slothrop.zope.com> Message-ID: <200209062026.g86KQqJ03393@pcp02138704pcs.reston01.va.comcast.net> > > Low prob spam! 1.64654685184e-11 > > /home/jeremy/Mail/spam:242 subject: your web site has been mapped > > prob('millions') = 0.99 > > prob('skip:= 40') = 0.99 > > prob('"remove"') = 0.99 > > prob('from:email addr:mail') = 0.99 > > prob('email addr:alum') = 0.01 > > prob('status') = 0.01 > > prob('connected') = 0.01 > > prob('returning') = 0.01 > > prob('from:email addr:com>') = 0.224056 > > prob('every') = 0.789741 > > prob('charges') = 0.208406 > > prob('free') = 0.818103 > > prob('survey.') = 0.14931 > > prob('officer') = 0.208406 > > prob('its') = 0.155044 > > prob('added') = 0.133131 > > prob('current') = 0.152639 > > prob('email addr:mit') = 0.01 > > prob('wide') = 0.0911528 > > prob('mark') = 0.136416 > > prob('survey') = 0.0850202 > > prob('http1:asp') = 0.88055 > > prob("i'd") = 0.0470418 > > prob('notices') = 0.01 Looks like your ham corpus by and large has To: jeremy@alum.mit.edu in a header while your spam corpus by and large doesn't. But this one does. Where did you gather your spam corpus? Could it be a collection of edge cases that SA didn't kill, like Barry's collection of SA false negatives? --Guido van Rossum (home page: http://www.python.org/~guido/) From jeremy@alum.mit.edu Fri Sep 6 21:45:37 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 6 Sep 2002 16:45:37 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <200209062026.g86KQqJ03393@pcp02138704pcs.reston01.va.comcast.net> References: <15737.2576.315460.956295@slothrop.zope.com> <200209062026.g86KQqJ03393@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15737.5105.690511.94543@slothrop.zope.com> >>>>> "GvR" == Guido van Rossum writes: GvR> Looks like your ham corpus by and large has To: GvR> jeremy@alum.mit.edu in a header while your spam corpus by and GvR> large doesn't. But this one does. By and large that's true. Wouldn't it be true of any mailbox? Most of your real mail is addressed to you, but only some of the spam is. GvR> Where did you gather your spam corpus? Could it be a GvR> collection of edge cases that SA didn't kill, like Barry's GvR> collection of SA false negatives? A large chunk of my spam collection is from 2000. The rest is recent, starting about the same time spambayes did. None of it was previously filtered by SA. Jeremy From tim.one@comcast.net Fri Sep 6 22:05:22 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 17:05:22 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <15737.2576.315460.956295@slothrop.zope.com> Message-ID: [Jeremy Hylton] > I've tried to do some testing with some personal collections of ham > and spam. I'm seeing very high false negative rates. 20-30% is > typical. That's very high indeed. > The false positive rate is 0-3%. (Finally! I had to scrub > a bunch of previously unnoticed spam from my inbox.) Both collections > have about 1100 messages. Does this mean you trained on about 1100 of each? > I'd like to figure out why my false negative rate is so high, but I'm > not sure what details I should look at to diagnose. I'm assuming that > mboxtest.py is basically correct, but it could have bugs. > > One possibility is that my ham test set isn't nearly so useful as the > python-list, since it isn't focused on a single topic. Heh -- when's the last time you read c.l.py ? "Python" is a very strong ham indicator, and that certainly helps. "wrote:" is an even stronger ham indicator there, and that helps even more. > I've got some python email, personal correspondence, questions about my > Shakespeare web site, and a few email newsletters I get on a regular basis. > I've got receipts from various online order sites, mail from the company > that manages my student loans, etc. Maybe the great variety in my > non-spam email makes it harder to find good discriminators for spam? Can't guess. You're in a good position to start adding more headers into the analysis, though. For example, an easy start would be to uncomment the header-counting lines in tokenize() (look for "Anthony"). Likely the most valuable thing it's missing then is some special parsing and tagging of Received headers. > Here's a sample spam distribution from a test run: > > Spam distribution for this pair: > * = 3 items > 0.00 73 ************************* > 2.50 0 > 5.00 2 * > 7.50 0 > 10.00 0 > 12.50 1 * > 15.00 0 > 17.50 1 * > 20.00 1 * > 22.50 0 > 25.00 2 * > 27.50 0 > 30.00 0 > 32.50 0 > 35.00 0 > 37.50 0 > 40.00 0 > 42.50 0 > 45.00 0 > 47.50 0 > 50.00 0 > 52.50 0 > 55.00 0 > 57.50 1 * > 60.00 0 > 62.50 1 * > 65.00 0 > 67.50 0 > 70.00 1 * > 72.50 0 > 75.00 0 > 77.50 0 > 80.00 2 * > 82.50 2 * > 85.00 2 * > 87.50 0 > 90.00 4 ** > 92.50 1 * > 95.00 5 ** > 97.50 127 ******************************************* So the bulk of your f-n woes come from spam scoring near 0.0. Good to know. > And here's a sample false negative. (I'll quote the report so it > stands out.) One thing I don't understand is how the spam probability > for the message is so low, when there are several high indicators and > few low indicators. You're hallucinating. Let's look: > > Low prob spam! 1.64654685184e-11 > > /home/jeremy/Mail/spam:242 subject: your web site has been mapped > > prob('millions') = 0.99 > > prob('skip:= 40') = 0.99 > > prob('"remove"') = 0.99 > > prob('from:email addr:mail') = 0.99 > > prob('email addr:alum') = 0.01 > > prob('status') = 0.01 > > prob('connected') = 0.01 > > prob('returning') = 0.01 Those 8 cancel out completely. They're the strongest indicators it found in both directions, and it's exactly as if they didn't exist. I'll sort the rest from low to high: > > prob('notices') = 0.01 > > prob('email addr:mit') = 0.01 > > prob("i'd") = 0.0470418 > > prob('survey') = 0.0850202 > > prob('wide') = 0.0911528 > > prob('added') = 0.133131 > > prob('mark') = 0.136416 > > prob('survey.') = 0.14931 > > prob('current') = 0.152639 > > prob('its') = 0.155044 > > prob('officer') = 0.208406 > > prob('charges') = 0.208406 > > prob('from:email addr:com>') = 0.224056 > > prob('every') = 0.789741 > > prob('http1:asp') = 0.88055 > > prob('free') = 0.818103 So you're got 13 indicators below 0.5, versus 3 above 0.5: it's overwhelmingly in favor of ham. > > > > From VM Mon Jul 24 10:05:39 2000 > > Return-Path: > > Message-ID: <0112a1010021870MARS1@mars1.internetseer.com> > > Status: RO > > From: "InternetSeer.com" > > To: jeremy@alum.mit.edu > > Subject: Your web site has been mapped > > Date: 23 Jul 2000 22:10:11 -0400 > > > > Freewire has added your web site to its map of the World Wide > Web. Freewire will continue to monitor millions of links and web > sites every day during its ongoing web survey. > > > > If it is important for you to know that your site is connected > to the web at all times, Freewire has arranged with > InternetSeer.com to notify you when your site does not respond. > This means that, AT NO CHARGE; InternetSeer.com will monitor your > Web site every hour and send notification to you by email > whenever your site is not connected to the Web. There are NO > current or future charges associated with this service. > > > > To begin your FREE monitoring NOW, activate your account at: > > http://www.internetseer.com/signup.asp?email=jeremy@alum.mit.edu > > > > Mark McLellan > > Chief Technology Officer > > Freewire.com > > > > Is your web site status important to you? I'd love your > comments. If you prefer not to receive any future notices that > result from our ongoing survey please let me know by returning > this email with the word "remove" in the subject line. > > > > ============================================= > > ##Remove: jeremy@alum.mit.edu## Yuck: it got two 0.01's from embedding your email address at the bottom here. From skip@pobox.com Fri Sep 6 22:20:05 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 16:20:05 -0500 Subject: [Spambayes] Deployment In-Reply-To: <200209061924.g86JOc516514@pcp02138704pcs.reston01.va.comcast.net> References: <200209061924.g86JOc516514@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <15737.7173.450960.192144@12-248-11-90.client.attbi.com> >> FYI, I'll never trust such a scheme: I have no tolerance for false >> positives, and indeed do nothing to try to block spam on any of my >> email accounts now for that reason. Deliver all suspected spam to a >> Spam folder instead and I'd love it. Guido> Another config parameter. Nix on that idea. If you make it a config parameter, some sysadmin is bound to use it and lose important mail for a customer. Essentially every mail user agent has some sort of filtering capability. Instead of *ever* deleting mail you should simply tag it with your decision about its spamminess and send it along. The user can then configure Outlook, Outlook Express, procmail, mutt, pine, VM, ... to do what she wants with the stuff that trips the spam check. The biggest false accusation people (not limited to the spammers) make about tools like SpamAssassin is that it deletes email. It doesn't. It just tags email. Someone (the user or the sysadmin) made a decision to delete spammy mail instead of squirreling it away for further review. SA gets wrongly accused. Don't provide detractors with more ammunition than necessary. just-message-tags-ma'am-ly y'rs, Skip From tim.one@comcast.net Fri Sep 6 22:21:12 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 17:21:12 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <15737.5105.690511.94543@slothrop.zope.com> Message-ID: > GvR> Looks like your ham corpus by and large has To: > GvR> jeremy@alum.mit.edu in a header while your spam corpus by and > GvR> large doesn't. But this one does. [Jeremy] > By and large that's true. Wouldn't it be true of any mailbox? Most > of your real mail is addressed to you, but only some of the spam is. The "To:" header didn't play any role in this, assuming Jeremy is using timtoken.tokenize. I couldn't use To in my test runs because variants of to:bruceg are extremely common in my spam collection. The prob('email addr:mit') = 0.01 and prob('email addr:alum') = 0.01 came from the ##Remove: jeremy@alum.mit.edu## embedded in the body of the msg. > A large chunk of my spam collection is from 2000. The rest is recent, > starting about the same time spambayes did. None of it was previously > filtered by SA. Good! Even better, if Guido gets into his time machine and starts spamming you from 2000, you're all set . From tim.one@comcast.net Fri Sep 6 22:32:21 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 17:32:21 -0400 Subject: [Spambayes] Deployment In-Reply-To: Message-ID: [Tim] > My tests train on about 7,000 msgs, and a binary pickle of the database is > approaching 10 million bytes. That shrinks to under 2 million bytes, though, if I delete all the WordInfo records with spamprob exactly equal to UNKNOWN_SPAMPROB. Such records aren't needed when scoring (an unknown word gets a made-up probability of UNKNOWN_SPAMPROB). Such records are only needed for training; I've noted before that a scoring-only database can be leaner. In part the bloat is due to character 5-gram'ing, part due to that the database is brand new so has never been cleaned via clearjunk(), and part due to plain evil gremlins. From neale@woozle.org Fri Sep 6 22:44:33 2002 From: neale@woozle.org (Neale Pickett) Date: 06 Sep 2002 14:44:33 -0700 Subject: [Spambayes] Deployment In-Reply-To: References: Message-ID: So then, Tim Peters is all like: > [Tim] > > My tests train on about 7,000 msgs, and a binary pickle of the database is > > approaching 10 million bytes. > > That shrinks to under 2 million bytes, though, if I delete all the WordInfo > records with spamprob exactly equal to UNKNOWN_SPAMPROB. Such records > aren't needed when scoring (an unknown word gets a made-up probability of > UNKNOWN_SPAMPROB). Such records are only needed for training; I've noted > before that a scoring-only database can be leaner. That's pretty good. I wonder how much better you could do by using some custom pickler. I just checked my little dbm file and found a lot of what I would call bloat: >>> import anydbm, hammie >>> d = hammie.PersistentGrahamBayes("ham.db") >>> db = anydbm.open("ham.db") >>> db["neale"], len(db["neale"]) ('ccopy_reg\n_reconstructor\nq\x01(cclassifier\nWordInfo\nq\x02c__builtin__\nobject\nq\x03NtRq\x04(GA\xce\xbc{\xfd\x94\xbboK\x00K\x00K\x00G?\xe0\x00\x00\x00\x00\x00\x00tb.', 106) >>> d.wordinfo["neale"], len(`d.wordinfo["neale"]`) (WordInfo'(1031337979.16197, 0, 0, 0, 0.5)', 42) Ignoring the fact that there are too many zeros in there, the pickled version of that WordInfo object is over twice as large as the string representation. So we could get a 50% decrease in size just by using the string representation instead of the pickle, right? Something about that logic seems wrong to me, but I can't see what it is. Maybe pickling is good for heterogeneous data types, but every value of our big dictionary is going to have the same type, so there's a ton of redundancy. I guess that explains why it compressed so well. Neale From skip@pobox.com Fri Sep 6 23:39:48 2002 From: skip@pobox.com (Skip Montanaro) Date: Fri, 6 Sep 2002 17:39:48 -0500 Subject: [Spambayes] understanding high false negative rate In-Reply-To: References: <15737.2576.315460.956295@slothrop.zope.com> Message-ID: <15737.11956.18745.619040@12-248-11-90.client.attbi.com> >> > ##Remove: jeremy@alum.mit.edu## Tim> Yuck: it got two 0.01's from embedding your email address at the Tim> bottom here. Which suggests that tagging email addresses in To/CC headers should be handled differently than in message bodies? Skip From tim.one@comcast.net Sat Sep 7 00:03:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 19:03:58 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <15737.11956.18745.619040@12-248-11-90.client.attbi.com> Message-ID: > >> > ##Remove: jeremy@alum.mit.edu## > > Tim> Yuck: it got two 0.01's from embedding your email address at the > Tim> bottom here. > > Which suggests that tagging email addresses in To/CC headers should be > handled differently than in message bodies? I don't know whether it suggests that, but they would be tagged differently in to/cc if I were tagging them at all right now. If I were tagging To: addresses, for example, the tokens would look like 'to:email addr:mit' instead of 'email addr:mit' as they appear when an email-like thingie is found in the body. Whether email addresses should be stuck in as one blob or split up as they are now is something I haven't tested. From tim.one@comcast.net Sat Sep 7 00:21:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 19:21:15 -0400 Subject: [Spambayes] [ANN] Trained classifier available In-Reply-To: <20020906162505.GB17800@cthulhu.gerg.ca> Message-ID: http://sf.net/project/showfiles.php?group_id=61702 This is the binary pickle of my classifier after training on my first spam/ham corpora pair. All records with spamprob == UNKNOWN_SPAMPROB have been purged. It's in a zip file, and is only half a meg. Jeremy, it would be interesting if you tried that on your data. The false negative rates across my other 4 test sets when run against this are: 0.364% 0.400% 0.400% 0.909% From neale@woozle.org Sat Sep 7 00:50:19 2002 From: neale@woozle.org (Neale Pickett) Date: 06 Sep 2002 16:50:19 -0700 Subject: [Spambayes] Ditching WordInfo Message-ID: I hacked up something to turn WordInfo into a tuple before pickling, and then turn the tuple back into WordInfo right after unpickling. Without this hack, my database was 21549056 bytes. After, it's 9945088 bytes. That's a 50% savings, not a bad optimization. So my question is, would it be too painful to ditch WordInfo in favor of a straight out tuple? (Or list if you'd rather, although making it a tuple has the nice side-effect of forcing you to play nice with my DBDict class). I hope doing this sort of optimization isn't too far distant from the goal of this project, even though README.txt says it is :) Diff attached. I'm not comfortable checking this in, since I don't really like how it works (I'd rather just get rid of WordInfo). But I guess it proves the point :) Neale ---8<--- ? classifier.pyc ? d ? ham.db ? ham.pickle ? ham.spamoracle ? hammie.pyc ? timtoken.pyc Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.5 diff -u -r1.5 hammie.py --- hammie.py 6 Sep 2002 20:48:29 -0000 1.5 +++ hammie.py 6 Sep 2002 23:48:34 -0000 @@ -1,7 +1,8 @@ #! /usr/bin/env python # A driver for the classifier module. Currently mostly a wrapper around -# existing stuff. +# existing stuff. Neale Pickett is the person to +# blame for this. """Usage: %(program)s [options] @@ -36,6 +37,7 @@ import errno import anydbm import cPickle as pickle +from types import * program = sys.argv[0] @@ -69,11 +71,24 @@ def __getitem__(self, key): if self.hash.has_key(key): - return pickle.loads(self.hash[key]) + val = pickle.loads(self.hash[key]) + # XXX: kludge kludge kludge. There's a more elegant + # solution, but this proves the concept for the time being. + if type(val) == TupleType \ + and len(val) == len(classifier.WordInfo.__slots__): + # How does pickle pull this off? + w = classifier.WordInfo(0) + w.__setstate__(val) + val = w + return val else: raise KeyError(key) - def __setitem__(self, key, val): + def __setitem__(self, key, val): + # XXX: This has got to go when the __getitem__ kludge is cleaned + # up + if isinstance(val, classifier.WordInfo): + val = val.__getstate__() v = pickle.dumps(val, 1) self.hash[key] = v @@ -84,7 +99,7 @@ k = self.hash.first() while k != None: key = k[0] - val = pickle.loads(k[1]) + val = self.__getitem__(key) if key not in self.iterskip: if fn: yield fn((key, val)) ---8<--- From jeremy@alum.mit.edu Sat Sep 7 01:00:14 2002 From: jeremy@alum.mit.edu (Jeremy Hylton) Date: Fri, 6 Sep 2002 20:00:14 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: References: <15737.2576.315460.956295@slothrop.zope.com> Message-ID: <15737.16782.542869.368986@slothrop.zope.com> >>>>> "TP" == Tim Peters writes: >> The false positive rate is 0-3%. (Finally! I had to scrub a >> bunch of previously unnoticed spam from my inbox.) Both >> collections have about 1100 messages. TP> Does this mean you trained on about 1100 of each? The total collections are 1100 messages. I trained with 1100/5 messages. TP> Can't guess. You're in a good position to start adding more TP> headers into the analysis, though. For example, an easy start TP> would be to uncomment the header-counting lines in tokenize() TP> (look for "Anthony"). Likely the most valuable thing it's TP> missing then is some special parsing and tagging of Received TP> headers. I tried the "Anthony" stuff, but it didn't make any appreciable difference that I could see from staring at the false negative rate. The numbers are big enough that a quick eyeball suffices. Then I tried a dirt simple tokenizer for the headers that tokenize the words in the header and emitted like this "%s: %s" % (hdr, word). That worked too well :-). The received and date headers helped the classifier discover that most of my spam is old and most of my ham is new. So I tried a slightly more complex one that skipped received, data, and x-from_, which all contained timestamps. I also skipped the X-VM- headers that my mail reader added: class MyTokenizer(Tokenizer): skip = {'received': 1, 'date': 1, 'x-from_': 1, } def tokenize_headers(self, msg): for k, v in msg.items(): k = k.lower() if k in self.skip or k.startswith('x-vm'): continue for w in subject_word_re.findall(v): for t in tokenize_word(w): yield "%s:%s" % (k, t) This did moderately better. The false negative rate is 7-21% over the tests performed so far. This is versus 11-28% for the previous test run that used the timtest header tokenizer. It's interesting to see that the best descriminators are all ham discriminators. There's not a single spam-indicator in the list. Most of the discriminators are header fields. One thing to note is that the presence of Mailman-generated headers is a strong non-spam indicator. That matches my intuition: I got an awful lot of Mailman-generated mail, and those lists are pretty good at surpressing spam. The other thing is that I get a lot of ham from people who use XEmacs. That's probably Barry, Guido, Fred, and me :-). One final note. It looks like many of the false positives are from people I've never met with questions about Shakespeare. They often start with stuff like: > Dear Sir/Madam, > > May I please take some of your precious time to ask you to help me to find a > solution to a problem that is worrying me greatly. I am old science student I guess that reads a lot like spam :-(. Jeremy 238 hams & 221 spams false positive: 2.10084033613 false negative: 9.50226244344 new false positives: [] new false negatives: [] best discriminators: 'x-mailscanner:clean' 671 0.0483425 'x-spam-status:IN_REP_TO' 679 0.01 'delivered-to:skip:s 10' 691 0.0829876 'x-mailer:Lucid' 699 0.01 'x-mailer:XEmacs' 699 0.01 'x-mailer:patch' 699 0.01 'x-mailer:under' 709 0.01 'x-mailscanner:Found' 716 0.0479124 'cc:zope.com' 718 0.01 "i'll" 750 0.01 'references:skip:1 20' 767 0.01 'rossum' 795 0.01 'x-spam-status:skip:S 10' 825 0.01 'van' 850 0.01 'http0:zope' 869 0.01 'email addr:zope' 883 0.01 'from:python.org' 895 0.01 'to:jeremy' 902 0.185401 'zope' 984 0.01 'list-archive:skip:m 10' 1058 0.01 'list-subscribe:skip:m 10' 1058 0.01 'list-unsubscribe:skip:m 10' 1058 0.01 'from:zope.com' 1098 0.01 'return-path:zope.com' 1115 0.01 'wrote:' 1129 0.01 'jeremy' 1150 0.01 'email addr:python' 1257 0.01 'x-mailman-version:2.0.13' 1311 0.01 'x-mailman-version:101270' 1395 0.01 'python' 1401 0.01 From tim.one@comcast.net Sat Sep 7 01:06:56 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 20:06:56 -0400 Subject: [Spambayes] Ditching WordInfo In-Reply-To: Message-ID: [Neale Pickett] > I hacked up something to turn WordInfo into a tuple before pickling, That's what WordInfo.__getstate__ does. > and then turn the tuple back into WordInfo right after unpickling. Likewise for WordInfo.__setstate__. > Without this hack, my database was 21549056 bytes. After, it's 9945088 bytes. > That's a 50% savings, not a bad optimization. I'm not sure what you're doing, but suspect you're storing individual WordInfo pickles. If so, most of the administrative pickle bloat is due to that, and doesn't happen if you pickle an entire classifier instance directly. > So my question is, would it be too painful to ditch WordInfo in favor of > a straight out tuple? (Or list if you'd rather, although making it a > tuple has the nice side-effect of forcing you to play nice with my > DBDict class). > > I hope doing this sort of optimization isn't too far distant from the > goal of this project, even though README.txt says it is :) > > Diff attached. I'm not comfortable checking this in, I think it's healthy that you're uncomfortable checking things in with > + # XXX: kludge kludge kludge. comments . > since I don't really like how it works (I'd rather just get rid of WordInfo). > But I guess it proves the point :) I'm not interested in optimizing anything yet, and get many benefits from the *ease* of working with utterly vanilla Python instance objects. Lots of code all over picks these apart for display and analysis purposes. Very few people have tried this code yet, and there are still many questions about it (see, e.g., Jeremy's writeup of his disappointing first-time experiences today). Let's keep it as easy as possible to modify for now. If you're desparate to save memory, write a subclass? Other people are free to vote in other directions, of course . From tim.one@comcast.net Sat Sep 7 01:18:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 20:18:18 -0400 Subject: [Spambayes] test sets? In-Reply-To: <15736.57093.811682.371784@anthem.wooz.org> Message-ID: [Barry] > Here's an interesting thing to test: discriminate words differently if > they are on a line that starts with `>' or, to catch styles like > above, that the first occurance on a line of < or > is > (to eliminate > html). Give me a mod to timtoken.py that does this, and I'll be happy to test it. > Then again, it may not be worth trying to un-false-positive that > Nigerian scam quote. If there's any sanity in the world, even the original poster would be glad to have his kneejerk response blocked . OTOH, you know there are a great many msgs on c.l.py (all over Usenet) that do nothing except quote a previous post and add a one-line comment. Remove the quoted sections from those, and there may be no content left to judge except for the headers. So I can see this nudging the stats in either direction. The only way to find out for sure is for you to write some code . From tim.one@comcast.net Sat Sep 7 01:32:26 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 20:32:26 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <15737.16782.542869.368986@slothrop.zope.com> Message-ID: [Jeremy Hylton[ > The total collections are 1100 messages. I trained with 1100/5 > messages. I'm reading this now as that you trained on about 220 spam and about 220 ham. That's less than 10% of the sizes of the training sets I've been using. Please try an experiment: train on 550 of each, and test once against the other 550 of each. Do that a few times making a random split each time (it won't be long until you discover why directories of individual files are a lot easier to work -- e.g., random.shuffle() makes this kind of thing trivial for me). From tim.one@comcast.net Sat Sep 7 03:51:24 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 22:51:24 -0400 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <15737.16782.542869.368986@slothrop.zope.com> Message-ID: [Jeremy] > The total collections are 1100 messages. I trained with 1100/5 > messages. While that's not a lot of training data, I picked random subsets of my corpora and got much better behavior (this is rates.py output; f-p rate per run in left column, f-n rate in right): Training on Data/Ham/Set1 & Data/Spam/Set1 ... 220 hams & 220 spams 0.000 1.364 0.000 0.455 0.000 1.818 0.000 1.364 Training on Data/Ham/Set2 & Data/Spam/Set2 ... 220 hams & 220 spams 0.455 2.727 0.455 0.455 0.000 0.909 0.455 2.273 Training on Data/Ham/Set3 & Data/Spam/Set3 ... 220 hams & 220 spams 0.000 2.727 0.455 0.909 0.000 0.909 0.000 1.818 Training on Data/Ham/Set4 & Data/Spam/Set4 ... 220 hams & 220 spams 0.000 2.727 0.000 0.909 0.000 0.909 0.000 1.818 Training on Data/Ham/Set5 & Data/Spam/Set5 ... 220 hams & 220 spams 0.000 1.818 0.000 1.364 0.000 0.909 0.000 2.273 total false pos 4 0.363636363636 total false neg 29 2.63636363636 Another full run with another randomly chosen (but disjoint) 220 of each in each set was much the same. The score distribution is also quite sharp: Ham distribution for all runs: * = 74 items 0.00 4381 ************************************************************ 2.50 3 * 5.00 3 * 7.50 1 * 10.00 0 12.50 0 15.00 1 * 17.50 1 * 20.00 1 * 22.50 0 25.00 0 27.50 0 30.00 1 * 32.50 0 35.00 0 37.50 0 40.00 1 * 42.50 0 45.00 0 47.50 0 50.00 0 52.50 0 55.00 0 57.50 1 * 60.00 0 62.50 0 65.00 0 67.50 1 * 70.00 0 72.50 0 75.00 0 77.50 0 80.00 0 82.50 0 85.00 0 87.50 1 * 90.00 0 92.50 2 * 95.00 0 97.50 2 * Spam distribution for all runs: * = 73 items 0.00 13 * 2.50 0 5.00 4 * 7.50 5 * 10.00 0 12.50 2 * 15.00 1 * 17.50 1 * 20.00 2 * 22.50 1 * 25.00 0 27.50 1 * 30.00 0 32.50 3 * 35.00 0 37.50 0 40.00 0 42.50 0 45.00 1 * 47.50 3 * 50.00 16 * 52.50 0 55.00 0 57.50 0 60.00 1 * 62.50 0 65.00 2 * 67.50 1 * 70.00 1 * 72.50 0 75.00 1 * 77.50 0 80.00 3 * 82.50 2 * 85.00 1 * 87.50 2 * 90.00 2 * 92.50 4 * 95.00 4 * 97.50 4323 ************************************************************ It's hard to say whether you need better ham or better spam, but I suspect better spam . 18 of the 30 most powerful discriminators here were HTML-related spam indicators; the top 10 overall were: '' 312 0.99 '' 329 0.99 'click' 334 0.99 '' 335 0.99 'wrote:' 381 0.01 'skip:< 10' 398 0.99 'python' 428 0.01 'content-type:text/html' 454 0.99 The HTML tags come from non-multipart/alternative HTML messages, from which HTML tags aren't stripped, and there are lots of these in my spam sets. That doesn't account for it, though. If I strip HTML tags out of those too, the rates are only a little worse: raining on Data/Ham/Set1 & Data/Spam/Set1 ... 220 hams & 220 spams 0.000 1.364 0.000 1.818 0.455 1.818 0.000 1.818 raining on Data/Ham/Set2 & Data/Spam/Set2 ... 220 hams & 220 spams 0.000 1.364 0.455 1.818 0.455 0.909 0.000 1.818 raining on Data/Ham/Set3 & Data/Spam/Set3 ... 220 hams & 220 spams 0.000 2.727 0.000 0.909 0.909 0.909 0.455 1.818 raining on Data/Ham/Set4 & Data/Spam/Set4 ... 220 hams & 220 spams 0.000 1.818 0.000 0.909 0.455 0.909 0.000 1.364 raining on Data/Ham/Set5 & Data/Spam/Set5 ... 220 hams & 220 spams 0.000 2.727 0.000 1.364 0.455 2.273 0.455 2.273 otal false pos 4 0.363636363636 otal false neg 34 3.09090909091 The 4th-strongest discriminator *still* finds another HTML clue, though! 'subject:Python' 164 0.01 'money' 169 0.99 'content-type:text/plain' 185 0.2 'charset:us-ascii' 191 0.127273 "i'm" 232 0.01 'content-type:text/html' 248 0.983607 ' ' 255 0.99 'wrote:' 372 0.01 'python' 431 0.01 'click' 519 0.99 Heh. I forgot all about  . From anthony@interlink.com.au Sat Sep 7 04:38:51 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 07 Sep 2002 13:38:51 +1000 Subject: [Spambayes] test sets? In-Reply-To: Message-ID: <200209070338.g873cpp20640@localhost.localdomain> > > Note that header names are case insensitive, so this one's no > > different than "MIME-Version:". Similarly other headers in your list. > > Ignoring case here may or may not help; that's for experiment to decide. > It's plausible that case is significant, if, e.g., a particular spam mailing > package generates unusual case, or a particular clueless spammer > misconfigures his package. I found it made no difference for my testing. > The brilliance of Anthony's "just count them" scheme is that it requires no > thought, so can't be fooled . Header lines that are evenly > distributed across spam and ham will turn out to be worthless indicators > (prob near 0.5), so do no harm. zactly. I started off doing clever clever things, and, as always with this stuff, found that stupid with a rock beats smart with scissors, every time. -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Sat Sep 7 04:44:50 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 07 Sep 2002 13:44:50 +1000 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <200209062026.g86KQqJ03393@pcp02138704pcs.reston01.va.comcast.net> Message-ID: <200209070344.g873io020676@localhost.localdomain> > Looks like your ham corpus by and large has To: jeremy@alum.mit.edu in > a header while your spam corpus by and large doesn't. But this one > does. Interestingly, for me, one of the highest value spam indicators was the name of the mail host that the spam was delivered to, in the To: line. So mail to info@gin.elax2.ekorp.com was pretty much a dead cert for the filters. -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Sat Sep 7 04:50:37 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 07 Sep 2002 13:50:37 +1000 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <15737.16782.542869.368986@slothrop.zope.com> Message-ID: <200209070350.g873obE20720@localhost.localdomain> >>> Jeremy Hylton wrote > Then I tried a dirt simple tokenizer for the headers that tokenize the > words in the header and emitted like this "%s: %s" % (hdr, word). > That worked too well :-). The received and date headers helped the > classifier discover that most of my spam is old and most of my ham is > new. Heh. I hit the same problem, but the other way round, when I first started playing with this - I'd collected spam for a week or two, then mixed it up with randomly selected messages from my mail boxes. course, it instantly picked up on 'received:2001' as a non-ham. Curse that too-smart-for-me software. Still, it's probably a good thing to note in the documentation about the software - when collecting spam/ham, make _sure_ you try and collect from the same source. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From anthony@interlink.com.au Sat Sep 7 04:52:54 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 07 Sep 2002 13:52:54 +1000 Subject: [Spambayes] understanding high false negative rate In-Reply-To: <200209070350.g873obE20720@localhost.localdomain> Message-ID: <200209070352.g873qs820746@localhost.localdomain> > course, it instantly picked up on 'received:2001' as a non-ham. -spam. *sigh* -- Anthony Baxter It's never too late to have a happy childhood. From guido@python.org Sat Sep 7 04:51:12 2002 From: guido@python.org (Guido van Rossum) Date: Fri, 06 Sep 2002 23:51:12 -0400 Subject: [Spambayes] hammie.py vs. GBayes.py Message-ID: <200209070351.g873pC613144@pcp02138704pcs.reston01.va.comcast.net> There seem to be two "drivers" for the classifier now: Neale Pickett's hammie.py, and the original GBayes.py. According to the README.txt, GBayes.py hasn't been kept up to date. Is there anything in there that isn't covered by hammie.py? About the only useful feature of GBayes.py that hammie.py doesn't (yet) copy is -u, which calculates spamness for an entire mailbox. This feature can easily be copied into hammie.py. (GBayes.py also has a large collection of tokenizers; but timtoken.py rules, so I'm not sure how interesting that is now.) Therefore I propose to nuke GBayes.py, after adding a -u feature. Anyone against? (I imagine that Skip or Barry might have a stake in GBayes.py; Tim seems to have moved all code he's working to other modules.) --Guido van Rossum (home page: http://www.python.org/~guido/) From anthony@interlink.com.au Sat Sep 7 05:00:36 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Sat, 07 Sep 2002 14:00:36 +1000 Subject: [Spambayes] test sets? In-Reply-To: Message-ID: <200209070400.g8740a520809@localhost.localdomain> >>> Tim Peters wrote > > As well as the usual spam, it also has customers complaining about > > credit card charges, it has people interested in the service and > > asking questions about long distance rates, &c &c &c. Lots and lots > > of "commercial" speech, in other words. Stuff that SA gets pretty > > badly wrong. > > Can this corpus be shared? I suppose not. Almost certainly 100% not, at least not without a massive massive amount of manual cleansing. There's just too much personal data in there. > > I did have Received in there, but it's out for the moment, as it causes > > rates to drop. > That's ambiguous. Accuracy rates or error rates, ham or spam rates? It made both the f-p and f-n rates drop. I need to think a bit more about why - I'm currently thinking about a special tokeniser just for received, so that, e.g., hostnames like 'pcp736393pcs.reston01.va.comcast.net' gets turned into received:pcp736393pcs.reston01.va.comcast.net received:reston01.va.comcast.net received:va.comcast.net received:comcast.net Specialising the tokeniser for various headers actually seems to do some good - in particular, keeping the parameters and their values of the content-types makes for a good detector of korean spam. > Mining embedded http/https/ftp thingies cut the false negative rate in half > in my tests (not keying off href, just scanning for anything that "looked > like" one); that was the single biggest f-n improvement I've seen. It > didn't change the false positive rate. So you know whether src added > additional power, or did you do both at once? Both at once. I added it because