From chris at penar.com Thu Jun 3 09:56:57 2004 From: chris at penar.com (Chris) Date: Thu Jun 3 09:57:33 2004 Subject: [spambayes-dev] Training Message-ID: When I move items from "junk suspects" to my INBOX, the program is still not learning to recognize these messages as good, even though I have this option selected. Do you have any tips? Chris Penar From rick at unc.edu Mon Jun 14 14:29:47 2004 From: rick at unc.edu (Rick Peterson) Date: Mon Jun 14 14:30:46 2004 Subject: [spambayes-dev] Deletes-as-spam come right back into InBox Message-ID: <200406141829.i5EITmMZ017077@smtp.unc.edu> Any help appreciated... I am using the SpamBayes Outlook Plug-in. When I 'delete-as-spam' a message it 'leaves' my Inbox for a second but then is redelivered back into my Inbox. This doesn't happen all the time and I have retrained SpamBayes a couple of times and it still happens. Any thoughts/clues? Ric From tim.one at comcast.net Mon Jun 14 14:46:02 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jun 14 14:46:11 2004 Subject: [spambayes-dev] Deletes-as-spam come right back into InBox In-Reply-To: <200406141829.i5EITmMZ017077@smtp.unc.edu> Message-ID: [Rick Peterson] > I am using the SpamBayes Outlook Plug-in. When I 'delete-as-spam' a > message it 'leaves' my Inbox for a second but then is redelivered back > into my Inbox. This doesn't happen all the time and I have retrained > SpamBayes a couple of times and it still happens. > > Any thoughts/clues? Unsure -- have never seen this myself. Perhaps you have the Inbox configured as your Spam folder, or perhaps you have an Outlook rule that's moving it back? From rick at unc.edu Mon Jun 14 15:09:55 2004 From: rick at unc.edu (Rick Peterson) Date: Mon Jun 14 15:10:41 2004 Subject: [spambayes-dev] Deletes-as-spam come right back into InBox Message-ID: <200406141909.i5EJ9tMZ027520@smtp.unc.edu> Thanks for the quick response. My spam folder is "junkemail" and I have no rules configured. I will try uninstalling and reinstalling the client. Can't think of anything else to do at this point. :-( -----Original Message----- From: Tim Peters [mailto:tim.one@comcast.net] Sent: Monday, June 14, 2004 2:46 PM To: 'Rick Peterson' Cc: spambayes-dev@python.org Subject: RE: [spambayes-dev] Deletes-as-spam come right back into InBox [Rick Peterson] > I am using the SpamBayes Outlook Plug-in. When I 'delete-as-spam' a > message it 'leaves' my Inbox for a second but then is redelivered back > into my Inbox. This doesn't happen all the time and I have retrained > SpamBayes a couple of times and it still happens. > > Any thoughts/clues? Unsure -- have never seen this myself. Perhaps you have the Inbox configured as your Spam folder, or perhaps you have an Outlook rule that's moving it back? From gtoal at gtoal.com Wed Jun 16 02:47:57 2004 From: gtoal at gtoal.com (Graham Toal) Date: Wed Jun 16 02:06:13 2004 Subject: [spambayes-dev] Deletes-as-spam come right back into InBox In-Reply-To: <200406152211.i5FMBwR5000787@gtoal.com> References: <200406152211.i5FMBwR5000787@gtoal.com> Message-ID: <40CFED1D.mailKP1FQ84F@gtoal.com> I've seen the same thing with an Outlook client talking to a regular IMAP server. In fact my wife's computer does this consistently - and she is not using spambayes. If it is outlook you are using, look there for an answer rather than the mail server or the spam filter. (and if you work it out let me know because I've spent weeks trying to work this out...) Graham From melis1 at freeler.nl Thu Jun 17 17:09:29 2004 From: melis1 at freeler.nl (M.P. vd Heiden) Date: Sun Jun 20 23:57:13 2004 Subject: [spambayes-dev] update Message-ID: <002801c454af$62c42c10$d782153e@amd> please inform me about a update -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040617/5194b968/attachment.html From ulf at linial.de Thu Jun 17 03:31:47 2004 From: ulf at linial.de (Linial) Date: Mon Jun 21 00:08:47 2004 Subject: [spambayes-dev] Question Outlook PlugIn Message-ID: How does SpamBayes-Outlook PlugIn work together with existing-Outlook rules? For example: I have a rule that moves all incoming mails that contain "example@test.de" to my "private" - folder. Does SpamBayes move them to the "unsure" folder if it hadn't be trained? In that case I would hav to sort my mail by hand again and the Outlook rules would be useless. Sorry but I can't test it by myself because I have many rules and tons of subfolders. Thank you in advance Ulf Klarmann From kennypitt at hotmail.com Thu Jun 24 13:10:24 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Jun 24 13:12:23 2004 Subject: [spambayes-dev] Question Outlook PlugIn In-Reply-To: Message-ID: Linial wrote: > How does SpamBayes-Outlook PlugIn work together with existing-Outlook > rules? > For example: I have a rule that moves all incoming mails that contain > "example@test.de" to my "private" - folder. Does SpamBayes move them > to the "unsure" folder if it hadn't be trained? You need to make sure that background filtering is enabled (go to the Advanced tab in SpamBayes Manager and check the "Enabled background filtering" option). It should be enabled by default in newer builds, but check it to be sure. This option gives the Outlook rules time to process messages before SpamBayes looks at them, so messages will already be moved to the destination folder before SpamBayes checks for spam. Configure SpamBayes to only look for spam in your Inbox and not in the destination folders for your rules and everything should work just the way you want. -- Kenny Pitt From kennypitt at hotmail.com Thu Jun 24 15:01:43 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Jun 24 15:03:45 2004 Subject: [spambayes-dev] update In-Reply-To: <002801c454af$62c42c10$d782153e@amd> Message-ID: M.P. vd Heiden wrote: > please inform me about a update At the following URL, you can subscribe to the Spambayes-announce mailing list. A notice is sent to this mailing list whenever a new version of SpamBayes is released. http://mail.python.org/mailman/listinfo/spambayes-announce If you are already running SpamBayes, you can also use the "Check for new version" menu item in the Outlook addin or the "Check for latest version" menu item from the Proxy application tray icon to check if you have the latest update. -- Kenny Pitt From skip at pobox.com Thu Jun 24 22:42:50 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Jun 24 22:42:59 2004 Subject: [spambayes-dev] are checkins on head okay? Message-ID: <16603.37162.198572.350989@montanaro.dyndns.org> Can I check stuff in on the spambayes cvs head or are we frozen awaiting the 1.0 release? Skip From tameyer at ihug.co.nz Fri Jun 25 19:58:39 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jun 25 19:58:45 2004 Subject: [spambayes-dev] are checkins on head okay? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E92A13@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677FDA@its-xchg4.massey.ac.nz> > Can I check stuff in on the spambayes cvs head or are we > frozen awaiting the 1.0 release? There's a 1.0 release branch, so knock yourself out :) (Mark and I are very close to having 1.0rc2 done - the delay is my fault - and then, hopefully, the mythical 1.0). =Tony Meyer --- Please always include the list (spambayes@python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. This way, you get everyone's help, and avoid a lack of replies when I'm busy. From ta-meyer at ihug.co.nz Sat Jun 26 03:27:49 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Sat Jun 26 03:27:59 2004 Subject: [spambayes-dev] 1.0 and beyond Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02A1@its-xchg4.massey.ac.nz> As you might have noticed, Mark & I have put out 1.0rc2. The plan this time is that it'll work flawlessly and in about a week we'll put the same thing together for 1.0 (and so a better job of publicising it as someone (Anthony?) suggested a while back). As I noted in answer to Skip the HEAD is free for anything that you want to do, and the 1.0_release branch should be left alone apart from really major bugs or packaging issues, until 1.0 is done. Does anyone have a plan after that? Do we want to keep the 1.0 branch alive and copy minor things across to it for a 1.1 release (someday!), or do we want 1.1 to have lots of new features and stuff (i.e. the HEAD). Cheers, Tony BTW (on an unrelated note) sorry about my double-up replies to some spambayes@python messages recently - the mail arrived here rather sporadically (maybe the mail.python.org problems, maybe mine) and I didn't realise that many of them had been dealt with already :) From tdickenson at geminidataloggers.com Sat Jun 26 09:47:52 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Sat Jun 26 09:47:57 2004 Subject: [spambayes-dev] correlated clues Message-ID: <200406261447.52347.tdickenson@geminidataloggers.com> Im seeing a significant number of misclassified spams that come through mailing lists. If the original spam body is small then it doesnt generate enough tokens to outweigh those added by the mailing list. Manually removing those tokens from the list causes it to be firmly nailed as spam. (To be fair, most of these small ones are viruses not spams. But spambayes does a good job of classifying those viruses that I receive direct, rather than via a list.) Example evidence below. Has anyone implemented or tested any mechanism to inhibit these gangs of tokens? X-Spambayes-Classification: ham; 0.25 X-Spambayes-Evidence: '*H*': 0.67; '*S*': 0.16; 'so?': 0.11; 'header:Received:4': 0.15; 'subject:] ': 0.16; 'url:zope': 0.19; 'sender:addr:zope.org': 0.19; 'zope': 0.20; 'email addr:zope.org': 0.20; 'think': 0.20; 'to:addr:zope.org': 0.21; 'subject:Zope': 0.21; 'sender:no real name:2**0': 0.23; 'url:mailman': 0.24; 'url:listinfo': 0.24; 'url:mail': 0.26; 'subject:[': 0.29; 'maillist': 0.31; 'url:org': 0.31; 'header:Errors-To:1': 0.32; 'content-disposition:inline': 0.33; 'reply-to:none': 0.34; 'subject:!': 0.72; 'charset:windows-1252': 0.88; 'from:addr:info': 0.93; 'message-id:@mail.zope.org': 0.94; 'subject:you': 0.95; 'content-type:application/x-zip-compressed': 0.98; 'filename:fname piece:zip': 0.98 -- Toby Dickenson From sethg at GoodmanAssociates.com Sun Jun 27 19:31:49 2004 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sun Jun 27 19:31:53 2004 Subject: [spambayes-dev] spam on lists Message-ID: I know that running a server-side filter is a pain and requiring registration is seen as a hurdle for new posters. However, I wonder if your mail admins have considered using blacklists? In addition to the usual FCrDNS tests (I hope they're doing that), just checking the IP against a few DNSBL's would probably stop a lot of the junk. If you're not already doing this, a good place to start would be these three lists: dnsbl.sorbs.net sbl-xbl.spamhaus.org bl.spamcop.net -- Seth Goodman From tim.peters at gmail.com Sun Jun 27 20:21:38 2004 From: tim.peters at gmail.com (Tim Peters) Date: Sun Jun 27 20:21:43 2004 Subject: [spambayes-dev] spam on lists In-Reply-To: References: Message-ID: <1f7befae04062717213a8f8f52@mail.gmail.com> [Seth Goodman] > I know that running a server-side filter is a pain and requiring > registration is seen as a hurdle for new posters. However, I wonder if your > mail admins have considered using blacklists? Well, nobody here has any access to the machines running mail.python.org. If you want to talk to them, mailto:postmaster@python.org is the way to do it. But they're all volunteers too, and generally can't make time to do anything except crisis mgmt. From skip at pobox.com Sun Jun 27 20:30:55 2004 From: skip at pobox.com (Skip Montanaro) Date: Sun Jun 27 20:31:07 2004 Subject: [spambayes-dev] spam on lists In-Reply-To: References: Message-ID: <16607.26303.639870.86327@montanaro.dyndns.org> Seth> I know that running a server-side filter is a pain and requiring Seth> registration is seen as a hurdle for new posters. However, I Seth> wonder if your mail admins have considered using blacklists? What are you referring to? Who are "your mail admins"? Are you referring to just the spambayes mailing lists or to something more global (all of the lists on python.org)? The spambayes lists are explicitly not filtered because people need to send spam to them for study purposes on occasion. If you're referring to the more global issue of mailing lists hosted on mail.python.org, a number of us are working on bringing up a new machine. I'm not sure exactly what all the various bits will be at this point, but the plan is for it to be better at rejecting spam and virii than the current machine. -- Skip Montanaro Got gigs? http://www.musi-cal.com/submit.html Got spam? http://www.spambayes.org/ skip@pobox.com From tim.one at comcast.net Sun Jun 27 23:24:26 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Jun 27 23:24:37 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <200406261447.52347.tdickenson@geminidataloggers.com> Message-ID: [Toby Dickenson] > Im seeing a significant number of misclassified spams that come through > mailing lists. If the original spam body is small then it doesnt generate > enough tokens to outweigh those added by the mailing list. Manually > removing those tokens from the list causes it to be firmly nailed as > spam. Toby, which training strategy do you use? I don't have a real problem with this, but I'm using train-on-error (mistakes and unsures). A consequence is that I train on only a tiny fraction of the mailing-list ham I receive, and on a roughly equal number of mailing-list spam. As a result, the "Mailman clues" are roughly neutral for me. Under train-on-everything, Mailman clues would be strongly hammy (because I get much more ham than spam from most Mailman lists I'm on). > (To be fair, most of these small ones are viruses not spams. But > spambayes does a good job of classifying those viruses that I receive > direct, rather than via a list.) And it's a still a mystery to me as to why . > Example evidence below. > > Has anyone implemented or tested any mechanism to inhibit these gangs of > tokens? Not any I know of that aren't already implemented. Ignoring most header lines by default is implemented, and your situation would be worse if it weren't: Mailman inserts a large pile of Mailman-specific header lines too, like: X-BeenThere: spambayes-dev@python.org X-Mailman-Version: 2.1.5 List-Id: Development of the Pythonic Bayesian classifier List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: spambayes-dev-bounces@python.org Errors-To: spambayes-dev-bounces@python.org for email on this list, and we ignore almost all of those by default. A general problem with "doing something about this" is that correlation *usually* helps in determining a correct classification, so I don't think correlation is evil per se. For example, your two strongest spam clues were 'content-type:application/x-zip-compressed': 0.98; 'filename:fname piece:zip': 0.98 and those are certainly correlated. The cases where correlation works in the wrong direction are noteworthy enough that people write about them when they occur. I don't have a good idea for identifying "bad correlation" automatically and efficiently. From stephena at hiwaay.net Mon Jun 28 03:42:29 2004 From: stephena at hiwaay.net (Stephen Anderson) Date: Mon Jun 28 03:41:38 2004 Subject: [spambayes-dev] Mental Musings on Spam Catching Message-ID: <40DF6975.22835.1DAFD79@localhost> [Right off the bat, my email is long-winded and I apologize. I think Ben Franklin once said something like, "I apologize but I don't have time to be brief."] I've been slowly trying to design a spam filter program using many of the lessons learned tuning spambayes as a starting place. I'm not saying I tuned spambayes but I've followed most of the discourses on the list and read through the source documentation for several ideas that predated my joining. I'm going to give you a little background (cause somebody always asks) but I was hoping maybe some of you with more experience could give me your opinion on a filtering idea that I had. (I know, theorizing is nothing, testing is everything) There are really four reasons I've been trying to do this. 1) It's fun and I like it; 2) My gut tells me SB is missing a lot of potentially useful information; 3) SB algorithm development has really slowed down from its heyday (which I can understand since it seems to work for most of the developers) and; 4) SB really underperforms on my father's email in comparison to most other peoples, my own included. That last point isn't so much why I started trying to build my own but has been leading me to try to figure out what exactly is SB's weakness in that it is so unsure (and so often wrong) about his email. The general direction I had been working towards is something very SB'ish with three main exceptions: 1) much more reliant on meta-tokens for email characteristics (ala SA) than SB, 2) Very different token scoring using mean and std-dev of occurrence / # of total body tokens; and 3) automatic expiration and removal of statistics after a certain time-period. Now a facet that I had been thinking about was elimination of duplicate messages. In the system I'm building, I don't want the system to ever be trained on the same message more than once. But obviously I don't want to store every message it has seen up to the expiration date (which I'm thinking would be 9-16 months). I also wanted to eliminate substantially duplicate emails. My googling brought me to the Distributed Checksum Clearinghouse and DCC's use of a "fuzzy checksum" algorithm to do almost exactly what I was looking for. After discovering that, I started contemplating how else I might incorporate a fuzzy checksum. I hadn't really thought of anything, but it put me in the right frame of mind for my next thought. I was reading through Tim Peters' very recent list response about "Correlated Values." I have reflected a lot on that very subject. How in the world can the system know that some highly correlated values should have more weight than other highly correlated values? Honestly, I've never been able to come up with any reasonable scheme that might work. I was sitting on my think chair (flush..) contemplating all that when I asked myself how some small emails could be so obviously spam to me and yet be so hard for SB because they have a few highly correlated ham tokens and fewer highly correlated spam tokens. I realized that its much easier for me because I'm looking at the whole picture, not on a token by token basis. So this brought me round to the bi-grams and n-grams thinking. I'm no real fan of them because I don't like the database growth and past poor tested performance. I was also mulling over DSPAM's n-gram offshoot scheme which I can't remember the name of. Finally it all kind of coalesced in my mind to the idea I wanted to run past everybody. Flavors. Emails have flavors and you can mix and match flavors. Some mixes lead to good tasting emails and some lead to bad tasting emails. Just like desserts. I can have ice cream, nuts, fruit, chocolate syrup, sugar sprinkles, and other stuff. Ice cream, fruit, and chocolate syrup is usually a winner. Nuts, sugar sprinkles, and chocolate syrup is usually not a good desert. What if we could use n-gram statistics to process the email as a more gestalt entity. But, instead of using n-grams on the words, what if we used it on a fixed set of potential tokens. I was thinking how one might take the token stream from the body and come up with a fuzzy checksum if you will. But instead of going the fuzzy checksum route, could we analyse the token stream to come up with meta-tokens that would indicate a more complex characteristic of the email than just a single word or token could provide? For instance, what if after all the tokens are generated, we can look at the token stream and perform some defined analysis to make decisions about certain specific characterstics of email messages. I know the devil is in the details and I haven't thought that far ahead. Right now I'm thinking things like : HIGH-PRCNT-HTML, or TYPICAL-MAILLIST, or HIGH-PIC- CONTENT, or NESTED-ENCODING-SCHEMES, or HIGH-HAPAXE-CONTENT, or HIGH- CODE-CONTENT or HAM-BLOCKS-AND-SPAM-BLOCKS. There should be characteristics we could objectively define about email messages where the characterstics are orthogonal and each one conveys information aggregated from the token stream but more than any single regular token could convey. Most of my examples are crap, but the high mailing list probability is a good one. There should be several email traits that we could see in the token stream that when taken together are characteristic of a mailing list message. This could then be one flavor token. After this, we can use an n-gram scheme to give every email message a flavor list. Like this message I'm writing could be FLAVS-A,D,E where A is mailing list content, D is long-winded, and E is high sentence length. With 10 email characteristics there would be 1023 possible combinations. That's not a lot to track in the database but I think it would be very valuable because certain flavor combinations will be very spammy and certain ones will be very hammy. So many of my father's emails that SB stumbles on are tripped up because spammers (or dumb luck) have been able to put highly-correlated ham words inside with their short spam message. But these emails don't have the same flavor of a real ham message. I guess what I'm trying to say is that it's relatively easy to bury spam in fake ham "content" because we look at the words in isolation. It would be considerably harder to fake multiple characteristics that consider the email in its entirety. Okay, I should sleep more and type less. Thanks for your time. Cheers, Steve From skip at pobox.com Mon Jun 28 07:05:31 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jun 28 07:05:36 2004 Subject: [spambayes-dev] spam on lists In-Reply-To: References: <16607.26303.639870.86327@montanaro.dyndns.org> Message-ID: <16607.64379.201895.103938@montanaro.dyndns.org> Seth> However, I wonder if your mail admins have considered using Seth> blacklists? >> What are you referring to? Seth> All I suggested was to consider using a few DNSBL's to reject Seth> incoming posts based on the connecting IP's being listed. Ah, well, I think most of us have been erroneously blacklisted. Makes me gunshy. At any rate, I believe that's probably going to fall to XS4ALL, the host for the new machine. They will be the MX for python.org and have a suite of stuff they do before passing the mail along to the new mail.python.org. Note also that although you mentioned DNSBLs, you also used the very generic term "blacklist" right off the bat, which has very different connotations in a Spambayes context. Skip From missmoney at alist.co.uk Mon Jun 28 12:12:22 2004 From: missmoney at alist.co.uk (Miss Moneypennys) Date: Mon Jun 28 07:12:27 2004 Subject: [spambayes-dev] Music2Dance2 2nd July Message-ID: PM200012:12:22 An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040628/0e570dda/attachment.html From anthony at interlink.com.au Mon Jun 28 09:13:20 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Mon Jun 28 09:13:38 2004 Subject: [spambayes-dev] 1.0 and beyond In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C02A1@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13064C02A1@its-xchg4.massey.ac.nz> Message-ID: <40E01970.3090803@interlink.com.au> Tony Meyer wrote: > As you might have noticed, Mark & I have put out 1.0rc2. The plan this time > is that it'll work flawlessly and in about a week we'll put the same > thing together for 1.0 (and so a better job of publicising it as someone > (Anthony?) suggested a while back). > > As I noted in answer to Skip the HEAD is free for anything that you want to > do, and the 1.0_release branch should be left alone apart from really major > bugs or packaging issues, until 1.0 is done. I would make this statement even stronger - unless you're involved in the release, you should NOT be touching the 1.0 branch until 1.0 is out the door. > Does anyone have a plan after that? Do we want to keep the 1.0 branch alive > and copy minor things across to it for a 1.1 release (someday!), or do we > want 1.1 to have lots of new features and stuff (i.e. the HEAD). Don't put new features into a 1.0.1 &c release - only include bug fixes. This saves effort (no wasted effort to backport new features, which will become harder as the trunk drifts away from the 1.0 branch), and means that people with 1.0 know that they can safely install 1.0.1, 1.0.2 and not have it break things. We do this for Python, and the feedback I get is overwhelmingly positive. Speaking as a former sysadmin, I'm going to be much happier if I know that a bugfix release is only a bugfix release. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From kennypitt at hotmail.com Mon Jun 28 12:12:11 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Jun 28 12:14:23 2004 Subject: [spambayes-dev] RE: [Spambayes] Execute test suite In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677FDB@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> This screenshot is taken from a development version running from >> source code. SpamBayes detects if it is running from the released >> binaries or from source code, and it does not show the "Execute test >> suite" option if running from the released binaries. > > We really ought to use a shot from the binary and use that. Any > chance you want to do that? OK, I grabbed the attached screenshot. I'm running Outlook 2003 so the toolbar style is a little different. If that's OK with everyone then I'll go ahead and check it in on the trunk and Tony can migrate it back to the 1.0 branch if desired. -- Kenny Pitt -------------- next part -------------- A non-text attachment was scrubbed... Name: manager-select.jpg Type: image/jpeg Size: 14556 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040628/1b1280b1/manager-select.jpg From iimspg at massey.ac.nz Mon Jun 28 18:33:32 2004 From: iimspg at massey.ac.nz (IIMS Postgraduate Representative) Date: Mon Jun 28 18:33:41 2004 Subject: [spambayes-dev] RE: [Spambayes] Execute test suite In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E930E6@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678014@its-xchg4.massey.ac.nz> > OK, I grabbed the attached screenshot. I'm running Outlook > 2003 so the toolbar style is a little different. If that's OK > with everyone then I'll go ahead and check it in on the trunk > and Tony can migrate it back to the 1.0 branch if desired. Go ahead and check it in (thanks!). I notice now that the old one is actually missing an item (Filter Messages), too. The style doesn't really matter - it's not that different. I'll leave the 1.0 one alone (it's a pretty minor thing, even though it's only documentation), but copy the change across for any 1.0.1 release. =Tony Meyer --- Please always include the list (spambayes@python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. This way, you get everyone's help, and avoid a lack of replies when I'm busy. From tameyer at ihug.co.nz Mon Jun 28 18:35:10 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Jun 28 18:35:47 2004 Subject: [spambayes-dev] RE: [Spambayes] Execute test suite In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E931D5@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678015@its-xchg4.massey.ac.nz> > -----Original Message----- > From: spambayes-dev-bounces@python.org > [mailto:spambayes-dev-bounces@python.org] On Behalf Of IIMS > Postgraduate Representative [...] Opps. Don't clean your keyboard while in the middle of typing message, or look what might happen . =Tony Meyer From kstocky at yahoo.com Mon Jun 28 22:37:10 2004 From: kstocky at yahoo.com (Kim Stockdale) Date: Mon Jun 28 22:37:13 2004 Subject: [spambayes-dev] FAQ quest? Using spam bayes with outlook 2003 spam filter Message-ID: <20040629023710.20126.qmail@web40413.mail.yahoo.com> I've been using spam bayes with Outlook 2000 and loved it. I upgraded to Outlook 2003, and I understand Outlook's built in spam filter is automatically on. Is there anything I need to do to the Outlook 2003 spam settings in order to use Spam Bayes? Can/should they both be used? Thank you, Kim Stockdale __________________________________ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail From ta-meyer at ihug.co.nz Mon Jun 28 22:53:22 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Mon Jun 28 22:53:29 2004 Subject: [spambayes-dev] Difference between "show clues" scoring and filter scoring Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02B0@its-xchg4.massey.ac.nz> I just had a very odd experience - a message (the one that Tim just replied to on spambayes@python.org) arrived and ended up in my spam folder. The spam field has a value of 1.00 in it - I was quite surprised at the false positive, so I did a "Show clues", and it scores 15% (high, but ok). I can't figure out why there is a discrepancy. There was definitely no training (I even checked the log, which confirms the move to the spam folder) between the two scorings. Does "show clues" get the message differently somehow? Is there some other explanation for this? FWIW, there aren't any errors in the log, either. =Tony Meyer From mhammond at skippinet.com.au Tue Jun 29 00:29:51 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Jun 29 00:29:53 2004 Subject: [spambayes-dev] RE: Difference between "show clues" scoring and filter scoring In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C02B0@its-xchg4.massey.ac.nz> Message-ID: <002701c45d91$b9bdeed0$0200a8c0@eden> > I just had a very odd experience - a message (the one that > Tim just replied > to on spambayes@python.org) arrived and ended up in my spam > folder. The > spam field has a value of 1.00 in it - I was quite surprised > at the false > positive, so I did a "Show clues", and it scores 15% (high, but ok). See also: https://sourceforge.net/tracker/index.php?func=detail&aid=972359&group_id=61 702&atid=498103 If the message you are refering to is a mime message, it may be the same thing. I believe the issue will be the butchery we do with the content-type and mime-armour. In that bug, I notice that the headers, as displayed by outlook, end with: """ X-OriginalArrivalTime: 14 Jun 2004 01:39:10.0757 (UTC) FILETIME=[64083550:01C451B0] ------=_NextPart_000_0D33_D859C01A.71B0FC9A Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit ------=_NextPart_000_0D33_D859C01A.71B0FC9A Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable ------=_NextPart_000_0D33_D859C01A.71B0FC9A-- """ Unless I am mistaken, everything after the blank line is part of the body - but as I mentioned, Outlook is returning it in the headers. I assume, but have not verified, that us fetching the headers via the MAPI property will give us the same string. I also suspect, but haven't confirmed, that this will screw up what we do with the headers, especially if we append headers *after* the blank line. Mark. From tameyer at ihug.co.nz Tue Jun 29 01:03:41 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jun 29 01:03:43 2004 Subject: [spambayes-dev] RE: Difference between "show clues" scoring andfilter scoring In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E93294@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02B3@its-xchg4.massey.ac.nz> > If the message you are refering to is a mime message, > it may be the same thing. I believe the issue will be > the butchery we do with the content-type and mime-armour. The message has a "MIME-Version" header, but no other mime headers. I've attached the message to the tracker. > Unless I am mistaken, everything after the blank line > is part of the body - but as I mentioned, Outlook is > returning it in the headers. I assume, but have not > verified, that us fetching the headers via the MAPI > property will give us the same string. I also suspect, but > haven't confirmed, that this will screw up what we do > with the headers, especially if we append headers > *after* the blank line. Does that still fit with my message that didn't have the multipart stuff? =Tony Meyer From tim.peters at gmail.com Tue Jun 29 01:06:34 2004 From: tim.peters at gmail.com (Tim Peters) Date: Tue Jun 29 01:06:41 2004 Subject: [spambayes-dev] RE: Difference between "show clues" scoring and filter scoring In-Reply-To: <002701c45d91$b9bdeed0$0200a8c0@eden> References: <002701c45d91$b9bdeed0$0200a8c0@eden> Message-ID: <1f7befae04062822063069a9d0@mail.gmail.com> [Mark Hammond] > ... > If the message you are refering to is a mime message, it may be the same > thing. I believe the issue will be the butchery we do with the content-type > and mime-armour. All email is MIME now <0.9 wink>. I think Tony must mean this msg: http://mail.python.org/pipermail/spambayes/2004-June/013665.html If so, it's Content-Type: text/plain; charset=us-ascii and has no non-trivial MIME structure (one text/plain part, no boundaries). I'd be curious to know even why it scored 15 for Tony! There's nothing clearly spammy about it, apart from a Yahoo blurb at the bottom, and yahoo sender address. From tameyer at ihug.co.nz Tue Jun 29 01:16:47 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jun 29 01:16:52 2004 Subject: [spambayes-dev] RE: Difference between "show clues" scoring andfilter scoring In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E932A7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678028@its-xchg4.massey.ac.nz> > I'd be curious to know even why it scored 15 for Tony! > There's nothing clearly spammy about it, apart from a > Yahoo blurb at the bottom, and yahoo sender address. I do get a bit of spam from Yahoo (lists), so that's part of it. The other spammy clues are all from 6 or fewer messages, so maybe 'small database' accounts for some of it. The imbalance probably doesn't help (though doesn't the imbalance in this direction more account for false negatives?). Combined Score: 15% (0.150615) Internal ham score (*H*): 0.802823 Internal spam score (*S*): 0.104054 # ham trained on: 55 # spam trained on: 322 50 Significant Tokens token spamprob #ham #spam 'bi:2000 and' 0.0918367 2 0 'bi:outlook 2003' 0.0918367 2 0 'outlook' 0.108078 4 2 'it.' 0.116936 10 7 'filter' 0.135074 3 2 'bi:they both' 0.155172 1 0 'skip:_ 30' 0.155172 1 0 'built' 0.164747 4 4 'spam' 0.164747 4 4 'anything' 0.173132 7 8 'there' 0.186543 22 29 'settings' 0.210929 1 1 'upgraded' 0.210929 1 1 "i've" 0.218646 7 11 'check' 0.257231 14 28 'skip:_ 40' 0.260614 6 12 'use' 0.261442 19 39 'url:listinfo' 0.270438 8 17 'url:mailman' 0.270438 8 17 'before' 0.271457 13 28 'on.' 0.271748 2 4 'loved' 0.286634 1 2 'url:org' 0.288353 17 40 'need' 0.289067 14 33 'order' 0.289706 6 14 'yahoo!' 0.294234 3 7 'bi:url:mail url:python' 0.329461 6 17 'skip:a 10' 0.336552 25 74 'subject:] ' 0.339934 16 48 'url:sf' 0.344635 3 9 'header:Errors-To:1' 0.360176 21 69 'subject:Spambayes' 0.371204 7 24 'sender:no real name:2**0' 0.371437 20 69 'url:yahoo' 0.377219 4 14 'more' 0.38184 23 83 'subject:[' 0.390077 15 56 'bi:header:Received:6 header:From:1' 0.393314 9 34 'sender:addr:spambayes-bounces' 0.397277 6 23 'url:html' 0.603261 12 107 'bi:header:MIME-Version:1 proto:http' 0.62941 4 40 'bi:been using' 0.844828 0 1 'bi:than other' 0.844828 0 1 'bi:thank you,' 0.844828 0 1 'bi:subject:? subject: ' 0.934783 0 3 'subject:2003' 0.934783 0 3 'understand' 0.934783 0 3 '2003,' 0.949438 0 4 'subject:with' 0.958716 0 5 'subject:\n\t' 0.965116 0 6 'from:addr:yahoo.com' 0.987106 0 17 =Tony Meyer From mhammond at skippinet.com.au Tue Jun 29 01:50:42 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Jun 29 01:50:48 2004 Subject: [spambayes-dev] RE: [Spambayes] Execute test suite In-Reply-To: Message-ID: <004401c45d9d$05d302a0$0200a8c0@eden> > OK, I grabbed the attached screenshot. I'm running Outlook > 2003 so the > toolbar style is a little different. If that's OK with > everyone then I'll > go ahead and check it in on the trunk and Tony can migrate it > back to the > 1.0 branch if desired. Sounds and looks great! Thanks, Mark. From tina at alist.co.uk Tue Jun 29 10:29:01 2004 From: tina at alist.co.uk (Tina Jones) Date: Tue Jun 29 05:29:06 2004 Subject: [spambayes-dev] Saturday 3rd July Message-ID: PM200010:29:01 An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040629/ad6f448a/attachment.html From tdickenson at geminidataloggers.com Wed Jun 30 12:36:48 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Wed Jun 30 12:36:52 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com> References: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com> Message-ID: <200406301736.48606.tdickenson@geminidataloggers.com> On Monday 28 June 2004 04:24, Tim Peters wrote: > [Toby Dickenson] > > > Im seeing a significant number of misclassified spams that come through > > mailing lists. > > Toby, which training strategy do you use? As you guess, I train on everything. Its very low effort to maintain, but it has left me with several training imbalances like this that have adversely affected classification accuracy. > A general problem with "doing something about this" is that correlation > *usually* helps in determining a correct classification, so I don't think > correlation is evil per se. > I don't have a good idea for identifying "bad correlation" > automatically and efficiently. My standards are lower... For now I would happy with a manual and inefficent process that picked out just this one special type of correlated clue. A couple of prototypes later, and even that isnt as easy as I hoped :-( Ill let you know if anything good come this. -- Toby Dickenson From lyris at newsletter.stern.de Wed Jun 30 15:39:05 2004 From: lyris at newsletter.stern.de (Lyris ListManager) Date: Wed Jun 30 15:45:05 2004 Subject: [spambayes-dev] Automatische Antwort - Newsletter Lifestyle Message-ID: Liebe User, dies ist eine automatisch erzeugte Mail. Wenn Sie Ihren Newsletter abbestellen wollen, müssen Sie auf den »Abbestellen«-Button am Ende des Newsletters klicken. Achtung: Für diesen Prozess müssen Sie online sein. Sollte Ihr Mail-Client den Button deaktiviert haben, können Sie folgende Seite aufrufen: http://www.stern.de/mein-stern-de/newsletter/ Im Kasten der rechte Spalte markieren Sie den Newsletter, den Sie nicht mehr erhalten möchten, und geben danach Ihre E-Mail-Adresse ein. Sie werden sofort vom Verteiler entfernt. Ihr stern.de-Team. From gbrown at alumni.caltech.edu Wed Jun 30 15:46:38 2004 From: gbrown at alumni.caltech.edu (Glenn Brown) Date: Wed Jun 30 15:46:39 2004 Subject: [spambayes-dev] Phishing detection? In-Reply-To: <200406301736.48606.tdickenson@geminidataloggers.com> Message-ID: <01d101c45eda$f5c3fc40$0d08a8c0@Glenn> HTML like the following is a very strong indicator of phishing. That is, if both the HREF and content of an anchor are URLs, but are different URLS, then someone is phishing; especially if the servers are different. Heck, just the visible content part of an anchor being a URL is phishy. Joe Spammer > Joe Spammer > https://wwww.citibank.com/signin/confirmation.jsp Anybody think this worth a "URL:in_anchor" token (easier to implement) or "URL:contradicts_anchor" token (harder)? Easier yet would be "URL:IPaddr_instead_of_hostname", but I know the weaker "URL:0" to "URL:255" tokens are already generated. --Glenn From tim.peters at gmail.com Wed Jun 30 20:00:38 2004 From: tim.peters at gmail.com (Tim Peters) Date: Wed Jun 30 20:00:46 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <200406301736.48606.tdickenson@geminidataloggers.com> References: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com> <200406301736.48606.tdickenson@geminidataloggers.com> Message-ID: <1f7befae040630170030eef019@mail.gmail.com> [Tim Peters] >> I don't have a good idea for identifying "bad correlation" >> automatically and efficiently. [Toby Dickenson] > My standards are lower... For now I would happy with a manual and inefficent > process that picked out just this one special type of correlated clue. A > couple of prototypes later, and even that isnt as easy as I hoped :-( > > Ill let you know if anything good come this. If you like, talk about what you tried, and what happened. The history of trying new gimmicks here is overwhelmingly a story of failure, and that's normal for this kind of classifier. Sharing what doesn't work saves overall effort too. We have two anti-bad-correlation gimmicks now, driven by early testing results, and rationalized after the fact : 1. As mentioned last time, ignoring most header lines. If we didn't, virtually all spam on mailing lists would score unsure or FN (thanks to a large number of distinct but correlated "I came from a mailing list" header tokens). 2. Stripping most evidence of HTML (like throwing away all HTML tags). If we didn't, virtually all HTML email would score unsure or FP (thanks to a huge number of distinct but correlated "HTML was used" body tokens). #1 bothers me more than #2 -- I view #2 mostly as a way of scoring content (what the users sees) instead of encoding, and I think that's a good thing to strive for independent of the correlation aspect. #1 is more purely a hack. Maybe another pure but personalized hack would be to add a list of specific tokens you want the classifier to pretend didn't exist. For example, 'url:zope' is probably strongly hammy for you, but isn't needed to get Zope mailing list ham scored as ham. If that's so, the only visible effect it has is to reduce the score on Zope mailing list spam. OTOH, most Zope mailing lists will soon have a members-only posting policy, so fighting Zope mailing list spam may soon be yesterday's war.