From jerzee_guy at hotmail.com Tue Oct 3 13:49:48 2006 From: jerzee_guy at hotmail.com (Chet C) Date: Tue, 03 Oct 2006 07:49:48 -0400 Subject: [spambayes-dev] SpamBayes Stopped Working Message-ID: I'm using Outlook 2002. Spambayes was working fine. No problems other than when I once deleted my Suspect Junk Mail directory but that was easy enough to correct. What has happened is this. When I would get SPAM, highlight the message and click "Delete As SPAM" nothing happened. I checked the FAQ and saw the question regarding something similar and the suggestion was to go to another directory (folder) and then go back and it would be gone. This wasn't the case for me. So I thought perhaps the program had become corrupted and removed it. I then restarted me computer just to make sure everything was cleared. I went back into OutLook and the SpamBayes buttons were still there. I thought that was odd. So I checked the Program folder. SpamBayes was gone. So I restarted again. Buttons for SpamBayes remained. OK. I decided to reload the program. Once done a restarted the computer again and went into OutLook. Buttons were there as expected but the same situation. The buttons weren't doing anything. I haven't checked the forums to see if anyone else has had this problem but I'll do that as I get time. Just thought you might want to know if hundreds of people haven't already e-mailed. Chet From rmezzone at pjsolomon.com Tue Oct 3 13:52:35 2006 From: rmezzone at pjsolomon.com (Robert Mezzone) Date: Tue, 3 Oct 2006 07:52:35 -0400 Subject: [spambayes-dev] SpamBayes Stopped Working Message-ID: Try Help, About Outlook, Disabled Items. -----Original Message----- From: spambayes-dev-bounces at python.org To: spambayes-dev at python.org Sent: Tue Oct 03 07:49:48 2006 Subject: [spambayes-dev] SpamBayes Stopped Working I'm using Outlook 2002. Spambayes was working fine. No problems other than when I once deleted my Suspect Junk Mail directory but that was easy enough to correct. What has happened is this. When I would get SPAM, highlight the message and click "Delete As SPAM" nothing happened. I checked the FAQ and saw the question regarding something similar and the suggestion was to go to another directory (folder) and then go back and it would be gone. This wasn't the case for me. So I thought perhaps the program had become corrupted and removed it. I then restarted me computer just to make sure everything was cleared. I went back into OutLook and the SpamBayes buttons were still there. I thought that was odd. So I checked the Program folder. SpamBayes was gone. So I restarted again. Buttons for SpamBayes remained. OK. I decided to reload the program. Once done a restarted the computer again and went into OutLook. Buttons were there as expected but the same situation. The buttons weren't doing anything. I haven't checked the forums to see if anyone else has had this problem but I'll do that as I get time. Just thought you might want to know if hundreds of people haven't already e-mailed. Chet _______________________________________________ spambayes-dev mailing list spambayes-dev at python.org http://mail.python.org/mailman/listinfo/spambayes-dev The information transmitted is intended only for the person or entity to which it is addressed and may be confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material, including all attachments from any computer without printing, copying, forwarding or saving it. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of any such entity. Peter J. Solomon Company reserves the right, to the extent and under circumstances permitted by applicable law, to retain, monitor and intercept e-mail messages to and from its systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061003/0cd13350/attachment.html From PHunt at aminc.com Wed Oct 11 18:44:16 2006 From: PHunt at aminc.com (Philip Hunt) Date: Wed, 11 Oct 2006 09:44:16 -0700 Subject: [spambayes-dev] Terminal Services Message-ID: Dear Sirs, Can I use Spam-Bayes in a multi-user enviroment such as Microsoft Terminal Server? I have installed it, only the first one to use Outlook gets the Spam-Bayes filter. Thank you, Philip Hunt From tdean at abbrev.com.au Fri Oct 20 07:24:01 2006 From: tdean at abbrev.com.au (Tim Dean) Date: Fri, 20 Oct 2006 15:24:01 +1000 Subject: [spambayes-dev] Media enquiry Message-ID: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> Hi folks, I'm currently writing a column in PC Authority magazine (www.pcauthority.com.au ) on the new wave of spam that use randomised or semi-randomised words to confound Bayesian filters. I'm looking for a developer for SpamBayes who would be willing to help me understand the issue and who can make a few comments on the impact of this kind of spam on filters such as SpamBayes, and how spam is evolving in general. Any information, comments or quotes would be greatly appreciated. You can contact me through this email address: tdean at abbrev.com.au. Best regards, Tim Dean Freelance journalist w. www.timstechguide.com.au e. tdean at abbrev.com.au p. (02) 9518 3481 m. 0412 560 365 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061020/004aae25/attachment.htm From anthony at interlink.com.au Fri Oct 20 10:29:32 2006 From: anthony at interlink.com.au (Anthony Baxter) Date: Fri, 20 Oct 2006 18:29:32 +1000 Subject: [spambayes-dev] Media enquiry In-Reply-To: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> References: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> Message-ID: <200610201829.33030.anthony@interlink.com.au> > I'm currently writing a column in PC Authority magazine > (www.pcauthority.com.au ) on the new wave > of spam that use randomised or semi-randomised words to confound Bayesian > filters. I can take this, cos it's in my timezone, if no-one else wants to. I figure the key point about the random word spam is that it's just trying to overwhelm the bayesian filters. Personally, I'm finding them _slightly_ effective (2 or 3 a day slip through if they hit the right words) but not significantly more than that. Fundamentally, they still have to put words in that sell a product, and that screws them over. Anthony From tameyer at ihug.co.nz Fri Oct 20 10:55:28 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri, 20 Oct 2006 21:55:28 +1300 Subject: [spambayes-dev] Media enquiry In-Reply-To: <200610201829.33030.anthony@interlink.com.au> References: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> <200610201829.33030.anthony@interlink.com.au> Message-ID: <42826C40-3902-4D48-BA2B-B7B82C7331C7@ihug.co.nz> >> I'm currently writing a column in PC Authority magazine >> (www.pcauthority.com.au ) on the >> new wave >> of spam that use randomised or semi-randomised words to confound >> Bayesian >> filters. > > I can take this, cos it's in my timezone, Plus you speak the local language ;) > if no-one else wants to. Sounds good to me :) > I figure > the key point about the random word spam is that it's just trying to > overwhelm the bayesian filters. Personally, I'm finding them > _slightly_ > effective (2 or 3 a day slip through if they hit the right words) > but not > significantly more than that. Fundamentally, they still have to put > words in > that sell a product, and that screws them over. I think that people have shown that random words are pretty ineffective (e.g. John Graham-Cumming at 2004's MIT Spam Conference). Random paragraphs (those news clippings and the like) are a bit more effective. I think that image-based spam is clearly far superior to any sort of random-word technique, though (although some of the image-spam also has the random words - I'm not sure that really helps the spammer, though). =Tony.Meyer From anthony at interlink.com.au Fri Oct 20 11:36:04 2006 From: anthony at interlink.com.au (Anthony Baxter) Date: Fri, 20 Oct 2006 19:36:04 +1000 Subject: [spambayes-dev] Media enquiry In-Reply-To: <42826C40-3902-4D48-BA2B-B7B82C7331C7@ihug.co.nz> References: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> <200610201829.33030.anthony@interlink.com.au> <42826C40-3902-4D48-BA2B-B7B82C7331C7@ihug.co.nz> Message-ID: <200610201936.09140.anthony@interlink.com.au> On Friday 20 October 2006 18:55, Tony Meyer wrote: > >> I'm currently writing a column in PC Authority magazine > >> (www.pcauthority.com.au ) on the > >> new wave > >> of spam that use randomised or semi-randomised words to confound > >> Bayesian > >> filters. > > > > I can take this, cos it's in my timezone, > > Plus you speak the local language ;) Does babelfish not have a Kiwi-ese to English translator? Pah. > I think that people have shown that random words are pretty > ineffective (e.g. John Graham-Cumming at 2004's MIT Spam > Conference). Random paragraphs (those news clippings and the like) > are a bit more effective. I think that image-based spam is clearly > far superior to any sort of random-word technique, though (although > some of the image-spam also has the random words - I'm not sure that > really helps the spammer, though). The stuff that slips through for me tends to have a lot of individual lines from various news articles, smashed together randomly. My favourite one (this didn't get through - I just noticed it when emptying my spam box) was one that had something like [%RANDOM_LINE_1%] [%RANDOM_LINE_2%] [%RANDOM_LINE_3%] [%RANDOM_LINE_4%] Ah spammers - clearly they're the best and the brightest. :) The nasty one which I've only seen occasionally would be one that spammed by replying to an email you'd already sent (either from a public mailing list archive or from the mailbox of a compromised PC). Fortunately, the cost to individualise spams like this is much much higher than mass random blasting, so I've seen very very little of it. The ones I have seen seem to be manually entered - someone will reply to a post with "Have you heard about XYZspamproduct" with a link. Image spam could be more of a problem, except that the less text in the message, the more header clues come into play, as well. While SB doesn't do a massive amount of, for instance, RBL checking, defense in depth (spamassassin+graylisting on the server, SB on the client) seems pretty effective. I'll email the guy back. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Fri Oct 20 13:05:58 2006 From: skip at pobox.com (skip at pobox.com) Date: Fri, 20 Oct 2006 06:05:58 -0500 Subject: [spambayes-dev] Media enquiry In-Reply-To: <200610201829.33030.anthony@interlink.com.au> References: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> <200610201829.33030.anthony@interlink.com.au> Message-ID: <17720.44438.679082.810296@montanaro.dyndns.org> >> I'm currently writing a column in PC Authority magazine >> (www.pcauthority.com.au ) on the new >> wave of spam that use randomised or semi-randomised words to confound >> Bayesian filters. Anthony> I can take this, cos it's in my timezone, if no-one else wants Anthony> to. I figure the key point about the random word spam is that Anthony> it's just trying to overwhelm the bayesian filters. Personally, Anthony> I'm finding them _slightly_ effective (2 or 3 a day slip Anthony> through if they hit the right words) but not significantly more Anthony> than that. Fundamentally, they still have to put words in that Anthony> sell a product, and that screws them over. The random word spam generally relies on the fact that the actual message is in the attached GIF image. It's not so much that the random words are defeating the filter. It's more that not enough useful tokens are being extracted from the image. Hence the move toward OCR approaches. Spammers evolve. Spam filters evolve. Skip From skip at pobox.com Fri Oct 20 13:14:56 2006 From: skip at pobox.com (skip at pobox.com) Date: Fri, 20 Oct 2006 06:14:56 -0500 Subject: [spambayes-dev] Media enquiry In-Reply-To: <200610201936.09140.anthony@interlink.com.au> References: <5ci99q$t0cv2i@iinet-mail.icp-qv1-irony5.iinet.net.au> <200610201829.33030.anthony@interlink.com.au> <42826C40-3902-4D48-BA2B-B7B82C7331C7@ihug.co.nz> <200610201936.09140.anthony@interlink.com.au> Message-ID: <17720.44976.106145.348642@montanaro.dyndns.org> Anthony> Image spam could be more of a problem, except that the less Anthony> text in the message, the more header clues come into play, as Anthony> well. While SB doesn't do a massive amount of, for instance, Anthony> RBL checking, defense in depth (spamassassin+graylisting on the Anthony> server, SB on the client) seems pretty effective. Have the spammers still not figured out how to defeat greylisting? (I suppose they may just not have the time to wait for the timeout on a compromised machine.) I've run postgrey for a couple years. Maybe that's one reason I don't see as much junk. Oh, another thing. I read my mail through XEmacs+VM and very rarely get legitimate email containing GIFs. When I do, legitimate or not, it's clear that the message has an image attached. I noticed with email clients like Thunderbird, that's not always the case. The GIF images might look like plain (though often colored, blinking) text when rendered. This became obvious to me when a guy at work showed me such a spam. He couldn't figure out why the spam filter at work hadn't caught it because it obviously had lots of spammy text. I explained that was actually a GIF image being displayed. He has a PhD in Computer Science and is an extremely bright guy, so I'm sure on casual glance lots of people focus on the random text and don't realize the sales pitch is embedded in an image, and conclude it must be the gibberish that's defeating the spam filter. It's just that the spam filter can't see what you see. Skip From kaspervg at post.cybercity.dk Mon Oct 23 00:52:39 2006 From: kaspervg at post.cybercity.dk (Kasper Vibe Grevsen) Date: Mon, 23 Oct 2006 00:52:39 +0200 Subject: [spambayes-dev] NetPbm for Windows Message-ID: <000601c6f62c$c7e73490$0600000a@kasper> Hi friends, just thought I would share this link: NetPbm for Windows: http://gnuwin32.sourceforge.net/packages/netpbm.htm -- Best regards Vibe -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061023/b5902463/attachment.htm From skip at pobox.com Mon Oct 23 05:25:48 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 22 Oct 2006 22:25:48 -0500 Subject: [spambayes-dev] NetPbm for Windows In-Reply-To: <000601c6f62c$c7e73490$0600000a@kasper> References: <000601c6f62c$c7e73490$0600000a@kasper> Message-ID: <17724.13884.721796.43878@montanaro.dyndns.org> Kasper> just thought I would share this link: Kasper> NetPbm for Windows: http://gnuwin32.sourceforge.net/packages/netpbm.htm Thanks. I presume you're referring to the use of NetPBM to convert GIF/JPEG/PNG images into NetPBM for use as input to ocrad. I got rid of that code a few weeks ago and now just use PIL (http://www.pythonware.com/products/pil/) exclusively. Skip From stevenk at workingsystems.com Tue Oct 24 18:51:50 2006 From: stevenk at workingsystems.com (Steven Kant) Date: Tue, 24 Oct 2006 09:51:50 -0700 Subject: [spambayes-dev] Question Message-ID: In my installation of SpamBayes (1.0.4) with Outlook 2003, I have set the Manager | Filtering Tab to Mark Spam as Read, but it does not do this, so Spam keeps coming up as unread mail. This means that I have to keep looking at messages that are Spam. Any ideas? Steven Kant Working Systems Inc. 218 1/2 West Fourth Ave Olympia WA 98501 (360) 943-7640 x103 toll-free: (866)-396-6767 fax: (360) 943-0596 http://www.workingsystems.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061024/31ac2fc3/attachment.html From dk7x at berkeley.edu Mon Oct 30 01:08:42 2006 From: dk7x at berkeley.edu (dk7x at berkeley.edu) Date: Sun, 29 Oct 2006 16:08:42 -0800 (PST) Subject: [spambayes-dev] Design Doc for Tokenizer Message-ID: <63396.24.4.231.235.1162166922.squirrel@calmail.berkeley.edu> Hey all, I'm currently doing a UC Berkeley research project. We would like to understand what interactions the tokenizer has with the different modules. Is there any documentation available that describes the different modules? We are interested in what the email representation is after email is tokenized and going into the learner and classifier. In addition, we would like to isolate the tokenizer. Any help would be appreciated. Thanks in advance for your response. Kai Xia From tameyer at ihug.co.nz Mon Oct 30 07:28:52 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon, 30 Oct 2006 19:28:52 +1300 Subject: [spambayes-dev] Design Doc for Tokenizer In-Reply-To: <63396.24.4.231.235.1162166922.squirrel@calmail.berkeley.edu> References: <63396.24.4.231.235.1162166922.squirrel@calmail.berkeley.edu> Message-ID: <075826FD-2603-4915-B88F-1764465C85ED@ihug.co.nz> > We would like to > understand what interactions the tokenizer has with the different > modules. The tokenizer reads the options to know what tokenizing to do. Any of the other modules that need to tokenize a message use the tokenizer. That's about it*. > Is there any documentation available that describes the different > modules? There's README-DEVEL.txt in the source, and the (extensive) comments in the code. Feel free to ask questions here. > We are interested in what the email representation is after email is > tokenized and going into the learner and classifier. The email is an iterable (generator in this case, but any iterable would do) of strings. > In addition, we would like to isolate the tokenizer. Already done - tokenizer.py is already isolated from the rest of SpamBayes, other than the options (which control what tokenization is done). =Tony.Meyer * Ok, not quite all. The experimental URL slurping option imports the classifier, because it only generates tokens if the score is already known to be unsure, and the tokenizer doesn't otherwise know anything about score. If this became non-experimental a tidier way would be found for this. The experimental image tokenization also uses the ImageStripper module. And the tokenizer uses mboxutils.get_message so that you can pass a string, file, or something like that, or a email.Message object, to tokenize (this is just convenience, really). From matt at matt-good.net Mon Oct 30 22:38:05 2006 From: matt at matt-good.net (Matt Good) Date: Mon, 30 Oct 2006 16:38:05 -0500 Subject: [spambayes-dev] effective tokenizer for wiki text Message-ID: <1162244286.5924.11.camel@nny> The Trac[1] project has resurrected work on a SpamBayes plugin for filtering Wiki and ticket edits after finding the current Akismet system to be unreliable. Tony Meyer added some comments[2] to the Wiki suggesting that we write a custom tokenizer instead of using the built-in email-centric tokenizer. Are there examples from other people that have written custom tokenizers that may be helpful, or do you have any hints on what to take into account for writing an effective tokenizer for Wiki text? -- Matt Good [1] http://trac.edgewall.org [2] http://trac.edgewall.org/wiki/SpamFilter#Bayes From skip at pobox.com Mon Oct 30 22:50:09 2006 From: skip at pobox.com (skip at pobox.com) Date: Mon, 30 Oct 2006 15:50:09 -0600 Subject: [spambayes-dev] effective tokenizer for wiki text In-Reply-To: <1162244286.5924.11.camel@nny> References: <1162244286.5924.11.camel@nny> Message-ID: <17734.29585.100735.130425@montanaro.dyndns.org> Matt> The Trac[1] project has resurrected work on a SpamBayes plugin for Matt> filtering Wiki and ticket edits after finding the current Akismet Matt> system to be unreliable. Tony Meyer added some comments[2] to the Matt> Wiki suggesting that we write a custom tokenizer instead of using Matt> the built-in email-centric tokenizer. Why not just create an "email message" out of the input? If the headers are identical in every message they won't generate any useful tokens and the message body will be all that yields useful clues. OTOH, if you have login or IP address information for the spammers, you might suitably populate the From: field. Matt> Are there examples from other people that have written custom tokenizers Matt> that may be helpful, or do you have any hints on what to take into Matt> account for writing an effective tokenizer for Wiki text? So far, I think most of us have bent our input to look like email. I think that would be a lot easier than writing and debugging a new tokenizer. Skip From skip at pobox.com Tue Oct 31 01:17:38 2006 From: skip at pobox.com (skip at pobox.com) Date: Mon, 30 Oct 2006 18:17:38 -0600 Subject: [spambayes-dev] effective tokenizer for wiki text In-Reply-To: <1162247304.5924.22.camel@nny> References: <1162244286.5924.11.camel@nny> <17734.29585.100735.130425@montanaro.dyndns.org> <1162247304.5924.22.camel@nny> Message-ID: <17734.38434.168703.680519@montanaro.dyndns.org> >> So far, I think most of us have bent our input to look like email. I >> think that would be a lot easier than writing and debugging a new >> tokenizer. Matt> Yes, I think it would be fine to start testing the filter that Matt> way, but I figured since the custom tokenizer had been suggested Matt> it was worth looking into what would be required and what the Matt> advantages might be. Maybe subclass tokenizer.Tokenizer and override the tokenize method? Skip From tameyer at ihug.co.nz Tue Oct 31 01:37:17 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue, 31 Oct 2006 13:37:17 +1300 Subject: [spambayes-dev] effective tokenizer for wiki text In-Reply-To: <17734.29585.100735.130425@montanaro.dyndns.org> References: <1162244286.5924.11.camel@nny> <17734.29585.100735.130425@montanaro.dyndns.org> Message-ID: [Skip] > Why not just create an "email message" out of the input? If the > headers are > identical in every message they won't generate any useful tokens > and the > message body will be all that yields useful clues. OTOH, if you > have login > or IP address information for the spammers, you might suitably > populate the > From: field. ISTM that it would be just as little work to write a "wiki-page to email" module as to create a Tokenizer subclass that tokenizes wiki pages. You can then skip all of the header tokenization (and any email-specific tokenization in the body, if there is any, but I can't think of any) and generate any additional tokens out of any metadata that might be available (maybe comment, author, etc?). [Matt] >> Are there examples from other people that have written custom >> tokenizers >> that may be helpful, or do you have any hints on what to take into >> account for writing an effective tokenizer for Wiki text? What exactly gets passed to the tokenizer? Anything more than just the content (complete? diff?) of the wiki page? If it's just the content/diff then other than the words themselves, URLs are probably the most useful content. You could try enabling (or improving) the URL slurping code, perhaps. > So far, I think most of us have bent our input to look like email. > I think > that would be a lot easier than writing and debugging a new tokenizer. A tokenizer's pretty simple, really - all it has to do is take the object you want to tokenize and yield a series of strings. It's been a couple of years, but I wrote some non-email tokenizers at one point. =Tony.Meyer From tameyer at ihug.co.nz Tue Oct 31 01:51:33 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue, 31 Oct 2006 13:51:33 +1300 Subject: [spambayes-dev] effective tokenizer for wiki text In-Reply-To: <17734.38434.168703.680519@montanaro.dyndns.org> References: <1162244286.5924.11.camel@nny> <17734.29585.100735.130425@montanaro.dyndns.org> <1162247304.5924.22.camel@nny> <17734.38434.168703.680519@montanaro.dyndns.org> Message-ID: <5E234D91-D4DB-4267-9639-FE4D3BC42F2A@ihug.co.nz> [Matt] >> Yes, I think it would be fine to start testing the filter that >> way, but I figured since the custom tokenizer had been suggested >> it was worth looking into what would be required and what the >> advantages might be. [Skip] > Maybe subclass tokenizer.Tokenizer and override the tokenize method? That's all that's needed. Just changing: def tokenize(self, obj): msg = self.get_message(obj) for tok in self.tokenize_headers(msg): yield tok for tok in self.tokenize_body(msg): yield tok to def tokenize(self, obj): text = obj # The rest of this is from tokenize_body. # Replace numeric character entities (like a for the letter # 'a'). text = numeric_entity_re.sub(numeric_entity_replacer, text) # Normalize case. text = text.lower() if options["Tokenizer", "replace_nonascii_chars"]: # Replace high-bit chars and control chars with '?'. text = text.translate(non_ascii_translate_tab) for t in find_html_virus_clues(text): yield "virus:%s" % t # Get rid of uuencoded sections, embedded URLs,