From liz.gifford at gmail.com Thu May 3 22:51:01 2007 From: liz.gifford at gmail.com (Elizabeth Gifford) Date: Thu, 3 May 2007 16:51:01 -0400 Subject: [spambayes-dev] I used SpamBayes for class project - thanks! Message-ID: <9E26B57C-C31B-46AD-A44F-DEDA0AF3DCF0@gmail.com> Hi SpamBayes developers, I just wanted to let you know that I used SpamBayes for my term project in my Machine Learning class, and to say thanks. I thought there might be some academic interest here, and that I'd share with the list in return for the great code of yours that I used. The overall assignment was to "do something with evolutionary algorithms." The goal of my project was more or less to attack a spam filter and attempt to use an EA to find weaknesses it might have. I set up a blind watchmaker sort of natural selection/ evolution environment using a SpamBayes filter that I had trained on an old training corpus as the fitness function. The program I wrote attempted to evolve a generic formula that worked sort of like "MadLibs" to write spam emails that could get past the filter. I started off with almost zero knowledge of how a spam filter works, and tried to keep it that way while I wrote the algorithm so that it would be as naive as possible. I've since read up on SpamBayes and some of the math behind it, and it looks really interesting. I'm looking forward to taking more time to look through the codebase. Now that the project is completed, I see where I could have trained the filter on different content or parts of the evolution structure that I could have written differently to make this project more successful, but that's research for you! This was a very quick proof of concept project, and if I were to continue with this research there are certainly a lot of things I would do differently and many more complicated techniques that I'd like to try. If anyone is interested in reading about the project in more detail, I've posted the paper I wrote here: http://www.cs.brandeis.edu/~egifford/classes/egiffordTermProject.pdf Thanks again for maintaining this cool project. I hope you'll take my contribution in the intellectual sense that is intended, and not as an attack. :) Liz Gifford From marko at von-oppen.com Thu May 10 22:33:54 2007 From: marko at von-oppen.com (Marko von Oppen) Date: Thu, 10 May 2007 22:33:54 +0200 Subject: [spambayes-dev] Vista compatibilty Message-ID: <000a01c79342$877561d0$96602570$@com> Under Vista win32traceutil does not work with normal user privileges. I have published a patch on Sourceforge which switches back to the Logfile mechanism from the binary release in such an environment. Marko From skip at pobox.com Mon May 14 14:31:15 2007 From: skip at pobox.com (skip at pobox.com) Date: Mon, 14 May 2007 07:31:15 -0500 Subject: [spambayes-dev] Standalone SpamBayes classifier for websites Message-ID: <17992.22163.417514.267934@montanaro.dyndns.org> CC'ing Richard Jones - Roundup guru and Reimar Bauer - MoinMoin guru. Reimar, I don't seem to have Marian Neagul's email handy. Can you forward this to him? I've been trying (rather unsuccessfully) to figure out how to integrate a SpamBayes classifier into Roundup. Basically I know zilch about Roundup's code. You need to score form submissions (the easy part), save them for later retraining and allow misclassified submissions to be reinjected into the website (the hard parts). I had similar problems when I tried to incorporate SpamBayes into MoinMoin. These sites generally treat all submissions as valid. Presuming we have a SpamBayes training database and classifier we can talk to it's a fairly easy task to score a submission and reject it if it looks like spam. Alas, if the submission is scored as spam Roundup and MoinMoin have no convenient way to save the submission yet keep it sequestered so it doesn't turn up on the web. It occurred to me yesterday that the SpamBayes POP3 proxy and IMAP filter solve the storage and classification problems for the specific case where you're talking those two email protocols. The only trick is that they are tied to POP3 and IMAP. Instead of email I need some other way to get a "message" into and out of the classifier/database manager. Given an arbitrary form submission I should be able to convert it to a MIME message (file uploads map to attachments) and hand it off to a standalone SpamBayes server for scoring and storage. If the submission is originally marked as spam (or unsure) but is later deemed okay, I should be able to convert the MIME message back into the necessary bits for resubmission. If the submission is originally marked as ham but is later deemed to be spam the regular Roundup or MoinMoin facility for deleting tickets, pages or attachments would get rid of it. Alas, sb_server.py and sb_imapfilter.py don't seem to share a lot of code (save for using Dibbler to build the web user interface). Is that true? It seems the user interface, classifier bits and storage should be essentially identical. All that should be different between the them is the way you transmit messages to and from external systems: sink ^ | +------------------+ +----------+ | | | | | Core |<------>| Protocol | | Server | | Adapter | | | | | +------------------+ +----------+ ^ ^ | | v source web & msg storage For POP3 the source would be the email client and the sink would be the real POP3 server. For IMAP the source and sink would be the IMAP server. For websites the source and sink would be the web site (Roundup, MoinMoin, etc). The data sent from the protocol adapter to the core server would be MIME messages. The data sent to the protocol adapter would be simply score info (ham, spam, unsure, perhaps raw scores). Any ideas on the shortest route to a core server that provides the user, training and storage interfaces? Start from scratch? Rip the POP3 stuff out of sb_server.py? Rip the IMAP stuff out of sb_imapfilter.py? I'd really hate to reinvent the wheel since we seem to have two wheels already. Once that core server is available, adapting to different environments should be possible by plugging in specific protocol adapters Thx, Skip From spambayes-dev at tangomu.com Wed May 16 09:29:57 2007 From: spambayes-dev at tangomu.com (Tony Meyer) Date: Wed, 16 May 2007 19:29:57 +1200 Subject: [spambayes-dev] Standalone SpamBayes classifier for websites In-Reply-To: <17992.22163.417514.267934@montanaro.dyndns.org> References: <17992.22163.417514.267934@montanaro.dyndns.org> Message-ID: <985394BC-0B69-4778-B5B9-62D4B2773C47@tangomu.com> > Alas, sb_server.py and sb_imapfilter.py don't seem to share a lot > of code > (save for using Dibbler to build the web user interface). Is that > true? Somewhat. Unfortunately, at the time I originally wrote sb_imapfilter I didn't use IMAP myself and so when deciding whether the IMAP solution should be a 'filter' (i.e. periodically connect to an IMAP server and classify messages) or a proxy (i.e. intercept connections to the server and classify on the fly) I went with the majority vote. In many ways a proxy would be the simpler solution, and would certainly resemble sb_server a lot more (I've written an IMAP proxy for another project, based on the sb_server POP3 proxy, and there is a lot of overlap). OTOH, there are advantages (particularly training) to the filter method. > It seems the user interface, classifier bits and storage should be > essentially > identical. The user interface is nearly identical. The shared part is in UserInterface.py, with the separate subclasses in ProxyUI.py (POP3) and ImapUI.py. The majority of the code in ImapUI.py deals with presenting a list of folders from an IMAP server to the user, to select which should be scanned for messages to classify/train (this probably wouldn't be necessary with a proxy). The majority of the code in ProxyUI.py deals with the browser-based training interface (which the IMAP filter doesn't have - you just put messages in the appropriate folders on the server). The classifier and storage bits are pretty much identical (storage.py and FileCorpus.py respectively). > Any ideas on the shortest route to a core server that provides the > user, > training and storage interfaces? Start from scratch? Rip the POP3 > stuff > out of sb_server.py? Rip the IMAP stuff out of sb_imapfilter.py? I'd > really hate to reinvent the wheel since we seem to have two wheels > already. > Once that core server is available, adapting to different environments > should be possible by plugging in specific protocol adapters Definitely don't start with sb_imapfilter.py - it's basically a scanner, not an on-demand-classifier. Probably the best place to start would be with the State class in sb_server.py. There are some POP3-specific parts in there, but personally I would be happy if they were abstracted out (e.g. a State class and a POP3ProxyState subclass). I could do that (promptly ;)). What you then have are: * State.bayes (the classifier) * State.hamCorpus, State.spamCorpus, State.unknownCorpus (storage of 'messages' - untrained messages in unknownCorpus, and trained messages (expiring) in ham/spamCorpus). * Training via moving messages between corpora. Once you've got something that looks like a message, you can do something like sb_server's onRetr for classification and storage (I've cut bits that probably aren't relevant): """ msg = email.message_from_string(messageText, _class=spambayes.message.SBHeaderMessage) msg.setId(state.getNewMessageName()) # Now find the spam disposition and add the header. (prob, clues) = state.bayes.spamprob(msg.tokenize(), evidence=True) msg.addSBHeaders(prob, clues) cls = msg.GetClassification() state.RecordClassification(cls, prob) # Cache the message. Write the message into the Unknown cache. makeMessage = state.unknownCorpus.makeMessage message = makeMessage(msg.getId(), msg.as_string()) state.unknownCorpus.addMessage(message) """ For the user interface, you can just create a UserInterface.UserInterface subclass (needs a Home page/method, and an __init__ method). Actually, you probably want ProxyUI.ProxyUserInterface as-is, with a different set of options to offer in the configuration pages (the parm_ini_map and adv_map used in the __init__). (There would be a "No POP3 proxies running" message on the main page, but you could ignore that or subclass appropriately). It's a long time since I've worked with the browser interface code, but I'm pretty sure that this would give you what you want. Cheers, Tony From skip at pobox.com Wed May 16 12:36:02 2007 From: skip at pobox.com (skip at pobox.com) Date: Wed, 16 May 2007 05:36:02 -0500 Subject: [spambayes-dev] Standalone SpamBayes classifier for websites In-Reply-To: <985394BC-0B69-4778-B5B9-62D4B2773C47@tangomu.com> References: <17992.22163.417514.267934@montanaro.dyndns.org> <985394BC-0B69-4778-B5B9-62D4B2773C47@tangomu.com> Message-ID: <17994.56978.558249.789217@montanaro.dyndns.org> At this point I got things whittled down to a "core" server by writing very little actual code, mostly just ripping the POP3 stuff out of the POP3 proxy. I need to do some rearranging then start figuring out what the protocol interface should look like, then write a protocol plugin for use by web apps (Roundup, Trac, MoinMoin, etc). Skip From sjoerd at acm.org Thu May 24 16:20:53 2007 From: sjoerd at acm.org (Sjoerd Mullender) Date: Thu, 24 May 2007 16:20:53 +0200 Subject: [spambayes-dev] spambayes crash due to bad image Message-ID: <46559F45.9010306@acm.org> I got this crash, probably due to a bad image. The code in ImageStripper.py that calls PIL is currently in a except IOError, but should probably be in a bare except to catch any and all errors from PIL. In my case I got an IndexError. I could make this change, if there is consensus and once Sourceforge lets me get at the spambayes repository. 16:02:17+ sb_imapfilter.py -c SpamBayes IMAP Filter Version 1.1a3 (August 2006). Traceback (most recent call last): File "/ufs/sjoerd/bin/x86_64/sb_imapfilter.py", line 1301, in run() File "/ufs/sjoerd/bin/x86_64/sb_imapfilter.py", line 1281, in run imap_filter.Filter() File "/ufs/sjoerd/bin/x86_64/sb_imapfilter.py", line 1080, in Filter self.unsure_folder, self.ham_folder) File "/ufs/sjoerd/bin/x86_64/sb_imapfilter.py", line 955, in Filter evidence=True) File "/ufs/sjoerd/lib/python2.6/site-packages/spambayes/classifier.py", line 196, in chi2_spamprob clues = self._getclues(wordstream) File "/ufs/sjoerd/lib/python2.6/site-packages/spambayes/classifier.py", line 498, in _getclues for word in Set(wordstream): File "/ufs/sjoerd/lib/python2.6/site-packages/spambayes/tokenizer.py", line 1284, in tokenize for tok in self.tokenize_body(msg): File "/ufs/sjoerd/lib/python2.6/site-packages/spambayes/tokenizer.py", line 1643, in tokenize_body text, tokens = crack_images(engine_name, parts) File "/ufs/sjoerd/lib/python2.6/site-packages/spambayes/ImageStripper.py", line 367, in analyze pnmfiles, tokens = PIL_decode_parts(parts) File "/ufs/sjoerd/lib/python2.6/site-packages/spambayes/ImageStripper.py", line 144, in PIL_decode_parts image.load() File "/ufs/sjoerd/lib/python2.6/site-packages/PIL/ImageFile.py", line 189, in load s = read(self.decodermaxblock) File "/ufs/sjoerd/lib/python2.6/site-packages/PIL/PngImagePlugin.py", line 349, in load_read cid, pos, len = self.png.read() File "/ufs/sjoerd/lib/python2.6/site-packages/PIL/PngImagePlugin.py", line 92, in read len = i32(s) File "/ufs/sjoerd/lib/python2.6/site-packages/PIL/PngImagePlugin.py", line 40, in i32 return ord(c[3]) + (ord(c[2])<<8) + (ord(c[1])<<16) + (ord(c[0])<<24) IndexError: string index out of range -- Sjoerd Mullender -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 369 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20070524/86a69f3b/attachment.pgp From mhammond at skippinet.com.au Fri May 25 03:42:23 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 25 May 2007 11:42:23 +1000 Subject: [spambayes-dev] spambayes crash due to bad image In-Reply-To: <46559F45.9010306@acm.org> Message-ID: <00a201c79e6d$f187ef30$1f0a0a0a@enfoldsystems.local> > I got this crash, probably due to a bad image. The code in > ImageStripper.py that calls PIL is currently in a except IOError, but > should probably be in a bare except to catch any and all errors from > PIL. In my case I got an IndexError. > > I could make this change, if there is consensus and once Sourceforge > lets me get at the spambayes repository. That sounds perfectly reasonable to me. Cheers, Mark From rjs at wmw.com Tue May 29 18:42:05 2007 From: rjs at wmw.com (Robert Savage) Date: Tue, 29 May 2007 12:42:05 -0400 Subject: [spambayes-dev] SpamBayes developer needed for small paid job In-Reply-To: <465C0644.4010101@wmw.com> References: <6.0.1.1.2.20070528132811.05239188@localhost> <465C0644.4010101@wmw.com> Message-ID: <6.0.1.1.2.20070529123547.04980b98@localhost> Hi, This is a request for paid help from a SpamBayes developer, please. We have source files which are creating a COM object from SpamBayes 1.1a2 or 1.1a3, but the COM isn't working well. This can be because these versions are alpha. We want to create a SpamBayes COM by using the latest stable release, 1.0.4. However, we tried to do it ourselves, and the build was created without errors, but the resulting COM isn't working. So having the COM correctly built with version 1.0.4 is the first task. Once we have the COM, it's possible that the SpamBayes COM still won't function as expected; in which case the Python source file which implements the COM would need to be fixed. This is the second task, and may not be necessary as migrating to version 1.04 may fix the problem. We do not know Python, and the original author is unreachable. Robert -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20070529/2c7ea1f3/attachment.htm From skip at pobox.com Tue May 29 19:15:25 2007 From: skip at pobox.com (skip at pobox.com) Date: Tue, 29 May 2007 12:15:25 -0500 Subject: [spambayes-dev] SpamBayes developer needed for small paid job In-Reply-To: <6.0.1.1.2.20070529123547.04980b98@localhost> References: <6.0.1.1.2.20070528132811.05239188@localhost> <465C0644.4010101@wmw.com> <6.0.1.1.2.20070529123547.04980b98@localhost> Message-ID: <18012.24493.254603.637415@montanaro.dyndns.org> Robert> This is a request for paid help from a SpamBayes developer, Robert> please. Robert> We have source files which are creating a COM object from Robert> SpamBayes 1.1a2 or 1.1a3, but the COM isn't working well.... I'm not committing him to anything you understand, but I suspect your best bet will be Mark Hammond: http://www.enfoldsystems.com/ http://www.enfoldsystems.com/About Skip