From skip at pobox.com Tue Aug 1 12:51:00 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 1 Aug 2006 05:51:00 -0500 Subject: [spambayes-dev] Trouble w/ zodb persistence of dnscache Message-ID: <17615.12820.699706.298379@montanaro.dyndns.org> I've been using Matt Cowles's x-lookup-ip extension with some success recently to reveal the real IP addresses behind spammers' hostnames. For example, the following hostnames are mentioned in pharma come-ons: % host www.astlehover.com www.astlehover.com has address 211.144.68.87 % host www.tornetseen.com www.tornetseen.com has address 211.144.68.87 % host www.erlikuvera.com www.erlikuvera.com has address 211.144.68.87 % host www.oplimazexu.com www.oplimazexu.com has address 211.144.68.87 The rest of the message content is pretty well disguised (very little content, random common text boilerplate, etc), so without IP lookup they tend to plop into my unsure mailbox. They sometimes score low enough to land in my regular inbox. Matt's extension solves that by looking up the IP addresses for hosts it encounters and generating a number of new tokens: % spamcounts -r :211 token,nspam,nham,spam prob url-ip:211.144.68.87/32,1,0,0.844827586207 url-ip:211.144.68/24,1,0,0.844827586207 url-ip:211/8,4,0,0.949438202247 url-ip:211.20.189/24,1,0,0.844827586207 url-ip:211.189.18/24,1,0,0.844827586207 url-ip:211.144/16,1,0,0.844827586207 received:211.95.72.130,1,0,0.844827586207 url-ip:211.189.18.186/32,1,0,0.844827586207 url-ip:211.22.166.116/32,1,0,0.844827586207 received:211.96,1,0,0.844827586207 received:211.95,1,0,0.844827586207 url-ip:211.22.166/24,1,0,0.844827586207 received:211.95.72,1,0,0.844827586207 url-ip:211.20/16,1,0,0.844827586207 url-ip:211.20.189.50/32,1,0,0.844827586207 received:211.96.42,1,0,0.844827586207 url-ip:211.22/16,1,0,0.844827586207 received:211,2,0,0.908163265306 received:211.96.42.103,1,0,0.844827586207 url-ip:211.189/16,1,0,0.844827586207 Unfortunately it doesn't cache IP addresses across sessions. My train-to-exhaustion scheme scores my entire training database. The first round of scoring is very time-consuming. I decided to solve that shortcoming. I added "dbm" and "zodb" support to Matt's dnscache module, since those are probably the two most prevalent storage schemes (default and emeritus default). I've been testing the zodb scheme but having trouble with it. If I start with no ~/.dnscache* files it correctly creates a new one. If I have an existing database already, it doesn't update the database file, though the timestamps on the .index and .tmp files are updated. I asked on zodb-dev and got some partial help (I was relying on __del__ to close() the FileStorage object), but even with that fixed it's not working properly. My recent pleas for help have gone unanswered, so I'm turning to this list. My zodb code was cribbed from the support in SpamBayes itself, so maybe the author of that code will see what I've done wrong. I set up the cache in tokenizer.py like so: try: import dnscache cache = dnscache.cache(cachefile=os.path.expanduser("~/.dnscache")) cache.printStatsAtEnd = True except (IOError, ImportError): cache = None else: import atexit atexit.register(cache.close) In the cache class's __init__ I open the cachefile if given: if cachefile: self.open_cachefile(cachefile) else: self.caches={ "A": {}, "PTR": {} } def open_cachefile(self, cachefile): filetype = options["Storage", "persistent_use_database"] cachefile = os.path.expanduser(cachefile) if filetype == "dbm": if os.path.exists(cachefile): self.caches=shelve.open(cachefile) else: self.caches=shelve.open(cachefile) self.caches["A"] = {} self.caches["PTR"] = {} elif filetype == "zodb": from ZODB import DB from ZODB.FileStorage import FileStorage self._zodb_storage = FileStorage(cachefile, read_only=False) self._DB = DB(self._zodb_storage, cache_size=10000) self._conn = self._DB.open() root = self._conn.root() self.caches = root.get("dnscache") if self.caches is None: # There is no classifier, so create one. from BTrees.OOBTree import OOBTree self.caches = root["dnscache"] = OOBTree() self.caches["A"] = {} self.caches["PTR"] = {} print "opened new cache" else: print "opened existing cache with", len(self.caches["A"]), "A records", print "and", len(self.caches["PTR"]), "PTR records" and when it's closed, this code executes: def close(self): filetype = options["Storage", "persistent_use_database"] if filetype == "dbm": self.caches.close() elif filetype == "zodb": self._zodb_close() def _zodb_store(self): import transaction from ZODB.POSException import ConflictError from ZODB.POSException import TransactionFailedError try: transaction.commit() except ConflictError, msg: # We'll save it next time, or on close. It'll be lost if we # hard-crash, but that's unlikely, and not a particularly big # deal. if options["globals", "verbose"]: print >> sys.stderr, "Conflict on commit.", msg transaction.abort() except TransactionFailedError, msg: # Saving isn't working. Try to abort, but chances are that # restarting is needed. if options["globals", "verbose"]: print >> sys.stderr, "Store failed. Need to restart.", msg transaction.abort() def _zodb_close(self): # Ensure that the db is saved before closing. Alternatively, we # could abort any waiting transaction. We need to do *something* # with it, though, or it will be still around after the db is # closed and cause problems. For now, saving seems to make sense # (and we can always add abort methods if they are ever needed). self._zodb_store() # Do the closing. self._DB.close() # We don't make any use of the 'undo' capabilities of the # FileStorage at the moment, so might as well pack the database # each time it is closed, to save as much disk space as possible. # Pack it up to where it was 'yesterday'. # XXX What is the 'referencesf' parameter for pack()? It doesn't # XXX seem to do anything according to the source. ## self._zodb_storage.pack(time.time()-60*60*24, None) self._zodb_storage.close() self._zodb_closed = True if options["globals", "verbose"]: print >> sys.stderr, 'Closed dnscache database' When run, it correctly announces that it's either creating a new cache or that it opened an existing cache, e.g.: opened existing cache with 479 A records and 0 PTR records No errors appear on stdout or stderr during the run. At completion it tells me that, "Closed dnscache database". I can see that the database isn't getting updated because a) its timestamp doesn't get updated and b) because running strings over the file and grepping for new names doesn't display them: % # this one exists... % strings -a ~/.dnscache* | egrep -i timsblogger www.timsbloggers.comq % # this one is new... % strings -a ~/.dnscache* | egrep -i tradelink % # bummer... Does anyone have any suggestions about getting this beast to work properly? Thx, Skip From spambayes at masters.me.uk Wed Aug 2 09:45:03 2006 From: spambayes at masters.me.uk (spambayes at masters.me.uk) Date: Wed, 2 Aug 2006 08:45:03 +0100 Subject: [spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers Message-ID: <000f01c6b607$9161a640$1202a8c0@trump> Dear Spambayes developers, I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been excellent. However, over the last couple of months, it has become compromised by a particular type of spam that I believe, over time, will render Spambayes much less effective unless something is done. I expect you've seen these Spams - at the moment, they are always the stock-market related ones but I'm sure once others catch on, they will start to use the same technique. The start of the email is a picture that looks like ordinary text but isn't. All the spam info is in the text. The picture is followed by a whole load of randomly selected words. There are 2 bad things about this: 1. These spams are successfully evading Spambayes in some cases. Firstly the Spam usually reaches the "possible Spam" folder. As a result, I am now spending significant time clearing out the possible spam folder whereas 2 or 3 months ago I wasn't. Secondly, the odd spam is actually managing to get through as ham. This is the first time this has happened ever. 2. Because I obviously mark these as Spam, all the randomly generated words in each spam email have their spam likelihood scores increased. The result of this is that over time, the spam-scores for loads of perfectly non-spam-like words are being gradually increased. The more this goes on, the more these "ham words" are being compromised. I suspect that this is why, to begin with, I only saw a few of these stock market emails, now I'm seeing loads and over the last 2 or 3 weeks some have started to come in as ham. I fear that the long term effect of this will be to spoil spambayes bigtime. I know that Spambayes has a deep-rooted principle in only using the bayesian algorithm and I wouldn't suggest changing that. However, I am wondering if it might be possible to analyse these messages and include some parts of the hidden text relating to the picture that are not presently included in the bayesian statistics. My thesis is this - I rarely get pictures in my email that are not just attachments - virtually all pictures that are embedded into the mail seem to be spam. So if there is some token or tag in the email that represents the embedded picture that can be included in the bayesian analysis, this would might fix the problem. I hope that this suggestion is useful - I certainly fear for the future of Spambayes if this new spam threat is not dealt with.... thanks for reading, James Masters. From ron at ridic.com Wed Aug 2 13:21:01 2006 From: ron at ridic.com (Ron Theis) Date: Wed, 02 Aug 2006 04:21:01 -0700 Subject: [spambayes-dev] Correct formatting of HTTP Post for training Message-ID: <44D08A9D.1030100@ridic.com> > Apparently I'm formatting the requests > incorrectly, because the server is returning a 500 error. Whoops, sorry, I was missing the "text" parameter in the POST. Dumb diddly. It seems to be training fine now. Ron From ron at ridic.com Wed Aug 2 13:05:16 2006 From: ron at ridic.com (Ron Theis) Date: Wed, 02 Aug 2006 04:05:16 -0700 Subject: [spambayes-dev] Correct formatting of HTTP Post for training Message-ID: <44D086EC.6070703@ridic.com> Hi, I'm trying to submit spam/ham via manually assembled HTTP POSTs to SpamBayes on Windows. Apparently I'm formatting the requests incorrectly, because the server is returning a 500 error. The error message includes a traceback of: File "spambayes\Dibbler.pyc", line 470, in found_terminator TypeError: onTrain() takes exactly 4 non-keyword arguments (2 given) Does anyone have a sample of what such a POST should look like? I suspect I'm bungling the formatting. Thanks, Ron From tim.peters at gmail.com Thu Aug 3 09:25:32 2006 From: tim.peters at gmail.com (Tim Peters) Date: Thu, 3 Aug 2006 03:25:32 -0400 Subject: [spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers In-Reply-To: <000f01c6b607$9161a640$1202a8c0@trump> References: <000f01c6b607$9161a640$1202a8c0@trump> Message-ID: <1f7befae0608030025i3ce746eatbd264fd5ad094725@mail.gmail.com> [spambayes at masters.me.uk] > I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been > excellent. However, over the last couple of months, it has become > compromised by a particular type of spam that I believe, over time, will > render Spambayes much less effective unless something is done. > > I expect you've seen these Spams - at the moment, they are always the > stock-market related ones I've seen a few drug spams using the same techniques, starting in July -- but they seemed to dry up quickly. > but I'm sure once others catch on, they will start > to use the same technique. The start of the email is a picture that looks > like ordinary text but isn't. All the spam info is in the text. The > picture is followed by a whole load of randomly selected words. You're probably not getting any reaction here because exactly the same thing is currently being discussed on the SpamBayes "user" mailing list, in this thread: Spam in Images http://mail.python.org/pipermail/spambayes/2006-August/date.html > There are 2 bad things about this: > > 1. These spams are successfully evading Spambayes in some cases. Firstly > the Spam usually reaches the "possible Spam" folder. As a result, I am now > spending significant time clearing out the possible spam folder whereas 2 or > 3 months ago I wasn't. Same here, except the time isn't significant. If you don't believe me, stop using SpamBayes for a week to rediscover what "significant" means ;-) > Secondly, the odd spam is actually managing to get through as ham. This > is the first time this has happened ever. Not here -- they're very good at scoring Unsure, but haven't seen any false negatives yet. > 2. Because I obviously mark these as Spam, all the randomly generated words > in each spam email have their spam likelihood scores increased. The result > of this is that over time, the spam-scores for loads of perfectly > non-spam-like words are being gradually increased. The more this goes on, > the more these "ham words" are being compromised. I certainly haven't seen any ham pushed into "unsure" because of this, and doubt it matters -- it generally doesn't hurt at all to have any number of "ham words" show up in a few spam. One of the characteristics of the spam you're talking about that /makes/ it effective is that it's very good at /not/ repeating gibberish phrases across messages. That's exactly why training on the gibberish is ineffective at catching future messages of the same ilk. But, OTOH, the non-repetition also prevents it from "poisoning" your strong ham tokens. They get slightly less hammy, and that doesn't hurt because most ham is nowhere near the unsure range. > I suspect that this is why, to begin with, I only saw a few of these stock market > emails, now I'm seeing loads The only reason you see loads of any kind of spam is that it's making a profit for the sender. Pump-&-dump scams violate major securities laws, and it's quite possible these scammers will quit before getting too greedy (= getting caught). > and over the last 2 or 3 weeks some have started to come in as ham. While I haven't seen that, it's inconsistent with your explanation above: if your "ham tokens" /were/ being compromised, that makes it /less/ likely that a message containing your ham tokens will be scored as ham, not more likely. A more likely explanation is simply "loads": gibberish does have a real chance of scoring as ham, and the more attempts are made, the more likely one will succeed. What they can't do is craft a message that scores as ham for all users, or even for most. > I fear that the long term effect of this will be to spoil spambayes bigtime. Possibly. People have panicked prematurely before ;-) > I know that Spambayes has a deep-rooted principle in only using the bayesian > algorithm and I wouldn't suggest changing that. However, I am wondering if > it might be possible to analyse these messages and include some parts of the > hidden text relating to the picture that are not presently included in the > bayesian statistics. See the thread above. Nobody knows a realistic way to extract the text from these images (there is no "text" here -- just a large matrix of individual pixels, something the human eye/brain system is very much better at decoding than programs). OTOH, the images themselves probably have many statistical characteristics not shared with "legitimate" images, and those can be computed/extracted with finite effort. > My thesis is this - I rarely get pictures in my email that are not just attachments - > virtually all pictures that are embedded into the mail seem to be spam. Of course that varies. For example, it's very easy to create embedded pictures in Outlook, and even small children know how to do it. Worse, their grandparents are required by law to consider such email "ham" :-) > So if there is some token or tag in the email that represents the embedded picture > that can be included in the bayesian analysis, this would might fix the problem. This is harder in Outlook because Outlook destroys the original MIME structure of the email before SpamBayes sees it. There are already several such tokens generated when the original MIME structure is available. In Outlook, it's most likely you'll get the single synthesized token: virus:src="cid: or a simple variation on that, and that's all that remains of the embedded GIF. A single token helps a bit, but not enough. Do note that pump-&-dump scams don't even contain a URL to click on: they want you to buy the stock on the open market, not send them money directly. That also makes it a unique (and uniquely effective) kind of spam: the pitch is /entirely/ buried in the GIF, with no useful text (not even a URL) of any kind to tokenize. > I hope that this suggestion is useful - I certainly fear for the future of > Spambayes if this new spam threat is not dealt with.... Don't assume that most spammers are capable of becoming competent :-) From skip at pobox.com Fri Aug 4 17:20:36 2006 From: skip at pobox.com (skip at pobox.com) Date: Fri, 4 Aug 2006 10:20:36 -0500 Subject: [spambayes-dev] Maybe a little OCR would help... Message-ID: <17619.26052.556242.798290@montanaro.dyndns.org> This is just one simple little test... I took two pump & dump messages for HLVK I received overnight. The GIF image is actually sliced into pieces horizontally, so I wrote a little shell script to convert the images to netpbm and concatenate them, then sent the result through ocrad, sorted, uniq'd and downshited the whole mess, then checked for words the two had in common. I came up with: _ __ and co company hlv hlvc lnc. low new news nlv now! now!!! on the tnis wl_ |_ While that is not a huge increase in the number of tokens and some aren't going to help, it's still better than what we have today. Time will tell if the cost is worth it. Perhaps if we generate some further interest in ocrad it will improve as well. Skip From james at masters.me.uk Sat Aug 5 23:16:23 2006 From: james at masters.me.uk (James Masters) Date: Sat, 5 Aug 2006 22:16:23 +0100 Subject: [spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers In-Reply-To: <1f7befae0608030025i3ce746eatbd264fd5ad094725@mail.gmail.com> Message-ID: <002701c6b8d4$68031d90$1202a8c0@trump> Dear Tim, Thank you very much for your comprehensive reply and apologies to the group for putting my email to the wrong place. If I have anything more to write, I'll put it in the forum you mention. thanks, James. > -----Original Message----- > From: Tim Peters [mailto:tim.peters at gmail.com] > Sent: 03 August 2006 08:26 > To: spambayes at masters.me.uk > Cc: spambayes-dev at python.org > Subject: Re: [spambayes-dev] Spambayes is starting not to work due to > retaliatory action by spammers > > > [spambayes at masters.me.uk] > > I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been > > excellent. However, over the last couple of months, it has become > > compromised by a particular type of spam that I believe, > over time, will > > render Spambayes much less effective unless something is done. > > > > I expect you've seen these Spams - at the moment, they are > always the > > stock-market related ones > > I've seen a few drug spams using the same techniques, starting in July > -- but they seemed to dry up quickly. > > > but I'm sure once others catch on, they will start > > to use the same technique. The start of the email is a > picture that looks > > like ordinary text but isn't. All the spam info is in the > text. The > > picture is followed by a whole load of randomly selected words. > > You're probably not getting any reaction here because exactly the same > thing is currently being discussed on the SpamBayes "user" mailing > list, in this thread: > > Spam in Images > http://mail.python.org/pipermail/spambayes/2006-August/date.html > > > There are 2 bad things about this: > > > > 1. These spams are successfully evading Spambayes in some > cases. Firstly > > the Spam usually reaches the "possible Spam" folder. As a > result, I am now > > spending significant time clearing out the possible spam > folder whereas 2 or > > 3 months ago I wasn't. > > Same here, except the time isn't significant. If you don't believe > me, stop using SpamBayes for a week to rediscover what "significant" > means ;-) > > > Secondly, the odd spam is actually managing to get through > as ham. This > > is the first time this has happened ever. > > Not here -- they're very good at scoring Unsure, but haven't seen any > false negatives yet. > > > 2. Because I obviously mark these as Spam, all the > randomly generated words > > in each spam email have their spam likelihood scores > increased. The result > > of this is that over time, the spam-scores for loads of perfectly > > non-spam-like words are being gradually increased. The > more this goes on, > > the more these "ham words" are being compromised. > > I certainly haven't seen any ham pushed into "unsure" because of this, > and doubt it matters -- it generally doesn't hurt at all to have any > number of "ham words" show up in a few spam. One of the > characteristics of the spam you're talking about that /makes/ it > effective is that it's very good at /not/ repeating gibberish phrases > across messages. That's exactly why training on the gibberish is > ineffective at catching future messages of the same ilk. But, OTOH, > the non-repetition also prevents it from "poisoning" your strong ham > tokens. They get slightly less hammy, and that doesn't hurt because > most ham is nowhere near the unsure range. > > > I suspect that this is why, to begin with, I only saw a few > of these stock market > > emails, now I'm seeing loads > > The only reason you see loads of any kind of spam is that it's making > a profit for the sender. Pump-&-dump scams violate major securities > laws, and it's quite possible these scammers will quit before getting > too greedy (= getting caught). > > > and over the last 2 or 3 weeks some have started to come in as ham. > > While I haven't seen that, it's inconsistent with your explanation > above: if your "ham tokens" /were/ being compromised, that makes it > /less/ likely that a message containing your ham tokens will be scored > as ham, not more likely. > > A more likely explanation is simply "loads": gibberish does have a > real chance of scoring as ham, and the more attempts are made, the > more likely one will succeed. What they can't do is craft a message > that scores as ham for all users, or even for most. > > > I fear that the long term effect of this will be to spoil > spambayes bigtime. > > Possibly. People have panicked prematurely before ;-) > > > I know that Spambayes has a deep-rooted principle in only > using the bayesian > > algorithm and I wouldn't suggest changing that. However, I > am wondering if > > it might be possible to analyse these messages and include > some parts of the > > hidden text relating to the picture that are not presently > included in the > > bayesian statistics. > > See the thread above. Nobody knows a realistic way to extract the > text from these images (there is no "text" here -- just a large matrix > of individual pixels, something the human eye/brain system is very > much better at decoding than programs). OTOH, the images themselves > probably have many statistical characteristics not shared with > "legitimate" images, and those can be computed/extracted with finite > effort. > > > My thesis is this - I rarely get pictures in my email that > are not just attachments - > > virtually all pictures that are embedded into the mail seem > to be spam. > > Of course that varies. For example, it's very easy to create embedded > pictures in Outlook, and even small children know how to do it. > Worse, their grandparents are required by law to consider such email > "ham" :-) > > > So if there is some token or tag in the email that > represents the embedded picture > > that can be included in the bayesian analysis, this would > might fix the problem. > > This is harder in Outlook because Outlook destroys the original MIME > structure of the email before SpamBayes sees it. There are already > several such tokens generated when the original MIME structure is > available. In Outlook, it's most likely you'll get the single > synthesized token: > > virus:src="cid: > > or a simple variation on that, and that's all that remains of the > embedded GIF. A single token helps a bit, but not enough. Do note > that pump-&-dump scams don't even contain a URL to click on: they > want you to buy the stock on the open market, not send them money > directly. That also makes it a unique (and uniquely effective) kind > of spam: the pitch is /entirely/ buried in the GIF, with no useful > text (not even a URL) of any kind to tokenize. > > > I hope that this suggestion is useful - I certainly fear > for the future of > > Spambayes if this new spam threat is not dealt with.... > > Don't assume that most spammers are capable of becoming competent :-) > From skip at pobox.com Sun Aug 6 19:25:47 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 6 Aug 2006 12:25:47 -0500 Subject: [spambayes-dev] Several new tokenizing gimmicks checked in Message-ID: <17622.9755.800465.16215@montanaro.dyndns.org> With the current crop of pump & dump spams I decided to break down and actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would help. It does a miserable job from a readability standpoint at extracting text from an image, but SpamBayes seems to love what it does generate. This morning I thought, "what the hell", and checked in all the current new tricks I've been working on/with: * IP address lookup and more extensive tokenization. This is from Matt Cowles. I added persistence beyond the current run. Unfortunately, the dbm persistence is untested (though should probably work okay) while the zodb persistence still has problems (writes the file the first time, but doesn't update it on successive runs). Maybe someone can look at those issues. This seems to work very well for those spams where the only useful clue is a URL, but with a domain name that changes each time. They seem to pretty much all point to the same IP address as far as I can tell. Enabled using the x-lookup_ip and lookup_ip_cache options. Requires installation of PyDNS. * Note image size. This was my first stab at trying to get some information out of an image. Seems to work pretty well. Enabled using the x-image_size option. * Note short runs of too-short words. Text spammers (as opposed to image spammers) seem to like to use this technique: X j A m N j A d X h M k E z R d I p D u I m A c C o I d A t L j I v S j to hide their tokens from spam filters. Enabled using the x-short_runs option. Based on my current database I'm skeptical this will add much over what else we already have. * Try OCR on images. The latest technique we've all encountered seems to be the pump and dump stock scams where the entire come-on is embedded in one or more GIF images. I wrote a small ImageStripper module which handles these. It grabs the image parts, converts them to netpbm format, concatenates them left-to-right, then submits the result to ocrad. This is just a proof-of-concept. It requires ocrad and netpbm to be available. As such I suspect it will only run currently on Unix-like systems. Enabled using the x-crack_images and max_image_size options. I added these extensions using multiple checkins, so if we decide to back one or more of them out it shouldn't be a major PITA. Skip From skip at pobox.com Mon Aug 7 00:50:44 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 6 Aug 2006 17:50:44 -0500 Subject: [spambayes-dev] Some test results Message-ID: <17622.29252.971244.847129@montanaro.dyndns.org> I put together some test databases today using spam received in the past week or so (about 1800 messages) and a reasonable cross-section of my ham (all saved python-related mail plus my regular non-specific mailbox, about 2300 messages) and did some 5x5 cross-validation tests (that's the correct term, right?). For the control test I set all these options False: x-lookup_ip x-short_runs x-image_size x-crack_images but otherwise used my standard configuration. I then made four runs, setting one option True for each run, then compared each test with the control run. The results are summarized briefly below. control v. x-lookup_ip ---------------------- false positive percentages 0.000 0.000 tied 0.217 0.217 tied 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.412 tied 4.533 4.533 tied 4.222 4.222 tied won 0 times tied 5 times lost 0 times control v. x-short_runs ----------------------- false positive percentages 0.000 0.000 tied 0.217 0.217 tied 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.412 tied 4.533 4.533 tied 4.222 4.222 tied won 0 times tied 5 times lost 0 times control v. x-image_size ----------------------- false positive percentages 0.000 0.000 tied 0.217 0.434 lost +100.00% 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 4 times lost 1 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.118 won -6.66% 4.533 4.533 tied 4.222 3.958 won -6.25% won 2 times tied 3 times lost 0 times control v. x-crack_images ------------------------- false positive percentages 0.000 0.000 tied 0.217 0.217 tied 0.000 0.000 tied 0.219 0.219 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times ... false negative percentages 4.199 4.199 tied 1.404 1.404 tied 4.412 4.118 won -6.66% 4.533 3.966 won -12.51% 4.222 3.430 won -18.76% won 3 times tied 2 times lost 0 times I didn't do anything to verify the accuracy of my spam and ham data. I'm doing that now. Also, the fact that the first two tests were identical to the control seems a bit suspicious, so I'm going to try them again after picking over my training database. Still, the image_size and crack_images runs look promising, perhaps because my recent spam is so full of these pump and dump spams. Skip From dave at boost-consulting.com Mon Aug 7 19:15:00 2006 From: dave at boost-consulting.com (David Abrahams) Date: Mon, 07 Aug 2006 13:15:00 -0400 Subject: [spambayes-dev] sb_imapfilter: bad FETCH response References: <87wtgg4goi.fsf@boost-consulting.com> Message-ID: A non-text attachment was scrubbed... Name: sb_imapfilter.diff Type: text/x-patch Size: 1049 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060807/d747a971/attachment.bin From tim.peters at gmail.com Tue Aug 8 04:47:00 2006 From: tim.peters at gmail.com (Tim Peters) Date: Mon, 7 Aug 2006 22:47:00 -0400 Subject: [spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers In-Reply-To: <002701c6b8d4$68031d90$1202a8c0@trump> References: <1f7befae0608030025i3ce746eatbd264fd5ad094725@mail.gmail.com> <002701c6b8d4$68031d90$1202a8c0@trump> Message-ID: <1f7befae0608071947t24c28cafwc5892d335ce8f8af@mail.gmail.com> [James Masters] > Thank you very much for your comprehensive reply and apologies to the group > for putting my email to the wrong place. No apology necessary. I didn't even intend to imply you were posting in a wrong place, just pointing out that the same topic just happened to be actively discussed elsewhere. Since SpamBayes in fact does a much poorer job on image-based spam than on "traditional" spam, and that's A Problem for both users and developers, discussing it on both the user and developer lists is thoroughly appropriate. > If I have anything more to write, I'll put it in the forum you mention. But only if it's appropriate there, else we'll have to ask you to apologize ;-) From skip at pobox.com Tue Aug 8 06:10:15 2006 From: skip at pobox.com (skip at pobox.com) Date: Mon, 7 Aug 2006 23:10:15 -0500 Subject: [spambayes-dev] Updated test results Message-ID: <17624.3751.511224.861381@montanaro.dyndns.org> I picked through my new training database, found one or two outright mistakes, deleted a few other administrative mails, fixed a few bugs in my recent checkins and rebalanced my database. I then made a baseline run with the following settings: [globals] verbose: True [Headers] include_evidence: True [Tokenizer] record_header_absence: True summarize_email_prefixes: True summarize_email_suffixes: True mine_received_headers:True x-pick_apart_urls:True x-fancy_url_recognition:False x-lookup_ip:False lookup_ip_cache:~/src/spambayes/ip.pickle x-short_runs:False x-image_size:False x-crack_images:False x-max_image_size:100000 [Categorization] ham_cutoff: 0.15 spam_cutoff: 0.50 [Storage] persistent_storage_file: ~/src/spambayes/test.pickle persistent_use_database: pickle followed by a series of test runs, each one with one of the following options set to True: x-lookup_ip x-short_runs x-image_size x-crack_images All tests were run against the same combination of ham and spam: -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams -> tested 459 hams & 359 spams against 1836 hams & 1436 spams baseline vs. x-lookup_ip: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.218 0.218 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times false negative percentages 2.228 1.671 won -25.00% 3.343 3.064 won -8.35% 5.292 4.735 won -10.53% 4.735 4.457 won -5.87% 2.786 2.507 won -10.01% won 5 times tied 0 times lost 0 times baseline vs. x-short_runs: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.218 0.218 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times false negative percentages 2.228 2.228 tied 3.343 3.343 tied 5.292 5.292 tied 4.735 4.735 tied 2.786 2.786 tied won 0 times tied 5 times lost 0 times baseline vs. x-image_size: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.218 0.218 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times false negative percentages 2.228 1.950 won -12.48% 3.343 3.343 tied 5.292 5.014 won -5.25% 4.735 4.457 won -5.87% 2.786 2.786 tied won 3 times tied 2 times lost 0 times baseline vs. x-crack_image: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.218 0.218 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times false negative percentages 2.228 1.671 won -25.00% 3.343 3.064 won -8.35% 5.292 4.457 won -15.78% 4.735 4.457 won -5.87% 2.786 2.786 tied won 4 times tied 1 times lost 0 times Based on the mixture of ham and spam I have it would appear only the x-short_runs option doesn't help discriminate ham from spam. Skip From matt at mondoinfo.com Tue Aug 8 21:37:02 2006 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Tue, 8 Aug 2006 14:37:02 -0500 (CDT) Subject: [spambayes-dev] Updated test results In-Reply-To: <17624.3751.511224.861381@montanaro.dyndns.org> References: <17624.3751.511224.861381@montanaro.dyndns.org> Message-ID: <1155050781.71.19994@mint-julep.mondoinfo.com> > baseline vs. x-lookup_ip: [. . .] > false negative percentages > 2.228 1.671 won -25.00% > 3.343 3.064 won -8.35% > 5.292 4.735 won -10.53% > 4.735 4.457 won -5.87% > 2.786 2.507 won -10.01% > > won 5 times > tied 0 times > lost 0 times I'm glad to see that. That's the sort of improvement that I see with that code, but I think it's the first time that anyone else has reproduced it. Still, as people have pointed out before, there's at least one potential problem in the code. That's that data from DNS isn't necessarily stable. If someone needed to un-train their database on a message a day or two later, the tokens generated might easily not be the same as they were when the message was first trained on. That could send a token's count below zero. That doesn't affect me in practice, but it would surely affect someone if the code were used widely. Fixing it in general would require some rather elaborate persistence mechanism, I think. Regards, Matt From skip at pobox.com Tue Aug 8 22:12:09 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 8 Aug 2006 15:12:09 -0500 Subject: [spambayes-dev] Updated test results In-Reply-To: <1155050781.71.19994@mint-julep.mondoinfo.com> References: <17624.3751.511224.861381@montanaro.dyndns.org> <1155050781.71.19994@mint-julep.mondoinfo.com> Message-ID: <17624.61465.232582.676437@montanaro.dyndns.org> Matt> Still, as people have pointed out before, there's at least one Matt> potential problem in the code. That's that data from DNS isn't Matt> necessarily stable.... Matt> That doesn't affect me in practice, but it would surely affect Matt> someone if the code were used widely. Fixing it in general would Matt> require some rather elaborate persistence mechanism, I think. Or simply retraining from scratch after deleting your cache. Speaking of which, I gave up on persistence via the dbm or zodb routes. Instead I just save/restore the cache using pickle. I'll probably check that into CVS this evening. Skip From dave at boost-consulting.com Wed Aug 9 02:28:58 2006 From: dave at boost-consulting.com (David Abrahams) Date: Tue, 08 Aug 2006 20:28:58 -0400 Subject: [spambayes-dev] Is IMAP supported? Message-ID: Hi, I made a bug report in January, and recently followed up with a partial diagnosis, but have received no reply to either one. Is sb_imapfilter still supported? Is there someone I should contact directly about this problem? I'd like to be able to make an informed decision about what to do about it next... Thanks in advance, -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Wed Aug 9 17:55:41 2006 From: skip at pobox.com (skip at pobox.com) Date: Wed, 9 Aug 2006 10:55:41 -0500 Subject: [spambayes-dev] [Spambayes] Posting problems In-Reply-To: References: Message-ID: <17626.1405.993809.368716@montanaro.dyndns.org> >>>>> "Dave" == David Abrahams writes: Dave> David Abrahams writes: >> I've posted several messages to this list through GMane, and wondered >> why nobody answered them. Well as it turned out, I wasn't subscribed, >> and you don't seem to be accepting posts from nonsubscribers (totally >> understandable). But I got no clue that subscription was needed, and >> GMane shows my posts anyway: >> >> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613 >> >> They just don't show up in your email archive. Furthermore, for a >> little extra weirdness, I've posted successfully here before: >> >> http://mail.python.org/pipermail/spambayes/2003-December/author.html >> >> I dunno what's going on here. Dave> Um, I guess I was confusing the -dev list (on which nobody is Dave> answering me) with this one. Sorry for the noise. Dave> But if someone could get back to me on the -dev list I'd really Dave> appreciate it! Even just an ACK would be useful at this point! I saw no pending moderator requests on either spambayes or spambayes-dev. I saw nothing in the Mailman config for spambayes-dev that would prevent you from posting. Perhaps GMane isn't actually posting the messages. Skip From dave at boost-consulting.com Wed Aug 9 18:06:53 2006 From: dave at boost-consulting.com (David Abrahams) Date: Wed, 09 Aug 2006 12:06:53 -0400 Subject: [spambayes-dev] [Spambayes] Posting problems In-Reply-To: <17626.1405.993809.368716@montanaro.dyndns.org> (skip@pobox.com's message of "Wed, 9 Aug 2006 10:55:41 -0500") References: <17626.1405.993809.368716@montanaro.dyndns.org> Message-ID: skip at pobox.com writes: >>>>>> "Dave" == David Abrahams writes: > > Dave> David Abrahams writes: > >> I've posted several messages to this list through GMane, and wondered > >> why nobody answered them. Well as it turned out, I wasn't subscribed, > >> and you don't seem to be accepting posts from nonsubscribers (totally > >> understandable). But I got no clue that subscription was needed, and > >> GMane shows my posts anyway: > >> > >> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613 > >> > >> They just don't show up in your email archive. Furthermore, for a > >> little extra weirdness, I've posted successfully here before: > >> > >> http://mail.python.org/pipermail/spambayes/2003-December/author.html > >> > >> I dunno what's going on here. > > Dave> Um, I guess I was confusing the -dev list (on which nobody is > Dave> answering me) with this one. Sorry for the noise. > > Dave> But if someone could get back to me on the -dev list I'd really > Dave> appreciate it! Even just an ACK would be useful at this point! > > I saw no pending moderator requests on either spambayes or spambayes-dev. I > saw nothing in the Mailman config for spambayes-dev that would prevent you > from posting. Perhaps GMane isn't actually posting the messages. No, it is posting the messages: http://mail.python.org/pipermail/spambayes-dev/2006-August/003701.html But the archive, at least, seems to have scrubbed out all the content along with the patch. The original message in the thread, however, does appear: http://mail.python.org/pipermail/spambayes-dev/2006-January/003616.html Again, you can see what I actually posted at http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613 -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Wed Aug 9 19:53:38 2006 From: skip at pobox.com (skip at pobox.com) Date: Wed, 9 Aug 2006 12:53:38 -0500 Subject: [spambayes-dev] [Spambayes] Posting problems In-Reply-To: References: <17626.1405.993809.368716@montanaro.dyndns.org> Message-ID: <17626.8482.159339.985701@montanaro.dyndns.org> Dave> No, it is posting the messages: Dave> http://mail.python.org/pipermail/spambayes-dev/2006-August/003701.html Dave> But the archive, at least, seems to have scrubbed out all the Dave> content along with the patch. Looking at the gmane version of the message, I do remember seeing it, so it's clearly getting to the list. The attachement is here: http://mail.python.org/pipermail/spambayes-dev/attachments/20060807/d747a971/attachment.bin though as you indicated the message body seems to have been vaporized. Dave> Again, you can see what I actually posted at Dave> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613 My best guess is that it's a pipermail bug. Skip From dave at boost-consulting.com Wed Aug 9 20:32:57 2006 From: dave at boost-consulting.com (David Abrahams) Date: Wed, 09 Aug 2006 14:32:57 -0400 Subject: [spambayes-dev] [Spambayes] Posting problems References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> Message-ID: skip at pobox.com writes: > Dave> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613 > > Looking at the gmane version of the message, I do remember seeing it, so > it's clearly getting to the list. Any idea why I'm not getting an answer? -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Wed Aug 9 20:49:56 2006 From: skip at pobox.com (skip at pobox.com) Date: Wed, 9 Aug 2006 13:49:56 -0500 Subject: [spambayes-dev] [Spambayes] Posting problems In-Reply-To: References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> Message-ID: <17626.11860.614678.323143@montanaro.dyndns.org> Dave> skip at pobox.com writes: Dave> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613 >> >> Looking at the gmane version of the message, I do remember seeing it, so >> it's clearly getting to the list. Dave> Any idea why I'm not getting an answer? Lack of round tuits perhaps? Skip From tim.peters at gmail.com Wed Aug 9 21:34:37 2006 From: tim.peters at gmail.com (Tim Peters) Date: Wed, 9 Aug 2006 15:34:37 -0400 Subject: [spambayes-dev] [Spambayes] Posting problems In-Reply-To: References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> Message-ID: <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> [David Abrahams] > Any idea why I'm not getting an answer? Perhaps because you've become the leading expert on sb_imapfilter ;-) It might help to drop the meta-discussion and start over with what the problem is. From dave at boost-consulting.com Wed Aug 9 21:54:38 2006 From: dave at boost-consulting.com (David Abrahams) Date: Wed, 09 Aug 2006 15:54:38 -0400 Subject: [spambayes-dev] [Spambayes] Posting problems References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> Message-ID: "Tim Peters" writes: > [David Abrahams] >> Any idea why I'm not getting an answer? > > Perhaps because you've become the leading expert on sb_imapfilter ;-) That's what I was afraid of. -- Dave Abrahams Boost Consulting www.boost-consulting.com From dave at boost-consulting.com Wed Aug 9 22:04:18 2006 From: dave at boost-consulting.com (David Abrahams) Date: Wed, 09 Aug 2006 16:04:18 -0400 Subject: [spambayes-dev] sb_imapfilter: problem parsing result of FETCH (was: Posting problems) References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> Message-ID: A non-text attachment was scrubbed... Name: sb_imapfilter.diff Type: text/x-patch Size: 1049 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060809/040a726c/attachment.bin From skip at pobox.com Thu Aug 10 07:00:04 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 10 Aug 2006 00:00:04 -0500 Subject: [spambayes-dev] Latest image spam/OCR update Message-ID: <17626.48468.837805.425784@montanaro.dyndns.org> I just checked in a couple significant changes to the OCR stuff. First, I added support for conversion of input images using PIL. That means netpbm is no longer required. PIL is faster and more robust than netpbm, and is platform-independent. Perhaps someone in Windows-land can take the time to see if it's possible to build ocrad on Windows. We could then (in theory, at least) distribute an ocrad installer alongside the SpamBayes Windows installer and perform crude, but apparently effective, OCR analysis of image-based spam. The second change to the OCR code was the addition of a simple pickled cache file (controlled by the "crack_image_cache" option). The conversion to netpbm format is still required, however the ocrad step is skipped if the md5 hexdigest of the generated image is present in the cache. In thi case any cached text and tokens are returned. I have no Windows capability, so someone else will have to take the steps necessary to make this all play on Windows. There are a few other things that need testing, but I'm out of time. First, I arbitrarily set an upper limit of 100kbytes on input images (per image before converting to netpbm). I think that allows all images that would hold spam content, but I'm not sure I have many images in my training database besides spam. I don't know if that's a useful cutoff or if there should even be a cutoff. Second, I observed that ocrad routinely seemed to get the letter case wrong (e.g. coming up with "EGLy" instead of "EGLY"), so I blindly downshift its output. I have nothing other than that simple observation to suggest that should be done. Third, if other people have traing databases, running N-fold cross validation tests of these new gimmicks would be beneficial. It would be nice if others could verify my results before a new release is made. Finally, if you're a Python programmer (or aspire to be one), picking through the new code would be a good check. Too bad the summer's nearly over. We could use a Summer of Code intern... Skip From tameyer at ihug.co.nz Thu Aug 10 07:54:56 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu, 10 Aug 2006 17:54:56 +1200 Subject: [spambayes-dev] Posting problems In-Reply-To: <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> Message-ID: <6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz> > [David Abrahams] >> Any idea why I'm not getting an answer? [Tim Peters] > Perhaps because you've become the leading expert on sb_imapfilter ;-) Or it could be because the wife of the previously leading expert on sb_imapfilter is due to have their first child any day now ;). Those round tuits are pretty scarce here at the moment. sb_imapfilter has always been an unloved child. Unlike most of the rest of the SpamBayes code, it's wasn't a scratching an itch, but shutting up people asking for it on spambayes at python.org. Back then I had time to spare, so Tim Stone & I put it together - I didn't have an IMAP account at the time. I still dislike IMAP, so use POP for all my accounts, so although I probably know the code better than anyone else (although it has been a while), I rarely exercise it. I've put it up for adoption on spambayes-dev at various times, but no-one has taken up the offer. I'll try to take a look at your message & the problem this weekend. I know that the code I added between 1.1a1 and 1.1a2 to both sb_server and sb_imapfilter to deal with the "changed database type causes a crash" bug wasn't as well designed as it should have been, and does cause the odd problem. I plan to fix that as soon as I can. =Tony.Meyer From mhammond at skippinet.com.au Thu Aug 10 09:32:25 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 10 Aug 2006 17:32:25 +1000 Subject: [spambayes-dev] [Spambayes] Latest image spam/OCR update In-Reply-To: <17626.48468.837805.425784@montanaro.dyndns.org> Message-ID: <192c01c6bc4f$26046ba0$0200a8c0@enfoldsystems.local> > Perhaps someone in Windows-land can > take the time to > see if it's possible to build ocrad on Windows. Using cygwin and gcc I was able to build an ocrad.exe on Windows (with one simple patch necessary; a complaint about std::sprintf - just removing the 'std::' prefix got it building) Sadly that is all I have time for today too though, but if anyone wants that .exe to fiddle with, let me know. Mark From dave at boost-consulting.com Thu Aug 10 18:48:45 2006 From: dave at boost-consulting.com (David Abrahams) Date: Thu, 10 Aug 2006 12:48:45 -0400 Subject: [spambayes-dev] Posting problems References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> <6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz> Message-ID: Tony Meyer writes: >> [David Abrahams] >>> Any idea why I'm not getting an answer? > > [Tim Peters] >> Perhaps because you've become the leading expert on sb_imapfilter ;-) > > Or it could be because the wife of the previously leading expert on > sb_imapfilter is due to have their first child any day now ;). Those > round tuits are pretty scarce here at the moment. Understood. Congratulations, though! > > sb_imapfilter has always been an unloved child. Unlike most of the > rest of the SpamBayes code, it's wasn't a scratching an itch, but > shutting up people asking for it on spambayes at python.org. Back then > I had time to spare, so Tim Stone & I put it together - I didn't have > an IMAP account at the time. > > I still dislike IMAP, so use POP for all my accounts, I don't see POP as an option if I want server-side mail storage. > so although I probably know the code better than anyone else > (although it has been a while), I rarely exercise it. I've put it > up for adoption on spambayes-dev at various times, but no-one has > taken up the offer. I might be willing to be trained toward that end (there's lots I want to do with IMAP and so it would be good to learn how), but I'm sure not competent to do it right now. > I'll try to take a look at your message & the problem this weekend. Thanks, I really appreciate it. > I know that the code I added between 1.1a1 and 1.1a2 to both > sb_server and sb_imapfilter to deal with the "changed database type > causes a crash" bug wasn't as well designed as it should have been, > and does cause the odd problem. I plan to fix that as soon as I can. Thanks again, Tony -- Dave Abrahams Boost Consulting www.boost-consulting.com From skip at pobox.com Sun Aug 13 18:56:20 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 13 Aug 2006 11:56:20 -0500 Subject: [spambayes-dev] [Spambayes-checkins] spambayes/spambayes dnscache.py, 1.2, 1.3 In-Reply-To: <20060813020548.AA6721E4002@bag.python.org> References: <20060813020548.AA6721E4002@bag.python.org> Message-ID: <17631.22964.232555.383050@montanaro.dyndns.org> Tony> Remove reference to Skip, probably left there by mistake :) Yes, probably... Thanks for catching it. S From skip at pobox.com Sun Aug 13 20:48:37 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 13 Aug 2006 13:48:37 -0500 Subject: [spambayes-dev] Patch for ocrad to run on Windows? Message-ID: <17631.29701.373791.625191@montanaro.dyndns.org> Mark Hammond and Sean True both said they had an ocrad.exe executable built under cygwin. (Hopefully it doesn't require cygwin runtime?) Was the only change you made to the source the "std::fprintf" -> "fprintf" replacement? Ocrad is GPL'd, so all we have to do to make it available is also distribute the modified source. If you can stick the .exe file somewhere and let me know if there are any Windows version restrictions, I'll put together the requisite modified Ocrad source distribution and place both the distribution and the executable on the SpamBayes website for Windows users to try out. I'll also send a (second) note to Antonio Diaz Diaz, the Ocrad author, letting him know where it is. Thx, Skip From tameyer at ihug.co.nz Sun Aug 13 23:32:46 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon, 14 Aug 2006 09:32:46 +1200 Subject: [spambayes-dev] Patch for ocrad to run on Windows? In-Reply-To: <17631.29701.373791.625191@montanaro.dyndns.org> References: <17631.29701.373791.625191@montanaro.dyndns.org> Message-ID: <31F368D4-4103-40B6-955A-698D8F813BD6@ihug.co.nz> > Mark Hammond and Sean True both said they had an ocrad.exe > executable built > under cygwin. (Hopefully it doesn't require cygwin runtime?) AFAIK, it will require cygwin1.dll, unless a change is also made (it is in the attached patch) to compile with -mno-cygwin. This seems to run fine on my machine, without any of the cygwin DLLs (they are installed, of course, but shouldn't be accessible outside of a Cygwin shell). > Was the only > change you made to the source the "std::fprintf" -> "fprintf" > replacement? Two of these, plus the Makefile.in as above. > Ocrad is GPL'd, so all we have to do to make it available is also > distribute > the modified source. If you can stick the .exe file somewhere and > let me > know if there are any Windows version restrictions, I'll put > together the > requisite modified Ocrad source distribution and place both the > distribution > and the executable on the SpamBayes website for Windows users to > try out. Patch is attached. .exe is at: http://tangomu.com/ocrad.exe I have no idea about Windows version restrictions. My assumption would be it will run on any version from Win95 to WinXP (no idea about Vista). =Tony.Meyer -------------- next part -------------- A non-text attachment was scrubbed... Name: ocrad.patch Type: application/octet-stream Size: 2004 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060814/b54b7471/attachment.obj -------------- next part -------------- From skip at pobox.com Mon Aug 14 05:37:20 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 13 Aug 2006 22:37:20 -0500 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows Message-ID: <17631.61424.222629.225936@montanaro.dyndns.org> I updated the OCR capabilities a bit more today. I added more intelligent assembly of split images into a single image after noticing that the spammers don't simply chop up multi-part GIF images horizontally. I also added a couple extra options (ocrad_scale and ocrad_charset) which control the image scaling factor (default is 2) and character set (default is "ascii") Ocrad uses. Scaling the image by a factor of 2 was a pretty obvious win: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 5 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 4.213 4.213 tied 1.404 0.843 won -39.96% 3.371 2.809 won -16.67% 2.528 2.247 won -11.12% 4.213 3.652 won -13.32% won 4 times tied 1 times lost 0 times total unique fn went from 56 to 49 won -12.50% mean fn % went from 3.14606741573 to 2.75280898876 won -12.50% Scaling by a factor of three was even better in the false negative department but regressed a bit in the false positive category so I checked Options.py in with a default scaling factor of 2. A couple things could stand to be further tested: * I have no idea how good Ocrad's scaling algorithm is. It's possible that PIL or NetPBM's scaling code is better. If so, it would make sense to scale the images before feeding to Ocrad. * The images I've see so far were all plain English, so I blindly made ascii the default charset. The other choices were iso-8859-9 and iso-8859-15. I simply assumed ascii would be the most appropriate default, but didn't test it. Finally, I put together a really simpleminded Ocrad-for-Windows release based upon the ocrad.exe binary that Tony built. Check the Files section of the SpamBayes project site: http://sourceforge.net/project/showfiles.php?group_id=61702 and grab ocrad-cygwin. There are a few caveats: 1. I don't do Windows. (No, really, I don't, strange as that may seem.) This is no fancy-schmancy point-and-shoot Windows installer. It's just a simple zip file with the Ocrad 0.15 distribution, Tony's .exe file and the patch he applied to the source. 2. I don't do Windows. The code I've written so far has been done entirely on my Mac. I've made no obvious concessions to portability. That said, I hope portability issues won't be daunting for any early adopters. 3. I don't do Windows. If you have problems it won't do you any good to mail me directly. Post about problems on the SpamBayes bug tracker: http://sourceforge.net/tracker/?group_id=61702&atid=498103 4. If you do Windows you will need PIL to take advantage of the recent changes: http://www.pythonware.com/products/pil/ (unless you want to put hair on your chest and build NetPBM on Windows). Fredrik Lundh provides prebuilt Windows versions of PIL. Grab the one appropriate for the version of Python you have installed. 5. If you do Windows (or any other platform for that matter), feedback to the lists about successes and failures would be helpful. Cheers, Skip From skip at pobox.com Sat Aug 19 23:06:10 2006 From: skip at pobox.com (skip at pobox.com) Date: Sat, 19 Aug 2006 16:06:10 -0500 Subject: [spambayes-dev] How about a 1.1a3 release? Message-ID: <17639.32066.677940.963348@montanaro.dyndns.org> Any thought on making a 1.1a3 release? I'd like to get the image spam stuff into more peoples' hands. (Has anyone tried it yet?) Tony is extremely busy, and doesn't have the requisite Win2K setup to create a widely runnable Windows installer. Can someone else do that? While I have the ocrad-cygwin zipfile available for people to download, what do people think about bundling the ocrad.exe file and the source patch as part of a SpamBayes Windows installer? I confirmed with the Ocrad author that all we need to distribute in the (very small) patch, not the entire distribution as I originally did. Thx, Skip From tameyer at ihug.co.nz Sat Aug 19 23:38:02 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun, 20 Aug 2006 09:38:02 +1200 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: <17639.32066.677940.963348@montanaro.dyndns.org> References: <17639.32066.677940.963348@montanaro.dyndns.org> Message-ID: <97CA4326-5332-466F-A2AA-454C6E1C3F91@ihug.co.nz> > Any thought on making a 1.1a3 release? +1 > I'd like to get the image spam stuff > into more peoples' hands. (Has anyone tried it yet?) Bits of it. I'll report more when I can :) (If this baby would hurry up and be born, that would help ;) > Tony is extremely > busy, and doesn't have the requisite Win2K setup to create a widely > runnable > Windows installer. Can someone else do that? To clarify this: this is the same issue I had with 1.1a2 - I don't have access to Outlook 2000 any more (I have 2002 and 2007b2). Last time Mark did this for me (basically, just cvs-up, run setup_all.py, and either do the Inno part or just email me the dist folder and I can do the rest). Alternatively, we could do a Windows build with Outlook 2002 and see how much complaining there is ;) > While I have the ocrad-cygwin zipfile available for people to > download, what > do people think about bundling the ocrad.exe file and the source > patch as > part of a SpamBayes Windows installer? I confirmed with the Ocrad > author > that all we need to distribute in the (very small) patch, not the > entire > distribution as I originally did. Fine by me. I can make the changes to the Inno installer script if this is ok with everyone. =Tony.Meyer From dave at boost-consulting.com Sun Aug 20 04:05:15 2006 From: dave at boost-consulting.com (David Abrahams) Date: Sat, 19 Aug 2006 22:05:15 -0400 Subject: [spambayes-dev] Posting problems In-Reply-To: <6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz> (Tony Meyer's message of "Thu, 10 Aug 2006 17:54:56 +1200") References: <17626.1405.993809.368716@montanaro.dyndns.org> <17626.8482.159339.985701@montanaro.dyndns.org> <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com> <6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz> Message-ID: Tony Meyer writes: > I'll try to take a look at your message & the problem this weekend. > I know that the code I added between 1.1a1 and 1.1a2 to both > sb_server and sb_imapfilter to deal with the "changed database type > causes a crash" bug wasn't as well designed as it should have been, > and does cause the odd problem. I plan to fix that as soon as I can. Hi Tony, Any progress on this one? -- Dave Abrahams Boost Consulting www.boost-consulting.com From mhammond at skippinet.com.au Sun Aug 20 09:22:54 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Sun, 20 Aug 2006 17:22:54 +1000 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: <97CA4326-5332-466F-A2AA-454C6E1C3F91@ihug.co.nz> Message-ID: Tony writes: > To clarify this: this is the same issue I had with 1.1a2 - I don't > have access to Outlook 2000 any more (I have 2002 and 2007b2). Last > time Mark did this for me (basically, just cvs-up, run setup_all.py, > and either do the Inno part or just email me the dist folder and I > can do the rest). I'm happy to turn that crank - just say the word (and let me know which of those cranks you prefer) > Alternatively, we could do a Windows build with Outlook 2002 and see > how much complaining there is ;) I'm still on Office-2k - although I expect that to change shortly - so this may well be the last release I can simply make using outlook 2000. Mark From tameyer at ihug.co.nz Sun Aug 20 11:08:13 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun, 20 Aug 2006 21:08:13 +1200 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: References: Message-ID: [Building a 1.1a3 binary] > I'm happy to turn that crank - just say the word (and let me know > which of > those cranks you prefer) Great. Skip - just tell Mark when you feel everything is ready, and Mark can (cvs-up and) run setup_all.py, compress the resulting dist folder, and send that to me via FTP (details offlist). >> Alternatively, we could do a Windows build with Outlook 2002 and see >> how much complaining there is ;) > > I'm still on Office-2k - although I expect that to change shortly - > so this > may well be the last release I can simply make using outlook 2000. Alternatively, we could drop OL2K support for 1.1, at least for now, and see if anyone complains (and if they do, they can maybe volunteer the price of a 2nd-hand copy of Office-2k <0.5 wink>). =Tony.Meyer From sethg at GoodmanAssociates.com Mon Aug 21 01:56:41 2006 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Sun, 20 Aug 2006 18:56:41 -0500 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: Message-ID: On -0500, Tony Meyer wrote: > > > Alternatively, we could do a Windows build with Outlook 2002 > > > and see how much complaining there is ;) > > > > I'm still on Office-2k - although I expect that to change shortly > > - so this may well be the last release I can simply make using > > outlook 2000. > > Alternatively, we could drop OL2K support for 1.1, at least for now, > and see if anyone complains (and if they do, they can maybe > volunteer the price of a 2nd-hand copy of Office-2k <0.5 wink>). I'm still stuck on Win2K/Office2K for quite a while, yet. I'd be willing to obtain a copy of Outlook2K for someone. Does this mean shipping to NZ? -- Seth Goodman not in NZ From skip at pobox.com Mon Aug 21 16:07:36 2006 From: skip at pobox.com (skip at pobox.com) Date: Mon, 21 Aug 2006 09:07:36 -0500 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: References: Message-ID: <17641.48680.645677.461714@montanaro.dyndns.org> >> I'm happy to turn that crank - just say the word (and let me know >> which of those cranks you prefer) Tony> Great. Skip - just tell Mark when you feel everything is ready, Tony> and Mark can (cvs-up and) run setup_all.py, compress the resulting Tony> dist folder, and send that to me via FTP (details offlist). I think we're about ready except for boosting the version info in spambayes/__init__.py: __version__ = "1.1a3" __date__ = _("August 2006") Feel free to turn the crank. I agree that trying a build on a more recent version of Outlook would be a good idea. For testing purposes that probably opens up the pool of potential release builders a bit. When we near a final release, if OL2K is still deemed desirable, we can cut a release that supports it. Skip From mhammond at skippinet.com.au Tue Aug 22 23:51:05 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 23 Aug 2006 07:51:05 +1000 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: <17641.48680.645677.461714@montanaro.dyndns.org> Message-ID: <04d801c6c635$12848980$2f0a0a0a@enfoldsystems.local> Skip writes: > Feel free to turn the crank. I agree that trying a build on > a more recent version of Outlook would be a good idea. For testing > purposes that probably > opens up the pool of potential release builders a bit. I currently *only* have Office2k installed. Thus, it is not possible for me to build a version that depends on a later version. The next time someone without Office2k installed wants to build a new version, they should just try and do so, patching the code where necessary. This would include 'addin.py', with the lines starting: gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0, bForDemand=True, bValidateFile=bValidateGencache) # Outlook 9 The existing code should be kept in place, but wrapped with an exception handler that 'falls back' to the newer version. This code should probably be cloned into setup_all.py, and depending on success or failure, change the 'typelibs' option passed to py2exe, reflecting what is known to be installed. I'd suggest that this print a fairly noisy warning so the packager is aware the built version will not work on Office 2k. On a more general note though, I think it is fairly clear that for all official releases, Office2k remain supported for a few years yet - when a few people on the -dev list still use Office2k, I would guess that many more users also do. I can't make the changes I recommend above as I don't have OfficeXP installed - but if someone else makes the change so it works for them, I'd be happy to repair any unintended breakage on Office2k systems. Mark From mhammond at skippinet.com.au Wed Aug 23 15:26:28 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 23 Aug 2006 23:26:28 +1000 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <17631.61424.222629.225936@montanaro.dyndns.org> Message-ID: <007201c6c6b7$be617290$020a0a0a@enfoldsystems.local> Hi Skip, > Scaling the image by a factor of 2 was a pretty > obvious win: > > false positive percentages > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied I'm playing a little with the new code and am trying to get things working with outlook. I'm a little stuck working out how to get some test data (and it doesn't help I'm a little rusty wrt to spambayes :) I'm trying to run the testtools code. The Outlook code that sets up the Data/Ham, Data/Spam directories etc just exports the text body of the message, but completely ignores 'attachments'. I'm out of time for tonight - can you offer any quick clues how your test environment is setup? Thanks, Mark. From skip at pobox.com Wed Aug 23 17:33:03 2006 From: skip at pobox.com (skip at pobox.com) Date: Wed, 23 Aug 2006 10:33:03 -0500 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <007201c6c6b7$be617290$020a0a0a@enfoldsystems.local> References: <17631.61424.222629.225936@montanaro.dyndns.org> <007201c6c6b7$be617290$020a0a0a@enfoldsystems.local> Message-ID: <17644.29999.936120.882486@montanaro.dyndns.org> Mark> I'm trying to run the testtools code. The Outlook code that sets Mark> up the Data/Ham, Data/Spam directories etc just exports the text Mark> body of the message, but completely ignores 'attachments'. I'm Mark> out of time for tonight - can you offer any quick clues how your Mark> test environment is setup? Quick clue: I'm not using Outlook or Windows. ;-) I don't know what to do given that Outlook shreds email so completely. Maybe this stuff can only be tested on Unix-y machines. Maybe the image analysis code won't even work because there's no such thing as an attachment with MIME content-type image/*... in Outlook. As for actual setup, it's done in what I think is the "usual" way. I start with two or more Unix mbox format files (at least one full of ham, one full of spam). I then run utilities/splitndirs.py to allocate them to the desired number of Data/{Ham,Spam}/SetN directories. I then make a series of runs like so: # control run python testtools/timcv.py ... args ... > std.txt python testtools/rates.py std.txt # one or more test runs with various parameters changed python testtools/timcv.py ... slightly different args ... > testN.txt python testtools/rates.py testN.txt python testtools/cmp.py stds.txt testNs.txt My guess is there's an easier way to run the tests and summarize the results, but it had been awhile since I'd done any testing either. This was the first "working" setup I stumbled upon, and thanks to my enormous bash command history buffer, I just recall the commands as I need them, so the pain of re-remebering is small. HTH, Skip From mhammond at skippinet.com.au Wed Aug 23 23:52:45 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu, 24 Aug 2006 07:52:45 +1000 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <17644.29999.936120.882486@montanaro.dyndns.org> Message-ID: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local> > Quick clue: I'm not using Outlook or Windows. ;-) Yep, I know that :) My mail was sent fairly late, so I didn't explain very well. > I don't > know what to do > given that Outlook shreds email so completely. Maybe this > stuff can only be > tested on Unix-y machines. Maybe the image analysis code > won't even work > because there's no such thing as an attachment with MIME content-type > image/*... in Outlook. I can manage all of that. What I need to know is in what format your Ham and Spam directories are. Currently mine are in plain-text. A quick look at the code showed that these were *not* expected to be a dump of a mime message, but instead a simple "word stream" - which didn't seem to fit with the binary data inside attachments. I was guessing they had already been processed to some degree, but gave up before digging deeper. > As for actual setup, it's done in what I think is the "usual" > way. I start > with two or more Unix mbox format files (at least one full of > ham, one full > of spam). I then run utilities/splitndirs.py to allocate them to the > desired number of Data/{Ham,Spam}/SetN directories. I then > make a series of > runs like so: hrm - so maybe they *are* just the complete dump of the message including the encoded image data and mime boundaries etc - I'll play a little more and look inside splitndirs. Thanks, Mark From tameyer at ihug.co.nz Thu Aug 24 08:18:11 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu, 24 Aug 2006 18:18:11 +1200 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local> References: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local> Message-ID: <40724010-C6F3-410C-9FAF-1F866F86B30C@ihug.co.nz> >> I don't know what to do given that Outlook shreds email so >> completely. >> Maybe this stuff can only be tested on Unix-y machines. Maybe the >> image analysis code won't even work because there's no such thing >> as an >> attachment with MIME content-type image/*... in Outlook. > > I can manage all of that. What I need to know is in what format > your Ham > and Spam directories are. They're RFC2822. So for mail in a .pst, presumably the job (of export_messages.py) would be to get the attachments and insert them into the messages (encoded in base64 or whatever) with the appropriate headers. I planned to write code to do this at some point last year, but don't recall getting around to it (and then I switched to Mail as my main email client). > hrm - so maybe they *are* just the complete dump of the message > including > the encoded image data and mime boundaries etc Yup. Is that accessible in Outlook? I had the feeling it wasn't. If you can get the attachments then it's easy enough to use the email package to build up the message with those and the plain text. =Tony.Meyer From skip at pobox.com Thu Aug 24 13:00:49 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 24 Aug 2006 06:00:49 -0500 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local> References: <17644.29999.936120.882486@montanaro.dyndns.org> <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local> Message-ID: <17645.34529.624745.399216@montanaro.dyndns.org> Mark> hrm - so maybe they *are* just the complete dump of the message Mark> including the encoded image data and mime boundaries etc - I'll Mark> play a little more and look inside splitndirs. Yup, plain old RFC 2822 messages... Skip From kenny.pitt at gmail.com Thu Aug 24 16:40:05 2006 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Thu, 24 Aug 2006 10:40:05 -0400 Subject: [spambayes-dev] How about a 1.1a3 release? In-Reply-To: <04d801c6c635$12848980$2f0a0a0a@enfoldsystems.local> References: <17641.48680.645677.461714@montanaro.dyndns.org> <04d801c6c635$12848980$2f0a0a0a@enfoldsystems.local> Message-ID: <2a052b990608240740w280bcdc9h7505034fbc45c034@mail.gmail.com> On 8/22/06, Mark Hammond wrote: > I currently *only* have Office2k installed. Thus, it is not possible for me > to build a version that depends on a later version. > > [...] > > On a more general note though, I think it is fairly clear that for all > official releases, Office2k remain supported for a few years yet - when a > few people on the -dev list still use Office2k, I would guess that many more > users also do. Maybe it would be a good idea to check a copy of the generated COM wrappers for 2k into CVS while we still have the capability to build them. It might require some tweaking to py2exe and/or win32com, but I'm sure we could find a way to utilize a pre-built wrapper instead of regenerating it from the installed typelibs on every build. That would certainly make it easier to build compatible versions in the future. -- Kenny Pitt From g12__ at hotmail.com Thu Aug 24 16:48:18 2006 From: g12__ at hotmail.com (Greg) Date: Thu, 24 Aug 2006 14:48:18 +0000 Subject: [spambayes-dev] My Humble Thanks To You Message-ID: Guys, I run a small corporate network with around 100 e-mail users. We've been using Sophos PureMessage as an anti-SPAM solution, but it doesn't work very well. It's just too simplistic and easy for the spammers to work around. After some online research I decided to give SpamBayes a go. I downloaded the Outlook plugin but didn't know what to expect. I don't get any SPAM, but I have access to all the e-mail inboxes and can see that some users get around 50-60 per day, so I was pleased when I discovered I could get the plug-in to look at any folder I had access to !! I trialled it on 4 users, 2 who get heavy amounts of SPAM and 2 who get light amounts. Day 1, I had to go through a lot of e-mail and tell it what was SPAM and what was HAM. Day 2, I only had to tell it about a few. Day 3, I think there was 1. Day 4, well.... you get the picture. The users were delighted. They think I am a god now!! I now have the plug-in filtering every Inbox on the system and it doesn't miss a beat. We, at last, have a clean e-mail system. And the truth is that you guys are gods! My thanks for all your efforts. This has made everyone's work life here much easier. Greg. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060824/8446e281/attachment.html From skip at pobox.com Thu Aug 24 17:40:50 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 24 Aug 2006 10:40:50 -0500 Subject: [spambayes-dev] My Humble Thanks To You In-Reply-To: References: Message-ID: <17645.51330.474614.22601@montanaro.dyndns.org> Greg> The users were delighted. They think I am a god now!! I won't tell them if you won't. All hail Greg the God!!! Glad we could help. Skip From tameyer at ihug.co.nz Fri Aug 25 02:39:59 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri, 25 Aug 2006 12:39:59 +1200 Subject: [spambayes-dev] [Spambayes-checkins] spambayes/windows/py2exe setup_all.py, 1.26, 1.27 In-Reply-To: <20060824131835.EB71E1E4005@bag.python.org> References: <20060824131835.EB71E1E4005@bag.python.org> Message-ID: On 25/08/2006, at 1:18 AM, Mark Hammond wrote: > Update of /cvsroot/spambayes/spambayes/windows/py2exe > In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540 > > Modified Files: > setup_all.py > Log Message: > Ship with PIL (but no Tkinter) and pyDNS > > > [...] > ! excludes = "Tkinter," # side-effect of PIL and markh doesn't > have it :) > ! "win32ui,pywin,pywin.debugger," # *sob* - these > still appear > ! # Keep zope out else outlook users lose training. > ! # (sob - but some of these may still appear!) > ! > "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence", I don't care about this for 1.1a3, but is this right? Outlook users (any users, really) would only lose training if they chose not to convert the database on installation and didn't change their configuration to continue to use bsddb. =Tony.Meyer From skip at pobox.com Fri Aug 25 04:08:20 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 24 Aug 2006 21:08:20 -0500 Subject: [spambayes-dev] SpamBayes 1.1a3 Message-ID: <17646.23444.394006.480373@montanaro.dyndns.org> The SpamBayes team is pleased to announce release 1.1a3 of SpamBayes. As is now usual, this is both a release of the source code and of an installation program for all Microsoft Windows users. This is an *ALPHA* release. It should only be installed by users willing to try out experimental software, and almost certainly contains new bugs. If you don't know what an alpha release is, please stick with 1.0.4 for the moment. The 1.1 release has been worked on since May of 2004, so contains a vast number of improvements over the 1.0.x line. These include, but are not limited to: * New database backends, including ZODB and ZOE. * Internationalisation support, including partial translations into French and Spanish. * Improved statistics reporting. * The ability to set audio notifications with the Outlook plug-in. * The ability to set the Outlook plug-in to move/copy ham, as well as spam/unsures. * Partial POP3 over SSL support for sb_server. * A vastly improved sb_imapfilter. * Several new experimental options, include one designed to help extract text content from image-based spams. Suggestions about what to try out can be found here: http://entrian.com/sbwiki/TryOutThePreRelease This release, like the ill-fated 1.0.2 and 1.0.3, is built with Python 2.4. We believe that the remaining incompatibilities with Python 2.4 have been resolved, and so this release should also include superior email parsing to the 1.0.x line. Details about the changes in this release can be found at http://sourceforge.net/project/shownotes.php?release_id=442102 You can get the release via the 'Download' page at http://spambayes.org/download.html Enjoy the new release and your spam-free mailbox As always, thanks to everyone involved in this release! Skip Montanaro. (on behalf of the SpamBayes team) --- What is SpamBayes? --- The SpamBayes project is working on developing a Bayesian (of sorts) anti-spam filter (in Python), initially based on the work of Paul Graham, but since modified with ideas from Robinson, Peters, et al. The project includes a number of different applications, all using the same core code, ranging from a plug-in for Microsoft Outlook, to a POP3 proxy, to various command-line tools and a command-line-based framework for testing new anti-spam techniques. The Windows installation program will install either the Outlook add-in (for Microsoft Outlook users), the SpamBayes server program (for all other POP3 mail client users, including Microsoft Outlook Express), or the SpamBayes IMAP filter (for all IMAP mail client users). All Windows users (including existing users of the Outlook add-in) are encouraged to use the installation program. If you wish to use the source-code version, you will also need to install Python - see README.txt in the source tree for more information. From mhammond at skippinet.com.au Fri Aug 25 04:44:17 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 25 Aug 2006 12:44:17 +1000 Subject: [spambayes-dev] [Spambayes-checkins] spambayes/windows/py2exesetup_all.py, 1.26, 1.27 In-Reply-To: Message-ID: <023d01c6c7f0$5c25d6e0$050a0a0a@enfoldsystems.local> > > Update of /cvsroot/spambayes/spambayes/windows/py2exe > > In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540 > > > > Modified Files: > > setup_all.py > > Log Message: > > Ship with PIL (but no Tkinter) and pyDNS > > > > > > [...] > > ! excludes = "Tkinter," # side-effect of PIL and markh doesn't > > have it :) > > ! "win32ui,pywin,pywin.debugger," # *sob* - these > > still appear > > ! # Keep zope out else outlook users lose training. > > ! # (sob - but some of these may still appear!) > > ! > > "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence", > > I don't care about this for 1.1a3, but is this right? Outlook users > (any users, really) would only lose training if they chose not to > convert the database on installation and didn't change their > configuration to continue to use bsddb. If the inno installer offers to convert databases, then you may be correct. However, for my testing I didn't use the inno installer, so suddenly and without warning 'lost' the training info. I wonder if people who roll spambayes out to many seats all use our Inno setup to achieve that - if not, they too will lose. More generally though, even if I was prompted about converting the databases, if I answered 'No' I would expect my old existing database would still work as before. An upgrade that *forces* pain on you (answer yes, wait while 1x20MB and 1x10MB pickles are migrated, or answer 'no' and take the pain of retraining from scratch) doesn't sound friendly. A better approach may be that before *creating* a database in the new format, check to see if the old format exists and continue to use it. And more generally still, the ZODB that I have installed is built from Zope3 from SVN - from a branch, but not (necessarily) corresponding to an official release. This didn't seem prudent (but OTOH, probably would not itself have caused me to exclude it without the above :) Cheers, Mark From vilisch at wmw.com Mon Aug 28 14:43:58 2006 From: vilisch at wmw.com (Vilmos Schnedarek) Date: Mon, 28 Aug 2006 15:43:58 +0300 Subject: [spambayes-dev] Integrate SpamBayes into a Win32 application Message-ID: <44F2E50E.4040607@wmw.com> An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060828/52648bab/attachment.html From skip at pobox.com Thu Aug 31 14:26:31 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 31 Aug 2006 07:26:31 -0500 Subject: [spambayes-dev] Need slightly better logic for blinking gifs Message-ID: <17654.54647.420342.345139@montanaro.dyndns.org> It didn't take long for the spammers to start with the blinking GIF images. Now I think they are using blinkers where the first image in the sequence is pretty much empty. The real content is in the second frame. I need to handle that. I don't think I can just blindly overwrite one frame with the next since they could just make the last one the blankish image. The way I concatenate images left-to-right and top-to-bottom makes it impossible to just concatenate the frames together either. Ideas? The code is in spambayes/ImageStripper.py in the distribution. Look at PIL_decode_parts. Skip From skip at pobox.com Thu Aug 31 15:48:57 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 31 Aug 2006 08:48:57 -0500 Subject: [spambayes-dev] Need slightly better logic for blinking gifs In-Reply-To: <17654.54647.420342.345139@montanaro.dyndns.org> References: <17654.54647.420342.345139@montanaro.dyndns.org> Message-ID: <17654.59593.363231.433652@montanaro.dyndns.org> I was a little rushed this morning heading out the door, so didn't completely dump my brain in my earlier message: skip> I don't think I can just blindly overwrite one frame with the next skip> since they could just make the last one the blankish image. The skip> way I concatenate images left-to-right and top-to-bottom makes it skip> impossible to just concatenate the frames together either. Ideas? The implication I should have stated explicitly is that we need to select an image that's most likely the one with text in it. If spammers are going to blink their GIFs I suspect one or more of the images will have to be mostly background, while other messages will have to be a mixture of colors. That suggests choosing one based on histograms. Another possibility is to decide which color is the background, make it transparent, then overlay all the images on top of each other. I don't have time to look at this right now. Perhaps someone else does. Skip From kenny.pitt at gmail.com Thu Aug 31 17:38:37 2006 From: kenny.pitt at gmail.com (Kenny Pitt) Date: Thu, 31 Aug 2006 11:38:37 -0400 Subject: [spambayes-dev] Need slightly better logic for blinking gifs In-Reply-To: <17654.59593.363231.433652@montanaro.dyndns.org> References: <17654.54647.420342.345139@montanaro.dyndns.org> <17654.59593.363231.433652@montanaro.dyndns.org> Message-ID: <2a052b990608310838x53dc9f5ftac4a89eb64822911@mail.gmail.com> On 8/31/06, skip at pobox.com wrote: > The implication I should have stated explicitly is that we need to select an > image that's most likely the one with text in it. If spammers are going to > blink their GIFs I suspect one or more of the images will have to be mostly > background, while other messages will have to be a mixture of colors. That > suggests choosing one based on histograms. Another possibility is to decide > which color is the background, make it transparent, then overlay all the > images on top of each other. Could we extract a list of text tokens from each frame separately, and then choose the token list that has the most tokens in it? -- Kenny Pitt From skip at pobox.com Thu Aug 31 19:14:10 2006 From: skip at pobox.com (skip at pobox.com) Date: Thu, 31 Aug 2006 12:14:10 -0500 Subject: [spambayes-dev] Need slightly better logic for blinking gifs In-Reply-To: <2a052b990608310838x53dc9f5ftac4a89eb64822911@mail.gmail.com> References: <17654.54647.420342.345139@montanaro.dyndns.org> <17654.59593.363231.433652@montanaro.dyndns.org> <2a052b990608310838x53dc9f5ftac4a89eb64822911@mail.gmail.com> Message-ID: <17655.6370.856714.140457@montanaro.dyndns.org> Kenny> Could we extract a list of text tokens from each frame Kenny> separately, and then choose the token list that has the most Kenny> tokens in it? In theory, yes, though that would require running ocrad on each possibly partial image (could get expensive) and would require code restructuring. At the moment, the images come in one of three forms: * a single non-blinking image * a set of images, non-blinking, which, when assembled, make a single larger image * a single blinking image Right now, I assume there might be multiple parts to the image, so I convert from the source to PIL's internal format, concatenate them together, then run ocrad on the total image. I imagine it's not going to be long before the spammers start splitting up their blinking images into parts. Skip