From montanaro at users.sourceforge.net Sat Aug 5 14:48:11 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sat, 05 Aug 2006 05:48:11 -0700 Subject: [Spambayes-checkins] spambayes/contrib tte.py,1.16,1.17 Message-ID: <20060805124814.1F7351E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/contrib In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv28933/contrib Modified Files: tte.py Log Message: close the store - that's the ticket Index: tte.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/contrib/tte.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** tte.py 19 Apr 2005 11:15:12 -0000 1.16 --- tte.py 5 Aug 2006 12:48:09 -0000 1.17 *************** *** 260,264 **** sh_ratio) ! store.store() if cullext is not None: --- 260,264 ---- sh_ratio) ! store.close() if cullext is not None: From montanaro at users.sourceforge.net Sun Aug 6 03:19:37 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sat, 05 Aug 2006 18:19:37 -0700 Subject: [Spambayes-checkins] spambayes/contrib spamcounts.py,1.7,1.8 Message-ID: <20060806011939.E72631E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/contrib In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv5191 Modified Files: spamcounts.py Log Message: Dump the -d and -p flags in favor of the more general -o flag. Index: spamcounts.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/contrib/spamcounts.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** spamcounts.py 23 Apr 2006 22:30:46 -0000 1.7 --- spamcounts.py 6 Aug 2006 01:19:35 -0000 1.8 *************** *** 2,15 **** """ ! Check spamcounts for various tokens or patterns ! usage %(prog)s [ -h ] [ -r ] [ -d db ] [ -p ] [ -t ] ... -h - print this documentation and exit. -r - treat tokens as regular expressions - may not be used with -t - -d db - use db instead of the default found in the options file - -p - db is actually a pickle -t - read message from stdin, tokenize it, then display their counts may not be used with -r """ --- 2,15 ---- """ ! Check spamcounts for one or more tokens or patterns ! usage %(prog)s [ options ] token ... -h - print this documentation and exit. -r - treat tokens as regular expressions - may not be used with -t -t - read message from stdin, tokenize it, then display their counts may not be used with -r + -o section:option:value + - set [section, option] in the options database to value """ *************** *** 64,70 **** def main(args): try: ! opts, args = getopt.getopt(args, "hrd:t", ! ["help", "re", "database=", "pickle", ! "tokenize"]) except getopt.GetoptError, msg: usage(msg) --- 64,69 ---- def main(args): try: ! opts, args = getopt.getopt(args, "hrto:", ! ["help", "re", "tokenize", "option="]) except getopt.GetoptError, msg: usage(msg) *************** *** 72,77 **** usere = False - dbname = get_pathname_option("Storage", "persistent_storage_file") - ispickle = not options["Storage", "persistent_use_database"] tokenizestdin = False for opt, arg in opts: --- 71,74 ---- *************** *** 79,90 **** usage() return 0 - elif opt in ("-d", "--database"): - dbname = arg elif opt in ("-r", "--re"): usere = True - elif opt in ("-p", "--pickle"): - ispickle = True elif opt in ("-t", "--tokenize"): tokenizestdin = True if usere and tokenizestdin: --- 76,85 ---- usage() return 0 elif opt in ("-r", "--re"): usere = True elif opt in ("-t", "--tokenize"): tokenizestdin = True + elif opt in ('-o', '--option'): + options.set_from_cmdline(arg, sys.stderr) if usere and tokenizestdin: From montanaro at users.sourceforge.net Sun Aug 6 16:50:32 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 07:50:32 -0700 Subject: [Spambayes-checkins] spambayes/scripts sb_filter.py,1.19,1.20 Message-ID: <20060806145034.662C91E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/scripts In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv19924 Modified Files: sb_filter.py Log Message: Run under control of the new cProfile profiler, if it's available. I found this useful to help identify where SB spends its time while training. Index: sb_filter.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/scripts/sb_filter.py,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** sb_filter.py 7 Apr 2006 02:25:25 -0000 1.19 --- sb_filter.py 6 Aug 2006 14:50:29 -0000 1.20 *************** *** 47,50 **** --- 47,53 ---- set [section, option] in the options database to value + -P + Run under control of the Python profiler, if it is available + All options marked with '*' operate on stdin, and write the resultant message to stdout. *************** *** 211,220 **** self.h.store() ! def main(): h = HammieFilter() actions = [] ! opts, args = getopt.getopt(sys.argv[1:], 'hvxd:p:nfgstGSo:', ['help', 'version', 'examples', 'option=']) create_newdb = False for opt, arg in opts: if opt in ('-h', '--help'): --- 214,224 ---- self.h.store() ! def main(profiling=False): h = HammieFilter() actions = [] ! opts, args = getopt.getopt(sys.argv[1:], 'hvxd:p:nfgstGSo:P', ['help', 'version', 'examples', 'option=']) create_newdb = False + do_profile = False for opt, arg in opts: if opt in ('-h', '--help'): *************** *** 238,241 **** --- 242,254 ---- elif opt == '-S': actions.append(h.untrain_spam) + elif opt == '-P': + do_profile = True + if not profiling: + try: + import cProfile + except ImportError: + pass + else: + return cProfile.run("main(True)") elif opt == "-n": create_newdb = True From montanaro at users.sourceforge.net Sun Aug 6 18:14:20 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 09:14:20 -0700 Subject: [Spambayes-checkins] spambayes/spambayes Options.py,1.131,1.132 Message-ID: <20060806161422.065C21E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv23311/spambayes Modified Files: Options.py Log Message: slight reformat, doc tweak Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.131 retrieving revision 1.132 diff -C2 -d -r1.131 -r1.132 *** Options.py 27 Nov 2005 22:05:45 -0000 1.131 --- Options.py 6 Aug 2006 16:14:17 -0000 1.132 *************** *** 134,144 **** BOOLEAN, RESTORE), ! ("address_headers", _("Address headers to mine"), ("from", "to", "cc", "sender", "reply-to"), _("""Mine the following address headers. If you have mixed source corpuses (as opposed to a mixed sauce walrus, which is delicious!) then you probably don't want to use 'to' or 'cc') Address headers will be decoded, and will generate charset tokens as well as the real ! address. Others to consider: to, cc, reply-to, errors-to, sender, ! ..."""), HEADER_NAME, RESTORE), --- 134,144 ---- BOOLEAN, RESTORE), ! ("address_headers", _("Address headers to mine"), ("from", "to", "cc", ! "sender", "reply-to"), _("""Mine the following address headers. If you have mixed source corpuses (as opposed to a mixed sauce walrus, which is delicious!) then you probably don't want to use 'to' or 'cc') Address headers will be decoded, and will generate charset tokens as well as the real ! address. Others to consider: errors-to, ..."""), HEADER_NAME, RESTORE), From montanaro at users.sourceforge.net Sun Aug 6 18:19:21 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 09:19:21 -0700 Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.37,1.38 Message-ID: <20060806161923.4FFAF1E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv25513 Modified Files: tokenizer.py Log Message: Break basic text tokenizing out into its own method in preparation for some other changes. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** tokenizer.py 15 Nov 2005 00:16:20 -0000 1.37 --- tokenizer.py 6 Aug 2006 16:19:19 -0000 1.38 *************** *** 1528,1533 **** yield "noheader:" + k ! def tokenize_body(self, msg, maxword=options["Tokenizer", ! "skip_max_word_size"]): """Generate a stream of tokens from an email Message. --- 1528,1545 ---- yield "noheader:" + k ! def tokenize_text(self, text, maxword=options["Tokenizer", ! "skip_max_word_size"]): ! """Tokenize everything in the chunk of text we were handed.""" ! for w in text.split(): ! n = len(w) ! # Make sure this range matches in tokenize_word(). ! if 3 <= n <= maxword: ! yield w ! ! elif n >= 3: ! for t in tokenize_word(w): ! yield t ! ! def tokenize_body(self, msg): """Generate a stream of tokens from an email Message. *************** *** 1606,1619 **** text = html_re.sub('', text) ! # Tokenize everything in the body. ! for w in text.split(): ! n = len(w) ! # Make sure this range matches in tokenize_word(). ! if 3 <= n <= maxword: ! yield w ! ! elif n >= 3: ! for t in tokenize_word(w): ! yield t global_tokenizer = Tokenizer() --- 1618,1623 ---- text = html_re.sub('', text) ! for t in self.tokenize_text(text): ! yield t global_tokenizer = Tokenizer() From montanaro at users.sourceforge.net Sun Aug 6 18:34:39 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 09:34:39 -0700 Subject: [Spambayes-checkins] spambayes/spambayes Options.py, 1.132, 1.133 tokenizer.py, 1.38, 1.39 Message-ID: <20060806163441.7E8C41E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv30712/spambayes Modified Files: Options.py tokenizer.py Log Message: Add an x-short_runs option. When enabled, instead of completely skipping short words, runs of them are counted, the longest generating a token using the usual log2() technique. See the comment in tokenizer.py and doc string in Options.py for examples of the sort of things it attempts to catch. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.132 retrieving revision 1.133 diff -C2 -d -r1.132 -r1.133 *** Options.py 6 Aug 2006 16:14:17 -0000 1.132 --- Options.py 6 Aug 2006 16:34:37 -0000 1.133 *************** *** 98,101 **** --- 98,109 ---- INTEGER, RESTORE), + ("x-short_runs", _("Count runs of short 'words'"), False, + _("""(EXPERIMENTAL) If true, generate tokens based on max number of + short word runs. Short words are anything of length < the + skip_max_word_size option. Normally they are skipped, but one common + spam technique spells words like 'V I A G RA'. + """), + BOOLEAN, RESTORE), + ("count_all_header_lines", _("Count all header lines"), False, _("""Generate tokens just counting the number of instances of each kind Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** tokenizer.py 6 Aug 2006 16:19:19 -0000 1.38 --- tokenizer.py 6 Aug 2006 16:34:37 -0000 1.39 *************** *** 1531,1543 **** "skip_max_word_size"]): """Tokenize everything in the chunk of text we were handed.""" for w in text.split(): n = len(w) ! # Make sure this range matches in tokenize_word(). ! if 3 <= n <= maxword: ! yield w ! elif n >= 3: ! for t in tokenize_word(w): ! yield t def tokenize_body(self, msg): --- 1531,1558 ---- "skip_max_word_size"]): """Tokenize everything in the chunk of text we were handed.""" + short_runs = Set() + short_count = 0 for w in text.split(): n = len(w) ! if n < 3: ! # count how many short words we see in a row - meant to ! # latch onto crap like this: ! # X j A m N j A d X h ! # M k E z R d I p D u I m A c ! # C o I d A t L j I v S j ! short_count += 1 ! else: ! if short_count: ! short_runs.add(short_count) ! short_count = 0 ! # Make sure this range matches in tokenize_word(). ! if 3 <= n <= maxword: ! yield w ! elif n >= 3: ! for t in tokenize_word(w): ! yield t ! if short_runs and options["Tokenizer", "x-short_runs"]: ! yield "short:%d" % int(log2(max(short_runs))) def tokenize_body(self, msg): From montanaro at users.sourceforge.net Sun Aug 6 18:52:57 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 09:52:57 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py, NONE, 1.1 Options.py, 1.133, 1.134 tokenizer.py, 1.39, 1.40 Message-ID: <20060806165259.7C81F1E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv5725/spambayes Modified Files: Options.py tokenizer.py Added Files: dnscache.py Log Message: Add Matt Cowles' dnscache module and x-lookup_ip option. Underwent some substantial changes, most importantly, I got most of the way adding support for persisting the cache to either dbm or zodb stores. Also ran reindent over dnscache.py. --- NEW FILE: dnscache.py --- # Copyright 2004, Matthew Dixon Cowles . # Distributable under the same terms as the Python programming language. # Inspired by the KevinL's cache included with PyDNS. # Provided with NO WARRANTY. # Version 0.1 2004 06 27 # Version 0.11 2004 07 06 Fixed zero division error in __del__ import DNS # From http://sourceforge.net/projects/pydns/ import sys import os import operator import time import types import shelve import socket from spambayes.Options import options kCheckForPruneEvery=20 kMaxTTL=60 * 60 * 24 * 7 # One week kPruneThreshold=1500 # May go over slightly; numbers chosen at random kPruneDownTo=1000 class lookupResult(object): #__slots__=("qType","answer","question","expiresAt","lastUsed") def __init__(self,qType,answer,question,expiresAt,now): self.qType=qType self.answer=answer self.question=question self.expiresAt=expiresAt self.lastUsed=now return None # From ActiveState's Python cookbook # Yakov Markovitch, Fast sort the list of objects by object's attribute # http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52230 def sort_by_attr(seq, attr): """Sort the sequence of objects by object's attribute Arguments: seq - the list or any sequence (including immutable one) of objects to sort. attr - the name of attribute to sort by Returns: the sorted list of objects. """ #import operator # Use the "Schwartzian transform" # Create the auxiliary list of tuples where every i-th tuple has form # (seq[i].attr, i, seq[i]) and sort it. The second item of tuple is needed not # only to provide stable sorting, but mainly to eliminate comparison of objects # (which can be expensive or prohibited) in case of equal attribute values. intermed = map(None, map(getattr, seq, (attr,)*len(seq)), xrange(len(seq)), seq) intermed.sort() return map(operator.getitem, intermed, (-1,) * len(intermed)) class cache: def __init__(self,dnsServer=None,cachefile=None): # These attributes intended for user setting self.printStatsAtEnd=False # As far as I can tell from the standards, # it's legal to have more than one PTR record # for an address. That is, it's legal to get # more than one name back when you do a # reverse lookup on an IP address. I don't # know of a use for that and I've never seen # it done. And I don't think that most # people would expect it. So forward ("A") # lookups always return a list. Reverse # ("PTR") lookups return a single name unless # this attribute is set to False. self.returnSinglePTR=True # How long to cache an error as no data self.cacheErrorSecs=5*60 # How long to wait for the server self.dnsTimeout=10 # Some servers always return a TTL of zero. # In those cases, turning this up a bit is # probably reasonable. self.minTTL=0 # end of user-settable attributes self.cachefile = cachefile if cachefile: self.open_cachefile(cachefile) else: self.caches={ "A": {}, "PTR": {} } self.hits=0 # These two for statistics self.misses=0 self.pruneTicker=0 if dnsServer==None: DNS.DiscoverNameServers() self.queryObj=DNS.DnsRequest() else: self.queryObj=DNS.DnsRequest(server=dnsServer) return None def open_cachefile(self, cachefile): filetype = options["Storage", "persistent_use_database"] cachefile = os.path.expanduser(cachefile) if filetype == "dbm": self.caches=shelve.open(cachefile) if not self.caches.has_key("A"): self.caches["A"] = {} if not self.caches.has_key("PTR"): self.caches["PTR"] = {} elif filetype == "zodb": from ZODB import DB from ZODB.FileStorage import FileStorage self._zodb_storage = FileStorage(cachefile, read_only=False) self._DB = DB(self._zodb_storage, cache_size=10000) self._conn = self._DB.open() root = self._conn.root() self.caches = root.get("dnscache") if self.caches is None: # There is no classifier, so create one. from BTrees.OOBTree import OOBTree self.caches = root["dnscache"] = OOBTree() self.caches["A"] = {} self.caches["PTR"] = {} print "opened new cache" else: print "opened existing cache with", len(self.caches["A"]), "A records", print "and", len(self.caches["PTR"]), "PTR records" def close(self): if not self.cachefile: return filetype = options["Storage", "persistent_use_database"] if filetype == "dbm": self.caches.close() elif filetype == "zodb": self._zodb_close() def _zodb_store(self): import transaction from ZODB.POSException import ConflictError from ZODB.POSException import TransactionFailedError try: transaction.commit() except ConflictError, msg: # We'll save it next time, or on close. It'll be lost if we # hard-crash, but that's unlikely, and not a particularly big # deal. if options["globals", "verbose"]: print >> sys.stderr, "Conflict on commit.", msg transaction.abort() except TransactionFailedError, msg: # Saving isn't working. Try to abort, but chances are that # restarting is needed. if options["globals", "verbose"]: print >> sys.stderr, "Store failed. Need to restart.", msg transaction.abort() def _zodb_close(self): # Ensure that the db is saved before closing. Alternatively, we # could abort any waiting transaction. We need to do *something* # with it, though, or it will be still around after the db is # closed and cause problems. For now, saving seems to make sense # (and we can always add abort methods if they are ever needed). self._zodb_store() # Do the closing. self._DB.close() # We don't make any use of the 'undo' capabilities of the # FileStorage at the moment, so might as well pack the database # each time it is closed, to save as much disk space as possible. # Pack it up to where it was 'yesterday'. # XXX What is the 'referencesf' parameter for pack()? It doesn't # XXX seem to do anything according to the source. ## self._zodb_storage.pack(time.time()-60*60*24, None) self._zodb_storage.close() self._zodb_closed = True if options["globals", "verbose"]: print >> sys.stderr, 'Closed dnscache database' def __del__(self): if self.printStatsAtEnd: self.printStats() def printStats(self): for key,val in self.caches.items(): totAnswers=0 for item in val.values(): totAnswers+=len(item) print "cache %s has %i question(s) and %i answer(s)" % (key,len(self.caches[key]),totAnswers) if self.hits+self.misses==0: print "No queries" else: print "%i hits, %i misses (%.1f%% hits)" % (self.hits, self.misses, self.hits/float(self.hits+self.misses)*100) def prune(self,now): # I want this to be as fast as reasonably possible. # If I didn't, I'd probably do various things differently # Is there a faster way to do this? allAnswers=[] for cache in self.caches.values(): for val in cache.values(): allAnswers += val allAnswers=sort_by_attr(allAnswers,"expiresAt") allAnswers.reverse() while True: if allAnswers[-1].expiresAt>now: break answer=allAnswers.pop() c=self.caches[answer.type] c[answer.question].remove(answer) if len(c[answer.question])==0: del c[answer.question] self.printStats() if len(allAnswers)<=kPruneDownTo: return None # Expiring didn't get us down to the size we want, so delete # some entries least-recently-used-wise. I'm not by any means # sure that this is the best strategy, but as yet I don't have # data to test different strategies. allAnswers=sort_by_attr(allAnswers,"lastUsed") allAnswers.reverse() numToDelete=len(allAnswers)-kPruneDownTo for count in range(numToDelete): answer=allAnswers.pop() c=self.caches[answer.type] c[answer.question].remove(answer) if len(c[answer.question])==0: del c[answer.question] return None def formatForReturn(self,listOfObjs): if len(listOfObjs)==1 and listOfObjs[0].answer==None: return [] if listOfObjs[0].qType=="PTR" and self.returnSinglePTR: return listOfObjs[0].answer return [ obj.answer for obj in listOfObjs ] def lookup(self,question,qType="A"): qType=qType.upper() if qType not in ("A","PTR"): raise ValueError,"Query type must be one of A, PTR" now=int(time.time()) # Finding the len() of a dictionary isn't an expensive operation # but doing it twice for every lookup isn't necessary. self.pruneTicker+=1 if self.pruneTicker==kCheckForPruneEvery: self.pruneTicker=0 if len(self.caches["A"])+len(self.caches["PTR"])>kPruneThreshold: self.prune(now) cacheToLookIn=self.caches[qType] try: answers=cacheToLookIn[question] except KeyError: pass else: assert len(answers)>0 ind=0 # No guarantee that expire has already been done while ind"Timeout": print "Error, fixme",detail print "Question was",queryQuestion print "Origianal question was",question print "Type was",qType objs=[ lookupResult(qType,None,question,self.cacheErrorSecs+now,now) ] cacheToLookIn[question]=objs # Add to format for return? return self.formatForReturn(objs) except socket.gaierror,detail: print "DNS connection failure:", self.queryObj.ns, detail print "Defaults:", DNS.defaults objs=[] for answer in reply.answers: if answer["typename"]==qType: # PyDNS returns TTLs as longs but RFC 1035 says that the # TTL value is a signed 32-bit value and must be positive, # so it should be safe to coerce it to a Python integer. # And anyone who sets a time to live of more than 2^31-1 # seconds (68 years and change) is drunk. # Arguably, I ought to impose a maximum rather than continuing # with longs (int(long) returns long in recent versions of Python). ttl=max(min(int(answer["ttl"]),kMaxTTL),self.minTTL) # RFC 2308 says that you should cache an NXDOMAIN for the # minimum of the minimum field of the SOA record and the TTL # of the SOA. if ttl>0: item=lookupResult(qType,answer["data"],question,ttl+now,now) objs.append(item) if len(objs)>0: cacheToLookIn[question]=objs return self.formatForReturn(objs) # Probably SERVFAIL or the like if len(reply.authority)==0: objs=[ lookupResult(qType,None,question,self.cacheErrorSecs+now,now) ] cacheToLookIn[question]=objs return self.formatForReturn(objs) # No such host # # I don't know in what circumstances you'd have more than one authority, # so I'll just assume that the first is what we want. # # RFC 2308 specifies that this how to decide how long to cache an # NXDOMAIN. auth=reply.authority[0] auTTL=int(auth["ttl"]) for item in auth["data"]: if type(item)==types.TupleType and item[0]=="minimum": auMin=int(item[1]) cacheNeg=min(auMin,auTTL) break else: cacheNeg=auTTL objs=[ lookupResult(qType,None,question,cacheNeg+now,now) ] cacheToLookIn[question]=objs return self.formatForReturn(objs) def main(): import transaction c=cache(cachefile=os.path.expanduser("~skip/.dnscache")) c.printStatsAtEnd=True for host in ["www.python.org", "www.timsbloggers.com", "www.seeputofor.com", "www.completegarbage.tv", "www.tradelinkllc.com"]: print "checking", host now=time.time() ips=c.lookup(host) print ips,time.time()-now now=time.time() ips=c.lookup(host) print ips,time.time()-now if ips: ip=ips[0] now=time.time() name=c.lookup(ip,qType="PTR") print name,time.time()-now now=time.time() name=c.lookup(ip,qType="PTR") print name,time.time()-now else: print "unknown" c.close() return None if __name__=="__main__": main() Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.133 retrieving revision 1.134 diff -C2 -d -r1.133 -r1.134 *** Options.py 6 Aug 2006 16:34:37 -0000 1.133 --- Options.py 6 Aug 2006 16:52:54 -0000 1.134 *************** *** 106,109 **** --- 106,123 ---- BOOLEAN, RESTORE), + ("x-lookup_ip", _("Generate IP address tokens from hostnames"), False, + _("""(EXPERIMENTAL) Generate IP address tokens from hostnames. + Requires PyDNS (http://pydns.sourceforge.net/)."""), + BOOLEAN, RESTORE), + + ("lookup_ip_cache", _("x-lookup_ip cache file location"), "", + _("""Tell SpamBayes where to cache IP address lookup information. + Only comes into play if lookup_ip is enabled. The default + (empty string) disables the file cache. When caching is enabled, + the cache file is stored using the same database type as the main + token store (only dbm and zodb supported so far, zodb has problems, + dbm is untested, hence the default)."""), + FILE, RESTORE), + ("count_all_header_lines", _("Count all header lines"), False, _("""Generate tokens just counting the number of instances of each kind Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** tokenizer.py 6 Aug 2006 16:34:37 -0000 1.39 --- tokenizer.py 6 Aug 2006 16:52:54 -0000 1.40 *************** *** 40,43 **** --- 40,54 ---- + try: + import dnscache + cache = dnscache.cache(cachefile=options["Tokenizer", "lookup_ip_cache"]) + cache.printStatsAtEnd = True + except (IOError, ImportError): + cache = None + else: + import atexit + atexit.register(cache.close) + + # Patch encodings.aliases to recognize 'ansi_x3_4_1968' from encodings.aliases import aliases # The aliases dictionary From montanaro at users.sourceforge.net Sun Aug 6 18:58:33 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 09:58:33 -0700 Subject: [Spambayes-checkins] spambayes/spambayes Options.py, 1.134, 1.135 tokenizer.py, 1.40, 1.41 Message-ID: <20060806165834.E34BF1E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv8064/spambayes Modified Files: Options.py tokenizer.py Log Message: Add an image-size token. Enabled with the x-image_size option. Uses the usual log2() gimmick. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.134 retrieving revision 1.135 diff -C2 -d -r1.134 -r1.135 *** Options.py 6 Aug 2006 16:52:54 -0000 1.134 --- Options.py 6 Aug 2006 16:58:31 -0000 1.135 *************** *** 120,123 **** --- 120,128 ---- FILE, RESTORE), + ("x-image_size", _("Generate image size tokens"), False, + _("""(EXPERIMENTAL) If true, generate tokens based on the sizes of + embedded images."""), + BOOLEAN, RESTORE), + ("count_all_header_lines", _("Count all header lines"), False, _("""Generate tokens just counting the number of instances of each kind Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** tokenizer.py 6 Aug 2006 16:52:54 -0000 1.40 --- tokenizer.py 6 Aug 2006 16:58:31 -0000 1.41 *************** *** 636,639 **** --- 636,647 ---- msg.walk())) + def imageparts(msg): + """Return a list of all msg parts with type 'image/*'.""" + # Don't want a set here because we want to be able to process them in + # order. + return filter(lambda part: + part.get_content_type().startswith('image/'), + msg.walk()) + has_highbit_char = re.compile(r"[\x80-\xff]").search *************** *** 1592,1595 **** --- 1600,1621 ---- "octet_prefix_size"]] + parts = imageparts(msg) + if options["Tokenizer", "x-image_size"]: + # Find image/* parts of the body, calculating the log(size) of + # each image. + + for part in parts: + try: + text = part.get_payload(decode=True) + except: + yield "control: couldn't decode image" + text = part.get_payload(decode=False) + + if text is None: + yield "control: image payload is None" + continue + + yield "image-size:2**%d" % round(log2(len(text))) + # Find, decode (base64, qp), and tokenize textual parts of the body. for part in textparts(msg): From montanaro at users.sourceforge.net Sun Aug 6 19:09:07 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 10:09:07 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, NONE, 1.1 Options.py, 1.135, 1.136 tokenizer.py, 1.41, 1.42 Message-ID: <20060806170910.511471E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv11708/spambayes Modified Files: Options.py tokenizer.py Added Files: ImageStripper.py Log Message: Crude OCR capability based on the ocrad program and netpbm. As bad as ocrad's text extraction is, this gimmick seems to work pretty well at catching the currently crop of pump-n-dump spams. Unix only until someone implements similar functionality for Windows. --- NEW FILE: ImageStripper.py --- """ This is the place where we try and discover information buried in images. """ import os import tempfile import math import time try: # We have three possibilities for Set: # (a) With Python 2.2 and earlier, we use our compatsets class # (b) With Python 2.3, we use the sets.Set class # (c) With Python 2.4 and later, we use the builtin set class Set = set except NameError: try: from sets import Set except ImportError: from spambayes.compatsets import Set from spambayes.Options import options # copied from tokenizer.py - maybe we should split it into pieces... def log2(n, log=math.log, c=math.log(2)): return log(n)/c # I'm sure this is all wrong for Windows. Someone else can fix it. ;-) def is_executable(prog): info = os.stat(prog) return (info.st_uid == os.getuid() and (info.st_mode & 0100) or info.st_gid == os.getgid() and (info.st_mode & 0010) or info.st_mode & 0001) def find_program(prog): for directory in os.environ.get("PATH", "").split(os.pathsep): program = os.path.join(directory, prog) if os.path.exists(program) and is_executable(program): return program return "" def find_decoders(): # check for filters to convert to netpbm for decode_jpeg in ["jpegtopnm", "djpeg"]: if find_program(decode_jpeg): break else: decode_jpeg = None for decode_png in ["pngtopnm"]: if find_program(decode_png): break else: decode_png = None for decode_gif in ["giftopnm"]: if find_program(decode_gif): break else: decode_gif = None decoders = { "image/jpeg": decode_jpeg, "image/gif": decode_gif, "image/png": decode_png, } return decoders def decode_parts(parts, decoders): pnmfiles = [] for part in parts: decoder = decoders.get(part.get_content_type()) if decoder is None: continue try: bytes = part.get_payload(decode=True) except: continue if len(bytes) > options["Tokenizer", "max_image_size"]: continue # assume it's just a picture for now fd, imgfile = tempfile.mkstemp() os.write(fd, bytes) os.close(fd) fd, pnmfile = tempfile.mkstemp() os.close(fd) os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) pnmfiles.append(pnmfile) if not pnmfiles: return if len(pnmfiles) > 1: if find_program("pnmcat"): fd, pnmfile = tempfile.mkstemp() os.close(fd) os.system("pnmcat -lr %s > %s 2>/dev/null" % (" ".join(pnmfiles), pnmfile)) for f in pnmfiles: os.unlink(f) pnmfiles = [pnmfile] return pnmfiles def extract_ocr_info(pnmfiles): fd, orf = tempfile.mkstemp() os.close(fd) textbits = [] tokens = Set() for pnmfile in pnmfiles: ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) textbits.append(ocr.read()) ocr.close() for line in open(orf): if line.startswith("lines"): nlines = int(line.split()[1]) if nlines: tokens.add("image-text-lines:%d" % int(log2(nlines))) os.unlink(pnmfile) os.unlink(orf) return "\n".join(textbits), tokens class ImageStripper: def analyze(self, parts): if not parts: return "", Set() # need ocrad if not find_program("ocrad"): return "", Set() decoders = find_decoders() pnmfiles = decode_parts(parts, decoders) if not pnmfiles: return "", Set() return extract_ocr_info(pnmfiles) Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.135 retrieving revision 1.136 diff -C2 -d -r1.135 -r1.136 *** Options.py 6 Aug 2006 16:58:31 -0000 1.135 --- Options.py 6 Aug 2006 17:09:05 -0000 1.136 *************** *** 125,128 **** --- 125,142 ---- BOOLEAN, RESTORE), + ("x-crack_images", _("Look inside images for text"), False, + _("""(EXPERIMENTAL) If true, generate tokens based on the + (hopefully) text content contained in any images in each message. + The current support is minimal, relies on the installation of + ocrad (http://www.gnu.org/software/ocrad/ocrad.html) and netpbm. + It is almost certainly only useful in its current form on Unix-like + machines."""), + BOOLEAN, RESTORE), + + ("max_image_size", _("Max image size to try OCR-ing"), 100000, + _("""When crack_images is enabled, this specifies the largest + image to try OCR on."""), + INTEGER, RESTORE), + ("count_all_header_lines", _("Count all header lines"), False, _("""Generate tokens just counting the number of instances of each kind Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** tokenizer.py 6 Aug 2006 16:58:31 -0000 1.41 --- tokenizer.py 6 Aug 2006 17:09:05 -0000 1.42 *************** *** 1618,1621 **** --- 1618,1629 ---- yield "image-size:2**%d" % round(log2(len(text))) + if options["Tokenizer", "x-crack_images"]: + from spambayes.ImageStripper import ImageStripper + text, tokens = ImageStripper().analyze(parts) + for t in tokens: + yield t + for t in self.tokenize_text(text): + yield t + # Find, decode (base64, qp), and tokenize textual parts of the body. for part in textparts(msg): From montanaro at users.sourceforge.net Sun Aug 6 22:55:12 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 13:55:12 -0700 Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.42,1.43 Message-ID: <20060806205514.D43581E4011@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv3937 Modified Files: tokenizer.py Log Message: log(0) is a no-no. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** tokenizer.py 6 Aug 2006 17:09:05 -0000 1.42 --- tokenizer.py 6 Aug 2006 20:55:10 -0000 1.43 *************** *** 1616,1620 **** continue ! yield "image-size:2**%d" % round(log2(len(text))) if options["Tokenizer", "x-crack_images"]: --- 1616,1621 ---- continue ! if text: ! yield "image-size:2**%d" % round(log2(len(text))) if options["Tokenizer", "x-crack_images"]: From montanaro at users.sourceforge.net Mon Aug 7 04:47:13 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 06 Aug 2006 19:47:13 -0700 Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.43,1.44 Message-ID: <20060807024715.6C64D1E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv10981 Modified Files: tokenizer.py Log Message: In splicing back several changes one-by-one I completely left out the code to handle x-lookup_ip... That would explain why my testing today didn't show any improvement! Also, tweak image-size to only yield a single token, and only if there is at least one decodable image. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** tokenizer.py 6 Aug 2006 20:55:10 -0000 1.43 --- tokenizer.py 7 Aug 2006 02:47:10 -0000 1.44 *************** *** 1085,1088 **** --- 1085,1103 ---- scheme, netloc, path, params, query, frag = urlparse.urlparse(url) + if cache is not None and options["Tokenizer", "x-lookup_ip"]: + ips=cache.lookup(netloc) + if len(ips)==0: + pushclue("url-ip:timeout") + else: + for ip in ips: # Should we limit to one A record? + pushclue("url-ip:%s/32" % ip) + dottedQuadList=ip.split(".") + pushclue("url-ip:%s/8" % dottedQuadList[0]) + pushclue("url-ip:%s.%s/16" % (dottedQuadList[0], + dottedQuadList[1])) + pushclue("url-ip:%s.%s.%s/24" % (dottedQuadList[0], + dottedQuadList[1], + dottedQuadList[2])) + # one common technique in bogus "please (re-)authorize yourself" # scams is to make it appear as if you're visiting a valid *************** *** 1605,1608 **** --- 1620,1624 ---- # each image. + total_len = 0 for part in parts: try: *************** *** 1612,1621 **** text = part.get_payload(decode=False) if text is None: yield "control: image payload is None" - continue ! if text: ! yield "image-size:2**%d" % round(log2(len(text))) if options["Tokenizer", "x-crack_images"]: --- 1628,1637 ---- text = part.get_payload(decode=False) + total_len += len(text or "") if text is None: yield "control: image payload is None" ! if total_len: ! yield "image-size:2**%d" % round(log2(total_len)) if options["Tokenizer", "x-crack_images"]: From anadelonbrin at users.sourceforge.net Tue Aug 8 00:22:33 2006 From: anadelonbrin at users.sourceforge.net (Tony Meyer) Date: Mon, 07 Aug 2006 15:22:33 -0700 Subject: [Spambayes-checkins] website docs.ht,1.19,1.20 Message-ID: <20060807222238.0B9FB1E4007@bag.python.org> Update of /cvsroot/spambayes/website In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv25575 Modified Files: docs.ht Log Message: Sourceforge broke our links! Index: docs.ht =================================================================== RCS file: /cvsroot/spambayes/website/docs.ht,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** docs.ht 9 Jul 2004 00:39:20 -0000 1.19 --- docs.ht 7 Aug 2006 22:22:28 -0000 1.20 *************** *** 11,21 **** hints and tips, scripts and recipes, and anything else (related to SpamBayes) that takes your fancy added here. !
  • Instructions on installing Spambayes and integrating it into your mail system.
  • !
  • The Outlook plugin includes an "About" File, and a "Troubleshooting Guide" that can be accessed via the toolbar. (Note that the online documentaton is always for the latest source version, and so might not correspond exactly with the version you are using. Always start with the documentation that came with the version you installed.)
  • !
  • The README-DEVEL.txt information that should be of use to people planning on developing code based on SpamBayes.
  • !
  • The TESTING.txt file -- Clues about the practice of statistical testing, adapted from Tim comments on python-dev.
  • There are also a vast number of clues and notes scattered as block comments through the code. --- 11,21 ---- hints and tips, scripts and recipes, and anything else (related to SpamBayes) that takes your fancy added here.
  • !
  • Instructions on installing Spambayes and integrating it into your mail system.
  • !
  • The Outlook plugin includes an "About" File, and a "Troubleshooting Guide" that can be accessed via the toolbar. (Note that the online documentaton is always for the latest source version, and so might not correspond exactly with the version you are using. Always start with the documentation that came with the version you installed.)
  • !
  • The README-DEVEL.txt information that should be of use to people planning on developing code based on SpamBayes.
  • !
  • The TESTING.txt file -- Clues about the practice of statistical testing, adapted from Tim comments on python-dev.
  • There are also a vast number of clues and notes scattered as block comments through the code. From anadelonbrin at users.sourceforge.net Tue Aug 8 00:23:29 2006 From: anadelonbrin at users.sourceforge.net (Tony Meyer) Date: Mon, 07 Aug 2006 15:23:29 -0700 Subject: [Spambayes-checkins] website download.ht, 1.36, 1.37 index.ht, 1.40, 1.41 Message-ID: <20060807222331.2EABA1E4005@bag.python.org> Update of /cvsroot/spambayes/website In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv26079 Modified Files: download.ht index.ht Log Message: 1.1a2 has been out for a bit now. Index: download.ht =================================================================== RCS file: /cvsroot/spambayes/website/download.ht,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** download.ht 10 Sep 2005 00:29:55 -0000 1.36 --- download.ht 7 Aug 2006 22:23:26 -0000 1.37 *************** *** 11,18 **** spambayes at python.org. !

    The first alpha release of 1.1 is also now available. It is highly likely ! that there are new bugs in this release, but if you are willing and able to ! give it a spin for us, that would be greatly appreciated. You might like ! to look at this list of things to try out.

    --- 11,19 ---- spambayes at python.org. !

    The second alpha release of 1.1 is also now available. It is highly likely ! that there are new bugs in this release (especially with the IMAP filter), ! but if you are willing and able to give it a spin for us, that would be ! greatly appreciated. You might like to look at this ! list of things to try out.

    *************** *** 70,87 ****

  • !
  • d6457f141e2485d26cb2fa61a8d804c7 ! spambayes-1.1a1.exe (3,025,816 bytes, ! sig)
  • !
  • 380bb81006064aeaad16d192439214a4 ! spambayes-1.1a1.tar.gz ! (823,660 bytes, ! sig)
  • !
  • 1b67365a847e97f24cc50236ba6e2183 ! spambayes-1.1a1.zip (971,031 bytes, ! sig)
  • --- 71,88 ----
    !
  • ! spambayes-1.1a2.exe (3,025,816 bytes, ! sig)
  • !
  • 6c94cb14008580c309dd176af73f2132 ! spambayes-1.1a2.tar.gz ! (830,084 bytes, ! sig)
  • !
  • ! spambayes-1.1a2.zip (971,031 bytes, ! sig)
  • Index: index.ht =================================================================== RCS file: /cvsroot/spambayes/website/index.ht,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** index.ht 10 Sep 2005 00:31:20 -0000 1.40 --- index.ht 7 Aug 2006 22:23:26 -0000 1.41 *************** *** 8,12 **** archives and a Windows binary installer).

    See the download page for more.

    !

    SpamBayes 1.1a1 is also now available! (This includes both the source archives and a Windows binary installers). This is an alpha release, so you should only try it if you are willing to try out --- 8,12 ---- archives and a Windows binary installer).

    See the download page for more.

    !

    SpamBayes 1.1a2 is also now available! (This includes both the source archives and a Windows binary installers). This is an alpha release, so you should only try it if you are willing to try out From montanaro at users.sourceforge.net Wed Aug 9 06:26:39 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Tue, 08 Aug 2006 21:26:39 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py,1.1,1.2 Message-ID: <20060809042641.A19F01E4006@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv7959 Modified Files: dnscache.py Log Message: Don't beat my brains out trying to get dbm and zodb caches to work. Just use a simple pickled dict... Index: dnscache.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/dnscache.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** dnscache.py 6 Aug 2006 16:52:54 -0000 1.1 --- dnscache.py 9 Aug 2006 04:26:36 -0000 1.2 *************** *** 14,19 **** import time import types - import shelve import socket from spambayes.Options import options --- 14,22 ---- import time import types import socket + try: + import cPickle as pickle + except ImportError: + import pickle from spambayes.Options import options *************** *** 63,67 **** class cache: ! def __init__(self,dnsServer=None,cachefile=None): # These attributes intended for user setting self.printStatsAtEnd=False --- 66,70 ---- class cache: ! def __init__(self,dnsServer=None,cachefile=""): # These attributes intended for user setting self.printStatsAtEnd=False *************** *** 93,101 **** # end of user-settable attributes ! self.cachefile = cachefile ! if cachefile: ! self.open_cachefile(cachefile) else: ! self.caches={ "A": {}, "PTR": {} } self.hits=0 # These two for statistics self.misses=0 --- 96,114 ---- # end of user-settable attributes ! self.cachefile = os.path.expanduser(cachefile) ! if self.cachefile and os.path.exists(self.cachefile): ! self.caches = pickle.load(open(self.cachefile, "rb")) else: ! self.caches = {"A": {}, "PTR": {}} ! ! if options["globals", "verbose"]: ! if self.caches["A"] or self.caches["PTR"]: ! print >> sys.stderr, "opened existing cache with", ! print >> sys.stderr, len(self.caches["A"]), "A records", ! print >> sys.stderr, "and", len(self.caches["PTR"]), ! print >> sys.stderr, "PTR records" ! else: ! print >> sys.stderr, "opened new cache" ! self.hits=0 # These two for statistics self.misses=0 *************** *** 109,198 **** return None - def open_cachefile(self, cachefile): - filetype = options["Storage", "persistent_use_database"] - cachefile = os.path.expanduser(cachefile) - if filetype == "dbm": - self.caches=shelve.open(cachefile) - if not self.caches.has_key("A"): - self.caches["A"] = {} - if not self.caches.has_key("PTR"): - self.caches["PTR"] = {} - elif filetype == "zodb": - from ZODB import DB - from ZODB.FileStorage import FileStorage - self._zodb_storage = FileStorage(cachefile, read_only=False) - self._DB = DB(self._zodb_storage, cache_size=10000) - self._conn = self._DB.open() - root = self._conn.root() - self.caches = root.get("dnscache") - if self.caches is None: - # There is no classifier, so create one. - from BTrees.OOBTree import OOBTree - self.caches = root["dnscache"] = OOBTree() - self.caches["A"] = {} - self.caches["PTR"] = {} - print "opened new cache" - else: - print "opened existing cache with", len(self.caches["A"]), "A records", - print "and", len(self.caches["PTR"]), "PTR records" - def close(self): - if not self.cachefile: - return - filetype = options["Storage", "persistent_use_database"] - if filetype == "dbm": - self.caches.close() - elif filetype == "zodb": - self._zodb_close() - - def _zodb_store(self): - import transaction - from ZODB.POSException import ConflictError - from ZODB.POSException import TransactionFailedError - - try: - transaction.commit() - except ConflictError, msg: - # We'll save it next time, or on close. It'll be lost if we - # hard-crash, but that's unlikely, and not a particularly big - # deal. - if options["globals", "verbose"]: - print >> sys.stderr, "Conflict on commit.", msg - transaction.abort() - except TransactionFailedError, msg: - # Saving isn't working. Try to abort, but chances are that - # restarting is needed. - if options["globals", "verbose"]: - print >> sys.stderr, "Store failed. Need to restart.", msg - transaction.abort() - - def _zodb_close(self): - # Ensure that the db is saved before closing. Alternatively, we - # could abort any waiting transaction. We need to do *something* - # with it, though, or it will be still around after the db is - # closed and cause problems. For now, saving seems to make sense - # (and we can always add abort methods if they are ever needed). - self._zodb_store() - - # Do the closing. - self._DB.close() - - # We don't make any use of the 'undo' capabilities of the - # FileStorage at the moment, so might as well pack the database - # each time it is closed, to save as much disk space as possible. - # Pack it up to where it was 'yesterday'. - # XXX What is the 'referencesf' parameter for pack()? It doesn't - # XXX seem to do anything according to the source. - ## self._zodb_storage.pack(time.time()-60*60*24, None) - self._zodb_storage.close() - - self._zodb_closed = True - if options["globals", "verbose"]: - print >> sys.stderr, 'Closed dnscache database' - - - def __del__(self): if self.printStatsAtEnd: self.printStats() def printStats(self): --- 122,130 ---- return None def close(self): if self.printStatsAtEnd: self.printStats() + if self.cachefile: + pickle.dump(self.caches, open(self.cachefile, "wb")) def printStats(self): *************** *** 201,209 **** for item in val.values(): totAnswers+=len(item) ! print "cache %s has %i question(s) and %i answer(s)" % (key,len(self.caches[key]),totAnswers) if self.hits+self.misses==0: ! print "No queries" else: ! print "%i hits, %i misses (%.1f%% hits)" % (self.hits, self.misses, self.hits/float(self.hits+self.misses)*100) def prune(self,now): --- 133,144 ---- for item in val.values(): totAnswers+=len(item) ! print >> sys.stderr, "cache", key, "has", len(self.caches[key]), ! print >> sys.stderr, "question(s) and", totAnswers, "answer(s)" if self.hits+self.misses==0: ! print >> sys.stderr, "No queries" else: ! print >> sys.stderr, self.hits, "hits,", self.misses, "misses", ! print >> sys.stderr, "(%.1f%% hits)" % \ ! (self.hits/float(self.hits+self.misses)*100) def prune(self,now): *************** *** 223,232 **** break answer=allAnswers.pop() ! c=self.caches[answer.type] c[answer.question].remove(answer) if len(c[answer.question])==0: del c[answer.question] ! self.printStats() if len(allAnswers)<=kPruneDownTo: --- 158,168 ---- break answer=allAnswers.pop() ! c=self.caches[answer.qType] c[answer.question].remove(answer) if len(c[answer.question])==0: del c[answer.question] ! if options["globals", "verbose"]: ! self.printStats() if len(allAnswers)<=kPruneDownTo: *************** *** 242,246 **** for count in range(numToDelete): answer=allAnswers.pop() ! c=self.caches[answer.type] c[answer.question].remove(answer) if len(c[answer.question])==0: --- 178,182 ---- for count in range(numToDelete): answer=allAnswers.pop() ! c=self.caches[answer.qType] c[answer.question].remove(answer) if len(c[answer.question])==0: From montanaro at users.sourceforge.net Thu Aug 10 06:08:03 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Wed, 09 Aug 2006 21:08:03 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, 1.1, 1.2 Options.py, 1.136, 1.137 tokenizer.py, 1.44, 1.45 Message-ID: <20060810040805.9A76E1E4007@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv21273/spambayes Modified Files: ImageStripper.py Options.py tokenizer.py Log Message: Use PIL to decode input images if available (faster, much more robust, and platform-independent). Add a token cache for the ocr output to speed up that operation. Slight API change for the ocr stuff. Now a singleton is created and used for all analysis. Index: ImageStripper.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** ImageStripper.py 6 Aug 2006 17:09:04 -0000 1.1 --- ImageStripper.py 10 Aug 2006 04:07:59 -0000 1.2 *************** *** 3,10 **** --- 3,28 ---- """ + from __future__ import division + + import sys import os import tempfile import math import time + import md5 + import atexit + try: + import cPickle as pickle + except ImportError: + import pickle + try: + import cStringIO as StringIO + except ImportError: + import StringIO + + try: + from PIL import Image + except ImportError: + Image = None try: *************** *** 65,128 **** return decoders ! def decode_parts(parts, decoders): ! pnmfiles = [] ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! fd, imgfile = tempfile.mkstemp() ! os.write(fd, bytes) ! os.close(fd) ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) ! pnmfiles.append(pnmfile) ! if not pnmfiles: ! return - if len(pnmfiles) > 1: - if find_program("pnmcat"): fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(pnmfiles), pnmfile)) ! for f in pnmfiles: ! os.unlink(f) ! pnmfiles = [pnmfile] ! return pnmfiles ! def extract_ocr_info(pnmfiles): ! fd, orf = tempfile.mkstemp() ! os.close(fd) ! textbits = [] ! tokens = Set() ! for pnmfile in pnmfiles: ! ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) ! textbits.append(ocr.read()) ! ocr.close() ! for line in open(orf): ! if line.startswith("lines"): ! nlines = int(line.split()[1]) ! if nlines: ! tokens.add("image-text-lines:%d" % int(log2(nlines))) ! os.unlink(pnmfile) ! os.unlink(orf) ! return "\n".join(textbits), tokens - class ImageStripper: def analyze(self, parts): if not parts: --- 83,211 ---- return decoders ! def imconcat(im1, im2): ! # concatenate im1 and im2 left-to-right ! w1, h1 = im1.size ! w2, h2 = im2.size ! im3 = Image.new("RGB", (w1+w2, max(h1, h2))) ! im3.paste(im1, (0, 0)) ! im3.paste(im2, (0, w1)) ! return im3 ! class ImageStripper: ! def __init__(self, cachefile=""): ! self.cachefile = os.path.expanduser(cachefile) ! if os.path.exists(self.cachefile): ! self.cache = pickle.load(open(self.cachefile)) ! else: ! self.cache = {} ! self.misses = self.hits = 0 ! if self.cachefile: ! atexit.register(self.close) ! def NetPBM_decode_parts(self, parts, decoders): ! pnmfiles = [] ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! fd, imgfile = tempfile.mkstemp() ! os.write(fd, bytes) ! os.close(fd) fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) ! pnmfiles.append(pnmfile) ! os.unlink(imgfile) ! if not pnmfiles: ! return ! if len(pnmfiles) > 1: ! if find_program("pnmcat"): ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(pnmfiles), pnmfile)) ! for f in pnmfiles: ! os.unlink(f) ! pnmfiles = [pnmfile] ! return pnmfiles ! def PIL_decode_parts(self, parts): ! full_image = None ! for part in parts: ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! ! # We're dealing with spammers here - who knows what garbage they ! # will call a GIF image to entice you to open it? ! try: ! image = Image.open(StringIO.StringIO(bytes)) ! image.load() ! except IOError: ! continue ! else: ! image = image.convert("RGB") ! ! if full_image is None: ! full_image = image ! else: ! full_image = imconcat(full_image, image) ! ! if not full_image: ! return ! ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! full_image.save(open(pnmfile, "wb"), "PPM") ! ! return [pnmfile] ! ! def extract_ocr_info(self, pnmfiles): ! fd, orf = tempfile.mkstemp() ! os.close(fd) ! ! textbits = [] ! tokens = Set() ! for pnmfile in pnmfiles: ! fhash = md5.new(open(pnmfile).read()).hexdigest() ! if fhash in self.cache: ! self.hits += 1 ! ctext, ctokens = self.cache[fhash] ! else: ! self.misses += 1 ! ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) ! ctext = ocr.read().lower() ! ocr.close() ! ctokens = set() ! for line in open(orf): ! if line.startswith("lines"): ! nlines = int(line.split()[1]) ! if nlines: ! ctokens.add("image-text-lines:%d" % ! int(log2(nlines))) ! self.cache[fhash] = (ctext, ctokens) ! textbits.append(ctext) ! tokens |= ctokens ! os.unlink(pnmfile) ! os.unlink(orf) ! ! return "\n".join(textbits), tokens def analyze(self, parts): if not parts: *************** *** 133,143 **** return "", Set() ! decoders = find_decoders() ! pnmfiles = decode_parts(parts, decoders) ! if not pnmfiles: ! return "", Set() ! return extract_ocr_info(pnmfiles) ! --- 216,240 ---- return "", Set() ! if Image is not None: ! pnmfiles = self.PIL_decode_parts(parts) ! else: ! pnmfiles = self.NetPBM_decode_parts(parts, find_decoders()) ! if pnmfiles: ! return self.extract_ocr_info(pnmfiles) ! return "", Set() ! ! def close(self): ! if options["globals", "verbose"]: ! print >> sys.stderr, "saving", len(self.cache), ! print >> sys.stderr, "items to", self.cachefile, ! if self.hits + self.misses: ! print >> sys.stderr, "%.2f%% hit rate" % \ ! (100 * self.hits / (self.hits + self.misses)), ! print >> sys.stderr ! pickle.dump(self.cache, open(self.cachefile, "wb")) ! ! _cachefile = options["Tokenizer", "crack_image_cache"] ! crack_images = ImageStripper(_cachefile).analyze Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.136 retrieving revision 1.137 diff -C2 -d -r1.136 -r1.137 *** Options.py 6 Aug 2006 17:09:05 -0000 1.136 --- Options.py 10 Aug 2006 04:07:59 -0000 1.137 *************** *** 118,122 **** token store (only dbm and zodb supported so far, zodb has problems, dbm is untested, hence the default)."""), ! FILE, RESTORE), ("x-image_size", _("Generate image size tokens"), False, --- 118,122 ---- token store (only dbm and zodb supported so far, zodb has problems, dbm is untested, hence the default)."""), ! PATH, RESTORE), ("x-image_size", _("Generate image size tokens"), False, *************** *** 134,137 **** --- 134,142 ---- BOOLEAN, RESTORE), + ("crack_image_cache", _("Cache to speed up ocr."), "", + _("""If non-empty, names a file from which to read cached ocr info + at start and to which to save that info at exit."""), + PATH, RESTORE), + ("max_image_size", _("Max image size to try OCR-ing"), 100000, _("""When crack_images is enabled, this specifies the largest Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.44 retrieving revision 1.45 diff -C2 -d -r1.44 -r1.45 *** tokenizer.py 7 Aug 2006 02:47:10 -0000 1.44 --- tokenizer.py 10 Aug 2006 04:07:59 -0000 1.45 *************** *** 1636,1641 **** if options["Tokenizer", "x-crack_images"]: ! from spambayes.ImageStripper import ImageStripper ! text, tokens = ImageStripper().analyze(parts) for t in tokens: yield t --- 1636,1641 ---- if options["Tokenizer", "x-crack_images"]: ! from spambayes.ImageStripper import crack_images ! text, tokens = crack_images(parts) for t in tokens: yield t From anadelonbrin at users.sourceforge.net Sun Aug 13 04:05:46 2006 From: anadelonbrin at users.sourceforge.net (Tony Meyer) Date: Sat, 12 Aug 2006 19:05:46 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py,1.2,1.3 Message-ID: <20060813020548.AA6721E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31206/spambayes Modified Files: dnscache.py Log Message: Remove reference to Skip, probably left there by mistake :) Index: dnscache.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/dnscache.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** dnscache.py 9 Aug 2006 04:26:36 -0000 1.2 --- dnscache.py 13 Aug 2006 02:05:43 -0000 1.3 *************** *** 314,318 **** def main(): import transaction ! c=cache(cachefile=os.path.expanduser("~skip/.dnscache")) c.printStatsAtEnd=True for host in ["www.python.org", "www.timsbloggers.com", --- 314,318 ---- def main(): import transaction ! c=cache(cachefile=os.path.expanduser("~/.dnscache")) c.printStatsAtEnd=True for host in ["www.python.org", "www.timsbloggers.com", From montanaro at users.sourceforge.net Sun Aug 13 18:27:51 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 13 Aug 2006 09:27:51 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py,1.2,1.3 Message-ID: <20060813162754.806071E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv18791 Modified Files: ImageStripper.py Log Message: The spammers don't just chop up their GIF images left-to-right. Concatenate them left-to-right until the height of adjacent images changes, then start a new row. At the end concatenate the rows top-to-bottom. Add a couple tokens to mark decode or conversion errors. The *_decode_parts don't use the class's state, so make them functions instead of methods. Index: ImageStripper.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** ImageStripper.py 10 Aug 2006 04:07:59 -0000 1.2 --- ImageStripper.py 13 Aug 2006 16:27:49 -0000 1.3 *************** *** 83,179 **** return decoders ! def imconcat(im1, im2): ! # concatenate im1 and im2 left-to-right ! w1, h1 = im1.size ! w2, h2 = im2.size ! im3 = Image.new("RGB", (w1+w2, max(h1, h2))) ! im3.paste(im1, (0, 0)) ! im3.paste(im2, (0, w1)) ! return im3 ! class ImageStripper: ! def __init__(self, cachefile=""): ! self.cachefile = os.path.expanduser(cachefile) ! if os.path.exists(self.cachefile): ! self.cache = pickle.load(open(self.cachefile)) ! else: ! self.cache = {} ! self.misses = self.hits = 0 ! if self.cachefile: ! atexit.register(self.close) ! def NetPBM_decode_parts(self, parts, decoders): ! pnmfiles = [] ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! fd, imgfile = tempfile.mkstemp() ! os.write(fd, bytes) ! os.close(fd) fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) ! pnmfiles.append(pnmfile) ! os.unlink(imgfile) ! if not pnmfiles: ! return ! if len(pnmfiles) > 1: ! if find_program("pnmcat"): ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(pnmfiles), pnmfile)) ! for f in pnmfiles: ! os.unlink(f) ! pnmfiles = [pnmfile] ! return pnmfiles ! def PIL_decode_parts(self, parts): ! full_image = None ! for part in parts: ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! # We're dealing with spammers here - who knows what garbage they ! # will call a GIF image to entice you to open it? ! try: ! image = Image.open(StringIO.StringIO(bytes)) ! image.load() ! except IOError: ! continue ! else: ! image = image.convert("RGB") ! if full_image is None: ! full_image = image ! else: ! full_image = imconcat(full_image, image) ! if not full_image: ! return ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! full_image.save(open(pnmfile, "wb"), "PPM") ! return [pnmfile] def extract_ocr_info(self, pnmfiles): --- 83,228 ---- return decoders ! def imconcatlr(left, right): ! """Concatenate two images left to right.""" ! w1, h1 = left.size ! w2, h2 = right.size ! result = Image.new("RGB", (w1 + w2, max(h1, h2))) ! result.paste(left, (0, 0)) ! result.paste(right, (w1, 0)) ! return result ! def imconcattb(upper, lower): ! """Concatenate two images top to bottom.""" ! w1, h1 = upper.size ! w2, h2 = lower.size ! result = Image.new("RGB", (max(w1, w2), h1 + h2)) ! result.paste(upper, (0, 0)) ! result.paste(lower, (0, h1)) ! return result ! def pnmsize(pnmfile): ! """Return dimensions of a PNM file.""" ! f = open(pnmfile) ! line1 = f.readline() ! line2 = f.readline() ! w, h = [int(n) for n in line2.split()] ! return w, h ! def NetPBM_decode_parts(parts, decoders): ! """Decode and assemble a bunch of images using NetPBM tools.""" ! rows = [] ! tokens = Set() ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! tokens.add("invalid-image:%s" % part.get_content_type()) ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! tokens.add("image:big") ! continue # assume it's just a picture for now + fd, imgfile = tempfile.mkstemp() + os.write(fd, bytes) + os.close(fd) + + fd, pnmfile = tempfile.mkstemp() + os.close(fd) + os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) + w, h = pnmsize(pnmfile) + if not rows: + # first image + rows.append([pnmfile]) + elif pnmsize(rows[-1][-1])[1] != h: + # new image, different height => start new row + rows.append([pnmfile]) + else: + # new image, same height => extend current row + rows[-1].append(pnmfile) + + for (i, row) in enumerate(rows): + if len(row) > 1: fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(row), pnmfile)) ! for f in row: ! os.unlink(f) ! rows[i] = pnmfile ! else: ! rows[i] = row[0] ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("pnmcat -tb %s > %s 2>/dev/null" % (" ".join(rows), pnmfile)) ! for f in rows: ! os.unlink(f) ! return [pnmfile], tokens ! def PIL_decode_parts(parts): ! """Decode and assemble a bunch of images using PIL.""" ! tokens = Set() ! rows = [] ! for part in parts: ! try: ! bytes = part.get_payload(decode=True) ! except: ! tokens.add("invalid-image:%s" % part.get_content_type()) ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! tokens.add("image:big") ! continue # assume it's just a picture for now ! # We're dealing with spammers and virus writers here. Who knows ! # what garbage they will call a GIF image to entice you to open ! # it? ! try: ! image = Image.open(StringIO.StringIO(bytes)) ! image.load() ! except IOError: ! tokens.add("invalid-image:%s" % part.get_content_type()) ! continue ! else: ! image = image.convert("RGB") ! if not rows: ! # first image ! rows.append(image) ! elif image.size[1] != rows[-1].size[1]: ! # new image, different height => start new row ! rows.append(image) ! else: ! # new image, same height => extend current row ! rows[-1] = imconcatlr(rows[-1], image) ! if not rows: ! return [], tokens ! # now concatenate the resulting row images top-to-bottom ! full_image, rows = rows[0], rows[1:] ! for image in rows: ! full_image = imconcattb(full_image, image) ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! full_image.save(open(pnmfile, "wb"), "PPM") ! return [pnmfile], tokens ! class ImageStripper: ! def __init__(self, cachefile=""): ! self.cachefile = os.path.expanduser(cachefile) ! if os.path.exists(self.cachefile): ! self.cache = pickle.load(open(self.cachefile)) ! else: ! self.cache = {} ! self.misses = self.hits = 0 ! if self.cachefile: ! atexit.register(self.close) def extract_ocr_info(self, pnmfiles): *************** *** 217,228 **** if Image is not None: ! pnmfiles = self.PIL_decode_parts(parts) else: ! pnmfiles = self.NetPBM_decode_parts(parts, find_decoders()) if pnmfiles: ! return self.extract_ocr_info(pnmfiles) ! return "", Set() --- 266,280 ---- if Image is not None: ! pnmfiles, tokens = PIL_decode_parts(parts) else: ! if not find_program("pnmcat"): ! return "", Set() ! pnmfiles, tokens = NetPBM_decode_parts(parts, find_decoders()) if pnmfiles: ! text, new_tokens = self.extract_ocr_info(pnmfiles) ! return text, tokens | new_tokens ! return "", tokens From montanaro at users.sourceforge.net Mon Aug 14 04:58:13 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 13 Aug 2006 19:58:13 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, 1.3, 1.4 Options.py, 1.137, 1.138 OptionsClass.py, 1.32, 1.33 Message-ID: <20060814025816.9CCEB1E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv26750/spambayes Modified Files: ImageStripper.py Options.py OptionsClass.py Log Message: Add scale and charset options (ocrad_scale and ocrad_charset, respectively) to pass to the ocrad command. Antonio Diaz Diaz, the author of Ocrad, suggested scaling up the images. Ocrad does indeed seem to perform better with the scaled images. Scaling by a factor of two seems to do significantly better than not scaling in my 5x5 N-fold test setup. Scaling by a factor of three might even be better, improving false negative percentages in four of the five sets, but it made the false positive score worse in one of the five sets, so I left the default scale at 2. I added the charset flag as well and defaulted to ascii. So far the spammers seem to be "GIFting" us with plain English, so searching for accented characters seems like it would just distract Ocrad. This has yet to be tested though. Index: ImageStripper.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** ImageStripper.py 13 Aug 2006 16:27:49 -0000 1.3 --- ImageStripper.py 14 Aug 2006 02:58:11 -0000 1.4 *************** *** 232,235 **** --- 232,237 ---- textbits = [] tokens = Set() + scale = options["Tokenizer", "ocrad_scale"] or 1 + charset = options["Tokenizer", "ocrad_charset"] for pnmfile in pnmfiles: fhash = md5.new(open(pnmfile).read()).hexdigest() *************** *** 239,243 **** else: self.misses += 1 ! ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) ctext = ocr.read().lower() ocr.close() --- 241,246 ---- else: self.misses += 1 ! ocr = os.popen("ocrad -s %s -c %s -x %s < %s 2>/dev/null" % ! (scale, charset, orf, pnmfile)) ctext = ocr.read().lower() ocr.close() Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.137 retrieving revision 1.138 diff -C2 -d -r1.137 -r1.138 *** Options.py 10 Aug 2006 04:07:59 -0000 1.137 --- Options.py 14 Aug 2006 02:58:11 -0000 1.138 *************** *** 139,142 **** --- 139,154 ---- PATH, RESTORE), + ("ocrad_scale", _("Scale factor to use with ocrad."), 2, + _("""Specifies the scale factor to apply when running ocrad. While + you can specify a negative scale it probably won't help. Scaling up + by a factor of 2 or 3 seems to work well for the sort of spam images + encountered by SpamBayes."""), + INTEGER, RESTORE), + + ("ocrad_charset", _("Charset to apply with ocrad."), "ascii", + _("""Specifies the charset to use when running ocrad. Valid values + are 'ascii', 'iso-8859-9' and 'iso-8859-15'."""), + OCRAD_CHARSET, RESTORE), + ("max_image_size", _("Max image size to try OCR-ing"), 100000, _("""When crack_images is enabled, this specifies the largest Index: OptionsClass.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** OptionsClass.py 22 Jun 2006 10:36:58 -0000 1.32 --- OptionsClass.py 14 Aug 2006 02:58:11 -0000 1.33 *************** *** 119,122 **** --- 119,123 ---- 'IMAP_FOLDER', 'IMAP_ASTRING', 'RESTORE', 'DO_NOT_RESTORE', 'IP_LIST', + 'OCRAD_CHARSET', ] *************** *** 871,872 **** --- 872,875 ---- RESTORE = True DO_NOT_RESTORE = False + + OCRAD_CHARSET = r"ascii|iso-8859-9|iso-8859-15" From montanaro at users.sourceforge.net Fri Aug 18 04:29:05 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Thu, 17 Aug 2006 19:29:05 -0700 Subject: [Spambayes-checkins] spambayes/contrib pycksum.py,1.1,1.2 Message-ID: <20060818022907.D10021E4004@bag.python.org> Update of /cvsroot/spambayes/spambayes/contrib In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv16513 Modified Files: pycksum.py Log Message: * Try to improve the duplicate detection capability. Lots of spam nowadays has random text junk, so be more lenient about how many chunks have to match. Also do a little more filtering on the source: - Compress multiple spaces and tabs to a single space - Compress multiple contiguous newlines into one - Map all strings of digits to a single "#" character - Map some common html entities to their plain text equivalents. * Use md5 checksum hexdigests instead of binascii.b2a_hex. * Correct line breaking of filtered body. * Use email.generator to flatten body instead of the broken flatten() function. Index: pycksum.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/contrib/pycksum.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** pycksum.py 25 May 2004 14:58:39 -0000 1.1 --- pycksum.py 18 Aug 2006 02:29:02 -0000 1.2 *************** *** 39,60 **** import sys import email.Parser import md5 import anydbm import re import time ! import binascii ! ! def flatten(body): ! # three types are possible: list, string, Message ! if isinstance(body, str): ! return body ! if hasattr(body, "get_payload"): ! payload = body.get_payload() ! if payload is None: ! return "" ! return flatten(payload) ! if isinstance(body, list): ! return "\n".join([flatten(b) for b in body]) ! raise TypeError, ("unrecognized body type: %s" % type(body)) def clean(data): --- 39,51 ---- import sys import email.Parser + import email.generator import md5 import anydbm import re import time ! try: ! import cStringIO as StringIO ! except ImportError: ! import StringIO def clean(data): *************** *** 67,74 **** data = re.sub(r"<[^>]*>", "", data).lower() # delete anything which looks like a url or email address # not sure what a pmguid: url is but it seems to occur frequently in spam # also convert all runs of whitespace into a single space ! return " ".join([w for w in data.split() if ('@' not in w and (':' not in w or --- 58,78 ---- data = re.sub(r"<[^>]*>", "", data).lower() + # Map all digits to '#' + data = re.sub(r"[0-9]+", "#", data) + + # Map a few common html entities + data = re.sub(r"( )+", " ", data) + data = re.sub(r"<", "<", data) + data = re.sub(r">", ">", data) + data = re.sub(r"&", "&", data) + + # Elide blank lines and multiple horizontal whitespace + data = re.sub(r"\n+", "\n", data) + data = re.sub(r"[ \t]+", " ", data) + # delete anything which looks like a url or email address # not sure what a pmguid: url is but it seems to occur frequently in spam # also convert all runs of whitespace into a single space ! return " ".join([w for w in data.split(" ") if ('@' not in w and (':' not in w or *************** *** 87,97 **** # separately or in various combinations if desired. ! body = flatten(msg) ! lines = clean(body) chunksize = len(lines)//4+1 sum = [] for i in range(4): chunk = "\n".join(lines[i*chunksize:(i+1)*chunksize]) ! sum.append(binascii.b2a_hex(md5.new(chunk).digest())) return ".".join(sum) --- 91,105 ---- # separately or in various combinations if desired. ! fp = StringIO.StringIO() ! g = email.generator.Generator(fp, mangle_from_=False, maxheaderlen=60) ! g.flatten(msg) ! text = fp.getvalue() ! body = text.split("\n\n", 1)[1] ! lines = clean(body).split("\n") chunksize = len(lines)//4+1 sum = [] for i in range(4): chunk = "\n".join(lines[i*chunksize:(i+1)*chunksize]) ! sum.append(md5.new(chunk).hexdigest()) return ".".join(sum) *************** *** 102,111 **** db = anydbm.open(f, "c") maxdblen = 2**14 ! # consider the first three pieces, the last three pieces and the middle ! # two pieces - one or more will likely eliminate attempts at disrupting ! # the checksum - if any are found in the db file, call it a match ! for subsum in (".".join(pieces[:-1]), ".".join(pieces[1:-1]), ! ".".join(pieces[1:])): if not db.has_key(subsum): db[subsum] = str(time.time()) --- 110,119 ---- db = anydbm.open(f, "c") maxdblen = 2**14 ! # consider the first two pieces, the middle two pieces and the last two ! # pieces - one or more will likely eliminate attempts at disrupting the ! # checksum - if any are found in the db file, call it a match ! for subsum in (".".join(pieces[:-2]), ".".join(pieces[1:-1]), ! ".".join(pieces[2:])): if not db.has_key(subsum): db[subsum] = str(time.time()) *************** *** 155,157 **** if __name__ == "__main__": sys.exit(main(sys.argv[1:])) - --- 163,164 ---- From montanaro at users.sourceforge.net Fri Aug 18 19:26:52 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 10:26:52 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.54,1.55 Message-ID: <20060818172655.DF0ED1E4004@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv9126 Modified Files: CHANGELOG.txt Log Message: I hope this doesn't break any scripts or irritate anyone too much, however... Just as mm/dd/yyyy format looks strange to non-US folks, dd/mm/yyyy looks just as strange to us cowboy types. Compromise on ISO-8601 dates. They sort, they're unambiguous, and they probably piss off both camps equally well. ;-) Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.54 retrieving revision 1.55 diff -C2 -d -r1.54 -r1.55 *** CHANGELOG.txt 7 Apr 2006 02:37:28 -0000 1.54 --- CHANGELOG.txt 18 Aug 2006 17:26:50 -0000 1.55 *************** *** 1,231 **** ! [Note that all dates are in English, not American format - i.e. day/month/year] Release 1.1a2 ============= ! Tony Meyer 03/04/2006 Add [ 1081787 ] Adding the version only to sb_filter.py ! Tony Meyer 03/04/2006 Fix [ 1383801 ] trustedIPs wildcard to regex broken ! Tony Meyer 02/04/2006 Fix [ 1387699 ] train_on_filter=True needs the db to be opened read/write ! Tony Meyer 02/04/2006 Fix [ 1387709 ] If globals:dbm_type is non-default, then don't use whichdb. ! Tony Meyer 27/11/2005 Install the conversion utility and offer to run it on Windows install. ! Tony Meyer 26/11/2005 Add conversion utility to easily convert dbm to ZODB. [...1933 lines suppressed...] ! Tim Stone 2003-02-25 Add option for pop3proxy to notate Subject: header. ! Tony Meyer 2003-02-25 Fix bug in Corpus.get() which would never return the default value. ! Mark Hammond 2003-02-18 "Store Outlook plugin files in the ""correct"" Windows directory." ! Neil Schemenauer 2003-02-16 Add -c and -d options to mailsort.py. ! Neil Schemenauer 2003-02-16 First check-in of dump_cdb.py ! Mark Hammond 2003-02-13 Add SF#685746 ('Outlook plugin folder list sorted alphabetically'). ! Mark Hammond 2003-02-13 Handle exceptions when opening folders in Outlook plugin better. ! Skip Montanaro 2003-02-13 Split BAYESCUSTOMIZE environment variable using os.pathsep. ! Mark Hammond 2003-02-12 Check for correct exception when removing file in Outlook addin. ! Mark Hammond 2003-02-12 Check for bsddb3 before bsddb (previously bsddb3 would never be found). ! Tim Stone 2003-02-10 Changed BAYESCUSTOMIZE environment variable parsing from a split to a regex to fix filenames with embedded spaces. ! Tim Stone 2003-02-08 Ensure that nham and nspam are instances of integer in dbExpImp.py ! Tim Stone 2003-02-08 Ensure that nham and nspam becoming strings doesn't break classification. ! Tim Stone 2003-02-08 Added ability to put classification in subject or to headers (for OE). ! Mark Hammond 2003-02-07 Fix some errors using bsddb3 in Outlook plugin. ! Mark Hammond 2003-02-04 "Fix SF#642740 ('""Recover from Spam"" wrong folder')." ! Mark Hammond 2003-02-03 Change train.py to be able to work with a bsddb database. ! Mark Hammond 2003-02-03 If a new bsddb or bsddb3 module is available use this instead of a pickle in the Outlook plugin. ! Mark Hammond 2003-02-03 Add tick-marks to the filter dialog. ! Mark Hammond 2003-02-03 Fix SF#677804 ('Untouched filter command error'). From montanaro at users.sourceforge.net Fri Aug 18 19:42:39 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 10:42:39 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.55,1.56 Message-ID: <20060818174242.264BB1E400D@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv16016 Modified Files: CHANGELOG.txt Log Message: Add my recent changes to changelog Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.55 retrieving revision 1.56 diff -C2 -d -r1.55 -r1.56 *** CHANGELOG.txt 18 Aug 2006 17:26:50 -0000 1.55 --- CHANGELOG.txt 18 Aug 2006 17:42:37 -0000 1.56 *************** *** 1,4 **** --- 1,23 ---- [Note that all dates are in ISO 8601 format, e.g. YYYY-MM-DD to ease sorting] + Release 1.1a3 + ============= + + Skip Montanaro 2006-08-18 Update pycksum.py to try and identify more duplicates + Skip Montanaro 2006-08-14 Add scale and charset options to ImageStripper + Skip Montanaro 2006-08-13 Stitch spam images back together properly, add a couple more tokens + Skip Montanaro 2006-08-10 Add support for PIL to ImageStripper.py + Skip Montanaro 2006-08-09 Cache x-lookup_ip in a pickle instead of trying to use anydbm or zodb + Skip Montanaro 2006-08-06 Add crude OCR capability to try and parse image-based spam using Ocrad & NetPBM + Skip Montanaro 2006-08-06 Add x-short_runs option + Skip Montanaro 2006-08-06 Add x-image_size option & corresponding token + Skip Montanaro 2006-08-06 Add Matt Cowles' x-lookup_ip extension w/ slight modifications + Skip Montanaro 2006-08-06 Add profiling using cProfile (if available) to sb_filter.py + Skip Montanaro 2006-08-06 Delete -d and -p flags from spamcounts.py + Skip Montanaro 2006-08-06 Refactor basic text tokenizing out of tokenize_body into a separate method, tokenize_text + Skip Montanaro 2006-08-05 Explicitly close ZODB store in tte.py + Skip Montanaro 2006-04-23 Reduce sensitivity of spamcounts.py to classifier changes + + Release 1.1a2 ============= From montanaro at users.sourceforge.net Sat Aug 19 02:26:40 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 17:26:40 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.56,1.57 Message-ID: <20060819002643.4F05C1E400C@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv14309 Modified Files: CHANGELOG.txt Log Message: Add other recent changelog bits Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.56 retrieving revision 1.57 diff -C2 -d -r1.56 -r1.57 *** CHANGELOG.txt 18 Aug 2006 17:42:37 -0000 1.56 --- CHANGELOG.txt 19 Aug 2006 00:26:38 -0000 1.57 *************** *** 17,21 **** --- 17,26 ---- Skip Montanaro 2006-08-06 Refactor basic text tokenizing out of tokenize_body into a separate method, tokenize_text Skip Montanaro 2006-08-05 Explicitly close ZODB store in tte.py + Tony Meyer 2006-06-22 Fix bug in regex preventing valid IPs + Toby Dickenson 2006-06-12 Suppress spurious duplicate From_ lines in sb_bnfilter.py + Tony Meyer 2006-06-10 Add simple parts of [ 824651 ] Multibyte message support + Tony Meyer 2006-05-06 Enable -o command line option setting, and follow TestDriver directories in testtools/mksets.py Skip Montanaro 2006-04-23 Reduce sensitivity of spamcounts.py to classifier changes + Tony Meyer 2006-04-22 Set zodb cache size to 10,000 From montanaro at users.sourceforge.net Sat Aug 19 02:37:55 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 17:37:55 -0700 Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.40,1.41 Message-ID: <20060819003757.B88791E4006@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv18704 Modified Files: WHAT_IS_NEW.txt Log Message: Update for 1.1a3 Index: WHAT_IS_NEW.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** WHAT_IS_NEW.txt 27 Nov 2005 02:15:33 -0000 1.40 --- WHAT_IS_NEW.txt 19 Aug 2006 00:37:52 -0000 1.41 *************** *** 16,19 **** --- 16,88 ---- is released. + New in 1.1 Alpha 3 + ================== + + + -------------------------------------------- + ** Incompatible changes and Transitioning ** + -------------------------------------------- + + There should be no incompatible changes since 1.1a2, though users new to the + 1.1 series should pay careful attention to the database changes introduced + in 1.1a2. + + + ------------------- + ** Other changes ** + ------------------- + + General + ------- + + Reported Bugs Fixed + =================== + No bugs tracked via the Sourceforge system were fixed. + + + Patches integrated + =================== + The following patches tracked via the Sourceforge system were integrated + in this release: + 824651 + + Feature Requests Added + ====================== + No feature requests tracked via the Sourceforge system were added + in this release. + + + Experimental Options + ==================== + + In addition to the experimental options listed for the 1.1a2 release, four + more new experimental options were added to SpamBayes. They all need + further testing. + + o x-short_runs - If true, generate tokens based on max number of short + word runs. Short words are anything of length < the skip_max_word_size + option. Normally they are skipped, but one common spam technique spells + words like 'V m I n A o G p RA' to try and avoid exposing them to + content filters. + + o x-lookup_ip - If true, generate IP address tokens from hostnames. This + requires PyDNS (http://pydns.sourceforge.net/). + + o x-image_size - If true, generate tokens based on the size of the largest + attached image. + + o x-crack_images - A lot of recent spam contains the entire message + embedded in one or more attached images. This option, if true, + generates tokens based on the (hopefully) text content contained in any + images in each message. The current support is minimal, relies on the + installation of ocrad (http://www.gnu.org/software/ocrad/ocrad.html) and + the Python Imaging Library (a.k.a. PIL, available at + http://www.pythonware.com/products/pil/). It has not yet been tested on + Windows, but for brave souls there is a simple zip file binary of ocrad + called "ocrad-cygwin" on the SpamBayes download page for Windows users + who can't build it themselves. PIL has its own Windows binary + installers specific to versions of Python as far back as 2.1. + + New in 1.1 Alpha 2 ================== From mhammond at users.sourceforge.net Thu Aug 24 14:42:03 2006 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Thu, 24 Aug 2006 05:42:03 -0700 Subject: [Spambayes-checkins] spambayes/spambayes __init__.py,1.18,1.19 Message-ID: <20060824124205.F40CC1E400A@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv23537/spambayes Modified Files: __init__.py Log Message: Version 1.1a3 Index: __init__.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/__init__.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** __init__.py 20 Apr 2006 03:13:26 -0000 1.18 --- __init__.py 24 Aug 2006 12:41:57 -0000 1.19 *************** *** 6,9 **** _ = lambda arg: arg ! __version__ = "1.1a2" ! __date__ = _("April 2005") --- 6,9 ---- _ = lambda arg: arg ! __version__ = "1.1a3" ! __date__ = _("August 2006") From mhammond at users.sourceforge.net Thu Aug 24 14:45:46 2006 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Thu, 24 Aug 2006 05:45:46 -0700 Subject: [Spambayes-checkins] spambayes/windows pop3proxy_tray.py, 1.24, 1.25 Message-ID: <20060824124548.E46E61E4005@bag.python.org> Update of /cvsroot/spambayes/spambayes/windows In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv24833/windows Modified Files: pop3proxy_tray.py Log Message: re-add the taskbar icon in the case of explorer crashing and restarting Index: pop3proxy_tray.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/windows/pop3proxy_tray.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** pop3proxy_tray.py 29 Mar 2005 05:59:25 -0000 1.24 --- pop3proxy_tray.py 24 Aug 2006 12:45:42 -0000 1.25 *************** *** 144,148 **** --- 144,150 ---- 1099 : ("Exit SpamBayes", self.OnExit), } + msg_TaskbarRestart = RegisterWindowMessage("TaskbarCreated"); message_map = { + msg_TaskbarRestart: self.OnTaskbarRestart, win32con.WM_DESTROY: self.OnDestroy, win32con.WM_COMMAND: self.OnCommand, *************** *** 188,195 **** 16, 16, icon_flags) ! flags = NIF_ICON | NIF_MESSAGE | NIF_TIP ! nid = (self.hwnd, 0, flags, WM_TASKBAR_NOTIFY, self.hstartedicon, ! "SpamBayes") ! Shell_NotifyIcon(NIM_ADD, nid) self.started = IsServerRunningAnywhere() self.tip = None --- 190,194 ---- 16, 16, icon_flags) ! self._AddTaskbarIcon() self.started = IsServerRunningAnywhere() self.tip = None *************** *** 205,208 **** --- 204,221 ---- "a local server" + def _AddTaskbarIcon(self): + flags = NIF_ICON | NIF_MESSAGE | NIF_TIP + nid = (self.hwnd, 0, flags, WM_TASKBAR_NOTIFY, self.hstartedicon, + "SpamBayes") + try: + Shell_NotifyIcon(NIM_ADD, nid) + except win32api_error: + # Apparently can be seen as XP is starting up. Certainly can + # be seen if explorer.exe is not running when started. + print "Ignoring error adding taskbar icon - explorer may not " \ + "be running (yet)." + # The TaskbarRestart message will fire in this case, and + # everything will work out :) + def BuildToolTip(self): tip = None *************** *** 394,397 **** --- 407,415 ---- function() + def OnTaskbarRestart(self, hwnd, msg, wparam, lparam): + # Called as the taskbar is created (either as Windows starts, or + # as Windows recovers from a crashed explorer.exe) + self._AddTaskbarIcon() + def OnExit(self): if self.started and not self.use_service: From mhammond at users.sourceforge.net Thu Aug 24 15:18:34 2006 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Thu, 24 Aug 2006 06:18:34 -0700 Subject: [Spambayes-checkins] spambayes/windows/py2exe setup_all.py, 1.26, 1.27 Message-ID: <20060824131835.EB71E1E4005@bag.python.org> Update of /cvsroot/spambayes/spambayes/windows/py2exe In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540 Modified Files: setup_all.py Log Message: Ship with PIL (but no Tkinter) and pyDNS Index: setup_all.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/windows/py2exe/setup_all.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** setup_all.py 28 Feb 2006 08:11:40 -0000 1.26 --- setup_all.py 24 Aug 2006 13:18:32 -0000 1.27 *************** *** 47,54 **** "spambayes.languages.fr,spambayes.languages.es.DIALOGS," \ "spambayes.languages.es_AR.DIALOGS," \ ! "spambayes.languages.fr.DIALOGS", ! excludes = "win32ui,pywin,pywin.debugger", # pywin is a package, and still seems to be included. ! includes = "dialogs.resources.dialogs,weakref", # Outlook dynamic dialogs ! dll_excludes = "dapi.dll,mapi32.dll", typelibs = [ ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0), --- 47,61 ---- "spambayes.languages.fr,spambayes.languages.es.DIALOGS," \ "spambayes.languages.es_AR.DIALOGS," \ ! "spambayes.languages.fr.DIALOGS," \ ! "PIL", ! excludes = "Tkinter," # side-effect of PIL and markh doesn't have it :) ! "win32ui,pywin,pywin.debugger," # *sob* - these still appear ! # Keep zope out else outlook users lose training. ! # (sob - but some of these may still appear!) ! "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence", ! includes = "dialogs.resources.dialogs,weakref," # Outlook dynamic dialogs ! "BmpImagePlugin,JpegImagePlugin", # PIL modules not auto found ! dll_excludes = "dapi.dll,mapi32.dll," ! "tk84.dll,tcl84.dll", # No Tkinter == no tk/tcl dlls typelibs = [ ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0), From anadelonbrin at users.sourceforge.net Fri Aug 25 02:43:30 2006 From: anadelonbrin at users.sourceforge.net (Tony Meyer) Date: Thu, 24 Aug 2006 17:43:30 -0700 Subject: [Spambayes-checkins] spambayes/windows spambayes.iss,1.25,1.26 Message-ID: <20060825004333.172E51E4004@bag.python.org> Update of /cvsroot/spambayes/spambayes/windows In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32424/windows Modified Files: spambayes.iss Log Message: Bump version number. For 1.1a3 at least, include ocrad.exe and the patch required to build it. Display license. Maybe binary users aren't aware that this gets installed, and so this might get rid of some of the "can I do X with spambayes" queries. For 1.1a3 at least, it also clarifies where ocrad comes from. Fix typo. Index: spambayes.iss =================================================================== RCS file: /cvsroot/spambayes/spambayes/windows/spambayes.iss,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** spambayes.iss 27 Nov 2005 00:42:11 -0000 1.25 --- spambayes.iss 25 Aug 2006 00:43:28 -0000 1.26 *************** *** 5,11 **** [Setup] ; Version specific constants ! AppVerName=SpamBayes 1.1a1 ! AppVersion=1.1a1 ! OutputBaseFilename=spambayes-1.1a1 ; Normal constants. Be careful about changing 'AppName' AppName=SpamBayes --- 5,11 ---- [Setup] ; Version specific constants ! AppVerName=SpamBayes 1.1a3 ! AppVersion=1.1a3 ! OutputBaseFilename=spambayes-1.1a3 ; Normal constants. Be careful about changing 'AppName' AppName=SpamBayes *************** *** 15,18 **** --- 15,19 ---- ShowComponentSizes=no UninstallDisplayIcon={app}\sbicon.ico + LicenseFile=py2exe\dist\license.txt [Files] *************** *** 51,54 **** --- 52,59 ---- Source: "py2exe\dist\bin\convert_database.exe"; DestDir: "{app}\bin"; Flags: ignoreversion + ; Include ocrad.exe and the patch required to get it to compile for Windows. + Source: "py2exe\ocrad.exe"; DestDir: "{app}\bin"; Flags: ignoreversion + Source: "py2exe\ocrad.patch"; DestDir: "{app}\docs"; Flags: ignoreversion + ; There is a problem attempting to get Inno to unregister our DLL. If we mark our DLL ; as 'regserver', it installs and registers OK, but at uninstall time, it unregisters *************** *** 90,94 **** InstallOutlook, InstallProxy, InstallIMAP: Boolean; WarnedNoOutlook, WarnedBoth : Boolean; ! startup, desktop, allusers, startup_imap : Boolean; // Tasks function InstallingOutlook() : Boolean; --- 95,99 ---- InstallOutlook, InstallProxy, InstallIMAP: Boolean; WarnedNoOutlook, WarnedBoth : Boolean; ! startup, desktop, allusers, startup_imap, convert_db : Boolean; // Tasks function InstallingOutlook() : Boolean; From montanaro at users.sourceforge.net Fri Aug 25 04:02:16 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Thu, 24 Aug 2006 19:02:16 -0700 Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.41,1.42 Message-ID: <20060825020218.5E98D1E4007@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv3443 Modified Files: WHAT_IS_NEW.txt Log Message: Slight update. Index: WHAT_IS_NEW.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** WHAT_IS_NEW.txt 19 Aug 2006 00:37:52 -0000 1.41 --- WHAT_IS_NEW.txt 25 Aug 2006 02:02:12 -0000 1.42 *************** *** 67,74 **** o x-lookup_ip - If true, generate IP address tokens from hostnames. This ! requires PyDNS (http://pydns.sourceforge.net/). o x-image_size - If true, generate tokens based on the size of the largest ! attached image. o x-crack_images - A lot of recent spam contains the entire message --- 67,75 ---- o x-lookup_ip - If true, generate IP address tokens from hostnames. This ! requires PyDNS (http://pydns.sourceforge.net/). This is included in the ! Windows installer. o x-image_size - If true, generate tokens based on the size of the largest ! attached image. o x-crack_images - A lot of recent spam contains the entire message *************** *** 79,86 **** the Python Imaging Library (a.k.a. PIL, available at http://www.pythonware.com/products/pil/). It has not yet been tested on ! Windows, but for brave souls there is a simple zip file binary of ocrad ! called "ocrad-cygwin" on the SpamBayes download page for Windows users ! who can't build it themselves. PIL has its own Windows binary ! installers specific to versions of Python as far back as 2.1. --- 80,84 ---- the Python Imaging Library (a.k.a. PIL, available at http://www.pythonware.com/products/pil/). It has not yet been tested on ! Windows, but is available in the Windows installer (as is PIL).