From montanaro at users.sourceforge.net Sat Aug 5 14:48:11 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 05 Aug 2006 05:48:11 -0700
Subject: [Spambayes-checkins] spambayes/contrib tte.py,1.16,1.17
Message-ID: <20060805124814.1F7351E4003@bag.python.org>
Update of /cvsroot/spambayes/spambayes/contrib
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv28933/contrib
Modified Files:
tte.py
Log Message:
close the store - that's the ticket
Index: tte.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/contrib/tte.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** tte.py 19 Apr 2005 11:15:12 -0000 1.16
--- tte.py 5 Aug 2006 12:48:09 -0000 1.17
***************
*** 260,264 ****
sh_ratio)
! store.store()
if cullext is not None:
--- 260,264 ----
sh_ratio)
! store.close()
if cullext is not None:
From montanaro at users.sourceforge.net Sun Aug 6 03:19:37 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 05 Aug 2006 18:19:37 -0700
Subject: [Spambayes-checkins] spambayes/contrib spamcounts.py,1.7,1.8
Message-ID: <20060806011939.E72631E4003@bag.python.org>
Update of /cvsroot/spambayes/spambayes/contrib
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv5191
Modified Files:
spamcounts.py
Log Message:
Dump the -d and -p flags in favor of the more general -o flag.
Index: spamcounts.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/contrib/spamcounts.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** spamcounts.py 23 Apr 2006 22:30:46 -0000 1.7
--- spamcounts.py 6 Aug 2006 01:19:35 -0000 1.8
***************
*** 2,15 ****
"""
! Check spamcounts for various tokens or patterns
! usage %(prog)s [ -h ] [ -r ] [ -d db ] [ -p ] [ -t ] ...
-h - print this documentation and exit.
-r - treat tokens as regular expressions - may not be used with -t
- -d db - use db instead of the default found in the options file
- -p - db is actually a pickle
-t - read message from stdin, tokenize it, then display their counts
may not be used with -r
"""
--- 2,15 ----
"""
! Check spamcounts for one or more tokens or patterns
! usage %(prog)s [ options ] token ...
-h - print this documentation and exit.
-r - treat tokens as regular expressions - may not be used with -t
-t - read message from stdin, tokenize it, then display their counts
may not be used with -r
+ -o section:option:value
+ - set [section, option] in the options database to value
"""
***************
*** 64,70 ****
def main(args):
try:
! opts, args = getopt.getopt(args, "hrd:t",
! ["help", "re", "database=", "pickle",
! "tokenize"])
except getopt.GetoptError, msg:
usage(msg)
--- 64,69 ----
def main(args):
try:
! opts, args = getopt.getopt(args, "hrto:",
! ["help", "re", "tokenize", "option="])
except getopt.GetoptError, msg:
usage(msg)
***************
*** 72,77 ****
usere = False
- dbname = get_pathname_option("Storage", "persistent_storage_file")
- ispickle = not options["Storage", "persistent_use_database"]
tokenizestdin = False
for opt, arg in opts:
--- 71,74 ----
***************
*** 79,90 ****
usage()
return 0
- elif opt in ("-d", "--database"):
- dbname = arg
elif opt in ("-r", "--re"):
usere = True
- elif opt in ("-p", "--pickle"):
- ispickle = True
elif opt in ("-t", "--tokenize"):
tokenizestdin = True
if usere and tokenizestdin:
--- 76,85 ----
usage()
return 0
elif opt in ("-r", "--re"):
usere = True
elif opt in ("-t", "--tokenize"):
tokenizestdin = True
+ elif opt in ('-o', '--option'):
+ options.set_from_cmdline(arg, sys.stderr)
if usere and tokenizestdin:
From montanaro at users.sourceforge.net Sun Aug 6 16:50:32 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 07:50:32 -0700
Subject: [Spambayes-checkins] spambayes/scripts sb_filter.py,1.19,1.20
Message-ID: <20060806145034.662C91E4002@bag.python.org>
Update of /cvsroot/spambayes/spambayes/scripts
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv19924
Modified Files:
sb_filter.py
Log Message:
Run under control of the new cProfile profiler, if it's available. I found
this useful to help identify where SB spends its time while training.
Index: sb_filter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/scripts/sb_filter.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** sb_filter.py 7 Apr 2006 02:25:25 -0000 1.19
--- sb_filter.py 6 Aug 2006 14:50:29 -0000 1.20
***************
*** 47,50 ****
--- 47,53 ----
set [section, option] in the options database to value
+ -P
+ Run under control of the Python profiler, if it is available
+
All options marked with '*' operate on stdin, and write the resultant
message to stdout.
***************
*** 211,220 ****
self.h.store()
! def main():
h = HammieFilter()
actions = []
! opts, args = getopt.getopt(sys.argv[1:], 'hvxd:p:nfgstGSo:',
['help', 'version', 'examples', 'option='])
create_newdb = False
for opt, arg in opts:
if opt in ('-h', '--help'):
--- 214,224 ----
self.h.store()
! def main(profiling=False):
h = HammieFilter()
actions = []
! opts, args = getopt.getopt(sys.argv[1:], 'hvxd:p:nfgstGSo:P',
['help', 'version', 'examples', 'option='])
create_newdb = False
+ do_profile = False
for opt, arg in opts:
if opt in ('-h', '--help'):
***************
*** 238,241 ****
--- 242,254 ----
elif opt == '-S':
actions.append(h.untrain_spam)
+ elif opt == '-P':
+ do_profile = True
+ if not profiling:
+ try:
+ import cProfile
+ except ImportError:
+ pass
+ else:
+ return cProfile.run("main(True)")
elif opt == "-n":
create_newdb = True
From montanaro at users.sourceforge.net Sun Aug 6 18:14:20 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:14:20 -0700
Subject: [Spambayes-checkins] spambayes/spambayes Options.py,1.131,1.132
Message-ID: <20060806161422.065C21E4002@bag.python.org>
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv23311/spambayes
Modified Files:
Options.py
Log Message:
slight reformat, doc tweak
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.131
retrieving revision 1.132
diff -C2 -d -r1.131 -r1.132
*** Options.py 27 Nov 2005 22:05:45 -0000 1.131
--- Options.py 6 Aug 2006 16:14:17 -0000 1.132
***************
*** 134,144 ****
BOOLEAN, RESTORE),
! ("address_headers", _("Address headers to mine"), ("from", "to", "cc", "sender", "reply-to"),
_("""Mine the following address headers. If you have mixed source
corpuses (as opposed to a mixed sauce walrus, which is delicious!)
then you probably don't want to use 'to' or 'cc') Address headers will
be decoded, and will generate charset tokens as well as the real
! address. Others to consider: to, cc, reply-to, errors-to, sender,
! ..."""),
HEADER_NAME, RESTORE),
--- 134,144 ----
BOOLEAN, RESTORE),
! ("address_headers", _("Address headers to mine"), ("from", "to", "cc",
! "sender", "reply-to"),
_("""Mine the following address headers. If you have mixed source
corpuses (as opposed to a mixed sauce walrus, which is delicious!)
then you probably don't want to use 'to' or 'cc') Address headers will
be decoded, and will generate charset tokens as well as the real
! address. Others to consider: errors-to, ..."""),
HEADER_NAME, RESTORE),
From montanaro at users.sourceforge.net Sun Aug 6 18:19:21 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:19:21 -0700
Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.37,1.38
Message-ID: <20060806161923.4FFAF1E4002@bag.python.org>
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv25513
Modified Files:
tokenizer.py
Log Message:
Break basic text tokenizing out into its own method in preparation for some
other changes.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** tokenizer.py 15 Nov 2005 00:16:20 -0000 1.37
--- tokenizer.py 6 Aug 2006 16:19:19 -0000 1.38
***************
*** 1528,1533 ****
yield "noheader:" + k
! def tokenize_body(self, msg, maxword=options["Tokenizer",
! "skip_max_word_size"]):
"""Generate a stream of tokens from an email Message.
--- 1528,1545 ----
yield "noheader:" + k
! def tokenize_text(self, text, maxword=options["Tokenizer",
! "skip_max_word_size"]):
! """Tokenize everything in the chunk of text we were handed."""
! for w in text.split():
! n = len(w)
! # Make sure this range matches in tokenize_word().
! if 3 <= n <= maxword:
! yield w
!
! elif n >= 3:
! for t in tokenize_word(w):
! yield t
!
! def tokenize_body(self, msg):
"""Generate a stream of tokens from an email Message.
***************
*** 1606,1619 ****
text = html_re.sub('', text)
! # Tokenize everything in the body.
! for w in text.split():
! n = len(w)
! # Make sure this range matches in tokenize_word().
! if 3 <= n <= maxword:
! yield w
!
! elif n >= 3:
! for t in tokenize_word(w):
! yield t
global_tokenizer = Tokenizer()
--- 1618,1623 ----
text = html_re.sub('', text)
! for t in self.tokenize_text(text):
! yield t
global_tokenizer = Tokenizer()
From montanaro at users.sourceforge.net Sun Aug 6 18:34:39 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:34:39 -0700
Subject: [Spambayes-checkins] spambayes/spambayes Options.py, 1.132,
1.133 tokenizer.py, 1.38, 1.39
Message-ID: <20060806163441.7E8C41E4002@bag.python.org>
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv30712/spambayes
Modified Files:
Options.py tokenizer.py
Log Message:
Add an x-short_runs option. When enabled, instead of completely skipping
short words, runs of them are counted, the longest generating a token using
the usual log2() technique. See the comment in tokenizer.py and doc string
in Options.py for examples of the sort of things it attempts to catch.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.132
retrieving revision 1.133
diff -C2 -d -r1.132 -r1.133
*** Options.py 6 Aug 2006 16:14:17 -0000 1.132
--- Options.py 6 Aug 2006 16:34:37 -0000 1.133
***************
*** 98,101 ****
--- 98,109 ----
INTEGER, RESTORE),
+ ("x-short_runs", _("Count runs of short 'words'"), False,
+ _("""(EXPERIMENTAL) If true, generate tokens based on max number of
+ short word runs. Short words are anything of length < the
+ skip_max_word_size option. Normally they are skipped, but one common
+ spam technique spells words like 'V I A G RA'.
+ """),
+ BOOLEAN, RESTORE),
+
("count_all_header_lines", _("Count all header lines"), False,
_("""Generate tokens just counting the number of instances of each kind
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -C2 -d -r1.38 -r1.39
*** tokenizer.py 6 Aug 2006 16:19:19 -0000 1.38
--- tokenizer.py 6 Aug 2006 16:34:37 -0000 1.39
***************
*** 1531,1543 ****
"skip_max_word_size"]):
"""Tokenize everything in the chunk of text we were handed."""
for w in text.split():
n = len(w)
! # Make sure this range matches in tokenize_word().
! if 3 <= n <= maxword:
! yield w
! elif n >= 3:
! for t in tokenize_word(w):
! yield t
def tokenize_body(self, msg):
--- 1531,1558 ----
"skip_max_word_size"]):
"""Tokenize everything in the chunk of text we were handed."""
+ short_runs = Set()
+ short_count = 0
for w in text.split():
n = len(w)
! if n < 3:
! # count how many short words we see in a row - meant to
! # latch onto crap like this:
! # X j A m N j A d X h
! # M k E z R d I p D u I m A c
! # C o I d A t L j I v S j
! short_count += 1
! else:
! if short_count:
! short_runs.add(short_count)
! short_count = 0
! # Make sure this range matches in tokenize_word().
! if 3 <= n <= maxword:
! yield w
! elif n >= 3:
! for t in tokenize_word(w):
! yield t
! if short_runs and options["Tokenizer", "x-short_runs"]:
! yield "short:%d" % int(log2(max(short_runs)))
def tokenize_body(self, msg):
From montanaro at users.sourceforge.net Sun Aug 6 18:52:57 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:52:57 -0700
Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py, NONE,
1.1 Options.py, 1.133, 1.134 tokenizer.py, 1.39, 1.40
Message-ID: <20060806165259.7C81F1E4002@bag.python.org>
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv5725/spambayes
Modified Files:
Options.py tokenizer.py
Added Files:
dnscache.py
Log Message:
Add Matt Cowles' dnscache module and x-lookup_ip option. Underwent some
substantial changes, most importantly, I got most of the way adding support
for persisting the cache to either dbm or zodb stores. Also ran reindent
over dnscache.py.
--- NEW FILE: dnscache.py ---
# Copyright 2004, Matthew Dixon Cowles The first alpha release of 1.1 is also now available. It is highly likely
! that there are new bugs in this release, but if you are willing and able to
! give it a spin for us, that would be greatly appreciated. You might like
! to look at this list
of things to try out. The second alpha release of 1.1 is also now available. It is highly likely
! that there are new bugs in this release (especially with the IMAP filter),
! but if you are willing and able to give it a spin for us, that would be
! greatly appreciated. You might like to look at this
! list
of things to try out.
!
!
See the download page for more.
!SpamBayes 1.1a1 is also now available! (This includes both the source archives and a Windows binary installers). This is an alpha release, so you should only try it if you are willing to try out --- 8,12 ---- archives and a Windows binary installer).
See the download page for more.
!SpamBayes 1.1a2 is also now available! (This includes both the source archives and a Windows binary installers). This is an alpha release, so you should only try it if you are willing to try out From montanaro at users.sourceforge.net Wed Aug 9 06:26:39 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Tue, 08 Aug 2006 21:26:39 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py,1.1,1.2 Message-ID: <20060809042641.A19F01E4006@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv7959 Modified Files: dnscache.py Log Message: Don't beat my brains out trying to get dbm and zodb caches to work. Just use a simple pickled dict... Index: dnscache.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/dnscache.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** dnscache.py 6 Aug 2006 16:52:54 -0000 1.1 --- dnscache.py 9 Aug 2006 04:26:36 -0000 1.2 *************** *** 14,19 **** import time import types - import shelve import socket from spambayes.Options import options --- 14,22 ---- import time import types import socket + try: + import cPickle as pickle + except ImportError: + import pickle from spambayes.Options import options *************** *** 63,67 **** class cache: ! def __init__(self,dnsServer=None,cachefile=None): # These attributes intended for user setting self.printStatsAtEnd=False --- 66,70 ---- class cache: ! def __init__(self,dnsServer=None,cachefile=""): # These attributes intended for user setting self.printStatsAtEnd=False *************** *** 93,101 **** # end of user-settable attributes ! self.cachefile = cachefile ! if cachefile: ! self.open_cachefile(cachefile) else: ! self.caches={ "A": {}, "PTR": {} } self.hits=0 # These two for statistics self.misses=0 --- 96,114 ---- # end of user-settable attributes ! self.cachefile = os.path.expanduser(cachefile) ! if self.cachefile and os.path.exists(self.cachefile): ! self.caches = pickle.load(open(self.cachefile, "rb")) else: ! self.caches = {"A": {}, "PTR": {}} ! ! if options["globals", "verbose"]: ! if self.caches["A"] or self.caches["PTR"]: ! print >> sys.stderr, "opened existing cache with", ! print >> sys.stderr, len(self.caches["A"]), "A records", ! print >> sys.stderr, "and", len(self.caches["PTR"]), ! print >> sys.stderr, "PTR records" ! else: ! print >> sys.stderr, "opened new cache" ! self.hits=0 # These two for statistics self.misses=0 *************** *** 109,198 **** return None - def open_cachefile(self, cachefile): - filetype = options["Storage", "persistent_use_database"] - cachefile = os.path.expanduser(cachefile) - if filetype == "dbm": - self.caches=shelve.open(cachefile) - if not self.caches.has_key("A"): - self.caches["A"] = {} - if not self.caches.has_key("PTR"): - self.caches["PTR"] = {} - elif filetype == "zodb": - from ZODB import DB - from ZODB.FileStorage import FileStorage - self._zodb_storage = FileStorage(cachefile, read_only=False) - self._DB = DB(self._zodb_storage, cache_size=10000) - self._conn = self._DB.open() - root = self._conn.root() - self.caches = root.get("dnscache") - if self.caches is None: - # There is no classifier, so create one. - from BTrees.OOBTree import OOBTree - self.caches = root["dnscache"] = OOBTree() - self.caches["A"] = {} - self.caches["PTR"] = {} - print "opened new cache" - else: - print "opened existing cache with", len(self.caches["A"]), "A records", - print "and", len(self.caches["PTR"]), "PTR records" - def close(self): - if not self.cachefile: - return - filetype = options["Storage", "persistent_use_database"] - if filetype == "dbm": - self.caches.close() - elif filetype == "zodb": - self._zodb_close() - - def _zodb_store(self): - import transaction - from ZODB.POSException import ConflictError - from ZODB.POSException import TransactionFailedError - - try: - transaction.commit() - except ConflictError, msg: - # We'll save it next time, or on close. It'll be lost if we - # hard-crash, but that's unlikely, and not a particularly big - # deal. - if options["globals", "verbose"]: - print >> sys.stderr, "Conflict on commit.", msg - transaction.abort() - except TransactionFailedError, msg: - # Saving isn't working. Try to abort, but chances are that - # restarting is needed. - if options["globals", "verbose"]: - print >> sys.stderr, "Store failed. Need to restart.", msg - transaction.abort() - - def _zodb_close(self): - # Ensure that the db is saved before closing. Alternatively, we - # could abort any waiting transaction. We need to do *something* - # with it, though, or it will be still around after the db is - # closed and cause problems. For now, saving seems to make sense - # (and we can always add abort methods if they are ever needed). - self._zodb_store() - - # Do the closing. - self._DB.close() - - # We don't make any use of the 'undo' capabilities of the - # FileStorage at the moment, so might as well pack the database - # each time it is closed, to save as much disk space as possible. - # Pack it up to where it was 'yesterday'. - # XXX What is the 'referencesf' parameter for pack()? It doesn't - # XXX seem to do anything according to the source. - ## self._zodb_storage.pack(time.time()-60*60*24, None) - self._zodb_storage.close() - - self._zodb_closed = True - if options["globals", "verbose"]: - print >> sys.stderr, 'Closed dnscache database' - - - def __del__(self): if self.printStatsAtEnd: self.printStats() def printStats(self): --- 122,130 ---- return None def close(self): if self.printStatsAtEnd: self.printStats() + if self.cachefile: + pickle.dump(self.caches, open(self.cachefile, "wb")) def printStats(self): *************** *** 201,209 **** for item in val.values(): totAnswers+=len(item) ! print "cache %s has %i question(s) and %i answer(s)" % (key,len(self.caches[key]),totAnswers) if self.hits+self.misses==0: ! print "No queries" else: ! print "%i hits, %i misses (%.1f%% hits)" % (self.hits, self.misses, self.hits/float(self.hits+self.misses)*100) def prune(self,now): --- 133,144 ---- for item in val.values(): totAnswers+=len(item) ! print >> sys.stderr, "cache", key, "has", len(self.caches[key]), ! print >> sys.stderr, "question(s) and", totAnswers, "answer(s)" if self.hits+self.misses==0: ! print >> sys.stderr, "No queries" else: ! print >> sys.stderr, self.hits, "hits,", self.misses, "misses", ! print >> sys.stderr, "(%.1f%% hits)" % \ ! (self.hits/float(self.hits+self.misses)*100) def prune(self,now): *************** *** 223,232 **** break answer=allAnswers.pop() ! c=self.caches[answer.type] c[answer.question].remove(answer) if len(c[answer.question])==0: del c[answer.question] ! self.printStats() if len(allAnswers)<=kPruneDownTo: --- 158,168 ---- break answer=allAnswers.pop() ! c=self.caches[answer.qType] c[answer.question].remove(answer) if len(c[answer.question])==0: del c[answer.question] ! if options["globals", "verbose"]: ! self.printStats() if len(allAnswers)<=kPruneDownTo: *************** *** 242,246 **** for count in range(numToDelete): answer=allAnswers.pop() ! c=self.caches[answer.type] c[answer.question].remove(answer) if len(c[answer.question])==0: --- 178,182 ---- for count in range(numToDelete): answer=allAnswers.pop() ! c=self.caches[answer.qType] c[answer.question].remove(answer) if len(c[answer.question])==0: From montanaro at users.sourceforge.net Thu Aug 10 06:08:03 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Wed, 09 Aug 2006 21:08:03 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, 1.1, 1.2 Options.py, 1.136, 1.137 tokenizer.py, 1.44, 1.45 Message-ID: <20060810040805.9A76E1E4007@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv21273/spambayes Modified Files: ImageStripper.py Options.py tokenizer.py Log Message: Use PIL to decode input images if available (faster, much more robust, and platform-independent). Add a token cache for the ocr output to speed up that operation. Slight API change for the ocr stuff. Now a singleton is created and used for all analysis. Index: ImageStripper.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** ImageStripper.py 6 Aug 2006 17:09:04 -0000 1.1 --- ImageStripper.py 10 Aug 2006 04:07:59 -0000 1.2 *************** *** 3,10 **** --- 3,28 ---- """ + from __future__ import division + + import sys import os import tempfile import math import time + import md5 + import atexit + try: + import cPickle as pickle + except ImportError: + import pickle + try: + import cStringIO as StringIO + except ImportError: + import StringIO + + try: + from PIL import Image + except ImportError: + Image = None try: *************** *** 65,128 **** return decoders ! def decode_parts(parts, decoders): ! pnmfiles = [] ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! fd, imgfile = tempfile.mkstemp() ! os.write(fd, bytes) ! os.close(fd) ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) ! pnmfiles.append(pnmfile) ! if not pnmfiles: ! return - if len(pnmfiles) > 1: - if find_program("pnmcat"): fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(pnmfiles), pnmfile)) ! for f in pnmfiles: ! os.unlink(f) ! pnmfiles = [pnmfile] ! return pnmfiles ! def extract_ocr_info(pnmfiles): ! fd, orf = tempfile.mkstemp() ! os.close(fd) ! textbits = [] ! tokens = Set() ! for pnmfile in pnmfiles: ! ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) ! textbits.append(ocr.read()) ! ocr.close() ! for line in open(orf): ! if line.startswith("lines"): ! nlines = int(line.split()[1]) ! if nlines: ! tokens.add("image-text-lines:%d" % int(log2(nlines))) ! os.unlink(pnmfile) ! os.unlink(orf) ! return "\n".join(textbits), tokens - class ImageStripper: def analyze(self, parts): if not parts: --- 83,211 ---- return decoders ! def imconcat(im1, im2): ! # concatenate im1 and im2 left-to-right ! w1, h1 = im1.size ! w2, h2 = im2.size ! im3 = Image.new("RGB", (w1+w2, max(h1, h2))) ! im3.paste(im1, (0, 0)) ! im3.paste(im2, (0, w1)) ! return im3 ! class ImageStripper: ! def __init__(self, cachefile=""): ! self.cachefile = os.path.expanduser(cachefile) ! if os.path.exists(self.cachefile): ! self.cache = pickle.load(open(self.cachefile)) ! else: ! self.cache = {} ! self.misses = self.hits = 0 ! if self.cachefile: ! atexit.register(self.close) ! def NetPBM_decode_parts(self, parts, decoders): ! pnmfiles = [] ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! fd, imgfile = tempfile.mkstemp() ! os.write(fd, bytes) ! os.close(fd) fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) ! pnmfiles.append(pnmfile) ! os.unlink(imgfile) ! if not pnmfiles: ! return ! if len(pnmfiles) > 1: ! if find_program("pnmcat"): ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(pnmfiles), pnmfile)) ! for f in pnmfiles: ! os.unlink(f) ! pnmfiles = [pnmfile] ! return pnmfiles ! def PIL_decode_parts(self, parts): ! full_image = None ! for part in parts: ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! ! # We're dealing with spammers here - who knows what garbage they ! # will call a GIF image to entice you to open it? ! try: ! image = Image.open(StringIO.StringIO(bytes)) ! image.load() ! except IOError: ! continue ! else: ! image = image.convert("RGB") ! ! if full_image is None: ! full_image = image ! else: ! full_image = imconcat(full_image, image) ! ! if not full_image: ! return ! ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! full_image.save(open(pnmfile, "wb"), "PPM") ! ! return [pnmfile] ! ! def extract_ocr_info(self, pnmfiles): ! fd, orf = tempfile.mkstemp() ! os.close(fd) ! ! textbits = [] ! tokens = Set() ! for pnmfile in pnmfiles: ! fhash = md5.new(open(pnmfile).read()).hexdigest() ! if fhash in self.cache: ! self.hits += 1 ! ctext, ctokens = self.cache[fhash] ! else: ! self.misses += 1 ! ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) ! ctext = ocr.read().lower() ! ocr.close() ! ctokens = set() ! for line in open(orf): ! if line.startswith("lines"): ! nlines = int(line.split()[1]) ! if nlines: ! ctokens.add("image-text-lines:%d" % ! int(log2(nlines))) ! self.cache[fhash] = (ctext, ctokens) ! textbits.append(ctext) ! tokens |= ctokens ! os.unlink(pnmfile) ! os.unlink(orf) ! ! return "\n".join(textbits), tokens def analyze(self, parts): if not parts: *************** *** 133,143 **** return "", Set() ! decoders = find_decoders() ! pnmfiles = decode_parts(parts, decoders) ! if not pnmfiles: ! return "", Set() ! return extract_ocr_info(pnmfiles) ! --- 216,240 ---- return "", Set() ! if Image is not None: ! pnmfiles = self.PIL_decode_parts(parts) ! else: ! pnmfiles = self.NetPBM_decode_parts(parts, find_decoders()) ! if pnmfiles: ! return self.extract_ocr_info(pnmfiles) ! return "", Set() ! ! def close(self): ! if options["globals", "verbose"]: ! print >> sys.stderr, "saving", len(self.cache), ! print >> sys.stderr, "items to", self.cachefile, ! if self.hits + self.misses: ! print >> sys.stderr, "%.2f%% hit rate" % \ ! (100 * self.hits / (self.hits + self.misses)), ! print >> sys.stderr ! pickle.dump(self.cache, open(self.cachefile, "wb")) ! ! _cachefile = options["Tokenizer", "crack_image_cache"] ! crack_images = ImageStripper(_cachefile).analyze Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.136 retrieving revision 1.137 diff -C2 -d -r1.136 -r1.137 *** Options.py 6 Aug 2006 17:09:05 -0000 1.136 --- Options.py 10 Aug 2006 04:07:59 -0000 1.137 *************** *** 118,122 **** token store (only dbm and zodb supported so far, zodb has problems, dbm is untested, hence the default)."""), ! FILE, RESTORE), ("x-image_size", _("Generate image size tokens"), False, --- 118,122 ---- token store (only dbm and zodb supported so far, zodb has problems, dbm is untested, hence the default)."""), ! PATH, RESTORE), ("x-image_size", _("Generate image size tokens"), False, *************** *** 134,137 **** --- 134,142 ---- BOOLEAN, RESTORE), + ("crack_image_cache", _("Cache to speed up ocr."), "", + _("""If non-empty, names a file from which to read cached ocr info + at start and to which to save that info at exit."""), + PATH, RESTORE), + ("max_image_size", _("Max image size to try OCR-ing"), 100000, _("""When crack_images is enabled, this specifies the largest Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.44 retrieving revision 1.45 diff -C2 -d -r1.44 -r1.45 *** tokenizer.py 7 Aug 2006 02:47:10 -0000 1.44 --- tokenizer.py 10 Aug 2006 04:07:59 -0000 1.45 *************** *** 1636,1641 **** if options["Tokenizer", "x-crack_images"]: ! from spambayes.ImageStripper import ImageStripper ! text, tokens = ImageStripper().analyze(parts) for t in tokens: yield t --- 1636,1641 ---- if options["Tokenizer", "x-crack_images"]: ! from spambayes.ImageStripper import crack_images ! text, tokens = crack_images(parts) for t in tokens: yield t From anadelonbrin at users.sourceforge.net Sun Aug 13 04:05:46 2006 From: anadelonbrin at users.sourceforge.net (Tony Meyer) Date: Sat, 12 Aug 2006 19:05:46 -0700 Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py,1.2,1.3 Message-ID: <20060813020548.AA6721E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31206/spambayes Modified Files: dnscache.py Log Message: Remove reference to Skip, probably left there by mistake :) Index: dnscache.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/dnscache.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** dnscache.py 9 Aug 2006 04:26:36 -0000 1.2 --- dnscache.py 13 Aug 2006 02:05:43 -0000 1.3 *************** *** 314,318 **** def main(): import transaction ! c=cache(cachefile=os.path.expanduser("~skip/.dnscache")) c.printStatsAtEnd=True for host in ["www.python.org", "www.timsbloggers.com", --- 314,318 ---- def main(): import transaction ! c=cache(cachefile=os.path.expanduser("~/.dnscache")) c.printStatsAtEnd=True for host in ["www.python.org", "www.timsbloggers.com", From montanaro at users.sourceforge.net Sun Aug 13 18:27:51 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 13 Aug 2006 09:27:51 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py,1.2,1.3 Message-ID: <20060813162754.806071E4002@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv18791 Modified Files: ImageStripper.py Log Message: The spammers don't just chop up their GIF images left-to-right. Concatenate them left-to-right until the height of adjacent images changes, then start a new row. At the end concatenate the rows top-to-bottom. Add a couple tokens to mark decode or conversion errors. The *_decode_parts don't use the class's state, so make them functions instead of methods. Index: ImageStripper.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** ImageStripper.py 10 Aug 2006 04:07:59 -0000 1.2 --- ImageStripper.py 13 Aug 2006 16:27:49 -0000 1.3 *************** *** 83,179 **** return decoders ! def imconcat(im1, im2): ! # concatenate im1 and im2 left-to-right ! w1, h1 = im1.size ! w2, h2 = im2.size ! im3 = Image.new("RGB", (w1+w2, max(h1, h2))) ! im3.paste(im1, (0, 0)) ! im3.paste(im2, (0, w1)) ! return im3 ! class ImageStripper: ! def __init__(self, cachefile=""): ! self.cachefile = os.path.expanduser(cachefile) ! if os.path.exists(self.cachefile): ! self.cache = pickle.load(open(self.cachefile)) ! else: ! self.cache = {} ! self.misses = self.hits = 0 ! if self.cachefile: ! atexit.register(self.close) ! def NetPBM_decode_parts(self, parts, decoders): ! pnmfiles = [] ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! fd, imgfile = tempfile.mkstemp() ! os.write(fd, bytes) ! os.close(fd) fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) ! pnmfiles.append(pnmfile) ! os.unlink(imgfile) ! if not pnmfiles: ! return ! if len(pnmfiles) > 1: ! if find_program("pnmcat"): ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(pnmfiles), pnmfile)) ! for f in pnmfiles: ! os.unlink(f) ! pnmfiles = [pnmfile] ! return pnmfiles ! def PIL_decode_parts(self, parts): ! full_image = None ! for part in parts: ! try: ! bytes = part.get_payload(decode=True) ! except: ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! continue # assume it's just a picture for now ! # We're dealing with spammers here - who knows what garbage they ! # will call a GIF image to entice you to open it? ! try: ! image = Image.open(StringIO.StringIO(bytes)) ! image.load() ! except IOError: ! continue ! else: ! image = image.convert("RGB") ! if full_image is None: ! full_image = image ! else: ! full_image = imconcat(full_image, image) ! if not full_image: ! return ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! full_image.save(open(pnmfile, "wb"), "PPM") ! return [pnmfile] def extract_ocr_info(self, pnmfiles): --- 83,228 ---- return decoders ! def imconcatlr(left, right): ! """Concatenate two images left to right.""" ! w1, h1 = left.size ! w2, h2 = right.size ! result = Image.new("RGB", (w1 + w2, max(h1, h2))) ! result.paste(left, (0, 0)) ! result.paste(right, (w1, 0)) ! return result ! def imconcattb(upper, lower): ! """Concatenate two images top to bottom.""" ! w1, h1 = upper.size ! w2, h2 = lower.size ! result = Image.new("RGB", (max(w1, w2), h1 + h2)) ! result.paste(upper, (0, 0)) ! result.paste(lower, (0, h1)) ! return result ! def pnmsize(pnmfile): ! """Return dimensions of a PNM file.""" ! f = open(pnmfile) ! line1 = f.readline() ! line2 = f.readline() ! w, h = [int(n) for n in line2.split()] ! return w, h ! def NetPBM_decode_parts(parts, decoders): ! """Decode and assemble a bunch of images using NetPBM tools.""" ! rows = [] ! tokens = Set() ! for part in parts: ! decoder = decoders.get(part.get_content_type()) ! if decoder is None: ! continue ! try: ! bytes = part.get_payload(decode=True) ! except: ! tokens.add("invalid-image:%s" % part.get_content_type()) ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! tokens.add("image:big") ! continue # assume it's just a picture for now + fd, imgfile = tempfile.mkstemp() + os.write(fd, bytes) + os.close(fd) + + fd, pnmfile = tempfile.mkstemp() + os.close(fd) + os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile)) + w, h = pnmsize(pnmfile) + if not rows: + # first image + rows.append([pnmfile]) + elif pnmsize(rows[-1][-1])[1] != h: + # new image, different height => start new row + rows.append([pnmfile]) + else: + # new image, same height => extend current row + rows[-1].append(pnmfile) + + for (i, row) in enumerate(rows): + if len(row) > 1: fd, pnmfile = tempfile.mkstemp() os.close(fd) ! os.system("pnmcat -lr %s > %s 2>/dev/null" % ! (" ".join(row), pnmfile)) ! for f in row: ! os.unlink(f) ! rows[i] = pnmfile ! else: ! rows[i] = row[0] ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! os.system("pnmcat -tb %s > %s 2>/dev/null" % (" ".join(rows), pnmfile)) ! for f in rows: ! os.unlink(f) ! return [pnmfile], tokens ! def PIL_decode_parts(parts): ! """Decode and assemble a bunch of images using PIL.""" ! tokens = Set() ! rows = [] ! for part in parts: ! try: ! bytes = part.get_payload(decode=True) ! except: ! tokens.add("invalid-image:%s" % part.get_content_type()) ! continue ! if len(bytes) > options["Tokenizer", "max_image_size"]: ! tokens.add("image:big") ! continue # assume it's just a picture for now ! # We're dealing with spammers and virus writers here. Who knows ! # what garbage they will call a GIF image to entice you to open ! # it? ! try: ! image = Image.open(StringIO.StringIO(bytes)) ! image.load() ! except IOError: ! tokens.add("invalid-image:%s" % part.get_content_type()) ! continue ! else: ! image = image.convert("RGB") ! if not rows: ! # first image ! rows.append(image) ! elif image.size[1] != rows[-1].size[1]: ! # new image, different height => start new row ! rows.append(image) ! else: ! # new image, same height => extend current row ! rows[-1] = imconcatlr(rows[-1], image) ! if not rows: ! return [], tokens ! # now concatenate the resulting row images top-to-bottom ! full_image, rows = rows[0], rows[1:] ! for image in rows: ! full_image = imconcattb(full_image, image) ! fd, pnmfile = tempfile.mkstemp() ! os.close(fd) ! full_image.save(open(pnmfile, "wb"), "PPM") ! return [pnmfile], tokens ! class ImageStripper: ! def __init__(self, cachefile=""): ! self.cachefile = os.path.expanduser(cachefile) ! if os.path.exists(self.cachefile): ! self.cache = pickle.load(open(self.cachefile)) ! else: ! self.cache = {} ! self.misses = self.hits = 0 ! if self.cachefile: ! atexit.register(self.close) def extract_ocr_info(self, pnmfiles): *************** *** 217,228 **** if Image is not None: ! pnmfiles = self.PIL_decode_parts(parts) else: ! pnmfiles = self.NetPBM_decode_parts(parts, find_decoders()) if pnmfiles: ! return self.extract_ocr_info(pnmfiles) ! return "", Set() --- 266,280 ---- if Image is not None: ! pnmfiles, tokens = PIL_decode_parts(parts) else: ! if not find_program("pnmcat"): ! return "", Set() ! pnmfiles, tokens = NetPBM_decode_parts(parts, find_decoders()) if pnmfiles: ! text, new_tokens = self.extract_ocr_info(pnmfiles) ! return text, tokens | new_tokens ! return "", tokens From montanaro at users.sourceforge.net Mon Aug 14 04:58:13 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Sun, 13 Aug 2006 19:58:13 -0700 Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, 1.3, 1.4 Options.py, 1.137, 1.138 OptionsClass.py, 1.32, 1.33 Message-ID: <20060814025816.9CCEB1E4003@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv26750/spambayes Modified Files: ImageStripper.py Options.py OptionsClass.py Log Message: Add scale and charset options (ocrad_scale and ocrad_charset, respectively) to pass to the ocrad command. Antonio Diaz Diaz, the author of Ocrad, suggested scaling up the images. Ocrad does indeed seem to perform better with the scaled images. Scaling by a factor of two seems to do significantly better than not scaling in my 5x5 N-fold test setup. Scaling by a factor of three might even be better, improving false negative percentages in four of the five sets, but it made the false positive score worse in one of the five sets, so I left the default scale at 2. I added the charset flag as well and defaulted to ascii. So far the spammers seem to be "GIFting" us with plain English, so searching for accented characters seems like it would just distract Ocrad. This has yet to be tested though. Index: ImageStripper.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** ImageStripper.py 13 Aug 2006 16:27:49 -0000 1.3 --- ImageStripper.py 14 Aug 2006 02:58:11 -0000 1.4 *************** *** 232,235 **** --- 232,237 ---- textbits = [] tokens = Set() + scale = options["Tokenizer", "ocrad_scale"] or 1 + charset = options["Tokenizer", "ocrad_charset"] for pnmfile in pnmfiles: fhash = md5.new(open(pnmfile).read()).hexdigest() *************** *** 239,243 **** else: self.misses += 1 ! ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile)) ctext = ocr.read().lower() ocr.close() --- 241,246 ---- else: self.misses += 1 ! ocr = os.popen("ocrad -s %s -c %s -x %s < %s 2>/dev/null" % ! (scale, charset, orf, pnmfile)) ctext = ocr.read().lower() ocr.close() Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.137 retrieving revision 1.138 diff -C2 -d -r1.137 -r1.138 *** Options.py 10 Aug 2006 04:07:59 -0000 1.137 --- Options.py 14 Aug 2006 02:58:11 -0000 1.138 *************** *** 139,142 **** --- 139,154 ---- PATH, RESTORE), + ("ocrad_scale", _("Scale factor to use with ocrad."), 2, + _("""Specifies the scale factor to apply when running ocrad. While + you can specify a negative scale it probably won't help. Scaling up + by a factor of 2 or 3 seems to work well for the sort of spam images + encountered by SpamBayes."""), + INTEGER, RESTORE), + + ("ocrad_charset", _("Charset to apply with ocrad."), "ascii", + _("""Specifies the charset to use when running ocrad. Valid values + are 'ascii', 'iso-8859-9' and 'iso-8859-15'."""), + OCRAD_CHARSET, RESTORE), + ("max_image_size", _("Max image size to try OCR-ing"), 100000, _("""When crack_images is enabled, this specifies the largest Index: OptionsClass.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** OptionsClass.py 22 Jun 2006 10:36:58 -0000 1.32 --- OptionsClass.py 14 Aug 2006 02:58:11 -0000 1.33 *************** *** 119,122 **** --- 119,123 ---- 'IMAP_FOLDER', 'IMAP_ASTRING', 'RESTORE', 'DO_NOT_RESTORE', 'IP_LIST', + 'OCRAD_CHARSET', ] *************** *** 871,872 **** --- 872,875 ---- RESTORE = True DO_NOT_RESTORE = False + + OCRAD_CHARSET = r"ascii|iso-8859-9|iso-8859-15" From montanaro at users.sourceforge.net Fri Aug 18 04:29:05 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Thu, 17 Aug 2006 19:29:05 -0700 Subject: [Spambayes-checkins] spambayes/contrib pycksum.py,1.1,1.2 Message-ID: <20060818022907.D10021E4004@bag.python.org> Update of /cvsroot/spambayes/spambayes/contrib In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv16513 Modified Files: pycksum.py Log Message: * Try to improve the duplicate detection capability. Lots of spam nowadays has random text junk, so be more lenient about how many chunks have to match. Also do a little more filtering on the source: - Compress multiple spaces and tabs to a single space - Compress multiple contiguous newlines into one - Map all strings of digits to a single "#" character - Map some common html entities to their plain text equivalents. * Use md5 checksum hexdigests instead of binascii.b2a_hex. * Correct line breaking of filtered body. * Use email.generator to flatten body instead of the broken flatten() function. Index: pycksum.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/contrib/pycksum.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** pycksum.py 25 May 2004 14:58:39 -0000 1.1 --- pycksum.py 18 Aug 2006 02:29:02 -0000 1.2 *************** *** 39,60 **** import sys import email.Parser import md5 import anydbm import re import time ! import binascii ! ! def flatten(body): ! # three types are possible: list, string, Message ! if isinstance(body, str): ! return body ! if hasattr(body, "get_payload"): ! payload = body.get_payload() ! if payload is None: ! return "" ! return flatten(payload) ! if isinstance(body, list): ! return "\n".join([flatten(b) for b in body]) ! raise TypeError, ("unrecognized body type: %s" % type(body)) def clean(data): --- 39,51 ---- import sys import email.Parser + import email.generator import md5 import anydbm import re import time ! try: ! import cStringIO as StringIO ! except ImportError: ! import StringIO def clean(data): *************** *** 67,74 **** data = re.sub(r"<[^>]*>", "", data).lower() # delete anything which looks like a url or email address # not sure what a pmguid: url is but it seems to occur frequently in spam # also convert all runs of whitespace into a single space ! return " ".join([w for w in data.split() if ('@' not in w and (':' not in w or --- 58,78 ---- data = re.sub(r"<[^>]*>", "", data).lower() + # Map all digits to '#' + data = re.sub(r"[0-9]+", "#", data) + + # Map a few common html entities + data = re.sub(r"( )+", " ", data) + data = re.sub(r"<", "<", data) + data = re.sub(r">", ">", data) + data = re.sub(r"&", "&", data) + + # Elide blank lines and multiple horizontal whitespace + data = re.sub(r"\n+", "\n", data) + data = re.sub(r"[ \t]+", " ", data) + # delete anything which looks like a url or email address # not sure what a pmguid: url is but it seems to occur frequently in spam # also convert all runs of whitespace into a single space ! return " ".join([w for w in data.split(" ") if ('@' not in w and (':' not in w or *************** *** 87,97 **** # separately or in various combinations if desired. ! body = flatten(msg) ! lines = clean(body) chunksize = len(lines)//4+1 sum = [] for i in range(4): chunk = "\n".join(lines[i*chunksize:(i+1)*chunksize]) ! sum.append(binascii.b2a_hex(md5.new(chunk).digest())) return ".".join(sum) --- 91,105 ---- # separately or in various combinations if desired. ! fp = StringIO.StringIO() ! g = email.generator.Generator(fp, mangle_from_=False, maxheaderlen=60) ! g.flatten(msg) ! text = fp.getvalue() ! body = text.split("\n\n", 1)[1] ! lines = clean(body).split("\n") chunksize = len(lines)//4+1 sum = [] for i in range(4): chunk = "\n".join(lines[i*chunksize:(i+1)*chunksize]) ! sum.append(md5.new(chunk).hexdigest()) return ".".join(sum) *************** *** 102,111 **** db = anydbm.open(f, "c") maxdblen = 2**14 ! # consider the first three pieces, the last three pieces and the middle ! # two pieces - one or more will likely eliminate attempts at disrupting ! # the checksum - if any are found in the db file, call it a match ! for subsum in (".".join(pieces[:-1]), ".".join(pieces[1:-1]), ! ".".join(pieces[1:])): if not db.has_key(subsum): db[subsum] = str(time.time()) --- 110,119 ---- db = anydbm.open(f, "c") maxdblen = 2**14 ! # consider the first two pieces, the middle two pieces and the last two ! # pieces - one or more will likely eliminate attempts at disrupting the ! # checksum - if any are found in the db file, call it a match ! for subsum in (".".join(pieces[:-2]), ".".join(pieces[1:-1]), ! ".".join(pieces[2:])): if not db.has_key(subsum): db[subsum] = str(time.time()) *************** *** 155,157 **** if __name__ == "__main__": sys.exit(main(sys.argv[1:])) - --- 163,164 ---- From montanaro at users.sourceforge.net Fri Aug 18 19:26:52 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 10:26:52 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.54,1.55 Message-ID: <20060818172655.DF0ED1E4004@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv9126 Modified Files: CHANGELOG.txt Log Message: I hope this doesn't break any scripts or irritate anyone too much, however... Just as mm/dd/yyyy format looks strange to non-US folks, dd/mm/yyyy looks just as strange to us cowboy types. Compromise on ISO-8601 dates. They sort, they're unambiguous, and they probably piss off both camps equally well. ;-) Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.54 retrieving revision 1.55 diff -C2 -d -r1.54 -r1.55 *** CHANGELOG.txt 7 Apr 2006 02:37:28 -0000 1.54 --- CHANGELOG.txt 18 Aug 2006 17:26:50 -0000 1.55 *************** *** 1,231 **** ! [Note that all dates are in English, not American format - i.e. day/month/year] Release 1.1a2 ============= ! Tony Meyer 03/04/2006 Add [ 1081787 ] Adding the version only to sb_filter.py ! Tony Meyer 03/04/2006 Fix [ 1383801 ] trustedIPs wildcard to regex broken ! Tony Meyer 02/04/2006 Fix [ 1387699 ] train_on_filter=True needs the db to be opened read/write ! Tony Meyer 02/04/2006 Fix [ 1387709 ] If globals:dbm_type is non-default, then don't use whichdb. ! Tony Meyer 27/11/2005 Install the conversion utility and offer to run it on Windows install. ! Tony Meyer 26/11/2005 Add conversion utility to easily convert dbm to ZODB. [...1933 lines suppressed...] ! Tim Stone 2003-02-25 Add option for pop3proxy to notate Subject: header. ! Tony Meyer 2003-02-25 Fix bug in Corpus.get() which would never return the default value. ! Mark Hammond 2003-02-18 "Store Outlook plugin files in the ""correct"" Windows directory." ! Neil Schemenauer 2003-02-16 Add -c and -d options to mailsort.py. ! Neil Schemenauer 2003-02-16 First check-in of dump_cdb.py ! Mark Hammond 2003-02-13 Add SF#685746 ('Outlook plugin folder list sorted alphabetically'). ! Mark Hammond 2003-02-13 Handle exceptions when opening folders in Outlook plugin better. ! Skip Montanaro 2003-02-13 Split BAYESCUSTOMIZE environment variable using os.pathsep. ! Mark Hammond 2003-02-12 Check for correct exception when removing file in Outlook addin. ! Mark Hammond 2003-02-12 Check for bsddb3 before bsddb (previously bsddb3 would never be found). ! Tim Stone 2003-02-10 Changed BAYESCUSTOMIZE environment variable parsing from a split to a regex to fix filenames with embedded spaces. ! Tim Stone 2003-02-08 Ensure that nham and nspam are instances of integer in dbExpImp.py ! Tim Stone 2003-02-08 Ensure that nham and nspam becoming strings doesn't break classification. ! Tim Stone 2003-02-08 Added ability to put classification in subject or to headers (for OE). ! Mark Hammond 2003-02-07 Fix some errors using bsddb3 in Outlook plugin. ! Mark Hammond 2003-02-04 "Fix SF#642740 ('""Recover from Spam"" wrong folder')." ! Mark Hammond 2003-02-03 Change train.py to be able to work with a bsddb database. ! Mark Hammond 2003-02-03 If a new bsddb or bsddb3 module is available use this instead of a pickle in the Outlook plugin. ! Mark Hammond 2003-02-03 Add tick-marks to the filter dialog. ! Mark Hammond 2003-02-03 Fix SF#677804 ('Untouched filter command error'). From montanaro at users.sourceforge.net Fri Aug 18 19:42:39 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 10:42:39 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.55,1.56 Message-ID: <20060818174242.264BB1E400D@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv16016 Modified Files: CHANGELOG.txt Log Message: Add my recent changes to changelog Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.55 retrieving revision 1.56 diff -C2 -d -r1.55 -r1.56 *** CHANGELOG.txt 18 Aug 2006 17:26:50 -0000 1.55 --- CHANGELOG.txt 18 Aug 2006 17:42:37 -0000 1.56 *************** *** 1,4 **** --- 1,23 ---- [Note that all dates are in ISO 8601 format, e.g. YYYY-MM-DD to ease sorting] + Release 1.1a3 + ============= + + Skip Montanaro 2006-08-18 Update pycksum.py to try and identify more duplicates + Skip Montanaro 2006-08-14 Add scale and charset options to ImageStripper + Skip Montanaro 2006-08-13 Stitch spam images back together properly, add a couple more tokens + Skip Montanaro 2006-08-10 Add support for PIL to ImageStripper.py + Skip Montanaro 2006-08-09 Cache x-lookup_ip in a pickle instead of trying to use anydbm or zodb + Skip Montanaro 2006-08-06 Add crude OCR capability to try and parse image-based spam using Ocrad & NetPBM + Skip Montanaro 2006-08-06 Add x-short_runs option + Skip Montanaro 2006-08-06 Add x-image_size option & corresponding token + Skip Montanaro 2006-08-06 Add Matt Cowles' x-lookup_ip extension w/ slight modifications + Skip Montanaro 2006-08-06 Add profiling using cProfile (if available) to sb_filter.py + Skip Montanaro 2006-08-06 Delete -d and -p flags from spamcounts.py + Skip Montanaro 2006-08-06 Refactor basic text tokenizing out of tokenize_body into a separate method, tokenize_text + Skip Montanaro 2006-08-05 Explicitly close ZODB store in tte.py + Skip Montanaro 2006-04-23 Reduce sensitivity of spamcounts.py to classifier changes + + Release 1.1a2 ============= From montanaro at users.sourceforge.net Sat Aug 19 02:26:40 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 17:26:40 -0700 Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.56,1.57 Message-ID: <20060819002643.4F05C1E400C@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv14309 Modified Files: CHANGELOG.txt Log Message: Add other recent changelog bits Index: CHANGELOG.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v retrieving revision 1.56 retrieving revision 1.57 diff -C2 -d -r1.56 -r1.57 *** CHANGELOG.txt 18 Aug 2006 17:42:37 -0000 1.56 --- CHANGELOG.txt 19 Aug 2006 00:26:38 -0000 1.57 *************** *** 17,21 **** --- 17,26 ---- Skip Montanaro 2006-08-06 Refactor basic text tokenizing out of tokenize_body into a separate method, tokenize_text Skip Montanaro 2006-08-05 Explicitly close ZODB store in tte.py + Tony Meyer 2006-06-22 Fix bug in regex preventing valid IPs + Toby Dickenson 2006-06-12 Suppress spurious duplicate From_ lines in sb_bnfilter.py + Tony Meyer 2006-06-10 Add simple parts of [ 824651 ] Multibyte message support + Tony Meyer 2006-05-06 Enable -o command line option setting, and follow TestDriver directories in testtools/mksets.py Skip Montanaro 2006-04-23 Reduce sensitivity of spamcounts.py to classifier changes + Tony Meyer 2006-04-22 Set zodb cache size to 10,000 From montanaro at users.sourceforge.net Sat Aug 19 02:37:55 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Fri, 18 Aug 2006 17:37:55 -0700 Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.40,1.41 Message-ID: <20060819003757.B88791E4006@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv18704 Modified Files: WHAT_IS_NEW.txt Log Message: Update for 1.1a3 Index: WHAT_IS_NEW.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** WHAT_IS_NEW.txt 27 Nov 2005 02:15:33 -0000 1.40 --- WHAT_IS_NEW.txt 19 Aug 2006 00:37:52 -0000 1.41 *************** *** 16,19 **** --- 16,88 ---- is released. + New in 1.1 Alpha 3 + ================== + + + -------------------------------------------- + ** Incompatible changes and Transitioning ** + -------------------------------------------- + + There should be no incompatible changes since 1.1a2, though users new to the + 1.1 series should pay careful attention to the database changes introduced + in 1.1a2. + + + ------------------- + ** Other changes ** + ------------------- + + General + ------- + + Reported Bugs Fixed + =================== + No bugs tracked via the Sourceforge system were fixed. + + + Patches integrated + =================== + The following patches tracked via the Sourceforge system were integrated + in this release: + 824651 + + Feature Requests Added + ====================== + No feature requests tracked via the Sourceforge system were added + in this release. + + + Experimental Options + ==================== + + In addition to the experimental options listed for the 1.1a2 release, four + more new experimental options were added to SpamBayes. They all need + further testing. + + o x-short_runs - If true, generate tokens based on max number of short + word runs. Short words are anything of length < the skip_max_word_size + option. Normally they are skipped, but one common spam technique spells + words like 'V m I n A o G p RA' to try and avoid exposing them to + content filters. + + o x-lookup_ip - If true, generate IP address tokens from hostnames. This + requires PyDNS (http://pydns.sourceforge.net/). + + o x-image_size - If true, generate tokens based on the size of the largest + attached image. + + o x-crack_images - A lot of recent spam contains the entire message + embedded in one or more attached images. This option, if true, + generates tokens based on the (hopefully) text content contained in any + images in each message. The current support is minimal, relies on the + installation of ocrad (http://www.gnu.org/software/ocrad/ocrad.html) and + the Python Imaging Library (a.k.a. PIL, available at + http://www.pythonware.com/products/pil/). It has not yet been tested on + Windows, but for brave souls there is a simple zip file binary of ocrad + called "ocrad-cygwin" on the SpamBayes download page for Windows users + who can't build it themselves. PIL has its own Windows binary + installers specific to versions of Python as far back as 2.1. + + New in 1.1 Alpha 2 ================== From mhammond at users.sourceforge.net Thu Aug 24 14:42:03 2006 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Thu, 24 Aug 2006 05:42:03 -0700 Subject: [Spambayes-checkins] spambayes/spambayes __init__.py,1.18,1.19 Message-ID: <20060824124205.F40CC1E400A@bag.python.org> Update of /cvsroot/spambayes/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv23537/spambayes Modified Files: __init__.py Log Message: Version 1.1a3 Index: __init__.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/__init__.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** __init__.py 20 Apr 2006 03:13:26 -0000 1.18 --- __init__.py 24 Aug 2006 12:41:57 -0000 1.19 *************** *** 6,9 **** _ = lambda arg: arg ! __version__ = "1.1a2" ! __date__ = _("April 2005") --- 6,9 ---- _ = lambda arg: arg ! __version__ = "1.1a3" ! __date__ = _("August 2006") From mhammond at users.sourceforge.net Thu Aug 24 14:45:46 2006 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Thu, 24 Aug 2006 05:45:46 -0700 Subject: [Spambayes-checkins] spambayes/windows pop3proxy_tray.py, 1.24, 1.25 Message-ID: <20060824124548.E46E61E4005@bag.python.org> Update of /cvsroot/spambayes/spambayes/windows In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv24833/windows Modified Files: pop3proxy_tray.py Log Message: re-add the taskbar icon in the case of explorer crashing and restarting Index: pop3proxy_tray.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/windows/pop3proxy_tray.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** pop3proxy_tray.py 29 Mar 2005 05:59:25 -0000 1.24 --- pop3proxy_tray.py 24 Aug 2006 12:45:42 -0000 1.25 *************** *** 144,148 **** --- 144,150 ---- 1099 : ("Exit SpamBayes", self.OnExit), } + msg_TaskbarRestart = RegisterWindowMessage("TaskbarCreated"); message_map = { + msg_TaskbarRestart: self.OnTaskbarRestart, win32con.WM_DESTROY: self.OnDestroy, win32con.WM_COMMAND: self.OnCommand, *************** *** 188,195 **** 16, 16, icon_flags) ! flags = NIF_ICON | NIF_MESSAGE | NIF_TIP ! nid = (self.hwnd, 0, flags, WM_TASKBAR_NOTIFY, self.hstartedicon, ! "SpamBayes") ! Shell_NotifyIcon(NIM_ADD, nid) self.started = IsServerRunningAnywhere() self.tip = None --- 190,194 ---- 16, 16, icon_flags) ! self._AddTaskbarIcon() self.started = IsServerRunningAnywhere() self.tip = None *************** *** 205,208 **** --- 204,221 ---- "a local server" + def _AddTaskbarIcon(self): + flags = NIF_ICON | NIF_MESSAGE | NIF_TIP + nid = (self.hwnd, 0, flags, WM_TASKBAR_NOTIFY, self.hstartedicon, + "SpamBayes") + try: + Shell_NotifyIcon(NIM_ADD, nid) + except win32api_error: + # Apparently can be seen as XP is starting up. Certainly can + # be seen if explorer.exe is not running when started. + print "Ignoring error adding taskbar icon - explorer may not " \ + "be running (yet)." + # The TaskbarRestart message will fire in this case, and + # everything will work out :) + def BuildToolTip(self): tip = None *************** *** 394,397 **** --- 407,415 ---- function() + def OnTaskbarRestart(self, hwnd, msg, wparam, lparam): + # Called as the taskbar is created (either as Windows starts, or + # as Windows recovers from a crashed explorer.exe) + self._AddTaskbarIcon() + def OnExit(self): if self.started and not self.use_service: From mhammond at users.sourceforge.net Thu Aug 24 15:18:34 2006 From: mhammond at users.sourceforge.net (Mark Hammond) Date: Thu, 24 Aug 2006 06:18:34 -0700 Subject: [Spambayes-checkins] spambayes/windows/py2exe setup_all.py, 1.26, 1.27 Message-ID: <20060824131835.EB71E1E4005@bag.python.org> Update of /cvsroot/spambayes/spambayes/windows/py2exe In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540 Modified Files: setup_all.py Log Message: Ship with PIL (but no Tkinter) and pyDNS Index: setup_all.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/windows/py2exe/setup_all.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** setup_all.py 28 Feb 2006 08:11:40 -0000 1.26 --- setup_all.py 24 Aug 2006 13:18:32 -0000 1.27 *************** *** 47,54 **** "spambayes.languages.fr,spambayes.languages.es.DIALOGS," \ "spambayes.languages.es_AR.DIALOGS," \ ! "spambayes.languages.fr.DIALOGS", ! excludes = "win32ui,pywin,pywin.debugger", # pywin is a package, and still seems to be included. ! includes = "dialogs.resources.dialogs,weakref", # Outlook dynamic dialogs ! dll_excludes = "dapi.dll,mapi32.dll", typelibs = [ ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0), --- 47,61 ---- "spambayes.languages.fr,spambayes.languages.es.DIALOGS," \ "spambayes.languages.es_AR.DIALOGS," \ ! "spambayes.languages.fr.DIALOGS," \ ! "PIL", ! excludes = "Tkinter," # side-effect of PIL and markh doesn't have it :) ! "win32ui,pywin,pywin.debugger," # *sob* - these still appear ! # Keep zope out else outlook users lose training. ! # (sob - but some of these may still appear!) ! "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence", ! includes = "dialogs.resources.dialogs,weakref," # Outlook dynamic dialogs ! "BmpImagePlugin,JpegImagePlugin", # PIL modules not auto found ! dll_excludes = "dapi.dll,mapi32.dll," ! "tk84.dll,tcl84.dll", # No Tkinter == no tk/tcl dlls typelibs = [ ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0), From anadelonbrin at users.sourceforge.net Fri Aug 25 02:43:30 2006 From: anadelonbrin at users.sourceforge.net (Tony Meyer) Date: Thu, 24 Aug 2006 17:43:30 -0700 Subject: [Spambayes-checkins] spambayes/windows spambayes.iss,1.25,1.26 Message-ID: <20060825004333.172E51E4004@bag.python.org> Update of /cvsroot/spambayes/spambayes/windows In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32424/windows Modified Files: spambayes.iss Log Message: Bump version number. For 1.1a3 at least, include ocrad.exe and the patch required to build it. Display license. Maybe binary users aren't aware that this gets installed, and so this might get rid of some of the "can I do X with spambayes" queries. For 1.1a3 at least, it also clarifies where ocrad comes from. Fix typo. Index: spambayes.iss =================================================================== RCS file: /cvsroot/spambayes/spambayes/windows/spambayes.iss,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** spambayes.iss 27 Nov 2005 00:42:11 -0000 1.25 --- spambayes.iss 25 Aug 2006 00:43:28 -0000 1.26 *************** *** 5,11 **** [Setup] ; Version specific constants ! AppVerName=SpamBayes 1.1a1 ! AppVersion=1.1a1 ! OutputBaseFilename=spambayes-1.1a1 ; Normal constants. Be careful about changing 'AppName' AppName=SpamBayes --- 5,11 ---- [Setup] ; Version specific constants ! AppVerName=SpamBayes 1.1a3 ! AppVersion=1.1a3 ! OutputBaseFilename=spambayes-1.1a3 ; Normal constants. Be careful about changing 'AppName' AppName=SpamBayes *************** *** 15,18 **** --- 15,19 ---- ShowComponentSizes=no UninstallDisplayIcon={app}\sbicon.ico + LicenseFile=py2exe\dist\license.txt [Files] *************** *** 51,54 **** --- 52,59 ---- Source: "py2exe\dist\bin\convert_database.exe"; DestDir: "{app}\bin"; Flags: ignoreversion + ; Include ocrad.exe and the patch required to get it to compile for Windows. + Source: "py2exe\ocrad.exe"; DestDir: "{app}\bin"; Flags: ignoreversion + Source: "py2exe\ocrad.patch"; DestDir: "{app}\docs"; Flags: ignoreversion + ; There is a problem attempting to get Inno to unregister our DLL. If we mark our DLL ; as 'regserver', it installs and registers OK, but at uninstall time, it unregisters *************** *** 90,94 **** InstallOutlook, InstallProxy, InstallIMAP: Boolean; WarnedNoOutlook, WarnedBoth : Boolean; ! startup, desktop, allusers, startup_imap : Boolean; // Tasks function InstallingOutlook() : Boolean; --- 95,99 ---- InstallOutlook, InstallProxy, InstallIMAP: Boolean; WarnedNoOutlook, WarnedBoth : Boolean; ! startup, desktop, allusers, startup_imap, convert_db : Boolean; // Tasks function InstallingOutlook() : Boolean; From montanaro at users.sourceforge.net Fri Aug 25 04:02:16 2006 From: montanaro at users.sourceforge.net (Skip Montanaro) Date: Thu, 24 Aug 2006 19:02:16 -0700 Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.41,1.42 Message-ID: <20060825020218.5E98D1E4007@bag.python.org> Update of /cvsroot/spambayes/spambayes In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv3443 Modified Files: WHAT_IS_NEW.txt Log Message: Slight update. Index: WHAT_IS_NEW.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** WHAT_IS_NEW.txt 19 Aug 2006 00:37:52 -0000 1.41 --- WHAT_IS_NEW.txt 25 Aug 2006 02:02:12 -0000 1.42 *************** *** 67,74 **** o x-lookup_ip - If true, generate IP address tokens from hostnames. This ! requires PyDNS (http://pydns.sourceforge.net/). o x-image_size - If true, generate tokens based on the size of the largest ! attached image. o x-crack_images - A lot of recent spam contains the entire message --- 67,75 ---- o x-lookup_ip - If true, generate IP address tokens from hostnames. This ! requires PyDNS (http://pydns.sourceforge.net/). This is included in the ! Windows installer. o x-image_size - If true, generate tokens based on the size of the largest ! attached image. o x-crack_images - A lot of recent spam contains the entire message *************** *** 79,86 **** the Python Imaging Library (a.k.a. PIL, available at http://www.pythonware.com/products/pil/). It has not yet been tested on ! Windows, but for brave souls there is a simple zip file binary of ocrad ! called "ocrad-cygwin" on the SpamBayes download page for Windows users ! who can't build it themselves. PIL has its own Windows binary ! installers specific to versions of Python as far back as 2.1. --- 80,84 ---- the Python Imaging Library (a.k.a. PIL, available at http://www.pythonware.com/products/pil/). It has not yet been tested on ! Windows, but is available in the Windows installer (as is PIL).