From montanaro at users.sourceforge.net  Sat Aug  5 14:48:11 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 05 Aug 2006 05:48:11 -0700
Subject: [Spambayes-checkins] spambayes/contrib tte.py,1.16,1.17
Message-ID: <20060805124814.1F7351E4003@bag.python.org>

Update of /cvsroot/spambayes/spambayes/contrib
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv28933/contrib

Modified Files:
	tte.py 
Log Message:
close the store - that's the ticket

Index: tte.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/contrib/tte.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** tte.py	19 Apr 2005 11:15:12 -0000	1.16
--- tte.py	5 Aug 2006 12:48:09 -0000	1.17
***************
*** 260,264 ****
            sh_ratio)
  
!     store.store()
  
      if cullext is not None:
--- 260,264 ----
            sh_ratio)
  
!     store.close()
  
      if cullext is not None:


From montanaro at users.sourceforge.net  Sun Aug  6 03:19:37 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sat, 05 Aug 2006 18:19:37 -0700
Subject: [Spambayes-checkins] spambayes/contrib spamcounts.py,1.7,1.8
Message-ID: <20060806011939.E72631E4003@bag.python.org>

Update of /cvsroot/spambayes/spambayes/contrib
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv5191

Modified Files:
	spamcounts.py 
Log Message:
Dump the -d and -p flags in favor of the more general -o flag.


Index: spamcounts.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/contrib/spamcounts.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** spamcounts.py	23 Apr 2006 22:30:46 -0000	1.7
--- spamcounts.py	6 Aug 2006 01:19:35 -0000	1.8
***************
*** 2,15 ****
  
  """
! Check spamcounts for various tokens or patterns
  
! usage %(prog)s [ -h ] [ -r ] [ -d db ] [ -p ] [ -t ] ...
  
  -h    - print this documentation and exit.
  -r    - treat tokens as regular expressions - may not be used with -t
- -d db - use db instead of the default found in the options file
- -p    - db is actually a pickle
  -t    - read message from stdin, tokenize it, then display their counts
          may not be used with -r
  """
  
--- 2,15 ----
  
  """
! Check spamcounts for one or more tokens or patterns
  
! usage %(prog)s [ options ] token ...
  
  -h    - print this documentation and exit.
  -r    - treat tokens as regular expressions - may not be used with -t
  -t    - read message from stdin, tokenize it, then display their counts
          may not be used with -r
+ -o section:option:value
+       - set [section, option] in the options database to value
  """
  
***************
*** 64,70 ****
  def main(args):
      try:
!         opts, args = getopt.getopt(args, "hrd:t",
!                                    ["help", "re", "database=", "pickle",
!                                     "tokenize"])
      except getopt.GetoptError, msg:
          usage(msg)
--- 64,69 ----
  def main(args):
      try:
!         opts, args = getopt.getopt(args, "hrto:",
!                                    ["help", "re", "tokenize", "option="])
      except getopt.GetoptError, msg:
          usage(msg)
***************
*** 72,77 ****
  
      usere = False
-     dbname = get_pathname_option("Storage", "persistent_storage_file")
-     ispickle = not options["Storage", "persistent_use_database"]
      tokenizestdin = False
      for opt, arg in opts:
--- 71,74 ----
***************
*** 79,90 ****
              usage()
              return 0
-         elif opt in ("-d", "--database"):
-             dbname = arg
          elif opt in ("-r", "--re"):
              usere = True
-         elif opt in ("-p", "--pickle"):
-             ispickle = True
          elif opt in ("-t", "--tokenize"):
              tokenizestdin = True
  
      if usere and tokenizestdin:
--- 76,85 ----
              usage()
              return 0
          elif opt in ("-r", "--re"):
              usere = True
          elif opt in ("-t", "--tokenize"):
              tokenizestdin = True
+         elif opt in ('-o', '--option'):
+             options.set_from_cmdline(arg, sys.stderr)
  
      if usere and tokenizestdin:


From montanaro at users.sourceforge.net  Sun Aug  6 16:50:32 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 07:50:32 -0700
Subject: [Spambayes-checkins] spambayes/scripts sb_filter.py,1.19,1.20
Message-ID: <20060806145034.662C91E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/scripts
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv19924

Modified Files:
	sb_filter.py 
Log Message:
Run under control of the new cProfile profiler, if it's available.  I found
this useful to help identify where SB spends its time while training.


Index: sb_filter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/scripts/sb_filter.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** sb_filter.py	7 Apr 2006 02:25:25 -0000	1.19
--- sb_filter.py	6 Aug 2006 14:50:29 -0000	1.20
***************
*** 47,50 ****
--- 47,53 ----
          set [section, option] in the options database to value
  
+     -P
+         Run under control of the Python profiler, if it is available
+ 
  All options marked with '*' operate on stdin, and write the resultant
  message to stdout.
***************
*** 211,220 ****
          self.h.store()
  
! def main():
      h = HammieFilter()
      actions = []
!     opts, args = getopt.getopt(sys.argv[1:], 'hvxd:p:nfgstGSo:',
                                 ['help', 'version', 'examples', 'option='])
      create_newdb = False
      for opt, arg in opts:
          if opt in ('-h', '--help'):
--- 214,224 ----
          self.h.store()
  
! def main(profiling=False):
      h = HammieFilter()
      actions = []
!     opts, args = getopt.getopt(sys.argv[1:], 'hvxd:p:nfgstGSo:P',
                                 ['help', 'version', 'examples', 'option='])
      create_newdb = False
+     do_profile = False
      for opt, arg in opts:
          if opt in ('-h', '--help'):
***************
*** 238,241 ****
--- 242,254 ----
          elif opt == '-S':
              actions.append(h.untrain_spam)
+         elif opt == '-P':
+             do_profile = True
+             if not profiling:
+                 try:
+                     import cProfile
+                 except ImportError:
+                     pass
+                 else:
+                     return cProfile.run("main(True)")
          elif opt == "-n":
              create_newdb = True


From montanaro at users.sourceforge.net  Sun Aug  6 18:14:20 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:14:20 -0700
Subject: [Spambayes-checkins] spambayes/spambayes Options.py,1.131,1.132
Message-ID: <20060806161422.065C21E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv23311/spambayes

Modified Files:
	Options.py 
Log Message:
slight reformat, doc tweak

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.131
retrieving revision 1.132
diff -C2 -d -r1.131 -r1.132
*** Options.py	27 Nov 2005 22:05:45 -0000	1.131
--- Options.py	6 Aug 2006 16:14:17 -0000	1.132
***************
*** 134,144 ****
       BOOLEAN, RESTORE),
  
!     ("address_headers", _("Address headers to mine"), ("from", "to", "cc", "sender", "reply-to"),
       _("""Mine the following address headers. If you have mixed source
       corpuses (as opposed to a mixed sauce walrus, which is delicious!)
       then you probably don't want to use 'to' or 'cc') Address headers will
       be decoded, and will generate charset tokens as well as the real
!      address.  Others to consider: to, cc, reply-to, errors-to, sender,
!      ..."""),
       HEADER_NAME, RESTORE),
  
--- 134,144 ----
       BOOLEAN, RESTORE),
  
!     ("address_headers", _("Address headers to mine"), ("from", "to", "cc",
!                                                        "sender", "reply-to"),
       _("""Mine the following address headers. If you have mixed source
       corpuses (as opposed to a mixed sauce walrus, which is delicious!)
       then you probably don't want to use 'to' or 'cc') Address headers will
       be decoded, and will generate charset tokens as well as the real
!      address.  Others to consider: errors-to, ..."""),
       HEADER_NAME, RESTORE),
  

From montanaro at users.sourceforge.net  Sun Aug  6 18:19:21 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:19:21 -0700
Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.37,1.38
Message-ID: <20060806161923.4FFAF1E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv25513

Modified Files:
	tokenizer.py 
Log Message:
Break basic text tokenizing out into its own method in preparation for some
other changes.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** tokenizer.py	15 Nov 2005 00:16:20 -0000	1.37
--- tokenizer.py	6 Aug 2006 16:19:19 -0000	1.38
***************
*** 1528,1533 ****
                      yield "noheader:" + k
  
!     def tokenize_body(self, msg, maxword=options["Tokenizer",
!                                                  "skip_max_word_size"]):
          """Generate a stream of tokens from an email Message.
  
--- 1528,1545 ----
                      yield "noheader:" + k
  
!     def tokenize_text(self, text, maxword=options["Tokenizer",
!                                                   "skip_max_word_size"]):
!         """Tokenize everything in the chunk of text we were handed."""
!         for w in text.split():
!             n = len(w)
!             # Make sure this range matches in tokenize_word().
!             if 3 <= n <= maxword:
!                 yield w
! 
!             elif n >= 3:
!                 for t in tokenize_word(w):
!                     yield t
! 
!     def tokenize_body(self, msg):
          """Generate a stream of tokens from an email Message.
  
***************
*** 1606,1619 ****
              text = html_re.sub('', text)
  
!             # Tokenize everything in the body.
!             for w in text.split():
!                 n = len(w)
!                 # Make sure this range matches in tokenize_word().
!                 if 3 <= n <= maxword:
!                     yield w
! 
!                 elif n >= 3:
!                     for t in tokenize_word(w):
!                         yield t
  
  global_tokenizer = Tokenizer()
--- 1618,1623 ----
              text = html_re.sub('', text)
  
!             for t in self.tokenize_text(text):
!                 yield t
  
  global_tokenizer = Tokenizer()


From montanaro at users.sourceforge.net  Sun Aug  6 18:34:39 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:34:39 -0700
Subject: [Spambayes-checkins] spambayes/spambayes Options.py, 1.132,
	1.133 tokenizer.py, 1.38, 1.39
Message-ID: <20060806163441.7E8C41E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv30712/spambayes

Modified Files:
	Options.py tokenizer.py 
Log Message:
Add an x-short_runs option.  When enabled, instead of completely skipping
short words, runs of them are counted, the longest generating a token using
the usual log2() technique.  See the comment in tokenizer.py and doc string
in Options.py for examples of the sort of things it attempts to catch.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.132
retrieving revision 1.133
diff -C2 -d -r1.132 -r1.133
*** Options.py	6 Aug 2006 16:14:17 -0000	1.132
--- Options.py	6 Aug 2006 16:34:37 -0000	1.133
***************
*** 98,101 ****
--- 98,109 ----
       INTEGER, RESTORE),
  
+     ("x-short_runs", _("Count runs of short 'words'"), False,
+      _("""(EXPERIMENTAL) If true, generate tokens based on max number of
+      short word runs. Short words are anything of length < the
+      skip_max_word_size option.  Normally they are skipped, but one common
+      spam technique spells words like 'V I A G RA'.
+      """),
+      BOOLEAN, RESTORE),
+ 
      ("count_all_header_lines", _("Count all header lines"), False,
       _("""Generate tokens just counting the number of instances of each kind

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -C2 -d -r1.38 -r1.39
*** tokenizer.py	6 Aug 2006 16:19:19 -0000	1.38
--- tokenizer.py	6 Aug 2006 16:34:37 -0000	1.39
***************
*** 1531,1543 ****
                                                    "skip_max_word_size"]):
          """Tokenize everything in the chunk of text we were handed."""
          for w in text.split():
              n = len(w)
!             # Make sure this range matches in tokenize_word().
!             if 3 <= n <= maxword:
!                 yield w
  
!             elif n >= 3:
!                 for t in tokenize_word(w):
!                     yield t
  
      def tokenize_body(self, msg):
--- 1531,1558 ----
                                                    "skip_max_word_size"]):
          """Tokenize everything in the chunk of text we were handed."""
+         short_runs = Set()
+         short_count = 0
          for w in text.split():
              n = len(w)
!             if n < 3:
!                 # count how many short words we see in a row - meant to
!                 # latch onto crap like this:
!                 # X j A m N j A d X h
!                 # M k E z R d I p D u I m A c
!                 # C o I d A t L j I v S j
!                 short_count += 1
!             else:
!                 if short_count:
!                     short_runs.add(short_count)
!                     short_count = 0
!                 # Make sure this range matches in tokenize_word().
!                 if 3 <= n <= maxword:
!                     yield w
  
!                 elif n >= 3:
!                     for t in tokenize_word(w):
!                         yield t
!         if short_runs and options["Tokenizer", "x-short_runs"]:
!             yield "short:%d" % int(log2(max(short_runs)))
  
      def tokenize_body(self, msg):


From montanaro at users.sourceforge.net  Sun Aug  6 18:52:57 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:52:57 -0700
Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py, NONE,
	1.1 Options.py, 1.133, 1.134 tokenizer.py, 1.39, 1.40
Message-ID: <20060806165259.7C81F1E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv5725/spambayes

Modified Files:
	Options.py tokenizer.py 
Added Files:
	dnscache.py 
Log Message:
Add Matt Cowles' dnscache module and x-lookup_ip option.  Underwent some
substantial changes, most importantly, I got most of the way adding support
for persisting the cache to either dbm or zodb stores.  Also ran reindent
over dnscache.py.


--- NEW FILE: dnscache.py ---
# Copyright 2004, Matthew Dixon Cowles <matt at mondoinfo.com>.
# Distributable under the same terms as the Python programming language.
# Inspired by the KevinL's cache included with PyDNS.
# Provided with NO WARRANTY.

# Version 0.1 2004 06 27
# Version 0.11 2004 07 06 Fixed zero division error in __del__

import DNS # From http://sourceforge.net/projects/pydns/

import sys
import os
import operator
import time
import types
import shelve
import socket

from spambayes.Options import options

kCheckForPruneEvery=20
kMaxTTL=60 * 60 * 24 * 7 # One week
kPruneThreshold=1500 # May go over slightly; numbers chosen at random
kPruneDownTo=1000


class lookupResult(object):
    #__slots__=("qType","answer","question","expiresAt","lastUsed")

    def __init__(self,qType,answer,question,expiresAt,now):
        self.qType=qType
        self.answer=answer
        self.question=question
        self.expiresAt=expiresAt
        self.lastUsed=now
        return None


# From ActiveState's Python cookbook
# Yakov Markovitch, Fast sort the list of objects by object's attribute
# http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52230
def sort_by_attr(seq, attr):
    """Sort the sequence of objects by object's attribute

    Arguments:
    seq  - the list or any sequence (including immutable one) of objects to sort.
    attr - the name of attribute to sort by

    Returns:
    the sorted list of objects.
    """
    #import operator

    # Use the "Schwartzian transform"
    # Create the auxiliary list of tuples where every i-th tuple has form
    # (seq[i].attr, i, seq[i]) and sort it. The second item of tuple is needed not
    # only to provide stable sorting, but mainly to eliminate comparison of objects
    # (which can be expensive or prohibited) in case of equal attribute values.
    intermed = map(None, map(getattr, seq, (attr,)*len(seq)), xrange(len(seq)), seq)
    intermed.sort()
    return map(operator.getitem, intermed, (-1,) * len(intermed))


class cache:
    def __init__(self,dnsServer=None,cachefile=None):
    # These attributes intended for user setting
        self.printStatsAtEnd=False

        # As far as I can tell from the standards,
        # it's legal to have more than one PTR record
        # for an address. That is, it's legal to get
        # more than one name back when you do a
        # reverse lookup on an IP address. I don't
        # know of a use for that and I've never seen
        # it done. And I don't think that most
        # people would expect it. So forward ("A")
        # lookups always return a list. Reverse
        # ("PTR") lookups return a single name unless
        # this attribute is set to False.
        self.returnSinglePTR=True

        # How long to cache an error as no data
        self.cacheErrorSecs=5*60

        # How long to wait for the server
        self.dnsTimeout=10

        # Some servers always return a TTL of zero.
        # In those cases, turning this up a bit is
        # probably reasonable.
        self.minTTL=0

        # end of user-settable attributes

        self.cachefile = cachefile
        if cachefile:
            self.open_cachefile(cachefile)
        else:
            self.caches={ "A": {}, "PTR": {} }
        self.hits=0 # These two for statistics
        self.misses=0
        self.pruneTicker=0

        if dnsServer==None:
            DNS.DiscoverNameServers()
            self.queryObj=DNS.DnsRequest()
        else:
            self.queryObj=DNS.DnsRequest(server=dnsServer)
        return None

    def open_cachefile(self, cachefile):
        filetype = options["Storage", "persistent_use_database"]
        cachefile = os.path.expanduser(cachefile)
        if filetype == "dbm":
            self.caches=shelve.open(cachefile)
            if not self.caches.has_key("A"):
                self.caches["A"] = {}
            if not self.caches.has_key("PTR"):
                self.caches["PTR"] = {}
        elif filetype == "zodb":
            from ZODB import DB
            from ZODB.FileStorage import FileStorage
            self._zodb_storage = FileStorage(cachefile, read_only=False)
            self._DB = DB(self._zodb_storage, cache_size=10000)
            self._conn = self._DB.open()
            root = self._conn.root()
            self.caches = root.get("dnscache")
            if self.caches is None:
                # There is no classifier, so create one.
                from BTrees.OOBTree import OOBTree
                self.caches = root["dnscache"] = OOBTree()
                self.caches["A"] = {}
                self.caches["PTR"] = {}
                print "opened new cache"
            else:
                print "opened existing cache with", len(self.caches["A"]), "A records",
                print "and", len(self.caches["PTR"]), "PTR records"

    def close(self):
        if not self.cachefile:
            return
        filetype = options["Storage", "persistent_use_database"]
        if filetype == "dbm":
            self.caches.close()
        elif filetype == "zodb":
            self._zodb_close()

    def _zodb_store(self):
        import transaction
        from ZODB.POSException import ConflictError
        from ZODB.POSException import TransactionFailedError

        try:
            transaction.commit()
        except ConflictError, msg:
            # We'll save it next time, or on close.  It'll be lost if we
            # hard-crash, but that's unlikely, and not a particularly big
            # deal.
            if options["globals", "verbose"]:
                print >> sys.stderr, "Conflict on commit.", msg
            transaction.abort()
        except TransactionFailedError, msg:
            # Saving isn't working.  Try to abort, but chances are that
            # restarting is needed.
            if options["globals", "verbose"]:
                print >> sys.stderr, "Store failed.  Need to restart.", msg
            transaction.abort()

    def _zodb_close(self):
        # Ensure that the db is saved before closing.  Alternatively, we
        # could abort any waiting transaction.  We need to do *something*
        # with it, though, or it will be still around after the db is
        # closed and cause problems.  For now, saving seems to make sense
        # (and we can always add abort methods if they are ever needed).
        self._zodb_store()

        # Do the closing.
        self._DB.close()

        # We don't make any use of the 'undo' capabilities of the
        # FileStorage at the moment, so might as well pack the database
        # each time it is closed, to save as much disk space as possible.
        # Pack it up to where it was 'yesterday'.
        # XXX What is the 'referencesf' parameter for pack()?  It doesn't
        # XXX seem to do anything according to the source.
##       self._zodb_storage.pack(time.time()-60*60*24, None)
        self._zodb_storage.close()

        self._zodb_closed = True
        if options["globals", "verbose"]:
            print >> sys.stderr, 'Closed dnscache database'


    def __del__(self):
        if self.printStatsAtEnd:
            self.printStats()

    def printStats(self):
        for key,val in self.caches.items():
            totAnswers=0
            for item in val.values():
                totAnswers+=len(item)
            print "cache %s has %i question(s) and %i answer(s)" % (key,len(self.caches[key]),totAnswers)
        if self.hits+self.misses==0:
            print "No queries"
        else:
            print "%i hits, %i misses (%.1f%% hits)" % (self.hits, self.misses, self.hits/float(self.hits+self.misses)*100)

    def prune(self,now):
        # I want this to be as fast as reasonably possible.
        # If I didn't, I'd probably do various things differently
        # Is there a faster way to do this?
        allAnswers=[]
        for cache in self.caches.values():
            for val in cache.values():
                allAnswers += val

        allAnswers=sort_by_attr(allAnswers,"expiresAt")
        allAnswers.reverse()

        while True:
            if allAnswers[-1].expiresAt>now:
                break
            answer=allAnswers.pop()
            c=self.caches[answer.type]
            c[answer.question].remove(answer)
            if len(c[answer.question])==0:
                del c[answer.question]

        self.printStats()

        if len(allAnswers)<=kPruneDownTo:
            return None

        # Expiring didn't get us down to the size we want, so delete
        # some entries least-recently-used-wise. I'm not by any means
        # sure that this is the best strategy, but as yet I don't have
        # data to test different strategies.
        allAnswers=sort_by_attr(allAnswers,"lastUsed")
        allAnswers.reverse()
        numToDelete=len(allAnswers)-kPruneDownTo
        for count in range(numToDelete):
            answer=allAnswers.pop()
            c=self.caches[answer.type]
            c[answer.question].remove(answer)
            if len(c[answer.question])==0:
                del c[answer.question]

        return None


    def formatForReturn(self,listOfObjs):
        if len(listOfObjs)==1 and listOfObjs[0].answer==None:
            return []

        if listOfObjs[0].qType=="PTR" and self.returnSinglePTR:
            return listOfObjs[0].answer

        return [ obj.answer for obj in listOfObjs ]


    def lookup(self,question,qType="A"):
        qType=qType.upper()
        if qType not in ("A","PTR"):
            raise ValueError,"Query type must be one of A, PTR"

        now=int(time.time())

        # Finding the len() of a dictionary isn't an expensive operation
        # but doing it twice for every lookup isn't necessary.
        self.pruneTicker+=1
        if self.pruneTicker==kCheckForPruneEvery:
            self.pruneTicker=0
            if len(self.caches["A"])+len(self.caches["PTR"])>kPruneThreshold:
                self.prune(now)

        cacheToLookIn=self.caches[qType]

        try:
            answers=cacheToLookIn[question]
        except KeyError:
            pass
        else:
            assert len(answers)>0
            ind=0
            # No guarantee that expire has already been done
            while ind<len(answers):
                thisAnswer=answers[ind]
                if thisAnswer.expiresAt<now:
                    del answers[ind]
                else:
                    thisAnswer.lastUsed=now
                    ind+=1

            if len(answers)==0:
                del cacheToLookIn[question]
            else:
                self.hits+=1
                return self.formatForReturn(answers)

        # Not in cache or we just expired it
        self.misses+=1

        if qType=="PTR":
            qList=question.split(".")
            qList.reverse()
            queryQuestion=".".join(qList)+".in-addr.arpa"
        else:
            queryQuestion=question

        # where do we get NXDOMAIN?
        try:
            reply=self.queryObj.req(queryQuestion,qtype=qType,timeout=self.dnsTimeout)
        except DNS.Base.DNSError,detail:
            if detail.args[0]<>"Timeout":
                print "Error, fixme",detail
                print "Question was",queryQuestion
                print "Origianal question was",question
                print "Type was",qType
            objs=[ lookupResult(qType,None,question,self.cacheErrorSecs+now,now) ]
            cacheToLookIn[question]=objs # Add to format for return?
            return self.formatForReturn(objs)
        except socket.gaierror,detail:
            print "DNS connection failure:", self.queryObj.ns, detail
            print "Defaults:", DNS.defaults

        objs=[]
        for answer in reply.answers:
            if answer["typename"]==qType:
                # PyDNS returns TTLs as longs but RFC 1035 says that the
                # TTL value is a signed 32-bit value and must be positive,
                # so it should be safe to coerce it to a Python integer.
                # And anyone who sets a time to live of more than 2^31-1
                # seconds (68 years and change) is drunk.
                # Arguably, I ought to impose a maximum rather than continuing
                # with longs (int(long) returns long in recent versions of Python).
                ttl=max(min(int(answer["ttl"]),kMaxTTL),self.minTTL)
                # RFC 2308 says that you should cache an NXDOMAIN for the
                # minimum of the minimum field of the SOA record and the TTL
                # of the SOA.
                if ttl>0:
                    item=lookupResult(qType,answer["data"],question,ttl+now,now)
                    objs.append(item)

        if len(objs)>0:
            cacheToLookIn[question]=objs
            return self.formatForReturn(objs)

        # Probably SERVFAIL or the like
        if len(reply.authority)==0:
            objs=[ lookupResult(qType,None,question,self.cacheErrorSecs+now,now) ]
            cacheToLookIn[question]=objs
            return self.formatForReturn(objs)


        # No such host
        #
        # I don't know in what circumstances you'd have more than one authority,
        # so I'll just assume that the first is what we want.
        #
        # RFC 2308 specifies that this how to decide how long to cache an
        # NXDOMAIN.
        auth=reply.authority[0]
        auTTL=int(auth["ttl"])
        for item in auth["data"]:
            if type(item)==types.TupleType and item[0]=="minimum":
                auMin=int(item[1])
                cacheNeg=min(auMin,auTTL)
                break
        else:
            cacheNeg=auTTL
        objs=[ lookupResult(qType,None,question,cacheNeg+now,now) ]

        cacheToLookIn[question]=objs
        return self.formatForReturn(objs)


def main():
    import transaction
    c=cache(cachefile=os.path.expanduser("~skip/.dnscache"))
    c.printStatsAtEnd=True
    for host in ["www.python.org", "www.timsbloggers.com",
                 "www.seeputofor.com", "www.completegarbage.tv",
                 "www.tradelinkllc.com"]:
        print "checking", host
        now=time.time()
        ips=c.lookup(host)
        print ips,time.time()-now
        now=time.time()
        ips=c.lookup(host)
        print ips,time.time()-now

        if ips:
            ip=ips[0]
            now=time.time()
            name=c.lookup(ip,qType="PTR")
            print name,time.time()-now
            now=time.time()
            name=c.lookup(ip,qType="PTR")
            print name,time.time()-now
        else:
            print "unknown"

    c.close()

    return None

if __name__=="__main__":
    main()

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.133
retrieving revision 1.134
diff -C2 -d -r1.133 -r1.134
*** Options.py	6 Aug 2006 16:34:37 -0000	1.133
--- Options.py	6 Aug 2006 16:52:54 -0000	1.134
***************
*** 106,109 ****
--- 106,123 ----
       BOOLEAN, RESTORE),
  
+     ("x-lookup_ip", _("Generate IP address tokens from hostnames"), False,
+      _("""(EXPERIMENTAL) Generate IP address tokens from hostnames.
+      Requires PyDNS (http://pydns.sourceforge.net/)."""),
+      BOOLEAN, RESTORE),
+ 
+     ("lookup_ip_cache", _("x-lookup_ip cache file location"), "",
+      _("""Tell SpamBayes where to cache IP address lookup information.
+      Only comes into play if lookup_ip is enabled. The default
+      (empty string) disables the file cache.  When caching is enabled,
+      the cache file is stored using the same database type as the main
+      token store (only dbm and zodb supported so far, zodb has problems,
+      dbm is untested, hence the default)."""),
+      FILE, RESTORE),
+ 
      ("count_all_header_lines", _("Count all header lines"), False,
       _("""Generate tokens just counting the number of instances of each kind

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** tokenizer.py	6 Aug 2006 16:34:37 -0000	1.39
--- tokenizer.py	6 Aug 2006 16:52:54 -0000	1.40
***************
*** 40,43 ****
--- 40,54 ----
  
  
+ try:
+     import dnscache
+     cache = dnscache.cache(cachefile=options["Tokenizer", "lookup_ip_cache"])
+     cache.printStatsAtEnd = True
+ except (IOError, ImportError):
+     cache = None
+ else:
+     import atexit
+     atexit.register(cache.close)
+ 
+  
  # Patch encodings.aliases to recognize 'ansi_x3_4_1968'
  from encodings.aliases import aliases # The aliases dictionary


From montanaro at users.sourceforge.net  Sun Aug  6 18:58:33 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 09:58:33 -0700
Subject: [Spambayes-checkins] spambayes/spambayes Options.py, 1.134,
	1.135 tokenizer.py, 1.40, 1.41
Message-ID: <20060806165834.E34BF1E4003@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv8064/spambayes

Modified Files:
	Options.py tokenizer.py 
Log Message:
Add an image-size token.  Enabled with the x-image_size option.  Uses the
usual log2() gimmick.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.134
retrieving revision 1.135
diff -C2 -d -r1.134 -r1.135
*** Options.py	6 Aug 2006 16:52:54 -0000	1.134
--- Options.py	6 Aug 2006 16:58:31 -0000	1.135
***************
*** 120,123 ****
--- 120,128 ----
       FILE, RESTORE),
  
+     ("x-image_size", _("Generate image size tokens"), False,
+      _("""(EXPERIMENTAL) If true, generate tokens based on the sizes of
+      embedded images."""),
+      BOOLEAN, RESTORE),
+ 
      ("count_all_header_lines", _("Count all header lines"), False,
       _("""Generate tokens just counting the number of instances of each kind

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** tokenizer.py	6 Aug 2006 16:52:54 -0000	1.40
--- tokenizer.py	6 Aug 2006 16:58:31 -0000	1.41
***************
*** 636,639 ****
--- 636,647 ----
                        msg.walk()))
  
+ def imageparts(msg):
+     """Return a list of all msg parts with type 'image/*'."""
+     # Don't want a set here because we want to be able to process them in
+     # order.
+     return filter(lambda part:
+                   part.get_content_type().startswith('image/'),
+                   msg.walk())
+ 
  has_highbit_char = re.compile(r"[\x80-\xff]").search
  
***************
*** 1592,1595 ****
--- 1600,1621 ----
                                                   "octet_prefix_size"]]
  
+         parts = imageparts(msg)
+         if options["Tokenizer", "x-image_size"]:
+             # Find image/* parts of the body, calculating the log(size) of
+             # each image.
+             
+             for part in parts:
+                 try:
+                     text = part.get_payload(decode=True)
+                 except:
+                     yield "control: couldn't decode image"
+                     text = part.get_payload(decode=False)
+ 
+                 if text is None:
+                     yield "control: image payload is None"
+                     continue
+ 
+                 yield "image-size:2**%d" % round(log2(len(text)))
+ 
          # Find, decode (base64, qp), and tokenize textual parts of the body.
          for part in textparts(msg):


From montanaro at users.sourceforge.net  Sun Aug  6 19:09:07 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 10:09:07 -0700
Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, NONE,
	1.1 Options.py, 1.135, 1.136 tokenizer.py, 1.41, 1.42
Message-ID: <20060806170910.511471E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv11708/spambayes

Modified Files:
	Options.py tokenizer.py 
Added Files:
	ImageStripper.py 
Log Message:
Crude OCR capability based on the ocrad program and netpbm.  As bad as
ocrad's text extraction is, this gimmick seems to work pretty well at
catching the currently crop of pump-n-dump spams.  Unix only until someone
implements similar functionality for Windows.


--- NEW FILE: ImageStripper.py ---
"""
This is the place where we try and discover information buried in images.
"""

import os
import tempfile
import math
import time

try:
    # We have three possibilities for Set:
    #  (a) With Python 2.2 and earlier, we use our compatsets class
    #  (b) With Python 2.3, we use the sets.Set class
    #  (c) With Python 2.4 and later, we use the builtin set class
    Set = set
except NameError:
    try:
        from sets import Set
    except ImportError:
        from spambayes.compatsets import Set

from spambayes.Options import options

# copied from tokenizer.py - maybe we should split it into pieces...
def log2(n, log=math.log, c=math.log(2)):
    return log(n)/c

# I'm sure this is all wrong for Windows.  Someone else can fix it. ;-)
def is_executable(prog):
    info = os.stat(prog)
    return (info.st_uid == os.getuid() and (info.st_mode & 0100) or
            info.st_gid == os.getgid() and (info.st_mode & 0010) or
            info.st_mode & 0001)

def find_program(prog):
    for directory in os.environ.get("PATH", "").split(os.pathsep):
        program = os.path.join(directory, prog)
        if os.path.exists(program) and is_executable(program):
            return program
    return ""

def find_decoders():
    # check for filters to convert to netpbm
    for decode_jpeg in ["jpegtopnm", "djpeg"]:
        if find_program(decode_jpeg):
            break
    else:
        decode_jpeg = None
    for decode_png in ["pngtopnm"]:
        if find_program(decode_png):
            break
    else:
        decode_png = None
    for decode_gif in ["giftopnm"]:
        if find_program(decode_gif):
            break
    else:
        decode_gif = None

    decoders = {
        "image/jpeg": decode_jpeg,
        "image/gif": decode_gif,
        "image/png": decode_png,
        }
    return decoders

def decode_parts(parts, decoders):
    pnmfiles = []
    for part in parts:
        decoder = decoders.get(part.get_content_type())
        if decoder is None:
            continue
        try:
            bytes = part.get_payload(decode=True)
        except:
            continue

        if len(bytes) > options["Tokenizer", "max_image_size"]:
            continue                # assume it's just a picture for now

        fd, imgfile = tempfile.mkstemp()
        os.write(fd, bytes)
        os.close(fd)

        fd, pnmfile = tempfile.mkstemp()
        os.close(fd)
        os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile))
        pnmfiles.append(pnmfile)

    if not pnmfiles:
        return

    if len(pnmfiles) > 1:
        if find_program("pnmcat"):
            fd, pnmfile = tempfile.mkstemp()
            os.close(fd)
            os.system("pnmcat -lr %s > %s 2>/dev/null" %
                      (" ".join(pnmfiles), pnmfile))
            for f in pnmfiles:
                os.unlink(f)
            pnmfiles = [pnmfile]

    return pnmfiles

def extract_ocr_info(pnmfiles):
    fd, orf = tempfile.mkstemp()
    os.close(fd)

    textbits = []
    tokens = Set()
    for pnmfile in pnmfiles:
        ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile))
        textbits.append(ocr.read())
        ocr.close()
        for line in open(orf):
            if line.startswith("lines"):
                nlines = int(line.split()[1])
                if nlines:
                    tokens.add("image-text-lines:%d" % int(log2(nlines)))

        os.unlink(pnmfile)
    os.unlink(orf)

    return "\n".join(textbits), tokens

class ImageStripper:
    def analyze(self, parts):
        if not parts:
            return "", Set()

        # need ocrad
        if not find_program("ocrad"):
            return "", Set()

        decoders = find_decoders()
        pnmfiles = decode_parts(parts, decoders)

        if not pnmfiles:
            return "", Set()

        return extract_ocr_info(pnmfiles)

        
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.135
retrieving revision 1.136
diff -C2 -d -r1.135 -r1.136
*** Options.py	6 Aug 2006 16:58:31 -0000	1.135
--- Options.py	6 Aug 2006 17:09:05 -0000	1.136
***************
*** 125,128 ****
--- 125,142 ----
       BOOLEAN, RESTORE),
  
+     ("x-crack_images", _("Look inside images for text"), False,
+      _("""(EXPERIMENTAL) If true, generate tokens based on the
+      (hopefully) text content contained in any images in each message.
+      The current support is minimal, relies on the installation of
+      ocrad (http://www.gnu.org/software/ocrad/ocrad.html) and netpbm.
+      It is almost certainly only useful in its current form on Unix-like
+      machines."""),
+      BOOLEAN, RESTORE),
+ 
+     ("max_image_size", _("Max image size to try OCR-ing"), 100000,
+      _("""When crack_images is enabled, this specifies the largest
+      image to try OCR on."""),
+      INTEGER, RESTORE),
+ 
      ("count_all_header_lines", _("Count all header lines"), False,
       _("""Generate tokens just counting the number of instances of each kind

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.41
retrieving revision 1.42
diff -C2 -d -r1.41 -r1.42
*** tokenizer.py	6 Aug 2006 16:58:31 -0000	1.41
--- tokenizer.py	6 Aug 2006 17:09:05 -0000	1.42
***************
*** 1618,1621 ****
--- 1618,1629 ----
                  yield "image-size:2**%d" % round(log2(len(text)))
  
+         if options["Tokenizer", "x-crack_images"]:
+             from spambayes.ImageStripper import ImageStripper
+             text, tokens = ImageStripper().analyze(parts)
+             for t in tokens:
+                 yield t
+             for t in self.tokenize_text(text):
+                 yield t
+ 
          # Find, decode (base64, qp), and tokenize textual parts of the body.
          for part in textparts(msg):


From montanaro at users.sourceforge.net  Sun Aug  6 22:55:12 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 13:55:12 -0700
Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.42,1.43
Message-ID: <20060806205514.D43581E4011@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv3937

Modified Files:
	tokenizer.py 
Log Message:
log(0) is a no-no.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** tokenizer.py	6 Aug 2006 17:09:05 -0000	1.42
--- tokenizer.py	6 Aug 2006 20:55:10 -0000	1.43
***************
*** 1616,1620 ****
                      continue
  
!                 yield "image-size:2**%d" % round(log2(len(text)))
  
          if options["Tokenizer", "x-crack_images"]:
--- 1616,1621 ----
                      continue
  
!                 if text:
!                     yield "image-size:2**%d" % round(log2(len(text)))
  
          if options["Tokenizer", "x-crack_images"]:


From montanaro at users.sourceforge.net  Mon Aug  7 04:47:13 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 06 Aug 2006 19:47:13 -0700
Subject: [Spambayes-checkins] spambayes/spambayes tokenizer.py,1.43,1.44
Message-ID: <20060807024715.6C64D1E4003@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv10981

Modified Files:
	tokenizer.py 
Log Message:
In splicing back several changes one-by-one I completely left out the code
to handle x-lookup_ip...  That would explain why my testing today didn't
show any improvement!

Also, tweak image-size to only yield a single token, and only if there is at
least one decodable image.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** tokenizer.py	6 Aug 2006 20:55:10 -0000	1.43
--- tokenizer.py	7 Aug 2006 02:47:10 -0000	1.44
***************
*** 1085,1088 ****
--- 1085,1103 ----
              scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
  
+             if cache is not None and options["Tokenizer", "x-lookup_ip"]:
+                 ips=cache.lookup(netloc)
+                 if len(ips)==0:
+                     pushclue("url-ip:timeout")
+                 else:
+                     for ip in ips: # Should we limit to one A record?
+                         pushclue("url-ip:%s/32" % ip)
+                         dottedQuadList=ip.split(".")
+                         pushclue("url-ip:%s/8" % dottedQuadList[0])
+                         pushclue("url-ip:%s.%s/16" % (dottedQuadList[0],
+                                                       dottedQuadList[1]))
+                         pushclue("url-ip:%s.%s.%s/24" % (dottedQuadList[0],
+                                                          dottedQuadList[1],
+                                                          dottedQuadList[2]))
+ 
              # one common technique in bogus "please (re-)authorize yourself"
              # scams is to make it appear as if you're visiting a valid
***************
*** 1605,1608 ****
--- 1620,1624 ----
              # each image.
              
+             total_len = 0
              for part in parts:
                  try:
***************
*** 1612,1621 ****
                      text = part.get_payload(decode=False)
  
                  if text is None:
                      yield "control: image payload is None"
-                     continue
  
!                 if text:
!                     yield "image-size:2**%d" % round(log2(len(text)))
  
          if options["Tokenizer", "x-crack_images"]:
--- 1628,1637 ----
                      text = part.get_payload(decode=False)
  
+                 total_len += len(text or "")
                  if text is None:
                      yield "control: image payload is None"
  
!             if total_len:
!                 yield "image-size:2**%d" % round(log2(total_len))
  
          if options["Tokenizer", "x-crack_images"]:


From anadelonbrin at users.sourceforge.net  Tue Aug  8 00:22:33 2006
From: anadelonbrin at users.sourceforge.net (Tony Meyer)
Date: Mon, 07 Aug 2006 15:22:33 -0700
Subject: [Spambayes-checkins] website docs.ht,1.19,1.20
Message-ID: <20060807222238.0B9FB1E4007@bag.python.org>

Update of /cvsroot/spambayes/website
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv25575

Modified Files:
	docs.ht 
Log Message:
Sourceforge broke our links!

Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** docs.ht	9 Jul 2004 00:39:20 -0000	1.19
--- docs.ht	7 Aug 2006 22:22:28 -0000	1.20
***************
*** 11,21 ****
  hints and tips, scripts and recipes, and anything else (related to SpamBayes) that takes
  your fancy added here.</li>
! <li>Instructions on <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/README.txt?rev=HEAD&content-type=text/plain">installing Spambayes</a> and integrating it into your mail system.</li>
! <li>The Outlook plugin includes an <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/Outlook2000/about.html?rev=HEAD">&quot;About&quot; File</a>, and a <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/Outlook2000/docs/troubleshooting.html?rev=HEAD">
  &quot;Troubleshooting Guide&quot</a> that can be accessed via the toolbar.
  (Note that the online documentaton is always for the <strong>latest source</strong> version, and so might not correspond exactly with the version you are using.
  Always start with the documentation that came with the version you installed.)</li>
! <li>The <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/README-DEVEL.txt?rev=HEAD&content-type=text/plain">README-DEVEL.txt</a> information that should be of use to people planning on developing code based on SpamBayes.</li>
! <li>The <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/TESTING.txt?rev=HEAD&content-type=text/plain">TESTING.txt</a> file -- Clues about the practice of statistical testing, adapted from Tim
   comments on python-dev.
  <li>There are also a vast number of clues and notes scattered as block comments through the code.
--- 11,21 ----
  hints and tips, scripts and recipes, and anything else (related to SpamBayes) that takes
  your fancy added here.</li>
! <li>Instructions on <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README.txt">installing Spambayes</a> and integrating it into your mail system.</li>
! <li>The Outlook plugin includes an <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/Outlook2000/about.html">&quot;About&quot; File</a>, and a <a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/Outlook2000/docs/troubleshooting.html?rev=HEAD">
  &quot;Troubleshooting Guide&quot</a> that can be accessed via the toolbar.
  (Note that the online documentaton is always for the <strong>latest source</strong> version, and so might not correspond exactly with the version you are using.
  Always start with the documentation that came with the version you installed.)</li>
! <li>The <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/README-DEVEL.txt">README-DEVEL.txt</a> information that should be of use to people planning on developing code based on SpamBayes.</li>
! <li>The <a href="http://spambayes.cvs.sourceforge.net/*checkout*/spambayes/spambayes/TESTING.txt">TESTING.txt</a> file -- Clues about the practice of statistical testing, adapted from Tim
   comments on python-dev.
  <li>There are also a vast number of clues and notes scattered as block comments through the code.


From anadelonbrin at users.sourceforge.net  Tue Aug  8 00:23:29 2006
From: anadelonbrin at users.sourceforge.net (Tony Meyer)
Date: Mon, 07 Aug 2006 15:23:29 -0700
Subject: [Spambayes-checkins] website download.ht, 1.36, 1.37 index.ht, 1.40,
	1.41
Message-ID: <20060807222331.2EABA1E4005@bag.python.org>

Update of /cvsroot/spambayes/website
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv26079

Modified Files:
	download.ht index.ht 
Log Message:
1.1a2 has been out for a bit now.

Index: download.ht
===================================================================
RCS file: /cvsroot/spambayes/website/download.ht,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** download.ht	10 Sep 2005 00:29:55 -0000	1.36
--- download.ht	7 Aug 2006 22:23:26 -0000	1.37
***************
*** 11,18 ****
  <a href="mailto:spambayes at python.org">spambayes at python.org</a>. 
  
! <p>The first alpha release of 1.1 is also now available.  It is highly likely
! that there are new bugs in this release, but if you are willing and able to
! give it a spin for us, that would be greatly appreciated.  You might like
! to look at this <a href="http://entrian.com/sbwiki/TryOutThePreRelease">list
  of things to try out</a>.</p>
  
--- 11,19 ----
  <a href="mailto:spambayes at python.org">spambayes at python.org</a>. 
  
! <p>The second alpha release of 1.1 is also now available.  It is highly likely
! that there are new bugs in this release (especially with the IMAP filter),
! but if you are willing and able to give it a spin for us, that would be
! greatly appreciated.  You might like to look at this
! <a href="http://entrian.com/sbwiki/TryOutThePreRelease">list
  of things to try out</a>.</p>
  
***************
*** 70,87 ****
  </li>
  <hr />
! <li><tt>d6457f141e2485d26cb2fa61a8d804c7</tt>
! <a href="http://prdownloads.sourceforge.net/spambayes/spambayes-1.1a1.exe?download">spambayes-1.1a1.exe</a>
  (3,025,816 bytes,
! <a href="sigs/spambayes-1.1a1.exe.asc">sig</a>)
  </li>
! <li><tt>380bb81006064aeaad16d192439214a4</tt>
! <a href="http://prdownloads.sourceforge.net/spambayes/spambayes-1.1a1.tar.gz?download">spambayes-1.1a1.tar.gz</a>
! (823,660 bytes,
! <a href="sigs/spambayes-1.1a1.tar.gz.asc">sig</a>)
  </li>
! <li><tt>1b67365a847e97f24cc50236ba6e2183</tt>
! <a href="http://prdownloads.sourceforge.net/spambayes/spambayes-1.1a1.zip?download">spambayes-1.1a1.zip</a>
  (971,031 bytes,
! <a href="sigs/spambayes-1.1a1.zip.asc">sig</a>)
  </li>
  </ul>
--- 71,88 ----
  </li>
  <hr />
! <li><tt></tt>
! <a href="http://prdownloads.sourceforge.net/spambayes/spambayes-1.1a2.exe?download">spambayes-1.1a2.exe</a>
  (3,025,816 bytes,
! <a href="sigs/spambayes-1.1a2.exe.asc">sig</a>)
  </li>
! <li><tt>6c94cb14008580c309dd176af73f2132</tt>
! <a href="http://prdownloads.sourceforge.net/spambayes/spambayes-1.1a2.tar.gz?download">spambayes-1.1a2.tar.gz</a>
! (830,084 bytes,
! <a href="sigs/spambayes-1.1a2.tar.gz.asc">sig</a>)
  </li>
! <li><tt></tt>
! <a href="http://prdownloads.sourceforge.net/spambayes/spambayes-1.1a2.zip?download">spambayes-1.1a2.zip</a>
  (971,031 bytes,
! <a href="sigs/spambayes-1.1a2.zip.asc">sig</a>)
  </li>
  </ul>

Index: index.ht
===================================================================
RCS file: /cvsroot/spambayes/website/index.ht,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** index.ht	10 Sep 2005 00:31:20 -0000	1.40
--- index.ht	7 Aug 2006 22:23:26 -0000	1.41
***************
*** 8,12 ****
  archives and a Windows binary installer).</p>
  <p>See the <a href="download.html">download</a> page for more.</p>
! <p>SpamBayes 1.1a1 is also now available!  (This includes both the source
  archives and a Windows binary installers).  This is an <em>alpha</em>
  release, so you should only try it if you are willing to try out
--- 8,12 ----
  archives and a Windows binary installer).</p>
  <p>See the <a href="download.html">download</a> page for more.</p>
! <p>SpamBayes 1.1a2 is also now available!  (This includes both the source
  archives and a Windows binary installers).  This is an <em>alpha</em>
  release, so you should only try it if you are willing to try out


From montanaro at users.sourceforge.net  Wed Aug  9 06:26:39 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Tue, 08 Aug 2006 21:26:39 -0700
Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py,1.1,1.2
Message-ID: <20060809042641.A19F01E4006@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv7959

Modified Files:
	dnscache.py 
Log Message:
Don't beat my brains out trying to get dbm and zodb caches to work.  Just
use a simple pickled dict...


Index: dnscache.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/dnscache.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** dnscache.py	6 Aug 2006 16:52:54 -0000	1.1
--- dnscache.py	9 Aug 2006 04:26:36 -0000	1.2
***************
*** 14,19 ****
  import time
  import types
- import shelve
  import socket
  
  from spambayes.Options import options
--- 14,22 ----
  import time
  import types
  import socket
+ try:
+     import cPickle as pickle
+ except ImportError:
+     import pickle
  
  from spambayes.Options import options
***************
*** 63,67 ****
  
  class cache:
!     def __init__(self,dnsServer=None,cachefile=None):
      # These attributes intended for user setting
          self.printStatsAtEnd=False
--- 66,70 ----
  
  class cache:
!     def __init__(self,dnsServer=None,cachefile=""):
      # These attributes intended for user setting
          self.printStatsAtEnd=False
***************
*** 93,101 ****
          # end of user-settable attributes
  
!         self.cachefile = cachefile
!         if cachefile:
!             self.open_cachefile(cachefile)
          else:
!             self.caches={ "A": {}, "PTR": {} }
          self.hits=0 # These two for statistics
          self.misses=0
--- 96,114 ----
          # end of user-settable attributes
  
!         self.cachefile = os.path.expanduser(cachefile)
!         if self.cachefile and os.path.exists(self.cachefile):
!             self.caches = pickle.load(open(self.cachefile, "rb"))
          else:
!             self.caches = {"A": {}, "PTR": {}}
! 
!         if options["globals", "verbose"]:
!             if self.caches["A"] or self.caches["PTR"]:
!                 print >> sys.stderr, "opened existing cache with",
!                 print >> sys.stderr, len(self.caches["A"]), "A records",
!                 print >> sys.stderr, "and", len(self.caches["PTR"]),
!                 print >> sys.stderr, "PTR records"
!             else:
!                 print >> sys.stderr, "opened new cache"
! 
          self.hits=0 # These two for statistics
          self.misses=0
***************
*** 109,198 ****
          return None
  
-     def open_cachefile(self, cachefile):
-         filetype = options["Storage", "persistent_use_database"]
-         cachefile = os.path.expanduser(cachefile)
-         if filetype == "dbm":
-             self.caches=shelve.open(cachefile)
-             if not self.caches.has_key("A"):
-                 self.caches["A"] = {}
-             if not self.caches.has_key("PTR"):
-                 self.caches["PTR"] = {}
-         elif filetype == "zodb":
-             from ZODB import DB
-             from ZODB.FileStorage import FileStorage
-             self._zodb_storage = FileStorage(cachefile, read_only=False)
-             self._DB = DB(self._zodb_storage, cache_size=10000)
-             self._conn = self._DB.open()
-             root = self._conn.root()
-             self.caches = root.get("dnscache")
-             if self.caches is None:
-                 # There is no classifier, so create one.
-                 from BTrees.OOBTree import OOBTree
-                 self.caches = root["dnscache"] = OOBTree()
-                 self.caches["A"] = {}
-                 self.caches["PTR"] = {}
-                 print "opened new cache"
-             else:
-                 print "opened existing cache with", len(self.caches["A"]), "A records",
-                 print "and", len(self.caches["PTR"]), "PTR records"
- 
      def close(self):
-         if not self.cachefile:
-             return
-         filetype = options["Storage", "persistent_use_database"]
-         if filetype == "dbm":
-             self.caches.close()
-         elif filetype == "zodb":
-             self._zodb_close()
- 
-     def _zodb_store(self):
-         import transaction
-         from ZODB.POSException import ConflictError
-         from ZODB.POSException import TransactionFailedError
- 
-         try:
-             transaction.commit()
-         except ConflictError, msg:
-             # We'll save it next time, or on close.  It'll be lost if we
-             # hard-crash, but that's unlikely, and not a particularly big
-             # deal.
-             if options["globals", "verbose"]:
-                 print >> sys.stderr, "Conflict on commit.", msg
-             transaction.abort()
-         except TransactionFailedError, msg:
-             # Saving isn't working.  Try to abort, but chances are that
-             # restarting is needed.
-             if options["globals", "verbose"]:
-                 print >> sys.stderr, "Store failed.  Need to restart.", msg
-             transaction.abort()
- 
-     def _zodb_close(self):
-         # Ensure that the db is saved before closing.  Alternatively, we
-         # could abort any waiting transaction.  We need to do *something*
-         # with it, though, or it will be still around after the db is
-         # closed and cause problems.  For now, saving seems to make sense
-         # (and we can always add abort methods if they are ever needed).
-         self._zodb_store()
- 
-         # Do the closing.
-         self._DB.close()
- 
-         # We don't make any use of the 'undo' capabilities of the
-         # FileStorage at the moment, so might as well pack the database
-         # each time it is closed, to save as much disk space as possible.
-         # Pack it up to where it was 'yesterday'.
-         # XXX What is the 'referencesf' parameter for pack()?  It doesn't
-         # XXX seem to do anything according to the source.
- ##       self._zodb_storage.pack(time.time()-60*60*24, None)
-         self._zodb_storage.close()
- 
-         self._zodb_closed = True
-         if options["globals", "verbose"]:
-             print >> sys.stderr, 'Closed dnscache database'
- 
- 
-     def __del__(self):
          if self.printStatsAtEnd:
              self.printStats()
  
      def printStats(self):
--- 122,130 ----
          return None
  
      def close(self):
          if self.printStatsAtEnd:
              self.printStats()
+         if self.cachefile:
+             pickle.dump(self.caches, open(self.cachefile, "wb"))
  
      def printStats(self):
***************
*** 201,209 ****
              for item in val.values():
                  totAnswers+=len(item)
!             print "cache %s has %i question(s) and %i answer(s)" % (key,len(self.caches[key]),totAnswers)
          if self.hits+self.misses==0:
!             print "No queries"
          else:
!             print "%i hits, %i misses (%.1f%% hits)" % (self.hits, self.misses, self.hits/float(self.hits+self.misses)*100)
  
      def prune(self,now):
--- 133,144 ----
              for item in val.values():
                  totAnswers+=len(item)
!             print >> sys.stderr, "cache", key, "has", len(self.caches[key]),
!             print >> sys.stderr, "question(s) and", totAnswers, "answer(s)"
          if self.hits+self.misses==0:
!             print >> sys.stderr, "No queries"
          else:
!             print >> sys.stderr, self.hits, "hits,", self.misses, "misses",
!             print >> sys.stderr, "(%.1f%% hits)" % \
!                   (self.hits/float(self.hits+self.misses)*100)
  
      def prune(self,now):
***************
*** 223,232 ****
                  break
              answer=allAnswers.pop()
!             c=self.caches[answer.type]
              c[answer.question].remove(answer)
              if len(c[answer.question])==0:
                  del c[answer.question]
  
!         self.printStats()
  
          if len(allAnswers)<=kPruneDownTo:
--- 158,168 ----
                  break
              answer=allAnswers.pop()
!             c=self.caches[answer.qType]
              c[answer.question].remove(answer)
              if len(c[answer.question])==0:
                  del c[answer.question]
  
!         if options["globals", "verbose"]:
!             self.printStats()
  
          if len(allAnswers)<=kPruneDownTo:
***************
*** 242,246 ****
          for count in range(numToDelete):
              answer=allAnswers.pop()
!             c=self.caches[answer.type]
              c[answer.question].remove(answer)
              if len(c[answer.question])==0:
--- 178,182 ----
          for count in range(numToDelete):
              answer=allAnswers.pop()
!             c=self.caches[answer.qType]
              c[answer.question].remove(answer)
              if len(c[answer.question])==0:


From montanaro at users.sourceforge.net  Thu Aug 10 06:08:03 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Wed, 09 Aug 2006 21:08:03 -0700
Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, 1.1,
	1.2 Options.py, 1.136, 1.137 tokenizer.py, 1.44, 1.45
Message-ID: <20060810040805.9A76E1E4007@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv21273/spambayes

Modified Files:
	ImageStripper.py Options.py tokenizer.py 
Log Message:
Use PIL to decode input images if available (faster, much more robust, and
platform-independent).  Add a token cache for the ocr output to speed up
that operation.  Slight API change for the ocr stuff.  Now a singleton is
created and used for all analysis.


Index: ImageStripper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** ImageStripper.py	6 Aug 2006 17:09:04 -0000	1.1
--- ImageStripper.py	10 Aug 2006 04:07:59 -0000	1.2
***************
*** 3,10 ****
--- 3,28 ----
  """
  
+ from __future__ import division
+ 
+ import sys
  import os
  import tempfile
  import math
  import time
+ import md5
+ import atexit
+ try:
+     import cPickle as pickle
+ except ImportError:
+     import pickle
+ try:
+     import cStringIO as StringIO
+ except ImportError:
+     import StringIO
+ 
+ try:
+     from PIL import Image
+ except ImportError:
+     Image = None
  
  try:
***************
*** 65,128 ****
      return decoders
  
! def decode_parts(parts, decoders):
!     pnmfiles = []
!     for part in parts:
!         decoder = decoders.get(part.get_content_type())
!         if decoder is None:
!             continue
!         try:
!             bytes = part.get_payload(decode=True)
!         except:
!             continue
  
!         if len(bytes) > options["Tokenizer", "max_image_size"]:
!             continue                # assume it's just a picture for now
  
!         fd, imgfile = tempfile.mkstemp()
!         os.write(fd, bytes)
!         os.close(fd)
  
!         fd, pnmfile = tempfile.mkstemp()
!         os.close(fd)
!         os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile))
!         pnmfiles.append(pnmfile)
  
!     if not pnmfiles:
!         return
  
-     if len(pnmfiles) > 1:
-         if find_program("pnmcat"):
              fd, pnmfile = tempfile.mkstemp()
              os.close(fd)
!             os.system("pnmcat -lr %s > %s 2>/dev/null" %
!                       (" ".join(pnmfiles), pnmfile))
!             for f in pnmfiles:
!                 os.unlink(f)
!             pnmfiles = [pnmfile]
  
!     return pnmfiles
  
! def extract_ocr_info(pnmfiles):
!     fd, orf = tempfile.mkstemp()
!     os.close(fd)
  
!     textbits = []
!     tokens = Set()
!     for pnmfile in pnmfiles:
!         ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile))
!         textbits.append(ocr.read())
!         ocr.close()
!         for line in open(orf):
!             if line.startswith("lines"):
!                 nlines = int(line.split()[1])
!                 if nlines:
!                     tokens.add("image-text-lines:%d" % int(log2(nlines)))
  
!         os.unlink(pnmfile)
!     os.unlink(orf)
  
!     return "\n".join(textbits), tokens
  
- class ImageStripper:
      def analyze(self, parts):
          if not parts:
--- 83,211 ----
      return decoders
  
! def imconcat(im1, im2):
!     # concatenate im1 and im2 left-to-right
!     w1, h1 = im1.size
!     w2, h2 = im2.size
!     im3 = Image.new("RGB", (w1+w2, max(h1, h2)))
!     im3.paste(im1, (0, 0))
!     im3.paste(im2, (0, w1))
!     return im3
  
! class ImageStripper:
!     def __init__(self, cachefile=""):
!         self.cachefile = os.path.expanduser(cachefile)
!         if os.path.exists(self.cachefile):
!             self.cache = pickle.load(open(self.cachefile))
!         else:
!             self.cache = {}
!         self.misses = self.hits = 0
!         if self.cachefile:
!             atexit.register(self.close)
  
!     def NetPBM_decode_parts(self, parts, decoders):
!         pnmfiles = []
!         for part in parts:
!             decoder = decoders.get(part.get_content_type())
!             if decoder is None:
!                 continue
!             try:
!                 bytes = part.get_payload(decode=True)
!             except:
!                 continue
  
!             if len(bytes) > options["Tokenizer", "max_image_size"]:
!                 continue                # assume it's just a picture for now
  
!             fd, imgfile = tempfile.mkstemp()
!             os.write(fd, bytes)
!             os.close(fd)
  
              fd, pnmfile = tempfile.mkstemp()
              os.close(fd)
!             os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile))
!             pnmfiles.append(pnmfile)
!             os.unlink(imgfile)
  
!         if not pnmfiles:
!             return
  
!         if len(pnmfiles) > 1:
!             if find_program("pnmcat"):
!                 fd, pnmfile = tempfile.mkstemp()
!                 os.close(fd)
!                 os.system("pnmcat -lr %s > %s 2>/dev/null" %
!                           (" ".join(pnmfiles), pnmfile))
!                 for f in pnmfiles:
!                     os.unlink(f)
!                 pnmfiles = [pnmfile]
  
!         return pnmfiles
  
!     def PIL_decode_parts(self, parts):
!         full_image = None
!         for part in parts:
!             try:
!                 bytes = part.get_payload(decode=True)
!             except:
!                 continue
  
!             if len(bytes) > options["Tokenizer", "max_image_size"]:
!                 continue                # assume it's just a picture for now
! 
!             # We're dealing with spammers here - who knows what garbage they
!             # will call a GIF image to entice you to open it?
!             try:
!                 image = Image.open(StringIO.StringIO(bytes))
!                 image.load()
!             except IOError:
!                 continue
!             else:
!                 image = image.convert("RGB")
! 
!             if full_image is None:
!                 full_image = image
!             else:
!                 full_image = imconcat(full_image, image)
! 
!         if not full_image:
!             return
! 
!         fd, pnmfile = tempfile.mkstemp()
!         os.close(fd)
!         full_image.save(open(pnmfile, "wb"), "PPM")
! 
!         return [pnmfile]
! 
!     def extract_ocr_info(self, pnmfiles):
!         fd, orf = tempfile.mkstemp()
!         os.close(fd)
! 
!         textbits = []
!         tokens = Set()
!         for pnmfile in pnmfiles:
!             fhash = md5.new(open(pnmfile).read()).hexdigest()
!             if fhash in self.cache:
!                 self.hits += 1
!                 ctext, ctokens = self.cache[fhash]
!             else:
!                 self.misses += 1
!                 ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile))
!                 ctext = ocr.read().lower()
!                 ocr.close()
!                 ctokens = set()
!                 for line in open(orf):
!                     if line.startswith("lines"):
!                         nlines = int(line.split()[1])
!                         if nlines:
!                             ctokens.add("image-text-lines:%d" %
!                                         int(log2(nlines)))
!                 self.cache[fhash] = (ctext, ctokens)
!             textbits.append(ctext)
!             tokens |= ctokens
!             os.unlink(pnmfile)
!         os.unlink(orf)
! 
!         return "\n".join(textbits), tokens
  
      def analyze(self, parts):
          if not parts:
***************
*** 133,143 ****
              return "", Set()
  
!         decoders = find_decoders()
!         pnmfiles = decode_parts(parts, decoders)
  
!         if not pnmfiles:
!             return "", Set()
  
!         return extract_ocr_info(pnmfiles)
  
!         
--- 216,240 ----
              return "", Set()
  
!         if Image is not None:
!             pnmfiles = self.PIL_decode_parts(parts)
!         else:
!             pnmfiles = self.NetPBM_decode_parts(parts, find_decoders())
  
!         if pnmfiles:
!             return self.extract_ocr_info(pnmfiles)
  
!         return "", Set()
  
! 
!     def close(self):
!         if options["globals", "verbose"]:
!             print >> sys.stderr, "saving", len(self.cache),
!             print >> sys.stderr, "items to", self.cachefile,
!             if self.hits + self.misses:
!                 print >> sys.stderr, "%.2f%% hit rate" % \
!                       (100 * self.hits / (self.hits + self.misses)),
!             print >> sys.stderr
!         pickle.dump(self.cache, open(self.cachefile, "wb"))
! 
! _cachefile = options["Tokenizer", "crack_image_cache"]
! crack_images = ImageStripper(_cachefile).analyze

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.136
retrieving revision 1.137
diff -C2 -d -r1.136 -r1.137
*** Options.py	6 Aug 2006 17:09:05 -0000	1.136
--- Options.py	10 Aug 2006 04:07:59 -0000	1.137
***************
*** 118,122 ****
       token store (only dbm and zodb supported so far, zodb has problems,
       dbm is untested, hence the default)."""),
!      FILE, RESTORE),
  
      ("x-image_size", _("Generate image size tokens"), False,
--- 118,122 ----
       token store (only dbm and zodb supported so far, zodb has problems,
       dbm is untested, hence the default)."""),
!      PATH, RESTORE),
  
      ("x-image_size", _("Generate image size tokens"), False,
***************
*** 134,137 ****
--- 134,142 ----
       BOOLEAN, RESTORE),
  
+     ("crack_image_cache", _("Cache to speed up ocr."), "",
+      _("""If non-empty, names a file from which to read cached ocr info
+      at start and to which to save that info at exit."""),
+      PATH, RESTORE),
+ 
      ("max_image_size", _("Max image size to try OCR-ing"), 100000,
       _("""When crack_images is enabled, this specifies the largest

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.44
retrieving revision 1.45
diff -C2 -d -r1.44 -r1.45
*** tokenizer.py	7 Aug 2006 02:47:10 -0000	1.44
--- tokenizer.py	10 Aug 2006 04:07:59 -0000	1.45
***************
*** 1636,1641 ****
  
          if options["Tokenizer", "x-crack_images"]:
!             from spambayes.ImageStripper import ImageStripper
!             text, tokens = ImageStripper().analyze(parts)
              for t in tokens:
                  yield t
--- 1636,1641 ----
  
          if options["Tokenizer", "x-crack_images"]:
!             from spambayes.ImageStripper import crack_images
!             text, tokens = crack_images(parts)
              for t in tokens:
                  yield t


From anadelonbrin at users.sourceforge.net  Sun Aug 13 04:05:46 2006
From: anadelonbrin at users.sourceforge.net (Tony Meyer)
Date: Sat, 12 Aug 2006 19:05:46 -0700
Subject: [Spambayes-checkins] spambayes/spambayes dnscache.py,1.2,1.3
Message-ID: <20060813020548.AA6721E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv31206/spambayes

Modified Files:
	dnscache.py 
Log Message:
Remove reference to Skip, probably left there by mistake :)

Index: dnscache.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/dnscache.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** dnscache.py	9 Aug 2006 04:26:36 -0000	1.2
--- dnscache.py	13 Aug 2006 02:05:43 -0000	1.3
***************
*** 314,318 ****
  def main():
      import transaction
!     c=cache(cachefile=os.path.expanduser("~skip/.dnscache"))
      c.printStatsAtEnd=True
      for host in ["www.python.org", "www.timsbloggers.com",
--- 314,318 ----
  def main():
      import transaction
!     c=cache(cachefile=os.path.expanduser("~/.dnscache"))
      c.printStatsAtEnd=True
      for host in ["www.python.org", "www.timsbloggers.com",


From montanaro at users.sourceforge.net  Sun Aug 13 18:27:51 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 13 Aug 2006 09:27:51 -0700
Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py,1.2,1.3
Message-ID: <20060813162754.806071E4002@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv18791

Modified Files:
	ImageStripper.py 
Log Message:
The spammers don't just chop up their GIF images left-to-right.  Concatenate
them left-to-right until the height of adjacent images changes, then start a
new row.  At the end concatenate the rows top-to-bottom.

Add a couple tokens to mark decode or conversion errors.

The *_decode_parts don't use the class's state, so make them functions
instead of methods.


Index: ImageStripper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** ImageStripper.py	10 Aug 2006 04:07:59 -0000	1.2
--- ImageStripper.py	13 Aug 2006 16:27:49 -0000	1.3
***************
*** 83,179 ****
      return decoders
  
! def imconcat(im1, im2):
!     # concatenate im1 and im2 left-to-right
!     w1, h1 = im1.size
!     w2, h2 = im2.size
!     im3 = Image.new("RGB", (w1+w2, max(h1, h2)))
!     im3.paste(im1, (0, 0))
!     im3.paste(im2, (0, w1))
!     return im3
  
! class ImageStripper:
!     def __init__(self, cachefile=""):
!         self.cachefile = os.path.expanduser(cachefile)
!         if os.path.exists(self.cachefile):
!             self.cache = pickle.load(open(self.cachefile))
!         else:
!             self.cache = {}
!         self.misses = self.hits = 0
!         if self.cachefile:
!             atexit.register(self.close)
  
!     def NetPBM_decode_parts(self, parts, decoders):
!         pnmfiles = []
!         for part in parts:
!             decoder = decoders.get(part.get_content_type())
!             if decoder is None:
!                 continue
!             try:
!                 bytes = part.get_payload(decode=True)
!             except:
!                 continue
  
!             if len(bytes) > options["Tokenizer", "max_image_size"]:
!                 continue                # assume it's just a picture for now
  
!             fd, imgfile = tempfile.mkstemp()
!             os.write(fd, bytes)
!             os.close(fd)
  
              fd, pnmfile = tempfile.mkstemp()
              os.close(fd)
!             os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile))
!             pnmfiles.append(pnmfile)
!             os.unlink(imgfile)
  
!         if not pnmfiles:
!             return
  
!         if len(pnmfiles) > 1:
!             if find_program("pnmcat"):
!                 fd, pnmfile = tempfile.mkstemp()
!                 os.close(fd)
!                 os.system("pnmcat -lr %s > %s 2>/dev/null" %
!                           (" ".join(pnmfiles), pnmfile))
!                 for f in pnmfiles:
!                     os.unlink(f)
!                 pnmfiles = [pnmfile]
  
!         return pnmfiles
  
!     def PIL_decode_parts(self, parts):
!         full_image = None
!         for part in parts:
!             try:
!                 bytes = part.get_payload(decode=True)
!             except:
!                 continue
  
!             if len(bytes) > options["Tokenizer", "max_image_size"]:
!                 continue                # assume it's just a picture for now
  
!             # We're dealing with spammers here - who knows what garbage they
!             # will call a GIF image to entice you to open it?
!             try:
!                 image = Image.open(StringIO.StringIO(bytes))
!                 image.load()
!             except IOError:
!                 continue
!             else:
!                 image = image.convert("RGB")
  
!             if full_image is None:
!                 full_image = image
!             else:
!                 full_image = imconcat(full_image, image)
  
!         if not full_image:
!             return
  
!         fd, pnmfile = tempfile.mkstemp()
!         os.close(fd)
!         full_image.save(open(pnmfile, "wb"), "PPM")
  
!         return [pnmfile]
  
      def extract_ocr_info(self, pnmfiles):
--- 83,228 ----
      return decoders
  
! def imconcatlr(left, right):
!     """Concatenate two images left to right."""
!     w1, h1 = left.size
!     w2, h2 = right.size
!     result = Image.new("RGB", (w1 + w2, max(h1, h2)))
!     result.paste(left, (0, 0))
!     result.paste(right, (w1, 0))
!     return result
  
! def imconcattb(upper, lower):
!     """Concatenate two images top to bottom."""
!     w1, h1 = upper.size
!     w2, h2 = lower.size
!     result = Image.new("RGB", (max(w1, w2), h1 + h2))
!     result.paste(upper, (0, 0))
!     result.paste(lower, (0, h1))
!     return result
  
! def pnmsize(pnmfile):
!     """Return dimensions of a PNM file."""
!     f = open(pnmfile)
!     line1 = f.readline()
!     line2 = f.readline()
!     w, h = [int(n) for n in line2.split()]
!     return w, h
  
! def NetPBM_decode_parts(parts, decoders):
!     """Decode and assemble a bunch of images using NetPBM tools."""
!     rows = []
!     tokens = Set()
!     for part in parts:
!         decoder = decoders.get(part.get_content_type())
!         if decoder is None:
!             continue
!         try:
!             bytes = part.get_payload(decode=True)
!         except:
!             tokens.add("invalid-image:%s" % part.get_content_type())
!             continue
  
!         if len(bytes) > options["Tokenizer", "max_image_size"]:
!             tokens.add("image:big")
!             continue                # assume it's just a picture for now
  
+         fd, imgfile = tempfile.mkstemp()
+         os.write(fd, bytes)
+         os.close(fd)
+ 
+         fd, pnmfile = tempfile.mkstemp()
+         os.close(fd)
+         os.system("%s <%s >%s 2>dev.null" % (decoder, imgfile, pnmfile))
+         w, h = pnmsize(pnmfile)
+         if not rows:
+             # first image
+             rows.append([pnmfile])
+         elif pnmsize(rows[-1][-1])[1] != h:
+             # new image, different height => start new row
+             rows.append([pnmfile])
+         else:
+             # new image, same height => extend current row
+             rows[-1].append(pnmfile)
+ 
+     for (i, row) in enumerate(rows):
+         if len(row) > 1:
              fd, pnmfile = tempfile.mkstemp()
              os.close(fd)
!             os.system("pnmcat -lr %s > %s 2>/dev/null" %
!                       (" ".join(row), pnmfile))
!             for f in row:
!                 os.unlink(f)
!             rows[i] = pnmfile
!         else:
!             rows[i] = row[0]
  
!     fd, pnmfile = tempfile.mkstemp()
!     os.close(fd)
!     os.system("pnmcat -tb %s > %s 2>/dev/null" % (" ".join(rows), pnmfile))
!     for f in rows:
!         os.unlink(f)
!     return [pnmfile], tokens
  
! def PIL_decode_parts(parts):
!     """Decode and assemble a bunch of images using PIL."""
!     tokens = Set()
!     rows = []
!     for part in parts:
!         try:
!             bytes = part.get_payload(decode=True)
!         except:
!             tokens.add("invalid-image:%s" % part.get_content_type())
!             continue
  
!         if len(bytes) > options["Tokenizer", "max_image_size"]:
!             tokens.add("image:big")
!             continue                # assume it's just a picture for now
  
!         # We're dealing with spammers and virus writers here.  Who knows
!         # what garbage they will call a GIF image to entice you to open
!         # it?
!         try:
!             image = Image.open(StringIO.StringIO(bytes))
!             image.load()
!         except IOError:
!             tokens.add("invalid-image:%s" % part.get_content_type())
!             continue
!         else:
!             image = image.convert("RGB")
  
!         if not rows:
!             # first image
!             rows.append(image)
!         elif image.size[1] != rows[-1].size[1]:
!             # new image, different height => start new row
!             rows.append(image)
!         else:
!             # new image, same height => extend current row
!             rows[-1] = imconcatlr(rows[-1], image)
  
!     if not rows:
!         return [], tokens
  
!     # now concatenate the resulting row images top-to-bottom
!     full_image, rows = rows[0], rows[1:]
!     for image in rows:
!         full_image = imconcattb(full_image, image)
  
!     fd, pnmfile = tempfile.mkstemp()
!     os.close(fd)
!     full_image.save(open(pnmfile, "wb"), "PPM")
  
!     return [pnmfile], tokens
  
! class ImageStripper:
!     def __init__(self, cachefile=""):
!         self.cachefile = os.path.expanduser(cachefile)
!         if os.path.exists(self.cachefile):
!             self.cache = pickle.load(open(self.cachefile))
!         else:
!             self.cache = {}
!         self.misses = self.hits = 0
!         if self.cachefile:
!             atexit.register(self.close)
  
      def extract_ocr_info(self, pnmfiles):
***************
*** 217,228 ****
  
          if Image is not None:
!             pnmfiles = self.PIL_decode_parts(parts)
          else:
!             pnmfiles = self.NetPBM_decode_parts(parts, find_decoders())
  
          if pnmfiles:
!             return self.extract_ocr_info(pnmfiles)
  
!         return "", Set()
  
  
--- 266,280 ----
  
          if Image is not None:
!             pnmfiles, tokens = PIL_decode_parts(parts)
          else:
!             if not find_program("pnmcat"):
!                 return "", Set()
!             pnmfiles, tokens = NetPBM_decode_parts(parts, find_decoders())
  
          if pnmfiles:
!             text, new_tokens = self.extract_ocr_info(pnmfiles)
!             return text, tokens | new_tokens
  
!         return "", tokens
  
  
From montanaro at users.sourceforge.net  Mon Aug 14 04:58:13 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Sun, 13 Aug 2006 19:58:13 -0700
Subject: [Spambayes-checkins] spambayes/spambayes ImageStripper.py, 1.3,
	1.4 Options.py, 1.137, 1.138 OptionsClass.py, 1.32, 1.33
Message-ID: <20060814025816.9CCEB1E4003@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv26750/spambayes

Modified Files:
	ImageStripper.py Options.py OptionsClass.py 
Log Message:
Add scale and charset options (ocrad_scale and ocrad_charset, respectively)
to pass to the ocrad command.  Antonio Diaz Diaz, the author of Ocrad,
suggested scaling up the images.  Ocrad does indeed seem to perform better
with the scaled images.  Scaling by a factor of two seems to do
significantly better than not scaling in my 5x5 N-fold test setup.  Scaling
by a factor of three might even be better, improving false negative
percentages in four of the five sets, but it made the false positive score
worse in one of the five sets, so I left the default scale at 2.

I added the charset flag as well and defaulted to ascii.  So far the
spammers seem to be "GIFting" us with plain English, so searching for
accented characters seems like it would just distract Ocrad.  This has yet
to be tested though.


Index: ImageStripper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** ImageStripper.py	13 Aug 2006 16:27:49 -0000	1.3
--- ImageStripper.py	14 Aug 2006 02:58:11 -0000	1.4
***************
*** 232,235 ****
--- 232,237 ----
          textbits = []
          tokens = Set()
+         scale = options["Tokenizer", "ocrad_scale"] or 1
+         charset = options["Tokenizer", "ocrad_charset"]
          for pnmfile in pnmfiles:
              fhash = md5.new(open(pnmfile).read()).hexdigest()
***************
*** 239,243 ****
              else:
                  self.misses += 1
!                 ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, pnmfile))
                  ctext = ocr.read().lower()
                  ocr.close()
--- 241,246 ----
              else:
                  self.misses += 1
!                 ocr = os.popen("ocrad -s %s -c %s -x %s < %s 2>/dev/null" %
!                                (scale, charset, orf, pnmfile))
                  ctext = ocr.read().lower()
                  ocr.close()

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.137
retrieving revision 1.138
diff -C2 -d -r1.137 -r1.138
*** Options.py	10 Aug 2006 04:07:59 -0000	1.137
--- Options.py	14 Aug 2006 02:58:11 -0000	1.138
***************
*** 139,142 ****
--- 139,154 ----
       PATH, RESTORE),
  
+     ("ocrad_scale", _("Scale factor to use with ocrad."), 2,
+      _("""Specifies the scale factor to apply when running ocrad.  While
+      you can specify a negative scale it probably won't help.  Scaling up
+      by a factor of 2 or 3 seems to work well for the sort of spam images
+      encountered by SpamBayes."""),
+      INTEGER, RESTORE),
+ 
+     ("ocrad_charset", _("Charset to apply with ocrad."), "ascii",
+      _("""Specifies the charset to use when running ocrad.  Valid values
+      are 'ascii', 'iso-8859-9' and 'iso-8859-15'."""),
+      OCRAD_CHARSET, RESTORE),
+ 
      ("max_image_size", _("Max image size to try OCR-ing"), 100000,
       _("""When crack_images is enabled, this specifies the largest

Index: OptionsClass.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** OptionsClass.py	22 Jun 2006 10:36:58 -0000	1.32
--- OptionsClass.py	14 Aug 2006 02:58:11 -0000	1.33
***************
*** 119,122 ****
--- 119,123 ----
             'IMAP_FOLDER', 'IMAP_ASTRING',
             'RESTORE', 'DO_NOT_RESTORE', 'IP_LIST',
+            'OCRAD_CHARSET',
            ]
  
***************
*** 871,872 ****
--- 872,875 ----
  RESTORE = True
  DO_NOT_RESTORE = False
+ 
+ OCRAD_CHARSET = r"ascii|iso-8859-9|iso-8859-15"


From montanaro at users.sourceforge.net  Fri Aug 18 04:29:05 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Thu, 17 Aug 2006 19:29:05 -0700
Subject: [Spambayes-checkins] spambayes/contrib pycksum.py,1.1,1.2
Message-ID: <20060818022907.D10021E4004@bag.python.org>

Update of /cvsroot/spambayes/spambayes/contrib
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv16513

Modified Files:
	pycksum.py 
Log Message:
* Try to improve the duplicate detection capability.  Lots of spam nowadays
  has random text junk, so be more lenient about how many chunks have to
  match.  Also do a little more filtering on the source:

  - Compress multiple spaces and tabs to a single space
  - Compress multiple contiguous newlines into one
  - Map all strings of digits to a single "#" character
  - Map some common html entities to their plain text equivalents.

* Use md5 checksum hexdigests instead of binascii.b2a_hex.

* Correct line breaking of filtered body.

* Use email.generator to flatten body instead of the broken flatten()
  function.


Index: pycksum.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/contrib/pycksum.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** pycksum.py	25 May 2004 14:58:39 -0000	1.1
--- pycksum.py	18 Aug 2006 02:29:02 -0000	1.2
***************
*** 39,60 ****
  import sys
  import email.Parser
  import md5
  import anydbm
  import re
  import time
! import binascii
! 
! def flatten(body):
!     # three types are possible: list, string, Message
!     if isinstance(body, str):
!         return body
!     if hasattr(body, "get_payload"):
!         payload = body.get_payload()
!         if payload is None:
!             return ""
!         return flatten(payload)
!     if isinstance(body, list):
!         return "\n".join([flatten(b) for b in body])
!     raise TypeError, ("unrecognized body type: %s" % type(body))
  
  def clean(data):
--- 39,51 ----
  import sys
  import email.Parser
+ import email.generator
  import md5
  import anydbm
  import re
  import time
! try:
!     import cStringIO as StringIO
! except ImportError:
!     import StringIO
  
  def clean(data):
***************
*** 67,74 ****
      data = re.sub(r"<[^>]*>", "", data).lower()
  
      # delete anything which looks like a url or email address
      # not sure what a pmguid: url is but it seems to occur frequently in spam
      # also convert all runs of whitespace into a single space
!     return " ".join([w for w in data.split()
                       if ('@' not in w and
                           (':' not in w or
--- 58,78 ----
      data = re.sub(r"<[^>]*>", "", data).lower()
  
+     # Map all digits to '#'
+     data = re.sub(r"[0-9]+", "#", data)
+ 
+     # Map a few common html entities
+     data = re.sub(r"(&nbsp;)+", " ", data)
+     data = re.sub(r"&lt;", "<", data)
+     data = re.sub(r"&gt;", ">", data)
+     data = re.sub(r"&amp;", "&", data)
+ 
+     # Elide blank lines and multiple horizontal whitespace
+     data = re.sub(r"\n+", "\n", data)
+     data = re.sub(r"[ \t]+", " ", data)
+ 
      # delete anything which looks like a url or email address
      # not sure what a pmguid: url is but it seems to occur frequently in spam
      # also convert all runs of whitespace into a single space
!     return " ".join([w for w in data.split(" ")
                       if ('@' not in w and
                           (':' not in w or
***************
*** 87,97 ****
      # separately or in various combinations if desired.
  
!     body = flatten(msg)
!     lines = clean(body)
      chunksize = len(lines)//4+1
      sum = []
      for i in range(4):
          chunk = "\n".join(lines[i*chunksize:(i+1)*chunksize])
!         sum.append(binascii.b2a_hex(md5.new(chunk).digest()))
  
      return ".".join(sum)
--- 91,105 ----
      # separately or in various combinations if desired.
  
!     fp = StringIO.StringIO()
!     g = email.generator.Generator(fp, mangle_from_=False, maxheaderlen=60)
!     g.flatten(msg)
!     text = fp.getvalue()
!     body = text.split("\n\n", 1)[1]
!     lines = clean(body).split("\n")
      chunksize = len(lines)//4+1
      sum = []
      for i in range(4):
          chunk = "\n".join(lines[i*chunksize:(i+1)*chunksize])
!         sum.append(md5.new(chunk).hexdigest())
  
      return ".".join(sum)
***************
*** 102,111 ****
      db = anydbm.open(f, "c")
      maxdblen = 2**14
!     # consider the first three pieces, the last three pieces and the middle
!     # two pieces - one or more will likely eliminate attempts at disrupting
!     # the checksum - if any are found in the db file, call it a match
!     for subsum in (".".join(pieces[:-1]),
                     ".".join(pieces[1:-1]),
!                    ".".join(pieces[1:])):
          if not db.has_key(subsum):
              db[subsum] = str(time.time())
--- 110,119 ----
      db = anydbm.open(f, "c")
      maxdblen = 2**14
!     # consider the first two pieces, the middle two pieces and the last two
!     # pieces - one or more will likely eliminate attempts at disrupting the
!     # checksum - if any are found in the db file, call it a match
!     for subsum in (".".join(pieces[:-2]),
                     ".".join(pieces[1:-1]),
!                    ".".join(pieces[2:])):
          if not db.has_key(subsum):
              db[subsum] = str(time.time())
***************
*** 155,157 ****
  if __name__ == "__main__":
      sys.exit(main(sys.argv[1:]))
- 
--- 163,164 ----


From montanaro at users.sourceforge.net  Fri Aug 18 19:26:52 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Fri, 18 Aug 2006 10:26:52 -0700
Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.54,1.55
Message-ID: <20060818172655.DF0ED1E4004@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv9126

Modified Files:
	CHANGELOG.txt 
Log Message:
I hope this doesn't break any scripts or irritate anyone too much,
however...  Just as mm/dd/yyyy format looks strange to non-US folks,
dd/mm/yyyy looks just as strange to us cowboy types.  Compromise on ISO-8601
dates.  They sort, they're unambiguous, and they probably piss off both
camps equally well. ;-)


Index: CHANGELOG.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v
retrieving revision 1.54
retrieving revision 1.55
diff -C2 -d -r1.54 -r1.55
*** CHANGELOG.txt	7 Apr 2006 02:37:28 -0000	1.54
--- CHANGELOG.txt	18 Aug 2006 17:26:50 -0000	1.55
***************
*** 1,231 ****
! [Note that all dates are in English, not American format - i.e. day/month/year]
  
  Release 1.1a2
  =============
! Tony Meyer        03/04/2006  Add [ 1081787 ] Adding the version only to sb_filter.py
! Tony Meyer        03/04/2006  Fix [ 1383801 ] trustedIPs wildcard to regex broken
! Tony Meyer        02/04/2006  Fix [ 1387699 ] train_on_filter=True needs the db to be opened read/write
! Tony Meyer        02/04/2006  Fix [ 1387709 ] If globals:dbm_type is non-default, then don't use whichdb.
! Tony Meyer        27/11/2005  Install the conversion utility and offer to run it on Windows install.
! Tony Meyer        26/11/2005  Add conversion utility to easily convert dbm to ZODB.
[...1933 lines suppressed...]
! Tim Stone	2003-02-25	Add option for pop3proxy to notate Subject: header.
! Tony Meyer	2003-02-25	Fix bug in Corpus.get() which would never return the default value.
! Mark Hammond	2003-02-18	"Store Outlook plugin files in the ""correct"" Windows directory."
! Neil Schemenauer	2003-02-16	Add -c and -d options to mailsort.py.
! Neil Schemenauer	2003-02-16	First check-in of dump_cdb.py
! Mark Hammond	2003-02-13	Add SF#685746 ('Outlook plugin folder list sorted alphabetically').
! Mark Hammond	2003-02-13	Handle exceptions when opening folders in Outlook plugin better.
! Skip Montanaro	2003-02-13	Split BAYESCUSTOMIZE environment variable using os.pathsep.
! Mark Hammond	2003-02-12	Check for correct exception when removing file in Outlook addin.
! Mark Hammond	2003-02-12	Check for bsddb3 before bsddb (previously bsddb3 would never be found).
! Tim Stone	2003-02-10	Changed BAYESCUSTOMIZE environment variable parsing from a split to a regex to fix filenames with embedded spaces.
! Tim Stone	2003-02-08	Ensure that nham and nspam are instances of integer in dbExpImp.py
! Tim Stone	2003-02-08	Ensure that nham and nspam becoming strings doesn't break classification.
! Tim Stone	2003-02-08	Added ability to put classification in subject or to headers (for OE).
! Mark Hammond	2003-02-07	Fix some errors using bsddb3 in Outlook plugin.
! Mark Hammond	2003-02-04	"Fix SF#642740 ('""Recover from Spam"" wrong folder')."
! Mark Hammond	2003-02-03	Change train.py to be able to work with a bsddb database.
! Mark Hammond	2003-02-03	If a new bsddb or bsddb3 module is available use this instead of a pickle in the Outlook plugin.
! Mark Hammond	2003-02-03	Add tick-marks to the filter dialog.
! Mark Hammond	2003-02-03	Fix SF#677804 ('Untouched filter command error').


From montanaro at users.sourceforge.net  Fri Aug 18 19:42:39 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Fri, 18 Aug 2006 10:42:39 -0700
Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.55,1.56
Message-ID: <20060818174242.264BB1E400D@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv16016

Modified Files:
	CHANGELOG.txt 
Log Message:
Add my recent changes to changelog


Index: CHANGELOG.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v
retrieving revision 1.55
retrieving revision 1.56
diff -C2 -d -r1.55 -r1.56
*** CHANGELOG.txt	18 Aug 2006 17:26:50 -0000	1.55
--- CHANGELOG.txt	18 Aug 2006 17:42:37 -0000	1.56
***************
*** 1,4 ****
--- 1,23 ----
  [Note that all dates are in ISO 8601 format, e.g. YYYY-MM-DD to ease sorting]
  
+ Release 1.1a3
+ =============
+ 
+ Skip Montanaro    2006-08-18  Update pycksum.py to try and identify more duplicates
+ Skip Montanaro	  2006-08-14  Add scale and charset options to ImageStripper
+ Skip Montanaro	  2006-08-13  Stitch spam images back together properly, add a couple more tokens
+ Skip Montanaro	  2006-08-10  Add support for PIL to ImageStripper.py
+ Skip Montanaro	  2006-08-09  Cache x-lookup_ip in a pickle instead of trying to use anydbm or zodb
+ Skip Montanaro	  2006-08-06  Add crude OCR capability to try and parse image-based spam using Ocrad & NetPBM
+ Skip Montanaro	  2006-08-06  Add x-short_runs option
+ Skip Montanaro	  2006-08-06  Add x-image_size option & corresponding token
+ Skip Montanaro	  2006-08-06  Add Matt Cowles' x-lookup_ip extension w/ slight modifications
+ Skip Montanaro	  2006-08-06  Add profiling using cProfile (if available) to sb_filter.py
+ Skip Montanaro	  2006-08-06  Delete -d and -p flags from spamcounts.py
+ Skip Montanaro	  2006-08-06  Refactor basic text tokenizing out of tokenize_body into a separate method, tokenize_text
+ Skip Montanaro	  2006-08-05  Explicitly close ZODB store in tte.py
+ Skip Montanaro	  2006-04-23  Reduce sensitivity of spamcounts.py to classifier changes
+ 
+ 
  Release 1.1a2
  =============


From montanaro at users.sourceforge.net  Sat Aug 19 02:26:40 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Fri, 18 Aug 2006 17:26:40 -0700
Subject: [Spambayes-checkins] spambayes CHANGELOG.txt,1.56,1.57
Message-ID: <20060819002643.4F05C1E400C@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv14309

Modified Files:
	CHANGELOG.txt 
Log Message:
Add other recent changelog bits


Index: CHANGELOG.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/CHANGELOG.txt,v
retrieving revision 1.56
retrieving revision 1.57
diff -C2 -d -r1.56 -r1.57
*** CHANGELOG.txt	18 Aug 2006 17:42:37 -0000	1.56
--- CHANGELOG.txt	19 Aug 2006 00:26:38 -0000	1.57
***************
*** 17,21 ****
--- 17,26 ----
  Skip Montanaro	  2006-08-06  Refactor basic text tokenizing out of tokenize_body into a separate method, tokenize_text
  Skip Montanaro	  2006-08-05  Explicitly close ZODB store in tte.py
+ Tony Meyer	  2006-06-22  Fix bug in regex preventing valid IPs
+ Toby Dickenson	  2006-06-12  Suppress spurious duplicate From_ lines in sb_bnfilter.py
+ Tony Meyer	  2006-06-10  Add simple parts of [ 824651 ] Multibyte message support
+ Tony Meyer	  2006-05-06  Enable -o command line option setting, and follow TestDriver directories in testtools/mksets.py
  Skip Montanaro	  2006-04-23  Reduce sensitivity of spamcounts.py to classifier changes
+ Tony Meyer	  2006-04-22  Set zodb cache size to 10,000
  
  
From montanaro at users.sourceforge.net  Sat Aug 19 02:37:55 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Fri, 18 Aug 2006 17:37:55 -0700
Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.40,1.41
Message-ID: <20060819003757.B88791E4006@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv18704

Modified Files:
	WHAT_IS_NEW.txt 
Log Message:
Update for 1.1a3


Index: WHAT_IS_NEW.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** WHAT_IS_NEW.txt	27 Nov 2005 02:15:33 -0000	1.40
--- WHAT_IS_NEW.txt	19 Aug 2006 00:37:52 -0000	1.41
***************
*** 16,19 ****
--- 16,88 ----
  is released.
  
+ New in 1.1 Alpha 3
+ ==================
+ 
+ 
+ --------------------------------------------
+ ** Incompatible changes and Transitioning **
+ --------------------------------------------
+ 
+ There should be no incompatible changes since 1.1a2, though users new to the
+ 1.1 series should pay careful attention to the database changes introduced
+ in 1.1a2.
+ 
+ 
+ -------------------
+ ** Other changes **
+ -------------------
+ 
+ General
+ -------
+ 
+ Reported Bugs Fixed
+ ===================
+ No bugs tracked via the Sourceforge system were fixed.
+ 
+ 
+ Patches integrated
+ ===================
+ The following patches tracked via the Sourceforge system were integrated
+ in this release:
+     824651
+ 
+ Feature Requests Added
+ ======================
+ No feature requests tracked via the Sourceforge system were added
+ in this release.
+ 
+ 
+ Experimental Options
+ ====================
+ 
+ In addition to the experimental options listed for the 1.1a2 release, four
+ more new experimental options were added to SpamBayes.  They all need
+ further testing.
+ 
+   o x-short_runs - If true, generate tokens based on max number of short
+     word runs. Short words are anything of length < the skip_max_word_size
+     option.  Normally they are skipped, but one common spam technique spells
+     words like 'V m I n A o G p RA' to try and avoid exposing them to
+     content filters.
+ 
+   o x-lookup_ip - If true, generate IP address tokens from hostnames.  This
+     requires PyDNS (http://pydns.sourceforge.net/).
+ 
+   o x-image_size - If true, generate tokens based on the size of the largest
+     attached image.
+ 
+   o x-crack_images - A lot of recent spam contains the entire message
+     embedded in one or more attached images.  This option, if true,
+     generates tokens based on the (hopefully) text content contained in any
+     images in each message.  The current support is minimal, relies on the
+     installation of ocrad (http://www.gnu.org/software/ocrad/ocrad.html) and
+     the Python Imaging Library (a.k.a. PIL, available at
+     http://www.pythonware.com/products/pil/).  It has not yet been tested on
+     Windows, but for brave souls there is a simple zip file binary of ocrad
+     called "ocrad-cygwin" on the SpamBayes download page for Windows users
+     who can't build it themselves.  PIL has its own Windows binary
+     installers specific to versions of Python as far back as 2.1.
+ 
+ 
  New in 1.1 Alpha 2
  ==================


From mhammond at users.sourceforge.net  Thu Aug 24 14:42:03 2006
From: mhammond at users.sourceforge.net (Mark Hammond)
Date: Thu, 24 Aug 2006 05:42:03 -0700
Subject: [Spambayes-checkins] spambayes/spambayes __init__.py,1.18,1.19
Message-ID: <20060824124205.F40CC1E400A@bag.python.org>

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv23537/spambayes

Modified Files:
	__init__.py 
Log Message:
Version 1.1a3


Index: __init__.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/__init__.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** __init__.py	20 Apr 2006 03:13:26 -0000	1.18
--- __init__.py	24 Aug 2006 12:41:57 -0000	1.19
***************
*** 6,9 ****
      _ = lambda arg: arg
  
! __version__ = "1.1a2"
! __date__ = _("April 2005")
--- 6,9 ----
      _ = lambda arg: arg
  
! __version__ = "1.1a3"
! __date__ = _("August 2006")


From mhammond at users.sourceforge.net  Thu Aug 24 14:45:46 2006
From: mhammond at users.sourceforge.net (Mark Hammond)
Date: Thu, 24 Aug 2006 05:45:46 -0700
Subject: [Spambayes-checkins] spambayes/windows pop3proxy_tray.py, 1.24, 1.25
Message-ID: <20060824124548.E46E61E4005@bag.python.org>

Update of /cvsroot/spambayes/spambayes/windows
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv24833/windows

Modified Files:
	pop3proxy_tray.py 
Log Message:
re-add the taskbar icon in the case of explorer crashing and restarting


Index: pop3proxy_tray.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/windows/pop3proxy_tray.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** pop3proxy_tray.py	29 Mar 2005 05:59:25 -0000	1.24
--- pop3proxy_tray.py	24 Aug 2006 12:45:42 -0000	1.25
***************
*** 144,148 ****
--- 144,150 ----
                                    1099 : ("Exit SpamBayes", self.OnExit),
                                    }
+         msg_TaskbarRestart = RegisterWindowMessage("TaskbarCreated");
          message_map = {
+             msg_TaskbarRestart: self.OnTaskbarRestart,
              win32con.WM_DESTROY: self.OnDestroy,
              win32con.WM_COMMAND: self.OnCommand,
***************
*** 188,195 ****
                                            16, 16, icon_flags)
  
!         flags = NIF_ICON | NIF_MESSAGE | NIF_TIP
!         nid = (self.hwnd, 0, flags, WM_TASKBAR_NOTIFY, self.hstartedicon,
!             "SpamBayes")
!         Shell_NotifyIcon(NIM_ADD, nid)
          self.started = IsServerRunningAnywhere()
          self.tip = None
--- 190,194 ----
                                            16, 16, icon_flags)
  
!         self._AddTaskbarIcon()
          self.started = IsServerRunningAnywhere()
          self.tip = None
***************
*** 205,208 ****
--- 204,221 ----
                    "a local server"
  
+     def _AddTaskbarIcon(self):
+         flags = NIF_ICON | NIF_MESSAGE | NIF_TIP
+         nid = (self.hwnd, 0, flags, WM_TASKBAR_NOTIFY, self.hstartedicon,
+             "SpamBayes")
+         try:
+             Shell_NotifyIcon(NIM_ADD, nid)
+         except win32api_error:
+             # Apparently can be seen as XP is starting up.  Certainly can
+             # be seen if explorer.exe is not running when started.
+             print "Ignoring error adding taskbar icon - explorer may not " \
+                   "be running (yet)."
+             # The TaskbarRestart message will fire in this case, and
+             # everything will work out :)
+ 
      def BuildToolTip(self):
          tip = None
***************
*** 394,397 ****
--- 407,415 ----
          function()
  
+     def OnTaskbarRestart(self, hwnd, msg, wparam, lparam):
+         # Called as the taskbar is created (either as Windows starts, or
+         # as Windows recovers from a crashed explorer.exe)
+         self._AddTaskbarIcon()
+ 
      def OnExit(self):
          if self.started and not self.use_service:


From mhammond at users.sourceforge.net  Thu Aug 24 15:18:34 2006
From: mhammond at users.sourceforge.net (Mark Hammond)
Date: Thu, 24 Aug 2006 06:18:34 -0700
Subject: [Spambayes-checkins] spambayes/windows/py2exe setup_all.py, 1.26,
	1.27
Message-ID: <20060824131835.EB71E1E4005@bag.python.org>

Update of /cvsroot/spambayes/spambayes/windows/py2exe
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540

Modified Files:
	setup_all.py 
Log Message:
Ship with PIL (but no Tkinter) and pyDNS


Index: setup_all.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/windows/py2exe/setup_all.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** setup_all.py	28 Feb 2006 08:11:40 -0000	1.26
--- setup_all.py	24 Aug 2006 13:18:32 -0000	1.27
***************
*** 47,54 ****
                 "spambayes.languages.fr,spambayes.languages.es.DIALOGS," \
                 "spambayes.languages.es_AR.DIALOGS," \
!                "spambayes.languages.fr.DIALOGS",
!     excludes = "win32ui,pywin,pywin.debugger", # pywin is a package, and still seems to be included.
!     includes = "dialogs.resources.dialogs,weakref", # Outlook dynamic dialogs
!     dll_excludes = "dapi.dll,mapi32.dll",
      typelibs = [
          ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0),
--- 47,61 ----
                 "spambayes.languages.fr,spambayes.languages.es.DIALOGS," \
                 "spambayes.languages.es_AR.DIALOGS," \
!                "spambayes.languages.fr.DIALOGS," \
!                "PIL",
!     excludes = "Tkinter," # side-effect of PIL and markh doesn't have it :)
!                 "win32ui,pywin,pywin.debugger," # *sob* - these still appear
!                 # Keep zope out else outlook users lose training.
!                 # (sob - but some of these may still appear!)
!                "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence",
!     includes = "dialogs.resources.dialogs,weakref," # Outlook dynamic dialogs
!                "BmpImagePlugin,JpegImagePlugin", # PIL modules not auto found
!     dll_excludes = "dapi.dll,mapi32.dll,"
!                    "tk84.dll,tcl84.dll", # No Tkinter == no tk/tcl dlls
      typelibs = [
          ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0),


From anadelonbrin at users.sourceforge.net  Fri Aug 25 02:43:30 2006
From: anadelonbrin at users.sourceforge.net (Tony Meyer)
Date: Thu, 24 Aug 2006 17:43:30 -0700
Subject: [Spambayes-checkins] spambayes/windows spambayes.iss,1.25,1.26
Message-ID: <20060825004333.172E51E4004@bag.python.org>

Update of /cvsroot/spambayes/spambayes/windows
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv32424/windows

Modified Files:
	spambayes.iss 
Log Message:
Bump version number.

For 1.1a3 at least, include ocrad.exe and the patch required to build it.

Display license.  Maybe binary users aren't aware that this gets installed, and so
 this might get rid of some of the "can I do X with spambayes" queries.  For 1.1a3
 at least, it also clarifies where ocrad comes from.

Fix typo.

Index: spambayes.iss
===================================================================
RCS file: /cvsroot/spambayes/spambayes/windows/spambayes.iss,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** spambayes.iss	27 Nov 2005 00:42:11 -0000	1.25
--- spambayes.iss	25 Aug 2006 00:43:28 -0000	1.26
***************
*** 5,11 ****
  [Setup]
  ; Version specific constants
! AppVerName=SpamBayes 1.1a1
! AppVersion=1.1a1
! OutputBaseFilename=spambayes-1.1a1
  ; Normal constants.  Be careful about changing 'AppName'
  AppName=SpamBayes
--- 5,11 ----
  [Setup]
  ; Version specific constants
! AppVerName=SpamBayes 1.1a3
! AppVersion=1.1a3
! OutputBaseFilename=spambayes-1.1a3
  ; Normal constants.  Be careful about changing 'AppName'
  AppName=SpamBayes
***************
*** 15,18 ****
--- 15,19 ----
  ShowComponentSizes=no
  UninstallDisplayIcon={app}\sbicon.ico
+ LicenseFile=py2exe\dist\license.txt
  
  [Files]
***************
*** 51,54 ****
--- 52,59 ----
  Source: "py2exe\dist\bin\convert_database.exe"; DestDir: "{app}\bin"; Flags: ignoreversion
  
+ ; Include ocrad.exe and the patch required to get it to compile for Windows.
+ Source: "py2exe\ocrad.exe"; DestDir: "{app}\bin"; Flags: ignoreversion
+ Source: "py2exe\ocrad.patch"; DestDir: "{app}\docs"; Flags: ignoreversion
+ 
  ; There is a problem attempting to get Inno to unregister our DLL.  If we mark our DLL
  ; as 'regserver', it installs and registers OK, but at uninstall time, it unregisters
***************
*** 90,94 ****
    InstallOutlook, InstallProxy, InstallIMAP: Boolean;
    WarnedNoOutlook, WarnedBoth : Boolean;
!   startup, desktop, allusers, startup_imap : Boolean; // Tasks
  
  function InstallingOutlook() : Boolean;
--- 95,99 ----
    InstallOutlook, InstallProxy, InstallIMAP: Boolean;
    WarnedNoOutlook, WarnedBoth : Boolean;
!   startup, desktop, allusers, startup_imap, convert_db : Boolean; // Tasks
  
  function InstallingOutlook() : Boolean;


From montanaro at users.sourceforge.net  Fri Aug 25 04:02:16 2006
From: montanaro at users.sourceforge.net (Skip Montanaro)
Date: Thu, 24 Aug 2006 19:02:16 -0700
Subject: [Spambayes-checkins] spambayes WHAT_IS_NEW.txt,1.41,1.42
Message-ID: <20060825020218.5E98D1E4007@bag.python.org>

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv3443

Modified Files:
	WHAT_IS_NEW.txt 
Log Message:
Slight update.


Index: WHAT_IS_NEW.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/WHAT_IS_NEW.txt,v
retrieving revision 1.41
retrieving revision 1.42
diff -C2 -d -r1.41 -r1.42
*** WHAT_IS_NEW.txt	19 Aug 2006 00:37:52 -0000	1.41
--- WHAT_IS_NEW.txt	25 Aug 2006 02:02:12 -0000	1.42
***************
*** 67,74 ****
  
    o x-lookup_ip - If true, generate IP address tokens from hostnames.  This
!     requires PyDNS (http://pydns.sourceforge.net/).
  
    o x-image_size - If true, generate tokens based on the size of the largest
!     attached image.
  
    o x-crack_images - A lot of recent spam contains the entire message
--- 67,75 ----
  
    o x-lookup_ip - If true, generate IP address tokens from hostnames.  This
!     requires PyDNS (http://pydns.sourceforge.net/).  This is included in the
!     Windows installer. 
  
    o x-image_size - If true, generate tokens based on the size of the largest
!     attached image. 
  
    o x-crack_images - A lot of recent spam contains the entire message
***************
*** 79,86 ****
      the Python Imaging Library (a.k.a. PIL, available at
      http://www.pythonware.com/products/pil/).  It has not yet been tested on
!     Windows, but for brave souls there is a simple zip file binary of ocrad
!     called "ocrad-cygwin" on the SpamBayes download page for Windows users
!     who can't build it themselves.  PIL has its own Windows binary
!     installers specific to versions of Python as far back as 2.1.
  
  
--- 80,84 ----
      the Python Imaging Library (a.k.a. PIL, available at
      http://www.pythonware.com/products/pil/).  It has not yet been tested on
!     Windows, but is available in the Windows installer (as is PIL).