[spambayes-dev] Trouble w/ zodb persistence of dnscache

Tue Aug 1 12:51:00 CEST 2006

I've been using Matt Cowles's x-lookup-ip extension with some success
recently to reveal the real IP addresses behind spammers' hostnames.  For
example, the following hostnames are mentioned in pharma come-ons:

    % host www.astlehover.com
    www.astlehover.com has address 211.144.68.87
    % host www.tornetseen.com
    www.tornetseen.com has address 211.144.68.87
    % host www.erlikuvera.com
    www.erlikuvera.com has address 211.144.68.87
    % host www.oplimazexu.com
    www.oplimazexu.com has address 211.144.68.87

The rest of the message content is pretty well disguised (very little
content, random common text boilerplate, etc), so without IP lookup they
tend to plop into my unsure mailbox.  They sometimes score low enough to
land in my regular inbox.

Matt's extension solves that by looking up the IP addresses for hosts it
encounters and generating a number of new tokens:

    % spamcounts -r :211          
    token,nspam,nham,spam prob
    url-ip:211.144.68.87/32,1,0,0.844827586207
    url-ip:211.144.68/24,1,0,0.844827586207
    url-ip:211/8,4,0,0.949438202247
    url-ip:211.20.189/24,1,0,0.844827586207
    url-ip:211.189.18/24,1,0,0.844827586207
    url-ip:211.144/16,1,0,0.844827586207
    received:211.95.72.130,1,0,0.844827586207
    url-ip:211.189.18.186/32,1,0,0.844827586207
    url-ip:211.22.166.116/32,1,0,0.844827586207
    received:211.96,1,0,0.844827586207
    received:211.95,1,0,0.844827586207
    url-ip:211.22.166/24,1,0,0.844827586207
    received:211.95.72,1,0,0.844827586207
    url-ip:211.20/16,1,0,0.844827586207
    url-ip:211.20.189.50/32,1,0,0.844827586207
    received:211.96.42,1,0,0.844827586207
    url-ip:211.22/16,1,0,0.844827586207
    received:211,2,0,0.908163265306
    received:211.96.42.103,1,0,0.844827586207
    url-ip:211.189/16,1,0,0.844827586207

Unfortunately it doesn't cache IP addresses across sessions.  My
train-to-exhaustion scheme scores my entire training database.  The first
round of scoring is very time-consuming.

I decided to solve that shortcoming.  I added "dbm" and "zodb" support to
Matt's dnscache module, since those are probably the two most prevalent
storage schemes (default and emeritus default).  I've been testing the zodb
scheme but having trouble with it.  If I start with no ~/.dnscache* files it
correctly creates a new one.  If I have an existing database already, it
doesn't update the database file, though the timestamps on the .index and
.tmp files are updated.

I asked on zodb-dev and got some partial help (I was relying on __del__ to
close() the FileStorage object), but even with that fixed it's not working
properly.  My recent pleas for help have gone unanswered, so I'm turning to
this list.  My zodb code was cribbed from the support in SpamBayes itself,
so maybe the author of that code will see what I've done wrong.

I set up the cache in tokenizer.py like so:

    try:
        import dnscache
        cache = dnscache.cache(cachefile=os.path.expanduser("~/.dnscache"))
        cache.printStatsAtEnd = True
    except (IOError, ImportError):
        cache = None
    else:
        import atexit
        atexit.register(cache.close)

In the cache class's __init__ I open the cachefile if given:

    if cachefile:
      self.open_cachefile(cachefile)
    else:
      self.caches={ "A": {}, "PTR": {} }

    def open_cachefile(self, cachefile):
      filetype = options["Storage", "persistent_use_database"]
      cachefile = os.path.expanduser(cachefile)
      if filetype == "dbm":
        if os.path.exists(cachefile):
          self.caches=shelve.open(cachefile)
        else:
          self.caches=shelve.open(cachefile)
          self.caches["A"] = {}
          self.caches["PTR"] = {}
      elif filetype == "zodb":
        from ZODB import DB
        from ZODB.FileStorage import FileStorage
        self._zodb_storage = FileStorage(cachefile, read_only=False)
        self._DB = DB(self._zodb_storage, cache_size=10000)
        self._conn = self._DB.open()
        root = self._conn.root()
        self.caches = root.get("dnscache")
        if self.caches is None:
          # There is no classifier, so create one.
          from BTrees.OOBTree import OOBTree
          self.caches = root["dnscache"] = OOBTree()
          self.caches["A"] = {}
          self.caches["PTR"] = {}
          print "opened new cache"
        else:
          print "opened existing cache with", len(self.caches["A"]), "A records",
          print "and", len(self.caches["PTR"]), "PTR records"

and when it's closed, this code executes:

    def close(self):
      filetype = options["Storage", "persistent_use_database"]
      if filetype == "dbm":
        self.caches.close()
      elif filetype == "zodb":
        self._zodb_close()

    def _zodb_store(self):
        import transaction
        from ZODB.POSException import ConflictError
        from ZODB.POSException import TransactionFailedError

        try:
            transaction.commit()
        except ConflictError, msg:
            # We'll save it next time, or on close.  It'll be lost if we
            # hard-crash, but that's unlikely, and not a particularly big
            # deal.
            if options["globals", "verbose"]:
                print >> sys.stderr, "Conflict on commit.", msg
            transaction.abort()
        except TransactionFailedError, msg:
            # Saving isn't working.  Try to abort, but chances are that
            # restarting is needed.
            if options["globals", "verbose"]:
              print >> sys.stderr, "Store failed.  Need to restart.", msg
            transaction.abort()

    def _zodb_close(self):
        # Ensure that the db is saved before closing.  Alternatively, we
        # could abort any waiting transaction.  We need to do *something*
        # with it, though, or it will be still around after the db is
        # closed and cause problems.  For now, saving seems to make sense
        # (and we can always add abort methods if they are ever needed).
        self._zodb_store()

        # Do the closing.        
        self._DB.close()

        # We don't make any use of the 'undo' capabilities of the
        # FileStorage at the moment, so might as well pack the database
        # each time it is closed, to save as much disk space as possible.
        # Pack it up to where it was 'yesterday'.
        # XXX What is the 'referencesf' parameter for pack()?  It doesn't
        # XXX seem to do anything according to the source.
  ##       self._zodb_storage.pack(time.time()-60*60*24, None)
        self._zodb_storage.close()

        self._zodb_closed = True
        if options["globals", "verbose"]:
            print >> sys.stderr, 'Closed dnscache database'

When run, it correctly announces that it's either creating a new cache or
that it opened an existing cache, e.g.:

    opened existing cache with 479 A records and 0 PTR records

No errors appear on stdout or stderr during the run.  At completion it tells
me that, "Closed dnscache database".

I can see that the database isn't getting updated because a) its timestamp
doesn't get updated and b) because running strings over the file and
grepping for new names doesn't display them:

    % # this one exists...
    % strings -a ~/.dnscache* | egrep -i timsblogger
    www.timsbloggers.comq
    % # this one is new...
    % strings -a ~/.dnscache* | egrep -i tradelink
    % # bummer...

Does anyone have any suggestions about getting this beast to work properly?

Thx,

Skip