[Spambayes-checkins] spambayes/spambayes storage.py,1.6,1.7

Mark Hammond mhammond at users.sourceforge.net
Thu May 29 18:37:22 EDT 2003


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9910

Modified Files:
	storage.py 
Log Message:
2 changes to the way the DB classifier manages words:

* As per Tim P's mail, keep a list of "changed words" with a flag 
indicating "change" or "delete".  This prevents the database save
from updating every single word ever loaded by the db.

* From Sean, a change that prevents caching of hapaxes.  Such words are
saved directly to the DB.  This reduces the memory footprint significantly
(as these words are not kept in memory) and helps save times.

This change makes "incremental" saving of the database happen in a 
reasonable time, and doesn't degrade after a complete retrain etc.

I'm off for a weekend holiday - someone can just back this out if I
screwed it up <wink>


Index: storage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/storage.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** storage.py	26 May 2003 05:34:13 -0000	1.6
--- storage.py	30 May 2003 00:37:19 -0000	1.7
***************
*** 131,134 ****
--- 131,137 ----
          fp.close()
  
+ # Values for our changed words map
+ WORD_DELETED = "D"
+ WORD_CHANGED = "C"
  
  class DBDictClassifier(classifier.Classifier):
***************
*** 169,172 ****
--- 172,176 ----
              self.nham = 0
          self.wordinfo = {}
+         self.changed_words = {} # value may be one of the WORD_ constants
  
      def store(self):
***************
*** 176,190 ****
              print 'Persisting',self.db_name,'state in database'
  
!         # Must use .keys() since we modify the dict in the loop
!         for key in self.wordinfo.keys():
!             val = self.wordinfo[key]
!             if val is None:
!                 del self.wordinfo[key]
!                 try:
!                     del self.db[key]
!                 except KeyError:
!                     pass
!             else:
                  self.db[key] = val.__getstate__()
          self.db[self.statekey] = (classifier.PICKLE_VERSION,
                                    self.nspam, self.nham)
--- 180,202 ----
              print 'Persisting',self.db_name,'state in database'
  
!         # Iterate over our changed word list.
!         # This is *not* thread-safe - another thread changing our
!         # changed_words could mess us up a little.  Possibly a little
!         # lock while we copy and reset self.changed_words would be appropriate.
!         # For now, just do it the naive way.
!         for key, flag in self.changed_words.items():
!             if flag == WORD_CHANGED:
!                 val = self.wordinfo[key]
                  self.db[key] = val.__getstate__()
+             elif flag == WORD_DELETED:
+                 assert not self.wordinfo.has_key(word), \
+                        "Should not have a wordinfo for words flagged for delete"
+                 del self.db[key]
+             else:
+                 raise RuntimeError, "Unknown flag value"
+ 
+         # Reset the changed word list.
+         self.changed_words = {}
+         # Update the global state, then do the actual save.
          self.db[self.statekey] = (classifier.PICKLE_VERSION,
                                    self.nspam, self.nham)
***************
*** 192,198 ****
  
      def _wordinfoget(self, word):
-         # Note an explicit None in the dict means the word
-         # has previously been deleted, but the DB has not been saved,
-         # so therefore should not be re-fecthed.
          try:
              return self.wordinfo[word]
--- 204,207 ----
***************
*** 206,214 ****
              return ret
  
!     # _wordinfoset is the same
  
      def _wordinfodel(self, word):
!         self.wordinfo[word] = None
! 
  
  class Trainer:
--- 215,243 ----
              return ret
  
!     def _wordinfoset(self, word, record):
!         # "Singleton" words (i.e. words that only have a single instance)
!         # take up more than 1/2 of the database, but are rarely used
!         # so we don't put them into the wordinfo cache, but write them
!         # directly to the database
!         # If the word occurs again, then it will be brought back in and
!         # never be a singleton again.
!         # This seems to reduce the memory footprint of the DBDictClassifier by
!         # as much as 60%!!!  This also has the effect of reducing the time it
!         # takes to store the database
!         if record and (record.spamcount+record.hamcount <= 1):
!             self.db[word] = record.__getstate__()
!             # Remove this word from the changed list (not that it should be
!             # there, but strange things can happen :)
!             try:
!                 del self.changed_words[word]
!             except KeyError:
!                 pass
!         else:
!             self.wordinfo[word] = record
!             self.changed_words[word] = WORD_CHANGED
  
      def _wordinfodel(self, word):
!         del self.wordinfo[word]
!         self.changed_words[word] = WORD_DELETED
  
  class Trainer:





More information about the Spambayes-checkins mailing list