[spambayes-dev] Pickle vs DB inconsistencies

Thu Jun 12 22:19:28 EDT 2003

I'm getting inconsistent results using the same training corpus when I
store the database to a pickle vs a DB file.  Here's how I created the
training databases (once DB, once pickle):

$ hammie.py -d -p db/default.db -g corpus/default/ham -s corpus/default/spam
$ hammie.py -D -p db/default.pkl -g corpus/default/ham -s corpus/default/spam

Results are unsurprising:

$ ll db/default.{db,pkl}
-rw-rw-r--    1 greg     dev       2600960 Jun 12 20:20 db/default.db
-rw-rw-r--    1 greg     dev       2277665 Jun 12 20:16 db/default.pkl

Now I try to score a message with each database:

$ msg=corpus/checkins/ham/cur/19OLiy-0000XO-00:2,S
$ hammie.py -f -d -p db/default.db < $msg | grep X-Spambayes
X-Spambayes-Classification: unsure; 0.26
$ hammie.py -f -D -p db/default.pkl < $msg | grep X-Spambayes
X-Spambayes-Classification: ham; 0.15

Huh?!?  My own scoring script (which just exists because I like one line
of output per scored message) shows the same thing:

$ ./score -d db/default.db $msg
? 0.258 corpus/checkins/ham/cur/19OLiy-0000XO-00:2,S
$ ./score -d db/default.pkl $msg
N 0.153 corpus/checkins/ham/cur/19OLiy-0000XO-00:2,S

The other neat feature of my "score" script is its -v option, which
dumps all the clues.  -v on the above two runs reveals that the set of
clue tokens are *nearly* identical, but the scores of each token are
subtly different between DB and pickle.  Some excerpts:

'*H*': 1.000                           '*H*': 0.991                         
'*S*': 0.515                           '*S*': 0.297                         
'to:spambayes': 0.001                  'to:spambayes': 0.002                
'from:Greg': 0.003                     'from:Greg': 0.006                   
'system.': 0.006                       'system.': 0.012                     
'binary': 0.007                        'binary': 0.014                      
'(not': 0.009                          '(not': 0.018                        
[...]
'taking': 0.370                        'taking': 0.372                      
'are': 0.377                           'are': 0.377                         
'reply-to:none': 0.379                 'reply-to:none': 0.379               
'for': 0.382                           'for': 0.382                         
'privileges': 0.386                    [not in the pickle store]
'header:Received:2': 0.392             'header:Received:2': 0.392           
'windows': 0.397                       'windows': 0.398                     
'unable': 0.606                        'unable': 0.604                      
'west': 0.617                          'west': 0.614                        
[...]
'notified': 0.946                      'staff': 0.937                       
'hereby': 0.949                        'notified': 0.941                    
'click': 0.955                         'federal': 0.941                     
'message-id:skip:3 30': 0.965          'click': 0.955                       
'belonging': 0.978                     'belonging': 0.959                   
'los': 0.984                           'los': 0.970                         
'medical': 0.984                       'medical': 0.970                     
'message-id:skip:t 20': 0.988          'message-id:skip:t 20': 0.976        
'street,': 0.990                       'street,': 0.980

(The correspondence gets jumbled near the end because the tokens are
sorted by score; it appears that the variance is higher near the top
end.)

Anybody have a clue WTF is going on here?  I'm running a
several-days-old CVS spambayes, so I'll try "cvs up" first.  And then I
guess I'll start picking through the DB and pickle files manually to see
if those differences are visible that way.  But I have no idea what that
will tell me ...

        Greg
-- 
Greg Ward <gward at python.net>                         http://www.gerg.ca/
Outside of a dog, a book is man's best friend.
Inside of a dog, it's too dark to read.