[spambayes-dev] Pickle vs DB inconsistencies
Greg Ward
gward at python.net
Thu Jun 12 22:19:28 EDT 2003
I'm getting inconsistent results using the same training corpus when I
store the database to a pickle vs a DB file. Here's how I created the
training databases (once DB, once pickle):
$ hammie.py -d -p db/default.db -g corpus/default/ham -s corpus/default/spam
$ hammie.py -D -p db/default.pkl -g corpus/default/ham -s corpus/default/spam
Results are unsurprising:
$ ll db/default.{db,pkl}
-rw-rw-r-- 1 greg dev 2600960 Jun 12 20:20 db/default.db
-rw-rw-r-- 1 greg dev 2277665 Jun 12 20:16 db/default.pkl
Now I try to score a message with each database:
$ msg=corpus/checkins/ham/cur/19OLiy-0000XO-00:2,S
$ hammie.py -f -d -p db/default.db < $msg | grep X-Spambayes
X-Spambayes-Classification: unsure; 0.26
$ hammie.py -f -D -p db/default.pkl < $msg | grep X-Spambayes
X-Spambayes-Classification: ham; 0.15
Huh?!? My own scoring script (which just exists because I like one line
of output per scored message) shows the same thing:
$ ./score -d db/default.db $msg
? 0.258 corpus/checkins/ham/cur/19OLiy-0000XO-00:2,S
$ ./score -d db/default.pkl $msg
N 0.153 corpus/checkins/ham/cur/19OLiy-0000XO-00:2,S
The other neat feature of my "score" script is its -v option, which
dumps all the clues. -v on the above two runs reveals that the set of
clue tokens are *nearly* identical, but the scores of each token are
subtly different between DB and pickle. Some excerpts:
'*H*': 1.000 '*H*': 0.991
'*S*': 0.515 '*S*': 0.297
'to:spambayes': 0.001 'to:spambayes': 0.002
'from:Greg': 0.003 'from:Greg': 0.006
'system.': 0.006 'system.': 0.012
'binary': 0.007 'binary': 0.014
'(not': 0.009 '(not': 0.018
[...]
'taking': 0.370 'taking': 0.372
'are': 0.377 'are': 0.377
'reply-to:none': 0.379 'reply-to:none': 0.379
'for': 0.382 'for': 0.382
'privileges': 0.386 [not in the pickle store]
'header:Received:2': 0.392 'header:Received:2': 0.392
'windows': 0.397 'windows': 0.398
'unable': 0.606 'unable': 0.604
'west': 0.617 'west': 0.614
[...]
'notified': 0.946 'staff': 0.937
'hereby': 0.949 'notified': 0.941
'click': 0.955 'federal': 0.941
'message-id:skip:3 30': 0.965 'click': 0.955
'belonging': 0.978 'belonging': 0.959
'los': 0.984 'los': 0.970
'medical': 0.984 'medical': 0.970
'message-id:skip:t 20': 0.988 'message-id:skip:t 20': 0.976
'street,': 0.990 'street,': 0.980
(The correspondence gets jumbled near the end because the tokens are
sorted by score; it appears that the variance is higher near the top
end.)
Anybody have a clue WTF is going on here? I'm running a
several-days-old CVS spambayes, so I'll try "cvs up" first. And then I
guess I'll start picking through the DB and pickle files manually to see
if those differences are visible that way. But I have no idea what that
will tell me ...
Greg
--
Greg Ward <gward at python.net> http://www.gerg.ca/
Outside of a dog, a book is man's best friend.
Inside of a dog, it's too dark to read.
More information about the spambayes-dev
mailing list