[Spambayes] Database reduction
Tim Peters
tim.one@comcast.net
Mon Nov 4 18:53:26 2002
[Neale Pickett]
> Perhaps a picture would be worth 1K words:
>
> >>> import classifier
> >>> w = classifier.WordInfo('aoeu', 2)
> >>> import pickle
> >>> w
> WordInfo"('aoeu', 0, 0, 0, 2)"
> >>> pickle.dumps(w, 1)
>
> 'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__b
> uiltin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02
> tq\x05bq\x06.'
>
> In case it isn't obvious yet, here's the problem:
>
> >>> len(pickle.dumps(w, 1))
> 102
> >>> len(`w`)
> 30
OTOH,
>>> cPickle.dumps(w.__getstate__(), 1)
'(U\x04aoeuq\x01K\x00K\x00K\x00K\x02t.'
>>> len(_)
19
>>>
which is shorter than your string repr. This isn't typical because 2 is an
absurd spamprob (it's > 1, and is an int instead of a double); the savings
would be greater with a real spamprob (which will consume about 19 bytes in
a string repr, but about 8 in a pickle).
> So, at least for hammie, you can get a 66% reduction in database size
> by *not* pickling WordInfo types. Tim calls this "administrative pickle
> bloat", which is the coolest jargon term I've heard all year.
Glad you liked it <wink>. If you pickle the states instead, you'll save a
lot of space. The state is a plain tuple. On the other end, you have to
construct a WordInfo object and pass the unpickled tuple to its __setstate__
method.
> As I understand it, things which pickle the Bayes object avoid this
> overhead from some pickler optimizations along the lines of "if we've
> already seen this type, just give it a number and stop referring to it
> by name."
Yes, but a Pickler does this automatically. You're using convenience
functions, which is why you get no savings. Here's pickle.dumps():
def dumps(object, bin = 0):
file = StringIO()
Pickler(file, bin).dump(object)
return file.getvalue()
It creates a brand new Pickler every time you call dumps, so nothing can be
remembered from one call to the next. Avoiding that is clumsy in this
context, but possible:
>>> f = StringIO.StringIO()
>>> p = cPickle.Pickler(f, 1)
>>> p.dump(w)
<cPickle.Pickler object at 0x007EE020>
>>> f.getvalue()
'ccopy_reg\n_reconstructor\nq\x01(cclassifier\nWordInfo\nq\x02c__builtin__\n
object\nq\x03NtRq\x04(U\x04abdeq\x05K\x00K\x00K\x00G?\xd3333333tb.'
>>> f.truncate(0)
>>> p.dump(w)
<cPickle.Pickler object at 0x007EE020>
>>> f.getvalue()
'h\x04.'
>>>
In this case, by reusing the Pickler, the second time dumping w created a
2-byte pickle: the Pickler maintains its own internal dict remembering
everything it pickled in the past. This can be a real data burden of its
own, though. See the docs for ways to clear a Pickler's dict (called the
pickle "memo" in the docs).
I'd avoid all that and pickle the states, but that's just me.