[Spambayes] Database reduction

Mon Nov 4 18:53:26 2002

[Neale Pickett]
> Perhaps a picture would be worth 1K words:
>
>     >>> import classifier
>     >>> w = classifier.WordInfo('aoeu', 2)
>     >>> import pickle
>     >>> w
>     WordInfo"('aoeu', 0, 0, 0, 2)"
>     >>> pickle.dumps(w, 1)
>
> 'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__b
> uiltin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02
> tq\x05bq\x06.'
>
> In case it isn't obvious yet, here's the problem:
>
>     >>> len(pickle.dumps(w, 1))
>     102
>     >>> len(`w`)
>     30

OTOH,

>>> cPickle.dumps(w.__getstate__(), 1)
'(U\x04aoeuq\x01K\x00K\x00K\x00K\x02t.'
>>> len(_)
19
>>>

which is shorter than your string repr.  This isn't typical because 2 is an
absurd spamprob (it's > 1, and is an int instead of a double); the savings
would be greater with a real spamprob (which will consume about 19 bytes in
a string repr, but about 8 in a pickle).

> So, at least for hammie, you can get a 66% reduction in database size
> by *not* pickling WordInfo types.  Tim calls this "administrative pickle
> bloat", which is the coolest jargon term I've heard all year.

Glad you liked it <wink>.  If you pickle the states instead, you'll save a
lot of space.  The state is a plain tuple.  On the other end, you have to
construct a WordInfo object and pass the unpickled tuple to its __setstate__
method.

> As I understand it, things which pickle the Bayes object avoid this
> overhead from some pickler optimizations along the lines of "if we've
> already seen this type, just give it a number and stop referring to it
> by name."

Yes, but a Pickler does this automatically.  You're using convenience
functions, which is why you get no savings.  Here's pickle.dumps():

def dumps(object, bin = 0):
    file = StringIO()
    Pickler(file, bin).dump(object)
    return file.getvalue()

It creates a brand new Pickler every time you call dumps, so nothing can be
remembered from one call to the next.  Avoiding that is clumsy in this
context, but possible:

>>> f = StringIO.StringIO()
>>> p = cPickle.Pickler(f, 1)
>>> p.dump(w)
<cPickle.Pickler object at 0x007EE020>
>>> f.getvalue()
'ccopy_reg\n_reconstructor\nq\x01(cclassifier\nWordInfo\nq\x02c__builtin__\n
object\nq\x03NtRq\x04(U\x04abdeq\x05K\x00K\x00K\x00G?\xd3333333tb.'
>>> f.truncate(0)
>>> p.dump(w)
<cPickle.Pickler object at 0x007EE020>
>>> f.getvalue()
'h\x04.'
>>>

In this case, by reusing the Pickler, the second time dumping w created a
2-byte pickle:  the Pickler maintains its own internal dict remembering
everything it pickled in the past.  This can be a real data burden of its
own, though.  See the docs for ways to clear a Pickler's dict (called the
pickle "memo" in the docs).

I'd avoid all that and pickle the states, but that's just me.