[Python-Dev] Pickler/Unpickler API clarification

Michael Haggerty mhagger at alum.mit.edu
Fri Mar 6 19:01:59 CET 2009


Antoine Pitrou wrote:
> Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
>> Antoine Pitrou wrote:
>>> Michael Haggerty <mhagger <at> alum.mit.edu> writes:
>>>> It is easy to optimize the pickling of instances by giving them
>>>> __getstate__() and __setstate__() methods.  But the pickler still
>>>> records the type of each object (essentially, the name of its class) in
>>>> each record.  The space for these strings constituted a large fraction
>>>> of the database size.
>>> If these strings are not interned, then perhaps they should be.
>>> There is a similar optimization proposal (w/ patch) for attribute names:
>>> http://bugs.python.org/issue5084
>> If I understand correctly, this would not help:
>>
>> - on writing, the strings are identical anyway, because they are read
>> out of the class's __name__ and __module__ fields.  Therefore the
>> Pickler's usual memoizing behavior will prevent the strings from being
>> written more than once.
> 
> Then why did you say that "the space for these strings constituted a
> large fraction of the database size", if they are already shared? Are
> your objects so tiny that even the space taken by the pointer to the
> type name grows the size of the database significantly?

Sorry for the confusion.  I thought you were suggesting the change to
help the more typical use case, when a single Pickler is used for a lot
of data.  That use case will not be helped by interning the class
__name__ and __module__ strings, for the reasons given in my previous email.

In my case, the strings are shared via the Pickler memoizing mechanism
because I pre-populate the memo (using the API that the OP proposes to
remove), so your suggestion won't help my current code, either.  It was
before I implemented the pre-populated memoizer that "the space for
these strings constituted a large fraction of the database size".  But
your suggestion wouldn't help that case, either.

Here are the main use cases:

1. Saving and loading one large record.  A class's __name__ string is
the same string object every time it is retrieved, so it only needs to
be stored once and the Pickler memo mechanism works.  Similarly for the
class's __module__ string.

2. Saving and loading lots of records sequentially.  Provided a single
Pickler is used for all records and its memo is never cleared, this
works just as well as case 1.

3. Saving and loading lots of records in random order, as for example in
the shelve module.  It is not possible to reuse a Pickler with retained
memo, because the Unpickler might not encounter objects in the right
order.  There are two subcases:

   a. Use a clean Pickler/Unpickler object for each record.  In this
case the __name__ and __module__ of a class will appear once in each
record in which the class appears.  (This is the case regardless of
whether they are interned.)  On reading, the __name__ and __module__ are
only used to look up the class, so interning them won't help.  It is
thus impossible to avoid wasting a lot of space in the database.

   b. Use a Pickler/Unpickler with a preset memo for each record (my
unorthodox technique).  In this case the class __name__ and __module__
will be memoized in the shared memo, so in other records only their ID
needs to be stored (in fact, only the ID of the class object itself).
This allows the database to be smaller, but does not have any effect on
the RAM usage of the loaded objects.

If the OP's proposal is accepted, 3b will become impossible.  The
technique seems not to be well known, so maybe it doesn't need to be
supported.  It would mean some extra work for me on the cvs2svn project
though :-(

Michael



More information about the Python-Dev mailing list