[Python-Dev] undesireable unpickle behavior, proposed fix
Guido van Rossum
guido at python.org
Tue Jan 27 18:57:40 CET 2009
On Tue, Jan 27, 2009 at 6:23 AM, Jesse Noller <jnoller at gmail.com> wrote:
> On Tue, Jan 27, 2009 at 4:49 AM, Jake McGuire <jake at youtube.com> wrote:
>> Instance attribute names are normally interned - this is done in
>> PyObject_SetAttr (among other places). Unpickling (in pickle and cPickle)
>> directly updates __dict__ on the instance object. This bypasses the
>> interning so you end up with many copies of the strings representing your
>> attribute names, which wastes a lot of space, both in RAM and in pickles of
>> sequences of objects created from pickles. Note that the native python
>> memcached client uses pickle to serialize objects.
>>
>>>>> import pickle
>>>>> class C(object):
>> ... def __init__(self, x):
>> ... self.long_attribute_name = x
>> ...
>>>>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None),
>>>>> pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL))
>> 3658
>>>>> len(pickle.dumps([C(None) for i in range(100)],
>>>>> pickle.HIGHEST_PROTOCOL))
>> 1441
>>>>>
>>
>> Interning the strings on unpickling makes the pickles smaller, and at least
>> for cPickle actually makes unpickling sequences of many objects slightly
>> faster. I have included proposed patches to cPickle.c and pickle.py, and
>> would appreciate any feedback.
>>
>> dhcp-172-31-170-32:~ mcguire$ diff -u
>> Downloads/Python-2.4.3/Modules/cPickle.c cPickle.c
>> --- Downloads/Python-2.4.3/Modules/cPickle.c 2004-07-26
>> 22:22:33.000000000 -0700
>> +++ cPickle.c 2009-01-26 23:30:31.000000000 -0800
>> @@ -4258,6 +4258,8 @@
>> PyObject *state, *inst, *slotstate;
>> PyObject *__setstate__;
>> PyObject *d_key, *d_value;
>> + PyObject *name;
>> + char * key_str;
>> int i;
>> int res = -1;
>>
>> @@ -4319,8 +4321,24 @@
>>
>> i = 0;
>> while (PyDict_Next(state, &i, &d_key, &d_value)) {
>> - if (PyObject_SetItem(dict, d_key, d_value) < 0)
>> - goto finally;
>> + /* normally the keys for instance attributes are
>> + interned. we should try to do that here. */
>> + if (PyString_CheckExact(d_key)) {
>> + key_str = PyString_AsString(d_key);
>> + name = PyString_FromString(key_str);
>> + if (! name)
>> + goto finally;
>> +
>> + PyString_InternInPlace(&name);
>> + if (PyObject_SetItem(dict, name, d_value) <
>> 0) {
>> + Py_DECREF(name);
>> + goto finally;
>> + }
>> + Py_DECREF(name);
>> + } else {
>> + if (PyObject_SetItem(dict, d_key, d_value) <
>> 0)
>> + goto finally;
>> + }
>> }
>> Py_DECREF(dict);
>> }
>>
>> dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/pickle.py
>> pickle.py
>> --- Downloads/Python-2.4.3/Lib/pickle.py 2009-01-27
>> 01:41:43.000000000 -0800
>> +++ pickle.py 2009-01-27 01:41:31.000000000 -0800
>> @@ -1241,7 +1241,15 @@
>> state, slotstate = state
>> if state:
>> try:
>> - inst.__dict__.update(state)
>> + d = inst.__dict__
>> + try:
>> + for k,v in state.items():
>> + d[intern(k)] = v
>> + # keys in state don't have to be strings
>> + # don't blow up, but don't go out of our way
>> + except TypeError:
>> + d.update(state)
>> +
>> except RuntimeError:
>> # XXX In restricted execution, the instance's __dict__
>> # is not accessible. Use the old way of unpickling
>>
>
> Hi Jake,
>
> You should really post this to bugs.python.org as an enhancement so we
> can track the discussion there.
>
> -jesse
Seconded, with eagerness -- interning attribute names when unpickling
makes a lot of sense!
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-Dev
mailing list