cPickle.dumps differs from Pickle.dumps; looks like a bug.

Nick Vatamaniuc vatamane at gmail.com
Wed May 16 18:33:25 EDT 2007


On May 16, 1:13 pm, Victor Kryukov <victor.kryu... at gmail.com> wrote:
> Hello list,
>
> I've found the following strange behavior of cPickle. Do you think
> it's a bug, or is it by design?
>
> Best regards,
> Victor.
>
> from pickle import dumps
> from cPickle import dumps as cdumps
>
> print dumps('1001799')==dumps(str(1001799))
> print cdumps('1001799')==cdumps(str(1001799))
>
> outputs
>
> True
> False
>
> vicbook:~ victor$ python
> Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
> [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.>>> quit()
>
> vicbook:~ victor$ uname -a
> Darwin vicbook 8.9.1 Darwin Kernel Version 8.9.1: Thu Feb 22 20:55:00
> PST 2007; root:xnu-792.18.15~1/RELEASE_I386 i386 i386

I might have found the culprit: see http://svn.python.org/projects/python/trunk/Modules/cPickle.c
Function static int put2(...) has the following code block in it :

---------cPickle.c-----------
int p;
...
if ((p = PyDict_Size(self->memo)) < 0)  goto finally;
/* Make sure memo keys are positive! */
	/* XXX Why?
	 * XXX And does "positive" really mean non-negative?
	 * XXX pickle.py starts with PUT index 0, not 1.  This makes for
	 * XXX gratuitous differences between the pickling modules.
	 */
p++;
-------------------------------

p++ will cause the difference. It seems the developers are not quite
sure why it's there or whether memo key sizes can be 0 or have to be
1.

Here is corresponding section for the Python version (pickle.py) taken
from Python 2.5
---------pickle.py----------
def memoize(self, obj):
        """Store an object in the memo."""
        # The Pickler memo is a dictionary mapping object ids to 2-
tuples
        # that contain the Unpickler memo key and the object being
memoized.
        # The memo key is written to the pickle and will become
        # the key in the Unpickler's memo.  The object is stored in
the
        # Pickler memo so that transient objects are kept alive during
        # pickling.

        # The use of the Unpickler memo length as the memo key is just
a
        # convention.  The only requirement is that the memo values be
unique.
        # But there appears no advantage to any other scheme, and this
        # scheme allows the Unpickler memo to be implemented as a
plain (but
        # growable) array, indexed by memo key.
        if self.fast:
            return
        assert id(obj) not in self.memo
        memo_len = len(self.memo)
        self.write(self.put(memo_len))
        self.memo[id(obj)] = memo_len, obj

    # Return a PUT (BINPUT, LONG_BINPUT) opcode string, with argument
i.
    def put(self, i, pack=struct.pack):
        if self.bin:
            if i < 256:
                return BINPUT + chr(i)
            else:
                return LONG_BINPUT + pack("<i", i)
        return PUT + repr(i) + '\n'
------------------------------------------

In memoize memo_len is the 'int p' from the c version. The size is 0
and is kept 0 while in the C version the size initially is 0 but then
is incremented with p++;

Any developers that know more about this?

-Nick Vatamaniuc




More information about the Python-list mailing list