[Python-Dev] [ 654866 ] pickle and cPickle not equivalent

Martin v. Löwis martin@v.loewis.de
18 Dec 2002 11:25:16 +0100


[Jim, this is about a perceived cPickle bug where refcount 1 objects
 lose their identity when on load they appear in multiple dump calls]

"Patrick K. O'Brien" <pobrien@orbtech.com> writes:

> Thank you for simplifying my example even more, and describing why 
> cPickle is doing what it's doing. Of course, we're still left with the 
> question of whether or not this should be considered a bug. I think it 
> is, but I couldn't tell from your description on SF whether you feel 
> the same way or not. Certainly the difference between pickle and 
> cPickle is disturbing, is it not?

I was happy to discover that this is a *really* rare case, so I
refrained from classifying it as a bug or non-bug. For example, you
can't trigger it by using cPickle.dump[s].

It is, strictly speaking, a bug, since the documentation says that all
objects will be recorded in the memo. The question is whether this is
a documentation bug (i.e. the documentation is promising too much) or
an implementation bug in cPickle.

There clearly is a documentation bug: objects like integers are never
put in the memo, and lose identity on unmarshalling. Of course, it is
not in the spirit of pickle that their identity is preserved. So any
correction would have to include a clarification of the documentation.

I'm a bit concerned about performance implications of changing the
behaviour: you'll have to put every object into the memo, even though
in most cases, you'll never need to lookup any refcount 1 objects.

It is not clear to my by what principle cPickle decides to use put or
put2. If cPickle is changed, the distinction between put and put2 goes
away. A cPickle expert might be able to tell, but I think there are
none left (which is also a reason for my initial reaction - you are as
much a cPickle expert as anybody else here).

To eliminate the performance concerns, it might be feasible to add
another flag to pickler objects, indicating whether this was called
from pickle.dump[s] or pickle.Pickler, adding the performance cost
only to people who use pickler objects. For that to be reasonable, one
should be sure that
a) the majority of users uses pickle.dump[s], and thus sees the old
   behaviour, and
b) refcount 1 objects are frequent enough to worry about the 
   performance hit

I would appreciate if anybody could provide data on either aspect.

> I honestly didn't have that attitude. Sorry if my message sounded that 
> way. I just wanted to try to resolve this while it was fresh, that's 
> all. And I do consider it pretty serious that pickle and cPickle are 
> not working identically (in substance, not cosmetic differences). But I 
> don't mean to imply that this bug is more or less important than any 
> other. 

Whether it is serious or not can be only answered once it is
understood: As you can see, I still hesitate to classify it as
serious. The reason is that cPickle always behaved this way, and
nobody ever noticed. This tells me:
a) almost nobody uses multiple dump calls to the same pickler, and
b) of those who do, almost nobody worries about cross-dump
   identities, and
c) of those who do, almost nobody ever ran into a case where a
   refcount 1 object occurred in a bad position of two dumps, and
d) of those who did, almost nobody noticed, and
e) of those who did, only a single person was worried enough to
   report this to the cPickle maintainers.

I'd really like to know what Jim Fulton thinks about all this.

Regards,
Martin