|Title:||Pickle protocol 5 with out-of-band data|
|Author:||Antoine Pitrou <solipsis at pitrou.net>|
- Producer API
- Consumer API
- Protocol changes
- Side effects
- Rejected alternatives
- Open questions
- Related work
This PEP proposes to standardize a new pickle protocol version, and accompanying APIs to take full advantage of it:
- A new pickle protocol version (5) to cover the extra metadata needed for out-of-band data buffers.
- A new PickleBuffer type for __reduce_ex__ implementations to return out-of-band data buffers.
- A new buffer_callback parameter when pickling, to handle out-of-band data buffers.
- A new buffers parameter when unpickling to provide out-of-band data buffers.
The PEP guarantees unchanged behaviour for anyone not using the new APIs.
The pickle protocol was originally designed in 1995 for on-disk persistency of arbitrary Python objects. The performance of a 1995-era storage medium probably made it irrelevant to focus on performance metrics such as use of RAM bandwidth when copying temporary data before writing it to disk.
Nowadays the pickle protocol sees a growing use in applications where most of the data isn't ever persisted to disk (or, when it is, it uses a portable format instead of Python-specific). Instead, pickle is being used to transmit data and commands from one process to another, either on the same machine or on multiple machines. Those applications will sometimes deal with very large data (such as Numpy arrays or Pandas dataframes) that need to be transferred around. For those applications, pickle is currently wasteful as it imposes spurious memory copies of the data being serialized.
As a matter of fact, the standard multiprocessing module uses pickle for serialization, and therefore also suffers from this problem when sending large data to another process.
Third-party Python libraries, such as Dask , PyArrow  and IPyParallel , have started implementing alternative serialization schemes with the explicit goal of avoiding copies on large data. Implementing a new serialization scheme is difficult and often leads to reduced generality (since many Python objects support pickle but not the new serialization scheme). Falling back on pickle for unsupported types is an option, but then you get back the spurious memory copies you wanted to avoid in the first place. For example, dask is able to avoid memory copies for Numpy arrays and built-in containers thereof (such as lists or dicts containing Numpy arrays), but if a large Numpy array is an attribute of a user-defined object, dask will serialize the user-defined object as a pickle stream, leading to memory copies.
The common theme of these third-party serialization efforts is to generate a stream of object metadata (which contains pickle-like information about the objects being serialized) and a separate stream of zero-copy buffer objects for the payloads of large objects. Note that, in this scheme, small objects such as ints, etc. can be dumped together with the metadata stream. Refinements can include opportunistic compression of large data depending on its type and layout, like dask does.
This PEP aims to make pickle usable in a way where large data is handled as a separate stream of zero-copy buffers, letting the application handle those buffers optimally.
To keep the example simple and avoid requiring knowledge of third-party libraries, we will focus here on a bytearray object (but the issue is conceptually the same with more sophisticated objects such as Numpy arrays). Like most objects, the bytearray object isn't immediately understood by the pickle module and must therefore specify its decomposition scheme.
Here is how a bytearray object currently decomposes for pickling:
>>> b.__reduce_ex__(4) (<class 'bytearray'>, (b'abc',), None)
This is because the bytearray.__reduce_ex__ implementation reads morally as follows:
class bytearray: def __reduce_ex__(self, protocol): if protocol == 4: return type(self), bytes(self), None # Legacy code for earlier protocols omitted
In turn it produces the following pickle code:
>>> pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=4))) 0: \x80 PROTO 4 2: \x95 FRAME 30 11: \x8c SHORT_BINUNICODE 'builtins' 21: \x8c SHORT_BINUNICODE 'bytearray' 32: \x93 STACK_GLOBAL 33: C SHORT_BINBYTES b'abc' 38: \x85 TUPLE1 39: R REDUCE 40: . STOP
(the call to pickletools.optimize above is only meant to make the pickle stream more readable by removing the MEMOIZE opcodes)
We can notice several things about the bytearray's payload (the sequence of bytes b'abc'):
- bytearray.__reduce_ex__ produces a first copy by instantiating a new bytes object from the bytearray's data.
- pickle.dumps produces a second copy when inserting the contents of that bytes object into the pickle stream, after the SHORT_BINBYTES opcode.
- Furthermore, when deserializing the pickle stream, a temporary bytes object is created when the SHORT_BINBYTES opcode is encountered (inducing a data copy).
What we really want is something like the following:
- bytearray.__reduce_ex__ produces a view of the bytearray's data.
- pickle.dumps doesn't try to copy that data into the pickle stream but instead passes the buffer view to its caller (which can decide on the most efficient handling of that buffer).
- When deserializing, pickle.loads takes the pickle stream and the buffer view separately, and passes the buffer view directly to the bytearray constructor.
We see that several conditions are required for the above to work:
- __reduce__ or __reduce_ex__ must be able to return something that indicates a serializable no-copy buffer view.
- The pickle protocol must be able to represent references to such buffer views, instructing the unpickler that it may have to get the actual buffer out of band.
- The pickle.Pickler API must provide its caller with a way to receive such buffer views while serializing.
- The pickle.Unpickler API must similarly allow its caller to provide the buffer views required for deserialization.
- For compatibility, the pickle protocol must also be able to contain direct serializations of such buffer views, such that current uses of the pickle API don't have to be modified if they are not concerned with memory copies.
We are introducing a new type pickle.PickleBuffer which can be instantiated from any buffer-supporting object, and is specifically meant to be returned from __reduce__ implementations:
class bytearray: def __reduce_ex__(self, protocol): if protocol >= 5: return type(self), (PickleBuffer(self),), None # Legacy code for earlier protocols omitted
PickleBuffer is a simple wrapper that doesn't have all the memoryview semantics and functionality, but is specifically recognized by the pickle module if protocol 5 or higher is enabled. It is an error to try to serialize a PickleBuffer with pickle protocol version 4 or earlier.
Only the raw data of the PickleBuffer will be considered by the pickle module. Any type-specific metadata (such as shapes or datatype) must be returned separately by the type's __reduce__ implementation, as is already the case.
The PickleBuffer class supports a very simple Python API. Its constructor takes a single PEP 3118-compatible object . PickleBuffer objects themselves support the buffer protocol, so consumers can call memoryview(...) on them to get additional information about the underlying buffer (such as the original type, shape, etc.). In addition, PickleBuffer objects can be explicitly released using their release() method.
On the C side, a simple API will be provided to create and inspect PickleBuffer objects:
PyObject *PyPickleBuffer_FromObject(PyObject *obj)
Create a PickleBuffer object holding a view over the PEP 3118-compatible obj.
Return whether obj is a PickleBuffer instance.
const Py_buffer *PyPickleBuffer_GetBuffer(PyObject *picklebuf)
Return a pointer to the internal Py_buffer owned by the PickleBuffer instance. An exception is raised if the buffer is released.
int PyPickleBuffer_Release(PyObject *picklebuf)
Release the PickleBuffer instance's underlying buffer.
PickleBuffer can wrap any kind of buffer, including non-contiguous buffers. It's up to consumers to decide how best to handle different kinds of buffers (for example, some consumers may find it acceptable to make a contiguous copy of non-contiguous buffers).
pickle.Pickler.__init__ and pickle.dumps are augmented with an additional buffer_callback parameter:
class Pickler: def __init__(self, file, protocol=None, ..., buffer_callback=None): """ If *buffer_callback* is not None, then it is called with a list of out-of-band buffer views when deemed necessary (this could be once every buffer, or only after a certain size is reached, or once at the end, depending on implementation details). The callback should arrange to store or transmit those buffers without changing their order. If *buffer_callback* is None (the default), buffer views are serialized into *file* as part of the pickle stream. It is an error if *buffer_callback* is not None and *protocol* is None or smaller than 5. """ def pickle.dumps(obj, protocol=None, *, ..., buffer_callback=None): """ See above for *buffer_callback*. """
pickle.Unpickler.__init__ and pickle.loads are augmented with an additional buffers parameter:
class Unpickler: def __init__(file, *, ..., buffers=None): """ If *buffers* is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the *buffer_callback* of a Pickler object. If *buffers* is None (the default), then the buffers are taken from the pickle stream, assuming they are serialized there. It is an error for *buffers* to be None if the pickle stream was produced with a non-None *buffer_callback*. """ def pickle.loads(data, *, ..., buffers=None): """ See above for *buffers*. """
Three new opcodes are introduced:
- BYTEARRAY8 creates a bytearray from the data following it in the pickle stream and pushes it on the stack (just like BINBYTES8 does for bytes objects);
- NEXT_BUFFER fetches a buffer from the buffers iterable and pushes it on the stack.
- READONLY_BUFFER makes a readonly view of the top of the stack.
When pickling encounters a PickleBuffer, there can be four cases:
- If a buffer_callback is given and the PickleBuffer is writable, the PickleBuffer is given to the callback and a NEXT_BUFFER opcode is appended to the pickle stream.
- If a buffer_callback is given and the PickleBuffer is readonly, the PickleBuffer is given to the callback and a NEXT_BUFFER opcode is appended to the pickle stream, followed by a READONLY_BUFFER opcode.
- If no buffer_callback is given and the PickleBuffer is writable, it is serialized into the pickle stream as if it were a bytearray object.
- If no buffer_callback is given and the PickleBuffer is readonly, it is serialized into the pickle stream as if it were a bytes object.
The distinction between readonly and writable buffers is explained below (see "Mutability").
PEP 3118 buffers  can be readonly or writable. Some objects, such as Numpy arrays, need to be backed by a mutable buffer for full operation. Pickle consumers that use the buffer_callback and buffers arguments will have to be careful to recreate mutable buffers. When doing I/O, this implies using buffer-passing API variants such as readinto (which are also often preferrable for performance).
If you pickle and then unpickle an object in the same process, passing out-of-band buffer views, then the unpickled object may be backed by the same buffer as the original pickled object.
For example, it might be reasonable to implement reduction of a Numpy array as follows (crucial metadata such as shapes is omitted for simplicity):
class ndarray: def __reduce_ex__(self, protocol): if protocol == 5: return numpy.frombuffer, (PickleBuffer(self), self.dtype) # Legacy code for earlier protocols omitted
Then simply passing the PickleBuffer around from dumps to loads will produce a new Numpy array sharing the same underlying memory as the original Numpy object (and, incidentally, keeping it alive):
>>> import numpy as np >>> a = np.zeros(10) >>> a 0.0 >>> buffers =  >>> data = pickle.dumps(a, protocol=5, buffer_callback=buffers.extend) >>> b = pickle.loads(data, buffers=buffers) >>> b = 42 >>> a 42.0
This won't happen with the traditional pickle API (i.e. without passing buffers and buffer_callback parameters), because then the buffer view is serialized inside the pickle stream with a copy.
The pickle persistence interface is a way of storing references to designated objects in the pickle stream while handling their actual serialization out of band. For example, one might consider the following for zero-copy serialization of bytearrays:
class MyPickle(pickle.Pickler): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.buffers =  def persistent_id(self, obj): if type(obj) is not bytearray: return None else: index = len(self.buffers) self.buffers.append(obj) return ('bytearray', index) class MyUnpickle(pickle.Unpickler): def __init__(self, *args, buffers, **kwargs): super().__init__(*args, **kwargs) self.buffers = buffers def persistent_load(self, pid): type_tag, index = pid if type_tag == 'bytearray': return self.buffers[index] else: assert 0 # unexpected type
This mechanism has two drawbacks:
- Each pickle consumer must reimplement Pickler and Unpickler subclasses, with custom code for each type of interest. Essentially, N pickle consumers end up each implementing custom code for M producers. This is difficult (especially for sophisticated types such as Numpy arrays) and poorly scalable.
- Each object encountered by the pickle module (even simple built-in objects such as ints and strings) triggers a call to the user's persistent_id() method, leading to a possible performance drop compared to nominal.
Should buffer_callback take a single buffers or a sequence of buffers?
- Taking a single buffer would allow returning a boolean indicating whether the given buffer should be serialized in-band or out-of-band.
- Taking a sequence of buffers is potentially more efficient by reducing function call overhead.
Should it be allowed to serialize a PickleBuffer in protocol 4 and earlier? It would simply be serialized as a bytes object (if read-only) or bytearray (if writable).
- It can make implementing __reduce__ simpler.
- Serializing a bytearray in protocol 4 makes a supplementary memory copy when bytearray.__reduce_ex__ returns a bytes object. This is a performance regression that may be overlooked by __reduce__ implementors.
A first implementation is available in the author's GitHub fork .
An experimental backport for Python 3.6 and 3.7 is downloadable from PyPI .
Thanks to the following people for early feedback: Nick Coghlan, Olivier Grisel, Stefan Krah, MinRK, Matt Rocklin, Eric Snow.
|||Dask.distributed -- A lightweight library for distributed computing in Python https://distributed.readthedocs.io/|
|||Dask.distributed custom serialization https://distributed.readthedocs.io/en/latest/serialization.html|
|||IPyParallel -- Using IPython for parallel computing https://ipyparallel.readthedocs.io/|
|||PyArrow -- A cross-language development platform for in-memory data https://arrow.apache.org/docs/python/|
|||PyArrow IPC and component-based serialization https://arrow.apache.org/docs/python/ipc.html#component-based-serialization|
|||(1, 2) PEP 3118 -- Revising the buffer protocol https://www.python.org/dev/peps/pep-3118/|
|||PEP 554 -- Multiple Interpreters in the Stdlib https://www.python.org/dev/peps/pep-0554/|
|||pickle5 branch on GitHub https://github.com/pitrou/cpython/tree/pickle5|
|||pickle5 project on PyPI https://pypi.org/project/pickle5/|
|||Draft use of pickle protocol 5 for Numpy array pickling https://gist.github.com/ogrisel/a2b0e5ae4987a398caa7f9277cb3b90a|
This document has been placed into the public domain.