[Python-Dev] PEP 574 -- Pickle protocol 5 with out-of-band data

Nathaniel Smith njs at pobox.com
Thu Mar 29 04:18:00 EDT 2018


On Thu, Mar 29, 2018 at 12:56 AM, Chris Jerdonek
<chris.jerdonek at gmail.com> wrote:
> On Wed, Mar 28, 2018 at 6:15 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Wed, Mar 28, 2018 at 1:03 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>> 28.03.18 21:39, Antoine Pitrou пише:
>>>> I'd like to submit this PEP for discussion.  It is quite specialized
>>>> and the main target audience of the proposed changes is
>>>> users and authors of applications/libraries transferring large amounts
>>>> of data (read: the scientific computing & data science ecosystems).
>>>
>>> Currently I'm working on porting some features from cloudpickle to the
>>> stdlib. For these of them which can't or shouldn't be implemented in the
>>> general purpose library (like serializing local functions by serializing
>>> their code objects, because it is not portable) I want to add hooks that
>>> would allow to implement them in cloudpickle using official API. This would
>>> allow cloudpickle to utilize C implementation of the pickler and unpickler.
>>
>> There's obviously some tension here between pickle's use as a
>> persistent storage format, and its use as a transient wire format. For
>> the former, you definitely can't store code objects because there's no
>> forwards- or backwards-compatibility guarantee for bytecode. But for
>> the latter, transmitting bytecode is totally fine, because all you
>> care about is whether it can be decoded once, right now, by some peer
>> process whose python version you can control -- that's why cloudpickle
>> exists.
>
> Is it really true you'll always be able to control the Python version
> on the other side? Even if they're internal services, it seems like
> there could be times / reasons preventing you from upgrading the
> environment of all of your services at the same rate. Or did you mean
> to say "often" all you care about ...?

Yeah, maybe I spoke a little sloppily -- I'm sure there are cases
where you're using pickle as a wire format between heterogenous
interpreters, in which case you wouldn't use version=NONPORTABLE. But
projects like dask, and everyone else who uses cloudpickle/dill, are
already assuming homogenous interpreters.

A typical way of using these kinds of systems is: you start your
script, it spins up some cloud VMs or local cluster nodes (maybe
sending them all a conda environment you made), they all chat for a
while doing your computation, and then they spin down again and your
script reports the results. So versioning and coordinated upgrades
really aren't a thing you need to worry about :-).

Another example is the multiprocessing module: it's very safe to
assume that the parent and the child are using the same interpreter
:-). There's no fundamental reason you shouldn't be able to send
bytecode between them.

Pickle's not really the ideal wire format for persistent services
anyway, given the arbitrary code execution and tricky versioning --
even if you aren't playing games with bytecode, pickle still assumes
that if two classes in two different interpreters have the same name,
then their internal implementation details are all the same. You can
make it work, but usually there are better options. It's perfect
though for multi-core and multi-machine parallelism.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the Python-Dev mailing list