enhancement request: make py3 read/write py2 pickle format

Wed Jun 10 18:10:25 EDT 2015

FWIW most of the objections below also apply to JSON, so this doesn't
just have to be about repr/literal_eval. I'm definitely a huge
proponent of widespread use of something like protocol buffers, both
for production code and personal hacky projects.

On Wed, Jun 10, 2015 at 2:36 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Wednesday 10 June 2015 14:48, Devin Jeanpierre wrote:
>
> [...]
>> and literal_eval is not a great idea.
>>
>> * the common serializer (repr) does not output a canonical form, and
>>   can serialize things in a way that they can't be deserialized
>
> For literals, the canonical form is that understood by Python. I'm pretty
> sure that these have been stable since the days of Python 1.0, and will
> remain so pretty much forever:

The problem is that there are two different ways repr might write out
a dict equal to {'a': 1, 'b': 2}. This can make tests brittle -- e.g.
it's why doctest fails badly at examples involving dictionaries. Text
format protocol buffers output everything sorted, so that you can do
textual diffs for compatibility tests and such.

At work, one thing we do in places is mock out services using "golden"
expected protobuf responses, so that you can test that the server
returns exactly that, and test what the client does with that,
separately. These are checked into perforce in text format.

>> * there is no schema
>> * there is no well understood migration story for when the data you
>>   load and store changes
>
> literal_eval is not a serialisation format itself. It is a primitive
> operation usable when serialising. E.g. you might write out a simple Unix-
> style rc file of key:value pairs:
>
-snip-
>
> split on "=" and call literal_eval on the value.
>
> This is a perfectly reasonable light-weight solution for simple
> serialisation needs.

I could spend a bunch of time writing yet another config file format,
or I could use text format protocol buffers, YAML, or TOML and call it
a day.

>> * it encourages the use of eval when literal_eval becomes inconvenient
>>   or insufficient
>
> I don't think so. I think that people who make the effort to import ast and
> call ast.literal_eval are fully aware of the dangers of eval and aren't
> silly enough to start using eval.

The problem is when you have your config file format using python
literals, and another programmer wants to deal with it and doesn't
look at your codebase, and things like that. When transferring data,
this can happen a lot, since you are often not the user of the data
you wrote, and you can't control how others consume it. They might use
eval even if you didn't mean for them to. For example, in JavaScript,
this was once a common problem for services exposing JSON, and it
still happens even now.

>> * It is not particularly well specified or documented compared to the
>>   alternatives.
>> * The types you get back differ in python 2 vs 3
>
> Doesn't matter. The type you *write* are different in Python 2 vs 3, so of
> course you do.

In a shared 2/3 codebase, if I write bytes I expect to get bytes, and
if I write unicode I expect to get unicode. (There is a third category
of thing, which should be bytes on 2.x and string on 3.x, but it's
probably best to handle that outside of the deserializer). If you
thread it through repr and literal_eval using different versions for
each, unicode in python 3 becomes bytes in python 2, and vice versa.
So it makes migrating to Python 3 even harder.

>> For most apps, the alternatives are better. Irmen's serpent library is
>> strictly better on every front, for example. (Except potentially
>> security, who knows.)
>
> Beyond simple needs, like rc files, literal_eval is not sufficient. You
> can't use it to deserialise arbitrary objects. That might be a feature, but
> if you need something more powerful than basic ints, floats, strings and a
> few others, literal_eval will not be powerful enough.

No, it is powerful enough. After all, JSON has the same limitations.
Protobuf only adds enums and structs to JSON's types, and it's
potentially the most-used serialization format in the world by
operations per second.

Serialization libraries/formats usually need handholding to serialize
complex Python objects into simple serializable types. [Except pickle,
and that's the very reason it's insecure (per previous discussion in
thread.)]

-- Devin