enhancement request: make py3 read/write py2 pickle format

Steven D'Aprano steve at pearwood.info
Wed Jun 10 23:21:25 EDT 2015


On Thu, 11 Jun 2015 08:10 am, Devin Jeanpierre wrote:

[...]
>> For literals, the canonical form is that understood by Python. I'm pretty
>> sure that these have been stable since the days of Python 1.0, and will
>> remain so pretty much forever:
> 
> The problem is that there are two different ways repr might write out
> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle -- e.g.
> it's why doctest fails badly at examples involving dictionaries. 

Only if they are badly written.

Yes, dicts are *less convenient* for doctests, but if they fail, the blame
is on the author of the tests themselves, not doctest.

Unordered output is not a problem for dicts, because dicts also have
unordered *input*. It doesn't matter whether you input {'a':1,'b':2} or
{'b':2,'a':1}, you will get the same dict either way.


[...]
> I could spend a bunch of time writing yet another config file format,
> or I could use text format protocol buffers, YAML, or TOML and call it
> a day.

Writing a rc parser is so trivial that it's almost easier to just write it
than it is to look up the APIs for YAML or JSON, to say nothing of the
rigmarole of defining a protocol buffer config file, compiling it,
importing the module, and using that.

def read(configfile):
    config = collections.OrderedDict()
    with open(configfile) as f:
        for line in f:
            line = line.strip()
            if line.startswith('#"): continue
            key, value = line.split("=", 1)
            key = key.rstrip()
            value = value.lstrip()
            config[key] = ast.literal_eval(value)
    return config


That's a basic, *but acceptable*, rc parser written in literally under a
minute. At the risk of ending up with egg on my face, I reckon that it's so
simple and so obviously correct that I can tell it works correctly without
even testing it. (Famous last words, huh?)

Unlike any of the richer, more powerful serialisation formats like YAML,
JSON, or protocol buffer, its not only human readable but human writable
too. By which I mean, while it is *possible* for a sufficiently motivated
person to write correctly formatted JSON, YAML or even XML, it's not really
something you would choose to do willingly. But Unix sys admins hand-edit
rc files every day.

But of course this also means it's less powerful and can deal with few types
of data. Power comes at a cost of complexity, and simplicity itself can be
a virtue. I wouldn't use JSON etc. for config files until I was sure that a
simpler INI or RC file wasn't sufficient for my needs.

Some how I have drifted away from serialisation in general to specifically
config files... never mind.


[...]
> The problem is when you have your config file format using python
> literals, and another programmer wants to deal with it and doesn't
> look at your codebase, and things like that. When transferring data,
> this can happen a lot, since you are often not the user of the data
> you wrote, and you can't control how others consume it. 

Not only can I not control how they consume it, but I don't care how they
consume it :-)

I hear what you are saying, and I don't disagree with it. I'm just standing
up for simplicity as a virtue when appropriate. If I'm writing a script to
save a bunch of values to pass to another script after some human editing,
it's faster for me to just write out the key:value pairs than it is to
learn how to use protocol buffer, deal with a separate compilation step,
etc. It's actually easier to write out, and read in, the key:values than to
use the configfile module. If you don't need multiple sections, default
values, or variable interpolation, even configparser is overkill.

But if I'm swapping data with others, or if I have to use a richer set of
types or functionality, then naturally I'm going to need something more
powerful, preferably something standard so I don't have to document the
internal format, just say "use XML with this schema" or whatever.


> They might use 
> eval even if you didn't mean for them to. For example, in JavaScript,
> this was once a common problem for services exposing JSON, and it
> still happens even now.

<shrug> If they choose to use eval, *that's not my fault*. You can't stop
them from deserialising your data and then passing any and all strings to
eval, so why should I be expected to stop them from something similar?


[...]
>> Beyond simple needs, like rc files, literal_eval is not sufficient. You
>> can't use it to deserialise arbitrary objects. That might be a feature,
>> but if you need something more powerful than basic ints, floats, strings
>> and a few others, literal_eval will not be powerful enough.
> 
> No, it is powerful enough. After all, JSON has the same limitations.

In the sense that you can build arbitrary objects from a combination of a
few basic types, yes, literal_eval is "powerful enough" if you are prepared
to re-invent JSON, YAML, or protocol buffer.

But I'm not talking about re-inventing what already exists. If I want JSON,
I'll use JSON, not spend weeks or months re-writing it from scratch. I
can't do this:

class MyClass:
    pass

a = MyClass()
serialised = repr(a)
b = ast.literal_eval(serialised)
assert a == b

which is what I mean when I say literal_eval isn't powerful enough to handle
arbitrary types. That's not a bug, that's a feature of literal_eval. It is
*designed* to have that limitation.




-- 
Steven




More information about the Python-list mailing list