Bulletproof json.dump?

Thu Jul 9 06:11:01 EDT 2020

On 2020-07-07, Stephen Rosen wrote:

> On Mon, Jul 6, 2020 at 6:37 AM Adam Funk <a24061 at ducksburg.com> wrote:
>
>> Is there a "bulletproof" version of json.dump somewhere that will
>> convert bytes to str, any other iterables to list, etc., so you can
>> just get your data into a file & keep working?
>>
>
> Is the data only being read by python programs? If so, consider using
> pickle: https://docs.python.org/3/library/pickle.html
> Unlike json dumping, the goal of pickle is to represent objects as exactly
> as possible and *not* to be interoperable with other languages.
>
>
> If you're using json to pass data between python and some other language,
> you don't want to silently convert bytes to strings.
> If you have a bytestring of utf-8 data, you want to utf-8 decode it before
> passing it to json.dumps.
> Likewise, if you have latin-1 data, you want to latin-1 decode it.
> There is no universal and correct bytes-to-string conversion.
>
> On Mon, Jul 6, 2020 at 9:45 AM Chris Angelico <rosuav at gmail.com> wrote:
>
>> Maybe what we need is to fork out the default JSON encoder into two,
>> or have a "strict=True" or "strict=False" flag. In non-strict mode,
>> round-tripping is not guaranteed, and various types will be folded to
>> each other - mainly, many built-in and stdlib types will be
>> represented in strings. In strict mode, compliance with the RFC is
>> ensured (so ValueError will be raised on inf/nan), and everything
>> should round-trip safely.
>>
>
> Wouldn't it be reasonable to represent this as an encoder which is provided
> by `json`? i.e.
>
>     from json import dumps, UnsafeJSONEncoder
>     ...
>     json.dumps(foo, cls=UnsafeJSONEncoder)
>
> Emphasizing the "Unsafe" part of this and introducing people to the idea of
> setting an encoder also seems nice.
>
>
> On Mon, Jul 6, 2020 at 9:12 AM Chris Angelico <rosuav at gmail.com> wrote:
>
>> On Mon, Jul 6, 2020 at 11:06 PM Jon Ribbens via Python-list
>> <python-list at python.org> wrote:
>> >
>
>> The 'json' module already fails to provide round-trip functionality:
>> >
>> >     >>> for data in ({True: 1}, {1: 2}, (1, 2)):
>> >     ...     if json.loads(json.dumps(data)) != data:
>> >     ...         print('oops', data, json.loads(json.dumps(data)))
>> >     ...
>> >     oops {True: 1} {'true': 1}
>> >     oops {1: 2} {'1': 2}
>> >     oops (1, 2) [1, 2]
>>
>> There's a fundamental limitation of JSON in that it requires string
>> keys, so this is an obvious transformation. I suppose you could call
>> that one a bug too, but it's very useful and not too dangerous. (And
>> then there's the tuple-to-list transformation, which I think probably
>> shouldn't happen, although I don't think that's likely to cause issues
>> either.)
>
>
> Ideally, all of these bits of support for non-JSON types should be opt-in,
> not opt-out.
> But it's not worth making a breaking change to the stdlib over this.
>
> Especially for new programmers, the notion that
>     deserialize(serialize(x)) != x
> just seems like a recipe for subtle bugs.
>
> You're never guaranteed that the deserialized object will match the
> original, but shouldn't one of the goals of a de/serialization library be
> to get it as close as is reasonable?
>
>
> I've seen people do things which boil down to
>
>     json.loads(x)["some_id"] == UUID(...)
>
> plenty of times. It's obviously wrong and the fix is easy, but isn't making
> the default json encoder less strict just encouraging this type of bug?
>
> Comparing JSON data against non-JSON types is part of the same category of
> errors: conflating JSON with dictionaries.
> It's very easy for people to make this mistake, especially since JSON
> syntax is a subset of python dict syntax, so I don't think `json.dumps`
> should be encouraging it.
>
> On Tue, Jul 7, 2020 at 6:52 AM Adam Funk <a24061 at ducksburg.com> wrote:
>
>> Here's another "I'd expect to have to deal with this sort of thing in
>> Java" example I just ran into:
>>
>> >>> r = requests.head(url, allow_redirects=True)
>> >>> print(json.dumps(r.headers, indent=2))
>> ...
>> TypeError: Object of type CaseInsensitiveDict is not JSON serializable
>> >>> print(json.dumps(dict(r.headers), indent=2))
>> {
>>   "Content-Type": "text/html; charset=utf-8",
>>   "Server": "openresty",
>> ...
>> }
>>
>
> Why should the JSON encoder know about an arbitrary dict-like type?
> It might implement Mapping, but there's no way for json.dumps to know that
> in the general case (because not everything which implements Mapping
> actually inherits from the Mapping ABC).
> Converting it to a type which json.dumps understands is a reasonable
> constraint.
>
> Also, wouldn't it be fair, if your object is "case insensitive" to
> serialize it as
>   { "CONTENT-TYPE": ... } or { "content-type": ... } or ...
> ?
>
> `r.headers["content-type"]` presumably gets a hit.
> `json.loads(json.dumps(dict(r.headers)))["content-type"]` will get a
> KeyError.
>
> This seems very much out of scope for the json package because it's not
> clear what it's supposed to do with this type.
> Libraries should ask users to specify what they mean and not make
> potentially harmful assumptions.

I see what you mean.  I guess it just bugs me to have to do all this
explicit type conversion (when I'm not using Java!).

-- 
A drug is not bad. A drug is a chemical compound. The problem comes in
when people who take drugs treat them like a license to behave like an
asshole.                                                ---Frank Zappa