[Python-Dev] Guarantee ordered dict literals in v3.7?

Tue Nov 7 20:44:12 EST 2017

On 8 November 2017 at 07:15, Paul Moore <p.f.moore at gmail.com> wrote:
> On 7 November 2017 at 20:35, Paul G <paul at ganssle.io> wrote:
>> If dictionary order is *not* guaranteed in the spec and the dictionary order isn't randomized (which I think everyone agrees is a bit messed up), it would probably be useful if you could enable "random order mode" in CPython, so you can stress-test that your code isn't making any assumptions about dictionary ordering without having to use an implementation where order isn't deterministic.
>>
>> I could either be something like an environment variable SCRAMBLE_DICT_ORDER or a flag like --scramble-dict-order. That would probably help somewhat with the very real problem of "everyone's going to start counting on this ordered property".
>
> This seems like overkill to me. By the same logic, we should add a
> "delay garbage collection" mode, that allows people to test that their
> code doesn't make unwarranted assumptions that a reference-counting
> garbage collector is in use.

Quite a few projects these days include PyPy in their CI test matrix,
and one of the things that does is test whether or not you're relying
on refcounting semantics.

We also offer ResourceWarning natively in CPython, which means if you
run under "python3 -Wd", you'll get a warning when external resources
like files are cleaned up implicitly:

    $ python3 -Wd
    Python 3.6.2 (default, Oct  2 2017, 16:51:32)
    [GCC 7.2.1 20170915 (Red Hat 7.2.1-2)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> f = open(".bashrc")
    >>> del f
    __main__:1: ResourceWarning: unclosed file <_io.TextIOWrapper
name='.bashrc' mode='r' encoding='UTF-8'>
    >>>

> Most public projects (which are the only ones that really need to
> worry about this sort of detail) will probably be supporting Python
> 3.5 and likely even Python 2.7 for some time yet. So they test under
> non-order-preserving dictionary implementations anyway. And if code
> that's only targeted for Python 3.7 assumes order preserving
> dictionaries, it's likely not got a huge user base anyway, so what's
> the problem?

The concern is that if we don't explicitly perturb dict iteration
order (or offer a way to opt-in to that), then insertion ordering will
end up becoming a *de facto* compatibility requirement for Python
implementations as CPython 2.7 and other releases prior to 3.6 start
dropping out of typical test matrices. With both 2.7 and 3.5 going
end-of-life in 2020, that means 3.7 (mid 2018) and 3.8 (late 2019 or
early 2020) are our available opportunities to make that decision -
beyond that, it starts getting a lot harder to change course away from
implicit standardisation, as there'll be a lot more 3.6+-only code in
the world by then.

The way golang handled this problem is in their dict iterator: they
added an extra field to hold a randomly generated hash, and used that
hash as the starting point in their iteration sequence.

We wouldn't be able to implement per-iterator order randomisation in
CPython due to the way the PyDict_Next API works: that uses a
caller-provided Py_ssize_t entry to track the current position in the
iteration sequence.

This means the simplest change we can make is to adjust the code in
_PyDict_Next that reads the "current iteration index" from the user
supplied pointer to instead interpret that as having a constant offset
(e.g. starting with the last item in the "natural" iteration order,
and then looping back around to the first one).

"Simplest" isn't the same as "simple" though, as:

1. We can't change this globally for all dicts, as we actually *do*
need keyword argument dicts and class body execution namespaces to be
insertion ordered. That makes it either a per-instance setting, or
else a subtly different dict type.
2. So far, I haven't actually come up with a perturbed iteration
implementation that doesn't segfault the interpreter. The dict
internals are nicely laid out to be iteration friendly, but they
really do assume that you're going to start at index zero, and then
iterate through to the end of the array. The bounds checking and
pointer validity testing becomes relatively fiddly if you try to push
against that and instead start iteration from a point partway through
the storage array.

That second point also becomes a concern from a performance
perspective because this is code that runs on *each* iteration of a
loop, rather than purely as part of the loop setup.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia