When convert two sets with the same elements to lists, are the lists always going to be the same?

Terry Reedy tjreedy at udel.edu
Sat May 5 00:28:40 EDT 2012


Peng, I actually am thinking about it.

Underlying problem: while unordered means conceptually unordered as far 
as the collection goes, the items in the collection, if homogenous 
enough, may have a natural order, which users find hard to ignore. Even 
if not comparable, an implementation such as CPython that uses linear 
sequential memory will impose some order. Even if the implementation 
uses unordered (holographic?) memory, order will be imposed to iterate, 
as when creating a serialized representation of the collection. Abstract 
objects, concrete objects, and serialized representations are three 
different things, but people tend to conflate them.

Order consistency issues: if the unordered collection is iterated, when 
can one expect the order to be the same? Theoretically, essentially 
never, except that iterating dicts by keys, values, or key-value pairs 
is guaranteed to be consistent, which means that re-iterating has to be 
consistent. I actually think the same might as well be true for sets, 
although there is no doc that says so.

If two collections are equal, should the iteration order be the same? It 
has always been true that if hash values collide, insertion order 
matters. However, a good hash function avoids hash collisions as much as 
possible in practical use cases. Without doing something artificial, as 
I did with the example, collisions should be especially rare on 64-bit 
builds. If one collection has a series of additions and deletions so 
that the underlying hash table has a different size than an equal 
collection build just from insertions, then order will also be different.

If the same collection is built by insertion in the same order, but in 
different runs, bugfix versions, or language versions, will iteration 
order by the same? Historically, it has been for CPython for about a 
decade, and people has come to depend on that constancy, in spite of 
warning not to. Even core developers have not been immune, as the 
CPython test suite has a few set or dict iteration order dependencies 
until it was recently randomized.

Late last year, it became obvious that this constancy is a practical 
denial-of-service security hole. The option to randomize hashing for 
each run was added to 2.6, 2.7, 3.1, and 3.2. Randomized hashing by run 
is part of 3.3. So some of the discussion above is obsolete. The example 
I gave only works for that one run, as hash('a') changes each run. So 
iteration order now changes with each run in fact as well as in theory.

For the doc, the problem is what to say and where without being 
repetitous (and to get multiple people to agree ;-).

-- 
Terry Jan Reedy




More information about the Python-list mailing list