Which type should be used when testing static structure appartenance

Wed Nov 18 06:37:20 EST 2015

On Wed, 18 Nov 2015 01:27 am, Nicolas Évrard wrote:

> Hello,
> 
> I saw the following retweet by Raymond Hettinger in this morning:
> 
>     https://twitter.com/sanityinc/status/666485814214287360
> 
>     Programming tip: many of those arrays and hashes in your code
>     should actually be sets. Match data structures to data
>     constraints!

This is why I hold Twitter in contempt. What Raymond says could be a good
tip, or a load of old hogwash, and there is no way of knowing because you
can't explain much in 130 characters. If it were a blog post, he could
explain what he means. But its a tweet, so he can't.

Of course you should match the data structure to your data constraints, but
what does that mean in practice? *Which* arrays and hashes should be sets?
How do you know which should be changed to sets?

> I saw just in time because in a review I wrote something like this:
> 
>     if operator not in ('where', 'not where')
> 
> and my colleague proposed that I should use a list instead of a tuple.
> But reading the mentioned tweet I tend to think that a set would be a
> better choice.

Exactly. Raymond's tweet only serves to muddy the water and add more
confusion. Before, you thought you knew the right answer: use a tuple.
Then, your colleague proposed a list, adding some uncertainty and
confusion: which is better, a list or a tuple? And now Raymond's tweet has
*increased* the uncertainty: should you use a set?

> What are your opinion on this issue (I realize it's not something of
> the utmost importance but rather a "philosophical" question).

I would say that there is no doubt that in Python 2, using a tuple is far
and away the best solution for the situation you show above.

There are three factors to weigh up: 

(1) how easy is it to create the data structure?

(2) how much space does the data structure use?

(3) how fast is searching the data structure?

How easy is it to create the data structure? In Python 2, lists and tuples
are effectively just as easy to type, but sets not so much:

('where', 'not where')
['where', 'not where']
set(['where', 'not where'])

Not only are sets longer to type, but they are created at runtime. Compare
the byte-code compiled for each expression:

py> from dis import dis
py> a = compile("('where', 'not where')", "", "single")
py> b = compile("['where', 'not where']", "", "single")
py> c = compile("set(['where', 'not where'])", "", "single")
py> dis(a)
  1           0 LOAD_CONST               3 (('where', 'not where'))
              3 PRINT_EXPR
              4 LOAD_CONST               2 (None)
              7 RETURN_VALUE
py> dis(b)
  1           0 LOAD_CONST               0 ('where')
              3 LOAD_CONST               1 ('not where')
              6 BUILD_LIST               2
              9 PRINT_EXPR
             10 LOAD_CONST               2 (None)
             13 RETURN_VALUE
py> dis(c)
  1           0 LOAD_NAME                0 (set)
              3 LOAD_CONST               0 ('where')
              6 LOAD_CONST               1 ('not where')
              9 BUILD_LIST               2
             12 CALL_FUNCTION            1
             15 PRINT_EXPR
             16 LOAD_CONST               2 (None)
             19 RETURN_VALUE

Python can optimise the creation of the tuple to compile-time, but it has to
create the list at run time. It also creates the set at run time, but it
also needs to do a name lookup. All this takes time.

How much space does each data structure take? Again, a tuple easily wins
over the others:

py> import sys
py> sys.getsizeof(('where', 'not where'))
32
py> sys.getsizeof(['where', 'not where'])
40
py> sys.getsizeof(set(['where', 'not where']))
112

The tuple is smaller than the list, and MUCH smaller than the set. (The
exact sizes you see will depend on the precise version and platform you are
using, but I am confident that the tuple will win.)

Last but not least, which is fastest? For that, we can use the timeit
module, running it as a script from the shell:

[steve at ando ~]$ python -m timeit -s "x = 'not where'" "x in ('where', 'not
where')"
10000000 loops, best of 3: 0.122 usec per loop
[steve at ando ~]$ python -m timeit -s "x = 'not where'" "x in ['where', 'not
where']"
10000000 loops, best of 3: 0.119 usec per loop
[steve at ando ~]$ python -m timeit -s "x = 'not where'" "x in
set(['where', 'not where'])"
1000000 loops, best of 3: 0.728 usec per loop

The difference between 0.122 microseconds and 0.119 microseconds is too
close to call. We can say that the tuple and the list are equally fast, and
the set much, much slower.

But... we can speed the set up, by only creating it once:

[steve at ando ~]$ python -m timeit -s "x = 'not where'" -s "S =
set(['where', 'not where'])" "x in S"
10000000 loops, best of 3: 0.101 usec per loop

That's better! But only *marginally* faster than the tuple, and to get the
speed increase you have to factor out the set into a constant.

So for Python 2, I would say that tuples are the clean winner, at least for
the case where you only have two items to check. If you have more than two,
sets start becoming more attractive.

How about Python 3? Python 3 has the advantage of using set literals. Let's
test the speed:

[steve at ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in
('where', 'not where')"
10000000 loops, best of 3: 0.11 usec per loop
[steve at ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in
['where', 'not where']"
10000000 loops, best of 3: 0.113 usec per loop
[steve at ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in
{'where', 'not where'}"
10000000 loops, best of 3: 0.0907 usec per loop

This makes sets much more attractive in Python 3.

-- 
Steven