how fast is Python code - another detail

Thu Mar 4 15:18:15 EST 2004

----- Original Message ----- 
From: "Gregor Lingl" <glingl at aon.at>
Newsgroups: comp.lang.python
Sent: Thursday, March 04, 2004 7:43 PM
Subject: how fast is Python code - another detail

> "Quality of design" may - for instance - consist
> in deleting 7 characters:
>
> Today I made - by accident - an interesting
> observation. The following is a snippet from
> a program which computes anagrams from a wordlist:
>
> def find_anagrams(wordlist):
>      d = {}
>      for word in wordlist:
> sig = anagram_signature(word)
> if sig not in d:
>              d[sig] = []     # d.setdefault intentionally not used
>      return d
>
> On my machine it computes d
> for a wordlist of 10000 items in less than 0.13s
> and for a wordlist of 20000 items in less than 0.25s
>
> Today one of my students inadvertedly wrote in the fifth line
>
>          if sig not in d.keys():
>
> and now timing of find_anagrams had the following results:
>
> for a wordlist of 10000 items approx. 9.33s, i.e a factor
> of approx 65 slower
>
> and for a wordlist of 20000 items approx. 60s, i.e. a factor
> of approx 250 slower. (!!!) (Imagine that the full wordlist
> in this example has 45000 entires, so an additional factor of
> 16 applies ...)
>
> I think this is a very astonishing and relevant difference!
> I think I understand why it is the case - the timing factor
> is proportional to the square of len(wordlist), because
> in every loop a list d.keys() has to be computed and now
> this has to be used for searching for the key sig,
> while for sig in d uses some built-in generator-property
> of dictionaries ...  (?)
>
> *on_the_other_ hand* I thought, that those two idioms
> are semantically identical.

Nope. They are not. "if sig not in d" equates to "not d.has_key(sig)"
("O(1)" in terms of performance)
And d.keys() makes a copy of d's entire list of keys (giving you "O(N)").

You might also consider using:

try:
   d[sig]
except KeyError:
   d[sig]=[]

HTH,

Vincent Wehren

So my question: why does the
> Python interpreter not recognize this identity and internally
> replace the latter by the first in order to get *much*
> better performance? >
> - would this lead to wrong results in some cases?
> - or would it be too difficult to implement?
> - or is this considered a problem of minor importance?
> - or was it only forgotten to do?
> - or what else?

> Regards,
> Gregor
>
> P.S.: Or was the key in dict idiom introduced exactly
> inorder to solve this problem?