Performance issue

Sat Apr 2 11:43:02 EST 2005

Tom Carrick <knyght at gmail.com> writes:

> In my attempted learning of python, I've decided to recode an old
> anagram solving program I made in C++. The C++ version runs in less
> than a second, while the python takes 30 seconds.

Indeed, your program can be improved to run about ten times as fast,
which (on my system, with 96274 entries in /usr/share/dict/words) is
below a second.

In general you should try to move the loops into C code, i.e. use
built-in functions instead of long 'for' blocks.

Some comments on your version:

> import string
>
> # Need a function to convert a string to a list to be
> # able to use the sort() function
> def string2list(s): [snipped]

list() achieves the same thing a lot faster.

> words = []

You do not need to initialize 'words' here, as you're overwriting it a
few lines afterwards.

> found = []
>
> anagram = raw_input("Find anagrams of word: ")
>
> f = open('words.txt', 'r')
> file = f.read()
> f.close()
>
> words = file.splitlines()

Try to avoid assigning to the names of built-in functions if you can.
Names like 'file', 'list', 'dict', 'map' etc. are often an obvious
choice, but overwriting them means that you don't "just" know what a
later use refers to.

> sorted_anagram = anagram.lower()
> sorted_anagram = string2list(anagram)
> sorted_anagram.sort(lambda x, y: cmp(x, y))

Unless you *really* have to, don't use comparison functions with
sort(), as they slow the operation considerably.  In this (as in most)
cases, a plain sorted_anagram.sort() does the trick, and in version
2.4 you can achieve custom sort orders with the optional 'key'
argument.  The sorted() built-in also comes in handy here.

> while words:
>     if len(words[0]) == len(sorted_anagram):
>         wordlist = string2list(words[0])
>         wordlist.sort(lambda x, y: cmp(x, y))
>         sorted_wordlist = wordlist
>         if sorted_anagram == sorted_wordlist:
>             found.append(words[0])
>     del words[0]

Avoid this style of looping at all times!  Removing the first element
of a list is O(n), so looping through the whole list as above is
O(n**2).  In most cases you should use a for loop:

for word in words:
  # do something

which is O(n) of course.  If you do have to loop destructively, pop()
from the end (which is the default) like so:

while words:
  word = words.pop()
  # do something

This is also O(n), because removing the *last* element of a list is
O(1) (amortized; I suppose the implementation will occasionally shrink
the underlying array at linear cost).

> print "Anagrams of " + anagram + ": "
> while found:
>     print found[0] + " "
>     del found[0]

I assume you meant not to print a newline between the words, which
'print' does by default.  The best solution in that case is "
".join(found).

A better version (2.4+ only):

-- 8< -- 8< --
anagram = raw_input("Find anagrams of word: ")

words = open('words.txt', 'r')

sorted_anagram = sorted(anagram.lower())

found = []

for word in words.read().splitlines():
    if len(word) == len(anagram) and sorted(word) == sorted_anagram:
        found.append(word)

print "Anagrams of %s: %s" % (anagram, ' '.join(found))
-- >8 -- >8 --

Interestingly, the length comparison makes quite a difference!  I
removed it at first, thinking it was unnecessary.  Here are some
timings:

* Your original version (for comparison):

  $ time echo stop | python2.4 anagram_slow.py
  [...]
  real    0m9.090s
  user    0m8.790s
  sys     0m0.013s

* Your version, but with the O(n**2) loop replaced by an O(n) 'for':

  $ time echo stop | python2.4 anagram_forloop.py
  [...]
  real    0m0.221s
  user    0m0.134s
  sys     0m0.014s

* My version but with the length comparison removed:

  $ time echo stop | python2.4 anagram_no_lencmp.py
  [...]
  real    0m0.408s
  user    0m0.353s
  sys     0m0.010s

* My version as above:

  $ time echo stop | python2.4 anagram_fast.py
  [...]
  real    0m0.144s
  user    0m0.099s
  sys     0m0.008s

Hope that helps :-)

- Thomas

-- 
If you want to reply by mail, substitute my first and last name for
'foo' and 'bar', respectively, and remove '.invalid'.