Performance issue
Thomas Rast
foo.bar at freesurf.ch.invalid
Sat Apr 2 11:43:02 EST 2005
Tom Carrick <knyght at gmail.com> writes:
> In my attempted learning of python, I've decided to recode an old
> anagram solving program I made in C++. The C++ version runs in less
> than a second, while the python takes 30 seconds.
Indeed, your program can be improved to run about ten times as fast,
which (on my system, with 96274 entries in /usr/share/dict/words) is
below a second.
In general you should try to move the loops into C code, i.e. use
built-in functions instead of long 'for' blocks.
Some comments on your version:
> import string
>
> # Need a function to convert a string to a list to be
> # able to use the sort() function
> def string2list(s): [snipped]
list() achieves the same thing a lot faster.
> words = []
You do not need to initialize 'words' here, as you're overwriting it a
few lines afterwards.
> found = []
>
> anagram = raw_input("Find anagrams of word: ")
>
> f = open('words.txt', 'r')
> file = f.read()
> f.close()
>
> words = file.splitlines()
Try to avoid assigning to the names of built-in functions if you can.
Names like 'file', 'list', 'dict', 'map' etc. are often an obvious
choice, but overwriting them means that you don't "just" know what a
later use refers to.
> sorted_anagram = anagram.lower()
> sorted_anagram = string2list(anagram)
> sorted_anagram.sort(lambda x, y: cmp(x, y))
Unless you *really* have to, don't use comparison functions with
sort(), as they slow the operation considerably. In this (as in most)
cases, a plain sorted_anagram.sort() does the trick, and in version
2.4 you can achieve custom sort orders with the optional 'key'
argument. The sorted() built-in also comes in handy here.
> while words:
> if len(words[0]) == len(sorted_anagram):
> wordlist = string2list(words[0])
> wordlist.sort(lambda x, y: cmp(x, y))
> sorted_wordlist = wordlist
> if sorted_anagram == sorted_wordlist:
> found.append(words[0])
> del words[0]
Avoid this style of looping at all times! Removing the first element
of a list is O(n), so looping through the whole list as above is
O(n**2). In most cases you should use a for loop:
for word in words:
# do something
which is O(n) of course. If you do have to loop destructively, pop()
from the end (which is the default) like so:
while words:
word = words.pop()
# do something
This is also O(n), because removing the *last* element of a list is
O(1) (amortized; I suppose the implementation will occasionally shrink
the underlying array at linear cost).
> print "Anagrams of " + anagram + ": "
> while found:
> print found[0] + " "
> del found[0]
I assume you meant not to print a newline between the words, which
'print' does by default. The best solution in that case is "
".join(found).
A better version (2.4+ only):
-- 8< -- 8< --
anagram = raw_input("Find anagrams of word: ")
words = open('words.txt', 'r')
sorted_anagram = sorted(anagram.lower())
found = []
for word in words.read().splitlines():
if len(word) == len(anagram) and sorted(word) == sorted_anagram:
found.append(word)
print "Anagrams of %s: %s" % (anagram, ' '.join(found))
-- >8 -- >8 --
Interestingly, the length comparison makes quite a difference! I
removed it at first, thinking it was unnecessary. Here are some
timings:
* Your original version (for comparison):
$ time echo stop | python2.4 anagram_slow.py
[...]
real 0m9.090s
user 0m8.790s
sys 0m0.013s
* Your version, but with the O(n**2) loop replaced by an O(n) 'for':
$ time echo stop | python2.4 anagram_forloop.py
[...]
real 0m0.221s
user 0m0.134s
sys 0m0.014s
* My version but with the length comparison removed:
$ time echo stop | python2.4 anagram_no_lencmp.py
[...]
real 0m0.408s
user 0m0.353s
sys 0m0.010s
* My version as above:
$ time echo stop | python2.4 anagram_fast.py
[...]
real 0m0.144s
user 0m0.099s
sys 0m0.008s
Hope that helps :-)
- Thomas
--
If you want to reply by mail, substitute my first and last name for
'foo' and 'bar', respectively, and remove '.invalid'.
More information about the Python-list
mailing list