When Python outruns C++ ...

Mon Mar 31 20:48:26 EST 2003

In article <tyfr88nwbx6.fsf_-_ at pcepsft001.cern.ch>, Jacek Generowicz wrote:
> Alex Martelli <aleax at aleax.it> writes:
>> Thanks to the excellent support given for these tasks by the standard
>> library, the C++ source is ONLY twice as big as the Python source for
>> the same job (a more usual ratio is around 5:1).  This holds for both
>> the simplest versions, and the slightly more complicated ones with
>> somewhat better optimization.  The runtimes on my box (Linux Mandrake
>> 9.0, gcc 3.1, Python 2.3) are, when the programs are run on the 4.4 MB
>> of the "King James Bible" in plain ASCII text: 17.4 seconds for the
>> simplest C++, going down to 15 with optimizations; 11.2 seconds for the
>> simplest Python, going down to 8.1 with optimizations (CPU times are
>> in very similar ratios).
> 
> Will you make this code public ?
> 
> Having a few examples of this sort readily available could be very
> useful when trying to make the point that the belief that C++ is
> absolutely essential for writing fast code, is slightly, errm,
> misguided.

I would also be interested to see the code, because I recently wrote
a similar program in C.  My program counts the number of occurences of
each word in a file (where a word is a positive number of characters
from a-z --- uppercase letters are converted to lowercase).

Having read Alex's original post, I quickly coded a python version to
do the same thing.

Now the python version was a _lot_ shorter (30 lines vs 273 lines)

However, on a file of size ~80MB, the C program outperformed the python
one by a considerable amount:

$ (time ./words < biginput) > /dev/null

real    0m8.476s
user    0m7.700s
sys     0m0.770s

$ (time python words.py < biginput) > /dev/null

real    1m59.747s
user    1m57.040s
sys     0m1.380s

Note that the python time is roughly in line with the amount of time that
Alex's original program would take if it scales linearly (and there is no
reason why it shouldn't), therefore I have to question the efficiency of
his C++ program, since I'm not sure I can write a better hash table than
the g++ STL developers.

Julian

p.s. maybe my python version is not very good, feel free to make
it go faster:

#
# Count the number of occurences of each word in a file
# -- Julian Tibble
#

import re

# "Word -> Number of occurences" mapping
wordcount = {}

# Regular expression to match a word 
re_word = re.compile('[a-z]+')

# Read in whole file and split into words
try:
	while 1:
		line = raw_input().lower()
		line = re_word.findall(line)
		for word in line:
			if wordcount.has_key(word):
				wordcount[word] += 1
			else:
				wordcount[word] = 1
except EOFError:
	pass

# Order by descending number of occurences
sortable = zip(wordcount.values(), wordcount.keys())
sortable.sort()
sortable.reverse()

for n,w in sortable:
	print w,n