map del efficiency python2.2 and 2.1

Tue Jul 16 11:08:00 EDT 2002

I have a file formatted like

word1 int1
word2 int2

with 800,000+ lines like that.  I want to load that file into a map 

  m[word] = int

and my original implementation was quite slow.  So I have been
experimenting with the most efficient way to load it and found that
the best way was to load the entire file as a string, split it, and
then iterate over the sequence with a stride of 2.  With this approach
I was able to initialize the entire map in 3.5 - 4s (if anyone has any
further suggestions, please jump in, but I am reasonably satisfied
with this performance).  

In the process of profiling, however, I found something surprising.
Most of the total execution time of the code is spent clearing the map
from memory, after all the lines in my script have executed (ie, the
equivalent of calling __del__ on a dictionary type.  In python 2.1,
here is a typical execution time profile

time to read file as string and split into seq: 1.000s
time to create map: 2.960s
total execution time: 9.87s

5.9s of the total runtime occurred *after* all the lines of my code
had executed.  Since my program does nothing after reading the map, I
assume most of this time is spent freeing the map.  If I comment out
the part of my program that creates the map, I do not have that long
delay at the end of the script.

Finally, this time is dramatically longer in python 2.2.  Here are
typical runtimes in 2.2

time to read file as string and split into seq: 0.980s
time to create map: 2.650s
total execution time: 27.87s

These runtimes are highly consistent (the machine isn't doing much
else).  So 2.2 does about 10% better in filling the map, but then
takes 24s to shut down.  

Here's the script:

import time

def pairs(l):
    return [(l[i], l[i+1]) for i in range(0, len(l), 2)]

start = time.clock()

file = '/home/jdhunter/dict/meta/allwords.dat'
fh = open(file, 'r')

# this takes 0.97s
s = fh.read()
seq = s.split()
readTime = time.clock()-start
print 'seq has %d items; elapsed time is %1.3f' % \
      (len(seq), readTime)

m = {}
#this takes 13.9 s
#for (key, val) in pairs(seq):
#    m[key] = int(val)

#this takes 2.9s
for i in xrange(0,len(seq),2):
    m[seq[i]] = int(seq[i+1])

mapTime = time.clock() - readTime
print 'map has %d keys; time to load map is %1.3f' % \
      (len(m.keys()), mapTime)

#cruncher1:~/dict/test> time python2.1 test_allwords.py
#seq has 1712676 items; elapsed time is 1.000
#map has 856338 keys; time to load map is 2.960
#4.400u 0.480s 0:09.87 49.4%	0+0k 0+0io 298pf+0w

#cruncher1:~/dict/test> time python2.2 test_allwords.py
#seq has 1712676 items; elapsed time is 0.980
#map has 856338 keys; time to load map is 2.650
#13.390u 0.500s 0:27.87 49.8%	0+0k 0+0io 350pf+0w

John Hunter