Populating a dictionary, fast

Sat Nov 10 23:32:44 EST 2007

On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:

> So, you think the Python's dict implementation degrades towards O(N)
> performance when it's fed millions of 64-bit pseudo-random longs?

No.

Here's my sample file:

$ wc -l id2name.txt
8191180 id2name.txt
$ ls -lh id2name.txt
-rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt

And the results of reading it into a dict (code follows below):

$ time ./slurp_dict.py
Starting at Sun Nov 11 14:26:51 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 14:29:31 2007
Starting to delete dict...

Traceback (most recent call last):
  File "./slurp_dict.py", line 20, in <module>
    del id2name
KeyboardInterrupt

real    35m52.334s
user    1m17.663s
sys     0m16.758s

Notice that the dict is completely read into memory in just two and a 
half minutes. The script then tries to delete the dict, and 32 minutes 
later is still struggling. That's the point I got sick of waiting and 
interrupted the script.

Conclusion: it's a memory issue, or maybe a garbage collection issue, not 
a problem with dicts.

Here's my code for creating the key:value file in the first place:

#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
    fp.write(template % random.randint(0, bits64))
fp.close()

###

And here's my code for slurping it in to a dict:

#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
    if n % 1000000 == 0:
        # Give feedback.
        print "Line %d" % n
    id,name = line.strip().split(':', 1)
    id = long(id)
    id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
    gc.enable()
print "Finishing at %s" % time.asctime()

###

So, what can you do? Given that I'm not willing to do any more unpaid 
experimentation for you, here are my suggestions, in no particular order:

(1) Presumably you don't want to run your app with the garbage collector 
turned off. You might still want to play around with the gc module to see 
what you can learn.

(2) More memory will help avoid paging. If you can't get more memory, try 
more virtual memory. It will still be slow, but at least the operating 
system doesn't have to try moving blocks around as much.

(3) Are you sure you need all eight-million-plus items in the cache all 
at once?

(4) There are lots of algorithms out there for dealing with data too big 
to fit into main memory. Do some research.

(5) There is a data structure designed for dealing with tens of millions 
of records at once. It is called "a database". If you can't find a better 
algorithm, and refuse to use an existing RDBMS, I suspect you're going to 
end up inventing a primitive, inefficient, buggy database which is no 
faster than existing systems out there.

-- 
Steven.