Populating a dictionary, fast

Sun Nov 11 11:25:15 EST 2007

Firstly, thank you for all of your help so far, I really appreciate it.

> > So, you think the Python's dict implementation degrades towards
 O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?
> 
> No.

Yes.

I tried your code (with one change, time on feedback lines) and got the
 same terrible
performance against my data set.

$ cat stevescode.py
#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
    if n % 1000000 == 0:
        # Give feedback.
        print "Line %d" % n,time.asctime()
    id,name = line.strip().split(':', 1)
    id = long(id)
    id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
    gc.enable()
print "Finishing at %s" % time.asctime()

$ ./stevescode.py
Starting at Sun Nov 11 10:46:37 2007
Line 0 Sun Nov 11 10:46:37 2007
Line 1000000 Sun Nov 11 10:48:22 2007
Line 2000000 Sun Nov 11 10:51:08 2007
Line 3000000 Sun Nov 11 10:56:16 2007
Line 4000000 Sun Nov 11 10:58:41 2007
Line 5000000 Sun Nov 11 11:03:26 2007
^C

To prove that my machine is sane, I ran the same against your generated
sample file and got _excellent_ performance.  Start to finish in under a minute.

$ cat steves-makedict.py
#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
    fp.write(template % random.randint(0, bits64))
fp.close()

$ ./steves-makedict.py

$ ./stevescode.py
Starting at Sun Nov 11 11:15:31 2007
Line 0 Sun Nov 11 11:15:31 2007
Line 1000000 Sun Nov 11 11:15:37 2007
Line 2000000 Sun Nov 11 11:15:43 2007
Line 3000000 Sun Nov 11 11:15:49 2007
Line 4000000 Sun Nov 11 11:15:54 2007
Line 5000000 Sun Nov 11 11:16:00 2007
Line 6000000 Sun Nov 11 11:16:07 2007
Line 7000000 Sun Nov 11 11:16:12 2007
Line 8000000 Sun Nov 11 11:16:18 2007
Items in dict: 8191180
Completed import at Sun Nov 11 11:16:19 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 11:16:23 2007
Finishing at Sun Nov 11 11:16:23 2007

> Notice that the dict is completely read into memory in just two and a

> half minutes. The script then tries to delete the dict, and 32
 minutes 
> later is still struggling. That's the point I got sick of waiting and

> interrupted the script.
>
> Conclusion: it's a memory issue, or maybe a garbage collection issue, not 
> a problem with dicts.

As you can see, not the case at all against my data set.

> (1) Presumably you don't want to run your app with the garbage collector 
> turned off. You might still want to play around with the gc module to see 
> what you can learn.

As you can see, your version did no better. :(

> (2) More memory will help avoid paging. If you can't get more memory, try 
> more virtual memory. It will still be slow, but at least the
 operating 
> system doesn't have to try moving blocks around as much.

The machine has 8GB, and is not doing anything else when I run this.

> (3) Are you sure you need all eight-million-plus items in the cache
 all
> at once?

Yes.

> (4) There are lots of algorithms out there for dealing with data too big 
> to fit into main memory. Do some research.

It DOES fit into main memory and a dictionary is exactly the right way
 to do this.

> (5) There is a data structure designed for dealing with tens of millions 
> of records at once. It is called "a database". If you can't find a better 
> algorithm, and refuse to use an existing RDBMS, I suspect you're
 going to 
> end up inventing a primitive, inefficient, buggy database which is no

> faster than existing systems out there.

I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS)
Performance is terrible because they end up doing too many random disk seeks.
Pre-caching all of the data ahead of time has offered us the best performance so far,
but is still slower than it ought to be.

Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy
code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys
sounds pretty lame.

 A cached in main memory hash is the right way to do this.

The Perl version runs *very* fast, after all.