Is 100,000 entries big for a dictionary?

Fri Dec 29 19:22:39 EST 2000

<rturpin at my-deja.com> wrote in message news:92iug4$vqb$1 at nnrp1.deja.com...
> I'm in the process of architecting an application with medium sized data
> sets. Python dictionaries are tempting as part of the representation
> mechanism. How do they perform when they have 100,000 entries? Can you

Some empirism to the rescue...:

import whrandom as rand
import time

dict = {}

start = time.clock()
for i in range(100000):
    dict[i] = i
stend = time.clock()

print "Populating the dict (100,000 items):",stend-start

start = time.clock()
for i in range(10000):
    x = rand.randrange(100000)
stend = time.clock()
gentime = stend-start

start = time.clock()
for i in range(10000):
    x = rand.randrange(100000)
    y = dict[x]
stend = time.clock()
reatime = (stend-start) - gentime

print "10,000 reads:", reatime

start = time.clock()
for i in range(10000):
    x = rand.randrange(100000)
    dict[x] = -93
stend = time.clock()
reatime = (stend-start) - gentime

print "10,000 writes", reatime

It should not matter that the dictionary is filled with increasing
numeric keys (except they're guaranteed to hash well!).

On my old cranky K5/300MHz running Win98, 96M RAM:

D:\PySym>python provdic.py
Populating the dict (100,000 items): 1.03804371511
10,000 reads: 0.174285522721
10,000 writes 0.21197220872

D:\PySym>python provdic.py
Populating the dict (100,000 items): 1.18972242243
10,000 reads: 0.0994661325198
10,000 writes 0.123712264704

D:\PySym>python provdic.py
Populating the dict (100,000 items): 1.22702609833
10,000 reads: 0.0950619353325
10,000 writes 0.0779203473072

So: about 1.0 to 1.2 seconds to insert 100,000 items;
about 0.1 to 0.2 seconds for 10k reads or writes,
depending on cache effects.  Prudentially, I'd state
this as about 10 to 20 microseconds per operation,
depending on cache effects, if hashing is maximally
fast and effective.

To test using strings as keys, I modified the script a
bit.  I prepended the generation of a list of 100,000
random words of 8 letters each:

print "generating 100,000 words..."
start = time.clock()
words = 100000*[None]
for i in range(100000):
    word = 6*[None]
    for j in range(6):
        word[j] = rand.choice(string.lowercase)
    words[i] = ''.join(word)
    if i&255==0: print '.',
stend = time.clock()
print
print "Words gemerated",stend-start

and I changed each dict[i] to dict[words[i]] in the
later part of the script.

Snipping the dots, I saw:

D:\PySym>python provdic.py
generating 100,000 words...
Words gemerated 48.0594755192
Populating the dict (100,000 items): 1.50119428753
10,000 reads: 0.0199911161769
10,000 writes 0.0385516016024

D:\PySym>python provdic.py
generating 100,000 words...
Words gemerated 54.563155601
Populating the dict (100,000 items): 1.48951792688
10,000 reads: 0.0423557216849
10,000 writes 0.134785195863

The populating-time grows modestly but consistently,
but the read and write times don't -- maybe some anomaly
in my measurements, but still, it seems 15 microseconds
per operation are still a good, rough estimate.

If these very approximate estimates are not decisive
enough to be a red or green light for your project, it
should not be hard to prepare a simple simulation that
mimics your intended use more closely.

> offer any other experience with this size data set in Python?

As long as everything fits comfortably in memory (and
these days you can fit quite a few 100,000-items dicts
in a typical PC's RAM), I never had any surprises in terms
of performance from dictionary use (lists and strings can
give such surprises, if naive use builds strings by many
small appends, or innocently inserts/deletes from the
interiors of big lists, say; not had to finesse once you
know what to look out for; but never did I notice any
such issue with dictionaries).

Alex