Large Dictionaries

Mon May 15 19:45:42 EDT 2006

1. Why is two minutes to insert 5M keys "bad" for you? What would be
"good"? What would be good/bad look-up times? Have you measured the
typical look-up time? How often is the dict creation  required to be
done? How often does the data change? Is multi-user access required for
(a) look-up (b) updating? Have you considered loading the dict from a
pickle?
2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
    adict[k] = v
then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco? In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco -> Intel x86, but what
chip exactly? How much free memory?
4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory. Why "roughly" 5M keys???
5. How large are your long_integers?
6. What is the nature of the value associated with each key?
7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

HTH,
John