Populating a dictionary, fast

Sat Nov 10 17:28:15 EST 2007

Michael Bacarella <mbac at gpshopper.com> writes:

> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
>     id,name = line.strip().split(':')
>     id = long(id)
>     id2name[id] = name
> 
> This takes about 45 *minutes*
> 
> If I comment out the last line in the loop body it takes only about
> 30 _seconds_ to run.  This would seem to implicate the line
> id2name[id] = name as being excruciatingly slow.

Or, rather, that the slowdown is caused by allocating these items in a
dictionary at all.

Dictionaries are implemented very efficiently in Python, but there
will still be overhead in inserting millions of distinct items. Of
course, if you just throw each item away instead of allocating space
for it, the loop will run very quickly.

> Is there a fast, functionally equivalent way of doing this?

You could, instead of individual assignments in a 'for' loop, try
letting the 'dict' type operate on a generator::

    input_file = open("id2name.txt")
    id2name = dict(
        (long(id), name) for (id, name) in
            line.strip().split(":") for line in input_file
    )

All that code inside the 'dict()' call is a "generator expression"; if
you don't know what they are yet, have a read of Python's
documentation on them. It creates a generator which will spit out
key+value tuples to be fed directly to the dict constructor as it
requests them.

That allows the generator to parse each item from the file exactly as
the 'dict' constructor needs it, possibly saving some extra "allocate,
assign, discard" steps. Not having your data set, I can't say if it'll
be significantly faster.

-- 
 \      "Compulsory unification of opinion achieves only the unanimity |
  `\     of the graveyard."  -- Justice Roberts in 319 U.S. 624 (1943) |
_o__)                                                                  |
Ben Finney