Populating a dictionary, fast

Michael Bacarella mbac at gpshopper.com
Sat Nov 10 20:18:37 EST 2007


> That's an awfully complicated way to iterate over a file. Try this 
> instead:

>
> id2name = {}
> for line in open('id2name.txt'):
>    id,name = line.strip().split(':')
>    id = long(id)
>    id2name[id] = name
>
> > This takes about 45 *minutes*
> > 
> On my system, it takes about a minute and a half to produce a
 dictionary 
> with 8191180 entries.

Doing something similar on my system is very fast as well.

$ cat dict-8191180.py

#!/usr/bin/python



v = {}

for i in xrange(8191180):

        v[i] = i


$ time ./dict-8191180.py

real    0m5.877s
user    0m4.953s
sys     0m0.924s

But...

> > If I comment out the last line in the loop body it takes only about
 30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.
>
> No, dictionary access is one of the most highly-optimized, fastest,
 most 
> efficient parts of Python. What it indicates to me is that your system
 is 
> running low on memory, and is struggling to find room for 517MB worth
 of 
> data.


If only it were so easy.




$ free


             total       used       free     shared    buffers     cached


Mem:       7390244    2103448    5286796          0      38996    1982756


-/+ buffers/cache:      81696    7308548


Swap:      2096472      10280    2086192





Here's your Python implementation running as badly as mine did.



$ wc -l id2name.txt

8191180 id2name.txt



$ cat cache-id2name.py

#!/usr/bin/python



id2name = {}

for line in open('id2name.txt'):

        id,name = line.strip().split(':',1)

        id = long(id)

        id2name[id] = name



$ time ./cache-id2name.py

^C

I let it go 30 minutes before killing it since I had to leave.  Here it is in top before I did the deed.


PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

18802 root      25   0 1301m 1.2g 1148 R 99.9 17.1  36:05.99 cache-id2name.p



36 minutes, 05.99 seconds.



To rule out the file reading/parsing logic as culprit, here's same thing, but with

dictionary insertion removed.



$ cat nocache-id2name.py

#!/usr/bin/python



id2name = {}

for line in open('id2name.txt'):

        id,name = line.strip().split(':',1)

        id = long(id)



$ time ./nocache-id2name.py



real    0m33.518s

user    0m33.070s

sys     0m0.415s





Here's a Perl implementation running very fast.



$ cat cache-id2name.pl

#!/usr/bin/perl



my %id2name = ();

my $line;

my $id;

my $name;

open(F,"<id2name.txt");



foreach $line (<F>) {

        chomp($line);

        ($id,$name) = split(/:/,$line,1);

        $id = int($id);

        $id2name{$id} = $name;

}



$ time ./cache-id2name.pl



real    0m46.363s

user    0m43.730s

sys     0m2.611s





So, you think the Python's dict implementation degrades towards O(N)

performance when it's fed millions of 64-bit pseudo-random longs?







More information about the Python-list mailing list