dict is really slow for big truck

Tue Apr 28 14:05:46 EDT 2009

On Apr 28, 2:54 pm, forrest yang <Gforrest.y... at gmail.com> wrote:
> i try to load a big file into a dict, which is about 9,000,000 lines,
> something like
> 1 2 3 4
> 2 2 3 4
> 3 4 5 6
>
> code
> for line in open(file)
>    arr=line.strip().split('\t')
>    dict[line.split(None, 1)[0]]=arr
>
> but, the dict is really slow as i load more data into the memory, by
> the way the mac i use have 16G memory.
> is this cased by the low performace for dict to extend memory or
> something other reason.
> is there any one can provide a better solution

Keys are integers, so they are very efficiently managed by the dict.
If I do this:
d = dict.fromkeys(xrange(9000000))

It takes only a little more than a second on my normal PC.
So probably the problem isn't in the dict, it's the I/O and/or the
list allocation. A possible suggestion is to not split the arrays, but
keep it as strings, and split them only when you use them:

d = {}
for line in open(file):
  line = line.strip()
  d[line.split(None, 1)[0]] = line

if that's not fast enough you can simplify it:

d = {}
for line in open(file):
  d[line.split(None, 1)[0]] = line

If you have memory problems still, then you can only keep the line
number as dict values, of even absolute file positions, to seek later.
You can also use memory mapped files.

Tell us how is the performance now.

Bye,
bearophile