Populating huge data structures from disk

Tue Nov 6 15:40:26 EST 2007

> > For various reasons I need to cache about 8GB of data from disk into
core on
> > application startup.
>
> Are you sure? On PC hardware, at least, doing this doesn't make any
> guarantee that accessing it actually going to be any faster. Is just
> mmap()ing the file a problem for some reason?
>
> I assume you're on a 64 bit machine.

Very sure.  If we hit the disk at all performance drops unacceptably.  The
application
has low locality of reference so on-demand caching isn't an option.  We get
the behavior
we want when we pre-cache; the issue is simply that it takes so long to
build this cache.

> > Building this cache takes nearly 2 hours on modern hardware.  I am
surprised
> > to discover that the bottleneck here is CPU.
> >
> > The reason this is surprising is because I expect something like this to
be
> > very fast:
> >
> > #!python
> > import array
> >
> > a = array.array('L')
> >
> > f = open('/dev/zero','r')
> >
> > while True:
> >
> >     a.fromstring(f.read(8))
>
> This just creates the same array over and over, forever. Is this
> really the code you meant to write? I don't know why you'd expect an
> infinite loop to be "fast"...

Not exactly.  fromstring() appends to the array.  It's growing the array
towards
infinity.  Since infinity never finishes it's hard to get an idea of how
slow
this looks.  Let's do 800MB instead.

Here's an example of loading 800MB in C:

$ time ./eat800 

real    0m44.939s
user    0m10.620s
sys     0m34.303s

$ cat eat800.c 
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

int main(void)
{
        int f = open("/dev/zero",O_RDONLY);
        int vlen = 8;
        long *v = malloc((sizeof (long)) * vlen);
        int i;

        for (i = 0; i < 100000000; i++) {
                if (i >= vlen) {
                        vlen *= 2;
                        v = (long *)realloc(v,(sizeof (long)) * vlen);
                }
                read(f,v+i,sizeof (long));
        }
        return 0;
}

Here's the similar operation in Python:
$ time python eat800.py 

real    3m8.407s
user    2m40.189s
sys     0m27.934s

$ cat eat800.py
#!/usr/bin/python

import array
a = array.array('L')

f = open('/dev/zero')
for i in xrange(100000000):
        a.fromstring(f.read(8))

They spend about the same amount of time in system, but Python spends 4.7x
as much
CPU in userland as C does.

And there's no solace in lists either:

$ time python eat800.py 

real    4m2.796s
user    3m57.865s
sys     0m3.638s

$ cat eat800.py 
#!/usr/bin/python

import struct

d = []
f = open('/dev/zero')
for i in xrange(100000000):
        d.append(struct.unpack('L',f.read(8))[0])

cPickle with protocol 2 has some promise but is more complicated because
arrays can't be pickled.  In a perfect world I could do something like this
somewhere in the backroom:

x = lengthy_number_crunching()
magic.save_mmap("/important-data")

and in the application do...

x = magic.mmap("/important-data")
magic.mlock("/important-data")

and once the mlock finishes bringing important-data into RAM, at
the speed of your disk I/O subsystem, all accesses to x will be
hits against RAM.

Any thoughts?