Populating huge data structures from disk

Tue Nov 6 16:00:44 EST 2007

On Nov 6, 2007 2:40 PM, Michael Bacarella <mbac at gpshopper.com> wrote:
>
> > > For various reasons I need to cache about 8GB of data from disk into
> core on
> > > application startup.
> >
> > Are you sure? On PC hardware, at least, doing this doesn't make any
> > guarantee that accessing it actually going to be any faster. Is just
> > mmap()ing the file a problem for some reason?
> >
> > I assume you're on a 64 bit machine.
>
> Very sure.  If we hit the disk at all performance drops unacceptably.  The
> application
> has low locality of reference so on-demand caching isn't an option.  We get
> the behavior
> we want when we pre-cache; the issue is simply that it takes so long to
> build this cache.
>

You're not going to avoid hitting disk just by reading into your
memory space. If your performance needs are really so tight that you
can't rely on the VM system to keep pages you're using in memory,
you're going to need to do this at a much lower (and system specific)
level.

mmap() with a reasonable VM system shouldn't be any slower than
reading it all into memory.

> > > Building this cache takes nearly 2 hours on modern hardware.  I am
> surprised
> > > to discover that the bottleneck here is CPU.
> > >
> > > The reason this is surprising is because I expect something like this to
> be
> > > very fast:
> > >
> > > #!python
> > > import array
> > >
> > > a = array.array('L')
> > >
> > > f = open('/dev/zero','r')
> > >
> > > while True:
> > >
> > >     a.fromstring(f.read(8))
> >
> > This just creates the same array over and over, forever. Is this
> > really the code you meant to write? I don't know why you'd expect an
> > infinite loop to be "fast"...
>
> Not exactly.  fromstring() appends to the array.  It's growing the array
> towards

You're correct, I misread the results of my testing.

> infinity.  Since infinity never finishes it's hard to get an idea of how
> slow
> this looks.  Let's do 800MB instead.
>

That makes this a useless benchmark, though...

> Here's an example of loading 800MB in C:
>
> $ time ./eat800
>
> real    0m44.939s
> user    0m10.620s
> sys     0m34.303s
>
> $ cat eat800.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
>
> int main(void)
> {
>         int f = open("/dev/zero",O_RDONLY);
>         int vlen = 8;
>         long *v = malloc((sizeof (long)) * vlen);
>         int i;
>
>         for (i = 0; i < 100000000; i++) {
>                 if (i >= vlen) {
>                         vlen *= 2;
>                         v = (long *)realloc(v,(sizeof (long)) * vlen);
>                 }
>                 read(f,v+i,sizeof (long));
>         }
>         return 0;
> }
>
> Here's the similar operation in Python:
> $ time python eat800.py
>
> real    3m8.407s
> user    2m40.189s
> sys     0m27.934s
>
> $ cat eat800.py
> #!/usr/bin/python
>
> import array
> a = array.array('L')
>
> f = open('/dev/zero')
> for i in xrange(100000000):
>         a.fromstring(f.read(8))
>
>

Note that you're not doing the same thing at all. You're
pre-allocating the array in the C code, but not in Python (and I don't
think you can). Is there some reason you're growing a 8 gig array 8
bytes at a time?

> They spend about the same amount of time in system, but Python spends 4.7x
> as much
> CPU in userland as C does.
>

Python has to grow the array. It's possible that this is tripping a
degenerate case in the gc behavior also (I don't know if array uses
PyObjects for its internal buffer), and if it is you'll see an
improvement by disabling GC.

> And there's no solace in lists either:
>
> $ time python eat800.py
>
> real    4m2.796s
> user    3m57.865s
> sys     0m3.638s
>
> $ cat eat800.py
> #!/usr/bin/python
>
> import struct
>
> d = []
> f = open('/dev/zero')
> for i in xrange(100000000):
>         d.append(struct.unpack('L',f.read(8))[0])
>
>
> cPickle with protocol 2 has some promise but is more complicated because
> arrays can't be pickled.  In a perfect world I could do something like this
> somewhere in the backroom:
>
> x = lengthy_number_crunching()
> magic.save_mmap("/important-data")
>
> and in the application do...
>
> x = magic.mmap("/important-data")
> magic.mlock("/important-data")
>
> and once the mlock finishes bringing important-data into RAM, at
> the speed of your disk I/O subsystem, all accesses to x will be
> hits against RAM.
>

You've basically described what mmap does, as far as I can tell. Have
you tried just mmapping the file?

>
> Any thoughts?
>
>

Did you try array.fromfile like I suggested?

> --
> http://mail.python.org/mailman/listinfo/python-list
>