Populating huge data structures from disk

Tue Nov 6 14:04:57 EST 2007

On Nov 6, 2007 12:18 PM, Michael Bacarella <mbac at gpshopper.com> wrote:
>
>
>
>
> For various reasons I need to cache about 8GB of data from disk into core on
> application startup.
>

Are you sure? On PC hardware, at least, doing this doesn't make any
guarantee that accessing it actually going to be any faster. Is just
mmap()ing the file a problem for some reason?

I assume you're on a 64 bit machine.

> Building this cache takes nearly 2 hours on modern hardware.  I am surprised
> to discover that the bottleneck here is CPU.
>
>
>
> The reason this is surprising is because I expect something like this to be
> very fast:
>
>
>
> #!python
>
>
>
> import array
>
> a = array.array('L')
>
> f = open('/dev/zero','r')
>
> while True:
>
>     a.fromstring(f.read(8))
>
>

This just creates the same array over and over, forever. Is this
really the code you meant to write? I don't know why you'd expect an
infinite loop to be "fast"...

>
>
>
> Profiling this application shows all of the time is spent inside
> a.fromstring.
>

Obviously, because that's all that's inside your while True loop.
There's nothing else that it could spend time on.

> Little difference if I use list instead of array.
>
>
>
> Is there anything I could tell the Python runtime to help it run this
> pathologically slanted case faster?
>

This code executes in a couple seconds for me (size reduced to fit in
my 32 bit memory space):

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> s = '\x00' * ((1024 **3)/2)
>>> len(s)
536870912
>>> a = array.array('L')
>>> a.fromstring(s)
>>>

You might also want to look at array.fromfile()