Populating huge data structures from disk

Tue Nov 6 16:42:04 EST 2007

> Note that you're not doing the same thing at all. You're
> pre-allocating the array in the C code, but not in Python (and I don't
> think you can). Is there some reason you're growing a 8 gig array 8
> bytes at a time?
>
> They spend about the same amount of time in system, but Python spends 4.7x
> as much
> CPU in userland as C does.
>
> Python has to grow the array. It's possible that this is tripping a
> degenerate case in the gc behavior also (I don't know if array uses
> PyObjects for its internal buffer), and if it is you'll see an
> improvement by disabling GC.

That does explain why it's consuming 4.7x as much CPU.

> > x = lengthy_number_crunching()
> > magic.save_mmap("/important-data")
> >
> > and in the application do...
> >
> > x = magic.mmap("/important-data")
> > magic.mlock("/important-data")
> >
> > and once the mlock finishes bringing important-data into RAM, at
> > the speed of your disk I/O subsystem, all accesses to x will be
> > hits against RAM.
>
> You've basically described what mmap does, as far as I can tell. Have
> you tried just mmapping the file?

Yes, that would be why my fantasy functions have 'mmap' in their names.

However, in C you can mmap arbitrarily complex data structures whereas
in Python all you can mmap without transformations is an array or a string.
I didn't say this earlier, but I do need to pull more than arrays
and strings into RAM.  Not being able to pre-allocate storage is a big
loser for this approach.