Populating huge data structures from disk

Rhamphoryncus rhamph at gmail.com
Wed Nov 7 15:42:11 EST 2007


On Nov 6, 2:42 pm, "Michael Bacarella" <m... at gpshopper.com> wrote:
> > Note that you're not doing the same thing at all. You're
> > pre-allocating the array in the C code, but not in Python (and I don't
> > think you can). Is there some reason you're growing a 8 gig array 8
> > bytes at a time?
>
> > They spend about the same amount of time in system, but Python spends 4.7x
> > as much
> > CPU in userland as C does.
>
> > Python has to grow the array. It's possible that this is tripping a
> > degenerate case in the gc behavior also (I don't know if array uses
> > PyObjects for its internal buffer), and if it is you'll see an
> > improvement by disabling GC.
>
> That does explain why it's consuming 4.7x as much CPU.
>
> > > x = lengthy_number_crunching()
> > > magic.save_mmap("/important-data")
>
> > > and in the application do...
>
> > > x = magic.mmap("/important-data")
> > > magic.mlock("/important-data")
>
> > > and once the mlock finishes bringing important-data into RAM, at
> > > the speed of your disk I/O subsystem, all accesses to x will be
> > > hits against RAM.
>
> > You've basically described what mmap does, as far as I can tell. Have
> > you tried just mmapping the file?
>
> Yes, that would be why my fantasy functions have 'mmap' in their names.
>
> However, in C you can mmap arbitrarily complex data structures whereas
> in Python all you can mmap without transformations is an array or a string.
> I didn't say this earlier, but I do need to pull more than arrays
> and strings into RAM.  Not being able to pre-allocate storage is a big
> loser for this approach.

I don't see how needing transformations is an issue, as it's just a
constant overhead (in big-O terms.)

The bigger concern is if what your storing is self contained (numbers,
strings) or if it contains inter-object references.  The latter may
require you to translate between indexes and temporary handle
objects.  This in turn may require some sort of garbage collection
scheme (refcounting, tracing).  Note that the addresses of the mmap'd
region can change each time the program runs, so even in C you may
need to use indexes.  (You may also want to eliminate C's padding,
although that would make the objects not directly accessible (due to
hardware alignment requirements.))

Having a separate process (or thread) that occasionally pokes each
page (as suggested by Paul Rubin) seems like the cleanest way to
ensure they stay "hot".  One pass during startup is insufficient, as
unused portions of a long-running program may get swapped out.  Also
note that poking need only touch 1 byte per page, much cheaper than
copying the entire page (so long as the page is already loaded from
disk.)


--
Adam Olsen, aka Rhamphoryncus




More information about the Python-list mailing list