[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Tue Mar 1 16:36:28 EST 2011

Robert Kern <robert.kern <at> gmail.com> writes:
> >> >> You can have each of those processes memory-map the whole file and
> >> >> just operate on their own slices. Your operating system's virtual
> >> >> memory manager should handle all of the details for you.
> >
> > Wow, I didn't know that. So as long as the ranges touched by each process do
> > not overlap, I'll be safe? If I modify only a few discontiguous chunks in a
> > range, will the virtual memory manager decide whether it is most efficient 
to
> > write just the chunks or the entire range back to disk?
> 
> It's up to the virtual memory manager, but usually, it will just load
> those pages (chunks the size of mmap.PAGESIZE) that are touched by
> your request and write them back.

What if two processes touch adjacent chunks that are smaller than a page? Is 
there a risk that writing back an entire page will overwrite the efforts of 
another process?

> > Use case: Generate "large" output for "many" parameter scenarios.
> > 1. Preallocate "enormous" output file on disk.
> > 2. Each process fills in part of the output.
> > 3. Analyze, aggregate results, perhaps save to HDF or database, in a 
sliding-
> > window fashion using a memory-mapped array. The aggregated results fit in
> > memory, even though the raw output doesn't.
[...]
> Okay, in this case, I don't think that just adding an offset argument
> to np.load() is very useful. You will want to read the dtype and shape
> information from the header, *then* decide what offset and shape to
> use for the memory-mapped segment. You will want to use the functions
> read_magic() and read_array_header_1_0() from np.lib.format directly.

Pardon me if I misunderstand, but isn't that what np.load does already, with or 
without my modifications? The existing np.load calls open_memmap if memory-
mapping is requested. open_memmap does read the header first, using read_magic 
getting the shape and dtype from read_array_header_1_0(). It currently computes 
passes offset=fp.tell() to numpy.memmap. I just modify this offset based on the 
number of items to skip and the dtype's item size.

> You can slightly modify the logic in open_memmap():
>
>         # Read the header of the file first.
>         fp = open(filename, 'rb')
>         try:
>             version = read_magic(fp)
>             if version != (1, 0):
>                 msg = "only support version (1,0) of file format, not %r"
>                 raise ValueError(msg % (version,))
>             shape, fortran_order, dtype = read_array_header_1_0(fp)
>             if dtype.hasobject:
>                 msg = "Array can't be memory-mapped: Python objects in dtype."
>                 raise ValueError(msg)
>             offset = fp.tell()
>         finally:
>             fp.close()
> 
>         chunk_offset, chunk_shape = decide_offset_shape(dtype, shape,
> fortran_order, offset)
> 
>         marray = np.memmap(filename, dtype=dtype, shape=chunk_shape,
>             order=('F' if fortran_order else 'C'), mode='r+',
> offset=chunk_offset)

To me this seems very equivalent to what my hack is doing, https://github.com/
jonovik/numpy/compare/master...offset_memmap. I guess decide_offset_shape() 
would encapsulate the gist of what I added.