[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load
Jon Olav Vik
jonovik at gmail.com
Tue Mar 1 16:36:28 EST 2011
Robert Kern <robert.kern <at> gmail.com> writes:
> >> >> You can have each of those processes memory-map the whole file and
> >> >> just operate on their own slices. Your operating system's virtual
> >> >> memory manager should handle all of the details for you.
> >
> > Wow, I didn't know that. So as long as the ranges touched by each process do
> > not overlap, I'll be safe? If I modify only a few discontiguous chunks in a
> > range, will the virtual memory manager decide whether it is most efficient
to
> > write just the chunks or the entire range back to disk?
>
> It's up to the virtual memory manager, but usually, it will just load
> those pages (chunks the size of mmap.PAGESIZE) that are touched by
> your request and write them back.
What if two processes touch adjacent chunks that are smaller than a page? Is
there a risk that writing back an entire page will overwrite the efforts of
another process?
> > Use case: Generate "large" output for "many" parameter scenarios.
> > 1. Preallocate "enormous" output file on disk.
> > 2. Each process fills in part of the output.
> > 3. Analyze, aggregate results, perhaps save to HDF or database, in a
sliding-
> > window fashion using a memory-mapped array. The aggregated results fit in
> > memory, even though the raw output doesn't.
[...]
> Okay, in this case, I don't think that just adding an offset argument
> to np.load() is very useful. You will want to read the dtype and shape
> information from the header, *then* decide what offset and shape to
> use for the memory-mapped segment. You will want to use the functions
> read_magic() and read_array_header_1_0() from np.lib.format directly.
Pardon me if I misunderstand, but isn't that what np.load does already, with or
without my modifications? The existing np.load calls open_memmap if memory-
mapping is requested. open_memmap does read the header first, using read_magic
getting the shape and dtype from read_array_header_1_0(). It currently computes
passes offset=fp.tell() to numpy.memmap. I just modify this offset based on the
number of items to skip and the dtype's item size.
> You can slightly modify the logic in open_memmap():
>
> # Read the header of the file first.
> fp = open(filename, 'rb')
> try:
> version = read_magic(fp)
> if version != (1, 0):
> msg = "only support version (1,0) of file format, not %r"
> raise ValueError(msg % (version,))
> shape, fortran_order, dtype = read_array_header_1_0(fp)
> if dtype.hasobject:
> msg = "Array can't be memory-mapped: Python objects in dtype."
> raise ValueError(msg)
> offset = fp.tell()
> finally:
> fp.close()
>
> chunk_offset, chunk_shape = decide_offset_shape(dtype, shape,
> fortran_order, offset)
>
> marray = np.memmap(filename, dtype=dtype, shape=chunk_shape,
> order=('F' if fortran_order else 'C'), mode='r+',
> offset=chunk_offset)
To me this seems very equivalent to what my hack is doing, https://github.com/
jonovik/numpy/compare/master...offset_memmap. I guess decide_offset_shape()
would encapsulate the gist of what I added.
More information about the NumPy-Discussion
mailing list