[Matrix-SIG] An Experiment in code-cleanup.

Andrew P. Mullhaupt amullhau@zen-pharaohs.com
Wed, 9 Feb 2000 02:17:39 -0500


> Travis Oliphant writes:
>  >
>  > 3) Facility for memory-mapped dataspace in arrays.
>
> For the NumPy users who are as ignorant about mmap, msync,
> and madvise as I am, I've put a couple of documents on
> my web site:

I have Kevin's "Why Aren't You Using mmap() Yet?" on my site. Kevin is
working on a new (11th anniversary edition? 1xth anniversary edition?).

By the way, Uresh Vahalia's book on Unix Internals is a very good idea for
anyone not yet familiar with modern operating systems, especially Unices.

Kevin is extremely knowledgable on this subject, and several others.

> Executive summary:
>
> i) mmap on Solaris can be a very big win

Orders of magnitude.

> (see bottom of
> http://www.geog.ubc.ca/~phil/mmap/msg00003.html) when
> used in combination with WILLNEED/WONTNEED  madvise calls to
> guide the page prefetching.

And with the newer versions of Solaris, madvise() is a good way to go.

madvise is _not_ SVR4 (not in SVID3) but it _is_ in the OSF/1 AES which
means it is _not_ vendor specific. But the standard part of madvise is that
it is a "hint".

However everything it actually _does_ when you hint the kernel with madvise
is specific usually to some versions of an operating system.

There are tricks to get around madvise not doing everything you want
(WONTNEED didn't work in Solaris for a long time. Kevin found a trick that
worked really well instead. Kevin knows people at Sun, since he was one of
the very earliest employees there, and so now the trick Kevin used to
suggest has now been found to be the implementation of WONTNEED in Solaris.)

And that trick is well worth understanding. It happens that msync() is a
good call to know. It has an undocumented behavior on Solaris that when you
msync a memory region with MS_INVALIDATE | MS_ASYNC, what happens is the
dirty pages are queued for writing and backing store is available
immediately, or if dirty, as soon as written out. This means that the pager
doesn't have to run at all to scavenge the pages. Linux didn't do this last
time I looked. I suggested it to the kernel guys and the idea got some
positive response, but I don't know if they did it.

> ii) IRIX and some other Unices (Linux 2.2 in particular), haven't
> implemented madvise, and naive use of mmap without madvise can produce
> lots of page faulting and much slower io than, say, asynchronous io
> calls on IRIX.  (http://www.geog.ubc.ca/~phil/mmap/msg00009.html)

IRIX has an awful implementation of mmap. And SGI people go around
badmouthing mmap; not that they don't have cause, but they are usually very
surprised to see how big the win is with a good implementation. Of course,
the msync() trick doesn't work on IRIX last I looked, which leads to the SGI
people believing that mmap() is brain damaged because it runs the pager into
the ground. It's a point of view that is bound to come up.

HP/UX was really wacked last time I looked. They had a version (10) which
supported the full mmap() on one series of workstations (700, 7000, I
forget, let's say 7e+?) and didn't support it except in the non-useful
SVR3.2 way on another series of workstations (8e+?). The reason was that the
8e+? workstations were multiprocessor and they hadn't figured out how to get
the newer kernel flying on the multiprocessors. I know Konrad had HP systems
at one point, maybe he has the scoop on those.

> So I'd love to see mmap in Numpy, but we may need to produce a
> tutorial outlining the tradeoffs, and giving some examples of
> madvise/msync/mmap used together (with a few benchmarks).  Any mmap
> module would need to include member functions that call madvise/msync
> for the mmapped array (but these may be no-ops on several popular OSes.)

I don't know if you want a separate module; maybe what you want is the
normal allocation of memory for all Numerical Python objects to be handled
in a way that makes sense for each operating system.

The approach I took when I was writing portable code for this sort of thing
was to write a wrapper for the memory operation semantics and then implement
the operations as a small library that would be OS specific, although not
_that_ specific. It was possible to write single source code for SVID3 and
OSF/AES1 systems with sparing use of conditional defines. Unfortunately,
that code is the intellectual property of another firm, or else I'd donate
it as an example for people who want to learn stuff about mmap. As it
stands, there was some similar code I was able to produce at some point. I
forget who here has a copy, maybe Konrad, maybe David Ascher.

Later,
Andrew Mullhaupt