[Numpy-discussion] Memory leak/fragmentation when using np.memmap

G Jones glenn.caltech at gmail.com
Wed May 18 19:36:31 EDT 2011


Hello,
I have seen the effect you describe, I had originally assumed this was the
case, but in fact there seems to be more to the problem. If it were only the
effect you mention, there should not be any memory error because the OS will
drop the pages when the memory is actually needed for something. At least I
would hope so. If not, this seems like a huge problem for linux.

As a followup, I managed to install tcmalloc as described in the article I
mentioned. Running the example I sent now shows a constant memory foot print
as expected. I am surprised such a solution was necessary. Certainly others
must work with such large datasets using numpy/python?

Thanks,
Glenn

On Wed, May 18, 2011 at 4:21 PM, Pauli Virtanen <pav at iki.fi> wrote:

> On Wed, 18 May 2011 15:09:31 -0700, G Jones wrote:
> [clip]
> > import numpy as np
> >
> > x = np.memmap('mybigfile.bin',mode='r',dtype='uint8') print x.shape   #
> > prints (42940071360,) in my case ndat = x.shape[0]
> > for k in range(1000):
> >   y = x[k*ndat/1000:(k+1)*ndat/1000].astype('float32')  #The astype
> >   ensures
> > that the data is read in from disk
> >   del y
> >
> > One would expect such a program would have a roughly constant memory
> > footprint, but in fact 'top' shows that the RES memory continually
> > increases. I can see that the memory usage is actually occurring because
> > the OS eventually starts to swap to disk. The memory usage does not seem
> > to correspond with the total size of the file.
>
> Your OS probably likes to keep the pages touched in memory and in swap,
> rather than dropping them. This happens at least on Linux.
>
> You can check that an equivalent simple C program displays
> the same behavior (use with file "data" with enough bytes):
>
> #include <sys/mman.h>
> #include <fcntl.h>
> #include <unistd.h>
>
> int main()
> {
>    unsigned long size = 2000000000;
>    unsigned long i;
>    char *p;
>    int fd;
>    char sum;
>
>    fd = open("data", O_RDONLY);
>    p = (char*)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
>
>    sum = 0;
>    for (i = 0; i < size; ++i) {
>        sum += *(p + i);
>    }
>    munmap(p, size);
>    close(fd);
>
>    return 0;
> }
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110518/2d4f4cfb/attachment.html>


More information about the NumPy-Discussion mailing list