[Numpy-discussion] Change in memmap behaviour

Nathaniel Smith njs at pobox.com
Wed Jul 4 14:21:38 EDT 2012


On Tue, Jul 3, 2012 at 4:08 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Tue, Jul 3, 2012 at 10:35 AM, Thouis (Ray) Jones <thouis at gmail.com> wrote:
>> On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen <sveinugu at gmail.com> wrote:
>>>
>>> On 2. juli 2012, at 22.40, Nathaniel Smith wrote:
>>>
>>>> On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen <sveinugu at gmail.com> wrote:
>>>>> [snip]
>>>>>
>>>>>
>>>>>
>>>>> Your actual memory usage may not have increased as much as you think,
>>>>> since memmap objects don't necessarily take much memory -- it sounds
>>>>> like you're leaking virtual memory, but your resident set size
>>>>> shouldn't go up as much.
>>>>>
>>>>>
>>>>> As I understand it, memmap objects retain the contents of the memmap in
>>>>> memory after it has been read the first time (in a lazy manner). Thus, when
>>>>> reading a slice of a 24GB file, only that part recides in memory. Our system
>>>>> reads a slice of a memmap, calculates something (say, the sum), and then
>>>>> deletes the memmap. It then loops through this for consequitive slices,
>>>>> retaining a low memory usage. Consider the following code:
>>>>>
>>>>> import numpy as np
>>>>> res = []
>>>>> vecLen = 3095677412
>>>>> for i in xrange(vecLen/10**8+1):
>>>>> x = i * 10**8
>>>>> y = min((i+1) * 10**8, vecLen)
>>>>> res.append(np.memmap('val.float64', dtype='float64')[x:y].sum())
>>>>>
>>>>> The memory usage of this code on a 24GB file (one value for each nucleotide
>>>>> in the human DNA!) is 23g resident memory after the loop is finished (not
>>>>> 24g for some reason..).
>>>>>
>>>>> Running the same code on 1.5.1rc1 gives a resident memory of 23m after the
>>>>> loop.
>>>>
>>>> Your memory measurement tools are misleading you. The same memory is
>>>> resident in both cases, just in one case your tools say it is
>>>> operating system disk cache (and not attributed to your app), and in
>>>> the other case that same memory, treated in the same way by the OS, is
>>>> shown as part of your app's resident memory. Virtual memory is
>>>> confusing...
>>>
>>> But the crucial difference is perhaps that the disk cache can be cleared by the OS if needed, but not the application memory in the same way, which must be swapped to disk? Or am I still confused?
>>>
>>> (snip)
>>>
>>>>>
>>>>> Great! Any idea on whether such a patch may be included in 1.7?
>>>>
>>>> Not really, if I or you or someone else gets inspired to take the time
>>>> to write a patch soon then it will be, otherwise not...
>>>>
>>>> -N
>>>
>>> I have now tried to add a patch, in the way you proposed, but I may have gotten it wrong..
>>>
>>> http://projects.scipy.org/numpy/ticket/2179
>>
>> I put this in a github repo, and added tests (author credit to Sveinung)
>> https://github.com/thouis/numpy/tree/mmap_children
>>
>> I'm not sure which branch to issue a PR request against, though.
>
> Looks good to me, thanks to both of you!
>
> Obviously should be merged to master; beyond that I'm not sure. We
> definitely want it in 1.7, but I'm not sure if that's been branched
> yet or not. (Or rather, it has been branched, but then maybe it was
> unbranched again? Travis?) Since it was a 1.6 regression it'd make
> sense to cherrypick to the 1.6 branch too, just in case it gets
> another release.

Merged into master and maintenance/1.6.x, but not maintenance/1.7.x,
I'll let Ondrej or Travis figure that out...

-N



More information about the NumPy-Discussion mailing list