Generator using item[n-1] + item[n] memory

Fri Feb 14 17:59:05 EST 2014

On Fri, Feb 14, 2014 at 3:27 PM, Nick Timkovich <prometheus235 at gmail.com> wrote:
> I have a Python 3.x program that processes several large text files that
> contain sizeable arrays of data that can occasionally brush up against the
> memory limit of my puny workstation.  From some basic memory profiling, it
> seems like when using the generator, the memory usage of my script balloons
> to hold consecutive elements, using up to twice the memory I expect.
>
> I made a simple, stand alone example to test the generator and I get similar
> results in Python 2.7, 3.3, and 3.4.  My test code follows, `memory_usage()`
> is a modifed version of [this function from an SO
> question](http://stackoverflow.com/a/898406/194586) which uses
> `/proc/self/status` and agrees with `top` as I watch it.  `resource` is
> probably a more cross-platform method:
>
> ###############
>
> import sys, resource, gc, time
>
> def biggen():
>     sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1
>     for size in sizes:
>         data = [1] * int(size * 1e6)
>         #time.sleep(1)
>         yield data
>
> def consumer():
>     for data in biggen():
>         rusage = resource.getrusage(resource.RUSAGE_SELF)
>         peak_mb = rusage.ru_maxrss/1024.0
>         print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format(
>                 peak_mb, len(data)/1e6))
>         #print(memory_usage())
>
>         data = None  # go
>         del data     # away
>         gc.collect() # please.
>
> # def memory_usage():
> #     """Memory usage of the current process, requires /proc/self/status"""
> #     # http://stackoverflow.com/a/898406/194586
> #     result = {'peak': 0, 'rss': 0}
> #     for line in open('/proc/self/status'):
> #         parts = line.split()
> #         key = parts[0][2:-1].lower()
> #         if key in result:
> #             result[key] = int(parts[1])/1024.0
> #     return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result)
>
> print(sys.version)
> consumer()
>
> ###############
>
> In practice I'll process data coming from such a generator loop, saving just
> what I need, then discard it.
>
> When I run the above script, and two large elements come in series (the data
> size can be highly variable), it seems like Python computes the next before
> freeing the previous, leading to up to double the memory usage.
>
> [...]
>
> The crazy belt-and-suspenders-and-duct-tape approach `data = None`, `del
> data`, and `gc.collect()` does nothing.

Because at the time you call gc.collect(), the generator still holds a
reference to the data, so it can't be collected.  Assuming this is
running in CPython and there are no reference cycles in the data, the
collection is unnecessary anyway, since CPython will automatically
free the data immediately when there are no references (but this is
not guaranteed for other implementations of Python).

> I'm pretty sure the generator itself is not doubling up on memory because
> otherwise a single large value it yields would increase the peak usage, and
> in the *same iteration* a large object appeared; it's only large consecutive
> objects.

Look again.  What happens to the data between two iterations of the generator?

1) data variable holds the data from the prior iteration
2) the loop jumps back up to the top
3) the data for the next iteration is constructed
4) the data for the next iteration is assigned to the data variable

It is not until step 4 that the variable stops referencing the data
from the prior iteration.  So there is a brief period where both of
these objects must still be in memory.

> How can I save my memory?

Try unreferencing the data in the generator at the end of each iteration.