[Baypiggies] json using huge memory footprint and not releasing

Fri Jun 15 23:41:59 CEST 2012

On Fri, Jun 15, 2012 at 5:32 PM, David Lawrence <david at bitcasa.com> wrote:

> On Fri, Jun 15, 2012 at 2:22 PM, Bob Ippolito <bob at redivi.com> wrote:
>
>> On Fri, Jun 15, 2012 at 4:15 PM, David Lawrence <david at bitcasa.com>wrote:
>>
>>> When I load the file into json, pythons memory usage spike to about
>>> 1.8GB and I can't seem to get that memory to be released.  I put together a
>>> test case that's very simple:
>>>
>>> with open("test_file.json", 'r') as f:
>>>     j = json.load(f)
>>>
>>> I'm sorry that I can't provide a sample json file, my test file has a
>>> lot of sensitive information, but for context, I'm dealing with a file in
>>> the order of 240MB.  After running the above 2 lines I have the
>>> previously mentioned 1.8GB of memory in use.  If I then do "del j" memory
>>> usage doesn't drop at all.  If I follow that with a "gc.collect()" it still
>>> doesn't drop.  I even tried unloading the json module and running another
>>> gc.collect.
>>>
>>> I'm trying to run some memory profiling but heapy has been churning 100%
>>> CPU for about an hour now and has yet to produce any output.
>>>
>>> Does anyone have any ideas?  I've also tried the above using cjson
>>> rather than the packaged json module.  cjson used about 30% less memory but
>>> otherwise displayed exactly the same issues.
>>>
>>> I'm running Python 2.7.2 on Ubuntu server 11.10.
>>>
>>> I'm happy to load up any memory profiler and see if it does better then
>>> heapy and provide any diagnostics you might think are necessary.  I'm
>>> hunting around for a large test json file that I can provide for anyone
>>> else to give it a go.
>>>
>>
>> It may just be the way that the allocator works. What happens if you load
>> the JSON, del the object, then do it again? Does it take up 3.6GB or stay
>> at 1.8GB? You may not be able to "release" that memory to the OS in such a
>> way that RSS gets smaller... but at the same time it's not really a leak
>> either.
>>
>> GC shouldn't really take part in a JSON structure, since it's guaranteed
>> to be acyclic… ref counting alone should be sufficient to instantly reclaim
>> that space. I'm not at all surprised that gc.collect() doesn't change
>> anything for CPython in this case.
>>
>> $ python
>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>> darwin
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import os, subprocess, simplejson
>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>> str(os.getpid())],
>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>> ...
>> >>> rss()
>> '7284'
>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>> >>> rss()
>> '49032'
>> >>> del l
>> >>> rss()
>> '42232'
>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>> >>> rss()
>> '49032'
>> >>> del l
>> >>> rss()
>> '42232'
>>
>> $ python
>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>> darwin
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import os, subprocess, simplejson
>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>> str(os.getpid())],
>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>> ...
>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>> xrange(1000000))))
>> >>> rss()
>> '288116'
>> >>> del l
>> >>> rss()
>> '84384'
>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>> xrange(1000000))))
>> >>> rss()
>> '288116'
>> >>> del l
>> >>> rss()
>> '84384'
>>
>> -bob
>>
>>
> It does appear that deleting the object and running the example again the
> memory stays static at about 1.8GB.  Could you provide a little more detail
> on what your examples are meant to demonstrate.  One shows a static memory
> footprint and the other shows the footprint fluctuating up and down.  I
> would expect the static footprint in the first example just from my
> understanding of python free lists of integers.
>
>
Both examples show the same thing, but with different data structures (list
of int, dict of str:int). The only thing missing is that I left out the
baseline in the second example, it would be the same as the first example.

The other suggestions are spot on. If you want the memory to really be
released, you have to do it in a transient subprocess, and/or you could
probably have lower overhead if you're using a streaming parser (if there's
something you can do with it incrementally).

-bob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20120615/9adcdd14/attachment-0001.html>