[Baypiggies] json using huge memory footprint and not releasing

Fri Jun 15 23:44:44 CEST 2012

On Fri, Jun 15, 2012 at 2:41 PM, Bob Ippolito <bob at redivi.com> wrote:

> On Fri, Jun 15, 2012 at 5:32 PM, David Lawrence <david at bitcasa.com> wrote:
>
>> On Fri, Jun 15, 2012 at 2:22 PM, Bob Ippolito <bob at redivi.com> wrote:
>>
>>> On Fri, Jun 15, 2012 at 4:15 PM, David Lawrence <david at bitcasa.com>wrote:
>>>
>>>> When I load the file into json, pythons memory usage spike to about
>>>> 1.8GB and I can't seem to get that memory to be released.  I put together a
>>>> test case that's very simple:
>>>>
>>>> with open("test_file.json", 'r') as f:
>>>>     j = json.load(f)
>>>>
>>>> I'm sorry that I can't provide a sample json file, my test file has a
>>>> lot of sensitive information, but for context, I'm dealing with a file in
>>>> the order of 240MB.  After running the above 2 lines I have the
>>>> previously mentioned 1.8GB of memory in use.  If I then do "del j" memory
>>>> usage doesn't drop at all.  If I follow that with a "gc.collect()" it still
>>>> doesn't drop.  I even tried unloading the json module and running another
>>>> gc.collect.
>>>>
>>>> I'm trying to run some memory profiling but heapy has been churning
>>>> 100% CPU for about an hour now and has yet to produce any output.
>>>>
>>>> Does anyone have any ideas?  I've also tried the above using cjson
>>>> rather than the packaged json module.  cjson used about 30% less memory but
>>>> otherwise displayed exactly the same issues.
>>>>
>>>> I'm running Python 2.7.2 on Ubuntu server 11.10.
>>>>
>>>> I'm happy to load up any memory profiler and see if it does better then
>>>> heapy and provide any diagnostics you might think are necessary.  I'm
>>>> hunting around for a large test json file that I can provide for anyone
>>>> else to give it a go.
>>>>
>>>
>>> It may just be the way that the allocator works. What happens if you
>>> load the JSON, del the object, then do it again? Does it take up 3.6GB or
>>> stay at 1.8GB? You may not be able to "release" that memory to the OS in
>>> such a way that RSS gets smaller... but at the same time it's not really a
>>> leak either.
>>>
>>> GC shouldn't really take part in a JSON structure, since it's guaranteed
>>> to be acyclic… ref counting alone should be sufficient to instantly reclaim
>>> that space. I'm not at all surprised that gc.collect() doesn't change
>>> anything for CPython in this case.
>>>
>>> $ python
>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>>> darwin
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import os, subprocess, simplejson
>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>>> str(os.getpid())],
>>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>>> ...
>>> >>> rss()
>>> '7284'
>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>>> >>> rss()
>>> '49032'
>>> >>> del l
>>> >>> rss()
>>> '42232'
>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>>> >>> rss()
>>> '49032'
>>> >>> del l
>>> >>> rss()
>>> '42232'
>>>
>>> $ python
>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>>> darwin
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import os, subprocess, simplejson
>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>>> str(os.getpid())],
>>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>>> ...
>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>>> xrange(1000000))))
>>> >>> rss()
>>> '288116'
>>> >>> del l
>>> >>> rss()
>>> '84384'
>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>>> xrange(1000000))))
>>> >>> rss()
>>> '288116'
>>> >>> del l
>>> >>> rss()
>>> '84384'
>>>
>>> -bob
>>>
>>>
>> It does appear that deleting the object and running the example again the
>> memory stays static at about 1.8GB.  Could you provide a little more detail
>> on what your examples are meant to demonstrate.  One shows a static memory
>> footprint and the other shows the footprint fluctuating up and down.  I
>> would expect the static footprint in the first example just from my
>> understanding of python free lists of integers.
>>
>>
> Both examples show the same thing, but with different data structures
> (list of int, dict of str:int). The only thing missing is that I left out
> the baseline in the second example, it would be the same as the first
> example.
>
> The other suggestions are spot on. If you want the memory to really be
> released, you have to do it in a transient subprocess, and/or you could
> probably have lower overhead if you're using a streaming parser (if there's
> something you can do with it incrementally).
>
> -bob
>
>
Thank you all for the help.  Multiprocessing with a Queue and blocking
get() calls looks like it will work well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20120615/7aba1cd4/attachment.html>