Memory usage per top 10x usage per heapy

Tue Sep 25 13:39:00 EDT 2012

> I'm a bit surprised you aren't beyond the 2gb limit, just with the
> structures you describe for the file.  You do realize that each object
> has quite a few bytes of overhead, so it's not surprising to use several
> times the size of a file, to store the file in an organized way.
I did some back of the envelope calcs which more or less agreed with 
heapy. The code stores 1 string, which is, on average, about 50 chars or 
so, and one MD5 hex string per line of code. There's about 40 bytes or 
so of overhead per string per sys.getsizeof(). I'm also storing an int 
(24b) and a <10 char string in an object with __slots__ set. Each 
object, per heapy (this is one area where I might be underestimating 
things) takes 64 bytes plus instance variable storage, so per line:

50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB 
plus some memory for the dicts, which is about what heapy is reporting 
(note I'm currently not actually running all 2M lines, I'm just running 
subsets for my tests).

Is there something I'm missing? Here's the heapy output after loading 
~300k lines:

Partition of a set of 1199849 objects. Total size = 89965376 bytes.
Index 	Count 	% 	Size 	% 	Cumulative 	% 	Kind
0 	599999 	50 	38399920 	43 	38399920 	43 	str
1 	5 	0 	25167224 	28 	63567144 	71 	dict
2 	299998 	25 	19199872 	21 	82767016 	92 	0xa13330
3 	299836 	25 	7196064 	8 	89963080 	100 	int
4 	4 	0 	1152 	0 	89964232 	100 	collections.defaultdict

Note that 3 of the dicts are empty. I assume that 0xa13330 is the 
address of the object. I'd actually expect to see 900k strings, but the 
<10 char string is always the same in this case so perhaps the runtime 
is using the same object...? At this point, top reports python as using 
1.1g of virt and 1.0g of res.

> I also
> wonder if heapy has been written to take into account the larger size of
> pointers in a 64bit build.
That I don't know, but that would only explain, at most, a 2x increase 
in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing.

> Another thing is to make sure
> that the md5 object used in your two maps is the same object, and not
> just one with the same value.
That's certainly the way the code is written, and heapy seems to confirm 
that the strings aren't duplicated in memory.

Thanks for sticking with me on this,

MrsE

On 9/25/2012 4:06 AM, Dave Angel wrote:
> On 09/25/2012 12:21 AM, Junkshops wrote:
>>> Just curious;  which is it, two million lines, or half a million bytes?
> <snip>
>> Sorry, that should've been a 500Mb, 2M line file.
>>
>>> which machine is 2gb, the Windows machine, or the VM?
>> VM. Winders is 4gb.
>>
>>> ...but I would point out that just because
>>> you free up the memory from the Python doesn't mean it gets released
>>> back to the system.  The C runtime manages its own heap, and is pretty
>>> persistent about hanging onto memory once obtained.  It's not normally a
>>> problem, since most small blocks are reused.  But it can get
>>> fragmented.  And i have no idea how well Virtual Box maps the Linux
>>> memory map into the Windows one.
>> Right, I understand that - but what's confusing me is that, given the
>> memory use is (I assume) monotonically increasing, the code should never
>> use more than what's reported by heapy once all the data is loaded into
>> memory, given that memory released by the code to the Python runtime is
>> reused. To the best of my ability to tell I'm not storing anything I
>> shouldn't, so the only thing I can think of is that all the object
>> creation and destruction, for some reason, it preventing reuse of
>> memory. I'm at a bit of a loss regarding what to try next.
> I'm not familiar with heapy, but perhaps it's missing something there.
> I'm a bit surprised you aren't beyond the 2gb limit, just with the
> structures you describe for the file.  You do realize that each object
> has quite a few bytes of overhead, so it's not surprising to use several
> times the size of a file, to store the file in an organized way.  I also
> wonder if heapy has been written to take into account the larger size of
> pointers in a 64bit build.
>
> Perhaps one way to save space would be to use a long to store those md5
> values.  You'd have to measure it, but I suspect it'd help (at the cost
> of lots of extra hexlify-type calls).  Another thing is to make sure
> that the md5 object used in your two maps is the same object, and not
> just one with the same value.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120925/6bacc6a0/attachment.html>