Memory usage per top 10x usage per heapy

Tue Sep 25 14:50:35 EDT 2012

On 09/25/2012 01:39 PM, Junkshops wrote:

Procedural point:  I know you're trying to conform to the standard that
this mailing list uses, but you're off a little, and it's distracting.
It's also probably more work for you, and certainly for us.

You need an attribution in front of the quoted portions.  This next
section is by me, but you don't say so.  That's because you copy/pasted
it from elsewhere in the reply, and didn't copy the "... Dave Angel
wrote" part.

Much easier is to take the reply, and remove the parts you're not going
to respond to, putting your own comments in between the parts that are
left (as you're doing).  And generally, there's no need for anything
after your last remark, so you just delete up to your signature, if any.

>> I'm a bit surprised you aren't beyond the 2gb limit, just with the
>> structures you describe for the file.  You do realize that each object
>> has quite a few bytes of overhead, so it's not surprising to use several
>> times the size of a file, to store the file in an organized way.
> I did some back of the envelope calcs which more or less agreed with
> heapy. The code stores 1 string, which is, on average, about 50 chars or
> so, and one MD5 hex string per line of code. There's about 40 bytes or
> so of overhead per string per sys.getsizeof(). I'm also storing an int
> (24b) and a <10 char string in an object with __slots__ set. Each
> object, per heapy (this is one area where I might be underestimating
> things) takes 64 bytes plus instance variable storage, so per line:
> 
> 50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB
> plus some memory for the dicts, which is about what heapy is reporting
> (note I'm currently not actually running all 2M lines, I'm just running
> subsets for my tests).
> 
> Is there something I'm missing? Here's the heapy output after loading
> ~300k lines:
> 
> Partition of a set of 1199849 objects. Total size = 89965376 bytes.
> Index     Count     %     Size     %     Cumulative     %     Kind
> 0     599999     50     38399920     43     38399920     43     str
> 1     5     0     25167224     28     63567144     71     dict
> 2     299998     25     19199872     21     82767016     92     0xa13330
> 3     299836     25     7196064     8     89963080     100     int
> 4     4     0     1152     0     89964232     100    
> collections.defaultdict
> 
> Note that 3 of the dicts are empty. I assumet 0xa13330 is the
> address of the object. I'd actually expect to see 900k strings, but the
> <10 char string is always the same in this case so perhaps the runtime
> is using the same object...? 

CPython currently interns short strings that conform to variable name
rules.  You can't count on that behavior (and i probably don't have it
quite right anyway), but it's probably what you're seeing.

> At this point, top reports python as using
> 1.1g of virt and 1.0g of res.
> 
>> I also
>> wonder if heapy has been written to take into account the larger size of
>> pointers in a 64bit build.
> That I don't know, but that would only explain, at most, a 2x increase
> in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing.
> 
>> Another thing is to make sure
>> that the md5 object used in your two maps is the same object, and not
>> just one with the same value.
> That's certainly the way the code is written, and heapy seems to confirm
> that the strings aren't duplicated in memory.
> 
> Thanks for sticking with me on this,

You're certainly welcome.  I suspect that heapy has some limitation in
its reporting, and that's what the discrepancy.  Oscar points out that
you have a bunch of exception objects, which certainly looks suspicious.
 If you're somehow storing one of these per line, and heapy isn't
reporting them, that could be a large discrepancy.

He also points out that you have a couple of lambda functions stored in
one of your dictionary.  A lambda function can be an expensive
proposition if you are building millions of them.  So can nested
functions with non-local variable references, in case you have any of those.

Oscar also reminds you of what I suggested for the md5 fields.  Stored
as ints instead of hex strings could save a good bit.  Just remember to
use the same one for both dicts, as you've been doing with the strings.

Other than that, I'm stumped.

-- 

DaveA