memory utilization blow up with dict structure

Fri Sep 23 06:22:50 EDT 2016

Am Freitag, 23. September 2016 12:02:47 UTC+2 schrieb Chris Angelico:
> On Fri, Sep 23, 2016 at 7:05 PM, Christian <mining.facts at gmail.com> wrote:
> > I'm wondering why python blow up a dictionary structure so much.
> >
> > The ids and cat substructure could have 0..n entries but in the most cases they are <= 10,t is limited by <= 6.
> >
> > Example:
> >
> > {'0a0f7a3a0e09826caef1bff707785662': {'ids': {'aa316b86-8169-11e6-bab9-0050563e2d7c',
> >  'aa3174f0-8169-11e6-bab9-0050563e2d7c',
> >  'aa319408-8169-11e6-bab9-0050563e2d7c',
> >  'aa3195e8-8169-11e6-bab9-0050563e2d7c',
> >  'aa319732-8169-11e6-bab9-0050563e2d7c',
> >  'aa319868-8169-11e6-bab9-0050563e2d7c',
> >  'aa31999e-8169-11e6-bab9-0050563e2d7c',
> >  'aa319b06-8169-11e6-bab9-0050563e2d7c'},
> >   't': {'type1', 'type2'},
> >   'dt': datetime.datetime(2016, 9, 11, 15, 15, 54, 343000),
> >   'nids': 8,
> >   'ntypes': 2,
> >   'cat': [('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', ''),
> >    ('ABC','aa3174f0-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
> >    ('ABC','aa319408-8169-11e6-bab9-0050563e2d7c','3', 'type1'),
> >    ('ABC','aa3195e8-8169-11e6-bab9-0050563e2d7c', '3', 'type2'),
> >    ('ABC','aa319732-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
> >    ('ABC','aa319868-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
> >    ('ABC','aa31999e-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
> >    ('ABC','aa319b06-8169-11e6-bab9-0050563e2d7c', '3', 'type2')]},
> >
> >
> > sys.getsizeof(superdict)
> > 50331744
> > len(superdict)
> > 941272
> 
> So... you have a million entries in the master dictionary, each of
> which has an associated collection of data, consisting of half a dozen
> things, some of which have subthings. The very smallest an object will
> ever be on a 64-bit Linux system is 16 bytes:
> 
> >>> sys.getsizeof(object())
> 16
> 
> and most of these will be much larger:
> 
> >>> sys.getsizeof(8)
> 28
> >>> sys.getsizeof(datetime.datetime(2016, 9, 11, 15, 15, 54, 343000))
> 48
> >>> sys.getsizeof([])
> 64
> >>> sys.getsizeof(('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', ''))
> 80
> >>> sys.getsizeof('aa316b86-8169-11e6-bab9-0050563e2d7c')
> 85
> >>> sys.getsizeof({})
> 240
> 
> (Bear in mind that sys.getsizeof counts only the object itself, not
> the things it references - that's why the tuple can take up less space
> than one of its members.)

Thanks for this clarification!

> 
> I don't think your collections can average less than about 1KB (even
> the textual representation of your example data is about that big),
> and you have a million of them. That's a gigabyte of memory, right
> there. Your peak memory usage is showing 3GB, so most likely, my
> conservative estimates have put an absolute lower bound on this. Try
> doing everything exactly the same as you did, only without actually
> loading the pickle - then see what memory usage is. I think you'll
> find that the usage is fully legitimate.
> 
> > Thanks for any advice to save memory.
> 
> Use a database. I suggest PostgreSQL. You won't have to load
> everything into memory all at once that way, and (bonus!) you can even
> update stuff on disk without rewriting everything.

Yes it seems I haven't a chance to avoid that, especially because the dict example isn't smaller then it will be in real. I'm in a trade-off between performance and scalability , so the dict construction should be fast as possible and having reads+writes (using mongodb) is a performance drawback.
Christian

> ChrisA