[Python-Dev] Investigating Python memory footprint of one real Web application

Fri Jan 20 06:43:33 EST 2017

On Fri, 20 Jan 2017 19:49:01 +0900
INADA Naoki <songofacandy at gmail.com> wrote:
> 
> Report is here
> https://gist.github.com/methane/ce723adb9a4d32d32dc7525b738d3c31

"this script counts static memory usage. It doesn’t care about dynamic
memory usage of processing real request"

You may be trying to optimize something which is only a very small
fraction of your actual memory footprint.  That said, the marshal
module could certainly try to intern some tuples and other immutable
structures.

> * Most large strings are docstring.  Is it worth enough that option
> for trim docstrings, without disabling asserts?

Perhaps docstrings may be compressed and then lazily decompressed when
accessed for the first time.  lz4 and zstd are good modern candidates
for that.  zstd also has a dictionary mode that helps for small data
(*).  See https://facebook.github.io/zstd/

(*) Even a 200-bytes docstring can be compressed this way:

>>> data = os.times.__doc__.encode()
>>> len(data)
211
>>> len(lz4.compress(data))
200
>>> c = zstd.ZstdCompressor()
>>> len(c.compress(data))
156
>>> c = zstd.ZstdCompressor(dict_data=dict_data)
>>> len(c.compress(data))
104

`dict_data` here is some 16KB dictionary I've trained on some Python
docstrings.  That 16KB dictionary could be computed while building
Python (or hand-generated from time to time, since it's unlikely to
change a lot) and put in a static array somewhere:

>>> samples = [(mod.__doc__ or '').encode() for mod in sys.modules.values()]
>>> sum(map(len, samples))
258113
>>> dict_data = zstd.train_dictionary(16384, samples)
>>> len(dict_data.as_bytes())
16384

Of course, compression is much more efficient on larger docstrings:

>>> import numpy as np
>>> data = np.__doc__.encode()
>>> len(data)
3140
>>> len(lz4.compress(data))
2271
>>> c = zstd.ZstdCompressor()
>>> len(c.compress(data))
1539
>>> c = zstd.ZstdCompressor(dict_data=dict_data)
>>> len(c.compress(data))
1348

>>> import pdb
>>> data = pdb.__doc__.encode()
>>> len(data)
12018
>>> len(lz4.compress(data))
6592
>>> c = zstd.ZstdCompressor()
>>> len(c.compress(data))
4502
>>> c = zstd.ZstdCompressor(dict_data=dict_data)
>>> len(c.compress(data))
4128

A similar strategy may be used for annotations and other
rarely-accessed metadata.

Another possibility, but probably much more costly in terms of initial
development and maintenance, is to put the docstrings (+ annotations,
etc.) in a separate file that's lazily read.

I think optimizing the footprint for everyone is much better than
adding command-line options to disable some specific metadata.

Regards

Antoine.