[Python-ideas] Move optional data out of pyc files

Thu Apr 12 14:32:05 EDT 2018

I've been playing a bit with this trying to collect some data and measure
how useful this would be. You can take a look at the script I'm using at:
https://github.com/dmoisset/pycstats

What I'm measuring is:
1. Number of objects in the pyc, and how many of those are:
   * docstrings (I'm using a heuristic here which I'm not 100% sure it is
correct)
   * lnotabs
   * Duplicate objects; these have not been discussed in this thread before
but are another source of optimization I noticed while writing this.
Essentially I'm refering to immutable constants that are instanced more
than once and could be shared. You can also measure the effect of this
optimization across modules and within a single module[1]
2. Bytes used in memory by the categories above (sum of sys.getsizeof() for
each category).

I'm not measuring anything related to annotations because, as I mentioned
before, they are generated piecemeal by executable bytecode so they are
hard to separate

Running this on my python 3.6 pyc cache I get:

$ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py
8645 docstrings, 1705441B
19060 lineno tables, 941702B
59382/202898 duplicate objects for 3101287/18582807 memory size

So this means around ~10% of the memory used after loading is used for
docstrings, ~5% for lnotabs, and ~15% for objects that could be shared. The
sharing assumes we can share betwwen modules, but even doing it within
modules, you can get to ~7%.

In short, this could mean a 25%-35% reduction in memory use for code
objects if the stdlib is a good benchmark.

Best,
D.

[1] Regarding duplicates, I've found some unexpected things within loaded
code objects, for example instances of the small integer "1" with different
id() than the singleton that cpython normally uses for "1", although most
duplicates are some small strings, tuples with argument names, or .
Something that could be interesting to write is a "pyc optimizer" that
removes duplicates, this should be a gain at a minimal preprocessing cost.

On 12 April 2018 at 15:16, Daniel Moisset <dmoisset at machinalis.com> wrote:

> One implementation difficulty specifically related to annotations, is that
> they are quite hard to find/extract from the code objects. Both docstrings
> and lnotab are within specific fields of the code object for their
> function/class/module; annotations are spread as individual constants
> (assuming PEP 563), which are loaded in bytecode through separate
> LOAD_CONST statements before creating the function object, and that can
> happen in the middle of bytecode for the higher level object (the module or
> class containing a function definition). So the change for achieving that
> will be more significant than just "add a couple of descriptors to function
> objects and change the module marshalling code".
>
> Probably making annotations fit a single structure that can live in
> co_consts could make this change easier, and also make startup of annotated
> modules faster (because you just load a single constant instead of one per
> argument), this might be a valuable change by itself.
>
>
>
> On 12 April 2018 at 11:48, INADA Naoki <songofacandy at gmail.com> wrote:
>
>> > Finally, loading docstrings and other optional components can be made
>> lazy.
>> > This was not in my original idea, and this will significantly
>> complicate the
>> > implementation, but in principle it is possible. This will require
>> larger
>> > changes in the marshal format and bytecode.
>>
>> I'm +1 on this idea.
>>
>> * New pyc format has code section (same to current) and text section.
>> text section stores UTF-8 strings and not loaded at import time.
>> * Function annotation (only when PEP 563 is used) and docstring are
>> stored as integer, point to offset in the text section.
>> * When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are
>> integer, text is loaded from the text section lazily.
>>
>> PEP 563 will reduce some startup time, but __annotation__ is still
>> dict.  Memory overhead is negligible.
>>
>> In [1]: def foo(a: int, b: int) -> int:
>>    ...:     return a + b
>>    ...:
>>    ...:
>>
>> In [2]: import sys
>> In [3]: sys.getsizeof(foo)
>> Out[3]: 136
>>
>> In [4]: sys.getsizeof(foo.__annotations__)
>> Out[4]: 240
>>
>> When PEP 563 is used, there are no side effect while building the
>> annotation.
>> So the annotation can be serialized in text, like
>> {"a":"int","b":"int","return":"int"}.
>>
>> This change will require new pyc format, and descriptor for
>> PyFunction.__doc__, PyFunction.__annotation__
>> and type.__doc__.
>>
>> Regards,
>>
>> --
>> INADA Naoki  <songofacandy at gmail.com>
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
>
> --
> Daniel F. Moisset - UK Country Manager - Machinalis Limited
> www.machinalis.co.uk <http://www.machinalis.com>
> Skype: @dmoisset T: + 44 7398 827139
>
> 1 Fore St, London, EC2Y 9DT
>
> Machinalis Limited is a company registered in England and Wales.
> Registered number: 10574987.
>

-- 
Daniel F. Moisset - UK Country Manager - Machinalis Limited
www.machinalis.co.uk <http://www.machinalis.com>
Skype: @dmoisset T: + 44 7398 827139

1 Fore St, London, EC2Y 9DT

Machinalis Limited is a company registered in England and Wales. Registered
number: 10574987.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180412/50c0efb8/attachment.html>