[Python-ideas] Move optional data out of pyc files

Thu Apr 12 15:46:34 EDT 2018

I think moving data out of pyc files is going in a wrong direction:
more stat calls means slower import and slower startup time.

Trying to make pycs smaller also isn't really worth it (they
compress quite well).

Saving memory could be done by disabling reading objects lazily
from the file - without removing anything from the pyc file.
Whether the few 100kB RAM this saves is worth the effort depends
on the application space.

This leaves the proposal to restructure pyc files into a sectioned
file and possibly indexed file to make access to (lazily) loaded
parts faster. More structure would add ways to more easily
update the content going forward (similar to how PE executable files
are structured) and allow us to get rid of extra pyc file
variants (e.g. for special optimized versions). So that's an
interesting approach :-)

BTW: In all this, please remember that quite a few applications
do use doc strings as part of the code, not only for documentation.
Most prominent are probably parsers which keep the parsing
definitions in doc strings.

On 12.04.2018 20:32, Daniel Moisset wrote:
> I've been playing a bit with this trying to collect some data and
> measure how useful this would be. You can take a look at the script I'm
> using at: https://github.com/dmoisset/pycstats 
> 
> What I'm measuring is:
> 1. Number of objects in the pyc, and how many of those are:
>    * docstrings (I'm using a heuristic here which I'm not 100% sure it
> is correct)
>    * lnotabs
>    * Duplicate objects; these have not been discussed in this thread
> before but are another source of optimization I noticed while writing
> this. Essentially I'm refering to immutable constants that are instanced
> more than once and could be shared. You can also measure the effect of
> this optimization across modules and within a single module[1]
> 2. Bytes used in memory by the categories above (sum of sys.getsizeof()
> for each category).
> 
> I'm not measuring anything related to annotations because, as I
> mentioned before, they are generated piecemeal by executable bytecode so
> they are hard to separate
> 
> Running this on my python 3.6 pyc cache I get:
> 
> $ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py 
> 8645 docstrings, 1705441B
> 19060 lineno tables, 941702B
> 59382/202898 duplicate objects for 3101287/18582807 memory size
> 
> So this means around ~10% of the memory used after loading is used for
> docstrings, ~5% for lnotabs, and ~15% for objects that could be shared.
> The sharing assumes we can share betwwen modules, but even doing it
> within modules, you can get to ~7%. 
> 
> In short, this could mean a 25%-35% reduction in memory use for code
> objects if the stdlib is a good benchmark.
> 
> Best,
> D.
> 
> [1] Regarding duplicates, I've found some unexpected things within
> loaded code objects, for example instances of the small integer "1" with
> different id() than the singleton that cpython normally uses for "1",
> although most duplicates are some small strings, tuples with argument
> names, or . Something that could be interesting to write is a "pyc
> optimizer" that removes duplicates, this should be a gain at a minimal
> preprocessing cost.
> 
> 
> On 12 April 2018 at 15:16, Daniel Moisset <dmoisset at machinalis.com
> <mailto:dmoisset at machinalis.com>> wrote:
> 
>     One implementation difficulty specifically related to annotations,
>     is that they are quite hard to find/extract from the code objects.
>     Both docstrings and lnotab are within specific fields of the code
>     object for their function/class/module; annotations are spread as
>     individual constants (assuming PEP 563), which are loaded in
>     bytecode through separate LOAD_CONST statements before creating the
>     function object, and that can happen in the middle of bytecode for
>     the higher level object (the module or class containing a function
>     definition). So the change for achieving that will be more
>     significant than just "add a couple of descriptors to function
>     objects and change the module marshalling code".
> 
>     Probably making annotations fit a single structure that can live in
>     co_consts could make this change easier, and also make startup of
>     annotated modules faster (because you just load a single constant
>     instead of one per argument), this might be a valuable change by itself.
> 
> 
> 
>     On 12 April 2018 at 11:48, INADA Naoki <songofacandy at gmail.com
>     <mailto:songofacandy at gmail.com>> wrote:
> 
>         > Finally, loading docstrings and other optional components can be made lazy.
>         > This was not in my original idea, and this will significantly complicate the
>         > implementation, but in principle it is possible. This will require larger
>         > changes in the marshal format and bytecode.
> 
>         I'm +1 on this idea.
> 
>         * New pyc format has code section (same to current) and text
>         section.
>         text section stores UTF-8 strings and not loaded at import time.
>         * Function annotation (only when PEP 563 is used) and docstring are
>         stored as integer, point to offset in the text section.
>         * When type.__doc__, PyFunction.__doc__,
>         PyFunction.__annotation__ are
>         integer, text is loaded from the text section lazily.
> 
>         PEP 563 will reduce some startup time, but __annotation__ is still
>         dict.  Memory overhead is negligible.
> 
>         In [1]: def foo(a: int, b: int) -> int:
>            ...:     return a + b
>            ...:
>            ...:
> 
>         In [2]: import sys
>         In [3]: sys.getsizeof(foo)
>         Out[3]: 136
> 
>         In [4]: sys.getsizeof(foo.__annotations__)
>         Out[4]: 240
> 
>         When PEP 563 is used, there are no side effect while building
>         the annotation.
>         So the annotation can be serialized in text, like
>         {"a":"int","b":"int","return":"int"}.
> 
>         This change will require new pyc format, and descriptor for
>         PyFunction.__doc__, PyFunction.__annotation__
>         and type.__doc__.
> 
>         Regards,
> 
>         -- 
>         INADA Naoki  <songofacandy at gmail.com
>         <mailto:songofacandy at gmail.com>>
>         _______________________________________________
>         Python-ideas mailing list
>         Python-ideas at python.org <mailto:Python-ideas at python.org>
>         https://mail.python.org/mailman/listinfo/python-ideas
>         <https://mail.python.org/mailman/listinfo/python-ideas>
>         Code of Conduct: http://python.org/psf/codeofconduct/
>         <http://python.org/psf/codeofconduct/>
> 
> 
> 
> 
>     -- 
>     Daniel F. Moisset - UK Country Manager - Machinalis Limited
>     www.machinalis.co.uk <http://www.machinalis.com>
>     Skype: @dmoisset T: + 44 7398 827139
> 
>     1 Fore St, London, EC2Y 9DT
> 
>     Machinalis Limited is a company registered in England and Wales.
>     Registered number: 10574987.
> 
> 
> 
> 
> -- 
> Daniel F. Moisset - UK Country Manager - Machinalis Limited
> www.machinalis.co.uk <http://www.machinalis.com>
> Skype: @dmoisset T: + 44 7398 827139
> 
> 1 Fore St, London, EC2Y 9DT
> 
> Machinalis Limited is a company registered in England and Wales.
> Registered number: 10574987.
> 
> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Apr 12 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/