[Python-Dev] Python startup time

Fri Jul 21 02:23:58 EDT 2017

On 21 July 2017 at 15:30, Cesare Di Mauro <cesare.di.mauro at gmail.com> wrote:

>
>
> 2017-07-21 4:52 GMT+02:00 Nick Coghlan <ncoghlan at gmail.com>:
>
>> On 21 July 2017 at 12:44, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> > We can separately measure the cost of unmarshalling the code object:
>> >
>> > $ python3 -m perf timeit -s "import typing; from marshal import loads;
>> from
>> > importlib.util import cache_from_source; cache =
>> > cache_from_source(typing.__file__); data = open(cache,
>> 'rb').read()[12:]"
>> > "loads(data)"
>> > .....................
>> > Mean +- std dev: 286 us +- 4 us
>>
>> Slight adjustment here, as the cost of locating the cached bytecode
>> and reading it from disk should really be accounted for in each
>> iteration:
>>
>> $ python3 -m perf timeit -s "import typing; from marshal import loads;
>> from importlib.util import cache_from_source" "cache =
>> cache_from_source(typing.__spec__.origin); data = open(cache,
>> 'rb').read()[12:]; loads(data)"
>> .....................
>> Mean +- std dev: 337 us +- 8 us
>>
>> That will have a bigger impact when loading from spinning disk or a
>> network drive, but it's fairly negligible when loading from a local
>> SSD or an already primed filesystem cache.
>>
>> Cheers,
>> Nick.
>>
>> --
>> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>>
> Thanks for your tests, Nick. It's quite evident that the marshal code
> cannot improve the situation, so I regret from my proposal.
>

It was still a good suggestion, since it made me realise I *hadn't*
actually measured the relative timings lately, so it was technically an
untested assumption that module level code execution still dominated the
overall import time.

typing is also a particularly large & complex module, and bytecode
unmarshalling represents a larger fraction of the import time for simpler
modules like abc:

$ python3 -m perf timeit -s "import abc; from marshal import loads; from
importlib.util import cache_from_source" "cache =
cache_from_source(abc.__spec__.origin); data = open(cache,
'rb').read()[12:]; loads(data)"
.....................
Mean +- std dev: 45.2 us +- 1.1 us

$ python3 -m perf timeit -s "import abc; loader_exec =
abc.__spec__.loader.exec_module" "loader_exec(abc)"
.....................
Mean +- std dev: 172 us +- 5 us

$ python3 -m perf timeit -s "import abc; from importlib import reload"
"reload(abc)"
.....................
Mean +- std dev: 280 us +- 14 us

And _weakrefset:

$ python3 -m perf timeit -s "import _weakrefset; from marshal import loads;
from importlib.util import cache_from_source" "cache =
cache_from_source(_weakrefset.__spec__.origin); data = open(cache,
'rb').read()[12:]; loads(data)"
.....................
Mean +- std dev: 57.7 us +- 1.3 us

$ python3 -m perf timeit -s "import _weakrefset; loader_exec =
_weakrefset.__spec__.loader.exec_module" "loader_exec(_weakrefset)"
.....................
Mean +- std dev: 129 us +- 6 us

$ python3 -m perf timeit -s "import _weakrefset; from importlib import
reload" "reload(_weakrefset)"
.....................
Mean +- std dev: 226 us +- 4 us

The conclusion still holds (the absolute numbers here are likely still too
small for the extra complexity of parallelising bytecode loading to pay off
in any significant way), but it also helps us set reasonable expectations
around how much of a gain we're likely to be able to get just from
precompilation with Cython.

That does actually raise a small microbenchmarking problem: for source and
bytecode imports, we can force the import system to genuinely rerun the
module or unmarshal the bytecode inside a single Python process, allowing
perf to measure it independently of CPython startup. While I'm pretty sure
it's possible to trick the import machinery into rerunning module level
init functions even for old-style extension modules (hence allowing us to
run similar tests to those above for a Cython compiled module), I don't
actually remember how to do it off the top of my head.

Cheers,
Nick.

P.S. I'll also note that in these cases where the import overhead is
proportionally significant for always-imported modules, we may want to look
at the benefits of freezing them (if they otherwise remain as pure Python
modules), or compiling them as builtin modules (if we switch them over to
Cython), in addition to looking at ways to make the modules themselves
faster. Being built directly into the interpreter binary is pretty much the
best case scenario for reducing import overhead.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20170721/2803bbf1/attachment.html>