[Python-ideas] Move optional data out of pyc files

Chris Angelico rosuav at gmail.com
Wed Apr 11 10:09:38 EDT 2018


On Wed, Apr 11, 2018 at 4:06 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:
>
> [...]
>> > Yes, it will double the number of files. Actually quadruple it, if the
>> > annotations and line numbers are in separate files too. But if most of
>> > those extra files never need to be opened, then there's no cost to them.
>> > And whatever extra cost there is, is amortized over the lifetime of the
>> > interpreter.
>>
>> Yes, if they are actually not needed. My question was about whether
>> that is truly valid.
>
> We're never really going to know the affect on performance without
> implementing and benchmarking the code. It might turn out that, to our
> surprise, three quarters of the std lib relies on loading docstrings
> during startup. But I doubt it.
>
>
>> Consider a very common use-case: an OS-provided
>> Python interpreter whose files are all owned by 'root'. Those will be
>> distributed with .pyc files for performance, but you don't want to
>> deprive the users of help() and anything else that needs docstrings
>> etc. So... are the docstrings lazily loaded or eagerly loaded?
>
> What relevance is that they're owned by root?

You have to predict in advance what you'll want to have in your pyc
files. Can't create them on the fly.

>> If eagerly, you've doubled the number of file-open calls to initialize
>> the interpreter.
>
> I do not understand why you think this is even an option. Has Serhiy
> said something that I missed that makes this seem to be on the table?
> That's not a rhetorical question -- I may have missed something. But I'm
> sure he understands that doubling or quadrupling the number of file
> operations during startup is not an optimization.
>
>
>> (Or quadrupled, if you need annotations and line
>> numbers and they're all separate.) If lazily, things are a lot more
>> complicated than the original description suggested, and there'd need
>> to be some semantic changes here.
>
> What semantic change do you expect?
>
> There's an implementation change, of course, but that's Serhiy's problem
> to deal with and I'm sure that he has considered that. There should be
> no semantic change. When you access obj.__doc__, then and only then are
> the compiled docstrings for that module read from the disk.

In other words, attempting to access obj.__doc__ can actually go and
open a file. Does it need to check if the file exists as part of the
import, or does it go back to sys.path? If the former, you're right
back with the eager loading problem of needing to do 2-4 times as many
stat calls; if the latter, it's semantically different in that a
change to sys.path can influence something that normally is preloaded.

> As for the in-memory data structures of objects themselves, I imagine
> something like the __doc__ and __annotation__ slots pointing to a table
> of strings, which is not initialised until you attempt to read from the
> table. Or something -- don't pay too much attention to my wild guesses.
>
> The bottom line is, is there some reason *aside from performance* to
> avoid this? Because if the performance is worse, I'm sure Serhiy will be
> the first to dump this idea.

Obviously it could be turned into just a performance question, but in
that case everything has to be preloaded, and I doubt there's going to
be any advantage. To be absolutely certain of retaining the existing
semantics, there'd need to be some sort of anchoring to ensure that
*this* .pyc file goes with *that* .pyc_docstrings file. Looking them
up anew will mean that there's every possibility that you get the
wrong file back.

As a simple example, upgrading your Python installation while you have
a Python script running can give you this effect already. Just import
a few modules, then change everything on disk. If you now import a
module that was already imported, you get it from cache (and the
unmodified version); import something that wasn't imported already,
and it goes to the disk. At the granularity of modules, this is seldom
a problem (I can imagine some package modules getting confused by
this, but otherwise not usually), but if docstrings are looked up
separately - and especially if lnotab is too - you could happily
import and use something (say, in a web server), then run updates, and
then an exception requires you to look up a line number. Oops, a few
lines got inserted into that file, and now all the line numbers are
straight-up wrong. That's a definite behavioural change. Maybe it's
one that's considered acceptable, but it definitely is a change. And
if mutations to sys.path can do this, it's definitely a semantic
change in Python.

ChrisA


More information about the Python-ideas mailing list