[Python-ideas] Move optional data out of pyc files

Thu Apr 12 00:44:40 EDT 2018

On Thu, Apr 12, 2018 at 11:59 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote:
>
> [...]
>> >> Consider a very common use-case: an OS-provided
>> >> Python interpreter whose files are all owned by 'root'. Those will be
>> >> distributed with .pyc files for performance, but you don't want to
>> >> deprive the users of help() and anything else that needs docstrings
>> >> etc. So... are the docstrings lazily loaded or eagerly loaded?
>> >
>> > What relevance is that they're owned by root?
>>
>> You have to predict in advance what you'll want to have in your pyc
>> files. Can't create them on the fly.
>
> How is that different from the situation right now?

If the files aren't owned by root (more specifically, if they're owned
by you, and you can write to the pycache directory), you can do
everything at runtime. Otherwise, you have to do everything at
installation time.

>> > What semantic change do you expect?
>> >
>> > There's an implementation change, of course, but that's Serhiy's problem
>> > to deal with and I'm sure that he has considered that. There should be
>> > no semantic change. When you access obj.__doc__, then and only then are
>> > the compiled docstrings for that module read from the disk.
>>
>> In other words, attempting to access obj.__doc__ can actually go and
>> open a file. Does it need to check if the file exists as part of the
>> import, or does it go back to sys.path?
>
> That's implementation, so I don't know, but I imagine that the module
> object will have a link pointing directly to the expected file on disk.
> No need to search the path, you just go directly to the expected file.
> Apart from handling the case when it doesn't exist, in which case the
> docstring or annotations get set to None, it should be relatively
> straight-forward.
>
> That link could be an explicit pathname:
>
>     /path/to/__pycache__/foo.cpython-33-doc.pyc
>
> or it could be implicitly built when required from the "master" .pyc
> file's path, since the differences are likely to be deterministic.

Referencing a path name requires that each directory in it be opened.
Checking to see if the file exists requires, at absolute best, one
more stat call, and that's assuming you have an open handle to the
directory.

>> If the former, you're right
>> back with the eager loading problem of needing to do 2-4 times as many
>> stat calls;
>
> Except that's not eager loading. When you open the file on demand, it
> might never be opened at all. If it is opened, it is likely to be a long
> time after interpreter startup.

I have no idea what you mean here. Eager loading != opening the file
on demand. Eager statting != opening on demand. If you're not going to
hold open handles to heaps of directories, you have to reference
everything by path name.

>> > As for the in-memory data structures of objects themselves, I imagine
>> > something like the __doc__ and __annotation__ slots pointing to a table
>> > of strings, which is not initialised until you attempt to read from the
>> > table. Or something -- don't pay too much attention to my wild guesses.
>> >
>> > The bottom line is, is there some reason *aside from performance* to
>> > avoid this? Because if the performance is worse, I'm sure Serhiy will be
>> > the first to dump this idea.
>>
>> Obviously it could be turned into just a performance question, but in
>> that case everything has to be preloaded
>
> You don't need to preload things to get a performance benefit.
> Preloading things that you don't need immediately and may never need at
> all, like docstrings, annotations and line numbers, is inefficient.

Right, and if you DON'T preload everything, you have a potential
semantic difference. Which is exactly what you were asking me, and I
was answering.

> So let's look at a few common scenarios:
>
>
> 1. You run a script. Let's say that the script ends up loading, directly
> or indirectly, 200 modules, none of which need docstrings or annotations
> during runtime, and the script runs to completion without needing to
> display a traceback. You save loading 200 sets of docstrings,
> annotations and line numbers ("metadata" for brevity) so overall the
> interpreter starts up quicker and the script runs faster.
>
>
> 2. You run the same script, but this time it raises an exception and
> displays a traceback. So now you have to load, let's say, 20 sets of
> line numbers, which is a bit slower, but that doesn't happen until the
> exception is raised and the traceback printed, which is already a slow
> and exceptional case so who cares if it takes an extra few milliseconds?
> It is still an overall win because of the 180 sets of metadata you
> didn't need to load.

Does this loading happen when the exception is constructed or when
it's printed? How much can you do with an exception without triggering
the loading of metadata? Is it now possible for the mere formatting of
a traceback to fail because of disk/network errors?

> These are, in my opinion, typical scenarios. If you're in an atypical
> scenario, say all your modules are loaded over a network running over a
> piece of string stuck between two tin cans *wink*, then you probably
> will feel a lot more pain, but honestly that's not our problem. We're
> not obliged to optimize Python for running on broken networks.

People DO run Python over networks, though, and people DO upgrade
their Python installations.

>> As a simple example, upgrading your Python installation while you have
>> a Python script running can give you this effect already.
>
> Right -- so we're not adding any failure modes that don't already exist.
>
> It is *already* a bad idea to upgrade your Python installation, or even
> modify modules, while Python is running, since the source code may get
> out of sync with the cached line numbers and the tracebacks will become
> inaccurate. This is especially a problem when running in the interactive
> interpreter while editing the file you are running.

Do you terminate every single Python process on your system before you
upgrade Python? Let's say you're running a server on Red Hat
Enterprise Linux or Debian Stable, and you go to apply all the latest
security updates. Is that best done by shutting down every single
application, THEN applying all updates, and only when that's all done,
starting everything up? Or do you update everything on the disk, then
pick one process at a time and signal it to restart?

I don't know for sure about RHEL, but I do know that Debian's package
management system involves a lot of Python. So it'd be a bit tricky to
build your updater such that no Python is running during updates -
you'd have to deploy a brand-new Python tree somewhere to use for
installation, or something. And if you have any tiny little wrapper
scripts written in Python, they could easily still be running across
an update, even if the rest of the app is written in C.

So, no. You should NOT have to take a blanket rule of "don't update
while it's running". Instead, what you have is: "Binaries can safely
be unlinked, and Python modules only get loaded when you import them".

>> if docstrings are looked up
>> separately - and especially if lnotab is too - you could happily
>> import and use something (say, in a web server), then run updates, and
>> then an exception requires you to look up a line number. Oops, a few
>> lines got inserted into that file, and now all the line numbers are
>> straight-up wrong. That's a definite behavioural change.
>
> Indeed, but that's no different from what happens now when the same line
> number might point to a different line of source code.

Yes, this is true; but at least the mapping from byte code to line
number is trustworthy. Worst case, you look at the traceback, and then
interpret it based on an older copy of the .py file. If lnotab is
loaded lazily, you don't even have that. Something's going to have to
try to figure out what the mapping is.

>> Maybe it's
>> one that's considered acceptable, but it definitely is a change.
>
> I don't think it is a change, and I think it is acceptable. I think the
> solution is, don't upgrade your modules while you're still running them!

If you need a solution to it, then it IS a change. Doesn't mean it
can't be done, but it definitely is a change. (Look at the PEP 572
changes to list comprehensions at class scope. Nobody's denying that
the semantics are changing; but normal usage won't ever witness the
changes.)

I don't think this is purely a performance question.

ChrisA