[Python-Dev] Better support for consuming vendored packages

Thu Mar 22 12:58:07 EDT 2018

 I'd like to start a discussion around practices for vendoring package
dependencies. I'm not sure python-dev is the appropriate venue for this
discussion. If not, please point me to one and I'll gladly take it there.

I'll start with a problem statement.

Not all consumers of Python packages wish to consume Python packages in the
common `pip install <package>` + `import <package>` manner. Some Python
applications may wish to vendor Python package dependencies such that known
compatible versions are always available.

For example, a Python application targeting a general audience may not wish
to expose the existence of Python nor want its users to be concerned about
Python packaging. This is good for the application because it reduces
complexity and the surface area of things that can go wrong.

But at the same time, Python applications need to be aware that the Python
environment may contain more than just the Python standard library and
whatever Python packages are provided by that application. If using the
system Python executable, other system packages may have installed Python
packages in the system site-packages and those packages would be visible to
your application. A user could `pip install` a package and that would be in
the Python environment used by your application. In short, unless your
application distributes its own copy of Python, all bets are off with
regards to what packages are installed. (And even then advanced users could
muck with the bundled Python, but let's ignore that edge case.)

In short, `import X` is often the wild west. For applications that want to
"just work" without requiring end users to manage Python packages, `import
X` is dangerous because `X` could come from anywhere and be anything -
possibly even a separate code base providing the same package name!

Since Python applications may not want to burden users with Python
packaging, they may vendor Python package dependencies such that a known
compatible version is always available. In most cases, a Python application
can insert itself into `sys.path` to ensure its copies of packages are
picked up first. This works a lot of the time. But the strategy can fall
apart.

Some Python applications support loading plugins or extensions. When
user-provided code can be executed, that code could have dependencies on
additional Python packages. Or that custom code could perform `sys.path`
modifications to provide its own package dependencies. What this means is
that `import X` from the perspective of the main application becomes
dangerous again. You want to pick up the packages that you provided. But
you just aren't sure that those packages will actually be picked up. And to
complicate matters even more, an extension may wish to use a *different*
version of a package from what you distribute. e.g. they may want to adopt
the latest version that you haven't ported to you or they may want to use
an old versions because they haven't ported yet. So now you have the
requirements that multiple versions of packages be available. In Python's
shared module namespace, that means having separate package names.

A partial solution to this quagmire is using relative - not absolute -
imports. e.g. say you have a package named "knights." It has a dependency
on a 3rd party package named "shrubbery." Let's assume you distribute your
application with a copy of "shrubbery" which is installed at some packages
root, alongside "knights:"

  /
  /knights/__init__.py
  /knights/ni.py
  /shrubbery/__init__.py

If from `knights.ni` you `import shrubbery`, you /could/ get the copy of
"shrubbery" distributed by your application. Or you could pick up some
other random copy that is also installed somewhere in `sys.path`.

Whereas if you vendor "shrubbery" into your package. e.g.

  /
  /knights/__init__.py
  /knights/ni.py
  /knights/vendored/__init__.py
  /knights/vendored/shrubbery/__init__.py

Then from `knights.ni` you `from .vendored import shrubbery`, you are
*guaranteed* to get your local copy of the "shrubbery" package.

This reliable behavior is highly desired by Python applications.

But there are problems.

What we've done is effectively rename the "shrubbery" package to
"knights.vendored.shrubbery." If a module inside that package attempts an
`import shrubbery.x`, this could fail because "shrubbery" is no longer the
package name. Or worse, it could pick up a separate copy of "shrubbery"
somewhere else in `sys.path` and you could have a Frankenstein package
pulling its code from multiple installs. So for this to work, all
package-local imports must be using relative imports. e.g. `from . import
x`.

The takeaway is that packages using relative imports for their own modules
are much more flexible and therefore friendly to downstream consumers that
may wish to vendor them under different names. Packages using relative
imports can be dropped in and used, often without source modifications.
This is a big deal, as downstream consumers don't want to be
modifying/forking packages they don't maintain. Because of the advantages
of relative imports, *I've individually reached the conclusion that
relative imports within packages should be considered a best practice.* I
would encourage the Python community to discuss adopting that practice more
formally (perhaps as a PEP or something).

But package-local relative imports aren't a cure-all. There is a major
problem with nested dependencies. e.g. if "shrubbery" depends on the
"herring" package. There's no reasonable way of telling "shrubbery" that
"herring" is actually provided by "knights.vendored." You might be tempted
to convert non package-local imports to relative. e.g. `from .. import
herring`. But the importer doesn't allow relative imports outside the
current top-level package and this would break classic installs where
"shrubbery" and "herring" are proper top-level packages and not
sub-packages in e.g. a "vendored" sub-package. For cases where this occurs,
the easiest recourse today is to rewrite imported source code to use
relative imports. That's annoying, but it works.

In summary, some Python applications may want to vendor and distribute
Python package dependencies. Reliance on absolute imports is dangerous
because the global Python environment is effectively undefined from the
perspective of the application. The safest thing to do is use relative
imports from within the application. But because many packages don't use
relative imports themselves, vendoring a package can require rewriting
source code so imports are relative. And even if relative imports are used
within that package, relative imports can't be used for other top-level
packages. So source code rewriting is required to handle these. If you
vendor your Python package dependencies, your world often consists of a lot
of pain. It's better to absorb that pain than inflict it on the end-users
of your application (who shouldn't need to care about Python packaging).
But this is a pain that Python application developers must deal with. And I
feel that pain undermines the health of the Python ecosystem because it
makes Python a less attractive platform for standalone applications.

I would very much welcome a discussion and any ideas on improving the
Python package dependency problem for standalone Python applications. I
think encouraging the use of relative imports within packages is a solid
first step. But it obviously isn't a complete solution.

Gregory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180322/31aa2f39/attachment.html>