[Distutils] formencode as .egg in Debian ??

Phillip J. Eby pje at telecommunity.com
Wed Nov 23 21:29:17 CET 2005


At 08:12 PM 11/23/2005 +0100, Matthias Urlichs wrote:
>Hi,
>
>Phillip J. Eby:
> > I'm thinking that perhaps I should add an option like
> > '--single-version-externally-managed' to the install command so that you
> > can indicate that you are installing for the sake of an external package
> > manager that will manage conflicts and uninstallation needs.  This would
> > then allow installation using the .egg-info form and no .pth files.
> >
>You might shorten that option a bit. ;-)  I agree that this would be a
>good option to have.

I try to use very long names for options that can have damaging effects if 
used indiscriminately.  A project that's installed the "old-fashioned way" 
(which is what this does, apart from adding .egg-info) is hard to uninstall 
and may overwrite other projects' files.  So, it is only safe to use if the 
files are being managed by some external package manager, and it further 
only works for a single installed version at a time.  So the name is 
intended to advertise these facts, and to discourage people who are just 
reading the option list from trying it out to see what it does.  :)


> > >People will often inspect sys.path to understand where Python
> > >is looking for their code.
> >
> > As I pointed out, eggs give you much better information on this.
>
>The .egg metadata does. That, as you say, is distinct from the idea of
>packaging the .egg as a zip file. Most likely, one that includes .pyc
>files which were byte-compiled with different file paths; That causes no
>problems whatsoever ... until you get obscure ideas like trying to step
>through the code with pdb, or opening it in your editor to insert an
>assertion or a printf, trying to figure out why your code breaks.  :-/

This is actually what the .egg-info mode was designed for.  That is, doing 
development of the project.  A setuptools-based project can run "setup.py 
develop" to add the project's source directory to sys.path, after 
generating an .egg-info directory in the project source if necessary.  This 
allows you to do all your development right in your source checkout, and of 
course all the file paths are just fine, and the egg metadata is available 
at runtime.  You can then deploy the project as an .egg file or directory.

(Also, for the .egg directory format, note that easy_install recompiles the 
.pyc/.pyo files so their paths *do* point to the .egg contents instead of 
the original build paths.  The issues with zipfiles and precompiled .pyc 
files are orthogonal to anything about setuptools, eggs, etc.; they will 
bite you in today's Python no matter what's in the zipfile or who 
precompiled the .pyc files.  I do have some ideas for fixing both of these 
problems in future versions of Python, but they're rather off-topic for all 
the lists we are currently talking on.)


>That's not exactly negotiable. Debian has a packaging format which
>resolves generic installation dependencies on its own. Therefore it
>cannot depend on Python-specific .egg metadata. Therefore we need a way
>to translate .egg metadata to Debian metadata.

Yes, that's precisely what I was suggesting would be helpful.  As Vincenzo 
already mentioned, the egg metadata is a good starting point for defining 
the Debian metadata.  I'm obviously not proposing changing Debian's 
metadata system.  Well, maybe it wasn't *obvious* that I wasn't proposing 
that, but in any case I'm not.  :)


> > I remain concerned about how such packages will work with namespace
> > packages, since namespace packages mean that two different distributions
> > may be supplying the same __init__.py files, and some package managers may
> > not be able to deal with two system packages (e.g. Debian packages, RPMs,
> > etc.) supplying the same file, even if it has identical contents in each
> > system package.
> >
>Debian packaging has a method to explicitly rename a different package's
>file if it conflicts with yours ("dpkg-divert"; it does _not_ depend on
>which package gets installed first). IMHO that's actually superior
>randomly executing only one of these files, since you are aware that
>there is a conflict (the second package simply doesn't install if you
>don't fix it), and thus can handle it intelligently.

The two kinds of possible conflicts are namespace packages, and 
project-level resources.

A namespace package is more like a Java package than a traditional Python 
package.  A Java package can be split across multiple directories or jar 
files; it doesn't have to be all in one place.  Thus you can have lots of 
jars with org.apache.* classes in them.

Python, however, requires packages to have an __init__.py file, and by 
default the entire package is assumed to be in the directory containing the 
__init__.py file.  However, as of Python 2.3, the 'pkgutil' module was 
introduced in the Python standard library which allowed you to create a 
Java-style "namespace package", automatically combining package directories 
found on different parts of sys.path.  So, if in one sys.path directory you 
had a 'zope.interface' package, and in another you had a 'zope.publisher' 
package, these would be combined, instead of the first one being treated as 
if it were all of 'zope.*', and the second being completely 
ignored.  However, *each* of the subpackages needs its own zope/__init__.py 
file for this to work.

So, the issue here is that if you install two projects that contain zope.* 
packages into the *same* directory (e.g. site-packages), then there will be 
two different zope/__init__.py files installed at the same location, even 
though they will have the same content (a short snippet of code to activate 
the namespace mechanism via the pkgutil module or via setuptools' 
pkg_resources module).

To date, there are only a small number of these namespace packages in 
existence, but over time they will represent a fairly large number of 
*projects*.  As I go through the breakup of the PEAK meta-project into 
separate components, I expect to have a dozen or so projects contributing 
to the peak.* and peak.util.* namespace packages.  Ian Bicking's Paste 
meta-project has a paste.* namespace package spread out in two or three 
subprojects so far.  There has been some off-and-on discussion about 
whether Zope 3 will move to eggs instead of their own zpkg tool (which has 
issues on Windows and Mac OS that eggs do not), and in that case they will 
likely have a couple dozen components in zope.* and zope.app.*.

So, for the long-term solution of wrapping Python projects in Debian 
packages, the namespace issue needs to be addressed, because renaming each 
project's zope/__init__.py or whatever isn't going to work very 
well.  There has to be one __init__.py file, or else such projects need to 
be installed in their own .egg directories or zipfiles to avoid collisions.

The second collision issue with --single-version-externally-managed is 
top-level resource collisions.  Some existing projects that are not 
egg-based manipulate their install_data operation in such a way that they 
create files or directories in site-packages directly, rather than inside 
their own package data structures.  Setuptools neither encourages nor 
discourages this, because it doesn't cause any problems for any egg layout 
except the .egg-info one -- and the .egg-info one was originally designed 
to support development, not deployment.  In the development scenario, any 
such files are isolated to the source tree, and for deployment the .egg 
file or directory keeps each projects' contents completely isolated.

So, what I'm saying is that putting all projects in the same directory (as 
all "traditional" Python installations do) has some inherent limitations 
with respect to namespace packages and top-level resources, and these 
limitations are orthogonal to the question of egg metadata.  The .egg 
formats were created to solve these problems (including clean upgrades, 
multi-version support, and uninstallation in scenarios where a package 
manager isn't usable), and so the other features that they enable will be 
increasingly popular as well.

In other words, as people make more use of PyPI (because they now really 
*can*), more people will put things on PyPI, and the probability of package 
name conflicts will increase more rapidly.  The natural response will be a 
desire to claim uber-project or organizational names (like paste.*, peak.*, 
zope.*, etc.) putting individual projects under sub-package names.  (For 
example, someone has already argued that I should move RuleDispatch's 
'dispatch' package to 'peak.dispatch' rather than keeping the top-level 
'dispatch' name all to myself.)

So, I'm just saying that using the --single-version-externally-managed 
approach requires that a package manager like Debian grow a way to handle 
these namespace packages safely and sanely.  One possibility is to create 
dummy packages that contain only the __init__.py file for that namespace, 
and then have the real packages all depend on the dummy package, while 
omitting the __init__.py.  So, perhaps each project containing a 
peak.util.* subpackage would depend on a 'python2.4-peak.util-namespace' 
package, which in turn would depend on a 'python2.4-peak-namespace' 
package.  It's rather ugly, to say the least, but it would work as long as 
upstream developers never put anything in namespace __init__.py files 
except for the pkg_resources.declare_namespace() call.

(By the way, since part of an egg's metadata lists what namespace packages 
the project contains code or data for, the generation of these dependencies 
can be automated as part of the egg-to-deb conversion process.)

Or, of course, the .egg directory approach can also be used to bypass all 
collision issues, but this brings sys.path and .pth files back into the 
discussion.  On the other hand, it can possibly be assumed that anything in 
a namespace package can be used only after a require() (either implicit or 
explicit), so maybe the .pth can be dropped for projects with namespace 
packages.  These are possibilities worth considering, since they avoid the 
ugliness of creating dummy packages just to hold namespace __init__.py files.



More information about the Distutils-SIG mailing list