[Python-Dev] setuptools: past, present, future

Fri Apr 21 21:57:42 CEST 2006

I've noticed that there seems to be a lot of confusion out there about what 
setuptools is and/or does, at least among Python-Dev folks, so I thought it 
might be a good idea to give an overview of its structure, so that people 
have a better idea of what is and isn't "magic".

Setuptools began as a fairly routine collection of distutils extensions, to 
do the same boring things that everybody needs distutils extensions to 
do.  Basic stuff like installing data with your packages, running unit 
tests, that sort of thing.

At some point, I was getting tired of having to deal with dependencies by 
making people install them manually, or else having to bundle them.  I 
wanted a more automated way to deal with this problem, and in 2004 brought 
the problem to the distutils-sig and planned to do a PyCon sprint to try to 
address the problem.  Tim Peters encouraged me to move the preliminary work 
I'd done to the Python sandbox, where others could follow the work and 
improve upon it, and he sponsored me for CVS privileges so I could do so.

As it turned out, I wasn't able to go to PyCon, but I produced some crude 
stuff to try to implement dependency handling, based on some previous work 
by Bob Ippolito.  Bob's stuff used imports to check version strings, and 
mine was a bit more sophisticated in that it could scan .py or .pyc files 
without actually importing them.  But there was no reasonable way to track 
download URLs, though, or deal with the myriad package formats (source, 
RPM, etc.) platform-specificness, etc. and PyPI didn't really exist yet.

To top it all off, within a couple of months I was laid off, so the problem 
ceased to be of immediate practical interest for me any more.  I decided to 
take a six-month sabbatical and work on RuleDispatch, after which I began 
contracting for OSAF.

OSAF's Chandler application has a plugin platform akin to Eclipse, and I 
saw that it was going to need a cross-platform plugin format.  I put out 
the call to distutils-sig, and Bob Ippolito took up the challenge.  We 
designed the first egg format, and we agreed that it should support Python 
libraries, not just plugins, and that it should be possible to treat .egg 
zipfiles and directories interchangeably, and that it should be possible to 
put more than one conceptual egg into one physical zipfile.  The true "egg" 
was the project release, not the zipfile itself.  (We called a zipfile 
containing multiple eggs a "basket", which we thought would be useful for 
things like py2exe.  pkg_resources still supports baskets today, but there 
are no tools for generating them - you have to just zip up a bunch of .egg 
directories to make one.)

Bob wrote the prototype pkg_resources module to support accessing resources 
in zipfiles and regular directories, while I worked on creating a bdist_egg 
command, which I added to the then-dormant setuptools package, figuring 
that the experimental dependency stuff could be later refactored to allow 
dependencies to be resolved using eggs.  We had a general notion that there 
would be some kind of web pages you could use to list packages on, since at 
that time PyPI didn't allow uploads yet.  Or at any rate, we didn't know 
about it until PyCon in 2005.

After PyCon, I kept hearing about projects to make a CPAN-like tool for 
Python, such as the Uragas project.  However, all of these projects sounded 
like they were going to reinvent everything from scratch, particularly a 
lot of stuff that Bob and I had just done.  It then occurred to me for the 
first time that the .egg format could be used to solve the problems both of 
having a local package database, and also the uninstallation and upgrade of 
packages.  In fact, the only piece missing was that there was no way to 
find and download the packages to be installed, and if I could solve that 
problem, the CPAN problem would be solved.

So, I did some research by taking a random sample of packages from PyPI, to 
find out what information people were actually registering.  I found that, 
more often than not, at least one of their PyPI URLs would point to a page 
that had links to packages that could be downloaded directly.  And that was 
basically enough to permit writing a very simple spider that would only 
follow "download" or "homepage" links from PyPI pages, and would also 
inspect URLs to see if they were recognizable as distutils-generated 
filenames, from which it could extract package name and version info.

Thus, easy_install was born, completing what some people now call the 
eggs/setuptools/easy_install trifecta.

If you are going to work on or support these tools, it's important that you 
understand that these three things are related, but distinct.  Setuptools 
is at heart just an ordinary collection of distutils enhancements, that 
just happens to include a bdist_egg command.  EasyInstall is another 
enhanced command built on setuptools, that leverages setuptools to build 
eggs for packages that don't have them.  But setuptools in its turn depends 
on EasyInstall, so that packages can have dependencies.

So the components are:

     pkg_resources: standalone module for working with project releases, 
dependency specification and resolution, and bundled resources

     setuptools: a package of distutils extensions, including ones to build 
eggs with

     easy_install: a distutils extension built using setuptools, that 
finds, downloads, builds eggs for, and installs packages that use either 
distutils or setuptools

And if you look at that list, it's pretty easy to see which part is the 
most magical, implicit, heuristic, etc.  It's easy_install, no 
question.  If it weren't for the fact that easy_install tries to support 
non-setuptools packages, there would be little need for monkeypatching or 
sandboxing.  If it weren't for the fact that easy_install tries to 
interpret web pages, there would be no need for heuristics or guessing.

So, in a perfect world where everybody neatly files everything with PyPI, 
easy_install would not have anything implicit about it.  But this isn't a 
perfect world, and to gain adoption, it had to have backward 
compatibility.  If easy_install could handle *enough* existing packages, 
then it would encourage package authors to use it so that they could depend 
on those existing packages.  These authors would end up using setuptools, 
which would then tend to ensure that *their* package would be 
easy_install-able as well.

And, since the user needs setuptools to install these new packages, then 
the user now has setuptools, and the option to try using it to install 
other packages.  Users then encourage package authors to have correct PyPI 
information so their packages can be easy_install-ed as well, and the 
network effect increases from there.

So, I bundled all three things (pkg_resources, setuptools, and 
easy_install) into a single distribution bundle precisely so it would have 
this "viral" network effect.  I knew that if everybody had to be made to 
get their PyPI entries straight *first*, it would never work.  But if I 
could leverage an ever-growing user population to put pressure on authors 
and system packagers, and an ever-growing author population to increase the 
number of users, then the natural course of things should be that packages 
that don't play will die off, be forked, etc., and those who do play will 
be rewarded with more users.

I made an explicit, conscious, and cold-blooded decision to do things that 
way, knowing full well that it would immediately kill off all the competing 
"CPAN for Python" projects, and that it would also force lots of people to 
deal with setuptools who didn't care about it one way or another.  The 
community as a whole would benefit immensely, even if the costs would be 
borne by people who didn't agree with what I was doing.

So, yes, I'm a cold calculating bastard.  EasyInstall is #1 in the field 
because it was designed to make its competition irrelevant and to virally 
spread itself across the entire Python ecosphere.  I'm pointing these 
things out now because I think it's better not to mince words; easy_install 
was designed with Total World Domination in mind from day one and that is 
exactly what it's here to do.  Compatibility at any cost is its watchword, 
because that is what fuels its adoption.  End-users are its market, because 
what the end users want ultimately controls what the developers and the 
packagers do.

Thus, if you look at the history of setuptools, you'll see that the vast 
majority of work I do on it is increasing the Just-Works-iness of 
easy_install.  The majority of changes to non-easy_install code (and both 
setuptools.package_index and setuptools.sandbox are there only for 
easy_install) are architectural or format changes intended to support 
greater justworksiness for easy_install.

(There are also lots of changes included to enhance setuptools' usefulness 
as a distutils extension, but these are driven mainly by user requests and 
Chandler needs, and there aren't nearly as many such changes.)

So, if you take easy_install and its support modules entirely out of 
setuptools, you would be left with a modest assortment of distutils 
extensions, most of which don't have any backward compatibility 
issues.  They could be merged into the distutils with nary a 
complaint.  The only significant change is the "sdist" command, which in 
setuptools supports a cleaner (and extensible) way of managing the source 
distribution manifest, that frees developers from messing with the MANIFEST 
file and remembering to constantly add junk to MANIFEST.in.  And there's 
probably some way we could decide to either keep the old behavior or make 
the old behavior an option for anybody who's relying on the way it worked 
before.

And that's all well and good, but now you don't have the features that are 
the real reason end users want the whole thing: easy_install.

And it's not just the users.  Package authors want it too.  TurboGears 
really couldn't exist without this.  It's easy to argue that oh, they 
could've made distribution packages for six formats and nine platforms, or 
they could've made tarballs, etc. to bundle all the dependencies in, but 
those approaches really just don't scale -- especially for the single 
package author just starting to build something new.

None of these options are economically viable for the author of a new 
package, especially if their core competency isn't packaging and 
distribution.  Now that there's a Turbogears community, yes, there are 
probably people available who can do a lot of those distribution-related 
tasks.  But there wouldn't have *been* a community if Kevin couldn't have 
shipped the software by himself!

This is the *real* problem that I always meant to address, from the very 
beginning: Python development and distribution *costs too much* for the 
community to flourish as it should.  It's too hard for non-experts, and 
until now it required bundling, system packaging, or asking users to 
install their own dependencies.  But asking users to install dependencies 
doesn't scale for large numbers of dependencies.  And not being able to 
reuse packages leads to proliferating wheel-reinvention, because 
installation cost is a barrier to entry.

So, the work that I've done is simply social engineering through economic 
leverage.  The goal is to change the cost equations so that entry barriers 
for package distribution are low, so that users can try different packages, 
so they can switch, so market forces can choose winners.  Because switching 
and installation costs are low, interoperability and reuse are more 
attractive choices, and more likely to be demanded by users.  You can 
already see these forces taking effect in such developments as the joint 
CherryPy/TurboGears template plugin interface, which uses another 
setuptools innovation (entry points) to allow plug-and-play.

I am doing all this because I got tired of reinventing wheels.  When you 
add in installation costs, writing your own package looks more attractive 
than reusing the other guy's.  But if installation is cheap, then people 
are more inclined to overlook the minor differences between how the other 
guy did it and how they would have done it, and are more likely to say to 
the "other guy", hey, I like this but would you add X?  And it's more 
likely that the "other guy" will say yes, because it will *multiply* his 
install base to get another published package depending on his project.

So, my question to all of you is, is that worth a little implicitness, a 
little magic?  My answer, of course, is yes.  It will probably be a 
multi-year effort to get the state of community practice up to a level 
where all the heuristics and webscraping can be removed from easy_install, 
without negatively affecting the cost equation.

Or maybe not.  Maybe we're just hitting the turn of the hockey stick now, 
and inclusion in 2.5 is just what the doctor ordered to kick the number of 
users so high that anybody would be crazy not to have clean PyPI listings, 
I don't know.  To be honest, though, I think the outstanding proposal on 
Catalog-SIG to merge Grig's "CheeseCake" rating system into PyPI (so that 
package authors will be shown what they can do to improve their listing 
quality) will actually have more direct impact on this than 2.5 inclusion 
will.  Guido's choice to bless setuptools is important for system packagers 
and developers to have confidence that this is the direction Python is 
taking; it doesn't have to actually go *in* 2.5 to do 
that.  install_egg_info clearly shows the direction we're taking.

So, after reading all the other stuff that's gone by in the last few days, 
this is what I think should happen:

First, setuptools should be withdrawn from inclusion in Python 2.5.  Not 
directly because of the opposition, but because of the simple truth that 
it's just not ready.  Some of that is because I've spent way too much time 
on the discussions this week, to the point of significant sleep deprivation 
at one point.  But when Guido first asked about it, I had concerns about 
getting everything done that really needed to be done, and effectively only 
agreed because I figured out a way to allow new versions to be distributed 
after-the-fact.  With the latest Python 2.5 release schedule, I'd be 
hard-pressed to get 0.7 to stability before the 2.5 betas go, certainly if 
I'm the only one working on it.

And a stable version of 0.7 is really the minimum that should go in the 
standard library, because the package management side of things really 
needs to have commands to list, uninstall, upgrade, etc., and they need to 
be easy to understand, not the confusing mishmash that is easy_install's 
current assortment of options.  (Which grew organically, rather than being 
designed ahead of time.)

And Fredrik is right to bring up concerns about both easy_install's 
confusing array of options, and the general support issues of asking 
Python-Dev to adopt setuptools.  These are things that can be addressed, 
and *are* being addressed, but they're not going to happen by Tuesday, when 
the alpha release is scheduled.

I hate to say this, because I really don't want to disappoint Guido or 
anyone on Python-Dev or elsewhere who has been calling for it to go in.  I 
really appreciate all your support, but Fredrik is right, and I can't let 
my desire to please all of you get in the way of what's right.

What *should* happen now instead, is a plan for merging setuptools into the 
distutils for 2.6.  That includes making the decisions about what "install" 
and "sdist" should do, and whether backward compatibility of internal 
behaviors should be implicit or explicit.  I don't want to start *that* 
thread right now, and we've already heard plenty of arguments on both 
sides.  Indeed, since Martin and Marc seem to be diametrically opposed on 
that issue, it is guaranteed that *somebody* will be unhappy with whatever 
decision is made.  :)

Between 2.5 and 2.6, setuptools should continue to be developed in the 
sandbox, and keep the name 'setuptools'.  For 2.6, however, we should merge 
the code bases and have setuptools just be an alias.  Or, perhaps what is 
now called setuptools should be called "distutils2" and distributed as 
such, with "setuptools" only being a legacy name.  But regardless, the plan 
should be to have only one codebase for 2.6, and to issue backported 
releases of that codebase for at least Python 2.4 and 2.5.

These ideas are new for me, because I hadn't thought that anybody would 
have cared enough to want to get into the code and share any of the 
work.  That being the case, it seems to make more sense for me to back off 
a little on the development in order to work on developer documentation, 
such as of the kind Fredrik has been asking for, and to work on a 
development roadmap so we can co-ordinate who will work on what, when, to 
get 0.7 to stability.

In the meantime, Python 2.5 *does* have install_egg_info, and it should 
definitely not be pulled out.  install_egg_info ensures that every package 
installed by the distutils is detectable by setuptools, and thus will not 
be reinstalled just because it wasn't installed by setuptools.

And there is one other thing that should go into 2.5, and that is PKG-INFO 
files for each bundled package that we are including in the standard 
library, that is distributed separately for older Python versions and is 
API-compatible.  So for example, if ctypes 0.9.6 is going in Python 2.5, it 
should hav a PKG-INFO in the appropriate directory to say so.  Thus, 
programs written for Python 2.4 that say they depend on something like 
"ctypes>=0.9" will work with Python 2.5 without needing to change their 
setup scripts to remove the dependency when the script is run under Python 2.5.

Last, but not least, we need to find an appropriate spot to add 
documentation for install_egg_info.

These are tasks that can be accomplished for 2.5, they are reasonably 
noncontroversial, and they do not add any new support requirements or 
stability issues that I can think of.

One final item that is a possibility: we could leave pkg_resources in for 
2.5, and add its documentation.  This would allow people to begin using its 
API to check for installed packages, accessing resources, etc.  I'd be 
interested in hearing folks' opinions about that, one way or the other.