[Distutils] thoughts on distutils 1 & 2

has hengist.podd at virgin.net
Fri May 14 16:25:22 EDT 2004


[Deep breath, everyone; it's gonna get longer before it gets shorter...;]

Stefan Seefeld wrote:

>Example: compilation of extension modules.
>
>Scons is aiming at providing an abstraction layer for portable compilation.
>DU2 should at least allow to just delegate compilation of extension 
>modules to scons.
>(and as I said previously, I think anything that doesn't allow to
>wrap traditional build systems based on 'make' and possibly the autotools
>is 'not good enough' in as a general solution).

I'm far from expert on build systems (interpreted language weenie), 
but do think it makes sense for DU to hand off compilation duties to 
a third-party as quickly as it can. That third-party might be a 
separate Python-based system, or makefile, or anything else; however 
DU shouldn't need to know this.

Thoughts on how one would separate extension compilation from the 
rest of the installation procedure...

Let's say we had a standard 'src' folder to contain everything needed 
to produce a module's .so file(s), and we treat that folder basically 
as a self-contained entity. How would we instruct it to compile and 
deliver those .so files? The party requesting the compile operation 
should not need to know anything about the compilating system used. 
Presumably the easiest way to decouple the two is to have a standard 
'compile.py' within the 'src' folder that is executed whenever 
somebody wants .so files created. Whatever code that compile.py file 
then executes is its own business, and if it needs any information 
from the OS/Python installation then it's up to it to request that 
information itself; ideally through existing Python APIs if possible, 
or through a specific DU API if not.

Once that's done, it should be easy for developers to select their 
own build system from all those available to them. The Python-based 
build system that's currently incorporated into DU could, of course, 
be spun off as a peer to make, etc. - giving developers one more 
option to choose from without forcing it upon them.

BTW, once .so compilation is decoupled from installation, it should 
be possible/practical/easy? to defer .so compilation to import time 
(as is currently done for .pyc files).


>>- Every Python module should be distributed, managed and used as a 
>>single folder containing ALL resources relating to that module: 
>>sub-modules, extensions, documentation (bundled, generated, etc.), 
>>tests, examples, etc. (Note: this can be done without affecting 
>>backwards-compatibility, which is important.) Similar idea to OS 
>>X's package scheme, where all resources for [e.g.] an application 
>>are bundled in a single folder, but less formal (no need to hide 
>>package contents from user).
>
>are you really talking about 'package' here when you say 'module' ?
>I don't think that mandating modules to be self contained is a good
>idea. Often modules only 'make sense' in the context of the package
>that contains them. Also, are you talking about how to distribute
>packages, or about the layout of the installed files ?
>I don't think DU2 should mandate any particular layout for the target
>installation. It may well suggest layout of the files inside the
>(not yet installed) package.

(Like I say, rough notes; please keep pointing out where I'm not 
making sense.:)

Basically, what I'm proposing is that module developers stop 
distributing 'naked' Python modules and use package format only (even 
when there's only a single .py file involved). We then take all the 
other stuff that's traditionally been bundled alongside the 
module/package - documentation, unit tests, examples, etc. - and put 
those into the package folder too.

The term 'package' would basically become redundant; you could just 
describe everything as 'modules' and 'sub-modules'.

It's largely a philosophical shift from treating .py and .so files as 
separate from source files, documentation, unit tests, examples, etc. 
to treating _all_ of them equally: each being an integral component 
of the module/package as a whole.

It won't require any modifications to Python itself, since Python's 
import mechanism already supports the package format. For module 
developers, it's really just a logistical shift from being able to 
distribute 'bare' modules to always using package format. Module 
developers should be happy with this, given that it's much more 
accommodating towards documentation, unit tests, examples, etc. Stuff 
that they already need to put somewhere, and where better than as 
part of the module/package itself? And users should benefit too, as 
they'll always know where to look for documentation, etc.


DU will benefit too in that the distributions will become much 
simpler to create: in most cases the only thing the developer will 
have to do is zip the package folder before uploading it, something 
that won't even require DU to do. (That's what I'm hoping for, 
anyway. In practice there might be some reason that I'm unaware of 
why certain platforms would require all that extra shuffle that DU 
currently does when installing packages - creating folders, copying 
files, etc. I'm not a cross-platform expert. But I'd be kinda 
surprised if that were the case.)


[Sidenote: in an ideal world, a Python end-user should _never_ need 
to know whether FooLib exists in bare module or package form; the 
transition from operating in a file-based namespace to 
class-/object-based namespace would be seamless. Python's import 
statement is a bit flawed here; e.g. import foo.bar can be used when 
bar is a module/package within package foo, but not when it's a 
attribute in module foo.]



>>- Question: is there any reason why modules should not be 
>>installable via simple drag-n-drop (GUI) or mv (CLI)? A standard 
>>policy of "the package IS the module" (see above) would allow a 
>>good chunk of both existing and proposed DU "features" to be gotten 
>>rid of completely without any loss of "functionality", greatly 
>>simplifying both build and install procedures.
>
>Again, I don't think it is DU2's role to impose anything concerning
>the target layout. This is often platform dependent anyways.

Not quite sure if we're talking on same wavelength here. Let me try 
to clarify my previous point first, then maybe you can explain yours 
to me (feel free to phrase it in terms even an idiot like me can 
understand; I won't be offended;).

I'm talking of how a module/package gets put in a suitable Python 
directory (e.g. site-packages), which I'm assuming (unless proven 
otherwise) only requires that one knows which directory to put it in 
and moving the module/package to it. I'm also assuming that DU should 
not need to rearrange the contents of that package folder when 
installing it (except perhaps in special cases where it must install 
one of several platform-specific versions of a file, say; but that'll 
be the exception rather than the rule, and packages that don't 
require such special handling shouldn't need to go through the same 
in-depth procedures to install).

I can't immediately see anything that DU adds to this process of 
duplicating a package folder from A to B, apart from filling my 
Terminal window with lots of technical-looking stuff about how it's 
creating new directories in site-packages and copying files over to 
them. Which looks impressive, but I'm not convinced is really 
necessary given a single 'mv' command can shift the package directory 
and all its contents over just fine from what I can tell. And if 
95-100% of modules can be installed with just a simple mv, then let's 
make that the default procedure for installing modules and squeeze DU 
out of that part of the process too.



>--Replace current system where user must explicitly state what they 
>want included with one where user need only state what they want 
>excluded.
>
>That depends on how much control users want over the process. I believe
>both are equally valid, and should be supported (similar in spirit to
>the MANIFEST.in syntax 'include' and 'exclude')

<ASIDE>
Quick bit of background info so you know where I'm coming from...

I'm also big on the "There should be [preferably] only one way to do 
it" philosophy (one of the things that attracts me to Python). This 
is as much out of necessity as anything, mind: I'm absolutely awful 
at absorbing and retaining technical information, especially compared 
to 'real' programmers who seem to soak up knowledge like a sponge. 
e.g. I admire Perl for its "hell, let's totally go for broke and put 
in _everything_ we can possibly think of" approach and am glad 
there's somebody out there doing it cos then other languages can look 
at Perl to see what's worked and what hasn't and steal the best stuff 
for themselves. But it's not a language I can really use; my brain 
capacity is far too limited to accommodate more than a fraction of 
Perl's vast featureset and rules, so I much prefer to stick to 
tighter languages like Python where I can work at a decent clip 
without having to look up some 1000-page reference book at every 
other line.

Thus I tend to set the bar for feature inclusion pretty high; 
probably much higher than most other programmers who can happily cope 
with a bit of API flab without any problem. Don't take my 
feature-flaying tendencies as a religious thing. It's more a matter 
of simple survival: I can't keep up with y'all otherwise. ;)
</ASIDE>


The problem I see is that manifests seem to be involved whether you 
need/want them or not. If [as I'm assuming] the majority of 
distributions are trivial to assemble, then manifests should be the 
exception, not the rule. I dunno how other folks work, but in my Home 
folder I have a PythonDev folder containing folders for each of my 
module projects - FooDev, BarDev, etc. Within each of these I have a 
folder named Distro, which contains all the files and folders that'll 
go into my distribution.

For me, manifests are nothing but a menace: this folder setup already 
makes clear what I want put into the distribution, and I can't see 
why I should have to explain it twice to the stupid machine. There's 
been several occasions where an error or omission in a manifest file 
has gone unnoticed until I've received an email from a user to say 
that the package they downloaded is missing some parts 
(embarrassing). Right now I manually unzip and check distributions 
before uploading, but this is kinda crazy; I shouldn't have to worry 
that DU might have screwed up a build, seeing as one of the reasons 
for automating the process should be to avoid making such mistakes.


Thus my conclusion: explicit inclusion is inherently unsafe; a single 
mistake or forgetting to update the manifest file to keep it in sync 
with changes to the package can easily result in a broken 
distribution.

A much more sensible default is to include everything by default, and 
leave it to the developer to exclude anything they don't want 
included. The worst accident likely to occur here with any regularity 
is that you forget to strip out a few .pyc files resulting in a 
distribution that's a few KB bigger than it really needs to be. Plus 
it adheres to the philosophy that the most common case should require 
the least amount of work: in this case, the majority of modules won't 
ever require a manifest file and can safely skip it.

[BTW, will check out the include/exclude feature which I wasn't 
previously aware of. Though my argument would be that I shouldn't 
need to know about such 'advanced' features just to produce a simple, 
reliable distribution: the process should be as simple as falling off 
a log to begin with.]

...

We can take this manifest issue quite a bit further, btw. Another big 
frustration with the things is they're quite brain-dead. All I want 
to say is "Package everything in Folder X for distribution except for 
.pyc and .so files". Thus a more pragmatic approach might be to do 
away with dumb manifest files completely, and leave the developer to 
optionally supply a 'build.py' script that will be automatically 
executed as part of the build process.




>>-- In particular, removing most DU involvment from build procedures 
>>would allow developers to use their own development/build systems 
>>much more easily.
>
>yes !! Though that's more easily said than done: a minimum of collaboration
>between the two is required, at least the adherence to some conventions.

Of course (see earlier comments). Just how many... no, _few_ 
conventions would be needed?


>>- Installation and compilation should be separate procedures.
>
>As a starting point, the whole 'build_ext' mechanism should be re-evaluated.
>The current 'Extension' mechanism is by far not abstract enough. Either
>the build_ext or the Extension class should be made polymorphic to wrap
>any external build system that could be used (make, scons, jam, ...)

Or invert and decouple the process to put the [e.g.] 'src/compile.py' 
script in control.In this case, I think we could greatly simplify the 
extension building process if DU can say to the 'src' folder: "Build 
me some .so files", then stand back and let it get on with it (while 
being happy to lend any support if/when it's asked for).

[i.e. It's a mental trick I often try when trying to resolve an API 
design: seeing if I can switch from a complex 'push' process to a 
simpler 'pull' process (or from a complex 'pull' process to a simpler 
'push' process). It can make quite a difference.]


>>- What else may setup.py scripts do apart from install modules (2) 
>>and build extensions (3)?
>
>* building documentation (that, too, is highly domain specific. From
>   Latex over Docbook to doxygen...)

Yup. So let's say we have a standard 'docs' folder within a package 
that may optionally contain a 'format.py' script that will be called 
as necessary.


>* running unit tests

Have a standard 'tests' folder containing an optional 'test.py' 
script. (Hey, think I see a pattern evolving here...)


>>- Remove metadata from setup.py and modules.
>
>I don't quite agree in general. What metadata are we talking about
>anyways ? There's metadata that is to be provided to the packager
>backends, i.e. a package description of some sort. Some of these
>can be generated automatically (such as MANIFEST.in -> MANIFEST,
>build / host platform, etc.), others have to be explicitely provided
>(maintainer address, package description).

I mean user-defined metadata (I'll assume those that generate 
metadata automatically for their own consumption can be left to 
handle that as best suits themselves):

1. A module may contain various bits of user-defined metadata, e.g. 
__version__, __author__. This info is almost certainly recorded 
elsewhere, so [afaik] shouldn't need to be duplicated here.

2. The setup.py script also contains module name, version, author, 
etc... potentially quite a lot of metadata, in fact. All mooshed 
together with code for building and installing packages. We should 
move this data out of there into a separate, dedicated metadata file 
that's included in each package.


>Having 'all metadata' lumped together brings us back to the 'swiss
>army knife' syndrome.

Well, swiss armyness is always a concern. If it's really a problem 
here, we'd just need to have more than one metadata file. But I don't 
think it'll come to that.

Also, one great advantage of pulling metadata out of module and 
setup.py files is that it'll make it much easier for other clients to 
access it. Right now it's kinda locked away: the only folks who know 
how to access and use it are Python (module metadata) and DU 
(setup.py metadata).


>- Improve version control. Junk current "operators" scheme (=, 
><, >, >=, <=) as both unnecessarily complex and inadequate (i.e. 
>stating module X requires module Y (>= 1.0) is useless in practice 
>as it's impossible to predict _future_ compatibility). Metadata 
>should support 'Backwards Compatibility' (optional) value indicating 
>earliest version of the module that current version is 
>backwards-compatible with. Dependencies list should declare name and 
>version of each required package (specifically, the version used as 
>package was developed and released).
>
>Good idea, though this issue highly depends on the packager backend used.

Could you cite some examples to help me understand the issues involved?


>- Make it easier to have multiple installed versions of a module.
>
>That, too, isn't really an DU2 issue, or is it ?

Not really; more a general packaging and Python import issue. But I 
included it here as I think packaging issues have a big impact on DU 
policy.


>>- Reject PEP 262 (installed packages database). Complex, fragile, 
>>duplication of information, single point of failure reminiscent of 
>>Windows Registry. Exploit the filesystem instead - any info a 
>>separate db system would provide should already be available from 
>>each module's metadata.
>
>I don't quite agree. I couldn't live without rpm these days.

Well, it's not to say that users can't build their own databases 
listing all their installed gunk if they want to. Ensuring user 
freedom in such areas is crucial. Perhaps it would be clearer to say 
that the intention is sound (make information on installed modules 
easy to retrieve; something I'm all for), but the way 262 proposes to 
do it is not.

In fact, one of my main objections to 262 is that it could well 
restrict user freedom: by creating lots of dependencies and 
synchronisation issues, users could find themselves locked into using 
a single 'official' Package Manager because it's the only one smart 
enough to deal with all these complexities. Users who venture into 
their site-packages folder by any other means will quickly find 
themselves being punished the PackMan Police for unlawful infractions.

This should be one of the benefits that comes from decoupling module 
metadata from implementation as I've suggested above. There'll be no 
need for a central authority (262's DB) to maintain metadata, because 
each module already contains and looks after its own. And because 
there's only one metadata instance in existence for each module, 
there's no dependency/synchronisation issues to worry about. You can 
still provide users with exactly the same API that the 262 DB would 
have done for accessing this info, of course, so you still get all 
the functionality 262 would have provided, but without any of the 
headaches.

Funnily enough though, one of the possible DB implementations floated 
for 262 is to put the metadata for each module into a separate file 
on disk. So perhaps I should say that 262's idea of maintaining a 
_separate_ database simply isn't necessary: all the info it would 
have provided can already be retrieved from files in the filesystem; 
the only difference is that each file is bundled in package. The 
module/file system _is_ the database, if you like. (After all, what's 
a filesystem but a big ol' object database by any other name?;)


Thanks,

has

p.s. If you're interested, you can see a module system I designed a 
couple years back at applemods.sourceforge.net. It actually uses a 
version of the "module = package = distribution with all batteries 
included" concept I'm floating here. (Which I think was itself 
influenced by Python's package system.)

p.p.s. Anything folk can do to help me understand the issues involved 
in cross-platform and extension compilation lest I spout off too much 
about things I know not will be much appreciated, ta. :)

-- 
http://freespace.virgin.net/hamish.sanderson/



More information about the Distutils-SIG mailing list