[Distutils] Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot))

Vinay Sajip vinay_sajip at yahoo.co.uk
Thu Jan 30 17:10:05 CET 2014


--------------------------------------------
On Thu, 30/1/14, Paul Moore <p.f.moore at gmail.com> wrote:

 Subject: Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot))
 To: "Vinay Sajip" <vinay_sajip at yahoo.co.uk>
 Cc: "Distutils" <distutils-sig at python.org>
 Date: Thursday, 30 January, 2014, 13:23
 
> OK. Note that this is not, in my view, an issue with wheels, but rather
> about zipfiles on sys.path, and (deliberate) design limitations of the
> module loader and zipimport implementations.[1]

Okay, I'm glad that's clarified. Otherwise, there's a danger of it being
conflated with an "importing wheels is bad" viewpoint which relates
even to pure-Python code.

> First of all, it is not possible to load a DLL into a process' memory
> [2, 3] unless it is stored as a file in the filesystem. So any attempt
> to import a C extension from a zipfile must, by necessity, involve
> extracting that DLL to the filesystem. That's where I see the
> problems. None are deal-breaking issues, but they consist of a
> number of niggling issues that cumulatively chip away at the
> reliability of the concept until the end result has enough corner
> cases and risks to make it unacceptable (depending on your
> tolerance for risks - there's a definite judgement call involved).

Okay, let's work through the issues you raise.

> 1. You need to choose a location to put the extracted file. On
> Windows in particular, there is no guaranteed-available
> filesystem location that can be used without risk. Some
> accounts have no home directory, some (locked down) users
> have no permissions anywhere but very specific places, even
> TEMP may not be usable if there's an aggressive housekeeping
> routine in place - but TEMP is probably the best choice of a bad
> lot.

There are always going to be environments where you can't do
stuff, say because of corporate lock-down policies. There is no
requirement on any solution to do the impossible; merely to fail
with an informative error message. There is lots of other
functionality that fails in these environments, too (e.g. access to
the Internet). So my view is that this should not be an obstacle
to developing such functionality for environments where it can
work, as long as it fails fast and informatively when it fails.

> 2. There are race conditions to consider. If the extraction is
> not completely isolated per-process, what if 2 processes
> want to use different versions of the same DLL? How will
> these be distinguished?

Processes are isolated from each other, so that doesn't stop
different processes using different versions of DLLs. Software
in those DLLs needs to be designed to avoid stepping on its
own toes, but that's orthogonal to whether it came from a zip
or not (.NET SxS assemblies, for example - if they have files
they write to, they need to not overwrite each other's stuff).

Distlib covers this by placing the DLL in a location which is
based on the absolute pathname of the wheel it came from.
So any software which uses the exact same wheel will use
the same DLL, but other software which uses a wheel with a
different version (which by definition will have a different
absolute path) will use a different DLL.

Perhaps there are holes in this approach - if so, please point
out any that you see.

> 3. Clean-up is an issue. How will the extracted files be
> removed? You can't unload the DLLs from Python, and
> you can't delete open files in Windows. So do you simply
> leave the files lying round? Or do you do some sort of atexit
> dance to run a separate process after the Python process
> terminates which will do the cleanup? What happens
> to that process when virus checkers hold the file open?
> Leaving the files around is probably the most robust answer,
> but it's not exactly friendly.

But it's a drawback of the underlying platform, and it seems to
me OK to do the best that's possible (like we do with pip
updating itself on Windows). Also, it's not clear if you always
want to clean up: perhaps you don't want to extract DLLs
every time if they're already there (let's not go down a cache
invalidation rabbit-hole - later is definitely better than right now ;-)

My view is that cleanup belongs with the application, not the
library - the application developer is best placed to know what
the right thing to do is for that particular application.

This is currently covered in distlib by having an API which
provides the root directory path for the cache. Cache cleanup
can be done on start-up before any wheels are mounted.

By the way, surely you've seen how much cruft accumulates
in TEMP on Windows machines? It's not as if Windows users'
expectations can be particularly high here ;-) I'm all for keeping
things tidy, of course.

> The only place where having a wheel rather than a general
> zipfile makes a difference is that a wheel *might* at some
> point contain metadata that allows the wheel to claim that it's
> "OK" to load its contents from a zipfile.
> But my points above are not something that the author of the
> C extension can address, so there's no way that I can see
> that an extension author can justifiably set that flag.

It's not the extension author exactly, it's the wheel packager.
In a corporate environment, they might be someone in a 
systems integrator role. Even if they are one and the same,
the assertion is that the wheel is designed to run from a zip.
Beyond that, it's up to the application developer and/or
systems integrator: it doesn't mean it will work in every
circumstance as one would wish. Are you telling me that
most Python packages on PyPI, conventionally installed, 
will handle gracefully an out-of-disk-space condition? Where
the cause of the failure is immediately apparent rather than
"weird" at first glance? I doubt it.

> Ideally, if these problems can be solved, the solution should
> be included in the core zipimport module so that all users
> can benefit. If there are still issues to iron out and

Nick and I have both given reasons why zipimport might not be
best placed to pioneer this. Although you are not concerned
with binary compatibility, it is a valid concern which needs
addressing, and bolstering the WHEEL metadata seems the
right place for such work.

> but baking the feature into wheel mount just limits your user
> base (and hence your audience for raising bug reports, etc)
> needlessly.

I'm not hung up about exactly where the functionality gets
implemented, just that it's useful. It would seem better to focus
on real issues (like the ones you've raised, and the ones I
raised about binary compatibility) rather than debating how best
to package it.   If someone is interested in developing this area,
they will put in the work of looking at the issues and coming up
with ideas to address them, whether it's package or Y. What
makes you think an enhanced third-party zipimport module is
suddenly going to get lots of eyeballs? The functionality in distlib
as a whole is a lot more useful (this being a tiny corner of it), but
there aren't too many eyeballs on that.

> implementation of zipimport, and who has kept an interested
> eye on how it has been used in the 11 years since its
> introduction - and in particular how people have tried to
> overcome the limitations we felt we had to impose when
> designing it.

I didn't know - thanks for your work on zipimport, I think it's great.
Surely 11 years is long enough for that initial functionality to
have bedded down? Often, getting a new feature in means
working to a feature-freeze deadline where not every avenue
can be explored. That's par for the course, especially where
hard technical problems are to be faced. But, surely there
comes a time when it's worth taking another look, and seeing
if we can push the envelope further?

I hope that in the above I've addressed at least in part the
issues you've raised - I'm sure you'll tell me if not.
 
> choice to only look at pure Python files, because the
> platform issues around C extensions were "too hard".

Were those just the issues you raised here? Wasn't
binary compatibility discussed?

> There is, I believe, code "out there" on the internet to
> map a DLL image into a process based purely in memory,

I would discount this: any solution has to work on multiple
Windows versions and the lower level the solution, the more
the risk. We're talking (in the current implementation) just
about file-system operations and import_dynamic, which are
fairly mature and well understood by comparison.

> [5] To be fair, this is where the wheel metadata might help
> in distinguishing. But consider development and testing,
> where repeated test runs would not typically have different
> versions, but the user might well want to test whether
> running from zip still works.

What's wrong with having the test setup code clearing the DLL
cache for every  run? Clearly the wheel has to be rebuilt for
each run, but that's not going to be a show-stopper if the tests
are arranged optimally.

Anyway, thanks for taking the time to raise the issues in detail.
This kind of discussion will hopefully help to move things
forward.

Regards,

Vinay Sajip


More information about the Distutils-SIG mailing list