[Distutils] GSoC 2017 - RFC on Proposal

Pradyun Gedam pradyunsg at gmail.com
Sat Mar 25 11:50:17 EDT 2017


I apologize for the duplicate mail. Please respond on this thread.

On Sat, 25 Mar 2017 at 21:16 Pradyun Gedam <pradyunsg at gmail.com> wrote:

> Hello Everyone!
>
> I had previously sent a mail on this list, stating that I would like to
> work on pip's dependency resolution for my GSoC 2017. I now have drafted a
> proposal for the same; with help from my mentors - Donald Stufft and Justin
> Cappos. I'd also take this opportunity to thank them for agreeing to be my
> mentors for this GSoC.
>
> I would like to request for comments on my proposal for GSoC - it is
> hosted at
> https://gist.github.com/pradyunsg/5cf4a35b81f08b6432f280aba6f511eb.
>
> Please find trailing a MarkDown version of the proposal.
>
> Thanks,
> Pradyun Gedam
>
> -----
>
> # Adding Proper Dependency Resolution to pip
>
> - **Name:** Pradyun S. Gedam
> - **Email:** [pradyunsg at gmail.com][mailto-email]
> - **Github:** [pradyunsg][github-profile]
> - **University:** [VIT University, Vellore, India][vit-homepage]
> - **Course:** Bachelor of Technology in Computer Science and Engineering
> - **Course Term:** 2016/17 - 2019/20 (4 Year)
> - **Timezone:** IST (GMT +5:30)
> - **GSoC Blog RSS Feed URL:** <
> https://pradyunsg.github.io/gsoc-2017/feed.xml>
>
> [github-profile]: http://github.com/pradyunsg/
> [vit-homepage]: http://vit.ac.in/
> [mailto-email]: mailto:pradyunsg at gmail.com
>
> ## About Me
>
> I was introduced to Python about five years ago, through "Core Python
> Programming" by Weasley J Chun. After the initial struggle with Python
> 2/3, the
> ball was set rolling and hasn't stopped since. I have fiddled around with
> Game Programming (PyGame), Computer Vision (OpenCV), Data Analytics
> (Pandas,
> SciPy, NumPy), transcompilers (Py2C) and more.
>
> As a high school student, I did internship at Enthought in 2013 and 2014.
> The two summers that I spent at Enthought were a great learning experience
> and
> I thoroughly enjoyed the environment there.
>
> Other than Python, I have also used C, C++, Web Technologies (JavaScript,
> HTML,
> CSS) and preprocessors (Pug, TypeScript, LESS/SASS/SCSS), Java and
> Bash/Zsh for
> some other projects.
>
> Curious to understand how pip works, I began looking into pip's source
> code.
> I started contributing to pip in May 2016. I am now fairly familiar with
> the
> design of pip and have a fair understanding of how it works, due to the
> various
> contributions I have made to pip in the past year.
>
> ## Mentors
>
> - Donald Stufft (@dstufft on GitHub)
> - Justin Cappos (@JustinCappos on GitHub)
>
> Communication with the mentors will be done mostly on issues and pull
> requests
> on pip's GitHub repository. If at any point in time, a real time
> discussion is
> to be done with the mentors, it would be done over IRC or Skype. Email can
> also
> be used if needed.
>
> ## Proposal
>
> This project is regarding improving dependency resolution performed within
> pip
> by implementing a dependency resolver within it.
>
> ### Abstract
>
> Currently, pip does not resolve dependencies correctly when there are
> conflicting requirements. The lack of dependency resolution has caused
> hard-to-debug bugs/failures due to the installation of incompatible
> packages.
> The lack of a dependency resolver is also a blocker for various other
> features -
> adding an upgrade-all functionality to pip and properly determining
> build-time
> dependencies for packages are two such features.
>
> ### Deliverables
>
> At the end of this project, pip will have the ability to:
>
> - detect requirement conflicts
> - resolve conflicting dependency requirements where possible
>
> ### Implementation
>
> The implementation language will be Python. The code will maintain the
> compatibility requirements of pip - the same source code will support the
> multiple Python implementations and version, including but not limited to,
> CPython 2.7, CPython 3.3+, PyPy 2, PyPy3.
>
> New Tests for the functionality introduced will be added to pip's current
> test
> suite.
>
> User documentation would not be a major part of this project. The only
> changes
> would be to mention pip can now resolve dependencies properly. There are
> certain
> sections that might need updating.
>
> #### Overview
>
> The project will be composed of the following stages:
>
> 1. Refactor the dependency resolution logic of pip into a separate module.
> 1. Implement dependency information caching in pip.
> 1. Implement a backtracking dependency resolver to resolve the
> dependencies.
>
> Every stage depends on the previous ones being completed. This step-wise
> approach would make incremental improvements so that there is a constant
> feedback on the work being done as well as making it easy to change course
> without losing existing work; if needed for some unforeseen reason.
>
> #### Discussion
>
> There is a class in pip - `RequirementSet`, which is currently a God
> class. It
> is responsible for the following:
>
>   1. Holding a list of things to be installed
>   1. Downloading Files
>   1. Dependency Resolution
>   1. Unpacking downloaded files (preparation for installation)
>   1. Uninstalling packages
>   1. Installing packages
>
> This is clearly a bad situation since this is most of the heavy lifting
> involved in pip. These responsibilities need to be separated and moved out
> into
> their independent modules/classes, to allow for simplification of
> `RequirementSet` while providing a clean platform for the remaining work.
> This is the most tricky portion of this project, given the complexity of
> `RequirementSet` as it stands today.
>
> There are two kinds of distributions that may be used to install a package
> -
> wheels (binary) and sdists (source). When installing a package, pip builds
> a
> wheel from an sdist and then proceeds to install the wheel. The difference
> between the two formats of distribution relevant to this project is -
> wheels
> store the information about dependencies within them statically; sdists do
> not.
>
> Determining the dependencies of a wheel distribution is merely the matter
> of
> fetching the information from a METADATA file within the `.whl` file. The
> dependency information for an sdist, on the other hand, can only be
> determined
> after running its `setup.py` file on the target system. This means that
> dependencies of an sdist depend on how its `setup.py` behaves which can
> change
> due to variations on target systems or could even contain through random
> choices.
>
> Computation of an sdist's dependencies on the target system is a
> time-consuming
> task since it potentially involves fetching a package from PyPI and
> executing
> it's setup.py to get the dependency information. In order to improve
> performance, once an sdist's dependencies are computed, they would be
> stored in
> a cache so that during dependency resolution, the dependencies of an sdist
> are
> not computed every time they are needed. Further, pip caches wheels it
> downloads or builds meaning that any installed package or downloaded
> wheel's
> dependency information would available statically, without the need to go
> through the sdist dependency cache.
>
> Like the wheel cache, sdist-dependency-cache will be a file system based
> cache.
> The sdist-dependency-cache would only be used if the corresponding sdist is
> being used.
>
> Since sdist dependencies are non-deterministic, the cached dependency
> information is potentially incorrect - in certain corner cases such as
> using
> random choices in setup.py files. Such uses are not seen as important
> enough to
> cater to, compared the benefits of having a cache. Further, this is
> already the
> case with the information in the wheel cache.
>
> SAT solver based resolution is not feasible for pip since a SAT solver
> needs
> the entire set of packages and their dependencies to compute a solution,
> which
> cannot be generated due to the aforementioned non-deterministic behaviour
> of
> setup.py file. Computing dependency information for all of PyPI on a target
> system for installing a package is simply not feasible.
>
> The most reasonable approach is using a backtracking solver. Such a solver
> can
> be incrementally provided information about the dependencies of a package
> and
> would only need dependency information about packages in the dependency
> graph
> of the current system.
>
> There is a need to keep a cache of visited packages during dependency
> resolution. A certain package-version combination may be reached via
> multiple
> paths and it is an inefficient use of computation time to re-compute that
> whether it is indeed going to satisfy the requirements or not. By storing
> information about which package-version combinations have been visited and
> do
> (or do not) satisfy the constraints, it is possible to speedup the
> resolution.
>
> Consider the following example:
>
> ```
> A-1.0 (depends: B)
> A-2.0 (depends: B and E-1.0)
> B-1.0 (depends: C and D)
> C-1.0 (depends: D)
> D-1.0
> D-1.1 (depends: E-2.0)
> E-1.0
> E-2.0
> ```
>
> If an installation of A is required, either A-2.0 or D-1.1 should not be
> installed because they have a conflicting requirement - E. While there are
> multiple possible solutions to this situation, the "most obvious" one us
> to use
> the D-1.0, instead of not installing A-2.0. Further, as multiple packages
> depend on D, the algorithm would "reach it" multiple times. By maintaining
> a
> cache for the visited packages, it is possible to achieve a speedup in
> such a
> scenario.
>
> #### Details
>
> Pull requests would be made on a regular basis during the project to ensure
> that the feedback loop is quick. This also reduces the possibilities of
> conflicts due to unrelated changes in pip.
>
> All the code will be tested within pip's existing testing infrastructure.
> It
> has everything needed to write tests related to all the changes to be made.
> Every PR made to pip as a part of this project will contain new tests or
> modifications to existing ones as needed.
>
> ##### Stage 1
>
> Initially, some abstractions will be introduced to the pip codebase to
> improve
> the reuse of certain common patterns within the codebase. This includes
> cleaner
> temporary directory management through a `TempDirectory` class.
>
> `RequirementSet.prepare_files` and `RequirementSet._prepare_file` are
> downloading, unpacking packages as well as doing what pip does as
> dependency
> resolution. Taking these functions apart neatly is going to be a tricky
> task.
>
> The following is a listing of the final modules that will be responsible
> for
> the various tasks that are currently being performed by `RequirementSet`:
>
> - `pip.resolve` - Dependency Resolution
> - `pip.download` - Downloading Files & Unpacking downloaded files
> - `pip.req.req_set` - Holding a list of things to be installed /
> uninstalled
> - `pip.operations.install` - Installing Packages (preparation for
> installation)
> - `pip.operations.uninstall` - Uninstalling Packages
>
> To be able to proceed to the next step, only the dependency resolution
> related
> code needs to be refactored into a separate module. Other portions of
> `RequirementSet` are not required to be refactored but the same will be
> tackled
> as optional deliverables. In other words, only `pip.resolve` needs to be
> completed to be able to proceed to the next stage in this project. This is
> needed since in Stage 3, the resolver would be written in `pip.resolve`,
> independent of the rest of the codebase.
>
> ##### Stage 2
>
> A new module `pip.cache` will be created. Within this module, all the
> caching
> will be handled. Thus, the code for the current wheel cache would be moved.
> The new code for a dependency cache for sdists would also be written here.
>
> The new cache would hold all of an sdist's egg-info. The information will
> be
> stored on the file-system in a sub directory structure much like that of
> the
> wheel cache, in a directory structure based on hash of the sdist file
> holding
> the egg-info at the end. This will be done in a class named `EggInfoCache`.
>
> `EggInfoCache` cache will be used only if a corresponding wheel for an
> sdist is
> not available. Installing an sdist results in the creation of a wheel which
> contains the dependency information, which would be used over the
> information
> available in the `EggInfoCache`.
>
> To be able to proceed to the next step, it is required that `EggInfoCache`
> is
> implemented.
>
> ##### Stage 3
>
> The module `pip.resolve` will be modified and a class named
> `BacktrackingResolver` will be added to it. The class does what you expect
> it
> to do - it would resolve dependencies with recursive backtracking. As
> described
> above, there will be some global state stored about the packages that have
> been
> explored. Other than the maintenance of global state, in the form of the
> cache,
> the rest of the algorithm will essentially follow the same structure as any
> backtracking algorithm.
>
> The project would be complete when the aforementioned resolver is
> implemented.
>
> #### Existing Work
>
> There is existing work directly related to dependency resolution in pip,
> done by
> multiple individuals.
>
> - Robert Collins (un-merged closed pull request on pypa/pip)
>
>   This has an incomplete backtracking dependency resolver and dependency
>   caching.
>
> - Sebastien Awwad (branch on a separate fork)
>
>   This was used for the depresolve project, to investigate the state of
>   Dependency Resolution in PyPI/pip ecosystem.
>
> - `pip-compile` (separate project)
>
>   This has a backtracking dependency resolver implemented to overcome pip's
>   inablity to resolve dependencies.
>
> Their work would be used for reference, where appropriate, during the
> course of
> the project. Further, there are many package managers which implement
> dependency
> resolution, which would also be looked into.
>
> ### Tentative Timeline
>
> - Community Bonding Period: **5 May - 29 May**
>
>   - Clean up and finalize my existing pull requests.
>   - Read existing literature regarding dependency resolution.
>   - Inspect existing implementations of dependency resolvers.
>
>   GOAL: Be ready for the work coming up.
>
> - Week 1: **30 May - 5 June**
>
>   - Introduce abstractions across pip's codebase to make refactoring
>     `RequirementSet` easier.
>   - Begin breaking down `RequirementSet.prepare_file`.
>
> - Week 2: **6 June - 12 June**
>
>   - Continue working on the refactor of `RequirementSet`.
>
> - Week 3: **13 June - 19 June**
>
>   - Continue working on the refactor of `RequirementSet`.
>   - Finish and polish `pip.resolve`.
>
>   GOAL: `pip.resolve` module will be ready, using the current resolution
>         strategy.
>
> - Week 4: **20 June - 26 June**
>
>   - Finish and polish all work on the refactor of `RequirementSet`.
>
> - Week 5: **27 June - 3 July**
>
>   - Move code for `WheelCache` into a new `pip.cache` module.
>   - Write tests for `pip.cache.EggInfoCache`, based on `WheelCache`.
>   - Begin implementation of `pip.cache.EggInfoCache`.
>
> - Week 6: **4 July - 10 July**
>
>   - Finish and polish `pip.cache.EggInfoCache`.
>
>   GOAL: A cache for storing dependency information of sdists would be
> ready to
>         add to pip.
>
> - Week 7: **10 July - 16 July**
>
>   - Create a comprehensive set of tests for the dependency resolver.
>     (in `tests/unit/test_resolve.py`)
>   - Begin implementation on the backtracking algorithm.
>
>   GOAL: A comprehensive test bed is ready for testing the dependency
> resolver.
>
> - Week 8: **17 July - 24 July**
>
>   - Complete a rough implementation of the backtracking algorithm
>
>   GOAL: An implementation of a dependency resolver to begin running tests
> on
>   and work on improving.
>
> - Week 9: **25 July - 31 July**
>
>   - Fixing bugs in dependency resolver
>
> - Week 10: **1 August - 6 August**
>
>   - Finish and polish work on dependency resolver
>
>   GOAL: A ready-to-merge PR adding a backtracking dependency resolver for
> pip.
>
> - Week 11: **6 August - 13 August**
>
>   Buffer Week.
>
> - Week 12: **13 August - 21 August**
>
>   Buffer Week. Finalization of project for evaluation.
>
> If the deliverable is achieved ahead of schedule, the remaining time will
> be
> utilized to resolve open issues on pip's repository in the order of
> priority as
> determined under the guidance of the mentors.
>
> #### Other Commitments
>
> I expect to not be able to put in 40 hour/week for at most 3 weeks
> throughout
> the working period, due to the schedule of my university.
>
> I will have semester-end examinations - from 10th May 2017 to 24th May
> 2017 -
> during the Community Bonding Period. My university will re-open for my
> third
> semester on 12 July 2017. I expect mid-semester examinations to be held in
> my
> University around 20th August 2017. During these times, I would not be
> able to
> put in full 40 hour weeks due to the academic workload.
>
> I might take a 3-4 day break during this period, regarding which I would be
> informing my mentor around a week in advance.
>
> I will be completely free from 30th May 2017 to 11 July 2017.
>
> ### Future Work
>
> There seems to be some interest in being able to reuse the above dependency
> resolution algorithm in other packaging related tools, specifically from
> the
> buildout project. I intend to eventually move the dependency resolution
> code
> that would come out of this project into a separate library to allow for
> reuse
> by installer projects - pip, buildout and other tools.
>
> ## Previous Contributions to pip
>
> (As on 12th March 2017)
>
> ### Issues
>
> Authored:
>
> - #3785 - Prefering wheel-based installation over source-based
> installation (Open)
> - #3786 - Make install command upgrade packages by default (Closed)
> - #3787 - Check if pip broke the dependency graph and warn the user (Open)
> - #3807 - Tests fail since change on PyPI (Closed)
> - #3809 - Switch to TOML for configuration files (Open)
> - #3871 - Provide a way to perform non-eager upgrades (Closed)
> - #4198 - Travis CI - pypy broken dues to dependency change in pycrypto
> (Closed)
> - #4282 - What's the release schedule? (Closed)
>
> Participated:
>
> - #59 - Add "upgrade" and "upgrade-all" commands (Open)
> - #988 - Pip needs a dependency resolver (Open)
> - #1056 - pip install -U does not remember whether a package was installed
> with --user (Open)
> - #1511 - ssl certificate hostname mismatch errors presented badly (Open)
> - #1668 - Default to --user  (Open)
> - #1736 - Create a command to make it easy to access the configuration
> file (Open)
> - #1737 - Don't tell the user what they meant, just do what they meant
> (Open)
> - #2313 - Automated the Creation and Upload of Release Artifacts (Open)
> - #2732 - pip install hangs with interactive setup.py setups (Open)
> - #3549 - pip -U pip fails (Open)
> - #3580 - Update requests/urllib3 (Closed)
> - #3610 - pip install from package from github, with github dependencies
> (Open)
> - #3788 - `pip` version suggested is older than the version which is
> installed (Closed)
> - #3789 - Error installing Mayavi in Mac OS X  (Closed)
> - #3798 - On using python -m pip install -upgrade pip.. its throwing an
> error like the one below (Closed)
> - #3811 - no matching distribution found for install (Closed)
> - #3814 - pip could not find a version that satisfies the requirement
> oslo.context (Closed)
> - #3876 - support git refs in @ syntax (Open)
> - #4021 - Will you accept PRs with pep484 type hints?  (Open)
> - #4087 - pip list produces error (Closed)
> - #4149 - Exception thrown when binary is already linked to /usr/local/bin
> (Open)
> - #4160 - Pip does not seem to be handling deep requirements correctly
> (Open)
> - #4162 - Let --find-links be context aware to support github, gitlab,
> etc. links (Open)
> - #4170 - pip list |head raises BrokenPipeError (Open)
> - #4182 - pip install should install packages in order to avoid ABI
> incompatibilities in compiled  (Open)
> - #4186 - IOError: [Errno 13] Permission denied: '/usr/local/bin/pip'
> (Open)
> - #4206 - Where on Windows 10 is pip.conf or pip.ini located? (Closed)
> - #4221 - Feature request: Check if user has permissions before
> downloading files (Closed)
> - #4229 - "pip uninstall" is too noisy (Open)
>
> #### Pull Requests
>
> Authored:
>
> - #3806 - Change install command's default behaviour to upgrade packages
> by default (Closed, superseded by #3972)
> - #3808 - Fix Tests (Merged)
> - #3818 - Improve UX and tests of check command (Merged)
> - #3972 - Add an upgrade-strategy option (Merged)
> - #3974 - [minor] An aesthetic change to wheel command source (Merged)
> - #4192 - Move out all the config code to a separate module (Merged)
> - #4193 - Add the ability to autocorrect a user's command (Open)
> - #4199 - Fix Tests for Travis CI (Merged)
> - #4200 - Reduce the API exposed by the configuration class (Merged)
> - #4232 - Update documentation to mention upgrade-strategy (Merged)
> - #4233 - Nicer permissions error message (Open)
> - #4240 - [WIP] Add a configuration command (Open)
>
> Participated:
>
> - #2716 - Issue #988: new resolver for pip (Closed)
> - #2975 - Different mock dependencies based on Python version (Merged)
> - #3744 - Add a "Upgrade all local installed packages" (Open)
> - #3750 - Add a `pip check` command. (Merged)
> - #3751 - tox.ini: Add "cover" target (Open)
> - #3794 - Use the canonicalize_name function for finding .dist-info
> (Merged)
> - #4142 - Optionally load C dependencies based on platform (Open)
> - #4144 - Install build dependencies as specified in PEP 518 (Open)
> - #4150 - Clarify default for --upgrade-strategy is eager (Merged)
> - #4194 - Allow passing a --no-progress-bar to the install script to
> surpress progress bar (Merged)
> - #4201 - Register req_to_install for cleanup sooner (Merged)
> - #4202 - Switch to 3.6.0 final as our latest 3.x release (Merged)
> - #4211 - improve message when installing requirements file (#4127)
> (Merged)
> - #4241 - Python 3.6 is supported (Merged)
>
> ## References
>
> 1. [pypa/pip#988](https://github.com/pypa/pip/issues/988)
>
>    Tracking issue for adding a proper dependency resolver in pip. Contains
>    links to various useful resources.
>
> 1. [pypa/pip#2716](https://github.com/pypa/pip/issues/2716)
>
>    Prior work by Robert Collins for adding a proper dependency resolver in
> pip.
>
> 1. [Python Dependency Resolution](
> https://docs.google.com/document/d/1x_VrNtXCup75qA3glDd2fQOB2TakldwjKZ6pXaAjAfg/edit?usp=sharing
> )
>
>    A writeup by Sebastian Awwad on the current state of dependency
> resolution
>    in pip and PyPI in general.
>
> 1. [PSF Application Template](
> https://wiki.python.org/moin/SummerOfCode/ApplicationTemplate2017)
>
>    For guidance on how to write the application and what information is
> needed.
>
> 1. [Stork: Secure Package Management for VM Environments](
> http://isis.poly.edu/~jcappos/papers/cappos_stork_dissertation_08.pdf)
>
>    A Paper by Justin Cappos about Stork, used for reference regarding
>    Backtracking Resolution
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20170325/336c36dc/attachment-0001.html>


More information about the Distutils-SIG mailing list