[Distutils] distlib and wheel metadata

Wed Feb 15 09:55:53 EST 2017

 On 15 February 2017 at 14:00, Wes Turner <wes.turner at gmail.com> wrote:
> On Wed, Feb 15, 2017 at 5:33 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> I asked Daniel to *stop* using pydist.json, since wheel was emitting a
>> point-in-time snapshot of PEP 426 (which includes a lot of
>> potentially-nice-to-have things that nobody has actually implemented
>> so far, like the semantic dependency declarations and the enhancements
>> to the extras syntax), rather than the final version of the spec.
>
> Would you send a link to the source for this?

It came up when Vinay reported a problem with the way bdist_wheel was
handling combined extras and environment marker definitions:
https://bitbucket.org/pypa/wheel/issues/103/problem-with-currently-generated

>> - dist-info/METADATA as defined at
>> https://packaging.python.org/specifications/#package-distribution-metadata
>> - dist-info/requires.txt runtime dependencies as defined at
>> http://setuptools.readthedocs.io/en/latest/formats.html#requires-txt
>> - dist-info/setup_requires.txt build time dependencies as defined at
>> http://setuptools.readthedocs.io/en/latest/formats.html#setup-requires-txt
>>
>> The dependency fields in METADATA itself unfortunately aren't really
>> useful for anything.
>
> Graph: Nodes and edges.

Unfortunately, it's not that simple, since:

- dependency declarations refer to time dependent node *sets*, not to
specific edges
- node resolution is not only time dependent, but also DNS and client
configuration dependent
- this is true even for "pinned" dependencies due to the way "=="
handles post-releases and local build IDs
- the legacy module based declarations are inconsistently populated
and don't refer to nodes by a useful name
- the new distribution package based declarations refer to nodes by a
useful name, but largely aren't populated

By contrast, METADATA *does* usefully define nodes in the graph, while
requires.txt and setup_requires.txt can be used to extract edges when
combined with suitable additional data sources (primarily a nominated
index server or set of index servers to use for dependency specifier
resolution).

>> There's definitely still a place for a pydist.json created by going
>> through PEP 426, comparing it to what bdist_wheel already does to
>> populate metadata.json, and either changing the PEP to match the
>> existing practice, or else agreeing that we prefer what the PEP
>> recommends, that we want to move in that direction, and that there's a
>> definite commitment to implement the changes in at least setuptools
>> and bdist_wheel (plus a migration strategy that allows for reasonably
>> sensible consumption of old metadata).
>
> Which function reads metadata.json?

Likely eventually nothing, since anything important that it contains
will be readable from either pydist.json or from the other legacy
metadata files.

> Which function reads pydist.json?

Eventually everything, with tools falling back to dynamically
generating it from legacy metadata formats as a transition plan to
handle component releases made with older toolchains.

> An RDFS Vocabulary contains Classes and Properties with rdfs:ranges and
> rdfs:domains.
>
> There are many representations for RDF: RDF/XML, Turtle/N3, JSONLD.
>
> RDF is implementation-neutral. JSONLD is implementation-neutral.

While true, both of these are still oriented towards working with a
*resolved* graph snapshot, rather than a deliberately underspecified
graph description that requires subsequent resolution within the node
set of a particular index server (or set of index servers).

Just incorporating the time dimension is already messy, even before
accounting for the fact that the metadata carried with along the
artifacts is designed to be independent of the particular server that
happens to be hosting it.

Tangent: if anyone is looking for an open source stack for working
with distributed graph storage manipulation from Python, the
combination of http://janusgraph.org/ and
https://pypi.org/project/gremlinpython/ is well worth a look ;)

>> The equivalent for PEP 426 would probably be legacy-to-pydist and
>> pydist-to-legacy converters that setuptools, bdist_wheel and other
>> publishing tools can use to ship legacy metadata alongside the
>> standardised format (and I believe Daniel already has at least the
>> former in order to generate metadata.json in bdist_wheel). With PEP
>> 426 as currently written, a pydist-to-legacy converter isn't really
>> feasible, since pydist proposes new concepts that can't be readily
>> represented in the old format.
>
> pydist-to-legacy would be a lossy transformation.

Given appropriate use of the "extras" system and a couple of new
METADATA fields, it doesn't have to be, at least for the initial
version - that's the new design constraint I'm proposing for
everything that isn't defined as a metadata extension.

The rationale being that if legacy dependency metadata can be reliably
generated from the new format, that creates an incentive for *new*
tools to adopt it ("generate the new format, get the legacy formats
for free"), while also offering a clear migration path for existing
publishing tools (refactor their metadata generation to produce the
new format only, then derive the legacy metadata files from that) and
consumption tools (consume the new fields immediately, look at
consuming the new files later).

>> > I understand that social reasons are often more important than technical
>> > reasons
>> > when it comes to success or failure of an approach; I'm just not sure
>> > that
>> > in this case, it wasn't given up on too early.
>>
>> I think of PEP 426 as "deferred indefinitely pending specific
>> practical problems to provide clearer design constraints" rather than
>> abandoned :)
>
> Is it too late to request lowercased property names without dashes?

That's already the case in PEP 426 as far as I know.

> class PackageMetadata
>     def __init__():
>         self.data = collections.OrderedDict()
>     @staticmethod
>     def read_legacy()
>     def read_metadata_json()
>     def read_pydist_json()
>     def read_pyproject_toml()
>     def read_jsonld()
>
>     def to_legacy():
>     def to_metadata_json()
>     def to_pydist_json()
>     def to_pyproject_toml()
>     def to_jsonld()
>
>     @classmethod
>     def Legacy()
>     def MetadataJson()
>     def PydistJson()
>     def PyprojectToml()
>     def Jsonld(cls, *args, **kwargs)
>         obj = cls(*args, **kwargs)
>         obj.read_jsonld(*args, **kwargs)
>         return obj
>
>     @classmethod
>     def from(cls, path,
> format='legacy|metadatajson|pydistjson|pyprojecttoml|jsonld'):
>         # or this
>
>
> ... for maximum reusability, we really shouldn't need an adapter registry
> here;

I'm not really worried about the Python API at this point, I'm
interested in the isomorphism of the data formats to help streamline
the migration (as that's the current main problem with PEP 426).

But yes, just as packaging grew "LegacyVersion" *after* PEP 440
defined the strict forward looking semantics, it will likely grow some
additional tools for reading and converting the legacy formats once
there's a clear pydist.json specification to document the semantics of
the translated fields.

>> 2. the new pipenv project to provide a simpler alternative to the
>> pip+virtualenv+pip-tools combination for environment management in web
>> service development (and similar layered application architectures).
>> As with the "install vs setup" split in setuptools, pipenv settled on
>> an "only two kinds of requirement (deployment and development)" model
>> for usability reasons, but it also distinguishes abstract dependencies
>> stored in Pipfile from pinned concrete dependencies stored in
>> Pipfile.lock.
>
> Does the Pipfile/Pipfile.lock distinction overlap with 'integrates' as a
> replacement for meta_requires?

Somewhat - the difference is that where the concrete dependencies in
Pipfile.lock are derived from the abstract dependencies in Pipfile,
the separation in pydist.json would be a declaration of "Yes, I really
did mean to publish this with a concrete dependency, it's not an
accident".

>> If we put those together with the existing interest in automating
>> generation of policy compliant operating system distribution packages,
>
>
> Downstream OS packaging could easily (and without permission) include extra
> attributes (properties specified with full URIS) in JSONLD metadata.

We can already drop arbitrary files into dist-info directories if we
really want to, but in practice that extra metadata tends to end up in
the system level package database rather than in the Python metadata.

>> - "integrates": replacement for "meta_requires" that only allows
>> pinned dependencies (i.e. hash maps with "name" & "version" fields, or
>> direct URL references, rather than a general PEP 508 specifier as a
>> string)
>
>
> Pipfile.lock?
>
> What happens here when something is listed in both requires and integrates?

Simplest would be to treat it the same way that tools treat mentioning
the same component in multiple requirements entries (since that's
really what you'd be doing).

> Where/do these get merged on the "name" attr as a key, given a presumed
> namespace URI prefix (https://pypi.org/project/)?

For installation purposes, they'd be combined into a single requirements set.

>> For converting old metadata, any concrete dependencies that are
>> compatible with the "integrates" field format would be mapped that
>> way, while everything else would be converted to "requires" entries.
>
> What heuristic would help identify compatibility with the integrates field?

PEP 440 version matching (==), arbitrary equality (===), and direct
references (@...), with the latter being disallowed on PyPI (but fine
when using a private index server).

>> The semantic differences between normal runtime dependencies and
>> "dev", "test", "doc" and "build" requirements would be handled as
>> extras, regardless of whether you were using the old metadata format
>> or the new one.
>
> +1 from me.
>
> I can't recall whether I've used {"dev", "test", "doc", and "build"} as
> extras names in the past; though I can remember thinking "wouldn't it be
> more intuitive to do it [that way]"
>
> Is this backward compatible? Extras still work as extras?

Yeah, this is essentially the way Provide-Extra ended up being
documented in https://packaging.python.org/specifications/#provides-extra-multiple-use

That already specifies the expected semantics for "test" and "doc", so
it would be a matter of adding "dev" and "build" (as well as surveying
PyPI for components that already defined those extras)

>> P.S. I'm definitely open to a PR that amends the PEP 426 draft along
>> these lines. I'll get to it eventually myself, but there are some
>> other things I see as higher priority for my open source time at the
>> moment (specifically the C locale handling behaviour of Python 3.6 in
>> Fedora 26 and the related upstream proposal for Python 3.7 in PEP 538)
>
> I need to find a job; my time commitment here is inconsistent.

Yeah, I assume work takes precedence for everyone, which is why I
spend time needling redistributors and major end users about the
disparity between "level of use" and "level of investment" when it
comes to the upstream Python packaging ecosystem. While progress on
that front isn't particularly visible yet, the nature of the
conversations are changing in a good

> I'm working on a project (nbmeta) for generating, displaying, and embedding
> RDFa and JSONLD in Jupyter notebooks (w/ _repr_html_() and an OrderedDict)
> which should refresh the JSONLD @context-writing skills necessary to define
> the RDFS vocabulary we could/should have at https://schema.python.org/ .

I'm definitely open to ensuring the specs are RDF/JSONLD friendly,
especially as some of the characteristics of that are beneficial in
other kinds of mappings as well (e.g.
lists-of-hash-maps-with-fixed-key-names are easier to work with than
hash-maps-with-data-dependent-key-names for a whole lot of reasons).

> - [ ] JSONLD PEP (<- PEP426)
>   - [ ] examples / test cases
>     - I've referenced IPython as an example package; are there other hard
> test cases for python packaging metadata conversion? (i.e. one that uses
> every feature of each metadata format)?

PyObjC is my standard example for legitimate version pinning in a
public project (it's a metapackage where each release just depends on
particular versions of the individual components)
django-mezzanine is one I like as a decent example of a reasonably
large dependency tree for something that still falls short of a
complete application
setuptools is a decent example for basic use of environment markers

I haven't found great examples for defining lots of extras or using
complex environment marker options (but I also haven't really gone
looking)

>   - [ ] JSONLD @context
>   - [ ] class PackageMetadata
>   - [ ] wheel: (additionally) generate JSONLD metadata
>   - [ ] schema.python.org: master, gh-pages (or e.g.
> "https://www.pypa.io/ns#")
>
> - [ ] warehouse: add a ./jsonld view (to elgacy?)

This definitely won't be an option for the legacy service, but it
could be an interesting addition to Warehouse.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia