[Import-SIG] Loading Resources From a Python Module/Package

Sun Feb 1 06:28:42 CET 2015

On 1 February 2015 at 08:27, Brett Cannon <brett at python.org> wrote:
> As I said above, I partially feel like the desire for this support is to
> work around some API decisions that are somewhat poor.
>
> How about this: get_path(package, path, *, real=False) or get_path(package,
> filename, *, real=False) -- depending on whether Barry and me get our way
> about paths or you do, Donald -- where 'real' is a flag specifying whether
> the path has to work as a path argument to builtins.open() and thus fails
> accordingly (in instances where it won't work it can fail immediately and so
> loader implementers only have two lines of code to care about to manage it).
> Then loaders can keep their get_data() method without issue and the API for
> loaders only grew by 1 (or stays constant depending on whether we want/can
> have it subsume get_filename() long-term).

Jumping in here, since I specifically object to the "real=<boolean
flag>" API design concept (on the grounds of that the presence of that
kind of flag means you have two different methods trying to get out),
this thread is already quite long and there are several different
aspects I'd like to comment on :)

* I like the overall naming suggestion of referring to this as a
"resources" API. That not only has precedent in pkg_resources, but is
also the standard terminology for referring to this kind of thing in
rich client applications. (see
https://msdn.microsoft.com/en-us/library/windows/apps/hh465241.aspx
for example)

* I think the PEP 302 approach of referring to resource anchors as
"paths" is inherently confusing, especially when the most common
anchor is __file__. As a result, I think we should refer to "resource
anchors" and "relative paths", rather than the current approach of
trying to create and pass around "absolute paths" (which then end up
only working properly when packages are installed to a real
filesystem).

* I think Donald's overview at https://bpaste.net/show/0c490aa07c07 is
a good summary of the functionality we should aim to provide (naming
bikesheds aside)

* I agree we should treat extraction and loading of C extension
modules (and shared libraries in general) as out of scope for the
resource API. They face several restrictions that don't apply to other
pure data files

* I agree that the resource APIs should be for read-only access only.
Images, localisation strings, application templates, those are the
kinds of things this API is aimed at: they're an essential part of the
application, and hence it's appropriate to bundle them with it in a
way that still works for single-file zip archive applications, but
they're not Python code.

* For the "must exist as a real shareable filesystem artefact, fail
immediately if that isn't possible" API, I think we should support
both implicit cleanup *and* explicit context managers for
deterministic resource control. "Make this available until I'm done
with it, regardless of where I use it" and "make this available for
this defined region of code" are different use cases. Depending on how
these objects are modelled in the API (more on that below), we could
potentially drop the atexit handler in favour of suitable
weakref.finalize() calls (which would then clean them up once the last
reference to the resource was dropped, rather than always waiting
until the end of the process - "keep this resource available until the
process ends" would then be a matter of reference it from the
appropriate module globals or some other similarly long lived data
structure). Leaks due to process crashes would then be cleaned up by
normal OS tempfile management processes.

* I don't think we should couple the concept of resource anchors
directly to package names (as discussed, it doesn't work for namespace
packages, for example). I think we *should* be able to *look up*
resource anchors by package name, although this may fail in some cases
(such as namespace packages), and that the top level API should do
that lookup implicitly (allowing package names to be passed wherever
an anchor is expected). A module object should also be usable as its
own anchor. I believe we should disallow the use of filesystem paths
as resource anchors, as that breaks the intended abstraction (looking
resources up relative to the related modules), and the API behaviour
is clearer if strings are always assumed to be referring to
package/module names.

* I *don't* think it's a good idea to incorporate this idea directly
onto the existing module Loader API. Better to create a new
"ResourceLoader" abstraction, such that we can easily provide a
default LocationResourceLoader. Reusing module Loader instances across
modules would still be permitted, reusing ResourceLoader instances
*would not*. This allows the resource anchor to be specified when
creating the resource loader, rather than on every call.

* As a consequence of the previous point, the ResourceLoader instance
would be linked *from the module spec* (and perhaps from the module
globals), rather than from the module loader instance. (This is how we
would support using a module as its own anchor). Having a resource
loader defined in the spec would be optional, making it clear that
namespace modules (for example), don't provide a resource access API -
if you want to store resources inside a namespace package, you need to
create a submodule or self-contained subpackage to serve as the
resource anchor.

* As a consequence of making a suitably configured resource loader
available through the module spec as part of the module finding
process it would become possible to access module relative resources
*without actually loading the module itself*.

* If the import system gets a module spec where "spec.has_location" is
set and Loader.get_data is available, but the new
"spec.resource_loader" attribute is set to None, then it will set it
to "LocationResourceLoader(spec.origin)", which will rely solely on
Loader.get_data() for content access

* We'd also provide an optimised FilesystemResourceLoader for use with
actual installed packages where the resources already exist on disk
and don't need to be copied to memory or a temporary directory to
provide a suitable API.

* For abstract data access at the ResourceLoader API level, I like
"get_anchor()" (returning a suitably descriptive string such that
"os.path.join(anchor, <relative path>)" will work with get_data() on
the corresponding module Loader), "get_bytes(<relative path>)",
"get_bytestream(<relative path>" and "get_filesystem_path(<relative
path>)". get_anchor() would be the minimum API, with default
implementations of the other three based on Loader.get_data(), BytesIO
and tempfile (this would involve suitable use of lazy or on-demand
imports for the latter two, as we'd need access to these from
importlib._bootstrap, but wouldn't want to load them on every
interpreter startup).

* For the top-level API, I similarly favour
importlib.resources.get_bytes(), get_bytestream() and
get_filesystem_path(). However, I would propose that the latter be an
object implementing a to-be-defined subset of the pathlib Path API,
rather than a string. Resource listing, etc, would then be handled
through the existing Path abstraction, rather than defining a new one.
In the standard library, because we'd just be using a temporary
directory, we could use real Path objects (although we'd need to add
weakref support to them to implement the weakref.finalize suggestion I
make above)

> As for importlib.resources, that can provide a higher-level API for a
> file-like object along with some way to say whether the file must be
> addressable on the filesystem to know if tempfile.NamedTemporaryFile() may
> be backing the file-like object or if io.BytesIO could provide the API.
>
> This gets me a clean API for loaders and importlib and gets you your real
> file paths as needed.

Yep, as you can see above, I agree there are two APIs to be designed
here - the high level user facing one, and the one between the import
machinery and plugin authors.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia