[Import-SIG] Loading Resources From a Python Module/Package

Sun Feb 1 01:59:27 CET 2015

> On Jan 31, 2015, at 7:19 PM, Brett Cannon <brett at python.org> wrote:
> 
> 
> 
> On Sat Jan 31 2015 at 6:05:39 PM Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
> 
> I honestly don’t really care what API the loader has because I don’t think anybody but the importlib.resources functions are ever going to use, so if you want to do something I hate with them knock yourself out I don’t care that much,
> 
> Well you have to care to an extent because that API will be what lets you do what you want to do at a higher API level.

Well I mean I don’t care enough to argue about what the API looks like as long as the high level API is possible with them.

>  
> I just think that requiring using __file__ and __path__ outside of the implementation of the loader itself is a pretty big code smell.
> 
> Who ever said anything about __file__ and __path__ outside of loaders? All I have proposed is something to allow you to do, e.g.::

Sorry, I mean the current API for get_data requires a Loader() user to construct an absolute path using __file__ and/or __path__ (at least I think it does, and the PEP example I think shows it using that). I think that whatever solution that ends up with should not require someone using the Loader() to know about them.

> 
>   def resource_stream(package, path):
>     """Return a file-like object for a file relative to a package."""
>    loader = ...
>     try:
>       return open(loader.get_path(package, path, real=True))
>     except NotARealFileThingy:
>       loader_path = loader.get_path(package, path)
>       return io.BytesIO(loader.get_data(loader_path))
> 
> Otherwise get_stream(package, path) and then, e.g.::
> 
>   def resource_filename(package, path):
>     loader = ...
>     stream = loader.get_stream()
>     if hasattr(stream, 'path'):
>       return stream.path
>     path = make a tempfile path somehow ...
>     with open(path, 'wb') as file:
>       file.write(stream.read())
>     return path

This is what I meant when I didn’t care. I don’t really like that loader API much but it (mostly) allows the top level concepts I want without having to sacrifice much.

The only negative I can see here is that it’s impossible to implement get_stream in an efficient way for large objects except in a file system. This might not matter much because in practice these files aren’t often going to be very big nor are they often going to live anywhere but the file system. However for instance it’s impossible for a loader that downloads things from the internet to return a stream that won’t load the entire response in memory inside of an io.BytesIO container. It might matter for zip files (I don’t know if it’s possible to open a handle to a particular file inside of a zip file and read() that without reading the whole file into memory?).

It might make sense to deprecate Loader().get_data() and replace it with Loader().get_data_stream() or just have them both (in the simple case Loader().get_data() could be a small wrapper around Loader().get_data_stream(). That would allow efficient access with resource_stream() even for non file systems things without changing the API much from what you like.

> 
> Please realize the kind of bind you're putting (at least) me in: trying to abstract away paths so they don't really exist except as opaque things to pass around for a loader is great and a goal I have been trying to meet, but then you want real file paths when available so you can open files directly and feed file paths to APIs that require them without creating a temporary file. So you're asking for a loader API that doesn't directly work with paths but that will spit them out when they are available and have a way to differentiate them which directly contradicts the idea of having the loader APi hide the concept of a file path away entirely (granted it is on the edge of the API in terms of what it emits and not what it directly takes in, but it does pierce the abstraction away from paths).
> 
> This use-case you're after is not something I haven't thought about or purposefully ignored. I too want people to work with loaders somehow so that data carried with a project from PyPI can be loaded from any reasonable loader implementation. But it's bloody hard and it's going to require some patience and compromise on all sides if we are going to get something that doesn't make loaders explicitly path-aware and hard to implement while still allowing the common case you are after of avoiding unnecessary overhead or doing something like isinstance() checks for importlib.machinery.SourceFileLoader or something.

Honestly I’m not trying to put anyone in a bind, I want the top level APIs that I pointed out and I want them to work in roughly the best way for each particular backend. For me the best way to do that is to give each top level function an optional hook into on the Loaders so that the loader can do something better and less generic if possible since it knows more about how it is implemented and can possibly make better choices about what is the best way to implement each particular thing. I also don’t like boolean arguments to functions that drastically change their behavior which is why I wanted to add things like Loader().get_data_filename() which would only return a non None value if a real path existed (aka real=True from above).

Sans the stream issue from above, It looks like the implementation you defined would work, I just don’t particularly like the API much because i don’t like cramming source code and data files into the same API. That’s OK though because I don’t really need to work with the API hardly ever so it doesn’t affect me much, I was just being opinionated because that’s the way I am.

>  
> 
> How about this, instead here’s the top level APIs I want: https://bpaste.net/show/0c490aa07c07 <https://bpaste.net/show/0c490aa07c07>
> 
> Other than seeing resource_bytes() as redundant and not wanting to give all of them a "resource_" prefix if they are going to live in importlib.resources I'm basically fine with what you're after.

It is somewhat redundant but I think it’s also a useful wrapper around resource_stream() to have, especially since it can then replace pkgutil.get_data directly and that function can be deprecated for it. I don’t really have a strong opinion on the names themselves, the resource_<foo> names were taken from what pkg_resources called them and because I couldn’t think of a good naming scheme for them that wasn’t just prefixing with get_ which I didn’t like as much as resource_. It might make sense to just put them in importlib.util with a resource_ prefix on them, or in import lib’s top level. I don’t have a strong opinion on which ones of those options are “best”.

>  
> 
> How this is implemented in the Loader() API can be whatever folks want. The important thing is that these all solve actual use cases and solve them better and easier than the naive approach of using os.path functions directly. 
> 
> Important Things:
> 
> * resource_filename and ResourceFilename() must return the real file system path if available and a temporary file else wise.
> * resource_filename *must* be available for the lifetime of the process once the function has been called.
> * ResourceFilename *must* clean itself up at the end of the context manager.
> * These functions/context managers *must* work in terms of package names and relative file paths.
> 
> All seem reasonable to me.
> 
> -Brett
>  
> 
> 
>> -Brett
>>  
>> 
>> 
>>> 
>>> -Brett
>>>  
>>> 
>>> This means that we’re not talking about “data” files, but “resource” files. This also removes the idea that you can call Loader.set_data() on those files (like i’ve seen in the implementation).
>>> 
>>>>  
>>>> 
>>>>> 
>>>>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs.
>>>> 
>>>> I think we do want to allow directories, it’s not unusual to have something like:
>>>> 
>>>> warehouse
>>>> ├── __init__.py
>>>> ├── templates
>>>> │   ├── accounts
>>>> │   │   └── profile.html
>>>> │   └── hello.html
>>>> ├── utils
>>>> │   └── mapper.py
>>>> └── wsgi.py
>>>> 
>>>> Conceptually templates isn’t a package (even though with namespace packages it kinda is) and I’d want to load profile.html by doing something like:
>>>> 
>>>> importlib.resources.get_bytes(“warehouse”, “templates/accounts/profile.html”)
>>>> 
>>>> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html')  =)
>>>>  
>>>> 
>>>> In pkg_resources the second argument to that function is a “resource path” which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it’s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader).
>>>> 
>>>> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement.
>>>>  
>>>> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader.
>>>> 
>>>>> 
>>>>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules.
>>>> 
>>>> Is an actual module what gets passed into Loader().exec_module()?
>>>> 
>>>> Yes.
>>>>  
>>>> If so I think it’s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, “logo.gif”) you’d do importlib.resources.get_bytes(__name__, “logo.gif”).
>>>> 
>>>> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues.
>>>> 
>>>> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3.
>>> 
>>> ---
>>> Donald Stufft
>>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>> 
>> 
>> ---
>> Donald Stufft
>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
> 
> 
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/03cf0c03/attachment-0001.html>