[Import-SIG] Loading Resources From a Python Module/Package

Sat Jan 31 22:43:47 CET 2015

> On Jan 31, 2015, at 4:22 PM, Brett Cannon <brett at python.org> wrote:
> 
> 
> 
> On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>> On Jan 31, 2015, at 12:00 PM, Brett Cannon <brett at python.org <mailto:brett at python.org>> wrote:
>> 
>> 
>> 
>> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>>> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org <mailto:brett at python.org>> wrote:
>>> 
>>> 
>>> 
>>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com <mailto:p.f.moore at gmail.com>> wrote:
>>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>>> >> It's certainly possible to add a new API that loads resources based on
>>> >> a relative name, but you'd have to specify relative to *what*.
>>> >> get_data explicitly ducks out of making that decision.
>>> >
>>> > data = __loader__.get_bytes(__name__, “logo.gif”)
>>> 
>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>>> prohibit sharing of loaders, etc, in the way Brett mentions.
>>> 
>>> By specifying the package anchor point I don't think it does.
>>>  
>>> Also, the
>>> fact that it needs __name__ in there feels wrong - a bit like the old
>>> version of super() needing to be told which class it was being called
>>> from.
>>> 
>>> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader.
>>> 
>>> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'.
>>>  
>>> But in principle I don't object to finding a suitable form of
>>> this.
>>> 
>>> And I like the name get_bytes - much more explicit in these Python 3
>>> days of explicit str/bytes distinctions :-)
>>> 
>>> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden  behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path).
>> 
>> I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data.
>> 
>> Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary.
> 
> Because the idea behind get_data_filename() is that it returns a path that can be used regularly by APIs that expect to be handed a file on the file system.
> 
> In my head that expectation is not placed on the method.
>  
> Simulating a path with an opaque string isn’t good enough because, for example, OpenSSL doesn’t know how to open /data/foo.zip/foobar/cacert.pem. The idea here is that _if_ a regular file system path is available for a particular resource file then Loader().get_data_filename() would return it, otherwise it’d return None (or not exist at all).
> 
> This means that pkgutil.get_data_filename (or importlib.resources.get_filename) can attempt to call Loader().get_data_filename() and just return that path if one exists on the file system already, and if it doesn’t then it can create a temporary file and call Loader.get_data() and write the data to that temporary file and return the path to that.
> 
> See I'm not even attempting to guarantee there is any API that will return a reasonable file system path as the import API makes no such guarantees. If an API like OpenSSL requires a file on the filesystem then you will have to write to a temporary file and that's just life. That's the same as if everything was stored in a zip file anyway.

The entire *point* is this thread is that sometimes you need a file path that is a valid path to a resource.

The naive approach is to just make it do something like:

# in pkgutil
def get_data_filename(package, resource):
    data = get_data(package, resource)
    if data is not None:
        with open("/tmp/path", "wb") as fp:
            fp.write(data)
        return "/tmp/path"

However the problem with this is that it imposes a read() into memory and then creating a new file, and then writing that data back to a file even in cases where there is already a file available on the file system. The Loader().get_data_filename() exists for a Loader() to *optionally* say that “We already have a file path for this file, so you can just use this instead of copying to a temporary location”.

Then the “optimized” but still naive approach becomes:

# in pkgutil
def get_data_filename(package, resource):
    mod = importlib.import_module(package)
    if hasattr(mod.__loader__, "get_data_filename"):
        try:
            filename = mod.__loader__.get_data_filename(package, resource)
        except FileNotFoundError:
            pass
        else:
            if filename is not None:
                return filename

    data = get_data(package, resource)
    if data is not None:
        with open("/tmp/path", "wb") as fp:
            fp.write(data)
        return "/tmp/path"

This means there’s basically no penalty for using this API to access resources files when you’re accessing files from a FileLoader. In my opinion anything that is harder to use than:

MY_PATH = os.path.join(os.path.dirname(__file__), “my/file.txt”)

Is highly unlikely to be used. People can already just write things to a temporary directory using get_data, but the point is they don’t because it’s a waste of time for the common case and it’s easier not to do that.

>  
> 
>>  
>> 
>> I think it is redundant but given that it’s a new API (passing module and a “resource path”) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()).
>> 
>> If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file?
> 
> Perhaps trying to read modules and resource files with the same method is the wrong approach?
> 
> If we are going to do that then we might as well deprecate all the methods that try to expose reading data and paths as the PEP 302 APIs tried to expose it uniformly.

I don’t think it makes sense to expose it uniformly, code is semantically different than data files and people need the ability to do different things with them. It’s unlikely you’ll get a 2GB.py file, however a 2GB data file is completely within the realms of possibility.

>  
> 
> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 <https://bpaste.net/show/b25b7e8dc8f0>
> 
> That seems like a bit much, e.g. why do you needs bytes **and** and a file-like object() when you get the former from the latter? And why do you need the path argument when you can get the path off the file-like object if it's an actual file object?

I don’t think it’s a bit much at all.

You get a stream method because sometimes things expect a file like object or sometimes the file is big and the ability to access a stream that handles that for you is super important. However when using a stream you need to ensure you close the stream after you’re done using it.

You get a bytes method because sometimes you don’t care about all of that and you just need/want the raw bytes, it’s a nicer API for those people to be able to just get bytes without having to worry about reading a file or closing the file after they are done reading it.

You get a filename method because the stream method may or may not return a file object that has a path at all, and if you just need to pass the path into another API having an open file handle just to get the filename is a waste of a file handle.

> 
> -Brett
>  
> 
> This means that we’re not talking about “data” files, but “resource” files. This also removes the idea that you can call Loader.set_data() on those files (like i’ve seen in the implementation).
> 
>>  
>> 
>>> 
>>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs.
>> 
>> I think we do want to allow directories, it’s not unusual to have something like:
>> 
>> warehouse
>> ├── __init__.py
>> ├── templates
>> │   ├── accounts
>> │   │   └── profile.html
>> │   └── hello.html
>> ├── utils
>> │   └── mapper.py
>> └── wsgi.py
>> 
>> Conceptually templates isn’t a package (even though with namespace packages it kinda is) and I’d want to load profile.html by doing something like:
>> 
>> importlib.resources.get_bytes(“warehouse”, “templates/accounts/profile.html”)
>> 
>> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html')  =)
>>  
>> 
>> In pkg_resources the second argument to that function is a “resource path” which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it’s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader).
>> 
>> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement.
>>  
>> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader.
>> 
>>> 
>>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules.
>> 
>> Is an actual module what gets passed into Loader().exec_module()?
>> 
>> Yes.
>>  
>> If so I think it’s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, “logo.gif”) you’d do importlib.resources.get_bytes(__name__, “logo.gif”).
>> 
>> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues.
>> 
>> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3.
> 
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/4584be3c/attachment-0001.html>