[Import-SIG] Loading Resources From a Python Module/Package

Sun Feb 1 01:19:21 CET 2015

On Sat Jan 31 2015 at 6:05:39 PM Donald Stufft <donald at stufft.io> wrote:

> On Jan 31, 2015, at 5:27 PM, Brett Cannon <brett at python.org> wrote:
>
>
>
> On Sat Jan 31 2015 at 4:43:50 PM Donald Stufft <donald at stufft.io> wrote:
>
>> On Jan 31, 2015, at 4:22 PM, Brett Cannon <brett at python.org> wrote:
>>
>>
>>
>> On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft <donald at stufft.io> wrote:
>>
>>> On Jan 31, 2015, at 12:00 PM, Brett Cannon <brett at python.org> wrote:
>>>
>>>
>>>
>>> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io>
>>> wrote:
>>>
>>>> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org> wrote:
>>>>
>>>>
>>>>
>>>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com>
>>>> wrote:
>>>>
>>>>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
>>>>> >> It's certainly possible to add a new API that loads resources based
>>>>> on
>>>>> >> a relative name, but you'd have to specify relative to *what*.
>>>>> >> get_data explicitly ducks out of making that decision.
>>>>> >
>>>>> > data = __loader__.get_bytes(__name__, “logo.gif”)
>>>>>
>>>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>>>>> prohibit sharing of loaders, etc, in the way Brett mentions.
>>>>
>>>>
>>>> By specifying the package anchor point I don't think it does.
>>>>
>>>>
>>>>> Also, the
>>>>> fact that it needs __name__ in there feels wrong - a bit like the old
>>>>> version of super() needing to be told which class it was being called
>>>>> from.
>>>>
>>>>
>>>> You can't avoid that. This is the entire reason why loader reuse is a
>>>> pain; you **have** to specify what to work off of, else its ambiguous and a
>>>> specific feature of a specific loader.
>>>>
>>>> But this is only an issue when you are trying to access a file relative
>>>> to the package/module you're in. Otherwise you're going to be specifying a
>>>> string constant like 'foo.bar'.
>>>>
>>>>
>>>>> But in principle I don't object to finding a suitable form of
>>>>> this.
>>>>>
>>>>> And I like the name get_bytes - much more explicit in these Python 3
>>>>> days of explicit str/bytes distinctions :-)
>>>>
>>>>
>>>> One unfortunate side-effect from having a new method to return bytes
>>>> from a data file is that it makes get_data() somewhat redundant. If we make
>>>> it get_data_filename(package_name, path) then it can return an absolute
>>>> path which can then be passed to get_data() to read the actual bytes. If we
>>>> create importlib.resources as Donald has suggested then all of this can be
>>>> hidden  behind a function and users don't have to care about any of this,
>>>> e.g. importlib.resources.read_data(module_anchor, path).
>>>>
>>>>
>>>> I think we actually have to go the other way, because only some Loaders
>>>> will be able to actually return a filename (returning a filename is
>>>> basically an optimization to prevent needing to call get_data and write
>>>> that out to a temporary directory) but pretty much any loader should
>>>> theoretically be able to support get_data.
>>>>
>>>
>>> Why can only some loaders return a filename? As I have said, loaders can
>>> return an opaque string to simulate a path if necessary.
>>>
>>>
>>> Because the idea behind get_data_filename() is that it returns a path
>>> that can be used regularly by APIs that expect to be handed a file on the
>>> file system.
>>>
>>
>> In my head that expectation is not placed on the method.
>>
>>
>>> Simulating a path with an opaque string isn’t good enough because, for
>>> example, OpenSSL doesn’t know how to open /data/foo.zip/foobar/cacert.pem.
>>> The idea here is that _if_ a regular file system path is available for a
>>> particular resource file then Loader().get_data_filename() would return it,
>>> otherwise it’d return None (or not exist at all).
>>>
>>> This means that pkgutil.get_data_filename (or
>>> importlib.resources.get_filename) can attempt to call
>>> Loader().get_data_filename() and just return that path if one exists on the
>>> file system already, and if it doesn’t then it can create a temporary file
>>> and call Loader.get_data() and write the data to that temporary file and
>>> return the path to that.
>>>
>>
>> See I'm not even attempting to guarantee there is any API that will
>> return a reasonable file system path as the import API makes no such
>> guarantees. If an API like OpenSSL requires a file on the filesystem then
>> you will have to write to a temporary file and that's just life. That's the
>> same as if everything was stored in a zip file anyway.
>>
>>
>> The entire *point* is this thread is that sometimes you need a file path
>> that is a valid path to a resource.
>>
>
> Right, but I also have to make sure the import API doesn't get too
> ridiculous because it took me years and several versions of Python to make
> it work with the APIs inherited from PEP 302 and to make sure it grow into
> a huge mess.
>
>
>>
>> The naive approach is to just make it do something like:
>>
>> # in pkgutil
>> def get_data_filename(package, resource):
>>     data = get_data(package, resource)
>>     if data is not None:
>>         with open("/tmp/path", "wb") as fp:
>>             fp.write(data)
>>         return "/tmp/path"
>>
>> However the problem with this is that it imposes a read() into memory and
>> then creating a new file, and then writing that data back to a file even in
>> cases where there is already a file available on the file system. The
>> Loader().get_data_filename() exists for a Loader() to *optionally* say that
>> “We already have a file path for this file, so you can just use this
>> instead of copying to a temporary location”.
>>
>
> And that's fine, but my point is forcing it to only play that role seems
> unnecessary. If you want a 'real' parameter to say "only return a path if I
> can pass it to an API that requires it" then that's fine.
>
>
>>
>> Then the “optimized” but still naive approach becomes:
>>
>> # in pkgutil
>> def get_data_filename(package, resource):
>>     mod = importlib.import_module(package)
>>     if hasattr(mod.__loader__, "get_data_filename"):
>>         try:
>>             filename = mod.__loader__.get_data_filename(package,
>> resource)
>>         except FileNotFoundError:
>>             pass
>>         else:
>>             if filename is not None:
>>                 return filename
>>
>>     data = get_data(package, resource)
>>     if data is not None:
>>         with open("/tmp/path", "wb") as fp:
>>             fp.write(data)
>>         return "/tmp/path"
>>
>> This means there’s basically no penalty for using this API to access
>> resources files when you’re accessing files from a FileLoader.
>>
>
> And leaking a temp file until shutdown which is why Barry and I prefer a
> context manager. =)
>
>
> So the top level API can have both and people can use whichever fits their
> situation best.
>
> It’s not really leaking a temp file, it’s making it available to the
> process once the function has been called. This is a common use case to
> need a data file for the entire life of the process.
>
>
>
>> In my opinion anything that is harder to use than:
>>
>> MY_PATH = os.path.join(os.path.dirname(__file__), “my/file.txt”)
>>
>> Is highly unlikely to be used. People can already just write things to a
>> temporary directory using get_data, but the point is they don’t because
>> it’s a waste of time for the common case and it’s easier not to do that.
>>
>
> That's fine, but I also feel like we are trying to design around bad API
> design where something is assuming all data is going to be on disk and thus
> it's okay to require a file path on the filesystem instead of taking the
> bytes directly or a file-like object.
>
> I realize you are trying to solve this specifically for OpenSSL since it
> has the nasty practice of wanting a file path, but from an import
> perspective I have to also worry about what makes sense for the API as a
> whole and from the perspective of import.
>
>
> I’m not actually trying to solve this specifically for OpenSSL at all, I’m
> trying to solve it for any API that requires a file path where I don’t
> control that API. My end goal is to make it so zip imports are useful
> enough people can assume they are going to work, not a curiosity that
> mostly only works by accident for most projects. Right now you take almost
> any project on PyPI that has a data file and there’s an extremely high
> chance that it won’t work with zip import.
>
> My long term goals here are to make a “static” deployment format for
> Python that can wrap everything up into one file, so one step along this
> path is getting people to stop doing things that rely on having a real
> filesystem and only go through some abstraction. However I can’t get them
> to actually do that if the API to do that is awkward and something they
> don’t want to actually use. Purity is a great thing, but when there is a
> direct competitor to this API you have to weigh purity against actual
> usefulness in the common case for people.
>
> So yea, it’s designing around a bad API, but it’s also an API design that
> exists all over the place in the real world and telling people “well just
> don’t do that” means they won’t use this API and their thing will continue
> to not be usable with zip import.
>
>
>
>>
>>
>>
>>>
>>>
>>>
>>>>
>>>> I think it is redundant but given that it’s a new API (passing module
>>>> and a “resource path”) I think it makes sense. The old get_data API can be
>>>> deprecated but left in for compatibility reasons if we want (sort of like
>>>> Loader().load_module() -> Loader().exec_module()).
>>>>
>>>
>>> If we do that then there would have to be a way to specify how to read
>>> the bytes for the module code itself since get_data() is used in the
>>> implementation of import by coupling it with get_filename() (which is why
>>> I'm trying not have to drop get_filename()/get_data() and instead come up
>>> with some new approach to reading bytes since the current approach is very
>>> composable). So get_bytes() would need a way to signal that you don't want
>>> some data file but the bytes for the module. Maybe if the path section is
>>> unspecified then that's a signal that the module's bytes is wanted and not
>>> some data file?
>>>
>>>
>>> Perhaps trying to read modules and resource files with the same method
>>> is the wrong approach?
>>>
>>
>> If we are going to do that then we might as well deprecate all the
>> methods that try to expose reading data and paths as the PEP 302 APIs tried
>> to expose it uniformly.
>>
>>
>> I don’t think it makes sense to expose it uniformly, code is semantically
>> different than data files and people need the ability to do different
>> things with them. It’s unlikely you’ll get a 2GB.py file, however a 2GB
>> data file is completely within the realms of possibility.
>>
>>
>>
>>>
>>> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0
>>>
>>
>> That seems like a bit much, e.g. why do you needs bytes **and** and a
>> file-like object() when you get the former from the latter? And why do you
>> need the path argument when you can get the path off the file-like object
>> if it's an actual file object?
>>
>>
>> I don’t think it’s a bit much at all.
>>
>> You get a stream method because sometimes things expect a file like
>> object or sometimes the file is big and the ability to access a stream that
>> handles that for you is super important. However when using a stream you
>> need to ensure you close the stream after you’re done using it.
>>
>
> With a context manager the closing requirement is negligible. And that
> only is an optimization if you're reading from something that allows for
> incremental reads, e.g. it's not an optimization for a SQL-backed loader
> (which is probably why PEP 302 has get_data() instead of get_file_object()
> or something).
>
>
> In almost all uses where I would personally use it a context manager is
> awkward and I won’t use it and I’ll just continue to not be zip import
> compatible.
>
>
>
>>
>> You get a bytes method because sometimes you don’t care about all of that
>> and you just need/want the raw bytes, it’s a nicer API for those people to
>> be able to just get bytes without having to worry about reading a file or
>> closing the file after they are done reading it.
>>
>
> That seems unnecessary if you want to provide the optimization of allowing
> a file-like object to be returned when reading all of the bytes takes two
> lines of code instead of one. People know how to read files so it isn't
> like it's a new paradigm.
>
>
>
>
>>
>> You get a filename method because the stream method may or may not return
>> a file object that has a path at all, and if you just need to pass the path
>> into another API having an open file handle just to get the filename is a
>> waste of a file handle.
>>
>
> As I said above, I partially feel like the desire for this support is to
> work around some API decisions that are somewhat poor.
>
> How about this: get_path(package, path, *, real=False) or
> get_path(package, filename, *, real=False) -- depending on whether Barry
> and me get our way about paths or you do, Donald -- where 'real' is a flag
> specifying whether the path has to work as a path argument to
> builtins.open() and thus fails accordingly (in instances where it won't
> work it can fail immediately and so loader implementers only have two lines
> of code to care about to manage it). Then loaders can keep their get_data()
> method without issue and the API for loaders only grew by 1 (or stays
> constant depending on whether we want/can have it subsume get_filename()
> long-term).
>
> As for importlib.resources, that can provide a higher-level API for a
> file-like object along with some way to say whether the file must be
> addressable on the filesystem to know if tempfile.NamedTemporaryFile() may
> be backing the file-like object or if io.BytesIO could provide the API.
>
> This gets me a clean API for loaders and importlib and gets you your real
> file paths as needed.
>
>
> I honestly don’t really care what API the loader has because I don’t think
> anybody but the importlib.resources functions are ever going to use, so if
> you want to do something I hate with them knock yourself out I don’t care
> that much,
>

Well you have to care to an extent because that API will be what lets you
do what you want to do at a higher API level.

> I just think that requiring using __file__ and __path__ outside of the
> implementation of the loader itself is a pretty big code smell.
>

Who ever said anything about __file__ and __path__ outside of loaders? All
I have proposed is something to allow you to do, e.g.::

  def resource_stream(package, path):
    """Return a file-like object for a file relative to a package."""
   loader = ...
    try:
      return open(loader.get_path(package, path, real=True))
    except NotARealFileThingy:
      loader_path = loader.get_path(package, path)
      return io.BytesIO(loader.get_data(loader_path))

Otherwise get_stream(package, path) and then, e.g.::

  def resource_filename(package, path):
    loader = ...
    stream = loader.get_stream()
    if hasattr(stream, 'path'):
      return stream.path
    path = make a tempfile path somehow ...
    with open(path, 'wb') as file:
      file.write(stream.read())
    return path

Please realize the kind of bind you're putting (at least) me in: trying to
abstract away paths so they don't really exist except as opaque things to
pass around for a loader is great and a goal I have been trying to meet,
but then you want real file paths when available so you can open files
directly and feed file paths to APIs that require them without creating a
temporary file. So you're asking for a loader API that doesn't directly
work with paths but that will spit them out when they are available and
have a way to differentiate them which directly contradicts the idea of
having the loader APi hide the concept of a file path away entirely
(granted it is on the edge of the API in terms of what it emits and not
what it directly takes in, but it does pierce the abstraction away from
paths).

This use-case you're after is not something I haven't thought about or
purposefully ignored. I too want people to work with loaders somehow so
that data carried with a project from PyPI can be loaded from any
reasonable loader implementation. But it's bloody hard and it's going to
require some patience and compromise on all sides if we are going to get
something that doesn't make loaders explicitly path-aware and hard to
implement while still allowing the common case you are after of avoiding
unnecessary overhead or doing something like isinstance() checks for
importlib.machinery.SourceFileLoader or something.

>
> How about this, instead here’s the top level APIs I want:
> https://bpaste.net/show/0c490aa07c07
>

Other than seeing resource_bytes() as redundant and not wanting to give all
of them a "resource_" prefix if they are going to live in
importlib.resources I'm basically fine with what you're after.

>
> How this is implemented in the Loader() API can be whatever folks want.
> The important thing is that these all solve actual use cases and solve them
> better and easier than the naive approach of using os.path functions
> directly.
>
> Important Things:
>
> * resource_filename and ResourceFilename() must return the real file
> system path if available and a temporary file else wise.
> * resource_filename *must* be available for the lifetime of the process
> once the function has been called.
> * ResourceFilename *must* clean itself up at the end of the context
> manager.
> * These functions/context managers *must* work in terms of package names
> and relative file paths.
>

All seem reasonable to me.

-Brett

>
>
> -Brett
>
>
>>
>>
>>
>> -Brett
>>
>>
>>>
>>> This means that we’re not talking about “data” files, but “resource”
>>> files. This also removes the idea that you can call Loader.set_data() on
>>> those files (like i’ve seen in the implementation).
>>>
>>>
>>>
>>>>
>>>>
>>>> One thing to consider is do we want to allow anything other than
>>>> filenames for the path part? Thanks to namespace packages every directory
>>>> is essentially a package, so we could say that the package anchor has to
>>>> encapsulate the directory and the path bit can only be a filename. That
>>>> gets us even farther away from having the concept of file paths being
>>>> manipulated in relation to import-related APIs.
>>>>
>>>>
>>>> I think we do want to allow directories, it’s not unusual to have
>>>> something like:
>>>>
>>>> warehouse
>>>> ├── __init__.py
>>>> ├── templates
>>>> │   ├── accounts
>>>> │   │   └── profile.html
>>>> │   └── hello.html
>>>> ├── utils
>>>> │   └── mapper.py
>>>> └── wsgi.py
>>>>
>>>> Conceptually templates isn’t a package (even though with namespace
>>>> packages it kinda is) and I’d want to load profile.html by doing something
>>>> like:
>>>>
>>>> importlib.resources.get_bytes(“warehouse”,
>>>> “templates/accounts/profile.html”)
>>>>
>>>
>>> Where I would be fine with get_bytes('warehouse.templates.accounts',
>>> 'profile.html')  =)
>>>
>>>
>>>>
>>>> In pkg_resources the second argument to that function is a “resource
>>>> path” which is defined as a relative to the given module/package and it
>>>> must use / to denote them. It explicitly says it’s not a file system path
>>>> but a resource path. It may translate to a file system path (as is the case
>>>> with the FileLoader) but it also may not (as is the case with a theoretical
>>>> S3Loader or PostgreSQLLoader).
>>>>
>>>
>>> Yep, which is why I'm making sure if we have paths we minimize them as
>>> they instantly make these alternative loader concepts a bigger pain to
>>> implement.
>>>
>>>
>>>> How you turn a warehouse + a resource path into some data (or whatever
>>>> other function we support) is an implementation detail of the Loader.
>>>>
>>>>
>>>> And just so I don't forget it, I keep wanting to pass an actual module
>>>> in so the code can extract the name that way, but that prevents the
>>>> __name__ trick as you would have to import yourself or grab the module from
>>>> sys.modules.
>>>>
>>>>
>>>> Is an actual module what gets passed into Loader().exec_module()?
>>>>
>>>
>>> Yes.
>>>
>>>
>>>> If so I think it’s fine to pass that into the new Loader() functions
>>>> and a new top level API in importlib.resources can do the things needed to
>>>> turn a string into a module object. So instead of doing
>>>> __loader__.get_bytes(__name__, “logo.gif”) you’d do
>>>> importlib.resources.get_bytes(__name__, “logo.gif”).
>>>>
>>>
>>> If we go the route of importlib.resources then that seems like a
>>> reasonable idea, although we will need to think through the ramifications
>>> to exec_module() itself although I don't think there were be any issues.
>>>
>>> And if we do go with importlib.resources I will probably want to make it
>>> available on PyPI with appropriate imp/pkgutil fallbacks to help people
>>> transitioning from Python 2 to 3.
>>>
>>> ---
>>> Donald Stufft
>>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>>>
>>
>> ---
>> Donald Stufft
>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>>
>
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150201/7282a45b/attachment-0001.html>