From bcannon at gmail.com  Fri Jan 30 20:28:40 2015
From: bcannon at gmail.com (Brett Cannon)
Date: Fri, 30 Jan 2015 19:28:40 +0000
Subject: [Import-SIG] Optimization levels embedded in .pyo file names?
Message-ID: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>

Something I have been thinking about is whether we should start embedding
the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo (the
O could also be lowercase if people preferred). It would save people from
making the mistake of executing their code with a mixture of -O and -OO. It
also avoids having to regenerate all of your .pyo whenever you want to
tweak which optimization level you are running at. And finally, if we make
importlib.cache_from_source() take an optional `optimization` argument then
people could even start specifying their own optimizations and have them
saved to their own .pyo files (with the caveat that some restrictions be
placed on the value, such as it has pass str.isalnum()).

As for importlib.cache_from_source() and it's debug_override parameter, I
would say we should lean on bools being ints and simply use its argument as
the optimization level (while it gets phased out).

I would love to even go so far as to say that we drop the .pyo file
extension and make what has normally been .pyc files be .O0.pyc and what
has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is
that it might break too much code in a transition and so .pyc stays as such
and then .O1.pyo and .O2.pyo comes into existence from the stdlib.

By doing this the last bit of runtime state that influences compiling and
importing code will somehow be exposed in bytecode files. I don't think it
should be embedded in the bytecode file header as this has nothing to do
with the validity of the bytecode compared to the source, just whether it
should be run with the current interpreter (much like the interpreter name).

Thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150130/a9b0c556/attachment.html>

From ethan at stoneleaf.us  Fri Jan 30 20:35:28 2015
From: ethan at stoneleaf.us (Ethan Furman)
Date: Fri, 30 Jan 2015 11:35:28 -0800
Subject: [Import-SIG] Optimization levels embedded in .pyo file names?
In-Reply-To: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
References: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
Message-ID: <54CBDD00.2060703@stoneleaf.us>

From a user perspective that sounds like a good idea.

--
~Ethan~

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150130/6523f28a/attachment.sig>

From barry at python.org  Fri Jan 30 22:46:46 2015
From: barry at python.org (Barry Warsaw)
Date: Fri, 30 Jan 2015 16:46:46 -0500
Subject: [Import-SIG] Optimization levels embedded in .pyo file names?
In-Reply-To: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
References: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
Message-ID: <20150130164646.5d1538ff@anarchist.wooz.org>

On Jan 30, 2015, at 07:28 PM, Brett Cannon wrote:

>Something I have been thinking about is whether we should start embedding
>the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo

+1 - we've had some trouble in the past in Debian with the name collisions on
.pyo for the different optimization levels.

>I would love to even go so far as to say that we drop the .pyo file
>extension and make what has normally been .pyc files be .O0.pyc and what
>has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is
>that it might break too much code in a transition and so .pyc stays as such
>and then .O1.pyo and .O2.pyo comes into existence from the stdlib.

I actually *would* go so far.  I thought about it during the PEP 3147
time frame but it was out-of-scope at the time.  A transition period might be
necessary (and/or a switch to choose) but I think it's a good end state.

Cheers,
-Barry

From donald at stufft.io  Sat Jan 31 00:37:44 2015
From: donald at stufft.io (Donald Stufft)
Date: Fri, 30 Jan 2015 18:37:44 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
Message-ID: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>

It's often times useful to be able to load a resource from a Python module or
packaging. Currently you can load the data into memory using pkgutil.get_data
however this doesn't help much if you need to pass that data into an API that
only accepts a filepath. Currently code that needs to do this often times does
something like os.path.join(os.path.dirname(__file__), "myfile.txt"), however
that doesn't work from within a zip file.

I think it would be a good idea to implement a pkgutil.get_data_filename
function which would return a filename that can be accessed to get at that
particular bit of package data. In addition I think it would be a good idea
to add an optional get_data_filename method onto the Loader that can be used
by a loader to indicate when a file *already* exists on the filesystem.

Essentially this boils down to the pkgutil.get_data_filename(package, resource)
function doing this:

1. Check if the loader for the package implements a get_data_filename method
   and if it does and it returns a value that is not None simply return that
   value. The FileLoader can have a simple get_data_filename then that just
   returns the on disk filename.

2. If the loader doesn't have a get_data_filename method or it returns a None
   value then call pkgutil.get_data and if that returns None then return None
   ourselves. If it doesn't return None then save that data to a temporary file
   and return the path to that temporary file.

I've implemented this (without tests) you can see here: https://bpaste.net/show/2e51b0588dcd

I have a few concerns however, currently Loader.get_data() requires you to pass
the entire path of the file you want to open
(like /usr/lib/python3.5/site-packages/foo/bar.txt or /data/foo.zip/bar.txt)
however I've made Loader.get_data_filename() want a relative path
(like bar.txt).

I wonder if this difference is OK? If not I wonder if we can make
Loader.get_data accept a relative path as well. I think this is a generally
more useful way of using the function because it doesn't restrict loaders to
file system only (which get_data currently is restricted to I believe) and it
lets the Loader encaspulate the logic about how to translate a relative path to
a chunk of data instead of needing the caller to do that.

My other problem is that pkgutil.get_data doesn't currently work for the
PEP 420 namespace packages and due to the above I'm not sure how to actually
make it work in a reasonable way without allowing get_data to accept relative
paths as well. Because my patch lets the Loader encapsulate turning a relative
path into a file path pkgutil.get_data_filename() and
_NamespaceLoader.get_data_filename both work and support PEP 420 namespace
packages.

A. What do people think about pkgutil.get_data_filename and
   Loader.get_data_filename?

B. What do people think about modifying Loader.get_data so it can support
   relative filenames instead of the calling code needing to handle that?

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From ericsnowcurrently at gmail.com  Sat Jan 31 01:01:58 2015
From: ericsnowcurrently at gmail.com (Eric Snow)
Date: Fri, 30 Jan 2015 17:01:58 -0700
Subject: [Import-SIG] Optimization levels embedded in .pyo file names?
In-Reply-To: <20150130164646.5d1538ff@anarchist.wooz.org>
References: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
 <20150130164646.5d1538ff@anarchist.wooz.org>
Message-ID: <CALFfu7ARZ4dDS_6=cq-tatK+VRifnQOO=2cs3V4GWuRfLXWMJA@mail.gmail.com>

On Fri, Jan 30, 2015 at 2:46 PM, Barry Warsaw <barry at python.org> wrote:
> On Jan 30, 2015, at 07:28 PM, Brett Cannon wrote:
>
>>Something I have been thinking about is whether we should start embedding
>>the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo
>
> +1 - we've had some trouble in the past in Debian with the name collisions on
> .pyo for the different optimization levels.
>
>>I would love to even go so far as to say that we drop the .pyo file
>>extension and make what has normally been .pyc files be .O0.pyc and what
>>has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is
>>that it might break too much code in a transition and so .pyc stays as such
>>and then .O1.pyo and .O2.pyo comes into existence from the stdlib.
>
> I actually *would* go so far.  I thought about it during the PEP 3147
> time frame but it was out-of-scope at the time.  A transition period might be
> necessary (and/or a switch to choose) but I think it's a good end state.

+1 to all of it. :)

-eric

From ericsnowcurrently at gmail.com  Sat Jan 31 01:03:32 2015
From: ericsnowcurrently at gmail.com (Eric Snow)
Date: Fri, 30 Jan 2015 17:03:32 -0700
Subject: [Import-SIG] Optimization levels embedded in .pyo file names?
In-Reply-To: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
References: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
Message-ID: <CALFfu7D+PrZjhVkzqGqpRR+CUmVGxN6DUw5OBzU4UodxVC0=jw@mail.gmail.com>

On Fri, Jan 30, 2015 at 12:28 PM, Brett Cannon <bcannon at gmail.com> wrote:
> And finally, if we make
> importlib.cache_from_source() take an optional `optimization` argument then
> people could even start specifying their own optimizations and have them
> saved to their own .pyo files (with the caveat that some restrictions be
> placed on the value, such as it has pass str.isalnum()).

I like that!  It would make it much easier to work on new optimizations.

-eric

From p.f.moore at gmail.com  Sat Jan 31 01:18:30 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 00:18:30 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
Message-ID: <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>

On 30 January 2015 at 23:37, Donald Stufft <donald at stufft.io> wrote:
> A. What do people think about pkgutil.get_data_filename and
>    Loader.get_data_filename?

Sounds reasonable. It's a relatively rare, but useful use case. One
possible issue, though, would people assume that if they get a
filename it'd be writeable? For the filesystem loader it would be, but
that would break subtly (writes work but would get discarded) for
loaders that don't have a native get_data_filename.

Related question - how would the temp files be cleaned up? At exit?

> B. What do people think about modifying Loader.get_data so it can support
>    relative filenames instead of the calling code needing to handle that?

I'd have to think about that one, but in principle it seems reasonable.

While we're extending the loaders, a far more commonly requested
feature would be to list available data files. At the moment, code can
only load data from known paths, which is not ideal. While it's
unrelated to the original proposal, it makes sense if we're changing
the spec of loaders to do it in one go, rather than having multiple
iterations.

Paul

From donald at stufft.io  Sat Jan 31 01:52:50 2015
From: donald at stufft.io (Donald Stufft)
Date: Fri, 30 Jan 2015 19:52:50 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
Message-ID: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>


> On Jan 30, 2015, at 7:18 PM, Paul Moore <p.f.moore at gmail.com> wrote:
> 
> On 30 January 2015 at 23:37, Donald Stufft <donald at stufft.io> wrote:
>> A. What do people think about pkgutil.get_data_filename and
>>   Loader.get_data_filename?
> 
> Sounds reasonable. It's a relatively rare, but useful use case. One
> possible issue, though, would people assume that if they get a
> filename it'd be writeable? For the filesystem loader it would be, but
> that would break subtly (writes work but would get discarded) for
> loaders that don't have a native get_data_filename.

I don?t think you can assume it?s writeable since that?ll break in a lot
of common cases even with the filesystem loader since often times things
in the filesystem will be installed in the system and users won?t have
permissions to write to them anyways.

> 
> Related question - how would the temp files be cleaned up? At exit?

My patch registers an atexit handler that cleans up the temporary files yea.

> 
>> B. What do people think about modifying Loader.get_data so it can support
>>   relative filenames instead of the calling code needing to handle that?
> 
> I'd have to think about that one, but in principle it seems reasonable.
> 
> While we're extending the loaders, a far more commonly requested
> feature would be to list available data files. At the moment, code can
> only load data from known paths, which is not ideal. While it's
> unrelated to the original proposal, it makes sense if we're changing
> the spec of loaders to do it in one go, rather than having multiple
> iterations.

Well both pkgutil.get_data and pkgutil.get_data_filename have parallels in the
pkg_resources library for similar reasons. If we want to extend this to more
things it might make sense to take a look at what all exists there currently:

resource_exists(package_or_requirement, resource_name)
    Does the named resource exist? Return True or False accordingly.

resource_stream(package_or_requirement, resource_name)
    Return a readable file-like object for the specified resource; it may be an
    actual file, a StringIO, or some similar object. The stream is in
    ?binary mode?, in the sense that whatever bytes are in the resource will be
    read as-is.

resource_string(package_or_requirement, resource_name)
    Return the specified resource as a string. The resource is read in binary
    fashion, such that the returned string contains exactly the bytes that are
    stored in the resource.

resource_isdir(package_or_requirement, resource_name)
    Is the named resource a directory? Return True or False accordingly.

resource_listdir(package_or_requirement, resource_name)
    List the contents of the named resource directory, just like os.listdir
    except that it works even if the resource is in a zipfile.

resource_filename(package_or_requirement, resource_name)
    Sometimes, it is not sufficient to access a resource in string or stream
    form, and a true filesystem filename is needed. In such cases, you can use
    this method (or module-level function) to obtain a filename for a resource.
    If the resource is in an archive distribution (such as a zipped egg), it
    will be extracted to a cache directory, and the filename within the cache
    will be returned. If the named resource is a directory, then all resources
    within that directory (including subdirectories) are also extracted. If the
    named resource is a C extension or ?eager resource? (see the setuptools
    documentation for details), then all C extensions and eager resources are
    extracted at the same time.

See https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access
and https://pythonhosted.org/setuptools/pkg_resources.html#resource-extraction

Obviously the similar functions here are:

* pkgutil.get_data is pkg_resources.resource_string
* pkgutil.get_data_filename is pkg_resources.resource_filename

The major difference being that pkg_resource.resource_filename will extract to
a cache directory (controllable with an environment variable or
programatically) and won't clean up the extracted files. This means that they
are (by default) extracted once per user and reused between extractions. I felt
like it made more sense to just extract to a temporary location (even though
this is less performant) in the stdlib.

That leaves:

* resource_exists
* resource_stream
* resource_isdir
* resource_listdir

Which can be done via pkg_resources but not via the standard library, I don't
have a major opinion on whether or not the standard library should do all of
them but I don't think it would hurt if it did.

Another interesting question if we're going to add more methods is where they
should all live. As far as I know pkgutil.get_data predates the importlib
module. Perhaps deprecating pkgutil.get_data and adding a importlib.resources
module which supports functions like:

* get_bytes(package, resource)
* get_stream(package, resource)
* get_filename(package, resource)
* exists(package, resource)
* isdir(package, resource)
* listdir(package, resource)

Changing the names (particular get_data -> get_bytes) could also provide the
mechanism for allowing relative files and deprecating the "you must pass in
a full file path to the Loader()" behavior since the get_data method could be
left alone and a new get_bytes method could be added.

This would mean people can do things like:

    import importlib.resources
    import socket
    import ssl

    context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
    context.verify_mode = ssl.CERT_REQUIRED
    context.check_hostname = True
    context.load_verify_locations(
        cafile=importlib.resources.get_filename("certifi", "cacert.pem"),
    )

    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    ssl_sock = context.wrap_socket(s, server_hostname='www.verisign.com')
    ssl_sock.connect(('www.verisign.com', 443))

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From p.f.moore at gmail.com  Sat Jan 31 10:34:45 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 09:34:45 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
Message-ID: <CACac1F8ZJw_Bqov2VO5CZAcrV_dPvKEqv-J5No5_MLFkUMGskA@mail.gmail.com>

On 31 January 2015 at 00:52, Donald Stufft <donald at stufft.io> wrote:
>> Sounds reasonable. It's a relatively rare, but useful use case. One
>> possible issue, though, would people assume that if they get a
>> filename it'd be writeable? For the filesystem loader it would be, but
>> that would break subtly (writes work but would get discarded) for
>> loaders that don't have a native get_data_filename.
>
> I don?t think you can assume it?s writeable since that?ll break in a lot
> of common cases even with the filesystem loader since often times things
> in the filesystem will be installed in the system and users won?t have
> permissions to write to them anyways.

Agreed, It's just that it could happen (either deliberately or by
accident). One example I found was pytz, which downloads and builds
the timezone data by doing dirname(__file__) in an "update the DB" API
call - it'd be an "obvious" case for using resource data. (That was
from a long time ago - checking the code now they seem to have tidied
this up so it's no longer that way).

But yes, documenting it as "don't do that" is probably fine.

>> Related question - how would the temp files be cleaned up? At exit?
>
> My patch registers an atexit handler that cleans up the temporary files yea.

Great.

>>> B. What do people think about modifying Loader.get_data so it can support
>>>   relative filenames instead of the calling code needing to handle that?
>>
>> I'd have to think about that one, but in principle it seems reasonable.
>>
>> While we're extending the loaders, a far more commonly requested
>> feature would be to list available data files. At the moment, code can
>> only load data from known paths, which is not ideal. While it's
>> unrelated to the original proposal, it makes sense if we're changing
>> the spec of loaders to do it in one go, rather than having multiple
>> iterations.
>
> Well both pkgutil.get_data and pkgutil.get_data_filename have parallels in the
> pkg_resources library for similar reasons. If we want to extend this to more
> things it might make sense to take a look at what all exists there currently:

+1 on following pkg_resources.

> resource_exists(package_or_requirement, resource_name)
>     Does the named resource exist? Return True or False accordingly.
>
> resource_stream(package_or_requirement, resource_name)
>     Return a readable file-like object for the specified resource; it may be an
>     actual file, a StringIO, or some similar object. The stream is in
>     ?binary mode?, in the sense that whatever bytes are in the resource will be
>     read as-is.
>
> resource_string(package_or_requirement, resource_name)
>     Return the specified resource as a string. The resource is read in binary
>     fashion, such that the returned string contains exactly the bytes that are
>     stored in the resource.
>
> resource_isdir(package_or_requirement, resource_name)
>     Is the named resource a directory? Return True or False accordingly.
>
> resource_listdir(package_or_requirement, resource_name)
>     List the contents of the named resource directory, just like os.listdir
>     except that it works even if the resource is in a zipfile.
>
> resource_filename(package_or_requirement, resource_name)
>     Sometimes, it is not sufficient to access a resource in string or stream
>     form, and a true filesystem filename is needed. In such cases, you can use
>     this method (or module-level function) to obtain a filename for a resource.
>     If the resource is in an archive distribution (such as a zipped egg), it
>     will be extracted to a cache directory, and the filename within the cache
>     will be returned. If the named resource is a directory, then all resources
>     within that directory (including subdirectories) are also extracted. If the
>     named resource is a C extension or ?eager resource? (see the setuptools
>     documentation for details), then all C extensions and eager resources are
>     extracted at the same time.
>
> See https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access
> and https://pythonhosted.org/setuptools/pkg_resources.html#resource-extraction
>
> Obviously the similar functions here are:
>
> * pkgutil.get_data is pkg_resources.resource_string
> * pkgutil.get_data_filename is pkg_resources.resource_filename
>
> The major difference being that pkg_resource.resource_filename will extract to
> a cache directory (controllable with an environment variable or
> programatically) and won't clean up the extracted files. This means that they
> are (by default) extracted once per user and reused between extractions. I felt
> like it made more sense to just extract to a temporary location (even though
> this is less performant) in the stdlib.
>
> That leaves:
>
> * resource_exists
> * resource_stream
> * resource_isdir
> * resource_listdir
>
> Which can be done via pkg_resources but not via the standard library, I don't
> have a major opinion on whether or not the standard library should do all of
> them but I don't think it would hurt if it did.
>
> Another interesting question if we're going to add more methods is where they
> should all live. As far as I know pkgutil.get_data predates the importlib
> module. Perhaps deprecating pkgutil.get_data and adding a importlib.resources
> module which supports functions like:
>
> * get_bytes(package, resource)
> * get_stream(package, resource)
> * get_filename(package, resource)
> * exists(package, resource)
> * isdir(package, resource)
> * listdir(package, resource)
>
> Changing the names (particular get_data -> get_bytes) could also provide the
> mechanism for allowing relative files and deprecating the "you must pass in
> a full file path to the Loader()" behavior since the get_data method could be
> left alone and a new get_bytes method could be added.
>
> This would mean people can do things like:
>
>     import importlib.resources
>     import socket
>     import ssl
>
>     context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
>     context.verify_mode = ssl.CERT_REQUIRED
>     context.check_hostname = True
>     context.load_verify_locations(
>         cafile=importlib.resources.get_filename("certifi", "cacert.pem"),
>     )
>
>     s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>     ssl_sock = context.wrap_socket(s, server_hostname='www.verisign.com')
>     ssl_sock.connect(('www.verisign.com', 443))

+1 on all of the above. Obviously, a lot of the support methods in
loaders would need to be optional, but that's fine - and the vast
majority of use cases are the filesystem and zipfiles, both of which
support these methods, and can be handled in the stdlib.

Paul

From brett at python.org  Sat Jan 31 15:48:12 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 14:48:12 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
Message-ID: <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>

 On Fri, Jan 30, 2015, 19:52 Donald Stufft <donald at stufft.io> wrote:


> On Jan 30, 2015, at 7:18 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>
> On 30 January 2015 at 23:37, Donald Stufft <donald at stufft.io> wrote:
>> A. What do people think about pkgutil.get_data_filename and
>>   Loader.get_data_filename?
>
> Sounds reasonable. It's a relatively rare, but useful use case. One
> possible issue, though, would people assume that if they get a
> filename it'd be writeable? For the filesystem loader it would be, but
> that would break subtly (writes work but would get discarded) for
> loaders that don't have a native get_data_filename.

I don?t think you can assume it?s writeable since that?ll break in a lot
of common cases even with the filesystem loader since often times things
in the filesystem will be installed in the system and users won?t have
permissions to write to them anyways.

>
> Related question - how would the temp files be cleaned up? At exit?

My patch registers an atexit handler that cleans up the temporary files yea.

>
>> B. What do people think about modifying Loader.get_data so it can support
>>   relative filenames instead of the calling code needing to handle that?
>
> I'd have to think about that one, but in principle it seems reasonable.
>
> While we're extending the loaders, a far more commonly requested
> feature would be to list available data files. At the moment, code can
> only load data from known paths, which is not ideal. While it's
> unrelated to the original proposal, it makes sense if we're changing
> the spec of loaders to do it in one go, rather than having multiple
> iterations.

Well both pkgutil.get_data and pkgutil.get_data_filename have parallels in
the
pkg_resources library for similar reasons. If we want to extend this to more
things it might make sense to take a look at what all exists there
currently:

resource_exists(package_or_requirement, resource_name)
    Does the named resource exist? Return True or False accordingly.

resource_stream(package_or_requirement, resource_name)
    Return a readable file-like object for the specified resource; it may
be an
    actual file, a StringIO, or some similar object. The stream is in
    ?binary mode?, in the sense that whatever bytes are in the resource
will be
    read as-is.

resource_string(package_or_requirement, resource_name)
    Return the specified resource as a string. The resource is read in
binary
    fashion, such that the returned string contains exactly the bytes that
are
    stored in the resource.

resource_isdir(package_or_requirement, resource_name)
    Is the named resource a directory? Return True or False accordingly.

resource_listdir(package_or_requirement, resource_name)
    List the contents of the named resource directory, just like os.listdir
    except that it works even if the resource is in a zipfile.

resource_filename(package_or_requirement, resource_name)
    Sometimes, it is not sufficient to access a resource in string or stream
    form, and a true filesystem filename is needed. In such cases, you can
use
    this method (or module-level function) to obtain a filename for a
resource.
    If the resource is in an archive distribution (such as a zipped egg), it
    will be extracted to a cache directory, and the filename within the
cache
    will be returned. If the named resource is a directory, then all
resources
    within that directory (including subdirectories) are also extracted. If
the
    named resource is a C extension or ?eager resource? (see the setuptools
    documentation for details), then all C extensions and eager resources
are
    extracted at the same time.

See
https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access
and
https://pythonhosted.org/setuptools/pkg_resources.html#resource-extraction

Obviously the similar functions here are:

* pkgutil.get_data is pkg_resources.resource_string
* pkgutil.get_data_filename is pkg_resources.resource_filename

The major difference being that pkg_resource.resource_filename will extract
to
a cache directory (controllable with an environment variable or
programatically) and won't clean up the extracted files. This means that
they
are (by default) extracted once per user and reused between extractions. I
felt
like it made more sense to just extract to a temporary location (even though
this is less performant) in the stdlib.

That leaves:

* resource_exists
* resource_stream
* resource_isdir
* resource_listdir

Which can be done via pkg_resources but not via the standard library, I
don't
have a major opinion on whether or not the standard library should do all of
them but I don't think it would hurt if it did.

Another interesting question if we're going to add more methods is where
they
should all live. As far as I know pkgutil.get_data predates the importlib
module.

 It does, so you really have to think in terms of finders and loaders.

Perhaps deprecating pkgutil.get_data and adding a importlib.resources
module which supports functions like:

* get_bytes(package, resource)
* get_stream(package, resource)
* get_filename(package, resource)
* exists(package, resource)
* isdir(package, resource)
* listdir(package, resource)

Changing the names (particular get_data -> get_bytes) could also provide the
mechanism for allowing relative files and deprecating the "you must pass in
a full file path to the Loader()" behavior since the get_data method could
be
left alone and a new get_bytes method could be added.


The reason Loader.get_data() takes absolute paths is to do away with
ambiguity. If you have a relative path and ask a loader to read that path,
where should that relative path be anchored? Should it be the top-level
package? What about the module that loader ewas returned to handle? But
then what about if a finder caches loaders and reuses them across modules
(nothing in PEP 302 says you can't do this and in actuality the frozen and
built-in loaders are just static and class methods). The choice of dealing
exclusively in absolute paths was a conscious choice on my part.

Now having said that, there is nothing to say absolute paths require file
system I based paths. What you should really do is think of these paths as
opaque, non-ambiguous paths for the loader which claimed it knew what file
path was needed to pass to get_data(). If you think that way then you
realize you can use markers in the path as necessary, e.g.
some/path/file.zip/pkg/sub/data.txt. As long as loader.get_data() can
unambiguously read that path as returned by get_data_filename() or whatever
the method is called then you have fully abstracted paths out while still
being able to read data from a loader.

Basically any API dealing with paths for loaders needs to abstract away the
concept of files, file-like paths, etc. and rely on using the loader API on
pretty much everything as a simple os.path of its own. This is why I have
not tried to tackle the issue of the list_contents() or some such API to
list modules and potentially data files as it needs to not really have a
concrete concept of file paths (and it really should be on finders and not
loaders which complicates discovery, selecting the right finder, etc.).
This is also why APIs wanting a file path instead of taking a file-like
object simply cannot play well with importlib and loaders which have
alternative back end storage without simply being lucky that the loader
they are working with uses filesystem paths (or writing out to a temp file).

-brett

This would mean people can do things like:

    import importlib.resources
    import socket
    import ssl

    context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
    context.verify_mode = ssl.CERT_REQUIRED
    context.check_hostname = True
    context.load_verify_locations(
        cafile=importlib.resources.get_filename("certifi", "cacert.pem"),
    )

    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    ssl_sock = context.wrap_socket(s, server_hostname='www.verisign.com')
    ssl_sock.connect(('www.verisign.com', 443))

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

_______________________________________________
Import-SIG mailing list
Import-SIG at python.org
https://mail.python.org/mailman/listinfo/import-sig
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/d788c9f4/attachment.html>

From donald at stufft.io  Sat Jan 31 16:34:41 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 10:34:41 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
Message-ID: <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>


> On Jan 31, 2015, at 9:48 AM, Brett Cannon <brett at python.org> wrote:
> 
> The reason Loader.get_data() takes absolute paths is to do away with ambiguity. If you have a relative path and ask a loader to read that path, where should that relative path be anchored? Should it be the top-level package? What about the module that loader ewas returned to handle? But then what about if a finder caches loaders and reuses them across modules (nothing in PEP 302 says you can't do this and in actuality the frozen and built-in loaders are just static and class methods). The choice of dealing exclusively in absolute paths was a conscious choice on my part.
> 
> Now having said that, there is nothing to say absolute paths require file system I based paths. What you should really do is think of these paths as opaque, non-ambiguous paths for the loader which claimed it knew what file path was needed to pass to get_data(). If you think that way then you realize you can use markers in the path as necessary, e.g. some/path/file.zip/pkg/sub/data.txt. As long as loader.get_data() can unambiguously read that path as returned by get_data_filename() or whatever the method is called then you have fully abstracted paths out while still being able to read data from a loader.
> 
> Basically any API dealing with paths for loaders needs to abstract away the concept of files, file-like paths, etc. and rely on using the loader API on pretty much everything as a simple os.path of its own. This is why I have not tried to tackle the issue of the list_contents() or some such API to list modules and potentially data files as it needs to not really have a concrete concept of file paths (and it really should be on finders and not loaders which complicates discovery, selecting the right finder, etc.). This is also why APIs wanting a file path instead of taking a file-like object simply cannot play well with importlib and loaders which have alternative back end storage without simply being lucky that the loader they are working with uses filesystem paths (or writing out to a temp file).
> 


I think that dealing in absolute file paths (whether they are ?real? paths or not) makes the APIs super hard to use in anything but the simple case. For instance what do you do in a namespace package (either PEP 420 or one that extends the module __path__). There you have multiple candidate file paths and no good way to figure out which one you need to use and It requires that your code couple itself with the implementation of the package and it will break if someone changes from a module to a namespace package.

The way the PEP 302 Loaders work isn?t super obvious to me, so I?m looking at the implementation and making assumptions about it and I thought that it was one Loader per importable name. Looking closer it appears the way you ?import? a module from a Loader is using Loader().exec_module(?foo.bar?). So I?d say then that the Loader() APIs should be Loader().get_bytes(?foo.bar?, ?relative/to/foo.bar/file.txt?). That should resolve the case about not knowing what it should be relative to, since it should be relative to the name given. Then the Loader() can encapsulate the logic about how to turn ?foo.bar? + ?relative/to/foo.bar/file.txt? into an absolute path for to get some data (or something else).

It seems obvious to me that requiring a full path like that is the wrong way to expect people to work with constructing full paths for resources. It would be similar to expecting people to do ``import /data/foo.zip/submodule``. The import system should be abstracting all of that away for them.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/de443ec0/attachment-0001.html>

From p.f.moore at gmail.com  Sat Jan 31 16:38:43 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 15:38:43 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
Message-ID: <CACac1F9eWYuDM1hkJmgvPH6upRGewCVR=JMFtqSGT1mf9akWUA@mail.gmail.com>

On 31 January 2015 at 14:48, Brett Cannon <brett at python.org> wrote:
> Basically any API dealing with paths for loaders needs to abstract away the
> concept of files, file-like paths, etc. and rely on using the loader API on
> pretty much everything as a simple os.path of its own. This is why I have
> not tried to tackle the issue of the list_contents() or some such API to
> list modules and potentially data files as it needs to not really have a
> concrete concept of file paths (and it really should be on finders and not
> loaders which complicates discovery, selecting the right finder, etc.). This
> is also why APIs wanting a file path instead of taking a file-like object
> simply cannot play well with importlib and loaders which have alternative
> back end storage without simply being lucky that the loader they are working
> with uses filesystem paths (or writing out to a temp file).

At the time we designed PEP 302, the principle was very strongly to
limit the API to the bare minimum that we knew loaders would have to
support (you have to be able to get the content of a file, because
that's how you load a module). This was because non-filesystem modules
were a new concept at the time, and if we'd asked what do people need,
everyone (ourselves included) would have automatically assumed
"everything a filesystem can do" and we'd have ended up just designing
a virtual filesystem API and excluding a lot of possible flexibility
(loaders for URLs, or databases, or whatever).

Now we've had experience with PEP 302, it's clear that people aren't
using the extra flexibility much - but they *do* miss filesystem-like
APIs. At the moment, pkg_resources fills in the gap, but that's not
integrated with the loader system. So I think it's probably about time
to accept that these extensions are useful and *don't* limit
flexibility in any practical way, and add them to the loader protocol.

Paul

From donald at stufft.io  Sat Jan 31 16:45:16 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 10:45:16 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CACac1F9eWYuDM1hkJmgvPH6upRGewCVR=JMFtqSGT1mf9akWUA@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <CACac1F9eWYuDM1hkJmgvPH6upRGewCVR=JMFtqSGT1mf9akWUA@mail.gmail.com>
Message-ID: <7A29542D-8925-4C9A-9F89-999EB679C44D@stufft.io>


> On Jan 31, 2015, at 10:38 AM, Paul Moore <p.f.moore at gmail.com> wrote:
> 
> On 31 January 2015 at 14:48, Brett Cannon <brett at python.org> wrote:
>> Basically any API dealing with paths for loaders needs to abstract away the
>> concept of files, file-like paths, etc. and rely on using the loader API on
>> pretty much everything as a simple os.path of its own. This is why I have
>> not tried to tackle the issue of the list_contents() or some such API to
>> list modules and potentially data files as it needs to not really have a
>> concrete concept of file paths (and it really should be on finders and not
>> loaders which complicates discovery, selecting the right finder, etc.). This
>> is also why APIs wanting a file path instead of taking a file-like object
>> simply cannot play well with importlib and loaders which have alternative
>> back end storage without simply being lucky that the loader they are working
>> with uses filesystem paths (or writing out to a temp file).
> 
> At the time we designed PEP 302, the principle was very strongly to
> limit the API to the bare minimum that we knew loaders would have to
> support (you have to be able to get the content of a file, because
> that's how you load a module). This was because non-filesystem modules
> were a new concept at the time, and if we'd asked what do people need,
> everyone (ourselves included) would have automatically assumed
> "everything a filesystem can do" and we'd have ended up just designing
> a virtual filesystem API and excluding a lot of possible flexibility
> (loaders for URLs, or databases, or whatever).
> 
> Now we've had experience with PEP 302, it's clear that people aren't
> using the extra flexibility much - but they *do* miss filesystem-like
> APIs. At the moment, pkg_resources fills in the gap, but that's not
> integrated with the loader system. So I think it's probably about time
> to accept that these extensions are useful and *don't* limit
> flexibility in any practical way, and add them to the loader protocol.

I don?t think we need to even limit things to file system like loaders.
If we make the ?expanded? resource APIs optional then if your non file
system loader can?t support something like listing all of the files at
a sub resource then it just doesn?t implement that. It means that maybe
every type of code won?t work with every type of loader but I think that?s
a situation that isn?t able to be remedied.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From p.f.moore at gmail.com  Sat Jan 31 16:46:33 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 15:46:33 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
Message-ID: <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>

On 31 January 2015 at 15:34, Donald Stufft <donald at stufft.io> wrote:
> It seems obvious to me that requiring a full path like that is the wrong way
> to expect people to work with constructing full paths for resources. It
> would be similar to expecting people to do ``import
> /data/foo.zip/submodule``. The import system should be abstracting all of
> that away for them.

Note the example in PEP 302:

    d = os.path.dirname(__file__)
    data = __loader__.get_data(os.path.join(d, "logo.gif"))

The parallel is with the historical filesystem-only approach,

    d = os.path.dirname(__file__)
    with open(os.path.join(d, "logo.gif"), 'rb') as f:
        data = f.read()

You *don't* want to use a relative pathname then in this case, so the
loader protocol is designed to follow that usage. As Brett says,
__file__ can have non-filesystem "token" elements (e.g., a zipfile
name) if necessary.

It's certainly possible to add a new API that loads resources based on
a relative name, but you'd have to specify relative to *what*.
get_data explicitly ducks out of making that decision.

Paul

From p.f.moore at gmail.com  Sat Jan 31 16:47:13 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 15:47:13 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <7A29542D-8925-4C9A-9F89-999EB679C44D@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <CACac1F9eWYuDM1hkJmgvPH6upRGewCVR=JMFtqSGT1mf9akWUA@mail.gmail.com>
 <7A29542D-8925-4C9A-9F89-999EB679C44D@stufft.io>
Message-ID: <CACac1F_vFZELw+a31Bx71B=whSjcT_KXZoaukHVc0c+WK4Mkog@mail.gmail.com>

On 31 January 2015 at 15:45, Donald Stufft <donald at stufft.io> wrote:
> I don?t think we need to even limit things to file system like loaders.
> If we make the ?expanded? resource APIs optional then if your non file
> system loader can?t support something like listing all of the files at
> a sub resource then it just doesn?t implement that. It means that maybe
> every type of code won?t work with every type of loader but I think that?s
> a situation that isn?t able to be remedied.

Sorry, I didn't state that explicitly, but I certainly assumed that
would be the case.
Paul

From donald at stufft.io  Sat Jan 31 16:47:44 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 10:47:44 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
Message-ID: <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>


> On Jan 31, 2015, at 10:46 AM, Paul Moore <p.f.moore at gmail.com> wrote:
> 
> On 31 January 2015 at 15:34, Donald Stufft <donald at stufft.io> wrote:
>> It seems obvious to me that requiring a full path like that is the wrong way
>> to expect people to work with constructing full paths for resources. It
>> would be similar to expecting people to do ``import
>> /data/foo.zip/submodule``. The import system should be abstracting all of
>> that away for them.
> 
> Note the example in PEP 302:
> 
>    d = os.path.dirname(__file__)
>    data = __loader__.get_data(os.path.join(d, "logo.gif"))
> 
> The parallel is with the historical filesystem-only approach,
> 
>    d = os.path.dirname(__file__)
>    with open(os.path.join(d, "logo.gif"), 'rb') as f:
>        data = f.read()
> 
> You *don't* want to use a relative pathname then in this case, so the
> loader protocol is designed to follow that usage. As Brett says,
> __file__ can have non-filesystem "token" elements (e.g., a zipfile
> name) if necessary.
> 
> It's certainly possible to add a new API that loads resources based on
> a relative name, but you'd have to specify relative to *what*.
> get_data explicitly ducks out of making that decision.

data = __loader__.get_bytes(__name__, ?logo.gif?)

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From p.f.moore at gmail.com  Sat Jan 31 16:54:13 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 15:54:13 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
Message-ID: <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>

On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
>> It's certainly possible to add a new API that loads resources based on
>> a relative name, but you'd have to specify relative to *what*.
>> get_data explicitly ducks out of making that decision.
>
> data = __loader__.get_bytes(__name__, ?logo.gif?)

Quite possibly. It needs a bit of fleshing out to make sure it doesn't
prohibit sharing of loaders, etc, in the way Brett mentions. Also, the
fact that it needs __name__ in there feels wrong - a bit like the old
version of super() needing to be told which class it was being called
from. But in principle I don't object to finding a suitable form of
this.

And I like the name get_bytes - much more explicit in these Python 3
days of explicit str/bytes distinctions :-)
Paul

From donald at stufft.io  Sat Jan 31 17:13:05 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 11:13:05 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
Message-ID: <C90660CA-9F3A-47B1-AB0F-B4FAA6759C35@stufft.io>


> On Jan 31, 2015, at 10:54 AM, Paul Moore <p.f.moore at gmail.com> wrote:
> 
> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
>>> It's certainly possible to add a new API that loads resources based on
>>> a relative name, but you'd have to specify relative to *what*.
>>> get_data explicitly ducks out of making that decision.
>> 
>> data = __loader__.get_bytes(__name__, ?logo.gif?)
> 
> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
> prohibit sharing of loaders, etc, in the way Brett mentions. Also, the
> fact that it needs __name__ in there feels wrong - a bit like the old
> version of super() needing to be told which class it was being called
> from. But in principle I don't object to finding a suitable form of
> this.

To be clear, I think using __name__ is massively better than using __file__,
for one even though PEP 302 states that __file__ must be set, it actually
doesn?t have to be set and PEP 420 doesn?t set it. Even if it did set it
that pattern is only actually really usable for non namespace packages (of
any type).

The namespace package way of doing that is basically:

for path in __path__:
    try:
        data = __loader__.get_data(os.path.join(path, ?logo.gif?))
    except FileNotFoundError:
        pass
    else:
	break
else:
    raise Exception(?Cannot Find the file ?logo.gif??)


Either way if a Loader isn?t specific to a particular importable name and can
be re-used between them then you need a way to specify what module it?s relative
to and it seems to me the *obvious* way to load a resource that is relative to
a module is to tell Python you want to load a particular resource from a particular
module, not to construct some (pseudo) file path that says all that information
as well but requires you to know if the thing you?re importing is a Python
module, a python package, or a namespace package.

In order to make a function like pkgutil.get_data that actually works in all
situations that you?d have to do something like:

def get_data(package, resource):
    mod = importlib.import_module(package)
    if hasattr(mod, ?__path__?):
        for path in __path__:
            try:
                return mod.__loader__.get_data(os.path.join(path, resource))
            except FileNotFoundError:
                pass
    if hasattr(mod, "__file__"):
        d = os.path.dirname(__file__)
        try:
            return mod.__loader__.get_data(os.path.join(d, resource))
        except FileNotFoundError:
            pass

This is compared to the situation where the Loaders encapsulate that logic
for you:

def get_data(package, resource):
    mod = importlib.import_module(package)
    try:
        mod.__loader__.get_bytes(package, resource)
    except FileNotFoundError:
        pass

Obviously the logic in the first function still exists, it?s just moved away
from the caller needing to handle it and instead the Loader handles it, just
like the loader abstracts away the __file__ location for importing a particular
module.

Although looking closer at the Loader().exec_module implementation, It appears
that it expects something other than a string to be passed to it. So if it makes
sense possibly Loader().get_bytes() etc should also expect something other than
a string to identify the module as well (whatever it actually wants, I can?t tell).
Then the utility functions in pkgutil or importlib.resources or whatever will do
the logic to translate from a string to whatever the Loader itself wants.


> 
> And I like the name get_bytes - much more explicit in these Python 3
> days of explicit str/bytes distinctions :-)
> Paul

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From brett at python.org  Sat Jan 31 17:19:25 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 16:19:25 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
Message-ID: <CAP1=2W7fvTR0dV6ZqOd+qRSgOCMK+fnVg_CBbLVV7GsyfifSVg@mail.gmail.com>

On Sat Jan 31 2015 at 10:34:45 AM Donald Stufft <donald at stufft.io> wrote:

>
> On Jan 31, 2015, at 9:48 AM, Brett Cannon <brett at python.org> wrote:
>
> The reason Loader.get_data() takes absolute paths is to do away with
> ambiguity. If you have a relative path and ask a loader to read that path,
> where should that relative path be anchored? Should it be the top-level
> package? What about the module that loader ewas returned to handle? But
> then what about if a finder caches loaders and reuses them across modules
> (nothing in PEP 302 says you can't do this and in actuality the frozen and
> built-in loaders are just static and class methods). The choice of dealing
> exclusively in absolute paths was a conscious choice on my part.
>
> Now having said that, there is nothing to say absolute paths require file
> system I based paths. What you should really do is think of these paths as
> opaque, non-ambiguous paths for the loader which claimed it knew what file
> path was needed to pass to get_data(). If you think that way then you
> realize you can use markers in the path as necessary, e.g.
> some/path/file.zip/pkg/sub/data.txt. As long as loader.get_data() can
> unambiguously read that path as returned by get_data_filename() or whatever
> the method is called then you have fully abstracted paths out while still
> being able to read data from a loader.
>
> Basically any API dealing with paths for loaders needs to abstract away
> the concept of files, file-like paths, etc. and rely on using the loader
> API on pretty much everything as a simple os.path of its own. This is why I
> have not tried to tackle the issue of the list_contents() or some such API
> to list modules and potentially data files as it needs to not really have a
> concrete concept of file paths (and it really should be on finders and not
> loaders which complicates discovery, selecting the right finder, etc.).
> This is also why APIs wanting a file path instead of taking a file-like
> object simply cannot play well with importlib and loaders which have
> alternative back end storage without simply being lucky that the loader
> they are working with uses filesystem paths (or writing out to a temp file).
>
>
> I think that dealing in absolute file paths (whether they are ?real? paths
> or not) makes the APIs super hard to use in anything but the simple case.
>

I think we are talking about two different things when we say "relative"; I
clarify later.


> For instance what do you do in a namespace package (either PEP 420 or one
> that extends the module __path__).
>
There you have multiple candidate file paths and no good way to figure out
> which one you need to use and It requires that your code couple itself with
> the implementation of the package and it will break if someone changes from
> a module to a namespace package.
>

Yep, but that's just life. If you're reading data out of a package anyway
then you are already coupled to its structure so this is no different.


>
> The way the PEP 302 Loaders work isn?t super obvious to me, so I?m looking
> at the implementation and making assumptions about it and I thought that it
> was one Loader per importable name. Looking closer it appears the way you
> ?import? a module from a Loader is using Loader().exec_module(?foo.bar?).
> So I?d say then that the Loader() APIs should be
> Loader().get_bytes(?foo.bar?, ?relative/to/foo.bar/file.txt?). That should
> resolve the case about not knowing what it should be relative to, since it
> should be relative to the name given. Then the Loader() can encapsulate the
> logic about how to turn ?foo.bar? + ?relative/to/foo.bar/file.txt? into an
> absolute path for to get some data (or something else).
>

Yes, specifying the package anchor point does away with the ambiguity of
relativity as it has an absolute position in a namespace. As long as we do
**that** then there are no relative paths to speak of as all the
information necessary to calculate an absolute path without ambiguity is
provided.


>
> It seems obvious to me that requiring a full path like that is the wrong
> way to expect people to work with constructing full paths for resources. It
> would be similar to expecting people to do ``import
> /data/foo.zip/submodule``. The import system should be abstracting all of
> that away for them.
>

I think what you mean by "relative" and what I mean by "relative" are
different. When I say "relative" I mean what you pass to loader.get_data().
What you mean by "relative" is I think the "file.txt" part of a call to
get_bytes('some.module', "file.txt") which I don't consider relative as you
specify everything for an absolute path. IOW I'm talking about the existing
API and its semantics and you're talking in terms of your new API, so we
are talking past each other. =)

-Brett


>
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/1fff2f20/attachment-0001.html>

From brett at python.org  Sat Jan 31 17:31:41 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 16:31:41 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
Message-ID: <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>

On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com> wrote:

> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
> >> It's certainly possible to add a new API that loads resources based on
> >> a relative name, but you'd have to specify relative to *what*.
> >> get_data explicitly ducks out of making that decision.
> >
> > data = __loader__.get_bytes(__name__, ?logo.gif?)
>
> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
> prohibit sharing of loaders, etc, in the way Brett mentions.


By specifying the package anchor point I don't think it does.


> Also, the
> fact that it needs __name__ in there feels wrong - a bit like the old
> version of super() needing to be told which class it was being called
> from.


You can't avoid that. This is the entire reason why loader reuse is a pain;
you **have** to specify what to work off of, else its ambiguous and a
specific feature of a specific loader.

But this is only an issue when you are trying to access a file relative to
the package/module you're in. Otherwise you're going to be specifying a
string constant like 'foo.bar'.


> But in principle I don't object to finding a suitable form of
> this.
>
> And I like the name get_bytes - much more explicit in these Python 3
> days of explicit str/bytes distinctions :-)


One unfortunate side-effect from having a new method to return bytes from a
data file is that it makes get_data() somewhat redundant. If we make it
get_data_filename(package_name, path) then it can return an absolute path
which can then be passed to get_data() to read the actual bytes. If we
create importlib.resources as Donald has suggested then all of this can be
hidden  behind a function and users don't have to care about any of this,
e.g. importlib.resources.read_data(module_anchor, path).

One thing to consider is do we want to allow anything other than filenames
for the path part? Thanks to namespace packages every directory is
essentially a package, so we could say that the package anchor has to
encapsulate the directory and the path bit can only be a filename. That
gets us even farther away from having the concept of file paths being
manipulated in relation to import-related APIs.

And just so I don't forget it, I keep wanting to pass an actual module in
so the code can extract the name that way, but that prevents the __name__
trick as you would have to import yourself or grab the module from
sys.modules.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/fb54ee12/attachment.html>

From donald at stufft.io  Sat Jan 31 17:43:52 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 11:43:52 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
Message-ID: <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>


> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org> wrote:
> 
> 
> 
> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com <mailto:p.f.moore at gmail.com>> wrote:
> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
> >> It's certainly possible to add a new API that loads resources based on
> >> a relative name, but you'd have to specify relative to *what*.
> >> get_data explicitly ducks out of making that decision.
> >
> > data = __loader__.get_bytes(__name__, ?logo.gif?)
> 
> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
> prohibit sharing of loaders, etc, in the way Brett mentions.
> 
> By specifying the package anchor point I don't think it does.
>  
> Also, the
> fact that it needs __name__ in there feels wrong - a bit like the old
> version of super() needing to be told which class it was being called
> from.
> 
> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader.
> 
> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'.
>  
> But in principle I don't object to finding a suitable form of
> this.
> 
> And I like the name get_bytes - much more explicit in these Python 3
> days of explicit str/bytes distinctions :-)
> 
> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden  behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path).

I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data.

I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()).

> 
> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs.

I think we do want to allow directories, it?s not unusual to have something like:

warehouse
??? __init__.py
??? templates
?   ??? accounts
?   ?   ??? profile.html
?   ??? hello.html
??? utils
?   ??? mapper.py
??? wsgi.py

Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like:

importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?)

In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader). How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader.

> 
> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules.

Is an actual module what gets passed into Loader().exec_module()? If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?).

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/7e7d5034/attachment-0001.html>

From brett at python.org  Sat Jan 31 17:21:57 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 16:21:57 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <C90660CA-9F3A-47B1-AB0F-B4FAA6759C35@stufft.io>
Message-ID: <CAP1=2W4Y23oUABexgwkykxYz0gO17E1usmYiTPfcpBmJbw6Ncg@mail.gmail.com>

On Sat Jan 31 2015 at 11:13:08 AM Donald Stufft <donald at stufft.io> wrote:

>
> > On Jan 31, 2015, at 10:54 AM, Paul Moore <p.f.moore at gmail.com> wrote:
> >
> > On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
> >>> It's certainly possible to add a new API that loads resources based on
> >>> a relative name, but you'd have to specify relative to *what*.
> >>> get_data explicitly ducks out of making that decision.
> >>
> >> data = __loader__.get_bytes(__name__, ?logo.gif?)
> >
> > Quite possibly. It needs a bit of fleshing out to make sure it doesn't
> > prohibit sharing of loaders, etc, in the way Brett mentions. Also, the
> > fact that it needs __name__ in there feels wrong - a bit like the old
> > version of super() needing to be told which class it was being called
> > from. But in principle I don't object to finding a suitable form of
> > this.
>
> To be clear, I think using __name__ is massively better than using
> __file__,
> for one even though PEP 302 states that __file__ must be set, it actually
> doesn?t have to be set and PEP 420 doesn?t set it. Even if it did set it
> that pattern is only actually really usable for non namespace packages (of
> any type).
>

So you're starting to get into the murky corners of import. =) PEP 420
actually supercedes PEP 302, but that doesn't mean it negates it. For
backwards-compatibility importlib still sets __file__, but you're right it
isn't necessary as long as __spec__ is set.

-Brett


>
> The namespace package way of doing that is basically:
>
> for path in __path__:
>     try:
>         data = __loader__.get_data(os.path.join(path, ?logo.gif?))
>     except FileNotFoundError:
>         pass
>     else:
>         break
> else:
>     raise Exception(?Cannot Find the file ?logo.gif??)
>
>
> Either way if a Loader isn?t specific to a particular importable name and
> can
> be re-used between them then you need a way to specify what module it?s
> relative
> to and it seems to me the *obvious* way to load a resource that is
> relative to
> a module is to tell Python you want to load a particular resource from a
> particular
> module, not to construct some (pseudo) file path that says all that
> information
> as well but requires you to know if the thing you?re importing is a Python
> module, a python package, or a namespace package.
>
> In order to make a function like pkgutil.get_data that actually works in
> all
> situations that you?d have to do something like:
>
> def get_data(package, resource):
>     mod = importlib.import_module(package)
>     if hasattr(mod, ?__path__?):
>         for path in __path__:
>             try:
>                 return mod.__loader__.get_data(os.path.join(path,
> resource))
>             except FileNotFoundError:
>                 pass
>     if hasattr(mod, "__file__"):
>         d = os.path.dirname(__file__)
>         try:
>             return mod.__loader__.get_data(os.path.join(d, resource))
>         except FileNotFoundError:
>             pass
>
> This is compared to the situation where the Loaders encapsulate that logic
> for you:
>
> def get_data(package, resource):
>     mod = importlib.import_module(package)
>     try:
>         mod.__loader__.get_bytes(package, resource)
>     except FileNotFoundError:
>         pass
>
> Obviously the logic in the first function still exists, it?s just moved
> away
> from the caller needing to handle it and instead the Loader handles it,
> just
> like the loader abstracts away the __file__ location for importing a
> particular
> module.
>
> Although looking closer at the Loader().exec_module implementation, It
> appears
> that it expects something other than a string to be passed to it. So if it
> makes
> sense possibly Loader().get_bytes() etc should also expect something other
> than
> a string to identify the module as well (whatever it actually wants, I
> can?t tell).
> Then the utility functions in pkgutil or importlib.resources or whatever
> will do
> the logic to translate from a string to whatever the Loader itself wants.
>
>
> >
> > And I like the name get_bytes - much more explicit in these Python 3
> > days of explicit str/bytes distinctions :-)
> > Paul
>
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/cbae5723/attachment.html>

From brett at python.org  Sat Jan 31 17:48:36 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 16:48:36 +0000
Subject: [Import-SIG] Optimization levels embedded in .pyo file names?
References: <CAP1=2W5UxD-pfi6MiKpNiXeqM3qi1qwMERbxMFnYeNetx3b+uA@mail.gmail.com>
 <20150130164646.5d1538ff@anarchist.wooz.org>
 <CALFfu7ARZ4dDS_6=cq-tatK+VRifnQOO=2cs3V4GWuRfLXWMJA@mail.gmail.com>
Message-ID: <CAP1=2W5x2a21T4E-i2WGGtxDGUTk6yx86a4kjm3znsvY-1+k9g@mail.gmail.com>

On Fri Jan 30 2015 at 7:02:04 PM Eric Snow <ericsnowcurrently at gmail.com>
wrote:

> On Fri, Jan 30, 2015 at 2:46 PM, Barry Warsaw <barry at python.org> wrote:
> > On Jan 30, 2015, at 07:28 PM, Brett Cannon wrote:
> >
> >>Something I have been thinking about is whether we should start embedding
> >>the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo
> >
> > +1 - we've had some trouble in the past in Debian with the name
> collisions on
> > .pyo for the different optimization levels.
> >
> >>I would love to even go so far as to say that we drop the .pyo file
> >>extension and make what has normally been .pyc files be .O0.pyc and what
> >>has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is
> >>that it might break too much code in a transition and so .pyc stays as
> such
> >>and then .O1.pyo and .O2.pyo comes into existence from the stdlib.
> >
> > I actually *would* go so far.  I thought about it during the PEP 3147
> > time frame but it was out-of-scope at the time.  A transition period
> might be
> > necessary (and/or a switch to choose) but I think it's a good end state.
>

Assuming no one flips out about writing a bunch of files we could write
files using the new and old file paths (or symlink the old paths to the
new, but that seems to be asking for trouble on some OS that doesn't
support them but maybe I'm being paranoid). That way people who construct
file paths manually can still read the old paths but those who use
cache_from_source() will get the new paths automatically (although
override_debug will be a little wonky but nothing horrible in the New
World). And anyone who really doesn't want all of those files written can
run with sys.dont_write_bytecode set to True after byte-compiling their
code.

This "multiple bytecode files for the same thing" approach might spike stat
calls since we would have to check which path is newer in case someone
edited the old path out-of-band, but it shouldn't be too bad (it will
obviously startup time will have to be measured).


>
> +1 to all of it. :)
>

Since everyone seems to think it's a good idea I will write up a PEP with
the end goal of going all the way with .pyc (probably on Friday).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/a3b04755/attachment.html>

From brett at python.org  Sat Jan 31 18:00:52 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 17:00:52 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
Message-ID: <CAP1=2W716b7-EVy5cRwaad3YyR_kT3=jgv9uKV84xe=zHKmY5A@mail.gmail.com>

On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io> wrote:

> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org> wrote:
>
>
>
> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com> wrote:
>
>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
>> >> It's certainly possible to add a new API that loads resources based on
>> >> a relative name, but you'd have to specify relative to *what*.
>> >> get_data explicitly ducks out of making that decision.
>> >
>> > data = __loader__.get_bytes(__name__, ?logo.gif?)
>>
>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>> prohibit sharing of loaders, etc, in the way Brett mentions.
>
>
> By specifying the package anchor point I don't think it does.
>
>
>> Also, the
>> fact that it needs __name__ in there feels wrong - a bit like the old
>> version of super() needing to be told which class it was being called
>> from.
>
>
> You can't avoid that. This is the entire reason why loader reuse is a
> pain; you **have** to specify what to work off of, else its ambiguous and a
> specific feature of a specific loader.
>
> But this is only an issue when you are trying to access a file relative to
> the package/module you're in. Otherwise you're going to be specifying a
> string constant like 'foo.bar'.
>
>
>> But in principle I don't object to finding a suitable form of
>> this.
>>
>> And I like the name get_bytes - much more explicit in these Python 3
>> days of explicit str/bytes distinctions :-)
>
>
> One unfortunate side-effect from having a new method to return bytes from
> a data file is that it makes get_data() somewhat redundant. If we make it
> get_data_filename(package_name, path) then it can return an absolute path
> which can then be passed to get_data() to read the actual bytes. If we
> create importlib.resources as Donald has suggested then all of this can be
> hidden  behind a function and users don't have to care about any of this,
> e.g. importlib.resources.read_data(module_anchor, path).
>
>
> I think we actually have to go the other way, because only some Loaders
> will be able to actually return a filename (returning a filename is
> basically an optimization to prevent needing to call get_data and write
> that out to a temporary directory) but pretty much any loader should
> theoretically be able to support get_data.
>

Why can only some loaders return a filename? As I have said, loaders can
return an opaque string to simulate a path if necessary.


>
> I think it is redundant but given that it?s a new API (passing module and
> a ?resource path?) I think it makes sense. The old get_data API can be
> deprecated but left in for compatibility reasons if we want (sort of like
> Loader().load_module() -> Loader().exec_module()).
>

If we do that then there would have to be a way to specify how to read the
bytes for the module code itself since get_data() is used in the
implementation of import by coupling it with get_filename() (which is why
I'm trying not have to drop get_filename()/get_data() and instead come up
with some new approach to reading bytes since the current approach is very
composable). So get_bytes() would need a way to signal that you don't want
some data file but the bytes for the module. Maybe if the path section is
unspecified then that's a signal that the module's bytes is wanted and not
some data file?


>
>
> One thing to consider is do we want to allow anything other than filenames
> for the path part? Thanks to namespace packages every directory is
> essentially a package, so we could say that the package anchor has to
> encapsulate the directory and the path bit can only be a filename. That
> gets us even farther away from having the concept of file paths being
> manipulated in relation to import-related APIs.
>
>
> I think we do want to allow directories, it?s not unusual to have
> something like:
>
> warehouse
> ??? __init__.py
> ??? templates
> ?   ??? accounts
> ?   ?   ??? profile.html
> ?   ??? hello.html
> ??? utils
> ?   ??? mapper.py
> ??? wsgi.py
>
> Conceptually templates isn?t a package (even though with namespace
> packages it kinda is) and I?d want to load profile.html by doing something
> like:
>
> importlib.resources.get_bytes(?warehouse?,
> ?templates/accounts/profile.html?)
>

Where I would be fine with get_bytes('warehouse.templates.accounts',
'profile.html')  =)


>
> In pkg_resources the second argument to that function is a ?resource path?
> which is defined as a relative to the given module/package and it must use
> / to denote them. It explicitly says it?s not a file system path but a
> resource path. It may translate to a file system path (as is the case with
> the FileLoader) but it also may not (as is the case with a theoretical
> S3Loader or PostgreSQLLoader).
>

Yep, which is why I'm making sure if we have paths we minimize them as they
instantly make these alternative loader concepts a bigger pain to implement.


> How you turn a warehouse + a resource path into some data (or whatever
> other function we support) is an implementation detail of the Loader.
>
>
> And just so I don't forget it, I keep wanting to pass an actual module in
> so the code can extract the name that way, but that prevents the __name__
> trick as you would have to import yourself or grab the module from
> sys.modules.
>
>
> Is an actual module what gets passed into Loader().exec_module()?
>

Yes.


> If so I think it?s fine to pass that into the new Loader() functions and a
> new top level API in importlib.resources can do the things needed to turn a
> string into a module object. So instead of doing
> __loader__.get_bytes(__name__, ?logo.gif?) you?d do
> importlib.resources.get_bytes(__name__, ?logo.gif?).
>

If we go the route of importlib.resources then that seems like a reasonable
idea, although we will need to think through the ramifications to
exec_module() itself although I don't think there were be any issues.

And if we do go with importlib.resources I will probably want to make it
available on PyPI with appropriate imp/pkgutil fallbacks to help people
transitioning from Python 2 to 3.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/58cb6c63/attachment-0001.html>

From donald at stufft.io  Sat Jan 31 18:28:04 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 12:28:04 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W716b7-EVy5cRwaad3YyR_kT3=jgv9uKV84xe=zHKmY5A@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
 <CAP1=2W716b7-EVy5cRwaad3YyR_kT3=jgv9uKV84xe=zHKmY5A@mail.gmail.com>
Message-ID: <D9D8CB4C-AEA3-4FFA-A87C-EC08E1D448C3@stufft.io>


> On Jan 31, 2015, at 12:00 PM, Brett Cannon <brett at python.org> wrote:
> 
> 
> 
> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org <mailto:brett at python.org>> wrote:
>> 
>> 
>> 
>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com <mailto:p.f.moore at gmail.com>> wrote:
>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>> >> It's certainly possible to add a new API that loads resources based on
>> >> a relative name, but you'd have to specify relative to *what*.
>> >> get_data explicitly ducks out of making that decision.
>> >
>> > data = __loader__.get_bytes(__name__, ?logo.gif?)
>> 
>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>> prohibit sharing of loaders, etc, in the way Brett mentions.
>> 
>> By specifying the package anchor point I don't think it does.
>>  
>> Also, the
>> fact that it needs __name__ in there feels wrong - a bit like the old
>> version of super() needing to be told which class it was being called
>> from.
>> 
>> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader.
>> 
>> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'.
>>  
>> But in principle I don't object to finding a suitable form of
>> this.
>> 
>> And I like the name get_bytes - much more explicit in these Python 3
>> days of explicit str/bytes distinctions :-)
>> 
>> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden  behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path).
> 
> I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data.
> 
> Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary.

Because the idea behind get_data_filename() is that it returns a path that can be used regularly by APIs that expect to be handed a file on the file system. Simulating a path with an opaque string isn?t good enough because, for example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. The idea here is that _if_ a regular file system path is available for a particular resource file then Loader().get_data_filename() would return it, otherwise it?d return None (or not exist at all).

This means that pkgutil.get_data_filename (or importlib.resources.get_filename) can attempt to call Loader().get_data_filename() and just return that path if one exists on the file system already, and if it doesn?t then it can create a temporary file and call Loader.get_data() and write the data to that temporary file and return the path to that.

>  
> 
> I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()).
> 
> If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file?

Perhaps trying to read modules and resource files with the same method is the wrong approach?

Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 <https://bpaste.net/show/b25b7e8dc8f0>

This means that we?re not talking about ?data? files, but ?resource? files. This also removes the idea that you can call Loader.set_data() on those files (like i?ve seen in the implementation).

>  
> 
>> 
>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs.
> 
> I think we do want to allow directories, it?s not unusual to have something like:
> 
> warehouse
> ??? __init__.py
> ??? templates
> ?   ??? accounts
> ?   ?   ??? profile.html
> ?   ??? hello.html
> ??? utils
> ?   ??? mapper.py
> ??? wsgi.py
> 
> Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like:
> 
> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?)
> 
> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html')  =)
>  
> 
> In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader).
> 
> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement.
>  
> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader.
> 
>> 
>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules.
> 
> Is an actual module what gets passed into Loader().exec_module()?
> 
> Yes.
>  
> If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?).
> 
> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues.
> 
> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/6c450472/attachment-0001.html>

From barry at python.org  Sat Jan 31 18:31:46 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 12:31:46 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
Message-ID: <20150131123146.6ad3f1a6@marathon>

On Jan 30, 2015, at 06:37 PM, Donald Stufft wrote:

>I think it would be a good idea to implement a pkgutil.get_data_filename
>function which would return a filename that can be accessed to get at that
>particular bit of package data.

+1

Of the pkg_resource methods that I use all the time, resource_string() (which
in Python 3 should really be called resource_bytes()) and resource_filename()
are the overwhelming favorites.  I do occasionally use resource_stream() and
even more rarely, resource_listdir().

Given that pkgutil.get_data() is essentially resource_bytes(), adopting (and
improving) equivalents for resource_filename() and resource_stream() would be
really nice.

>I have a few concerns however, currently Loader.get_data() requires you to
>pass the entire path of the file you want to open (like
>/usr/lib/python3.5/site-packages/foo/bar.txt or /data/foo.zip/bar.txt)
>however I've made Loader.get_data_filename() want a relative path (like
>bar.txt).
>
>I wonder if this difference is OK?

Depends on who you ask :).  Clearly, most users should never be confronted
with the difference.  The APIs they should use are the pkgutil ones and there,
everything's relative to a package namespace path, which is (well, modulo
perhaps some PEP 420 corners) unambiguous.

I don't particularly like the "feature" of get_data() allowing resources paths
with / in the name.  I'd much rather the resource either be a dotted module
path, or just not allowing subpaths.  The difference is a requirement in the
layout of the package, e.g.

pkgutil.get_data('my.package.path', 'subpath/foo.dat')
pkgutil.get_data('my.package.path.subpath', 'foo.dat')

The latter requires that 'subpath' be a subpackage while the former does not.
Personally, that seems like a fine restriction to me, but that's how I always
lay out my in-package data anyway.

Loader implementers OTOH, do care, but there's a lot fewer of them than users.

>If not I wonder if we can make Loader.get_data accept a relative path as
>well. I think this is a generally more useful way of using the function
>because it doesn't restrict loaders to file system only (which get_data
>currently is restricted to I believe) and it lets the Loader encaspulate the
>logic about how to translate a relative path to a chunk of data instead of
>needing the caller to do that.

+1

>My other problem is that pkgutil.get_data doesn't currently work for the PEP
>420 namespace packages and due to the above I'm not sure how to actually make
>it work in a reasonable way without allowing get_data to accept relative
>paths as well.

Well, with the restriction on resource subpaths above, there's no problem,
right?

pkgutil.get_data('my.package.path.subpath', 'foo.dat')

Assuming subpath is contained within a namespace portion, it should be
unambiguous where it comes from.

pkgutil.get_data('my.package.path', 'foo.dat')

If 'my.package.path' is a namespace package then there *isn't* any portion
containing foo.dat, so this should return None because the namespace loader
won't have get_data() implemented on it.

I understand that imposing this restriction is a backward compatibility break,
so it may not be adoptable.  There are ways to get around that (add a flag to
the API, implement a new pkgutil API with the restriction and deprecate
.get_data(), etc.).  However, for PEP 420 packages, you could impose this
restriction in .get_data() without the backward compatibility problem.  And
certainly in any new APIs, e.g. .get_package_filename()
a.k.a. resource_filename() you can do impose this restriction.

I also think resource_stream() should be implemented as well, but maybe it
should be called `pkgutil.open(package, resource, mode, encoding)` ?

I can live without resource_listdir().

Cheers,
-Barry

From barry at python.org  Sat Jan 31 18:40:04 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 12:40:04 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
Message-ID: <20150131124004.2c373863@marathon>

On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote:

>resource_exists(package_or_requirement, resource_name)
>    Does the named resource exist? Return True or False accordingly.

+1

>resource_stream(package_or_requirement, resource_name)
>    Return a readable file-like object for the specified resource; it may be
>    an actual file, a StringIO, or some similar object. The stream is in
>    ?binary mode?, in the sense that whatever bytes are in the resource will
>    be read as-is.

See my previous follow up.  I'd much rather have an open()-like API so I don't
have to do the subsequent decoding.

>resource_string(package_or_requirement, resource_name)
>    Return the specified resource as a string. The resource is read in binary
>    fashion, such that the returned string contains exactly the bytes that are
>    stored in the resource.

Right, so resource_string() is the wrong name <wink>.  In my Python 3 code I
always do:

from pkg_resources import resource_string as resource_bytes

so at least the call sites more accurately reflect reality. :)

>resource_isdir(package_or_requirement, resource_name)
>    Is the named resource a directory? Return True or False accordingly.
>
>resource_listdir(package_or_requirement, resource_name)
>    List the contents of the named resource directory, just like os.listdir
>    except that it works even if the resource is in a zipfile.

I've used these, but rarely, so I don't care too much.

>resource_filename(package_or_requirement, resource_name)
[...]
>Obviously the similar functions here are:
>
>* pkgutil.get_data is pkg_resources.resource_string
>* pkgutil.get_data_filename is pkg_resources.resource_filename
>
>The major difference being that pkg_resource.resource_filename will extract
>to a cache directory (controllable with an environment variable or
>programatically) and won't clean up the extracted files. This means that they
>are (by default) extracted once per user and reused between extractions. I
>felt like it made more sense to just extract to a temporary location (even
>though this is less performant) in the stdlib.

Extracting to a temporary location is fine.  These generally aren't
performance critical sections (e.g. I use them predominately in tests) and if
they are then I'd rather let the user define the caching policy.

>That leaves:
>
>* resource_exists
>* resource_stream
>* resource_isdir
>* resource_listdir
>
>Which can be done via pkg_resources but not via the standard library, I don't
>have a major opinion on whether or not the standard library should do all of
>them but I don't think it would hurt if it did.

resource_stream() is useful, but see my previous response on that.

>Another interesting question if we're going to add more methods is where they
>should all live. As far as I know pkgutil.get_data predates the importlib
>module. Perhaps deprecating pkgutil.get_data and adding a importlib.resources
>module which supports functions like:
>
>* get_bytes(package, resource)
>* get_stream(package, resource)
>* get_filename(package, resource)
>* exists(package, resource)
>* isdir(package, resource)
>* listdir(package, resource)

Modulo bikeshedding on the names of the functions, importlib.resources seems
like a nice place for it.

>Changing the names (particular get_data -> get_bytes) could also provide the
>mechanism for allowing relative files and deprecating the "you must pass in
>a full file path to the Loader()" behavior since the get_data method could be
>left alone and a new get_bytes method could be added.

+1, but see also my previous suggestion about path restrictions.

Cheers,
-Barry

From barry at python.org  Sat Jan 31 18:44:51 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 12:44:51 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
Message-ID: <20150131124451.582b5dc3@marathon>

On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote:

>> On Jan 30, 2015, at 7:18 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>> Related question - how would the temp files be cleaned up? At exit?
>
>My patch registers an atexit handler that cleans up the temporary files yea.

Why not implement it as a context manager?

I'm not a big fan of overloading the atexit handler because there are
situations where it might not get called (e.g. the program crashes or is kill
-9'd), but a context manager allows the resource to be cleaned up asap.

Reviewing my own uses of pkg_resources.resource_filename() I think it would
work just fine because I rarely need the path much longer than the immediate
operation.  If I did need to cache it more permanently, I could easily do:

    with resource_filename('my.package.path', 'foo.dat') as path:
        shutil.copy(path, some_more_permanent_location)

Easy peasy.

Cheers,
-Barry

From pje at telecommunity.com  Sat Jan 31 19:00:02 2015
From: pje at telecommunity.com (PJ Eby)
Date: Sat, 31 Jan 2015 13:00:02 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <C90660CA-9F3A-47B1-AB0F-B4FAA6759C35@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <C90660CA-9F3A-47B1-AB0F-B4FAA6759C35@stufft.io>
Message-ID: <CALeMXf4AktJxS0VVAyDmhywWgUYmo+83gyQOSGgNdLmKu87C=Q@mail.gmail.com>

On Sat, Jan 31, 2015 at 11:13 AM, Donald Stufft <donald at stufft.io> wrote:
> To be clear, I think using __name__ is massively better than using __file__,
> for one even though PEP 302 states that __file__ must be set, it actually
> doesn?t have to be set and PEP 420 doesn?t set it. Even if it did set it
> that pattern is only actually really usable for non namespace packages (of
> any type).

Indeed, pkg_resources does not support resource access from namespace
packages, only from specific modules or non-namespace packages
contained in a namespace package.  In the face of ambiguity, the
implementation should refuse to guess.  Disallowing namespace-relative
access avoids the possibility of ambiguity, and it's essentially a
non-issue anyway since there's no real use case for "find me whichever
copy of this file got installed first or got listed first on
sys.path".

From barry at python.org  Sat Jan 31 19:03:21 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 13:03:21 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
Message-ID: <20150131130321.31654688@marathon>

On Jan 31, 2015, at 02:48 PM, Brett Cannon wrote:

>> Sounds reasonable. It's a relatively rare, but useful use case. One
>> possible issue, though, would people assume that if they get a
>> filename it'd be writeable? For the filesystem loader it would be, but
>> that would break subtly (writes work but would get discarded) for
>> loaders that don't have a native get_data_filename.
>
>I don?t think you can assume it?s writeable since that?ll break in a lot
>of common cases even with the filesystem loader since often times things
>in the filesystem will be installed in the system and users won?t have
>permissions to write to them anyways.

That's okay.  Just let the normal exceptions percolate up.

But I do agree that at least in my own use cases, these are almost entirely
read operations, so I'm okay with enforcing that.  I think a user could pretty
easily implement writable APIs on top if needed.

Cheers,
-Barry

From pje at telecommunity.com  Sat Jan 31 19:05:10 2015
From: pje at telecommunity.com (PJ Eby)
Date: Sat, 31 Jan 2015 13:05:10 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <20150131124451.582b5dc3@marathon>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
Message-ID: <CALeMXf7M8D8=5Nh+3uJ2-V84+wHXsmKsxcDfTC3UxJ6W5f1PKw@mail.gmail.com>

On Sat, Jan 31, 2015 at 12:44 PM, Barry Warsaw <barry at python.org> wrote:
> On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote:
>
>>> On Jan 30, 2015, at 7:18 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>>> Related question - how would the temp files be cleaned up? At exit?
>>
>>My patch registers an atexit handler that cleans up the temporary files yea.
>
> Why not implement it as a context manager?

Note that neither approach will work for one common use of extracted
files: extension modules and shared libraries on Windows.  Unlike
*nixy operating systems, you can't delete an open file on Windows, and
loaded .DLLs are open files IIUC.  Unless you've got some way to
unload the .pyd or .dll files, you won't be able to do a complete
cleanup in that case.  (This use case is actually why I took the
caching approach rather than the tempfile approach in the first
place.)

From barry at python.org  Sat Jan 31 19:06:13 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 13:06:13 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
Message-ID: <20150131130613.6c4ae34e@marathon>

On Jan 31, 2015, at 03:54 PM, Paul Moore wrote:

>And I like the name get_bytes - much more explicit in these Python 3
>days of explicit str/bytes distinctions :-)

+1

-Barry

From donald at stufft.io  Sat Jan 31 19:07:29 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 13:07:29 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <20150131124451.582b5dc3@marathon>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
Message-ID: <F498153C-C48A-4B69-96FA-EE55D00A4D3C@stufft.io>


> On Jan 31, 2015, at 12:44 PM, Barry Warsaw <barry at python.org> wrote:
> 
> On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote:
> 
>>> On Jan 30, 2015, at 7:18 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>>> Related question - how would the temp files be cleaned up? At exit?
>> 
>> My patch registers an atexit handler that cleans up the temporary files yea.
> 
> Why not implement it as a context manager?
> 
> I'm not a big fan of overloading the atexit handler because there are
> situations where it might not get called (e.g. the program crashes or is kill
> -9'd), but a context manager allows the resource to be cleaned up asap.
> 
> Reviewing my own uses of pkg_resources.resource_filename() I think it would
> work just fine because I rarely need the path much longer than the immediate
> operation.  If I did need to cache it more permanently, I could easily do:
> 
>    with resource_filename('my.package.path', 'foo.dat') as path:
>        shutil.copy(path, some_more_permanent_location)
> 
> Easy peasy.


The reasons for not wanting to use a context manager are sort of intertwined
with each other.

The competitor to this function is something like:

    import os.path
    import time

    LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif")

    def print_logo_path():
        print(LOGO_PATH)


    while True:
        print_logo_path()
        time.sleep(1)

So when looking at an alternative that we want people to use we have to
consider the cost of porting to that code from the old way. Using an atexit
handler means that the above code can be switched to the new mechanism
just by chaning a single line:

    LOGO_PATH = importlib.resources.get_filename(__name__, "logo.gif")

Using a context manager would require something like:

    LOGO_MAKER = lambda: importlib.resources.get_filename(__name__, "logo.gif")

    def print_logo_path():
        with LOGO_MAKER as filename:
            print(filename)

Or:

    _LOGO_TMP = importlib.resources.get_filename(__name__, "logo.gif")
    atexit.register(_LOGO_TMP.cleanup)
    LOGO_PATH = _LOGO_TMP.name

It makes it more akward to use anytime you need to use the file in multiple
locations or multiple times and since each context manager instance (in the
worst case) is going to need to get bytes, create a temp file, and write bytes
for each use of the context manager.

The other thing is that for the "common" case, where the resource is available
on the file system already because we're just using a FileLoader, there is no
need for an atexit handler or a temporary file at all. The context manager
would only really exist for the uncommon case where we need to write the data
to a temporary file. Using the atexit handler allows us to provide the best
API for the common case, without too much problem for the uncommon case.

Yes it does mean that in certain cases the temporary files may be left behind,
particularly with kill -9 or segfaults or what have you. However that case
already exists, the only thing the context manager does is narrow the window
of case where a kill -9 or a segfault can leave temporary files behind.


---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From barry at python.org  Sat Jan 31 19:08:42 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 13:08:42 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
Message-ID: <20150131130842.1c218f20@marathon>

On Jan 31, 2015, at 04:31 PM, Brett Cannon wrote:

>One thing to consider is do we want to allow anything other than filenames
>for the path part?

IMHO, no.  See my previous responses.

Cheers,
-Barry

From donald at stufft.io  Sat Jan 31 19:09:03 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 13:09:03 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CALeMXf7M8D8=5Nh+3uJ2-V84+wHXsmKsxcDfTC3UxJ6W5f1PKw@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
 <CALeMXf7M8D8=5Nh+3uJ2-V84+wHXsmKsxcDfTC3UxJ6W5f1PKw@mail.gmail.com>
Message-ID: <F7E3A83E-EB09-4FEA-9C59-6ACA809ECFDC@stufft.io>


> On Jan 31, 2015, at 1:05 PM, PJ Eby <pje at telecommunity.com> wrote:
> 
> On Sat, Jan 31, 2015 at 12:44 PM, Barry Warsaw <barry at python.org> wrote:
>> On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote:
>> 
>>>> On Jan 30, 2015, at 7:18 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>>>> Related question - how would the temp files be cleaned up? At exit?
>>> 
>>> My patch registers an atexit handler that cleans up the temporary files yea.
>> 
>> Why not implement it as a context manager?
> 
> Note that neither approach will work for one common use of extracted
> files: extension modules and shared libraries on Windows.  Unlike
> *nixy operating systems, you can't delete an open file on Windows, and
> loaded .DLLs are open files IIUC.  Unless you've got some way to
> unload the .pyd or .dll files, you won't be able to do a complete
> cleanup in that case.  (This use case is actually why I took the
> caching approach rather than the tempfile approach in the first
> place.)

I don?t think it?s important for this API to support extracting extension
modules. If we want to support importing extension modules from inside
of a zip file (or similar) I think that should get it?s own support inside
the loader and not rely on the resource extraction for that. IOW I think
that these should primarily exist for data files.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From barry at python.org  Sat Jan 31 19:11:42 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 13:11:42 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
Message-ID: <20150131131142.27f7c682@marathon>

On Jan 31, 2015, at 11:43 AM, Donald Stufft wrote:

>I think we do want to allow directories, it?s not unusual to have something
>like:
>
>warehouse
>??? __init__.py
>??? templates
>?   ??? accounts
>?   ?   ??? profile.html
>?   ??? hello.html
>??? utils
>?   ??? mapper.py
>??? wsgi.py
>
>Conceptually templates isn?t a package (even though with namespace packages
>it kinda is) and I?d want to load profile.html by doing something like:
>
>importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?)

I understand there's a conceptual wart, but I have no problem dropping an
empty __init__.py file in those subdirectories and then using:

importlib.resources.get_bytes('warehouse.templates.accounts', 'profile.html')

And given how much easier it makes life from an implementation and description
standpoint, I think it's a fine compromise.

Cheers,
-Barry

From pje at telecommunity.com  Sat Jan 31 18:54:01 2015
From: pje at telecommunity.com (PJ Eby)
Date: Sat, 31 Jan 2015 12:54:01 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CACac1F9eWYuDM1hkJmgvPH6upRGewCVR=JMFtqSGT1mf9akWUA@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <CACac1F9eWYuDM1hkJmgvPH6upRGewCVR=JMFtqSGT1mf9akWUA@mail.gmail.com>
Message-ID: <CALeMXf6Ks+De-eg3e6xwaw29o7JumUE3G4+0+cCMK+rb=KW1PQ@mail.gmail.com>

On Sat, Jan 31, 2015 at 10:38 AM, Paul Moore <p.f.moore at gmail.com> wrote:
> At the moment, pkg_resources fills in the gap, but that's not
> integrated with the loader system.

Actually, it is.  There's basically a generic function that adapts
loaders to "resource providers".  In a trivial case, a loader can
simply implement the resource provider interface directly, and
register 'lambda self: self' as the adapter function.

I suggest taking a look at the IResourceProvider class, and seeing
whether you want to change anything in how the interface or
implementation work.  You could in fact create ABCs based on the
pkg_resources implementation.

The real question is whether there are any lessons to be learned from
pkg_resources' usage history.  I think the idea of temp files may be a
good one, though there will still be no real cleanup possible in the
case of e.g. C extensions.  You'll have to rely on whatever system
facility exists for temporary file cleanup.  With historical
hindsight, I'd say that I should've made it temp by default, with the
option to set a persistent cache, because a common complaint is that
processes running as special users often can't write to their home
directory (e.g. web servers running as "nobody").

Apart from that, the implementations in pkg_resources can mostly be
pulled for reuse, as well as the interfaces, and I'd suggest doing
exactly that.  There are a lot of non-obvious gotchas dealing with
zipfiles, and the implementation is fairly battle-hardened at this
point.

From donald at stufft.io  Sat Jan 31 19:18:05 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 13:18:05 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <20150131131142.27f7c682@marathon>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
 <20150131131142.27f7c682@marathon>
Message-ID: <0BF27A19-2C53-4994-8455-CD19D9A05E5E@stufft.io>


> On Jan 31, 2015, at 1:11 PM, Barry Warsaw <barry at python.org> wrote:
> 
> On Jan 31, 2015, at 11:43 AM, Donald Stufft wrote:
> 
>> I think we do want to allow directories, it?s not unusual to have something
>> like:
>> 
>> warehouse
>> ??? __init__.py
>> ??? templates
>> ?   ??? accounts
>> ?   ?   ??? profile.html
>> ?   ??? hello.html
>> ??? utils
>> ?   ??? mapper.py
>> ??? wsgi.py
>> 
>> Conceptually templates isn?t a package (even though with namespace packages
>> it kinda is) and I?d want to load profile.html by doing something like:
>> 
>> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?)
> 
> I understand there's a conceptual wart, but I have no problem dropping an
> empty __init__.py file in those subdirectories and then using:
> 
> importlib.resources.get_bytes('warehouse.templates.accounts', 'profile.html')
> 
> And given how much easier it makes life from an implementation and description
> standpoint, I think it's a fine compromise.

I think it actually makes things *harder* from an implementation and description
standpoint. You?re thinking in terms of implementation for the FileLoader, but
say for a PostgreSQLLoader now I have to create mock packages for warehouse.templates
and warehouse.templates.accounts whereas if we treat the resource path not as a
file path, but as a key for an object store where ?/? is slightly special then
my PostgreSQL loader only need to have a ?warehouse? package, and then a table
that essentially does something like:

    package   | resource key                    | data
    --------------------------------------------------
    warehouse | templates/accounts/profile.html | ?

In the FileLoader we?d obviously treat the / as path separators and create directory
entries, but in reality it?s just a key: value store. I already implemented one of
these functions in a way that allows the / separator and I would have had to have
gone out of my way to disallow it rather than allow it.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From barry at python.org  Sat Jan 31 19:29:33 2015
From: barry at python.org (Barry Warsaw)
Date: Sat, 31 Jan 2015 13:29:33 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <F498153C-C48A-4B69-96FA-EE55D00A4D3C@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
 <F498153C-C48A-4B69-96FA-EE55D00A4D3C@stufft.io>
Message-ID: <20150131132933.28d485f5@marathon>

On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote:

>The reasons for not wanting to use a context manager are sort of intertwined
>with each other.
>
>The competitor to this function is something like:
>
>    import os.path
>    import time
>
>    LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif")
>
>    def print_logo_path():
>        print(LOGO_PATH)
>
>
>    while True:
>        print_logo_path()
>        time.sleep(1)

I'm just wondering if that's extracted from a real example or whether it's
just a possible use case you'd want to support.  It's not a use case I've ever
needed.

I reviewed a bunch of resource_filename() uses and in almost all cases it's

1. Crafting a path-y thing for some other API that only takes paths.
2. Constructing a path for essentially shutil.copy()'ing the file somewhere
   else (e.g. a test http server's file vending directory).

There are one or two where it might be inconvenient to use a context manager,
but the majority of cases would be fine.

>It makes it more akward to use anytime you need to use the file in multiple
>locations or multiple times and since each context manager instance (in the
>worst case) is going to need to get bytes, create a temp file, and write bytes
>for each use of the context manager.

Perhaps it makes sense to either provide two APIs and/or implement a higher
level API on top of a lower-level one?

>The other thing is that for the "common" case, where the resource is available
>on the file system already because we're just using a FileLoader, there is no
>need for an atexit handler or a temporary file at all. The context manager
>would only really exist for the uncommon case where we need to write the data
>to a temporary file. Using the atexit handler allows us to provide the best
>API for the common case, without too much problem for the uncommon case.

A context manager could also conditionalize the delete just like your proposal
conditionalizes adding to the atexit handler.

>Yes it does mean that in certain cases the temporary files may be left behind,
>particularly with kill -9 or segfaults or what have you. However that case
>already exists, the only thing the context manager does is narrow the window
>of case where a kill -9 or a segfault can leave temporary files behind.

Sure, but it reduces the window for leakage, which will probably be enough.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/95f7f522/attachment.sig>

From donald at stufft.io  Sat Jan 31 20:42:06 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 14:42:06 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <20150131132933.28d485f5@marathon>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
 <F498153C-C48A-4B69-96FA-EE55D00A4D3C@stufft.io>
 <20150131132933.28d485f5@marathon>
Message-ID: <59512300-F929-4A51-A8DF-0E829F64D206@stufft.io>


> On Jan 31, 2015, at 1:29 PM, Barry Warsaw <barry at python.org> wrote:
> 
> On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote:
> 
>> The reasons for not wanting to use a context manager are sort of intertwined
>> with each other.
>> 
>> The competitor to this function is something like:
>> 
>>   import os.path
>>   import time
>> 
>>   LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif")
>> 
>>   def print_logo_path():
>>       print(LOGO_PATH)
>> 
>> 
>>   while True:
>>       print_logo_path()
>>       time.sleep(1)
> 
> I'm just wondering if that's extracted from a real example or whether it's
> just a possible use case you'd want to support.  It's not a use case I've ever
> needed.
> 
> I reviewed a bunch of resource_filename() uses and in almost all cases it's
> 
> 1. Crafting a path-y thing for some other API that only takes paths.
> 2. Constructing a path for essentially shutil.copy()'ing the file somewhere
>   else (e.g. a test http server's file vending directory).
> 
> There are one or two where it might be inconvenient to use a context manager,
> but the majority of cases would be fine.

Yea, requests/certifi which doesn?t currently use resource_filename at all but
just constructs the path to the .pem file using __file__.

Also ensurepip and virtualenv (both the existing and the rewrite).

Almost every case where I had to access a resource file I end up needing to do
it multiple places and it was easier to just construct the path once and reuse
it.

> 
>> It makes it more akward to use anytime you need to use the file in multiple
>> locations or multiple times and since each context manager instance (in the
>> worst case) is going to need to get bytes, create a temp file, and write bytes
>> for each use of the context manager.
> 
> Perhaps it makes sense to either provide two APIs and/or implement a higher
> level API on top of a lower-level one?

i thought about doing it this way too, I didn?t just because I couldn?t really
imagine anyone really using the context manager when a simpler API was available
and I thought that having one way to do it was better. However I?m perfectly
happy to have two APIs if people think it?s important.

> 
>> The other thing is that for the "common" case, where the resource is available
>> on the file system already because we're just using a FileLoader, there is no
>> need for an atexit handler or a temporary file at all. The context manager
>> would only really exist for the uncommon case where we need to write the data
>> to a temporary file. Using the atexit handler allows us to provide the best
>> API for the common case, without too much problem for the uncommon case.
> 
> A context manager could also conditionalize the delete just like your proposal
> conditionalizes adding to the atexit handler.
> 
>> Yes it does mean that in certain cases the temporary files may be left behind,
>> particularly with kill -9 or segfaults or what have you. However that case
>> already exists, the only thing the context manager does is narrow the window
>> of case where a kill -9 or a segfault can leave temporary files behind.
> 
> Sure, but it reduces the window for leakage, which will probably be enough.
> 
> Cheers,
> -Barry

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


From brett at python.org  Sat Jan 31 22:22:42 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 21:22:42 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
 <CAP1=2W716b7-EVy5cRwaad3YyR_kT3=jgv9uKV84xe=zHKmY5A@mail.gmail.com>
 <D9D8CB4C-AEA3-4FFA-A87C-EC08E1D448C3@stufft.io>
Message-ID: <CAP1=2W6kg2tWbS_1O-BTSY-L5=AQ42Gmm8n1RRdYxT_63j1PGA@mail.gmail.com>

On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft <donald at stufft.io> wrote:

> On Jan 31, 2015, at 12:00 PM, Brett Cannon <brett at python.org> wrote:
>
>
>
> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io> wrote:
>
>> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org> wrote:
>>
>>
>>
>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com> wrote:
>>
>>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
>>> >> It's certainly possible to add a new API that loads resources based on
>>> >> a relative name, but you'd have to specify relative to *what*.
>>> >> get_data explicitly ducks out of making that decision.
>>> >
>>> > data = __loader__.get_bytes(__name__, ?logo.gif?)
>>>
>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>>> prohibit sharing of loaders, etc, in the way Brett mentions.
>>
>>
>> By specifying the package anchor point I don't think it does.
>>
>>
>>> Also, the
>>> fact that it needs __name__ in there feels wrong - a bit like the old
>>> version of super() needing to be told which class it was being called
>>> from.
>>
>>
>> You can't avoid that. This is the entire reason why loader reuse is a
>> pain; you **have** to specify what to work off of, else its ambiguous and a
>> specific feature of a specific loader.
>>
>> But this is only an issue when you are trying to access a file relative
>> to the package/module you're in. Otherwise you're going to be specifying a
>> string constant like 'foo.bar'.
>>
>>
>>> But in principle I don't object to finding a suitable form of
>>> this.
>>>
>>> And I like the name get_bytes - much more explicit in these Python 3
>>> days of explicit str/bytes distinctions :-)
>>
>>
>> One unfortunate side-effect from having a new method to return bytes from
>> a data file is that it makes get_data() somewhat redundant. If we make it
>> get_data_filename(package_name, path) then it can return an absolute path
>> which can then be passed to get_data() to read the actual bytes. If we
>> create importlib.resources as Donald has suggested then all of this can be
>> hidden  behind a function and users don't have to care about any of this,
>> e.g. importlib.resources.read_data(module_anchor, path).
>>
>>
>> I think we actually have to go the other way, because only some Loaders
>> will be able to actually return a filename (returning a filename is
>> basically an optimization to prevent needing to call get_data and write
>> that out to a temporary directory) but pretty much any loader should
>> theoretically be able to support get_data.
>>
>
> Why can only some loaders return a filename? As I have said, loaders can
> return an opaque string to simulate a path if necessary.
>
>
> Because the idea behind get_data_filename() is that it returns a path that
> can be used regularly by APIs that expect to be handed a file on the file
> system.
>

In my head that expectation is not placed on the method.


> Simulating a path with an opaque string isn?t good enough because, for
> example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem.
> The idea here is that _if_ a regular file system path is available for a
> particular resource file then Loader().get_data_filename() would return it,
> otherwise it?d return None (or not exist at all).
>
> This means that pkgutil.get_data_filename (or
> importlib.resources.get_filename) can attempt to call
> Loader().get_data_filename() and just return that path if one exists on the
> file system already, and if it doesn?t then it can create a temporary file
> and call Loader.get_data() and write the data to that temporary file and
> return the path to that.
>

See I'm not even attempting to guarantee there is any API that will return
a reasonable file system path as the import API makes no such guarantees.
If an API like OpenSSL requires a file on the filesystem then you will have
to write to a temporary file and that's just life. That's the same as if
everything was stored in a zip file anyway.


>
>
>
>>
>> I think it is redundant but given that it?s a new API (passing module and
>> a ?resource path?) I think it makes sense. The old get_data API can be
>> deprecated but left in for compatibility reasons if we want (sort of like
>> Loader().load_module() -> Loader().exec_module()).
>>
>
> If we do that then there would have to be a way to specify how to read the
> bytes for the module code itself since get_data() is used in the
> implementation of import by coupling it with get_filename() (which is why
> I'm trying not have to drop get_filename()/get_data() and instead come up
> with some new approach to reading bytes since the current approach is very
> composable). So get_bytes() would need a way to signal that you don't want
> some data file but the bytes for the module. Maybe if the path section is
> unspecified then that's a signal that the module's bytes is wanted and not
> some data file?
>
>
> Perhaps trying to read modules and resource files with the same method is
> the wrong approach?
>

If we are going to do that then we might as well deprecate all the methods
that try to expose reading data and paths as the PEP 302 APIs tried to
expose it uniformly.


>
> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0
>

That seems like a bit much, e.g. why do you needs bytes **and** and a
file-like object() when you get the former from the latter? And why do you
need the path argument when you can get the path off the file-like object
if it's an actual file object?

-Brett


>
> This means that we?re not talking about ?data? files, but ?resource?
> files. This also removes the idea that you can call Loader.set_data() on
> those files (like i?ve seen in the implementation).
>
>
>
>>
>>
>> One thing to consider is do we want to allow anything other than
>> filenames for the path part? Thanks to namespace packages every directory
>> is essentially a package, so we could say that the package anchor has to
>> encapsulate the directory and the path bit can only be a filename. That
>> gets us even farther away from having the concept of file paths being
>> manipulated in relation to import-related APIs.
>>
>>
>> I think we do want to allow directories, it?s not unusual to have
>> something like:
>>
>> warehouse
>> ??? __init__.py
>> ??? templates
>> ?   ??? accounts
>> ?   ?   ??? profile.html
>> ?   ??? hello.html
>> ??? utils
>> ?   ??? mapper.py
>> ??? wsgi.py
>>
>> Conceptually templates isn?t a package (even though with namespace
>> packages it kinda is) and I?d want to load profile.html by doing something
>> like:
>>
>> importlib.resources.get_bytes(?warehouse?,
>> ?templates/accounts/profile.html?)
>>
>
> Where I would be fine with get_bytes('warehouse.templates.accounts',
> 'profile.html')  =)
>
>
>>
>> In pkg_resources the second argument to that function is a ?resource
>> path? which is defined as a relative to the given module/package and it
>> must use / to denote them. It explicitly says it?s not a file system path
>> but a resource path. It may translate to a file system path (as is the case
>> with the FileLoader) but it also may not (as is the case with a theoretical
>> S3Loader or PostgreSQLLoader).
>>
>
> Yep, which is why I'm making sure if we have paths we minimize them as
> they instantly make these alternative loader concepts a bigger pain to
> implement.
>
>
>> How you turn a warehouse + a resource path into some data (or whatever
>> other function we support) is an implementation detail of the Loader.
>>
>>
>> And just so I don't forget it, I keep wanting to pass an actual module in
>> so the code can extract the name that way, but that prevents the __name__
>> trick as you would have to import yourself or grab the module from
>> sys.modules.
>>
>>
>> Is an actual module what gets passed into Loader().exec_module()?
>>
>
> Yes.
>
>
>> If so I think it?s fine to pass that into the new Loader() functions and
>> a new top level API in importlib.resources can do the things needed to turn
>> a string into a module object. So instead of doing
>> __loader__.get_bytes(__name__, ?logo.gif?) you?d do
>> importlib.resources.get_bytes(__name__, ?logo.gif?).
>>
>
> If we go the route of importlib.resources then that seems like a
> reasonable idea, although we will need to think through the ramifications
> to exec_module() itself although I don't think there were be any issues.
>
> And if we do go with importlib.resources I will probably want to make it
> available on PyPI with appropriate imp/pkgutil fallbacks to help people
> transitioning from Python 2 to 3.
>
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/0684e8fa/attachment-0001.html>

From brett at python.org  Sat Jan 31 22:25:03 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 21:25:03 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
 <F498153C-C48A-4B69-96FA-EE55D00A4D3C@stufft.io>
 <20150131132933.28d485f5@marathon>
Message-ID: <CAP1=2W5Q_GDhieZ1ti=YNe-=QJQUiRZ7Md+hmpZ6GddrjuubJA@mail.gmail.com>

On Sat Jan 31 2015 at 1:29:43 PM Barry Warsaw <barry at python.org> wrote:

> On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote:
>
> >The reasons for not wanting to use a context manager are sort of
> intertwined
> >with each other.
> >
> >The competitor to this function is something like:
> >
> >    import os.path
> >    import time
> >
> >    LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif")
> >
> >    def print_logo_path():
> >        print(LOGO_PATH)
> >
> >
> >    while True:
> >        print_logo_path()
> >        time.sleep(1)
>
> I'm just wondering if that's extracted from a real example or whether it's
> just a possible use case you'd want to support.  It's not a use case I've
> ever
> needed.
>
> I reviewed a bunch of resource_filename() uses and in almost all cases it's
>
> 1. Crafting a path-y thing for some other API that only takes paths.
> 2. Constructing a path for essentially shutil.copy()'ing the file somewhere
>    else (e.g. a test http server's file vending directory).
>
> There are one or two where it might be inconvenient to use a context
> manager,
> but the majority of cases would be fine.
>
> >It makes it more akward to use anytime you need to use the file in
> multiple
> >locations or multiple times and since each context manager instance (in
> the
> >worst case) is going to need to get bytes, create a temp file, and write
> bytes
> >for each use of the context manager.
>
> Perhaps it makes sense to either provide two APIs and/or implement a higher
> level API on top of a lower-level one?
>
> >The other thing is that for the "common" case, where the resource is
> available
> >on the file system already because we're just using a FileLoader, there
> is no
> >need for an atexit handler or a temporary file at all. The context manager
> >would only really exist for the uncommon case where we need to write the
> data
> >to a temporary file. Using the atexit handler allows us to provide the
> best
> >API for the common case, without too much problem for the uncommon case.
>
> A context manager could also conditionalize the delete just like your
> proposal
> conditionalizes adding to the atexit handler.
>
> >Yes it does mean that in certain cases the temporary files may be left
> behind,
> >particularly with kill -9 or segfaults or what have you. However that case
> >already exists, the only thing the context manager does is narrow the
> window
> >of case where a kill -9 or a segfault can leave temporary files behind.
>
> Sure, but it reduces the window for leakage, which will probably be enough.
>

I'm with Barry not wanting to rely on atexit when a context manager is
explicit and will clean up any state as necessary.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/af41f926/attachment.html>

From donald at stufft.io  Sat Jan 31 22:43:47 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 16:43:47 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W6kg2tWbS_1O-BTSY-L5=AQ42Gmm8n1RRdYxT_63j1PGA@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
 <CAP1=2W716b7-EVy5cRwaad3YyR_kT3=jgv9uKV84xe=zHKmY5A@mail.gmail.com>
 <D9D8CB4C-AEA3-4FFA-A87C-EC08E1D448C3@stufft.io>
 <CAP1=2W6kg2tWbS_1O-BTSY-L5=AQ42Gmm8n1RRdYxT_63j1PGA@mail.gmail.com>
Message-ID: <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io>


> On Jan 31, 2015, at 4:22 PM, Brett Cannon <brett at python.org> wrote:
> 
> 
> 
> On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>> On Jan 31, 2015, at 12:00 PM, Brett Cannon <brett at python.org <mailto:brett at python.org>> wrote:
>> 
>> 
>> 
>> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>>> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org <mailto:brett at python.org>> wrote:
>>> 
>>> 
>>> 
>>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com <mailto:p.f.moore at gmail.com>> wrote:
>>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io <mailto:donald at stufft.io>> wrote:
>>> >> It's certainly possible to add a new API that loads resources based on
>>> >> a relative name, but you'd have to specify relative to *what*.
>>> >> get_data explicitly ducks out of making that decision.
>>> >
>>> > data = __loader__.get_bytes(__name__, ?logo.gif?)
>>> 
>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>>> prohibit sharing of loaders, etc, in the way Brett mentions.
>>> 
>>> By specifying the package anchor point I don't think it does.
>>>  
>>> Also, the
>>> fact that it needs __name__ in there feels wrong - a bit like the old
>>> version of super() needing to be told which class it was being called
>>> from.
>>> 
>>> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader.
>>> 
>>> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'.
>>>  
>>> But in principle I don't object to finding a suitable form of
>>> this.
>>> 
>>> And I like the name get_bytes - much more explicit in these Python 3
>>> days of explicit str/bytes distinctions :-)
>>> 
>>> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden  behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path).
>> 
>> I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data.
>> 
>> Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary.
> 
> Because the idea behind get_data_filename() is that it returns a path that can be used regularly by APIs that expect to be handed a file on the file system.
> 
> In my head that expectation is not placed on the method.
>  
> Simulating a path with an opaque string isn?t good enough because, for example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. The idea here is that _if_ a regular file system path is available for a particular resource file then Loader().get_data_filename() would return it, otherwise it?d return None (or not exist at all).
> 
> This means that pkgutil.get_data_filename (or importlib.resources.get_filename) can attempt to call Loader().get_data_filename() and just return that path if one exists on the file system already, and if it doesn?t then it can create a temporary file and call Loader.get_data() and write the data to that temporary file and return the path to that.
> 
> See I'm not even attempting to guarantee there is any API that will return a reasonable file system path as the import API makes no such guarantees. If an API like OpenSSL requires a file on the filesystem then you will have to write to a temporary file and that's just life. That's the same as if everything was stored in a zip file anyway.

The entire *point* is this thread is that sometimes you need a file path that is a valid path to a resource.

The naive approach is to just make it do something like:

# in pkgutil
def get_data_filename(package, resource):
    data = get_data(package, resource)
    if data is not None:
        with open("/tmp/path", "wb") as fp:
            fp.write(data)
        return "/tmp/path"

However the problem with this is that it imposes a read() into memory and then creating a new file, and then writing that data back to a file even in cases where there is already a file available on the file system. The Loader().get_data_filename() exists for a Loader() to *optionally* say that ?We already have a file path for this file, so you can just use this instead of copying to a temporary location?.

Then the ?optimized? but still naive approach becomes:

# in pkgutil
def get_data_filename(package, resource):
    mod = importlib.import_module(package)
    if hasattr(mod.__loader__, "get_data_filename"):
        try:
            filename = mod.__loader__.get_data_filename(package, resource)
        except FileNotFoundError:
            pass
        else:
            if filename is not None:
                return filename

    data = get_data(package, resource)
    if data is not None:
        with open("/tmp/path", "wb") as fp:
            fp.write(data)
        return "/tmp/path"

This means there?s basically no penalty for using this API to access resources files when you?re accessing files from a FileLoader. In my opinion anything that is harder to use than:

MY_PATH = os.path.join(os.path.dirname(__file__), ?my/file.txt?)

Is highly unlikely to be used. People can already just write things to a temporary directory using get_data, but the point is they don?t because it?s a waste of time for the common case and it?s easier not to do that.

>  
> 
>>  
>> 
>> I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()).
>> 
>> If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file?
> 
> Perhaps trying to read modules and resource files with the same method is the wrong approach?
> 
> If we are going to do that then we might as well deprecate all the methods that try to expose reading data and paths as the PEP 302 APIs tried to expose it uniformly.

I don?t think it makes sense to expose it uniformly, code is semantically different than data files and people need the ability to do different things with them. It?s unlikely you?ll get a 2GB.py file, however a 2GB data file is completely within the realms of possibility.

>  
> 
> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 <https://bpaste.net/show/b25b7e8dc8f0>
> 
> That seems like a bit much, e.g. why do you needs bytes **and** and a file-like object() when you get the former from the latter? And why do you need the path argument when you can get the path off the file-like object if it's an actual file object?

I don?t think it?s a bit much at all.

You get a stream method because sometimes things expect a file like object or sometimes the file is big and the ability to access a stream that handles that for you is super important. However when using a stream you need to ensure you close the stream after you?re done using it.

You get a bytes method because sometimes you don?t care about all of that and you just need/want the raw bytes, it?s a nicer API for those people to be able to just get bytes without having to worry about reading a file or closing the file after they are done reading it.

You get a filename method because the stream method may or may not return a file object that has a path at all, and if you just need to pass the path into another API having an open file handle just to get the filename is a waste of a file handle.


> 
> -Brett
>  
> 
> This means that we?re not talking about ?data? files, but ?resource? files. This also removes the idea that you can call Loader.set_data() on those files (like i?ve seen in the implementation).
> 
>>  
>> 
>>> 
>>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs.
>> 
>> I think we do want to allow directories, it?s not unusual to have something like:
>> 
>> warehouse
>> ??? __init__.py
>> ??? templates
>> ?   ??? accounts
>> ?   ?   ??? profile.html
>> ?   ??? hello.html
>> ??? utils
>> ?   ??? mapper.py
>> ??? wsgi.py
>> 
>> Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like:
>> 
>> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?)
>> 
>> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html')  =)
>>  
>> 
>> In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader).
>> 
>> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement.
>>  
>> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader.
>> 
>>> 
>>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules.
>> 
>> Is an actual module what gets passed into Loader().exec_module()?
>> 
>> Yes.
>>  
>> If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?).
>> 
>> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues.
>> 
>> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3.
> 
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/4584be3c/attachment-0001.html>

From donald at stufft.io  Sat Jan 31 22:46:38 2015
From: donald at stufft.io (Donald Stufft)
Date: Sat, 31 Jan 2015 16:46:38 -0500
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <CAP1=2W5Q_GDhieZ1ti=YNe-=QJQUiRZ7Md+hmpZ6GddrjuubJA@mail.gmail.com>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
 <F498153C-C48A-4B69-96FA-EE55D00A4D3C@stufft.io>
 <20150131132933.28d485f5@marathon>
 <CAP1=2W5Q_GDhieZ1ti=YNe-=QJQUiRZ7Md+hmpZ6GddrjuubJA@mail.gmail.com>
Message-ID: <9BC29885-16AA-4F6B-8AEF-00E2D3F9B61D@stufft.io>


> On Jan 31, 2015, at 4:25 PM, Brett Cannon <brett at python.org> wrote:
> 
> 
> 
> On Sat Jan 31 2015 at 1:29:43 PM Barry Warsaw <barry at python.org <mailto:barry at python.org>> wrote:
> On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote:
> 
> >The reasons for not wanting to use a context manager are sort of intertwined
> >with each other.
> >
> >The competitor to this function is something like:
> >
> >    import os.path
> >    import time
> >
> >    LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif")
> >
> >    def print_logo_path():
> >        print(LOGO_PATH)
> >
> >
> >    while True:
> >        print_logo_path()
> >        time.sleep(1)
> 
> I'm just wondering if that's extracted from a real example or whether it's
> just a possible use case you'd want to support.  It's not a use case I've ever
> needed.
> 
> I reviewed a bunch of resource_filename() uses and in almost all cases it's
> 
> 1. Crafting a path-y thing for some other API that only takes paths.
> 2. Constructing a path for essentially shutil.copy()'ing the file somewhere
>    else (e.g. a test http server's file vending directory).
> 
> There are one or two where it might be inconvenient to use a context manager,
> but the majority of cases would be fine.
> 
> >It makes it more akward to use anytime you need to use the file in multiple
> >locations or multiple times and since each context manager instance (in the
> >worst case) is going to need to get bytes, create a temp file, and write bytes
> >for each use of the context manager.
> 
> Perhaps it makes sense to either provide two APIs and/or implement a higher
> level API on top of a lower-level one?
> 
> >The other thing is that for the "common" case, where the resource is available
> >on the file system already because we're just using a FileLoader, there is no
> >need for an atexit handler or a temporary file at all. The context manager
> >would only really exist for the uncommon case where we need to write the data
> >to a temporary file. Using the atexit handler allows us to provide the best
> >API for the common case, without too much problem for the uncommon case.
> 
> A context manager could also conditionalize the delete just like your proposal
> conditionalizes adding to the atexit handler.
> 
> >Yes it does mean that in certain cases the temporary files may be left behind,
> >particularly with kill -9 or segfaults or what have you. However that case
> >already exists, the only thing the context manager does is narrow the window
> >of case where a kill -9 or a segfault can leave temporary files behind.
> 
> Sure, but it reduces the window for leakage, which will probably be enough.
> 
> I'm with Barry not wanting to rely on atexit when a context manager is explicit and will clean up any state as necessary. 


I think if we mandate a context manager people are going to be unlikely to actually use it because in a lot of cases it?s going to be a pain in the ass to use a context manager with it and they?ll just fall back to using os.path.join(os.path.dirname(__file__), ?my/file.txt?). I?m trying to make it so people *want* to use these APIs because they make their lives easier over the naive approach. Adding in stuff that makes it more awkward to use just means people won?t use them and zip imports will continue to be barely supported in the wider ecosystem.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/62aef48b/attachment.html>

From brett at python.org  Sat Jan 31 23:27:06 2015
From: brett at python.org (Brett Cannon)
Date: Sat, 31 Jan 2015 22:27:06 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <CAP1=2W4Ev6YKCy6AMyfQ5m2TVC7vMyLMzsX6_Fm3CQKixqgNmQ@mail.gmail.com>
 <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io>
 <CACac1F-+Dd-QzrDvdUZOr=wwO+q5YRTeYCskXnCRvbx4GOPHqA@mail.gmail.com>
 <FF56F9D7-A965-4D60-9BB0-0D7A4180543D@stufft.io>
 <CACac1F-VHtid3+_h3adqVW+Ed7fhirqK4YVbek4=BCWWVcQsiQ@mail.gmail.com>
 <CAP1=2W4jg1E6LO6MnEKcsXxnV29eCSWiCNbRHzG7H1MPM33ayg@mail.gmail.com>
 <AC99A3AC-68E4-4935-9BF2-B6AE7FE8CBD8@stufft.io>
 <CAP1=2W716b7-EVy5cRwaad3YyR_kT3=jgv9uKV84xe=zHKmY5A@mail.gmail.com>
 <D9D8CB4C-AEA3-4FFA-A87C-EC08E1D448C3@stufft.io>
 <CAP1=2W6kg2tWbS_1O-BTSY-L5=AQ42Gmm8n1RRdYxT_63j1PGA@mail.gmail.com>
 <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io>
Message-ID: <CAP1=2W46Wtotcf8X_M2C5f80E_UCHvgQKP-FsiSVXrs26-QrOQ@mail.gmail.com>

On Sat Jan 31 2015 at 4:43:50 PM Donald Stufft <donald at stufft.io> wrote:

> On Jan 31, 2015, at 4:22 PM, Brett Cannon <brett at python.org> wrote:
>
>
>
> On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft <donald at stufft.io> wrote:
>
>> On Jan 31, 2015, at 12:00 PM, Brett Cannon <brett at python.org> wrote:
>>
>>
>>
>> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft <donald at stufft.io> wrote:
>>
>>> On Jan 31, 2015, at 11:31 AM, Brett Cannon <brett at python.org> wrote:
>>>
>>>
>>>
>>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore <p.f.moore at gmail.com>
>>> wrote:
>>>
>>>> On 31 January 2015 at 15:47, Donald Stufft <donald at stufft.io> wrote:
>>>> >> It's certainly possible to add a new API that loads resources based
>>>> on
>>>> >> a relative name, but you'd have to specify relative to *what*.
>>>> >> get_data explicitly ducks out of making that decision.
>>>> >
>>>> > data = __loader__.get_bytes(__name__, ?logo.gif?)
>>>>
>>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't
>>>> prohibit sharing of loaders, etc, in the way Brett mentions.
>>>
>>>
>>> By specifying the package anchor point I don't think it does.
>>>
>>>
>>>> Also, the
>>>> fact that it needs __name__ in there feels wrong - a bit like the old
>>>> version of super() needing to be told which class it was being called
>>>> from.
>>>
>>>
>>> You can't avoid that. This is the entire reason why loader reuse is a
>>> pain; you **have** to specify what to work off of, else its ambiguous and a
>>> specific feature of a specific loader.
>>>
>>> But this is only an issue when you are trying to access a file relative
>>> to the package/module you're in. Otherwise you're going to be specifying a
>>> string constant like 'foo.bar'.
>>>
>>>
>>>> But in principle I don't object to finding a suitable form of
>>>> this.
>>>>
>>>> And I like the name get_bytes - much more explicit in these Python 3
>>>> days of explicit str/bytes distinctions :-)
>>>
>>>
>>> One unfortunate side-effect from having a new method to return bytes
>>> from a data file is that it makes get_data() somewhat redundant. If we make
>>> it get_data_filename(package_name, path) then it can return an absolute
>>> path which can then be passed to get_data() to read the actual bytes. If we
>>> create importlib.resources as Donald has suggested then all of this can be
>>> hidden  behind a function and users don't have to care about any of this,
>>> e.g. importlib.resources.read_data(module_anchor, path).
>>>
>>>
>>> I think we actually have to go the other way, because only some Loaders
>>> will be able to actually return a filename (returning a filename is
>>> basically an optimization to prevent needing to call get_data and write
>>> that out to a temporary directory) but pretty much any loader should
>>> theoretically be able to support get_data.
>>>
>>
>> Why can only some loaders return a filename? As I have said, loaders can
>> return an opaque string to simulate a path if necessary.
>>
>>
>> Because the idea behind get_data_filename() is that it returns a path
>> that can be used regularly by APIs that expect to be handed a file on the
>> file system.
>>
>
> In my head that expectation is not placed on the method.
>
>
>> Simulating a path with an opaque string isn?t good enough because, for
>> example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem.
>> The idea here is that _if_ a regular file system path is available for a
>> particular resource file then Loader().get_data_filename() would return it,
>> otherwise it?d return None (or not exist at all).
>>
>> This means that pkgutil.get_data_filename (or
>> importlib.resources.get_filename) can attempt to call
>> Loader().get_data_filename() and just return that path if one exists on the
>> file system already, and if it doesn?t then it can create a temporary file
>> and call Loader.get_data() and write the data to that temporary file and
>> return the path to that.
>>
>
> See I'm not even attempting to guarantee there is any API that will return
> a reasonable file system path as the import API makes no such guarantees.
> If an API like OpenSSL requires a file on the filesystem then you will have
> to write to a temporary file and that's just life. That's the same as if
> everything was stored in a zip file anyway.
>
>
> The entire *point* is this thread is that sometimes you need a file path
> that is a valid path to a resource.
>

Right, but I also have to make sure the import API doesn't get too
ridiculous because it took me years and several versions of Python to make
it work with the APIs inherited from PEP 302 and to make sure it grow into
a huge mess.


>
> The naive approach is to just make it do something like:
>
> # in pkgutil
> def get_data_filename(package, resource):
>     data = get_data(package, resource)
>     if data is not None:
>         with open("/tmp/path", "wb") as fp:
>             fp.write(data)
>         return "/tmp/path"
>
> However the problem with this is that it imposes a read() into memory and
> then creating a new file, and then writing that data back to a file even in
> cases where there is already a file available on the file system. The
> Loader().get_data_filename() exists for a Loader() to *optionally* say that
> ?We already have a file path for this file, so you can just use this
> instead of copying to a temporary location?.
>

And that's fine, but my point is forcing it to only play that role seems
unnecessary. If you want a 'real' parameter to say "only return a path if I
can pass it to an API that requires it" then that's fine.


>
> Then the ?optimized? but still naive approach becomes:
>
> # in pkgutil
> def get_data_filename(package, resource):
>     mod = importlib.import_module(package)
>     if hasattr(mod.__loader__, "get_data_filename"):
>         try:
>             filename = mod.__loader__.get_data_filename(package, resource)
>         except FileNotFoundError:
>             pass
>         else:
>             if filename is not None:
>                 return filename
>
>     data = get_data(package, resource)
>     if data is not None:
>         with open("/tmp/path", "wb") as fp:
>             fp.write(data)
>         return "/tmp/path"
>
> This means there?s basically no penalty for using this API to access
> resources files when you?re accessing files from a FileLoader.
>

And leaking a temp file until shutdown which is why Barry and I prefer a
context manager. =)


> In my opinion anything that is harder to use than:
>
> MY_PATH = os.path.join(os.path.dirname(__file__), ?my/file.txt?)
>
> Is highly unlikely to be used. People can already just write things to a
> temporary directory using get_data, but the point is they don?t because
> it?s a waste of time for the common case and it?s easier not to do that.
>

That's fine, but I also feel like we are trying to design around bad API
design where something is assuming all data is going to be on disk and thus
it's okay to require a file path on the filesystem instead of taking the
bytes directly or a file-like object.

I realize you are trying to solve this specifically for OpenSSL since it
has the nasty practice of wanting a file path, but from an import
perspective I have to also worry about what makes sense for the API as a
whole and from the perspective of import.


>
>
>
>>
>>
>>
>>>
>>> I think it is redundant but given that it?s a new API (passing module
>>> and a ?resource path?) I think it makes sense. The old get_data API can be
>>> deprecated but left in for compatibility reasons if we want (sort of like
>>> Loader().load_module() -> Loader().exec_module()).
>>>
>>
>> If we do that then there would have to be a way to specify how to read
>> the bytes for the module code itself since get_data() is used in the
>> implementation of import by coupling it with get_filename() (which is why
>> I'm trying not have to drop get_filename()/get_data() and instead come up
>> with some new approach to reading bytes since the current approach is very
>> composable). So get_bytes() would need a way to signal that you don't want
>> some data file but the bytes for the module. Maybe if the path section is
>> unspecified then that's a signal that the module's bytes is wanted and not
>> some data file?
>>
>>
>> Perhaps trying to read modules and resource files with the same method is
>> the wrong approach?
>>
>
> If we are going to do that then we might as well deprecate all the methods
> that try to expose reading data and paths as the PEP 302 APIs tried to
> expose it uniformly.
>
>
> I don?t think it makes sense to expose it uniformly, code is semantically
> different than data files and people need the ability to do different
> things with them. It?s unlikely you?ll get a 2GB.py file, however a 2GB
> data file is completely within the realms of possibility.
>
>
>
>>
>> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0
>>
>
> That seems like a bit much, e.g. why do you needs bytes **and** and a
> file-like object() when you get the former from the latter? And why do you
> need the path argument when you can get the path off the file-like object
> if it's an actual file object?
>
>
> I don?t think it?s a bit much at all.
>
> You get a stream method because sometimes things expect a file like object
> or sometimes the file is big and the ability to access a stream that
> handles that for you is super important. However when using a stream you
> need to ensure you close the stream after you?re done using it.
>

With a context manager the closing requirement is negligible. And that only
is an optimization if you're reading from something that allows for
incremental reads, e.g. it's not an optimization for a SQL-backed loader
(which is probably why PEP 302 has get_data() instead of get_file_object()
or something).


>
> You get a bytes method because sometimes you don?t care about all of that
> and you just need/want the raw bytes, it?s a nicer API for those people to
> be able to just get bytes without having to worry about reading a file or
> closing the file after they are done reading it.
>

That seems unnecessary if you want to provide the optimization of allowing
a file-like object to be returned when reading all of the bytes takes two
lines of code instead of one. People know how to read files so it isn't
like it's a new paradigm.


>
> You get a filename method because the stream method may or may not return
> a file object that has a path at all, and if you just need to pass the path
> into another API having an open file handle just to get the filename is a
> waste of a file handle.
>

As I said above, I partially feel like the desire for this support is to
work around some API decisions that are somewhat poor.

How about this: get_path(package, path, *, real=False) or get_path(package,
filename, *, real=False) -- depending on whether Barry and me get our way
about paths or you do, Donald -- where 'real' is a flag specifying whether
the path has to work as a path argument to builtins.open() and thus fails
accordingly (in instances where it won't work it can fail immediately and
so loader implementers only have two lines of code to care about to manage
it). Then loaders can keep their get_data() method without issue and the
API for loaders only grew by 1 (or stays constant depending on whether we
want/can have it subsume get_filename() long-term).

As for importlib.resources, that can provide a higher-level API for a
file-like object along with some way to say whether the file must be
addressable on the filesystem to know if tempfile.NamedTemporaryFile() may
be backing the file-like object or if io.BytesIO could provide the API.

This gets me a clean API for loaders and importlib and gets you your real
file paths as needed.

-Brett


>
>
>
> -Brett
>
>
>>
>> This means that we?re not talking about ?data? files, but ?resource?
>> files. This also removes the idea that you can call Loader.set_data() on
>> those files (like i?ve seen in the implementation).
>>
>>
>>
>>>
>>>
>>> One thing to consider is do we want to allow anything other than
>>> filenames for the path part? Thanks to namespace packages every directory
>>> is essentially a package, so we could say that the package anchor has to
>>> encapsulate the directory and the path bit can only be a filename. That
>>> gets us even farther away from having the concept of file paths being
>>> manipulated in relation to import-related APIs.
>>>
>>>
>>> I think we do want to allow directories, it?s not unusual to have
>>> something like:
>>>
>>> warehouse
>>> ??? __init__.py
>>> ??? templates
>>> ?   ??? accounts
>>> ?   ?   ??? profile.html
>>> ?   ??? hello.html
>>> ??? utils
>>> ?   ??? mapper.py
>>> ??? wsgi.py
>>>
>>> Conceptually templates isn?t a package (even though with namespace
>>> packages it kinda is) and I?d want to load profile.html by doing something
>>> like:
>>>
>>> importlib.resources.get_bytes(?warehouse?,
>>> ?templates/accounts/profile.html?)
>>>
>>
>> Where I would be fine with get_bytes('warehouse.templates.accounts',
>> 'profile.html')  =)
>>
>>
>>>
>>> In pkg_resources the second argument to that function is a ?resource
>>> path? which is defined as a relative to the given module/package and it
>>> must use / to denote them. It explicitly says it?s not a file system path
>>> but a resource path. It may translate to a file system path (as is the case
>>> with the FileLoader) but it also may not (as is the case with a theoretical
>>> S3Loader or PostgreSQLLoader).
>>>
>>
>> Yep, which is why I'm making sure if we have paths we minimize them as
>> they instantly make these alternative loader concepts a bigger pain to
>> implement.
>>
>>
>>> How you turn a warehouse + a resource path into some data (or whatever
>>> other function we support) is an implementation detail of the Loader.
>>>
>>>
>>> And just so I don't forget it, I keep wanting to pass an actual module
>>> in so the code can extract the name that way, but that prevents the
>>> __name__ trick as you would have to import yourself or grab the module from
>>> sys.modules.
>>>
>>>
>>> Is an actual module what gets passed into Loader().exec_module()?
>>>
>>
>> Yes.
>>
>>
>>> If so I think it?s fine to pass that into the new Loader() functions and
>>> a new top level API in importlib.resources can do the things needed to turn
>>> a string into a module object. So instead of doing
>>> __loader__.get_bytes(__name__, ?logo.gif?) you?d do
>>> importlib.resources.get_bytes(__name__, ?logo.gif?).
>>>
>>
>> If we go the route of importlib.resources then that seems like a
>> reasonable idea, although we will need to think through the ramifications
>> to exec_module() itself although I don't think there were be any issues.
>>
>> And if we do go with importlib.resources I will probably want to make it
>> available on PyPI with appropriate imp/pkgutil fallbacks to help people
>> transitioning from Python 2 to 3.
>>
>> ---
>> Donald Stufft
>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>>
>
> ---
> Donald Stufft
> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20150131/706fcd2a/attachment-0001.html>

From p.f.moore at gmail.com  Sat Jan 31 23:41:36 2015
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 31 Jan 2015 22:41:36 +0000
Subject: [Import-SIG] Loading Resources From a Python Module/Package
In-Reply-To: <F7E3A83E-EB09-4FEA-9C59-6ACA809ECFDC@stufft.io>
References: <CE22DC85-F14A-403A-9139-A394A68A2E06@stufft.io>
 <CACac1F_a3DiLpgDV8D-amJCcOwX8mgVOWm8U9EAe_siRSKpwXg@mail.gmail.com>
 <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io>
 <20150131124451.582b5dc3@marathon>
 <CALeMXf7M8D8=5Nh+3uJ2-V84+wHXsmKsxcDfTC3UxJ6W5f1PKw@mail.gmail.com>
 <F7E3A83E-EB09-4FEA-9C59-6ACA809ECFDC@stufft.io>
Message-ID: <CACac1F_tuis6V=-UV2-ezdGNv83aUYUzm5-ZOpXdUe2trF0-zQ@mail.gmail.com>

On 31 January 2015 at 18:09, Donald Stufft <donald at stufft.io> wrote:
> I don?t think it?s important for this API to support extracting extension
> modules. If we want to support importing extension modules from inside
> of a zip file (or similar) I think that should get it?s own support inside
> the loader and not rely on the resource extraction for that. IOW I think
> that these should primarily exist for data files.

+1

Paul