From bcannon at gmail.com Fri Jan 30 20:28:40 2015 From: bcannon at gmail.com (Brett Cannon) Date: Fri, 30 Jan 2015 19:28:40 +0000 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? Message-ID: Something I have been thinking about is whether we should start embedding the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo (the O could also be lowercase if people preferred). It would save people from making the mistake of executing their code with a mixture of -O and -OO. It also avoids having to regenerate all of your .pyo whenever you want to tweak which optimization level you are running at. And finally, if we make importlib.cache_from_source() take an optional `optimization` argument then people could even start specifying their own optimizations and have them saved to their own .pyo files (with the caveat that some restrictions be placed on the value, such as it has pass str.isalnum()). As for importlib.cache_from_source() and it's debug_override parameter, I would say we should lean on bools being ints and simply use its argument as the optimization level (while it gets phased out). I would love to even go so far as to say that we drop the .pyo file extension and make what has normally been .pyc files be .O0.pyc and what has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is that it might break too much code in a transition and so .pyc stays as such and then .O1.pyo and .O2.pyo comes into existence from the stdlib. By doing this the last bit of runtime state that influences compiling and importing code will somehow be exposed in bytecode files. I don't think it should be embedded in the bytecode file header as this has nothing to do with the validity of the bytecode compared to the source, just whether it should be run with the current interpreter (much like the interpreter name). Thoughts? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Jan 30 20:35:28 2015 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Jan 2015 11:35:28 -0800 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? In-Reply-To: References: Message-ID: <54CBDD00.2060703@stoneleaf.us> From a user perspective that sounds like a good idea. -- ~Ethan~ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From barry at python.org Fri Jan 30 22:46:46 2015 From: barry at python.org (Barry Warsaw) Date: Fri, 30 Jan 2015 16:46:46 -0500 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? In-Reply-To: References: Message-ID: <20150130164646.5d1538ff@anarchist.wooz.org> On Jan 30, 2015, at 07:28 PM, Brett Cannon wrote: >Something I have been thinking about is whether we should start embedding >the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo +1 - we've had some trouble in the past in Debian with the name collisions on .pyo for the different optimization levels. >I would love to even go so far as to say that we drop the .pyo file >extension and make what has normally been .pyc files be .O0.pyc and what >has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is >that it might break too much code in a transition and so .pyc stays as such >and then .O1.pyo and .O2.pyo comes into existence from the stdlib. I actually *would* go so far. I thought about it during the PEP 3147 time frame but it was out-of-scope at the time. A transition period might be necessary (and/or a switch to choose) but I think it's a good end state. Cheers, -Barry From donald at stufft.io Sat Jan 31 00:37:44 2015 From: donald at stufft.io (Donald Stufft) Date: Fri, 30 Jan 2015 18:37:44 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package Message-ID: It's often times useful to be able to load a resource from a Python module or packaging. Currently you can load the data into memory using pkgutil.get_data however this doesn't help much if you need to pass that data into an API that only accepts a filepath. Currently code that needs to do this often times does something like os.path.join(os.path.dirname(__file__), "myfile.txt"), however that doesn't work from within a zip file. I think it would be a good idea to implement a pkgutil.get_data_filename function which would return a filename that can be accessed to get at that particular bit of package data. In addition I think it would be a good idea to add an optional get_data_filename method onto the Loader that can be used by a loader to indicate when a file *already* exists on the filesystem. Essentially this boils down to the pkgutil.get_data_filename(package, resource) function doing this: 1. Check if the loader for the package implements a get_data_filename method and if it does and it returns a value that is not None simply return that value. The FileLoader can have a simple get_data_filename then that just returns the on disk filename. 2. If the loader doesn't have a get_data_filename method or it returns a None value then call pkgutil.get_data and if that returns None then return None ourselves. If it doesn't return None then save that data to a temporary file and return the path to that temporary file. I've implemented this (without tests) you can see here: https://bpaste.net/show/2e51b0588dcd I have a few concerns however, currently Loader.get_data() requires you to pass the entire path of the file you want to open (like /usr/lib/python3.5/site-packages/foo/bar.txt or /data/foo.zip/bar.txt) however I've made Loader.get_data_filename() want a relative path (like bar.txt). I wonder if this difference is OK? If not I wonder if we can make Loader.get_data accept a relative path as well. I think this is a generally more useful way of using the function because it doesn't restrict loaders to file system only (which get_data currently is restricted to I believe) and it lets the Loader encaspulate the logic about how to translate a relative path to a chunk of data instead of needing the caller to do that. My other problem is that pkgutil.get_data doesn't currently work for the PEP 420 namespace packages and due to the above I'm not sure how to actually make it work in a reasonable way without allowing get_data to accept relative paths as well. Because my patch lets the Loader encapsulate turning a relative path into a file path pkgutil.get_data_filename() and _NamespaceLoader.get_data_filename both work and support PEP 420 namespace packages. A. What do people think about pkgutil.get_data_filename and Loader.get_data_filename? B. What do people think about modifying Loader.get_data so it can support relative filenames instead of the calling code needing to handle that? --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From ericsnowcurrently at gmail.com Sat Jan 31 01:01:58 2015 From: ericsnowcurrently at gmail.com (Eric Snow) Date: Fri, 30 Jan 2015 17:01:58 -0700 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? In-Reply-To: <20150130164646.5d1538ff@anarchist.wooz.org> References: <20150130164646.5d1538ff@anarchist.wooz.org> Message-ID: On Fri, Jan 30, 2015 at 2:46 PM, Barry Warsaw wrote: > On Jan 30, 2015, at 07:28 PM, Brett Cannon wrote: > >>Something I have been thinking about is whether we should start embedding >>the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo > > +1 - we've had some trouble in the past in Debian with the name collisions on > .pyo for the different optimization levels. > >>I would love to even go so far as to say that we drop the .pyo file >>extension and make what has normally been .pyc files be .O0.pyc and what >>has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is >>that it might break too much code in a transition and so .pyc stays as such >>and then .O1.pyo and .O2.pyo comes into existence from the stdlib. > > I actually *would* go so far. I thought about it during the PEP 3147 > time frame but it was out-of-scope at the time. A transition period might be > necessary (and/or a switch to choose) but I think it's a good end state. +1 to all of it. :) -eric From ericsnowcurrently at gmail.com Sat Jan 31 01:03:32 2015 From: ericsnowcurrently at gmail.com (Eric Snow) Date: Fri, 30 Jan 2015 17:03:32 -0700 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? In-Reply-To: References: Message-ID: On Fri, Jan 30, 2015 at 12:28 PM, Brett Cannon wrote: > And finally, if we make > importlib.cache_from_source() take an optional `optimization` argument then > people could even start specifying their own optimizations and have them > saved to their own .pyo files (with the caveat that some restrictions be > placed on the value, such as it has pass str.isalnum()). I like that! It would make it much easier to work on new optimizations. -eric From p.f.moore at gmail.com Sat Jan 31 01:18:30 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 00:18:30 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: Message-ID: On 30 January 2015 at 23:37, Donald Stufft wrote: > A. What do people think about pkgutil.get_data_filename and > Loader.get_data_filename? Sounds reasonable. It's a relatively rare, but useful use case. One possible issue, though, would people assume that if they get a filename it'd be writeable? For the filesystem loader it would be, but that would break subtly (writes work but would get discarded) for loaders that don't have a native get_data_filename. Related question - how would the temp files be cleaned up? At exit? > B. What do people think about modifying Loader.get_data so it can support > relative filenames instead of the calling code needing to handle that? I'd have to think about that one, but in principle it seems reasonable. While we're extending the loaders, a far more commonly requested feature would be to list available data files. At the moment, code can only load data from known paths, which is not ideal. While it's unrelated to the original proposal, it makes sense if we're changing the spec of loaders to do it in one go, rather than having multiple iterations. Paul From donald at stufft.io Sat Jan 31 01:52:50 2015 From: donald at stufft.io (Donald Stufft) Date: Fri, 30 Jan 2015 19:52:50 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: Message-ID: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> > On Jan 30, 2015, at 7:18 PM, Paul Moore wrote: > > On 30 January 2015 at 23:37, Donald Stufft wrote: >> A. What do people think about pkgutil.get_data_filename and >> Loader.get_data_filename? > > Sounds reasonable. It's a relatively rare, but useful use case. One > possible issue, though, would people assume that if they get a > filename it'd be writeable? For the filesystem loader it would be, but > that would break subtly (writes work but would get discarded) for > loaders that don't have a native get_data_filename. I don?t think you can assume it?s writeable since that?ll break in a lot of common cases even with the filesystem loader since often times things in the filesystem will be installed in the system and users won?t have permissions to write to them anyways. > > Related question - how would the temp files be cleaned up? At exit? My patch registers an atexit handler that cleans up the temporary files yea. > >> B. What do people think about modifying Loader.get_data so it can support >> relative filenames instead of the calling code needing to handle that? > > I'd have to think about that one, but in principle it seems reasonable. > > While we're extending the loaders, a far more commonly requested > feature would be to list available data files. At the moment, code can > only load data from known paths, which is not ideal. While it's > unrelated to the original proposal, it makes sense if we're changing > the spec of loaders to do it in one go, rather than having multiple > iterations. Well both pkgutil.get_data and pkgutil.get_data_filename have parallels in the pkg_resources library for similar reasons. If we want to extend this to more things it might make sense to take a look at what all exists there currently: resource_exists(package_or_requirement, resource_name) Does the named resource exist? Return True or False accordingly. resource_stream(package_or_requirement, resource_name) Return a readable file-like object for the specified resource; it may be an actual file, a StringIO, or some similar object. The stream is in ?binary mode?, in the sense that whatever bytes are in the resource will be read as-is. resource_string(package_or_requirement, resource_name) Return the specified resource as a string. The resource is read in binary fashion, such that the returned string contains exactly the bytes that are stored in the resource. resource_isdir(package_or_requirement, resource_name) Is the named resource a directory? Return True or False accordingly. resource_listdir(package_or_requirement, resource_name) List the contents of the named resource directory, just like os.listdir except that it works even if the resource is in a zipfile. resource_filename(package_or_requirement, resource_name) Sometimes, it is not sufficient to access a resource in string or stream form, and a true filesystem filename is needed. In such cases, you can use this method (or module-level function) to obtain a filename for a resource. If the resource is in an archive distribution (such as a zipped egg), it will be extracted to a cache directory, and the filename within the cache will be returned. If the named resource is a directory, then all resources within that directory (including subdirectories) are also extracted. If the named resource is a C extension or ?eager resource? (see the setuptools documentation for details), then all C extensions and eager resources are extracted at the same time. See https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access and https://pythonhosted.org/setuptools/pkg_resources.html#resource-extraction Obviously the similar functions here are: * pkgutil.get_data is pkg_resources.resource_string * pkgutil.get_data_filename is pkg_resources.resource_filename The major difference being that pkg_resource.resource_filename will extract to a cache directory (controllable with an environment variable or programatically) and won't clean up the extracted files. This means that they are (by default) extracted once per user and reused between extractions. I felt like it made more sense to just extract to a temporary location (even though this is less performant) in the stdlib. That leaves: * resource_exists * resource_stream * resource_isdir * resource_listdir Which can be done via pkg_resources but not via the standard library, I don't have a major opinion on whether or not the standard library should do all of them but I don't think it would hurt if it did. Another interesting question if we're going to add more methods is where they should all live. As far as I know pkgutil.get_data predates the importlib module. Perhaps deprecating pkgutil.get_data and adding a importlib.resources module which supports functions like: * get_bytes(package, resource) * get_stream(package, resource) * get_filename(package, resource) * exists(package, resource) * isdir(package, resource) * listdir(package, resource) Changing the names (particular get_data -> get_bytes) could also provide the mechanism for allowing relative files and deprecating the "you must pass in a full file path to the Loader()" behavior since the get_data method could be left alone and a new get_bytes method could be added. This would mean people can do things like: import importlib.resources import socket import ssl context = ssl.SSLContext(ssl.PROTOCOL_SSLv23) context.verify_mode = ssl.CERT_REQUIRED context.check_hostname = True context.load_verify_locations( cafile=importlib.resources.get_filename("certifi", "cacert.pem"), ) s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ssl_sock = context.wrap_socket(s, server_hostname='www.verisign.com') ssl_sock.connect(('www.verisign.com', 443)) --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From p.f.moore at gmail.com Sat Jan 31 10:34:45 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 09:34:45 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: On 31 January 2015 at 00:52, Donald Stufft wrote: >> Sounds reasonable. It's a relatively rare, but useful use case. One >> possible issue, though, would people assume that if they get a >> filename it'd be writeable? For the filesystem loader it would be, but >> that would break subtly (writes work but would get discarded) for >> loaders that don't have a native get_data_filename. > > I don?t think you can assume it?s writeable since that?ll break in a lot > of common cases even with the filesystem loader since often times things > in the filesystem will be installed in the system and users won?t have > permissions to write to them anyways. Agreed, It's just that it could happen (either deliberately or by accident). One example I found was pytz, which downloads and builds the timezone data by doing dirname(__file__) in an "update the DB" API call - it'd be an "obvious" case for using resource data. (That was from a long time ago - checking the code now they seem to have tidied this up so it's no longer that way). But yes, documenting it as "don't do that" is probably fine. >> Related question - how would the temp files be cleaned up? At exit? > > My patch registers an atexit handler that cleans up the temporary files yea. Great. >>> B. What do people think about modifying Loader.get_data so it can support >>> relative filenames instead of the calling code needing to handle that? >> >> I'd have to think about that one, but in principle it seems reasonable. >> >> While we're extending the loaders, a far more commonly requested >> feature would be to list available data files. At the moment, code can >> only load data from known paths, which is not ideal. While it's >> unrelated to the original proposal, it makes sense if we're changing >> the spec of loaders to do it in one go, rather than having multiple >> iterations. > > Well both pkgutil.get_data and pkgutil.get_data_filename have parallels in the > pkg_resources library for similar reasons. If we want to extend this to more > things it might make sense to take a look at what all exists there currently: +1 on following pkg_resources. > resource_exists(package_or_requirement, resource_name) > Does the named resource exist? Return True or False accordingly. > > resource_stream(package_or_requirement, resource_name) > Return a readable file-like object for the specified resource; it may be an > actual file, a StringIO, or some similar object. The stream is in > ?binary mode?, in the sense that whatever bytes are in the resource will be > read as-is. > > resource_string(package_or_requirement, resource_name) > Return the specified resource as a string. The resource is read in binary > fashion, such that the returned string contains exactly the bytes that are > stored in the resource. > > resource_isdir(package_or_requirement, resource_name) > Is the named resource a directory? Return True or False accordingly. > > resource_listdir(package_or_requirement, resource_name) > List the contents of the named resource directory, just like os.listdir > except that it works even if the resource is in a zipfile. > > resource_filename(package_or_requirement, resource_name) > Sometimes, it is not sufficient to access a resource in string or stream > form, and a true filesystem filename is needed. In such cases, you can use > this method (or module-level function) to obtain a filename for a resource. > If the resource is in an archive distribution (such as a zipped egg), it > will be extracted to a cache directory, and the filename within the cache > will be returned. If the named resource is a directory, then all resources > within that directory (including subdirectories) are also extracted. If the > named resource is a C extension or ?eager resource? (see the setuptools > documentation for details), then all C extensions and eager resources are > extracted at the same time. > > See https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access > and https://pythonhosted.org/setuptools/pkg_resources.html#resource-extraction > > Obviously the similar functions here are: > > * pkgutil.get_data is pkg_resources.resource_string > * pkgutil.get_data_filename is pkg_resources.resource_filename > > The major difference being that pkg_resource.resource_filename will extract to > a cache directory (controllable with an environment variable or > programatically) and won't clean up the extracted files. This means that they > are (by default) extracted once per user and reused between extractions. I felt > like it made more sense to just extract to a temporary location (even though > this is less performant) in the stdlib. > > That leaves: > > * resource_exists > * resource_stream > * resource_isdir > * resource_listdir > > Which can be done via pkg_resources but not via the standard library, I don't > have a major opinion on whether or not the standard library should do all of > them but I don't think it would hurt if it did. > > Another interesting question if we're going to add more methods is where they > should all live. As far as I know pkgutil.get_data predates the importlib > module. Perhaps deprecating pkgutil.get_data and adding a importlib.resources > module which supports functions like: > > * get_bytes(package, resource) > * get_stream(package, resource) > * get_filename(package, resource) > * exists(package, resource) > * isdir(package, resource) > * listdir(package, resource) > > Changing the names (particular get_data -> get_bytes) could also provide the > mechanism for allowing relative files and deprecating the "you must pass in > a full file path to the Loader()" behavior since the get_data method could be > left alone and a new get_bytes method could be added. > > This would mean people can do things like: > > import importlib.resources > import socket > import ssl > > context = ssl.SSLContext(ssl.PROTOCOL_SSLv23) > context.verify_mode = ssl.CERT_REQUIRED > context.check_hostname = True > context.load_verify_locations( > cafile=importlib.resources.get_filename("certifi", "cacert.pem"), > ) > > s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > ssl_sock = context.wrap_socket(s, server_hostname='www.verisign.com') > ssl_sock.connect(('www.verisign.com', 443)) +1 on all of the above. Obviously, a lot of the support methods in loaders would need to be optional, but that's fine - and the vast majority of use cases are the filesystem and zipfiles, both of which support these methods, and can be handled in the stdlib. Paul From brett at python.org Sat Jan 31 15:48:12 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 14:48:12 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: On Fri, Jan 30, 2015, 19:52 Donald Stufft wrote: > On Jan 30, 2015, at 7:18 PM, Paul Moore wrote: > > On 30 January 2015 at 23:37, Donald Stufft wrote: >> A. What do people think about pkgutil.get_data_filename and >> Loader.get_data_filename? > > Sounds reasonable. It's a relatively rare, but useful use case. One > possible issue, though, would people assume that if they get a > filename it'd be writeable? For the filesystem loader it would be, but > that would break subtly (writes work but would get discarded) for > loaders that don't have a native get_data_filename. I don?t think you can assume it?s writeable since that?ll break in a lot of common cases even with the filesystem loader since often times things in the filesystem will be installed in the system and users won?t have permissions to write to them anyways. > > Related question - how would the temp files be cleaned up? At exit? My patch registers an atexit handler that cleans up the temporary files yea. > >> B. What do people think about modifying Loader.get_data so it can support >> relative filenames instead of the calling code needing to handle that? > > I'd have to think about that one, but in principle it seems reasonable. > > While we're extending the loaders, a far more commonly requested > feature would be to list available data files. At the moment, code can > only load data from known paths, which is not ideal. While it's > unrelated to the original proposal, it makes sense if we're changing > the spec of loaders to do it in one go, rather than having multiple > iterations. Well both pkgutil.get_data and pkgutil.get_data_filename have parallels in the pkg_resources library for similar reasons. If we want to extend this to more things it might make sense to take a look at what all exists there currently: resource_exists(package_or_requirement, resource_name) Does the named resource exist? Return True or False accordingly. resource_stream(package_or_requirement, resource_name) Return a readable file-like object for the specified resource; it may be an actual file, a StringIO, or some similar object. The stream is in ?binary mode?, in the sense that whatever bytes are in the resource will be read as-is. resource_string(package_or_requirement, resource_name) Return the specified resource as a string. The resource is read in binary fashion, such that the returned string contains exactly the bytes that are stored in the resource. resource_isdir(package_or_requirement, resource_name) Is the named resource a directory? Return True or False accordingly. resource_listdir(package_or_requirement, resource_name) List the contents of the named resource directory, just like os.listdir except that it works even if the resource is in a zipfile. resource_filename(package_or_requirement, resource_name) Sometimes, it is not sufficient to access a resource in string or stream form, and a true filesystem filename is needed. In such cases, you can use this method (or module-level function) to obtain a filename for a resource. If the resource is in an archive distribution (such as a zipped egg), it will be extracted to a cache directory, and the filename within the cache will be returned. If the named resource is a directory, then all resources within that directory (including subdirectories) are also extracted. If the named resource is a C extension or ?eager resource? (see the setuptools documentation for details), then all C extensions and eager resources are extracted at the same time. See https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access and https://pythonhosted.org/setuptools/pkg_resources.html#resource-extraction Obviously the similar functions here are: * pkgutil.get_data is pkg_resources.resource_string * pkgutil.get_data_filename is pkg_resources.resource_filename The major difference being that pkg_resource.resource_filename will extract to a cache directory (controllable with an environment variable or programatically) and won't clean up the extracted files. This means that they are (by default) extracted once per user and reused between extractions. I felt like it made more sense to just extract to a temporary location (even though this is less performant) in the stdlib. That leaves: * resource_exists * resource_stream * resource_isdir * resource_listdir Which can be done via pkg_resources but not via the standard library, I don't have a major opinion on whether or not the standard library should do all of them but I don't think it would hurt if it did. Another interesting question if we're going to add more methods is where they should all live. As far as I know pkgutil.get_data predates the importlib module. It does, so you really have to think in terms of finders and loaders. Perhaps deprecating pkgutil.get_data and adding a importlib.resources module which supports functions like: * get_bytes(package, resource) * get_stream(package, resource) * get_filename(package, resource) * exists(package, resource) * isdir(package, resource) * listdir(package, resource) Changing the names (particular get_data -> get_bytes) could also provide the mechanism for allowing relative files and deprecating the "you must pass in a full file path to the Loader()" behavior since the get_data method could be left alone and a new get_bytes method could be added. The reason Loader.get_data() takes absolute paths is to do away with ambiguity. If you have a relative path and ask a loader to read that path, where should that relative path be anchored? Should it be the top-level package? What about the module that loader ewas returned to handle? But then what about if a finder caches loaders and reuses them across modules (nothing in PEP 302 says you can't do this and in actuality the frozen and built-in loaders are just static and class methods). The choice of dealing exclusively in absolute paths was a conscious choice on my part. Now having said that, there is nothing to say absolute paths require file system I based paths. What you should really do is think of these paths as opaque, non-ambiguous paths for the loader which claimed it knew what file path was needed to pass to get_data(). If you think that way then you realize you can use markers in the path as necessary, e.g. some/path/file.zip/pkg/sub/data.txt. As long as loader.get_data() can unambiguously read that path as returned by get_data_filename() or whatever the method is called then you have fully abstracted paths out while still being able to read data from a loader. Basically any API dealing with paths for loaders needs to abstract away the concept of files, file-like paths, etc. and rely on using the loader API on pretty much everything as a simple os.path of its own. This is why I have not tried to tackle the issue of the list_contents() or some such API to list modules and potentially data files as it needs to not really have a concrete concept of file paths (and it really should be on finders and not loaders which complicates discovery, selecting the right finder, etc.). This is also why APIs wanting a file path instead of taking a file-like object simply cannot play well with importlib and loaders which have alternative back end storage without simply being lucky that the loader they are working with uses filesystem paths (or writing out to a temp file). -brett This would mean people can do things like: import importlib.resources import socket import ssl context = ssl.SSLContext(ssl.PROTOCOL_SSLv23) context.verify_mode = ssl.CERT_REQUIRED context.check_hostname = True context.load_verify_locations( cafile=importlib.resources.get_filename("certifi", "cacert.pem"), ) s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ssl_sock = context.wrap_socket(s, server_hostname='www.verisign.com') ssl_sock.connect(('www.verisign.com', 443)) --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA _______________________________________________ Import-SIG mailing list Import-SIG at python.org https://mail.python.org/mailman/listinfo/import-sig -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sat Jan 31 16:34:41 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 10:34:41 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> > On Jan 31, 2015, at 9:48 AM, Brett Cannon wrote: > > The reason Loader.get_data() takes absolute paths is to do away with ambiguity. If you have a relative path and ask a loader to read that path, where should that relative path be anchored? Should it be the top-level package? What about the module that loader ewas returned to handle? But then what about if a finder caches loaders and reuses them across modules (nothing in PEP 302 says you can't do this and in actuality the frozen and built-in loaders are just static and class methods). The choice of dealing exclusively in absolute paths was a conscious choice on my part. > > Now having said that, there is nothing to say absolute paths require file system I based paths. What you should really do is think of these paths as opaque, non-ambiguous paths for the loader which claimed it knew what file path was needed to pass to get_data(). If you think that way then you realize you can use markers in the path as necessary, e.g. some/path/file.zip/pkg/sub/data.txt. As long as loader.get_data() can unambiguously read that path as returned by get_data_filename() or whatever the method is called then you have fully abstracted paths out while still being able to read data from a loader. > > Basically any API dealing with paths for loaders needs to abstract away the concept of files, file-like paths, etc. and rely on using the loader API on pretty much everything as a simple os.path of its own. This is why I have not tried to tackle the issue of the list_contents() or some such API to list modules and potentially data files as it needs to not really have a concrete concept of file paths (and it really should be on finders and not loaders which complicates discovery, selecting the right finder, etc.). This is also why APIs wanting a file path instead of taking a file-like object simply cannot play well with importlib and loaders which have alternative back end storage without simply being lucky that the loader they are working with uses filesystem paths (or writing out to a temp file). > I think that dealing in absolute file paths (whether they are ?real? paths or not) makes the APIs super hard to use in anything but the simple case. For instance what do you do in a namespace package (either PEP 420 or one that extends the module __path__). There you have multiple candidate file paths and no good way to figure out which one you need to use and It requires that your code couple itself with the implementation of the package and it will break if someone changes from a module to a namespace package. The way the PEP 302 Loaders work isn?t super obvious to me, so I?m looking at the implementation and making assumptions about it and I thought that it was one Loader per importable name. Looking closer it appears the way you ?import? a module from a Loader is using Loader().exec_module(?foo.bar?). So I?d say then that the Loader() APIs should be Loader().get_bytes(?foo.bar?, ?relative/to/foo.bar/file.txt?). That should resolve the case about not knowing what it should be relative to, since it should be relative to the name given. Then the Loader() can encapsulate the logic about how to turn ?foo.bar? + ?relative/to/foo.bar/file.txt? into an absolute path for to get some data (or something else). It seems obvious to me that requiring a full path like that is the wrong way to expect people to work with constructing full paths for resources. It would be similar to expecting people to do ``import /data/foo.zip/submodule``. The import system should be abstracting all of that away for them. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Sat Jan 31 16:38:43 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 15:38:43 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: On 31 January 2015 at 14:48, Brett Cannon wrote: > Basically any API dealing with paths for loaders needs to abstract away the > concept of files, file-like paths, etc. and rely on using the loader API on > pretty much everything as a simple os.path of its own. This is why I have > not tried to tackle the issue of the list_contents() or some such API to > list modules and potentially data files as it needs to not really have a > concrete concept of file paths (and it really should be on finders and not > loaders which complicates discovery, selecting the right finder, etc.). This > is also why APIs wanting a file path instead of taking a file-like object > simply cannot play well with importlib and loaders which have alternative > back end storage without simply being lucky that the loader they are working > with uses filesystem paths (or writing out to a temp file). At the time we designed PEP 302, the principle was very strongly to limit the API to the bare minimum that we knew loaders would have to support (you have to be able to get the content of a file, because that's how you load a module). This was because non-filesystem modules were a new concept at the time, and if we'd asked what do people need, everyone (ourselves included) would have automatically assumed "everything a filesystem can do" and we'd have ended up just designing a virtual filesystem API and excluding a lot of possible flexibility (loaders for URLs, or databases, or whatever). Now we've had experience with PEP 302, it's clear that people aren't using the extra flexibility much - but they *do* miss filesystem-like APIs. At the moment, pkg_resources fills in the gap, but that's not integrated with the loader system. So I think it's probably about time to accept that these extensions are useful and *don't* limit flexibility in any practical way, and add them to the loader protocol. Paul From donald at stufft.io Sat Jan 31 16:45:16 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 10:45:16 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: <7A29542D-8925-4C9A-9F89-999EB679C44D@stufft.io> > On Jan 31, 2015, at 10:38 AM, Paul Moore wrote: > > On 31 January 2015 at 14:48, Brett Cannon wrote: >> Basically any API dealing with paths for loaders needs to abstract away the >> concept of files, file-like paths, etc. and rely on using the loader API on >> pretty much everything as a simple os.path of its own. This is why I have >> not tried to tackle the issue of the list_contents() or some such API to >> list modules and potentially data files as it needs to not really have a >> concrete concept of file paths (and it really should be on finders and not >> loaders which complicates discovery, selecting the right finder, etc.). This >> is also why APIs wanting a file path instead of taking a file-like object >> simply cannot play well with importlib and loaders which have alternative >> back end storage without simply being lucky that the loader they are working >> with uses filesystem paths (or writing out to a temp file). > > At the time we designed PEP 302, the principle was very strongly to > limit the API to the bare minimum that we knew loaders would have to > support (you have to be able to get the content of a file, because > that's how you load a module). This was because non-filesystem modules > were a new concept at the time, and if we'd asked what do people need, > everyone (ourselves included) would have automatically assumed > "everything a filesystem can do" and we'd have ended up just designing > a virtual filesystem API and excluding a lot of possible flexibility > (loaders for URLs, or databases, or whatever). > > Now we've had experience with PEP 302, it's clear that people aren't > using the extra flexibility much - but they *do* miss filesystem-like > APIs. At the moment, pkg_resources fills in the gap, but that's not > integrated with the loader system. So I think it's probably about time > to accept that these extensions are useful and *don't* limit > flexibility in any practical way, and add them to the loader protocol. I don?t think we need to even limit things to file system like loaders. If we make the ?expanded? resource APIs optional then if your non file system loader can?t support something like listing all of the files at a sub resource then it just doesn?t implement that. It means that maybe every type of code won?t work with every type of loader but I think that?s a situation that isn?t able to be remedied. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From p.f.moore at gmail.com Sat Jan 31 16:46:33 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 15:46:33 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On 31 January 2015 at 15:34, Donald Stufft wrote: > It seems obvious to me that requiring a full path like that is the wrong way > to expect people to work with constructing full paths for resources. It > would be similar to expecting people to do ``import > /data/foo.zip/submodule``. The import system should be abstracting all of > that away for them. Note the example in PEP 302: d = os.path.dirname(__file__) data = __loader__.get_data(os.path.join(d, "logo.gif")) The parallel is with the historical filesystem-only approach, d = os.path.dirname(__file__) with open(os.path.join(d, "logo.gif"), 'rb') as f: data = f.read() You *don't* want to use a relative pathname then in this case, so the loader protocol is designed to follow that usage. As Brett says, __file__ can have non-filesystem "token" elements (e.g., a zipfile name) if necessary. It's certainly possible to add a new API that loads resources based on a relative name, but you'd have to specify relative to *what*. get_data explicitly ducks out of making that decision. Paul From p.f.moore at gmail.com Sat Jan 31 16:47:13 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 15:47:13 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <7A29542D-8925-4C9A-9F89-999EB679C44D@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <7A29542D-8925-4C9A-9F89-999EB679C44D@stufft.io> Message-ID: On 31 January 2015 at 15:45, Donald Stufft wrote: > I don?t think we need to even limit things to file system like loaders. > If we make the ?expanded? resource APIs optional then if your non file > system loader can?t support something like listing all of the files at > a sub resource then it just doesn?t implement that. It means that maybe > every type of code won?t work with every type of loader but I think that?s > a situation that isn?t able to be remedied. Sorry, I didn't state that explicitly, but I certainly assumed that would be the case. Paul From donald at stufft.io Sat Jan 31 16:47:44 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 10:47:44 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: > On Jan 31, 2015, at 10:46 AM, Paul Moore wrote: > > On 31 January 2015 at 15:34, Donald Stufft wrote: >> It seems obvious to me that requiring a full path like that is the wrong way >> to expect people to work with constructing full paths for resources. It >> would be similar to expecting people to do ``import >> /data/foo.zip/submodule``. The import system should be abstracting all of >> that away for them. > > Note the example in PEP 302: > > d = os.path.dirname(__file__) > data = __loader__.get_data(os.path.join(d, "logo.gif")) > > The parallel is with the historical filesystem-only approach, > > d = os.path.dirname(__file__) > with open(os.path.join(d, "logo.gif"), 'rb') as f: > data = f.read() > > You *don't* want to use a relative pathname then in this case, so the > loader protocol is designed to follow that usage. As Brett says, > __file__ can have non-filesystem "token" elements (e.g., a zipfile > name) if necessary. > > It's certainly possible to add a new API that loads resources based on > a relative name, but you'd have to specify relative to *what*. > get_data explicitly ducks out of making that decision. data = __loader__.get_bytes(__name__, ?logo.gif?) --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From p.f.moore at gmail.com Sat Jan 31 16:54:13 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 15:54:13 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On 31 January 2015 at 15:47, Donald Stufft wrote: >> It's certainly possible to add a new API that loads resources based on >> a relative name, but you'd have to specify relative to *what*. >> get_data explicitly ducks out of making that decision. > > data = __loader__.get_bytes(__name__, ?logo.gif?) Quite possibly. It needs a bit of fleshing out to make sure it doesn't prohibit sharing of loaders, etc, in the way Brett mentions. Also, the fact that it needs __name__ in there feels wrong - a bit like the old version of super() needing to be told which class it was being called from. But in principle I don't object to finding a suitable form of this. And I like the name get_bytes - much more explicit in these Python 3 days of explicit str/bytes distinctions :-) Paul From donald at stufft.io Sat Jan 31 17:13:05 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 11:13:05 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: > On Jan 31, 2015, at 10:54 AM, Paul Moore wrote: > > On 31 January 2015 at 15:47, Donald Stufft wrote: >>> It's certainly possible to add a new API that loads resources based on >>> a relative name, but you'd have to specify relative to *what*. >>> get_data explicitly ducks out of making that decision. >> >> data = __loader__.get_bytes(__name__, ?logo.gif?) > > Quite possibly. It needs a bit of fleshing out to make sure it doesn't > prohibit sharing of loaders, etc, in the way Brett mentions. Also, the > fact that it needs __name__ in there feels wrong - a bit like the old > version of super() needing to be told which class it was being called > from. But in principle I don't object to finding a suitable form of > this. To be clear, I think using __name__ is massively better than using __file__, for one even though PEP 302 states that __file__ must be set, it actually doesn?t have to be set and PEP 420 doesn?t set it. Even if it did set it that pattern is only actually really usable for non namespace packages (of any type). The namespace package way of doing that is basically: for path in __path__: try: data = __loader__.get_data(os.path.join(path, ?logo.gif?)) except FileNotFoundError: pass else: break else: raise Exception(?Cannot Find the file ?logo.gif??) Either way if a Loader isn?t specific to a particular importable name and can be re-used between them then you need a way to specify what module it?s relative to and it seems to me the *obvious* way to load a resource that is relative to a module is to tell Python you want to load a particular resource from a particular module, not to construct some (pseudo) file path that says all that information as well but requires you to know if the thing you?re importing is a Python module, a python package, or a namespace package. In order to make a function like pkgutil.get_data that actually works in all situations that you?d have to do something like: def get_data(package, resource): mod = importlib.import_module(package) if hasattr(mod, ?__path__?): for path in __path__: try: return mod.__loader__.get_data(os.path.join(path, resource)) except FileNotFoundError: pass if hasattr(mod, "__file__"): d = os.path.dirname(__file__) try: return mod.__loader__.get_data(os.path.join(d, resource)) except FileNotFoundError: pass This is compared to the situation where the Loaders encapsulate that logic for you: def get_data(package, resource): mod = importlib.import_module(package) try: mod.__loader__.get_bytes(package, resource) except FileNotFoundError: pass Obviously the logic in the first function still exists, it?s just moved away from the caller needing to handle it and instead the Loader handles it, just like the loader abstracts away the __file__ location for importing a particular module. Although looking closer at the Loader().exec_module implementation, It appears that it expects something other than a string to be passed to it. So if it makes sense possibly Loader().get_bytes() etc should also expect something other than a string to identify the module as well (whatever it actually wants, I can?t tell). Then the utility functions in pkgutil or importlib.resources or whatever will do the logic to translate from a string to whatever the Loader itself wants. > > And I like the name get_bytes - much more explicit in these Python 3 > days of explicit str/bytes distinctions :-) > Paul --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From brett at python.org Sat Jan 31 17:19:25 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 16:19:25 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On Sat Jan 31 2015 at 10:34:45 AM Donald Stufft wrote: > > On Jan 31, 2015, at 9:48 AM, Brett Cannon wrote: > > The reason Loader.get_data() takes absolute paths is to do away with > ambiguity. If you have a relative path and ask a loader to read that path, > where should that relative path be anchored? Should it be the top-level > package? What about the module that loader ewas returned to handle? But > then what about if a finder caches loaders and reuses them across modules > (nothing in PEP 302 says you can't do this and in actuality the frozen and > built-in loaders are just static and class methods). The choice of dealing > exclusively in absolute paths was a conscious choice on my part. > > Now having said that, there is nothing to say absolute paths require file > system I based paths. What you should really do is think of these paths as > opaque, non-ambiguous paths for the loader which claimed it knew what file > path was needed to pass to get_data(). If you think that way then you > realize you can use markers in the path as necessary, e.g. > some/path/file.zip/pkg/sub/data.txt. As long as loader.get_data() can > unambiguously read that path as returned by get_data_filename() or whatever > the method is called then you have fully abstracted paths out while still > being able to read data from a loader. > > Basically any API dealing with paths for loaders needs to abstract away > the concept of files, file-like paths, etc. and rely on using the loader > API on pretty much everything as a simple os.path of its own. This is why I > have not tried to tackle the issue of the list_contents() or some such API > to list modules and potentially data files as it needs to not really have a > concrete concept of file paths (and it really should be on finders and not > loaders which complicates discovery, selecting the right finder, etc.). > This is also why APIs wanting a file path instead of taking a file-like > object simply cannot play well with importlib and loaders which have > alternative back end storage without simply being lucky that the loader > they are working with uses filesystem paths (or writing out to a temp file). > > > I think that dealing in absolute file paths (whether they are ?real? paths > or not) makes the APIs super hard to use in anything but the simple case. > I think we are talking about two different things when we say "relative"; I clarify later. > For instance what do you do in a namespace package (either PEP 420 or one > that extends the module __path__). > There you have multiple candidate file paths and no good way to figure out > which one you need to use and It requires that your code couple itself with > the implementation of the package and it will break if someone changes from > a module to a namespace package. > Yep, but that's just life. If you're reading data out of a package anyway then you are already coupled to its structure so this is no different. > > The way the PEP 302 Loaders work isn?t super obvious to me, so I?m looking > at the implementation and making assumptions about it and I thought that it > was one Loader per importable name. Looking closer it appears the way you > ?import? a module from a Loader is using Loader().exec_module(?foo.bar?). > So I?d say then that the Loader() APIs should be > Loader().get_bytes(?foo.bar?, ?relative/to/foo.bar/file.txt?). That should > resolve the case about not knowing what it should be relative to, since it > should be relative to the name given. Then the Loader() can encapsulate the > logic about how to turn ?foo.bar? + ?relative/to/foo.bar/file.txt? into an > absolute path for to get some data (or something else). > Yes, specifying the package anchor point does away with the ambiguity of relativity as it has an absolute position in a namespace. As long as we do **that** then there are no relative paths to speak of as all the information necessary to calculate an absolute path without ambiguity is provided. > > It seems obvious to me that requiring a full path like that is the wrong > way to expect people to work with constructing full paths for resources. It > would be similar to expecting people to do ``import > /data/foo.zip/submodule``. The import system should be abstracting all of > that away for them. > I think what you mean by "relative" and what I mean by "relative" are different. When I say "relative" I mean what you pass to loader.get_data(). What you mean by "relative" is I think the "file.txt" part of a call to get_bytes('some.module', "file.txt") which I don't consider relative as you specify everything for an absolute path. IOW I'm talking about the existing API and its semantics and you're talking in terms of your new API, so we are talking past each other. =) -Brett > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Jan 31 17:31:41 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 16:31:41 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On Sat Jan 31 2015 at 10:54:22 AM Paul Moore wrote: > On 31 January 2015 at 15:47, Donald Stufft wrote: > >> It's certainly possible to add a new API that loads resources based on > >> a relative name, but you'd have to specify relative to *what*. > >> get_data explicitly ducks out of making that decision. > > > > data = __loader__.get_bytes(__name__, ?logo.gif?) > > Quite possibly. It needs a bit of fleshing out to make sure it doesn't > prohibit sharing of loaders, etc, in the way Brett mentions. By specifying the package anchor point I don't think it does. > Also, the > fact that it needs __name__ in there feels wrong - a bit like the old > version of super() needing to be told which class it was being called > from. You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader. But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'. > But in principle I don't object to finding a suitable form of > this. > > And I like the name get_bytes - much more explicit in these Python 3 > days of explicit str/bytes distinctions :-) One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path). One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs. And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sat Jan 31 17:43:52 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 11:43:52 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: > On Jan 31, 2015, at 11:31 AM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 10:54:22 AM Paul Moore > wrote: > On 31 January 2015 at 15:47, Donald Stufft > wrote: > >> It's certainly possible to add a new API that loads resources based on > >> a relative name, but you'd have to specify relative to *what*. > >> get_data explicitly ducks out of making that decision. > > > > data = __loader__.get_bytes(__name__, ?logo.gif?) > > Quite possibly. It needs a bit of fleshing out to make sure it doesn't > prohibit sharing of loaders, etc, in the way Brett mentions. > > By specifying the package anchor point I don't think it does. > > Also, the > fact that it needs __name__ in there feels wrong - a bit like the old > version of super() needing to be told which class it was being called > from. > > You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader. > > But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'. > > But in principle I don't object to finding a suitable form of > this. > > And I like the name get_bytes - much more explicit in these Python 3 > days of explicit str/bytes distinctions :-) > > One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path). I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data. I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()). > > One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs. I think we do want to allow directories, it?s not unusual to have something like: warehouse ??? __init__.py ??? templates ? ??? accounts ? ? ??? profile.html ? ??? hello.html ??? utils ? ??? mapper.py ??? wsgi.py Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like: importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader). How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader. > > And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules. Is an actual module what gets passed into Loader().exec_module()? If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?). --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Jan 31 17:21:57 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 16:21:57 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On Sat Jan 31 2015 at 11:13:08 AM Donald Stufft wrote: > > > On Jan 31, 2015, at 10:54 AM, Paul Moore wrote: > > > > On 31 January 2015 at 15:47, Donald Stufft wrote: > >>> It's certainly possible to add a new API that loads resources based on > >>> a relative name, but you'd have to specify relative to *what*. > >>> get_data explicitly ducks out of making that decision. > >> > >> data = __loader__.get_bytes(__name__, ?logo.gif?) > > > > Quite possibly. It needs a bit of fleshing out to make sure it doesn't > > prohibit sharing of loaders, etc, in the way Brett mentions. Also, the > > fact that it needs __name__ in there feels wrong - a bit like the old > > version of super() needing to be told which class it was being called > > from. But in principle I don't object to finding a suitable form of > > this. > > To be clear, I think using __name__ is massively better than using > __file__, > for one even though PEP 302 states that __file__ must be set, it actually > doesn?t have to be set and PEP 420 doesn?t set it. Even if it did set it > that pattern is only actually really usable for non namespace packages (of > any type). > So you're starting to get into the murky corners of import. =) PEP 420 actually supercedes PEP 302, but that doesn't mean it negates it. For backwards-compatibility importlib still sets __file__, but you're right it isn't necessary as long as __spec__ is set. -Brett > > The namespace package way of doing that is basically: > > for path in __path__: > try: > data = __loader__.get_data(os.path.join(path, ?logo.gif?)) > except FileNotFoundError: > pass > else: > break > else: > raise Exception(?Cannot Find the file ?logo.gif??) > > > Either way if a Loader isn?t specific to a particular importable name and > can > be re-used between them then you need a way to specify what module it?s > relative > to and it seems to me the *obvious* way to load a resource that is > relative to > a module is to tell Python you want to load a particular resource from a > particular > module, not to construct some (pseudo) file path that says all that > information > as well but requires you to know if the thing you?re importing is a Python > module, a python package, or a namespace package. > > In order to make a function like pkgutil.get_data that actually works in > all > situations that you?d have to do something like: > > def get_data(package, resource): > mod = importlib.import_module(package) > if hasattr(mod, ?__path__?): > for path in __path__: > try: > return mod.__loader__.get_data(os.path.join(path, > resource)) > except FileNotFoundError: > pass > if hasattr(mod, "__file__"): > d = os.path.dirname(__file__) > try: > return mod.__loader__.get_data(os.path.join(d, resource)) > except FileNotFoundError: > pass > > This is compared to the situation where the Loaders encapsulate that logic > for you: > > def get_data(package, resource): > mod = importlib.import_module(package) > try: > mod.__loader__.get_bytes(package, resource) > except FileNotFoundError: > pass > > Obviously the logic in the first function still exists, it?s just moved > away > from the caller needing to handle it and instead the Loader handles it, > just > like the loader abstracts away the __file__ location for importing a > particular > module. > > Although looking closer at the Loader().exec_module implementation, It > appears > that it expects something other than a string to be passed to it. So if it > makes > sense possibly Loader().get_bytes() etc should also expect something other > than > a string to identify the module as well (whatever it actually wants, I > can?t tell). > Then the utility functions in pkgutil or importlib.resources or whatever > will do > the logic to translate from a string to whatever the Loader itself wants. > > > > > > And I like the name get_bytes - much more explicit in these Python 3 > > days of explicit str/bytes distinctions :-) > > Paul > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Jan 31 17:48:36 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 16:48:36 +0000 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? References: <20150130164646.5d1538ff@anarchist.wooz.org> Message-ID: On Fri Jan 30 2015 at 7:02:04 PM Eric Snow wrote: > On Fri, Jan 30, 2015 at 2:46 PM, Barry Warsaw wrote: > > On Jan 30, 2015, at 07:28 PM, Brett Cannon wrote: > > > >>Something I have been thinking about is whether we should start embedding > >>the -O option into the bytecode file name, e.g., foo.cpython-35.O2.pyo > > > > +1 - we've had some trouble in the past in Debian with the name > collisions on > > .pyo for the different optimization levels. > > > >>I would love to even go so far as to say that we drop the .pyo file > >>extension and make what has normally been .pyc files be .O0.pyc and what > >>has usually been -O and -OO be .O1.pyc and .O2.pyc, but my suspicion is > >>that it might break too much code in a transition and so .pyc stays as > such > >>and then .O1.pyo and .O2.pyo comes into existence from the stdlib. > > > > I actually *would* go so far. I thought about it during the PEP 3147 > > time frame but it was out-of-scope at the time. A transition period > might be > > necessary (and/or a switch to choose) but I think it's a good end state. > Assuming no one flips out about writing a bunch of files we could write files using the new and old file paths (or symlink the old paths to the new, but that seems to be asking for trouble on some OS that doesn't support them but maybe I'm being paranoid). That way people who construct file paths manually can still read the old paths but those who use cache_from_source() will get the new paths automatically (although override_debug will be a little wonky but nothing horrible in the New World). And anyone who really doesn't want all of those files written can run with sys.dont_write_bytecode set to True after byte-compiling their code. This "multiple bytecode files for the same thing" approach might spike stat calls since we would have to check which path is newer in case someone edited the old path out-of-band, but it shouldn't be too bad (it will obviously startup time will have to be measured). > > +1 to all of it. :) > Since everyone seems to think it's a good idea I will write up a PEP with the end goal of going all the way with .pyc (probably on Friday). -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Jan 31 18:00:52 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 17:00:52 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft wrote: > On Jan 31, 2015, at 11:31 AM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 10:54:22 AM Paul Moore wrote: > >> On 31 January 2015 at 15:47, Donald Stufft wrote: >> >> It's certainly possible to add a new API that loads resources based on >> >> a relative name, but you'd have to specify relative to *what*. >> >> get_data explicitly ducks out of making that decision. >> > >> > data = __loader__.get_bytes(__name__, ?logo.gif?) >> >> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >> prohibit sharing of loaders, etc, in the way Brett mentions. > > > By specifying the package anchor point I don't think it does. > > >> Also, the >> fact that it needs __name__ in there feels wrong - a bit like the old >> version of super() needing to be told which class it was being called >> from. > > > You can't avoid that. This is the entire reason why loader reuse is a > pain; you **have** to specify what to work off of, else its ambiguous and a > specific feature of a specific loader. > > But this is only an issue when you are trying to access a file relative to > the package/module you're in. Otherwise you're going to be specifying a > string constant like 'foo.bar'. > > >> But in principle I don't object to finding a suitable form of >> this. >> >> And I like the name get_bytes - much more explicit in these Python 3 >> days of explicit str/bytes distinctions :-) > > > One unfortunate side-effect from having a new method to return bytes from > a data file is that it makes get_data() somewhat redundant. If we make it > get_data_filename(package_name, path) then it can return an absolute path > which can then be passed to get_data() to read the actual bytes. If we > create importlib.resources as Donald has suggested then all of this can be > hidden behind a function and users don't have to care about any of this, > e.g. importlib.resources.read_data(module_anchor, path). > > > I think we actually have to go the other way, because only some Loaders > will be able to actually return a filename (returning a filename is > basically an optimization to prevent needing to call get_data and write > that out to a temporary directory) but pretty much any loader should > theoretically be able to support get_data. > Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary. > > I think it is redundant but given that it?s a new API (passing module and > a ?resource path?) I think it makes sense. The old get_data API can be > deprecated but left in for compatibility reasons if we want (sort of like > Loader().load_module() -> Loader().exec_module()). > If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file? > > > One thing to consider is do we want to allow anything other than filenames > for the path part? Thanks to namespace packages every directory is > essentially a package, so we could say that the package anchor has to > encapsulate the directory and the path bit can only be a filename. That > gets us even farther away from having the concept of file paths being > manipulated in relation to import-related APIs. > > > I think we do want to allow directories, it?s not unusual to have > something like: > > warehouse > ??? __init__.py > ??? templates > ? ??? accounts > ? ? ??? profile.html > ? ??? hello.html > ??? utils > ? ??? mapper.py > ??? wsgi.py > > Conceptually templates isn?t a package (even though with namespace > packages it kinda is) and I?d want to load profile.html by doing something > like: > > importlib.resources.get_bytes(?warehouse?, > ?templates/accounts/profile.html?) > Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html') =) > > In pkg_resources the second argument to that function is a ?resource path? > which is defined as a relative to the given module/package and it must use > / to denote them. It explicitly says it?s not a file system path but a > resource path. It may translate to a file system path (as is the case with > the FileLoader) but it also may not (as is the case with a theoretical > S3Loader or PostgreSQLLoader). > Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement. > How you turn a warehouse + a resource path into some data (or whatever > other function we support) is an implementation detail of the Loader. > > > And just so I don't forget it, I keep wanting to pass an actual module in > so the code can extract the name that way, but that prevents the __name__ > trick as you would have to import yourself or grab the module from > sys.modules. > > > Is an actual module what gets passed into Loader().exec_module()? > Yes. > If so I think it?s fine to pass that into the new Loader() functions and a > new top level API in importlib.resources can do the things needed to turn a > string into a module object. So instead of doing > __loader__.get_bytes(__name__, ?logo.gif?) you?d do > importlib.resources.get_bytes(__name__, ?logo.gif?). > If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues. And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sat Jan 31 18:28:04 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 12:28:04 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: > On Jan 31, 2015, at 12:00 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft > wrote: >> On Jan 31, 2015, at 11:31 AM, Brett Cannon > wrote: >> >> >> >> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore > wrote: >> On 31 January 2015 at 15:47, Donald Stufft > wrote: >> >> It's certainly possible to add a new API that loads resources based on >> >> a relative name, but you'd have to specify relative to *what*. >> >> get_data explicitly ducks out of making that decision. >> > >> > data = __loader__.get_bytes(__name__, ?logo.gif?) >> >> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >> prohibit sharing of loaders, etc, in the way Brett mentions. >> >> By specifying the package anchor point I don't think it does. >> >> Also, the >> fact that it needs __name__ in there feels wrong - a bit like the old >> version of super() needing to be told which class it was being called >> from. >> >> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader. >> >> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'. >> >> But in principle I don't object to finding a suitable form of >> this. >> >> And I like the name get_bytes - much more explicit in these Python 3 >> days of explicit str/bytes distinctions :-) >> >> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path). > > I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data. > > Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary. Because the idea behind get_data_filename() is that it returns a path that can be used regularly by APIs that expect to be handed a file on the file system. Simulating a path with an opaque string isn?t good enough because, for example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. The idea here is that _if_ a regular file system path is available for a particular resource file then Loader().get_data_filename() would return it, otherwise it?d return None (or not exist at all). This means that pkgutil.get_data_filename (or importlib.resources.get_filename) can attempt to call Loader().get_data_filename() and just return that path if one exists on the file system already, and if it doesn?t then it can create a temporary file and call Loader.get_data() and write the data to that temporary file and return the path to that. > > > I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()). > > If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file? Perhaps trying to read modules and resource files with the same method is the wrong approach? Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 This means that we?re not talking about ?data? files, but ?resource? files. This also removes the idea that you can call Loader.set_data() on those files (like i?ve seen in the implementation). > > >> >> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs. > > I think we do want to allow directories, it?s not unusual to have something like: > > warehouse > ??? __init__.py > ??? templates > ? ??? accounts > ? ? ??? profile.html > ? ??? hello.html > ??? utils > ? ??? mapper.py > ??? wsgi.py > > Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like: > > importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) > > Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html') =) > > > In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader). > > Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement. > > How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader. > >> >> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules. > > Is an actual module what gets passed into Loader().exec_module()? > > Yes. > > If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?). > > If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues. > > And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Sat Jan 31 18:31:46 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 12:31:46 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: Message-ID: <20150131123146.6ad3f1a6@marathon> On Jan 30, 2015, at 06:37 PM, Donald Stufft wrote: >I think it would be a good idea to implement a pkgutil.get_data_filename >function which would return a filename that can be accessed to get at that >particular bit of package data. +1 Of the pkg_resource methods that I use all the time, resource_string() (which in Python 3 should really be called resource_bytes()) and resource_filename() are the overwhelming favorites. I do occasionally use resource_stream() and even more rarely, resource_listdir(). Given that pkgutil.get_data() is essentially resource_bytes(), adopting (and improving) equivalents for resource_filename() and resource_stream() would be really nice. >I have a few concerns however, currently Loader.get_data() requires you to >pass the entire path of the file you want to open (like >/usr/lib/python3.5/site-packages/foo/bar.txt or /data/foo.zip/bar.txt) >however I've made Loader.get_data_filename() want a relative path (like >bar.txt). > >I wonder if this difference is OK? Depends on who you ask :). Clearly, most users should never be confronted with the difference. The APIs they should use are the pkgutil ones and there, everything's relative to a package namespace path, which is (well, modulo perhaps some PEP 420 corners) unambiguous. I don't particularly like the "feature" of get_data() allowing resources paths with / in the name. I'd much rather the resource either be a dotted module path, or just not allowing subpaths. The difference is a requirement in the layout of the package, e.g. pkgutil.get_data('my.package.path', 'subpath/foo.dat') pkgutil.get_data('my.package.path.subpath', 'foo.dat') The latter requires that 'subpath' be a subpackage while the former does not. Personally, that seems like a fine restriction to me, but that's how I always lay out my in-package data anyway. Loader implementers OTOH, do care, but there's a lot fewer of them than users. >If not I wonder if we can make Loader.get_data accept a relative path as >well. I think this is a generally more useful way of using the function >because it doesn't restrict loaders to file system only (which get_data >currently is restricted to I believe) and it lets the Loader encaspulate the >logic about how to translate a relative path to a chunk of data instead of >needing the caller to do that. +1 >My other problem is that pkgutil.get_data doesn't currently work for the PEP >420 namespace packages and due to the above I'm not sure how to actually make >it work in a reasonable way without allowing get_data to accept relative >paths as well. Well, with the restriction on resource subpaths above, there's no problem, right? pkgutil.get_data('my.package.path.subpath', 'foo.dat') Assuming subpath is contained within a namespace portion, it should be unambiguous where it comes from. pkgutil.get_data('my.package.path', 'foo.dat') If 'my.package.path' is a namespace package then there *isn't* any portion containing foo.dat, so this should return None because the namespace loader won't have get_data() implemented on it. I understand that imposing this restriction is a backward compatibility break, so it may not be adoptable. There are ways to get around that (add a flag to the API, implement a new pkgutil API with the restriction and deprecate .get_data(), etc.). However, for PEP 420 packages, you could impose this restriction in .get_data() without the backward compatibility problem. And certainly in any new APIs, e.g. .get_package_filename() a.k.a. resource_filename() you can do impose this restriction. I also think resource_stream() should be implemented as well, but maybe it should be called `pkgutil.open(package, resource, mode, encoding)` ? I can live without resource_listdir(). Cheers, -Barry From barry at python.org Sat Jan 31 18:40:04 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 12:40:04 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: <20150131124004.2c373863@marathon> On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote: >resource_exists(package_or_requirement, resource_name) > Does the named resource exist? Return True or False accordingly. +1 >resource_stream(package_or_requirement, resource_name) > Return a readable file-like object for the specified resource; it may be > an actual file, a StringIO, or some similar object. The stream is in > ?binary mode?, in the sense that whatever bytes are in the resource will > be read as-is. See my previous follow up. I'd much rather have an open()-like API so I don't have to do the subsequent decoding. >resource_string(package_or_requirement, resource_name) > Return the specified resource as a string. The resource is read in binary > fashion, such that the returned string contains exactly the bytes that are > stored in the resource. Right, so resource_string() is the wrong name . In my Python 3 code I always do: from pkg_resources import resource_string as resource_bytes so at least the call sites more accurately reflect reality. :) >resource_isdir(package_or_requirement, resource_name) > Is the named resource a directory? Return True or False accordingly. > >resource_listdir(package_or_requirement, resource_name) > List the contents of the named resource directory, just like os.listdir > except that it works even if the resource is in a zipfile. I've used these, but rarely, so I don't care too much. >resource_filename(package_or_requirement, resource_name) [...] >Obviously the similar functions here are: > >* pkgutil.get_data is pkg_resources.resource_string >* pkgutil.get_data_filename is pkg_resources.resource_filename > >The major difference being that pkg_resource.resource_filename will extract >to a cache directory (controllable with an environment variable or >programatically) and won't clean up the extracted files. This means that they >are (by default) extracted once per user and reused between extractions. I >felt like it made more sense to just extract to a temporary location (even >though this is less performant) in the stdlib. Extracting to a temporary location is fine. These generally aren't performance critical sections (e.g. I use them predominately in tests) and if they are then I'd rather let the user define the caching policy. >That leaves: > >* resource_exists >* resource_stream >* resource_isdir >* resource_listdir > >Which can be done via pkg_resources but not via the standard library, I don't >have a major opinion on whether or not the standard library should do all of >them but I don't think it would hurt if it did. resource_stream() is useful, but see my previous response on that. >Another interesting question if we're going to add more methods is where they >should all live. As far as I know pkgutil.get_data predates the importlib >module. Perhaps deprecating pkgutil.get_data and adding a importlib.resources >module which supports functions like: > >* get_bytes(package, resource) >* get_stream(package, resource) >* get_filename(package, resource) >* exists(package, resource) >* isdir(package, resource) >* listdir(package, resource) Modulo bikeshedding on the names of the functions, importlib.resources seems like a nice place for it. >Changing the names (particular get_data -> get_bytes) could also provide the >mechanism for allowing relative files and deprecating the "you must pass in >a full file path to the Loader()" behavior since the get_data method could be >left alone and a new get_bytes method could be added. +1, but see also my previous suggestion about path restrictions. Cheers, -Barry From barry at python.org Sat Jan 31 18:44:51 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 12:44:51 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: <20150131124451.582b5dc3@marathon> On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote: >> On Jan 30, 2015, at 7:18 PM, Paul Moore wrote: >> Related question - how would the temp files be cleaned up? At exit? > >My patch registers an atexit handler that cleans up the temporary files yea. Why not implement it as a context manager? I'm not a big fan of overloading the atexit handler because there are situations where it might not get called (e.g. the program crashes or is kill -9'd), but a context manager allows the resource to be cleaned up asap. Reviewing my own uses of pkg_resources.resource_filename() I think it would work just fine because I rarely need the path much longer than the immediate operation. If I did need to cache it more permanently, I could easily do: with resource_filename('my.package.path', 'foo.dat') as path: shutil.copy(path, some_more_permanent_location) Easy peasy. Cheers, -Barry From pje at telecommunity.com Sat Jan 31 19:00:02 2015 From: pje at telecommunity.com (PJ Eby) Date: Sat, 31 Jan 2015 13:00:02 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On Sat, Jan 31, 2015 at 11:13 AM, Donald Stufft wrote: > To be clear, I think using __name__ is massively better than using __file__, > for one even though PEP 302 states that __file__ must be set, it actually > doesn?t have to be set and PEP 420 doesn?t set it. Even if it did set it > that pattern is only actually really usable for non namespace packages (of > any type). Indeed, pkg_resources does not support resource access from namespace packages, only from specific modules or non-namespace packages contained in a namespace package. In the face of ambiguity, the implementation should refuse to guess. Disallowing namespace-relative access avoids the possibility of ambiguity, and it's essentially a non-issue anyway since there's no real use case for "find me whichever copy of this file got installed first or got listed first on sys.path". From barry at python.org Sat Jan 31 19:03:21 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 13:03:21 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: <20150131130321.31654688@marathon> On Jan 31, 2015, at 02:48 PM, Brett Cannon wrote: >> Sounds reasonable. It's a relatively rare, but useful use case. One >> possible issue, though, would people assume that if they get a >> filename it'd be writeable? For the filesystem loader it would be, but >> that would break subtly (writes work but would get discarded) for >> loaders that don't have a native get_data_filename. > >I don?t think you can assume it?s writeable since that?ll break in a lot >of common cases even with the filesystem loader since often times things >in the filesystem will be installed in the system and users won?t have >permissions to write to them anyways. That's okay. Just let the normal exceptions percolate up. But I do agree that at least in my own use cases, these are almost entirely read operations, so I'm okay with enforcing that. I think a user could pretty easily implement writable APIs on top if needed. Cheers, -Barry From pje at telecommunity.com Sat Jan 31 19:05:10 2015 From: pje at telecommunity.com (PJ Eby) Date: Sat, 31 Jan 2015 13:05:10 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <20150131124451.582b5dc3@marathon> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> Message-ID: On Sat, Jan 31, 2015 at 12:44 PM, Barry Warsaw wrote: > On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote: > >>> On Jan 30, 2015, at 7:18 PM, Paul Moore wrote: >>> Related question - how would the temp files be cleaned up? At exit? >> >>My patch registers an atexit handler that cleans up the temporary files yea. > > Why not implement it as a context manager? Note that neither approach will work for one common use of extracted files: extension modules and shared libraries on Windows. Unlike *nixy operating systems, you can't delete an open file on Windows, and loaded .DLLs are open files IIUC. Unless you've got some way to unload the .pyd or .dll files, you won't be able to do a complete cleanup in that case. (This use case is actually why I took the caching approach rather than the tempfile approach in the first place.) From barry at python.org Sat Jan 31 19:06:13 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 13:06:13 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: <20150131130613.6c4ae34e@marathon> On Jan 31, 2015, at 03:54 PM, Paul Moore wrote: >And I like the name get_bytes - much more explicit in these Python 3 >days of explicit str/bytes distinctions :-) +1 -Barry From donald at stufft.io Sat Jan 31 19:07:29 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 13:07:29 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <20150131124451.582b5dc3@marathon> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> Message-ID: > On Jan 31, 2015, at 12:44 PM, Barry Warsaw wrote: > > On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote: > >>> On Jan 30, 2015, at 7:18 PM, Paul Moore wrote: >>> Related question - how would the temp files be cleaned up? At exit? >> >> My patch registers an atexit handler that cleans up the temporary files yea. > > Why not implement it as a context manager? > > I'm not a big fan of overloading the atexit handler because there are > situations where it might not get called (e.g. the program crashes or is kill > -9'd), but a context manager allows the resource to be cleaned up asap. > > Reviewing my own uses of pkg_resources.resource_filename() I think it would > work just fine because I rarely need the path much longer than the immediate > operation. If I did need to cache it more permanently, I could easily do: > > with resource_filename('my.package.path', 'foo.dat') as path: > shutil.copy(path, some_more_permanent_location) > > Easy peasy. The reasons for not wanting to use a context manager are sort of intertwined with each other. The competitor to this function is something like: import os.path import time LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif") def print_logo_path(): print(LOGO_PATH) while True: print_logo_path() time.sleep(1) So when looking at an alternative that we want people to use we have to consider the cost of porting to that code from the old way. Using an atexit handler means that the above code can be switched to the new mechanism just by chaning a single line: LOGO_PATH = importlib.resources.get_filename(__name__, "logo.gif") Using a context manager would require something like: LOGO_MAKER = lambda: importlib.resources.get_filename(__name__, "logo.gif") def print_logo_path(): with LOGO_MAKER as filename: print(filename) Or: _LOGO_TMP = importlib.resources.get_filename(__name__, "logo.gif") atexit.register(_LOGO_TMP.cleanup) LOGO_PATH = _LOGO_TMP.name It makes it more akward to use anytime you need to use the file in multiple locations or multiple times and since each context manager instance (in the worst case) is going to need to get bytes, create a temp file, and write bytes for each use of the context manager. The other thing is that for the "common" case, where the resource is available on the file system already because we're just using a FileLoader, there is no need for an atexit handler or a temporary file at all. The context manager would only really exist for the uncommon case where we need to write the data to a temporary file. Using the atexit handler allows us to provide the best API for the common case, without too much problem for the uncommon case. Yes it does mean that in certain cases the temporary files may be left behind, particularly with kill -9 or segfaults or what have you. However that case already exists, the only thing the context manager does is narrow the window of case where a kill -9 or a segfault can leave temporary files behind. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From barry at python.org Sat Jan 31 19:08:42 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 13:08:42 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: <20150131130842.1c218f20@marathon> On Jan 31, 2015, at 04:31 PM, Brett Cannon wrote: >One thing to consider is do we want to allow anything other than filenames >for the path part? IMHO, no. See my previous responses. Cheers, -Barry From donald at stufft.io Sat Jan 31 19:09:03 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 13:09:03 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> Message-ID: > On Jan 31, 2015, at 1:05 PM, PJ Eby wrote: > > On Sat, Jan 31, 2015 at 12:44 PM, Barry Warsaw wrote: >> On Jan 30, 2015, at 07:52 PM, Donald Stufft wrote: >> >>>> On Jan 30, 2015, at 7:18 PM, Paul Moore wrote: >>>> Related question - how would the temp files be cleaned up? At exit? >>> >>> My patch registers an atexit handler that cleans up the temporary files yea. >> >> Why not implement it as a context manager? > > Note that neither approach will work for one common use of extracted > files: extension modules and shared libraries on Windows. Unlike > *nixy operating systems, you can't delete an open file on Windows, and > loaded .DLLs are open files IIUC. Unless you've got some way to > unload the .pyd or .dll files, you won't be able to do a complete > cleanup in that case. (This use case is actually why I took the > caching approach rather than the tempfile approach in the first > place.) I don?t think it?s important for this API to support extracting extension modules. If we want to support importing extension modules from inside of a zip file (or similar) I think that should get it?s own support inside the loader and not rely on the resource extraction for that. IOW I think that these should primarily exist for data files. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From barry at python.org Sat Jan 31 19:11:42 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 13:11:42 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: <20150131131142.27f7c682@marathon> On Jan 31, 2015, at 11:43 AM, Donald Stufft wrote: >I think we do want to allow directories, it?s not unusual to have something >like: > >warehouse >??? __init__.py >??? templates >? ??? accounts >? ? ??? profile.html >? ??? hello.html >??? utils >? ??? mapper.py >??? wsgi.py > >Conceptually templates isn?t a package (even though with namespace packages >it kinda is) and I?d want to load profile.html by doing something like: > >importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) I understand there's a conceptual wart, but I have no problem dropping an empty __init__.py file in those subdirectories and then using: importlib.resources.get_bytes('warehouse.templates.accounts', 'profile.html') And given how much easier it makes life from an implementation and description standpoint, I think it's a fine compromise. Cheers, -Barry From pje at telecommunity.com Sat Jan 31 18:54:01 2015 From: pje at telecommunity.com (PJ Eby) Date: Sat, 31 Jan 2015 12:54:01 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: On Sat, Jan 31, 2015 at 10:38 AM, Paul Moore wrote: > At the moment, pkg_resources fills in the gap, but that's not > integrated with the loader system. Actually, it is. There's basically a generic function that adapts loaders to "resource providers". In a trivial case, a loader can simply implement the resource provider interface directly, and register 'lambda self: self' as the adapter function. I suggest taking a look at the IResourceProvider class, and seeing whether you want to change anything in how the interface or implementation work. You could in fact create ABCs based on the pkg_resources implementation. The real question is whether there are any lessons to be learned from pkg_resources' usage history. I think the idea of temp files may be a good one, though there will still be no real cleanup possible in the case of e.g. C extensions. You'll have to rely on whatever system facility exists for temporary file cleanup. With historical hindsight, I'd say that I should've made it temp by default, with the option to set a persistent cache, because a common complaint is that processes running as special users often can't write to their home directory (e.g. web servers running as "nobody"). Apart from that, the implementations in pkg_resources can mostly be pulled for reuse, as well as the interfaces, and I'd suggest doing exactly that. There are a lot of non-obvious gotchas dealing with zipfiles, and the implementation is fairly battle-hardened at this point. From donald at stufft.io Sat Jan 31 19:18:05 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 13:18:05 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <20150131131142.27f7c682@marathon> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <20150131131142.27f7c682@marathon> Message-ID: <0BF27A19-2C53-4994-8455-CD19D9A05E5E@stufft.io> > On Jan 31, 2015, at 1:11 PM, Barry Warsaw wrote: > > On Jan 31, 2015, at 11:43 AM, Donald Stufft wrote: > >> I think we do want to allow directories, it?s not unusual to have something >> like: >> >> warehouse >> ??? __init__.py >> ??? templates >> ? ??? accounts >> ? ? ??? profile.html >> ? ??? hello.html >> ??? utils >> ? ??? mapper.py >> ??? wsgi.py >> >> Conceptually templates isn?t a package (even though with namespace packages >> it kinda is) and I?d want to load profile.html by doing something like: >> >> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) > > I understand there's a conceptual wart, but I have no problem dropping an > empty __init__.py file in those subdirectories and then using: > > importlib.resources.get_bytes('warehouse.templates.accounts', 'profile.html') > > And given how much easier it makes life from an implementation and description > standpoint, I think it's a fine compromise. I think it actually makes things *harder* from an implementation and description standpoint. You?re thinking in terms of implementation for the FileLoader, but say for a PostgreSQLLoader now I have to create mock packages for warehouse.templates and warehouse.templates.accounts whereas if we treat the resource path not as a file path, but as a key for an object store where ?/? is slightly special then my PostgreSQL loader only need to have a ?warehouse? package, and then a table that essentially does something like: package | resource key | data -------------------------------------------------- warehouse | templates/accounts/profile.html | ? In the FileLoader we?d obviously treat the / as path separators and create directory entries, but in reality it?s just a key: value store. I already implemented one of these functions in a way that allows the / separator and I would have had to have gone out of my way to disallow it rather than allow it. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From barry at python.org Sat Jan 31 19:29:33 2015 From: barry at python.org (Barry Warsaw) Date: Sat, 31 Jan 2015 13:29:33 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> Message-ID: <20150131132933.28d485f5@marathon> On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote: >The reasons for not wanting to use a context manager are sort of intertwined >with each other. > >The competitor to this function is something like: > > import os.path > import time > > LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif") > > def print_logo_path(): > print(LOGO_PATH) > > > while True: > print_logo_path() > time.sleep(1) I'm just wondering if that's extracted from a real example or whether it's just a possible use case you'd want to support. It's not a use case I've ever needed. I reviewed a bunch of resource_filename() uses and in almost all cases it's 1. Crafting a path-y thing for some other API that only takes paths. 2. Constructing a path for essentially shutil.copy()'ing the file somewhere else (e.g. a test http server's file vending directory). There are one or two where it might be inconvenient to use a context manager, but the majority of cases would be fine. >It makes it more akward to use anytime you need to use the file in multiple >locations or multiple times and since each context manager instance (in the >worst case) is going to need to get bytes, create a temp file, and write bytes >for each use of the context manager. Perhaps it makes sense to either provide two APIs and/or implement a higher level API on top of a lower-level one? >The other thing is that for the "common" case, where the resource is available >on the file system already because we're just using a FileLoader, there is no >need for an atexit handler or a temporary file at all. The context manager >would only really exist for the uncommon case where we need to write the data >to a temporary file. Using the atexit handler allows us to provide the best >API for the common case, without too much problem for the uncommon case. A context manager could also conditionalize the delete just like your proposal conditionalizes adding to the atexit handler. >Yes it does mean that in certain cases the temporary files may be left behind, >particularly with kill -9 or segfaults or what have you. However that case >already exists, the only thing the context manager does is narrow the window >of case where a kill -9 or a segfault can leave temporary files behind. Sure, but it reduces the window for leakage, which will probably be enough. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From donald at stufft.io Sat Jan 31 20:42:06 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 14:42:06 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <20150131132933.28d485f5@marathon> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> <20150131132933.28d485f5@marathon> Message-ID: <59512300-F929-4A51-A8DF-0E829F64D206@stufft.io> > On Jan 31, 2015, at 1:29 PM, Barry Warsaw wrote: > > On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote: > >> The reasons for not wanting to use a context manager are sort of intertwined >> with each other. >> >> The competitor to this function is something like: >> >> import os.path >> import time >> >> LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif") >> >> def print_logo_path(): >> print(LOGO_PATH) >> >> >> while True: >> print_logo_path() >> time.sleep(1) > > I'm just wondering if that's extracted from a real example or whether it's > just a possible use case you'd want to support. It's not a use case I've ever > needed. > > I reviewed a bunch of resource_filename() uses and in almost all cases it's > > 1. Crafting a path-y thing for some other API that only takes paths. > 2. Constructing a path for essentially shutil.copy()'ing the file somewhere > else (e.g. a test http server's file vending directory). > > There are one or two where it might be inconvenient to use a context manager, > but the majority of cases would be fine. Yea, requests/certifi which doesn?t currently use resource_filename at all but just constructs the path to the .pem file using __file__. Also ensurepip and virtualenv (both the existing and the rewrite). Almost every case where I had to access a resource file I end up needing to do it multiple places and it was easier to just construct the path once and reuse it. > >> It makes it more akward to use anytime you need to use the file in multiple >> locations or multiple times and since each context manager instance (in the >> worst case) is going to need to get bytes, create a temp file, and write bytes >> for each use of the context manager. > > Perhaps it makes sense to either provide two APIs and/or implement a higher > level API on top of a lower-level one? i thought about doing it this way too, I didn?t just because I couldn?t really imagine anyone really using the context manager when a simpler API was available and I thought that having one way to do it was better. However I?m perfectly happy to have two APIs if people think it?s important. > >> The other thing is that for the "common" case, where the resource is available >> on the file system already because we're just using a FileLoader, there is no >> need for an atexit handler or a temporary file at all. The context manager >> would only really exist for the uncommon case where we need to write the data >> to a temporary file. Using the atexit handler allows us to provide the best >> API for the common case, without too much problem for the uncommon case. > > A context manager could also conditionalize the delete just like your proposal > conditionalizes adding to the atexit handler. > >> Yes it does mean that in certain cases the temporary files may be left behind, >> particularly with kill -9 or segfaults or what have you. However that case >> already exists, the only thing the context manager does is narrow the window >> of case where a kill -9 or a segfault can leave temporary files behind. > > Sure, but it reduces the window for leakage, which will probably be enough. > > Cheers, > -Barry --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From brett at python.org Sat Jan 31 22:22:42 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 21:22:42 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft wrote: > On Jan 31, 2015, at 12:00 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft wrote: > >> On Jan 31, 2015, at 11:31 AM, Brett Cannon wrote: >> >> >> >> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore wrote: >> >>> On 31 January 2015 at 15:47, Donald Stufft wrote: >>> >> It's certainly possible to add a new API that loads resources based on >>> >> a relative name, but you'd have to specify relative to *what*. >>> >> get_data explicitly ducks out of making that decision. >>> > >>> > data = __loader__.get_bytes(__name__, ?logo.gif?) >>> >>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >>> prohibit sharing of loaders, etc, in the way Brett mentions. >> >> >> By specifying the package anchor point I don't think it does. >> >> >>> Also, the >>> fact that it needs __name__ in there feels wrong - a bit like the old >>> version of super() needing to be told which class it was being called >>> from. >> >> >> You can't avoid that. This is the entire reason why loader reuse is a >> pain; you **have** to specify what to work off of, else its ambiguous and a >> specific feature of a specific loader. >> >> But this is only an issue when you are trying to access a file relative >> to the package/module you're in. Otherwise you're going to be specifying a >> string constant like 'foo.bar'. >> >> >>> But in principle I don't object to finding a suitable form of >>> this. >>> >>> And I like the name get_bytes - much more explicit in these Python 3 >>> days of explicit str/bytes distinctions :-) >> >> >> One unfortunate side-effect from having a new method to return bytes from >> a data file is that it makes get_data() somewhat redundant. If we make it >> get_data_filename(package_name, path) then it can return an absolute path >> which can then be passed to get_data() to read the actual bytes. If we >> create importlib.resources as Donald has suggested then all of this can be >> hidden behind a function and users don't have to care about any of this, >> e.g. importlib.resources.read_data(module_anchor, path). >> >> >> I think we actually have to go the other way, because only some Loaders >> will be able to actually return a filename (returning a filename is >> basically an optimization to prevent needing to call get_data and write >> that out to a temporary directory) but pretty much any loader should >> theoretically be able to support get_data. >> > > Why can only some loaders return a filename? As I have said, loaders can > return an opaque string to simulate a path if necessary. > > > Because the idea behind get_data_filename() is that it returns a path that > can be used regularly by APIs that expect to be handed a file on the file > system. > In my head that expectation is not placed on the method. > Simulating a path with an opaque string isn?t good enough because, for > example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. > The idea here is that _if_ a regular file system path is available for a > particular resource file then Loader().get_data_filename() would return it, > otherwise it?d return None (or not exist at all). > > This means that pkgutil.get_data_filename (or > importlib.resources.get_filename) can attempt to call > Loader().get_data_filename() and just return that path if one exists on the > file system already, and if it doesn?t then it can create a temporary file > and call Loader.get_data() and write the data to that temporary file and > return the path to that. > See I'm not even attempting to guarantee there is any API that will return a reasonable file system path as the import API makes no such guarantees. If an API like OpenSSL requires a file on the filesystem then you will have to write to a temporary file and that's just life. That's the same as if everything was stored in a zip file anyway. > > > >> >> I think it is redundant but given that it?s a new API (passing module and >> a ?resource path?) I think it makes sense. The old get_data API can be >> deprecated but left in for compatibility reasons if we want (sort of like >> Loader().load_module() -> Loader().exec_module()). >> > > If we do that then there would have to be a way to specify how to read the > bytes for the module code itself since get_data() is used in the > implementation of import by coupling it with get_filename() (which is why > I'm trying not have to drop get_filename()/get_data() and instead come up > with some new approach to reading bytes since the current approach is very > composable). So get_bytes() would need a way to signal that you don't want > some data file but the bytes for the module. Maybe if the path section is > unspecified then that's a signal that the module's bytes is wanted and not > some data file? > > > Perhaps trying to read modules and resource files with the same method is > the wrong approach? > If we are going to do that then we might as well deprecate all the methods that try to expose reading data and paths as the PEP 302 APIs tried to expose it uniformly. > > Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 > That seems like a bit much, e.g. why do you needs bytes **and** and a file-like object() when you get the former from the latter? And why do you need the path argument when you can get the path off the file-like object if it's an actual file object? -Brett > > This means that we?re not talking about ?data? files, but ?resource? > files. This also removes the idea that you can call Loader.set_data() on > those files (like i?ve seen in the implementation). > > > >> >> >> One thing to consider is do we want to allow anything other than >> filenames for the path part? Thanks to namespace packages every directory >> is essentially a package, so we could say that the package anchor has to >> encapsulate the directory and the path bit can only be a filename. That >> gets us even farther away from having the concept of file paths being >> manipulated in relation to import-related APIs. >> >> >> I think we do want to allow directories, it?s not unusual to have >> something like: >> >> warehouse >> ??? __init__.py >> ??? templates >> ? ??? accounts >> ? ? ??? profile.html >> ? ??? hello.html >> ??? utils >> ? ??? mapper.py >> ??? wsgi.py >> >> Conceptually templates isn?t a package (even though with namespace >> packages it kinda is) and I?d want to load profile.html by doing something >> like: >> >> importlib.resources.get_bytes(?warehouse?, >> ?templates/accounts/profile.html?) >> > > Where I would be fine with get_bytes('warehouse.templates.accounts', > 'profile.html') =) > > >> >> In pkg_resources the second argument to that function is a ?resource >> path? which is defined as a relative to the given module/package and it >> must use / to denote them. It explicitly says it?s not a file system path >> but a resource path. It may translate to a file system path (as is the case >> with the FileLoader) but it also may not (as is the case with a theoretical >> S3Loader or PostgreSQLLoader). >> > > Yep, which is why I'm making sure if we have paths we minimize them as > they instantly make these alternative loader concepts a bigger pain to > implement. > > >> How you turn a warehouse + a resource path into some data (or whatever >> other function we support) is an implementation detail of the Loader. >> >> >> And just so I don't forget it, I keep wanting to pass an actual module in >> so the code can extract the name that way, but that prevents the __name__ >> trick as you would have to import yourself or grab the module from >> sys.modules. >> >> >> Is an actual module what gets passed into Loader().exec_module()? >> > > Yes. > > >> If so I think it?s fine to pass that into the new Loader() functions and >> a new top level API in importlib.resources can do the things needed to turn >> a string into a module object. So instead of doing >> __loader__.get_bytes(__name__, ?logo.gif?) you?d do >> importlib.resources.get_bytes(__name__, ?logo.gif?). >> > > If we go the route of importlib.resources then that seems like a > reasonable idea, although we will need to think through the ramifications > to exec_module() itself although I don't think there were be any issues. > > And if we do go with importlib.resources I will probably want to make it > available on PyPI with appropriate imp/pkgutil fallbacks to help people > transitioning from Python 2 to 3. > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Jan 31 22:25:03 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 21:25:03 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> <20150131132933.28d485f5@marathon> Message-ID: On Sat Jan 31 2015 at 1:29:43 PM Barry Warsaw wrote: > On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote: > > >The reasons for not wanting to use a context manager are sort of > intertwined > >with each other. > > > >The competitor to this function is something like: > > > > import os.path > > import time > > > > LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif") > > > > def print_logo_path(): > > print(LOGO_PATH) > > > > > > while True: > > print_logo_path() > > time.sleep(1) > > I'm just wondering if that's extracted from a real example or whether it's > just a possible use case you'd want to support. It's not a use case I've > ever > needed. > > I reviewed a bunch of resource_filename() uses and in almost all cases it's > > 1. Crafting a path-y thing for some other API that only takes paths. > 2. Constructing a path for essentially shutil.copy()'ing the file somewhere > else (e.g. a test http server's file vending directory). > > There are one or two where it might be inconvenient to use a context > manager, > but the majority of cases would be fine. > > >It makes it more akward to use anytime you need to use the file in > multiple > >locations or multiple times and since each context manager instance (in > the > >worst case) is going to need to get bytes, create a temp file, and write > bytes > >for each use of the context manager. > > Perhaps it makes sense to either provide two APIs and/or implement a higher > level API on top of a lower-level one? > > >The other thing is that for the "common" case, where the resource is > available > >on the file system already because we're just using a FileLoader, there > is no > >need for an atexit handler or a temporary file at all. The context manager > >would only really exist for the uncommon case where we need to write the > data > >to a temporary file. Using the atexit handler allows us to provide the > best > >API for the common case, without too much problem for the uncommon case. > > A context manager could also conditionalize the delete just like your > proposal > conditionalizes adding to the atexit handler. > > >Yes it does mean that in certain cases the temporary files may be left > behind, > >particularly with kill -9 or segfaults or what have you. However that case > >already exists, the only thing the context manager does is narrow the > window > >of case where a kill -9 or a segfault can leave temporary files behind. > > Sure, but it reduces the window for leakage, which will probably be enough. > I'm with Barry not wanting to rely on atexit when a context manager is explicit and will clean up any state as necessary. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sat Jan 31 22:43:47 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 16:43:47 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> Message-ID: <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> > On Jan 31, 2015, at 4:22 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft > wrote: >> On Jan 31, 2015, at 12:00 PM, Brett Cannon > wrote: >> >> >> >> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft > wrote: >>> On Jan 31, 2015, at 11:31 AM, Brett Cannon > wrote: >>> >>> >>> >>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore > wrote: >>> On 31 January 2015 at 15:47, Donald Stufft > wrote: >>> >> It's certainly possible to add a new API that loads resources based on >>> >> a relative name, but you'd have to specify relative to *what*. >>> >> get_data explicitly ducks out of making that decision. >>> > >>> > data = __loader__.get_bytes(__name__, ?logo.gif?) >>> >>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >>> prohibit sharing of loaders, etc, in the way Brett mentions. >>> >>> By specifying the package anchor point I don't think it does. >>> >>> Also, the >>> fact that it needs __name__ in there feels wrong - a bit like the old >>> version of super() needing to be told which class it was being called >>> from. >>> >>> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader. >>> >>> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'. >>> >>> But in principle I don't object to finding a suitable form of >>> this. >>> >>> And I like the name get_bytes - much more explicit in these Python 3 >>> days of explicit str/bytes distinctions :-) >>> >>> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path). >> >> I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data. >> >> Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary. > > Because the idea behind get_data_filename() is that it returns a path that can be used regularly by APIs that expect to be handed a file on the file system. > > In my head that expectation is not placed on the method. > > Simulating a path with an opaque string isn?t good enough because, for example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. The idea here is that _if_ a regular file system path is available for a particular resource file then Loader().get_data_filename() would return it, otherwise it?d return None (or not exist at all). > > This means that pkgutil.get_data_filename (or importlib.resources.get_filename) can attempt to call Loader().get_data_filename() and just return that path if one exists on the file system already, and if it doesn?t then it can create a temporary file and call Loader.get_data() and write the data to that temporary file and return the path to that. > > See I'm not even attempting to guarantee there is any API that will return a reasonable file system path as the import API makes no such guarantees. If an API like OpenSSL requires a file on the filesystem then you will have to write to a temporary file and that's just life. That's the same as if everything was stored in a zip file anyway. The entire *point* is this thread is that sometimes you need a file path that is a valid path to a resource. The naive approach is to just make it do something like: # in pkgutil def get_data_filename(package, resource): data = get_data(package, resource) if data is not None: with open("/tmp/path", "wb") as fp: fp.write(data) return "/tmp/path" However the problem with this is that it imposes a read() into memory and then creating a new file, and then writing that data back to a file even in cases where there is already a file available on the file system. The Loader().get_data_filename() exists for a Loader() to *optionally* say that ?We already have a file path for this file, so you can just use this instead of copying to a temporary location?. Then the ?optimized? but still naive approach becomes: # in pkgutil def get_data_filename(package, resource): mod = importlib.import_module(package) if hasattr(mod.__loader__, "get_data_filename"): try: filename = mod.__loader__.get_data_filename(package, resource) except FileNotFoundError: pass else: if filename is not None: return filename data = get_data(package, resource) if data is not None: with open("/tmp/path", "wb") as fp: fp.write(data) return "/tmp/path" This means there?s basically no penalty for using this API to access resources files when you?re accessing files from a FileLoader. In my opinion anything that is harder to use than: MY_PATH = os.path.join(os.path.dirname(__file__), ?my/file.txt?) Is highly unlikely to be used. People can already just write things to a temporary directory using get_data, but the point is they don?t because it?s a waste of time for the common case and it?s easier not to do that. > > >> >> >> I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()). >> >> If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file? > > Perhaps trying to read modules and resource files with the same method is the wrong approach? > > If we are going to do that then we might as well deprecate all the methods that try to expose reading data and paths as the PEP 302 APIs tried to expose it uniformly. I don?t think it makes sense to expose it uniformly, code is semantically different than data files and people need the ability to do different things with them. It?s unlikely you?ll get a 2GB.py file, however a 2GB data file is completely within the realms of possibility. > > > Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 > > That seems like a bit much, e.g. why do you needs bytes **and** and a file-like object() when you get the former from the latter? And why do you need the path argument when you can get the path off the file-like object if it's an actual file object? I don?t think it?s a bit much at all. You get a stream method because sometimes things expect a file like object or sometimes the file is big and the ability to access a stream that handles that for you is super important. However when using a stream you need to ensure you close the stream after you?re done using it. You get a bytes method because sometimes you don?t care about all of that and you just need/want the raw bytes, it?s a nicer API for those people to be able to just get bytes without having to worry about reading a file or closing the file after they are done reading it. You get a filename method because the stream method may or may not return a file object that has a path at all, and if you just need to pass the path into another API having an open file handle just to get the filename is a waste of a file handle. > > -Brett > > > This means that we?re not talking about ?data? files, but ?resource? files. This also removes the idea that you can call Loader.set_data() on those files (like i?ve seen in the implementation). > >> >> >>> >>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs. >> >> I think we do want to allow directories, it?s not unusual to have something like: >> >> warehouse >> ??? __init__.py >> ??? templates >> ? ??? accounts >> ? ? ??? profile.html >> ? ??? hello.html >> ??? utils >> ? ??? mapper.py >> ??? wsgi.py >> >> Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like: >> >> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) >> >> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html') =) >> >> >> In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader). >> >> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement. >> >> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader. >> >>> >>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules. >> >> Is an actual module what gets passed into Loader().exec_module()? >> >> Yes. >> >> If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?). >> >> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues. >> >> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3. > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sat Jan 31 22:46:38 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 16:46:38 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> <20150131132933.28d485f5@marathon> Message-ID: <9BC29885-16AA-4F6B-8AEF-00E2D3F9B61D@stufft.io> > On Jan 31, 2015, at 4:25 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 1:29:43 PM Barry Warsaw > wrote: > On Jan 31, 2015, at 01:07 PM, Donald Stufft wrote: > > >The reasons for not wanting to use a context manager are sort of intertwined > >with each other. > > > >The competitor to this function is something like: > > > > import os.path > > import time > > > > LOGO_PATH = os.path.join(os.path.dirname(__file__), "logo.gif") > > > > def print_logo_path(): > > print(LOGO_PATH) > > > > > > while True: > > print_logo_path() > > time.sleep(1) > > I'm just wondering if that's extracted from a real example or whether it's > just a possible use case you'd want to support. It's not a use case I've ever > needed. > > I reviewed a bunch of resource_filename() uses and in almost all cases it's > > 1. Crafting a path-y thing for some other API that only takes paths. > 2. Constructing a path for essentially shutil.copy()'ing the file somewhere > else (e.g. a test http server's file vending directory). > > There are one or two where it might be inconvenient to use a context manager, > but the majority of cases would be fine. > > >It makes it more akward to use anytime you need to use the file in multiple > >locations or multiple times and since each context manager instance (in the > >worst case) is going to need to get bytes, create a temp file, and write bytes > >for each use of the context manager. > > Perhaps it makes sense to either provide two APIs and/or implement a higher > level API on top of a lower-level one? > > >The other thing is that for the "common" case, where the resource is available > >on the file system already because we're just using a FileLoader, there is no > >need for an atexit handler or a temporary file at all. The context manager > >would only really exist for the uncommon case where we need to write the data > >to a temporary file. Using the atexit handler allows us to provide the best > >API for the common case, without too much problem for the uncommon case. > > A context manager could also conditionalize the delete just like your proposal > conditionalizes adding to the atexit handler. > > >Yes it does mean that in certain cases the temporary files may be left behind, > >particularly with kill -9 or segfaults or what have you. However that case > >already exists, the only thing the context manager does is narrow the window > >of case where a kill -9 or a segfault can leave temporary files behind. > > Sure, but it reduces the window for leakage, which will probably be enough. > > I'm with Barry not wanting to rely on atexit when a context manager is explicit and will clean up any state as necessary. I think if we mandate a context manager people are going to be unlikely to actually use it because in a lot of cases it?s going to be a pain in the ass to use a context manager with it and they?ll just fall back to using os.path.join(os.path.dirname(__file__), ?my/file.txt?). I?m trying to make it so people *want* to use these APIs because they make their lives easier over the naive approach. Adding in stuff that makes it more awkward to use just means people won?t use them and zip imports will continue to be barely supported in the wider ecosystem. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Jan 31 23:27:06 2015 From: brett at python.org (Brett Cannon) Date: Sat, 31 Jan 2015 22:27:06 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: On Sat Jan 31 2015 at 4:43:50 PM Donald Stufft wrote: > On Jan 31, 2015, at 4:22 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft wrote: > >> On Jan 31, 2015, at 12:00 PM, Brett Cannon wrote: >> >> >> >> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft wrote: >> >>> On Jan 31, 2015, at 11:31 AM, Brett Cannon wrote: >>> >>> >>> >>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore >>> wrote: >>> >>>> On 31 January 2015 at 15:47, Donald Stufft wrote: >>>> >> It's certainly possible to add a new API that loads resources based >>>> on >>>> >> a relative name, but you'd have to specify relative to *what*. >>>> >> get_data explicitly ducks out of making that decision. >>>> > >>>> > data = __loader__.get_bytes(__name__, ?logo.gif?) >>>> >>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >>>> prohibit sharing of loaders, etc, in the way Brett mentions. >>> >>> >>> By specifying the package anchor point I don't think it does. >>> >>> >>>> Also, the >>>> fact that it needs __name__ in there feels wrong - a bit like the old >>>> version of super() needing to be told which class it was being called >>>> from. >>> >>> >>> You can't avoid that. This is the entire reason why loader reuse is a >>> pain; you **have** to specify what to work off of, else its ambiguous and a >>> specific feature of a specific loader. >>> >>> But this is only an issue when you are trying to access a file relative >>> to the package/module you're in. Otherwise you're going to be specifying a >>> string constant like 'foo.bar'. >>> >>> >>>> But in principle I don't object to finding a suitable form of >>>> this. >>>> >>>> And I like the name get_bytes - much more explicit in these Python 3 >>>> days of explicit str/bytes distinctions :-) >>> >>> >>> One unfortunate side-effect from having a new method to return bytes >>> from a data file is that it makes get_data() somewhat redundant. If we make >>> it get_data_filename(package_name, path) then it can return an absolute >>> path which can then be passed to get_data() to read the actual bytes. If we >>> create importlib.resources as Donald has suggested then all of this can be >>> hidden behind a function and users don't have to care about any of this, >>> e.g. importlib.resources.read_data(module_anchor, path). >>> >>> >>> I think we actually have to go the other way, because only some Loaders >>> will be able to actually return a filename (returning a filename is >>> basically an optimization to prevent needing to call get_data and write >>> that out to a temporary directory) but pretty much any loader should >>> theoretically be able to support get_data. >>> >> >> Why can only some loaders return a filename? As I have said, loaders can >> return an opaque string to simulate a path if necessary. >> >> >> Because the idea behind get_data_filename() is that it returns a path >> that can be used regularly by APIs that expect to be handed a file on the >> file system. >> > > In my head that expectation is not placed on the method. > > >> Simulating a path with an opaque string isn?t good enough because, for >> example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. >> The idea here is that _if_ a regular file system path is available for a >> particular resource file then Loader().get_data_filename() would return it, >> otherwise it?d return None (or not exist at all). >> >> This means that pkgutil.get_data_filename (or >> importlib.resources.get_filename) can attempt to call >> Loader().get_data_filename() and just return that path if one exists on the >> file system already, and if it doesn?t then it can create a temporary file >> and call Loader.get_data() and write the data to that temporary file and >> return the path to that. >> > > See I'm not even attempting to guarantee there is any API that will return > a reasonable file system path as the import API makes no such guarantees. > If an API like OpenSSL requires a file on the filesystem then you will have > to write to a temporary file and that's just life. That's the same as if > everything was stored in a zip file anyway. > > > The entire *point* is this thread is that sometimes you need a file path > that is a valid path to a resource. > Right, but I also have to make sure the import API doesn't get too ridiculous because it took me years and several versions of Python to make it work with the APIs inherited from PEP 302 and to make sure it grow into a huge mess. > > The naive approach is to just make it do something like: > > # in pkgutil > def get_data_filename(package, resource): > data = get_data(package, resource) > if data is not None: > with open("/tmp/path", "wb") as fp: > fp.write(data) > return "/tmp/path" > > However the problem with this is that it imposes a read() into memory and > then creating a new file, and then writing that data back to a file even in > cases where there is already a file available on the file system. The > Loader().get_data_filename() exists for a Loader() to *optionally* say that > ?We already have a file path for this file, so you can just use this > instead of copying to a temporary location?. > And that's fine, but my point is forcing it to only play that role seems unnecessary. If you want a 'real' parameter to say "only return a path if I can pass it to an API that requires it" then that's fine. > > Then the ?optimized? but still naive approach becomes: > > # in pkgutil > def get_data_filename(package, resource): > mod = importlib.import_module(package) > if hasattr(mod.__loader__, "get_data_filename"): > try: > filename = mod.__loader__.get_data_filename(package, resource) > except FileNotFoundError: > pass > else: > if filename is not None: > return filename > > data = get_data(package, resource) > if data is not None: > with open("/tmp/path", "wb") as fp: > fp.write(data) > return "/tmp/path" > > This means there?s basically no penalty for using this API to access > resources files when you?re accessing files from a FileLoader. > And leaking a temp file until shutdown which is why Barry and I prefer a context manager. =) > In my opinion anything that is harder to use than: > > MY_PATH = os.path.join(os.path.dirname(__file__), ?my/file.txt?) > > Is highly unlikely to be used. People can already just write things to a > temporary directory using get_data, but the point is they don?t because > it?s a waste of time for the common case and it?s easier not to do that. > That's fine, but I also feel like we are trying to design around bad API design where something is assuming all data is going to be on disk and thus it's okay to require a file path on the filesystem instead of taking the bytes directly or a file-like object. I realize you are trying to solve this specifically for OpenSSL since it has the nasty practice of wanting a file path, but from an import perspective I have to also worry about what makes sense for the API as a whole and from the perspective of import. > > > >> >> >> >>> >>> I think it is redundant but given that it?s a new API (passing module >>> and a ?resource path?) I think it makes sense. The old get_data API can be >>> deprecated but left in for compatibility reasons if we want (sort of like >>> Loader().load_module() -> Loader().exec_module()). >>> >> >> If we do that then there would have to be a way to specify how to read >> the bytes for the module code itself since get_data() is used in the >> implementation of import by coupling it with get_filename() (which is why >> I'm trying not have to drop get_filename()/get_data() and instead come up >> with some new approach to reading bytes since the current approach is very >> composable). So get_bytes() would need a way to signal that you don't want >> some data file but the bytes for the module. Maybe if the path section is >> unspecified then that's a signal that the module's bytes is wanted and not >> some data file? >> >> >> Perhaps trying to read modules and resource files with the same method is >> the wrong approach? >> > > If we are going to do that then we might as well deprecate all the methods > that try to expose reading data and paths as the PEP 302 APIs tried to > expose it uniformly. > > > I don?t think it makes sense to expose it uniformly, code is semantically > different than data files and people need the ability to do different > things with them. It?s unlikely you?ll get a 2GB.py file, however a 2GB > data file is completely within the realms of possibility. > > > >> >> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 >> > > That seems like a bit much, e.g. why do you needs bytes **and** and a > file-like object() when you get the former from the latter? And why do you > need the path argument when you can get the path off the file-like object > if it's an actual file object? > > > I don?t think it?s a bit much at all. > > You get a stream method because sometimes things expect a file like object > or sometimes the file is big and the ability to access a stream that > handles that for you is super important. However when using a stream you > need to ensure you close the stream after you?re done using it. > With a context manager the closing requirement is negligible. And that only is an optimization if you're reading from something that allows for incremental reads, e.g. it's not an optimization for a SQL-backed loader (which is probably why PEP 302 has get_data() instead of get_file_object() or something). > > You get a bytes method because sometimes you don?t care about all of that > and you just need/want the raw bytes, it?s a nicer API for those people to > be able to just get bytes without having to worry about reading a file or > closing the file after they are done reading it. > That seems unnecessary if you want to provide the optimization of allowing a file-like object to be returned when reading all of the bytes takes two lines of code instead of one. People know how to read files so it isn't like it's a new paradigm. > > You get a filename method because the stream method may or may not return > a file object that has a path at all, and if you just need to pass the path > into another API having an open file handle just to get the filename is a > waste of a file handle. > As I said above, I partially feel like the desire for this support is to work around some API decisions that are somewhat poor. How about this: get_path(package, path, *, real=False) or get_path(package, filename, *, real=False) -- depending on whether Barry and me get our way about paths or you do, Donald -- where 'real' is a flag specifying whether the path has to work as a path argument to builtins.open() and thus fails accordingly (in instances where it won't work it can fail immediately and so loader implementers only have two lines of code to care about to manage it). Then loaders can keep their get_data() method without issue and the API for loaders only grew by 1 (or stays constant depending on whether we want/can have it subsume get_filename() long-term). As for importlib.resources, that can provide a higher-level API for a file-like object along with some way to say whether the file must be addressable on the filesystem to know if tempfile.NamedTemporaryFile() may be backing the file-like object or if io.BytesIO could provide the API. This gets me a clean API for loaders and importlib and gets you your real file paths as needed. -Brett > > > > -Brett > > >> >> This means that we?re not talking about ?data? files, but ?resource? >> files. This also removes the idea that you can call Loader.set_data() on >> those files (like i?ve seen in the implementation). >> >> >> >>> >>> >>> One thing to consider is do we want to allow anything other than >>> filenames for the path part? Thanks to namespace packages every directory >>> is essentially a package, so we could say that the package anchor has to >>> encapsulate the directory and the path bit can only be a filename. That >>> gets us even farther away from having the concept of file paths being >>> manipulated in relation to import-related APIs. >>> >>> >>> I think we do want to allow directories, it?s not unusual to have >>> something like: >>> >>> warehouse >>> ??? __init__.py >>> ??? templates >>> ? ??? accounts >>> ? ? ??? profile.html >>> ? ??? hello.html >>> ??? utils >>> ? ??? mapper.py >>> ??? wsgi.py >>> >>> Conceptually templates isn?t a package (even though with namespace >>> packages it kinda is) and I?d want to load profile.html by doing something >>> like: >>> >>> importlib.resources.get_bytes(?warehouse?, >>> ?templates/accounts/profile.html?) >>> >> >> Where I would be fine with get_bytes('warehouse.templates.accounts', >> 'profile.html') =) >> >> >>> >>> In pkg_resources the second argument to that function is a ?resource >>> path? which is defined as a relative to the given module/package and it >>> must use / to denote them. It explicitly says it?s not a file system path >>> but a resource path. It may translate to a file system path (as is the case >>> with the FileLoader) but it also may not (as is the case with a theoretical >>> S3Loader or PostgreSQLLoader). >>> >> >> Yep, which is why I'm making sure if we have paths we minimize them as >> they instantly make these alternative loader concepts a bigger pain to >> implement. >> >> >>> How you turn a warehouse + a resource path into some data (or whatever >>> other function we support) is an implementation detail of the Loader. >>> >>> >>> And just so I don't forget it, I keep wanting to pass an actual module >>> in so the code can extract the name that way, but that prevents the >>> __name__ trick as you would have to import yourself or grab the module from >>> sys.modules. >>> >>> >>> Is an actual module what gets passed into Loader().exec_module()? >>> >> >> Yes. >> >> >>> If so I think it?s fine to pass that into the new Loader() functions and >>> a new top level API in importlib.resources can do the things needed to turn >>> a string into a module object. So instead of doing >>> __loader__.get_bytes(__name__, ?logo.gif?) you?d do >>> importlib.resources.get_bytes(__name__, ?logo.gif?). >>> >> >> If we go the route of importlib.resources then that seems like a >> reasonable idea, although we will need to think through the ramifications >> to exec_module() itself although I don't think there were be any issues. >> >> And if we do go with importlib.resources I will probably want to make it >> available on PyPI with appropriate imp/pkgutil fallbacks to help people >> transitioning from Python 2 to 3. >> >> --- >> Donald Stufft >> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA >> > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Sat Jan 31 23:41:36 2015 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 31 Jan 2015 22:41:36 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <20150131124451.582b5dc3@marathon> Message-ID: On 31 January 2015 at 18:09, Donald Stufft wrote: > I don?t think it?s important for this API to support extracting extension > modules. If we want to support importing extension modules from inside > of a zip file (or similar) I think that should get it?s own support inside > the loader and not rely on the resource extraction for that. IOW I think > that these should primarily exist for data files. +1 Paul