From donald at stufft.io Sun Feb 1 00:05:36 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 18:05:36 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> > On Jan 31, 2015, at 5:27 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 4:43:50 PM Donald Stufft > wrote: >> On Jan 31, 2015, at 4:22 PM, Brett Cannon > wrote: >> >> >> >> On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft > wrote: >>> On Jan 31, 2015, at 12:00 PM, Brett Cannon > wrote: >>> >>> >>> >>> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft > wrote: >>>> On Jan 31, 2015, at 11:31 AM, Brett Cannon > wrote: >>>> >>>> >>>> >>>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore > wrote: >>>> On 31 January 2015 at 15:47, Donald Stufft > wrote: >>>> >> It's certainly possible to add a new API that loads resources based on >>>> >> a relative name, but you'd have to specify relative to *what*. >>>> >> get_data explicitly ducks out of making that decision. >>>> > >>>> > data = __loader__.get_bytes(__name__, ?logo.gif?) >>>> >>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >>>> prohibit sharing of loaders, etc, in the way Brett mentions. >>>> >>>> By specifying the package anchor point I don't think it does. >>>> >>>> Also, the >>>> fact that it needs __name__ in there feels wrong - a bit like the old >>>> version of super() needing to be told which class it was being called >>>> from. >>>> >>>> You can't avoid that. This is the entire reason why loader reuse is a pain; you **have** to specify what to work off of, else its ambiguous and a specific feature of a specific loader. >>>> >>>> But this is only an issue when you are trying to access a file relative to the package/module you're in. Otherwise you're going to be specifying a string constant like 'foo.bar'. >>>> >>>> But in principle I don't object to finding a suitable form of >>>> this. >>>> >>>> And I like the name get_bytes - much more explicit in these Python 3 >>>> days of explicit str/bytes distinctions :-) >>>> >>>> One unfortunate side-effect from having a new method to return bytes from a data file is that it makes get_data() somewhat redundant. If we make it get_data_filename(package_name, path) then it can return an absolute path which can then be passed to get_data() to read the actual bytes. If we create importlib.resources as Donald has suggested then all of this can be hidden behind a function and users don't have to care about any of this, e.g. importlib.resources.read_data(module_anchor, path). >>> >>> I think we actually have to go the other way, because only some Loaders will be able to actually return a filename (returning a filename is basically an optimization to prevent needing to call get_data and write that out to a temporary directory) but pretty much any loader should theoretically be able to support get_data. >>> >>> Why can only some loaders return a filename? As I have said, loaders can return an opaque string to simulate a path if necessary. >> >> Because the idea behind get_data_filename() is that it returns a path that can be used regularly by APIs that expect to be handed a file on the file system. >> >> In my head that expectation is not placed on the method. >> >> Simulating a path with an opaque string isn?t good enough because, for example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. The idea here is that _if_ a regular file system path is available for a particular resource file then Loader().get_data_filename() would return it, otherwise it?d return None (or not exist at all). >> >> This means that pkgutil.get_data_filename (or importlib.resources.get_filename) can attempt to call Loader().get_data_filename() and just return that path if one exists on the file system already, and if it doesn?t then it can create a temporary file and call Loader.get_data() and write the data to that temporary file and return the path to that. >> >> See I'm not even attempting to guarantee there is any API that will return a reasonable file system path as the import API makes no such guarantees. If an API like OpenSSL requires a file on the filesystem then you will have to write to a temporary file and that's just life. That's the same as if everything was stored in a zip file anyway. > > The entire *point* is this thread is that sometimes you need a file path that is a valid path to a resource. > > Right, but I also have to make sure the import API doesn't get too ridiculous because it took me years and several versions of Python to make it work with the APIs inherited from PEP 302 and to make sure it grow into a huge mess. > > > The naive approach is to just make it do something like: > > # in pkgutil > def get_data_filename(package, resource): > data = get_data(package, resource) > if data is not None: > with open("/tmp/path", "wb") as fp: > fp.write(data) > return "/tmp/path" > > However the problem with this is that it imposes a read() into memory and then creating a new file, and then writing that data back to a file even in cases where there is already a file available on the file system. The Loader().get_data_filename() exists for a Loader() to *optionally* say that ?We already have a file path for this file, so you can just use this instead of copying to a temporary location?. > > And that's fine, but my point is forcing it to only play that role seems unnecessary. If you want a 'real' parameter to say "only return a path if I can pass it to an API that requires it" then that's fine. > > > Then the ?optimized? but still naive approach becomes: > > # in pkgutil > def get_data_filename(package, resource): > mod = importlib.import_module(package) > if hasattr(mod.__loader__, "get_data_filename"): > try: > filename = mod.__loader__.get_data_filename(package, resource) > except FileNotFoundError: > pass > else: > if filename is not None: > return filename > > data = get_data(package, resource) > if data is not None: > with open("/tmp/path", "wb") as fp: > fp.write(data) > return "/tmp/path" > > This means there?s basically no penalty for using this API to access resources files when you?re accessing files from a FileLoader. > > And leaking a temp file until shutdown which is why Barry and I prefer a context manager. =) So the top level API can have both and people can use whichever fits their situation best. It?s not really leaking a temp file, it?s making it available to the process once the function has been called. This is a common use case to need a data file for the entire life of the process. > > In my opinion anything that is harder to use than: > > MY_PATH = os.path.join(os.path.dirname(__file__), ?my/file.txt?) > > Is highly unlikely to be used. People can already just write things to a temporary directory using get_data, but the point is they don?t because it?s a waste of time for the common case and it?s easier not to do that. > > That's fine, but I also feel like we are trying to design around bad API design where something is assuming all data is going to be on disk and thus it's okay to require a file path on the filesystem instead of taking the bytes directly or a file-like object. > > I realize you are trying to solve this specifically for OpenSSL since it has the nasty practice of wanting a file path, but from an import perspective I have to also worry about what makes sense for the API as a whole and from the perspective of import. I?m not actually trying to solve this specifically for OpenSSL at all, I?m trying to solve it for any API that requires a file path where I don?t control that API. My end goal is to make it so zip imports are useful enough people can assume they are going to work, not a curiosity that mostly only works by accident for most projects. Right now you take almost any project on PyPI that has a data file and there?s an extremely high chance that it won?t work with zip import. My long term goals here are to make a ?static? deployment format for Python that can wrap everything up into one file, so one step along this path is getting people to stop doing things that rely on having a real filesystem and only go through some abstraction. However I can?t get them to actually do that if the API to do that is awkward and something they don?t want to actually use. Purity is a great thing, but when there is a direct competitor to this API you have to weigh purity against actual usefulness in the common case for people. So yea, it?s designing around a bad API, but it?s also an API design that exists all over the place in the real world and telling people ?well just don?t do that? means they won?t use this API and their thing will continue to not be usable with zip import. > > >> >> >>> >>> >>> I think it is redundant but given that it?s a new API (passing module and a ?resource path?) I think it makes sense. The old get_data API can be deprecated but left in for compatibility reasons if we want (sort of like Loader().load_module() -> Loader().exec_module()). >>> >>> If we do that then there would have to be a way to specify how to read the bytes for the module code itself since get_data() is used in the implementation of import by coupling it with get_filename() (which is why I'm trying not have to drop get_filename()/get_data() and instead come up with some new approach to reading bytes since the current approach is very composable). So get_bytes() would need a way to signal that you don't want some data file but the bytes for the module. Maybe if the path section is unspecified then that's a signal that the module's bytes is wanted and not some data file? >> >> Perhaps trying to read modules and resource files with the same method is the wrong approach? >> >> If we are going to do that then we might as well deprecate all the methods that try to expose reading data and paths as the PEP 302 APIs tried to expose it uniformly. > > I don?t think it makes sense to expose it uniformly, code is semantically different than data files and people need the ability to do different things with them. It?s unlikely you?ll get a 2GB.py file, however a 2GB data file is completely within the realms of possibility. > >> >> >> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 >> >> That seems like a bit much, e.g. why do you needs bytes **and** and a file-like object() when you get the former from the latter? And why do you need the path argument when you can get the path off the file-like object if it's an actual file object? > > I don?t think it?s a bit much at all. > > You get a stream method because sometimes things expect a file like object or sometimes the file is big and the ability to access a stream that handles that for you is super important. However when using a stream you need to ensure you close the stream after you?re done using it. > > With a context manager the closing requirement is negligible. And that only is an optimization if you're reading from something that allows for incremental reads, e.g. it's not an optimization for a SQL-backed loader (which is probably why PEP 302 has get_data() instead of get_file_object() or something). In almost all uses where I would personally use it a context manager is awkward and I won?t use it and I?ll just continue to not be zip import compatible. > > > You get a bytes method because sometimes you don?t care about all of that and you just need/want the raw bytes, it?s a nicer API for those people to be able to just get bytes without having to worry about reading a file or closing the file after they are done reading it. > > That seems unnecessary if you want to provide the optimization of allowing a file-like object to be returned when reading all of the bytes takes two lines of code instead of one. People know how to read files so it isn't like it's a new paradigm. > > > You get a filename method because the stream method may or may not return a file object that has a path at all, and if you just need to pass the path into another API having an open file handle just to get the filename is a waste of a file handle. > > As I said above, I partially feel like the desire for this support is to work around some API decisions that are somewhat poor. > > How about this: get_path(package, path, *, real=False) or get_path(package, filename, *, real=False) -- depending on whether Barry and me get our way about paths or you do, Donald -- where 'real' is a flag specifying whether the path has to work as a path argument to builtins.open() and thus fails accordingly (in instances where it won't work it can fail immediately and so loader implementers only have two lines of code to care about to manage it). Then loaders can keep their get_data() method without issue and the API for loaders only grew by 1 (or stays constant depending on whether we want/can have it subsume get_filename() long-term). > > As for importlib.resources, that can provide a higher-level API for a file-like object along with some way to say whether the file must be addressable on the filesystem to know if tempfile.NamedTemporaryFile() may be backing the file-like object or if io.BytesIO could provide the API. > > This gets me a clean API for loaders and importlib and gets you your real file paths as needed. > I honestly don?t really care what API the loader has because I don?t think anybody but the importlib.resources functions are ever going to use, so if you want to do something I hate with them knock yourself out I don?t care that much, I just think that requiring using __file__ and __path__ outside of the implementation of the loader itself is a pretty big code smell. How about this, instead here?s the top level APIs I want: https://bpaste.net/show/0c490aa07c07 How this is implemented in the Loader() API can be whatever folks want. The important thing is that these all solve actual use cases and solve them better and easier than the naive approach of using os.path functions directly. Important Things: * resource_filename and ResourceFilename() must return the real file system path if available and a temporary file else wise. * resource_filename *must* be available for the lifetime of the process once the function has been called. * ResourceFilename *must* clean itself up at the end of the context manager. * These functions/context managers *must* work in terms of package names and relative file paths. > -Brett > > > >> >> -Brett >> >> >> This means that we?re not talking about ?data? files, but ?resource? files. This also removes the idea that you can call Loader.set_data() on those files (like i?ve seen in the implementation). >> >>> >>> >>>> >>>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs. >>> >>> I think we do want to allow directories, it?s not unusual to have something like: >>> >>> warehouse >>> ??? __init__.py >>> ??? templates >>> ? ??? accounts >>> ? ? ??? profile.html >>> ? ??? hello.html >>> ??? utils >>> ? ??? mapper.py >>> ??? wsgi.py >>> >>> Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like: >>> >>> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) >>> >>> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html') =) >>> >>> >>> In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader). >>> >>> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement. >>> >>> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader. >>> >>>> >>>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules. >>> >>> Is an actual module what gets passed into Loader().exec_module()? >>> >>> Yes. >>> >>> If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?). >>> >>> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues. >>> >>> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3. >> >> --- >> Donald Stufft >> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sun Feb 1 01:19:21 2015 From: brett at python.org (Brett Cannon) Date: Sun, 01 Feb 2015 00:19:21 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> Message-ID: On Sat Jan 31 2015 at 6:05:39 PM Donald Stufft wrote: > On Jan 31, 2015, at 5:27 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 4:43:50 PM Donald Stufft wrote: > >> On Jan 31, 2015, at 4:22 PM, Brett Cannon wrote: >> >> >> >> On Sat Jan 31 2015 at 12:28:07 PM Donald Stufft wrote: >> >>> On Jan 31, 2015, at 12:00 PM, Brett Cannon wrote: >>> >>> >>> >>> On Sat Jan 31 2015 at 11:43:55 AM Donald Stufft >>> wrote: >>> >>>> On Jan 31, 2015, at 11:31 AM, Brett Cannon wrote: >>>> >>>> >>>> >>>> On Sat Jan 31 2015 at 10:54:22 AM Paul Moore >>>> wrote: >>>> >>>>> On 31 January 2015 at 15:47, Donald Stufft wrote: >>>>> >> It's certainly possible to add a new API that loads resources based >>>>> on >>>>> >> a relative name, but you'd have to specify relative to *what*. >>>>> >> get_data explicitly ducks out of making that decision. >>>>> > >>>>> > data = __loader__.get_bytes(__name__, ?logo.gif?) >>>>> >>>>> Quite possibly. It needs a bit of fleshing out to make sure it doesn't >>>>> prohibit sharing of loaders, etc, in the way Brett mentions. >>>> >>>> >>>> By specifying the package anchor point I don't think it does. >>>> >>>> >>>>> Also, the >>>>> fact that it needs __name__ in there feels wrong - a bit like the old >>>>> version of super() needing to be told which class it was being called >>>>> from. >>>> >>>> >>>> You can't avoid that. This is the entire reason why loader reuse is a >>>> pain; you **have** to specify what to work off of, else its ambiguous and a >>>> specific feature of a specific loader. >>>> >>>> But this is only an issue when you are trying to access a file relative >>>> to the package/module you're in. Otherwise you're going to be specifying a >>>> string constant like 'foo.bar'. >>>> >>>> >>>>> But in principle I don't object to finding a suitable form of >>>>> this. >>>>> >>>>> And I like the name get_bytes - much more explicit in these Python 3 >>>>> days of explicit str/bytes distinctions :-) >>>> >>>> >>>> One unfortunate side-effect from having a new method to return bytes >>>> from a data file is that it makes get_data() somewhat redundant. If we make >>>> it get_data_filename(package_name, path) then it can return an absolute >>>> path which can then be passed to get_data() to read the actual bytes. If we >>>> create importlib.resources as Donald has suggested then all of this can be >>>> hidden behind a function and users don't have to care about any of this, >>>> e.g. importlib.resources.read_data(module_anchor, path). >>>> >>>> >>>> I think we actually have to go the other way, because only some Loaders >>>> will be able to actually return a filename (returning a filename is >>>> basically an optimization to prevent needing to call get_data and write >>>> that out to a temporary directory) but pretty much any loader should >>>> theoretically be able to support get_data. >>>> >>> >>> Why can only some loaders return a filename? As I have said, loaders can >>> return an opaque string to simulate a path if necessary. >>> >>> >>> Because the idea behind get_data_filename() is that it returns a path >>> that can be used regularly by APIs that expect to be handed a file on the >>> file system. >>> >> >> In my head that expectation is not placed on the method. >> >> >>> Simulating a path with an opaque string isn?t good enough because, for >>> example, OpenSSL doesn?t know how to open /data/foo.zip/foobar/cacert.pem. >>> The idea here is that _if_ a regular file system path is available for a >>> particular resource file then Loader().get_data_filename() would return it, >>> otherwise it?d return None (or not exist at all). >>> >>> This means that pkgutil.get_data_filename (or >>> importlib.resources.get_filename) can attempt to call >>> Loader().get_data_filename() and just return that path if one exists on the >>> file system already, and if it doesn?t then it can create a temporary file >>> and call Loader.get_data() and write the data to that temporary file and >>> return the path to that. >>> >> >> See I'm not even attempting to guarantee there is any API that will >> return a reasonable file system path as the import API makes no such >> guarantees. If an API like OpenSSL requires a file on the filesystem then >> you will have to write to a temporary file and that's just life. That's the >> same as if everything was stored in a zip file anyway. >> >> >> The entire *point* is this thread is that sometimes you need a file path >> that is a valid path to a resource. >> > > Right, but I also have to make sure the import API doesn't get too > ridiculous because it took me years and several versions of Python to make > it work with the APIs inherited from PEP 302 and to make sure it grow into > a huge mess. > > >> >> The naive approach is to just make it do something like: >> >> # in pkgutil >> def get_data_filename(package, resource): >> data = get_data(package, resource) >> if data is not None: >> with open("/tmp/path", "wb") as fp: >> fp.write(data) >> return "/tmp/path" >> >> However the problem with this is that it imposes a read() into memory and >> then creating a new file, and then writing that data back to a file even in >> cases where there is already a file available on the file system. The >> Loader().get_data_filename() exists for a Loader() to *optionally* say that >> ?We already have a file path for this file, so you can just use this >> instead of copying to a temporary location?. >> > > And that's fine, but my point is forcing it to only play that role seems > unnecessary. If you want a 'real' parameter to say "only return a path if I > can pass it to an API that requires it" then that's fine. > > >> >> Then the ?optimized? but still naive approach becomes: >> >> # in pkgutil >> def get_data_filename(package, resource): >> mod = importlib.import_module(package) >> if hasattr(mod.__loader__, "get_data_filename"): >> try: >> filename = mod.__loader__.get_data_filename(package, >> resource) >> except FileNotFoundError: >> pass >> else: >> if filename is not None: >> return filename >> >> data = get_data(package, resource) >> if data is not None: >> with open("/tmp/path", "wb") as fp: >> fp.write(data) >> return "/tmp/path" >> >> This means there?s basically no penalty for using this API to access >> resources files when you?re accessing files from a FileLoader. >> > > And leaking a temp file until shutdown which is why Barry and I prefer a > context manager. =) > > > So the top level API can have both and people can use whichever fits their > situation best. > > It?s not really leaking a temp file, it?s making it available to the > process once the function has been called. This is a common use case to > need a data file for the entire life of the process. > > > >> In my opinion anything that is harder to use than: >> >> MY_PATH = os.path.join(os.path.dirname(__file__), ?my/file.txt?) >> >> Is highly unlikely to be used. People can already just write things to a >> temporary directory using get_data, but the point is they don?t because >> it?s a waste of time for the common case and it?s easier not to do that. >> > > That's fine, but I also feel like we are trying to design around bad API > design where something is assuming all data is going to be on disk and thus > it's okay to require a file path on the filesystem instead of taking the > bytes directly or a file-like object. > > I realize you are trying to solve this specifically for OpenSSL since it > has the nasty practice of wanting a file path, but from an import > perspective I have to also worry about what makes sense for the API as a > whole and from the perspective of import. > > > I?m not actually trying to solve this specifically for OpenSSL at all, I?m > trying to solve it for any API that requires a file path where I don?t > control that API. My end goal is to make it so zip imports are useful > enough people can assume they are going to work, not a curiosity that > mostly only works by accident for most projects. Right now you take almost > any project on PyPI that has a data file and there?s an extremely high > chance that it won?t work with zip import. > > My long term goals here are to make a ?static? deployment format for > Python that can wrap everything up into one file, so one step along this > path is getting people to stop doing things that rely on having a real > filesystem and only go through some abstraction. However I can?t get them > to actually do that if the API to do that is awkward and something they > don?t want to actually use. Purity is a great thing, but when there is a > direct competitor to this API you have to weigh purity against actual > usefulness in the common case for people. > > So yea, it?s designing around a bad API, but it?s also an API design that > exists all over the place in the real world and telling people ?well just > don?t do that? means they won?t use this API and their thing will continue > to not be usable with zip import. > > > >> >> >> >>> >>> >>> >>>> >>>> I think it is redundant but given that it?s a new API (passing module >>>> and a ?resource path?) I think it makes sense. The old get_data API can be >>>> deprecated but left in for compatibility reasons if we want (sort of like >>>> Loader().load_module() -> Loader().exec_module()). >>>> >>> >>> If we do that then there would have to be a way to specify how to read >>> the bytes for the module code itself since get_data() is used in the >>> implementation of import by coupling it with get_filename() (which is why >>> I'm trying not have to drop get_filename()/get_data() and instead come up >>> with some new approach to reading bytes since the current approach is very >>> composable). So get_bytes() would need a way to signal that you don't want >>> some data file but the bytes for the module. Maybe if the path section is >>> unspecified then that's a signal that the module's bytes is wanted and not >>> some data file? >>> >>> >>> Perhaps trying to read modules and resource files with the same method >>> is the wrong approach? >>> >> >> If we are going to do that then we might as well deprecate all the >> methods that try to expose reading data and paths as the PEP 302 APIs tried >> to expose it uniformly. >> >> >> I don?t think it makes sense to expose it uniformly, code is semantically >> different than data files and people need the ability to do different >> things with them. It?s unlikely you?ll get a 2GB.py file, however a 2GB >> data file is completely within the realms of possibility. >> >> >> >>> >>> Maybe instead we should do: https://bpaste.net/show/b25b7e8dc8f0 >>> >> >> That seems like a bit much, e.g. why do you needs bytes **and** and a >> file-like object() when you get the former from the latter? And why do you >> need the path argument when you can get the path off the file-like object >> if it's an actual file object? >> >> >> I don?t think it?s a bit much at all. >> >> You get a stream method because sometimes things expect a file like >> object or sometimes the file is big and the ability to access a stream that >> handles that for you is super important. However when using a stream you >> need to ensure you close the stream after you?re done using it. >> > > With a context manager the closing requirement is negligible. And that > only is an optimization if you're reading from something that allows for > incremental reads, e.g. it's not an optimization for a SQL-backed loader > (which is probably why PEP 302 has get_data() instead of get_file_object() > or something). > > > In almost all uses where I would personally use it a context manager is > awkward and I won?t use it and I?ll just continue to not be zip import > compatible. > > > >> >> You get a bytes method because sometimes you don?t care about all of that >> and you just need/want the raw bytes, it?s a nicer API for those people to >> be able to just get bytes without having to worry about reading a file or >> closing the file after they are done reading it. >> > > That seems unnecessary if you want to provide the optimization of allowing > a file-like object to be returned when reading all of the bytes takes two > lines of code instead of one. People know how to read files so it isn't > like it's a new paradigm. > > > > >> >> You get a filename method because the stream method may or may not return >> a file object that has a path at all, and if you just need to pass the path >> into another API having an open file handle just to get the filename is a >> waste of a file handle. >> > > As I said above, I partially feel like the desire for this support is to > work around some API decisions that are somewhat poor. > > How about this: get_path(package, path, *, real=False) or > get_path(package, filename, *, real=False) -- depending on whether Barry > and me get our way about paths or you do, Donald -- where 'real' is a flag > specifying whether the path has to work as a path argument to > builtins.open() and thus fails accordingly (in instances where it won't > work it can fail immediately and so loader implementers only have two lines > of code to care about to manage it). Then loaders can keep their get_data() > method without issue and the API for loaders only grew by 1 (or stays > constant depending on whether we want/can have it subsume get_filename() > long-term). > > As for importlib.resources, that can provide a higher-level API for a > file-like object along with some way to say whether the file must be > addressable on the filesystem to know if tempfile.NamedTemporaryFile() may > be backing the file-like object or if io.BytesIO could provide the API. > > This gets me a clean API for loaders and importlib and gets you your real > file paths as needed. > > > I honestly don?t really care what API the loader has because I don?t think > anybody but the importlib.resources functions are ever going to use, so if > you want to do something I hate with them knock yourself out I don?t care > that much, > Well you have to care to an extent because that API will be what lets you do what you want to do at a higher API level. > I just think that requiring using __file__ and __path__ outside of the > implementation of the loader itself is a pretty big code smell. > Who ever said anything about __file__ and __path__ outside of loaders? All I have proposed is something to allow you to do, e.g.:: def resource_stream(package, path): """Return a file-like object for a file relative to a package.""" loader = ... try: return open(loader.get_path(package, path, real=True)) except NotARealFileThingy: loader_path = loader.get_path(package, path) return io.BytesIO(loader.get_data(loader_path)) Otherwise get_stream(package, path) and then, e.g.:: def resource_filename(package, path): loader = ... stream = loader.get_stream() if hasattr(stream, 'path'): return stream.path path = make a tempfile path somehow ... with open(path, 'wb') as file: file.write(stream.read()) return path Please realize the kind of bind you're putting (at least) me in: trying to abstract away paths so they don't really exist except as opaque things to pass around for a loader is great and a goal I have been trying to meet, but then you want real file paths when available so you can open files directly and feed file paths to APIs that require them without creating a temporary file. So you're asking for a loader API that doesn't directly work with paths but that will spit them out when they are available and have a way to differentiate them which directly contradicts the idea of having the loader APi hide the concept of a file path away entirely (granted it is on the edge of the API in terms of what it emits and not what it directly takes in, but it does pierce the abstraction away from paths). This use-case you're after is not something I haven't thought about or purposefully ignored. I too want people to work with loaders somehow so that data carried with a project from PyPI can be loaded from any reasonable loader implementation. But it's bloody hard and it's going to require some patience and compromise on all sides if we are going to get something that doesn't make loaders explicitly path-aware and hard to implement while still allowing the common case you are after of avoiding unnecessary overhead or doing something like isinstance() checks for importlib.machinery.SourceFileLoader or something. > > How about this, instead here?s the top level APIs I want: > https://bpaste.net/show/0c490aa07c07 > Other than seeing resource_bytes() as redundant and not wanting to give all of them a "resource_" prefix if they are going to live in importlib.resources I'm basically fine with what you're after. > > How this is implemented in the Loader() API can be whatever folks want. > The important thing is that these all solve actual use cases and solve them > better and easier than the naive approach of using os.path functions > directly. > > Important Things: > > * resource_filename and ResourceFilename() must return the real file > system path if available and a temporary file else wise. > * resource_filename *must* be available for the lifetime of the process > once the function has been called. > * ResourceFilename *must* clean itself up at the end of the context > manager. > * These functions/context managers *must* work in terms of package names > and relative file paths. > All seem reasonable to me. -Brett > > > -Brett > > >> >> >> >> -Brett >> >> >>> >>> This means that we?re not talking about ?data? files, but ?resource? >>> files. This also removes the idea that you can call Loader.set_data() on >>> those files (like i?ve seen in the implementation). >>> >>> >>> >>>> >>>> >>>> One thing to consider is do we want to allow anything other than >>>> filenames for the path part? Thanks to namespace packages every directory >>>> is essentially a package, so we could say that the package anchor has to >>>> encapsulate the directory and the path bit can only be a filename. That >>>> gets us even farther away from having the concept of file paths being >>>> manipulated in relation to import-related APIs. >>>> >>>> >>>> I think we do want to allow directories, it?s not unusual to have >>>> something like: >>>> >>>> warehouse >>>> ??? __init__.py >>>> ??? templates >>>> ? ??? accounts >>>> ? ? ??? profile.html >>>> ? ??? hello.html >>>> ??? utils >>>> ? ??? mapper.py >>>> ??? wsgi.py >>>> >>>> Conceptually templates isn?t a package (even though with namespace >>>> packages it kinda is) and I?d want to load profile.html by doing something >>>> like: >>>> >>>> importlib.resources.get_bytes(?warehouse?, >>>> ?templates/accounts/profile.html?) >>>> >>> >>> Where I would be fine with get_bytes('warehouse.templates.accounts', >>> 'profile.html') =) >>> >>> >>>> >>>> In pkg_resources the second argument to that function is a ?resource >>>> path? which is defined as a relative to the given module/package and it >>>> must use / to denote them. It explicitly says it?s not a file system path >>>> but a resource path. It may translate to a file system path (as is the case >>>> with the FileLoader) but it also may not (as is the case with a theoretical >>>> S3Loader or PostgreSQLLoader). >>>> >>> >>> Yep, which is why I'm making sure if we have paths we minimize them as >>> they instantly make these alternative loader concepts a bigger pain to >>> implement. >>> >>> >>>> How you turn a warehouse + a resource path into some data (or whatever >>>> other function we support) is an implementation detail of the Loader. >>>> >>>> >>>> And just so I don't forget it, I keep wanting to pass an actual module >>>> in so the code can extract the name that way, but that prevents the >>>> __name__ trick as you would have to import yourself or grab the module from >>>> sys.modules. >>>> >>>> >>>> Is an actual module what gets passed into Loader().exec_module()? >>>> >>> >>> Yes. >>> >>> >>>> If so I think it?s fine to pass that into the new Loader() functions >>>> and a new top level API in importlib.resources can do the things needed to >>>> turn a string into a module object. So instead of doing >>>> __loader__.get_bytes(__name__, ?logo.gif?) you?d do >>>> importlib.resources.get_bytes(__name__, ?logo.gif?). >>>> >>> >>> If we go the route of importlib.resources then that seems like a >>> reasonable idea, although we will need to think through the ramifications >>> to exec_module() itself although I don't think there were be any issues. >>> >>> And if we do go with importlib.resources I will probably want to make it >>> available on PyPI with appropriate imp/pkgutil fallbacks to help people >>> transitioning from Python 2 to 3. >>> >>> --- >>> Donald Stufft >>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA >>> >> >> --- >> Donald Stufft >> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA >> > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sun Feb 1 01:59:27 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 19:59:27 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> Message-ID: <64D4D521-0F11-4D0A-81A7-2C8A1CCE5501@stufft.io> > On Jan 31, 2015, at 7:19 PM, Brett Cannon wrote: > > > > On Sat Jan 31 2015 at 6:05:39 PM Donald Stufft > wrote: > > I honestly don?t really care what API the loader has because I don?t think anybody but the importlib.resources functions are ever going to use, so if you want to do something I hate with them knock yourself out I don?t care that much, > > Well you have to care to an extent because that API will be what lets you do what you want to do at a higher API level. Well I mean I don?t care enough to argue about what the API looks like as long as the high level API is possible with them. > > I just think that requiring using __file__ and __path__ outside of the implementation of the loader itself is a pretty big code smell. > > Who ever said anything about __file__ and __path__ outside of loaders? All I have proposed is something to allow you to do, e.g.:: Sorry, I mean the current API for get_data requires a Loader() user to construct an absolute path using __file__ and/or __path__ (at least I think it does, and the PEP example I think shows it using that). I think that whatever solution that ends up with should not require someone using the Loader() to know about them. > > def resource_stream(package, path): > """Return a file-like object for a file relative to a package.""" > loader = ... > try: > return open(loader.get_path(package, path, real=True)) > except NotARealFileThingy: > loader_path = loader.get_path(package, path) > return io.BytesIO(loader.get_data(loader_path)) > > Otherwise get_stream(package, path) and then, e.g.:: > > def resource_filename(package, path): > loader = ... > stream = loader.get_stream() > if hasattr(stream, 'path'): > return stream.path > path = make a tempfile path somehow ... > with open(path, 'wb') as file: > file.write(stream.read()) > return path This is what I meant when I didn?t care. I don?t really like that loader API much but it (mostly) allows the top level concepts I want without having to sacrifice much. The only negative I can see here is that it?s impossible to implement get_stream in an efficient way for large objects except in a file system. This might not matter much because in practice these files aren?t often going to be very big nor are they often going to live anywhere but the file system. However for instance it?s impossible for a loader that downloads things from the internet to return a stream that won?t load the entire response in memory inside of an io.BytesIO container. It might matter for zip files (I don?t know if it?s possible to open a handle to a particular file inside of a zip file and read() that without reading the whole file into memory?). It might make sense to deprecate Loader().get_data() and replace it with Loader().get_data_stream() or just have them both (in the simple case Loader().get_data() could be a small wrapper around Loader().get_data_stream(). That would allow efficient access with resource_stream() even for non file systems things without changing the API much from what you like. > > Please realize the kind of bind you're putting (at least) me in: trying to abstract away paths so they don't really exist except as opaque things to pass around for a loader is great and a goal I have been trying to meet, but then you want real file paths when available so you can open files directly and feed file paths to APIs that require them without creating a temporary file. So you're asking for a loader API that doesn't directly work with paths but that will spit them out when they are available and have a way to differentiate them which directly contradicts the idea of having the loader APi hide the concept of a file path away entirely (granted it is on the edge of the API in terms of what it emits and not what it directly takes in, but it does pierce the abstraction away from paths). > > This use-case you're after is not something I haven't thought about or purposefully ignored. I too want people to work with loaders somehow so that data carried with a project from PyPI can be loaded from any reasonable loader implementation. But it's bloody hard and it's going to require some patience and compromise on all sides if we are going to get something that doesn't make loaders explicitly path-aware and hard to implement while still allowing the common case you are after of avoiding unnecessary overhead or doing something like isinstance() checks for importlib.machinery.SourceFileLoader or something. Honestly I?m not trying to put anyone in a bind, I want the top level APIs that I pointed out and I want them to work in roughly the best way for each particular backend. For me the best way to do that is to give each top level function an optional hook into on the Loaders so that the loader can do something better and less generic if possible since it knows more about how it is implemented and can possibly make better choices about what is the best way to implement each particular thing. I also don?t like boolean arguments to functions that drastically change their behavior which is why I wanted to add things like Loader().get_data_filename() which would only return a non None value if a real path existed (aka real=True from above). Sans the stream issue from above, It looks like the implementation you defined would work, I just don?t particularly like the API much because i don?t like cramming source code and data files into the same API. That?s OK though because I don?t really need to work with the API hardly ever so it doesn?t affect me much, I was just being opinionated because that?s the way I am. > > > How about this, instead here?s the top level APIs I want: https://bpaste.net/show/0c490aa07c07 > > Other than seeing resource_bytes() as redundant and not wanting to give all of them a "resource_" prefix if they are going to live in importlib.resources I'm basically fine with what you're after. It is somewhat redundant but I think it?s also a useful wrapper around resource_stream() to have, especially since it can then replace pkgutil.get_data directly and that function can be deprecated for it. I don?t really have a strong opinion on the names themselves, the resource_ names were taken from what pkg_resources called them and because I couldn?t think of a good naming scheme for them that wasn?t just prefixing with get_ which I didn?t like as much as resource_. It might make sense to just put them in importlib.util with a resource_ prefix on them, or in import lib?s top level. I don?t have a strong opinion on which ones of those options are ?best?. > > > How this is implemented in the Loader() API can be whatever folks want. The important thing is that these all solve actual use cases and solve them better and easier than the naive approach of using os.path functions directly. > > Important Things: > > * resource_filename and ResourceFilename() must return the real file system path if available and a temporary file else wise. > * resource_filename *must* be available for the lifetime of the process once the function has been called. > * ResourceFilename *must* clean itself up at the end of the context manager. > * These functions/context managers *must* work in terms of package names and relative file paths. > > All seem reasonable to me. > > -Brett > > > >> -Brett >> >> >> >>> >>> -Brett >>> >>> >>> This means that we?re not talking about ?data? files, but ?resource? files. This also removes the idea that you can call Loader.set_data() on those files (like i?ve seen in the implementation). >>> >>>> >>>> >>>>> >>>>> One thing to consider is do we want to allow anything other than filenames for the path part? Thanks to namespace packages every directory is essentially a package, so we could say that the package anchor has to encapsulate the directory and the path bit can only be a filename. That gets us even farther away from having the concept of file paths being manipulated in relation to import-related APIs. >>>> >>>> I think we do want to allow directories, it?s not unusual to have something like: >>>> >>>> warehouse >>>> ??? __init__.py >>>> ??? templates >>>> ? ??? accounts >>>> ? ? ??? profile.html >>>> ? ??? hello.html >>>> ??? utils >>>> ? ??? mapper.py >>>> ??? wsgi.py >>>> >>>> Conceptually templates isn?t a package (even though with namespace packages it kinda is) and I?d want to load profile.html by doing something like: >>>> >>>> importlib.resources.get_bytes(?warehouse?, ?templates/accounts/profile.html?) >>>> >>>> Where I would be fine with get_bytes('warehouse.templates.accounts', 'profile.html') =) >>>> >>>> >>>> In pkg_resources the second argument to that function is a ?resource path? which is defined as a relative to the given module/package and it must use / to denote them. It explicitly says it?s not a file system path but a resource path. It may translate to a file system path (as is the case with the FileLoader) but it also may not (as is the case with a theoretical S3Loader or PostgreSQLLoader). >>>> >>>> Yep, which is why I'm making sure if we have paths we minimize them as they instantly make these alternative loader concepts a bigger pain to implement. >>>> >>>> How you turn a warehouse + a resource path into some data (or whatever other function we support) is an implementation detail of the Loader. >>>> >>>>> >>>>> And just so I don't forget it, I keep wanting to pass an actual module in so the code can extract the name that way, but that prevents the __name__ trick as you would have to import yourself or grab the module from sys.modules. >>>> >>>> Is an actual module what gets passed into Loader().exec_module()? >>>> >>>> Yes. >>>> >>>> If so I think it?s fine to pass that into the new Loader() functions and a new top level API in importlib.resources can do the things needed to turn a string into a module object. So instead of doing __loader__.get_bytes(__name__, ?logo.gif?) you?d do importlib.resources.get_bytes(__name__, ?logo.gif?). >>>> >>>> If we go the route of importlib.resources then that seems like a reasonable idea, although we will need to think through the ramifications to exec_module() itself although I don't think there were be any issues. >>>> >>>> And if we do go with importlib.resources I will probably want to make it available on PyPI with appropriate imp/pkgutil fallbacks to help people transitioning from Python 2 to 3. >>> >>> --- >>> Donald Stufft >>> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA >> >> >> --- >> Donald Stufft >> PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sun Feb 1 02:08:54 2015 From: donald at stufft.io (Donald Stufft) Date: Sat, 31 Jan 2015 20:08:54 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <64D4D521-0F11-4D0A-81A7-2C8A1CCE5501@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> <64D4D521-0F11-4D0A-81A7-2C8A1CCE5501@stufft.io> Message-ID: > On Jan 31, 2015, at 7:59 PM, Donald Stufft wrote: >> >> How this is implemented in the Loader() API can be whatever folks want. The important thing is that these all solve actual use cases and solve them better and easier than the naive approach of using os.path functions directly. >> >> Important Things: >> >> * resource_filename and ResourceFilename() must return the real file system path if available and a temporary file else wise. >> * resource_filename *must* be available for the lifetime of the process once the function has been called. >> * ResourceFilename *must* clean itself up at the end of the context manager. >> * These functions/context managers *must* work in terms of package names and relative file paths. >> >> All seem reasonable to me. >> Oh, one additional thing that I think is important: They should work with modules (foo -> foo.py), packages (foo -> foo/__init__.py), old style ?namespace? packages via extending __path__ (foo -> multiple foo/__init__.py ), new style namespace packages (foo -> multiple foo/). What file path to use is obvious in the first two cases because there is only one candidate file. For the other two I think it should just use the order of the __path__ and return the first one it finds (or None/Exception if it doesn?t find one). --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Feb 1 04:57:46 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 1 Feb 2015 13:57:46 +1000 Subject: [Import-SIG] Optimization levels embedded in .pyo file names? In-Reply-To: References: <20150130164646.5d1538ff@anarchist.wooz.org> Message-ID: On 1 February 2015 at 02:48, Brett Cannon wrote: > Since everyone seems to think it's a good idea I will write up a PEP with > the end goal of going all the way with .pyc (probably on Friday). +1 to Barry and Eric's replies from me as well. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Feb 1 06:28:42 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 1 Feb 2015 15:28:42 +1000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: On 1 February 2015 at 08:27, Brett Cannon wrote: > As I said above, I partially feel like the desire for this support is to > work around some API decisions that are somewhat poor. > > How about this: get_path(package, path, *, real=False) or get_path(package, > filename, *, real=False) -- depending on whether Barry and me get our way > about paths or you do, Donald -- where 'real' is a flag specifying whether > the path has to work as a path argument to builtins.open() and thus fails > accordingly (in instances where it won't work it can fail immediately and so > loader implementers only have two lines of code to care about to manage it). > Then loaders can keep their get_data() method without issue and the API for > loaders only grew by 1 (or stays constant depending on whether we want/can > have it subsume get_filename() long-term). Jumping in here, since I specifically object to the "real=" API design concept (on the grounds of that the presence of that kind of flag means you have two different methods trying to get out), this thread is already quite long and there are several different aspects I'd like to comment on :) * I like the overall naming suggestion of referring to this as a "resources" API. That not only has precedent in pkg_resources, but is also the standard terminology for referring to this kind of thing in rich client applications. (see https://msdn.microsoft.com/en-us/library/windows/apps/hh465241.aspx for example) * I think the PEP 302 approach of referring to resource anchors as "paths" is inherently confusing, especially when the most common anchor is __file__. As a result, I think we should refer to "resource anchors" and "relative paths", rather than the current approach of trying to create and pass around "absolute paths" (which then end up only working properly when packages are installed to a real filesystem). * I think Donald's overview at https://bpaste.net/show/0c490aa07c07 is a good summary of the functionality we should aim to provide (naming bikesheds aside) * I agree we should treat extraction and loading of C extension modules (and shared libraries in general) as out of scope for the resource API. They face several restrictions that don't apply to other pure data files * I agree that the resource APIs should be for read-only access only. Images, localisation strings, application templates, those are the kinds of things this API is aimed at: they're an essential part of the application, and hence it's appropriate to bundle them with it in a way that still works for single-file zip archive applications, but they're not Python code. * For the "must exist as a real shareable filesystem artefact, fail immediately if that isn't possible" API, I think we should support both implicit cleanup *and* explicit context managers for deterministic resource control. "Make this available until I'm done with it, regardless of where I use it" and "make this available for this defined region of code" are different use cases. Depending on how these objects are modelled in the API (more on that below), we could potentially drop the atexit handler in favour of suitable weakref.finalize() calls (which would then clean them up once the last reference to the resource was dropped, rather than always waiting until the end of the process - "keep this resource available until the process ends" would then be a matter of reference it from the appropriate module globals or some other similarly long lived data structure). Leaks due to process crashes would then be cleaned up by normal OS tempfile management processes. * I don't think we should couple the concept of resource anchors directly to package names (as discussed, it doesn't work for namespace packages, for example). I think we *should* be able to *look up* resource anchors by package name, although this may fail in some cases (such as namespace packages), and that the top level API should do that lookup implicitly (allowing package names to be passed wherever an anchor is expected). A module object should also be usable as its own anchor. I believe we should disallow the use of filesystem paths as resource anchors, as that breaks the intended abstraction (looking resources up relative to the related modules), and the API behaviour is clearer if strings are always assumed to be referring to package/module names. * I *don't* think it's a good idea to incorporate this idea directly onto the existing module Loader API. Better to create a new "ResourceLoader" abstraction, such that we can easily provide a default LocationResourceLoader. Reusing module Loader instances across modules would still be permitted, reusing ResourceLoader instances *would not*. This allows the resource anchor to be specified when creating the resource loader, rather than on every call. * As a consequence of the previous point, the ResourceLoader instance would be linked *from the module spec* (and perhaps from the module globals), rather than from the module loader instance. (This is how we would support using a module as its own anchor). Having a resource loader defined in the spec would be optional, making it clear that namespace modules (for example), don't provide a resource access API - if you want to store resources inside a namespace package, you need to create a submodule or self-contained subpackage to serve as the resource anchor. * As a consequence of making a suitably configured resource loader available through the module spec as part of the module finding process it would become possible to access module relative resources *without actually loading the module itself*. * If the import system gets a module spec where "spec.has_location" is set and Loader.get_data is available, but the new "spec.resource_loader" attribute is set to None, then it will set it to "LocationResourceLoader(spec.origin)", which will rely solely on Loader.get_data() for content access * We'd also provide an optimised FilesystemResourceLoader for use with actual installed packages where the resources already exist on disk and don't need to be copied to memory or a temporary directory to provide a suitable API. * For abstract data access at the ResourceLoader API level, I like "get_anchor()" (returning a suitably descriptive string such that "os.path.join(anchor, )" will work with get_data() on the corresponding module Loader), "get_bytes()", "get_bytestream(" and "get_filesystem_path()". get_anchor() would be the minimum API, with default implementations of the other three based on Loader.get_data(), BytesIO and tempfile (this would involve suitable use of lazy or on-demand imports for the latter two, as we'd need access to these from importlib._bootstrap, but wouldn't want to load them on every interpreter startup). * For the top-level API, I similarly favour importlib.resources.get_bytes(), get_bytestream() and get_filesystem_path(). However, I would propose that the latter be an object implementing a to-be-defined subset of the pathlib Path API, rather than a string. Resource listing, etc, would then be handled through the existing Path abstraction, rather than defining a new one. In the standard library, because we'd just be using a temporary directory, we could use real Path objects (although we'd need to add weakref support to them to implement the weakref.finalize suggestion I make above) > As for importlib.resources, that can provide a higher-level API for a > file-like object along with some way to say whether the file must be > addressable on the filesystem to know if tempfile.NamedTemporaryFile() may > be backing the file-like object or if io.BytesIO could provide the API. > > This gets me a clean API for loaders and importlib and gets you your real > file paths as needed. Yep, as you can see above, I agree there are two APIs to be designed here - the high level user facing one, and the one between the import machinery and plugin authors. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Feb 1 06:40:21 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 1 Feb 2015 15:40:21 +1000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: On 1 February 2015 at 15:28, Nick Coghlan wrote: > * I don't think we should couple the concept of resource anchors > directly to package names (as discussed, it doesn't work for namespace > packages, for example). I think we *should* be able to *look up* > resource anchors by package name, although this may fail in some cases > (such as namespace packages), and that the top level API should do > that lookup implicitly (allowing package names to be passed wherever > an anchor is expected). A module object should also be usable as its > own anchor. I believe we should disallow the use of filesystem paths > as resource anchors, as that breaks the intended abstraction (looking > resources up relative to the related modules), and the API behaviour > is clearer if strings are always assumed to be referring to > package/module names. Oops, just realised this is wrong, because I myself initially used the term "resource anchor" to refer to two different things and didn't go back to fix this point when I settled on only using it to refer to part of the API between the import machinery and custom resource loaders. To fix that mistake: * I think the user facing API should be defined in terms of modules & packages (as in Donald's draft API), provided by name, spec or the object itself. * I think the interface between the import machinery and resource loaders should use the concept of "resource anchors" as a new term to describe what we mean when __file__ gets set to something other than a real filesystem path. Filesystem paths are then a kind of resource anchor, as are the combinations of a zip archive name with a subpath within that archive. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From barry at python.org Sun Feb 1 22:43:05 2015 From: barry at python.org (Barry Warsaw) Date: Sun, 1 Feb 2015 16:43:05 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> Message-ID: <20150201164305.72ea612d@limelight.wooz.org> On Jan 31, 2015, at 06:05 PM, Donald Stufft wrote: >How about this, instead here?s the top level APIs I want: >https://bpaste.net/show/0c490aa07c07 I like almost all of this. It nicely handles the case where you want a longer lived file resource (via resource_filename()) and don't care about its life cycle, and where you want to clean up the resource asap (via ResourceFilename). If I didn't skim over something critical, I think Nick's introduction of the term "resource anchor" is a useful one. Given that a resource anchor can be 1. a string containing the dotted module path to a package 2. an actual module object 3. a module spec the term better describes the first argument in these APIs than "package". I still would like a generalization of resource_stream() that allows opening in text mode with a given encoding, e.g. # importlib.resources def open(resource_anchor, resource, encoding=None): """ resource_anchor is a 1) str that represents a dot seperated import module; 2) a module object; 3) a module spec. resource is a str that represents a relative to package resource. Return a file like object (but it may be an io.BytesIO) where read() yields bytes. If encoding is given, read() yields strs. No `mode` is provided, as this is a read-only interface. """ I can build this out of the pieces already described: from importlib.resources import ResourceFilename with ResourceFilename('my.package', 'foo.cfg') as filename: with open(filename, 'r', encoding='utf-8') as fp: my_config = fp.read() but it's certainly not as convenient as: with importlib.resources.open('my.package', 'foo.cfg', 'utf-8') as fp: my_config = fp.read() (oh, and is anybody else tired of writing `open('file', 'r', encoding='utf-8')` literally *everywhere*, and which it were just the default already? ;) Cheers, -Barry From ncoghlan at gmail.com Mon Feb 2 14:22:09 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 2 Feb 2015 23:22:09 +1000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <20150201164305.72ea612d@limelight.wooz.org> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> <20150201164305.72ea612d@limelight.wooz.org> Message-ID: On 2 Feb 2015 07:43, "Barry Warsaw" wrote: > > On Jan 31, 2015, at 06:05 PM, Donald Stufft wrote: > > >How about this, instead here?s the top level APIs I want: > >https://bpaste.net/show/0c490aa07c07 < https://bpaste.net/show/0c490aa07c07> > > I like almost all of this. It nicely handles the case where you want a longer > lived file resource (via resource_filename()) and don't care about its life > cycle, and where you want to clean up the resource asap (via > ResourceFilename). > > If I didn't skim over something critical, I think Nick's introduction of the > term "resource anchor" is a useful one. Given that a resource anchor can be > > 1. a string containing the dotted module path to a package > 2. an actual module object > 3. a module spec > > the term better describes the first argument in these APIs than "package". That's the definition of the term I started with, and it's probably the best one. For the related concept used to map this to the underlying import plugin APIs, we can use PEP 451's existing "location" term. > I still would like a generalization of resource_stream() that allows opening > in text mode with a given encoding, I'd prefer a helper function that can be used to easily pass a resource stream to the builtin open() via its opener argument, rather than duplicating that functionality. Regards, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Mon Feb 2 14:31:39 2015 From: donald at stufft.io (Donald Stufft) Date: Mon, 2 Feb 2015 08:31:39 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: > On Feb 1, 2015, at 12:28 AM, Nick Coghlan wrote: > > On 1 February 2015 at 08:27, Brett Cannon wrote: >> As I said above, I partially feel like the desire for this support is to >> work around some API decisions that are somewhat poor. >> >> How about this: get_path(package, path, *, real=False) or get_path(package, >> filename, *, real=False) -- depending on whether Barry and me get our way >> about paths or you do, Donald -- where 'real' is a flag specifying whether >> the path has to work as a path argument to builtins.open() and thus fails >> accordingly (in instances where it won't work it can fail immediately and so >> loader implementers only have two lines of code to care about to manage it). >> Then loaders can keep their get_data() method without issue and the API for >> loaders only grew by 1 (or stays constant depending on whether we want/can >> have it subsume get_filename() long-term). > > Jumping in here, since I specifically object to the "real= flag>" API design concept (on the grounds of that the presence of that > kind of flag means you have two different methods trying to get out), > this thread is already quite long and there are several different > aspects I'd like to comment on :) > > * I like the overall naming suggestion of referring to this as a > "resources" API. That not only has precedent in pkg_resources, but is > also the standard terminology for referring to this kind of thing in > rich client applications. (see > https://msdn.microsoft.com/en-us/library/windows/apps/hh465241.aspx > for example) > > * I think the PEP 302 approach of referring to resource anchors as > "paths" is inherently confusing, especially when the most common > anchor is __file__. As a result, I think we should refer to "resource > anchors" and "relative paths", rather than the current approach of > trying to create and pass around "absolute paths" (which then end up > only working properly when packages are installed to a real > filesystem). > > * I think Donald's overview at https://bpaste.net/show/0c490aa07c07 is > a good summary of the functionality we should aim to provide (naming > bikesheds aside) > > * I agree we should treat extraction and loading of C extension > modules (and shared libraries in general) as out of scope for the > resource API. They face several restrictions that don't apply to other > pure data files > > * I agree that the resource APIs should be for read-only access only. > Images, localisation strings, application templates, those are the > kinds of things this API is aimed at: they're an essential part of the > application, and hence it's appropriate to bundle them with it in a > way that still works for single-file zip archive applications, but > they're not Python code. > > * For the "must exist as a real shareable filesystem artefact, fail > immediately if that isn't possible" API, I think we should support > both implicit cleanup *and* explicit context managers for > deterministic resource control. "Make this available until I'm done > with it, regardless of where I use it" and "make this available for > this defined region of code" are different use cases. Depending on how > these objects are modelled in the API (more on that below), we could > potentially drop the atexit handler in favour of suitable > weakref.finalize() calls (which would then clean them up once the last > reference to the resource was dropped, rather than always waiting > until the end of the process - "keep this resource available until the > process ends" would then be a matter of reference it from the > appropriate module globals or some other similarly long lived data > structure). Leaks due to process crashes would then be cleaned up by > normal OS tempfile management processes. > > * I don't think we should couple the concept of resource anchors > directly to package names (as discussed, it doesn't work for namespace > packages, for example). I think we *should* be able to *look up* > resource anchors by package name, although this may fail in some cases > (such as namespace packages), and that the top level API should do > that lookup implicitly (allowing package names to be passed wherever > an anchor is expected). A module object should also be usable as its > own anchor. I believe we should disallow the use of filesystem paths > as resource anchors, as that breaks the intended abstraction (looking > resources up relative to the related modules), and the API behaviour > is clearer if strings are always assumed to be referring to > package/module names. > > * I *don't* think it's a good idea to incorporate this idea directly > onto the existing module Loader API. Better to create a new > "ResourceLoader" abstraction, such that we can easily provide a > default LocationResourceLoader. Reusing module Loader instances across > modules would still be permitted, reusing ResourceLoader instances > *would not*. This allows the resource anchor to be specified when > creating the resource loader, rather than on every call. > > * As a consequence of the previous point, the ResourceLoader instance > would be linked *from the module spec* (and perhaps from the module > globals), rather than from the module loader instance. (This is how we > would support using a module as its own anchor). Having a resource > loader defined in the spec would be optional, making it clear that > namespace modules (for example), don't provide a resource access API - > if you want to store resources inside a namespace package, you need to > create a submodule or self-contained subpackage to serve as the > resource anchor. > > * As a consequence of making a suitably configured resource loader > available through the module spec as part of the module finding > process it would become possible to access module relative resources > *without actually loading the module itself*. > > * If the import system gets a module spec where "spec.has_location" is > set and Loader.get_data is available, but the new > "spec.resource_loader" attribute is set to None, then it will set it > to "LocationResourceLoader(spec.origin)", which will rely solely on > Loader.get_data() for content access > > * We'd also provide an optimised FilesystemResourceLoader for use with > actual installed packages where the resources already exist on disk > and don't need to be copied to memory or a temporary directory to > provide a suitable API. > > * For abstract data access at the ResourceLoader API level, I like > "get_anchor()" (returning a suitably descriptive string such that > "os.path.join(anchor, )" will work with get_data() on > the corresponding module Loader), "get_bytes()", > "get_bytestream(" and "get_filesystem_path( path>)". get_anchor() would be the minimum API, with default > implementations of the other three based on Loader.get_data(), BytesIO > and tempfile (this would involve suitable use of lazy or on-demand > imports for the latter two, as we'd need access to these from > importlib._bootstrap, but wouldn't want to load them on every > interpreter startup). > > * For the top-level API, I similarly favour > importlib.resources.get_bytes(), get_bytestream() and > get_filesystem_path(). However, I would propose that the latter be an > object implementing a to-be-defined subset of the pathlib Path API, > rather than a string. Resource listing, etc, would then be handled > through the existing Path abstraction, rather than defining a new one. > In the standard library, because we'd just be using a temporary > directory, we could use real Path objects (although we'd need to add > weakref support to them to implement the weakref.finalize suggestion I > make above) > >> As for importlib.resources, that can provide a higher-level API for a >> file-like object along with some way to say whether the file must be >> addressable on the filesystem to know if tempfile.NamedTemporaryFile() may >> be backing the file-like object or if io.BytesIO could provide the API. >> >> This gets me a clean API for loaders and importlib and gets you your real >> file paths as needed. > > Yep, as you can see above, I agree there are two APIs to be designed > here - the high level user facing one, and the one between the import > machinery and plugin authors. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia This all sounds reasonable to me, except maybe the weakref bit, I can imagine getting into some trouble if you do something like: ctx = SSLContext() ctx.load_verify_location(cafile=str(importlib.resources.get_filesystem_path())) Using a weakref isn?t a horrible idea though and I wouldn?t be completely opposed to it, it would just mean that they have to be sure to keep around a reference to the pathlib style thing even if they need the path as a string and are going to cast pathlib into a str. The error message might be confusing because it?ll work in the common case just fine since if the file is already on the file system nothing is going to get cleaned up, but lead to errors that only happen if you?re using a zip import or similar. That kind of transient error feels like somewhat of a footgun. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA From brett at python.org Mon Feb 2 15:18:05 2015 From: brett at python.org (Brett Cannon) Date: Mon, 02 Feb 2015 14:18:05 +0000 Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: On Sun Feb 01 2015 at 12:28:46 AM Nick Coghlan wrote: > On 1 February 2015 at 08:27, Brett Cannon wrote: > > As I said above, I partially feel like the desire for this support is to > > work around some API decisions that are somewhat poor. > > > > How about this: get_path(package, path, *, real=False) or > get_path(package, > > filename, *, real=False) -- depending on whether Barry and me get our way > > about paths or you do, Donald -- where 'real' is a flag specifying > whether > > the path has to work as a path argument to builtins.open() and thus fails > > accordingly (in instances where it won't work it can fail immediately > and so > > loader implementers only have two lines of code to care about to manage > it). > > Then loaders can keep their get_data() method without issue and the API > for > > loaders only grew by 1 (or stays constant depending on whether we > want/can > > have it subsume get_filename() long-term). > > Jumping in here, since I specifically object to the "real= flag>" API design concept (on the grounds of that the presence of that > kind of flag means you have two different methods trying to get out), > this thread is already quite long and there are several different > aspects I'd like to comment on :) > > * I like the overall naming suggestion of referring to this as a > "resources" API. That not only has precedent in pkg_resources, but is > also the standard terminology for referring to this kind of thing in > rich client applications. (see > https://msdn.microsoft.com/en-us/library/windows/apps/hh465241.aspx > for example) > > * I think the PEP 302 approach of referring to resource anchors as > "paths" is inherently confusing, especially when the most common > anchor is __file__. As a result, I think we should refer to "resource > anchors" and "relative paths", rather than the current approach of > trying to create and pass around "absolute paths" (which then end up > only working properly when packages are installed to a real > filesystem). > Yes, which is what has made this whole discussion "fun". =) > > * I think Donald's overview at https://bpaste.net/show/0c490aa07c07 is > a good summary of the functionality we should aim to provide (naming > bikesheds aside) > > * I agree we should treat extraction and loading of C extension > modules (and shared libraries in general) as out of scope for the > resource API. They face several restrictions that don't apply to other > pure data files > I'm not even willing to go there with that. You can talk to Thomas Wouters at PyCon if you want to hear how he had tried to deal with it at Google. > > * I agree that the resource APIs should be for read-only access only. > Images, localisation strings, application templates, those are the > kinds of things this API is aimed at: they're an essential part of the > application, and hence it's appropriate to bundle them with it in a > way that still works for single-file zip archive applications, but > they're not Python code. > > * For the "must exist as a real shareable filesystem artefact, fail > immediately if that isn't possible" API, I think we should support > both implicit cleanup *and* explicit context managers for > deterministic resource control. "Make this available until I'm done > with it, regardless of where I use it" and "make this available for > this defined region of code" are different use cases. Depending on how > these objects are modelled in the API (more on that below), we could > potentially drop the atexit handler in favour of suitable > weakref.finalize() calls (which would then clean them up once the last > reference to the resource was dropped, rather than always waiting > until the end of the process - "keep this resource available until the > process ends" would then be a matter of reference it from the > appropriate module globals or some other similarly long lived data > structure). Leaks due to process crashes would then be cleaned up by > normal OS tempfile management processes. > > * I don't think we should couple the concept of resource anchors > directly to package names (as discussed, it doesn't work for namespace > packages, for example). I think we *should* be able to *look up* > resource anchors by package name, although this may fail in some cases > (such as namespace packages), and that the top level API should do > that lookup implicitly (allowing package names to be passed wherever > an anchor is expected). A module object should also be usable as its > own anchor. I believe we should disallow the use of filesystem paths > as resource anchors, as that breaks the intended abstraction (looking > resources up relative to the related modules), and the API behaviour > is clearer if strings are always assumed to be referring to > package/module names. > Not quite following here. So are you saying we should define the location as ('foo.bar', 'baz/file.txt') or as ('foo.bar.baz', 'file.txt')? You say you "don't think we should couple the concept of resource anchors directly" but then say "we should disallow the use of filesystem paths". > > * I *don't* think it's a good idea to incorporate this idea directly > onto the existing module Loader API. Better to create a new > "ResourceLoader" abstraction, such that we can easily provide a > default LocationResourceLoader. Reusing module Loader instances across > modules would still be permitted, reusing ResourceLoader instances > *would not*. This allows the resource anchor to be specified when > creating the resource loader, rather than on every call. > You do realize that importlib.abc.ResourceLoader already exists, right? Otherwise I'm rather confused by the terminology. =) And are you saying that we should have special rules for LocationResourceLoader instances such that you can not have to specify the anchoring package and thus force loader creators to provide unique instances per package? Or are you talking about some new thing that is tied to specs? > > * As a consequence of the previous point, the ResourceLoader instance > would be linked *from the module spec* (and perhaps from the module > globals), rather than from the module loader instance. (This is how we > would support using a module as its own anchor). Having a resource > loader defined in the spec would be optional, making it clear that > namespace modules (for example), don't provide a resource access API - > if you want to store resources inside a namespace package, you need to > create a submodule or self-contained subpackage to serve as the > resource anchor. > So are you suggesting we add a new attribute to specs which would store a certain ABC subclass which implements an API for loading resources? > > * As a consequence of making a suitably configured resource loader > available through the module spec as part of the module finding > process it would become possible to access module relative resources > *without actually loading the module itself*. > OK, you are suggesting adding a new object type and attribute to specs. Can we call them "resource readers" so we don't conflate the "loader" term? And doing it through specs also means that the overhead of requiring the file name not have any directory parts is not extra overhead. > > * If the import system gets a module spec where "spec.has_location" is > set and Loader.get_data is available, but the new > "spec.resource_loader" attribute is set to None, then it will set it > to "LocationResourceLoader(spec.origin)", which will rely solely on > Loader.get_data() for content access > This is a little finicky. Are we going to simply say that we assume spec.origin is some path that works with os.path functions? Will Windows be okay if someone decided to standardize on / as a path separator instead of \ ? I get this buys us support from older loader implementations but I just want to make sure that it will work 80% of the time before we add more implicit magic to importlib. > > * We'd also provide an optimised FilesystemResourceLoader for use with > actual installed packages where the resources already exist on disk > and don't need to be copied to memory or a temporary directory to > provide a suitable API. > > * For abstract data access at the ResourceLoader API level, I like > "get_anchor()" (returning a suitably descriptive string such that > "os.path.join(anchor, )" will work with get_data() on > the corresponding module Loader), I would rather call it get_location() since get_anchor() using 'anchor' seems to conflate what an anchor is representing. > "get_bytes()", > "get_bytestream(" and "get_filesystem_path( path>)". get_anchor() would be the minimum API, with default > implementations of the other three based on Loader.get_data(), BytesIO > and tempfile (this would involve suitable use of lazy or on-demand > imports for the latter two, as we'd need access to these from > importlib._bootstrap, but wouldn't want to load them on every > interpreter startup). > > * For the top-level API, I similarly favour > importlib.resources.get_bytes(), get_bytestream() and > get_filesystem_path(). However, I would propose that the latter be an > object implementing a to-be-defined subset of the pathlib Path API, > rather than a string. Resource listing, etc, would then be handled > through the existing Path abstraction, rather than defining a new one. > In the standard library, because we'd just be using a temporary > directory, we could use real Path objects (although we'd need to add > weakref support to them to implement the weakref.finalize suggestion I > make above) > Seems reasonable to me to start getting Path objects into the stdlib more. -Brett > > > As for importlib.resources, that can provide a higher-level API for a > > file-like object along with some way to say whether the file must be > > addressable on the filesystem to know if tempfile.NamedTemporaryFile() > may > > be backing the file-like object or if io.BytesIO could provide the API. > > > > This gets me a clean API for loaders and importlib and gets you your real > > file paths as needed. > > Yep, as you can see above, I agree there are two APIs to be designed > here - the high level user facing one, and the one between the import > machinery and plugin authors. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Mon Feb 2 22:53:03 2015 From: barry at python.org (Barry Warsaw) Date: Mon, 2 Feb 2015 16:53:03 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: <0BF27A19-2C53-4994-8455-CD19D9A05E5E@stufft.io> References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <20150131131142.27f7c682@marathon> <0BF27A19-2C53-4994-8455-CD19D9A05E5E@stufft.io> Message-ID: <20150202165303.533a4845@anarchist.wooz.org> On Jan 31, 2015, at 01:18 PM, Donald Stufft wrote: >I think it actually makes things *harder* from an implementation and >description standpoint. You?re thinking in terms of implementation for the >FileLoader, but say for a PostgreSQLLoader now I have to create mock packages >for warehouse.templates and warehouse.templates.accounts whereas if we treat >the resource path not as a file path, but as a key for an object store where >?/? is slightly special then my PostgreSQL loader only need to have a >?warehouse? package, and then a table that essentially does something like: > > package | resource key | data > -------------------------------------------------- > warehouse | templates/accounts/profile.html | ? > >In the FileLoader we?d obviously treat the / as path separators and create >directory entries, but in reality it?s just a key: value store. I already >implemented one of these functions in a way that allows the / separator and I >would have had to have gone out of my way to disallow it rather than allow >it. So that would mean the API is actually: resource_whatever(resource_anchor, resource_key) and loaders would be free to interpret resource_key however they want, including *not* supporting some resource_keys, e.g. throw an exception. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From barry at python.org Mon Feb 2 22:56:45 2015 From: barry at python.org (Barry Warsaw) Date: Mon, 2 Feb 2015 16:56:45 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> <117995FA-74A1-473D-903F-48F1A601E57B@stufft.io> <20150201164305.72ea612d@limelight.wooz.org> Message-ID: <20150202165645.3013aaff@anarchist.wooz.org> On Feb 02, 2015, at 11:22 PM, Nick Coghlan wrote: >I'd prefer a helper function that can be used to easily pass a resource >stream to the builtin open() via its opener argument, rather than >duplicating that functionality. I'm not sure how more useful that would be than just .decode()'ing the bytes that get returned from resource_stream().read(), which is essentially what I do with pkg_resources today anyway. It's not horribly inconvenient, it could just be nicer (IMHO). Not a big deal. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From erik.m.bray at gmail.com Mon Feb 2 23:34:59 2015 From: erik.m.bray at gmail.com (Erik Bray) Date: Mon, 2 Feb 2015 17:34:59 -0500 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: Message-ID: Sorry for the late reply, but I'm still catching up on this... On Fri, Jan 30, 2015 at 6:37 PM, Donald Stufft wrote: > It's often times useful to be able to load a resource from a Python module or > packaging. Currently you can load the data into memory using pkgutil.get_data > however this doesn't help much if you need to pass that data into an API that > only accepts a filepath. Currently code that needs to do this often times does > something like os.path.join(os.path.dirname(__file__), "myfile.txt"), however > that doesn't work from within a zip file. ... > A. What do people think about pkgutil.get_data_filename and > Loader.get_data_filename? Big +1 for me. This deficiency has been an annoyance to me for some time--just, I guess, not enough of an annoyance to propose any general solution. Astropy has some pretty hideous workarounds [1] to get resource loading from zipfiles working, and even then it only works for zipfile loaders and not other arbitrary loaders. Fortunately that's the only case I know if that matters to me (for use with PyInstaller and other such software bundlers). So I would love to see such an interface, and although one could argue about the details I think the interface you proposed for Loader.get_data_filename is fairly obvious and would have been my first pass proposal as well. > B. What do people think about modifying Loader.get_data so it can support > relative filenames instead of the calling code needing to handle that? Yes, please. Makes much more sense. Erik [1] https://github.com/embray/astropy/blob/issue-960/astropy/utils/data.py#L739 From ncoghlan at gmail.com Tue Feb 3 12:04:25 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 3 Feb 2015 21:04:25 +1000 Subject: [Import-SIG] Loading Resources From a Python Module/Package In-Reply-To: References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> <82EA9F13-3FC6-4A99-83FC-DCE2C44FDAC9@stufft.io> <75CB86FF-A527-4564-BAB3-9289446E05DB@stufft.io> Message-ID: On 3 Feb 2015 00:18, "Brett Cannon" wrote: > > > > On Sun Feb 01 2015 at 12:28:46 AM Nick Coghlan wrote: >> >> * I don't think we should couple the concept of resource anchors >> directly to package names (as discussed, it doesn't work for namespace >> packages, for example). I think we *should* be able to *look up* >> resource anchors by package name, although this may fail in some cases >> (such as namespace packages), and that the top level API should do >> that lookup implicitly (allowing package names to be passed wherever >> an anchor is expected). A module object should also be usable as its >> own anchor. I believe we should disallow the use of filesystem paths >> as resource anchors, as that breaks the intended abstraction (looking >> resources up relative to the related modules), and the API behaviour >> is clearer if strings are always assumed to be referring to >> package/module names. > > > Not quite following here. So are you saying we should define the location as ('foo.bar', 'baz/file.txt') or as ('foo.bar.baz', 'file.txt')? You say you "don't think we should couple the concept of resource anchors directly" but then say "we should disallow the use of filesystem paths". I like Donald's resource anchor/resource key terminology suggestion here. The "no filesystem paths" comment refers only to specifying the anchor point in the module namespace, not to the key within that anchor. >> * I *don't* think it's a good idea to incorporate this idea directly >> onto the existing module Loader API. Better to create a new >> "ResourceLoader" abstraction, such that we can easily provide a >> default LocationResourceLoader. Reusing module Loader instances across >> modules would still be permitted, reusing ResourceLoader instances >> *would not*. This allows the resource anchor to be specified when >> creating the resource loader, rather than on every call. > > You do realize that importlib.abc.ResourceLoader already exists, right? Otherwise I'm rather confused by the terminology. =) *blinks* Apparently I missed that. OK, guess I need a different name :) > And are you saying that we should have special rules for LocationResourceLoader instances such that you can not have to specify the anchoring package and thus force loader creators to provide unique instances per package? Or are you talking about some new thing that is tied to specs? Yes, I'm proposing each module will need its own resource reader, they won't be shareable the way module loaders are. I don't think the memory savings from sharing are worth the extra complexity. >> * As a consequence of the previous point, the ResourceLoader instance >> would be linked *from the module spec* (and perhaps from the module >> globals), rather than from the module loader instance. (This is how we >> would support using a module as its own anchor). Having a resource >> loader defined in the spec would be optional, making it clear that >> namespace modules (for example), don't provide a resource access API - >> if you want to store resources inside a namespace package, you need to >> create a submodule or self-contained subpackage to serve as the >> resource anchor. > > > So are you suggesting we add a new attribute to specs which would store a certain ABC subclass which implements an API for loading resources? Correct (although it would be "reading resources" with your suggested terminology tweak). >> * As a consequence of making a suitably configured resource loader >> available through the module spec as part of the module finding >> process it would become possible to access module relative resources >> *without actually loading the module itself*. > > OK, you are suggesting adding a new object type and attribute to specs. Can we call them "resource readers" so we don't conflate the "loader" term? Yep, that sounds like a good improvement to me. > And doing it through specs also means that the overhead of requiring the file name not have any directory parts is not extra overhead. I don't follow this part. I'm OK with resource keys having path separators in them. >> * If the import system gets a module spec where "spec.has_location" is >> set and Loader.get_data is available, but the new >> "spec.resource_loader" attribute is set to None, then it will set it >> to "LocationResourceLoader(spec.origin)", which will rely solely on >> Loader.get_data() for content access > > > This is a little finicky. Are we going to simply say that we assume spec.origin is some path that works with os.path functions? Will Windows be okay if someone decided to standardize on / as a path separator instead of \ ? I get this buys us support from older loader implementations but I just want to make sure that it will work 80% of the time before we add more implicit magic to importlib. We'd only be assuming that loader.get_data(os.path.join(spec.origin, resource_key) works. It will fail in the same cases where using __file__ currently fails, but with a potential way to fix it (i.e. providing a custom resource reader when populating the module spec) >> * We'd also provide an optimised FilesystemResourceLoader for use with >> actual installed packages where the resources already exist on disk >> and don't need to be copied to memory or a temporary directory to >> provide a suitable API. >> >> * For abstract data access at the ResourceLoader API level, I like >> "get_anchor()" (returning a suitably descriptive string such that >> "os.path.join(anchor, )" will work with get_data() on >> the corresponding module Loader), > > > I would rather call it get_location() since get_anchor() using 'anchor' seems to conflate what an anchor is representing. Yeah, I confused myself while writing that. I like anchor for the user facing API, location for the plugin level. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From encukou at gmail.com Fri Feb 20 15:56:50 2015 From: encukou at gmail.com (Petr Viktorin) Date: Fri, 20 Feb 2015 15:56:50 +0100 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading Message-ID: Hello list, I have taken Nick's challenge of extension module loading. I've read some of the relevant discussions, and bounced my ideas off Nick to see if I missed anything important. The main idea I realized, which was not obvious from the discussion, was that in addition to playing well with PEP 451 (ModuleSpec) and supporting subinterpreters and multiple Py_Initialize/Py_Finalize cycles, Nick's Create/Exec proposal allows executing the module in a "foreign", externally created module object. The main use case for that would be runpy and __main__, but lazy-loading mechanisms were mentioned that would benefit as well. As I was writing this down, I realized that once pre-created modules are allowed, it makes no sense to insist that they actually are module instances -- PyModule_Type provides little functionality above a plain object subclass. I'm not sure there are any use cases for this, but I don't see a reason to limit things artificially. Any bugs caused by allowing non-ModuleType modules are unlikely to be subtle, unless the custom object passes the "asked for it" line. Comments appreciated. --- PEP: XXX Title: Redesigning extension module loading Version: $Revision$ Last-Modified: $Date$ Author: Petr Viktorin , Stefan Behnel , Nick Coghlan BDFL-Delegate: "???" Discussions-To: "???" Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 11-Aug-2013 Python-Version: 3.5 Post-History: 23-Aug-2013, 20-Feb-2015 Resolution: Abstract ======== This PEP proposes a redesign of the way in which extension modules interact with the import machinery. This was last revised for Python 3.0 in PEP 3121, but did not solve all problems at the time. The goal is to solve them by bringing extension modules closer to the way Python modules behave; specifically to hook into the ModuleSpec-based loading mechanism introduced in PEP 451. Two ways to initialize a module, depending on the desired functionality, are proposed. The preferred form allows extension modules to be executed in pre-defined namespaces, paving the way for extension modules being runnable with Python's ``-m`` switch. Other modules can use arbitrary custom types for their module implementation, and are no longer restricted to types.ModuleType. Both ways make it easy to support properties at the module level and to safely store arbitrary global state in the module that is covered by normal garbage collection and supports reloading and sub-interpreters. Extension authors are encouraged to take these issues into account when using the new API. Motivation ========== Python modules and extension modules are not being set up in the same way. For Python modules, the module is created and set up first, then the module code is being executed (PEP 302). A ModuleSpec object (PEP 451) is used to hole information about the module, and pased to the relevant hooks. For extensions, i.e. shared libraries, the module init function is executed straight away and does both the creation and initialisation. The initialisation function is not passed ModuleSpec information about the loaded module, such as the __file__ or fully-qualified name This hinders relative imports and resource loading. This is specifically a problem for Cython generated modules, for which it's not uncommon that the module init code has the same level of complexity as that of any 'regular' Python module. Also, the lack of __file__ and __name__ information hinders the compilation of __init__.py modules, i.e. packages, especially when relative imports are being used at module init time. The other disadvantage of the discrepancy is that existing Python programmers learning C cannot effectively map concepts between the two domains. As long as extension modules are fundamentally different from pure Python ones in the way they're initialised, they are harder for people to pick up without relying on something like cffi, SWIG or Cython to handle the actual extension module creation. Currently, extension modules are also not added to sys.modules until they are fully initialized, which means that a (potentially transitive) re-import of the module will really try to reimport it and thus run into an infinite loop when it executes the module init function again. Without the fully qualified module name, it is not trivial to correctly add the module to sys.modules either. Furthermore, the majority of currently existing extension modules has problems with sub-interpreter support and/or reloading, and, while it is it possible with the current infrastructure to support these features, is neither easy nor efficient. Addressing these issues was the goal of PEP 3121, but many extensions took the least-effort approach to porting to Python 3, leaving many of these issues unresolved. Thius PEP keeps the backwards-compatible behavior, which should reduce pressure and give extension authors adequate time to consider these issues when porting. The current process =================== Currently, extension modules export an initialisation function named "PyInit_modulename", named after the file name of the shared library. This function is executed by the import machinery and must return either NULL in the case of an exception, or a fully initialised module object. The function receives no arguments, so it has no way of knowing about its import context. During its execution, the module init function creates a module object based on a PyModuleDef struct. It then continues to initialise it by adding attributes to the module dict, creating types, etc. In the back, the shared library loader keeps a note of the fully qualified module name of the last module that it loaded, and when a module gets created that has a matching name, this global variable is used to determine the fully qualified name of the module object. This is not entirely safe as it relies on the module init function creating its own module object first, but this assumption usually holds in practice. The proposal ============ The current extension module initialisation will be deprecated in favour of a new initialisation scheme. Since the current scheme will continue to be available, existing code will continue to work unchanged, including binary compatibility. Extension modules that support the new initialisation scheme must export one or both of the public symbols "PyModuleCreate_modulename" and "PyModuleExec_modulename", where "modulename" is the name of the shared library. This mimics the previous naming convention for the "PyInit_modulename" function. This symbols, if defined, must resolve to C functions with the following signatures, respectively:: PyObject* (*PyModuleCreateFunction)(PyObject* module_spec) int (*PyModuleExecFunction)(PyObject* module) The PyModuleCreate function --------------------------- This PyModuleCreate function is used to implement "loader.create_module" defined in PEP 451. By exporting the "PyModuleCreate_modulename" symbol, an extension module indicates that it uses a custom module object. This prevents loading the extension in a pre-created module, but gives greater flexibility in allowing a custom C-level layout of the module object. The "module_spec" argument receives a "ModuleSpec" instance, as defined in PEP 451. When called, this function must create and return a module object. If "PyModuleExec_module" is undefined, this function must also initialize the module; see PyModuleExec_module for details on initialization. There is no requirement for the returned object to be an instance of types.ModuleType. Any type can be used. This follows the current support for allowing arbitrary objects in sys.modules and makes it easier for extension modules to define a type that exactly matches their needs for holding module state. The PyModuleExec function ------------------------- This PyModuleExec function is used to implement "loader.exec_module" defined in PEP 451. It is called after ModuleSpec-related attributes such as ``__loader__``, ``__spec__`` and ``__name__`` are set on the module. (The full list is in PEP 451 [#pep-0451-attributes]_) The "PyModuleExec_modulename" function will be called to initialize a module. This happens in two situations: when the module is first initialized for a given (sub-)interpreter, and when the module is reloaded. The "module" argument receives the module object. If PyModuleCreate is defined, this will be the the object returned by it. If PyModuleCreate is not defined, PyModuleExec is epected to operate on any Python object for which attributes can be added by PyObject_GetAttr* and retreived by PyObject_SetAttr*. Specifically, as the module may not be a PyModule_Type subclass, PyModule_* functions should not be used on it, unless they explicitly support operating on all objects. Helper functions ---------------- For two initialization tasks previously done by PyModule_Create, two functions are introduced:: int PyModule_SetDocString(PyObject *m, const char *doc) int PyModule_AddFunctions(PyObject *m, PyMethodDef *functions) These set the module docstring, and add the module functions, respectively. Both will work on any Python object that supports setting attributes. They return zero on success, and on failure, they set the exception and return -1. Other changes ------------- The following functions and macros will be modified to work on any object that supports attribute access: * PyModule_GetNameObject * PyModule_GetName * PyModule_GetFilenameObject * PyModule_GetFilename * PyModule_AddIntConstant * PyModule_AddStringConstant * PyModule_AddIntMacro * PyModule_AddStringMacro * PyModule_AddObject Usage ===== This PEP allows three new ways of creating modules, each with its advantages and disadvantages. Exec-only --------- The preferred way to create C extensions is to define "PyModuleExec_modulename" only. This brings the following advantages: * The extension can be loaded into a pre-created module, making it possible to run them as ``__main__``, participate in certain lazy-loading schemes [#lazy_import_concerns]_, or enable other creative uses. * The module can be reloaded in the same way as Python modules. As Exec-only extension modules do not have C-level storage, all module-local data must be stored in the module object's attributes, possibly using the PyCapsule mechanism. XXX: Provide an example? Create-only ----------- Extensions defining only the "PyModuleCreate_modulename" hook behave similarly to current extensions. This is the easiest way to create modules that require custom module objects, or substantial per-module state at the C level (using positive ``PyModuleDef.m_size``). When the PyModuleCreate function is called, the module has not yet been added to sys.modules. Attempts to load the module again (possibly transitively) will result in an infinite loop. If user code needs to me called in module initialization, module authors are advised to do so from the PyModuleExec function. Reloading a Create-only module does nothing, except re-setting ModuleSpec-related attributes described in PEP 0451 [#pep-0451-attributes]. XXX: Provide an example? (It would be similar to the one in PEP 3121) Exec and Create --------------- Extensions that need to create a custom module object, and either need to run user code during initialization or support reloading, should define both "PyModuleCreate_modulename" and "PyModuleExec_modulename". XXX: Provide an example? Legacy Init ----------- If neither PyModuleExec nor PyModuleCreate is defined, the module is initialized using the PyModuleInit hook, as described in PEP 3121. If PyModuleExec or PyModuleCreate is defined, PyModuleInit will be ignored. Modules requiring compatibility with previous versions of CPython may implement PyModuleInit in addition to the new hooks. Subinterpreters and Interpreter Reloading ----------------------------------------- Extensions using the new initialization scheme are expected to support subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly. The mechanism is designed to make this easy, but care is still required on the part of the extension author. No user-defined functions, methods, or instances may leak to different interpreters. To achieve this, all module-level state should be kept in either the module dict, or in the module object. A simple rule of thumb is: Do not define any static data, except built-in types with no mutable or user-settable class attributes. Module Reloading ---------------- Extensions that support reloading must define PyModuleExec, which is called in reload() to re-initialize the module in place. The same caveats apply to reloading an extension module as to reloading a Python module. Note that due to limitations in shared library loading (both dlopen on POSIX and LoadModuleEx on Windows), it is not generally possible to load a modified library after it has changed on disk. Therefore, reloading extension modules is of limited use. Multiple modules in one library ------------------------------- To support multiple Python modules in one shared library, the library must export all appropriate PyModuleExec_ or PyModuleCreate_ hooks for each exported module. The modules are loaded using a ModuleSpec with origin set to the name of the library file, and name set to the module name. Note that this mechanism can only be used to *load* such modules, not to *find* them. XXX: Provide an example of how to load such modules Implementation ============== XXX - not started Open issues =========== Now that PEP 442 is implemented, it would be nice if module finalization did not set all attributes to None, In this scheme, it is not possible to create a module with C-level state, which would be able to exec itself in any externally provided module object, short of putting PyCapsules in the module dict. The proposal repurposes PyModule_SetDocString, PyModule_AddObject, PyModule_AddIntMacro et.al. to work on any object. Would it be better to have these in the PyObject namespace? We should expose some kind of API in importlib.util (or a better place?) that can be used to check that a module works with reloading and subinterpreters. The runpy module will need to be modified to take advantage of PEP 451 and this PEP. This might out of scope for this PEP. Previous Approaches =================== Stefan Behnel's initial proto-PEP [#stefans_protopep]_ had a "PyInit_modulename" hook that would create a module class, whose ``__init__`` would be then called to create the module. This proposal did not correspond to the (then nonexistent) PEP 451, where module creation and initialization is broken into distinct steps. It also did not support loading an extension into pre-existing module objects. Nick Coghlan proposed the Create annd Exec hooks, and wrote a prototype implementation [#nicks-prototype]_. At this time PEP 451 was still not implemented, so the prototype does not use ModuleSpec. References ========== .. [#lazy_import_concerns] https://mail.python.org/pipermail/python-dev/2013-August/128129.html .. [#pep-0451-attributes] https://www.python.org/dev/peps/pep-0451/#attributes .. [#stefans_protopep] https://mail.python.org/pipermail/python-dev/2013-August/128087.html .. [#nicks-prototype] https://mail.python.org/pipermail/python-dev/2013-August/128101.html Copyright ========= This document has been placed in the public domain. From ncoghlan at gmail.com Sat Feb 21 13:19:55 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 21 Feb 2015 22:19:55 +1000 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading In-Reply-To: References: Message-ID: On 21 February 2015 at 00:56, Petr Viktorin wrote: > Hello list, > > I have taken Nick's challenge of extension module loading. Thanks for tackling this! > I've read some of the relevant discussions, and bounced my ideas off Nick > to see if I missed anything important. > > The main idea I realized, which was not obvious from the discussion, > was that in addition to playing well with PEP 451 (ModuleSpec) and supporting > subinterpreters and multiple Py_Initialize/Py_Finalize cycles, > Nick's Create/Exec proposal allows executing the module in a "foreign", > externally created module object. The main use case for that would be runpy and > __main__, but lazy-loading mechanisms were mentioned that would benefit as well. For everyone else's reference: this actually came up in Petr's earlier off-list discussions with me, when I realised I'd had the "running extension modules as __main__" use case in mind myself, but never actually written that notion down anywhere. It's the one capability of PyModuleExec_* that simply doesn't exist today. > As I was writing this down, I realized that once pre-created modules are > allowed, it makes no sense to insist that they actually are module > instances -- PyModule_Type provides little functionality above a plain object > subclass. I'm not sure there are any use cases for this, but I don't see a > reason to limit things artificially. Any bugs caused by allowing > non-ModuleType modules are unlikely to be subtle, unless the custom object > passes the "asked for it" line. > > Comments appreciated. This generally looks good to me. Some more specific feedback inline below. > PEP: XXX > Title: Redesigning extension module loading For the BDFL-Delegate question: Brett would you be happy tackling this one? > Motivation > ========== > > Python modules and extension modules are not being set up in the same way. > For Python modules, the module is created and set up first, then the module > code is being executed (PEP 302). > A ModuleSpec object (PEP 451) is used to hole information about the module, > and pased to the relevant hooks. s/hole/hold/ s/pased/passed/ > Furthermore, the majority of currently existing extension modules has > problems with sub-interpreter support and/or reloading, and, while it is > it possible with the current infrastructure to support these > features, is neither easy nor efficient. > Addressing these issues was the goal of PEP 3121, but many extensions > took the least-effort approach to porting to Python 3, leaving many of these > issues unresolved. It's probably worth noting that some of those "least-effort" porting approaches are in the standard library: this PEP is about solving our own problems in addition to other people's. > Thius PEP keeps the backwards-compatible behavior, which should reduce pressure > and give extension authors adequate time to consider these issues when porting. s/thius/this/ > The proposal > ============ > > The current extension module initialisation will be deprecated in favour of > a new initialisation scheme. Since the current scheme will continue to be > available, existing code will continue to work unchanged, including binary > compatibility. > > Extension modules that support the new initialisation scheme must export one > or both of the public symbols "PyModuleCreate_modulename" and > "PyModuleExec_modulename", where "modulename" is the > name of the shared library. This mimics the previous naming convention for > the "PyInit_modulename" function. > > This symbols, if defined, must resolve to C functions with the following > signatures, respectively:: > > PyObject* (*PyModuleCreateFunction)(PyObject* module_spec) > int (*PyModuleExecFunction)(PyObject* module) For the Python level, the model we ended up with for 3.5 is: 1. create_module must exist, but may return None 2. exec_module must exist, but may have no effect on the module state For the new C level API, it's probably worth drawing the more explicit parallel to __new__ and __init__ on classes, where you can implement both of them if you want, but in most cases, implementing only one or the other will be sufficient. The reason I suggest that is because I was going to ask if we should make providing both APIs, or at least PyModuleExec_*, compulsory (based on the Python Loader API requirements), but thinking of the __new__/__init__ analogy made me realise that your current design makes sense, since dealing with it is confined specifically to the extension module loader implementation. > The PyModuleCreate function > --------------------------- > When called, this function must create and return a module object. > > If "PyModuleExec_module" is undefined, this function must also initialize > the module; see PyModuleExec_module for details on initialization. This should be clarified to point out that, as per PEP 451, the import machinery will still take care of setting the import related attributes after the loader returns the module from create_module. > There is no requirement for the returned object to be an instance of > types.ModuleType. Any type can be used. The requirement for the returned object to support getting and setting attributes (as per https://www.python.org/dev/peps/pep-0451/#attributes) should be defined here. > This follows the current > support for allowing arbitrary objects in sys.modules and makes it easier > for extension modules to define a type that exactly matches their needs for > holding module state. +1 > The PyModuleExec function > ------------------------- > > This PyModuleExec function is used to implement "loader.exec_module" > defined in PEP 451. > It is called after ModuleSpec-related attributes such as ``__loader__``, > ``__spec__`` and ``__name__`` are set on the module. > (The full list is in PEP 451 [#pep-0451-attributes]_) > > The "PyModuleExec_modulename" function will be called to initialize a module. > This happens in two situations: when the module is first initialized for > a given (sub-)interpreter, and when the module is reloaded. > > The "module" argument receives the module object. > If PyModuleCreate is defined, this will be the the object returned by it. > If PyModuleCreate is not defined, PyModuleExec is epected to operate > on any Python object for which attributes can be added by PyObject_GetAttr* > and retreived by PyObject_SetAttr*. > Specifically, as the module may not be a PyModule_Type subclass, > PyModule_* functions should not be used on it, unless they explicitly support > operating on all objects. I think this is too permissive on the interpreter side of things, thus making things more complicated than we'd like them to be for extension module authors. If PyModuleCreate_* is defined, PyModuleExec_* will receive the object returned there, while if it isn't defined, the interpreter *will* provide a PyModule_Type instance, as per PEP 451. However, permitting module authors to make the PyModule_Type (or a subclass) assumption in their implementation does introduce a subtle requirement on the implementation of both the load_module method, and on custom PyModuleExec_* functions that are paired with a PyModuleCreate_* function. Firstly, we need to enforce the following constraint in load_module: if the underlying C module does *not* define a custom PyModuleCreate_* function, and we're passed a module execution environment which is *not* an instance of PyModule_Type, then we should throw TypeError. By contrast, in the presence of a custom PyModuleCreate_* function, the requirement for checking the type of the execution environment (and throwing TypeError if the module can't handle it) should be delegated to the PyModuleExec_* function, and that will need to be documented appropriately. That keeps things simple in the default case (extension module authors just using PyModuleExec_* can continue to assume the use of PyModule_Type or a subclass), while allowing more flexibility in the "power user" case of creating your own module object. > Usage > ===== > > This PEP allows three new ways of creating modules, each with its > advantages and disadvantages. > > > Exec-only > --------- > > The preferred way to create C extensions is to define "PyModuleExec_modulename" > only. This brings the following advantages: > > * The extension can be loaded into a pre-created module, making it possible > to run them as ``__main__``, participate in certain lazy-loading schemes > [#lazy_import_concerns]_, or enable other creative uses. > * The module can be reloaded in the same way as Python modules. > > As Exec-only extension modules do not have C-level storage, > all module-local data must be stored in the module object's attributes, > possibly using the PyCapsule mechanism. With my suggested change above, this approach will also let module authors assume PyModule_Type (or a subclass), and have the interpreter enforce that assumption on their behalf. > Create-only > ----------- > > Extensions defining only the "PyModuleCreate_modulename" hook behave similarly > to current extensions. > > This is the easiest way to create modules that require custom module objects, > or substantial per-module state at the C level (using positive > ``PyModuleDef.m_size``). > > When the PyModuleCreate function is called, the module has not yet been added > to sys.modules. > Attempts to load the module again (possibly transitively) will result in an > infinite loop. > If user code needs to me called in module initialization, > module authors are advised to do so from the PyModuleExec function. > > Reloading a Create-only module does nothing, except re-setting > ModuleSpec-related attributes described in PEP 0451 [#pep-0451-attributes]. Another advantage of this approach is that you don't need to worry about potentially being passed a module object of an arbitrary type. > Exec and Create > --------------- > > Extensions that need to create a custom module object, > and either need to run user code during initialization or support reloading, > should define both "PyModuleCreate_modulename" and "PyModuleExec_modulename". This approach will have the downside of needing to check the type of the passed in module against the module implementation's assumptions. > Subinterpreters and Interpreter Reloading > ----------------------------------------- > > Extensions using the new initialization scheme are expected to support > subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly. > The mechanism is designed to make this easy, but care is still required > on the part of the extension author. > No user-defined functions, methods, or instances may leak to different > interpreters. > To achieve this, all module-level state should be kept in either the module > dict, or in the module object. > A simple rule of thumb is: Do not define any static data, except built-in types > with no mutable or user-settable class attributes. Worth noting here that this is why we consider it desirable to provide a utility somewhere in the standard library to make it easy to do these kinds of checks. At the very least we need it in the test.support module to do our own tests, but it would be preferable to have it as a supported API somewhere in the standard library. This isn't the only area where this kind of question of making it easier for people to test whether or not they're implementing or emulating a protocol correctly has come up - it's applicable to testing things like total ordering support in custom objects, operand precedence handling, ABC compliance, code generation, exception traceback manipulation, etc. Perhaps we should propose a new unittest submodule for compatibility and compliance tests that are too esoteric for the module top level, but we also don't want to ask people to write for themselves? > Module Reloading > ---------------- > > Extensions that support reloading must define PyModuleExec, which is called > in reload() to re-initialize the module in place. > The same caveats apply to reloading an extension module as to reloading > a Python module. Assuming you go with my suggestion regarding the PyModule_Type assumption above, that would be worth reiterating here. > Multiple modules in one library > ------------------------------- > > To support multiple Python modules in one shared library, the library > must export all appropriate PyModuleExec_ or PyModuleCreate_ hooks > for each exported module. > The modules are loaded using a ModuleSpec with origin set to the name of the > library file, and name set to the module name. > Note that this mechanism can only be used to *load* such modules, > not to *find* them. If I recall correctly, Brett already updated the extension module finder to handle locating such modules. It's either that or there's an existing issue on the tracker for it. > Open issues > =========== > > Now that PEP 442 is implemented, it would be nice if module finalization > did not set all attributes to None, Antoine added that in 3.4: http://bugs.python.org/issue18214 However, it wasn't entirely effective, as several extension modules still need to be hit with a sledgehammer to get them to drop references properly. Asking "Why is that so?" is actually one of the things that got me started digging into this area a couple of years back. > In this scheme, it is not possible to create a module with C-level state, > which would be able to exec itself in any externally provided module object, > short of putting PyCapsules in the module dict. I suspect "PyCapsule in the module dict" may be the right answer here, in which case some suitable documentation and perhaps some convenience APIs could be a good way to go. Relying on PyCapsule also has the advantage of potentially supporting better collaboration between extension modules, without needing to link them with each other directly. > The proposal repurposes PyModule_SetDocString, PyModule_AddObject, > PyModule_AddIntMacro et.al. to work on any object. > Would it be better to have these in the PyObject namespace? With my proposal above to keep the PyModule_Type assumption in most cases, I think it may be better to leave them alone entirely. If folks decide to allow non module types, they can decide to handle the consequences. > We should expose some kind of API in importlib.util (or a better place?) that > can be used to check that a module works with reloading and subinterpreters. See comments above on that. > The runpy module will need to be modified to take advantage of PEP 451 > and this PEP. This might out of scope for this PEP. I think it's out of scope, but runpy *does* need an internal redesign to take full advantage of PEP 451. Currently it works by attempting to extract the code object directly in most situations, whereas PEP 451 should let it rely almost entirely on exec_code instead (with direct execution used only when it's actually given a path directly to a Python source or bytecode file. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From vinay_sajip at yahoo.co.uk Mon Feb 23 00:02:35 2015 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Sun, 22 Feb 2015 23:02:35 +0000 (UTC) Subject: [Import-SIG] Loading Resources From a Python Module/Package References: <3483790F-8627-4F56-AEC4-A30A82295721@stufft.io> Message-ID: PJ Eby telecommunity.com> writes: > Apart from that, the implementations in pkg_resources can mostly be > pulled for reuse, as well as the interfaces, and I'd suggest doing > exactly that. There are a lot of non-obvious gotchas dealing with > zipfiles, and the implementation is fairly battle-hardened at this > point. I'm a bit late to this thread, but never mind :-) Another possibility is to use distlib's resource implementation, which offers the same basic functionality as is being proposed / pkg_resources, along with a higher level API which I find easier to use. The API works uniformly with both filesystem resources and resources in zip files. Basic usage: import distlib.resources finder = distlib.resources.Finder('package.name') resource = finder.find('resource.bin') # could be 'nested/resource.bin If there's no such resource, find() returns None. Otherwise, you can do a number of things with resource via its properties: is_container - Whether this instance is a container of other resources. bytes - All of the resource data as a byte string. size - The size of the resource data in bytes. resources - The relative names of all the contents of this resource. path - This attribute is set by the resource?s finder. It is a textual representation of the path, such that if a PEP 302 loader?s get_data() method is called with the path, the resource?s bytes are returned by the loader. This attribute is analogous to the resource_filename API in setuptools. Note that for resources in zip files, the path will be a pointer to the resource in the zip file, and not directly usable as a filename. While setuptools deals with this by extracting zip entries to cache and returning filenames from the cache, this does not seem an appropriate thing to do in this package, as a resource is already made available to callers either as a stream or a string of bytes. file_path - This attribute is the same as the path for file-based resource. For resources in a .zip file, the relevant resource is extracted to a file in a cache in the file system, and the name of the cached file is returned. This is for use with APIs that need file names, or need to be able to access data through OS-level file handles. The resource also as an as_stream() method, which returns a binary stream of the resource?s data. This must be closed by the caller when it?s finished with. There's a little more than I've mentioned here. Tutorial documentation is available at [1] and reference documentation is available at [2]. Regards, Vinay Sajip [1] http://distlib.readthedocs.org/en/latest/tutorial.html#using-the- resource-api [2] http://distlib.readthedocs.org/en/latest/reference.html#the-distlib- resources-package From encukou at gmail.com Mon Feb 23 14:18:25 2015 From: encukou at gmail.com (Petr Viktorin) Date: Mon, 23 Feb 2015 14:18:25 +0100 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading In-Reply-To: References: Message-ID: On Sat, Feb 21, 2015 at 1:19 PM, Nick Coghlan wrote: > On 21 February 2015 at 00:56, Petr Viktorin wrote: >> Hello list, >> >> I have taken Nick's challenge of extension module loading. >> >> Comments appreciated. > > This generally looks good to me. Some more specific feedback inline below. Thanks! I'll reply to the points where we don't 100% agree. >> PEP: XXX >> Title: Redesigning extension module loading > > For the BDFL-Delegate question: Brett would you be happy tackling this one? > > >> The proposal >> ============ >> >> The current extension module initialisation will be deprecated in favour of >> a new initialisation scheme. Since the current scheme will continue to be >> available, existing code will continue to work unchanged, including binary >> compatibility. >> >> Extension modules that support the new initialisation scheme must export one >> or both of the public symbols "PyModuleCreate_modulename" and >> "PyModuleExec_modulename", where "modulename" is the >> name of the shared library. This mimics the previous naming convention for >> the "PyInit_modulename" function. >> >> This symbols, if defined, must resolve to C functions with the following >> signatures, respectively:: >> >> PyObject* (*PyModuleCreateFunction)(PyObject* module_spec) >> int (*PyModuleExecFunction)(PyObject* module) > > For the Python level, the model we ended up with for 3.5 is: > > 1. create_module must exist, but may return None > 2. exec_module must exist, but may have no effect on the module state It would make sense that PyModuleCreate may return None, just to better mirror PEP451. I'll also point out that exec_module can put another object in sys.modules, to replace the module being loaded. > >> The PyModuleExec function >> ------------------------- >> >> This PyModuleExec function is used to implement "loader.exec_module" >> defined in PEP 451. >> It is called after ModuleSpec-related attributes such as ``__loader__``, >> ``__spec__`` and ``__name__`` are set on the module. >> (The full list is in PEP 451 [#pep-0451-attributes]_) >> >> The "PyModuleExec_modulename" function will be called to initialize a module. >> This happens in two situations: when the module is first initialized for >> a given (sub-)interpreter, and when the module is reloaded. >> >> The "module" argument receives the module object. >> If PyModuleCreate is defined, this will be the the object returned by it. >> If PyModuleCreate is not defined, PyModuleExec is epected to operate >> on any Python object for which attributes can be added by PyObject_GetAttr* >> and retreived by PyObject_SetAttr*. >> Specifically, as the module may not be a PyModule_Type subclass, >> PyModule_* functions should not be used on it, unless they explicitly support >> operating on all objects. > > I think this is too permissive on the interpreter side of things, thus > making things more complicated than we'd like them to be for extension > module authors. What complications are you thinking about? I was worried about this too, but I don't see the complications. I don't think there is enough difference between PyModule_Type and any object with getattr/setattr, either on the C or Python level. After initialization, the differences are: - Modules have a __dict__. But, as the docs say, "It is recommended extensions use other PyModule_*() and PyObject_*() functions rather than directly manipulate a module?s __dict__." This would become a requirement. - The finalization is special. There have been efforts to remove this difference. Any problems here are for the custom-module-object provider (e.g. the lazy-load library) to sort out, the extension author shouldn't have to do anything extra. - There's a PyModuleDef usable for registration. - There's a custom __repr__. Currently there is a bunch of convenience functions/macros that only work on modules do little more than get/setattr. They can easily be made to work on any object. > If PyModuleCreate_* is defined, PyModuleExec_* will receive the object > returned there, while if it isn't defined, the interpreter *will* > provide a PyModule_Type instance, as per PEP 451. > > However, permitting module authors to make the PyModule_Type (or a > subclass) assumption in their implementation does introduce a subtle > requirement on the implementation of both the load_module method, and > on custom PyModuleExec_* functions that are paired with a > PyModuleCreate_* function. > > Firstly, we need to enforce the following constraint in load_module: > if the underlying C module does *not* define a custom PyModuleCreate_* > function, and we're passed a module execution environment which is > *not* an instance of PyModule_Type, then we should throw TypeError. > > By contrast, in the presence of a custom PyModuleCreate_* function, > the requirement for checking the type of the execution environment > (and throwing TypeError if the module can't handle it) should be > delegated to the PyModuleExec_* function, and that will need to be > documented appropriately. > > That keeps things simple in the default case (extension module authors > just using PyModuleExec_* can continue to assume the use of > PyModule_Type or a subclass), while allowing more flexibility in the > "power user" case of creating your own module object. I see a different kind of simplicity in my proposal: Modules are just objects with a custom __repr__. > >> Subinterpreters and Interpreter Reloading >> ----------------------------------------- >> >> Extensions using the new initialization scheme are expected to support >> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly. >> The mechanism is designed to make this easy, but care is still required >> on the part of the extension author. >> No user-defined functions, methods, or instances may leak to different >> interpreters. >> To achieve this, all module-level state should be kept in either the module >> dict, or in the module object. >> A simple rule of thumb is: Do not define any static data, except built-in types >> with no mutable or user-settable class attributes. > > Worth noting here that this is why we consider it desirable to provide > a utility somewhere in the standard library to make it easy to do > these kinds of checks. > > At the very least we need it in the test.support module to do our own > tests, but it would be preferable to have it as a supported API > somewhere in the standard library. > > This isn't the only area where this kind of question of making it > easier for people to test whether or not they're implementing or > emulating a protocol correctly has come up - it's applicable to > testing things like total ordering support in custom objects, operand > precedence handling, ABC compliance, code generation, exception > traceback manipulation, etc. > > Perhaps we should propose a new unittest submodule for compatibility > and compliance tests that are too esoteric for the module top level, > but we also don't want to ask people to write for themselves? The unittest submodule is out of scope here, but something I'd like to get involved in later. For now I'm going to put tests in test.support. > >> Open issues >> =========== >> >> Now that PEP 442 is implemented, it would be nice if module finalization >> did not set all attributes to None, > > Antoine added that in 3.4: http://bugs.python.org/issue18214 > > However, it wasn't entirely effective, as several extension modules > still need to be hit with a sledgehammer to get them to drop > references properly. Asking "Why is that so?" is actually one of the > things that got me started digging into this area a couple of years > back. Ah. I had a note about this from reading the discussion, but now I see this is out of scope (aside from checking things don't break on this front). Issue closed :) >> In this scheme, it is not possible to create a module with C-level state, >> which would be able to exec itself in any externally provided module object, >> short of putting PyCapsules in the module dict. > > I suspect "PyCapsule in the module dict" may be the right answer here, > in which case some suitable documentation and perhaps some convenience > APIs could be a good way to go. Right, I'll put them in the next version. > Relying on PyCapsule also has the advantage of potentially supporting > better collaboration between extension modules, without needing to > link them with each other directly. Well, I'd argue that in most cases where you want two extensions to collaborate, a public Python API would be useful enough to justify its costs. Maintaining an ABI for capsule contents, on the other hand, might not be worth it. I'd rather promote cffi/Cython wrappers than this. But of course, it is possible if people want to go for it. >> The runpy module will need to be modified to take advantage of PEP 451 >> and this PEP. This might out of scope for this PEP. > > I think it's out of scope, but runpy *does* need an internal redesign > to take full advantage of PEP 451. Currently it works by attempting to > extract the code object directly in most situations, whereas PEP 451 > should let it rely almost entirely on exec_code instead (with direct > execution used only when it's actually given a path directly to a > Python source or bytecode file. Yes. If it was simple I'd include it, but this effort wants its own issue, if not a PEP. From ncoghlan at gmail.com Mon Feb 23 14:47:15 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 23 Feb 2015 23:47:15 +1000 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading In-Reply-To: References: Message-ID: On 23 February 2015 at 23:18, Petr Viktorin wrote: > On Sat, Feb 21, 2015 at 1:19 PM, Nick Coghlan wrote: >> On 21 February 2015 at 00:56, Petr Viktorin wrote: >>> The "module" argument receives the module object. >>> If PyModuleCreate is defined, this will be the the object returned by it. >>> If PyModuleCreate is not defined, PyModuleExec is epected to operate >>> on any Python object for which attributes can be added by PyObject_GetAttr* >>> and retreived by PyObject_SetAttr*. >>> Specifically, as the module may not be a PyModule_Type subclass, >>> PyModule_* functions should not be used on it, unless they explicitly support >>> operating on all objects. >> >> I think this is too permissive on the interpreter side of things, thus >> making things more complicated than we'd like them to be for extension >> module authors. > > What complications are you thinking about? I was worried about this > too, but I don't see the complications. I don't think there is enough > difference between PyModule_Type and any object with getattr/setattr, > either on the C or Python level. After initialization, the differences > are: > - Modules have a __dict__. But, as the docs say, "It is recommended > extensions use other PyModule_*() and PyObject_*() functions rather > than directly manipulate a module?s __dict__." This would become a > requirement. > - The finalization is special. There have been efforts to remove this > difference. Any problems here are for the custom-module-object > provider (e.g. the lazy-load library) to sort out, the extension > author shouldn't have to do anything extra. > - There's a PyModuleDef usable for registration. > - There's a custom __repr__. > Currently there is a bunch of convenience functions/macros that only > work on modules do little more than get/setattr. They can easily be > made to work on any object. It occurs to me that we'd like folks to steer clear of relying on struct layout details anyway (to help promote use of the stable ABI), so yeah, I think you've persuaded me that the more general "expect an object that supports setting & getting attributes, but still check your error codes appropriately" directive for module authors using the new initialisation API is a good way to go. For the other areas, I'll mostly wait until I see the next draft before commenting further. However, I will note that the difference I see between create_module becoming compulsory (but allowed to return None) and whether or not PyModuleCreate_* should also be optional (in addition to letting it return None) is that the latter would need to be added at a *per-module* level for everyone writing extension modules using the new API, while create_module only exists at a *per-loader* level. That changes the equation for who pays the cost of making the method optional For create_module: * if it's mandatory, cost is borne by loader authors, but importlib provides a default impl that returns None * if it's optional, cost is borne by the already complex import system and anyone else manipulating loaders directly So making create_module mandatory is likely to reduce the net complexity of the overall system. For PyModuleCreate_*: * if it's mandatory, cost is borne by every extension module author as a bit of standard boilerplate they have to add * if it's optional, cost is borne in the create_module implementation for the updated extension module loader, and anyone writing their own custom extension module loader (which is even more unusual than interacting with loaders directly) Here, I think the relative frequency of the two activities (writing extension modules vs writing extension module loaders) favours making the C level module creation function entirely optional in addition to letting it return None. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From encukou at gmail.com Mon Feb 23 14:57:35 2015 From: encukou at gmail.com (Petr Viktorin) Date: Mon, 23 Feb 2015 14:57:35 +0100 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading In-Reply-To: References: Message-ID: On Mon, Feb 23, 2015 at 2:47 PM, Nick Coghlan wrote: > On 23 February 2015 at 23:18, Petr Viktorin wrote: >> On Sat, Feb 21, 2015 at 1:19 PM, Nick Coghlan wrote: >>> On 21 February 2015 at 00:56, Petr Viktorin wrote: >>>> The "module" argument receives the module object. >>>> If PyModuleCreate is defined, this will be the the object returned by it. >>>> If PyModuleCreate is not defined, PyModuleExec is epected to operate >>>> on any Python object for which attributes can be added by PyObject_GetAttr* >>>> and retreived by PyObject_SetAttr*. >>>> Specifically, as the module may not be a PyModule_Type subclass, >>>> PyModule_* functions should not be used on it, unless they explicitly support >>>> operating on all objects. >>> >>> I think this is too permissive on the interpreter side of things, thus >>> making things more complicated than we'd like them to be for extension >>> module authors. >> >> What complications are you thinking about? I was worried about this >> too, but I don't see the complications. I don't think there is enough >> difference between PyModule_Type and any object with getattr/setattr, >> either on the C or Python level. After initialization, the differences >> are: >> - Modules have a __dict__. But, as the docs say, "It is recommended >> extensions use other PyModule_*() and PyObject_*() functions rather >> than directly manipulate a module?s __dict__." This would become a >> requirement. >> - The finalization is special. There have been efforts to remove this >> difference. Any problems here are for the custom-module-object >> provider (e.g. the lazy-load library) to sort out, the extension >> author shouldn't have to do anything extra. >> - There's a PyModuleDef usable for registration. >> - There's a custom __repr__. >> Currently there is a bunch of convenience functions/macros that only >> work on modules do little more than get/setattr. They can easily be >> made to work on any object. > > It occurs to me that we'd like folks to steer clear of relying on > struct layout details anyway (to help promote use of the stable ABI), > so yeah, I think you've persuaded me that the more general "expect an > object that supports setting & getting attributes, but still check > your error codes appropriately" directive for module authors using the > new initialisation API is a good way to go. > > For the other areas, I'll mostly wait until I see the next draft > before commenting further. > > However, I will note that the difference I see between create_module > becoming compulsory (but allowed to return None) and whether or not > PyModuleCreate_* should also be optional (in addition to letting it > return None) is that the latter would need to be added at a > *per-module* level for everyone writing extension modules using the > new API, while create_module only exists at a *per-loader* level. That > changes the equation for who pays the cost of making the method > optional > > For create_module: > * if it's mandatory, cost is borne by loader authors, but importlib > provides a default impl that returns None > * if it's optional, cost is borne by the already complex import > system and anyone else manipulating loaders directly > > So making create_module mandatory is likely to reduce the net > complexity of the overall system. > > For PyModuleCreate_*: > > * if it's mandatory, cost is borne by every extension module author > as a bit of standard boilerplate they have to add > * if it's optional, cost is borne in the create_module > implementation for the updated extension module loader, and anyone > writing their own custom extension module loader (which is even more > unusual than interacting with loaders directly) > > Here, I think the relative frequency of the two activities (writing > extension modules vs writing extension module loaders) favours making > the C level module creation function entirely optional in addition to > letting it return None. > Right, we're on the same page here. Thanks for spelling it out. I'll update the draft later this week, when I have a block of time for it. From brett at python.org Mon Feb 23 16:16:12 2015 From: brett at python.org (Brett Cannon) Date: Mon, 23 Feb 2015 15:16:12 +0000 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading References: Message-ID: I mostly have grammar/typo comments and one suggestion to minimize the number of ways of initializing a module by not letting PyModuleCreate_* do that step on its own. Otherwise the approach seems to be on the right track for what we need for extension loading. Thanks for taking this on! On Fri Feb 20 2015 at 9:57:16 AM Petr Viktorin wrote: > Hello list, > > I have taken Nick's challenge of extension module loading. > I've read some of the relevant discussions, and bounced my ideas off Nick > to see if I missed anything important. > > The main idea I realized, which was not obvious from the discussion, > was that in addition to playing well with PEP 451 (ModuleSpec) and > supporting > subinterpreters and multiple Py_Initialize/Py_Finalize cycles, > Nick's Create/Exec proposal allows executing the module in a "foreign", > externally created module object. The main use case for that would be > runpy and > __main__, but lazy-loading mechanisms were mentioned that would benefit as > well. > > As I was writing this down, I realized that once pre-created modules are > allowed, it makes no sense to insist that they actually are module > instances -- PyModule_Type provides little functionality above a plain > object > subclass. I'm not sure there are any use cases for this, but I don't see a > reason to limit things artificially. Any bugs caused by allowing > non-ModuleType modules are unlikely to be subtle, unless the custom object > passes the "asked for it" line. > > Comments appreciated. > > > --- > > > PEP: XXX > Title: Redesigning extension module loading > Version: $Revision$ > Last-Modified: $Date$ > Author: Petr Viktorin , Stefan Behnel at behnel.de>, Nick Coghlan > BDFL-Delegate: "???" > Discussions-To: "???" > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 11-Aug-2013 > Python-Version: 3.5 > Post-History: 23-Aug-2013, 20-Feb-2015 > Resolution: > > > Abstract > ======== > > This PEP proposes a redesign of the way in which extension modules interact > with the import machinery. This was last revised for Python 3.0 in PEP > 3121, but did not solve all problems at the time. The goal is to solve them > by bringing extension modules closer to the way Python modules behave; > specifically to hook into the ModuleSpec-based loading mechanism > introduced in PEP 451. > > Two ways to initialize a module, depending on the desired functionality, > are proposed. > > The preferred form allows extension modules to be executed in pre-defined > namespaces, paving the way for extension modules being runnable with > Python's > ``-m`` switch. > > Other modules can use arbitrary custom types for their module > implementation, > and are no longer restricted to types.ModuleType. > > Both ways make it easy to support properties at the module > level and to safely store arbitrary global state in the module that is > covered by normal garbage collection and supports reloading and > sub-interpreters. > Extension authors are encouraged to take these issues into account > when using the new API. > > > > Motivation > ========== > > Python modules and extension modules are not being set up in the same way. > For Python modules, the module is created and set up first, then the module > code is being executed (PEP 302). > A ModuleSpec object (PEP 451) is used to hole information about the module, > "hole" -> "hold" > and pased to the relevant hooks. > "pased" -> "passed" > For extensions, i.e. shared libraries, the module > init function is executed straight away and does both the creation and > initialisation. The initialisation function is not passed ModuleSpec > information about the loaded module, such as the __file__ or > fully-qualified > name > This hinders relative imports and resource loading. > > This is specifically a problem for Cython generated modules, for which it's > not uncommon that the module init code has the same level of complexity as > that of any 'regular' Python module. Also, the lack of __file__ and > __name__ > information hinders the compilation of __init__.py modules, i.e. packages, > especially when relative imports are being used at module init time. > > The other disadvantage of the discrepancy is that existing Python > programmers > learning C cannot effectively map concepts between the two domains. > As long as extension modules are fundamentally different from pure Python > ones > in the way they're initialised, they are harder for people to pick up > without > relying on something like cffi, SWIG or Cython to handle the actual > extension > module creation. > > Currently, extension modules are also not added to sys.modules until they > are > fully initialized, which means that a (potentially transitive) > re-import of the module will really try to reimport it and thus run into an > infinite loop when it executes the module init function again. > Without the fully qualified module name, it is not trivial to correctly add > the module to sys.modules either. > > Furthermore, the majority of currently existing extension modules has > problems with sub-interpreter support and/or reloading, and, while it is > it possible with the current infrastructure to support these > features, is neither easy nor efficient. > "is neither" -> "it is neither" > Addressing these issues was the goal of PEP 3121, but many extensions > took the least-effort approach to porting to Python 3, leaving many of > these > issues unresolved. > Thius PEP keeps the backwards-compatible behavior, which should reduce > pressure > "Thius" -> "Thus" > and give extension authors adequate time to consider these issues when > porting. > > > The current process > =================== > > Currently, extension modules export an initialisation function named > "PyInit_modulename", named after the file name of the shared library. This > function is executed by the import machinery and must return either NULL in > the case of an exception, or a fully initialised module object. The > function receives no arguments, so it has no way of knowing about its > import context. > > During its execution, the module init function creates a module object > based on a PyModuleDef struct. It then continues to initialise it by adding > attributes to the module dict, creating types, etc. > > In the back, the shared library loader keeps a note of the fully qualified > module name of the last module that it loaded, and when a module gets > created that has a matching name, this global variable is used to determine > the fully qualified name of the module object. This is not entirely safe > as it > relies on the module init function creating its own module object first, > but this assumption usually holds in practice. > > > The proposal > ============ > > The current extension module initialisation will be deprecated in favour of > a new initialisation scheme. Since the current scheme will continue to be > available, existing code will continue to work unchanged, including binary > compatibility. > > Extension modules that support the new initialisation scheme must export > one > or both of the public symbols "PyModuleCreate_modulename" and > "PyModuleExec_modulename", where "modulename" is the > name of the shared library. This mimics the previous naming convention for > the "PyInit_modulename" function. > > This symbols, if defined, must resolve to C functions with the following > "This" -> "These" > signatures, respectively:: > > PyObject* (*PyModuleCreateFunction)(PyObject* module_spec) > int (*PyModuleExecFunction)(PyObject* module) > > > The PyModuleCreate function > --------------------------- > > This PyModuleCreate function is used to implement "loader.create_module" > defined in PEP 451. > > By exporting the "PyModuleCreate_modulename" symbol, an extension module > indicates that it uses a custom module object. > > This prevents loading the extension in a pre-created module, > but gives greater flexibility in allowing a custom C-level layout > of the module object. > > The "module_spec" argument receives a "ModuleSpec" instance, as defined in > PEP 451. > > When called, this function must create and return a module object. > > If "PyModuleExec_module" is undefined, this function must also initialize > the module; see PyModuleExec_module for details on initialization. > Why conflate module creation with initialization? If one is going to have initialization code then it can't be difficult to factor out into a PyModuleExec_* function, so I don't see a good reason to support only defining PyModuleCreate_*. > > There is no requirement for the returned object to be an instance of > types.ModuleType. Any type can be used. This follows the current > support for allowing arbitrary objects in sys.modules and makes it easier > for extension modules to define a type that exactly matches their needs for > holding module state. > > > The PyModuleExec function > ------------------------- > > This PyModuleExec function is used to implement "loader.exec_module" > defined in PEP 451. > It is called after ModuleSpec-related attributes such as ``__loader__``, > ``__spec__`` and ``__name__`` are set on the module. > (The full list is in PEP 451 [#pep-0451-attributes]_) > > The "PyModuleExec_modulename" function will be called to initialize a > module. > This happens in two situations: when the module is first initialized for > a given (sub-)interpreter, and when the module is reloaded. > > The "module" argument receives the module object. > If PyModuleCreate is defined, this will be the the object returned by it. > If PyModuleCreate is not defined, PyModuleExec is epected to operate > "epected" -> "expected" > on any Python object for which attributes can be added by PyObject_GetAttr* > and retreived by PyObject_SetAttr*. > "retreived" -> "retrieved" > Specifically, as the module may not be a PyModule_Type subclass, > PyModule_* functions should not be used on it, unless they explicitly > support > operating on all objects. > > > Helper functions > ---------------- > > For two initialization tasks previously done by PyModule_Create, > two functions are introduced:: > > int PyModule_SetDocString(PyObject *m, const char *doc) > int PyModule_AddFunctions(PyObject *m, PyMethodDef *functions) > > These set the module docstring, and add the module functions, respectively. > Both will work on any Python object that supports setting attributes. > They return zero on success, and on failure, they set the exception > and return -1. > > > Other changes > ------------- > > The following functions and macros will be modified to work on any object > that supports attribute access: > > * PyModule_GetNameObject > * PyModule_GetName > * PyModule_GetFilenameObject > * PyModule_GetFilename > * PyModule_AddIntConstant > * PyModule_AddStringConstant > * PyModule_AddIntMacro > * PyModule_AddStringMacro > * PyModule_AddObject > > Usage > ===== > > This PEP allows three new ways of creating modules, each with its > advantages and disadvantages. > > Exec-only > --------- > > The preferred way to create C extensions is to define > "PyModuleExec_modulename" > only. This brings the following advantages: > > * The extension can be loaded into a pre-created module, making it possible > to run them as ``__main__``, participate in certain lazy-loading schemes > [#lazy_import_concerns]_, or enable other creative uses. > * The module can be reloaded in the same way as Python modules. > > As Exec-only extension modules do not have C-level storage, > all module-local data must be stored in the module object's attributes, > possibly using the PyCapsule mechanism. > > XXX: Provide an example? > > > Create-only > ----------- > > Extensions defining only the "PyModuleCreate_modulename" hook behave > similarly > to current extensions. > If we are going to bother with allowing module creation then I would rather either have people stay with the old way or completely move over to the new way and not switch over only partially. Supporting this create-and-initialize also breaks with the Python analog that the rest of this PEP promotes. > > This is the easiest way to create modules that require custom module > objects, > or substantial per-module state at the C level (using positive > ``PyModuleDef.m_size``). > > When the PyModuleCreate function is called, the module has not yet been > added > to sys.modules. > Attempts to load the module again (possibly transitively) will result in an > infinite loop. > If user code needs to me called in module initialization, > "me" -> "be" > module authors are advised to do so from the PyModuleExec function. > > Reloading a Create-only module does nothing, except re-setting > ModuleSpec-related attributes described in PEP 0451 [#pep-0451-attributes]. > > XXX: Provide an example? (It would be similar to the one in PEP 3121) > > > Exec and Create > --------------- > > Extensions that need to create a custom module object, > and either need to run user code during initialization or support > reloading, > should define both "PyModuleCreate_modulename" and > "PyModuleExec_modulename". > > XXX: Provide an example? > > If you drop the ability for PyModuleCreate_* to also initialize then you will really only have 1 way to import a module it happens to have an optional module creation step. If you do drop it then the opening line for this section is misleading. > > Legacy Init > ----------- > > If neither PyModuleExec nor PyModuleCreate is defined, the module is > initialized using the PyModuleInit hook, as described in PEP 3121. > > If PyModuleExec or PyModuleCreate is defined, PyModuleInit will be ignored. > Modules requiring compatibility with previous versions of CPython may > implement > PyModuleInit in addition to the new hooks. > > > Subinterpreters and Interpreter Reloading > ----------------------------------------- > > Extensions using the new initialization scheme are expected to support > subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly. > The mechanism is designed to make this easy, but care is still required > on the part of the extension author. > No user-defined functions, methods, or instances may leak to different > interpreters. > To achieve this, all module-level state should be kept in either the module > dict, or in the module object. > A simple rule of thumb is: Do not define any static data, except built-in > types > with no mutable or user-settable class attributes. > > > Module Reloading > ---------------- > > Extensions that support reloading must define PyModuleExec, which is called > in reload() to re-initialize the module in place. > The same caveats apply to reloading an extension module as to reloading > a Python module. > > Note that due to limitations in shared library loading (both dlopen on > POSIX > and LoadModuleEx on Windows), it is not generally possible to load a > modified > library after it has changed on disk. > Therefore, reloading extension modules is of limited use. > > > Multiple modules in one library > ------------------------------- > > To support multiple Python modules in one shared library, the library > must export all appropriate PyModuleExec_ or PyModuleCreate_ > hooks > for each exported module. > The modules are loaded using a ModuleSpec with origin set to the name of > the > library file, and name set to the module name. > Note that this mechanism can only be used to *load* such modules, > not to *find* them. > > XXX: Provide an example of how to load such modules > > > Implementation > ============== > > XXX - not started > > > Open issues > =========== > > Now that PEP 442 is implemented, it would be nice if module finalization > did not set all attributes to None, > > In this scheme, it is not possible to create a module with C-level state, > which would be able to exec itself in any externally provided module > object, > short of putting PyCapsules in the module dict. > > The proposal repurposes PyModule_SetDocString, PyModule_AddObject, > PyModule_AddIntMacro et.al. to work on any object. > Would it be better to have these in the PyObject namespace? > No. They are setting explicit attributes that are meant only for modules so its more generalization than is necessary to rename them. > > We should expose some kind of API in importlib.util (or a better place?) > that > can be used to check that a module works with reloading and > subinterpreters. > What would such an API actually check to verify that a module could be reloaded? > > The runpy module will need to be modified to take advantage of PEP 451 > and this PEP. This might out of scope for this PEP. > > > > Previous Approaches > =================== > > Stefan Behnel's initial proto-PEP [#stefans_protopep]_ > had a "PyInit_modulename" hook that would create a module class, > whose ``__init__`` would be then called to create the module. > This proposal did not correspond to the (then nonexistent) PEP 451, > where module creation and initialization is broken into distinct steps. > It also did not support loading an extension into pre-existing module > objects. > > Nick Coghlan proposed the Create annd Exec hooks, and wrote a prototype > implementation [#nicks-prototype]_. > At this time PEP 451 was still not implemented, so the prototype > does not use ModuleSpec. > > > References > ========== > > .. [#lazy_import_concerns] > https://mail.python.org/pipermail/python-dev/2013-August/128129.html > > .. [#pep-0451-attributes] > https://www.python.org/dev/peps/pep-0451/#attributes > > .. [#stefans_protopep] > https://mail.python.org/pipermail/python-dev/2013-August/128087.html > > .. [#nicks-prototype] > https://mail.python.org/pipermail/python-dev/2013-August/128101.html > > > Copyright > ========= > > This document has been placed in the public domain. > _______________________________________________ > Import-SIG mailing list > Import-SIG at python.org > https://mail.python.org/mailman/listinfo/import-sig > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Mon Feb 23 16:29:17 2015 From: brett at python.org (Brett Cannon) Date: Mon, 23 Feb 2015 15:29:17 +0000 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading References: Message-ID: On Sat Feb 21 2015 at 7:27:19 AM Nick Coghlan wrote: > On 21 February 2015 at 00:56, Petr Viktorin wrote: > > Hello list, > > > > I have taken Nick's challenge of extension module loading. > > Thanks for tackling this! > > > I've read some of the relevant discussions, and bounced my ideas off Nick > > to see if I missed anything important. > > > > The main idea I realized, which was not obvious from the discussion, > > was that in addition to playing well with PEP 451 (ModuleSpec) and > supporting > > subinterpreters and multiple Py_Initialize/Py_Finalize cycles, > > Nick's Create/Exec proposal allows executing the module in a "foreign", > > externally created module object. The main use case for that would be > runpy and > > __main__, but lazy-loading mechanisms were mentioned that would benefit > as well. > > For everyone else's reference: this actually came up in Petr's earlier > off-list discussions with me, when I realised I'd had the "running > extension modules as __main__" use case in mind myself, but never > actually written that notion down anywhere. > > It's the one capability of PyModuleExec_* that simply doesn't exist today. > > > As I was writing this down, I realized that once pre-created modules are > > allowed, it makes no sense to insist that they actually are module > > instances -- PyModule_Type provides little functionality above a plain > object > > subclass. I'm not sure there are any use cases for this, but I don't see > a > > reason to limit things artificially. Any bugs caused by allowing > > non-ModuleType modules are unlikely to be subtle, unless the custom > object > > passes the "asked for it" line. > > > > Comments appreciated. > > This generally looks good to me. Some more specific feedback inline below. > > > PEP: XXX > > Title: Redesigning extension module loading > > For the BDFL-Delegate question: Brett would you be happy tackling this one? > I don't know if "be happy tackling" is the right way to phrase it. =) Honestly I don't think I'm the best person for this PEP. My experience with the C API and extension modules is rather limited and so I don't think I will be able to properly think of the impact on more complex, sane extension module use cases. > > > Motivation > > ========== > > > > Python modules and extension modules are not being set up in the same > way. > > For Python modules, the module is created and set up first, then the > module > > code is being executed (PEP 302). > > A ModuleSpec object (PEP 451) is used to hole information about the > module, > > and pased to the relevant hooks. > > s/hole/hold/ > s/pased/passed/ > > > > > Furthermore, the majority of currently existing extension modules has > > problems with sub-interpreter support and/or reloading, and, while it is > > it possible with the current infrastructure to support these > > features, is neither easy nor efficient. > > Addressing these issues was the goal of PEP 3121, but many extensions > > took the least-effort approach to porting to Python 3, leaving many of > these > > issues unresolved. > > It's probably worth noting that some of those "least-effort" porting > approaches are in the standard library: this PEP is about solving our > own problems in addition to other people's. > > > Thius PEP keeps the backwards-compatible behavior, which should reduce > pressure > > and give extension authors adequate time to consider these issues when > porting. > > s/thius/this/ > > > The proposal > > ============ > > > > The current extension module initialisation will be deprecated in favour > of > > a new initialisation scheme. Since the current scheme will continue to be > > available, existing code will continue to work unchanged, including > binary > > compatibility. > > > > Extension modules that support the new initialisation scheme must export > one > > or both of the public symbols "PyModuleCreate_modulename" and > > "PyModuleExec_modulename", where "modulename" is the > > name of the shared library. This mimics the previous naming convention > for > > the "PyInit_modulename" function. > > > > This symbols, if defined, must resolve to C functions with the following > > signatures, respectively:: > > > > PyObject* (*PyModuleCreateFunction)(PyObject* module_spec) > > int (*PyModuleExecFunction)(PyObject* module) > > For the Python level, the model we ended up with for 3.5 is: > > 1. create_module must exist, but may return None > 2. exec_module must exist, but may have no effect on the module state > > For the new C level API, it's probably worth drawing the more explicit > parallel to __new__ and __init__ on classes, where you can implement > both of them if you want, but in most cases, implementing only one or > the other will be sufficient. > > The reason I suggest that is because I was going to ask if we should > make providing both APIs, or at least PyModuleExec_*, compulsory > (based on the Python Loader API requirements), but thinking of the > __new__/__init__ analogy made me realise that your current design > makes sense, since dealing with it is confined specifically to the > extension module loader implementation. > See I don't like this fork from the PEP 451 API. Unless we want to change importlib to not require exec_module() and instead let create_module() partially fulfill the role load_module() had by doing everything then I say the C API should try to follow how the rest of the import machinery operates, especially if the separation is mostly a refactoring of what some combined PyModuleCreate_* would probably do anyway. > > > The PyModuleCreate function > > --------------------------- > > > > > When called, this function must create and return a module object. > > > > If "PyModuleExec_module" is undefined, this function must also initialize > > the module; see PyModuleExec_module for details on initialization. > > This should be clarified to point out that, as per PEP 451, the import > machinery will still take care of setting the import related > attributes after the loader returns the module from create_module. > > > There is no requirement for the returned object to be an instance of > > types.ModuleType. Any type can be used. > > The requirement for the returned object to support getting and setting > attributes (as per > https://www.python.org/dev/peps/pep-0451/#attributes) should be > defined here. > > > This follows the current > > support for allowing arbitrary objects in sys.modules and makes it easier > > for extension modules to define a type that exactly matches their needs > for > > holding module state. > > +1 > > > The PyModuleExec function > > ------------------------- > > > > This PyModuleExec function is used to implement "loader.exec_module" > > defined in PEP 451. > > It is called after ModuleSpec-related attributes such as ``__loader__``, > > ``__spec__`` and ``__name__`` are set on the module. > > (The full list is in PEP 451 [#pep-0451-attributes]_) > > > > The "PyModuleExec_modulename" function will be called to initialize a > module. > > This happens in two situations: when the module is first initialized for > > a given (sub-)interpreter, and when the module is reloaded. > > > > The "module" argument receives the module object. > > If PyModuleCreate is defined, this will be the the object returned by it. > > If PyModuleCreate is not defined, PyModuleExec is epected to operate > > on any Python object for which attributes can be added by > PyObject_GetAttr* > > and retreived by PyObject_SetAttr*. > > Specifically, as the module may not be a PyModule_Type subclass, > > PyModule_* functions should not be used on it, unless they explicitly > support > > operating on all objects. > > I think this is too permissive on the interpreter side of things, thus > making things more complicated than we'd like them to be for extension > module authors. > > If PyModuleCreate_* is defined, PyModuleExec_* will receive the object > returned there, while if it isn't defined, the interpreter *will* > provide a PyModule_Type instance, as per PEP 451. > > However, permitting module authors to make the PyModule_Type (or a > subclass) assumption in their implementation does introduce a subtle > requirement on the implementation of both the load_module method, and > on custom PyModuleExec_* functions that are paired with a > PyModuleCreate_* function. > > Firstly, we need to enforce the following constraint in load_module: > if the underlying C module does *not* define a custom PyModuleCreate_* > function, and we're passed a module execution environment which is > *not* an instance of PyModule_Type, then we should throw TypeError. > > By contrast, in the presence of a custom PyModuleCreate_* function, > the requirement for checking the type of the execution environment > (and throwing TypeError if the module can't handle it) should be > delegated to the PyModuleExec_* function, and that will need to be > documented appropriately. > > That keeps things simple in the default case (extension module authors > just using PyModuleExec_* can continue to assume the use of > PyModule_Type or a subclass), while allowing more flexibility in the > "power user" case of creating your own module object. > > > Usage > > ===== > > > > This PEP allows three new ways of creating modules, each with its > > advantages and disadvantages. > > > > > > Exec-only > > --------- > > > > The preferred way to create C extensions is to define > "PyModuleExec_modulename" > > only. This brings the following advantages: > > > > * The extension can be loaded into a pre-created module, making it > possible > > to run them as ``__main__``, participate in certain lazy-loading > schemes > > [#lazy_import_concerns]_, or enable other creative uses. > > * The module can be reloaded in the same way as Python modules. > > > > As Exec-only extension modules do not have C-level storage, > > all module-local data must be stored in the module object's attributes, > > possibly using the PyCapsule mechanism. > > With my suggested change above, this approach will also let module > authors assume PyModule_Type (or a subclass), and have the interpreter > enforce that assumption on their behalf. > > > Create-only > > ----------- > > > > Extensions defining only the "PyModuleCreate_modulename" hook behave > similarly > > to current extensions. > > > > This is the easiest way to create modules that require custom module > objects, > > or substantial per-module state at the C level (using positive > > ``PyModuleDef.m_size``). > > > > When the PyModuleCreate function is called, the module has not yet been > added > > to sys.modules. > > Attempts to load the module again (possibly transitively) will result in > an > > infinite loop. > > If user code needs to me called in module initialization, > > module authors are advised to do so from the PyModuleExec function. > > > > Reloading a Create-only module does nothing, except re-setting > > ModuleSpec-related attributes described in PEP 0451 > [#pep-0451-attributes]. > > Another advantage of this approach is that you don't need to worry > about potentially being passed a module object of an arbitrary type. > > > Exec and Create > > --------------- > > > > Extensions that need to create a custom module object, > > and either need to run user code during initialization or support > reloading, > > should define both "PyModuleCreate_modulename" and > "PyModuleExec_modulename". > > This approach will have the downside of needing to check the type of > the passed in module against the module implementation's assumptions. > > > Subinterpreters and Interpreter Reloading > > ----------------------------------------- > > > > Extensions using the new initialization scheme are expected to support > > subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly. > > The mechanism is designed to make this easy, but care is still required > > on the part of the extension author. > > No user-defined functions, methods, or instances may leak to different > > interpreters. > > To achieve this, all module-level state should be kept in either the > module > > dict, or in the module object. > > A simple rule of thumb is: Do not define any static data, except > built-in types > > with no mutable or user-settable class attributes. > > Worth noting here that this is why we consider it desirable to provide > a utility somewhere in the standard library to make it easy to do > these kinds of checks. > > At the very least we need it in the test.support module to do our own > tests, but it would be preferable to have it as a supported API > somewhere in the standard library. > > This isn't the only area where this kind of question of making it > easier for people to test whether or not they're implementing or > emulating a protocol correctly has come up - it's applicable to > testing things like total ordering support in custom objects, operand > precedence handling, ABC compliance, code generation, exception > traceback manipulation, etc. > > Perhaps we should propose a new unittest submodule for compatibility > and compliance tests that are too esoteric for the module top level, > but we also don't want to ask people to write for themselves? > > > Module Reloading > > ---------------- > > > > Extensions that support reloading must define PyModuleExec, which is > called > > in reload() to re-initialize the module in place. > > The same caveats apply to reloading an extension module as to reloading > > a Python module. > > Assuming you go with my suggestion regarding the PyModule_Type > assumption above, that would be worth reiterating here. > > > Multiple modules in one library > > ------------------------------- > > > > To support multiple Python modules in one shared library, the library > > must export all appropriate PyModuleExec_ or PyModuleCreate_ > hooks > > for each exported module. > > The modules are loaded using a ModuleSpec with origin set to the name of > the > > library file, and name set to the module name. > > Note that this mechanism can only be used to *load* such modules, > > not to *find* them. > > If I recall correctly, Brett already updated the extension module > finder to handle locating such modules. It's either that or there's an > existing issue on the tracker for it. > Existing issue; extensions use FileFinder and do no caching or search of what initialization functions are exported by the module. -Brett > > > Open issues > > =========== > > > > Now that PEP 442 is implemented, it would be nice if module finalization > > did not set all attributes to None, > > Antoine added that in 3.4: http://bugs.python.org/issue18214 > > However, it wasn't entirely effective, as several extension modules > still need to be hit with a sledgehammer to get them to drop > references properly. Asking "Why is that so?" is actually one of the > things that got me started digging into this area a couple of years > back. > > > In this scheme, it is not possible to create a module with C-level state, > > which would be able to exec itself in any externally provided module > object, > > short of putting PyCapsules in the module dict. > > I suspect "PyCapsule in the module dict" may be the right answer here, > in which case some suitable documentation and perhaps some convenience > APIs could be a good way to go. > > Relying on PyCapsule also has the advantage of potentially supporting > better collaboration between extension modules, without needing to > link them with each other directly. > > > The proposal repurposes PyModule_SetDocString, PyModule_AddObject, > > PyModule_AddIntMacro et.al. to work on any object. > > Would it be better to have these in the PyObject namespace? > > With my proposal above to keep the PyModule_Type assumption in most > cases, I think it may be better to leave them alone entirely. If folks > decide to allow non module types, they can decide to handle the > consequences. > > > We should expose some kind of API in importlib.util (or a better place?) > that > > can be used to check that a module works with reloading and > subinterpreters. > > See comments above on that. > > > The runpy module will need to be modified to take advantage of PEP 451 > > and this PEP. This might out of scope for this PEP. > > I think it's out of scope, but runpy *does* need an internal redesign > to take full advantage of PEP 451. Currently it works by attempting to > extract the code object directly in most situations, whereas PEP 451 > should let it rely almost entirely on exec_code instead (with direct > execution used only when it's actually given a path directly to a > Python source or bytecode file. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Import-SIG mailing list > Import-SIG at python.org > https://mail.python.org/mailman/listinfo/import-sig > -------------- next part -------------- An HTML attachment was scrubbed... URL: From encukou at gmail.com Tue Feb 24 22:49:33 2015 From: encukou at gmail.com (Petr Viktorin) Date: Tue, 24 Feb 2015 22:49:33 +0100 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading In-Reply-To: References: Message-ID: On Mon, Feb 23, 2015 at 4:16 PM, Brett Cannon wrote: > I mostly have grammar/typo comments and one suggestion to minimize the > number of ways of initializing a module by not letting PyModuleCreate_* do > that step on its own. Thanks for the corrections! Lesson learned, I'll use a spell checker next time. ... >> The PyModuleCreate function >> --------------------------- >> >> This PyModuleCreate function is used to implement "loader.create_module" >> defined in PEP 451. >> >> By exporting the "PyModuleCreate_modulename" symbol, an extension module >> indicates that it uses a custom module object. >> >> This prevents loading the extension in a pre-created module, >> but gives greater flexibility in allowing a custom C-level layout >> of the module object. >> >> The "module_spec" argument receives a "ModuleSpec" instance, as defined in >> PEP 451. >> >> When called, this function must create and return a module object. >> >> If "PyModuleExec_module" is undefined, this function must also initialize >> the module; see PyModuleExec_module for details on initialization. > > > Why conflate module creation with initialization? If one is going to have > initialization code then it can't be difficult to factor out into a > PyModuleExec_* function, so I don't see a good reason to support only > defining PyModuleCreate_*. Right. Originally, to me, Exec seemed to not be very useful when Create is specified, because reload support for extension modules isn't very useful (unless you're Cython and want to emulate Python modules as well as possible). But given the fact that you can't safely call user code from Create, it does make sense to always require Exec, so people aren't tempted to take shortcuts. It does stretch the __new__/__init__ parallel Nick mentioned. But while that parallel was is a good stepping stone to get to this design, I don't think it is too useful for explaining how the design works. I feel that people who know what __new__ can do and why it is necessary should have no problem understanding a module creation hook without relating to classes. Classes make me think about inheritance, which doesn't apply here. Most __init__s don't register methods or class constants, but Exec should add functions and module globals. So I plan to drop the Create-only option, and to not mention the __new__/__init__ parallel. Nick, does that sound reasonable to you? ... >> In this scheme, it is not possible to create a module with C-level state, >> which would be able to exec itself in any externally provided module >> object, >> short of putting PyCapsules in the module dict. >> >> The proposal repurposes PyModule_SetDocString, PyModule_AddObject, >> PyModule_AddIntMacro et.al. to work on any object. >> Would it be better to have these in the PyObject namespace? > > > No. They are setting explicit attributes that are meant only for modules so > its more generalization than is necessary to rename them. OK. I will add PyModule_AddCapsule and PyModule_GetCapsule as simple helpers for C-level state. >> We should expose some kind of API in importlib.util (or a better place?) >> that >> can be used to check that a module works with reloading and >> subinterpreters. > > > What would such an API actually check to verify that a module could be > reloaded? Obviously we can't check for static state or object leakage between subinterpreters. By using the new API, you promise that the extension does support reloading and subinterpreters. This will be prominently stated in the docs, and checked by this function. For the old API, PyModule_Create with m_size>=0 can be used to support subinterpreters. But I don't think the language in the docs is strong enough to say that m_size>=0 is a promise of such support. From ncoghlan at gmail.com Wed Feb 25 13:40:55 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 25 Feb 2015 22:40:55 +1000 Subject: [Import-SIG] Proto-PEP: Redesigning extension module loading In-Reply-To: References: Message-ID: On 25 February 2015 at 07:49, Petr Viktorin wrote: > On Mon, Feb 23, 2015 at 4:16 PM, Brett Cannon wrote: >> I mostly have grammar/typo comments and one suggestion to minimize the >> number of ways of initializing a module by not letting PyModuleCreate_* do >> that step on its own. > > Thanks for the corrections! Lesson learned, I'll use a spell checker next time. > > ... >>> The PyModuleCreate function >>> --------------------------- >>> >>> This PyModuleCreate function is used to implement "loader.create_module" >>> defined in PEP 451. >>> >>> By exporting the "PyModuleCreate_modulename" symbol, an extension module >>> indicates that it uses a custom module object. >>> >>> This prevents loading the extension in a pre-created module, >>> but gives greater flexibility in allowing a custom C-level layout >>> of the module object. >>> >>> The "module_spec" argument receives a "ModuleSpec" instance, as defined in >>> PEP 451. >>> >>> When called, this function must create and return a module object. >>> >>> If "PyModuleExec_module" is undefined, this function must also initialize >>> the module; see PyModuleExec_module for details on initialization. >> >> >> Why conflate module creation with initialization? If one is going to have >> initialization code then it can't be difficult to factor out into a >> PyModuleExec_* function, so I don't see a good reason to support only >> defining PyModuleCreate_*. > > Right. Originally, to me, Exec seemed to not be very useful when > Create is specified, because reload support for extension modules > isn't very useful (unless you're Cython and want to emulate Python > modules as well as possible). But given the fact that you can't safely > call user code from Create, it does make sense to always require Exec, > so people aren't tempted to take shortcuts. > > It does stretch the __new__/__init__ parallel Nick mentioned. But > while that parallel was is a good stepping stone to get to this > design, I don't think it is too useful for explaining how the design > works. > I feel that people who know what __new__ can do and why it is > necessary should have no problem understanding a module creation hook > without relating to classes. Classes make me think about inheritance, > which doesn't apply here. Most __init__s don't register methods or > class constants, but Exec should add functions and module globals. > > So I plan to drop the Create-only option, and to not mention the > __new__/__init__ parallel. > Nick, does that sound reasonable to you? Yes, it does, especially since in that case: 1. The boilerplate probably won't be boilerplate, since it lets you run code after the import system has finished initialising the module globals 2. Even if it *is* boilerplate, it's really trivial boilerplate (and you have a very weird extension module) >>> We should expose some kind of API in importlib.util (or a better place?) >>> that >>> can be used to check that a module works with reloading and >>> subinterpreters. >> >> >> What would such an API actually check to verify that a module could be >> reloaded? > > Obviously we can't check for static state or object leakage between > subinterpreters. > By using the new API, you promise that the extension does support > reloading and subinterpreters. This will be prominently stated in the > docs, and checked by this function. > For the old API, PyModule_Create with m_size>=0 can be used to support > subinterpreters. But I don't think the language in the docs is strong > enough to say that m_size>=0 is a promise of such support. Ah, I wasn't clear in terms of "check" or "test" when I mentioned this - I was literally referring to something that could be run in test suites to try these things and see if they worked or not, rather than to a runtime "can I reload this safely?" check. "Try it and see" is likely to be a better approach to take there. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From bcannon at gmail.com Fri Feb 27 18:06:59 2015 From: bcannon at gmail.com (Brett Cannon) Date: Fri, 27 Feb 2015 17:06:59 +0000 Subject: [Import-SIG] PEP for the removal of PYO files Message-ID: Here is my proposed PEP to drop .pyo files from Python. Thanks to Barry's work in PEP 3147 this really shouldn't have much impact on user's code (then again, bytecode files are basically an implementation detail so it shouldn't impact hardly anyone directly). One thing I would appreciate is if people have more motivation for this. While the maintainer of importlib in me wants to see this happen, the core developer in me thinks the arguments are a little weak. So if people can provide more reasons why this is a good thing that would be appreciated. PEP: 487 Title: Elimination of PYO files Version: $Revision$ Last-Modified: $Date$ Author: Brett Cannon Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 20-Feb-2015 Post-History: Abstract ======== This PEP proposes eliminating the concept of PYO files from Python. To continue the support of the separation of bytecode files based on their optimization level, this PEP proposes extending the PYC file name to include the optimization level in bytecode repository directory (i.e., the ``__pycache__`` directory). Rationale ========= As of today, bytecode files come in two flavours: PYC and PYO. A PYC file is the bytecode file generated and read from when no optimization level is specified at interpreter startup (i.e., ``-O`` is not specified). A PYO file represents the bytecode file that is read/written when **any** optimization level is specified (i.e., when ``-O`` is specified, including ``-OO``). This means that while PYC files clearly delineate the optimization level used when they were generated -- namely no optimizations beyond the peepholer -- the same is not true for PYO files. Put in terms of optimization levels and the file extension: - 0: ``.pyc`` - 1 (``-O``): ``.pyo`` - 2 (``-OO``): ``.pyo`` The reuse of the ``.pyo`` file extension for both level 1 and 2 optimizations means that there is no clear way to tell what optimization level was used to generate the bytecode file. In terms of reading PYO files, this can lead to an interpreter using a mixture of optimization levels with its code if the user was not careful to make sure all PYO files were generated using the same optimization level (typically done by blindly deleting all PYO files and then using the `compileall` module to compile all-new PYO files [1]_). This issue is only compounded when people optimize Python code beyond what the interpreter natively supports, e.g., using the astoptimizer project [2]_. In terms of writing PYO files, the need to delete all PYO files every time one either changes the optimization level they want to use or are unsure of what optimization was used the last time PYO files were generated leads to unnecessary file churn. As for distributing bytecode-only modules, having to distribute both ``.pyc`` and ``.pyo`` files is unnecessary for the common use-case of code obfuscation and smaller file deployments. Proposal ======== To eliminate the ambiguity that PYO files present, this PEP proposes eliminating the concept of PYO files and their accompanying ``.pyo`` file extension. To allow for the optimization level to be unambiguous as well as to avoid having to regenerate optimized bytecode files needlessly in the `__pycache__` directory, the optimization level used to generate a PYC file will be incorporated into the bytecode file name. Currently bytecode file names are created by ``importlib.util.cache_from_source()``, approximately using the following expression defined by PEP 3147 [3]_, [4]_, [5]_:: '{name}.{cache_tag}.pyc'.format(name=module_name, cache_tag=sys.implementation.cache_tag) This PEP proposes to change the expression to:: '{name}.{cache_tag}.opt-{optimization}.pyc'.format( name=module_name, cache_tag=sys.implementation.cache_tag, optimization=str(sys.flags.optimize)) The "opt-" prefix was chosen so as to provide a visual separator from the cache tag. The placement of the optimization level after the cache tag was chosen to preserve lexicographic sort order of bytecode file names based on module name and cache tag which will not vary for a single interpreter. The "opt-" prefix was chosen over "o" so as to be somewhat self-documenting. The "opt-" prefix was chosen over "O" so as to not have any confusion with "0" while being so close to the interpreter version number. A period was chosen over a hyphen as a separator so as to distinguish clearly that the optimization level is not part of the interpreter version as specified by the cache tag. It also lends to the use of the period in the file name to delineate semantically different concepts. For example, the bytecode file name of ``importlib.cpython-35.pyc`` would become ``importlib.cpython-35.opt-0.pyc``. If ``-OO`` had been passed to the interpreter then instead of ``importlib.cpython-35.pyo`` the file name would be ``importlib.cpython-35.opt-2.pyc``. Implementation ============== importlib --------- As ``importlib.util.cache_from_source()`` is the API that exposes bytecode file paths as while as being directly used by importlib, it requires the most critical change. As of Python 3.4, the function's signature is:: importlib.util.cache_from_source(path, debug_override=None) This PEP proposes changing the signature in Python 3.5 to:: importlib.util.cache_from_source(path, debug_override=None, *, optimization=None) The introduced ``optimization`` keyword-only parameter will control what optimization level is specified in the file name. If the argument is ``None`` then the current optimization level of the interpreter will be assumed. Any argument given for ``optimization`` will be passed to ``str()`` and must have ``str.isalnum()`` be true, else ``ValueError`` will be raised (this prevents invalid characters being used in the file name). It is expected that beyond Python's own 0-2 optimization levels, third-party code will use a hash of optimization names to specify the optimization level, e.g. ``hashlib.sha256(','.join(['dead code elimination', 'constant folding'])).hexdigest()``. The ``debug_override`` parameter will be deprecated. As the parameter expects a boolean, the integer value of the boolean will be used as if it had been provided as the argument to ``optimization`` (a ``None`` argument will mean the same as for ``optimization``). A deprecation warning will be raised when ``debug_override`` is given a value other than ``None``, but there are no plans for the complete removal of the parameter as this time (but removal will be no later than Python 4). The various module attributes for importlib.machinery which relate to bytecode file suffixes will be updated [7]_. The ``DEBUG_BYTECODE_SUFFIXES`` and ``OPTIMIZED_BYTECODE_SUFFIXES`` will both be documented as deprecated and set to the same value as ``BYTECODE_SUFFIXES`` (removal of ``DEBUG_BYTECODE_SUFFIXES`` and ``OPTIMIZED_BYTECODE_SUFFIXES`` is not currently planned, but will be not later than Python 4). All various finders and loaders will also be updated as necessary, but updating the previous mentioned parts of importlib should be all that is required. Rest of the standard library ---------------------------- The various functions exposed by the ``py_compile`` and ``compileall`` functions will be updated as necessary to make sure they follow the new bytecode file name semantics [6]_, [1]_. Compatibility Considerations ============================ Any code directly manipulating bytecode files from Python 3.2 on will need to consider the impact of this change on their code (prior to Python 3.2 -- including all of Python 2 -- there was no __pycache__ which already necessitates bifurcating bytecode file handling support). If code was setting the ``debug_override`` argument to ``importlib.util.cache_from_source()`` then care will be needed if they want the path to a bytecode file with an optimization level of 2. Otherwise only code **not** using ``importlib.util.cache_from_source()`` will need updating. As for people who distribute bytecode-only modules, they will have to choose which optimization level they want their bytecode files to be since distributing a ``.pyo`` file with a ``.pyc`` file will no longer be of any use. Since people typically only distribute bytecode files for code obfuscation purposes or smaller distribution size then only having to distribute a single ``.pyc`` should actually be beneficial to these use-cases. Rejected Ideas ============== N/A Open Issues =========== Formatting of the optimization level in the file name ----------------------------------------------------- Using the "opt-" prefix and placing the optimization level between the cache tag and file extension is not critical. Other options which were considered are: * ``importlib.cpython-35.o0.pyc`` * ``importlib.cpython-35.O0.pyc`` * ``importlib.cpython-35.0.pyc`` * ``importlib.cpython-35-O0.pyc`` * ``importlib.O0.cpython-35.pyc`` * ``importlib.o0.cpython-35.pyc`` * ``importlib.0.cpython-35.pyc`` These were initially rejected either because they would change the sort order of bytecode files, possible ambiguity with the cache tag, or were not self-documenting enough. References ========== .. [1] The compileall module (https://docs.python.org/3/library/compileall.html#module-compileall) .. [2] The astoptimizer project (https://pypi.python.org/pypi/astoptimizer) .. [3] ``importlib.util.cache_from_source()`` ( https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source ) .. [4] Implementation of ``importlib.util.cache_from_source()`` from CPython 3.4.3rc1 ( https://hg.python.org/cpython/file/038297948389/Lib/importlib/_bootstrap.py#l437 ) .. [5] PEP 3147, PYC Repository Directories, Warsaw (http://www.python.org/dev/peps/pep-3147) .. [6] The py_compile module (https://docs.python.org/3/library/compileall.html#module-compileall) .. [7] The importlib.machinery module ( https://docs.python.org/3/library/importlib.html#module-importlib.machinery) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Feb 27 19:01:53 2015 From: guido at python.org (Guido van Rossum) Date: Fri, 27 Feb 2015 10:01:53 -0800 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: Message-ID: I'm in a good mood today and I think this is a great idea! That's not to say that I'm accepting it as-is (I haven't read it fully) but I expect that there are very few downsides and it won't break much. (There's of course always going to be someone who always uses -O and somehow depends on the existence of .pyo files, but they should have seen it coming with __pycache__ and the new version-specific extensions. :-) On Fri, Feb 27, 2015 at 9:06 AM, Brett Cannon wrote: > Here is my proposed PEP to drop .pyo files from Python. Thanks to Barry's > work in PEP 3147 this really shouldn't have much impact on user's code > (then again, bytecode files are basically an implementation detail so it > shouldn't impact hardly anyone directly). > > One thing I would appreciate is if people have more motivation for this. > While the maintainer of importlib in me wants to see this happen, the core > developer in me thinks the arguments are a little weak. So if people can > provide more reasons why this is a good thing that would be appreciated. > > > PEP: 487 > Title: Elimination of PYO files > Version: $Revision$ > Last-Modified: $Date$ > Author: Brett Cannon > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 20-Feb-2015 > Post-History: > > Abstract > ======== > > This PEP proposes eliminating the concept of PYO files from Python. > To continue the support of the separation of bytecode files based on > their optimization level, this PEP proposes extending the PYC file > name to include the optimization level in bytecode repository > directory (i.e., the ``__pycache__`` directory). > > > Rationale > ========= > > As of today, bytecode files come in two flavours: PYC and PYO. A PYC > file is the bytecode file generated and read from when no > optimization level is specified at interpreter startup (i.e., ``-O`` > is not specified). A PYO file represents the bytecode file that is > read/written when **any** optimization level is specified (i.e., when > ``-O`` is specified, including ``-OO``). This means that while PYC > files clearly delineate the optimization level used when they were > generated -- namely no optimizations beyond the peepholer -- the same > is not true for PYO files. Put in terms of optimization levels and > the file extension: > > - 0: ``.pyc`` > - 1 (``-O``): ``.pyo`` > - 2 (``-OO``): ``.pyo`` > > The reuse of the ``.pyo`` file extension for both level 1 and 2 > optimizations means that there is no clear way to tell what > optimization level was used to generate the bytecode file. In terms > of reading PYO files, this can lead to an interpreter using a mixture > of optimization levels with its code if the user was not careful to > make sure all PYO files were generated using the same optimization > level (typically done by blindly deleting all PYO files and then > using the `compileall` module to compile all-new PYO files [1]_). > This issue is only compounded when people optimize Python code beyond > what the interpreter natively supports, e.g., using the astoptimizer > project [2]_. > > In terms of writing PYO files, the need to delete all PYO files > every time one either changes the optimization level they want to use > or are unsure of what optimization was used the last time PYO files > were generated leads to unnecessary file churn. > > As for distributing bytecode-only modules, having to distribute both > ``.pyc`` and ``.pyo`` files is unnecessary for the common use-case > of code obfuscation and smaller file deployments. > > > Proposal > ======== > > To eliminate the ambiguity that PYO files present, this PEP proposes > eliminating the concept of PYO files and their accompanying ``.pyo`` > file extension. To allow for the optimization level to be unambiguous > as well as to avoid having to regenerate optimized bytecode files > needlessly in the `__pycache__` directory, the optimization level > used to generate a PYC file will be incorporated into the bytecode > file name. Currently bytecode file names are created by > ``importlib.util.cache_from_source()``, approximately using the > following expression defined by PEP 3147 [3]_, [4]_, [5]_:: > > '{name}.{cache_tag}.pyc'.format(name=module_name, > cache_tag=sys.implementation.cache_tag) > > This PEP proposes to change the expression to:: > > '{name}.{cache_tag}.opt-{optimization}.pyc'.format( > name=module_name, > cache_tag=sys.implementation.cache_tag, > optimization=str(sys.flags.optimize)) > > The "opt-" prefix was chosen so as to provide a visual separator > from the cache tag. The placement of the optimization level after > the cache tag was chosen to preserve lexicographic sort order of > bytecode file names based on module name and cache tag which will > not vary for a single interpreter. The "opt-" prefix was chosen over > "o" so as to be somewhat self-documenting. The "opt-" prefix was > chosen over "O" so as to not have any confusion with "0" while being > so close to the interpreter version number. > > A period was chosen over a hyphen as a separator so as to distinguish > clearly that the optimization level is not part of the interpreter > version as specified by the cache tag. It also lends to the use of > the period in the file name to delineate semantically different > concepts. > > For example, the bytecode file name of ``importlib.cpython-35.pyc`` > would become ``importlib.cpython-35.opt-0.pyc``. If ``-OO`` had been > passed to the interpreter then instead of > ``importlib.cpython-35.pyo`` the file name would be > ``importlib.cpython-35.opt-2.pyc``. > > > Implementation > ============== > > importlib > --------- > > As ``importlib.util.cache_from_source()`` is the API that exposes > bytecode file paths as while as being directly used by importlib, it > requires the most critical change. As of Python 3.4, the function's > signature is:: > > importlib.util.cache_from_source(path, debug_override=None) > > This PEP proposes changing the signature in Python 3.5 to:: > > importlib.util.cache_from_source(path, debug_override=None, *, > optimization=None) > > The introduced ``optimization`` keyword-only parameter will control > what optimization level is specified in the file name. If the > argument is ``None`` then the current optimization level of the > interpreter will be assumed. Any argument given for ``optimization`` > will be passed to ``str()`` and must have ``str.isalnum()`` be true, > else ``ValueError`` will be raised (this prevents invalid characters > being used in the file name). It is expected that beyond Python's own > 0-2 optimization levels, third-party code will use a hash of > optimization names to specify the optimization level, e.g. > ``hashlib.sha256(','.join(['dead code elimination', 'constant > folding'])).hexdigest()``. > > The ``debug_override`` parameter will be deprecated. As the parameter > expects a boolean, the integer value of the boolean will be used as > if it had been provided as the argument to ``optimization`` (a > ``None`` argument will mean the same as for ``optimization``). A > deprecation warning will be raised when ``debug_override`` is given a > value other than ``None``, but there are no plans for the complete > removal of the parameter as this time (but removal will be no later > than Python 4). > > The various module attributes for importlib.machinery which relate to > bytecode file suffixes will be updated [7]_. The > ``DEBUG_BYTECODE_SUFFIXES`` and ``OPTIMIZED_BYTECODE_SUFFIXES`` will > both be documented as deprecated and set to the same value as > ``BYTECODE_SUFFIXES`` (removal of ``DEBUG_BYTECODE_SUFFIXES`` and > ``OPTIMIZED_BYTECODE_SUFFIXES`` is not currently planned, but will be > not later than Python 4). > > All various finders and loaders will also be updated as necessary, > but updating the previous mentioned parts of importlib should be all > that is required. > > > Rest of the standard library > ---------------------------- > > The various functions exposed by the ``py_compile`` and > ``compileall`` functions will be updated as necessary to make sure > they follow the new bytecode file name semantics [6]_, [1]_. > > > Compatibility Considerations > ============================ > > Any code directly manipulating bytecode files from Python 3.2 on > will need to consider the impact of this change on their code (prior > to Python 3.2 -- including all of Python 2 -- there was no > __pycache__ which already necessitates bifurcating bytecode file > handling support). If code was setting the ``debug_override`` > argument to ``importlib.util.cache_from_source()`` then care will be > needed if they want the path to a bytecode file with an optimization > level of 2. Otherwise only code **not** using > ``importlib.util.cache_from_source()`` will need updating. > > As for people who distribute bytecode-only modules, they will have > to choose which optimization level they want their bytecode files to > be since distributing a ``.pyo`` file with a ``.pyc`` file will no > longer be of any use. Since people typically only distribute bytecode > files for code obfuscation purposes or smaller distribution size > then only having to distribute a single ``.pyc`` should actually be > beneficial to these use-cases. > > > Rejected Ideas > ============== > > N/A > > > Open Issues > =========== > > Formatting of the optimization level in the file name > ----------------------------------------------------- > > Using the "opt-" prefix and placing the optimization level between > the cache tag and file extension is not critical. Other options which > were considered are: > > * ``importlib.cpython-35.o0.pyc`` > * ``importlib.cpython-35.O0.pyc`` > * ``importlib.cpython-35.0.pyc`` > * ``importlib.cpython-35-O0.pyc`` > * ``importlib.O0.cpython-35.pyc`` > * ``importlib.o0.cpython-35.pyc`` > * ``importlib.0.cpython-35.pyc`` > > These were initially rejected either because they would change the > sort order of bytecode files, possible ambiguity with the cache tag, > or were not self-documenting enough. > > > References > ========== > > .. [1] The compileall module > (https://docs.python.org/3/library/compileall.html#module-compileall) > > .. [2] The astoptimizer project > (https://pypi.python.org/pypi/astoptimizer) > > .. [3] ``importlib.util.cache_from_source()`` > ( > https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source > ) > > .. [4] Implementation of ``importlib.util.cache_from_source()`` from > CPython 3.4.3rc1 > ( > https://hg.python.org/cpython/file/038297948389/Lib/importlib/_bootstrap.py#l437 > ) > > .. [5] PEP 3147, PYC Repository Directories, Warsaw > (http://www.python.org/dev/peps/pep-3147) > > .. [6] The py_compile module > (https://docs.python.org/3/library/compileall.html#module-compileall) > > .. [7] The importlib.machinery module > ( > https://docs.python.org/3/library/importlib.html#module-importlib.machinery > ) > > > Copyright > ========= > > This document has been placed in the public domain. > > > .. > Local Variables: > mode: indented-text > indent-tabs-mode: nil > sentence-end-double-space: t > fill-column: 70 > coding: utf-8 > End: > > > _______________________________________________ > Import-SIG mailing list > Import-SIG at python.org > https://mail.python.org/mailman/listinfo/import-sig > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bcannon at gmail.com Fri Feb 27 19:06:33 2015 From: bcannon at gmail.com (Brett Cannon) Date: Fri, 27 Feb 2015 18:06:33 +0000 Subject: [Import-SIG] PEP for the removal of PYO files References: Message-ID: On Fri, Feb 27, 2015 at 1:02 PM Guido van Rossum wrote: > I'm in a good mood today and I think this is a great idea! > Just that mean if you were in a bad mood this would be a bad idea? ;) > That's not to say that I'm accepting it as-is (I haven't read it fully) > but I expect that there are very few downsides and it won't break much. > There is a section in the PEP discussing backwards-compatibility. Basically the potential breakage seems fairly minimal to me. > (There's of course always going to be someone who always uses -O and > somehow depends on the existence of .pyo files, but they should have seen > it coming with __pycache__ and the new version-specific extensions. :-) > Yep! PEP 3147 makes this much easier to do without breaking the world. -Brett > > On Fri, Feb 27, 2015 at 9:06 AM, Brett Cannon wrote: > >> Here is my proposed PEP to drop .pyo files from Python. Thanks to Barry's >> work in PEP 3147 this really shouldn't have much impact on user's code >> (then again, bytecode files are basically an implementation detail so it >> shouldn't impact hardly anyone directly). >> >> One thing I would appreciate is if people have more motivation for this. >> While the maintainer of importlib in me wants to see this happen, the core >> developer in me thinks the arguments are a little weak. So if people can >> provide more reasons why this is a good thing that would be appreciated. >> >> >> PEP: 487 >> Title: Elimination of PYO files >> Version: $Revision$ >> Last-Modified: $Date$ >> Author: Brett Cannon >> Status: Draft >> Type: Standards Track >> Content-Type: text/x-rst >> Created: 20-Feb-2015 >> Post-History: >> >> Abstract >> ======== >> >> This PEP proposes eliminating the concept of PYO files from Python. >> To continue the support of the separation of bytecode files based on >> their optimization level, this PEP proposes extending the PYC file >> name to include the optimization level in bytecode repository >> directory (i.e., the ``__pycache__`` directory). >> >> >> Rationale >> ========= >> >> As of today, bytecode files come in two flavours: PYC and PYO. A PYC >> file is the bytecode file generated and read from when no >> optimization level is specified at interpreter startup (i.e., ``-O`` >> is not specified). A PYO file represents the bytecode file that is >> read/written when **any** optimization level is specified (i.e., when >> ``-O`` is specified, including ``-OO``). This means that while PYC >> files clearly delineate the optimization level used when they were >> generated -- namely no optimizations beyond the peepholer -- the same >> is not true for PYO files. Put in terms of optimization levels and >> the file extension: >> >> - 0: ``.pyc`` >> - 1 (``-O``): ``.pyo`` >> - 2 (``-OO``): ``.pyo`` >> >> The reuse of the ``.pyo`` file extension for both level 1 and 2 >> optimizations means that there is no clear way to tell what >> optimization level was used to generate the bytecode file. In terms >> of reading PYO files, this can lead to an interpreter using a mixture >> of optimization levels with its code if the user was not careful to >> make sure all PYO files were generated using the same optimization >> level (typically done by blindly deleting all PYO files and then >> using the `compileall` module to compile all-new PYO files [1]_). >> This issue is only compounded when people optimize Python code beyond >> what the interpreter natively supports, e.g., using the astoptimizer >> project [2]_. >> >> In terms of writing PYO files, the need to delete all PYO files >> every time one either changes the optimization level they want to use >> or are unsure of what optimization was used the last time PYO files >> were generated leads to unnecessary file churn. >> >> As for distributing bytecode-only modules, having to distribute both >> ``.pyc`` and ``.pyo`` files is unnecessary for the common use-case >> of code obfuscation and smaller file deployments. >> >> >> Proposal >> ======== >> >> To eliminate the ambiguity that PYO files present, this PEP proposes >> eliminating the concept of PYO files and their accompanying ``.pyo`` >> file extension. To allow for the optimization level to be unambiguous >> as well as to avoid having to regenerate optimized bytecode files >> needlessly in the `__pycache__` directory, the optimization level >> used to generate a PYC file will be incorporated into the bytecode >> file name. Currently bytecode file names are created by >> ``importlib.util.cache_from_source()``, approximately using the >> following expression defined by PEP 3147 [3]_, [4]_, [5]_:: >> >> '{name}.{cache_tag}.pyc'.format(name=module_name, >> >> cache_tag=sys.implementation.cache_tag) >> >> This PEP proposes to change the expression to:: >> >> '{name}.{cache_tag}.opt-{optimization}.pyc'.format( >> name=module_name, >> cache_tag=sys.implementation.cache_tag, >> optimization=str(sys.flags.optimize)) >> >> The "opt-" prefix was chosen so as to provide a visual separator >> from the cache tag. The placement of the optimization level after >> the cache tag was chosen to preserve lexicographic sort order of >> bytecode file names based on module name and cache tag which will >> not vary for a single interpreter. The "opt-" prefix was chosen over >> "o" so as to be somewhat self-documenting. The "opt-" prefix was >> chosen over "O" so as to not have any confusion with "0" while being >> so close to the interpreter version number. >> >> A period was chosen over a hyphen as a separator so as to distinguish >> clearly that the optimization level is not part of the interpreter >> version as specified by the cache tag. It also lends to the use of >> the period in the file name to delineate semantically different >> concepts. >> >> For example, the bytecode file name of ``importlib.cpython-35.pyc`` >> would become ``importlib.cpython-35.opt-0.pyc``. If ``-OO`` had been >> passed to the interpreter then instead of >> ``importlib.cpython-35.pyo`` the file name would be >> ``importlib.cpython-35.opt-2.pyc``. >> >> >> Implementation >> ============== >> >> importlib >> --------- >> >> As ``importlib.util.cache_from_source()`` is the API that exposes >> bytecode file paths as while as being directly used by importlib, it >> requires the most critical change. As of Python 3.4, the function's >> signature is:: >> >> importlib.util.cache_from_source(path, debug_override=None) >> >> This PEP proposes changing the signature in Python 3.5 to:: >> >> importlib.util.cache_from_source(path, debug_override=None, *, >> optimization=None) >> >> The introduced ``optimization`` keyword-only parameter will control >> what optimization level is specified in the file name. If the >> argument is ``None`` then the current optimization level of the >> interpreter will be assumed. Any argument given for ``optimization`` >> will be passed to ``str()`` and must have ``str.isalnum()`` be true, >> else ``ValueError`` will be raised (this prevents invalid characters >> being used in the file name). It is expected that beyond Python's own >> 0-2 optimization levels, third-party code will use a hash of >> optimization names to specify the optimization level, e.g. >> ``hashlib.sha256(','.join(['dead code elimination', 'constant >> folding'])).hexdigest()``. >> >> The ``debug_override`` parameter will be deprecated. As the parameter >> expects a boolean, the integer value of the boolean will be used as >> if it had been provided as the argument to ``optimization`` (a >> ``None`` argument will mean the same as for ``optimization``). A >> deprecation warning will be raised when ``debug_override`` is given a >> value other than ``None``, but there are no plans for the complete >> removal of the parameter as this time (but removal will be no later >> than Python 4). >> >> The various module attributes for importlib.machinery which relate to >> bytecode file suffixes will be updated [7]_. The >> ``DEBUG_BYTECODE_SUFFIXES`` and ``OPTIMIZED_BYTECODE_SUFFIXES`` will >> both be documented as deprecated and set to the same value as >> ``BYTECODE_SUFFIXES`` (removal of ``DEBUG_BYTECODE_SUFFIXES`` and >> ``OPTIMIZED_BYTECODE_SUFFIXES`` is not currently planned, but will be >> not later than Python 4). >> >> All various finders and loaders will also be updated as necessary, >> but updating the previous mentioned parts of importlib should be all >> that is required. >> >> >> Rest of the standard library >> ---------------------------- >> >> The various functions exposed by the ``py_compile`` and >> ``compileall`` functions will be updated as necessary to make sure >> they follow the new bytecode file name semantics [6]_, [1]_. >> >> >> Compatibility Considerations >> ============================ >> >> Any code directly manipulating bytecode files from Python 3.2 on >> will need to consider the impact of this change on their code (prior >> to Python 3.2 -- including all of Python 2 -- there was no >> __pycache__ which already necessitates bifurcating bytecode file >> handling support). If code was setting the ``debug_override`` >> argument to ``importlib.util.cache_from_source()`` then care will be >> needed if they want the path to a bytecode file with an optimization >> level of 2. Otherwise only code **not** using >> ``importlib.util.cache_from_source()`` will need updating. >> >> As for people who distribute bytecode-only modules, they will have >> to choose which optimization level they want their bytecode files to >> be since distributing a ``.pyo`` file with a ``.pyc`` file will no >> longer be of any use. Since people typically only distribute bytecode >> files for code obfuscation purposes or smaller distribution size >> then only having to distribute a single ``.pyc`` should actually be >> beneficial to these use-cases. >> >> >> Rejected Ideas >> ============== >> >> N/A >> >> >> Open Issues >> =========== >> >> Formatting of the optimization level in the file name >> ----------------------------------------------------- >> >> Using the "opt-" prefix and placing the optimization level between >> the cache tag and file extension is not critical. Other options which >> were considered are: >> >> * ``importlib.cpython-35.o0.pyc`` >> * ``importlib.cpython-35.O0.pyc`` >> * ``importlib.cpython-35.0.pyc`` >> * ``importlib.cpython-35-O0.pyc`` >> * ``importlib.O0.cpython-35.pyc`` >> * ``importlib.o0.cpython-35.pyc`` >> * ``importlib.0.cpython-35.pyc`` >> >> These were initially rejected either because they would change the >> sort order of bytecode files, possible ambiguity with the cache tag, >> or were not self-documenting enough. >> >> >> References >> ========== >> >> .. [1] The compileall module >> (https://docs.python.org/3/library/compileall.html#module-compileall) >> >> .. [2] The astoptimizer project >> (https://pypi.python.org/pypi/astoptimizer) >> >> .. [3] ``importlib.util.cache_from_source()`` >> ( >> https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source >> ) >> >> .. [4] Implementation of ``importlib.util.cache_from_source()`` from >> CPython 3.4.3rc1 >> ( >> https://hg.python.org/cpython/file/038297948389/Lib/importlib/_bootstrap.py#l437 >> ) >> >> .. [5] PEP 3147, PYC Repository Directories, Warsaw >> (http://www.python.org/dev/peps/pep-3147) >> >> .. [6] The py_compile module >> (https://docs.python.org/3/library/compileall.html#module-compileall) >> >> .. [7] The importlib.machinery module >> ( >> https://docs.python.org/3/library/importlib.html#module-importlib.machinery >> ) >> >> >> Copyright >> ========= >> >> This document has been placed in the public domain. >> >> >> .. >> Local Variables: >> mode: indented-text >> indent-tabs-mode: nil >> sentence-end-double-space: t >> fill-column: 70 >> coding: utf-8 >> End: >> >> >> _______________________________________________ >> Import-SIG mailing list >> Import-SIG at python.org >> https://mail.python.org/mailman/listinfo/import-sig >> >> > > > -- > --Guido van Rossum (python.org/~guido) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Feb 27 19:12:47 2015 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 27 Feb 2015 10:12:47 -0800 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: Message-ID: <54F0B39F.5060803@stoneleaf.us> On 02/27/2015 09:06 AM, Brett Cannon wrote: > PEP: 487 +1. Great idea. -- ~Ethan~ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From barry at python.org Fri Feb 27 19:28:48 2015 From: barry at python.org (Barry Warsaw) Date: Fri, 27 Feb 2015 13:28:48 -0500 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: Message-ID: <20150227132848.7a9ab390@limelight.wooz.org> This looks great Brett, thanks for pushing it forward. I think it's a perfectly natural and consistent extension to PEP 3147. Some comments inlined. On Feb 27, 2015, at 05:06 PM, Brett Cannon wrote: >Rationale >========= > > - 0: ``.pyc`` > - 1 (``-O``): ``.pyo`` > - 2 (``-OO``): ``.pyo`` This is all the rationale I need. :) >The "opt-" prefix was chosen so as to provide a visual separator >from the cache tag. The placement of the optimization level after >the cache tag was chosen to preserve lexicographic sort order of >bytecode file names based on module name and cache tag which will >not vary for a single interpreter. The "opt-" prefix was chosen over >"o" so as to be somewhat self-documenting. The "opt-" prefix was >chosen over "O" so as to not have any confusion with "0" while being >so close to the interpreter version number. I get it, and the examples you include in the open questions is helpful, but I still don't like "opt-". We'll no doubt bikeshed on this until Guido decides but looking at the examples below I'd be okay with 'O'. Did you consider 'opt', e.g. imporlib.cpython-35.opt0.pyc ? >Compatibility Considerations >============================ Just as PEP 3147 had to make backward compatibility concessions to .pyc files living outside __pycache__ (which I think is still supported, right?) I think you'll have to do the same for traditional .pyo files, at least for Python 3.5. You won't have to *write* such files, but if they exist and the corresponding optimization level pyc file isn't present in __pycache__, you'll have to load them. It might in fact make sense to add some language to this PEP saying that in Python 3.6, support for old-style .pyc and .pyo files will be removed. Cheers, -Barry From ethan at stoneleaf.us Fri Feb 27 19:36:20 2015 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 27 Feb 2015 10:36:20 -0800 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: <20150227132848.7a9ab390@limelight.wooz.org> References: <20150227132848.7a9ab390@limelight.wooz.org> Message-ID: <54F0B924.3020006@stoneleaf.us> On 02/27/2015 10:28 AM, Barry Warsaw wrote: > from the PEP: >> The "opt-" prefix was chosen so as to provide a visual separator >> from the cache tag. The placement of the optimization level after >> the cache tag was chosen to preserve lexicographic sort order of >> bytecode file names based on module name and cache tag which will >> not vary for a single interpreter. The "opt-" prefix was chosen over >> "o" so as to be somewhat self-documenting. The "opt-" prefix was >> chosen over "O" so as to not have any confusion with "0" while being >> so close to the interpreter version number. > > I get it, and the examples you include in the open questions is helpful, but I > still don't like "opt-". We'll no doubt bikeshed on this until Guido > decides but looking at the examples below I'd be okay with 'O'. Did > you consider 'opt', e.g. imporlib.cpython-35.opt0.pyc ? I can live with either '.opt-N.' or just '.optN.' but all the others I thought were horrid. >> Compatibility Considerations >> ============================ > > Just as PEP 3147 had to make backward compatibility concessions to .pyc files > living outside __pycache__ (which I think is still supported, right?) I think > you'll have to do the same for traditional .pyo files, at least for Python > 3.5. You won't have to *write* such files, but if they exist and the > corresponding optimization level pyc file isn't present in __pycache__, you'll > have to load them. > > It might in fact make sense to add some language to this PEP saying that in > Python 3.6, support for old-style .pyc and .pyo files will be removed. +1 -- ~Ethan~ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From brett at python.org Fri Feb 27 20:26:26 2015 From: brett at python.org (Brett Cannon) Date: Fri, 27 Feb 2015 19:26:26 +0000 Subject: [Import-SIG] PEP for the removal of PYO files References: <20150227132848.7a9ab390@limelight.wooz.org> Message-ID: On Fri, Feb 27, 2015 at 1:28 PM Barry Warsaw wrote: > This looks great Brett, thanks for pushing it forward. I think it's a > perfectly natural and consistent extension to PEP 3147. > > Some comments inlined. > > On Feb 27, 2015, at 05:06 PM, Brett Cannon wrote: > > >Rationale > >========= > > > > - 0: ``.pyc`` > > - 1 (``-O``): ``.pyo`` > > - 2 (``-OO``): ``.pyo`` > > This is all the rationale I need. :) > > >The "opt-" prefix was chosen so as to provide a visual separator > >from the cache tag. The placement of the optimization level after > >the cache tag was chosen to preserve lexicographic sort order of > >bytecode file names based on module name and cache tag which will > >not vary for a single interpreter. The "opt-" prefix was chosen over > >"o" so as to be somewhat self-documenting. The "opt-" prefix was > >chosen over "O" so as to not have any confusion with "0" while being > >so close to the interpreter version number. > > I get it, and the examples you include in the open questions is helpful, > but I > still don't like "opt-". We'll no doubt bikeshed on this until Guido > decides but looking at the examples below I'd be okay with 'O'. Did > you consider 'opt', e.g. imporlib.cpython-35.opt0.pyc ? > Nope, and I'll think about it and at least add it as a possibility. > > >Compatibility Considerations > >============================ > > Just as PEP 3147 had to make backward compatibility concessions to .pyc > files > living outside __pycache__ (which I think is still supported, right?) Unfortunately yes. > I think > you'll have to do the same for traditional .pyo files, at least for Python > 3.5. You won't have to *write* such files, but if they exist and the > corresponding optimization level pyc file isn't present in __pycache__, > you'll > have to load them. > > It might in fact make sense to add some language to this PEP saying that in > Python 3.6, support for old-style .pyc and .pyo files will be removed. > Ah, but you see the magic number changed in Python 3.5 for matrix multiplication, so pre-existing .pyo files won't even load, so they will have to be regenerated regardless. I will mention that in the PEP. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Fri Feb 27 20:28:05 2015 From: donald at stufft.io (Donald Stufft) Date: Fri, 27 Feb 2015 14:28:05 -0500 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: <20150227132848.7a9ab390@limelight.wooz.org> Message-ID: > On Feb 27, 2015, at 2:26 PM, Brett Cannon wrote: > > > > On Fri, Feb 27, 2015 at 1:28 PM Barry Warsaw > wrote: > This looks great Brett, thanks for pushing it forward. I think it's a > perfectly natural and consistent extension to PEP 3147. > > Some comments inlined. > > On Feb 27, 2015, at 05:06 PM, Brett Cannon wrote: > > >Rationale > >========= > > > > - 0: ``.pyc`` > > - 1 (``-O``): ``.pyo`` > > - 2 (``-OO``): ``.pyo`` > > This is all the rationale I need. :) > > >The "opt-" prefix was chosen so as to provide a visual separator > >from the cache tag. The placement of the optimization level after > >the cache tag was chosen to preserve lexicographic sort order of > >bytecode file names based on module name and cache tag which will > >not vary for a single interpreter. The "opt-" prefix was chosen over > >"o" so as to be somewhat self-documenting. The "opt-" prefix was > >chosen over "O" so as to not have any confusion with "0" while being > >so close to the interpreter version number. > > I get it, and the examples you include in the open questions is helpful, but I > still don't like "opt-". We'll no doubt bikeshed on this until Guido > decides but looking at the examples below I'd be okay with 'O'. Did > you consider 'opt', e.g. imporlib.cpython-35.opt0.pyc ? > > Nope, and I'll think about it and at least add it as a possibility. > > > >Compatibility Considerations > >============================ > > Just as PEP 3147 had to make backward compatibility concessions to .pyc files > living outside __pycache__ (which I think is still supported, right?) > > Unfortunately yes. > > I think > you'll have to do the same for traditional .pyo files, at least for Python > 3.5. You won't have to *write* such files, but if they exist and the > corresponding optimization level pyc file isn't present in __pycache__, you'll > have to load them. > > It might in fact make sense to add some language to this PEP saying that in > Python 3.6, support for old-style .pyc and .pyo files will be removed. > > Ah, but you see the magic number changed in Python 3.5 for matrix multiplication, so pre-existing .pyo files won't even load, so they will have to be regenerated regardless. I will mention that in the PEP. > _______________________________________________ > Import-SIG mailing list > Import-SIG at python.org > https://mail.python.org/mailman/listinfo/import-sig Some people ship .pyc only code, do people also ship .pyo only code? --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: Message signed with OpenPGP using GPGMail URL: From brett at python.org Fri Feb 27 20:40:08 2015 From: brett at python.org (Brett Cannon) Date: Fri, 27 Feb 2015 19:40:08 +0000 Subject: [Import-SIG] PEP for the removal of PYO files References: <20150227132848.7a9ab390@limelight.wooz.org> Message-ID: On Fri, Feb 27, 2015 at 2:28 PM Donald Stufft wrote: > On Feb 27, 2015, at 2:26 PM, Brett Cannon wrote: > > > > On Fri, Feb 27, 2015 at 1:28 PM Barry Warsaw wrote: > >> This looks great Brett, thanks for pushing it forward. I think it's a >> perfectly natural and consistent extension to PEP 3147. >> >> Some comments inlined. >> >> On Feb 27, 2015, at 05:06 PM, Brett Cannon wrote: >> >> >Rationale >> >========= >> > >> > - 0: ``.pyc`` >> > - 1 (``-O``): ``.pyo`` >> > - 2 (``-OO``): ``.pyo`` >> >> This is all the rationale I need. :) >> >> >The "opt-" prefix was chosen so as to provide a visual separator >> >from the cache tag. The placement of the optimization level after >> >the cache tag was chosen to preserve lexicographic sort order of >> >bytecode file names based on module name and cache tag which will >> >not vary for a single interpreter. The "opt-" prefix was chosen over >> >"o" so as to be somewhat self-documenting. The "opt-" prefix was >> >chosen over "O" so as to not have any confusion with "0" while being >> >so close to the interpreter version number. >> >> I get it, and the examples you include in the open questions is helpful, >> but I >> still don't like "opt-". We'll no doubt bikeshed on this until Guido >> decides but looking at the examples below I'd be okay with 'O'. >> Did >> you consider 'opt', e.g. imporlib.cpython-35.opt0.pyc ? >> > > Nope, and I'll think about it and at least add it as a possibility. > > >> >> >Compatibility Considerations >> >============================ >> >> Just as PEP 3147 had to make backward compatibility concessions to .pyc >> files >> living outside __pycache__ (which I think is still supported, right?) > > > Unfortunately yes. > > >> I think >> you'll have to do the same for traditional .pyo files, at least for Python >> 3.5. You won't have to *write* such files, but if they exist and the >> corresponding optimization level pyc file isn't present in __pycache__, >> you'll >> have to load them. >> >> It might in fact make sense to add some language to this PEP saying that >> in >> Python 3.6, support for old-style .pyc and .pyo files will be removed. >> > > Ah, but you see the magic number changed in Python 3.5 for matrix > multiplication, so pre-existing .pyo files won't even load, so they will > have to be regenerated regardless. I will mention that in the PEP. > > _______________________________________________ > Import-SIG mailing list > Import-SIG at python.org > https://mail.python.org/mailman/listinfo/import-sig > > > Some people ship .pyc only code, do people also ship .pyo only code? > Definitely possible as is shipping both .pyc and .pyo files. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericsnowcurrently at gmail.com Sat Feb 28 01:30:10 2015 From: ericsnowcurrently at gmail.com (Eric Snow) Date: Fri, 27 Feb 2015 17:30:10 -0700 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: <20150227132848.7a9ab390@limelight.wooz.org> Message-ID: On Fri, Feb 27, 2015 at 12:26 PM, Brett Cannon wrote: > On Fri, Feb 27, 2015 at 1:28 PM Barry Warsaw wrote: >> I get it, and the examples you include in the open questions is helpful, >> but I >> still don't like "opt-". We'll no doubt bikeshed on this until Guido >> decides but looking at the examples below I'd be okay with 'O'. >> Did >> you consider 'opt', e.g. imporlib.cpython-35.opt0.pyc ? > > > Nope, and I'll think about it and at least add it as a possibility. Keep in mind that the optimization "level" isn't constrained to just digits: imporlib.cpython-35.opt-b01603b27537a88c593d429923081a813f66eaef7360a5040507b90e85d285b0.pyc vs. imporlib.cpython-35.optb01603b27537a88c593d429923081a813f66eaef7360a5040507b90e85d285b0.pyc I think the hyphen helps in that case. -eric From ericsnowcurrently at gmail.com Sat Feb 28 01:30:54 2015 From: ericsnowcurrently at gmail.com (Eric Snow) Date: Fri, 27 Feb 2015 17:30:54 -0700 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: <20150227132848.7a9ab390@limelight.wooz.org> Message-ID: On Fri, Feb 27, 2015 at 5:30 PM, Eric Snow wrote: > imporlib.cpython-35.opt-b01603b27537a88c593d429923081a813f66eaef7360a5040507b90e85d285b0.pyc > > vs. > > imporlib.cpython-35.optb01603b27537a88c593d429923081a813f66eaef7360a5040507b90e85d285b0.pyc BTW, that hash comes from the hashlib example in the PEP. :) -eric From ncoghlan at gmail.com Sat Feb 28 17:50:39 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 1 Mar 2015 02:50:39 +1000 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: Message-ID: On 28 February 2015 at 03:06, Brett Cannon wrote: > Here is my proposed PEP to drop .pyo files from Python. Thanks to Barry's > work in PEP 3147 this really shouldn't have much impact on user's code (then > again, bytecode files are basically an implementation detail so it shouldn't > impact hardly anyone directly). Some specific technical questions/suggestions: * Can we make "opt-0" implied so normal pyc file names don't change at all? * I'd like to see a description of the impact on compileall (which may be "no impact", but I'd like the PEP to explicitly say that if so) > One thing I would appreciate is if people have more motivation for this. > While the maintainer of importlib in me wants to see this happen, the core > developer in me thinks the arguments are a little weak. So if people can > provide more reasons why this is a good thing that would be appreciated. For that aspect, I'd suggest pitching the PEP as aiming primarily at separating the two optimisation levels (so stripped PYO files don't overwrite normal ones) and then simply eliminating the pyo extension entirely as being redundant since the new mechanism will also make it possible to distinguish optimised files from unoptimised ones. The first is the user facing benefit of the change (e.g. it lets us precompile all three levels in distro packages), while the latter is just a nice import maintainer facing side-effect. This perspective would likely be further strengthened if the "opt-0" case were taken as the implied default rather than being explicit in the filename. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From solipsis at pitrou.net Sat Feb 28 17:57:08 2015 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 28 Feb 2015 17:57:08 +0100 Subject: [Import-SIG] PEP for the removal of PYO files References: Message-ID: <20150228175708.372d145d@fsol> On Fri, 27 Feb 2015 17:06:59 +0000 Brett Cannon wrote: > > A period was chosen over a hyphen as a separator so as to distinguish > clearly that the optimization level is not part of the interpreter > version as specified by the cache tag. It also lends to the use of > the period in the file name to delineate semantically different > concepts. Indeed but why would other implementations have to mimick CPython here? Perhaps the whole idea of differing "optimization" levels doesn't make sense for them. Regards Antoine. From ncoghlan at gmail.com Sat Feb 28 18:13:20 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 1 Mar 2015 03:13:20 +1000 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: <20150228175708.372d145d@fsol> References: <20150228175708.372d145d@fsol> Message-ID: On 1 March 2015 at 02:57, Antoine Pitrou wrote: > On Fri, 27 Feb 2015 17:06:59 +0000 > Brett Cannon wrote: >> >> A period was chosen over a hyphen as a separator so as to distinguish >> clearly that the optimization level is not part of the interpreter >> version as specified by the cache tag. It also lends to the use of >> the period in the file name to delineate semantically different >> concepts. > > Indeed but why would other implementations have to mimick CPython here? > Perhaps the whole idea of differing "optimization" levels doesn't make > sense for them. Could Numba potentially use it for JIT priming? (I'd ask for PyPy as well, but I don't know if we have any PyPy devs on the import-sig list) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From solipsis at pitrou.net Sat Feb 28 18:16:01 2015 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 28 Feb 2015 18:16:01 +0100 Subject: [Import-SIG] PEP for the removal of PYO files In-Reply-To: References: <20150228175708.372d145d@fsol> Message-ID: <20150228181601.262eaba1@fsol> On Sun, 1 Mar 2015 03:13:20 +1000 Nick Coghlan wrote: > On 1 March 2015 at 02:57, Antoine Pitrou wrote: > > On Fri, 27 Feb 2015 17:06:59 +0000 > > Brett Cannon wrote: > >> > >> A period was chosen over a hyphen as a separator so as to distinguish > >> clearly that the optimization level is not part of the interpreter > >> version as specified by the cache tag. It also lends to the use of > >> the period in the file name to delineate semantically different > >> concepts. > > > > Indeed but why would other implementations have to mimick CPython here? > > Perhaps the whole idea of differing "optimization" levels doesn't make > > sense for them. > > Could Numba potentially use it for JIT priming? We'll probably want something like that one day, but we wouldn't necessarily use the same file structure - Numba currently works at the function level, not at the module level. In other words, the PEP is entirely neutral for us. Regards Antoine. From bcannon at gmail.com Sat Feb 28 22:08:50 2015 From: bcannon at gmail.com (Brett Cannon) Date: Sat, 28 Feb 2015 21:08:50 +0000 Subject: [Import-SIG] PEP for the removal of PYO files Message-ID: On Sat, Feb 28, 2015 at 11:50 AM Nick Coghlan wrote: > On 28 February 2015 at 03:06, Brett Cannon wrote: > > Here is my proposed PEP to drop .pyo files from Python. Thanks to Barry's > > work in PEP 3147 this really shouldn't have much impact on user's code > (then > > again, bytecode files are basically an implementation detail so it > shouldn't > > impact hardly anyone directly). > > Some specific technical questions/suggestions: > > * Can we make "opt-0" implied so normal pyc file names don't change at all? > Sure, but why specifically? EIBTI makes me not want to have some optional bit in the file name just make someone's life who didn't use cache_from_source() a little easier. > > * I'd like to see a description of the impact on compileall (which may > be "no impact", but I'd like the PEP to explicitly say that if so) > Are you talking about the command-line interface? If so then no, it makes no special difference beyond the fact that .pyo files won't be put in the legacy locations if you run the interpreter with -O and -OO. > > > One thing I would appreciate is if people have more motivation for this. > > While the maintainer of importlib in me wants to see this happen, the > core > > developer in me thinks the arguments are a little weak. So if people can > > provide more reasons why this is a good thing that would be appreciated. > > For that aspect, I'd suggest pitching the PEP as aiming primarily at > separating the two optimisation levels (so stripped PYO files don't > overwrite normal ones) and then simply eliminating the pyo extension > entirely as being redundant since the new mechanism will also make it > possible to distinguish optimised files from unoptimised ones. > > The first is the user facing benefit of the change (e.g. it lets us > precompile all three levels in distro packages), while the latter is > just a nice import maintainer facing side-effect. > I'll add a sentence mentioning it allows all optimization levels to be compiled and available at once. > > This perspective would likely be further strengthened if the "opt-0" > case were taken as the implied default rather than being explicit in > the filename. > Is that really so important? When was the last time you looked in a __pycache__ directory? -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Sat Feb 28 22:16:30 2015 From: brett at python.org (Brett Cannon) Date: Sat, 28 Feb 2015 21:16:30 +0000 Subject: [Import-SIG] PEP for the removal of PYO files References: <20150228175708.372d145d@fsol> Message-ID: On Sat, Feb 28, 2015 at 11:57 AM Antoine Pitrou wrote: > On Fri, 27 Feb 2015 17:06:59 +0000 > Brett Cannon wrote: > > > > A period was chosen over a hyphen as a separator so as to distinguish > > clearly that the optimization level is not part of the interpreter > > version as specified by the cache tag. It also lends to the use of > > the period in the file name to delineate semantically different > > concepts. > > Indeed but why would other implementations have to mimick CPython here? > Perhaps the whole idea of differing "optimization" levels doesn't make > sense for them. > Directly it might not, but if they support the AST module along with passing AST nodes to compile() then they would implicitly support optimizations for bytecode through custom loaders. I also checked PyPy and IronPython 3 and they both support -O. But an implementation that chose to skip the ast module and not support -O is the best argument to support Nick's ask to not specify the optimization if it is 0 (although I'm not saying that's enough to sway me to change the PEP). -------------- next part -------------- An HTML attachment was scrubbed... URL: