From yahya-abou-imran at protonmail.com Mon Jan 1 11:45:21 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Mon, 01 Jan 2018 11:45:21 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: <5rKl4ov7H8c_D8fcUybvuLBUalDO5MHQ4-kwOzCCAAiIRVmvmyrGC2UMjJXehcUBALGYhgCpiSOg_jDdabfXhZo-A6Wd4MnxCW8SNLtb1ag=@protonmail.com> > A. Width restrictions suggest making the async branches a separate diagram. > > B. Be consistent on placement of inherited versus added methods. Always > list inherited first? Different fonts, as suggested, might be good. > > C. After discussion here, and revision, open a doc enhancement issue on >bugs.python.org. I managed to have a nice layout to have all the ABCs in this module on one diagram. I'm joining three versions to this mail. I'm thankfull to all the people that gave my their opinions. However, I insist that abstract methods should be shown first (in italic, witch is the standard in UML). And about parenthesis, it's a standard in UML to represent methods; so even if it does'nt add anything meaningfull in this situation, I prefer to stick to UML conventions as most as possible. -------------- next part -------------- A non-text attachment was scrubbed... Name: base.png Type: image/png Size: 71476 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: other_collections.png Type: image/png Size: 15399 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: full.png Type: image/png Size: 77903 bytes Desc: not available URL: From brett at python.org Mon Jan 1 15:39:29 2018 From: brett at python.org (Brett Cannon) Date: Mon, 01 Jan 2018 20:39:29 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: While I appreciate what you're trying to accomplish, Yahya, one thing I would like to say is if we were to accept the diagram into the docs I would prefer that there be a source file that isn't an image which we can update with easily available software (e.g. like a dot file). Otherwise updating the file will either be burdensome going forward or we will simply have to drop the image at the first instance of needing to update it because no one can or be willing to put in the effort (and I'm thinking in 5 years, not soon while we can count on you to help). On Sat, Dec 30, 2017, 08:12 Yahya Abou 'Imran via Python-ideas, < python-ideas at python.org> wrote: > We can find very usefull class diagramm to understand the hierarchy of the > builtin Collection abstract class and interface in java. > > Some examples: > http://www.falkhausen.de/Java-8/java.util/Collection-Hierarchy-simple.html > http://www.falkhausen.de/Java-8/java.util/Collection-List.html > > But when I search about python's ABC, The more detailed I can find are > those from the book of Luciano Ramalho Fluent Python: > https://goo.gl/images/8JGjvM > https://goo.gl/images/6xZqcA > > (I think they're done with pyreverse of pylint) > > They are fine, but I think we could provide some other more detailed in > this page: > https://docs.python.org/3/library/collections.abc.html > > The table could be difficult to understand, a diagram help visualize > things. > > I've began working on it with plantuml and pyreverse, I'm joining to this > mail what I've done so far so you can tell me what you think. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yahya-abou-imran at protonmail.com Mon Jan 1 16:31:47 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Mon, 01 Jan 2018 16:31:47 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: >While I appreciate what you're trying to accomplish, Yahya, one thing I would like to say is if we were to accept the diagram into the docs I would prefer that there be a source file that isn't an image which we can update with easily available software (e.g. like a dot file). Otherwise updating the file will either be burdensome going forward or we will simply have to drop the image at the first instance of needing to update it because no one can or be willing to put in the effort (and I'm thinking in 5 years, not soon while we can count on you to help). > Of course! "Tip 23 Always Use Source Code Control Always. Even if you are a single-person team on a one-week project. Even if it's a "throw-away" prototype. Even if the stuff you're working on isn't source code. Make sure that *everything* is under source control -- documentation, phone number lists, memos to vendors, makefiles, build and release procedure, that little shell script that burns the CD master -- everything. We routinely use source code control on just about everything (including the text of this bool)." The Pragmatic Programmer, Andrew Hunt & David Thomas. Here are the files! I used plantuml. -------------- next part -------------- A non-text attachment was scrubbed... Name: base.puml Type: application/octet-stream Size: 2858 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: full.puml Type: application/octet-stream Size: 3566 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: other_collections.puml Type: application/octet-stream Size: 832 bytes Desc: not available URL: From barry at barrys-emacs.org Mon Jan 1 16:43:31 2018 From: barry at barrys-emacs.org (Barry Scott) Date: Mon, 1 Jan 2018 21:43:31 +0000 Subject: [Python-ideas] a set of enum.Enum values rather than the construction of bit-sets as the "norm"? In-Reply-To: References: <20171227055639.GP4215@ando.pearwood.info> <9ec7fc9a-57ca-4fd5-ad21-8b1346349c2a@googlegroups.com> <20171229081816.GT4215@ando.pearwood.info> <20171229153821.GU4215@ando.pearwood.info> <20171230042531.GV4215@ando.pearwood.info> <980401a3-27c0-c5aa-24c9-41e3db533f69@mrabarnett.plus.com> Message-ID: I'm guessing that what this thread is about is coming up with an API rule that makes providing a set of boolean options available to a function or class in the least error prone way. Its the error prone nature of integer bit masks that is behind the enum suggestion I assume. From the C tradition we have the integer bit mask which is error prone as there is no type checking that the masks belong to the option flags. Some APIs use calls with lots of keyword args that you set true or false and even none to mean default. The suggestion for a set of enums from this thread. You would need a class to represent a set of a particular enum to get type safety. List of strings or enums. You could even use a class that represents the options and set up an instance and pass it in. Hard to get wrong. I can see that all these styles have their place and each designer will pick the style they think fits the API they are designing. Barry From guido at python.org Mon Jan 1 17:16:26 2018 From: guido at python.org (Guido van Rossum) Date: Mon, 1 Jan 2018 15:16:26 -0700 Subject: [Python-ideas] a set of enum.Enum values rather than the construction of bit-sets as the "norm"? In-Reply-To: References: <20171227055639.GP4215@ando.pearwood.info> <9ec7fc9a-57ca-4fd5-ad21-8b1346349c2a@googlegroups.com> <20171229081816.GT4215@ando.pearwood.info> <20171229153821.GU4215@ando.pearwood.info> <20171230042531.GV4215@ando.pearwood.info> <980401a3-27c0-c5aa-24c9-41e3db533f69@mrabarnett.plus.com> Message-ID: The enum.Flag type solves all this neatly. On Mon, Jan 1, 2018 at 2:43 PM, Barry Scott wrote: > I'm guessing that what this thread is about is coming up with an API rule > that makes > providing a set of boolean options available to a function or class in the > least error prone way. > > Its the error prone nature of integer bit masks that is behind the enum > suggestion I assume. > > From the C tradition we have the integer bit mask which is error prone as > there is no type checking that the masks belong to the option flags. > > Some APIs use calls with lots of keyword args that you set true or false > and even none to mean default. > > The suggestion for a set of enums from this thread. You would need a class > to represent a set of a particular enum to get type safety. > > List of strings or enums. > > You could even use a class that represents the options and set up an > instance and pass it in. Hard to get wrong. > > I can see that all these styles have their place and each designer will > pick the style they think fits the API they > are designing. > > Barry > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Mon Jan 1 17:32:52 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 1 Jan 2018 23:32:52 +0100 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: Hi, There is "blockdiag" which is Sphinx friendly: http://blockdiag.com/en/blockdiag/sphinxcontrib.html Look also at: * http://asciiflow.com/ * http://ditaa.sourceforge.net/ * http://asciidoctor.org/news/2014/02/18/plain-text-diagrams-in-asciidoctor/ * etc. I like ASCII Art since it doesn't require any specific tool to edit it (even if dedicated tools like asciiflow can make editing simpler). For example, I have no idea how to open a ".puml" file. What if the tool for this specific format becomes outdated or is not available on some platforms? Graphviz with "dot" files is another option. Victor 2018-01-01 21:39 GMT+01:00 Brett Cannon : > While I appreciate what you're trying to accomplish, Yahya, one thing I > would like to say is if we were to accept the diagram into the docs I would > prefer that there be a source file that isn't an image which we can update > with easily available software (e.g. like a dot file). Otherwise updating > the file will either be burdensome going forward or we will simply have to > drop the image at the first instance of needing to update it because no one > can or be willing to put in the effort (and I'm thinking in 5 years, not > soon while we can count on you to help). > > On Sat, Dec 30, 2017, 08:12 Yahya Abou 'Imran via Python-ideas, > wrote: >> >> We can find very usefull class diagramm to understand the hierarchy of the >> builtin Collection abstract class and interface in java. >> >> Some examples: >> http://www.falkhausen.de/Java-8/java.util/Collection-Hierarchy-simple.html >> http://www.falkhausen.de/Java-8/java.util/Collection-List.html >> >> But when I search about python's ABC, The more detailed I can find are >> those from the book of Luciano Ramalho Fluent Python: >> https://goo.gl/images/8JGjvM >> https://goo.gl/images/6xZqcA >> >> (I think they're done with pyreverse of pylint) >> >> They are fine, but I think we could provide some other more detailed in >> this page: >> https://docs.python.org/3/library/collections.abc.html >> >> The table could be difficult to understand, a diagram help visualize >> things. >> >> I've began working on it with plantuml and pyreverse, I'm joining to this >> mail what I've done so far so you can tell me what you think. >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > From tjreedy at udel.edu Mon Jan 1 17:50:16 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 1 Jan 2018 17:50:16 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: On 1/1/2018 3:39 PM, Brett Cannon wrote: > While I appreciate what you're trying to accomplish, Yahya, one thing I > would like to say is if we were to accept the diagram into the docs I > would prefer that there be a source file that isn't an image which we > can update with easily available software (e.g. like a dot file). 'dot file' was new to me. It is the input format for the dot tool of the open-source graphviz package, with binaries available for Windows, Mac, and various *nixes. After looking at http://www.ffnn.nl/pages/articles/media/uml-diagrams-using-graphviz-dot.php and seeing how easy to edit the examples are, I would require a text source file unless the result were somehow bad. Yahya, how did *you* produce your example? > Otherwise updating the file will either be burdensome going forward or > we will simply have to drop the image at the first instance of needing > to update it because no one can or be willing to put in the effort (and > I'm thinking in 5 years, not soon while we can count on you to help). The ABCs seem to change a bit with every version. -- Terry Jan Reedy From yahya-abou-imran at protonmail.com Mon Jan 1 17:58:42 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Mon, 01 Jan 2018 17:58:42 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: http://plantuml.com/ You just run it with the `plantuml` command, and you have .png It has a good integration with a lot of tools (iPython for example) : http://plantuml.com/running I will look at your suggestions though. -------- Message d'origine -------- On 1 janv. 2018 23:32, Victor Stinner a ?crit : > Hi, > > There is "blockdiag" which is Sphinx friendly: > http://blockdiag.com/en/blockdiag/sphinxcontrib.html > > Look also at: > > * http://asciiflow.com/ > * http://ditaa.sourceforge.net/ > * http://asciidoctor.org/news/2014/02/18/plain-text-diagrams-in-asciidoctor/ > * etc. > > I like ASCII Art since it doesn't require any specific tool to edit it > (even if dedicated tools like asciiflow can make editing simpler). > > For example, I have no idea how to open a ".puml" file. What if the > tool for this specific format becomes outdated or is not available on > some platforms? > > Graphviz with "dot" files is another option. > > Victor > > 2018-01-01 21:39 GMT+01:00 Brett Cannon : >> While I appreciate what you're trying to accomplish, Yahya, one thing I >> would like to say is if we were to accept the diagram into the docs I would >> prefer that there be a source file that isn't an image which we can update >> with easily available software (e.g. like a dot file). Otherwise updating >> the file will either be burdensome going forward or we will simply have to >> drop the image at the first instance of needing to update it because no one >> can or be willing to put in the effort (and I'm thinking in 5 years, not >> soon while we can count on you to help). >> >> On Sat, Dec 30, 2017, 08:12 Yahya Abou 'Imran via Python-ideas, >> wrote: >>> >>> We can find very usefull class diagramm to understand the hierarchy of the >>> builtin Collection abstract class and interface in java. >>> >>> Some examples: >>> http://www.falkhausen.de/Java-8/java.util/Collection-Hierarchy-simple.html >>> http://www.falkhausen.de/Java-8/java.util/Collection-List.html >>> >>> But when I search about python's ABC, The more detailed I can find are >>> those from the book of Luciano Ramalho Fluent Python: >>> https://goo.gl/images/8JGjvM >>> https://goo.gl/images/6xZqcA >>> >>> (I think they're done with pyreverse of pylint) >>> >>> They are fine, but I think we could provide some other more detailed in >>> this page: >>> https://docs.python.org/3/library/collections.abc.html >>> >>> The table could be difficult to understand, a diagram help visualize >>> things. >>> >>> I've began working on it with plantuml and pyreverse, I'm joining to this >>> mail what I've done so far so you can tell me what you think. >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> @python.org> @python.org> -------------- next part -------------- An HTML attachment was scrubbed... URL: From yahya-abou-imran at protonmail.com Mon Jan 1 18:51:10 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Mon, 01 Jan 2018 18:51:10 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: Plantuml can also generate ASCII, so playing with ditaa I managed to have intersting things... I opened a public repo on my GitLab account to put that all so you can have a visualization of it (source files and `png`s): https://gitlab.com/yahya-abou-imran/collections-abc-uml `dot` files seems also interesting by the way... -------------- next part -------------- An HTML attachment was scrubbed... URL: From wes.turner at gmail.com Mon Jan 1 20:50:09 2018 From: wes.turner at gmail.com (Wes Turner) Date: Mon, 1 Jan 2018 20:50:09 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: On Monday, January 1, 2018, Yahya Abou 'Imran via Python-ideas < python-ideas at python.org> wrote: > Plantuml can also generate ASCII, so playing with ditaa I managed to have > intersting things... > > I opened a public repo on my GitLab account to put that all so you can > have a visualization of it (source files and `png`s): > > https://gitlab.com/yahya-abou-imran/collections-abc-uml > > `dot` files seems also interesting by the way... > > There is a PlantUML Sphinx extension (which requires Java): https://github.com/sphinx-contrib/plantuml/ There is GraphViz Sphinx extension: http://www.sphinx-doc.org/en/stable/ext/graphviz.html It looks like pyreverse can generate UML diagrams as DOT files: https://github.com/PyCQA/pylint/tree/master/pylint/pyreverse IDK how much post-processing or tool customization is necessary to implement the requested UML diagram styles. This generates DOT files from Django model classes (without adding a Java dependency to the Sphinx docs build): https://github.com/django-extensions/django-extensions/blob/master/django_extensions/management/modelviz.py This generates PlantUML and DOT diagrams from SQLalchemy classes: https://bitbucket.org/estin/sadisplay -------------- next part -------------- An HTML attachment was scrubbed... URL: From wes.turner at gmail.com Mon Jan 1 20:51:22 2018 From: wes.turner at gmail.com (Wes Turner) Date: Mon, 1 Jan 2018 20:51:22 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: SVG is preferable to PNG because you can Ctrl-F SVG. On Monday, January 1, 2018, Wes Turner wrote: > > > On Monday, January 1, 2018, Yahya Abou 'Imran via Python-ideas < > python-ideas at python.org> wrote: > >> Plantuml can also generate ASCII, so playing with ditaa I managed to have >> intersting things... >> >> I opened a public repo on my GitLab account to put that all so you can >> have a visualization of it (source files and `png`s): >> >> https://gitlab.com/yahya-abou-imran/collections-abc-uml >> >> `dot` files seems also interesting by the way... >> >> > There is a PlantUML Sphinx extension (which requires Java): > https://github.com/sphinx-contrib/plantuml/ > > There is GraphViz Sphinx extension: > http://www.sphinx-doc.org/en/stable/ext/graphviz.html > > It looks like pyreverse can generate UML diagrams as DOT files: > https://github.com/PyCQA/pylint/tree/master/pylint/pyreverse > > IDK how much post-processing or tool customization is necessary to > implement the requested UML diagram styles. > > This generates DOT files from Django model classes (without adding a Java > dependency to the Sphinx docs build): > https://github.com/django-extensions/django-extensions/ > blob/master/django_extensions/management/modelviz.py > > This generates PlantUML and DOT diagrams from SQLalchemy classes: > https://bitbucket.org/estin/sadisplay > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yahya-abou-imran at protonmail.com Tue Jan 2 08:25:21 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Tue, 02 Jan 2018 08:25:21 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: At the end of the day, I found that plantuml is the most suitable tool for this. Graphviz dot is interesting, but it doesn't feel natural to make class diagram with it, or at least it's less handy... I could bring several arguments to support this, but it's not the topic. Everybody wanting to try itself is welcome, but, *I* can take the commitment to maintain it over the years. Here are the 3 svg files witch are my last proposals for the moment: https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/base.svg https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/other_collections.svg https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/full.svg -------------- next part -------------- An HTML attachment was scrubbed... URL: From wes.turner at gmail.com Tue Jan 2 09:34:58 2018 From: wes.turner at gmail.com (Wes Turner) Date: Tue, 2 Jan 2018 09:34:58 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: On Tuesday, January 2, 2018, Yahya Abou 'Imran < yahya-abou-imran at protonmail.com> wrote: > At the end of the day, I found that plantuml is the most suitable tool for > this. > Graphviz dot is interesting, but it doesn't feel natural to make class > diagram with it, or at least it's less handy... I could bring several > arguments to support this, but it's not the topic. > > Everybody wanting to try itself is welcome, but, *I* can take the > commitment to maintain it over the years. > https://readthedocs.org/projects/python/ https://github.com/python/cpython/blob/master/Doc/conf.py https://github.com/python/cpython/blob/master/Doc/library/collections.abc.rst https://devguide.python.org/#contributing https://devguide.python.org/docquality/ > > Here are the 3 svg files witch are my last proposals for the moment: > > https://gitlab.com/yahya-abou-imran/collections-abc-uml/ > blob/master/plantuml/base.svg > https://gitlab.com/yahya-abou-imran/collections-abc-uml/ > blob/master/plantuml/other_collections.svg > https://gitlab.com/yahya-abou-imran/collections-abc-uml/ > blob/master/plantuml/full.svg > Thanks! Wow, now I wish that I had said PNG email attachments and SVG with fallback to PNG in the Sphinx docs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gadgetsteve at live.co.uk Tue Jan 2 05:01:47 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Tue, 2 Jan 2018 10:01:47 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: Can I suggest that rather than manually producing or tweaking, and later updating, the diagrams it might be better to spend a little time annotating the source code and possibly adding to extra configuration files so that a tools such as DoxyGen (http://www.stack.nl/~dimitri/doxygen/) with GraphViz can be used to extract the information from the source code, produce the dot files and then produce the diagrams in whatever the desired format is. Of course Sphinx with suitable plug-ins might be able to do the same - or will eventually be able to do so. While the diagrams produced might lack the elegance of manually produced ones they would be much more useful as they would always be up to date due to being produced, and updated, automatically. DoxyGen is an open source, cross platform, tool that can parse and diagram any of C, Objective-C, C#, PHP, Java, Python, IDL (Corba, Microsoft, and UNO/OpenOffice flavors), Fortran, VHDL, Tcl, and to some extent D. It is already in use in the wxPython Phoenix project to parse the wxWidgets C++ code so as to extract the interface details for both the documentation and implementation. It can also work with MSCGEN, DIA & PLANTUML. I am attaching the diagram produced for the full inheritance of collections.abc as produced by doxygen/graphviz but I am sure that there are some options that could make this more readable/useful. Steve On 01/01/2018 20:39, Brett Cannon wrote: > While I appreciate what you're trying to accomplish, Yahya, one thing I > would like to say is if we were to accept the diagram into the docs I > would prefer that there be a source file that isn't an image which we > can update with easily available software (e.g. like a dot file). > Otherwise updating the file will either be burdensome going forward or > we will simply have to drop the image at the first instance of needing > to update it because no one can or be willing to put in the effort (and > I'm thinking in 5 years, not soon while we can count on you to help). > > On Sat, Dec 30, 2017, 08:12 Yahya Abou 'Imran via Python-ideas, > > wrote: > > We can find very usefull class diagramm to understand the hierarchy > of the builtin Collection abstract class and interface in java. > > Some examples: > http://www.falkhausen.de/Java-8/java.util/Collection-Hierarchy-simple.html > > http://www.falkhausen.de/Java-8/java.util/Collection-List.html > > > But when I search about python's ABC, The more detailed I can find > are those from the book of Luciano Ramalho Fluent Python: > https://goo.gl/images/8JGjvM > > https://goo.gl/images/6xZqcA > > > (I think they're done with pyreverse of pylint) > > They are fine, but I think we could provide some other more detailed > in this page: > https://docs.python.org/3/library/collections.abc.html > > > The table could be difficult to understand, a diagram help visualize > things. > > I've began working on it with plantuml and pyreverse, I'm joining to > this mail what I've done so far so you can tell me what you think. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cdcdf2c291d0c41499c5908d5515902ef%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636504365089997045&sdata=AIRvO1pUs69rWu%2FbV7fZ7ab%2FvrhNnlgFjucXz4JaXSU%3D&reserved=0 > Code of Conduct: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cdcdf2c291d0c41499c5908d5515902ef%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636504365089997045&sdata=jUd4UT9Usupz0chUJnItDXCd84PCZdCFkIkOQNmE6S8%3D&reserved=0 > -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. -------------- next part -------------- A non-text attachment was scrubbed... Name: class__collections__abc_1_1_collection__inherit__graph.svg Type: image/svg+xml Size: 105357 bytes Desc: class__collections__abc_1_1_collection__inherit__graph.svg URL: From wes.turner at gmail.com Tue Jan 2 09:41:07 2018 From: wes.turner at gmail.com (Wes Turner) Date: Tue, 2 Jan 2018 09:41:07 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: Is there a way to generate relative links to the classes in the SVG? This would be really convenient: https://docs.python.org/3/library/collections.abc.html#collections.abc.Hashable Hashable On Tuesday, January 2, 2018, Wes Turner wrote: > > > On Tuesday, January 2, 2018, Yahya Abou 'Imran < > yahya-abou-imran at protonmail.com> wrote: > >> At the end of the day, I found that plantuml is the most suitable tool >> for this. >> Graphviz dot is interesting, but it doesn't feel natural to make class >> diagram with it, or at least it's less handy... I could bring several >> arguments to support this, but it's not the topic. >> >> Everybody wanting to try itself is welcome, but, *I* can take the >> commitment to maintain it over the years. >> > > https://readthedocs.org/projects/python/ > > https://github.com/python/cpython/blob/master/Doc/conf.py > > https://github.com/python/cpython/blob/master/Doc/ > library/collections.abc.rst > > https://devguide.python.org/#contributing > > https://devguide.python.org/docquality/ > > > >> >> Here are the 3 svg files witch are my last proposals for the moment: >> >> https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob >> /master/plantuml/base.svg >> https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob >> /master/plantuml/other_collections.svg >> https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob >> /master/plantuml/full.svg >> > > Thanks! Wow, now I wish that I had said PNG email attachments and SVG with > fallback to PNG in the Sphinx docs. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yahya-abou-imran at protonmail.com Tue Jan 2 10:26:30 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Tue, 02 Jan 2018 10:26:30 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: >Can I suggest that rather than manually producing or tweaking, and later > updating, the diagrams it might be better to spend a little time > annotating the source code [...] > While the diagrams produced might lack the elegance of manually produced > ones they would be much more useful as they would always be up to date > due to being produced, and updated, automatically. I think it would be a lot of changes in the source code... I would like to precise, so everybody can know it, the syntax of a .puml is realy simple. For example: test.puml: @startuml hide members show methods abstract class Iterable { {abstract} __iter__() } abstract class Iterator { {abstract} __next__() .. __iter__() } Iterable <|-- Iterator @enduml Then: $ plantuml test.puml And you have test.png I joined. > [...] > I am attaching the diagram produced for the full inheritance of > collections.abc as produced by doxygen/graphviz but I am sure that there > are some options that could make this more readable/useful. > > Steve It's a blank file that I have... I've been struggling with those kind of tools these days, and realised that it's a lot more work (and pain) than a plain text file. -------------- next part -------------- A non-text attachment was scrubbed... Name: test.png Type: image/png Size: 5604 bytes Desc: not available URL: From brett at python.org Tue Jan 2 14:38:40 2018 From: brett at python.org (Brett Cannon) Date: Tue, 02 Jan 2018 19:38:40 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: On Tue, 2 Jan 2018 at 05:25 Yahya Abou 'Imran < yahya-abou-imran at protonmail.com> wrote: > At the end of the day, I found that plantuml is the most suitable tool for > this. > Right, but when I look at http://plantuml.com/ I don't see any open source code to guarantee it will be available in e.g. 5 years. (I really just see a lot of ads around a free Java app). > Graphviz dot is interesting, but it doesn't feel natural to make class > diagram with it, or at least it's less handy... I could bring several > arguments to support this, but it's not the topic. > It's somewhat the topic to me, though, since people seem to have found the diagrams helpful, which means defining how to sustainably maintain them so we are willing to accept them into the documentation is important as that will be the next step in accepting them into the documentation. > > Everybody wanting to try itself is welcome, but, *I* can take the > commitment to maintain it over the years. > I personally appreciate that offer, but I also don't know you well enough to be able to take that as a guarantee, hence why I'm trying to make sure the tooling that is used will last for a very long time (next month will be the 27th anniversary of Python's first public release, so anything we do may need to last a while :) . I know this is an annoying thing to be thinking about when you already have the diagram done in plantuml, but sustaining this work for a long time is part of maintaining open source. -Brett > > Here are the 3 svg files witch are my last proposals for the moment: > > > https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/base.svg > > https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/other_collections.svg > > https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/full.svg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yahya-abou-imran at protonmail.com Tue Jan 2 16:53:10 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Tue, 02 Jan 2018 16:53:10 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: <7D15F033-0334-42DF-AC12-39971500CF27@barrys-emacs.org> References: <7D15F033-0334-42DF-AC12-39971500CF27@barrys-emacs.org> Message-ID: >>Right, but when I look at http://plantuml.com/ I don't see any open source code to guarantee >>>it will be available in e.g. 5 years. (I really just see a lot of ads around a free Java app). >> >I see fedora packages plantuml and says is lgpl3 licensed. > >Barry On archlinux: $ pacman -Qi plantuml Name : plantuml Version : 1.2017.20-1 Description : Component that allows to quickly write uml diagrams Architecture : any URL : http://plantuml.com/ Licenses : GPL [...] Found also on debian stable in the main section (only free softwares), here is the licence: Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ Upstream-Name: plantuml Upstream-Contact: Arnaud Roques Source: http://plantuml.com/download.html Files: * Copyright: 2009-2014, Arnaud Roques License: Expat Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: . The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. . THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Files: debian/plantuml.* Copyright: 2010 Ilya Paramonov License: GPL-3+ This script is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. . This script distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. . The full text of the GPL is distributed in /usr/share/common-licenses/GPL-3 on Debian systems. From yahya-abou-imran at protonmail.com Tue Jan 2 16:56:53 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Tue, 02 Jan 2018 16:56:53 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: <7D15F033-0334-42DF-AC12-39971500CF27@barrys-emacs.org> Message-ID: <0kSc_1AppXX3oULi8GvkCGfT-nEsLfUFLhaCE4Ym42AB77jPHFvDUrjqPbI6UTetPIo62GTVLXvwtB80np8fbxD8YYQdK4GddesbwYUzu0s=@protonmail.com> And on the sourceforge page: https://sourceforge.net/projects/plantuml/ License GNU General Public License version 2.0 (GPLv2) From barry at barrys-emacs.org Tue Jan 2 16:29:16 2018 From: barry at barrys-emacs.org (Barry) Date: Tue, 2 Jan 2018 21:29:16 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: <7F8AC713-9F37-4B55-9AC1-02300BEF668A@barrys-emacs.org> After reading this thread I went off and read up about plantuml. According to its docs it produces dot files that it reandes with graphviz. So if graphviz will produce svg that part is solved. Can you creat the plantuml file automatically from the python code? Barry On 2 Jan 2018, at 15:26, Yahya Abou 'Imran via Python-ideas wrote: >> Can I suggest that rather than manually producing or tweaking, and later >> updating, the diagrams it might be better to spend a little time >> annotating the source code [...] >> While the diagrams produced might lack the elegance of manually produced >> ones they would be much more useful as they would always be up to date >> due to being produced, and updated, automatically. > > I think it would be a lot of changes in the source code... > > I would like to precise, so everybody can know it, the syntax of a .puml is realy simple. For example: > > test.puml: > > @startuml > > hide members > show methods > > abstract class Iterable { > {abstract} __iter__() > } > > abstract class Iterator { > {abstract} __next__() > .. > __iter__() > } > > Iterable <|-- Iterator > > @enduml > > Then: > > $ plantuml test.puml > > And you have test.png I joined. > > > >> [...] >> I am attaching the diagram produced for the full inheritance of >> collections.abc as produced by doxygen/graphviz but I am sure that there >> are some options that could make this more readable/useful. >> >> Steve > > It's a blank file that I have... > > I've been struggling with those kind of tools these days, and realised that it's a lot more work (and pain) than a plain text file. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From barry at barrys-emacs.org Tue Jan 2 16:42:32 2018 From: barry at barrys-emacs.org (Barry) Date: Tue, 2 Jan 2018 21:42:32 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: Message-ID: <7D15F033-0334-42DF-AC12-39971500CF27@barrys-emacs.org> > On 2 Jan 2018, at 19:38, Brett Cannon wrote: > > > >> On Tue, 2 Jan 2018 at 05:25 Yahya Abou 'Imran wrote: >> At the end of the day, I found that plantuml is the most suitable tool for this. > > Right, but when I look at http://plantuml.com/ I don't see any open source code to guarantee > it will be available in e.g. 5 years. (I really just see a lot of ads around a free Java app). I see fedora packages plantuml and says is lgpl3 licensed. Barry > >> Graphviz dot is interesting, but it doesn't feel natural to make class diagram with it, or at least it's less handy... I could bring several arguments to support this, but it's not the topic. > > It's somewhat the topic to me, though, since people seem to have found the diagrams helpful, which means defining how to sustainably maintain them so we are willing to accept them into the documentation is important as that will be the next step in accepting them into the documentation. > >> >> Everybody wanting to try itself is welcome, but, *I* can take the commitment to maintain it over the years. > > I personally appreciate that offer, but I also don't know you well enough to be able to take that as a guarantee, hence why I'm trying to make sure the tooling that is used will last for a very long time (next month will be the 27th anniversary of Python's first public release, so anything we do may need to last a while :) . > > I know this is an annoying thing to be thinking about when you already have the diagram done in plantuml, but sustaining this work for a long time is part of maintaining open source. > > -Brett > >> >> Here are the 3 svg files witch are my last proposals for the moment: >> >> https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/base.svg >> https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/other_collections.svg >> https://gitlab.com/yahya-abou-imran/collections-abc-uml/blob/master/plantuml/full.svg > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Tue Jan 2 19:59:16 2018 From: brett at python.org (Brett Cannon) Date: Wed, 03 Jan 2018 00:59:16 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: <0kSc_1AppXX3oULi8GvkCGfT-nEsLfUFLhaCE4Ym42AB77jPHFvDUrjqPbI6UTetPIo62GTVLXvwtB80np8fbxD8YYQdK4GddesbwYUzu0s=@protonmail.com> References: <7D15F033-0334-42DF-AC12-39971500CF27@barrys-emacs.org> <0kSc_1AppXX3oULi8GvkCGfT-nEsLfUFLhaCE4Ym42AB77jPHFvDUrjqPbI6UTetPIo62GTVLXvwtB80np8fbxD8YYQdK4GddesbwYUzu0s=@protonmail.com> Message-ID: That's great! They don't advertise this, unfortunately, at plantuml.com :/ On Tue, 2 Jan 2018 at 13:57 Yahya Abou 'Imran < yahya-abou-imran at protonmail.com> wrote: > And on the sourceforge page: > > https://sourceforge.net/projects/plantuml/ > > License > GNU General Public License version 2.0 (GPLv2) > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gadgetsteve at live.co.uk Wed Jan 3 02:40:45 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Wed, 3 Jan 2018 07:40:45 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: <7F8AC713-9F37-4B55-9AC1-02300BEF668A@barrys-emacs.org> References: <7F8AC713-9F37-4B55-9AC1-02300BEF668A@barrys-emacs.org> Message-ID: You can embed plantuml directives in rst, and possibly the code, and use sphinxcontrib-plantuml which at least keeps the diagrams for documentation close to the code. pyplantuml claims to be able to extract the infromation directly from the code, (https://github.com/cb109/pyplantuml), but having tried it it seems a little flaky (and over specific in its dependencies). pyreverse by default produces a .dot file but takes some time to come up with the correct flags. https://github.com/rtfd/readthedocs.org/issues/407 makes interesting reading as well and is also relevant to this thread. On 02/01/2018 21:29, Barry wrote: > After reading this thread I went off and read up about plantuml. > According to its docs it produces dot files that it reandes with graphviz. > So if graphviz will produce svg that part is solved. > > Can you creat the plantuml file automatically from the python code? > > Barry > > On 2 Jan 2018, at 15:26, Yahya Abou 'Imran via Python-ideas wrote: > >>> Can I suggest that rather than manually producing or tweaking, and later >>> updating, the diagrams it might be better to spend a little time >>> annotating the source code [...] >>> While the diagrams produced might lack the elegance of manually produced >>> ones they would be much more useful as they would always be up to date >>> due to being produced, and updated, automatically. >> >> I think it would be a lot of changes in the source code... >> >> I would like to precise, so everybody can know it, the syntax of a .puml is realy simple. For example: >> >> test.puml: >> >> @startuml >> >> hide members >> show methods >> >> abstract class Iterable { >> {abstract} __iter__() >> } >> >> abstract class Iterator { >> {abstract} __next__() >> .. >> __iter__() >> } >> >> Iterable <|-- Iterator >> >> @enduml >> >> Then: >> >> $ plantuml test.puml >> >> And you have test.png I joined. >> >> >> >>> [...] >>> I am attaching the diagram produced for the full inheritance of >>> collections.abc as produced by doxygen/graphviz but I am sure that there >>> are some options that could make this more readable/useful. >>> >>> Steve >> >> It's a blank file that I have... >> >> I've been struggling with those kind of tools these days, and realised that it's a lot more work (and pain) than a plain text file. >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7C6d32c06f38af4d40145708d55227e234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636505253601685827&sdata=DyrMOxl32rsEHK%2ByWUW9QlvzP%2Fn%2FqHXsKlSNuT%2B0loU%3D&reserved=0 >> Code of Conduct: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7C6d32c06f38af4d40145708d55227e234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636505253601685827&sdata=PQTKP%2FYAkdZscYpy815VbJXBeE89K1aslWibFgGf1Ww%3D&reserved=0 > -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From paul at rudin.co.uk Wed Jan 3 04:06:47 2018 From: paul at rudin.co.uk (Paul Rudin) Date: Wed, 03 Jan 2018 09:06:47 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation References: Message-ID: <87po6ria54.fsf@rudin.co.uk> Brett Cannon writes: > On Tue, 2 Jan 2018 at 05:25 Yahya Abou 'Imran b1ySJe57IN+BqQ9rBEUg at public.gmane.org> wrote: > > At the end of the day, I found that plantuml is the most suitable tool for > this. > > Right, but when I look at http://plantuml.com/ I don't see any open source > code to guarantee it will be available in e.g. 5 years. (I really just see a > lot of ads around a free Java app). I know nothing of this software, but https://github.com/plantuml/plantuml appears to be the source code, GPL3 licensed. Still, is a java dependency to build the python docs a good idea? From random832 at fastmail.com Wed Jan 3 10:24:36 2018 From: random832 at fastmail.com (Random832) Date: Wed, 03 Jan 2018 10:24:36 -0500 Subject: [Python-ideas] a set of enum.Enum values rather than the construction of bit-sets as the "norm"? In-Reply-To: References: <20171227055639.GP4215@ando.pearwood.info> <9ec7fc9a-57ca-4fd5-ad21-8b1346349c2a@googlegroups.com> <20171229081816.GT4215@ando.pearwood.info> <20171229153821.GU4215@ando.pearwood.info> <20171230042531.GV4215@ando.pearwood.info> Message-ID: <1514993076.900792.1222974920.14A057DF@webmail.messagingengine.com> On Sun, Dec 31, 2017, at 00:33, Guido van Rossum wrote: > I'm not keen on this recommendation. An argument that takes a Set[Foo] > would mean that in order to specify: > - no flags: you'd have to pass set() -- you can't use {} since that's an > empty dict, not an empty set Optional[Set[Foo]]? > - one flag: you'd have to pass {Foo.BAR} rather than just Foo.BAR > - two flags: you'd have to pass {Foo.BAR, Foo.BAZ} rather than Foo.BAR | > Foo.BAZ Maybe the flags themselves should be of a type that you can do both with (i.e. each value is a set containing itself). From guido at python.org Wed Jan 3 10:41:02 2018 From: guido at python.org (Guido van Rossum) Date: Wed, 3 Jan 2018 08:41:02 -0700 Subject: [Python-ideas] a set of enum.Enum values rather than the construction of bit-sets as the "norm"? In-Reply-To: <1514993076.900792.1222974920.14A057DF@webmail.messagingengine.com> References: <20171227055639.GP4215@ando.pearwood.info> <9ec7fc9a-57ca-4fd5-ad21-8b1346349c2a@googlegroups.com> <20171229081816.GT4215@ando.pearwood.info> <20171229153821.GU4215@ando.pearwood.info> <20171230042531.GV4215@ando.pearwood.info> <1514993076.900792.1222974920.14A057DF@webmail.messagingengine.com> Message-ID: On Wed, Jan 3, 2018 at 8:24 AM, Random832 wrote: > On Sun, Dec 31, 2017, at 00:33, Guido van Rossum wrote: > > I'm not keen on this recommendation. An argument that takes a Set[Foo] > > would mean that in order to specify: > > - no flags: you'd have to pass set() -- you can't use {} since that's an > > empty dict, not an empty set > > Optional[Set[Foo]]? > > > - one flag: you'd have to pass {Foo.BAR} rather than just Foo.BAR > > - two flags: you'd have to pass {Foo.BAR, Foo.BAZ} rather than Foo.BAR | > > Foo.BAZ > > Maybe the flags themselves should be of a type that you can do both with > (i.e. each value is a set containing itself). > That's pretty much what the enum.Flag type does. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Wed Jan 3 11:52:06 2018 From: brett at python.org (Brett Cannon) Date: Wed, 03 Jan 2018 16:52:06 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: <87po6ria54.fsf@rudin.co.uk> References: <87po6ria54.fsf@rudin.co.uk> Message-ID: We would check in the resulting image, so any Java dependency would only be for when we update the image. On Wed, Jan 3, 2018, 01:33 Paul Rudin, wrote: > Brett Cannon writes: > > > On Tue, 2 Jan 2018 at 05:25 Yahya Abou 'Imran > b1ySJe57IN+BqQ9rBEUg at public.gmane.org> wrote: > > > > At the end of the day, I found that plantuml is the most suitable > tool for > > this. > > > > Right, but when I look at http://plantuml.com/ I don't see any open > source > > code to guarantee it will be available in e.g. 5 years. (I really just > see a > > lot of ads around a free Java app). > > I know nothing of this software, but > https://github.com/plantuml/plantuml appears to be the source code, GPL3 > licensed. > > Still, is a java dependency to build the python docs a good idea? > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mariocj89 at gmail.com Thu Jan 4 14:56:41 2018 From: mariocj89 at gmail.com (Mario Corchero) Date: Thu, 4 Jan 2018 19:56:41 +0000 Subject: [Python-ideas] pdb to support running modules Message-ID: Hello All, Since PEP 338 we can run python modules as a script via `python -m module_name` but there is no way to run pdb on those (AFAIK). The proposal is to add a new argument "-m" to the pdb module to allow users to run `python -m pdb -m my_module_name` This is especially useful when working on cli tools that use entrypoints in setup.py, as there is no other way to run them. I have a possible implementation here https://github.com/python/cpython/pull/4752 I quite often use pdb with scripts, being able to run modules would be useful as well. What do you think? Regards! Mario -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeanpierreda at gmail.com Thu Jan 4 15:29:24 2018 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Thu, 4 Jan 2018 14:29:24 -0600 Subject: [Python-ideas] pdb to support running modules In-Reply-To: References: Message-ID: On Thu, Jan 4, 2018 at 1:56 PM, Mario Corchero wrote: > Since PEP 338 we can run python modules as a script via `python -m > module_name` but there is no way to run pdb on those (AFAIK). > > The proposal is to add a new argument "-m" to the pdb module to allow users > to run `python -m pdb -m my_module_name` Mega +1, I always, always want this, for every command that itself can invoke other python scripts. There is prior art in that this feature was already added to cProfile as well: https://bugs.python.org/issue21862 -- Devin From lisaroach14 at gmail.com Thu Jan 4 18:50:06 2018 From: lisaroach14 at gmail.com (Lisa Roach) Date: Thu, 4 Jan 2018 15:50:06 -0800 Subject: [Python-ideas] pdb to support running modules In-Reply-To: References: Message-ID: I also +1 this idea, I don't see a reason why we couldn't run pdb on the modules. On Thu, Jan 4, 2018 at 12:29 PM, Devin Jeanpierre wrote: > On Thu, Jan 4, 2018 at 1:56 PM, Mario Corchero > wrote: > > Since PEP 338 we can run python modules as a script via `python -m > > module_name` but there is no way to run pdb on those (AFAIK). > > > > The proposal is to add a new argument "-m" to the pdb module to allow > users > > to run `python -m pdb -m my_module_name` > > Mega +1, I always, always want this, for every command that itself can > invoke other python scripts. There is prior art in that this feature > was already added to cProfile as well: > https://bugs.python.org/issue21862 > > -- Devin > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Jan 4 20:15:54 2018 From: guido at python.org (Guido van Rossum) Date: Thu, 4 Jan 2018 18:15:54 -0700 Subject: [Python-ideas] pdb to support running modules In-Reply-To: References: Message-ID: Sounds uncontroversial, this can just be done via bugs.python.org. On Thu, Jan 4, 2018 at 4:50 PM, Lisa Roach wrote: > I also +1 this idea, I don't see a reason why we couldn't run pdb on the > modules. > > On Thu, Jan 4, 2018 at 12:29 PM, Devin Jeanpierre > wrote: > >> On Thu, Jan 4, 2018 at 1:56 PM, Mario Corchero >> wrote: >> > Since PEP 338 we can run python modules as a script via `python -m >> > module_name` but there is no way to run pdb on those (AFAIK). >> > >> > The proposal is to add a new argument "-m" to the pdb module to allow >> users >> > to run `python -m pdb -m my_module_name` >> >> Mega +1, I always, always want this, for every command that itself can >> invoke other python scripts. There is prior art in that this feature >> was already added to cProfile as well: >> https://bugs.python.org/issue21862 >> >> -- Devin >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gadgetsteve at live.co.uk Fri Jan 5 01:28:36 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Fri, 5 Jan 2018 06:28:36 +0000 Subject: [Python-ideas] Syntax to import modules before running command from the command line Message-ID: Currently invoking `python -c "some;separated;set of commands;"` will, if you need to use any library functions, require one or more import somelib; sections in the execution string. This results in rather complex "one liners". On the other hand `python -m somelib` will load somelib and attempt to execute its `__main__()` or give an error if there isn't one. What I would like to suggest is a mechanism to pre-load libraries before evaluating the -c option as this would allow the use of code from libraries that don't have a `__main__` function, or those that do but it doesn't do what you want. Since -m for module is already taken I would suggest one of: -p for pre-load module -M for load module without attempting to execute `module.__main__()` and without defining "__main__" in the load context or -l for library with the last two having the advantage of appearing next to -m in the --help output. This would change, (for a trivial example): `python -c"import numpy;print(numpy.pi);"` to: `python -M numpy -c"print(numpy.pi);"` -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From wes.turner at gmail.com Fri Jan 5 02:31:28 2018 From: wes.turner at gmail.com (Wes Turner) Date: Fri, 5 Jan 2018 02:31:28 -0500 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: Could it just check if -c and -m are both set? That way there'd be no need for -p or -M. (I have an -m switch in pyline which does exactly this. It makes copying and pasting less convenient; but does save having to type 'import module;? for one liners) On Friday, January 5, 2018, Steve Barnes wrote: > Currently invoking `python -c "some;separated;set of commands;"` will, > if you need to use any library functions, require one or more import > somelib; sections in the execution string. This results in rather > complex "one liners". > > On the other hand `python -m somelib` will load somelib and attempt to > execute its `__main__()` or give an error if there isn't one. > > What I would like to suggest is a mechanism to pre-load libraries before > evaluating the -c option as this would allow the use of code from > libraries that don't have a `__main__` function, or those that do but it > doesn't do what you want. > > Since -m for module is already taken I would suggest one of: > -p for pre-load module > -M for load module without attempting to execute `module.__main__()` > and without defining "__main__" in the load context or > -l for library > with the last two having the advantage of appearing next to -m in the > --help output. > > This would change, (for a trivial example): > `python -c"import numpy;print(numpy.pi);"` > to: > `python -M numpy -c"print(numpy.pi);"` > > > -- > Steve (Gadget) Barnes > Any opinions in this message are my personal opinions and do not reflect > those of my employer. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wes.turner at gmail.com Fri Jan 5 02:33:12 2018 From: wes.turner at gmail.com (Wes Turner) Date: Fri, 5 Jan 2018 02:33:12 -0500 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: An implicit print() would be convenient, too. On Friday, January 5, 2018, Wes Turner wrote: > Could it just check if -c and -m are both set? > That way there'd be no need for -p or -M. > > (I have an -m switch in pyline which does exactly this. It makes > copying and pasting less convenient; but does save having to type 'import > module;? for one liners) > > On Friday, January 5, 2018, Steve Barnes wrote: > >> Currently invoking `python -c "some;separated;set of commands;"` will, >> if you need to use any library functions, require one or more import >> somelib; sections in the execution string. This results in rather >> complex "one liners". >> >> On the other hand `python -m somelib` will load somelib and attempt to >> execute its `__main__()` or give an error if there isn't one. >> >> What I would like to suggest is a mechanism to pre-load libraries before >> evaluating the -c option as this would allow the use of code from >> libraries that don't have a `__main__` function, or those that do but it >> doesn't do what you want. >> >> Since -m for module is already taken I would suggest one of: >> -p for pre-load module >> -M for load module without attempting to execute `module.__main__()` >> and without defining "__main__" in the load context or >> -l for library >> with the last two having the advantage of appearing next to -m in the >> --help output. >> >> This would change, (for a trivial example): >> `python -c"import numpy;print(numpy.pi);"` >> to: >> `python -M numpy -c"print(numpy.pi);"` >> >> >> -- >> Steve (Gadget) Barnes >> Any opinions in this message are my personal opinions and do not reflect >> those of my employer. >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Fri Jan 5 03:12:59 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 5 Jan 2018 18:12:59 +1000 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: On 5 January 2018 at 16:28, Steve Barnes wrote: > Currently invoking `python -c "some;separated;set of commands;"` will, > if you need to use any library functions, require one or more import > somelib; sections in the execution string. This results in rather > complex "one liners". > > On the other hand `python -m somelib` will load somelib and attempt to > execute its `__main__()` or give an error if there isn't one. That's not quite how the -m switch works, but that doesn't affect your core point: "-m somemodule" terminates the option list and defines what will be run as `__main__`, so you can't compose it with other options (like `-c`). > What I would like to suggest is a mechanism to pre-load libraries before > evaluating the -c option as this would allow the use of code from > libraries that don't have a `__main__` function, or those that do but it > doesn't do what you want. This could be an interesting alternative to the old "python -C" idea in https://bugs.python.org/issue14803 (which suggests adding an alternative to "-c" that runs the given code *before* running `__main__`, rather than running it *as* `__main__`). To summarise some of the use cases from that discussion: * setting `__main__.__requires__` to change how `import pkg_resources` selects default package versions * reducing the proliferation of `-X` options to enable modules like tracemalloc and faulthandler * providing a way to enable code coverage as early as possible on startup (and have the setting be inherited by subprocesses) The main downside I'd see compared to that `python -C` and PYTHONRUNFIRST idea is that it might encourage the definition of modules with side-effects in order to use it to reconfigure the interpreter at startup. That concern could potentially be mitigated if we did both: * -M/PYTHONIMPORTFIRST: implicit imports in __main__ on startup * -C/PYTHONRUNFIRST: code to run in __main__ on startup (after the implicit imports) However, the issue then is that "python -M numpy" would just be a less flexible alternative to a command like "python -C 'import numpy as np'". Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From p.f.moore at gmail.com Fri Jan 5 06:06:25 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 5 Jan 2018 11:06:25 +0000 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: On 5 January 2018 at 08:12, Nick Coghlan wrote: > However, the issue then is that "python -M numpy" would just be a less > flexible alternative to a command like "python -C 'import numpy as > np'". For quick one-liners don't underestimate the value of avoiding punctuation: # No punctuation at all python -M numpy # Needs quotes - if quoted itself, needs nested quotes python -C "import numpy" # Needs quotes and semicolon python -c "import numpy; ..." This may not be a big deal on Unix shells (IMO, it's still an issue there, just less critical) but on less-capable shells like Windows' CMD, avoiding unnecessary punctuation can be a big improvement. Paul From yahya-abou-imran at protonmail.com Fri Jan 5 11:45:03 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Fri, 05 Jan 2018 11:45:03 -0500 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: <87po6ria54.fsf@rudin.co.uk> References: <87po6ria54.fsf@rudin.co.uk> Message-ID: Hi everybody. I would like to make a recap: It seems that we all agree that it's something nice to have. Now, we are wondering about the tools or the methodology to use them. As I said, if somebody want to give a try, I'm not against it. But what are we doing now? I've already open an issue, I would like to submit a PR to place the puml and the svg in the doc and to make the diagram appear in the suitable place. Is there a real good reason to wait? I not trying to force myself on anybody here, I just think it's a shame not to use something we have know... If we don't manage to all agree with each other, I will just let the issue open until somebody comes with an undisputed solution. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Fri Jan 5 14:25:08 2018 From: brett at python.org (Brett Cannon) Date: Fri, 05 Jan 2018 19:25:08 +0000 Subject: [Python-ideas] Add an UML class diagram to the collections.abc module documentation In-Reply-To: References: <87po6ria54.fsf@rudin.co.uk> Message-ID: At this point the conversation should shift to https://bugs.python.org/issue32471 . On Fri, 5 Jan 2018 at 08:51 Yahya Abou 'Imran via Python-ideas < python-ideas at python.org> wrote: > Hi everybody. > > I would like to make a recap: > > It seems that we all agree that it's something nice to have. Now, we are > wondering about the tools or the methodology to use them. > > As I said, if somebody want to give a try, I'm not against it. But what > are we doing now? > I've already open an issue, I would like to submit a PR to place the puml > and the svg in the doc and to make the diagram appear in the suitable place. > > Is there a real good reason to wait? > > I not trying to force myself on anybody here, I just think it's a shame > not to use something we have know... > If we don't manage to all agree with each other, I will just let the issue > open until somebody comes with an undisputed solution. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Sun Jan 7 12:18:06 2018 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 8 Jan 2018 04:18:06 +1100 Subject: [Python-ideas] [Python-Dev] subprocess not escaping "^" on Windows In-Reply-To: <79eabfed-7e8a-b570-485c-fecbe5c94725@stackless.com> References: <79eabfed-7e8a-b570-485c-fecbe5c94725@stackless.com> Message-ID: Redirecting this part of the conversation to python-ideas. On Mon, Jan 8, 2018 at 3:17 AM, Christian Tismer wrote: > As a side note: In most cases where shell=True is found, people > seem to need evaluation of the PATH variable. To my understanding, > >>>> from subprocess import call >>>> call(("ls",)) > > works in Linux, but (with dir) not in Windows. But that is misleading > because "dir" is a builtin command but "ls" is not. The same holds for > "del" (Windows) and "rm" (Linux). That's exactly what shell=True is for - if you want a shell feature, you use the shell. What exactly would emulate_shell do? Would it simply do a $PATH or %PATH% search, but otherwise function as shell=False? Would it process redirection? Would it handle interpolations? I think not, from your description: > Perhaps it would be a good thing to emulate the builtin programs > in python by some shell=True replacement (emulate_shell=True?) > to match the normal user expectations without using the shell? but it becomes difficult to draw the line. For instance, with emulate_shell=True, what would you do with all the sh/bash built-ins: https://www.gnu.org/software/bash/manual/html_node/Bourne-Shell-Builtins.html https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html I'm thinking especially of the commands where bash has its own handling of something that could otherwise be found in $PATH, like pwd, time, and echo, but shells can do a lot of other things too. When do you actually want to execute a shell built-in from Python but without using the shell itself? You give the example of ls/dir, but if that ever comes up in real-world code, I'd toss it out and recommend a cross-platform os.listdir or equivalent. There are plenty of times I've wanted a really quick way to redirect a standard stream from Python, but that isn't part of what you're recommending. Can you give a real-world example that would be improved by this? I know this was just a side note in your original, but I'd like to hear more about what would make it useful. ChrisA From tismer at stackless.com Sun Jan 7 14:32:32 2018 From: tismer at stackless.com (Christian Tismer) Date: Sun, 7 Jan 2018 20:32:32 +0100 Subject: [Python-ideas] [Python-Dev] subprocess not escaping "^" on Windows In-Reply-To: References: <79eabfed-7e8a-b570-485c-fecbe5c94725@stackless.com> Message-ID: <504650b4-3217-8aa4-91a6-d3de9719b9b3@stackless.com> Hi Chris, On 07.01.18 18:18, Chris Angelico wrote: > Redirecting this part of the conversation to python-ideas. > > On Mon, Jan 8, 2018 at 3:17 AM, Christian Tismer wrote: >> As a side note: In most cases where shell=True is found, people >> seem to need evaluation of the PATH variable. To my understanding, >> >>>>> from subprocess import call >>>>> call(("ls",)) >> >> works in Linux, but (with dir) not in Windows. But that is misleading >> because "dir" is a builtin command but "ls" is not. The same holds for >> "del" (Windows) and "rm" (Linux). > > That's exactly what shell=True is for - if you want a shell feature, > you use the shell. What exactly would emulate_shell do? Would it > simply do a $PATH or %PATH% search, but otherwise function as > shell=False? Would it process redirection? Would it handle > interpolations? I think not, from your description: > >> Perhaps it would be a good thing to emulate the builtin programs >> in python by some shell=True replacement (emulate_shell=True?) >> to match the normal user expectations without using the shell? > > but it becomes difficult to draw the line. For instance, with > emulate_shell=True, what would you do with all the sh/bash built-ins: > > https://www.gnu.org/software/bash/manual/html_node/Bourne-Shell-Builtins.html > https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html > > I'm thinking especially of the commands where bash has its own > handling of something that could otherwise be found in $PATH, like > pwd, time, and echo, but shells can do a lot of other things too. > > When do you actually want to execute a shell built-in from Python but > without using the shell itself? You give the example of ls/dir, but if > that ever comes up in real-world code, I'd toss it out and recommend a > cross-platform os.listdir or equivalent. There are plenty of times > I've wanted a really quick way to redirect a standard stream from > Python, but that isn't part of what you're recommending. Can you give > a real-world example that would be improved by this? > > I know this was just a side note in your original, but I'd like to > hear more about what would make it useful. No, I cannot. I just thought of a way to keep users from using "shell=True". I *think* they do it after they experience that "del" for instance is not found. They conclude "ah, I need the shell", which is not true. So whatever you come up with, the effect should be that people no longer use the shell. THATs what I want, after bad experience with non-escaped "^" in a regex, that caused some really weird result. -- Christian Tismer-Sperling :^) tismer at stackless.com Software Consulting : http://www.stackless.com/ Karl-Liebknecht-Str. 121 : https://github.com/PySide 14482 Potsdam : GPG key -> 0xFB7BEE0E phone +49 173 24 18 776 fax +49 (30) 700143-0023 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 496 bytes Desc: OpenPGP digital signature URL: From steve.dower at python.org Sun Jan 7 16:11:04 2018 From: steve.dower at python.org (Steve Dower) Date: Mon, 8 Jan 2018 08:11:04 +1100 Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows In-Reply-To: References: Message-ID: It?s not a good idea. You end up with two different C runtimes in memory that cannot communicate, and many things will not work properly. If you compile your debug build extension with the non-debug CRT (/MD rather than /MDd) you will lose asserts, but otherwise it will work fine and the quoted code picks the release lib. Or if you like, when you install Python 3.5 or later there are advanced options to install debug symbols and binaries. You can use a proper debug build against the debug binaries (python_d.exe). Cheers, Steve Top-posted from my Windows phone From: Ivan Pozdeev via Python-ideas Sent: Saturday, December 30, 2017 13:01 To: python-ideas at python.org Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows The Windows version of pyconfig.h has the following construct: ??? if defined(_DEBUG) ?????????? pragma comment(lib,"python37_d.lib") ??? elif defined(Py_LIMITED_API) ?????????? pragma comment(lib,"python3.lib") ??? else ?????????? pragma comment(lib,"python37.lib") ??? endif /* _DEBUG */ which fails the compilation of a debug version of an extension. Making debugging it... difficult. Perhaps we could define some other constant? I'm not sure whether such compilation is a good idea in general, so asking here at first. -- Regards, Ivan _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From vano at mail.mipt.ru Sun Jan 7 17:25:36 2018 From: vano at mail.mipt.ru (Ivan Pozdeev) Date: Mon, 8 Jan 2018 01:25:36 +0300 Subject: [Python-ideas] [Python-Dev] subprocess not escaping "^" on Windows In-Reply-To: <504650b4-3217-8aa4-91a6-d3de9719b9b3@stackless.com> References: <79eabfed-7e8a-b570-485c-fecbe5c94725@stackless.com> <504650b4-3217-8aa4-91a6-d3de9719b9b3@stackless.com> Message-ID: <71fd61f3-4570-5951-4a46-4a57efb7794d@mail.mipt.ru> On 07.01.2018 22:32, Christian Tismer wrote: > Hi Chris, > > On 07.01.18 18:18, Chris Angelico wrote: >> Redirecting this part of the conversation to python-ideas. >> >> On Mon, Jan 8, 2018 at 3:17 AM, Christian Tismer wrote: >>> As a side note: In most cases where shell=True is found, people >>> seem to need evaluation of the PATH variable. To my understanding, >>> >>>>>> from subprocess import call >>>>>> call(("ls",)) >>> works in Linux, but (with dir) not in Windows. But that is misleading >>> because "dir" is a builtin command but "ls" is not. The same holds for >>> "del" (Windows) and "rm" (Linux). >> That's exactly what shell=True is for - if you want a shell feature, >> you use the shell. What exactly would emulate_shell do? Would it >> simply do a $PATH or %PATH% search, but otherwise function as >> shell=False? Would it process redirection? Would it handle >> interpolations? I think not, from your description: >> >>> Perhaps it would be a good thing to emulate the builtin programs >>> in python by some shell=True replacement (emulate_shell=True?) >>> to match the normal user expectations without using the shell? >> but it becomes difficult to draw the line. For instance, with >> emulate_shell=True, what would you do with all the sh/bash built-ins: >> >> https://www.gnu.org/software/bash/manual/html_node/Bourne-Shell-Builtins.html >> https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html >> >> I'm thinking especially of the commands where bash has its own >> handling of something that could otherwise be found in $PATH, like >> pwd, time, and echo, but shells can do a lot of other things too. >> >> When do you actually want to execute a shell built-in from Python but >> without using the shell itself? You give the example of ls/dir, but if >> that ever comes up in real-world code, I'd toss it out and recommend a >> cross-platform os.listdir or equivalent. There are plenty of times >> I've wanted a really quick way to redirect a standard stream from >> Python, but that isn't part of what you're recommending. Can you give >> a real-world example that would be improved by this? >> >> I know this was just a side note in your original, but I'd like to >> hear more about what would make it useful. > > No, I cannot. I just thought of a way to keep users from using > "shell=True". I *think* they do it after they experience that > "del" for instance is not found. They conclude "ah, I need the > shell", which is not true. Even putting aside the fact this is pure conjecture, the kind of people who make decisions like this will find a zillion more ways to shoot themselves in the foot. They don't need a cleaner syntax, they need to learn the basics of programming in a high-level language to understand how it's different from programming in the shell. In particular, why spawning a subprocess for something covered by a library function is a bad idea. > > So whatever you come up with, the effect should be that people > no longer use the shell. THATs what I want, after bad experience with > non-escaped "^" in a regex, that caused some really weird result. > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- Regards, Ivan -------------- next part -------------- An HTML attachment was scrubbed... URL: From gadgetsteve at live.co.uk Mon Jan 8 01:46:59 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Mon, 8 Jan 2018 06:46:59 +0000 Subject: [Python-ideas] [Python-Dev] subprocess not escaping "^" on Windows In-Reply-To: <71fd61f3-4570-5951-4a46-4a57efb7794d@mail.mipt.ru> References: <79eabfed-7e8a-b570-485c-fecbe5c94725@stackless.com> <504650b4-3217-8aa4-91a6-d3de9719b9b3@stackless.com> <71fd61f3-4570-5951-4a46-4a57efb7794d@mail.mipt.ru> Message-ID: Reacting to: >> >> No, I cannot. I just thought of a way to keep users from using >> "shell=True". I *think* they do it after they experience that >> "del" for instance is not found. They conclude "ah, I need the >> shell", which is not true. > Even putting aside the fact this is pure conjecture, the kind of people > who make decisions like this will find a zillion more ways to shoot > themselves in the foot. They don't need a cleaner syntax, they need to > learn the basics of programming in a high-level language to understand > how it's different from programming in the shell. In particular, why > spawning a subprocess for something covered by a library function is a > bad idea. >> >> So whatever you come up with, the effect should be that people >> no longer use the shell. THATs what I want, after bad experience with >> non-escaped "^" in a regex, that caused some really weird result. >> >> >> How about starting off with marking all use of "shell=True" as deprecated and then replacing the parameter with "risky_shell=True" or having no such parameter and adding "risky_" or "dangerous_" wrappers for all items that currently have the "shell=True" option. This would at least highlight that the developer is performing a risky operation, to me a part of the problem is that "shell=True" sounds innocuous so it is rarely picked up as a potential problem. I do quite like the idea of having a "with_path=True|False" option or maybe a "use_path=" that defaults to sys.path for all of the subprocess functions that would allow a little more control over the execution environment. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From oscardssmith at gmail.com Mon Jan 8 11:17:30 2018 From: oscardssmith at gmail.com (Oscar Smith) Date: Mon, 8 Jan 2018 10:17:30 -0600 Subject: [Python-ideas] make Connections iterable Message-ID: I am currently working on a program where it would be really useful if a connection had a __next__ method, because then it would be much easier to iterate over. It would just be an alias to recv, but would allow you to do things like merging the results of connections using heapq.merge that currently are highly non-trivial to accomplish. Is there a reason this API isn't supported? Oscar Smith -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Mon Jan 8 15:00:32 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 8 Jan 2018 15:00:32 -0500 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: Message-ID: On 1/8/2018 11:17 AM, Oscar Smith wrote: > I am currently working on a program where it would be really useful if a > connection had a __next__ method, because then it would be much easier > to iterate over. It would just be an alias to recv, but would allow you > to do things like merging the results of connections using heapq.merge > that currently are highly non-trivial to accomplish. The reference to recv says that you must be talking about multiprocessing.Connection rather than sqlite3.Connection. Since recv raises EORError when done, an alias does not work. Try the following generator adaptor. def connect_gen(connection): try: while True: yield connection.recv() except EOFError: pass You could make the above the .__iter__ method of a MyConnecton subclass. -- Terry Jan Reedy From steve at pearwood.info Mon Jan 8 20:25:27 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 9 Jan 2018 12:25:27 +1100 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: Message-ID: <20180109012527.GM6667@ando.pearwood.info> On Mon, Jan 08, 2018 at 10:17:30AM -0600, Oscar Smith wrote: > I am currently working on a program where it would be really useful if a > connection had a __next__ method, because then it would be much easier to > iterate over. What sort of connection are you referring to? > It would just be an alias to recv, but would allow you to do > things like merging the results of connections using heapq.merge that > currently are highly non-trivial to accomplish. This gives you an iterator which repeatedly calls connection.recv until it raises a FooException, then ends. def conn_iter(connection): try: while True: yield connection.recv() except FooException: # FIXME -- what does recv actually raise? return Doesn't seem "highly non-trivial" to me. Have I missed something? > Is there a reason this API isn't supported? You are asking the wrong question. Adding APIs isn't "default allow", where there has to be a reason to *not* support it otherwise it gets added. It is "default deny" -- there has to be a good reason to add it, otherwise it gets left out. YAGNI is an excellent design principle, as it is easier to add a useful API later, than to remove an unnecessary or poorly designed one. So the question needs to be: "Is this a good enough reason to support this API?" Maybe, maybe not. Not every trivial wrapper function needs to be a method. But perhaps this is an exception: perhaps iterability is such a common and useful API for connections that it should be added, for the same reason that files are iterable. Care to elaborate on why this would be useful and why the generator I showed above isn't satisfactory? -- Steve From oscardssmith at gmail.com Mon Jan 8 22:05:19 2018 From: oscardssmith at gmail.com (Oscar Smith) Date: Mon, 8 Jan 2018 21:05:19 -0600 Subject: [Python-ideas] make Connections iterable In-Reply-To: <20180109012527.GM6667@ando.pearwood.info> References: <20180109012527.GM6667@ando.pearwood.info> Message-ID: The arguments for including this API is that it allows easy iteration over the results of a connection allowing it to be used with any of the features of itertools or any other library accepting iterables. recv is only used in places where the iterable protocol could be used, so it makes sense for consistency to use the API shared by the rest of Python. Oscar Smith On Mon, Jan 8, 2018 at 7:25 PM, Steven D'Aprano wrote: > On Mon, Jan 08, 2018 at 10:17:30AM -0600, Oscar Smith wrote: > > I am currently working on a program where it would be really useful if a > > connection had a __next__ method, because then it would be much easier to > > iterate over. > > What sort of connection are you referring to? > > > > It would just be an alias to recv, but would allow you to do > > things like merging the results of connections using heapq.merge that > > currently are highly non-trivial to accomplish. > > This gives you an iterator which repeatedly calls connection.recv until > it raises a FooException, then ends. > > def conn_iter(connection): > try: > while True: > yield connection.recv() > except FooException: # FIXME -- what does recv actually raise? > return > > Doesn't seem "highly non-trivial" to me. Have I missed something? > > > > Is there a reason this API isn't supported? > > You are asking the wrong question. Adding APIs isn't "default allow", > where there has to be a reason to *not* support it otherwise it gets > added. It is "default deny" -- there has to be a good reason to add it, > otherwise it gets left out. YAGNI is an excellent design principle, as > it is easier to add a useful API later, than to remove an unnecessary or > poorly designed one. > > So the question needs to be: > > "Is this a good enough reason to support this API?" > > Maybe, maybe not. Not every trivial wrapper function needs to be a > method. > > But perhaps this is an exception: perhaps iterability is such a common > and useful API for connections that it should be added, for the same > reason that files are iterable. > > Care to elaborate on why this would be useful and why the generator I > showed above isn't satisfactory? > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amit.mixie at gmail.com Mon Jan 8 22:27:07 2018 From: amit.mixie at gmail.com (Amit Green) Date: Mon, 8 Jan 2018 22:27:07 -0500 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: <20180109012527.GM6667@ando.pearwood.info> Message-ID: An argument against this API, is that any caller of recv should be doing error handling (i.e.: catching exceptions from the socket). Changing into an iterator makes it less likely that error handling will be properly coded, and makes the error handling more obscure. Thus although the API would make the code more readable for the [wrong case] of not handling errors; the real issue is that it would make the code more obscure for the proper case of error handling. We should focus on the proper use case: using recv with error handling & thus not add this API. Virus-free. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Mon, Jan 8, 2018 at 10:05 PM, Oscar Smith wrote: > The arguments for including this API is that it allows easy iteration over > the results of a connection allowing it to be used with any of the features > of itertools or any other library accepting iterables. recv is only used in > places where the iterable protocol could be used, so it makes sense for > consistency to use the API shared by the rest of Python. > > Oscar Smith > > On Mon, Jan 8, 2018 at 7:25 PM, Steven D'Aprano > wrote: > >> On Mon, Jan 08, 2018 at 10:17:30AM -0600, Oscar Smith wrote: >> > I am currently working on a program where it would be really useful if a >> > connection had a __next__ method, because then it would be much easier >> to >> > iterate over. >> >> What sort of connection are you referring to? >> >> >> > It would just be an alias to recv, but would allow you to do >> > things like merging the results of connections using heapq.merge that >> > currently are highly non-trivial to accomplish. >> >> This gives you an iterator which repeatedly calls connection.recv until >> it raises a FooException, then ends. >> >> def conn_iter(connection): >> try: >> while True: >> yield connection.recv() >> except FooException: # FIXME -- what does recv actually raise? >> return >> >> Doesn't seem "highly non-trivial" to me. Have I missed something? >> >> >> > Is there a reason this API isn't supported? >> >> You are asking the wrong question. Adding APIs isn't "default allow", >> where there has to be a reason to *not* support it otherwise it gets >> added. It is "default deny" -- there has to be a good reason to add it, >> otherwise it gets left out. YAGNI is an excellent design principle, as >> it is easier to add a useful API later, than to remove an unnecessary or >> poorly designed one. >> >> So the question needs to be: >> >> "Is this a good enough reason to support this API?" >> >> Maybe, maybe not. Not every trivial wrapper function needs to be a >> method. >> >> But perhaps this is an exception: perhaps iterability is such a common >> and useful API for connections that it should be added, for the same >> reason that files are iterable. >> >> Care to elaborate on why this would be useful and why the generator I >> showed above isn't satisfactory? >> >> -- >> Steve >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Mon Jan 8 22:34:31 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 9 Jan 2018 13:34:31 +1000 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: <20180109012527.GM6667@ando.pearwood.info> Message-ID: On 9 January 2018 at 13:27, Amit Green wrote: > An argument against this API, is that any caller of recv should be doing > error handling (i.e.: catching exceptions from the socket). > > Changing into an iterator makes it less likely that error handling will be > properly coded, and makes the error handling more obscure. > > Thus although the API would make the code more readable for the [wrong > case] of not handling errors; the real issue is that it would make the code > more obscure for the proper case of error handling. > > We should focus on the proper use case: using recv with error handling & > thus not add this API. > It could be useful to include a recipe in the documentation that shows a generator with suitable error handling (taking the generic connection errors and adapting them to app specific ones) while also showing how to adapt the connection to the iterator protocol, though. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From amit.mixie at gmail.com Mon Jan 8 22:39:47 2018 From: amit.mixie at gmail.com (Amit Green) Date: Mon, 8 Jan 2018 22:39:47 -0500 Subject: [Python-ideas] make Connections iterable Message-ID: On Mon, Jan 8, 2018 at 10:34 PM, Nick Coghlan wrote > It could be useful to include a recipe in the documentation that shows a > generator with suitable error handling (taking the generic connection > errors and adapting them to app specific ones) while also showing how to > adapt the connection to the iterator protocol, though. > Agreed +1. This would be great for three reasons: - Error handling should be as close to the source of errors as possible (i.e.: in the iterator). - Converting error's and adapting them is a great idea; - Mostly: Any documentation that shows people how to do better error handling, is always welcome. Virus-free. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Jan 9 00:22:56 2018 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 8 Jan 2018 21:22:56 -0800 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: <20180109012527.GM6667@ando.pearwood.info> Message-ID: On Mon, Jan 8, 2018 at 7:27 PM, Amit Green wrote: > An argument against this API, is that any caller of recv should be doing > error handling (i.e.: catching exceptions from the socket). > It's still not entirely clear, but I'm pretty sure this thread is talking about multiprocessing.Connection objects, which don't have anything to do with sockets. (I think. They might use sockets internally on some platforms.) The only documented error from multiprocessing.Connection.recv is EOFError, which is basically equivalent to a StopIteration. I'm surprised that multiprocessing.Connection isn't iterable -- it seems like an obvious oversight. -n -- Nathaniel J. Smith -- https://vorpus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Jan 9 05:07:34 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 9 Jan 2018 11:07:34 +0100 Subject: [Python-ideas] make Connections iterable References: <20180109012527.GM6667@ando.pearwood.info> Message-ID: <20180109110734.656d6c4b@fsol> On Mon, 8 Jan 2018 21:22:56 -0800 Nathaniel Smith wrote: > > The only documented error from multiprocessing.Connection.recv is EOFError, > which is basically equivalent to a StopIteration. Actually recv() can raise an OSError corresponding to any system-level error. > I'm surprised that multiprocessing.Connection isn't iterable -- it seems > like an obvious oversight. What is obvious about making a connection iterable? It's the first time I see someone requesting this. Regards Antoine. From ncoghlan at gmail.com Tue Jan 9 05:46:35 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 9 Jan 2018 20:46:35 +1000 Subject: [Python-ideas] make Connections iterable In-Reply-To: <20180109110734.656d6c4b@fsol> References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> Message-ID: On 9 January 2018 at 20:07, Antoine Pitrou wrote: > On Mon, 8 Jan 2018 21:22:56 -0800 > Nathaniel Smith wrote: >> I'm surprised that multiprocessing.Connection isn't iterable -- it seems >> like an obvious oversight. > > What is obvious about making a connection iterable? It's the first > time I see someone requesting this. If you view them as comparable to subprocess pipes, then it can be surprising that they're not iterable when using a line-oriented protocol. If you instead view them as comparable to socket connections, then the lack of iteration support seems equally reasonable. Hence my suggestion of providing a docs recipe showing an example of wrapping a connection in a generator in order to define a suitable way of getting from a raw bytestream to iterable chunks. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From solipsis at pitrou.net Tue Jan 9 06:02:23 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 9 Jan 2018 12:02:23 +0100 Subject: [Python-ideas] make Connections iterable References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> Message-ID: <20180109120223.05d6142e@fsol> On Tue, 9 Jan 2018 20:46:35 +1000 Nick Coghlan wrote: > On 9 January 2018 at 20:07, Antoine Pitrou wrote: > > On Mon, 8 Jan 2018 21:22:56 -0800 > > Nathaniel Smith wrote: > >> I'm surprised that multiprocessing.Connection isn't iterable -- it seems > >> like an obvious oversight. > > > > What is obvious about making a connection iterable? It's the first > > time I see someone requesting this. > > If you view them as comparable to subprocess pipes, then it can be > surprising that they're not iterable when using a line-oriented > protocol. > > If you instead view them as comparable to socket connections, then the > lack of iteration support seems equally reasonable. multiprocessing connections are actually message-oriented. So perhaps it could make sense for them to be iterable. But they are also quite low-level (often you wouldn't use them directly, but instead rely on multiprocessing.Queue). > Hence my suggestion of providing a docs recipe showing an example of > wrapping a connection in a generator in order to define a suitable way > of getting from a raw bytestream to iterable chunks. Well... if someone needs a doc recipe for this, they shouldn't use the lower-level functionality and instead stick to multiprocessing.Queue. (this begs the question: should multiprocessing.Queue be iterable? well, it's modeled on queue.Queue which isn't iterable) Regards Antoine. From njs at pobox.com Tue Jan 9 06:24:58 2018 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 9 Jan 2018 03:24:58 -0800 Subject: [Python-ideas] make Connections iterable In-Reply-To: <20180109110734.656d6c4b@fsol> References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> Message-ID: On Tue, Jan 9, 2018 at 2:07 AM, Antoine Pitrou wrote: > On Mon, 8 Jan 2018 21:22:56 -0800 > Nathaniel Smith wrote: >> >> The only documented error from multiprocessing.Connection.recv is EOFError, >> which is basically equivalent to a StopIteration. > > Actually recv() can raise an OSError corresponding to any system-level > error. > >> I'm surprised that multiprocessing.Connection isn't iterable -- it seems >> like an obvious oversight. > > What is obvious about making a connection iterable? It's the first > time I see someone requesting this. On the receive side, it's a stream of incoming objects that you fetch one at a time until you get to the end, probably processed with a loop like: while True: try: next_message = conn.recv() except EOFError: break ... Why wouldn't it be iterable? -n -- Nathaniel J. Smith -- https://vorpus.org From random832 at fastmail.com Tue Jan 9 07:12:25 2018 From: random832 at fastmail.com (Random832) Date: Tue, 09 Jan 2018 07:12:25 -0500 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> Message-ID: <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> On Tue, Jan 9, 2018, at 05:46, Nick Coghlan wrote: > If you view them as comparable to subprocess pipes, then it can be > surprising that they're not iterable when using a line-oriented > protocol. > > If you instead view them as comparable to socket connections, then the > lack of iteration support seems equally reasonable. Sockets are files - there's no fundamental reason a stream socket using a line-oriented protocol (which is a common enough case), or a datagram socket, shouldn't be iterable. Why aren't they? Making sockets iterable would be a separate discussion, but I don't think this is necessarily an argument. And saying "I think you should be handling errors in some particular way, so we'll make the API more difficult to encourage this" seems a non-sequitur. The whole point of exceptions is that the error handling code doesn't need to be directly at the point of use but can be, say, a try/catch wrapped around the inner loop. From storchaka at gmail.com Tue Jan 9 07:51:10 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 9 Jan 2018 14:51:10 +0200 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> Message-ID: 09.01.18 12:46, Nick Coghlan ????: > On 9 January 2018 at 20:07, Antoine Pitrou wrote: >> On Mon, 8 Jan 2018 21:22:56 -0800 >> Nathaniel Smith wrote: >>> I'm surprised that multiprocessing.Connection isn't iterable -- it seems >>> like an obvious oversight. >> >> What is obvious about making a connection iterable? It's the first >> time I see someone requesting this. > > If you view them as comparable to subprocess pipes, then it can be > surprising that they're not iterable when using a line-oriented > protocol. > > If you instead view them as comparable to socket connections, then the > lack of iteration support seems equally reasonable. > > Hence my suggestion of providing a docs recipe showing an example of > wrapping a connection in a generator in order to define a suitable way > of getting from a raw bytestream to iterable chunks. recv() can raise OSError, and it is more likely than raising OSError in file's readline(). The user code inside the loop also can perform writing and can raise OSError. This in the case of line-oriented files there is a single main source of OSError, but in the case of Connections there are two possible sources, and you need a way to distinguish errors raised by recv() and by writing. Currently you just use two different try-except statements. while True: try: data = conn.recv() except EOFError: break except OSError: # error on reading try: # user code except OSError: # error on writing It is very easy extend the case when don't handle OSError to the case when handle OSError. If Connections be iterable the code that don't handle errors looks simple: for data in conn: # user code But this simple code is not correct. When add error handling it will look like: while True: try: data = next(conn) except StopIteration: break except OSError: # error on reading try: # user code except OSError: # error on writing Not too different from the first example and very differen from the second example. This feature is not useful if properly handle errors, it makes the simpler only a quick code when you don't handle errors or handle them improperly. Yet one concern: the Connections object has the send() method, generator objects also have send() methods, but with the different semantic. This may be confusing. From rosuav at gmail.com Tue Jan 9 11:27:04 2018 From: rosuav at gmail.com (Chris Angelico) Date: Wed, 10 Jan 2018 03:27:04 +1100 Subject: [Python-ideas] make Connections iterable In-Reply-To: <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> Message-ID: On Tue, Jan 9, 2018 at 11:12 PM, Random832 wrote: > On Tue, Jan 9, 2018, at 05:46, Nick Coghlan wrote: >> If you view them as comparable to subprocess pipes, then it can be >> surprising that they're not iterable when using a line-oriented >> protocol. >> >> If you instead view them as comparable to socket connections, then the >> lack of iteration support seems equally reasonable. > > Sockets are files - there's no fundamental reason a stream socket using a line-oriented protocol (which is a common enough case), or a datagram socket, shouldn't be iterable. Why aren't they? Making sockets iterable would be a separate discussion, but I don't think this is necessarily an argument. > Only in POSIX. On other platforms, sockets are most definitely NOT files. And datagram sockets don't really make sense to iterate over. Part of the problem with even POSIX stream sockets (either TCP or Unix domain) is what you do when there's nothing to read. Do you block, waiting for a line? Do you raise StopIteration and then subsequently become un-finished again (which, according to Python semantics, is a broken iterator)? Do you yield a special value that says "no data yet but more maybe later"?? Blocking is the only one that makes sense, and that only if you run two threads, one for reading and one for writing. (Unless you're using a unidirectional socket, basically a TCP-enabled or filesystem-named pipe. Far from common.) ChrisA From njs at pobox.com Tue Jan 9 11:39:06 2018 From: njs at pobox.com (Nathaniel Smith) Date: Tue, 9 Jan 2018 08:39:06 -0800 Subject: [Python-ideas] make Connections iterable In-Reply-To: <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> Message-ID: On Jan 9, 2018 04:12, "Random832" wrote: On Tue, Jan 9, 2018, at 05:46, Nick Coghlan wrote: > If you view them as comparable to subprocess pipes, then it can be > surprising that they're not iterable when using a line-oriented > protocol. > > If you instead view them as comparable to socket connections, then the > lack of iteration support seems equally reasonable. Sockets are files - there's no fundamental reason a stream socket using a line-oriented protocol (which is a common enough case), or a datagram socket, shouldn't be iterable. Why aren't they? Supporting line iteration on sockets would require adding a whole buffering layer, which would be a huge change in semantics. Also, due to the way the BSD socket API works, stream and datagram sockets are the same Python type, so which one would socket.__next__ assume? (Plus datagrams are a bit messy anyway; you need to know the protocol's max size before you can call recv.) I know this was maybe a rhetorical question, but this particular case does have an answer beyond "we never did it that way before" :-). -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Tue Jan 9 11:40:43 2018 From: ethan at stoneleaf.us (Ethan Furman) Date: Tue, 09 Jan 2018 08:40:43 -0800 Subject: [Python-ideas] make Connections iterable In-Reply-To: References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> Message-ID: <5A54F08B.9090407@stoneleaf.us> On 01/09/2018 08:27 AM, Chris Angelico wrote: > On Tue, Jan 9, 2018 at 11:12 PM, Random832 wrote: >> On Tue, Jan 9, 2018, at 05:46, Nick Coghlan wrote: >>> If you view them as comparable to subprocess pipes, then it can be >>> surprising that they're not iterable when using a line-oriented >>> protocol. >>> >>> If you instead view them as comparable to socket connections, then the >>> lack of iteration support seems equally reasonable. >> >> Sockets are files - there's no fundamental reason a stream socket using a line-oriented protocol (which is a common enough case), or a datagram socket, shouldn't be iterable. Why aren't they? Making sockets iterable would be a separate discussion, but I don't think this is necessarily an argument. >> > > Only in POSIX. On other platforms, sockets are most definitely NOT > files. And datagram sockets don't really make sense to iterate over. > > Part of the problem with even POSIX stream sockets (either TCP or Unix > domain) is what you do when there's nothing to read. Do you block, > waiting for a line? Do you raise StopIteration and then subsequently > become un-finished again (which, according to Python semantics, is a > broken iterator)? This. Although it is technically broken, this is how reading from the console works. I believe that falls under practicality beats purity. ;) -- ~Ethan~ From solipsis at pitrou.net Tue Jan 9 11:49:23 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 9 Jan 2018 17:49:23 +0100 Subject: [Python-ideas] make Connections iterable References: <20180109012527.GM6667@ando.pearwood.info> <20180109110734.656d6c4b@fsol> <1515499945.184742.1229162312.4984D569@webmail.messagingengine.com> Message-ID: <20180109174923.7e08b407@fsol> On Tue, 9 Jan 2018 08:39:06 -0800 Nathaniel Smith wrote: > On Jan 9, 2018 04:12, "Random832" wrote: > > On Tue, Jan 9, 2018, at 05:46, Nick Coghlan wrote: > > If you view them as comparable to subprocess pipes, then it can be > > surprising that they're not iterable when using a line-oriented > > protocol. > > > > If you instead view them as comparable to socket connections, then the > > lack of iteration support seems equally reasonable. > > Sockets are files - there's no fundamental reason a stream socket using a > line-oriented protocol (which is a common enough case), or a datagram > socket, shouldn't be iterable. Why aren't they? > > > Supporting line iteration on sockets would require adding a whole buffering > layer, which would be a huge change in semantics. The buffering layer already exists. Just call socket.makefile() and you've got your iterable object :-) https://docs.python.org/3/library/socket.html#socket.socket.makefile Regards Antoine. From vano at mail.mipt.ru Tue Jan 9 13:35:26 2018 From: vano at mail.mipt.ru (Ivan Pozdeev) Date: Tue, 9 Jan 2018 21:35:26 +0300 Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows In-Reply-To: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> References: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> Message-ID: <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> On 08.01.2018 0:11, Steve Dower wrote: > > It?s not a good idea. You end up with two different C runtimes in > memory that cannot communicate, and many things will not work properly. > > If you compile your debug build extension with the non-debug CRT (/MD > rather than /MDd) you will lose asserts, but otherwise it will work > fine and the quoted code picks the release lib. > Distutils' designers seem to have thought differently. Whether the extension is linked against pythonxx_d.lib is governed by the --debug switch to the `build' command rather than the type of the running Python. Compiler optimization flags are inserted according to it, too. As a consequence, * I cannot install an extension into debug Python at all 'cuz `bdist_*' and `install' commands don't support --debug and invoke `debug' internally without it. * Neither can I compile an extension for release Python without optimizations. I'm at a loss here. Either I'm missing something, or with the current build system, it's impossible to debug extensions! > > Or if you like, when you install Python 3.5 or later there are > advanced options to install debug symbols and binaries. You can use a > proper debug build against the debug binaries (python_d.exe). > > Cheers, > > Steve > > Top-posted from my Windows phone > > *From: *Ivan Pozdeev via Python-ideas > *Sent: *Saturday, December 30, 2017 13:01 > *To: *python-ideas at python.org > *Subject: *[Python-ideas] Allow to compile debug extension against > releasePython in Windows > > The Windows version of pyconfig.h has the following construct: > > ??? if defined(_DEBUG) > > ?????????? pragma comment(lib,"python37_d.lib") > > ??? elif defined(Py_LIMITED_API) > > ?????????? pragma comment(lib,"python3.lib") > > ??? else > > ?????????? pragma comment(lib,"python37.lib") > > ??? endif /* _DEBUG */ > > which fails the compilation of a debug version of an extension. Making > > debugging it... difficult. > > Perhaps we could define some other constant? > > I'm not sure whether such compilation is a good idea in general, so > > asking here at first. > > -- > > Regards, > > Ivan > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Regards, Ivan -------------- next part -------------- An HTML attachment was scrubbed... URL: From vano at mail.mipt.ru Tue Jan 9 13:46:37 2018 From: vano at mail.mipt.ru (Ivan Pozdeev) Date: Tue, 9 Jan 2018 21:46:37 +0300 Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows In-Reply-To: <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> References: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> Message-ID: On 09.01.2018 21:35, Ivan Pozdeev via Python-ideas wrote: > On 08.01.2018 0:11, Steve Dower wrote: >> >> It?s not a good idea. You end up with two different C runtimes in >> memory that cannot communicate, and many things will not work properly. >> >> If you compile your debug build extension with the non-debug CRT (/MD >> rather than /MDd) you will lose asserts, but otherwise it will work >> fine and the quoted code picks the release lib. >> > Distutils' designers seem to have thought differently. > Whether the extension is linked against pythonxx_d.lib is governed by > the --debug switch to the `build' command rather than the type of the > running Python. Compiler optimization flags and /MD(d) are inserted > according to it, too. > > As a consequence, > * I cannot install an extension into debug Python at all 'cuz > `bdist_*' and `install' commands don't support --debug and invoke > `debug' internally without it. I meant "invoke `build' internally without it." , sorry. This kafkaesque "you cannot do this because you cannot do this" is taking its toll on me... > * Neither can I compile an extension for release Python without > optimizations. > > I'm at a loss here. Either I'm missing something, or with the current > build system, it's impossible to debug extensions! >> >> Or if you like, when you install Python 3.5 or later there are >> advanced options to install debug symbols and binaries. You can use a >> proper debug build against the debug binaries (python_d.exe). >> >> Cheers, >> >> Steve >> >> Top-posted from my Windows phone >> >> *From: *Ivan Pozdeev via Python-ideas >> *Sent: *Saturday, December 30, 2017 13:01 >> *To: *python-ideas at python.org >> *Subject: *[Python-ideas] Allow to compile debug extension against >> releasePython in Windows >> >> The Windows version of pyconfig.h has the following construct: >> >> ??? if defined(_DEBUG) >> >> ?????????? pragma comment(lib,"python37_d.lib") >> >> ??? elif defined(Py_LIMITED_API) >> >> ?????????? pragma comment(lib,"python3.lib") >> >> ??? else >> >> ?????????? pragma comment(lib,"python37.lib") >> >> ??? endif /* _DEBUG */ >> >> which fails the compilation of a debug version of an extension. Making >> >> debugging it... difficult. >> >> Perhaps we could define some other constant? >> >> I'm not sure whether such compilation is a good idea in general, so >> >> asking here at first. >> >> -- >> >> Regards, >> >> Ivan >> >> _______________________________________________ >> >> Python-ideas mailing list >> >> Python-ideas at python.org >> >> https://mail.python.org/mailman/listinfo/python-ideas >> >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > -- > Regards, > Ivan > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- Regards, Ivan -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexander.belopolsky at gmail.com Tue Jan 9 14:18:42 2018 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Tue, 9 Jan 2018 14:18:42 -0500 Subject: [Python-ideas] pdb to support running modules In-Reply-To: References: Message-ID: On Thu, Jan 4, 2018 at 8:15 PM, Guido van Rossum wrote: > Sounds uncontroversial, this can just be done via bugs.python.org. > .. and it has been proposed there over 7 years ago: . From vano at mail.mipt.ru Tue Jan 9 15:35:49 2018 From: vano at mail.mipt.ru (Ivan Pozdeev) Date: Tue, 9 Jan 2018 23:35:49 +0300 Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows In-Reply-To: References: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> Message-ID: <886953d1-3bb1-da51-b1a2-a12318879f51@mail.mipt.ru> On 09.01.2018 23:31, Barry Scott wrote: > I not a user of distutils or setuptools but some googling seems to say > that > the build command has a --debug to do what you want. If that does not > work it would seem like you could ask the setuptools maintainers how to > do the reason thing of a debug build. > I just wrote, in https://mail.python.org/pipermail/python-ideas/2018-January/048579.html , that --debug is not sufficient, and that the problematic logic is in distutils, not setuptools. -- Regards, Ivan From rspeer at luminoso.com Tue Jan 9 16:15:23 2018 From: rspeer at luminoso.com (Rob Speer) Date: Tue, 09 Jan 2018 21:15:23 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings Message-ID: Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings. There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably the second or third most common text encoding in the world. And Python doesn't quite support it. You can see the character table for this encoding at: https://encoding.spec.whatwg.org/index-windows-1252.txt For the sake of discussion, let's call this encoding "web-1252". WHATWG calls it "windows-1252", but notice that it's subtly different from Python's "windows-1252" encoding. Python's windows-1252 has bytes that are undefined: >>> b'\x90'.decode('windows-1252') UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: character maps to In web-1252, the bytes that are undefined according to windows-1252 map to the control characters in those positions in iso-8859-1 -- that is, the Unicode codepoints with the same number as the byte. In web-1252, b'\x90' would decode as '\u0090'. This may seem like a silly encoding that encourages doing horrible things with text. That's pretty much the case. But there's a reason every Web browser implements it: - It's compatible with windows-1252 - Any sequence of bytes can be round-tripped through it without losing information It's not just this one encoding. WHATWG's encoding standard ( https://encoding.spec.whatwg.org/) contains modified versions of windows-1250 through windows-1258 and windows-874. Support for these encodings matters to me, in part, because I maintain a Unicode data-cleaning library, "ftfy". One thing it does is to detect and undo encoding/decoding errors that cause mojibake, as long as they're detectible and reversible. Looking at real-world examples of text that has been damaged by mojibake, it's clear that lots of text is transferred through what I'm calling the "web-1252" encoding, in a way that's incompatible with Python's "windows-1252". In order to be able to work with and fix this kind of text, ftfy registers new codecs -- and I implemented this even before I knew that they were standardized in Web browsers. When ftfy is imported, you can decode text as "sloppy-windows-1252" (the name I chose for this encoding), for example. ftfy can tell people a sequence of steps that they can use in the future to fix text that's like the text they provided. Very often, these steps require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which means the steps only work with ftfy imported, even for people who are not using the features of ftfy. Support for these encodings also seems highly relevant to people who use Python for web scraping, as it would be desirable to maximize compatibility with what a Web browser would do. This really seems like it belongs in the standard library instead of being an incidental feature of my library. I know that code in the standard library has "one foot in the grave". I _want_ these legacy encodings to have one foot in the grave. But some of them are extremely common, and Python code should be able to deal with them. Adding these encodings to Python would be straightforward to implement. Does this require a PEP, a pull request, or further discussion? -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at barrys-emacs.org Tue Jan 9 15:31:22 2018 From: barry at barrys-emacs.org (Barry Scott) Date: Tue, 9 Jan 2018 20:31:22 +0000 Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows In-Reply-To: References: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> Message-ID: I not a user of distutils or setuptools but some googling seems to say that the build command has a --debug to do what you want. If that does not work it would seem like you could ask the setuptools maintainers how to do the reason thing of a debug build. Barry > On 9 Jan 2018, at 18:46, Ivan Pozdeev via Python-ideas wrote: > > On 09.01.2018 21:35, Ivan Pozdeev via Python-ideas wrote: > >> On 08.01.2018 0:11, Steve Dower wrote: >>> It?s not a good idea. You end up with two different C runtimes in memory that cannot communicate, and many things will not work properly. >>> >>> If you compile your debug build extension with the non-debug CRT (/MD rather than /MDd) you will lose asserts, but otherwise it will work fine and the quoted code picks the release lib. >> Distutils' designers seem to have thought differently. >> Whether the extension is linked against pythonxx_d.lib is governed by the --debug switch to the `build' command rather than the type of the running Python. Compiler optimization flags and /MD(d) are inserted according to it, too. >> >> As a consequence, >> * I cannot install an extension into debug Python at all 'cuz `bdist_*' and `install' commands don't support --debug and invoke `debug' internally without it. > I meant "invoke `build' internally without it." , sorry. > > This kafkaesque "you cannot do this because you cannot do this" is taking its toll on me... >> * Neither can I compile an extension for release Python without optimizations. >> >> I'm at a loss here. Either I'm missing something, or with the current build system, it's impossible to debug extensions! >>> >>> Or if you like, when you install Python 3.5 or later there are advanced options to install debug symbols and binaries. You can use a proper debug build against the debug binaries (python_d.exe). >>> >>> Cheers, >>> Steve >>> >>> Top-posted from my Windows phone >>> >>> From: Ivan Pozdeev via Python-ideas >>> Sent: Saturday, December 30, 2017 13:01 >>> To: python-ideas at python.org >>> Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows >>> >>> The Windows version of pyconfig.h has the following construct: >>> >>> if defined(_DEBUG) >>> pragma comment(lib,"python37_d.lib") >>> elif defined(Py_LIMITED_API) >>> pragma comment(lib,"python3.lib") >>> else >>> pragma comment(lib,"python37.lib") >>> endif /* _DEBUG */ >>> >>> which fails the compilation of a debug version of an extension. Making >>> debugging it... difficult. >>> >>> Perhaps we could define some other constant? >>> >>> I'm not sure whether such compilation is a good idea in general, so >>> asking here at first. >>> >>> -- >>> Regards, >>> Ivan >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> -- >> Regards, >> Ivan >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ > > -- > Regards, > Ivan > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at barrys-emacs.org Tue Jan 9 15:40:36 2018 From: barry at barrys-emacs.org (Barry Scott) Date: Tue, 9 Jan 2018 20:40:36 +0000 Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows In-Reply-To: <886953d1-3bb1-da51-b1a2-a12318879f51@mail.mipt.ru> References: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> <886953d1-3bb1-da51-b1a2-a12318879f51@mail.mipt.ru> Message-ID: <8FDFE482-D082-4C7A-AF52-4DAEEBCC0834@barrys-emacs.org> > On 9 Jan 2018, at 20:35, Ivan Pozdeev wrote: > > On 09.01.2018 23:31, Barry Scott wrote: >> I not a user of distutils or setuptools but some googling seems to say that >> the build command has a --debug to do what you want. If that does not >> work it would seem like you could ask the setuptools maintainers how to >> do the reason thing of a debug build. >> > I just wrote, in https://mail.python.org/pipermail/python-ideas/2018-January/048579.html , that --debug is not sufficient, and that the problematic logic is in distutils, not setuptools. Sorry, I mis-read that. I thought it was not a known option. It is certainly hard to find docs for. This does sound like something the setup tools people can answer for you. I think they hang out on the python distutils-sig mailing list. Barry From vano at mail.mipt.ru Tue Jan 9 16:51:55 2018 From: vano at mail.mipt.ru (Ivan Pozdeev) Date: Wed, 10 Jan 2018 00:51:55 +0300 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote: > Hi! I joined this list because I'm interested in filling a gap in > Python's standard library, relating to text encodings. > > There is an encoding with no name of its own. It's supported by every > current web browser and standardized by WHATWG. It's so prevalent that > if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you > will get this encoding _instead_. It is probably the second or third > most common text encoding in the world. And Python doesn't quite > support it. > > You can see the character table for this encoding at: > https://encoding.spec.whatwg.org/index-windows-1252.txt > > For the sake of discussion, let's call this encoding "web-1252". > WHATWG calls it "windows-1252", but notice that it's subtly different > from Python's "windows-1252" encoding. Python's windows-1252 has bytes > that are undefined: > > >>> b'\x90'.decode('windows-1252') > UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position > 0: character maps to > > In web-1252, the bytes that are undefined according to windows-1252 > map to the control characters in those positions in iso-8859-1 -- that > is, the Unicode codepoints with the same number as the byte. In > web-1252, b'\x90' would decode as '\u0090'. According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does the same: ??? "According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API |MultiByteToWideChar | maps these to the corresponding C1 control codes ." And in ISO-8859-1, the same handling is done for unused code points even by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : ??? "*ISO-8859-1* is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429 " And what would you think -- these "C1 control codes" are also the corresponding Unicode points! ( https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) Since Windows is pretty much the reference implementation for "windows-xxxx" encodings, it even makes sense to alter the existing encodings rather than add new ones. > > This may seem like a silly encoding that encourages doing horrible > things with text. That's pretty much the case. But there's a reason > every Web browser implements it: > > - It's compatible with windows-1252 > - Any sequence of bytes can be round-tripped through it without losing > information > > It's not just this one encoding. WHATWG's encoding standard > (https://encoding.spec.whatwg.org/) contains modified versions of > windows-1250 through windows-1258 and windows-874. > > Support for these encodings matters to me, in part, because I maintain > a Unicode data-cleaning library, "ftfy". One thing it does is to > detect and undo encoding/decoding errors that cause mojibake, as long > as they're detectible and reversible. Looking at real-world examples > of text that has been damaged by mojibake, it's clear that lots of > text is transferred through what I'm calling the "web-1252" encoding, > in a way that's incompatible with Python's "windows-1252". > > In order to be able to work with and fix this kind of text, ftfy > registers new codecs -- and I implemented this even before I knew that > they were standardized in Web browsers. When ftfy is imported, you can > decode text as "sloppy-windows-1252" (the name I chose for this > encoding), for example. > > ftfy can tell people a sequence of steps that they can use in the > future to fix text that's like the text they provided. Very often, > these steps require the sloppy-windows-1252 or sloppy-windows-1251 > encoding, which means the steps only work with ftfy imported, even for > people who are not using the features of ftfy. > > Support for these encodings also seems highly relevant to people who > use Python for web scraping, as it would be desirable to maximize > compatibility with what a Web browser would do. > > This really seems like it belongs in the standard library instead of > being an incidental feature of my library. I know that code in the > standard library has "one foot in the grave". I _want_ these legacy > encodings to have one foot in the grave. But some of them are > extremely common, and Python code should be able to deal with them. > > Adding these encodings to Python would be straightforward to > implement. Does this require a PEP, a pull request, or further discussion? > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- Regards, Ivan -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Tue Jan 9 16:54:52 2018 From: barry at python.org (Barry Warsaw) Date: Tue, 09 Jan 2018 13:54:52 -0800 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: Steve Barnes wrote: > Currently invoking `python -c "some;separated;set of commands;"` will, > if you need to use any library functions, require one or more import > somelib; sections in the execution string. This results in rather > complex "one liners". > > On the other hand `python -m somelib` will load somelib and attempt to > execute its `__main__()` or give an error if there isn't one. > > What I would like to suggest is a mechanism to pre-load libraries before > evaluating the -c option as this would allow the use of code from > libraries that don't have a `__main__` function, or those that do but it > doesn't do what you want. It would be really cool if you could somehow write a file with a bunch of commands in it, and then get Python to execute those commands. Then it could still be a one line invocation, but you could do much more complex things, including import a bunch of modules before executing some code. I'm not sure what such a file would look like but here's a strawman: ``` import os import sys import somelib path = somelib.get_path() parent = os.path.dirname(path) print(parent) sys.exit(0 if os.path.isdir(parent) else 1) ``` Then you could run it like so: $ python3 myscript.py That seems like a nice, compact, one line invocation, but cI don't know, it probably needs some fleshing out. It's just a crazy idea, and there's probably not enough time to implement this for Python 3.7. Maybe for Python 3.8. time-machine-winking-ly y'rs, -Barry From vano at mail.mipt.ru Tue Jan 9 16:56:56 2018 From: vano at mail.mipt.ru (Ivan Pozdeev) Date: Wed, 10 Jan 2018 00:56:56 +0300 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: On 10.01.2018 0:54, Barry Warsaw wrote: > Steve Barnes wrote: >> Currently invoking `python -c "some;separated;set of commands;"` will, >> if you need to use any library functions, require one or more import >> somelib; sections in the execution string. This results in rather >> complex "one liners". >> >> On the other hand `python -m somelib` will load somelib and attempt to >> execute its `__main__()` or give an error if there isn't one. >> >> What I would like to suggest is a mechanism to pre-load libraries before >> evaluating the -c option as this would allow the use of code from >> libraries that don't have a `__main__` function, or those that do but it >> doesn't do what you want. > It would be really cool if you could somehow write a file with a bunch > of commands in it, and then get Python to execute those commands. Then > it could still be a one line invocation, but you could do much more > complex things, including import a bunch of modules before executing > some code. I'm not sure what such a file would look like but here's a > strawman: IPython's `run -i' does that. > > ``` > import os > import sys > import somelib > > path = somelib.get_path() > parent = os.path.dirname(path) > print(parent) > > sys.exit(0 if os.path.isdir(parent) else 1) > ``` > > Then you could run it like so: > > $ python3 myscript.py > > That seems like a nice, compact, one line invocation, but cI don't know, > it probably needs some fleshing out. It's just a crazy idea, and > there's probably not enough time to implement this for Python 3.7. > Maybe for Python 3.8. > > time-machine-winking-ly y'rs, > -Barry > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- Regards, Ivan From rspeer at luminoso.com Tue Jan 9 18:56:01 2018 From: rspeer at luminoso.com (Rob Speer) Date: Tue, 09 Jan 2018 23:56:01 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: Oh that's interesting. So it seems to be Python that's the exception here. Would we really be able to add entries to character mappings that haven't changed since Python 2.0? On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas at python.org> wrote: > First of all, many thanks for such a excellently writen letter. It was a > real pleasure to read. > On 10.01.2018 0:15, Rob Speer wrote: > > Hi! I joined this list because I'm interested in filling a gap in Python's > standard library, relating to text encodings. > > There is an encoding with no name of its own. It's supported by every > current web browser and standardized by WHATWG. It's so prevalent that if > you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will > get this encoding _instead_. It is probably the second or third most common > text encoding in the world. And Python doesn't quite support it. > > You can see the character table for this encoding at: > https://encoding.spec.whatwg.org/index-windows-1252.txt > > For the sake of discussion, let's call this encoding "web-1252". WHATWG > calls it "windows-1252", but notice that it's subtly different from > Python's "windows-1252" encoding. Python's windows-1252 has bytes that are > undefined: > > >>> b'\x90'.decode('windows-1252') > UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: > character maps to > > In web-1252, the bytes that are undefined according to windows-1252 map to > the control characters in those positions in iso-8859-1 -- that is, the > Unicode codepoints with the same number as the byte. In web-1252, b'\x90' > would decode as '\u0090'. > > According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does > the same: > > "According to the information on Microsoft's and the Unicode > Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; > however, the Windows API MultiByteToWideChar > > maps these to the corresponding C1 control codes > ." > And in ISO-8859-1, the same handling is done for unused code points even > by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : > > "*ISO-8859-1* is the IANA > > preferred name for this standard when supplemented with the C0 and C1 > control codes > from ISO/IEC 6429 " > And what would you think -- these "C1 control codes" are also the > corresponding Unicode points! ( > https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) > > Since Windows is pretty much the reference implementation for > "windows-xxxx" encodings, it even makes sense to alter the existing > encodings rather than add new ones. > > > This may seem like a silly encoding that encourages doing horrible things > with text. That's pretty much the case. But there's a reason every Web > browser implements it: > > - It's compatible with windows-1252 > - Any sequence of bytes can be round-tripped through it without losing > information > > It's not just this one encoding. WHATWG's encoding standard ( > https://encoding.spec.whatwg.org/) contains modified versions of > windows-1250 through windows-1258 and windows-874. > > Support for these encodings matters to me, in part, because I maintain a > Unicode data-cleaning library, "ftfy". One thing it does is to detect and > undo encoding/decoding errors that cause mojibake, as long as they're > detectible and reversible. Looking at real-world examples of text that has > been damaged by mojibake, it's clear that lots of text is transferred > through what I'm calling the "web-1252" encoding, in a way that's > incompatible with Python's "windows-1252". > > In order to be able to work with and fix this kind of text, ftfy registers > new codecs -- and I implemented this even before I knew that they were > standardized in Web browsers. When ftfy is imported, you can decode text as > "sloppy-windows-1252" (the name I chose for this encoding), for example. > > ftfy can tell people a sequence of steps that they can use in the future > to fix text that's like the text they provided. Very often, these steps > require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which > means the steps only work with ftfy imported, even for people who are not > using the features of ftfy. > > Support for these encodings also seems highly relevant to people who use > Python for web scraping, as it would be desirable to maximize compatibility > with what a Web browser would do. > > This really seems like it belongs in the standard library instead of being > an incidental feature of my library. I know that code in the standard > library has "one foot in the grave". I _want_ these legacy encodings to > have one foot in the grave. But some of them are extremely common, and > Python code should be able to deal with them. > > Adding these encodings to Python would be straightforward to implement. > Does this require a PEP, a pull request, or further discussion? > > > _______________________________________________ > Python-ideas mailing listPython-ideas at python.orghttps://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > > -- > Regards, > Ivan > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Jan 9 21:27:40 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 10 Jan 2018 12:27:40 +1000 Subject: [Python-ideas] pdb to support running modules In-Reply-To: References: Message-ID: On 10 January 2018 at 05:18, Alexander Belopolsky wrote: > On Thu, Jan 4, 2018 at 8:15 PM, Guido van Rossum wrote: >> Sounds uncontroversial, this can just be done via bugs.python.org. >> > > .. and it has been proposed there over 7 years ago: > . Aye, I linked Mario's patch up to that as a higher level tracking issue - the `pdb -m` support is available in 3.7.0a4 :) He's also now submitted comparable patches for a couple of other modules which I'll aim to get to before 3.7.0b1 Cheers, Nick. P.S. For anyone curious as to what the delay has been, the problem is that the *public* runpy API isn't quite flexible enough to support the way some of these other modules want to run code, and even if we had added such an API, they likely would have required refactoring in order to use it. Mario's patches have instead been taking advantage of the fact that these are stdlib modules we're updating, and hence we can get away with using *private* runpy APIs for now, and then based on these conversions, we can look at what features the public runpy API would need in order for us to migrate them away from using those private interfaces. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Tue Jan 9 21:39:31 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 10 Jan 2018 12:39:31 +1000 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: On 10 January 2018 at 07:54, Barry Warsaw wrote: > Steve Barnes wrote: >> Currently invoking `python -c "some;separated;set of commands;"` will, >> if you need to use any library functions, require one or more import >> somelib; sections in the execution string. This results in rather >> complex "one liners". >> >> On the other hand `python -m somelib` will load somelib and attempt to >> execute its `__main__()` or give an error if there isn't one. >> >> What I would like to suggest is a mechanism to pre-load libraries before >> evaluating the -c option as this would allow the use of code from >> libraries that don't have a `__main__` function, or those that do but it >> doesn't do what you want. > > It would be really cool if you could somehow write a file with a bunch > of commands in it, and then get Python to execute those commands. Then > it could still be a one line invocation, but you could do much more > complex things, including import a bunch of modules before executing > some code. You jest, but doing that and then going on to process the rest of the command line the same way the interpreter normally would is genuinely tricky. Even `python -m runpy [other args]` doesn't emulate it perfectly, and its responsible for implementing large chunks of the regular behaviour :) For the coverage.py use case, an environment-based solution is also genuinely helpful, since you typically can't modify subprocess invocations just because the software is being tested. At the moment, there are approaches that rely on using either `sitecustomize` or `*.pth` files, but being able to write `PYTHONRUNFIRST="import coverage; coverage.process_startup()"` would be a fair bit clearer about what was actually going on. That example also shows why I'm wary of offering an import-only version of this: I believe it would encourage folks to write modules that have side effects on import, which is something we try to avoid doing. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Tue Jan 9 21:46:01 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 10 Jan 2018 12:46:01 +1000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On 10 January 2018 at 09:56, Rob Speer wrote: > Oh that's interesting. So it seems to be Python that's the exception here. > > Would we really be able to add entries to character mappings that haven't > changed since Python 2.0? Changing things that used to cause an exception into operations that produce a useful result is generally OK - it's going the other way (dubious output -> exception) that's always problematic. So as long as the Windows specialists give it a +1, updating the existing codecs to match the MultiByteToWideChar behaviour seems like a better option to me than offering multiple versions of the codecs (and that could then be done as a tracker enhancement request along the lines of "Make the windows-* text encodings match MultiByteToWideChar"). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From rspeer at luminoso.com Tue Jan 9 22:56:12 2018 From: rspeer at luminoso.com (Rob Speer) Date: Wed, 10 Jan 2018 03:56:12 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: One other thing I've noticed that's related to the WHATWG encoding list: in Python, the encoding name "windows-874" seems to be missing. The _encoding_ is there, as "cp874", but "windows-874" doesn't work as an alias for it the way that "windows-1252" works as an alias for "cp1252". That alias should be added, right? On Tue, 9 Jan 2018 at 21:46 Nick Coghlan wrote: > On 10 January 2018 at 09:56, Rob Speer wrote: > > Oh that's interesting. So it seems to be Python that's the exception > here. > > > > Would we really be able to add entries to character mappings that haven't > > changed since Python 2.0? > > Changing things that used to cause an exception into operations that > produce a useful result is generally OK - it's going the other way > (dubious output -> exception) that's always problematic. > > So as long as the Windows specialists give it a +1, updating the > existing codecs to match the MultiByteToWideChar behaviour seems like > a better option to me than offering multiple versions of the codecs > (and that could then be done as a tracker enhancement request along > the lines of "Make the windows-* text encodings match > MultiByteToWideChar"). > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Jan 9 23:16:29 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 10 Jan 2018 14:16:29 +1000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On 10 January 2018 at 13:56, Rob Speer wrote: > One other thing I've noticed that's related to the WHATWG encoding list: in > Python, the encoding name "windows-874" seems to be missing. The _encoding_ > is there, as "cp874", but "windows-874" doesn't work as an alias for it the > way that "windows-1252" works as an alias for "cp1252". That alias should be > added, right? Aye, that would make sense. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From p.f.moore at gmail.com Wed Jan 10 03:30:01 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 10 Jan 2018 08:30:01 +0000 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: On 10 January 2018 at 02:39, Nick Coghlan wrote: > For the coverage.py use case, an environment-based solution is also > genuinely helpful, since you typically can't modify subprocess > invocations just because the software is being tested. At the moment, > there are approaches that rely on using either `sitecustomize` or > `*.pth` files, but being able to write `PYTHONRUNFIRST="import > coverage; coverage.process_startup()"` would be a fair bit clearer > about what was actually going on. It's worth remembering that Windows doesn't have the equivalent of the Unix "VAR=xxx prog arg arg" syntax for one-time setting of an environment variable, so environment variable based solutions are strictly less useful than command line arguments. That's one reason I prefer -C over PYTHONRUNFIRST. Paul From p.f.moore at gmail.com Wed Jan 10 03:31:41 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 10 Jan 2018 08:31:41 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On 10 January 2018 at 04:16, Nick Coghlan wrote: > On 10 January 2018 at 13:56, Rob Speer wrote: >> One other thing I've noticed that's related to the WHATWG encoding list: in >> Python, the encoding name "windows-874" seems to be missing. The _encoding_ >> is there, as "cp874", but "windows-874" doesn't work as an alias for it the >> way that "windows-1252" works as an alias for "cp1252". That alias should be >> added, right? > > Aye, that would make sense. Agreed - extending the encodings and adding the alias both sound like reasonable enhancements to me. Paul From mal at egenix.com Wed Jan 10 03:38:42 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 10 Jan 2018 09:38:42 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On 10.01.2018 00:56, Rob Speer wrote: > Oh that's interesting. So it seems to be Python that's the exception here. > > Would we really be able to add entries to character mappings that haven't > changed since Python 2.0? The Windows mappings in Python come directly from the Unicode Consortium mapping files. If the Consortium changes the mappings, we can update them. If not, then we have a problem, since consumers are not only the win32 APIs, but also other tools out there running on completely different platforms, e.g. Java tools or web servers providing downloads using the Windows code page encodings. Allowing such mappings in the existing codecs would then result failures when the "other" sides see the decoded Unicode version and try to encode back into the original encoding - you'd move the problem from the Python side to the "other" side of the integration. I had a look on the Unicode FTP site and they have since added a new directory with mapping files they call "best fit": http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt The WideCharToMultiByte() defaults to best fit, but also offers a mode where it operates in standards compliant mode: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx See flag WC_NO_BEST_FIT_CHARS. Unicode TR#22 is also clear on this: https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned It allows such best fit mappings to make encodings round-trip safe, but requires to keep these separate from the original standard mappings: """ It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences (NCRs) to preserve round-trip integrity; use of fallback characters in that case corrupts the data. """ If you read the above section in TR#22 you quickly get reminded of what the Unicode error handlers do (we basically implement the three modes it mentions... raise, ignore, replace). Now, for unmapped sequences an error handler can opt for using a fallback sequence instead. So in addition to adding best fit codecs, there's also the option to add an error handler for best fit resolution of unmapped sequences. Given the above, I don't think we ought to change the existing standards compliant mappings, but use one of two solutions: a) add "best fit" encodings (see the Unicode FTP site for a list) b) add an error handlers "bestfit" which implements the fallback modes for the encodings in question > On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < > python-ideas at python.org> wrote: > >> First of all, many thanks for such a excellently writen letter. It was a >> real pleasure to read. >> On 10.01.2018 0:15, Rob Speer wrote: >> >> Hi! I joined this list because I'm interested in filling a gap in Python's >> standard library, relating to text encodings. >> >> There is an encoding with no name of its own. It's supported by every >> current web browser and standardized by WHATWG. It's so prevalent that if >> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will >> get this encoding _instead_. It is probably the second or third most common >> text encoding in the world. And Python doesn't quite support it. >> >> You can see the character table for this encoding at: >> https://encoding.spec.whatwg.org/index-windows-1252.txt >> >> For the sake of discussion, let's call this encoding "web-1252". WHATWG >> calls it "windows-1252", but notice that it's subtly different from >> Python's "windows-1252" encoding. Python's windows-1252 has bytes that are >> undefined: >> >>>>> b'\x90'.decode('windows-1252') >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: >> character maps to >> >> In web-1252, the bytes that are undefined according to windows-1252 map to >> the control characters in those positions in iso-8859-1 -- that is, the >> Unicode codepoints with the same number as the byte. In web-1252, b'\x90' >> would decode as '\u0090'. >> >> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does >> the same: >> >> "According to the information on Microsoft's and the Unicode >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; >> however, the Windows API MultiByteToWideChar >> >> maps these to the corresponding C1 control codes >> ." >> And in ISO-8859-1, the same handling is done for unused code points even >> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : >> >> "*ISO-8859-1* is the IANA >> >> preferred name for this standard when supplemented with the C0 and C1 >> control codes >> from ISO/IEC 6429 " >> And what would you think -- these "C1 control codes" are also the >> corresponding Unicode points! ( >> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) >> >> Since Windows is pretty much the reference implementation for >> "windows-xxxx" encodings, it even makes sense to alter the existing >> encodings rather than add new ones. >> >> >> This may seem like a silly encoding that encourages doing horrible things >> with text. That's pretty much the case. But there's a reason every Web >> browser implements it: >> >> - It's compatible with windows-1252 >> - Any sequence of bytes can be round-tripped through it without losing >> information >> >> It's not just this one encoding. WHATWG's encoding standard ( >> https://encoding.spec.whatwg.org/) contains modified versions of >> windows-1250 through windows-1258 and windows-874. >> >> Support for these encodings matters to me, in part, because I maintain a >> Unicode data-cleaning library, "ftfy". One thing it does is to detect and >> undo encoding/decoding errors that cause mojibake, as long as they're >> detectible and reversible. Looking at real-world examples of text that has >> been damaged by mojibake, it's clear that lots of text is transferred >> through what I'm calling the "web-1252" encoding, in a way that's >> incompatible with Python's "windows-1252". >> >> In order to be able to work with and fix this kind of text, ftfy registers >> new codecs -- and I implemented this even before I knew that they were >> standardized in Web browsers. When ftfy is imported, you can decode text as >> "sloppy-windows-1252" (the name I chose for this encoding), for example. >> >> ftfy can tell people a sequence of steps that they can use in the future >> to fix text that's like the text they provided. Very often, these steps >> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which >> means the steps only work with ftfy imported, even for people who are not >> using the features of ftfy. >> >> Support for these encodings also seems highly relevant to people who use >> Python for web scraping, as it would be desirable to maximize compatibility >> with what a Web browser would do. >> >> This really seems like it belongs in the standard library instead of being >> an incidental feature of my library. I know that code in the standard >> library has "one foot in the grave". I _want_ these legacy encodings to >> have one foot in the grave. But some of them are extremely common, and >> Python code should be able to deal with them. >> >> Adding these encodings to Python would be straightforward to implement. >> Does this require a PEP, a pull request, or further discussion? >> >> >> _______________________________________________ >> Python-ideas mailing listPython-ideas at python.orghttps://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> >> -- >> Regards, >> Ivan >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From steve.dower at python.org Wed Jan 10 04:47:44 2018 From: steve.dower at python.org (Steve Dower) Date: Wed, 10 Jan 2018 20:47:44 +1100 Subject: [Python-ideas] Allow to compile debug extension againstreleasePython in Windows In-Reply-To: References: <20180107211111.B73001200A1@relay2.telecom.mipt.ru> <10ff9787-8656-d973-3edd-6b13d1b6e659@mail.mipt.ru> Message-ID: You don?t have to use distutils to build your extension, and if you want anything more complex than a release build you probably should build it yourself. All you need is the right include and lib directories when compiling/linking, set the output extension to pyd instead of dll, and use /MD when targeting python.exe and /MDd when targeting python_d.exe. You can use whatever /Ox options you like regardless of your target. There?s nothing really special about what the build tools do, other than being readily available on most Python distros. Since you're on Windows, why not have a look at Visual Studio 2017? The Python workload has a native development option that includes a template for building/debugging an extension module with these options already configured (https://docs.microsoft.com/en-us/visualstudio/python/cpp-and-python), and the debugger supports stepping through both Python and C code simultaneously (https://docs.microsoft.com/en-us/visualstudio/python/debugging-mixed-mode). It might make your life a little easier. Top-posted from my Windows phone From: Ivan Pozdeev Sent: Wednesday, January 10, 2018 5:47 To: Steve Dower; python-ideas at python.org Subject: Re: [Python-ideas] Allow to compile debug extension againstreleasePython in Windows On 09.01.2018 21:35, Ivan Pozdeev via Python-ideas wrote: On 08.01.2018 0:11, Steve Dower wrote: It?s not a good idea. You end up with two different C runtimes in memory that cannot communicate, and many things will not work properly. ? If you compile your debug build extension with the non-debug CRT (/MD rather than /MDd) you will lose asserts, but otherwise it will work fine and the quoted code picks the release lib. Distutils' designers seem to have thought differently. Whether the extension is linked against pythonxx_d.lib is governed by the --debug switch to the `build' command rather than the type of the running Python. Compiler optimization flags and /MD(d) are inserted according to it, too. As a consequence, * I cannot install an extension into debug Python at all 'cuz `bdist_*' and `install' commands don't support --debug and invoke `debug' internally without it. I meant "invoke `build' internally without it." , sorry. This kafkaesque "you cannot do this because you cannot do this" is taking its toll on me... * Neither can I compile an extension for release Python without optimizations. I'm at a loss here. Either I'm missing something, or with the current build system, it's impossible to debug extensions! ? Or if you like, when you install Python 3.5 or later there are advanced options to install debug symbols and binaries. You can use a proper debug build against the debug binaries (python_d.exe). ? Cheers, Steve ? Top-posted from my Windows phone ? From: Ivan Pozdeev via Python-ideas Sent: Saturday, December 30, 2017 13:01 To: python-ideas at python.org Subject: [Python-ideas] Allow to compile debug extension against releasePython in Windows ? The Windows version of pyconfig.h has the following construct: ? ??? if defined(_DEBUG) ?????????? pragma comment(lib,"python37_d.lib") ??? elif defined(Py_LIMITED_API) ?????????? pragma comment(lib,"python3.lib") ??? else ?????????? pragma comment(lib,"python37.lib") ??? endif /* _DEBUG */ ? which fails the compilation of a debug version of an extension. Making debugging it... difficult. ? Perhaps we could define some other constant? ? I'm not sure whether such compilation is a good idea in general, so asking here at first. ? -- Regards, Ivan ? _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ ? -- Regards, Ivan _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -- Regards, Ivan -------------- next part -------------- An HTML attachment was scrubbed... URL: From rspeer at luminoso.com Wed Jan 10 13:36:29 2018 From: rspeer at luminoso.com (Rob Speer) Date: Wed, 10 Jan 2018 18:36:29 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding. "Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8." (Mapping ? to 8? Seriously?!) It also does things such as mapping Cyrillic letters to Latin letters that look like them. This is not what I'm interested in implementing. I just want there to be encodings that match the WHATWG encodings exactly. If they have to be given a different name, that's fine with me. On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg wrote: > On 10.01.2018 00:56, Rob Speer wrote: > > Oh that's interesting. So it seems to be Python that's the exception > here. > > > > Would we really be able to add entries to character mappings that haven't > > changed since Python 2.0? > > The Windows mappings in Python come directly from the Unicode > Consortium mapping files. > > If the Consortium changes the mappings, we can update them. > > If not, then we have a problem, since consumers are not only > the win32 APIs, but also other tools out there running on > completely different platforms, e.g. Java tools or web servers > providing downloads using the Windows code page encodings. > > Allowing such mappings in the existing codecs would then result > failures when the "other" sides see the decoded Unicode version and > try to encode back into the original encoding - you'd move the > problem from the Python side to the "other" side of the > integration. > > I had a look on the Unicode FTP site and they have since added > a new directory with mapping files they call "best fit": > > > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt > > The WideCharToMultiByte() defaults to best fit, but also offers > a mode where it operates in standards compliant mode: > > > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx > > See flag WC_NO_BEST_FIT_CHARS. > > Unicode TR#22 is also clear on this: > > https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned > > It allows such best fit mappings to make encodings round-trip > safe, but requires to keep these separate from the original > standard mappings: > > """ > It is very important that systems be able to distinguish between the > fallback mappings and regular mappings. Systems like XML require the use > of hex escape sequences (NCRs) to preserve round-trip integrity; use of > fallback characters in that case corrupts the data. > """ > > If you read the above section in TR#22 you quickly get reminded > of what the Unicode error handlers do (we basically implement > the three modes it mentions... raise, ignore, replace). > > Now, for unmapped sequences an error handler can opt for > using a fallback sequence instead. > > So in addition to adding best fit codecs, there's also the > option to add an error handler for best fit resolution of > unmapped sequences. > > Given the above, I don't think we ought to change the existing > standards compliant mappings, but use one of two solutions: > > a) add "best fit" encodings (see the Unicode FTP site for > a list) > > b) add an error handlers "bestfit" which implements the > fallback modes for the encodings in question > > > > On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < > > python-ideas at python.org> wrote: > > > >> First of all, many thanks for such a excellently writen letter. It was a > >> real pleasure to read. > >> On 10.01.2018 0:15, Rob Speer wrote: > >> > >> Hi! I joined this list because I'm interested in filling a gap in > Python's > >> standard library, relating to text encodings. > >> > >> There is an encoding with no name of its own. It's supported by every > >> current web browser and standardized by WHATWG. It's so prevalent that > if > >> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will > >> get this encoding _instead_. It is probably the second or third most > common > >> text encoding in the world. And Python doesn't quite support it. > >> > >> You can see the character table for this encoding at: > >> https://encoding.spec.whatwg.org/index-windows-1252.txt > >> > >> For the sake of discussion, let's call this encoding "web-1252". WHATWG > >> calls it "windows-1252", but notice that it's subtly different from > >> Python's "windows-1252" encoding. Python's windows-1252 has bytes that > are > >> undefined: > >> > >>>>> b'\x90'.decode('windows-1252') > >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position > 0: > >> character maps to > >> > >> In web-1252, the bytes that are undefined according to windows-1252 map > to > >> the control characters in those positions in iso-8859-1 -- that is, the > >> Unicode codepoints with the same number as the byte. In web-1252, > b'\x90' > >> would decode as '\u0090'. > >> > >> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does > >> the same: > >> > >> "According to the information on Microsoft's and the Unicode > >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; > >> however, the Windows API MultiByteToWideChar > >> < > http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx > > > >> maps these to the corresponding C1 control codes > >> ." > >> And in ISO-8859-1, the same handling is done for unused code points even > >> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : > >> > >> "*ISO-8859-1* is the IANA > >> > >> preferred name for this standard when supplemented with the C0 and C1 > >> control codes > >> from ISO/IEC 6429 " > >> And what would you think -- these "C1 control codes" are also the > >> corresponding Unicode points! ( > >> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) > >> > >> Since Windows is pretty much the reference implementation for > >> "windows-xxxx" encodings, it even makes sense to alter the existing > >> encodings rather than add new ones. > >> > >> > >> This may seem like a silly encoding that encourages doing horrible > things > >> with text. That's pretty much the case. But there's a reason every Web > >> browser implements it: > >> > >> - It's compatible with windows-1252 > >> - Any sequence of bytes can be round-tripped through it without losing > >> information > >> > >> It's not just this one encoding. WHATWG's encoding standard ( > >> https://encoding.spec.whatwg.org/) contains modified versions of > >> windows-1250 through windows-1258 and windows-874. > >> > >> Support for these encodings matters to me, in part, because I maintain a > >> Unicode data-cleaning library, "ftfy". One thing it does is to detect > and > >> undo encoding/decoding errors that cause mojibake, as long as they're > >> detectible and reversible. Looking at real-world examples of text that > has > >> been damaged by mojibake, it's clear that lots of text is transferred > >> through what I'm calling the "web-1252" encoding, in a way that's > >> incompatible with Python's "windows-1252". > >> > >> In order to be able to work with and fix this kind of text, ftfy > registers > >> new codecs -- and I implemented this even before I knew that they were > >> standardized in Web browsers. When ftfy is imported, you can decode > text as > >> "sloppy-windows-1252" (the name I chose for this encoding), for example. > >> > >> ftfy can tell people a sequence of steps that they can use in the future > >> to fix text that's like the text they provided. Very often, these steps > >> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which > >> means the steps only work with ftfy imported, even for people who are > not > >> using the features of ftfy. > >> > >> Support for these encodings also seems highly relevant to people who use > >> Python for web scraping, as it would be desirable to maximize > compatibility > >> with what a Web browser would do. > >> > >> This really seems like it belongs in the standard library instead of > being > >> an incidental feature of my library. I know that code in the standard > >> library has "one foot in the grave". I _want_ these legacy encodings to > >> have one foot in the grave. But some of them are extremely common, and > >> Python code should be able to deal with them. > >> > >> Adding these encodings to Python would be straightforward to implement. > >> Does this require a PEP, a pull request, or further discussion? > >> > >> > >> _______________________________________________ > >> Python-ideas mailing listPython-ideas at python.orghttps:// > mail.python.org/mailman/listinfo/python-ideas > >> Code of Conduct: http://python.org/psf/codeofconduct/ > >> > >> > >> -- > >> Regards, > >> Ivan > >> > >> _______________________________________________ > >> Python-ideas mailing list > >> Python-ideas at python.org > >> https://mail.python.org/mailman/listinfo/python-ideas > >> Code of Conduct: http://python.org/psf/codeofconduct/ > >> > > > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 10 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Wed Jan 10 14:04:22 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 10 Jan 2018 20:04:22 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> On 10.01.2018 19:36, Rob Speer wrote: > I'm looking at the documentation of "best fit" mappings, and that seems to > be a different matter. It appears that best-fit mappings are designed to be > many-to-one mappings used only for encoding. "Best fit" is what the Windows API is implementing. I don't believe it's a good strategy to create the confusion that WHATWG is introducing by using the same names for non-standard encodings. Python uses the Unicode Consortium standard encodings or otherwise internationally standardized ones for the stdlib. If someone wants to use different encodings, it's easily possible to pip install these as necessary. For the stdlib, I think we should stick to standards and not go for spreading non-standard ones. So -1 on adding WHATWG encodings to the stdlib. We could add encodings from the Unicode Best Fit mappings and call them e.g. "bestfit1252" as is done by the Unicode Consortium. They may not be the same as what the WHATWG defines, but serve a very similar purpose and match what is implemented by the Windows API. Adding valid new aliases is a different matter. As long as the aliases do map to the same encodings, those are perfectly fine to add. Thanks. > "Examples of best fit are converting fullwidth letters to their > counterparts when converting to single byte code pages, and mapping the > Infinity character to the number 8." (Mapping ? to 8? Seriously?!) It also > does things such as mapping Cyrillic letters to Latin letters that look > like them. > > This is not what I'm interested in implementing. I just want there to be > encodings that match the WHATWG encodings exactly. If they have to be given > a different name, that's fine with me. > > On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg wrote: > >> On 10.01.2018 00:56, Rob Speer wrote: >>> Oh that's interesting. So it seems to be Python that's the exception >> here. >>> >>> Would we really be able to add entries to character mappings that haven't >>> changed since Python 2.0? >> >> The Windows mappings in Python come directly from the Unicode >> Consortium mapping files. >> >> If the Consortium changes the mappings, we can update them. >> >> If not, then we have a problem, since consumers are not only >> the win32 APIs, but also other tools out there running on >> completely different platforms, e.g. Java tools or web servers >> providing downloads using the Windows code page encodings. >> >> Allowing such mappings in the existing codecs would then result >> failures when the "other" sides see the decoded Unicode version and >> try to encode back into the original encoding - you'd move the >> problem from the Python side to the "other" side of the >> integration. >> >> I had a look on the Unicode FTP site and they have since added >> a new directory with mapping files they call "best fit": >> >> >> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt >> >> The WideCharToMultiByte() defaults to best fit, but also offers >> a mode where it operates in standards compliant mode: >> >> >> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx >> >> See flag WC_NO_BEST_FIT_CHARS. >> >> Unicode TR#22 is also clear on this: >> >> https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned >> >> It allows such best fit mappings to make encodings round-trip >> safe, but requires to keep these separate from the original >> standard mappings: >> >> """ >> It is very important that systems be able to distinguish between the >> fallback mappings and regular mappings. Systems like XML require the use >> of hex escape sequences (NCRs) to preserve round-trip integrity; use of >> fallback characters in that case corrupts the data. >> """ >> >> If you read the above section in TR#22 you quickly get reminded >> of what the Unicode error handlers do (we basically implement >> the three modes it mentions... raise, ignore, replace). >> >> Now, for unmapped sequences an error handler can opt for >> using a fallback sequence instead. >> >> So in addition to adding best fit codecs, there's also the >> option to add an error handler for best fit resolution of >> unmapped sequences. >> >> Given the above, I don't think we ought to change the existing >> standards compliant mappings, but use one of two solutions: >> >> a) add "best fit" encodings (see the Unicode FTP site for >> a list) >> >> b) add an error handlers "bestfit" which implements the >> fallback modes for the encodings in question >> >> >>> On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < >>> python-ideas at python.org> wrote: >>> >>>> First of all, many thanks for such a excellently writen letter. It was a >>>> real pleasure to read. >>>> On 10.01.2018 0:15, Rob Speer wrote: >>>> >>>> Hi! I joined this list because I'm interested in filling a gap in >> Python's >>>> standard library, relating to text encodings. >>>> >>>> There is an encoding with no name of its own. It's supported by every >>>> current web browser and standardized by WHATWG. It's so prevalent that >> if >>>> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will >>>> get this encoding _instead_. It is probably the second or third most >> common >>>> text encoding in the world. And Python doesn't quite support it. >>>> >>>> You can see the character table for this encoding at: >>>> https://encoding.spec.whatwg.org/index-windows-1252.txt >>>> >>>> For the sake of discussion, let's call this encoding "web-1252". WHATWG >>>> calls it "windows-1252", but notice that it's subtly different from >>>> Python's "windows-1252" encoding. Python's windows-1252 has bytes that >> are >>>> undefined: >>>> >>>>>>> b'\x90'.decode('windows-1252') >>>> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position >> 0: >>>> character maps to >>>> >>>> In web-1252, the bytes that are undefined according to windows-1252 map >> to >>>> the control characters in those positions in iso-8859-1 -- that is, the >>>> Unicode codepoints with the same number as the byte. In web-1252, >> b'\x90' >>>> would decode as '\u0090'. >>>> >>>> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does >>>> the same: >>>> >>>> "According to the information on Microsoft's and the Unicode >>>> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; >>>> however, the Windows API MultiByteToWideChar >>>> < >> http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx >>> >>>> maps these to the corresponding C1 control codes >>>> ." >>>> And in ISO-8859-1, the same handling is done for unused code points even >>>> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : >>>> >>>> "*ISO-8859-1* is the IANA >>>> >>>> preferred name for this standard when supplemented with the C0 and C1 >>>> control codes >>>> from ISO/IEC 6429 " >>>> And what would you think -- these "C1 control codes" are also the >>>> corresponding Unicode points! ( >>>> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) >>>> >>>> Since Windows is pretty much the reference implementation for >>>> "windows-xxxx" encodings, it even makes sense to alter the existing >>>> encodings rather than add new ones. >>>> >>>> >>>> This may seem like a silly encoding that encourages doing horrible >> things >>>> with text. That's pretty much the case. But there's a reason every Web >>>> browser implements it: >>>> >>>> - It's compatible with windows-1252 >>>> - Any sequence of bytes can be round-tripped through it without losing >>>> information >>>> >>>> It's not just this one encoding. WHATWG's encoding standard ( >>>> https://encoding.spec.whatwg.org/) contains modified versions of >>>> windows-1250 through windows-1258 and windows-874. >>>> >>>> Support for these encodings matters to me, in part, because I maintain a >>>> Unicode data-cleaning library, "ftfy". One thing it does is to detect >> and >>>> undo encoding/decoding errors that cause mojibake, as long as they're >>>> detectible and reversible. Looking at real-world examples of text that >> has >>>> been damaged by mojibake, it's clear that lots of text is transferred >>>> through what I'm calling the "web-1252" encoding, in a way that's >>>> incompatible with Python's "windows-1252". >>>> >>>> In order to be able to work with and fix this kind of text, ftfy >> registers >>>> new codecs -- and I implemented this even before I knew that they were >>>> standardized in Web browsers. When ftfy is imported, you can decode >> text as >>>> "sloppy-windows-1252" (the name I chose for this encoding), for example. >>>> >>>> ftfy can tell people a sequence of steps that they can use in the future >>>> to fix text that's like the text they provided. Very often, these steps >>>> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which >>>> means the steps only work with ftfy imported, even for people who are >> not >>>> using the features of ftfy. >>>> >>>> Support for these encodings also seems highly relevant to people who use >>>> Python for web scraping, as it would be desirable to maximize >> compatibility >>>> with what a Web browser would do. >>>> >>>> This really seems like it belongs in the standard library instead of >> being >>>> an incidental feature of my library. I know that code in the standard >>>> library has "one foot in the grave". I _want_ these legacy encodings to >>>> have one foot in the grave. But some of them are extremely common, and >>>> Python code should be able to deal with them. >>>> >>>> Adding these encodings to Python would be straightforward to implement. >>>> Does this require a PEP, a pull request, or further discussion? >>>> >>>> >>>> _______________________________________________ >>>> Python-ideas mailing listPython-ideas at python.orghttps:// >> mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>>> >>>> -- >>>> Regards, >>>> Ivan >>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas at python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>> >>> >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> -- >> Marc-Andre Lemburg >> eGenix.com >> >> Professional Python Services directly from the Experts (#1, Jan 10 2018) >>>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>>>> Python Database Interfaces ... http://products.egenix.com/ >>>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ >> ________________________________________________________________________ >> >> ::: We implement business ideas - efficiently in both time and costs ::: >> >> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >> Registered at Amtsgericht Duesseldorf: HRB 46611 >> http://www.egenix.com/company/contact/ >> http://www.malemburg.com/ >> > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From rspeer at luminoso.com Wed Jan 10 14:13:39 2018 From: rspeer at luminoso.com (Rob Speer) Date: Wed, 10 Jan 2018 19:13:39 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: I was originally proposing these encodings under different names, and that's what I think they should have. Indeed, that helps because a pip installable library can backport the new encodings to previous versions of Python. Having a pip installable library as the _only_ way to use these encodings is the status quo that I am very familiar with. It's awkward. To use a package that registers new codecs, you have to import something from that package, even if you never call anything from what you imported, and that makes flake8 complain. The idea that an encoding name may or may not be registered, based on what has been imported, breaks our intuition about reading Python code and is very hard to statically analyze. I disagree with calling the WHATWG encodings that are implemented in every Web browser "non-standard". WHATWG may not have a typical origin story as a standards organization, but it _is_ the standards organization for the Web. I'm really not interested in best-fit mappings that turn infinity into "8" and square roots into "v". Making weird mappings like that sounds like a job for the "unidecode" library, not the stdlib. On Wed, 10 Jan 2018 at 13:36 Rob Speer wrote: > I'm looking at the documentation of "best fit" mappings, and that seems to > be a different matter. It appears that best-fit mappings are designed to be > many-to-one mappings used only for encoding. > > "Examples of best fit are converting fullwidth letters to their > counterparts when converting to single byte code pages, and mapping the > Infinity character to the number 8." (Mapping ? to 8? Seriously?!) It also > does things such as mapping Cyrillic letters to Latin letters that look > like them. > > This is not what I'm interested in implementing. I just want there to be > encodings that match the WHATWG encodings exactly. If they have to be given > a different name, that's fine with me. > > On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg wrote: > >> On 10.01.2018 00:56, Rob Speer wrote: >> > Oh that's interesting. So it seems to be Python that's the exception >> here. >> > >> > Would we really be able to add entries to character mappings that >> haven't >> > changed since Python 2.0? >> >> The Windows mappings in Python come directly from the Unicode >> Consortium mapping files. >> >> If the Consortium changes the mappings, we can update them. >> >> If not, then we have a problem, since consumers are not only >> the win32 APIs, but also other tools out there running on >> completely different platforms, e.g. Java tools or web servers >> providing downloads using the Windows code page encodings. >> >> Allowing such mappings in the existing codecs would then result >> failures when the "other" sides see the decoded Unicode version and >> try to encode back into the original encoding - you'd move the >> problem from the Python side to the "other" side of the >> integration. >> >> I had a look on the Unicode FTP site and they have since added >> a new directory with mapping files they call "best fit": >> >> >> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt >> >> The WideCharToMultiByte() defaults to best fit, but also offers >> a mode where it operates in standards compliant mode: >> >> >> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx >> >> See flag WC_NO_BEST_FIT_CHARS. >> >> Unicode TR#22 is also clear on this: >> >> https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned >> >> It allows such best fit mappings to make encodings round-trip >> safe, but requires to keep these separate from the original >> standard mappings: >> >> """ >> It is very important that systems be able to distinguish between the >> fallback mappings and regular mappings. Systems like XML require the use >> of hex escape sequences (NCRs) to preserve round-trip integrity; use of >> fallback characters in that case corrupts the data. >> """ >> >> If you read the above section in TR#22 you quickly get reminded >> of what the Unicode error handlers do (we basically implement >> the three modes it mentions... raise, ignore, replace). >> >> Now, for unmapped sequences an error handler can opt for >> using a fallback sequence instead. >> >> So in addition to adding best fit codecs, there's also the >> option to add an error handler for best fit resolution of >> unmapped sequences. >> >> Given the above, I don't think we ought to change the existing >> standards compliant mappings, but use one of two solutions: >> >> a) add "best fit" encodings (see the Unicode FTP site for >> a list) >> >> b) add an error handlers "bestfit" which implements the >> fallback modes for the encodings in question >> >> >> > On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < >> > python-ideas at python.org> wrote: >> > >> >> First of all, many thanks for such a excellently writen letter. It was >> a >> >> real pleasure to read. >> >> On 10.01.2018 0:15, Rob Speer wrote: >> >> >> >> Hi! I joined this list because I'm interested in filling a gap in >> Python's >> >> standard library, relating to text encodings. >> >> >> >> There is an encoding with no name of its own. It's supported by every >> >> current web browser and standardized by WHATWG. It's so prevalent that >> if >> >> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you >> will >> >> get this encoding _instead_. It is probably the second or third most >> common >> >> text encoding in the world. And Python doesn't quite support it. >> >> >> >> You can see the character table for this encoding at: >> >> https://encoding.spec.whatwg.org/index-windows-1252.txt >> >> >> >> For the sake of discussion, let's call this encoding "web-1252". WHATWG >> >> calls it "windows-1252", but notice that it's subtly different from >> >> Python's "windows-1252" encoding. Python's windows-1252 has bytes that >> are >> >> undefined: >> >> >> >>>>> b'\x90'.decode('windows-1252') >> >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position >> 0: >> >> character maps to >> >> >> >> In web-1252, the bytes that are undefined according to windows-1252 >> map to >> >> the control characters in those positions in iso-8859-1 -- that is, the >> >> Unicode codepoints with the same number as the byte. In web-1252, >> b'\x90' >> >> would decode as '\u0090'. >> >> >> >> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does >> >> the same: >> >> >> >> "According to the information on Microsoft's and the Unicode >> >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; >> >> however, the Windows API MultiByteToWideChar >> >> < >> http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx >> > >> >> maps these to the corresponding C1 control codes >> >> ." >> >> And in ISO-8859-1, the same handling is done for unused code points >> even >> >> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : >> >> >> >> "*ISO-8859-1* is the IANA >> >> >> >> preferred name for this standard when supplemented with the C0 and C1 >> >> control codes >> >> from ISO/IEC 6429 " >> >> And what would you think -- these "C1 control codes" are also the >> >> corresponding Unicode points! ( >> >> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) >> >> >> >> Since Windows is pretty much the reference implementation for >> >> "windows-xxxx" encodings, it even makes sense to alter the existing >> >> encodings rather than add new ones. >> >> >> >> >> >> This may seem like a silly encoding that encourages doing horrible >> things >> >> with text. That's pretty much the case. But there's a reason every Web >> >> browser implements it: >> >> >> >> - It's compatible with windows-1252 >> >> - Any sequence of bytes can be round-tripped through it without losing >> >> information >> >> >> >> It's not just this one encoding. WHATWG's encoding standard ( >> >> https://encoding.spec.whatwg.org/) contains modified versions of >> >> windows-1250 through windows-1258 and windows-874. >> >> >> >> Support for these encodings matters to me, in part, because I maintain >> a >> >> Unicode data-cleaning library, "ftfy". One thing it does is to detect >> and >> >> undo encoding/decoding errors that cause mojibake, as long as they're >> >> detectible and reversible. Looking at real-world examples of text that >> has >> >> been damaged by mojibake, it's clear that lots of text is transferred >> >> through what I'm calling the "web-1252" encoding, in a way that's >> >> incompatible with Python's "windows-1252". >> >> >> >> In order to be able to work with and fix this kind of text, ftfy >> registers >> >> new codecs -- and I implemented this even before I knew that they were >> >> standardized in Web browsers. When ftfy is imported, you can decode >> text as >> >> "sloppy-windows-1252" (the name I chose for this encoding), for >> example. >> >> >> >> ftfy can tell people a sequence of steps that they can use in the >> future >> >> to fix text that's like the text they provided. Very often, these steps >> >> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which >> >> means the steps only work with ftfy imported, even for people who are >> not >> >> using the features of ftfy. >> >> >> >> Support for these encodings also seems highly relevant to people who >> use >> >> Python for web scraping, as it would be desirable to maximize >> compatibility >> >> with what a Web browser would do. >> >> >> >> This really seems like it belongs in the standard library instead of >> being >> >> an incidental feature of my library. I know that code in the standard >> >> library has "one foot in the grave". I _want_ these legacy encodings to >> >> have one foot in the grave. But some of them are extremely common, and >> >> Python code should be able to deal with them. >> >> >> >> Adding these encodings to Python would be straightforward to implement. >> >> Does this require a PEP, a pull request, or further discussion? >> >> >> >> >> >> _______________________________________________ >> >> Python-ideas mailing listPython-ideas at python.orghttps:// >> mail.python.org/mailman/listinfo/python-ideas >> >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> >> >> >> >> -- >> >> Regards, >> >> Ivan >> >> >> >> _______________________________________________ >> >> Python-ideas mailing list >> >> Python-ideas at python.org >> >> https://mail.python.org/mailman/listinfo/python-ideas >> >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> >> > >> > >> > >> > _______________________________________________ >> > Python-ideas mailing list >> > Python-ideas at python.org >> > https://mail.python.org/mailman/listinfo/python-ideas >> > Code of Conduct: http://python.org/psf/codeofconduct/ >> > >> >> -- >> Marc-Andre Lemburg >> eGenix.com >> >> Professional Python Services directly from the Experts (#1, Jan 10 2018) >> >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >> >>> Python Database Interfaces ... http://products.egenix.com/ >> >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ >> ________________________________________________________________________ >> >> ::: We implement business ideas - efficiently in both time and costs ::: >> >> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >> Registered at Amtsgericht Duesseldorf: HRB 46611 >> http://www.egenix.com/company/contact/ >> http://www.malemburg.com/ >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Wed Jan 10 14:56:55 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 10 Jan 2018 20:56:55 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: <613bd6ed-8f6c-2cd8-7419-bc3f2a5fdb50@egenix.com> On 10.01.2018 20:13, Rob Speer wrote: > I was originally proposing these encodings under different names, and > that's what I think they should have. Indeed, that helps because a pip > installable library can backport the new encodings to previous versions of > Python. > > Having a pip installable library as the _only_ way to use these encodings > is the status quo that I am very familiar with. It's awkward. To use a > package that registers new codecs, you have to import something from that > package, even if you never call anything from what you imported, and that > makes flake8 complain. The idea that an encoding name may or may not be > registered, based on what has been imported, breaks our intuition about > reading Python code and is very hard to statically analyze. You can have a function in the package which registers the codecs. That way you do have a call into the library and intuition is restored :-) (and flake should be happy as well): import mycodecs mycodecs.register() > I disagree with calling the WHATWG encodings that are implemented in every > Web browser "non-standard". WHATWG may not have a typical origin story as a > standards organization, but it _is_ the standards organization for the Web. I don't really want to get into a discussion here. Browsers use these modified encodings to cope with mojibake or web content which isn't quite standard compliant. That's a valid use case, but promoting such wrong use by having work-around encodings in the stdlib and having Python produce non-standard output doesn't strike me as a good way forward. We do have error handlers for dealing with partially corrupted data. I think that's good enough. > I'm really not interested in best-fit mappings that turn infinity into "8" > and square roots into "v". Making weird mappings like that sounds like a > job for the "unidecode" library, not the stdlib. Well, one of your main arguments was that the Windows API follows these best fit encodings. I agree that best fit may not necessarily be best fit for everyone :-) > On Wed, 10 Jan 2018 at 13:36 Rob Speer wrote: > >> I'm looking at the documentation of "best fit" mappings, and that seems to >> be a different matter. It appears that best-fit mappings are designed to be >> many-to-one mappings used only for encoding. >> >> "Examples of best fit are converting fullwidth letters to their >> counterparts when converting to single byte code pages, and mapping the >> Infinity character to the number 8." (Mapping ? to 8? Seriously?!) It also >> does things such as mapping Cyrillic letters to Latin letters that look >> like them. >> >> This is not what I'm interested in implementing. I just want there to be >> encodings that match the WHATWG encodings exactly. If they have to be given >> a different name, that's fine with me. >> >> On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg wrote: >> >>> On 10.01.2018 00:56, Rob Speer wrote: >>>> Oh that's interesting. So it seems to be Python that's the exception >>> here. >>>> >>>> Would we really be able to add entries to character mappings that >>> haven't >>>> changed since Python 2.0? >>> >>> The Windows mappings in Python come directly from the Unicode >>> Consortium mapping files. >>> >>> If the Consortium changes the mappings, we can update them. >>> >>> If not, then we have a problem, since consumers are not only >>> the win32 APIs, but also other tools out there running on >>> completely different platforms, e.g. Java tools or web servers >>> providing downloads using the Windows code page encodings. >>> >>> Allowing such mappings in the existing codecs would then result >>> failures when the "other" sides see the decoded Unicode version and >>> try to encode back into the original encoding - you'd move the >>> problem from the Python side to the "other" side of the >>> integration. >>> >>> I had a look on the Unicode FTP site and they have since added >>> a new directory with mapping files they call "best fit": >>> >>> >>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt >>> >>> The WideCharToMultiByte() defaults to best fit, but also offers >>> a mode where it operates in standards compliant mode: >>> >>> >>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx >>> >>> See flag WC_NO_BEST_FIT_CHARS. >>> >>> Unicode TR#22 is also clear on this: >>> >>> https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned >>> >>> It allows such best fit mappings to make encodings round-trip >>> safe, but requires to keep these separate from the original >>> standard mappings: >>> >>> """ >>> It is very important that systems be able to distinguish between the >>> fallback mappings and regular mappings. Systems like XML require the use >>> of hex escape sequences (NCRs) to preserve round-trip integrity; use of >>> fallback characters in that case corrupts the data. >>> """ >>> >>> If you read the above section in TR#22 you quickly get reminded >>> of what the Unicode error handlers do (we basically implement >>> the three modes it mentions... raise, ignore, replace). >>> >>> Now, for unmapped sequences an error handler can opt for >>> using a fallback sequence instead. >>> >>> So in addition to adding best fit codecs, there's also the >>> option to add an error handler for best fit resolution of >>> unmapped sequences. >>> >>> Given the above, I don't think we ought to change the existing >>> standards compliant mappings, but use one of two solutions: >>> >>> a) add "best fit" encodings (see the Unicode FTP site for >>> a list) >>> >>> b) add an error handlers "bestfit" which implements the >>> fallback modes for the encodings in question >>> >>> >>>> On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < >>>> python-ideas at python.org> wrote: >>>> >>>>> First of all, many thanks for such a excellently writen letter. It was >>> a >>>>> real pleasure to read. >>>>> On 10.01.2018 0:15, Rob Speer wrote: >>>>> >>>>> Hi! I joined this list because I'm interested in filling a gap in >>> Python's >>>>> standard library, relating to text encodings. >>>>> >>>>> There is an encoding with no name of its own. It's supported by every >>>>> current web browser and standardized by WHATWG. It's so prevalent that >>> if >>>>> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you >>> will >>>>> get this encoding _instead_. It is probably the second or third most >>> common >>>>> text encoding in the world. And Python doesn't quite support it. >>>>> >>>>> You can see the character table for this encoding at: >>>>> https://encoding.spec.whatwg.org/index-windows-1252.txt >>>>> >>>>> For the sake of discussion, let's call this encoding "web-1252". WHATWG >>>>> calls it "windows-1252", but notice that it's subtly different from >>>>> Python's "windows-1252" encoding. Python's windows-1252 has bytes that >>> are >>>>> undefined: >>>>> >>>>>>>> b'\x90'.decode('windows-1252') >>>>> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position >>> 0: >>>>> character maps to >>>>> >>>>> In web-1252, the bytes that are undefined according to windows-1252 >>> map to >>>>> the control characters in those positions in iso-8859-1 -- that is, the >>>>> Unicode codepoints with the same number as the byte. In web-1252, >>> b'\x90' >>>>> would decode as '\u0090'. >>>>> >>>>> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does >>>>> the same: >>>>> >>>>> "According to the information on Microsoft's and the Unicode >>>>> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; >>>>> however, the Windows API MultiByteToWideChar >>>>> < >>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx >>>> >>>>> maps these to the corresponding C1 control codes >>>>> ." >>>>> And in ISO-8859-1, the same handling is done for unused code points >>> even >>>>> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : >>>>> >>>>> "*ISO-8859-1* is the IANA >>>>> >>>>> preferred name for this standard when supplemented with the C0 and C1 >>>>> control codes >>>>> from ISO/IEC 6429 " >>>>> And what would you think -- these "C1 control codes" are also the >>>>> corresponding Unicode points! ( >>>>> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) >>>>> >>>>> Since Windows is pretty much the reference implementation for >>>>> "windows-xxxx" encodings, it even makes sense to alter the existing >>>>> encodings rather than add new ones. >>>>> >>>>> >>>>> This may seem like a silly encoding that encourages doing horrible >>> things >>>>> with text. That's pretty much the case. But there's a reason every Web >>>>> browser implements it: >>>>> >>>>> - It's compatible with windows-1252 >>>>> - Any sequence of bytes can be round-tripped through it without losing >>>>> information >>>>> >>>>> It's not just this one encoding. WHATWG's encoding standard ( >>>>> https://encoding.spec.whatwg.org/) contains modified versions of >>>>> windows-1250 through windows-1258 and windows-874. >>>>> >>>>> Support for these encodings matters to me, in part, because I maintain >>> a >>>>> Unicode data-cleaning library, "ftfy". One thing it does is to detect >>> and >>>>> undo encoding/decoding errors that cause mojibake, as long as they're >>>>> detectible and reversible. Looking at real-world examples of text that >>> has >>>>> been damaged by mojibake, it's clear that lots of text is transferred >>>>> through what I'm calling the "web-1252" encoding, in a way that's >>>>> incompatible with Python's "windows-1252". >>>>> >>>>> In order to be able to work with and fix this kind of text, ftfy >>> registers >>>>> new codecs -- and I implemented this even before I knew that they were >>>>> standardized in Web browsers. When ftfy is imported, you can decode >>> text as >>>>> "sloppy-windows-1252" (the name I chose for this encoding), for >>> example. >>>>> >>>>> ftfy can tell people a sequence of steps that they can use in the >>> future >>>>> to fix text that's like the text they provided. Very often, these steps >>>>> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which >>>>> means the steps only work with ftfy imported, even for people who are >>> not >>>>> using the features of ftfy. >>>>> >>>>> Support for these encodings also seems highly relevant to people who >>> use >>>>> Python for web scraping, as it would be desirable to maximize >>> compatibility >>>>> with what a Web browser would do. >>>>> >>>>> This really seems like it belongs in the standard library instead of >>> being >>>>> an incidental feature of my library. I know that code in the standard >>>>> library has "one foot in the grave". I _want_ these legacy encodings to >>>>> have one foot in the grave. But some of them are extremely common, and >>>>> Python code should be able to deal with them. >>>>> >>>>> Adding these encodings to Python would be straightforward to implement. >>>>> Does this require a PEP, a pull request, or further discussion? >>>>> >>>>> >>>>> _______________________________________________ >>>>> Python-ideas mailing listPython-ideas at python.orghttps:// >>> mail.python.org/mailman/listinfo/python-ideas >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Ivan >>>>> >>>>> _______________________________________________ >>>>> Python-ideas mailing list >>>>> Python-ideas at python.org >>>>> https://mail.python.org/mailman/listinfo/python-ideas >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas at python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>> >>> -- >>> Marc-Andre Lemburg >>> eGenix.com >>> >>> Professional Python Services directly from the Experts (#1, Jan 10 2018) >>>>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>>>>> Python Database Interfaces ... http://products.egenix.com/ >>>>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ >>> ________________________________________________________________________ >>> >>> ::: We implement business ideas - efficiently in both time and costs ::: >>> >>> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >>> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >>> Registered at Amtsgericht Duesseldorf: HRB 46611 >>> http://www.egenix.com/company/contact/ >>> http://www.malemburg.com/ >>> >> > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From rspeer at luminoso.com Wed Jan 10 15:20:45 2018 From: rspeer at luminoso.com (Rob Speer) Date: Wed, 10 Jan 2018 20:20:45 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <613bd6ed-8f6c-2cd8-7419-bc3f2a5fdb50@egenix.com> References: <613bd6ed-8f6c-2cd8-7419-bc3f2a5fdb50@egenix.com> Message-ID: > Well, one of your main arguments was that the Windows API follows these best fit encodings. No, that wasn't me, that was Ivan. My argument has been based on compatibility with Web technologies; I wanted these encodings before I knew what Windows did (and now what Windows does kind of horrifies me). Calling a register() function makes flake8 happy, at the cost of convenience, but it still has no clear connection to the place where you use the registered encodings. On Wed, 10 Jan 2018 at 14:57 M.-A. Lemburg wrote: > On 10.01.2018 20:13, Rob Speer wrote: > > I was originally proposing these encodings under different names, and > > that's what I think they should have. Indeed, that helps because a pip > > installable library can backport the new encodings to previous versions > of > > Python. > > > > Having a pip installable library as the _only_ way to use these encodings > > is the status quo that I am very familiar with. It's awkward. To use a > > package that registers new codecs, you have to import something from that > > package, even if you never call anything from what you imported, and that > > makes flake8 complain. The idea that an encoding name may or may not be > > registered, based on what has been imported, breaks our intuition about > > reading Python code and is very hard to statically analyze. > > You can have a function in the package which registers the > codecs. That way you do have a call into the library and intuition > is restored :-) (and flake should be happy as well): > > import mycodecs > mycodecs.register() > > > I disagree with calling the WHATWG encodings that are implemented in > every > > Web browser "non-standard". WHATWG may not have a typical origin story > as a > > standards organization, but it _is_ the standards organization for the > Web. > > I don't really want to get into a discussion here. Browsers > use these modified encodings to cope with mojibake or web content > which isn't quite standard compliant. That's a valid use case, > but promoting such wrong use by having work-around encodings in > the stdlib and having Python produce non-standard output > doesn't strike me as a good way forward. We do have error handlers > for dealing with partially corrupted data. I think that's good > enough. > > > I'm really not interested in best-fit mappings that turn infinity into > "8" > > and square roots into "v". Making weird mappings like that sounds like a > > job for the "unidecode" library, not the stdlib. > > Well, one of your main arguments was that the Windows API follows > these best fit encodings. > > I agree that best fit may not necessarily be best fit for > everyone :-) > > > > On Wed, 10 Jan 2018 at 13:36 Rob Speer wrote: > > > >> I'm looking at the documentation of "best fit" mappings, and that seems > to > >> be a different matter. It appears that best-fit mappings are designed > to be > >> many-to-one mappings used only for encoding. > >> > >> "Examples of best fit are converting fullwidth letters to their > >> counterparts when converting to single byte code pages, and mapping the > >> Infinity character to the number 8." (Mapping ? to 8? Seriously?!) It > also > >> does things such as mapping Cyrillic letters to Latin letters that look > >> like them. > >> > >> This is not what I'm interested in implementing. I just want there to be > >> encodings that match the WHATWG encodings exactly. If they have to be > given > >> a different name, that's fine with me. > >> > >> On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg wrote: > >> > >>> On 10.01.2018 00:56, Rob Speer wrote: > >>>> Oh that's interesting. So it seems to be Python that's the exception > >>> here. > >>>> > >>>> Would we really be able to add entries to character mappings that > >>> haven't > >>>> changed since Python 2.0? > >>> > >>> The Windows mappings in Python come directly from the Unicode > >>> Consortium mapping files. > >>> > >>> If the Consortium changes the mappings, we can update them. > >>> > >>> If not, then we have a problem, since consumers are not only > >>> the win32 APIs, but also other tools out there running on > >>> completely different platforms, e.g. Java tools or web servers > >>> providing downloads using the Windows code page encodings. > >>> > >>> Allowing such mappings in the existing codecs would then result > >>> failures when the "other" sides see the decoded Unicode version and > >>> try to encode back into the original encoding - you'd move the > >>> problem from the Python side to the "other" side of the > >>> integration. > >>> > >>> I had a look on the Unicode FTP site and they have since added > >>> a new directory with mapping files they call "best fit": > >>> > >>> > >>> > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt > >>> > >>> The WideCharToMultiByte() defaults to best fit, but also offers > >>> a mode where it operates in standards compliant mode: > >>> > >>> > >>> > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx > >>> > >>> See flag WC_NO_BEST_FIT_CHARS. > >>> > >>> Unicode TR#22 is also clear on this: > >>> > >>> > https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned > >>> > >>> It allows such best fit mappings to make encodings round-trip > >>> safe, but requires to keep these separate from the original > >>> standard mappings: > >>> > >>> """ > >>> It is very important that systems be able to distinguish between the > >>> fallback mappings and regular mappings. Systems like XML require the > use > >>> of hex escape sequences (NCRs) to preserve round-trip integrity; use of > >>> fallback characters in that case corrupts the data. > >>> """ > >>> > >>> If you read the above section in TR#22 you quickly get reminded > >>> of what the Unicode error handlers do (we basically implement > >>> the three modes it mentions... raise, ignore, replace). > >>> > >>> Now, for unmapped sequences an error handler can opt for > >>> using a fallback sequence instead. > >>> > >>> So in addition to adding best fit codecs, there's also the > >>> option to add an error handler for best fit resolution of > >>> unmapped sequences. > >>> > >>> Given the above, I don't think we ought to change the existing > >>> standards compliant mappings, but use one of two solutions: > >>> > >>> a) add "best fit" encodings (see the Unicode FTP site for > >>> a list) > >>> > >>> b) add an error handlers "bestfit" which implements the > >>> fallback modes for the encodings in question > >>> > >>> > >>>> On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < > >>>> python-ideas at python.org> wrote: > >>>> > >>>>> First of all, many thanks for such a excellently writen letter. It > was > >>> a > >>>>> real pleasure to read. > >>>>> On 10.01.2018 0:15, Rob Speer wrote: > >>>>> > >>>>> Hi! I joined this list because I'm interested in filling a gap in > >>> Python's > >>>>> standard library, relating to text encodings. > >>>>> > >>>>> There is an encoding with no name of its own. It's supported by every > >>>>> current web browser and standardized by WHATWG. It's so prevalent > that > >>> if > >>>>> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you > >>> will > >>>>> get this encoding _instead_. It is probably the second or third most > >>> common > >>>>> text encoding in the world. And Python doesn't quite support it. > >>>>> > >>>>> You can see the character table for this encoding at: > >>>>> https://encoding.spec.whatwg.org/index-windows-1252.txt > >>>>> > >>>>> For the sake of discussion, let's call this encoding "web-1252". > WHATWG > >>>>> calls it "windows-1252", but notice that it's subtly different from > >>>>> Python's "windows-1252" encoding. Python's windows-1252 has bytes > that > >>> are > >>>>> undefined: > >>>>> > >>>>>>>> b'\x90'.decode('windows-1252') > >>>>> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in > position > >>> 0: > >>>>> character maps to > >>>>> > >>>>> In web-1252, the bytes that are undefined according to windows-1252 > >>> map to > >>>>> the control characters in those positions in iso-8859-1 -- that is, > the > >>>>> Unicode codepoints with the same number as the byte. In web-1252, > >>> b'\x90' > >>>>> would decode as '\u0090'. > >>>>> > >>>>> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows > does > >>>>> the same: > >>>>> > >>>>> "According to the information on Microsoft's and the Unicode > >>>>> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; > >>>>> however, the Windows API MultiByteToWideChar > >>>>> < > >>> > http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx > >>>> > >>>>> maps these to the corresponding C1 control codes > >>>>> ." > >>>>> And in ISO-8859-1, the same handling is done for unused code points > >>> even > >>>>> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : > >>>>> > >>>>> "*ISO-8859-1* is the IANA > >>>>> > >>>>> preferred name for this standard when supplemented with the C0 and C1 > >>>>> control codes > > >>>>> from ISO/IEC 6429 " > >>>>> And what would you think -- these "C1 control codes" are also the > >>>>> corresponding Unicode points! ( > >>>>> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) > >>>>> > >>>>> Since Windows is pretty much the reference implementation for > >>>>> "windows-xxxx" encodings, it even makes sense to alter the existing > >>>>> encodings rather than add new ones. > >>>>> > >>>>> > >>>>> This may seem like a silly encoding that encourages doing horrible > >>> things > >>>>> with text. That's pretty much the case. But there's a reason every > Web > >>>>> browser implements it: > >>>>> > >>>>> - It's compatible with windows-1252 > >>>>> - Any sequence of bytes can be round-tripped through it without > losing > >>>>> information > >>>>> > >>>>> It's not just this one encoding. WHATWG's encoding standard ( > >>>>> https://encoding.spec.whatwg.org/) contains modified versions of > >>>>> windows-1250 through windows-1258 and windows-874. > >>>>> > >>>>> Support for these encodings matters to me, in part, because I > maintain > >>> a > >>>>> Unicode data-cleaning library, "ftfy". One thing it does is to detect > >>> and > >>>>> undo encoding/decoding errors that cause mojibake, as long as they're > >>>>> detectible and reversible. Looking at real-world examples of text > that > >>> has > >>>>> been damaged by mojibake, it's clear that lots of text is transferred > >>>>> through what I'm calling the "web-1252" encoding, in a way that's > >>>>> incompatible with Python's "windows-1252". > >>>>> > >>>>> In order to be able to work with and fix this kind of text, ftfy > >>> registers > >>>>> new codecs -- and I implemented this even before I knew that they > were > >>>>> standardized in Web browsers. When ftfy is imported, you can decode > >>> text as > >>>>> "sloppy-windows-1252" (the name I chose for this encoding), for > >>> example. > >>>>> > >>>>> ftfy can tell people a sequence of steps that they can use in the > >>> future > >>>>> to fix text that's like the text they provided. Very often, these > steps > >>>>> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, > which > >>>>> means the steps only work with ftfy imported, even for people who are > >>> not > >>>>> using the features of ftfy. > >>>>> > >>>>> Support for these encodings also seems highly relevant to people who > >>> use > >>>>> Python for web scraping, as it would be desirable to maximize > >>> compatibility > >>>>> with what a Web browser would do. > >>>>> > >>>>> This really seems like it belongs in the standard library instead of > >>> being > >>>>> an incidental feature of my library. I know that code in the standard > >>>>> library has "one foot in the grave". I _want_ these legacy encodings > to > >>>>> have one foot in the grave. But some of them are extremely common, > and > >>>>> Python code should be able to deal with them. > >>>>> > >>>>> Adding these encodings to Python would be straightforward to > implement. > >>>>> Does this require a PEP, a pull request, or further discussion? > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Python-ideas mailing listPython-ideas at python.orghttps:// > >>> mail.python.org/mailman/listinfo/python-ideas > >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ > >>>>> > >>>>> > >>>>> -- > >>>>> Regards, > >>>>> Ivan > >>>>> > >>>>> _______________________________________________ > >>>>> Python-ideas mailing list > >>>>> Python-ideas at python.org > >>>>> https://mail.python.org/mailman/listinfo/python-ideas > >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ > >>>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Python-ideas mailing list > >>>> Python-ideas at python.org > >>>> https://mail.python.org/mailman/listinfo/python-ideas > >>>> Code of Conduct: http://python.org/psf/codeofconduct/ > >>>> > >>> > >>> -- > >>> Marc-Andre Lemburg > >>> eGenix.com > >>> > >>> Professional Python Services directly from the Experts (#1, Jan 10 > 2018) > >>>>>> Python Projects, Coaching and Consulting ... > http://www.egenix.com/ > >>>>>> Python Database Interfaces ... > http://products.egenix.com/ > >>>>>> Plone/Zope Database Interfaces ... > http://zope.egenix.com/ > >>> > ________________________________________________________________________ > >>> > >>> ::: We implement business ideas - efficiently in both time and costs > ::: > >>> > >>> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > >>> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > >>> Registered at Amtsgericht Duesseldorf: HRB 46611 > >>> http://www.egenix.com/company/contact/ > >>> http://www.malemburg.com/ > >>> > >> > > > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 10 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Wed Jan 10 19:10:08 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 11 Jan 2018 10:10:08 +1000 Subject: [Python-ideas] Syntax to import modules before running command from the command line In-Reply-To: References: Message-ID: On 10 January 2018 at 18:30, Paul Moore wrote: > On 10 January 2018 at 02:39, Nick Coghlan wrote: >> For the coverage.py use case, an environment-based solution is also >> genuinely helpful, since you typically can't modify subprocess >> invocations just because the software is being tested. At the moment, >> there are approaches that rely on using either `sitecustomize` or >> `*.pth` files, but being able to write `PYTHONRUNFIRST="import >> coverage; coverage.process_startup()"` would be a fair bit clearer >> about what was actually going on. > > It's worth remembering that Windows doesn't have the equivalent of the > Unix "VAR=xxx prog arg arg" syntax for one-time setting of an > environment variable, so environment variable based solutions are > strictly less useful than command line arguments. That's one reason I > prefer -C over PYTHONRUNFIRST. The proposal is to offer both, not an either/or, but the idea isn't that folks would need to set the environment variable directly - it's that coverage.py itself (or a test runner) would set it so that subprocesses were captured in addition to the directly executed module. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Wed Jan 10 19:22:58 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 11 Jan 2018 10:22:58 +1000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> Message-ID: On 11 January 2018 at 05:04, M.-A. Lemburg wrote: > For the stdlib, I think we should stick to standards and > not go for spreading non-standard ones. > > So -1 on adding WHATWG encodings to the stdlib. We already support HTML5 in the standard library, and saying "We'll accept WHATWG's definition of HTML, but not their associated text encodings" seems like a strange place to draw a line when it comes to standards support. I do think your observation constitutes a compelling reason to leave the existing codecs alone though, and treat the web codecs as a distinct set of mappings. Given that, I think Rob's original suggestion of using "web-1252" et al is a good one. We can also separate them out in the documentation, such that we have three tables: * https://docs.python.org/3/library/codecs.html#standard-encodings (Unicode Consortium) * https://docs.python.org/3/library/codecs.html#python-specific-encodings (python-dev/PSF) * a new table for WHATWG encodings Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From chris.barker at noaa.gov Wed Jan 10 19:24:33 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 10 Jan 2018 16:24:33 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> Message-ID: On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg wrote: > I don't believe it's a good strategy to create the confusion that > WHATWG is introducing by using the same names for non-standard > encodings. > agreed. > Python uses the Unicode Consortium standard encodings or > otherwise internationally standardized ones for the stdlib. > > If someone wants to use different encodings, it's easily > possible to pip install these as necessary. > > For the stdlib, I think we should stick to standards and > not go for spreading non-standard ones. > > So -1 on adding WHATWG encodings to the stdlib. > If the OP is right that it is one of the most widely used encodings in the world, it's kinda hard to call it "non-standard" I think practicality beats purity here -- if the WHATWG encoding(s) are clearly defined, widely used, and the names don't conflict with other standard encodings then it seems like a very good addition to the stdlib. So +1 -- provided that the proposed encoding(s) is "clearly defined, widely used, and the name doesn't conflict with other standard encodings" -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From gadgetsteve at live.co.uk Wed Jan 10 14:44:49 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Wed, 10 Jan 2018 19:44:49 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On 10/01/2018 19:13, Rob Speer wrote: > I was originally proposing these encodings under different names, and > that's what I think they should have. Indeed, that helps because a pip > installable library can backport the new encodings to previous versions > of Python. > > Having a pip installable library as the _only_ way to use these > encodings is the status quo that I am very familiar with. It's awkward. > To use a package that registers new codecs, you have to import something > from that package, even if you never call anything from what you > imported, and that makes flake8 complain. The idea that an encoding name > may or may not be registered, based on what has been imported, breaks > our intuition about reading Python code and is very hard to statically > analyze. > > I disagree with calling the WHATWG encodings that are implemented in > every Web browser "non-standard". WHATWG may not have a typical origin > story as a standards organization, but it _is_ the standards > organization for the Web. Please note that the WHATWG standard describes Windows-1252 as a "Legacy Single Byte Encoding" and to me the name suggests expects it to be implemented on Windows platforms and for Windows Specific Web Pages. THE Encoding - i.e. the standard that all browsers, and other web applications, are expected to adhere to is UTF-8. I am somewhat confused because according to https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the original examples) is undefined as the table only runs to 127 i.e. 0x7F. > > I'm really not interested in best-fit mappings that turn infinity into > "8" and square roots into "v". Making weird mappings like that sounds > like a job for the "unidecode" library, not the stdlib. > > On Wed, 10 Jan 2018 at 13:36 Rob Speer > wrote: > > I'm looking at the documentation of "best fit" mappings, and that > seems to be a different matter. It appears that best-fit mappings > are designed to be many-to-one mappings used only for encoding. > > "Examples of best fit are converting fullwidth letters to their > counterparts when converting to single byte code pages, and mapping > the Infinity character to the number 8." (Mapping ? to 8? > Seriously?!) It also does things such as mapping Cyrillic letters to > Latin letters that look like them. > > This is not what I'm interested in implementing. I just want there > to be encodings that match the WHATWG encodings exactly. If they > have to be given a different name, that's fine with me. > > On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg > wrote: > > On 10.01.2018 00:56, Rob Speer wrote: > > Oh that's interesting. So it seems to be Python that's the > exception here. > > > > Would we really be able to add entries to character mappings > that haven't > > changed since Python 2.0? > > The Windows mappings in Python come directly from the Unicode > Consortium mapping files. > > If the Consortium changes the mappings, we can update them. > > If not, then we have a problem, since consumers are not only > the win32 APIs, but also other tools out there running on > completely different platforms, e.g. Java tools or web servers > providing downloads using the Windows code page encodings. > > Allowing such mappings in the existing codecs would then result > failures when the "other" sides see the decoded Unicode version and > try to encode back into the original encoding - you'd move the > problem from the Python side to the "other" side of the > integration. > > I had a look on the Unicode FTP site and they have since added > a new directory with mapping files they call "best fit": > > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt > > > The WideCharToMultiByte() defaults to best fit, but also offers > a mode where it operates in standards compliant mode: > > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx > > > See flag WC_NO_BEST_FIT_CHARS. > > Unicode TR#22 is also clear on this: > > https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned > > > It allows such best fit mappings to make encodings round-trip > safe, but requires to keep these separate from the original > standard mappings: > > """ > It is very important that systems be able to distinguish between the > fallback mappings and regular mappings. Systems like XML require > the use > of hex escape sequences (NCRs) to preserve round-trip integrity; > use of > fallback characters in that case corrupts the data. > """ > > If you read the above section in TR#22 you quickly get reminded > of what the Unicode error handlers do (we basically implement > the three modes it mentions... raise, ignore, replace). > > Now, for unmapped sequences an error handler can opt for > using a fallback sequence instead. > > So in addition to adding best fit codecs, there's also the > option to add an error handler for best fit resolution of > unmapped sequences. > > Given the above, I don't think we ought to change the existing > standards compliant mappings, but use one of two solutions: > > a) add "best fit" encodings (see the Unicode FTP site for > ? ?a list) > > b) add an error handlers "bestfit" which implements the > ? ?fallback modes for the encodings in question > > > > On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < > > python-ideas at python.org > wrote: > > > >> First of all, many thanks for such a excellently writen > letter. It was a > >> real pleasure to read. > >> On 10.01.2018 0:15, Rob Speer wrote: > >> > >> Hi! I joined this list because I'm interested in filling a > gap in Python's > >> standard library, relating to text encodings. > >> > >> There is an encoding with no name of its own. It's supported > by every > >> current web browser and standardized by WHATWG. It's so > prevalent that if > >> you ask a Web browser to decode "iso-8859-1" or > "windows-1252", you will > >> get this encoding _instead_. It is probably the second or > third most common > >> text encoding in the world. And Python doesn't quite support it. > >> > >> You can see the character table for this encoding at: > >> https://encoding.spec.whatwg.org/index-windows-1252.txt > > >> > >> For the sake of discussion, let's call this encoding > "web-1252". WHATWG > >> calls it "windows-1252", but notice that it's subtly > different from > >> Python's "windows-1252" encoding. Python's windows-1252 has > bytes that are > >> undefined: > >> > >>>>> b'\x90'.decode('windows-1252') > >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 > in position 0: > >> character maps to > >> > >> In web-1252, the bytes that are undefined according to > windows-1252 map to > >> the control characters in those positions in iso-8859-1 -- > that is, the > >> Unicode codepoints with the same number as the byte. In > web-1252, b'\x90' > >> would decode as '\u0090'. > >> > >> According to https://en.wikipedia.org/wiki/Windows-1252 > > , Windows does > >> the same: > >> > >>? ? ?"According to the information on Microsoft's and the Unicode > >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are > unused; > >> however, the Windows API MultiByteToWideChar > >> > > > >> maps these to the corresponding C1 control codes > >> >." > >> And in ISO-8859-1, the same handling is done for unused code > points even > >> by the standard ( > https://en.wikipedia.org/wiki/ISO/IEC_8859-1 > > ) : > >> > >>? ? ?"*ISO-8859-1* is the IANA > >> > > > >> preferred name for this standard when supplemented with the > C0 and C1 > >> control codes > > > >> from ISO/IEC 6429 > >" > >> And what would you think -- these "C1 control codes" are > also the > >> corresponding Unicode points! ( > >> > https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) > > ) > >> > >> Since Windows is pretty much the reference implementation for > >> "windows-xxxx" encodings, it even makes sense to alter the > existing > >> encodings rather than add new ones. > >> > >> > >> This may seem like a silly encoding that encourages doing > horrible things > >> with text. That's pretty much the case. But there's a reason > every Web > >> browser implements it: > >> > >> - It's compatible with windows-1252 > >> - Any sequence of bytes can be round-tripped through it > without losing > >> information > >> > >> It's not just this one encoding. WHATWG's encoding standard ( > >> https://encoding.spec.whatwg.org/ > ) > contains modified versions of > >> windows-1250 through windows-1258 and windows-874. > >> > >> Support for these encodings matters to me, in part, because > I maintain a > >> Unicode data-cleaning library, "ftfy". One thing it does is > to detect and > >> undo encoding/decoding errors that cause mojibake, as long > as they're > >> detectible and reversible. Looking at real-world examples of > text that has > >> been damaged by mojibake, it's clear that lots of text is > transferred > >> through what I'm calling the "web-1252" encoding, in a way > that's > >> incompatible with Python's "windows-1252". > >> > >> In order to be able to work with and fix this kind of text, > ftfy registers > >> new codecs -- and I implemented this even before I knew that > they were > >> standardized in Web browsers. When ftfy is imported, you can > decode text as > >> "sloppy-windows-1252" (the name I chose for this encoding), > for example. > >> > >> ftfy can tell people a sequence of steps that they can use > in the future > >> to fix text that's like the text they provided. Very often, > these steps > >> require the sloppy-windows-1252 or sloppy-windows-1251 > encoding, which > >> means the steps only work with ftfy imported, even for > people who are not > >> using the features of ftfy. > >> > >> Support for these encodings also seems highly relevant to > people who use > >> Python for web scraping, as it would be desirable to > maximize compatibility > >> with what a Web browser would do. > >> > >> This really seems like it belongs in the standard library > instead of being > >> an incidental feature of my library. I know that code in the > standard > >> library has "one foot in the grave". I _want_ these legacy > encodings to > >> have one foot in the grave. But some of them are extremely > common, and > >> Python code should be able to deal with them. > >> > >> Adding these encodings to Python would be straightforward to > implement. > >> Does this require a PEP, a pull request, or further discussion? > >> > >> > >> _______________________________________________ > >> Python-ideas mailing > listPython-ideas at python.orghttps://mail.python.org/mailman/listinfo/python-ideas > > >> Code of Conduct: http://python.org/psf/codeofconduct/ > > >> > >> > >> -- > >> Regards, > >> Ivan > >> > >> _______________________________________________ > >> Python-ideas mailing list > >> Python-ideas at python.org > >> https://mail.python.org/mailman/listinfo/python-ideas > > >> Code of Conduct: http://python.org/psf/codeofconduct/ > > >> > > > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan > 10 2018) > >>> Python Projects, Coaching and Consulting ... > http://www.egenix.com/ > > >>> Python Database Interfaces ... http://products.egenix.com/ > > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and > costs ::: > > ? ?eGenix.com Software, Skills and Services GmbH > Pastor-Loeh-Str.48 > ? ? D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > ? ? ? ? ? ?Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > > http://www.malemburg.com/ > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=Ruay67LA%2Fyv3Ki5jevX7qbBtaw1PfG6I5c00kFZzxNY%3D&reserved=0 > Code of Conduct: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=Dfsa4ryYvzqKJFN9FtbuQPJ9T6mlArpkL0Z%2BwzAGGTg%3D&reserved=0 > -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From rosuav at gmail.com Wed Jan 10 21:22:30 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 11 Jan 2018 13:22:30 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On Thu, Jan 11, 2018 at 6:44 AM, Steve Barnes wrote: > > I am somewhat confused because according to > https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the > original examples) is undefined as the table only runs to 127 i.e. 0x7F. AIUI the table in that file assumes that the first 128 bytes are interpreted as per ASCII. So you're looking at the *next* 128 bytes, and line 16 is the one that handles byte 0x90. ChrisA From random832 at fastmail.com Thu Jan 11 01:11:51 2018 From: random832 at fastmail.com (Random832) Date: Thu, 11 Jan 2018 01:11:51 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: <1515651111.2496384.1231521992.2FA79113@webmail.messagingengine.com> On Wed, Jan 10, 2018, at 14:44, Steve Barnes wrote: > I am somewhat confused because according to > https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the > original examples) is undefined as the table only runs to 127 i.e. 0x7F. The spec referenced in the comments says "Let code point be the index code point for byte ? 0x80 in index single-byte." From mal at egenix.com Thu Jan 11 03:58:35 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Thu, 11 Jan 2018 09:58:35 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> Message-ID: <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> On 11.01.2018 01:22, Nick Coghlan wrote: > On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >> For the stdlib, I think we should stick to standards and >> not go for spreading non-standard ones. >> >> So -1 on adding WHATWG encodings to the stdlib. > > We already support HTML5 in the standard library, and saying "We'll > accept WHATWG's definition of HTML, but not their associated text > encodings" seems like a strange place to draw a line when it comes to > standards support. There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data. Do you really things it's a good idea to support this natively in Python ? The other problem is that WHATWG considers its documents "living standards", i.e. they are subject to change and don't come with a version number (apart from a date). This makes sense when you look at their mostly decoding-only nature, but, again for encoding, creates an interoperability problem. > I do think your observation constitutes a compelling reason to leave > the existing codecs alone though, and treat the web codecs as a > distinct set of mappings. Given that, I think Rob's original > suggestion of using "web-1252" et al is a good one. > > We can also separate them out in the documentation, such that we have > three tables: > > * https://docs.python.org/3/library/codecs.html#standard-encodings > (Unicode Consortium) > * https://docs.python.org/3/library/codecs.html#python-specific-encodings > (python-dev/PSF) > * a new table for WHATWG encodings > > Cheers, > Nick. > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 11 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From rosuav at gmail.com Thu Jan 11 04:01:04 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 11 Jan 2018 20:01:04 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> Message-ID: On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg wrote: > On 11.01.2018 01:22, Nick Coghlan wrote: >> On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >>> For the stdlib, I think we should stick to standards and >>> not go for spreading non-standard ones. >>> >>> So -1 on adding WHATWG encodings to the stdlib. >> >> We already support HTML5 in the standard library, and saying "We'll >> accept WHATWG's definition of HTML, but not their associated text >> encodings" seems like a strange place to draw a line when it comes to >> standards support. > > There's a problem with these encodings: they are mostly meant > for decoding (broken) data, but as soon as we have them in the stdlib, > people will also start using them for encoding data, producing more > corrupted data. > > Do you really things it's a good idea to support this natively > in Python ? > > The other problem is that WHATWG considers its documents "living > standards", i.e. they are subject to change and don't come with > a version number (apart from a date). > > This makes sense when you look at their mostly decoding-only > nature, but, again for encoding, creates an interoperability problem. Would it be viable to have them in the stdlib for decoding only? To have them simply not work for encoding? ChrisA From mal at egenix.com Thu Jan 11 04:14:59 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Thu, 11 Jan 2018 10:14:59 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> Message-ID: On 11.01.2018 10:01, Chris Angelico wrote: > On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg wrote: >> On 11.01.2018 01:22, Nick Coghlan wrote: >>> On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >>>> For the stdlib, I think we should stick to standards and >>>> not go for spreading non-standard ones. >>>> >>>> So -1 on adding WHATWG encodings to the stdlib. >>> >>> We already support HTML5 in the standard library, and saying "We'll >>> accept WHATWG's definition of HTML, but not their associated text >>> encodings" seems like a strange place to draw a line when it comes to >>> standards support. >> >> There's a problem with these encodings: they are mostly meant >> for decoding (broken) data, but as soon as we have them in the stdlib, >> people will also start using them for encoding data, producing more >> corrupted data. >> >> Do you really things it's a good idea to support this natively >> in Python ? >> >> The other problem is that WHATWG considers its documents "living >> standards", i.e. they are subject to change and don't come with >> a version number (apart from a date). >> >> This makes sense when you look at their mostly decoding-only >> nature, but, again for encoding, creates an interoperability problem. > > Would it be viable to have them in the stdlib for decoding only? To > have them simply not work for encoding? That would be possible and resolve the above issues I have with the encodings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 11 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From storchaka at gmail.com Thu Jan 11 04:55:38 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Thu, 11 Jan 2018 11:55:38 +0200 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: 09.01.18 23:15, Rob Speer ????: > There is an encoding with no name of its own. It's supported by every > current web browser and standardized by WHATWG. It's so prevalent that > if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you > will get this encoding _instead_. It is probably the second or third > most common text encoding in the world. And Python doesn't quite support it. > > You can see the character table for this encoding at: > https://encoding.spec.whatwg.org/index-windows-1252.txt > > For the sake of discussion, let's call this encoding "web-1252". WHATWG > calls it "windows-1252", but notice that it's subtly different from > Python's "windows-1252" encoding.. Python's windows-1252 has bytes that > are undefined: > > >>> b'\x90'.decode('windows-1252') > UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position > 0: character maps to > > In web-1252, the bytes that are undefined according to windows-1252 map > to the control characters in those positions in iso-8859-1 -- that is, > the Unicode codepoints with the same number as the byte. In web-1252, > b'\x90' would decode as '\u0090'. > > This may seem like a silly encoding that encourages doing horrible > things with text. That's pretty much the case. But there's a reason > every Web browser implements it: > > - It's compatible with windows-1252 > - Any sequence of bytes can be round-tripped through it without losing > information > > It's not just this one encoding. WHATWG's encoding standard > (https://encoding.spec.whatwg.org/ ) > contains modified versions of windows-1250 through windows-1258 and > windows-874. The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding. >>> b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape') '\udc90???' >>> '\udc90???'.encode('windows-1252', 'surrogateescape') b'\x90\x91\x92\x93' If you want to map unassigned bytes to other characters, you should just create a new error handler. There are caveats, since such characters are not distinguished from correctly decoded characters. The same problem with the UTF-8 encoding. WHATWG allows encoding and decoding surrogate characters in the range U+d800-U+dcff. This is contrary to the Unicode Standard and raises an error by default in Python. But you can allow encoding and decoding of surrogate characters by explicitly specifying the "surrogatepass" error handler. From stephanh42 at gmail.com Thu Jan 11 05:49:33 2018 From: stephanh42 at gmail.com (Stephan Houben) Date: Thu, 11 Jan 2018 11:49:33 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: Op 11 jan. 2018 10:56 schreef "Serhiy Storchaka" : 09.01.18 23:15, Rob Speer ????: > > > For the sake of discussion, let's call this encoding "web-1252". WHATWG > calls it "windows-1252", I'd suggest to name it then "whatwg-windows-152". and in general "whatwg-" + whatgwgs_name_of_encoding Stephan but notice that it's subtly different from Python's "windows-1252" encoding.. > Python's windows-1252 has bytes that are undefined: > > > >>> b'\x90'.decode('windows-1252') > UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: > character maps to > > In web-1252, the bytes that are undefined according to windows-1252 map to > the control characters in those positions in iso-8859-1 -- that is, the > Unicode codepoints with the same number as the byte. In web-1252, b'\x90' > would decode as '\u0090'. > > This may seem like a silly encoding that encourages doing horrible things > with text. That's pretty much the case. But there's a reason every Web > browser implements it: > > - It's compatible with windows-1252 > - Any sequence of bytes can be round-tripped through it without losing > information > > It's not just this one encoding. WHATWG's encoding standard ( > https://encoding.spec.whatwg.org/ ) > contains modified versions of windows-1250 through windows-1258 and > windows-874. > The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows you to distinguish correctly decoded characters from the escaped bytes, perform character by character processing of the decoded text, and encode the result back with the same encoding. >>> b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape') '\udc90???' >>> '\udc90???'.encode('windows-1252', 'surrogateescape') b'\x90\x91\x92\x93' If you want to map unassigned bytes to other characters, you should just create a new error handler. There are caveats, since such characters are not distinguished from correctly decoded characters. The same problem with the UTF-8 encoding. WHATWG allows encoding and decoding surrogate characters in the range U+d800-U+dcff. This is contrary to the Unicode Standard and raises an error by default in Python. But you can allow encoding and decoding of surrogate characters by explicitly specifying the "surrogatepass" error handler. _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Thu Jan 11 07:04:41 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 11 Jan 2018 13:04:41 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> Message-ID: <20180111130441.550bba61@fsol> On Wed, 10 Jan 2018 16:24:33 -0800 Chris Barker wrote: > On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg wrote: > > > I don't believe it's a good strategy to create the confusion that > > WHATWG is introducing by using the same names for non-standard > > encodings. > > > > agreed. > > > > Python uses the Unicode Consortium standard encodings or > > otherwise internationally standardized ones for the stdlib. > > > > If someone wants to use different encodings, it's easily > > possible to pip install these as necessary. > > > > For the stdlib, I think we should stick to standards and > > not go for spreading non-standard ones. > > > > So -1 on adding WHATWG encodings to the stdlib. > > > > If the OP is right that it is one of the most widely used encodings in the > world, it's kinda hard to call it "non-standard" Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: how often does web-XXX mojibake happen? We're well in the 2010s now and you'd hope that mojibake doesn't happen as often as it used to in, e.g., 1998. Regards Antoine. From njs at pobox.com Thu Jan 11 08:18:43 2018 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 11 Jan 2018 05:18:43 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <20180111130441.550bba61@fsol> Message-ID: On Jan 11, 2018 4:05 AM, "Antoine Pitrou" wrote: Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: how often does web-XXX mojibake happen? We're well in the 2010s now and you'd hope that mojibake doesn't happen as often as it used to in, e.g., 1998. I'm not an expert here or anything, but from what we've been hearing it sounds like it must be used by all standard-compliant HTML parsers. I don't *like* the standard much, but I don't think that the stdlib should refuse to handle standard-compliant HTML, or help users handle standard-compliant HTML correctly, just because the HTML standard has unfortunate things in it. We're not going to convince them to change the standard or anything. And this whole thread started with someone said that their mojibake fixing library is having trouble because of this, so clearly mojibake does still exist. Does it help if we reframe it as not that whatwg is "wrong" about windows-1252, but rather that there is this encoding web-1252, and thanks to an interesting quirk of history, in HTML documents the byte sequence b'' indicates a file using this encoding? In fact the mapping between byte sequences and character sets here is so arbitrary that in standards-compliant HTML, the byte sequences b'', b'', and b'' *also* indicate that the file is encoded using web-1252. (See: https://encoding.spec.whatwg.org/#names-and-labels) -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Thu Jan 11 08:42:24 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 11 Jan 2018 14:42:24 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <20180111130441.550bba61@fsol> Message-ID: <20180111144224.43d6df1a@fsol> On Thu, 11 Jan 2018 05:18:43 -0800 Nathaniel Smith wrote: > I'm not an expert here or anything, but from what we've been hearing it > sounds like it must be used by all standard-compliant HTML parsers. I don't > *like* the standard much, but I don't think that the stdlib should refuse > to handle standard-compliant HTML, or help users handle standard-compliant > HTML correctly, just because the HTML standard has unfortunate things in > it. We're not going to convince them to change the standard or anything. > And this whole thread started with someone said that their mojibake fixing > library is having trouble because of this, so clearly mojibake does still > exist. This is true. The other question is what the bar is for admitting new encodings in the standard library. I don't know much about the history of past practices there, so I will happily leave the decision to other people such as Marc-Andr?. Regards Antoine. From random832 at fastmail.com Thu Jan 11 11:42:43 2018 From: random832 at fastmail.com (Random832) Date: Thu, 11 Jan 2018 11:42:43 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: > The way of solving this issue in Python is using an error handler. The > "surrogateescape" error handler is specially designed for lossless > reversible decoding. It maps every unassigned byte in the range > 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows > you to distinguish correctly decoded characters from the escaped bytes, > perform character by character processing of the decoded text, and > encode the result back with the same encoding. Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described? From random832 at fastmail.com Thu Jan 11 12:19:26 2018 From: random832 at fastmail.com (Random832) Date: Thu, 11 Jan 2018 12:19:26 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> Message-ID: <1515691166.461248.1232095104.7FC7DEC2@webmail.messagingengine.com> On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote: > There's a problem with these encodings: they are mostly meant > for decoding (broken) data, but as soon as we have them in the stdlib, > people will also start using them for encoding data, producing more > corrupted data. Is it really corrupted? > Do you really things it's a good idea to support this natively > in Python ? The problem is, that's ignoring the very real fact that this is, and has always been* the behavior of the native encodings built in to Windows. My opinion is that Microsoft, for whatever reason, misrepresented their encodings when they submitted them to Unicode. The native APIs for text conversion have mechanisms for error reporting, and these supposedly undefined characters do not trigger them as they do for e.g. CP932 0xA0. Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the mappings being discussed here) If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still returns U+0081. As far as the actual encoding implemented in windows is concerned, CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), and not in any way a fallback or a "best fit" or an invalid character. *except for the addition of the Euro sign to each encoding at typically 0x80 in circa 1998. **It's worth mentioning that our cp932 returns U+F8F0, even with errors='strict', despite this not being present in the unicode published mapping. It has done this at least since the CJKCodecs change in 2004. I can't determine where (or if) it was implemented at all before that. From rspeer at luminoso.com Thu Jan 11 14:42:45 2018 From: rspeer at luminoso.com (Rob Speer) Date: Thu, 11 Jan 2018 19:42:45 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <1515691166.461248.1232095104.7FC7DEC2@webmail.messagingengine.com> References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> <1515691166.461248.1232095104.7FC7DEC2@webmail.messagingengine.com> Message-ID: > The question is rather: how often does web-XXX mojibake happen? Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "???", using the 'replace' error handler. In web-1252, this would be "??\x9d". The web-1252 version is more common. Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "???" is your code crashing. On Thu, 11 Jan 2018 at 12:20 Random832 wrote: > On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote: > > There's a problem with these encodings: they are mostly meant > > for decoding (broken) data, but as soon as we have them in the stdlib, > > people will also start using them for encoding data, producing more > > corrupted data. > > Is it really corrupted? > > > Do you really things it's a good idea to support this natively > > in Python ? > > The problem is, that's ignoring the very real fact that this is, and has > always been* the behavior of the native encodings built in to Windows. My > opinion is that Microsoft, for whatever reason, misrepresented their > encodings when they submitted them to Unicode. The native APIs for text > conversion have mechanisms for error reporting, and these supposedly > undefined characters do not trigger them as they do for e.g. CP932 0xA0. > > Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private > use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the > mappings being discussed here) > If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns > an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still > returns U+0081. > > As far as the actual encoding implemented in windows is concerned, > CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), > and not in any way a fallback or a "best fit" or an invalid character. > > *except for the addition of the Euro sign to each encoding at typically > 0x80 in circa 1998. > **It's worth mentioning that our cp932 returns U+F8F0, even with > errors='strict', despite this not being present in the unicode published > mapping. It has done this at least since the CJKCodecs change in 2004. I > can't determine where (or if) it was implemented at all before that. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rspeer at luminoso.com Thu Jan 11 14:55:07 2018 From: rspeer at luminoso.com (Rob Speer) Date: Thu, 11 Jan 2018 19:55:07 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> Message-ID: On Thu, 11 Jan 2018 at 11:43 Random832 wrote: > Maybe we need a new error handler that maps unassigned bytes in the range > 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the > encodings being discussed have behavior other than the "normal" version of > the encoding plus what I just described? > (accidentally replied individually instead of replaying all) There is one more difference I have found between Python's encodings and WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down what the Unicode Consortium has to say about this. Other than that, all the differences are adding the fall-throughs in the range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte b'\xff' is undefined, and it remains undefined in WHATWG's mapping. -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Thu Jan 11 14:56:50 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Thu, 11 Jan 2018 21:56:50 +0200 Subject: [Python-ideas] Make functions, methods and descriptor types living in the types module Message-ID: Currently the classes of functions (implemented in Python and builtin), methods, and different type of descriptors, generators, etc have the __module__ attribute equal to "builtins" and the name that can't be used for accessing the class. >>> def f(): pass ... >>> type(f) >>> type(f).__module__ 'builtins' >>> type(f).__name__ 'function' >>> type(f).__qualname__ 'function' >>> import builtins >>> builtins.function Traceback (most recent call last): File "", line 1, in AttributeError: module 'builtins' has no attribute 'function' But most of this classes (if not all) are exposed in the types module. I suggest to rename them. Make the __module__ attribute equal to "builtins" and the __name__ and the __qualname__ attributes equal to the name used for accessing the class in the types module. This would allow to pickle references to these types. Currently this isn't possible. >>> pickle.dumps(types.FunctionType) Traceback (most recent call last): File "", line 1, in _pickle.PicklingError: Can't pickle : attribute lookup function on builtins failed And this will help to implement the pickle support of dynamic functions etc. Currently the third-party library that implements this needs to use a special purposed factory function (not compatible with other similar libraries) since types.FunctionType isn't pickleable. From random832 at fastmail.com Thu Jan 11 15:15:34 2018 From: random832 at fastmail.com (Random832) Date: Thu, 11 Jan 2018 15:15:34 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> Message-ID: <1515701734.1570983.1232333824.2727F08F@webmail.messagingengine.com> On Thu, Jan 11, 2018, at 14:55, Rob Speer wrote: > There is one more difference I have found between Python's encodings and > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down > what the Unicode Consortium has to say about this. It appears in the best fit mapping (with a comment suggesting it unclear what vowel point it is actually meant to be) but not the normal mapping. > Other than that, all the differences are adding the fall-throughs in the > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte > b'\xff' is undefined, and it remains undefined in WHATWG's mapping. This is, for the record, also consistent with the results of my test program - 0xCA is treated as a perfectly ordinary mapping that goes to U+05BA, whereas 0xFF returns an error. In permissive mode it maps to U+F896. 0xCA U+05BA appears (with no glyph, though) in the code chart Microsoft published with https://www.microsoft.com/typography/unicode/cscp.htm, but not in the corresponding mapping list. It also does not appear in https://msdn.microsoft.com/en-us/library/cc195057.aspx. From python at mrabarnett.plus.com Thu Jan 11 16:09:22 2018 From: python at mrabarnett.plus.com (MRAB) Date: Thu, 11 Jan 2018 21:09:22 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <773ff078-aa95-fa2c-a225-bce8d921827c@egenix.com> <1515691166.461248.1232095104.7FC7DEC2@webmail.messagingengine.com> Message-ID: <43980031-842f-bc4b-9cb4-b7a197986b35@mrabarnett.plus.com> On 2018-01-11 19:42, Rob Speer wrote: > > The question is rather: how often does web-XXX mojibake happen? > > Very often. Particularly web-1252 mixed up with UTF-8. > > My ftfy library is tested on data from Twitter and the Common Crawl, > both prime sources of mojibake. One common mojibake sequence is when a > right curly quote is encoded as UTF-8 and decoded as codepage 1252. In > Python's official windows-1252, this would at best be "???", using the > 'replace' error handler. In web-1252, this would be "??\x9d". The > web-1252 version is more common. > > Of course, since Python itself is widespread, there is some survivorship > bias here. Another thing you could get instead of "???" is your code > crashing. > FWIW, I've occasionally seen that kind of mojibake on the news ticker of the BBC News channel. :-( [snip] From victor.stinner at gmail.com Thu Jan 11 17:41:30 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 11 Jan 2018 23:41:30 +0100 Subject: [Python-ideas] Make functions, methods and descriptor types living in the types module In-Reply-To: References: Message-ID: I like the idea of having a fully qualified name that "works" (can be resolved). I don't think that repr() should change, right? Can this change break the backward compatibility somehow? Victor Le 11 janv. 2018 21:00, "Serhiy Storchaka" a ?crit : > Currently the classes of functions (implemented in Python and builtin), > methods, and different type of descriptors, generators, etc have the > __module__ attribute equal to "builtins" and the name that can't be used > for accessing the class. > > >>> def f(): pass > ... > >>> type(f) > > >>> type(f).__module__ > 'builtins' > >>> type(f).__name__ > 'function' > >>> type(f).__qualname__ > 'function' > >>> import builtins > >>> builtins.function > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'builtins' has no attribute 'function' > > But most of this classes (if not all) are exposed in the types module. > > I suggest to rename them. Make the __module__ attribute equal to > "builtins" and the __name__ and the __qualname__ attributes equal to the > name used for accessing the class in the types module. > > This would allow to pickle references to these types. Currently this isn't > possible. > > >>> pickle.dumps(types.FunctionType) > Traceback (most recent call last): > File "", line 1, in > _pickle.PicklingError: Can't pickle : attribute lookup > function on builtins failed > > And this will help to implement the pickle support of dynamic functions > etc. Currently the third-party library that implements this needs to use a > special purposed factory function (not compatible with other similar > libraries) since types.FunctionType isn't pickleable. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve.dower at python.org Thu Jan 11 23:45:40 2018 From: steve.dower at python.org (Steve Dower) Date: Fri, 12 Jan 2018 15:45:40 +1100 Subject: [Python-ideas] Make functions, methods and descriptor types living in the types module In-Reply-To: References: Message-ID: <2cc1ae0c-3dc3-6003-08f7-1fd9fe5b4c23@python.org> I certainly have code that joins __module__ with __name__ to create a fully-qualified name (with special handling for those builtins that are not in builtins), and IIUC __qualname__ doesn't normally include the module name either (it's intended for nested types/functions). Can we make it visible when you import the builtins module, but not in the builtins namespace? Cheers, Steve On 12Jan2018 0941, Victor Stinner wrote: > I like the idea of having a fully qualified name that "works" (can be > resolved). > > I don't think that repr() should change, right? > > Can this change break the backward compatibility somehow? > > Victor > > Le 11 janv. 2018 21:00, "Serhiy Storchaka" > a ?crit : > > Currently the classes of functions (implemented in Python and > builtin), methods, and different type of descriptors, generators, > etc have the __module__ attribute equal to "builtins" and the name > that can't be used for accessing the class. > > >>> def f(): pass > ... > >>> type(f) > > >>> type(f).__module__ > 'builtins' > >>> type(f).__name__ > 'function' > >>> type(f).__qualname__ > 'function' > >>> import builtins > >>> builtins.function > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'builtins' has no attribute 'function' > > But most of this classes (if not all) are exposed in the types module. > > I suggest to rename them. Make the __module__ attribute equal to > "builtins" and the __name__ and the __qualname__ attributes equal to > the name used for accessing the class in the types module. > > This would allow to pickle references to these types. Currently this > isn't possible. > > >>> pickle.dumps(types.FunctionType) > Traceback (most recent call last): > File "", line 1, in > _pickle.PicklingError: Can't pickle : attribute > lookup function on builtins failed > > And this will help to implement the pickle support of dynamic > functions etc. Currently the third-party library that implements > this needs to use a special purposed factory function (not > compatible with other similar libraries) since types.FunctionType > isn't pickleable. From steve.dower at python.org Thu Jan 11 23:55:01 2018 From: steve.dower at python.org (Steve Dower) Date: Fri, 12 Jan 2018 15:55:01 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> Message-ID: On 12Jan2018 0342, Random832 wrote: > On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: >> The way of solving this issue in Python is using an error handler. The >> "surrogateescape" error handler is specially designed for lossless >> reversible decoding. It maps every unassigned byte in the range >> 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows >> you to distinguish correctly decoded characters from the escaped bytes, >> perform character by character processing of the decoded text, and >> encode the result back with the same encoding. > > Maybe we need a new error handler that maps unassigned bytes in the range 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the encodings being discussed have behavior other than the "normal" version of the encoding plus what I just described? +1 on this being an error handler (if possible). I suspect the semantics will be more complex than suggested above, but as this seems to be able handling normally un[en/de]codable characters, using an error handler to return something more sensible best represents what is going on. Call it something like 'web' or 'relaxed' or 'whatwg'. I don't know if error handlers have enough context for this though. If not, we should ensure they can have it. I'd much rather explain one new error handler to most people (and a more complex API for implementing them to the few people who do it) than explain a whole suite of new encodings. Cheers, Steve From turnbull.stephen.fw at u.tsukuba.ac.jp Fri Jan 12 02:05:08 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Fri, 12 Jan 2018 16:05:08 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <15ed2e9d-a1f5-9a76-5bb1-f038929d73d0@egenix.com> <20180111130441.550bba61@fsol> Message-ID: <23128.24100.781234.775655@turnbull.sk.tsukuba.ac.jp> Executive summary: we already do. Nathaniel suggests we should conform to the WHAT-WG standard. But AFAGCT[1], there is no such thing as "WHATWG versions of legacy encodings". The document at https://encoding.spec.whatwg.org/ has the following normative specifications (capitalized words are presumed to have RFC 2119 semantics, but the document doesn't say): 1. New content MUST be encoded in UTF-8, which may or may not be tagged with "charset=utf-8" according to context. 2. Error-handling is draconian. On decoding, errors MUST be fatal or (default) replacement of *all* uninterpretable text segments with U+FFFD REPLACEMENT CHARACTER. On encoding (i.e., of form input), errors MUST be (default) fatal or 'html' (ie, 💩-encoding). Developers SHOULD construct processes to negotiate for UTF-8 instead of using 'html'. 3. "Legacy" (ie, IANA registered) encoding names MUST be interpreted via a specified map to actual encodings. (I believe Subject: refers to a garbled interpretation of this requirement.) Note that "WHATWG codecs" don't help with this at all! There won't be labels for them in documents! You see charset="us-ascii" or charset="shift_jis", which correspond to existing Python codecs. What a Python process needs to do to conform: 1. Specify 'utf-8' as the codec. 2. Use 'strict', 'replace', or 'xmlcharrefreplace' in error handling, conforming to the restrictions for decoding and encoding. 3. Use https://pypi.python.org/pypi/webencodings for the mapping of "charset" or "encoding" labels to codecs. What we might want to do in the stdlib to make conformance easier: 1. Nothing. 2. Nothing. (Caveat: I have not checked that Python error handlers are 100% equivalent to the algorithms in the WHAT-WG encoding standard. I believe they are, or very close.) 3. Add the webencodings module to the stdlib. Add codecs if any are missing. (I haven't correlated the lists of codecs, but Python's is quite complete.) I think adding webencodings is a very plausible step. Maintenance and future development should be minimal since it's a very well- specified, complete, and self-contained standard. If somebody wants to do hacks not in the WHAT-WG encoding standard to improve "readability" of decoded broken HTML, I think they're on their own.[2] I'm -1 on adding hacks for unassigned code points or worse breakage to the stdlib. Such hacks belong in frameworks etc., or as standalone modules on PyPI. Footnotes: [1] OK, my Google-fu may be lacking today. [2] Although 💩 decoding is part of HTML (and has been for a long time), I'm sure that Python HTML-processing modules already handle that. It definitely doesn't belong in the codecs or error handlers, which handle the decoding of encoded text to "unencoded" text (or "internal encoding" if you prefer). "💩" as characters means PILE OF POO regardless of whether the text is encoded in ASCII or EBCDIC or written in graphite and carried on a physical RFC 1149 network -- it's a higher-level construct. From ncoghlan at gmail.com Fri Jan 12 02:48:48 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 12 Jan 2018 17:48:48 +1000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> Message-ID: On 12 January 2018 at 14:55, Steve Dower wrote: > On 12Jan2018 0342, Random832 wrote: >> >> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: >>> >>> The way of solving this issue in Python is using an error handler. The >>> "surrogateescape" error handler is specially designed for lossless >>> reversible decoding. It maps every unassigned byte in the range >>> 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows >>> you to distinguish correctly decoded characters from the escaped bytes, >>> perform character by character processing of the decoded text, and >>> encode the result back with the same encoding. >> >> Maybe we need a new error handler that maps unassigned bytes in the range >> 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the >> encodings being discussed have behavior other than the "normal" version of >> the encoding plus what I just described? > > > +1 on this being an error handler (if possible). I suspect the semantics > will be more complex than suggested above, but as this seems to be able > handling normally un[en/de]codable characters, using an error handler to > return something more sensible best represents what is going on. Call it > something like 'web' or 'relaxed' or 'whatwg'. > > I don't know if error handlers have enough context for this though. If not, > we should ensure they can have it. I'd much rather explain one new error > handler to most people (and a more complex API for implementing them to the > few people who do it) than explain a whole suite of new encodings. +1 from me, which shifts my position to be: 1. If we can make a decoding-only error handler that does the desired thing in combination with our existing codecs, lets do that (perhaps using a name like "controlpass", since the intent is to pass through otherwise unassigned latin-1 control characters, similar to the way "surrogatepass" allows lone surrogates) 2. Only if 1 fails for some reason would we look at adding the extra decode-only codec variants. Given the power of errors handlers, though, I expect the surrogatepass-style error handler approach will work (see https://docs.python.org/3/library/codecs.html#codecs.register_error and https://docs.python.org/3/library/exceptions.html#UnicodeError for an overview of the information they're given and what they can do about it). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From turnbull.stephen.fw at u.tsukuba.ac.jp Fri Jan 12 03:10:29 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Fri, 12 Jan 2018 17:10:29 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> Message-ID: <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> Rob Speer writes: > There is one more difference I have found between Python's encodings and > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down > what the Unicode Consortium has to say about this. In the past Microsoft has changed windows-125x coded character sets in Windows without updating the IANA registry. It's not clear to me how to deal with these nonstandards. I suspect that Microsoft will follow WHAT-WG in this in the end. Given that in practice Windows encodings are nonstandards not even followed by their defining authority, it seems reasonable to me that Python could update to following WHAT-WG, as long as it's a superset of the current codec (in a 3.x release, not a 3.x.y release); at least the way the encoding standard is presented they're pretty good at this, and likely more reliable going forward than Microsoft itself is on the legacy encodings. > Other than that, all the differences are adding the fall-throughs in the > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte > b'\xff' is undefined, and it remains undefined in WHATWG's mapping. I really do not want those fall-throughs to control characters in the stdlib, since they have no textual interpretation in any standard encoding. My interpretation is "you're under attack, shutter the windows and call the cops". If people want to use codecs incorporating them, they should have to import them separately in the context of a defensive framework that deals with them at a higher level. Probably there's no harm in a browser that does visual presentation, but in other contexts where there is text mixed with control codes we cannot predict what will happen since there is no standard interpretation in common (cross-platform) use AFAIK. And even in visual representation, out-of-channel codes can be problematic. I once crashed a Prime minicomputer by forwarding some ASCII art tuned for a VT-220 back to its author, who had stolen the very nice Prime console terminal and was using it for email. Hilarity ensued (for me, all my deadlines were weeks off). Programs are generally more robust today, but in most cases it would a lot safer to use xmlcharrefreplace or backslashreplace, or surrogateescape to ensure that paranoid Unicode processes would reject it. Especially since there are real hostiles out there. From fakedme+py at gmail.com Fri Jan 12 05:01:03 2018 From: fakedme+py at gmail.com (Soni L.) Date: Fri, 12 Jan 2018 08:01:03 -0200 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> Message-ID: <8c3ef8ba-1638-3505-2bf3-dbc057335813@gmail.com> On 2018-01-12 06:10 AM, Stephen J. Turnbull wrote: > Rob Speer writes: > > > There is one more difference I have found between Python's encodings and > > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down > > what the Unicode Consortium has to say about this. > > In the past Microsoft has changed windows-125x coded character sets in > Windows without updating the IANA registry. It's not clear to me how > to deal with these nonstandards. I suspect that Microsoft will follow > WHAT-WG in this in the end. > > Given that in practice Windows encodings are nonstandards not even > followed by their defining authority, it seems reasonable to me that > Python could update to following WHAT-WG, as long as it's a superset > of the current codec (in a 3.x release, not a 3.x.y release); at least > the way the encoding standard is presented they're pretty good at > this, and likely more reliable going forward than Microsoft itself is > on the legacy encodings. > > > Other than that, all the differences are adding the fall-throughs in the > > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte > > b'\xff' is undefined, and it remains undefined in WHATWG's mapping. > > I really do not want those fall-throughs to control characters in the > stdlib, since they have no textual interpretation in any standard > encoding. My interpretation is "you're under attack, shutter the > windows and call the cops". If people want to use codecs > incorporating them, they should have to import them separately in the > context of a defensive framework that deals with them at a higher > level. This is surprising to me because I always took those encodings to have those fallbacks. It's pretty wild to think someone wouldn't want them. > > Probably there's no harm in a browser that does visual presentation, > but in other contexts where there is text mixed with control codes we > cannot predict what will happen since there is no standard > interpretation in common (cross-platform) use AFAIK. And even in > visual representation, out-of-channel codes can be problematic. I > once crashed a Prime minicomputer by forwarding some ASCII art tuned > for a VT-220 back to its author, who had stolen the very nice Prime > console terminal and was using it for email. Hilarity ensued (for > me, all my deadlines were weeks off). Programs are generally more > robust today, but in most cases it would a lot safer to use > xmlcharrefreplace or backslashreplace, or surrogateescape to ensure > that paranoid Unicode processes would reject it. Especially since > there are real hostiles out there. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From random832 at fastmail.com Fri Jan 12 10:23:01 2018 From: random832 at fastmail.com (Random832) Date: Fri, 12 Jan 2018 10:23:01 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> Message-ID: <1515770581.2892134.1233236400.2694A704@webmail.messagingengine.com> On Fri, Jan 12, 2018, at 03:10, Stephen J. Turnbull wrote: > > Other than that, all the differences are adding the fall-throughs in the > > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte > > b'\xff' is undefined, and it remains undefined in WHATWG's mapping. > > I really do not want those fall-throughs to control characters in the > stdlib, since they have no textual interpretation in any standard > encoding. My interpretation is "you're under attack, shutter the > windows and call the cops". If people want to use codecs > incorporating them, they should have to import them separately in the > context of a defensive framework that deals with them at a higher > level. There are plenty of standard encodings that do have actual representations of the control characters. It's not clear why you consider it more dangerous for the "windows-1252" encoding to be able to return '\x81' for b'\x81' than for "latin-1" to do the same, or for "utf-8" to return it for b'\xc2\x81'. These characters exist. Supporting them in encodings that contain them in the real world, regardless what was submitted to the Unicode consortium, doesn't add any new attack surface. From random832 at fastmail.com Fri Jan 12 20:37:17 2018 From: random832 at fastmail.com (Random832) Date: Fri, 12 Jan 2018 20:37:17 -0500 Subject: [Python-ideas] Make functions, methods and descriptor types living in the types module In-Reply-To: References: Message-ID: <1515807437.1605578.1233819808.38A39C00@webmail.messagingengine.com> On Thu, Jan 11, 2018, at 17:41, Victor Stinner wrote: > I like the idea of having a fully qualified name that "works" (can be > resolved). > > I don't think that repr() should change, right? What if we made these types available under their current name in the types module? e.g. types.module, types.function, etc. From guido at python.org Sat Jan 13 00:19:47 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 12 Jan 2018 21:19:47 -0800 Subject: [Python-ideas] Make functions, methods and descriptor types living in the types module In-Reply-To: <2cc1ae0c-3dc3-6003-08f7-1fd9fe5b4c23@python.org> References: <2cc1ae0c-3dc3-6003-08f7-1fd9fe5b4c23@python.org> Message-ID: On Thu, Jan 11, 2018 at 8:45 PM, Steve Dower wrote: > I certainly have code that joins __module__ with __name__ to create a > fully-qualified name (with special handling for those builtins that are not > in builtins), and IIUC __qualname__ doesn't normally include the module > name either (it's intended for nested types/functions). > In fact __qualname__ should never include the module. It should however include the containing class(es). E.g. for # __main__.py: class Outer: class Inner: def f(self): pass print(Outer.Inner.f.__name__) # 'f' print(Outer.Inner.f.__qualname__) # 'Outer.Inner.f' print(Outer.Inner.f.__module__) # '__main__' IMO the current __module__ for these objects is just wrong (since they aren't in builtins -- even though they are "built in"). So in principle it should be okay to set it to 'types'. (Except I wish we didn't have the types module at all, but that's water under the bridge.) In practice I expect there to be some failing tests. Maybe investigate this for Python 3.8? > Can we make it visible when you import the builtins module, but not in the > builtins namespace? > That would violate the very definition of the builtins module. PS. There are probably many more of these. E.g. NoneType, dict_keys, etc. --Guido On 12Jan2018 0941, Victor Stinner wrote: > >> I like the idea of having a fully qualified name that "works" (can be >> resolved). >> >> I don't think that repr() should change, right? >> >> Can this change break the backward compatibility somehow? >> >> Victor >> >> Le 11 janv. 2018 21:00, "Serhiy Storchaka" > > a ?crit : >> >> >> Currently the classes of functions (implemented in Python and >> builtin), methods, and different type of descriptors, generators, >> etc have the __module__ attribute equal to "builtins" and the name >> that can't be used for accessing the class. >> >> >>> def f(): pass >> ... >> >>> type(f) >> >> >>> type(f).__module__ >> 'builtins' >> >>> type(f).__name__ >> 'function' >> >>> type(f).__qualname__ >> 'function' >> >>> import builtins >> >>> builtins.function >> Traceback (most recent call last): >> File "", line 1, in >> AttributeError: module 'builtins' has no attribute 'function' >> >> But most of this classes (if not all) are exposed in the types module. >> >> I suggest to rename them. Make the __module__ attribute equal to >> "builtins" and the __name__ and the __qualname__ attributes equal to >> the name used for accessing the class in the types module. >> >> This would allow to pickle references to these types. Currently this >> isn't possible. >> >> >>> pickle.dumps(types.FunctionType) >> Traceback (most recent call last): >> File "", line 1, in >> _pickle.PicklingError: Can't pickle : attribute >> lookup function on builtins failed >> >> And this will help to implement the pickle support of dynamic >> functions etc. Currently the third-party library that implements >> this needs to use a special purposed factory function (not >> compatible with other similar libraries) since types.FunctionType >> isn't pickleable. >> > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.marie at schneider-electric.com Tue Jan 16 04:38:25 2018 From: sylvain.marie at schneider-electric.com (smarie) Date: Tue, 16 Jan 2018 01:38:25 -0800 (PST) Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: Message-ID: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> (for some reason google groups has accepted the message but the mailing list rejected it. Re-posting it, sorry for the inconvenience) Le mardi 28 novembre 2017 04:22:13 UTC+1, Nathan Schneider a ?crit : > > > I think it would be interesting to investigate how assert statements are > used in the wild. I can think of three kinds of uses: > > 1) Nonredundant checking: > 2) Redundant checking: > 3) Temporary debugging: > > Hello there I am very much interested by this topic, as I spent a couple months trying to come up with a solution with the obvious constraint of not changing the language. My goal was type checking and value validation for applications wishing to get these features (so obviously, with the possibly to not disable it at runtime even if the rest of the application is optimized). For type checking I discovered PEP484 and many type-checkers (enforce, pytypes...) out there, but for value validation I found nothing really satisfying to me except many sources saying not to use assert. I personally align with the idea already mentioned in this thread that assert is not a tool for "end-user value validation", but a tool for other kind of purposes - that are perfectly valid but are just not the same that "end-user value validation" (if the name is not perfect, feel free to propose another). What I think we're missing is ?get consistent and customizable validation outcome, whatever the inner validation means?. Typically ?assert isfinite(x)? is today a good example of what is wrong: 1. it can be disabled globally by end-users even if the lib developer does not want it, 2. if x is None the exception is different from the exception you get if x is not finite, 3. you can not customize the exception type for easy error codes internationalization, you can only customize the message. My proposal would be to rather define another statement for example 'validate ' with specific constraints on the type of exception to ensure consistency (for example only subclasses of a ValidationError defined in the stdlib) You can find my attempt to do that in the valid8 project https://smarie.github.io/python-valid8, with the 'assert_valid(...)' function. With the current language limitations I could not define something as simple as 'validate ', but I created a mini lambda library to at least keep some level of simplicity for the . The result is for example : class InvalidSurface(ValidationError): help_msg = 'Surface should be a positive number' assert_valid('surface', surf, x > 0, error_type=InvalidSurface) Oh, by the way, in valid8 I have the notion of variable name everywhere (again, because thats interesting to the application) but I'm not sure whether it would make sense to keep it in a 'validate' statement. Let me know what you think Kind regards Sylvain -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Tue Jan 16 05:01:18 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 16 Jan 2018 12:01:18 +0200 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> Message-ID: 16.01.18 11:38, smarie ????: > You can find my attempt to do that in the valid8 project > https://smarie.github.io/python-valid8 > , with the 'assert_valid(...)' > function. With the current language limitations I could not define > something as simple as 'validate or exception_message>', but I created a mini lambda library to at least > keep some level of simplicity for the . The result is for > example : > > class InvalidSurface(ValidationError): > ??? help_msg = 'Surface should be a positive number' > > assert_valid('surface', surf, x > 0, error_type=InvalidSurface) What is the advantage over the simple if not x > 0: raise InvalidSurface('surface') ? From steve at pearwood.info Tue Jan 16 05:23:27 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 16 Jan 2018 21:23:27 +1100 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> Message-ID: <20180116102320.GF1982@ando.pearwood.info> On Tue, Jan 16, 2018 at 01:38:25AM -0800, smarie wrote: > Typically ?assert isfinite(x)? is today a good example of what is wrong: > > 1. it can be disabled globally by end-users even if the lib developer > does not want it, That's not a bug, or even a problem ("wrong"), it is the very purpose of assert. Assertions are intended to allow the end user to disable the checks. If you, the developer, don't want a check to be disabled, then you shouldn't call it an assertion and use assert. I sometimes wish that Python included a richer set of assertions rather than just a single `assert` keyword. Something like Eiffel's concept of pre-conditions, post-conditions and invariants, where each can be enabled or disabled independently. But even with Python's sparse assertion API, the reason for assert is for debugging checks which can optionally be disabled. Checks which should never be disabled don't belong as assertions. > 2. if x is None the exception is different from the exception you get if > x is not finite This is a little untidy, but I don't see why this should be considered "wrong". That's just the way Python operates: however you write a check, it is going to rely on the check succeeding, and if it raises an exception, you will see that exception. That's a good thing, not "wrong" -- if my assertion isfinite(x) fails because x is None, I'd much rather see a TypeError than AssertionError: x is not finite > 3. you can not customize the exception type for easy error codes > internationalization, you can only customize the message. assert is not intended to be used to present user-friendly error messages to the end-user. The end-user should never see an AssertionError, and if they do, that's a bug, not a recoverable error or an avoidable environmental/input error. I have come to the conclusion that a good way to think about assertions is that they are often "checked comments". Comments like: # when we get here, x is a finite number quickly get out of sync with code. To quote Michael Foord: "At Resolver we've found it useful to short-circuit any doubt and just refer to comments in code as 'lies'. " -- Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22 Instead, turn it into an assertion, if you can: assert isfinite(x) which is both inline developer documentation ("x is a finite number") and a debugging aid (it will fail during development is x is not a finite number). And, if the cost of those assertions becomes excessive, the end-user can disable them. > My proposal would be to rather define another statement for example > > 'validate exception_message>' The problem with a statement called "validate" is that it will break a huge number of programs that already include functions and methods using that name. But apart from the use of a keyword, we already have a way to do almost exactly what you want: if not expression: raise ValidationError(message) after defining some appropriate ValidationError class. And it is only a few key presses longer than the proposed: validate expression, ValidationError, message -- Steve From sylvain.marie at schneider-electric.com Tue Jan 16 04:43:45 2018 From: sylvain.marie at schneider-electric.com (smarie) Date: Tue, 16 Jan 2018 01:43:45 -0800 (PST) Subject: [Python-ideas] PEP 557 Dataclasses evolution: supporting implicit field creation with __init__ signature introspection Message-ID: <298a039c-405a-4ba8-924d-15ad3ff22a01@googlegroups.com> (my first post seems to have been accepted by google groups but rejected by the mailing list. The first repost attempt was completely destructured by outlook > Reposting a 2d time? VERY sorry for the inconvenience, I?m not familiar with google groups) Hello there I recently found out that while I was developing autoclass in complete, na?ve ignorance of the python evolution process, there was a PEP (557, here ) growing, being proposed by Eric V. Smith; and now accepted since december. While this PEP is a GREAT step forward developer-friendly classes in python, I do not feel that it covers all of the use cases I wanted to cover with autoclass. I discussed with Eric about the particular one below and he suggested to post it here: In the case users want to benefit from the code generation while keeping custom __init__ methods the current proposal leads to double declaration as shown in the pep. The choice I made with autoclass is opposite/complementary: rather than using class descriptor to define fields, the developer includes all of them in the __init__ method signature, and an optional include/exclude parameter allows him/her to further customize the list if need be. See here . I think that both ways to declare fields actually make sense, depending on the use case. Compact pure-data classes will be more compact and readable with fields defined as class descriptors. Hybrid data+logic classes with custom constructors will be more compact if the fields used in the custom constructor signature are not defined twice (also in a descriptor). Supporting both ways to declare the fields, possibly both styles in the same class definition (but does it make sense? not sure), would avoid duplication when users want custom constructors. The example in the pep would become: @dataclass() class ArgHolder: def __init__(self, *args: Any, **kwargs: Any): pass Does that make sense ? Kind regards Sylvain -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.marie at schneider-electric.com Tue Jan 16 10:41:10 2018 From: sylvain.marie at schneider-electric.com (smarie) Date: Tue, 16 Jan 2018 07:41:10 -0800 (PST) Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: <9b6d2c4c-785d-4922-bfc8-a360cb544154@googlegroups.com> Le mardi 16 janvier 2018 16:37:29 UTC+1, smarie a ?crit : > > > validate is_foo_compliant(x) or is_bar_compliant(x) > ValidationError(message) > This was a typo in this case since we use the base ValidationError it even would simplify to validate is_foo_compliant(x) or is_bar_compliant(x), message -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.marie at schneider-electric.com Tue Jan 16 10:37:29 2018 From: sylvain.marie at schneider-electric.com (smarie) Date: Tue, 16 Jan 2018 07:37:29 -0800 (PST) Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <20180116102320.GF1982@ando.pearwood.info> References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: Le mardi 16 janvier 2018 11:24:34 UTC+1, Steven D'Aprano a ?crit : > > That's not a bug, or even a problem ("wrong"), it is the very purpose of > assert. Assertions are intended to allow the end user to disable the > checks. > > If you, the developer, don't want a check to be disabled, then you > shouldn't call it an assertion and use assert. > That is exactly what I'm saying. It seems that we both agree that applicative value validation is different from asserts, and that assert should not be used for applicative value validation. For this reason, I do not suggest to go in the direction the OP is mentioning but rather to explicitly separate the 2 concepts by creating a new statement for value validation. > The problem with a statement called "validate" is that it will break a > huge number of programs that already include functions and methods using > that name. > You definitely make a point here. But that would be the case for absolutely *any* language evolution as soon as the proposed statements are plain old english words. Should it be a show-stopper ? I dont think so. > But apart from the use of a keyword, we already have a way to do almost > exactly what you want: > > if not expression: raise ValidationError(message) > > after defining some appropriate ValidationError class. And it is only a > few key presses longer than the proposed: > > validate expression, ValidationError, message > This is precisely what is not good in my opinion: here you do not separate from . Of course if is just a "x > 0" statement, it works, but now what if you rely on a 3d-party provided validation function (or even yours) such as e.g. "is_foo_compliant" ? if not is_foo_compliant(x): raise ValidationError(message) What if this third part method raises an exception instead of returning False in some cases ? try: if not is_foo_compliant(x): raise ValidationError(message) except: raise MyValidationError(message) What if you want to compose this third party function with *another* one that returns False but not exceptions ? Say, with an OR ? (one or the other should work). Yields: try: if not is_foo_compliant(x): raise ValidationError(message) except: if not is_bar_compliant(x): raise MyValidationError(message) It starts to be quite ugly, messy... while the applicative intent is clear and could be expressed directly as: validate is_foo_compliant(x) or is_bar_compliant(x) ValidationError(message) The goal is really to let developers express their applicative intent =(what should be checked and what is the outcome if anything goes wrong), and give them confidence that the statement will always fail the same way, whatever the failure modes /behaviour of the checkers used in the statement. This is what valid8 proposes today with assert_valid, but obviously again having it built-in in the language would be much more concise. Note that the class and function decorators (@validate_arg, @validate_field...) would remain useful anyway. -- Sylvain -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Tue Jan 16 10:55:59 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 16 Jan 2018 15:55:59 +0000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: Grr, Google Groups gateway messes up reply-to. Apologies to anyone who gets a double-post, please can posters ensure that reply-to is set to the list, and *not* to the Google Groups gateway? Thanks. Paul On 16 January 2018 at 15:54, Paul Moore wrote: > On 16 January 2018 at 15:37, smarie > wrote: > >>> If you, the developer, don't want a check to be disabled, then you >>> shouldn't call it an assertion and use assert. >> >> >> That is exactly what I'm saying. It seems that we both agree that >> applicative value validation is different from asserts, and that assert >> should not be used for applicative value validation. >> For this reason, I do not suggest to go in the direction the OP is >> mentioning but rather to explicitly separate the 2 concepts by creating a >> new statement for value validation. >> >>> >>> The problem with a statement called "validate" is that it will break a >>> huge number of programs that already include functions and methods using >>> that name. >> >> You definitely make a point here. But that would be the case for absolutely >> *any* language evolution as soon as the proposed statements are plain old >> english words. Should it be a show-stopper ? I dont think so. > > Why does this need to be a statement at all? Unlike assert, it's > always executed, so it can be defined as a simple function: > > def validate(test, message): > if not test: > raise ValidartionError(message) > >>> But apart from the use of a keyword, we already have a way to do almost >>> exactly what you want: >>> >>> if not expression: raise ValidationError(message) >>> >>> after defining some appropriate ValidationError class. And it is only a >>> few key presses longer than the proposed: >>> >>> validate expression, ValidationError, message >> >> >> This is precisely what is not good in my opinion: here you do not separate >> from . Of course if >> is just a "x > 0" statement, it works, but now what if you rely on a >> 3d-party provided validation function (or even yours) such as e.g. >> "is_foo_compliant" ? >> >> if not is_foo_compliant(x): raise ValidationError(message) >> >> What if this third part method raises an exception instead of returning >> False in some cases ? >> >> try: >> if not is_foo_compliant(x): raise ValidationError(message) >> except: >> raise MyValidationError(message) >> >> What if you want to compose this third party function with *another* one >> that returns False but not exceptions ? Say, with an OR ? (one or the other >> should work). Yields: >> >> try: >> if not is_foo_compliant(x): raise ValidationError(message) >> except: >> if not is_bar_compliant(x): >> raise MyValidationError(message) >> >> It starts to be quite ugly, messy... while the applicative intent is clear >> and could be expressed directly as: >> >> validate is_foo_compliant(x) or is_bar_compliant(x) >> ValidationError(message) > > I don't see how a validate statement avoids having to deal with all of > the complexity you mention here. And it's *far* easier to handle this > as a standalone function - if you find a new requirement like the ones > you suggest above, you simply modify the function (and release an > updated version of your package, if you choose to release your code on > PyPI) and you're done. With a new statement, you'd need to raise a > Python feature request, wait for at least the next Python release to > see the modification, and *still* have to support people on older > versions of Python with the unfixed version. > > Also, a validate() function will wor on older versions of Python, all > the way back to Python 2.7 if you want. > > Paul From sylvain.marie at schneider-electric.com Tue Jan 16 11:25:52 2018 From: sylvain.marie at schneider-electric.com (smarie) Date: Tue, 16 Jan 2018 08:25:52 -0800 (PST) Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: Thanks Paul Le mardi 16 janvier 2018 16:56:57 UTC+1, Paul Moore a ?crit : > > > Why does this need to be a statement at all? Unlike assert, it's > > always executed, so it can be defined as a simple function > Very good point. Actually that's what I already provide in valid8 with the assert_valid function. See all examples here . Let's consider this example where users want to define on-the-fly one of the validation functions, and combine it with another with a 'or': assert_valid('surface', surf, or_(lambda x: (x >= 0) & (x < 10000), is_foo_compliant), help_msg="surface should be 0== 0) & (x < 10000), is_foo_compliant), help_msg="surface should be between 0 and 10000 or foo compliant") or even (if you pre-convert Is_foo_compliant to mini_lambda) assert_valid('surface', surf, ((x >= 0) & (x < 10000)) | Is_foo_compliant), help_msg="surface should be between 0 and 10000 or foo compliant") There are three reasons why having a 'validate' statement would improve this: * no more parenthesis: more elegant and readable * inline use of python (1): no more use of lambda or mini_lambda, no performance overhead * inline use of python (2): composition would not require custom function composition operators such as 'or_' (above) or mini-lambda composition anymore, it could be built-in in any language element used after resulting in validate (surf >= 0) & (surf < 10000) or is_foo_compliant(surf), "surface should be 0= I don't see how a validate statement avoids having to deal with all of > > the complexity you mention here. It would obviously need to be quite smart :) but it is possible, since I can do it with assert_valid today. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Tue Jan 16 12:01:01 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 16 Jan 2018 17:01:01 +0000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: I fixed the reply-to this time, looks like you're still getting messed up by Google Groups. On 16 January 2018 at 16:25, smarie wrote: > Let's consider this example where users want to define on-the-fly one of the > validation functions, and combine it with another with a 'or': > > assert_valid('surface', surf, or_(lambda x: (x >= 0) & (x < 10000), > is_foo_compliant), help_msg="surface should be 0= > How ugly for something so simple ! I tried to make it slightly more compact > by developping a mini lambda syntax but it obviously makes it slower. Why do you do this? What's the requirement for delaying evaluation of the condition? A validate statement in Python wouldn't be any better able to do that, so it'd be just as ugly with a statement. There's no reason I can see why I'd ever need delayed evaluation, so what's wrong with just assert_valid(0 <= surf < 10000 and is_foo_compliant(surf), help_msg="surface should be 0= There are three reasons why having a 'validate' statement would improve > this: > > * no more parenthesis: more elegant and readable > * inline use of python (1): no more use of lambda or mini_lambda, no > performance overhead > * inline use of python (2): composition would not require custom function > composition operators such as 'or_' (above) or mini-lambda composition > anymore, it could be built-in in any language element used after > > resulting in > > validate (surf >= 0) & (surf < 10000) or is_foo_compliant(surf), > "surface should be 0= > (I removed the variable name alias 'surface' since I don't know if it should > remain or not) > > Elegant, isn't it ? No more so than my function version, but yes far more so than yours... Paul From p.f.moore at gmail.com Tue Jan 16 13:35:17 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 16 Jan 2018 18:35:17 +0000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: On 16 January 2018 at 17:36, Sylvain MARIE wrote: > (trying with direct reply this time) > >> Why do you do this? What's the requirement for delaying evaluation of the condition? > > Thanks for challenging my poorly chosen examples :) > > The primary requirement is about *catching* unwanted/uncontrolled/heterogenous exceptions happening in the underlying functions that are combined together to provide the validation means, so as to provide a uniform/consistent outcome however diverse the underlying functions are (they can return booleans or raise exceptions, or both). > > In your proposal, if 'is_foo_compliant' raises an exception, it will not be caught by 'assert_valid', therefore the ValidationError will not be raised. So this is not what I want as an application developer. Ah, OK. But nothing in your proposal for a new statement suggests you wanted that, and assert doesn't work like that, so I hadn't realised that's what you were after. You could of course simply do: def assert_valid(expr, help_msg): # Catch exceptions in expr() as you see fit if not expr(): raise ValidationError(help_msg) assert_valid(lambda: 0 <= surf < 10000 and is_foo_compliant(surf), help_msg="surface should be 0= References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: Perhaps the OP can look into Python macro libraries to get the wanted syntax? https://github.com/lihaoyi/macropy On Tue, Jan 16, 2018 at 2:35 PM, Paul Moore wrote: > On 16 January 2018 at 17:36, Sylvain MARIE > wrote: > > (trying with direct reply this time) > > > >> Why do you do this? What's the requirement for delaying evaluation of > the condition? > > > > Thanks for challenging my poorly chosen examples :) > > > > The primary requirement is about *catching* unwanted/uncontrolled/heterogenous > exceptions happening in the underlying functions that are combined together > to provide the validation means, so as to provide a uniform/consistent > outcome however diverse the underlying functions are (they can return > booleans or raise exceptions, or both). > > > > In your proposal, if 'is_foo_compliant' raises an exception, it will not > be caught by 'assert_valid', therefore the ValidationError will not be > raised. So this is not what I want as an application developer. > > Ah, OK. But nothing in your proposal for a new statement suggests you > wanted that, and assert doesn't work like that, so I hadn't realised > that's what you were after. > > You could of course simply do: > > def assert_valid(expr, help_msg): > # Catch exceptions in expr() as you see fit > if not expr(): > raise ValidationError(help_msg) > > assert_valid(lambda: 0 <= surf < 10000 and is_foo_compliant(surf), > help_msg="surface should be 0= > No need for a whole expression language :-) > > Paul > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Juancarlo *A?ez* -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Tue Jan 16 20:42:47 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 17 Jan 2018 12:42:47 +1100 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: <20180117014247.GH1982@ando.pearwood.info> On Tue, Jan 16, 2018 at 07:37:29AM -0800, smarie wrote: [...] > > The problem with a statement called "validate" is that it will break a > > huge number of programs that already include functions and methods using > > that name. > > > > You definitely make a point here. But that would be the case for absolutely > *any* language evolution as soon as the proposed statements are plain old > english words. Should it be a show-stopper ? I dont think so. It is not a show-stopper, but it is a very, very larger barrier to adding new keywords. If there is a solution to the problem that doesn't require a new keyword, that is almost always preferred over breaking people's code when they upgrade. > > But apart from the use of a keyword, we already have a way to do almost > > exactly what you want: > > > > if not expression: raise ValidationError(message) > > > > after defining some appropriate ValidationError class. And it is only a > > few key presses longer than the proposed: > > > > validate expression, ValidationError, message > > > > This is precisely what is not good in my opinion: here you do not separate > from . Of course if means> is just a "x > 0" statement, it works, but now what if you rely on a > 3d-party provided validation function (or even yours) such as e.g. > "is_foo_compliant" ? There's no need to invent a third-party validation function. It might be my own validation function, or it might be a simple statement like: if variable is None: ... which can fail with NameError if "variable" is not defined. Or "x > 0" can fail if x is not a number. Regardless of what the validation check does, there are two ways it can not pass: - the check can fail; - or the check can raise an exception. The second generally means that the check code itself is buggy or incomplete, which is why unittest reports these categories separately. That is a good thing, not a problem to be fixed. For example: # if x < 0: raise ValueError('x must not be negative') validate x >= 0, ValueError, 'x must not be negative' Somehow my code passes a string to this as x. Wouldn't you, the developer, want to know that there is a code path that somehow results in x being a string? I know I would. Maybe that will become obvious later on, but it is best to determine errors as close to their source as we can. With your proposed validate keyword, the interpreter lies to me: it says that the check x >= 0 *fails* rather than raises, which implies that x is a negative number. Now I waste my time trying to debug how x could possibly be a negative number when the symptom is actually very different (x is a string). Hiding the exception is normally a bad thing, but if I really want to do that, I can write a helper function: def is_larger_or_equal(x, y): try: return x >= y except: return False If I find myself writing lots of such helper functions, that's probably a hint that I am hiding too much information. Bare excepts have been called the most diabolic Python anti-pattern: https://realpython.com/blog/python/the-most-diabolical-python-antipattern/ so hiding exceptions *by default* (as your proposed validate statement would do) is probably not a great idea. The bottom line is, if my check raises an exception instead of passing or failing, I want to know that it raised. I don't want the error to be hidden as a failed check. > if not is_foo_compliant(x): raise ValidationError(message) > > What if this third part method raises an exception instead of returning > False in some cases ? Great! I would hope it did raise an exception if it were passed something that it wasn't expecting and can't deal with. There may be some cases where I want a validation function to ignore all errors, but if so, I will handle them individually with a wrapper function, which let's me decide how to handle individual errors: def my_foo_compliant(x): try: return is_foo_compliant(x) except SpamError, EggsError: return True except CheeseError: return False except: raise But I can count the number of times I've done that in practice on the fingers of one hand. [...] > The goal is really to let developers express their applicative intent > =(what should be checked and what is the outcome if anything goes wrong), > and give them confidence that the statement will always fail the same way, > whatever the failure modes /behaviour of the checkers used in the > statement. I don't agree that this is a useful goal for the Python interpreter to support as a keyword or built-in function. If you want to create your own library to do this, I wish you good luck, but I would not use it and I honestly think that it is a trap: something that seems to be convenient and useful but actually makes maintaining code harder by hiding unexpected, unhandled cases as if they were expected failures. -- Steve From turnbull.stephen.fw at u.tsukuba.ac.jp Wed Jan 17 00:30:40 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 17 Jan 2018 14:30:40 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <8c3ef8ba-1638-3505-2bf3-dbc057335813@gmail.com> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> <8c3ef8ba-1638-3505-2bf3-dbc057335813@gmail.com> Message-ID: <23134.57216.230565.397549@turnbull.sk.tsukuba.ac.jp> Soni L. writes: > This is surprising to me because I always took those encodings to > have those fallbacks [to raw control characters]. ISO-8859-1 implementations do, for historical reasons AFAICT. And they frequently produce mojibake and occasionally wilder behavior. Most legacy encodings don't, and their standards documents frequently leave the behavior undefined for control character codes (which means you can error on them) and define use of unassigned codes as an error. > It's pretty wild to think someone wouldn't want them. In what context? WHAT-WG's encoding standard is *all about browsers*. If a codec is feeding text into a process that renders them all as glyphs for a human to look at, that's one thing. The codec doesn't want to fatal there, and the likely fallback glyph is something from the control glyphs block if even windows-125x doesn't have a glyph there. I guess it sort of makes sense. If you're feeding a program (as with JSON data, which I believe is "supposed" to be UTF-8, but many developers use the legacy charsets they're used to and which are often embedded in the underlying databases etc, ditto XML), the codec has no idea when or how that's going to get interpreted. In one application I've maintained, an editor, it has to deal with whatever characters are sent to it, but we preferred to take charset designations seriously because users were able to flexibly change those if they wanted to, so the error handler is some form of replacement with a human-readable representation (not pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC in encodings using ISO 2022 extensions). Mostly users would use the editor to remove or replace invalid codes, although of course they could just leave them in (and they would be converted from display form to the original codes on output). In another, a mailing list manager, codes outside the defined repertoires were a recurring nightmare that crashed server processes and blocked queues. It took a decade before we sealed the last known "leak" and I am not confident there are no leaks left. So I don't actually have experience of a use case for control character pass-through, and I wouldn't even automate the superset substitutions if I could avoid it. (In the editor case, I would provide a dialog saying "This is supposed to be iso-8859-1, but I'm seeing C1 control codes. Would you like me to try windows-1252, which uses those codes for graphic characters?") So to my mind, the use case here is relatively restricted (writing user display interfaces) and does not need to be in the stdlib, and would constitute an attractive nuisance there (developers would say "these users will stop complaining about inability to process their dirty data if I use a WHAT-WG version of a codec, then they don't have to clean up"). I don't have an objection to supporting even that use case, but I don't see why that support needs to be available in the stdlib. From turnbull.stephen.fw at u.tsukuba.ac.jp Wed Jan 17 00:37:18 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 17 Jan 2018 14:37:18 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <1515770581.2892134.1233236400.2694A704@webmail.messagingengine.com> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> <1515770581.2892134.1233236400.2694A704@webmail.messagingengine.com> Message-ID: <23134.57614.704002.685527@turnbull.sk.tsukuba.ac.jp> Random832 writes: > There are plenty of standard encodings that do have actual > representations of the control characters. My complaint was not about coded character sets that don't conform to ISO 2022's conventions about control vs. graphic blocks, especially in the C1 block. It was about promoting *unassigned* codes to the Unicode scalars with the same integer values. These codes don't correspond to characters. They are undefined as far as codecs are concerned. In the case of windows-125x charsets, even though they are IANA registered, Microsoft reserves the right to change and even ignore the published repertoire without updating it. There I think it's reasonable to use WHAT-WG graphic character repertoires even in Python's stdlib codecs, and I wouldn't be surprised if Microsoft was willing to delegate definition of those repertoires to the WG in the end. From fakedme+py at gmail.com Wed Jan 17 06:52:29 2018 From: fakedme+py at gmail.com (Soni L.) Date: Wed, 17 Jan 2018 09:52:29 -0200 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <23134.57216.230565.397549@turnbull.sk.tsukuba.ac.jp> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> <8c3ef8ba-1638-3505-2bf3-dbc057335813@gmail.com> <23134.57216.230565.397549@turnbull.sk.tsukuba.ac.jp> Message-ID: On 2018-01-17 03:30 AM, Stephen J. Turnbull wrote: > Soni L. writes: > > > This is surprising to me because I always took those encodings to > > have those fallbacks [to raw control characters]. > > ISO-8859-1 implementations do, for historical reasons AFAICT. And > they frequently produce mojibake and occasionally wilder behavior. > Most legacy encodings don't, and their standards documents frequently > leave the behavior undefined for control character codes (which means > you can error on them) and define use of unassigned codes as an error. > > > It's pretty wild to think someone wouldn't want them. > > In what context? WHAT-WG's encoding standard is *all about browsers*. > If a codec is feeding text into a process that renders them all as > glyphs for a human to look at, that's one thing. The codec doesn't > want to fatal there, and the likely fallback glyph is something from > the control glyphs block if even windows-125x doesn't have a glyph > there. I guess it sort of makes sense. > > If you're feeding a program (as with JSON data, which I believe is > "supposed" to be UTF-8, but many developers use the legacy charsets > they're used to and which are often embedded in the underlying > databases etc, ditto XML), the codec has no idea when or how that's > going to get interpreted. In one application I've maintained, an > editor, it has to deal with whatever characters are sent to it, but we > preferred to take charset designations seriously because users were > able to flexibly change those if they wanted to, so the error handler > is some form of replacement with a human-readable representation (not > pass-through), except for the usual HT, CR, LF, FF, and DEL (and ESC > in encodings using ISO 2022 extensions). Mostly users would use the > editor to remove or replace invalid codes, although of course they > could just leave them in (and they would be converted from display > form to the original codes on output). > > In another, a mailing list manager, codes outside the defined > repertoires were a recurring nightmare that crashed server processes > and blocked queues. It took a decade before we sealed the last known > "leak" and I am not confident there are no leaks left. > > So I don't actually have experience of a use case for control > character pass-through, and I wouldn't even automate the superset > substitutions if I could avoid it. (In the editor case, I would > provide a dialog saying "This is supposed to be iso-8859-1, but I'm > seeing C1 control codes. Would you like me to try windows-1252, which > uses those codes for graphic characters?") > > So to my mind, the use case here is relatively restricted (writing > user display interfaces) and does not need to be in the stdlib, and > would constitute an attractive nuisance there (developers would say > "these users will stop complaining about inability to process their > dirty data if I use a WHAT-WG version of a codec, then they don't have > to clean up"). I don't have an objection to supporting even that use > case, but I don't see why that support needs to be available in the > stdlib. > We use control characters as formatting/control characters on IRC all the time. ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, IIRC. Windows codepages implicitly define control characters in that range, but they're still technically defined. It's a de-facto standard for those encodings. I think python should follow the (de-facto) standard. This is it. From sylvain.marie at schneider-electric.com Tue Jan 16 12:36:39 2018 From: sylvain.marie at schneider-electric.com (Sylvain MARIE) Date: Tue, 16 Jan 2018 17:36:39 +0000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: (trying with direct reply this time) > Why do you do this? What's the requirement for delaying evaluation of the condition? Thanks for challenging my poorly chosen examples :) The primary requirement is about *catching* unwanted/uncontrolled/heterogenous exceptions happening in the underlying functions that are combined together to provide the validation means, so as to provide a uniform/consistent outcome however diverse the underlying functions are (they can return booleans or raise exceptions, or both). In your proposal, if 'is_foo_compliant' raises an exception, it will not be caught by 'assert_valid', therefore the ValidationError will not be raised. So this is not what I want as an application developer. -- Sylvain -----Message d'origine----- De?: Paul Moore [mailto:p.f.moore at gmail.com] Envoy??: mardi 16 janvier 2018 18:01 ??: Sylvain MARIE Cc?: Python-Ideas Objet?: Re: [Python-ideas] Repurpose `assert' into a general-purpose check I fixed the reply-to this time, looks like you're still getting messed up by Google Groups. On 16 January 2018 at 16:25, smarie wrote: > Let's consider this example where users want to define on-the-fly one > of the validation functions, and combine it with another with a 'or': > > assert_valid('surface', surf, or_(lambda x: (x >= 0) & (x < > 10000), is_foo_compliant), help_msg="surface should be 0= foo compliant") > > How ugly for something so simple ! I tried to make it slightly more > compact by developping a mini lambda syntax but it obviously makes it slower. Why do you do this? What's the requirement for delaying evaluation of the condition? A validate statement in Python wouldn't be any better able to do that, so it'd be just as ugly with a statement. There's no reason I can see why I'd ever need delayed evaluation, so what's wrong with just assert_valid(0 <= surf < 10000 and is_foo_compliant(surf), help_msg="surface should be 0= There are three reasons why having a 'validate' statement would > improve > this: > > * no more parenthesis: more elegant and readable > * inline use of python (1): no more use of lambda or mini_lambda, no > performance overhead > * inline use of python (2): composition would not require custom > function composition operators such as 'or_' (above) or mini-lambda > composition anymore, it could be built-in in any language element used > after > > resulting in > > validate (surf >= 0) & (surf < 10000) or > is_foo_compliant(surf), "surface should be 0= > (I removed the variable name alias 'surface' since I don't know if it > should remain or not) > > Elegant, isn't it ? No more so than my function version, but yes far more so than yours... Paul ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. ______________________________________________________________________ From nikolasrvanderhoof at gmail.com Wed Jan 17 12:19:51 2018 From: nikolasrvanderhoof at gmail.com (Nikolas Vanderhoof) Date: Wed, 17 Jan 2018 12:19:51 -0500 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: I think having a means for such validations separate from assertions would be helpful. However, I agree with Steven that 'validate' would be a bad keyword choice. Besides breaking compatibility with programs that use 'validate', it would break wsgiref.validate in the standard library. ? On Tue, Jan 16, 2018 at 2:22 PM, Juancarlo A?ez wrote: > Perhaps the OP can look into Python macro libraries to get the wanted > syntax? > > https://github.com/lihaoyi/macropy > > On Tue, Jan 16, 2018 at 2:35 PM, Paul Moore wrote: > >> On 16 January 2018 at 17:36, Sylvain MARIE >> wrote: >> > (trying with direct reply this time) >> > >> >> Why do you do this? What's the requirement for delaying evaluation of >> the condition? >> > >> > Thanks for challenging my poorly chosen examples :) >> > >> > The primary requirement is about *catching* >> unwanted/uncontrolled/heterogenous exceptions happening in the >> underlying functions that are combined together to provide the validation >> means, so as to provide a uniform/consistent outcome however diverse the >> underlying functions are (they can return booleans or raise exceptions, or >> both). >> > >> > In your proposal, if 'is_foo_compliant' raises an exception, it will >> not be caught by 'assert_valid', therefore the ValidationError will not be >> raised. So this is not what I want as an application developer. >> >> Ah, OK. But nothing in your proposal for a new statement suggests you >> wanted that, and assert doesn't work like that, so I hadn't realised >> that's what you were after. >> >> You could of course simply do: >> >> def assert_valid(expr, help_msg): >> # Catch exceptions in expr() as you see fit >> if not expr(): >> raise ValidationError(help_msg) >> >> assert_valid(lambda: 0 <= surf < 10000 and is_foo_compliant(surf), >> help_msg="surface should be 0=> >> No need for a whole expression language :-) >> >> Paul >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > Juancarlo *A?ez* > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Jan 17 12:58:41 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 17 Jan 2018 09:58:41 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings Message-ID: On Tue, Jan 16, 2018 at 9:30 PM, Stephen J. Turnbull < turnbull.stephen.fw at u.tsukuba.ac.jp> wrote: > In what context? WHAT-WG's encoding standard is *all about browsers*. > If a codec is feeding text into a process that renders them all as > glyphs for a human to look at, that's one thing. The codec doesn't > want to fatal there, and the likely fallback glyph is something from > the control glyphs block if even windows-125x doesn't have a glyph > there. I guess it sort of makes sense. > sure it does -- and python is not a browser, and python itself has nothigni visual -- but we sure want to be abel to write code that produces visual representations of maybe messy text... if you're feeding a program ... > the codec has no idea when or how that's > going to get interpreted. sure -- which is why others have suggested that if WATWG is supported, then it *should* only be used for encoding, not encoding. But we are supposed to be consenting adults here -- I see no reason to prevent encoding -- maybe it would be useful for testing??? (as with JSON data, which I believe is > "supposed" to be UTF-8, but many developers use the legacy charsets > they're used to and which are often embedded in the underlying > databases etc, ditto XML), OK -- if developers do the wrong thing, then they do the wrong thing -- we can't prevent that! And Python's lovely "text is unicode" model actually makes that hard to do wong. But we do need a way to decode messy text, and then send it off to JSON or whatever properly encoded. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From rspeer at luminoso.com Wed Jan 17 13:13:55 2018 From: rspeer at luminoso.com (Rob Speer) Date: Wed, 17 Jan 2018 18:13:55 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: I'm going to push back on the idea that this should only be used for decoding, not encoding. The use case I started with -- showing people how to fix mojibake using Python -- would *only* use these codecs in the encoding direction. To fix the most common case of mojibake, you encode it as web-1252 and decode it as UTF-8 (because you got the data from someone who did the opposite). I have implemented some decode-only codecs (such as CESU-8), for exactly the reason of "why would you want more text in this encoding", but the situation is different here. On Wed, 17 Jan 2018 at 13:00 Chris Barker wrote: > On Tue, Jan 16, 2018 at 9:30 PM, Stephen J. Turnbull < > turnbull.stephen.fw at u.tsukuba.ac.jp> wrote: > >> In what context? WHAT-WG's encoding standard is *all about browsers*. >> If a codec is feeding text into a process that renders them all as >> glyphs for a human to look at, that's one thing. The codec doesn't >> want to fatal there, and the likely fallback glyph is something from >> the control glyphs block if even windows-125x doesn't have a glyph >> there. I guess it sort of makes sense. >> > > sure it does -- and python is not a browser, and python itself has > nothigni visual -- but we sure want to be abel to write code that produces > visual representations of maybe messy text... > > if you're feeding a program > > ... > >> the codec has no idea when or how that's >> going to get interpreted. > > > sure -- which is why others have suggested that if WATWG is supported, > then it *should* only be used for encoding, not encoding. But we are > supposed to be consenting adults here -- I see no reason to prevent > encoding -- maybe it would be useful for testing??? > > (as with JSON data, which I believe is >> "supposed" to be UTF-8, but many developers use the legacy charsets >> they're used to and which are often embedded in the underlying >> databases etc, ditto XML), > > > OK -- if developers do the wrong thing, then they do the wrong thing -- we > can't prevent that! > > And Python's lovely "text is unicode" model actually makes that hard to do > wong. But we do need a way to decode messy text, and then send it off to > JSON or whatever properly encoded. > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Wed Jan 17 16:46:06 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 18 Jan 2018 08:46:06 +1100 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <20180116102320.GF1982@ando.pearwood.info> Message-ID: <20180117214605.GB22500@ando.pearwood.info> On Wed, Jan 17, 2018 at 12:19:51PM -0500, Nikolas Vanderhoof wrote: > I think having a means for such validations separate from assertions would > be helpful. What semantics would this "validate" statement have, and how would it be different from what we can write now? if not condition: raise SomeException(message) validate condition, SomeException, message # or some other name Unless it does something better than a simple "if ... raise", there's not much point in adding a keyword just to save a few keystrokes. To justify a keyword, it needs to do something special that a built-in function can't do, like delayed evaluation (without wrapping the expression in a function). -- Steve From njs at pobox.com Wed Jan 17 18:30:41 2018 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 17 Jan 2018 15:30:41 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: On Wed, Jan 17, 2018 at 10:13 AM, Rob Speer wrote: > I'm going to push back on the idea that this should only be used for > decoding, not encoding. > > The use case I started with -- showing people how to fix mojibake using > Python -- would *only* use these codecs in the encoding direction. To fix > the most common case of mojibake, you encode it as web-1252 and decode it as > UTF-8 (because you got the data from someone who did the opposite). It's also nice to be able to parse some HTML data, make a few changes in memory, and then serialize it back to HTML. Having this crash on random documents is rather irritating, esp. if these documents are standards-compliant HTML as in this case. -n -- Nathaniel J. Smith -- https://vorpus.org From ncoghlan at gmail.com Thu Jan 18 00:41:17 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 18 Jan 2018 15:41:17 +1000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <20180117214605.GB22500@ando.pearwood.info> References: <20180116102320.GF1982@ando.pearwood.info> <20180117214605.GB22500@ando.pearwood.info> Message-ID: On 18 January 2018 at 07:46, Steven D'Aprano wrote: > To justify a keyword, it needs to do something special that a built-in > function can't do, like delayed evaluation (without wrapping the > expression in a function). My reaction to these threads for a while has been "We should just add a function for unconditional assertions in expression form", and I finally got around to posting that to the issue tracker rather than leaving it solely in mailing list posts: https://bugs.python.org/issue32590 The gist of the idea is to add a new ensure() builtin along the lines of: class ValidationError(AssertionError): pass _MISSING = object() def ensure(condition, msg=_MISSING, exc_type=ValidationError): if not condition: if msg is _MISSING: msg = condition raise exc_type(msg) There's no need to involve the compiler if you're never going to optimise the code out, and code-rewriters like the one in pytest can be taught to recognise "ensure(condition)" as being comparable to an assert statement. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From gadgetsteve at live.co.uk Thu Jan 18 00:21:04 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Thu, 18 Jan 2018 05:21:04 +0000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: On 17/01/2018 17:19, Nikolas Vanderhoof wrote: > I think having a means for such validations separate from assertions > would be helpful. > However, I agree with Steven that 'validate' would be a bad keyword choice. > Besides breaking compatibility with programs that use 'validate', it > would break > wsgiref.validate > > in the standard library. > > ? To me it looks like this discussion has basically split into two separate use cases: 1. Using assert in a way that it will not (ever) get turned off. 2. The specific case of ensuring that a variable/parameter is an instance of a specific type. and I would like to suggest two separate possibly syntaxes that might make sense. 1. For asserts that should not be disabled we could have an always qualifier optionally added to assert, either as "assert condition exception always" or "assert always condition exception", that disables the optimisation for that specific exception. This would make it clearer that the developer needs this specific check always. Alternatively, we could consider a scoped flag, say keep_asserts, that sets the same. 2. For the specific, and to me more desirable, use case of ensuring type compliance how about an ensure keyword, (or possibly function), with a syntax of "ensure var type" or "ensure(var, type)" which goes a little further by attempting to convert the type of var to type and only if var cannot be converted raises a type exception. This second syntax could, of course, be implemented as a library function rather than a change to python itself. Either option could have an optional exception to raise, with the default being a type error. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From rosuav at gmail.com Thu Jan 18 01:22:06 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 18 Jan 2018 17:22:06 +1100 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <6bde8d06-08cc-4724-95d9-e00547692d96@googlegroups.com> <20180116102320.GF1982@ando.pearwood.info> Message-ID: On Thu, Jan 18, 2018 at 4:21 PM, Steve Barnes wrote: > 1. For asserts that should not be disabled we could have an always > qualifier optionally added to assert, either as "assert condition > exception always" or "assert always condition exception", that disables > the optimisation for that specific exception. This would make it clearer > that the developer needs this specific check always. Alternatively, we > could consider a scoped flag, say keep_asserts, that sets the same. But if they're never to be compiled out, why do they need special syntax? assert always x >= 0, "x must be positive" can become if x < 0: raise ValueError("x must be positive") I haven't yet seen any justification for syntax here. The nearest I've seen is that this "ensure" action is more like: try: cond = x >= 0 except BaseException: raise AssertionError("x must be positive") else: if not cond: raise AssertionError("x must be positive") Which, IMO, is a bad idea, and I'm not sure anyone was actually advocating it anyway. ChrisA From steve at pearwood.info Thu Jan 18 01:59:03 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 18 Jan 2018 17:59:03 +1100 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: Message-ID: <20180118065903.GD22500@ando.pearwood.info> On Thu, Jan 18, 2018 at 05:22:06PM +1100, Chris Angelico wrote: > I haven't yet seen any justification for syntax here. The nearest I've > seen is that this "ensure" action is more like: > > try: > cond = x >= 0 > except BaseException: > raise AssertionError("x must be positive") > else: > if not cond: > raise AssertionError("x must be positive") > > Which, IMO, is a bad idea, and I'm not sure anyone was actually > advocating it anyway. My understanding is that Sylvain was advocating for that. -- Steve From ethan at stoneleaf.us Thu Jan 18 02:12:53 2018 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 17 Jan 2018 23:12:53 -0800 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <20180118065903.GD22500@ando.pearwood.info> References: <20180118065903.GD22500@ando.pearwood.info> Message-ID: <5A6048F5.9050207@stoneleaf.us> On 01/17/2018 10:59 PM, Steven D'Aprano wrote: > On Thu, Jan 18, 2018 at 05:22:06PM +1100, Chris Angelico wrote: > >> I haven't yet seen any justification for syntax here. The nearest I've >> seen is that this "ensure" action is more like: >> >> try: >> cond = x >= 0 >> except BaseException: >> raise AssertionError("x must be positive") >> else: >> if not cond: >> raise AssertionError("x must be positive") >> >> Which, IMO, is a bad idea, and I'm not sure anyone was actually >> advocating it anyway. > > My understanding is that Sylvain was advocating for that. Agreed. Which, as has been pointed out, is an incredibly bad idea. -- ~Ethan~ From turnbull.stephen.fw at u.tsukuba.ac.jp Thu Jan 18 11:04:41 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Fri, 19 Jan 2018 01:04:41 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: <23136.50585.860684.619627@turnbull.sk.tsukuba.ac.jp> Nathaniel Smith writes: > It's also nice to be able to parse some HTML data, make a few changes > in memory, and then serialize it back to HTML. Having this crash on > random documents is rather irritating, esp. if these documents are > standards-compliant HTML as in this case. This example doesn't make sense to me. Why would *conformant* HTML crash the codec? Unless you're saying the source is non-conformant and *lied* about the encoding? Then errors=surrogateescape should do what you want here, no? If not, new codecs won't help you---the "crash" is somewhere else. Similarly, Soni's use case of control characters for formatting in an IRC client. If they're C0, then AFAICT all of the ASCII-compatible codecs do pass all of those through.[1] If they're C1, then you've got big trouble because the multibyte encodings will either error due to a malformed character or produce an unintended character (except for UTF-8, where you can encode the character in UTF-8). The windows-* encodings are quite inconsistent about the graphics they put in C1 space as well as where they leave holes, so this is not just application-specific, it's even encoding-specific behavior. The more examples of claimed use cases I see, the more I think most of them are already addressed more safely by Python's existing mechanisms, and the less I see a real need for this in the stdlib, with the single exception that WHAT-WG may be a better authority to follow than Microsoft for windows-* codecs. Footnotes: [1] I don't like that much, I'd rather restrict to the ones that have universally accepted semantics including CR, LF, HT, ESC, BEL, and FF. But passthrough is traditional there, a few more are in somewhat common use, and I'm not crazy enough to break backward compatibility. From random832 at fastmail.com Thu Jan 18 12:32:42 2018 From: random832 at fastmail.com (Random832) Date: Thu, 18 Jan 2018 12:32:42 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <23136.50585.860684.619627@turnbull.sk.tsukuba.ac.jp> References: <23136.50585.860684.619627@turnbull.sk.tsukuba.ac.jp> Message-ID: <1516296762.2008115.1240027520.58132A15@webmail.messagingengine.com> On Thu, Jan 18, 2018, at 11:04, Stephen J. Turnbull wrote: > Nathaniel Smith writes: > > > It's also nice to be able to parse some HTML data, make a few changes > > in memory, and then serialize it back to HTML. Having this crash on > > random documents is rather irritating, esp. if these documents are > > standards-compliant HTML as in this case. > > This example doesn't make sense to me. Why would *conformant* HTML > crash the codec? Unless you're saying the source is non-conformant > and *lied* about the encoding? I think his point is that the WHATWG standard is the one that governs HTML and therefore HTML that uses these encodings (including the C1 characters) are conformant to *that* standard, regardless of their status with regards to anything published by Unicode, and that the new encodings (whatever they are called), including the round-trip for b'\x81' as \u0081, are the ones identified by a statement in an HTML document that it uses windows-1252, and therefore such a statement is not a lie. From turnbull.stephen.fw at u.tsukuba.ac.jp Thu Jan 18 13:12:17 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Fri, 19 Jan 2018 03:12:17 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> <8c3ef8ba-1638-3505-2bf3-dbc057335813@gmail.com> <23134.57216.230565.397549@turnbull.sk.tsukuba.ac.jp> Message-ID: <23136.58241.165838.361907@turnbull.sk.tsukuba.ac.jp> Soni L. writes: > ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, > IIRC. You recall incorrectly. You're probably thinking of RFC 1345. But I've never seen that cited except in the IANA registry. All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429 primary and supplementary control sets as good choices. (Unicode goes so far as to use ISO 6429's names for the supplementary set for C1 code points while explicitly denying them *any* semantics.) But none specifies a default, and as far as I know there is no widespread agreement on what control codes are good for, except for a handful of "whitespace" characters in C0, and a couple of C1 controls that are used by (and reserved to) ISO 2022. In fact, Python ISO-8859 codecs do pass them through (both C0 and C1), and the UTF-8 codec passes through C0 and allows encoding and decoding of C1 code points. On the other hand, the ISO standards forbid use of unassigned graphic code points as characters (graphic or control), and codecs quite reasonably treat unassigned graphic code points as errors. In Python, that practice is extended to the windows-* sets, which seems reasonable to me. But the windows-* encodings do not support C1 controls. Instead the entire right half of the code page is graphic (per Microsoft's IANA registrations), and that, I suppose, is why Python does not allow fallthrough of unassigned code points 0x80-0x9F in windows-* codecs. > I think python should follow the (de-facto) standard. This is it. WHAT-WG encoding isn't a "de facto" standard, it's a published standard by a recognized (though forked) standards body. However, different standards are designed for different contexts, and WHAT-WG's encoding standard is clearly specifically aimed at browsers. It also may be useful for more specialized UI applications such as your IRC client, although IMO that's asking for trouble. Note also that the WHAT-WG standard is in a peculiar limbo between informative and normative. The standard encoding is UTF-8, end-of-story. What we're talking about here is best practices for UIs that are faced with non-conformant "legacy" documents, and want to display something anyway. But Python is a general-purpose programming language, and should cleave to the most generally-accepted, well-defined standards, which are the ISO standards themselves in the case of ISO-defined coded character sets. Aliasing the ISO character sets (and ASCII! oh, my aching RFC 822 header!) to the corresponding windows-* as a *general* practice is pretty abominable, though it makes some sense in the case of browsers. For windows-* character sets, ISTM that the WHAT-WG repertoires of graphic characters are improvements of Microsoft's (assuming that WHAT-WG version their standards). Applications can do what they want, of course, and I'm all for a PyPI package to make it easier to do that, whether by providing additional codecs, additional error handlers, or by post-processing surrogate- escaped bytes. I still don't think the WHAT-WG approach is a good fit for most use cases, nor should it be included in the stdlib. Most of the use cases I've seen proposed so far are well-served by existing Python features like errors='surrogateescape'. Steve From fakedme+py at gmail.com Thu Jan 18 18:21:48 2018 From: fakedme+py at gmail.com (Soni L.) Date: Thu, 18 Jan 2018 21:21:48 -0200 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <23136.58241.165838.361907@turnbull.sk.tsukuba.ac.jp> References: <1515688963.3680880.1232088936.35516731@webmail.messagingengine.com> <23128.28021.311864.446571@turnbull.sk.tsukuba.ac.jp> <8c3ef8ba-1638-3505-2bf3-dbc057335813@gmail.com> <23134.57216.230565.397549@turnbull.sk.tsukuba.ac.jp> <23136.58241.165838.361907@turnbull.sk.tsukuba.ac.jp> Message-ID: <75ab97a9-bdd4-7174-b133-001cf1add9d8@gmail.com> On 2018-01-18 04:12 PM, Stephen J. Turnbull wrote: > Soni L. writes: > > > ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, > > IIRC. > > You recall incorrectly. You're probably thinking of RFC 1345. But > I've never seen that cited except in the IANA registry. > > All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429 > primary and supplementary control sets as good choices. (Unicode goes > so far as to use ISO 6429's names for the supplementary set for C1 > code points while explicitly denying them *any* semantics.) But > none specifies a default, and as far as I know there is no widespread > agreement on what control codes are good for, except for a handful of > "whitespace" characters in C0, and a couple of C1 controls that are > used by (and reserved to) ISO 2022. In fact, Python ISO-8859 codecs > do pass them through (both C0 and C1), and the UTF-8 codec passes > through C0 and allows encoding and decoding of C1 code points. > > On the other hand, the ISO standards forbid use of unassigned graphic > code points as characters (graphic or control), and codecs quite > reasonably treat unassigned graphic code points as errors. In Python, > that practice is extended to the windows-* sets, which seems > reasonable to me. But the windows-* encodings do not support C1 > controls. Instead the entire right half of the code page is graphic > (per Microsoft's IANA registrations), and that, I suppose, is why > Python does not allow fallthrough of unassigned code points 0x80-0x9F > in windows-* codecs. > > > I think python should follow the (de-facto) standard. This is it. > > WHAT-WG encoding isn't a "de facto" standard, it's a published > standard by a recognized (though forked) standards body. However, > different standards are designed for different contexts, and WHAT-WG's > encoding standard is clearly specifically aimed at browsers. It also > may be useful for more specialized UI applications such as your IRC > client, although IMO that's asking for trouble. Note also that the > WHAT-WG standard is in a peculiar limbo between informative and > normative. The standard encoding is UTF-8, end-of-story. What we're > talking about here is best practices for UIs that are faced with > non-conformant "legacy" documents, and want to display something > anyway. > > But Python is a general-purpose programming language, and should > cleave to the most generally-accepted, well-defined standards, which > are the ISO standards themselves in the case of ISO-defined coded > character sets. Aliasing the ISO character sets (and ASCII! oh, my > aching RFC 822 header!) to the corresponding windows-* as a *general* > practice is pretty abominable, though it makes some sense in the case > of browsers. For windows-* character sets, ISTM that the WHAT-WG > repertoires of graphic characters are improvements of Microsoft's > (assuming that WHAT-WG version their standards). > > Applications can do what they want, of course, and I'm all for a PyPI > package to make it easier to do that, whether by providing additional > codecs, additional error handlers, or by post-processing surrogate- > escaped bytes. I still don't think the WHAT-WG approach is a good fit > for most use cases, nor should it be included in the stdlib. Most of > the use cases I've seen proposed so far are well-served by existing > Python features like errors='surrogateescape'. I'm just glad I *always* use bytestrings when dealing with network protocols, I guess. It's the only reasonable option. > > Steve > > From steve at pearwood.info Thu Jan 18 22:39:07 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 19 Jan 2018 14:39:07 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: Message-ID: <20180119033907.GH22500@ando.pearwood.info> On Wed, Jan 10, 2018 at 07:13:39PM +0000, Rob Speer wrote: [...] > Having a pip installable library as the _only_ way to use these encodings > is the status quo that I am very familiar with. It's awkward. To use a > package that registers new codecs, you have to import something from that > package, even if you never call anything from what you imported, and that > makes flake8 complain. The idea that an encoding name may or may not be > registered, based on what has been imported, breaks our intuition about > reading Python code and is very hard to statically analyze. Breaks whose intuition? You don't speak for me on that matter -- while I don't like modules which operate by side-effect on import, I know that they are possible. In the stdlib, we have rlcompleter which operates like that. Whether such a design is good or bad (I think bad), nevertheless registering codecs by side-effect at import time should be an obvious possibility to any reasonably experienced developer. But regardless, I don't think that "the existing codec library has a poor API, and flake8 complains about it" is a good reason for adding the codecs to the stdlib. We don't necessarily add functionality to the stdlib just because existing third-party solutions are awkward to use. Having said that, I'm not actually against adding this, although I lean slightly towards "add". I think the case for adding is unclear, and needs a PEP to discuss the issues fully. I think we've come to a consensus on the following question: - Should we change the behaviour of the existing codecs to match the WHATWG encodings? No. but there are others that do not have a consensus: - Are existing stdlib solutions satisfactory to meet the WHATWG standard? - If not, should the WHATWG encodings be added to the stdlib? - If so, should they be built-in codecs, or should we import a library to register them? - Or use the error handler mechanism? - If codecs, should we offer both encode and decode support, or just decoding? - What about the Unicode best-fit encodings? Regarding that first undecided question, I'm particularly interested to see your response to Stephen Turnbull's statements here: https://mail.python.org/pipermail/python-ideas/2018-January/048628.html > I disagree with calling the WHATWG encodings that are implemented in every > Web browser "non-standard". WHATWG may not have a typical origin story as a > standards organization, but it _is_ the standards organization for the Web. I wonder what the W3C would say about that last statement. > I'm really not interested in best-fit mappings that turn infinity into "8" > and square roots into "v". Making weird mappings like that sounds like a > job for the "unidecode" library, not the stdlib. Frankly, the idea that browsers should ignore the HTML's declared encoding in favour of some other hybrid encoding which never existed outside of broken web pages in order to be called "standards compliant" seems weird if not broken to me. Possibly even more weird than mapping ? to 8 and ? to v. (I really wish the Unicode Consortium would do a better job of explaining the reasoning behind some of their more unintuitive or flat out strange-seeming decisions. But that's a rant for another day.) I know that web browsers aren't quite the same as programming languages, and "Practicality beats purity", but still, "In the face of ambiguity, resist the temptation to guess". The WHATWG standard strikes me as "Do What You Guess I Mean". -- Steve From nikolasrvanderhoof at gmail.com Thu Jan 18 22:51:31 2018 From: nikolasrvanderhoof at gmail.com (Nikolas Vanderhoof) Date: Thu, 18 Jan 2018 22:51:31 -0500 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <5A6048F5.9050207@stoneleaf.us> References: <20180118065903.GD22500@ando.pearwood.info> <5A6048F5.9050207@stoneleaf.us> Message-ID: > > I sometimes wish that Python included a richer set of assertions rather > than just a single `assert` keyword. Something like Eiffel's concept of > pre-conditions, post-conditions and invariants, where each can be > enabled or disabled independently. Has something like this been proposed for Python before? This seems to align more with the intended use of assert that's been pointed out in this thread. In what case though would one want to disable some but not all of these pre, post, or invariant assertions? On Thu, Jan 18, 2018 at 2:12 AM, Ethan Furman wrote: > On 01/17/2018 10:59 PM, Steven D'Aprano wrote: > >> On Thu, Jan 18, 2018 at 05:22:06PM +1100, Chris Angelico wrote: >> >> I haven't yet seen any justification for syntax here. The nearest I've >>> seen is that this "ensure" action is more like: >>> >>> try: >>> cond = x >= 0 >>> except BaseException: >>> raise AssertionError("x must be positive") >>> else: >>> if not cond: >>> raise AssertionError("x must be positive") >>> >>> Which, IMO, is a bad idea, and I'm not sure anyone was actually >>> advocating it anyway. >>> >> >> My understanding is that Sylvain was advocating for that. >> > > Agreed. Which, as has been pointed out, is an incredibly bad idea. > > -- > ~Ethan~ > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Jan 18 22:51:13 2018 From: guido at python.org (Guido van Rossum) Date: Thu, 18 Jan 2018 19:51:13 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <20180119033907.GH22500@ando.pearwood.info> References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: Can someone explain to me why this is such a controversial issue? It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying existing encodings seems wrong -- did the feature request somehow transmogrify into that?) -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Jan 18 23:00:01 2018 From: guido at python.org (Guido van Rossum) Date: Thu, 18 Jan 2018 20:00:01 -0800 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <20180118065903.GD22500@ando.pearwood.info> <5A6048F5.9050207@stoneleaf.us> Message-ID: On Thu, Jan 18, 2018 at 7:51 PM, Nikolas Vanderhoof < nikolasrvanderhoof at gmail.com> wrote: > I sometimes wish that Python included a richer set of assertions rather >> than just a single `assert` keyword. Something like Eiffel's concept of >> pre-conditions, post-conditions and invariants, where each can be >> enabled or disabled independently. > > > Has something like this been proposed for Python before? > This seems to align more with the intended use of assert that's been > pointed out in this thread. > In what case though would one want to disable some but not all of these > pre, post, or invariant assertions? > Oh, many times, starting in the late '90s IIRC (Paul Dubois was a big fan). The problems are twofold: (a) it would require a lot of new keywords or ugly syntax; and (b) there would have to be a way to enable each form separately *per module or package*. Eiffel solves that (AFAIC) through separate compilation -- e.g. a stable version of a library might disable invariants and post-conditions but keep pre-conditions, since those could be violated by less mature application code; or a mature application could disable all checks and link with optimized library binaries that also have disabled all checks. I'm sure other scenarios are also viable. But that solution isn't available in Python, where command line flags apply to *all* modules being imported. (Note: even if you have a solution for (b), getting past (a) isn't so easy. So don't get nerd-sniped by the solution for (b) alone.) -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Thu Jan 18 23:38:03 2018 From: njs at pobox.com (Nathaniel Smith) Date: Thu, 18 Jan 2018 20:38:03 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum wrote: > Can someone explain to me why this is such a controversial issue? I guess practicality versus purity is always controversial :-) > It seems reasonable to me to add new encodings to the stdlib that do the > roundtripping requested in the first message of the thread. As long as they > have new names that seems to fall under "practicality beats purity". > (Modifying existing encodings seems wrong -- did the feature request somehow > transmogrify into that?) Someone did discover that Microsoft's current implementations of the windows-* encodings matches the WHAT-WG spec, rather than the Unicode spec that Microsoft originally wrote. So there is some argument that the Python's existing encodings are simply out of date, and changing them would be a bugfix. (And standards aside, it is surely going to be somewhat error-prone if Python's windows-1252 doesn't match everyone else's implementations of windows-1252.) But yeah, AFAICT the original requesters would be happy either way; they just want it available under some name. -n -- Nathaniel J. Smith -- https://vorpus.org From nikolasrvanderhoof at gmail.com Thu Jan 18 23:50:04 2018 From: nikolasrvanderhoof at gmail.com (Nikolas Vanderhoof) Date: Thu, 18 Jan 2018 23:50:04 -0500 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <20180118065903.GD22500@ando.pearwood.info> <5A6048F5.9050207@stoneleaf.us> Message-ID: Thank you for your explanation! ? On Thu, Jan 18, 2018 at 11:00 PM, Guido van Rossum wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Nikolas Vanderhoof < > nikolasrvanderhoof at gmail.com> wrote: > >> I sometimes wish that Python included a richer set of assertions rather >>> than just a single `assert` keyword. Something like Eiffel's concept of >>> pre-conditions, post-conditions and invariants, where each can be >>> enabled or disabled independently. >> >> >> Has something like this been proposed for Python before? >> This seems to align more with the intended use of assert that's been >> pointed out in this thread. >> In what case though would one want to disable some but not all of these >> pre, post, or invariant assertions? >> > > Oh, many times, starting in the late '90s IIRC (Paul Dubois was a big fan). > > The problems are twofold: (a) it would require a lot of new keywords or > ugly syntax; and (b) there would have to be a way to enable each form > separately *per module or package*. Eiffel solves that (AFAIC) through > separate compilation -- e.g. a stable version of a library might disable > invariants and post-conditions but keep pre-conditions, since those could > be violated by less mature application code; or a mature application could > disable all checks and link with optimized library binaries that also have > disabled all checks. I'm sure other scenarios are also viable. > > But that solution isn't available in Python, where command line flags > apply to *all* modules being imported. > > (Note: even if you have a solution for (b), getting past (a) isn't so > easy. So don't get nerd-sniped by the solution for (b) alone.) > > -- > --Guido van Rossum (python.org/~guido) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Thu Jan 18 23:51:23 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 19 Jan 2018 15:51:23 +1100 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: References: <20180118065903.GD22500@ando.pearwood.info> <5A6048F5.9050207@stoneleaf.us> Message-ID: <20180119045123.GJ22500@ando.pearwood.info> On Thu, Jan 18, 2018 at 08:00:01PM -0800, Guido van Rossum wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Nikolas Vanderhoof < > nikolasrvanderhoof at gmail.com> wrote: > > > I sometimes wish that Python included a richer set of assertions rather > >> than just a single `assert` keyword. Something like Eiffel's concept of > >> pre-conditions, post-conditions and invariants, where each can be > >> enabled or disabled independently. > > > > > > Has something like this been proposed for Python before? > > This seems to align more with the intended use of assert that's been > > pointed out in this thread. > > In what case though would one want to disable some but not all of these > > pre, post, or invariant assertions? > > > > Oh, many times, starting in the late '90s IIRC (Paul Dubois was a big fan). > > The problems are twofold: (a) it would require a lot of new keywords or > ugly syntax; and (b) there would have to be a way to enable each form > separately *per module or package*. Eiffel solves that (AFAIC) through > separate compilation -- e.g. a stable version of a library might disable > invariants and post-conditions but keep pre-conditions, since those could > be violated by less mature application code; or a mature application could > disable all checks and link with optimized library binaries that also have > disabled all checks. I'm sure other scenarios are also viable. > > But that solution isn't available in Python, where command line flags apply > to *all* modules being imported. > > (Note: even if you have a solution for (b), getting past (a) isn't so easy. > So don't get nerd-sniped by the solution for (b) alone.) Indeed. I fear that Python's design will never be a good match for Eiffel's Design By Contract. Nevertheless I still have hope that there could be things we can learn from it. After all, DBC is as much a state of mind as it is syntax. Here's a blast from the past: https://www.python.org/doc/essays/metaclasses/ http://legacy.python.org/doc/essays/metaclasses/Eiffel.py This was my first introduction to the idea of software contracts! -- Steve From me at lorenamesa.com Fri Jan 19 00:18:40 2018 From: me at lorenamesa.com (Lorena Mesa) Date: Fri, 19 Jan 2018 05:18:40 +0000 Subject: [Python-ideas] I Message-ID: -- __________________________________________________________________ *Lorena Mesa* Co-Organizer, PyLadies Chicago Director, Python Software Foundation www.lorenamesa.com @loooorenanicole Pronouns: she/her/hers Say what? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Fri Jan 19 08:30:31 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 19 Jan 2018 14:30:31 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> On 19.01.2018 05:38, Nathaniel Smith wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum wrote: >> Can someone explain to me why this is such a controversial issue? > > I guess practicality versus purity is always controversial :-) > >> It seems reasonable to me to add new encodings to the stdlib that do the >> roundtripping requested in the first message of the thread. As long as they >> have new names that seems to fall under "practicality beats purity". There are a few issues here: * WHATWG encodings are mostly for decoding content in order to show it in the browser, accepting broken encoding data. Python already has support for this by using one of the available error handlers, or adding new ones to suit the needs. If we'd add the encodings, people will start creating more broken data, since this is what the WHATWG codecs output when encoding Unicode. As discussed, this could be addressed by making the WHATWG codecs decode-only. * The use case seems limited to implementing browsers or headless implementations working like browsers. That's not really general enough to warrant adding lots of new codecs to the stdlib. A PyPI package is better suited for this. * The WHATWG codecs do not only cover simple mapping codecs, but also many multi-byte ones for e.g. Asian languages. I doubt that we'd want to maintain such codecs in the stdlib, since this will increase the download sizes of the installers and also require people knowledgeable about these variants to work on them and fix any issues. Overall, I think either pointing people to error handlers or perhaps adding a new one specifically for the case of dealing with control character mappings would provide a better maintenance / usefulness ratio than adding lots of new legacy codecs to the stdlib. BTW: WHATWG pushes for always using UTF-8 as far as I can tell from their website. >> (Modifying existing encodings seems wrong -- did the feature request somehow >> transmogrify into that?) > > Someone did discover that Microsoft's current implementations of the > windows-* encodings matches the WHAT-WG spec, rather than the Unicode > spec that Microsoft originally wrote. No, MS implements somethings called "best fit encodings" and these are different than what WHATWG uses. Unlike the WHATWG encodings, these are documented as vendor encodings on the Unicode site, which is what we normally use as reference for out stdlib codecs. However, whether these are actually a good idea, is open to discussion as well, since they sometimes go a bit far with "best fit", e.g. mapping the infinity symbol to 8. Again, using the error handles we have for dealing with situations which require non-standard encoding behavior are the better approach: https://docs.python.org/3.7/library/codecs.html#error-handlers Adding new ones is possible as well. > So there is some argument that > the Python's existing encodings are simply out of date, and changing > them would be a bugfix. (And standards aside, it is surely going to be > somewhat error-prone if Python's windows-1252 doesn't match everyone > else's implementations of windows-1252.) But yeah, AFAICT the original > requesters would be happy either way; they just want it available > under some name. The encodings are not out of date. I don't know where you got that impression from. The Windows API WideCharToMultiByte which was quoted in the discussion: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx unfortunately uses the above mentioned best fit encodings, but this can and should be switched off by specifying the WC_NO_BEST_FIT_CHARS for anything that requires validation or needs to be interoperable: """ For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "?" (infinity) maps to 8 (eight) in some code pages. """ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From stefan at bytereef.org Fri Jan 19 09:27:53 2018 From: stefan at bytereef.org (Stefan Krah) Date: Fri, 19 Jan 2018 15:27:53 +0100 Subject: [Python-ideas] Official site-packages/test directory Message-ID: <20180119142753.GA4754@bytereef.org> Hello, I wonder if we could get an official site-packages/test directory. Currently it seems to be problematic to distribute tests if they are outside the package directory. Here is a nice overview of the two main layout possibilities: http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html I like the outside-the-package approach, mostly for reasons described very eloquently here: http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so it would make sense to have site-packages/foo.py and site-packages/test/test_foo.py. For me, this is the natural layout. Stefan Krah From guido at python.org Fri Jan 19 11:10:57 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 19 Jan 2018 08:10:57 -0800 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: <20180119142753.GA4754@bytereef.org> References: <20180119142753.GA4754@bytereef.org> Message-ID: IIUC another common layout is to have folders named test or tests inside each package. This would avoid requiring any changes to the site-packages layout. On Fri, Jan 19, 2018 at 6:27 AM, Stefan Krah wrote: > > Hello, > > I wonder if we could get an official site-packages/test directory. > Currently > it seems to be problematic to distribute tests if they are outside the > package > directory. Here is a nice overview of the two main layout possibilities: > > http://pytest.readthedocs.io/en/reorganize-docs/new-docs/ > user/directory_structure.html > > > I like the outside-the-package approach, mostly for reasons described very > eloquently here: > > http://python-notes.curiousefficiency.org/en/ > latest/python_concepts/import_traps.html > > > CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so > it > would make sense to have site-packages/foo.py and > site-packages/test/test_foo.py. > > For me, this is the natural layout. > > > > Stefan Krah > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Jan 19 11:20:26 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 19 Jan 2018 08:20:26 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> Message-ID: On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg wrote: > On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > wrote: > >> Can someone explain to me why this is such a controversial issue? > > > > I guess practicality versus purity is always controversial :-) > > > >> It seems reasonable to me to add new encodings to the stdlib that do the > >> roundtripping requested in the first message of the thread. As long as > they > >> have new names that seems to fall under "practicality beats purity". > > There are a few issues here: > > * WHATWG encodings are mostly for decoding content in order to > show it in the browser, accepting broken encoding data. > And sometimes Python apps that pull data from the web. > Python already has support for this by using one of the available > error handlers, or adding new ones to suit the needs. > This seems cumbersome though. > If we'd add the encodings, people will start creating more > broken data, since this is what the WHATWG codecs output > when encoding Unicode. > That's FUD. Only apps that specifically use the new WHATWG encodings would be able to consume that data. And surely the practice of web browsers will have a much bigger effect than Python's choice. > As discussed, this could be addressed by making the WHATWG > codecs decode-only. > But that would defeat the point of roundtripping, right? > * The use case seems limited to implementing browsers or headless > implementations working like browsers. > > That's not really general enough to warrant adding lots of > new codecs to the stdlib. A PyPI package is better suited > for this. > Perhaps, but such a package already exists and its author (who surely has read a lot of bug reports from its users) says that this is cumbersome. > * The WHATWG codecs do not only cover simple mapping codecs, > but also many multi-byte ones for e.g. Asian languages. > > I doubt that we'd want to maintain such codecs in the stdlib, > since this will increase the download sizes of the installers > and also require people knowledgeable about these variants > to work on them and fix any issues. > Really? Why is adding a bunch of codecs so much effort? Surely the translation tables contain data that compresses well? And surely we don't need a separate dedicated piece of C code for each new codec? > Overall, I think either pointing people to error handlers > or perhaps adding a new one specifically for the case of > dealing with control character mappings would provide a better > maintenance / usefulness ratio than adding lots of new > legacy codecs to the stdlib. > Wouldn't error handlers be much slower? And to me it seems a new error handler is a much *bigger* deal than some new encodings -- error handlers must work for *all* encodings. > BTW: WHATWG pushes for always using UTF-8 as far as I can tell > from their website. > As does Python. But apparently it will take decades more to get there. > >> (Modifying existing encodings seems wrong -- did the feature request > somehow > >> transmogrify into that?) > > > > Someone did discover that Microsoft's current implementations of the > > windows-* encodings matches the WHAT-WG spec, rather than the Unicode > > spec that Microsoft originally wrote. > > No, MS implements somethings called "best fit encodings" > and these are different than what WHATWG uses. > > Unlike the WHATWG encodings, these are documented as vendor encodings > on the Unicode site, which is what we normally use as reference > for out stdlib codecs. > > However, whether these are actually a good idea, is open to discussion > as well, since they sometimes go a bit far with "best fit", e.g. > mapping the infinity symbol to 8. > > Again, using the error handles we have for dealing with > situations which require non-standard encoding behavior are > the better approach: > > https://docs.python.org/3.7/library/codecs.html#error-handlers > > Adding new ones is possible as well. > > > So there is some argument that > > the Python's existing encodings are simply out of date, and changing > > them would be a bugfix. (And standards aside, it is surely going to be > > somewhat error-prone if Python's windows-1252 doesn't match everyone > > else's implementations of windows-1252.) But yeah, AFAICT the original > > requesters would be happy either way; they just want it available > > under some name. > > The encodings are not out of date. I don't know where you got > that impression from. > > The Windows API WideCharToMultiByte which was quoted in the discussion: > > https://msdn.microsoft.com/en-us/library/windows/desktop/ > dd374130%28v=vs.85%29.aspx > > unfortunately uses the above mentioned best fit encodings, > but this can and should be switched off by specifying the > WC_NO_BEST_FIT_CHARS for anything that requires validation > or needs to be interoperable: > > """ > For strings that require validation, such as file, resource, and user > names, the application should always use the WC_NO_BEST_FIT_CHARS flag. > This flag prevents the function from mapping characters to characters > that appear similar but have very different semantics. In some cases, > the semantic change can be extreme. For example, the symbol for "?" > (infinity) maps to 8 (eight) in some code pages. > """ > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Jan 19 11:23:23 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 19 Jan 2018 16:23:23 +0000 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: Another common approach is to not ship tests as part of your (runtime) package at all - they are in the sdist but not the wheels nor are they deployed with "setup.py install". In my experience, this is the usual approach projects take if they don't have the tests in the package directory. (I don't think I've *ever* seen a project try to install tests except by including them in the package directory...) Paul On 19 January 2018 at 16:10, Guido van Rossum wrote: > IIUC another common layout is to have folders named test or tests inside > each package. This would avoid requiring any changes to the site-packages > layout. > > On Fri, Jan 19, 2018 at 6:27 AM, Stefan Krah wrote: >> >> >> Hello, >> >> I wonder if we could get an official site-packages/test directory. >> Currently >> it seems to be problematic to distribute tests if they are outside the >> package >> directory. Here is a nice overview of the two main layout possibilities: >> >> >> http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html >> >> >> I like the outside-the-package approach, mostly for reasons described very >> eloquently here: >> >> >> http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html >> >> >> CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so >> it >> would make sense to have site-packages/foo.py and >> site-packages/test/test_foo.py. >> >> For me, this is the natural layout. From random832 at fastmail.com Fri Jan 19 11:24:48 2018 From: random832 at fastmail.com (Random832) Date: Fri, 19 Jan 2018 11:24:48 -0500 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> Message-ID: <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote: > > Someone did discover that Microsoft's current implementations of the > > windows-* encodings matches the WHAT-WG spec, rather than the Unicode > > spec that Microsoft originally wrote. > > No, MS implements somethings called "best fit encodings" > and these are different than what WHATWG uses. NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings. We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode. > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx > > unfortunately uses the above mentioned best fit encodings, > but this can and should be switched off by specifying the > WC_NO_BEST_FIT_CHARS for anything that requires validation > or needs to be interoperable: Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing. From wolfgang.maier at biologie.uni-freiburg.de Fri Jan 19 11:30:56 2018 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Fri, 19 Jan 2018 17:30:56 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: <20180119142753.GA4754@bytereef.org> References: <20180119142753.GA4754@bytereef.org> Message-ID: On 01/19/2018 03:27 PM, Stefan Krah wrote: > > Hello, > > I wonder if we could get an official site-packages/test directory. Currently > it seems to be problematic to distribute tests if they are outside the package > directory. Here is a nice overview of the two main layout possibilities: > > http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html > > > I like the outside-the-package approach, mostly for reasons described very > eloquently here: > > http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html > > > CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so it > would make sense to have site-packages/foo.py and site-packages/test/test_foo.py. > > For me, this is the natural layout. > I think that's a really nice idea. With an official site-packages/test directory there could be pip support for optionally installing tests alongside a package if its layout allows it. So end users could just install things without tests, but developers could do: pip install --with-tests or something to get everything? Wolfgang From guido at python.org Fri Jan 19 11:48:00 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 19 Jan 2018 08:48:00 -0800 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: On Fri, Jan 19, 2018 at 8:30 AM, Wolfgang Maier < wolfgang.maier at biologie.uni-freiburg.de> wrote: > > I think that's a really nice idea. > With an official site-packages/test directory there could be pip support > for optionally installing tests alongside a package if its layout allows > it. So end users could just install things without tests, but developers > could do: pip install --with-tests or something to get everything? Oh, I just realized there's another problem here. The existing 'test' package (which is not a namespace package) would hide the site-packages/test directory. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Fri Jan 19 11:54:20 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 19 Jan 2018 17:54:20 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> Message-ID: On 19.01.2018 17:20, Guido van Rossum wrote: > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > wrote: > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > wrote: > >> Can someone explain to me why this is such a controversial issue? > > > > I guess practicality versus purity is always controversial :-) > > > >> It seems reasonable to me to add new encodings to the stdlib that do the > >> roundtripping requested in the first message of the thread. As long as they > >> have new names that seems to fall under "practicality beats purity". > > There are a few issues here: > > * WHATWG encodings are mostly for decoding content in order to > ? show it in the browser, accepting broken encoding data. > > > And sometimes Python apps that pull data from the web. > ? > > ? Python already has support for this by using one of the available > ? error handlers, or adding new ones to suit the needs. > > > This seems cumbersome though. ? Why is that ? Python 3 uses such error handlers for most of the I/O that's done with the OS already and for very similar reasons: dealing with broken data or broken configurations. > ? If we'd add the encodings, people will start creating more > ? broken data, since this is what the WHATWG codecs output > ? when encoding Unicode. > > > That's FUD. Only apps that specifically use the new WHATWG encodings > would be able to consume that data. And surely the practice of web > browsers will have a much bigger effect than Python's choice. ? It's not FUD. I don't think we ought to encourage having Python create more broken data. The purpose of the WHATWG encodings is to help browsers deal with decoding broken data in a uniform way. It's not to generate more such data. That may be regarded as purists view, but also has a very practical meaning. The output of the codecs will only readable by browsers implementing the WHATWG encodings. Other tools receiving the data will run into the same decoding problems. Once you have Unicode, it's better to stay there and use UTF-8 for encoding to avoid any such issues. > ? As discussed, this could be addressed by making the WHATWG > ? codecs decode-only. > > > But that would defeat the point of roundtripping, right? Yes, intentionally. Once you have Unicode, the data should be encoded correctly back into UTF-8 or whatever legacy encoding is needed, fixing any issues while in Unicode. As always, it's better to explicitly address such problems than to simply punt on them and write back broken data. > * The use case seems limited to implementing browsers or headless > ? implementations working like browsers. > > ? That's not really general enough to warrant adding lots of > ? new codecs to the stdlib. A PyPI package is better suited > ? for this. > > > Perhaps, but such a package already exists and its author (who surely > has read a lot of bug reports from its users) says that this is cumbersome. ? The only critique I read was that registering the codecs is not explicit enough, but that's really only a nit, since you can easily have the codec package expose a register function which you then call explicitly in the code using the codecs. > * The WHATWG codecs do not only cover simple mapping codecs, > ? but also many multi-byte ones for e.g. Asian languages. > > ? I doubt that we'd want to maintain such codecs in the stdlib, > ? since this will increase the download sizes of the installers > ? and also require people knowledgeable about these variants > ? to work on them and fix any issues. > > > Really? Why is adding a bunch of codecs so much effort? Surely the > translation tables contain data that compresses well? And surely we > don't need a separate dedicated piece of C code for each new codec? ? For the simple charmap style codecs that's true. Not so for the Asian ones and the latter also do require dedicated C code (see Modules/cjkcodecs). > Overall, I think either pointing people to error handlers > or perhaps adding a new one specifically for the case of > dealing with control character mappings would provide a better > maintenance / usefulness ratio than adding lots of new > legacy codecs to the stdlib. > > > Wouldn't error handlers be much slower? And to me it seems a new error > handler is a much *bigger* deal than some new encodings -- error > handlers must work for *all* encodings. ? Error handlers have a standard interface and so they will work for all codecs. Some codecs limits the number of handlers that can be used, but most accept all registered handlers. If a handler is too slow in Python, it can be coded in C for speed. > BTW: WHATWG pushes for always using UTF-8 as far as I can tell > from their website. > > > As does Python. But apparently it will take decades more to get there. Yes indeed, so let's not add even more confusion by adding more variants of the legacy encodings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From stefan at bytereef.org Fri Jan 19 12:08:50 2018 From: stefan at bytereef.org (Stefan Krah) Date: Fri, 19 Jan 2018 18:08:50 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: <20180119170850.GA13584@bytereef.org> On Fri, Jan 19, 2018 at 04:23:23PM +0000, Paul Moore wrote: > Another common approach is to not ship tests as part of your (runtime) > package at all - they are in the sdist but not the wheels nor are they > deployed with "setup.py install". In my experience, this is the usual > approach projects take if they don't have the tests in the package > directory. (I don't think I've *ever* seen a project try to install > tests except by including them in the package directory...) Yes, given the current situation not shipping is definitely the best approach in that case. I just thought that if we did have something like site-packages/stest (Guido correctly noted that "test" wouldn't work), people might use it. But it is all very speculative and I'm not really sure myself. Stefan Krah From rspeer at luminoso.com Fri Jan 19 12:12:14 2018 From: rspeer at luminoso.com (Rob Speer) Date: Fri, 19 Jan 2018 17:12:14 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> Message-ID: Error handlers are quite orthogonal to this problem. If you try to solve this problem with an error handler, you will have a different problem. Suppose you made "c1-control-passthrough" or whatever into an error handler, similar to "replace" or "ignore", and then you encounter an unassigned character that's *not* in the range 0x80 to 0x9f. (Many encodings have these.) Do you replace it? Do you ignore it? You don't know because you just replaced the error handler with something that's not about error handling. I will also repeat that having these encodings (in both directions) will provide more ways for Python to *reduce* the amount of mojibake that exists. If acknowledging that mojibake exists offends your sense of purity, and you'd rather just destroy all mojibake at the source... that's great, and please get back to me after you've fixed Microsoft Excel. I hope to make a pull request shortly that implements these mappings as new encodings that work just like the other ones. On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg wrote: > On 19.01.2018 17:20, Guido van Rossum wrote: > > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > > wrote: > > > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum < > guido at python.org > wrote: > > >> Can someone explain to me why this is such a controversial issue? > > > > > > I guess practicality versus purity is always controversial :-) > > > > > >> It seems reasonable to me to add new encodings to the stdlib that > do the > > >> roundtripping requested in the first message of the thread. As > long as they > > >> have new names that seems to fall under "practicality beats > purity". > > > > There are a few issues here: > > > > * WHATWG encodings are mostly for decoding content in order to > > show it in the browser, accepting broken encoding data. > > > > > > And sometimes Python apps that pull data from the web. > > > > > > Python already has support for this by using one of the available > > error handlers, or adding new ones to suit the needs. > > > > > > This seems cumbersome though. > > Why is that ? > > Python 3 uses such error handlers for most of the I/O that's done > with the OS already and for very similar reasons: dealing with > broken data or broken configurations. > > > If we'd add the encodings, people will start creating more > > broken data, since this is what the WHATWG codecs output > > when encoding Unicode. > > > > > > That's FUD. Only apps that specifically use the new WHATWG encodings > > would be able to consume that data. And surely the practice of web > > browsers will have a much bigger effect than Python's choice. > > It's not FUD. I don't think we ought to encourage having > Python create more broken data. The purpose of the WHATWG > encodings is to help browsers deal with decoding broken > data in a uniform way. It's not to generate more such data. > > That may be regarded as purists view, but also has a very > practical meaning. The output of the codecs will only readable > by browsers implementing the WHATWG encodings. Other tools > receiving the data will run into the same decoding problems. > > Once you have Unicode, it's better to stay there and use > UTF-8 for encoding to avoid any such issues. > > > As discussed, this could be addressed by making the WHATWG > > codecs decode-only. > > > > > > But that would defeat the point of roundtripping, right? > > Yes, intentionally. Once you have Unicode, the data should > be encoded correctly back into UTF-8 or whatever legacy encoding > is needed, fixing any issues while in Unicode. > > As always, it's better to explicitly address such problems than > to simply punt on them and write back broken data. > > > * The use case seems limited to implementing browsers or headless > > implementations working like browsers. > > > > That's not really general enough to warrant adding lots of > > new codecs to the stdlib. A PyPI package is better suited > > for this. > > > > > > Perhaps, but such a package already exists and its author (who surely > > has read a lot of bug reports from its users) says that this is > cumbersome. > > The only critique I read was that registering the codecs > is not explicit enough, but that's really only a nit, since > you can easily have the codec package expose a register > function which you then call explicitly in the code using > the codecs. > > > * The WHATWG codecs do not only cover simple mapping codecs, > > but also many multi-byte ones for e.g. Asian languages. > > > > I doubt that we'd want to maintain such codecs in the stdlib, > > since this will increase the download sizes of the installers > > and also require people knowledgeable about these variants > > to work on them and fix any issues. > > > > > > Really? Why is adding a bunch of codecs so much effort? Surely the > > translation tables contain data that compresses well? And surely we > > don't need a separate dedicated piece of C code for each new codec? > > For the simple charmap style codecs that's true. Not so for the > Asian ones and the latter also do require dedicated C code (see > Modules/cjkcodecs). > > > Overall, I think either pointing people to error handlers > > or perhaps adding a new one specifically for the case of > > dealing with control character mappings would provide a better > > maintenance / usefulness ratio than adding lots of new > > legacy codecs to the stdlib. > > > > > > Wouldn't error handlers be much slower? And to me it seems a new error > > handler is a much *bigger* deal than some new encodings -- error > > handlers must work for *all* encodings. > > Error handlers have a standard interface and so they will work > for all codecs. Some codecs limits the number of handlers that > can be used, but most accept all registered handlers. > > If a handler is too slow in Python, it can be coded in C for > speed. > > > BTW: WHATWG pushes for always using UTF-8 as far as I can tell > > from their website. > > > > > > As does Python. But apparently it will take decades more to get there. > > Yes indeed, so let's not add even more confusion by adding more > variants of the legacy encodings. > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wolfgang.maier at biologie.uni-freiburg.de Fri Jan 19 11:55:25 2018 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Fri, 19 Jan 2018 17:55:25 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: <84f1288d-a3e7-c269-be8f-4eaaa13d401e@biologie.uni-freiburg.de> On 01/19/2018 05:48 PM, Guido van Rossum wrote: > On Fri, Jan 19, 2018 at 8:30 AM, Wolfgang Maier > > wrote: > > > I think that's a really nice idea. > With an official site-packages/test directory there could be pip > support for optionally installing tests alongside a package if its > layout allows it. So end users could just install things without > tests, but developers could do: pip install --with-tests > or something to get everything? > > > Oh, I just realized there's another problem here. The existing 'test' > package (which is not a namespace package) would hide the > site-packages/test directory. > Well, that shouldn't be a big obstacle since one could just as well choose another name ( __tests__ for example?). Alternatively, package-specific test directories could exist *inside* site-packages. So much like today's .dist-info directories there could be .test dirs? From mal at egenix.com Fri Jan 19 12:17:48 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 19 Jan 2018 18:17:48 +0100 Subject: [Python-ideas] Windows Best Fit Encodings (was: Support WHATWG versions of legacy encodings) In-Reply-To: <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> Message-ID: <7cda4c01-2252-7555-54e6-ccdeb8f07bb0@egenix.com> On 19.01.2018 17:24, Random832 wrote: > On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote: >>> Someone did discover that Microsoft's current implementations of the >>> windows-* encodings matches the WHAT-WG spec, rather than the Unicode >>> spec that Microsoft originally wrote. >> >> No, MS implements somethings called "best fit encodings" >> and these are different than what WHATWG uses. > > NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings. > > We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode. I only know the best fit encoding maps that are available on the Unicode site. If I read your comment correctly, you are saying that MS has moved away from the standard code pages towards something else - perhaps even something other than the best fit encodings listed on the Unicode site ? Do you have some references for this ? Note that the Windows code page codecs implemented in Python are all based on the Unicode mapping files and those were created by MS. >> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx >> >> unfortunately uses the above mentioned best fit encodings, >> but this can and should be switched off by specifying the >> WC_NO_BEST_FIT_CHARS for anything that requires validation >> or needs to be interoperable: > > Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing. Interesting. The CP1252 mapping clearly defines 0x80 to map to undefined, whereas the bestfit1252 maps it to 0x0081: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt Same for the example you gave for CP932: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt So at least following the documentation you'd expect the function to implement the regular mappings. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From guido at python.org Fri Jan 19 12:24:27 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 19 Jan 2018 09:24:27 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> Message-ID: OK, I will tune out this conversation. It is clearly not going anywhere. On Fri, Jan 19, 2018 at 9:12 AM, Rob Speer wrote: > Error handlers are quite orthogonal to this problem. If you try to solve > this problem with an error handler, you will have a different problem. > > Suppose you made "c1-control-passthrough" or whatever into an error > handler, similar to "replace" or "ignore", and then you encounter an > unassigned character that's *not* in the range 0x80 to 0x9f. (Many > encodings have these.) Do you replace it? Do you ignore it? You don't know > because you just replaced the error handler with something that's not about > error handling. > > I will also repeat that having these encodings (in both directions) will > provide more ways for Python to *reduce* the amount of mojibake that > exists. If acknowledging that mojibake exists offends your sense of purity, > and you'd rather just destroy all mojibake at the source... that's great, > and please get back to me after you've fixed Microsoft Excel. > > I hope to make a pull request shortly that implements these mappings as > new encodings that work just like the other ones. > > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg wrote: > >> On 19.01.2018 17:20, Guido van Rossum wrote: >> > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > > > wrote: >> > >> > On 19.01.2018 05:38, Nathaniel Smith wrote: >> > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum < >> guido at python.org > wrote: >> > >> Can someone explain to me why this is such a controversial issue? >> > > >> > > I guess practicality versus purity is always controversial :-) >> > > >> > >> It seems reasonable to me to add new encodings to the stdlib >> that do the >> > >> roundtripping requested in the first message of the thread. As >> long as they >> > >> have new names that seems to fall under "practicality beats >> purity". >> > >> > There are a few issues here: >> > >> > * WHATWG encodings are mostly for decoding content in order to >> > show it in the browser, accepting broken encoding data. >> > >> > >> > And sometimes Python apps that pull data from the web. >> > >> > >> > Python already has support for this by using one of the available >> > error handlers, or adding new ones to suit the needs. >> > >> > >> > This seems cumbersome though. >> >> Why is that ? >> >> Python 3 uses such error handlers for most of the I/O that's done >> with the OS already and for very similar reasons: dealing with >> broken data or broken configurations. >> >> > If we'd add the encodings, people will start creating more >> > broken data, since this is what the WHATWG codecs output >> > when encoding Unicode. >> > >> > >> > That's FUD. Only apps that specifically use the new WHATWG encodings >> > would be able to consume that data. And surely the practice of web >> > browsers will have a much bigger effect than Python's choice. >> >> It's not FUD. I don't think we ought to encourage having >> Python create more broken data. The purpose of the WHATWG >> encodings is to help browsers deal with decoding broken >> data in a uniform way. It's not to generate more such data. >> >> That may be regarded as purists view, but also has a very >> practical meaning. The output of the codecs will only readable >> by browsers implementing the WHATWG encodings. Other tools >> receiving the data will run into the same decoding problems. >> >> Once you have Unicode, it's better to stay there and use >> UTF-8 for encoding to avoid any such issues. >> >> > As discussed, this could be addressed by making the WHATWG >> > codecs decode-only. >> > >> > >> > But that would defeat the point of roundtripping, right? >> >> Yes, intentionally. Once you have Unicode, the data should >> be encoded correctly back into UTF-8 or whatever legacy encoding >> is needed, fixing any issues while in Unicode. >> >> As always, it's better to explicitly address such problems than >> to simply punt on them and write back broken data. >> >> > * The use case seems limited to implementing browsers or headless >> > implementations working like browsers. >> > >> > That's not really general enough to warrant adding lots of >> > new codecs to the stdlib. A PyPI package is better suited >> > for this. >> > >> > >> > Perhaps, but such a package already exists and its author (who surely >> > has read a lot of bug reports from its users) says that this is >> cumbersome. >> >> The only critique I read was that registering the codecs >> is not explicit enough, but that's really only a nit, since >> you can easily have the codec package expose a register >> function which you then call explicitly in the code using >> the codecs. >> >> > * The WHATWG codecs do not only cover simple mapping codecs, >> > but also many multi-byte ones for e.g. Asian languages. >> > >> > I doubt that we'd want to maintain such codecs in the stdlib, >> > since this will increase the download sizes of the installers >> > and also require people knowledgeable about these variants >> > to work on them and fix any issues. >> > >> > >> > Really? Why is adding a bunch of codecs so much effort? Surely the >> > translation tables contain data that compresses well? And surely we >> > don't need a separate dedicated piece of C code for each new codec? >> >> For the simple charmap style codecs that's true. Not so for the >> Asian ones and the latter also do require dedicated C code (see >> Modules/cjkcodecs). >> >> > Overall, I think either pointing people to error handlers >> > or perhaps adding a new one specifically for the case of >> > dealing with control character mappings would provide a better >> > maintenance / usefulness ratio than adding lots of new >> > legacy codecs to the stdlib. >> > >> > >> > Wouldn't error handlers be much slower? And to me it seems a new error >> > handler is a much *bigger* deal than some new encodings -- error >> > handlers must work for *all* encodings. >> >> Error handlers have a standard interface and so they will work >> for all codecs. Some codecs limits the number of handlers that >> can be used, but most accept all registered handlers. >> >> If a handler is too slow in Python, it can be coded in C for >> speed. >> >> > BTW: WHATWG pushes for always using UTF-8 as far as I can tell >> > from their website. >> > >> > >> > As does Python. But apparently it will take decades more to get there. >> >> Yes indeed, so let's not add even more confusion by adding more >> variants of the legacy encodings. >> >> -- >> Marc-Andre Lemburg >> eGenix.com >> >> Professional Python Services directly from the Experts (#1, Jan 19 2018) >> >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >> >>> Python Database Interfaces ... http://products.egenix.com/ >> >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ >> ________________________________________________________________________ >> >> ::: We implement business ideas - efficiently in both time and costs ::: >> >> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >> >> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >> Registered at Amtsgericht Duesseldorf: HRB 46611 >> http://www.egenix.com/company/contact/ >> http://www.malemburg.com/ >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Jan 19 12:30:43 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 19 Jan 2018 17:30:43 +0000 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: <20180119170850.GA13584@bytereef.org> References: <20180119142753.GA4754@bytereef.org> <20180119170850.GA13584@bytereef.org> Message-ID: On 19 January 2018 at 17:08, Stefan Krah wrote: > On Fri, Jan 19, 2018 at 04:23:23PM +0000, Paul Moore wrote: >> Another common approach is to not ship tests as part of your (runtime) >> package at all - they are in the sdist but not the wheels nor are they >> deployed with "setup.py install". In my experience, this is the usual >> approach projects take if they don't have the tests in the package >> directory. (I don't think I've *ever* seen a project try to install >> tests except by including them in the package directory...) > > Yes, given the current situation not shipping is definitely the best > approach in that case. > > I just thought that if we did have something like site-packages/stest > (Guido correctly noted that "test" wouldn't work), people might use it. > > > But it is all very speculative and I'm not really sure myself. To be usable, tools like pip, wheel, setuptools, flit, etc, would all need to be updated to take into account this option, as well as the relevant standards (the wheel spec for one). Add to that the changes needed to places like the sysconfig package to allow introspecting the location of the new test directory. Would there be a test directory in user-site as well? What about in virtual environments? (If only in site-packages, then it'll likely be read-only in a lot of environments). Also, would we need to reserve the directory name chosen to prohibit 3rd party packages using it? As we've seen the stdlib test package clashes with the original proposal, who's to say there's nothing on PyPI that uses stest? The idea isn't a bad one in principle - there's a proposal from some time back on distutils-sig that Python packaging support more "target locations" matching the POSIX style locations - for docs, config, etc. A test directory would fit in with this idea. But it's a pretty big change in practice, and no-one has yet done much beyond talk about it. And the proposal would likely have put the test directory *outside* site-packages, which avoids the name clash problem. I'd think that the idea of a site-packages/stest directory would need a much more compelling use case to justify it. Paul PS There's nothing stopping a (distribution) package FOO from installing (Python) packages foo and foo-tests. It's not common, and probably violates people's expectations, but it's not *illegal* (the setuptools distribution installs pkg_resources as well as setuptools, for a well-known example). So in theory, if people wanted this enough, they could have implemented it right now, without needing any change to Python or the packaging ecosystem. From encukou at gmail.com Fri Jan 19 12:34:15 2018 From: encukou at gmail.com (Petr Viktorin) Date: Fri, 19 Jan 2018 18:34:15 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: FWIW, I've had very good experience with putting tests for package `foo` in a directory/package called `test_foo`. This combines the best of both worlds -- it can be easily separated for distribution (like `tests`), and it doesn't cause name conflicts (like `foo.tests`). On 01/19/2018 05:23 PM, Paul Moore wrote: > Another common approach is to not ship tests as part of your (runtime) > package at all - they are in the sdist but not the wheels nor are they > deployed with "setup.py install". In my experience, this is the usual > approach projects take if they don't have the tests in the package > directory. (I don't think I've *ever* seen a project try to install > tests except by including them in the package directory...) > > Paul > > On 19 January 2018 at 16:10, Guido van Rossum wrote: >> IIUC another common layout is to have folders named test or tests inside >> each package. This would avoid requiring any changes to the site-packages >> layout. >> >> On Fri, Jan 19, 2018 at 6:27 AM, Stefan Krah wrote: >>> >>> >>> Hello, >>> >>> I wonder if we could get an official site-packages/test directory. >>> Currently >>> it seems to be problematic to distribute tests if they are outside the >>> package >>> directory. Here is a nice overview of the two main layout possibilities: >>> >>> >>> http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html >>> >>> >>> I like the outside-the-package approach, mostly for reasons described very >>> eloquently here: >>> >>> >>> http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html >>> >>> >>> CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so >>> it >>> would make sense to have site-packages/foo.py and >>> site-packages/test/test_foo.py. >>> >>> For me, this is the natural layout. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > From mal at egenix.com Fri Jan 19 13:13:56 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 19 Jan 2018 19:13:56 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> Message-ID: <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> On 19.01.2018 18:12, Rob Speer wrote: > Error handlers are quite orthogonal to this problem. If you try to solve > this problem with an error handler, you will have a different problem. > > Suppose you made "c1-control-passthrough" or whatever into an error > handler, similar to "replace" or "ignore", and then you encounter an > unassigned character that's *not* in the range 0x80 to 0x9f. (Many > encodings have these.) Do you replace it? Do you ignore it? You don't > know because you just replaced the error handler with something that's > not about error handling. It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach. Based on the context it may also make sense to escape the input data using a different syntax, e.g. XML escapes, backslash notations, HTML numeric entities, etc. You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is. The Unicode error handlers are pretty flexible when it comes to providing a solution: https://www.python.org/dev/peps/pep-0293/ You can even have the handler work "patch" an encoding, since it also gets the encoding name as input. You could probably create an error handler which implements most of their workarounds into a single "whatwg" handler. > I will also repeat that having these encodings (in both directions) will > provide more ways for Python to *reduce* the amount of mojibake that > exists. If acknowledging that mojibake exists offends your sense of > purity, and you'd rather just destroy all mojibake at the source... > that's great, and please get back to me after you've fixed Microsoft Excel. I acknowledge that we have different views on this :-) Note that I'm not saying that the encodings are bad idea, or should not be used. I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead. The extra hurdle to pip-install a package for this feels like the right way to turn this into a more conscious decision and who knows... perhaps it'll even help fix Excel once they have decided on including Python as scripting language: https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/suggestions/10549005-python-as-an-excel-scripting-language > I hope to make a pull request shortly that implements these mappings as > new encodings that work just like the other ones. > > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg > wrote: > > On 19.01.2018 17:20, Guido van Rossum wrote: > > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > > >> wrote: > > > >? ? ?On 19.01.2018 05:38, Nathaniel Smith wrote: > >? ? ?> On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > >> wrote: > >? ? ?>> Can someone explain to me why this is such a controversial > issue? > >? ? ?> > >? ? ?> I guess practicality versus purity is always controversial :-) > >? ? ?> > >? ? ?>> It seems reasonable to me to add new encodings to the > stdlib that do the > >? ? ?>> roundtripping requested in the first message of the thread. > As long as they > >? ? ?>> have new names that seems to fall under "practicality beats > purity". > > > >? ? ?There are a few issues here: > > > >? ? ?* WHATWG encodings are mostly for decoding content in order to > >? ? ?? show it in the browser, accepting broken encoding data. > > > > > > And sometimes Python apps that pull data from the web. > > ? > > > >? ? ?? Python already has support for this by using one of the > available > >? ? ?? error handlers, or adding new ones to suit the needs. > > > > > > This seems cumbersome though. > ?? > Why is that ? > > Python 3 uses such error handlers for most of the I/O that's done > with the OS already and for very similar reasons: dealing with > broken data or broken configurations. > > >? ? ?? If we'd add the encodings, people will start creating more > >? ? ?? broken data, since this is what the WHATWG codecs output > >? ? ?? when encoding Unicode. > > > > > > That's FUD. Only apps that specifically use the new WHATWG encodings > > would be able to consume that data. And surely the practice of web > > browsers will have a much bigger effect than Python's choice. > ?? > It's not FUD. I don't think we ought to encourage having > Python create more broken data. The purpose of the WHATWG > encodings is to help browsers deal with decoding broken > data in a uniform way. It's not to generate more such data. > > That may be regarded as purists view, but also has a very > practical meaning. The output of the codecs will only readable > by browsers implementing the WHATWG encodings. Other tools > receiving the data will run into the same decoding problems. > > Once you have Unicode, it's better to stay there and use > UTF-8 for encoding to avoid any such issues. > > >? ? ?? As discussed, this could be addressed by making the WHATWG > >? ? ?? codecs decode-only. > > > > > > But that would defeat the point of roundtripping, right? > > Yes, intentionally. Once you have Unicode, the data should > be encoded correctly back into UTF-8 or whatever legacy encoding > is needed, fixing any issues while in Unicode. > > As always, it's better to explicitly address such problems than > to simply punt on them and write back broken data. > > >? ? ?* The use case seems limited to implementing browsers or headless > >? ? ?? implementations working like browsers. > > > >? ? ?? That's not really general enough to warrant adding lots of > >? ? ?? new codecs to the stdlib. A PyPI package is better suited > >? ? ?? for this. > > > > > > Perhaps, but such a package already exists and its author (who surely > > has read a lot of bug reports from its users) says that this is > cumbersome. > ?? > The only critique I read was that registering the codecs > is not explicit enough, but that's really only a nit, since > you can easily have the codec package expose a register > function which you then call explicitly in the code using > the codecs. > > >? ? ?* The WHATWG codecs do not only cover simple mapping codecs, > >? ? ?? but also many multi-byte ones for e.g. Asian languages. > > > >? ? ?? I doubt that we'd want to maintain such codecs in the stdlib, > >? ? ?? since this will increase the download sizes of the installers > >? ? ?? and also require people knowledgeable about these variants > >? ? ?? to work on them and fix any issues. > > > > > > Really? Why is adding a bunch of codecs so much effort? Surely the > > translation tables contain data that compresses well? And surely we > > don't need a separate dedicated piece of C code for each new codec? > ?? > For the simple charmap style codecs that's true. Not so for the > Asian ones and the latter also do require dedicated C code (see > Modules/cjkcodecs). > > >? ? ?Overall, I think either pointing people to error handlers > >? ? ?or perhaps adding a new one specifically for the case of > >? ? ?dealing with control character mappings would provide a better > >? ? ?maintenance / usefulness ratio than adding lots of new > >? ? ?legacy codecs to the stdlib. > > > > > > Wouldn't error handlers be much slower? And to me it seems a new error > > handler is a much *bigger* deal than some new encodings -- error > > handlers must work for *all* encodings. > ?? > Error handlers have a standard interface and so they will work > for all codecs. Some codecs limits the number of handlers that > can be used, but most accept all registered handlers. > > If a handler is too slow in Python, it can be coded in C for > speed. > > >? ? ?BTW: WHATWG pushes for always using UTF-8 as far as I can tell > >? ? ?from their website. > > > > > > As does Python. But apparently it will take decades more to get there. > > Yes indeed, so let's not add even more confusion by adding more > variants of the legacy encodings. > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ...? http://www.egenix.com/ > >>> Python Database Interfaces ...? ? ? ? ? ?http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ...? ? ? ? ? ?http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > ? ?eGenix.com Software, Skills and Services GmbH? Pastor-Loeh-Str.48 > ? ? D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > ? ? ? ? ? ?Registered at Amtsgericht Duesseldorf: HRB 46611 > ? ? ? ? ? ? ? ?http://www.egenix.com/company/contact/ > ? ? ? ? ? ? ? ? ? ? ? http://www.malemburg.com/ > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From mal at egenix.com Fri Jan 19 13:18:06 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 19 Jan 2018 19:18:06 +0100 Subject: [Python-ideas] Windows Best Fit Encodings In-Reply-To: <7cda4c01-2252-7555-54e6-ccdeb8f07bb0@egenix.com> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> <7cda4c01-2252-7555-54e6-ccdeb8f07bb0@egenix.com> Message-ID: Hi Steve, do you know of a definite resource for Windows code pages on MSDN or another official MS website ? I tried to find some links, but only got these ancient ones: https://msdn.microsoft.com/en-us/library/cc195054.aspx (this version of cp1252 doesn't even have the euro sign yet) Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 19.01.2018 18:17, M.-A. Lemburg wrote: > On 19.01.2018 17:24, Random832 wrote: >> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote: >>>> Someone did discover that Microsoft's current implementations of the >>>> windows-* encodings matches the WHAT-WG spec, rather than the Unicode >>>> spec that Microsoft originally wrote. >>> >>> No, MS implements somethings called "best fit encodings" >>> and these are different than what WHATWG uses. >> >> NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings. >> >> We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode. > > I only know the best fit encoding maps that are available > on the Unicode site. > > If I read your comment correctly, you are saying that MS has > moved away from the standard code pages towards something > else - perhaps even something other than the best fit encodings > listed on the Unicode site ? > > Do you have some references for this ? > > Note that the Windows code page codecs implemented in Python > are all based on the Unicode mapping files and those were > created by MS. > >>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx >>> >>> unfortunately uses the above mentioned best fit encodings, >>> but this can and should be switched off by specifying the >>> WC_NO_BEST_FIT_CHARS for anything that requires validation >>> or needs to be interoperable: >> >> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing. > > Interesting. The CP1252 mapping clearly defines 0x80 to map > to undefined, whereas the bestfit1252 maps it to 0x0081: > > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT > http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt > > Same for the example you gave for CP932: > > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT > http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt > > So at least following the documentation you'd expect the function > to implement the regular mappings. > From stefan at bytereef.org Fri Jan 19 13:19:56 2018 From: stefan at bytereef.org (Stefan Krah) Date: Fri, 19 Jan 2018 19:19:56 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> <20180119170850.GA13584@bytereef.org> Message-ID: <20180119181956.GA14866@bytereef.org> On Fri, Jan 19, 2018 at 05:30:43PM +0000, Paul Moore wrote: [cut] > I'd think that the idea of a site-packages/stest directory would need > a much more compelling use case to justify it. Thanks for the detailed explanation! It sounds that there's much more work involved than I thought, so it's probably better to drop this proposal. > PS There's nothing stopping a (distribution) package FOO from > installing (Python) packages foo and foo-tests. It's not common, and > probably violates people's expectations, but it's not *illegal* (the > setuptools distribution installs pkg_resources as well as setuptools, > for a well-known example). So in theory, if people wanted this enough, > they could have implemented it right now, without needing any change > to Python or the packaging ecosystem. If people don't come with pitchforks, that's a good solution. I suspected that people would complain both if foo-tests were installed automatically like pkg_resources but also if foo-tests were a separate optional package (too much hassle). Stefan Krah From rspeer at luminoso.com Fri Jan 19 13:35:30 2018 From: rspeer at luminoso.com (Rob Speer) Date: Fri, 19 Jan 2018 18:35:30 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> Message-ID: > It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach. And the way to express that is with errors='replace', errors='surrogateescape', or whatever, which Python already does. We do not need an explosion of error handlers. This problem can be very straightforwardly solved with encodings, and error handlers can keep doing their usual job on top of encodings. > You could also add a "latin1replace" error handler which simply passes through everything that's undefined as-is. Nobody asked for this. > I just don't want to have people start using "web-1252" as encoding simply because they they are writing out text for a web application - they should use "utf-8" instead. I did ask for input on the name. If the problem is that you think my working name for the encoding is misleading, you could help with that instead of constantly trying to replace the proposal with something different. Guido had some very sensible feedback just a moment ago. I am wondering now if we lost Guido because I broke python-ideas etiquette (is a pull request not the next step, for example? I never got a good answer on the process), or because this thread is just constantly being derailed. On Fri, 19 Jan 2018 at 13:14 M.-A. Lemburg wrote: > On 19.01.2018 18:12, Rob Speer wrote: > > Error handlers are quite orthogonal to this problem. If you try to solve > > this problem with an error handler, you will have a different problem. > > > > Suppose you made "c1-control-passthrough" or whatever into an error > > handler, similar to "replace" or "ignore", and then you encounter an > > unassigned character that's *not* in the range 0x80 to 0x9f. (Many > > encodings have these.) Do you replace it? Do you ignore it? You don't > > know because you just replaced the error handler with something that's > > not about error handling. > > It depends on what you want to achieve. You may want to fail, > assign a code point from a private area or use a surrogate > escape approach. Based on the context it may also make sense > to escape the input data using a different syntax, e.g. > XML escapes, backslash notations, HTML numeric entities, etc. > > You could also add a "latin1replace" error handler which > simply passes through everything that's undefined as-is. > > The Unicode error handlers are pretty flexible when it comes > to providing a solution: > > https://www.python.org/dev/peps/pep-0293/ > > You can even have the handler work "patch" an encoding, since > it also gets the encoding name as input. > > You could probably create an error handler which implements > most of their workarounds into a single "whatwg" handler. > > > I will also repeat that having these encodings (in both directions) will > > provide more ways for Python to *reduce* the amount of mojibake that > > exists. If acknowledging that mojibake exists offends your sense of > > purity, and you'd rather just destroy all mojibake at the source... > > that's great, and please get back to me after you've fixed Microsoft > Excel. > > I acknowledge that we have different views on this :-) > > Note that I'm not saying that the encodings are bad idea, > or should not be used. > > I just don't want to have people start using "web-1252" as > encoding simply because they they are writing out text for > a web application - they should use "utf-8" instead. > > The extra hurdle to pip-install a package for this feels > like the right way to turn this into a more conscious > decision and who knows... perhaps it'll even help fix Excel > once they have decided on including Python as scripting > language: > > > https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/suggestions/10549005-python-as-an-excel-scripting-language > > > I hope to make a pull request shortly that implements these mappings as > > new encodings that work just like the other ones. > > > > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg > > wrote: > > > > On 19.01.2018 17:20, Guido van Rossum wrote: > > > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > > > > >> wrote: > > > > > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > > > >> wrote: > > > >> Can someone explain to me why this is such a controversial > > issue? > > > > > > > > I guess practicality versus purity is always controversial > :-) > > > > > > > >> It seems reasonable to me to add new encodings to the > > stdlib that do the > > > >> roundtripping requested in the first message of the thread. > > As long as they > > > >> have new names that seems to fall under "practicality beats > > purity". > > > > > > There are a few issues here: > > > > > > * WHATWG encodings are mostly for decoding content in order to > > > show it in the browser, accepting broken encoding data. > > > > > > > > > And sometimes Python apps that pull data from the web. > > > > > > > > > Python already has support for this by using one of the > > available > > > error handlers, or adding new ones to suit the needs. > > > > > > > > > This seems cumbersome though. > > > > Why is that ? > > > > Python 3 uses such error handlers for most of the I/O that's done > > with the OS already and for very similar reasons: dealing with > > broken data or broken configurations. > > > > > If we'd add the encodings, people will start creating more > > > broken data, since this is what the WHATWG codecs output > > > when encoding Unicode. > > > > > > > > > That's FUD. Only apps that specifically use the new WHATWG > encodings > > > would be able to consume that data. And surely the practice of web > > > browsers will have a much bigger effect than Python's choice. > > > > It's not FUD. I don't think we ought to encourage having > > Python create more broken data. The purpose of the WHATWG > > encodings is to help browsers deal with decoding broken > > data in a uniform way. It's not to generate more such data. > > > > That may be regarded as purists view, but also has a very > > practical meaning. The output of the codecs will only readable > > by browsers implementing the WHATWG encodings. Other tools > > receiving the data will run into the same decoding problems. > > > > Once you have Unicode, it's better to stay there and use > > UTF-8 for encoding to avoid any such issues. > > > > > As discussed, this could be addressed by making the WHATWG > > > codecs decode-only. > > > > > > > > > But that would defeat the point of roundtripping, right? > > > > Yes, intentionally. Once you have Unicode, the data should > > be encoded correctly back into UTF-8 or whatever legacy encoding > > is needed, fixing any issues while in Unicode. > > > > As always, it's better to explicitly address such problems than > > to simply punt on them and write back broken data. > > > > > * The use case seems limited to implementing browsers or > headless > > > implementations working like browsers. > > > > > > That's not really general enough to warrant adding lots of > > > new codecs to the stdlib. A PyPI package is better suited > > > for this. > > > > > > > > > Perhaps, but such a package already exists and its author (who > surely > > > has read a lot of bug reports from its users) says that this is > > cumbersome. > > > > The only critique I read was that registering the codecs > > is not explicit enough, but that's really only a nit, since > > you can easily have the codec package expose a register > > function which you then call explicitly in the code using > > the codecs. > > > > > * The WHATWG codecs do not only cover simple mapping codecs, > > > but also many multi-byte ones for e.g. Asian languages. > > > > > > I doubt that we'd want to maintain such codecs in the stdlib, > > > since this will increase the download sizes of the installers > > > and also require people knowledgeable about these variants > > > to work on them and fix any issues. > > > > > > > > > Really? Why is adding a bunch of codecs so much effort? Surely the > > > translation tables contain data that compresses well? And surely we > > > don't need a separate dedicated piece of C code for each new codec? > > > > For the simple charmap style codecs that's true. Not so for the > > Asian ones and the latter also do require dedicated C code (see > > Modules/cjkcodecs). > > > > > Overall, I think either pointing people to error handlers > > > or perhaps adding a new one specifically for the case of > > > dealing with control character mappings would provide a better > > > maintenance / usefulness ratio than adding lots of new > > > legacy codecs to the stdlib. > > > > > > > > > Wouldn't error handlers be much slower? And to me it seems a new > error > > > handler is a much *bigger* deal than some new encodings -- error > > > handlers must work for *all* encodings. > > > > Error handlers have a standard interface and so they will work > > for all codecs. Some codecs limits the number of handlers that > > can be used, but most accept all registered handlers. > > > > If a handler is too slow in Python, it can be coded in C for > > speed. > > > > > BTW: WHATWG pushes for always using UTF-8 as far as > > I can tell > > > from their website. > > > > > > > > > As does Python. But apparently it will take decades more to get > there. > > > > Yes indeed, so let's not add even more confusion by adding more > > variants of the legacy encodings. > > > > -- > > Marc-Andre Lemburg > > eGenix.com > > > > Professional Python Services directly from the Experts (#1, Jan 19 > 2018) > > >>> Python Projects, Coaching and Consulting ... > http://www.egenix.com/ > > >>> Python Database Interfaces ... > http://products.egenix.com/ > > >>> Plone/Zope Database Interfaces ... > http://zope.egenix.com/ > > > ________________________________________________________________________ > > > > ::: We implement business ideas - efficiently in both time and costs > ::: > > > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > > Registered at Amtsgericht Duesseldorf: HRB 46611 > > http://www.egenix.com/company/contact/ > > http://www.malemburg.com/ > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Jan 19 13:43:04 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 19 Jan 2018 18:43:04 +0000 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: <20180119181956.GA14866@bytereef.org> References: <20180119142753.GA4754@bytereef.org> <20180119170850.GA13584@bytereef.org> <20180119181956.GA14866@bytereef.org> Message-ID: On 19 January 2018 at 18:19, Stefan Krah wrote: > On Fri, Jan 19, 2018 at 05:30:43PM +0000, Paul Moore wrote: > [cut] >> I'd think that the idea of a site-packages/stest directory would need >> a much more compelling use case to justify it. > > Thanks for the detailed explanation! It sounds that there's much more work > involved than I thought, so it's probably better to drop this proposal. > > >> PS There's nothing stopping a (distribution) package FOO from >> installing (Python) packages foo and foo-tests. It's not common, and >> probably violates people's expectations, but it's not *illegal* (the >> setuptools distribution installs pkg_resources as well as setuptools, >> for a well-known example). So in theory, if people wanted this enough, >> they could have implemented it right now, without needing any change >> to Python or the packaging ecosystem. > > If people don't come with pitchforks, that's a good solution. I suspected > that people would complain both if foo-tests were installed automatically > like pkg_resources but also if foo-tests were a separate optional package > (too much hassle). Personally, I prefer packages that don't install their tests (I'm just about willing to tolerate the tests-inside-the package-approach) so I actually dislike this option myself - I was just saying it's possible. Paul From chris.barker at noaa.gov Fri Jan 19 13:39:12 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 19 Jan 2018 10:39:12 -0800 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: <20180119181956.GA14866@bytereef.org> References: <20180119142753.GA4754@bytereef.org> <20180119170850.GA13584@bytereef.org> <20180119181956.GA14866@bytereef.org> Message-ID: hmm, I've struggled for ages with this problem -- I have some packages with REALLY big test suites. so I don't put the tests in the package. But their are also numerous issues with building and installing the package (C code, lots of dependencies, etc), so it would be really nice to have a way to test the actual installed package after the fact (and, be able to properly test conda packages as well -- easy if the tests are in the package, hard if not...) So I like the idea of having a standard way / place to install tests. However, somehow I never thought to make a my_package_tests package -- d'uh! seems the obvious way to handle teh "optionally install the tests" problem. I still like the idea of separate location, but Paul is right that it's a change that would have to filter out through a lot of infrastructure, so maybe not practical. So maybe the way to go is to come up with recommendations for a standard way to do it -- maybe published by PyPa? -CHB On Fri, Jan 19, 2018 at 10:19 AM, Stefan Krah wrote: > On Fri, Jan 19, 2018 at 05:30:43PM +0000, Paul Moore wrote: > [cut] > > I'd think that the idea of a site-packages/stest directory would need > > a much more compelling use case to justify it. > > Thanks for the detailed explanation! It sounds that there's much more work > involved than I thought, so it's probably better to drop this proposal. > > > > PS There's nothing stopping a (distribution) package FOO from > > installing (Python) packages foo and foo-tests. It's not common, and > > probably violates people's expectations, but it's not *illegal* (the > > setuptools distribution installs pkg_resources as well as setuptools, > > for a well-known example). So in theory, if people wanted this enough, > > they could have implemented it right now, without needing any change > > to Python or the packaging ecosystem. > > If people don't come with pitchforks, that's a good solution. I suspected > that people would complain both if foo-tests were installed automatically > like pkg_resources but also if foo-tests were a separate optional package > (too much hassle). > > > > Stefan Krah > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.rodola at gmail.com Fri Jan 19 14:21:51 2018 From: g.rodola at gmail.com (Giampaolo Rodola') Date: Fri, 19 Jan 2018 20:21:51 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: On Fri, Jan 19, 2018 at 5:23 PM, Paul Moore wrote: > Another common approach is to not ship tests as part of your (runtime) > package at all - they are in the sdist but not the wheels nor are they > deployed with "setup.py install". In my experience, this is the usual > approach projects take if they don't have the tests in the package > directory. (I don't think I've *ever* seen a project try to install > tests except by including them in the package directory...) I personally include them in psutil distribution so that users can test the installation with "python -m psutil.test". I have even this documented as I think it's an added value. -- Giampaolo - http://grodola.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Fri Jan 19 14:38:26 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 19 Jan 2018 20:38:26 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> Message-ID: <154c4ce3-f1a6-4df0-f992-1a94022f0e05@egenix.com> Rob: I think I was very clear very early in the thread that I'm opposed to adding a complete set of new encodings to the stdlib which only slightly alter many existing ones. Ever since I've been trying to give you suggestions on how we can solve the issue you're trying to address with the encodings in different ways which achieve much of the same but with the existing code base. I've also tried to understand the issue with WideCharToMultiByte() et al. apparently using different encodings than the ones which MS itself published to the Unicode Consortium, to see whether there's an issue we may need to resolve. That's a different topic, which is why I changed the subject line. If you call that derailing, I cannot help it, but won't engage any further in this discussion. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 19 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ On 19.01.2018 19:35, Rob Speer wrote: >> It depends on what you want to achieve. You may want to fail, assign a > code point from a private area or use a surrogate escape approach. > > And the way to express that is with errors='replace', > errors='surrogateescape', or whatever, which Python already does. We do > not need an explosion of error handlers. This problem can be very > straightforwardly solved with encodings, and error handlers can keep > doing their usual job on top of encodings. > >> You could also add a "latin1replace" error handler which simply passes > through everything that's undefined as-is. > > Nobody asked for this. > >> I just don't want to have people start using "web-1252" as encoding > simply because they they are writing out text for a web application - > they should use "utf-8" instead. > > I did ask for input on the name. If the problem is that you think my > working name for the encoding is misleading, you could help with that > instead of constantly trying to replace the proposal with something > different. > > Guido had some very sensible feedback just a moment ago. I am wondering > now if we lost Guido because I broke python-ideas etiquette (is a pull > request not the next step, for example? I never got a good answer on the > process), or because this thread is just constantly being derailed. > > > > On Fri, 19 Jan 2018 at 13:14 M.-A. Lemburg > wrote: > > On 19.01.2018 18:12, Rob Speer wrote: > > Error handlers are quite orthogonal to this problem. If you try to > solve > > this problem with an error handler, you will have a different problem. > > > > Suppose you made "c1-control-passthrough" or whatever into an error > > handler, similar to "replace" or "ignore", and then you encounter an > > unassigned character that's *not* in the range 0x80 to 0x9f. (Many > > encodings have these.) Do you replace it? Do you ignore it? You don't > > know because you just replaced the error handler with something that's > > not about error handling. > > It depends on what you want to achieve. You may want to fail, > assign a code point from a private area or use a surrogate > escape approach. Based on the context it may also make sense > to escape the input data using a different syntax, e.g. > XML escapes, backslash notations, HTML numeric entities, etc. > > You could also add a "latin1replace" error handler which > simply passes through everything that's undefined as-is. > > The Unicode error handlers are pretty flexible when it comes > to providing a solution: > > https://www.python.org/dev/peps/pep-0293/ > > You can even have the handler work "patch" an encoding, since > it also gets the encoding name as input. > > You could probably create an error handler which implements > most of their workarounds into a single "whatwg" handler. > > > I will also repeat that having these encodings (in both > directions) will > > provide more ways for Python to *reduce* the amount of mojibake that > > exists. If acknowledging that mojibake exists offends your sense of > > purity, and you'd rather just destroy all mojibake at the source... > > that's great, and please get back to me after you've fixed > Microsoft Excel. > > I acknowledge that we have different views on this :-) > > Note that I'm not saying that the encodings are bad idea, > or should not be used. > > I just don't want to have people start using "web-1252" as > encoding simply because they they are writing out text for > a web application - they should use "utf-8" instead. > > The extra hurdle to pip-install a package for this feels > like the right way to turn this into a more conscious > decision and who knows... perhaps it'll even help fix Excel > once they have decided on including Python as scripting > language: > > https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/suggestions/10549005-python-as-an-excel-scripting-language > > > I hope to make a pull request shortly that implements these > mappings as > > new encodings that work just like the other ones. > > > > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg > > >> wrote: > > > >? ? ?On 19.01.2018 17:20, Guido van Rossum wrote: > >? ? ?> On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > > >? ? ?> > >? ? ?> > >>> wrote: > >? ? ?> > >? ? ?>? ? ?On 19.01.2018 05:38, Nathaniel Smith wrote: > >? ? ?>? ? ?> On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > >? ? ? > > > > >? ? ?>>> wrote: > >? ? ?>? ? ?>> Can someone explain to me why this is such a > controversial > >? ? ?issue? > >? ? ?>? ? ?> > >? ? ?>? ? ?> I guess practicality versus purity is always > controversial :-) > >? ? ?>? ? ?> > >? ? ?>? ? ?>> It seems reasonable to me to add new encodings to the > >? ? ?stdlib that do the > >? ? ?>? ? ?>> roundtripping requested in the first message of the > thread. > >? ? ?As long as they > >? ? ?>? ? ?>> have new names that seems to fall under "practicality > beats > >? ? ?purity". > >? ? ?> > >? ? ?>? ? ?There are a few issues here: > >? ? ?> > >? ? ?>? ? ?* WHATWG encodings are mostly for decoding content in > order to > >? ? ?>? ? ?? show it in the browser, accepting broken encoding data. > >? ? ?> > >? ? ?> > >? ? ?> And sometimes Python apps that pull data from the web. > >? ? ?> ? > >? ? ?> > >? ? ?>? ? ?? Python already has support for this by using one of the > >? ? ?available > >? ? ?>? ? ?? error handlers, or adding new ones to suit the needs. > >? ? ?> > >? ? ?> > >? ? ?> This seems cumbersome though. > >? ? ??? > >? ? ?Why is that ? > > > >? ? ?Python 3 uses such error handlers for most of the I/O that's done > >? ? ?with the OS already and for very similar reasons: dealing with > >? ? ?broken data or broken configurations. > > > >? ? ?>? ? ?? If we'd add the encodings, people will start creating more > >? ? ?>? ? ?? broken data, since this is what the WHATWG codecs output > >? ? ?>? ? ?? when encoding Unicode. > >? ? ?> > >? ? ?> > >? ? ?> That's FUD. Only apps that specifically use the new WHATWG > encodings > >? ? ?> would be able to consume that data. And surely the practice > of web > >? ? ?> browsers will have a much bigger effect than Python's choice. > >? ? ??? > >? ? ?It's not FUD. I don't think we ought to encourage having > >? ? ?Python create more broken data. The purpose of the WHATWG > >? ? ?encodings is to help browsers deal with decoding broken > >? ? ?data in a uniform way. It's not to generate more such data. > > > >? ? ?That may be regarded as purists view, but also has a very > >? ? ?practical meaning. The output of the codecs will only readable > >? ? ?by browsers implementing the WHATWG encodings. Other tools > >? ? ?receiving the data will run into the same decoding problems. > > > >? ? ?Once you have Unicode, it's better to stay there and use > >? ? ?UTF-8 for encoding to avoid any such issues. > > > >? ? ?>? ? ?? As discussed, this could be addressed by making the WHATWG > >? ? ?>? ? ?? codecs decode-only. > >? ? ?> > >? ? ?> > >? ? ?> But that would defeat the point of roundtripping, right? > > > >? ? ?Yes, intentionally. Once you have Unicode, the data should > >? ? ?be encoded correctly back into UTF-8 or whatever legacy encoding > >? ? ?is needed, fixing any issues while in Unicode. > > > >? ? ?As always, it's better to explicitly address such problems than > >? ? ?to simply punt on them and write back broken data. > > > >? ? ?>? ? ?* The use case seems limited to implementing browsers or > headless > >? ? ?>? ? ?? implementations working like browsers. > >? ? ?> > >? ? ?>? ? ?? That's not really general enough to warrant adding lots of > >? ? ?>? ? ?? new codecs to the stdlib. A PyPI package is better suited > >? ? ?>? ? ?? for this. > >? ? ?> > >? ? ?> > >? ? ?> Perhaps, but such a package already exists and its author > (who surely > >? ? ?> has read a lot of bug reports from its users) says that this is > >? ? ?cumbersome. > >? ? ??? > >? ? ?The only critique I read was that registering the codecs > >? ? ?is not explicit enough, but that's really only a nit, since > >? ? ?you can easily have the codec package expose a register > >? ? ?function which you then call explicitly in the code using > >? ? ?the codecs. > > > >? ? ?>? ? ?* The WHATWG codecs do not only cover simple mapping codecs, > >? ? ?>? ? ?? but also many multi-byte ones for e.g. Asian languages. > >? ? ?> > >? ? ?>? ? ?? I doubt that we'd want to maintain such codecs in the > stdlib, > >? ? ?>? ? ?? since this will increase the download sizes of the > installers > >? ? ?>? ? ?? and also require people knowledgeable about these variants > >? ? ?>? ? ?? to work on them and fix any issues. > >? ? ?> > >? ? ?> > >? ? ?> Really? Why is adding a bunch of codecs so much effort? > Surely the > >? ? ?> translation tables contain data that compresses well? And > surely we > >? ? ?> don't need a separate dedicated piece of C code for each new > codec? > >? ? ??? > >? ? ?For the simple charmap style codecs that's true. Not so for the > >? ? ?Asian ones and the latter also do require dedicated C code (see > >? ? ?Modules/cjkcodecs). > > > >? ? ?>? ? ?Overall, I think either pointing people to error handlers > >? ? ?>? ? ?or perhaps adding a new one specifically for the case of > >? ? ?>? ? ?dealing with control character mappings would provide a > better > >? ? ?>? ? ?maintenance / usefulness ratio than adding lots of new > >? ? ?>? ? ?legacy codecs to the stdlib. > >? ? ?> > >? ? ?> > >? ? ?> Wouldn't error handlers be much slower? And to me it seems a > new error > >? ? ?> handler is a much *bigger* deal than some new encodings -- error > >? ? ?> handlers must work for *all* encodings. > >? ? ??? > >? ? ?Error handlers have a standard interface and so they will work > >? ? ?for all codecs. Some codecs limits the number of handlers that > >? ? ?can be used, but most accept all registered handlers. > > > >? ? ?If a handler is too slow in Python, it can be coded in C for > >? ? ?speed. > > > >? ? ?>? ? ?BTW: WHATWG pushes for always using UTF-8 as far as > > I can tell > >? ? ?>? ? ?from their website. > >? ? ?> > >? ? ?> > >? ? ?> As does Python. But apparently it will take decades more to > get there. > > > >? ? ?Yes indeed, so let's not add even more confusion by adding more > >? ? ?variants of the legacy encodings. > > > >? ? ?-- > >? ? ?Marc-Andre Lemburg > >? ? ?eGenix.com > > > >? ? ?Professional Python Services directly from the Experts (#1, > Jan 19 2018) > >? ? ?>>> Python Projects, Coaching and Consulting ...? > http://www.egenix.com/ > >? ? ?>>> Python Database Interfaces ...? ? ? ? ? > ?http://products.egenix.com/ > >? ? ?>>> Plone/Zope Database Interfaces ...? ? ? ? ? > ?http://zope.egenix.com/ > >? ? > ?________________________________________________________________________ > > > >? ? ?::: We implement business ideas - efficiently in both time and > costs ::: > > > >? ? ?? ?eGenix.com Software, Skills and Services GmbH? > Pastor-Loeh-Str.48 > >? ? ?? ? D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre > Lemburg > >? ? ?? ? ? ? ? ?Registered at Amtsgericht Duesseldorf: HRB 46611 > >? ? ?? ? ? ? ? ? ? ?http://www.egenix.com/company/contact/ > >? ? ?? ? ? ? ? ? ? ? ? ? ? http://www.malemburg.com/ > > > >? ? ?_______________________________________________ > >? ? ?Python-ideas mailing list > >? ? ?Python-ideas at python.org > > > >? ? ?https://mail.python.org/mailman/listinfo/python-ideas > >? ? ?Code of Conduct: http://python.org/psf/codeofconduct/ > > > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ...? http://www.egenix.com/ > >>> Python Database Interfaces ...? ? ? ? ? ?http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ...? ? ? ? ? ?http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > ? ?eGenix.com Software, Skills and Services GmbH? Pastor-Loeh-Str.48 > ? ? D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > ? ? ? ? ? ?Registered at Amtsgericht Duesseldorf: HRB 46611 > ? ? ? ? ? ? ? ?http://www.egenix.com/company/contact/ > ? ? ? ? ? ? ? ? ? ? ? http://www.malemburg.com/ > From sylvain.marie at schneider-electric.com Fri Jan 19 13:09:44 2018 From: sylvain.marie at schneider-electric.com (Sylvain MARIE) Date: Fri, 19 Jan 2018 18:09:44 +0000 Subject: [Python-ideas] Repurpose `assert' into a general-purpose check In-Reply-To: <5A6048F5.9050207@stoneleaf.us> References: <20180118065903.GD22500@ando.pearwood.info> <5A6048F5.9050207@stoneleaf.us> Message-ID: > I haven't yet seen any justification for syntax here. The nearest I've seen is that this "ensure" action is more like: > > try: > cond = x >= 0 > except BaseException: > raise AssertionError("x must be positive") > else: > if not cond: > raise AssertionError("x must be positive") > > Which, IMO, is a bad idea, and I'm not sure anyone was actually advocating it anyway. > > ChrisA Indeed, I was the one advocating for it :) Based on all the feedback I received from this discussion, I realized that my implementation was completely flawed by the fact that I had done the class and functions decorators first, and wanted to apply the same pattern to the inline validator, resulting in this assert_valid with overkill delayed evaluation. Resulting in me saying that the only way out would be a new python language element. I tried my best to update valid8 and reached a new stable point with version 3.0.0, providing 2 main utilities for inline validation: - the simple but not so powerful `quick_valid` function - the more verbose (2 lines) but much more generic `wrap_valid` context manager (that's the best I can do today !) The more capable but delayed-evaluation based `assert_valid` is not recommended anymore, or just a tool to replicate what is done in the function and class validation decorators. Like the decorators, it adds the ability to blend two styles of base functions (boolean testers and failure raisers) with boolean operators seamlessly. But the complexity is not worth it for inline validation (it seems to be worth it for decorators). See https://smarie.github.io/python-valid8 for the new updated documentation. I also updated the problem description page at https://smarie.github.io/python-valid8/why_validation/ so as to keep a reference of the problem description and "wishlist" (whether it is implemented by this library or by new language elements in the future). Do not hesitate to contribute or send me your edits (off-list). I would love to get feedback from anyone concerning this library, whether you consider it's useless or "interesting but...". We should probably take this offline though, so as not to pollute the initial thread. Thanks again, a great weekend to all (end of the day here in france ;) ) Kind regards Sylvain From fakedme+py at gmail.com Fri Jan 19 17:08:29 2018 From: fakedme+py at gmail.com (Soni L.) Date: Fri, 19 Jan 2018 20:08:29 -0200 Subject: [Python-ideas] Chaining coders Message-ID: <2c5facca-f0ad-4413-eb84-f4deee7e30de@gmail.com> windows-1252 is based on iso-8859-1. Thus, I'd like to be able to chain coders as follows: bytes.decode("windows-1252-ext", else=lambda r: r.decode("iso-8859-1")) What this "else" does is that it's a lambda, and it gets passed an object with a decode method identical to the bytes decode method, except that it doesn't affect already-decoded characters. In this case, "windows-1252-ext" only includes things in the \x80-\x9F range, leaving it up to "iso-8859-1" to handle the rest. A similar process would happen for encoding: encode with "windows-1252-ext", else = "iso-8859-1". (Technically, "windows-1252-ext" isn't needed - you can use the existing "windows-1252" and combine it with the "iso-8859-1" to get "windows-1252-c1".) This would be a novel way to think of encodings as not just flat translation tables but highly composable translation tables. I have a thing for composition. From chris.barker at noaa.gov Fri Jan 19 17:50:35 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 19 Jan 2018 14:50:35 -0800 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: On Fri, Jan 19, 2018 at 11:21 AM, Giampaolo Rodola' wrote: > > I personally include them in psutil distribution so that users can test > the installation with "python -m psutil.test". I have even this documented > as I think it's an added value. > or: pytest --pyargs pkg_name It Is really handy, and sometimes required to test the distribution / installation itself. So I do that most of the time these days -- but it gets ugly if the tests get really huge. -CHB > > > > -- > Giampaolo - http://grodola.blogspot.com > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From rspeer at luminoso.com Fri Jan 19 18:05:57 2018 From: rspeer at luminoso.com (Rob Speer) Date: Fri, 19 Jan 2018 23:05:57 +0000 Subject: [Python-ideas] Chaining coders In-Reply-To: <2c5facca-f0ad-4413-eb84-f4deee7e30de@gmail.com> References: <2c5facca-f0ad-4413-eb84-f4deee7e30de@gmail.com> Message-ID: I see how this is another way to get what I was asking for: a way to decode some unfortunately common text encodings, ones that Web browsers use, in Python without having to import additional modules. I appreciate other ideas about how to solve this problem, but the generality here seems pretty unnecessary. The world isn't making any _novel_ legacy encodings. There are 8 legacy encodings that Python has missed, and there's no reason to expect there to be any more of them. It's worrisome to support arbitrary compositions of encodings. Most of these possible hybrid encodings haven't been used before, and using them would be a bad idea because there would be no reason to expect any other software in existence to be compatible with them. Some of these legacy encodings (like the webbish version of windows-1255) are not the composition of two encodings that already exist in Python. So you'd have to define new encodings anyway. On Fri, 19 Jan 2018 at 17:09 Soni L. wrote: > windows-1252 is based on iso-8859-1. Thus, I'd like to be able to chain > coders as follows: > > bytes.decode("windows-1252-ext", else=lambda r: r.decode("iso-8859-1")) > > What this "else" does is that it's a lambda, and it gets passed an > object with a decode method identical to the bytes decode method, except > that it doesn't affect already-decoded characters. In this case, > "windows-1252-ext" only includes things in the \x80-\x9F range, leaving > it up to "iso-8859-1" to handle the rest. > > A similar process would happen for encoding: encode with > "windows-1252-ext", else = "iso-8859-1". > > (Technically, "windows-1252-ext" isn't needed - you can use the existing > "windows-1252" and combine it with the "iso-8859-1" to get > "windows-1252-c1".) > > This would be a novel way to think of encodings as not just flat > translation tables but highly composable translation tables. I have a > thing for composition. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve.dower at python.org Sat Jan 20 02:01:34 2018 From: steve.dower at python.org (Steve Dower) Date: Sat, 20 Jan 2018 18:01:34 +1100 Subject: [Python-ideas] Windows Best Fit Encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> <7cda4c01-2252-7555-54e6-ccdeb8f07bb0@egenix.com> Message-ID: On 20Jan2018 0518, M.-A. Lemburg wrote: > do you know of a definite resource for Windows code pages > on MSDN or another official MS website ? I don't know of anything sorry, and my quick search didn't turn up anything public. But I can at least confirm that the internal table for cp1252 has the same undefined characters as on unicode.org, so presumably if MultiByteToWideChar is mapping those to "best fit" characters it's only because the flag has been passed. As far as I can tell, Microsoft has not been secretly redefining any encodings. Cheers, Steve From random832 at fastmail.com Sat Jan 20 04:21:07 2018 From: random832 at fastmail.com (Random832) Date: Sat, 20 Jan 2018 04:21:07 -0500 Subject: [Python-ideas] Windows Best Fit Encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> <7cda4c01-2252-7555-54e6-ccdeb8f07bb0@egenix.com> Message-ID: <1516440067.2034280.1241892608.4F07D92F@webmail.messagingengine.com> On Sat, Jan 20, 2018, at 02:01, Steve Dower wrote: > On 20Jan2018 0518, M.-A. Lemburg wrote: > > do you know of a definite resource for Windows code pages > > on MSDN or another official MS website ? I don't know what happened to this page, but I was able to find better-looking codepage tables at http://web.archive.org/web/20160314211032/https://msdn.microsoft.com/en-us/goglobal/bb964654 Older versions at: web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.asp web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.mspx See also, still live: https://www.microsoft.com/typography/unicode/cscp.htm (this has 0xCA in the graphical table for cp1255, the other does not) > > I don't know of anything sorry, and my quick search didn't turn up > anything public. But I can at least confirm that the internal table for > cp1252 has the same undefined characters as on unicode.org >, so > presumably if MultiByteToWideChar is mapping those to "best fit" > characters it's only because the flag has been passed. I'm passing MB_ERR_INVALID_CHARS. And is this just as true for cp1255 0xCA as for the control characters? MultiByteToWideChar doesn't even *have* a flag for "best fit". I was not able to identify any combination of flags that can be passed to either function on Windows 7 that would cause e.g. 0x81 in cp1252 to be treated any differently from any other character. The C_1252.NLS file appears to consist of: 28 bytes of header 512 bytes WCHAR[256] of mappings e.g. 0000010c: 7800 7900 7a00 7b00 7c00 7d00 7e00 7f00 x.y.z.{.|.}.~... 0000011c: ac20 8100 1a20 9201 1e20 2620 2020 2120 . ... ... & ! 0000012c: c602 3020 6001 3920 5201 8d00 7d01 8f00 ..0 `.9 R...}... 0000013c: 9000 1820 1920 1c20 1d20 2220 1320 1420 ... . . . " . . 0000014c: dc02 2221 6101 3a20 5301 9d00 7e01 7801 .."!a.: S...~.x. 0000015c: a000 a100 a200 a300 a400 a500 a600 a700 ................ Six zero bytes BYTE[65536] apparently of the best fit mappings, e.g. 000002a2: 3f81 3f3f 3f3f 3f3f 3f3f 3f3f 3f8d 3f8f ?.???????????.?. 000002b2: 903f 3f3f 3f3f 3f3f 3f3f 3f3f 3f9d 3f3f .????????????.?? 00000312: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff ................ 00000322: 4161 4161 4161 4363 4363 4363 4363 4464 AaAaAaCcCcCcCcDd I don't see where the file format even has room to identify characters as invalid (or how WideCharToMultiByte disables the best fit mappings, unless it's by checking the result against the WCHAR[256] table), though CP1253 and CP1255 seem to manage it. The ones in those codepages that do return an error are mapped (if the flag is not passed in, and in the NLS file tables) to private use characters U+F8xx. > As far as I can > tell, Microsoft has not been secretly redefining any encodings. Not so much redefining as holding back these characters from the published definition. I was being a bit overly dramatic with the 'for some unknown reason' bit, it seems obvious the reason is they wanted to reserve the ability to add new characters in the future, as they did for the Euro sign. And there's nothing wrong with that, per se, though it's unfortunate that their own conversion functions can't treat these bytes as errors. Looking at the actual files, it looks like the ones in the "best fit" directory are in a format used internally by Microsoft (at a glance, they seem to contain enough information to generate the .NLS files, including stuff like the question marks in the header and the structure of DBCS tables), and the ones in the other mappings directory are sanitized and converted to more or less the same format as the other mappings. (As for 1255 0xCA, the comment in the best fit file suggests that it was unclear what hebrew vowel point it was meant to be) From steve at pearwood.info Sun Jan 21 05:43:44 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 21 Jan 2018 21:43:44 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> Message-ID: <20180121104343.GQ22500@ando.pearwood.info> On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote: > > It depends on what you want to achieve. You may want to fail, assign a > code point from a private area or use a surrogate escape approach. > > And the way to express that is with errors='replace', > errors='surrogateescape', or whatever, which Python already does. We do not > need an explosion of error handlers. This problem can be very > straightforwardly solved with encodings, and error handlers can keep doing > their usual job on top of encodings. > > > You could also add a "latin1replace" error handler which simply passes > through everything that's undefined as-is. > > Nobody asked for this. Actually, Soni L. seems to have suggested a similar idea in the thread titled "Chaining coders" (codecs). But what does it matter whether someone asked for it? Until this thread, nobody had asked for support for WHATWG encodings either. The question to my mind is whether or not this "latin1replace" handler, in conjunction with existing codecs, will do the same thing as the WHATWG codecs. If I have understood you correctly, I think it will. Have I missed something? > > I just don't want to have people start using "web-1252" as encoding > simply because they they are writing out text for a web application - they > should use "utf-8" instead. > > I did ask for input on the name. If the problem is that you think my > working name for the encoding is misleading, you could help with that > instead of constantly trying to replace the proposal with something > different. Rob, you've come here with a proposal based on an actual problem (web pages with mojibake and broken encodings), an existing solution (a third party library) you dislike, and a suggested new solution you will like (move the encodings into the std lib). That's great, and we need more suggestions like this: concrete use-cases and concrete solutions. But you cannot expect that we're going to automatically agree that: - the problem is something that Python the language has to solve (it seems to be a *browser* problem, not a general programming problem); - the existing solution is not sufficient; and - your proposal is the right solution. All of these things need to be justified, and counter-proposals are part of that. When we make a non-trivial proposal on Python-Ideas, it is very rare that they are so clearly the right solution for the right problem that they get instant approval and you can go straight to the PR. Often there are legitimate questions about all three steps. That's why I suggested earlier that (in my opinion) there needs to be a PEP to summarise the issue, justify the proposal, and counter the arguments against it. (Even if the proposal is agreed upon by everyone, if it is sufficiently non-trivial, we sometimes require a PEP summarising the issue for future reference.) As the author of one PEP myself, I know how frustrating this process can seem when you think that this is a bloody obvious proposal with no downside that all right-thinking people ought to instantly recognise as a great idea *wink* but nevertheless, in *my opinion* (I don't speak for anyone else) I think a PEP would be a good idea. > Guido had some very sensible feedback just a moment ago. I am wondering now > if we lost Guido because I broke python-ideas etiquette (is a pull request > not the next step, for example? I never got a good answer on the process), > or because this thread is just constantly being derailed. I don't speak for Guido, but it might simply be he isn't invested enough in *this specific issue* to spend the time wading through a long thread. (That's another reason why a PEP is sometimes valuable.) Perhaps he's still on holiday and only has limited time to spend on this. If I were in your position, my next step would be to write a new post summarising the thread so far: - a brief summary of the nature of the problem; - why you think a solution (whatever that solution turns out to be) should be in the stdlib rather than a third-party library; - what you think the solution should be; - and give a fair critique of the alternatives suggested so far and why you thik that they aren't suitable. That's the same sort of information given in a PEP, but without having to go through the formal PEP process. That might be enough to gain consensus on what happens next -- and maybe even agreement that a formal and more detailed PEP is not needed. Oh, and in case you're thinking this is all a great PITA, it might help if you read these to get an understanding of why things are as they are: https://www.curiousefficiency.org/posts/2011/02/status-quo-wins-stalemate.html https://www.curiousefficiency.org/posts/2011/04/musings-on-culture-of-python-dev.html Good luck! -- Steve From mal at egenix.com Sun Jan 21 06:35:05 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Sun, 21 Jan 2018 12:35:05 +0100 Subject: [Python-ideas] Windows Best Fit Encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <1516379088.3407852.1241225176.1425E49E@webmail.messagingengine.com> <7cda4c01-2252-7555-54e6-ccdeb8f07bb0@egenix.com> Message-ID: <980fd980-2f6e-1d3e-f35f-e1653762e2ad@egenix.com> On 20.01.2018 08:01, Steve Dower wrote: > On 20Jan2018 0518, M.-A. Lemburg wrote: >> do you know of a definite resource for Windows code pages >> on MSDN or another official MS website ? > > I don't know of anything sorry, and my quick search didn't turn up > anything public. But I can at least confirm that the internal table for > cp1252 has the same undefined characters as on unicode.org, so > presumably if MultiByteToWideChar is mapping those to "best fit" > characters it's only because the flag has been passed. As far as I can > tell, Microsoft has not been secretly redefining any encodings. Thanks for confirming, Steve. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 21 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From rspeer at luminoso.com Sun Jan 21 11:36:58 2018 From: rspeer at luminoso.com (Rob Speer) Date: Sun, 21 Jan 2018 16:36:58 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <20180121104343.GQ22500@ando.pearwood.info> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> <20180121104343.GQ22500@ando.pearwood.info> Message-ID: > The question to my mind is whether or not this "latin1replace" handler, > in conjunction with existing codecs, will do the same thing as the > WHATWG codecs. If I have understood you correctly, I think it will. Have > I missed something? It won't do the same thing, and neither will the "chaining coders" proposal. It's easy to miss details like this in all the counterproposals. The difference between WHATWG encodings and the ones in Python is, in all but one case, *only* in the C1 control character range (0x80 to 0x9F), a range of Unicode characters that has historically evaded standardization because they never had a clear purpose even before Unicode. Filling in all the gaps with Latin-1 would do the right thing for, I think, 3 of the encodings, and the wrong thing in the other 5 cases. (In the anomalous case of Windows-1255, it would do a more explicitly wrong thing.) Let's take Windows-1253 (Greek) as an example. Windows-1253 has a bunch of gaps in the 0x80 to 0x9F range, like most of the others. It also has gaps for 0xAA, 0xD2, and 0xFF. WHATWG does _not_ recommend decoding these as the letters "?", "?", and "?", the characters in the equivalent positions in Latin-1. They are simply unassigned. Other software sometimes maps them to the Private Use Area, but this is not standardized at all, and it seems clear that Python should handle them with its usual error handler for unassigned bytes. (Which is one of the reasons not to replace the error handler with something different: we still need the error handler.) Of course, you could define an encoding that's Windows-1253 plus the letters "?", "?", and "?", filling in all the gaps with Latin-1. It would be weird and new (who ever heard of an encoding that has a mapping for "?" but not "?"?). One point I hope to have agreement on is that we do not want to create _new_ legacy encodings that are not used anywhere else. The reason I was proposing to move ahead with a PR was not that I thought it would be automatically accepted -- it was to have a point of reference for exactly what I'm proposing, so we can discuss exactly what the functional difference is between this and counterproposals without getting lost. But I can see how writing the point of reference in PEP form instead of PR form can be the right way to focus discussion. Thanks for the recommendation there, and I'd like a little extra information -- I don't know _mechanically_ how to write a PEP. (Where do I submit it to, for example?) -- Rob Speer On Sun, 21 Jan 2018 at 05:44 Steven D'Aprano wrote: > On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote: > > > It depends on what you want to achieve. You may want to fail, assign a > > code point from a private area or use a surrogate escape approach. > > > > And the way to express that is with errors='replace', > > errors='surrogateescape', or whatever, which Python already does. We do > not > > need an explosion of error handlers. This problem can be very > > straightforwardly solved with encodings, and error handlers can keep > doing > > their usual job on top of encodings. > > > > > You could also add a "latin1replace" error handler which simply passes > > through everything that's undefined as-is. > > > > Nobody asked for this. > > Actually, Soni L. seems to have suggested a similar idea in the thread > titled "Chaining coders" (codecs). > > But what does it matter whether someone asked for it? Until this thread, > nobody had asked for support for WHATWG encodings either. > > The question to my mind is whether or not this "latin1replace" handler, > in conjunction with existing codecs, will do the same thing as the > WHATWG codecs. If I have understood you correctly, I think it will. Have > I missed something? > > > > > I just don't want to have people start using "web-1252" as encoding > > simply because they they are writing out text for a web application - > they > > should use "utf-8" instead. > > > > I did ask for input on the name. If the problem is that you think my > > working name for the encoding is misleading, you could help with that > > instead of constantly trying to replace the proposal with something > > different. > > Rob, you've come here with a proposal based on an actual problem (web > pages with mojibake and broken encodings), an existing solution (a third > party library) you dislike, and a suggested new solution you will like > (move the encodings into the std lib). That's great, and we need more > suggestions like this: concrete use-cases and concrete solutions. > > But you cannot expect that we're going to automatically agree that: > > - the problem is something that Python the language has to solve > (it seems to be a *browser* problem, not a general programming > problem); > > - the existing solution is not sufficient; and > > - your proposal is the right solution. > > > All of these things need to be justified, and counter-proposals are part > of that. > > When we make a non-trivial proposal on Python-Ideas, it is very rare > that they are so clearly the right solution for the right problem that > they get instant approval and you can go straight to the PR. Often there > are legitimate questions about all three steps. That's why I suggested > earlier that (in my opinion) there needs to be a PEP to summarise the > issue, justify the proposal, and counter the arguments against it. > > (Even if the proposal is agreed upon by everyone, if it is sufficiently > non-trivial, we sometimes require a PEP summarising the issue for future > reference.) > > As the author of one PEP myself, I know how frustrating this process can > seem when you think that this is a bloody obvious proposal with no > downside that all right-thinking people ought to instantly recognise as > a great idea *wink* but nevertheless, in *my opinion* (I don't speak for > anyone else) I think a PEP would be a good idea. > > > > Guido had some very sensible feedback just a moment ago. I am wondering > now > > if we lost Guido because I broke python-ideas etiquette (is a pull > request > > not the next step, for example? I never got a good answer on the > process), > > or because this thread is just constantly being derailed. > > I don't speak for Guido, but it might simply be he isn't invested > enough in *this specific issue* to spend the time wading through a > long thread. (That's another reason why a PEP is sometimes valuable.) > Perhaps he's still on holiday and only has limited time to spend on > this. > > If I were in your position, my next step would be to write a new > post summarising the thread so far: > > - a brief summary of the nature of the problem; > - why you think a solution (whatever that solution turns out to be) > should be in the stdlib rather than a third-party library; > - what you think the solution should be; > - and give a fair critique of the alternatives suggested so far and > why you thik that they aren't suitable. > > That's the same sort of information given in a PEP, but without having > to go through the formal PEP process. That might be enough to gain > consensus on what happens next -- and maybe even agreement that a formal > and more detailed PEP is not needed. > > Oh, and in case you're thinking this is all a great PITA, it might help > if you read these to get an understanding of why things are as they are: > > > https://www.curiousefficiency.org/posts/2011/02/status-quo-wins-stalemate.html > > > https://www.curiousefficiency.org/posts/2011/04/musings-on-culture-of-python-dev.html > > > Good luck! > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Sun Jan 21 11:47:19 2018 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 22 Jan 2018 03:47:19 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> <20180121104343.GQ22500@ando.pearwood.info> Message-ID: On Mon, Jan 22, 2018 at 3:36 AM, Rob Speer wrote: > Thanks for the recommendation there, and I'd like a little extra information > -- I don't know _mechanically_ how to write a PEP. (Where do I submit it to, > for example?) I can help you with that side of things. Start by checking out PEP 1: https://www.python.org/dev/peps/pep-0001/ Feel free to ping me off-list if you have difficulties, or if you need a hand getting the formatting tidy. ChrisA From guido at python.org Sun Jan 21 15:52:53 2018 From: guido at python.org (Guido van Rossum) Date: Sun, 21 Jan 2018 12:52:53 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <20180121104343.GQ22500@ando.pearwood.info> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> <20180121104343.GQ22500@ando.pearwood.info> Message-ID: On Sun, Jan 21, 2018 at 2:43 AM, Steven D'Aprano wrote: > On Fri, Jan 19, 2018 at 06:35:30PM +0000, Rob Speer wrote: > > Guido had some very sensible feedback just a moment ago. I am wondering > now > > if we lost Guido because I broke python-ideas etiquette (is a pull > request > > not the next step, for example? I never got a good answer on the > process), > > or because this thread is just constantly being derailed. > > I don't speak for Guido, but it might simply be he isn't invested > enough in *this specific issue* to spend the time wading through a > long thread. (That's another reason why a PEP is sometimes valuable.) > Perhaps he's still on holiday and only has limited time to spend on > this. > Actually my reason to withdraw is that the sides seem to be about as well dug in as the sides during WW1. There's not much I can do in such case (except point out that the status quo wins). -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From turnbull.stephen.fw at u.tsukuba.ac.jp Mon Jan 22 01:43:37 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Mon, 22 Jan 2018 15:43:37 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> <20180121104343.GQ22500@ando.pearwood.info> Message-ID: <23141.34841.479927.670393@turnbull.sk.tsukuba.ac.jp> I don't expect to change your mind about the "right" way to deal with this, but this is a more explicit description of what those of us who advocate error handlers are thinking about. It may be useful in writing your PEP (PEPs describe rejected counterproposals and amendments along with adopted proposals and rationale in either case). Rob Speer writes: > > The question to my mind is whether or not this "latin1replace" handler, > > in conjunction with existing codecs, will do the same thing as the > > WHATWG codecs. If I have understood you correctly, I think it will. Have > > I missed something? > > It won't do the same thing, and neither will the "chaining coders" > proposal. The "chaining coders" proposal isn't well-enough specified to be sure. However, for practical purposes you may think of a Python *codec* as a "whole array" decoder/encoder, and an *error handler* as a "token-by- token" decoder/encoder. The distinction in type is for efficiency, of course. Codecs can't be "chained" (I think, but I didn't think very hard), but handlers can, in the sense that each handler can handle some input values and delegate anything it can't deal with to the next handler in the chain (under the hood handler implementationss are just Python functions with a particular signature, so this is just "loop until non-None"). > It's easy to miss details like this in all the counterproposals. I see no reason why a 'whatwgreplace' error handler with the logic # I am assuming decoding, and single-byte encodings. Encoding # with 'html' error mode would insert format("&#%d;", ord(unicode)). # Multibyte is a little harder. # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS # and Big5. assert the_byte >= 0x80 # Handle C1 control characters. if the_byte < 0xA0: append_to_output(chr(the_byte)) # Handle extended repertoire with a dict. # This condition will depend on the particular codec. elif the_byte in additional_code_points: append_to_output(additional_code_points[the_byte]) # Implement WHATWG error modes. elif whatwg_error_mode is replacement: append_to_output("\uFFFD") else: raise doesn't have the effect you want. This can be done in pure Python. (Note: The actions in the pseudocode are not accurate. IIRC real handlers take a UnicodeError as argument, and return a tuple of the text to append to output and number of input tokens to skip, or return None to indicate an unhandled error, rather than doing the appending and raising themselves.) The main objection to doing it this way would be efficiency. To be honest, I personally don't think that's an important objection since this handler is frequently invoked only if the source text is badly broken. (Remember, you'll already be greatly expanding the repertoire of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And it would surely be "fast enough" if written in C. Caveat: I'm not sure I agree with MAL about windows-1255. I think it's arguable that the WHAT-WG index is a better approximation to reality, and I'd like to hear Hebrew speakers argue about that (I'm not one). > The difference between WHATWG encodings and the ones in Python is, > in all but one case, *only* in the C1 control character range (0x80 > to 0x9F), Also in Japanese, where "corporate characters" have been added (frequently twice, preventing round-tripping ... yuck) to the JIS standard. I haven't checked the Chinese and Korean tables for similar damage, but they're not quite as wacky about this stuff as the JISC is, so they're probably OK (and of course Big5 was "corporate" from the get-go). > a range of Unicode characters that has historically evaded > standardization because they never had a clear purpose even before > Unicode. Filling in all the gaps with Latin-1 That's wrong, as you explain: > [Eg, in Greek, some code points] are simply unassigned. Other > software sometimes maps them to the Private Use Area, but this is > not standardized at all, and it seems clear that Python should > handle them with its usual error handler for unassigned > bytes. (Which is one of the reasons not to replace the error > handler with something different: we still need the error handler.) The logic above handles all this. As mentioned, a stdlib error handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG conformance, or 'surrogatereplace' for the Pythonic equivalent of mapping to the private area) could be chained if desired, and the defaults could be changed and the names aliased to the WHAT-WG terms. This could be automated with a factory function that takes a list of predefined handlers and composes them, although that would add another layer of inefficiency (the composition would presumably be done in a loop, and possibly using try although I think the error handler convention is to return the text to insert if handled, and None if the error can't be handled). Steve From turnbull.stephen.fw at u.tsukuba.ac.jp Mon Jan 22 02:39:18 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Mon, 22 Jan 2018 16:39:18 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <1516296762.2008115.1240027520.58132A15@webmail.messagingengine.com> References: <23136.50585.860684.619627@turnbull.sk.tsukuba.ac.jp> <1516296762.2008115.1240027520.58132A15@webmail.messagingengine.com> Message-ID: <23141.38182.48661.542852@turnbull.sk.tsukuba.ac.jp> Random832 writes: > I think his point is that the WHATWG standard is the one that > governs HTML and therefore HTML that uses these encodings > (including the C1 characters) are conformant to *that* standard, I don't think that is a tenable interpretation of this standard. The WHAT-WG standard encoding for HTML is UTF-8. This is what https://encoding.spec.whatwg.org/#names-and-labels says: Authors must use the UTF-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it. New protocols and formats, as well as existing formats deployed in new contexts[1], must use the UTF-8 encoding exclusively. If these protocols and formats need to expose the encoding?s name or label, they must expose it as "utf-8". Non-UTF-8 *documents* do not conform. There's nothing anywhere that says you may use other encodings, with the single exception of implied permission when encoding form input to send to the server (and that's not even HTML!) Even there you're encouraged to use UTF-8. The rest of the standard provides for how *processes* should handle encodings in purported HTML documents that fail the requirement to encode in UTF-8. That doesn't mean such documents conform; it simply *gives permission* to a conformant process to try to deal with them, and rules for doing that. Yes, it's true that WHAT-WG processing probably would have saved Nathaniel some aggravation with his manipulations of HTML. It's equally likely that errors='surrogateescape' would do so, and a better job on encodings like Hebrew that leave code points in graphic regions undefined. Footnotes: [1] I take this to mean that when I take an EUC-JP HTML document and move it from my legacy document tree to my new Django static resource collection, I *must* transcode it to UTF-8. From yahya-abou-imran at protonmail.com Mon Jan 22 09:20:16 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Mon, 22 Jan 2018 09:20:16 -0500 Subject: [Python-ideas] __vars__ special method Message-ID: On top of this old proposition: https://bugs.python.org/issue13290 We could have a __vars__ method that would be called by vars() if defined. The use cases: 1. let vars() work with instances without __dict__; 2. hide some attributes from the public API. Example for 1: class C: __slots__ = 'eggs', 'spam' def __vars__(self): d = {} for attr in self.__slots__: if hasattr(self, attr): d[attr] = getattr(self, attr) return d Exemple for 2: class C: def __vars__(self): return {attr: value for attr, value in self.__dict__.items() if not attr.startswith('_')} From steve at pearwood.info Mon Jan 22 12:41:45 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 23 Jan 2018 04:41:45 +1100 Subject: [Python-ideas] __vars__ special method In-Reply-To: References: Message-ID: <20180122174144.GT22500@ando.pearwood.info> On Mon, Jan 22, 2018 at 09:20:16AM -0500, Yahya Abou 'Imran via Python-ideas wrote: > On top of this old proposition: > https://bugs.python.org/issue13290 > > We could have a __vars__ method that would be called by vars() if defined. > The use cases: > > 1. let vars() work with instances without __dict__; > 2. hide some attributes from the public API. I think you may have misunderstood the purpose of vars(). It isn't to be a slightly different version of dir(), instead vars() should return the object's namespace. Not a copy of the namespace, but the actual namespace used by the object. This is how vars() currently works: py> class X: ... pass ... py> obj = X() py> ns = vars(obj) py> ns['spam'] = 999 py> obj.spam 999 If vars() can return a modified copy of the namespace, that will break this functionality. -- Steve From rspeer at luminoso.com Mon Jan 22 14:32:04 2018 From: rspeer at luminoso.com (Rob Speer) Date: Mon, 22 Jan 2018 19:32:04 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: <23141.34841.479927.670393@turnbull.sk.tsukuba.ac.jp> References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> <20180121104343.GQ22500@ando.pearwood.info> <23141.34841.479927.670393@turnbull.sk.tsukuba.ac.jp> Message-ID: I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying. You quoted the part where I said "Filling in all the gaps with Latin-1", cut out the part where I said "is wrong", and replied with "that's wrong". I guess I'm glad we're in agreement, but this has been a strange bit of discourse. In this pseudocode that implements a "whatwg_error_mode", can you describe what the Python code to call it would look like? Does every call to .encode and .decode now have a "whatwg_error_mode" parameter, in addition to the "errors" parameter? Or are there twice as many possible strings you could pass as the "errors" parameter, so you can have "replace", "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc? My objection here isn't efficiency, it's adding confusing extra options to .encode() and .decode() that aren't relevant in most cases. I'd like to limit this proposal to single-byte encodings, addressing the discrepancies in the C1 characters and possibly that Hebrew vowel point. If there are differences in the JIS encodings, that is a can of worms I'd like to not open at the moment. -- Rob Speer On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull < turnbull.stephen.fw at u.tsukuba.ac.jp> wrote: > I don't expect to change your mind about the "right" way to deal with > this, but this is a more explicit description of what those of us who > advocate error handlers are thinking about. It may be useful in > writing your PEP (PEPs describe rejected counterproposals and > amendments along with adopted proposals and rationale in either case). > > Rob Speer writes: > > > > The question to my mind is whether or not this "latin1replace" > handler, > > > in conjunction with existing codecs, will do the same thing as the > > > WHATWG codecs. If I have understood you correctly, I think it will. > Have > > > I missed something? > > > > It won't do the same thing, and neither will the "chaining coders" > > proposal. > > The "chaining coders" proposal isn't well-enough specified to be sure. > > However, for practical purposes you may think of a Python *codec* as a > "whole array" decoder/encoder, and an *error handler* as a "token-by- > token" decoder/encoder. The distinction in type is for efficiency, of > course. Codecs can't be "chained" (I think, but I didn't think very > hard), but handlers can, in the sense that each handler can handle > some input values and delegate anything it can't deal with to the next > handler in the chain (under the hood handler implementationss are just > Python functions with a particular signature, so this is just "loop > until non-None"). > > > It's easy to miss details like this in all the counterproposals. > > I see no reason why a 'whatwgreplace' error handler with the logic > > # I am assuming decoding, and single-byte encodings. Encoding > # with 'html' error mode would insert format("&#%d;", ord(unicode)). > # Multibyte is a little harder. > > # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS > # and Big5. > assert the_byte >= 0x80 > # Handle C1 control characters. > if the_byte < 0xA0: > append_to_output(chr(the_byte)) > # Handle extended repertoire with a dict. > # This condition will depend on the particular codec. > elif the_byte in additional_code_points: > append_to_output(additional_code_points[the_byte]) > # Implement WHATWG error modes. > elif whatwg_error_mode is replacement: > append_to_output("\uFFFD") > else: > raise > > doesn't have the effect you want. This can be done in pure Python. > (Note: The actions in the pseudocode are not accurate. IIRC real > handlers take a UnicodeError as argument, and return a tuple of the > text to append to output and number of input tokens to skip, or > return None to indicate an unhandled error, rather than doing the > appending and raising themselves.) > > The main objection to doing it this way would be efficiency. To be > honest, I personally don't think that's an important objection since > this handler is frequently invoked only if the source text is badly > broken. (Remember, you'll already be greatly expanding the repertoire > of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And > it would surely be "fast enough" if written in C. > > Caveat: I'm not sure I agree with MAL about windows-1255. I think > it's arguable that the WHAT-WG index is a better approximation to > reality, and I'd like to hear Hebrew speakers argue about that (I'm > not one). > > > The difference between WHATWG encodings and the ones in Python is, > > in all but one case, *only* in the C1 control character range (0x80 > > to 0x9F), > > Also in Japanese, where "corporate characters" have been added > (frequently twice, preventing round-tripping ... yuck) to the JIS > standard. I haven't checked the Chinese and Korean tables for similar > damage, but they're not quite as wacky about this stuff as the JISC > is, so they're probably OK (and of course Big5 was "corporate" from > the get-go). > > > a range of Unicode characters that has historically evaded > > standardization because they never had a clear purpose even before > > Unicode. Filling in all the gaps with Latin-1 > > That's wrong, as you explain: > > > [Eg, in Greek, some code points] are simply unassigned. Other > > software sometimes maps them to the Private Use Area, but this is > > not standardized at all, and it seems clear that Python should > > handle them with its usual error handler for unassigned > > bytes. (Which is one of the reasons not to replace the error > > handler with something different: we still need the error handler.) > > The logic above handles all this. As mentioned, a stdlib error > handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG > conformance, or 'surrogatereplace' for the Pythonic equivalent of > mapping to the private area) could be chained if desired, and the > defaults could be changed and the names aliased to the WHAT-WG terms. > > This could be automated with a factory function that takes a list of > predefined handlers and composes them, although that would add another > layer of inefficiency (the composition would presumably be done in a > loop, and possibly using try although I think the error handler > convention is to return the text to insert if handled, and None if the > error can't be handled). > > Steve > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Mon Jan 22 21:54:32 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 23 Jan 2018 12:54:32 +1000 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> <20180119170850.GA13584@bytereef.org> <20180119181956.GA14866@bytereef.org> Message-ID: On 20 January 2018 at 04:39, Chris Barker wrote: > So maybe the way to go is to come up with recommendations for a standard way > to do it -- maybe published by PyPa? I don't think the trade-offs here are clear enough for us to add an opinionated guide to packaging.python.org, but it could be an appropriate topic for a discussion page (e.g. "Publishing Test Suites") that frames the problem, and lays out some of the options for handling it: - tests published as part of the package (generally not ideal, but can make post-install testing easier) - tests in the sdist only (most suitable for unit tests) - tests published as a separate package (often suitable for integration tests) One of the things that makes the 3rd option a bit awkward currently is that a lot of tools assume "1 repo -> 1 sdist", so splitting your test suite out to its own sdist can be fairly annoying in practice. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From waksman at gmail.com Mon Jan 22 22:33:46 2018 From: waksman at gmail.com (George Leslie-Waksman) Date: Tue, 23 Jan 2018 03:33:46 +0000 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance Message-ID: The proposed implementation of dataclasses prevents defining fields with defaults before fields without defaults. This can create limitations on logical grouping of fields and on inheritance. Take, for example, the case: @dataclass class Foo: some_default: dict = field(default_factory=dict) @dataclass class Bar(Foo): other_field: int this results in the error: 5 @dataclass ----> 6 class Bar(Foo): 7 other_field: int 8 ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py in dataclass(_cls, init, repr, eq, order, hash, frozen) 751 752 # We're called as @dataclass, with a class. --> 753 return wrap(_cls) 754 755 ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py in wrap(cls) 743 744 def wrap(cls): --> 745 return _process_class(cls, repr, eq, order, hash, init, frozen) 746 747 # See if we're being called as @dataclass or @dataclass(). ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py in _process_class(cls, repr, eq, order, hash, init, frozen) 675 # in __init__. Use "self" if possible. 676 '__dataclass_self__' if 'self' in fields --> 677 else 'self', 678 )) 679 if repr: ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py in _init_fn(fields, frozen, has_post_init, self_name) 422 seen_default = True 423 elif seen_default: --> 424 raise TypeError(f'non-default argument {f.name!r} ' 425 'follows default argument') 426 TypeError: non-default argument 'other_field' follows default argument I understand that this is a limitation of positional arguments because the effective __init__ signature is: def __init__(self, some_default: dict = , other_field: int): However, keyword only arguments allow an entirely reasonable solution to this problem: def __init__(self, *, some_default: dict = , other_field: int): And have the added benefit of making the fields in the __init__ call entirely explicit. So, I propose the addition of a keyword_only flag to the @dataclass decorator that renders the __init__ method using keyword only arguments: @dataclass(keyword_only=True) class Bar(Foo): other_field: int --George Leslie-Waksman -------------- next part -------------- An HTML attachment was scrubbed... URL: From guettliml at thomas-guettler.de Wed Jan 24 11:46:29 2018 From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=) Date: Wed, 24 Jan 2018 17:46:29 +0100 Subject: [Python-ideas] Preemptive multitasking and asyncio Message-ID: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> I found a question and answer at Stackoverflow[1] which says that asyncio/await is like cooperative multitasking. My whish is to have preemptive multitasking: The interpreter does the yielding. The software developer does not need to insert async/await keywords into its source code any more. AFAIK the erlang interpreter does something like this. I guess it is impossible to implement this, but it was somehow important for me to speak out my which. What do you think? Regards, Thomas G?ttler [1] https://stackoverflow.com/questions/38865050/is-await-in-python3-cooperative-multitasking -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines From steve at pearwood.info Wed Jan 24 11:59:01 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 25 Jan 2018 03:59:01 +1100 Subject: [Python-ideas] Preemptive multitasking and asyncio In-Reply-To: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> References: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> Message-ID: <20180124165900.GV22500@ando.pearwood.info> On Wed, Jan 24, 2018 at 05:46:29PM +0100, Thomas G?ttler wrote: > I found a question and answer at Stackoverflow[1] which says > that asyncio/await is like cooperative multitasking. > > My whish is to have preemptive multitasking: The interpreter > does the yielding. Isn't that what threading and multiprocessing do? -- Steve From rosuav at gmail.com Wed Jan 24 11:59:07 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 25 Jan 2018 03:59:07 +1100 Subject: [Python-ideas] Preemptive multitasking and asyncio In-Reply-To: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> References: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> Message-ID: On Thu, Jan 25, 2018 at 3:46 AM, Thomas G?ttler wrote: > I found a question and answer at Stackoverflow[1] which says > that asyncio/await is like cooperative multitasking. "Like"? It *is* a form of co-operative multitasking. > My whish is to have preemptive multitasking: The interpreter > does the yielding. The software developer does not need to > insert async/await keywords into its source code any more. The time machine strikes again! Sounds like you want threads. Check out the threading module, but be aware that many Python interpreters, including CPython (the most commonly-used Python), have restrictions on when threads can switch contexts. Chances are it'll work for you, but if it can't, you may want to consider the multiprocessing module instead. ChrisA From prometheus235 at gmail.com Wed Jan 24 12:01:10 2018 From: prometheus235 at gmail.com (Nick Timkovich) Date: Wed, 24 Jan 2018 11:01:10 -0600 Subject: [Python-ideas] Preemptive multitasking and asyncio In-Reply-To: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> References: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> Message-ID: If I'm understanding correctly, the interpreter already does this with threads. About every 15 milliseconds the interpreter will stop a thread and see if there are any others to work on, see "Grok the GIL," blog: https://emptysqua.re/blog/grok-the-gil-fast-thread-safe-python/ or the PyCon talk: https://www.youtube.com/watch?time_continue=150&v=7SSYhuk5hmc On Wed, Jan 24, 2018 at 10:46 AM, Thomas G?ttler < guettliml at thomas-guettler.de> wrote: > I found a question and answer at Stackoverflow[1] which says > that asyncio/await is like cooperative multitasking. > > My whish is to have preemptive multitasking: The interpreter > does the yielding. The software developer does not need to > insert async/await keywords into its source code any more. > > AFAIK the erlang interpreter does something like this. > > I guess it is impossible to implement this, but it was > somehow important for me to speak out my which. > > What do you think? > > Regards, > Thomas G?ttler > > > [1] https://stackoverflow.com/questions/38865050/is-await-in- > python3-cooperative-multitasking > > > -- > Thomas Guettler http://www.thomas-guettler.de/ > I am looking for feedback: https://github.com/guettli/pro > gramming-guidelines > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From levkivskyi at gmail.com Wed Jan 24 18:22:06 2018 From: levkivskyi at gmail.com (Ivan Levkivskyi) Date: Wed, 24 Jan 2018 23:22:06 +0000 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: Message-ID: It is possible to pass init=False to the decorator on the subclass (and supply your own custom __init__, if necessary): @dataclass class Foo: some_default: dict = field(default_factory=dict) @dataclass(init=False) # This works class Bar(Foo): other_field: int -- Ivan On 23 January 2018 at 03:33, George Leslie-Waksman wrote: > The proposed implementation of dataclasses prevents defining fields with > defaults before fields without defaults. This can create limitations on > logical grouping of fields and on inheritance. > > Take, for example, the case: > > @dataclass > class Foo: > some_default: dict = field(default_factory=dict) > > @dataclass > class Bar(Foo): > other_field: int > > this results in the error: > > 5 @dataclass > ----> 6 class Bar(Foo): > 7 other_field: int > 8 > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in dataclass(_cls, init, repr, eq, order, hash, frozen) > 751 > 752 # We're called as @dataclass, with a class. > --> 753 return wrap(_cls) > 754 > 755 > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in wrap(cls) > 743 > 744 def wrap(cls): > --> 745 return _process_class(cls, repr, eq, order, hash, init, > frozen) > 746 > 747 # See if we're being called as @dataclass or @dataclass(). > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in _process_class(cls, repr, eq, order, hash, init, frozen) > 675 # in __init__. Use "self" if > possible. > 676 '__dataclass_self__' if 'self' in > fields > --> 677 else 'self', > 678 )) > 679 if repr: > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in _init_fn(fields, frozen, has_post_init, self_name) > 422 seen_default = True > 423 elif seen_default: > --> 424 raise TypeError(f'non-default argument {f.name!r} > ' > 425 'follows default argument') > 426 > > TypeError: non-default argument 'other_field' follows default argument > > I understand that this is a limitation of positional arguments because the > effective __init__ signature is: > > def __init__(self, some_default: dict = , other_field: int): > > However, keyword only arguments allow an entirely reasonable solution to > this problem: > > def __init__(self, *, some_default: dict = , other_field: int): > > And have the added benefit of making the fields in the __init__ call > entirely explicit. > > So, I propose the addition of a keyword_only flag to the @dataclass > decorator that renders the __init__ method using keyword only arguments: > > @dataclass(keyword_only=True) > class Bar(Foo): > other_field: int > > --George Leslie-Waksman > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lsy at pobox.com Wed Jan 24 18:25:09 2018 From: lsy at pobox.com (Larry Yaeger) Date: Wed, 24 Jan 2018 15:25:09 -0800 Subject: [Python-ideas] Non-intrusive debug logging Message-ID: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> Everyone uses logging during code development to help in debugging. Whether using a logging module or plain old print statements, this usually requires introducing one or (many) more lines of code into the model being worked on, making the existing, functional code more difficult to read. It is also easy to leave logging code in place accidentally when cleaning up. I have a possibly odd suggestion for dramatically improving debug logging, in an almost automated fashion, without any of the usual problems. I'm willing to jump through PEP hoops, but won't waste anyone's time arguing for the idea if others don't embrace it. In brief, make an extremely succinct comment, such as "#l" or "#dbgl", semantically meaningful. If appended to a line of code, such as: x = 1 Y = 2 Z = 3 x = y + z #l When executed the line would be logged (presumably to stdout) as something like: x=1 -> y=2 + z=3 -> 5 In one log line you see the variables and their values, the operations being performed with them, and the result of the evaluated expression. Placing "#l" on a def Function() line would apply "#l" logging to every line in that function or method. That's it. Adding these log requests would be easy and not clutter the code. Removing them reliably would be a trivial global operation in any text editor. The resulting debug information would be fully diagnostic. Of course there are lots of details and edge cases to consider. There always are. For example, what if x had not been previously defined? A reasonable answer: x=. -> y=2 + z=3 -> 5 In every case the solution should minimize volume of output while maximizing logged information. One can debate the precise formatting of the output. The text used for the semantically meaningful comment string is open to debate. The output pipe could be made controllable. There are natural extensions to allow periodic and conditional logging. I have ideas for reducing (already unlikely) collisions with existing comments. But I want to keep this simple for now, and I think this captures the core of the idea. For your consideration. - larryy From gadgetsteve at live.co.uk Thu Jan 25 00:54:04 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Thu, 25 Jan 2018 05:54:04 +0000 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> Message-ID: On 24/01/2018 23:25, Larry Yaeger wrote: > Everyone uses logging during code development to help in debugging. Whether using a logging module or plain old print statements, this usually requires introducing one or (many) more lines of code into the model being worked on, making the existing, functional code more difficult to read. It is also easy to leave logging code in place accidentally when cleaning up. I have a possibly odd suggestion for dramatically improving debug logging, in an almost automated fashion, without any of the usual problems. I'm willing to jump through PEP hoops, but won't waste anyone's time arguing for the idea if others don't embrace it. > > In brief, make an extremely succinct comment, such as "#l" or "#dbgl", semantically meaningful. If appended to a line of code, such as: > > x = 1 > Y = 2 > Z = 3 > x = y + z #l > > When executed the line would be logged (presumably to stdout) as something like: > > x=1 -> y=2 + z=3 -> 5 > > In one log line you see the variables and their values, the operations being performed with them, and the result of the evaluated expression. > > Placing "#l" on a def Function() line would apply "#l" logging to every line in that function or method. > > That's it. Adding these log requests would be easy and not clutter the code. Removing them reliably would be a trivial global operation in any text editor. The resulting debug information would be fully diagnostic. > > Of course there are lots of details and edge cases to consider. There always are. For example, what if x had not been previously defined? A reasonable answer: > > x=. -> y=2 + z=3 -> 5 > > In every case the solution should minimize volume of output while maximizing logged information. > > One can debate the precise formatting of the output. The text used for the semantically meaningful comment string is open to debate. The output pipe could be made controllable. There are natural extensions to allow periodic and conditional logging. I have ideas for reducing (already unlikely) collisions with existing comments. But I want to keep this simple for now, and I think this captures the core of the idea. > > For your consideration. > > - larryy > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7C1833282b2db342f6476308d563825af0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636524333870912999&sdata=K5Qd3iat2NG4h%2F1rgIf4C9FOmv8%2Fa%2FUN23I3jOiW0PU%3D&reserved=0 > Code of Conduct: https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7C1833282b2db342f6476308d563825af0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636524333870912999&sdata=TcZ2QLkZ0rDntMEQvouNFFnzn7HBBBT64cb%2ByrdIDj8%3D&reserved=0 > I thing that this idea has some merit, obviously with some details to be worked out as well. I would suggest, however, that if this feature is introduced it be controlled via a run-time switch &/or environment variable which defaults to off. Then rather than the developer having to do global replaces they simply turn off the switch (or reduce it to zero). It may also be preferable to use decorators rather than syntactically significant comments, (personally I don't like the latter having had too many bad experiences with them). -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From waksman at gmail.com Thu Jan 25 01:38:54 2018 From: waksman at gmail.com (George Leslie-Waksman) Date: Thu, 25 Jan 2018 06:38:54 +0000 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: Message-ID: It may be possible but it makes for pretty leaky abstractions and it's unclear what that custom __init__ should look like. How am I supposed to know what the replacement for default_factory is? Moreover, suppose I want one base class with an optional argument and a half dozen subclasses each with their own required argument. At that point, I have to write the same __init__ function a half dozen times. It feels rather burdensome for the user when an additional flag (say "kw_only=True") and a modification to: https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294 that inserted `['*']` after `[self_name]` if the flag is specified could ameliorate this entire issue. On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi wrote: > It is possible to pass init=False to the decorator on the subclass (and > supply your own custom __init__, if necessary): > > @dataclass > class Foo: > some_default: dict = field(default_factory=dict) > > @dataclass(init=False) # This works > class Bar(Foo): > other_field: int > > -- > Ivan > > > > On 23 January 2018 at 03:33, George Leslie-Waksman > wrote: > >> The proposed implementation of dataclasses prevents defining fields with >> defaults before fields without defaults. This can create limitations on >> logical grouping of fields and on inheritance. >> >> Take, for example, the case: >> >> @dataclass >> class Foo: >> some_default: dict = field(default_factory=dict) >> >> @dataclass >> class Bar(Foo): >> other_field: int >> >> this results in the error: >> >> 5 @dataclass >> ----> 6 class Bar(Foo): >> 7 other_field: int >> 8 >> >> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >> in dataclass(_cls, init, repr, eq, order, hash, frozen) >> 751 >> 752 # We're called as @dataclass, with a class. >> --> 753 return wrap(_cls) >> 754 >> 755 >> >> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >> in wrap(cls) >> 743 >> 744 def wrap(cls): >> --> 745 return _process_class(cls, repr, eq, order, hash, init, >> frozen) >> 746 >> 747 # See if we're being called as @dataclass or @dataclass(). >> >> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >> in _process_class(cls, repr, eq, order, hash, init, frozen) >> 675 # in __init__. Use "self" if >> possible. >> 676 '__dataclass_self__' if 'self' in >> fields >> --> 677 else 'self', >> 678 )) >> 679 if repr: >> >> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >> in _init_fn(fields, frozen, has_post_init, self_name) >> 422 seen_default = True >> 423 elif seen_default: >> --> 424 raise TypeError(f'non-default argument {f.name!r} >> ' >> 425 'follows default argument') >> 426 >> >> TypeError: non-default argument 'other_field' follows default argument >> >> I understand that this is a limitation of positional arguments because >> the effective __init__ signature is: >> >> def __init__(self, some_default: dict = , other_field: int): >> >> However, keyword only arguments allow an entirely reasonable solution to >> this problem: >> >> def __init__(self, *, some_default: dict = , other_field: int): >> >> And have the added benefit of making the fields in the __init__ call >> entirely explicit. >> >> So, I propose the addition of a keyword_only flag to the @dataclass >> decorator that renders the __init__ method using keyword only arguments: >> >> @dataclass(keyword_only=True) >> class Bar(Foo): >> other_field: int >> >> --George Leslie-Waksman >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guettliml at thomas-guettler.de Thu Jan 25 02:37:18 2018 From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=) Date: Thu, 25 Jan 2018 08:37:18 +0100 Subject: [Python-ideas] Preemptive multitasking and asyncio In-Reply-To: <20180124165900.GV22500@ando.pearwood.info> References: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> <20180124165900.GV22500@ando.pearwood.info> Message-ID: <9a747bde-59e8-72ba-9029-ebf8516d68c7@thomas-guettler.de> Am 24.01.2018 um 17:59 schrieb Steven D'Aprano: > On Wed, Jan 24, 2018 at 05:46:29PM +0100, Thomas G?ttler wrote: >> I found a question and answer at Stackoverflow[1] which says >> that asyncio/await is like cooperative multitasking. >> >> My whish is to have preemptive multitasking: The interpreter >> does the yielding. > > Isn't that what threading and multiprocessing do? Hmmm, yes, you are right. I guess I have not understood something up to now. If async/await is the answer. What was the question? AFAIK it can help to solve the c10k problem. If I don't have the c10k problem, then what is the benefit of async/await? If it is only about IO, there are python bindings for libraries like libuv. I guess this whole topic is too ...(I am missing the right term here)... for average developers like me. Regards, Thomas G?ttler -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines From sf at fermigier.com Thu Jan 25 05:44:55 2018 From: sf at fermigier.com (=?UTF-8?Q?St=C3=A9fane_Fermigier?=) Date: Thu, 25 Jan 2018 11:44:55 +0100 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> Message-ID: Some thoughts: 1. I too dislikes the idea of using comments as semantically significant annotations. I think it's quite OK for annotation that are aimed at external tools (e.g. '# nocover' or '# noqa') but not for runtime behavior. 2. It's probably possible do do interesting things using decorators at the function / class level. The "q" project (https://pypi.python.org/pypi/q) already does some useful things in that direction. Still, when using q (or another decorator-based approach), you need to first 'import q', which means that you can easily end up with spurious 'import q' in your code, after you're done debugging. Or even worse, commit them but forget to add 'q' in your setup.py / requirements.txt / Pipfile, and break CI or production. 3. Doing things unintrusively at the statement level seems much harder. S. On Thu, Jan 25, 2018 at 6:54 AM, Steve Barnes wrote: > > > On 24/01/2018 23:25, Larry Yaeger wrote: > > Everyone uses logging during code development to help in debugging. > Whether using a logging module or plain old print statements, this usually > requires introducing one or (many) more lines of code into the model being > worked on, making the existing, functional code more difficult to read. It > is also easy to leave logging code in place accidentally when cleaning up. > I have a possibly odd suggestion for dramatically improving debug logging, > in an almost automated fashion, without any of the usual problems. I'm > willing to jump through PEP hoops, but won't waste anyone's time arguing > for the idea if others don't embrace it. > > > > In brief, make an extremely succinct comment, such as "#l" or "#dbgl", > semantically meaningful. If appended to a line of code, such as: > > > > x = 1 > > Y = 2 > > Z = 3 > > x = y + z #l > > > > When executed the line would be logged (presumably to stdout) as > something like: > > > > x=1 -> y=2 + z=3 -> 5 > > > > In one log line you see the variables and their values, the operations > being performed with them, and the result of the evaluated expression. > > > > Placing "#l" on a def Function() line would apply "#l" logging to every > line in that function or method. > > > > That's it. Adding these log requests would be easy and not clutter the > code. Removing them reliably would be a trivial global operation in any > text editor. The resulting debug information would be fully diagnostic. > > > > Of course there are lots of details and edge cases to consider. There > always are. For example, what if x had not been previously defined? A > reasonable answer: > > > > x=. -> y=2 + z=3 -> 5 > > > > In every case the solution should minimize volume of output while > maximizing logged information. > > > > One can debate the precise formatting of the output. The text used for > the semantically meaningful comment string is open to debate. The output > pipe could be made controllable. There are natural extensions to allow > periodic and conditional logging. I have ideas for reducing (already > unlikely) collisions with existing comments. But I want to keep this > simple for now, and I think this captures the core of the idea. > > > > For your consideration. > > > > - larryy > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://eur02.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython- > ideas&data=02%7C01%7C%7C1833282b2db342f6476308d563825af0% > 7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636524333870912999&sdata= > K5Qd3iat2NG4h%2F1rgIf4C9FOmv8%2Fa%2FUN23I3jOiW0PU%3D&reserved=0 > > Code of Conduct: https://eur02.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C% > 7C1833282b2db342f6476308d563825af0%7C84df9e7fe9f640afb435aaaaaaaa > aaaa%7C1%7C0%7C636524333870912999&sdata=TcZ2QLkZ0rDntMEQvouNFFnzn7HBBB > T64cb%2ByrdIDj8%3D&reserved=0 > > > > I thing that this idea has some merit, obviously with some details to be > worked out as well. > > I would suggest, however, that if this feature is introduced it be > controlled via a run-time switch &/or environment variable which > defaults to off. Then rather than the developer having to do global > replaces they simply turn off the switch (or reduce it to zero). > > It may also be preferable to use decorators rather than syntactically > significant comments, (personally I don't like the latter having had too > many bad experiences with them). > > -- > Steve (Gadget) Barnes > Any opinions in this message are my personal opinions and do not reflect > those of my employer. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Stefane Fermigier - http://fermigier.com/ - http://twitter.com/sfermigier - http://linkedin.com/in/sfermigier Founder & CEO, Abilian - Enterprise Social Software - http://www.abilian.com/ Chairman, Free&OSS Group / Systematic Cluster - http://www.gt-logiciel-libre.org/ Co-Chairman, National Council for Free & Open Source Software (CNLL) - http://cnll.fr/ Founder & Organiser, PyData Paris - http://pydata.fr/ --- ?You never change things by ?ghting the existing reality. To change something, build a new model that makes the existing model obsolete.? ? R. Buckminster Fuller -------------- next part -------------- An HTML attachment was scrubbed... URL: From yahya-abou-imran at protonmail.com Thu Jan 25 06:11:24 2018 From: yahya-abou-imran at protonmail.com (Yahya Abou 'Imran) Date: Thu, 25 Jan 2018 06:11:24 -0500 Subject: [Python-ideas] __vars__ special method In-Reply-To: <20180122174144.GT22500@ando.pearwood.info> References: <20180122174144.GT22500@ando.pearwood.info> Message-ID: > I think you may have misunderstood the purpose of vars(). It isn't to be > a slightly different version of dir(), instead vars() should return the > object's namespace. Not a copy of the namespace, but the actual > namespace used by the object. > > This is how vars() currently works: > > py> class X: >... pass > ... > py> obj = X() >py> ns = vars(obj) >py> ns['spam'] = 999 >py> obj.spam >999 > > > If vars() can return a modified copy of the namespace, that will break > this functionality. > This is not always true, e.g. for classes vars() returns a mappingproxy. >From Doc: "Objects such as modules and instances have an updateable __dict__ attribute; however, other objects may have write restrictions on their __dict__ attributes (for example, classes use a types.MappingProxyType to prevent direct dictionary updates)." https://docs.python.org/3.6/library/functions.html#vars But you're right: it's misleading to return a RW mapping wich is a fake namespace... In the above examples you could just return a mappingproxy. If you want to support this feature you could use composition: class C: def __init__(self): self.publicns = {} # or: self.proxyattr = MyProxyClass() def __vars__(self): return self.publicns # or: return self.proxyattr.__dict__ This namespace will be updateable, and it let you distinguish between the namespace you want to expose to your clients without compromosing the real one. Of course, the real one could always be accessible via __dict__ (if present). From eric at trueblade.com Thu Jan 25 08:12:38 2018 From: eric at trueblade.com (Eric V. Smith) Date: Thu, 25 Jan 2018 08:12:38 -0500 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: Message-ID: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> I'm not completely opposed to this feature. But there are some cases to consider. Here's the first one that occurs to me: note that due to the way dataclasses work, it would need to be used everywhere down an inheritance hierarchy. That is, if an intermediate base class required it, all class derived from that intermediate base would need to specify it, too. That's because each class just makes decisions based on its fields and its base classes' fields, and not on any flags attached to the base class. As it's currently implemented, a class doesn't remember any of the decorator's arguments, so there's no way to look for this information, anyway. I think there are enough issues here that it's not going to make it in to 3.7. It would require getting a firm proposal together, selling the idea on python-dev, and completing the implementation before Monday. But if you want to try, I'd participate in the discussion. Taking Ivan's suggestion one step further, a way to do this currently is to pass init=False and then write another decorator that adds the kw-only __init__. So the usage would be: @dataclass class Foo: some_default: dict = field(default_factory=dict) @kw_only_init @dataclass(init=False) class Bar(Foo): other_field: int kw_only_init(cls) would look at fields(cls) and construct the __init__. It would be a hassle to re-implement dataclasses's _init_fn function, but it could be made to work (in reality, of course, you'd just copy it and hack it up to do what you want). You'd also need to use some private knowledge of InitVars if you wanted to support them (the stock fields(cls) doesn't return them). For 3.8 we can consider changing dataclasses's APIs if we want to add this. Eric. On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: > It may be possible but it makes for pretty leaky abstractions and it's > unclear what that custom __init__ should look like. How am I supposed to > know what the replacement for default_factory is? > > Moreover, suppose I want one base class with an optional argument and a > half dozen subclasses each with their own required argument. At that > point, I have to write the same __init__ function a half dozen times. > > It feels rather burdensome for the user when an additional flag (say > "kw_only=True") and a modification to: > https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294?that > inserted `['*']` after `[self_name]` if the flag is specified could > ameliorate this entire issue. > > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi > wrote: > > It is possible to pass init=False to the decorator on the subclass > (and supply your own custom __init__, if necessary): > > @dataclass > class Foo: > ??? some_default: dict = field(default_factory=dict) > > @dataclass(init=False) # This works > class Bar(Foo): > ??? other_field: int > > -- > Ivan > > > > On 23 January 2018 at 03:33, George Leslie-Waksman > > wrote: > > The proposed implementation of dataclasses prevents defining > fields with defaults before fields without defaults. This can > create limitations on logical grouping of fields and on inheritance. > > Take, for example, the case: > > @dataclass > class Foo: > ? ? some_default: dict = field(default_factory=dict) > > @dataclass > class Bar(Foo): > ? ? other_field: int > > this results in the error: > > ? ? ? 5 @dataclass > ----> 6 class Bar(Foo): > ? ? ? 7? ? ?other_field: int > ? ? ? 8 > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in dataclass(_cls, init, repr, eq, order, hash, frozen) > ? ? 751 > ? ? 752? ? ?# We're called as @dataclass, with a class. > --> 753? ? ?return wrap(_cls) > ? ? 754 > ? ? 755 > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in wrap(cls) > ? ? 743 > ? ? 744? ? ?def wrap(cls): > --> 745? ? ? ? ?return _process_class(cls, repr, eq, order, > hash, init, frozen) > ? ? 746 > ? ? 747? ? ?# See if we're being called as @dataclass or > @dataclass(). > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in _process_class(cls, repr, eq, order, hash, init, frozen) > ? ? 675? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?#? in __init__.? Use > "self" if possible. > ? ? 676? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'__dataclass_self__' if > 'self' in fields > --> 677? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?else 'self', > ? ? 678? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?)) > ? ? 679? ? ?if repr: > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > in _init_fn(fields, frozen, has_post_init, self_name) > ? ? 422? ? ? ? ? ? ? ? ?seen_default = True > ? ? 423? ? ? ? ? ? ?elif seen_default: > --> 424? ? ? ? ? ? ? ? ?raise TypeError(f'non-default argument > {f.name !r} ' > ? ? 425? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'follows default argument') > ? ? 426 > > TypeError: non-default argument 'other_field' follows default > argument > > I understand that this is a limitation of positional arguments > because the effective __init__ signature is: > > def __init__(self, some_default: dict = , > other_field: int): > > However, keyword only arguments allow an entirely reasonable > solution to this problem: > > def __init__(self, *, some_default: dict = , > other_field: int): > > And have the added benefit of making the fields in the __init__ > call entirely explicit. > > So, I propose the addition of a keyword_only flag to the > @dataclass decorator that renders the __init__ method using > keyword only arguments: > > @dataclass(keyword_only=True) > class Bar(Foo): > ? ? other_field: int > > --George Leslie-Waksman > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > From mehaase at gmail.com Thu Jan 25 09:37:12 2018 From: mehaase at gmail.com (Mark E. Haase) Date: Thu, 25 Jan 2018 09:37:12 -0500 Subject: [Python-ideas] Preemptive multitasking and asyncio In-Reply-To: <9a747bde-59e8-72ba-9029-ebf8516d68c7@thomas-guettler.de> References: <802ca617-060e-a797-8b15-9c5c27509e68@thomas-guettler.de> <20180124165900.GV22500@ando.pearwood.info> <9a747bde-59e8-72ba-9029-ebf8516d68c7@thomas-guettler.de> Message-ID: Explicit yield points makes reasoning about multitasking much easier. The author of Twisted wrote an excellent post about this.[1] I agree that the Python documentation doesn't do justice to the question, "when should I use asyncio instead of threads?" [1] https://glyph.twistedmatrix.com/2014/02/unyielding.html On Thu, Jan 25, 2018 at 2:37 AM, Thomas G?ttler < guettliml at thomas-guettler.de> wrote: > > > Am 24.01.2018 um 17:59 schrieb Steven D'Aprano: > >> On Wed, Jan 24, 2018 at 05:46:29PM +0100, Thomas G?ttler wrote: >> >>> I found a question and answer at Stackoverflow[1] which says >>> that asyncio/await is like cooperative multitasking. >>> >>> My whish is to have preemptive multitasking: The interpreter >>> does the yielding. >>> >> >> Isn't that what threading and multiprocessing do? >> > > Hmmm, yes, you are right. > > I guess I have not understood something up to now. > If async/await is the answer. What was the question? > AFAIK it can help to solve the c10k problem. > > If I don't have the c10k problem, then what is the benefit of async/await? > > If it is only about IO, there are python bindings for libraries like libuv. > > I guess this whole topic is too ...(I am missing the right term here)... > for > average developers like me. > > Regards, > Thomas G?ttler > > > -- > Thomas Guettler http://www.thomas-guettler.de/ > I am looking for feedback: https://github.com/guettli/pro > gramming-guidelines > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Jan 25 11:13:17 2018 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Thu, 25 Jan 2018 08:13:17 -0800 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> Message-ID: This strikes me as something a debugger should do, rather than the regular interpreter. And using comment-based syntax means that they would get ignored by the regular interpreter? which is exactly what you want. As for a decoration approach? that wouldn?t let you do anything on a line by line basis. -CHB Sent from my iPhone On Jan 25, 2018, at 2:44 AM, St?fane Fermigier wrote: Some thoughts: 1. I too dislikes the idea of using comments as semantically significant annotations. I think it's quite OK for annotation that are aimed at external tools (e.g. '# nocover' or '# noqa') but not for runtime behavior. 2. It's probably possible do do interesting things using decorators at the function / class level. The "q" project (https://pypi.python.org/pypi/q) already does some useful things in that direction. Still, when using q (or another decorator-based approach), you need to first 'import q', which means that you can easily end up with spurious 'import q' in your code, after you're done debugging. Or even worse, commit them but forget to add 'q' in your setup.py / requirements.txt / Pipfile, and break CI or production. 3. Doing things unintrusively at the statement level seems much harder. S. On Thu, Jan 25, 2018 at 6:54 AM, Steve Barnes wrote: > > > On 24/01/2018 23:25, Larry Yaeger wrote: > > Everyone uses logging during code development to help in debugging. > Whether using a logging module or plain old print statements, this usually > requires introducing one or (many) more lines of code into the model being > worked on, making the existing, functional code more difficult to read. It > is also easy to leave logging code in place accidentally when cleaning up. > I have a possibly odd suggestion for dramatically improving debug logging, > in an almost automated fashion, without any of the usual problems. I'm > willing to jump through PEP hoops, but won't waste anyone's time arguing > for the idea if others don't embrace it. > > > > In brief, make an extremely succinct comment, such as "#l" or "#dbgl", > semantically meaningful. If appended to a line of code, such as: > > > > x = 1 > > Y = 2 > > Z = 3 > > x = y + z #l > > > > When executed the line would be logged (presumably to stdout) as > something like: > > > > x=1 -> y=2 + z=3 -> 5 > > > > In one log line you see the variables and their values, the operations > being performed with them, and the result of the evaluated expression. > > > > Placing "#l" on a def Function() line would apply "#l" logging to every > line in that function or method. > > > > That's it. Adding these log requests would be easy and not clutter the > code. Removing them reliably would be a trivial global operation in any > text editor. The resulting debug information would be fully diagnostic. > > > > Of course there are lots of details and edge cases to consider. There > always are. For example, what if x had not been previously defined? A > reasonable answer: > > > > x=. -> y=2 + z=3 -> 5 > > > > In every case the solution should minimize volume of output while > maximizing logged information. > > > > One can debate the precise formatting of the output. The text used for > the semantically meaningful comment string is open to debate. The output > pipe could be made controllable. There are natural extensions to allow > periodic and conditional logging. I have ideas for reducing (already > unlikely) collisions with existing comments. But I want to keep this > simple for now, and I think this captures the core of the idea. > > > > For your consideration. > > > > - larryy > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://eur02.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython- > ideas&data=02%7C01%7C%7C1833282b2db342f6476308d563825af0% > 7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636524333870912999&sdata= > K5Qd3iat2NG4h%2F1rgIf4C9FOmv8%2Fa%2FUN23I3jOiW0PU%3D&reserved=0 > > Code of Conduct: https://eur02.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C% > 7C1833282b2db342f6476308d563825af0%7C84df9e7fe9f640afb435aaaaaaaa > aaaa%7C1%7C0%7C636524333870912999&sdata=TcZ2QLkZ0rDntMEQvouNFFnzn7HBBB > T64cb%2ByrdIDj8%3D&reserved=0 > > > > I thing that this idea has some merit, obviously with some details to be > worked out as well. > > I would suggest, however, that if this feature is introduced it be > controlled via a run-time switch &/or environment variable which > defaults to off. Then rather than the developer having to do global > replaces they simply turn off the switch (or reduce it to zero). > > It may also be preferable to use decorators rather than syntactically > significant comments, (personally I don't like the latter having had too > many bad experiences with them). > > -- > Steve (Gadget) Barnes > Any opinions in this message are my personal opinions and do not reflect > those of my employer. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Stefane Fermigier - http://fermigier.com/ - http://twitter.com/sfermigier - http://linkedin.com/in/sfermigier Founder & CEO, Abilian - Enterprise Social Software - http://www.abilian.com/ Chairman, Free&OSS Group / Systematic Cluster - http://www.gt-logiciel-libre.org/ Co-Chairman, National Council for Free & Open Source Software (CNLL) - http://cnll.fr/ Founder & Organiser, PyData Paris - http://pydata.fr/ --- ?You never change things by ?ghting the existing reality. To change something, build a new model that makes the existing model obsolete.? ? R. Buckminster Fuller _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Thu Jan 25 11:28:23 2018 From: phd at phdru.name (Oleg Broytman) Date: Thu, 25 Jan 2018 17:28:23 +0100 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> Message-ID: <20180125162823.GA31213@phdru.name> On Thu, Jan 25, 2018 at 11:44:55AM +0100, St??fane Fermigier wrote: > 1. I too dislikes the idea of using comments as semantically significant > annotations. > > I think it's quite OK for annotation that are aimed at external tools (e.g. > '# nocover' or '# noqa') but not for runtime behavior. That is, you don't like ``# coding:`` directive? ;-) Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From greg.ewing at canterbury.ac.nz Thu Jan 25 16:03:26 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 26 Jan 2018 10:03:26 +1300 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> Message-ID: <5A6A461E.8060703@canterbury.ac.nz> Steve Barnes wrote: > I would suggest, however, that if this feature is introduced it be > controlled via a run-time switch &/or environment variable which > defaults to off. I disagreew with defaulting it to off. That would encourage lazy developers to distribute library code full of #l lines, so that when you turn it on to debug something of your own, you get swamped with someone else's debugging messages. -- Greg From joejev at gmail.com Thu Jan 25 17:41:20 2018 From: joejev at gmail.com (Joseph Jevnik) Date: Thu, 25 Jan 2018 17:41:20 -0500 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: <5A6A461E.8060703@canterbury.ac.nz> References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> <5A6A461E.8060703@canterbury.ac.nz> Message-ID: This can be accomplished as a decorator. Jim Crist wrote a version of this using the codetransformer library. The usage is pretty simple: @trace() def sum_word_lengths(words): total = 0 for w in words: word_length = len(w) total += word_length return total >>> sum_word_lengths(['apple', 'banana', 'pear', 'orange']) total = 0 w = 'apple' word_length = 5 total = 5 w = 'banana' word_length = 6 total = 11 w = 'pear' word_length = 4 total = 15 w = 'orange' word_length = 6 total = 21 21 The source for the trace decorator is in this available here: https://gist.github.com/jcrist/2b97c9bcc0b95caa73ce (I can't figure out how to link to a section, it is under "Tracing"). On Thu, Jan 25, 2018 at 4:03 PM, Greg Ewing wrote: > Steve Barnes wrote: >> >> I would suggest, however, that if this feature is introduced it be >> controlled via a run-time switch &/or environment variable which defaults to >> off. > > > I disagreew with defaulting it to off. That would encourage > lazy developers to distribute library code full of #l lines, > so that when you turn it on to debug something of your own, > you get swamped with someone else's debugging messages. > > -- > Greg > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From prometheus235 at gmail.com Thu Jan 25 18:09:14 2018 From: prometheus235 at gmail.com (Nick Timkovich) Date: Thu, 25 Jan 2018 17:09:14 -0600 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: <5A6A461E.8060703@canterbury.ac.nz> References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> <5A6A461E.8060703@canterbury.ac.nz> Message-ID: I think part of the reason that logging appears complicated is because logging actually is complicated. In the myriad different contexts a Python program runs (daemon, command line tool, interactively), the logging output should be going to all sorts of different places. Thus was born handlers. If you want "all the logs", do you really want all the logs from some library and the library it calls? Thus was born filters. For better or worse, the "line cost" of a logging call encourages them to be useful. That said, I'd maybe like a plugin for my editor that could hide all logging statements for some "mature" projects so I could try to see the control flow a bit better. On Thu, Jan 25, 2018 at 3:03 PM, Greg Ewing wrote: > Steve Barnes wrote: > >> I would suggest, however, that if this feature is introduced it be >> controlled via a run-time switch &/or environment variable which defaults >> to off. >> > > I disagreew with defaulting it to off. That would encourage > lazy developers to distribute library code full of #l lines, > so that when you turn it on to debug something of your own, > you get swamped with someone else's debugging messages. > > -- > Greg > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c at anthonyrisinger.com Thu Jan 25 20:06:18 2018 From: c at anthonyrisinger.com (C Anthony Risinger) Date: Thu, 25 Jan 2018 19:06:18 -0600 Subject: [Python-ideas] Non-intrusive debug logging In-Reply-To: References: <9FDAEAB0-CB6B-4FE4-B2A3-7A13A4C20902@pobox.com> <5A6A461E.8060703@canterbury.ac.nz> Message-ID: On Thu, Jan 25, 2018 at 5:09 PM, Nick Timkovich wrote: > I think part of the reason that logging appears complicated is because > logging actually is complicated. In the myriad different contexts a Python > program runs (daemon, command line tool, interactively), the logging output > should be going to all sorts of different places. Thus was born handlers. > If you want "all the logs", do you really want all the logs from some > library and the library it calls? Thus was born filters. > > For better or worse, the "line cost" of a logging call encourages them to > be useful. > I think that last bit is the OP's primary ask. Truly great and useful logs are genuinely hard to write. They want a cross-cutting, differentially updated context that no single section of code a) cares about or b) willingly wants to incur the costs of... especially when unused. In my mind, the most significant barrier to fantastic logging -- DEBUG in particular -- is you must 100% understand, ahead-of-time, which data and in which situations will yield solutions to unknown future problems, and then must limit that extra data to relevant inputs only (eg. only DEBUG logging a specific user or condition), ideally for a defined capture window, so you avoid writing 100MBps to /dev/null all day long. While this proposal does limit the line noise I don't think it makes logging any more accessible, or useful. The desire to tap into a running program and dynamically inspect useful data at the time it's needed is what led to this: Dynamic logging after the fact (injects "logging call sites" into a target function's __code__) https://pypi.python.org/pypi/retrospect/0.1.4 It never went beyond a basic POC and it's not meant to be a plug, only another point of interest. Instead of explicitly calling out to some logging library every 0.1 lines of code, I'd rather attach "something" to a function, as needed, and tell it what I am interested in (function args, function return, symbols, etc). This makes logging more like typing, where you could even move logging information to a stub file of sorts and bind it with the application, using it... or not! Sort of like a per-function sys.settrace. I believe this approach (ie. with something like __code__.settrace) could fulfill the OP's original ask with additional possibilities. -- C Anthony -------------- next part -------------- An HTML attachment was scrubbed... URL: From dancollins34 at gmail.com Thu Jan 25 23:38:16 2018 From: dancollins34 at gmail.com (Daniel Collins) Date: Thu, 25 Jan 2018 23:38:16 -0500 Subject: [Python-ideas] .then execution of actions following a future's completion Message-ID: Hello all, So, first time posting here. I?ve been bothered for a while about the lack of the ability to chain futures in python, such that the next future will execute upon the first?s completion. So I submitted a pr to do this. This would add the .then(self, fn) method to concurrent.futures.Future. Thoughts? -dancollins34 Github PR #5335 bugs.python.org issue #32672 -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Jan 26 01:07:09 2018 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Jan 2018 22:07:09 -0800 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: Message-ID: I really don't want to distract Yury with this. Let's consider this (or something that addresses the same need) for 3.8. To be clear this is meant as a feature for concurrent.futures.Future, not for asyncio.Future. (It's a bit confusing since you also change asyncio.) Also to be honest I don't understand the use case *or* the semantics very well. You have some explaining to do... (Also, full links: https://bugs.python.org/issue32672; https://github.com/python/cpython/pull/5335) On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins wrote: > Hello all, > > So, first time posting here. I?ve been bothered for a while about the lack > of the ability to chain futures in python, such that the next future will > execute upon the first?s completion. So I submitted a pr to do this. This > would add the .then(self, fn) method to concurrent.futures.Future. > Thoughts? > > -dancollins34 > > Github PR #5335 > bugs.python.org issue #32672 > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bzvi7919 at gmail.com Fri Jan 26 03:16:26 2018 From: bzvi7919 at gmail.com (Bar Harel) Date: Fri, 26 Jan 2018 08:16:26 +0000 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: Message-ID: I have a simple way to solve this I believe. Why not just expose "_chain_future()" from asyncio/futures.py to the public, instead of copying and pasting parts of it? It already works, being used everywhere in the stdlib, it supports both asyncio and concurrent.futures, it's an easily testable external function (follows the design of asyncio for the better part), it's threadsafe right out of the box and it wouldn't require anything but removing a single underscore and adding documentation. (I always wondered why was it private anyway) It's like the function was meant to be public :-P -- Bar On Fri, Jan 26, 2018, 8:07 AM Guido van Rossum wrote: > I really don't want to distract Yury with this. Let's consider this (or > something that addresses the same need) for 3.8. > > To be clear this is meant as a feature for concurrent.futures.Future, not > for asyncio.Future. (It's a bit confusing since you also change asyncio.) > > Also to be honest I don't understand the use case *or* the semantics very > well. You have some explaining to do... > > (Also, full links: https://bugs.python.org/issue32672; > https://github.com/python/cpython/pull/5335) > > On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins > wrote: > >> Hello all, >> >> So, first time posting here. I?ve been bothered for a while about the >> lack of the ability to chain futures in python, such that the next future >> will execute upon the first?s completion. So I submitted a pr to do this. >> This would add the .then(self, fn) method to concurrent.futures.Future. >> Thoughts? >> >> -dancollins34 >> >> Github PR #5335 >> bugs.python.org issue #32672 >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> > > > -- > --Guido van Rossum (python.org/~guido) > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From songofacandy at gmail.com Fri Jan 26 03:42:31 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Fri, 26 Jan 2018 17:42:31 +0900 Subject: [Python-ideas] Adding str.isascii() ? Message-ID: Hi. Currently, int(), str.isdigit(), str.isalnum(), etc... accepts non-ASCII strings. >>> s = ???" >>> s '???' >>> s.isdigit() True >>> print(ascii(s)) '\uff11\uff12\uff13' >>> int(s) 123 But sometimes, we want to accept only ascii string. For example, ipaddress module uses: _DECIMAL_DIGITS = frozenset('0123456789') ... if _DECIMAL_DIGITS.issuperset(str): ref: https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Lib/ipaddress.py#L491-L494 If str has str.isascii() method, it can be simpler: `if s.isascii() and s.isdigit():` I want to add it in Python 3.7 if there are no opposite opinions. Regrads, -- INADA Naoki From rosuav at gmail.com Fri Jan 26 03:53:41 2018 From: rosuav at gmail.com (Chris Angelico) Date: Fri, 26 Jan 2018 19:53:41 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: On Fri, Jan 26, 2018 at 7:42 PM, INADA Naoki wrote: > Hi. > > Currently, int(), str.isdigit(), str.isalnum(), etc... accepts > non-ASCII strings. > >>>> s = ???" >>>> s > '???' >>>> s.isdigit() > True >>>> print(ascii(s)) > '\uff11\uff12\uff13' >>>> int(s) > 123 > > But sometimes, we want to accept only ascii string. For example, > ipaddress module uses: > > _DECIMAL_DIGITS = frozenset('0123456789') > ... > if _DECIMAL_DIGITS.issuperset(str): > > ref: https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Lib/ipaddress.py#L491-L494 > > If str has str.isascii() method, it can be simpler: > > `if s.isascii() and s.isdigit():` > > I want to add it in Python 3.7 if there are no opposite opinions. > I'm not sure that the decimal-digit check is actually improved by this, but nonetheless, I am in favour of this feature. In CPython, this method can simply look at the object headers to see if it has the 'ascii' flag set; otherwise, it'd be effectively equivalent to: def isascii(self): return ord(max(self)) < 128 Would be handy when working with semi-textual protocols, where ASCII text is trivially encoded, but non-ASCII text may require negotiation or a protocol header. ChrisA From mal at egenix.com Fri Jan 26 04:16:41 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 10:16:41 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> On 26.01.2018 09:53, Chris Angelico wrote: > On Fri, Jan 26, 2018 at 7:42 PM, INADA Naoki wrote: >> Hi. >> >> Currently, int(), str.isdigit(), str.isalnum(), etc... accepts >> non-ASCII strings. >> >>>>> s = ???" >>>>> s >> '???' >>>>> s.isdigit() >> True >>>>> print(ascii(s)) >> '\uff11\uff12\uff13' >>>>> int(s) >> 123 >> >> But sometimes, we want to accept only ascii string. For example, >> ipaddress module uses: >> >> _DECIMAL_DIGITS = frozenset('0123456789') >> ... >> if _DECIMAL_DIGITS.issuperset(str): >> >> ref: https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Lib/ipaddress.py#L491-L494 >> >> If str has str.isascii() method, it can be simpler: >> >> `if s.isascii() and s.isdigit():` >> >> I want to add it in Python 3.7 if there are no opposite opinions. >> > > I'm not sure that the decimal-digit check is actually improved by > this, but nonetheless, I am in favour of this feature. In CPython, > this method can simply look at the object headers to see if it has the > 'ascii' flag set; otherwise, it'd be effectively equivalent to: > > def isascii(self): > return ord(max(self)) < 128 > > Would be handy when working with semi-textual protocols, where ASCII > text is trivially encoded, but non-ASCII text may require negotiation > or a protocol header. +1 Just a note: checking the header in CPython will only give a hint, since strings created using higher order kinds can still be 100% ASCII. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From songofacandy at gmail.com Fri Jan 26 04:44:31 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Fri, 26 Jan 2018 18:44:31 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: > +1 > > Just a note: checking the header in CPython will only give a hint, > since strings created using higher order kinds can still be 100% > ASCII. > Oh, really? I think checking header is enough for all ready unicode. For example, this is _PyUnicode_EqualToASCIIString implementation: if (PyUnicode_READY(unicode) == -1) { /* Memory error or bad data */ PyErr_Clear(); return non_ready_unicode_equal_to_ascii_string(unicode, str); } if (!PyUnicode_IS_ASCII(unicode)) return 0; And I think str.isascii() can be implemented as: if (PyUnicode_READY(unicode) == -1) { return NULL; } if (PyUnicode_IS_ASCII(unicode)) { Py_RETURN_TRUE; } else { Py_RETURN_FALSE; } From mal at egenix.com Fri Jan 26 05:12:42 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 11:12:42 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: On 26.01.2018 10:44, INADA Naoki wrote: >> +1 >> >> Just a note: checking the header in CPython will only give a hint, >> since strings created using higher order kinds can still be 100% >> ASCII. >> > > Oh, really? > I think checking header is enough for all ready unicode. No, because you can pass in maxchar to PyUnicode_New() and the implementation will take this as hint to the max code point used in the string. There is no check done whether maxchar is indeed the minimum upper bound to the code point ordinals. The reason for doing this is simple: you don't want to have to scan the string every time you create a Unicode object. CPython itself often does do such a scan before calling PyUnicode_New(), so in many cases, the header will be set to ASCII, but not always. > For example, this is _PyUnicode_EqualToASCIIString implementation: > > if (PyUnicode_READY(unicode) == -1) { > /* Memory error or bad data */ > PyErr_Clear(); > return non_ready_unicode_equal_to_ascii_string(unicode, str); > } > if (!PyUnicode_IS_ASCII(unicode)) > return 0; > > And I think str.isascii() can be implemented as: > > if (PyUnicode_READY(unicode) == -1) { > return NULL; > } > if (PyUnicode_IS_ASCII(unicode)) { > Py_RETURN_TRUE; > } > else { > Py_RETURN_FALSE; > } > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From solipsis at pitrou.net Fri Jan 26 05:22:07 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 26 Jan 2018 11:22:07 +0100 Subject: [Python-ideas] Adding str.isascii() ? References: Message-ID: <20180126112207.1fb80a02@fsol> On Fri, 26 Jan 2018 17:42:31 +0900 INADA Naoki wrote: > > If str has str.isascii() method, it can be simpler: > > `if s.isascii() and s.isdigit():` > > I want to add it in Python 3.7 if there are no opposite opinions. +1 from me. Regards Antoine. From victor.stinner at gmail.com Fri Jan 26 06:10:17 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 26 Jan 2018 12:10:17 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180126112207.1fb80a02@fsol> References: <20180126112207.1fb80a02@fsol> Message-ID: +1 The idea is not new and I like it. Naoki created https://bugs.python.org/issue32677 Victor 2018-01-26 11:22 GMT+01:00 Antoine Pitrou : > On Fri, 26 Jan 2018 17:42:31 +0900 > INADA Naoki > wrote: >> >> If str has str.isascii() method, it can be simpler: >> >> `if s.isascii() and s.isdigit():` >> >> I want to add it in Python 3.7 if there are no opposite opinions. > > +1 from me. > > Regards > > Antoine. > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From songofacandy at gmail.com Fri Jan 26 06:17:23 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Fri, 26 Jan 2018 20:17:23 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: > No, because you can pass in maxchar to PyUnicode_New() and > the implementation will take this as hint to the max code point > used in the string. There is no check done whether maxchar > is indeed the minimum upper bound to the code point ordinals. API doc says: """ maxchar should be the true maximum code point to be placed in the string. As an approximation, it can be rounded up to the nearest value in the sequence 127, 255, 65535, 1114111. """ https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New Since doc says *should*, strings created with wrong maxchar are considered invalid object. We already ignores string with wrong maxchars in some places. Even "a" == "a" may fail for such invalid string object. So I don't think str.iascii() should consider about it. Regards, From steve at pearwood.info Fri Jan 26 07:39:54 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 26 Jan 2018 23:39:54 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: <20180126123953.GA22500@ando.pearwood.info> On Fri, Jan 26, 2018 at 05:42:31PM +0900, INADA Naoki wrote: > If str has str.isascii() method, it can be simpler: > > `if s.isascii() and s.isdigit():` > > I want to add it in Python 3.7 if there are no opposite opinions. I have no objection to isascii, but I don't think it goes far enough. Sometimes I want to know whether a string is compatible with Latin-1 or UCS-2 as well as ASCII. For that, I used a function that exposes the size of code points in bits: @property def size(self): # This can be implemented much more efficiently in CPython. c = ord(max(self)) if self else 0 if c <= 0x7F: return 7 elif c <= 0xFF: return 8 elif c <= 0xFFFF: return 16 else: assert c <= 0x10FFFF return 21 A quick test for ASCII will be: string.size == 7 and to test that it is entirely within the BMP (Basic Multilingual Plane): string.size <= 16 -- Steve From mal at egenix.com Fri Jan 26 08:02:46 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 14:02:46 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: On 26.01.2018 12:17, INADA Naoki wrote: >> No, because you can pass in maxchar to PyUnicode_New() and >> the implementation will take this as hint to the max code point >> used in the string. There is no check done whether maxchar >> is indeed the minimum upper bound to the code point ordinals. > > API doc says: > > """ > maxchar should be the true maximum code point to be placed in the string. > As an approximation, it can be rounded up to the nearest value in the > sequence 127, 255, 65535, 1114111. > """ > https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New > > Since doc says *should*, strings created with wrong maxchar > are considered invalid object. Not really: "should" means should, not must :-) Objects created with PyUnicode_New() are valid and ready (this only has a meaning for legacy strings). You can set maxchar to 64k and still just use ASCII as content. In some cases, you may want the internal string representation to be wchar_t compatible or work with Py_UCS2/4, so both 64k and sys.maxunicode are reasonable and valid values. Overall, I'm starting to believe that a str.maxchar() function would be a better choice than to only go for ASCII. This could have an optional parameter "exact" to force scanning the string and returning the actual max code point ordinal when set to True (default), or return the approximation based on the used kind if not set (which is many cases, will give you a good hint). For checking ASCII, you'd then write: def isascii(s): if s.maxchar(exact=False) < 128: return True if s.maxchar() < 128: return True return False -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From songofacandy at gmail.com Fri Jan 26 08:11:31 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Fri, 26 Jan 2018 22:11:31 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: Do you mean we should fix *all* of CPython unicode handling, not only str.isascii()? At least, equality test doesn't care wrong kind. https://github.com/python/cpython/blob/master/Objects/stringlib/eq.h https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Objects/unicodeobject.c#L10871-L10873 https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Objects/unicodeobject.c#L10998-L10999 There may be many others, but I'm not sure. On Fri, Jan 26, 2018 at 10:02 PM, M.-A. Lemburg wrote: > On 26.01.2018 12:17, INADA Naoki wrote: >>> No, because you can pass in maxchar to PyUnicode_New() and >>> the implementation will take this as hint to the max code point >>> used in the string. There is no check done whether maxchar >>> is indeed the minimum upper bound to the code point ordinals. >> >> API doc says: >> >> """ >> maxchar should be the true maximum code point to be placed in the string. >> As an approximation, it can be rounded up to the nearest value in the >> sequence 127, 255, 65535, 1114111. >> """ >> https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New >> >> Since doc says *should*, strings created with wrong maxchar >> are considered invalid object. > > Not really: "should" means should, not must :-) Objects created > with PyUnicode_New() are valid and ready (this only has a meaning > for legacy strings). > > You can set maxchar to 64k and still just use ASCII as content. > In some cases, you may want the internal string representation > to be wchar_t compatible or work with Py_UCS2/4, so both 64k > and sys.maxunicode are reasonable and valid values. > > Overall, I'm starting to believe that a str.maxchar() function > would be a better choice than to only go for ASCII. > > This could have an optional parameter "exact" to force scanning > the string and returning the actual max code point ordinal > when set to True (default), or return the approximation based > on the used kind if not set (which is many cases, will give > you a good hint). > > For checking ASCII, you'd then write: > > def isascii(s): > if s.maxchar(exact=False) < 128: > return True > if s.maxchar() < 128: > return True > return False > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 26 2018) >>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>>> Python Database Interfaces ... http://products.egenix.com/ >>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > -- INADA Naoki From rosuav at gmail.com Fri Jan 26 08:15:24 2018 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 27 Jan 2018 00:15:24 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: On Fri, Jan 26, 2018 at 10:17 PM, INADA Naoki wrote: >> No, because you can pass in maxchar to PyUnicode_New() and >> the implementation will take this as hint to the max code point >> used in the string. There is no check done whether maxchar >> is indeed the minimum upper bound to the code point ordinals. > > API doc says: > > """ > maxchar should be the true maximum code point to be placed in the string. > As an approximation, it can be rounded up to the nearest value in the > sequence 127, 255, 65535, 1114111. > """ > https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New > > Since doc says *should*, strings created with wrong maxchar > are considered invalid object. > > We already ignores string with wrong maxchars in some places. > Even "a" == "a" may fail for such invalid string object. Can you create a simple test-case that proves this? If so, I would say that this is a bug in the docs, and recommend rewording it somewhat thus: maxchar is either the actual maximum code point to be placed in the string, or (as an approximation) rounded up to the nearest value in the sequence 127, 255, 65535, 1114111. Failing a basic operation like equality checking would be considered a total failure. ChrisA From victor.stinner at gmail.com Fri Jan 26 08:31:33 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 26 Jan 2018 14:31:33 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: 2018-01-26 12:17 GMT+01:00 INADA Naoki : >> No, because you can pass in maxchar to PyUnicode_New() and >> the implementation will take this as hint to the max code point >> used in the string. There is no check done whether maxchar >> is indeed the minimum upper bound to the code point ordinals. > > API doc says: > > """ > maxchar should be the true maximum code point to be placed in the string. > As an approximation, it can be rounded up to the nearest value in the > sequence 127, 255, 65535, 1114111. > """ > https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New > > Since doc says *should*, strings created with wrong maxchar > are considered invalid object. PyUnicode objects must always use the most efficient storage. It's a very strong requirement of the PEP 393. As Naoki wrote, many functions rely on this assumption to implement fast-path. The assumption is even implemented in the debug check _PyUnicode_CheckConsistency(): https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Objects/unicodeobject.c#L453-L485 Victor From songofacandy at gmail.com Fri Jan 26 08:33:36 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Fri, 26 Jan 2018 22:33:36 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: > > Can you create a simple test-case that proves this? Sure. $ git diff diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c index 2ad4322eca..475d5219e1 100644 --- a/Modules/_testcapimodule.c +++ b/Modules/_testcapimodule.c @@ -5307,6 +5307,12 @@ PyInit__testcapi(void) Py_INCREF(&PyInstanceMethod_Type); PyModule_AddObject(m, "instancemethod", (PyObject *)&PyInstanceMethod_Type); + PyObject *wrong_unicode = PyUnicode_New(1, 65535); + PyUnicode_WRITE(PyUnicode_2BYTE_KIND, + PyUnicode_DATA(wrong_unicode), + 0, 'a'); + PyModule_AddObject(m, "wrong_unicode", wrong_unicode); + PyModule_AddIntConstant(m, "the_number_three", 3); #ifdef WITH_PYMALLOC PyModule_AddObject(m, "WITH_PYMALLOC", Py_True); $ ./python Python 3.7.0a4+ (heads/master-dirty:e76daebc0c, Jan 26 2018, 22:31:18) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import _testcapi >>> _testcapi.wrong_unicode 'a' >>> len(_testcapi.wrong_unicode) 1 >>> ord(_testcapi.wrong_unicode) 97 >>> _testcapi.wrong_unicode == 'a' False >>> From victor.stinner at gmail.com Fri Jan 26 08:37:14 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 26 Jan 2018 14:37:14 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180126123953.GA22500@ando.pearwood.info> References: <20180126123953.GA22500@ando.pearwood.info> Message-ID: 2018-01-26 13:39 GMT+01:00 Steven D'Aprano : > I have no objection to isascii, but I don't think it goes far enough. > Sometimes I want to know whether a string is compatible with Latin-1 or > UCS-2 as well as ASCII. For that, I used a function that exposes the > size of code points in bits: Really? I never required such check in practice. Would you mind to elaborate your use case? ASCII is very very common and hardcoded in many file formats and protocols. Other character sets are more rare. > @property > def size(self): > # This can be implemented much more efficiently in CPython. > c = ord(max(self)) if self else 0 > if c <= 0x7F: > return 7 > elif c <= 0xFF: > return 8 > elif c <= 0xFFFF: > return 16 > else: > assert c <= 0x10FFFF > return 21 An efficient, O(1) complexity, implementation can be annoying to implement. I don't think that it's worth it. Python doesn't have this method, and I never see any user requesting this feature. IMHO this size() idea comes from the PEP 393 design, but not from a real use case. In CPython, str.isascii() would be a O(1) operation since the result is "cached" by design in the implementation of PyUnicode. PEP 393 is an implementation detail. PyPy is now using utf8 internally, not PEP 393 (UCS1, UCS2 or UCS4). PyPy might want to use a bit to cache if the string is ASCII or not, but I'm not sure that it's worth it to check the maximum character or the size() result. Victor From mal at egenix.com Fri Jan 26 08:43:25 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 14:43:25 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: On 26.01.2018 14:31, Victor Stinner wrote: > 2018-01-26 12:17 GMT+01:00 INADA Naoki : >>> No, because you can pass in maxchar to PyUnicode_New() and >>> the implementation will take this as hint to the max code point >>> used in the string. There is no check done whether maxchar >>> is indeed the minimum upper bound to the code point ordinals. >> >> API doc says: >> >> """ >> maxchar should be the true maximum code point to be placed in the string. >> As an approximation, it can be rounded up to the nearest value in the >> sequence 127, 255, 65535, 1114111. >> """ >> https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_New >> >> Since doc says *should*, strings created with wrong maxchar >> are considered invalid object. > > PyUnicode objects must always use the most efficient storage. It's a > very strong requirement of the PEP 393. As Naoki wrote, many functions > rely on this assumption to implement fast-path. > > The assumption is even implemented in the debug check > _PyUnicode_CheckConsistency(): > > https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Objects/unicodeobject.c#L453-L485 If that's indeed being used as assumption, the docs must be fixed and PyUnicode_New() should verify this assumption as well - not only in debug builds using C asserts() :-) Going through the code, I saw a lot of calls to find_maxchar_surrogates() before calling PyUnicode_New(). This call would then have to be moved inside PyUnicode_New() instead. C extensions can easily create strings using PyUnicode_New() which do not adhere to such a requirement and then write arbitrary content using PyUnicode_WRITE(). In some cases, this may even be necessary, say in case the extension doesn't know what data is being written, reading it from some external source. I'm not too familiar with the new Unicode code, but it seems that this requirement is not checked everywhere, e.g. the resize code doesn't seem to have such checks either (only in debug versions). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From victor.stinner at gmail.com Fri Jan 26 08:55:40 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 26 Jan 2018 14:55:40 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: 2018-01-26 14:43 GMT+01:00 M.-A. Lemburg : > If that's indeed being used as assumption, the docs must be > fixed and PyUnicode_New() should verify this assumption as > well - not only in debug builds using C asserts() :-) As PyUnicode_FromStringAndSize(NULL, size), PyUnicode_New(size, maxchar) only allocates memory with uninitialized characters. I don't see how PyUnicode_New() could check the string content since the content is unknow yet... The new public C API added by PEP 393 is hard to use correctly, but they are the most efficient. Functions like PyUnicode_FromString() are simple to use and very hard to misuse :-) PyPy developers asked me to simply drop all these new public C API, make them private. At least, deprecate them. But I never looked in depth at the new API. I don't know if Cython uses it for example. Some APIs are still private like _PyUnicodeWriter which allows to create a string in multiple steps with a smart strategy to reduce or even avoid realloc() and conversions from the different storage types (UCS1, UCS2, UCS4). This API is very efficient, but also hard to use. > C extensions can easily create strings using PyUnicode_New() > which do not adhere to such a requirement and then write > arbitrary content using PyUnicode_WRITE(). In some cases, > this may even be necessary, say in case the extension doesn't > know what data is being written, reading it from some external > source. It would be a bug in the C extension. > I'm not too familiar with the new Unicode code, but it seems > that this requirement is not checked everywhere, e.g. the > resize code doesn't seem to have such checks either (only in > debug versions). It must be checked everywhere. If it's not the case, it's an obvious bug in CPython. If you spotted a bug, please report a bug ;-) Victor From mal at egenix.com Fri Jan 26 09:18:13 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 15:18:13 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: <6947d39b-b15c-d8ed-fc0e-6137e239e890@egenix.com> On 26.01.2018 14:55, Victor Stinner wrote: > 2018-01-26 14:43 GMT+01:00 M.-A. Lemburg : >> If that's indeed being used as assumption, the docs must be >> fixed and PyUnicode_New() should verify this assumption as >> well - not only in debug builds using C asserts() :-) > > As PyUnicode_FromStringAndSize(NULL, size), PyUnicode_New(size, > maxchar) only allocates memory with uninitialized characters. > > I don't see how PyUnicode_New() could check the string content since > the content is unknow yet... You do have a point there ;-) I guess making the assumption very clear in the docs would be a good first step - as Chris suggested. > The new public C API added by PEP 393 is hard to use correctly, but > they are the most efficient. Functions like PyUnicode_FromString() are > simple to use and very hard to misuse :-) PyPy developers asked me to > simply drop all these new public C API, make them private. At least, > deprecate them. But I never looked in depth at the new API. I don't > know if Cython uses it for example. Dropping them would most likely seriously limit the usefulness of the Unicode API. If you always have to copy strings to create objects, this would make text intense work very slow. The usual approach is to have a three step process: 1. create a container object of sufficient size 2. write data 3. resize container to actual size I guess marking objects returned by PyUnicode_New() as "not ready" would help resolve the issue. Whenever the maxchar check is applied, the ready flag could then be set. The resize operations would then have to apply the maxchar check as well. Unfortunately, many of he readiness checks are only available in debug builds, but at least it's a way forward to make the API more robust. > Some APIs are still private like _PyUnicodeWriter which allows to > create a string in multiple steps with a smart strategy to reduce or > even avoid realloc() and conversions from the different storage types > (UCS1, UCS2, UCS4). This API is very efficient, but also hard to use. > >> C extensions can easily create strings using PyUnicode_New() >> which do not adhere to such a requirement and then write >> arbitrary content using PyUnicode_WRITE(). In some cases, >> this may even be necessary, say in case the extension doesn't >> know what data is being written, reading it from some external >> source. > > It would be a bug in the C extension. Is there a way to call an API which fixes the setting (a public version of unicode_adjust_maxchar()) ? Without this, how would an extension be able to provide a correct value upfront without knowing the content ? >> I'm not too familiar with the new Unicode code, but it seems >> that this requirement is not checked everywhere, e.g. the >> resize code doesn't seem to have such checks either (only in >> debug versions). > > It must be checked everywhere. If it's not the case, it's an obvious > bug in CPython. > > If you spotted a bug, please report a bug ;-) Yes, will do. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From solipsis at pitrou.net Fri Jan 26 09:58:14 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 26 Jan 2018 15:58:14 +0100 Subject: [Python-ideas] Adding str.isascii() ? References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> Message-ID: <20180126155814.727e3b03@fsol> On Fri, 26 Jan 2018 22:33:36 +0900 INADA Naoki wrote: > > > > Can you create a simple test-case that proves this? > > Sure. I think the question assumed "without writing custom C or ctypes code that deliberately builds a non-conformant unicode object" ;-) Regards Antoine. From barry at python.org Fri Jan 26 10:02:24 2018 From: barry at python.org (Barry Warsaw) Date: Fri, 26 Jan 2018 10:02:24 -0500 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: Guido van Rossum wrote: > IIUC another common layout is to have folders named test or tests inside > each package. This would avoid requiring any changes to the site-packages > layout. That's what I do for all my personal code. Yes, it means the test directories are shipped with the sdist, but really who cares? I don't think I've had a single complaint about it, even with large-ish projects like Mailman. I can see you wanting to do something different if your project has truly gargantuan test suites, but even with 100% coverage (or nearly so), I think size just isn't usually a big deal. In another message, Giampaolo describes being able to run tests with -m psutil.test. That's a neat idea which I haven't tried. But I do think including the tests can be instructive, and I know that on more than one occasion, I've cracked open a project's test suite to get a better sense of the semantics and usage of a particular API. Finally, I'll disagree with pytest's recommendation to not put __init__.py files in your test directories. Although I'm not a heavy pytest user (we use it exclusive at work, but I don't use it much with my own stuff), having __init__.py files can be useful, especially if you also have test data you want to access through pkg_resources, or now, importlib_resources (importlib.resources in Python 3.7). Cheers, -Barry From mal at egenix.com Fri Jan 26 10:04:32 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 16:04:32 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180126155814.727e3b03@fsol> References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> <20180126155814.727e3b03@fsol> Message-ID: <3cd5c351-8a05-0e52-983f-a748da8e96c1@egenix.com> On 26.01.2018 15:58, Antoine Pitrou wrote: > On Fri, 26 Jan 2018 22:33:36 +0900 > INADA Naoki > wrote: >>> >>> Can you create a simple test-case that proves this? >> >> Sure. > > I think the question assumed "without writing custom C or ctypes code > that deliberately builds a non-conformant unicode object" ;-) I think his example is spot on, since this is how you'd expect to use the APIs. Even more so, if you don't know the maximum code point used in the data you write to the object upfront. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From songofacandy at gmail.com Fri Jan 26 10:12:53 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Sat, 27 Jan 2018 00:12:53 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180126155814.727e3b03@fsol> References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> <20180126155814.727e3b03@fsol> Message-ID: No. See this mail. https://mail.python.org/pipermail/python-ideas/2018-January/048748.html Point is should we support invalid Unicode created by C API. And I assume no. 2018/01/26 ??11:58 "Antoine Pitrou" : On Fri, 26 Jan 2018 22:33:36 +0900 INADA Naoki wrote: > > > > Can you create a simple test-case that proves this? > > Sure. I think the question assumed "without writing custom C or ctypes code that deliberately builds a non-conformant unicode object" ;-) Regards Antoine. _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From random832 at fastmail.com Fri Jan 26 10:16:23 2018 From: random832 at fastmail.com (Random832) Date: Fri, 26 Jan 2018 10:16:23 -0500 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <6947d39b-b15c-d8ed-fc0e-6137e239e890@egenix.com> References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> <6947d39b-b15c-d8ed-fc0e-6137e239e890@egenix.com> Message-ID: <1516979783.4063003.1249158496.5DD0FEFE@webmail.messagingengine.com> On Fri, Jan 26, 2018, at 09:18, M.-A. Lemburg wrote: > Is there a way to call an API which fixes the setting > (a public version of unicode_adjust_maxchar()) ? > > Without this, how would an extension be able to provide a > correct value upfront without knowing the content ? It obviously has to know the content before it can finally return the string (or pass it to any other function, etc), because strings are immutable. Why not then do all the intermediate work in an array of int32's (or perhaps a UCS-4 PyUnicode to be returned only if needed), then afterward scan and build the string? From mal at egenix.com Fri Jan 26 10:36:04 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Jan 2018 16:36:04 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <1516979783.4063003.1249158496.5DD0FEFE@webmail.messagingengine.com> References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> <6947d39b-b15c-d8ed-fc0e-6137e239e890@egenix.com> <1516979783.4063003.1249158496.5DD0FEFE@webmail.messagingengine.com> Message-ID: On 26.01.2018 16:16, Random832 wrote: > On Fri, Jan 26, 2018, at 09:18, M.-A. Lemburg wrote: >> Is there a way to call an API which fixes the setting >> (a public version of unicode_adjust_maxchar()) ? >> >> Without this, how would an extension be able to provide a >> correct value upfront without knowing the content ? > > It obviously has to know the content before it can finally return the string (or pass it to any other function, etc), because strings are immutable. Why not then do all the intermediate work in an array of int32's (or perhaps a UCS-4 PyUnicode to be returned only if needed), then afterward scan and build the string? The create, write data, resize approach is a standard way to build (longer) Pythhon string objects in the Python C API, since it avoids temporary copies. E.g. you don't want to first build a buffer to hold 100MB XML, then scan it for the max code point being used, create a python string from it (which copies the data into a second 100MB buffer) and then deallocate the first buffer again. Instead you create an uninitialized Python Unicode object and use PyUnicde_WRITE() to write the data directly into the object, avoiding the 100MB temp buffer. PS: Strings are immutable in Python, but they are not in C. You can manipulate string objects provided you own the only reference. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From stephane at wirtel.be Fri Jan 26 10:45:34 2018 From: stephane at wirtel.be (Stephane Wirtel) Date: Fri, 26 Jan 2018 16:45:34 +0100 Subject: [Python-ideas] Official site-packages/test directory In-Reply-To: References: <20180119142753.GA4754@bytereef.org> Message-ID: <20180126154534.GA3215@xps> Hi Barry, Sometimes, I need to read the tests of a package because I don't understand the usage of a function/method/class and unfortunately, there is no documentation. In this case, and only in this case, I will try to find the tests and in the worst case, download the source and try to understand with the 'tests' directory. my 0.000002 ? ;-) Stephane On 01/26, Barry Warsaw wrote: >Guido van Rossum wrote: >> IIUC another common layout is to have folders named test or tests inside >> each package. This would avoid requiring any changes to the site-packages >> layout. > >That's what I do for all my personal code. Yes, it means the test >directories are shipped with the sdist, but really who cares? I don't >think I've had a single complaint about it, even with large-ish projects >like Mailman. I can see you wanting to do something different if your >project has truly gargantuan test suites, but even with 100% coverage >(or nearly so), I think size just isn't usually a big deal. > >In another message, Giampaolo describes being able to run tests with -m >psutil.test. That's a neat idea which I haven't tried. But I do think >including the tests can be instructive, and I know that on more than one >occasion, I've cracked open a project's test suite to get a better sense >of the semantics and usage of a particular API. > >Finally, I'll disagree with pytest's recommendation to not put >__init__.py files in your test directories. Although I'm not a heavy >pytest user (we use it exclusive at work, but I don't use it much with >my own stuff), having __init__.py files can be useful, especially if you >also have test data you want to access through pkg_resources, or now, >importlib_resources (importlib.resources in Python 3.7). > >Cheers, >-Barry -- St?phane Wirtel - http://wirtel.be - @matrixise From songofacandy at gmail.com Fri Jan 26 10:54:16 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Sat, 27 Jan 2018 00:54:16 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <9658103a-89b4-fd4e-c837-03a80eab140e@egenix.com> <6947d39b-b15c-d8ed-fc0e-6137e239e890@egenix.com> <1516979783.4063003.1249158496.5DD0FEFE@webmail.messagingengine.com> Message-ID: We have _PyUnicodeWriter for such use cases. We may able to expose it as public API, but please start another thread for it. Unicode created by wrong maxchar is not supported from Python 3.3. == and hash() doesn't work properly for such unicode object. So str.isascii() has not to support it too. > > The create, write data, resize approach is a standard way to build > (longer) Pythhon string objects in the Python C API, since it > avoids temporary copies. > > E.g. you don't want to first build a buffer to hold 100MB XML, > then scan it for the max code point being used, create a python > string from it (which copies the data into a second 100MB > buffer) and then deallocate the first buffer again. > > Instead you create an uninitialized Python Unicode object > and use PyUnicde_WRITE() to write the data directly into > the object, avoiding the 100MB temp buffer. > > PS: Strings are immutable in Python, but they are not in C. > You can manipulate string objects provided you own the only > reference. > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 26 2018) >>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>>> Python Database Interfaces ... http://products.egenix.com/ >>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- INADA Naoki From guido at python.org Fri Jan 26 11:20:49 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 08:20:49 -0800 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: On Fri, Jan 26, 2018 at 12:42 AM, INADA Naoki wrote: > Currently, int(), str.isdigit(), str.isalnum(), etc... accepts > non-ASCII strings. > > >>> s = ???" > >>> s > '???' > >>> s.isdigit() > True > >>> print(ascii(s)) > '\uff11\uff12\uff13' > >>> int(s) > 123 > > But sometimes, we want to accept only ascii string. For example, > ipaddress module uses: > > _DECIMAL_DIGITS = frozenset('0123456789') > ... > if _DECIMAL_DIGITS.issuperset(str): > > ref: https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537 > f756e805de/Lib/ipaddress.py#L491-L494 > > If str has str.isascii() method, it can be simpler: > > `if s.isascii() and s.isdigit():` > > I want to add it in Python 3.7 if there are no opposite opinions. > That's fine with me. Please also add it to bytes and bytearray objects. It's okay if the implementation has to scan the string -- so do isdigit() etc. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dancollins34 at gmail.com Fri Jan 26 11:54:43 2018 From: dancollins34 at gmail.com (Daniel Collins) Date: Fri, 26 Jan 2018 11:54:43 -0500 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: Message-ID: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> So, just going point by point: Yes, absolutely put this off for 3.8. I didn?t know the freeze was so close or I would have put the 3.8 tag on originally. Yes, absolutely it is only meant for concurrent.futures futures, it only changes async where async uses concurrent.futures futures. Here?s a more fleshed out description of the use case: Assume you have two functions. Function a(x: str)->AResult fetches an AResult object from a web resource, function b(y: AResult) performs some computationally heavy work on AResult. Assume you?re calling a 10 times with a threadpoolexecutor with 2 worker theads. If you were to schedule a as future using submit, and b as a callback, the executions would look like this: ExecutorThread: b*10 Worker1: a*5 Worker2: a*5 This only gets worse as more work (b) is scheduled as a callback for the result from a. Now you could resolve this by, instead of submitting b as a callback, submitting the following lambda: lambda x: executor.submit(b, x) But then you wouldn?t have easy access to this new future. You would have to build a lot of boilerplate code to collect that future into some external collection, and this would only get worse the deeper the nesting goes. With this syntax on the other hand, if you run a 10 times using submit, but then run a_fut.then(b) for each future, execution instead looks like this: ExecutorThread: Worker1: a*5 b*5 Worker2: a*5 b*5 You can also do additional depth easily. Suppose you want to run 3 c operations (processes the output of b) for each b operation. Then you could call this like b_fut = a_fut.then(b) for i in range(3): b_fut.then(c) And the execution would look like this: ExecutorThread: Worker1: a*5 b*5 c*15 Worker2: a*5 b*5 c*15 Which would be very difficult to do otherwise, and distributes the load across the workers, while having direct access to the outputs of the calls to c. -dancollins34 Sent from my iPhone > On Jan 26, 2018, at 1:07 AM, Guido van Rossum wrote: > > I really don't want to distract Yury with this. Let's consider this (or something that addresses the same need) for 3.8. > > To be clear this is meant as a feature for concurrent.futures.Future, not for asyncio.Future. (It's a bit confusing since you also change asyncio.) > > Also to be honest I don't understand the use case *or* the semantics very well. You have some explaining to do... > > (Also, full links: https://bugs.python.org/issue32672; https://github.com/python/cpython/pull/5335) > >> On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins wrote: >> Hello all, >> >> So, first time posting here. I?ve been bothered for a while about the lack of the ability to chain futures in python, such that the next future will execute upon the first?s completion. So I submitted a pr to do this. This would add the .then(self, fn) method to concurrent.futures.Future. Thoughts? >> >> -dancollins34 >> >> Github PR #5335 >> bugs.python.org issue #32672 >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dancollins34 at gmail.com Fri Jan 26 11:58:01 2018 From: dancollins34 at gmail.com (Daniel Collins) Date: Fri, 26 Jan 2018 11:58:01 -0500 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: Message-ID: Yeah, it would be better to use the chain_future call from async directly, but the problems are 1) it would make concurrent dependent on async and 2) if it were public, it would require users to instantiate futures, which they?re not supposed to do. -dancollins34 Sent from my iPhone > On Jan 26, 2018, at 3:16 AM, Bar Harel wrote: > > I have a simple way to solve this I believe. > > Why not just expose "_chain_future()" from asyncio/futures.py to the public, instead of copying and pasting parts of it? > > It already works, being used everywhere in the stdlib, it supports both asyncio and concurrent.futures, it's an easily testable external function (follows the design of asyncio for the better part), it's threadsafe right out of the box and it wouldn't require anything but removing a single underscore and adding documentation. (I always wondered why was it private anyway) > > It's like the function was meant to be public :-P > > -- Bar > > >> On Fri, Jan 26, 2018, 8:07 AM Guido van Rossum wrote: >> I really don't want to distract Yury with this. Let's consider this (or something that addresses the same need) for 3.8. >> >> To be clear this is meant as a feature for concurrent.futures.Future, not for asyncio.Future. (It's a bit confusing since you also change asyncio.) >> >> Also to be honest I don't understand the use case *or* the semantics very well. You have some explaining to do... >> >> (Also, full links: https://bugs.python.org/issue32672; https://github.com/python/cpython/pull/5335) >> >>> On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins wrote: >>> Hello all, >>> >>> So, first time posting here. I?ve been bothered for a while about the lack of the ability to chain futures in python, such that the next future will execute upon the first?s completion. So I submitted a pr to do this. This would add the .then(self, fn) method to concurrent.futures.Future. Thoughts? >>> >>> -dancollins34 >>> >>> Github PR #5335 >>> bugs.python.org issue #32672 >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Jan 26 11:59:22 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 08:59:22 -0800 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> References: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> Message-ID: @Bar: I don't know about exposing _chain_future(). Certainly it's overkill for what the OP wants -- their PR only cares about chaining concurrent.future.Future. @Daniel: I present the following simpler solution -- it requires you to explicitly pass the executor, but since 'fn' is being submitted to an executor, EIBTI. def then(executor, future, fn): newf = concurrent.futures.Future() def callback(fut): f = executor.submit(fn, fut) try: newf.set_result(f.result()) except CancelledError: newf.cancel() except Exception as err: newf.set_exception(err) return executor.submit(callback) I can't quite follow your reasoning about worker threads (and did you realize that because of the GIL, Python doesn't actually use multiple cores?). But I suppose it doesn't matter whether I understand that -- your point is that you want the 'fn' function submitted to the executor, not run as a "done callback". And that's reasonable. But modifying so much code just so the Future can know which to executor it belongs so you can make then() a method seems overkill. On Fri, Jan 26, 2018 at 8:54 AM, Daniel Collins wrote: > So, just going point by point: > > Yes, absolutely put this off for 3.8. I didn?t know the freeze was so > close or I would have put the 3.8 tag on originally. > > Yes, absolutely it is only meant for concurrent.futures futures, it only > changes async where async uses concurrent.futures futures. > > Here?s a more fleshed out description of the use case: > > Assume you have two functions. Function a(x: str)->AResult fetches an > AResult object from a web resource, function b(y: AResult) performs some > computationally heavy work on AResult. > > Assume you?re calling a 10 times with a threadpoolexecutor with 2 worker > theads. If you were to schedule a as future using submit, and b as a > callback, the executions would look like this: > > ExecutorThread: b*10 > Worker1: a*5 > Worker2: a*5 > > This only gets worse as more work (b) is scheduled as a callback for the > result from a. > > Now you could resolve this by, instead of submitting b as a callback, > submitting the following lambda: > > lambda x: executor.submit(b, x) > > But then you wouldn?t have easy access to this new future. You would have > to build a lot of boilerplate code to collect that future into some > external collection, and this would only get worse the deeper the nesting > goes. > > With this syntax on the other hand, if you run a 10 times using submit, > but then run a_fut.then(b) for each future, execution instead looks like > this: > > ExecutorThread: > Worker1: a*5 b*5 > Worker2: a*5 b*5 > > You can also do additional depth easily. Suppose you want to run 3 c > operations (processes the output of b) for each b operation. Then you could > call this like > > b_fut = a_fut.then(b) > > for i in range(3): > b_fut.then(c) > > And the execution would look like this: > > ExecutorThread: > Worker1: a*5 b*5 c*15 > Worker2: a*5 b*5 c*15 > > Which would be very difficult to do otherwise, and distributes the load > across the workers, while having direct access to the outputs of the calls > to c. > > -dancollins34 > > Sent from my iPhone > > On Jan 26, 2018, at 1:07 AM, Guido van Rossum wrote: > > I really don't want to distract Yury with this. Let's consider this (or > something that addresses the same need) for 3.8. > > To be clear this is meant as a feature for concurrent.futures.Future, not > for asyncio.Future. (It's a bit confusing since you also change asyncio.) > > Also to be honest I don't understand the use case *or* the semantics very > well. You have some explaining to do... > > (Also, full links: https://bugs.python.org/issue32672; > https://github.com/python/cpython/pull/5335) > > On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins > wrote: > >> Hello all, >> >> So, first time posting here. I?ve been bothered for a while about the >> lack of the ability to chain futures in python, such that the next future >> will execute upon the first?s completion. So I submitted a pr to do this. >> This would add the .then(self, fn) method to concurrent.futures.Future. >> Thoughts? >> >> -dancollins34 >> >> Github PR #5335 >> bugs.python.org issue #32672 >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> > > > -- > --Guido van Rossum (python.org/~guido) > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dancollins34 at gmail.com Fri Jan 26 12:10:27 2018 From: dancollins34 at gmail.com (Daniel Collins) Date: Fri, 26 Jan 2018 12:10:27 -0500 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> Message-ID: @Guido: I agree, that?s a much cleaner solution to pass the executor. However, I think the last line should be future.add_done_callback(callback) return newf not executor.submit. I?ll rewrite it like this and resubmit tonight for discussion. Sent from my iPhone > On Jan 26, 2018, at 11:59 AM, Guido van Rossum wrote: > > @Bar: I don't know about exposing _chain_future(). Certainly it's overkill for what the OP wants -- their PR only cares about chaining concurrent.future.Future. > > @Daniel: I present the following simpler solution -- it requires you to explicitly pass the executor, but since 'fn' is being submitted to an executor, EIBTI. > > def then(executor, future, fn): > newf = concurrent.futures.Future() > def callback(fut): > f = executor.submit(fn, fut) > try: > newf.set_result(f.result()) > except CancelledError: > newf.cancel() > except Exception as err: > newf.set_exception(err) > return executor.submit(callback) > > I can't quite follow your reasoning about worker threads (and did you realize that because of the GIL, Python doesn't actually use multiple cores?). But I suppose it doesn't matter whether I understand that -- your point is that you want the 'fn' function submitted to the executor, not run as a "done callback". And that's reasonable. But modifying so much code just so the Future can know which to executor it belongs so you can make then() a method seems overkill. > >> On Fri, Jan 26, 2018 at 8:54 AM, Daniel Collins wrote: >> So, just going point by point: >> >> Yes, absolutely put this off for 3.8. I didn?t know the freeze was so close or I would have put the 3.8 tag on originally. >> >> Yes, absolutely it is only meant for concurrent.futures futures, it only changes async where async uses concurrent.futures futures. >> >> Here?s a more fleshed out description of the use case: >> >> Assume you have two functions. Function a(x: str)->AResult fetches an AResult object from a web resource, function b(y: AResult) performs some computationally heavy work on AResult. >> >> Assume you?re calling a 10 times with a threadpoolexecutor with 2 worker theads. If you were to schedule a as future using submit, and b as a callback, the executions would look like this: >> >> ExecutorThread: b*10 >> Worker1: a*5 >> Worker2: a*5 >> >> This only gets worse as more work (b) is scheduled as a callback for the result from a. >> >> Now you could resolve this by, instead of submitting b as a callback, submitting the following lambda: >> >> lambda x: executor.submit(b, x) >> >> But then you wouldn?t have easy access to this new future. You would have to build a lot of boilerplate code to collect that future into some external collection, and this would only get worse the deeper the nesting goes. >> >> With this syntax on the other hand, if you run a 10 times using submit, but then run a_fut.then(b) for each future, execution instead looks like this: >> >> ExecutorThread: >> Worker1: a*5 b*5 >> Worker2: a*5 b*5 >> >> You can also do additional depth easily. Suppose you want to run 3 c operations (processes the output of b) for each b operation. Then you could call this like >> >> b_fut = a_fut.then(b) >> >> for i in range(3): >> b_fut.then(c) >> >> And the execution would look like this: >> >> ExecutorThread: >> Worker1: a*5 b*5 c*15 >> Worker2: a*5 b*5 c*15 >> >> Which would be very difficult to do otherwise, and distributes the load across the workers, while having direct access to the outputs of the calls to c. >> >> -dancollins34 >> >> Sent from my iPhone >> >>> On Jan 26, 2018, at 1:07 AM, Guido van Rossum wrote: >>> >>> I really don't want to distract Yury with this. Let's consider this (or something that addresses the same need) for 3.8. >>> >>> To be clear this is meant as a feature for concurrent.futures.Future, not for asyncio.Future. (It's a bit confusing since you also change asyncio.) >>> >>> Also to be honest I don't understand the use case *or* the semantics very well. You have some explaining to do... >>> >>> (Also, full links: https://bugs.python.org/issue32672; https://github.com/python/cpython/pull/5335) >>> >>>> On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins wrote: >>>> Hello all, >>>> >>>> So, first time posting here. I?ve been bothered for a while about the lack of the ability to chain futures in python, such that the next future will execute upon the first?s completion. So I submitted a pr to do this. This would add the .then(self, fn) method to concurrent.futures.Future. Thoughts? >>>> >>>> -dancollins34 >>>> >>>> Github PR #5335 >>>> bugs.python.org issue #32672 >>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas at python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>> >>> >>> >>> -- >>> --Guido van Rossum (python.org/~guido) > > > > -- > --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dancollins34 at gmail.com Fri Jan 26 12:20:17 2018 From: dancollins34 at gmail.com (Daniel Collins) Date: Fri, 26 Jan 2018 12:20:17 -0500 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> Message-ID: <10C6409B-2D98-461B-BC3A-A7209C7B669B@gmail.com> @Guido As an aside, my understanding was libraries that fall back to c (Numpy as an example) release the GIL for load heavy operations. But I believe the explanation would hold in the general case if you replace thread with process using a ProcessPoolExecutor, that it would be good to be able to submit a callback function back to the executor. Sent from my iPhone > On Jan 26, 2018, at 12:10 PM, Daniel Collins wrote: > > @Guido: I agree, that?s a much cleaner solution to pass the executor. However, I think the last line should be future.add_done_callback(callback) > return newf > > not executor.submit. > > I?ll rewrite it like this and resubmit tonight for discussion. > > Sent from my iPhone > >> On Jan 26, 2018, at 11:59 AM, Guido van Rossum wrote: >> >> @Bar: I don't know about exposing _chain_future(). Certainly it's overkill for what the OP wants -- their PR only cares about chaining concurrent.future.Future. >> >> @Daniel: I present the following simpler solution -- it requires you to explicitly pass the executor, but since 'fn' is being submitted to an executor, EIBTI. >> >> def then(executor, future, fn): >> newf = concurrent.futures.Future() >> def callback(fut): >> f = executor.submit(fn, fut) >> try: >> newf.set_result(f.result()) >> except CancelledError: >> newf.cancel() >> except Exception as err: >> newf.set_exception(err) >> return executor.submit(callback) >> >> I can't quite follow your reasoning about worker threads (and did you realize that because of the GIL, Python doesn't actually use multiple cores?). But I suppose it doesn't matter whether I understand that -- your point is that you want the 'fn' function submitted to the executor, not run as a "done callback". And that's reasonable. But modifying so much code just so the Future can know which to executor it belongs so you can make then() a method seems overkill. >> >>> On Fri, Jan 26, 2018 at 8:54 AM, Daniel Collins wrote: >>> So, just going point by point: >>> >>> Yes, absolutely put this off for 3.8. I didn?t know the freeze was so close or I would have put the 3.8 tag on originally. >>> >>> Yes, absolutely it is only meant for concurrent.futures futures, it only changes async where async uses concurrent.futures futures. >>> >>> Here?s a more fleshed out description of the use case: >>> >>> Assume you have two functions. Function a(x: str)->AResult fetches an AResult object from a web resource, function b(y: AResult) performs some computationally heavy work on AResult. >>> >>> Assume you?re calling a 10 times with a threadpoolexecutor with 2 worker theads. If you were to schedule a as future using submit, and b as a callback, the executions would look like this: >>> >>> ExecutorThread: b*10 >>> Worker1: a*5 >>> Worker2: a*5 >>> >>> This only gets worse as more work (b) is scheduled as a callback for the result from a. >>> >>> Now you could resolve this by, instead of submitting b as a callback, submitting the following lambda: >>> >>> lambda x: executor.submit(b, x) >>> >>> But then you wouldn?t have easy access to this new future. You would have to build a lot of boilerplate code to collect that future into some external collection, and this would only get worse the deeper the nesting goes. >>> >>> With this syntax on the other hand, if you run a 10 times using submit, but then run a_fut.then(b) for each future, execution instead looks like this: >>> >>> ExecutorThread: >>> Worker1: a*5 b*5 >>> Worker2: a*5 b*5 >>> >>> You can also do additional depth easily. Suppose you want to run 3 c operations (processes the output of b) for each b operation. Then you could call this like >>> >>> b_fut = a_fut.then(b) >>> >>> for i in range(3): >>> b_fut.then(c) >>> >>> And the execution would look like this: >>> >>> ExecutorThread: >>> Worker1: a*5 b*5 c*15 >>> Worker2: a*5 b*5 c*15 >>> >>> Which would be very difficult to do otherwise, and distributes the load across the workers, while having direct access to the outputs of the calls to c. >>> >>> -dancollins34 >>> >>> Sent from my iPhone >>> >>>> On Jan 26, 2018, at 1:07 AM, Guido van Rossum wrote: >>>> >>>> I really don't want to distract Yury with this. Let's consider this (or something that addresses the same need) for 3.8. >>>> >>>> To be clear this is meant as a feature for concurrent.futures.Future, not for asyncio.Future. (It's a bit confusing since you also change asyncio.) >>>> >>>> Also to be honest I don't understand the use case *or* the semantics very well. You have some explaining to do... >>>> >>>> (Also, full links: https://bugs.python.org/issue32672; https://github.com/python/cpython/pull/5335) >>>> >>>>> On Thu, Jan 25, 2018 at 8:38 PM, Daniel Collins wrote: >>>>> Hello all, >>>>> >>>>> So, first time posting here. I?ve been bothered for a while about the lack of the ability to chain futures in python, such that the next future will execute upon the first?s completion. So I submitted a pr to do this. This would add the .then(self, fn) method to concurrent.futures.Future. Thoughts? >>>>> >>>>> -dancollins34 >>>>> >>>>> Github PR #5335 >>>>> bugs.python.org issue #32672 >>>>> >>>>> _______________________________________________ >>>>> Python-ideas mailing list >>>>> Python-ideas at python.org >>>>> https://mail.python.org/mailman/listinfo/python-ideas >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> >>>> >>>> >>>> >>>> -- >>>> --Guido van Rossum (python.org/~guido) >> >> >> >> -- >> --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Jan 26 13:05:56 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 10:05:56 -0800 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: <10C6409B-2D98-461B-BC3A-A7209C7B669B@gmail.com> References: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> <10C6409B-2D98-461B-BC3A-A7209C7B669B@gmail.com> Message-ID: On Fri, Jan 26, 2018 at 9:20 AM, Daniel Collins wrote: > @Guido As an aside, my understanding was libraries that fall back to c > (Numpy as an example) release the GIL for load heavy operations. But I > believe the explanation would hold in the general case if you replace > thread with process using a ProcessPoolExecutor, that it would be good to > be able to submit a callback function back to the executor. > Sure, but your explanation didn't mention any of that. And yes, good catch on the last line of my example. :-) Given that the solution is only a few lines -- perhaps it's enough to just add it as an example to the docs, rather than to add it as a new function to concurrent.futures? A doc change can be added to 3.7! -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From waksman at gmail.com Fri Jan 26 14:11:28 2018 From: waksman at gmail.com (George Leslie-Waksman) Date: Fri, 26 Jan 2018 19:11:28 +0000 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> References: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> Message-ID: Even if we could inherit the setting, I would think that we would still want to require the code be explicit. It seems worse to implicitly require keyword only arguments for a class without giving any indication in the code. As it stands, the current implementation does not allow a later subclass to be declared without `keyword_only=True` so we could handle this case by adding a note to the `TypeError` message about considering the keyword_only flag. How do I got about putting together a proposal to get this into 3.8? --George On Thu, Jan 25, 2018 at 5:12 AM Eric V. Smith wrote: > I'm not completely opposed to this feature. But there are some cases to > consider. Here's the first one that occurs to me: note that due to the > way dataclasses work, it would need to be used everywhere down an > inheritance hierarchy. That is, if an intermediate base class required > it, all class derived from that intermediate base would need to specify > it, too. That's because each class just makes decisions based on its > fields and its base classes' fields, and not on any flags attached to > the base class. As it's currently implemented, a class doesn't remember > any of the decorator's arguments, so there's no way to look for this > information, anyway. > > I think there are enough issues here that it's not going to make it in > to 3.7. It would require getting a firm proposal together, selling the > idea on python-dev, and completing the implementation before Monday. But > if you want to try, I'd participate in the discussion. > > Taking Ivan's suggestion one step further, a way to do this currently is > to pass init=False and then write another decorator that adds the > kw-only __init__. So the usage would be: > > @dataclass > class Foo: > some_default: dict = field(default_factory=dict) > > @kw_only_init > @dataclass(init=False) > class Bar(Foo): > other_field: int > > kw_only_init(cls) would look at fields(cls) and construct the __init__. > It would be a hassle to re-implement dataclasses's _init_fn function, > but it could be made to work (in reality, of course, you'd just copy it > and hack it up to do what you want). You'd also need to use some private > knowledge of InitVars if you wanted to support them (the stock > fields(cls) doesn't return them). > > For 3.8 we can consider changing dataclasses's APIs if we want to add this. > > Eric. > > On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: > > It may be possible but it makes for pretty leaky abstractions and it's > > unclear what that custom __init__ should look like. How am I supposed to > > know what the replacement for default_factory is? > > > > Moreover, suppose I want one base class with an optional argument and a > > half dozen subclasses each with their own required argument. At that > > point, I have to write the same __init__ function a half dozen times. > > > > It feels rather burdensome for the user when an additional flag (say > > "kw_only=True") and a modification to: > > https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294 > that > > inserted `['*']` after `[self_name]` if the flag is specified could > > ameliorate this entire issue. > > > > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi > > wrote: > > > > It is possible to pass init=False to the decorator on the subclass > > (and supply your own custom __init__, if necessary): > > > > @dataclass > > class Foo: > > some_default: dict = field(default_factory=dict) > > > > @dataclass(init=False) # This works > > class Bar(Foo): > > other_field: int > > > > -- > > Ivan > > > > > > > > On 23 January 2018 at 03:33, George Leslie-Waksman > > > wrote: > > > > The proposed implementation of dataclasses prevents defining > > fields with defaults before fields without defaults. This can > > create limitations on logical grouping of fields and on > inheritance. > > > > Take, for example, the case: > > > > @dataclass > > class Foo: > > some_default: dict = field(default_factory=dict) > > > > @dataclass > > class Bar(Foo): > > other_field: int > > > > this results in the error: > > > > 5 @dataclass > > ----> 6 class Bar(Foo): > > 7 other_field: int > > 8 > > > > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > > in dataclass(_cls, init, repr, eq, order, hash, frozen) > > 751 > > 752 # We're called as @dataclass, with a class. > > --> 753 return wrap(_cls) > > 754 > > 755 > > > > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > > in wrap(cls) > > 743 > > 744 def wrap(cls): > > --> 745 return _process_class(cls, repr, eq, order, > > hash, init, frozen) > > 746 > > 747 # See if we're being called as @dataclass or > > @dataclass(). > > > > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > > in _process_class(cls, repr, eq, order, hash, init, frozen) > > 675 # in __init__. Use > > "self" if possible. > > 676 '__dataclass_self__' if > > 'self' in fields > > --> 677 else 'self', > > 678 )) > > 679 if repr: > > > > > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py > > in _init_fn(fields, frozen, has_post_init, self_name) > > 422 seen_default = True > > 423 elif seen_default: > > --> 424 raise TypeError(f'non-default argument > > {f.name !r} ' > > 425 'follows default > argument') > > 426 > > > > TypeError: non-default argument 'other_field' follows default > > argument > > > > I understand that this is a limitation of positional arguments > > because the effective __init__ signature is: > > > > def __init__(self, some_default: dict = , > > other_field: int): > > > > However, keyword only arguments allow an entirely reasonable > > solution to this problem: > > > > def __init__(self, *, some_default: dict = , > > other_field: int): > > > > And have the added benefit of making the fields in the __init__ > > call entirely explicit. > > > > So, I propose the addition of a keyword_only flag to the > > @dataclass decorator that renders the __init__ method using > > keyword only arguments: > > > > @dataclass(keyword_only=True) > > class Bar(Foo): > > other_field: int > > > > --George Leslie-Waksman > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dancollins34 at gmail.com Fri Jan 26 14:23:42 2018 From: dancollins34 at gmail.com (Daniel Collins) Date: Fri, 26 Jan 2018 14:23:42 -0500 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: References: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> <10C6409B-2D98-461B-BC3A-A7209C7B669B@gmail.com> Message-ID: <169871CD-3D04-4292-AC52-F4AFFC36B2A8@gmail.com> That?s very true. I?ll try to keep my terminology more in line with the implementation in the future. The only problem with that, is that the function utilizes methods that are marked in the documentation as exclusively to be called by the executor (set_result, instantiation of future objects) and it would be confusing if a few lines later, a ?but you can use them for this? example was provided. -dancollins34 Sent from my iPhone > On Jan 26, 2018, at 1:05 PM, Guido van Rossum wrote: > >> On Fri, Jan 26, 2018 at 9:20 AM, Daniel Collins wrote: >> @Guido As an aside, my understanding was libraries that fall back to c (Numpy as an example) release the GIL for load heavy operations. But I believe the explanation would hold in the general case if you replace thread with process using a ProcessPoolExecutor, that it would be good to be able to submit a callback function back to the executor. > > Sure, but your explanation didn't mention any of that. > > And yes, good catch on the last line of my example. :-) > > Given that the solution is only a few lines -- perhaps it's enough to just add it as an example to the docs, rather than to add it as a new function to concurrent.futures? A doc change can be added to 3.7! > > -- > --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From liam.marsh.home at gmail.com Fri Jan 26 15:46:51 2018 From: liam.marsh.home at gmail.com (liam marsh) Date: Fri, 26 Jan 2018 21:46:51 +0100 Subject: [Python-ideas] Logging: a more perseverent version of the StreamHandler? Message-ID: Hello, Some time ago, I set up some logging using stdout in a program with the `stdout_redirected()` context manager, which had to close and reopen stdout to work. Unsurprisingly, the StreamHandler didn't take it well. So I made a Handler class which is able to reload the stream (AKA get the new sys.stdout) whenever the old one wasn't writable. But there might be some more legit use cases for stubborn StreamHandlers like that (that are not ugly-looking, hopefully temporary patches). The way I see it for now is a StreamHandler subclass which, instead of having a `stream` argument, would have |`getStream`|,|`reloadStream`| and |`location`|, and would be used this way: On initialisation, it would load the stream object, then act as a regular StreamHandler, but checking that the stream is writable at each `handler.emit()` call. If it is not, then reload it. If given, the |`getStream`| (a callable object which returns a ready-to-use stream) argument is used to load/reload the underlying stream, else it would be fetched at the location described by `location`, and, if it is still not writable, call |`reloadStream()` |(which should put a usable stream object at |`location`|) then try to fetch it again. Here is the current implementation I have: |||```| from .config import _resolve as resolve? # will (uglily) be used later |class ReloadingHandler(StreamHandler):| |??? """| |??? A stream handler which reloads the stream object from one place if an error occurs| |??? """| |??? def __init__(self, getStream=None, reloadStream=None, location=None):| |??????? """| |??????? Initialize the handler.| || |??????? If stream is not specified, sys.stderr is used.| |??????? """ ??? ??? self.getStream = getStream ??? ??? self.stream = None? # to be overwritten later ??? ??? if getStream is None:| ||??? ??? ??? if location is None: ??? ??? ??? ??? self.location = 'sys.stderr'? # note the lack of 'ext://' ??? ??? ??? ??? self.reloadStream = None ??? ??? ??? else: |??? ??? ??? ??? self.reloadStream = reloadStream ??? ??? ??? ??? self.location = location ??? ??? stream = self.reload()? # gets the stream ??? ??? StreamHandler.__init__(self, stream||) |||| ??? def reload(self): ??? ??? if self.getStream is not None: ??? ??? ??? stream = self.getStream() ??? ??? else: ??? ??? ??? try: ??? ??? ??? ??? stream = resolve(self.location) ??? ??? ??? ??? exc = None ??? ??? ??? except Exception as err: ??? ??? ??? ??? exc = err? # is this really needed? ??? ??? ??? ??? stream = None? # just retry for now ??? ??? ??? if stream is None or not stream.writable(): ??? ??? ??? ??? if self.reloadStream is None: ??? ??? ??? ??? ??? if exc: ??? ??? ??? ??? ??? ??? raise exc ??? ??? ??? ??? ??? else: ??? ??? ??? ??? ??? ??? raise ValueError("ReloadingHandler couldn't reload a valid stream") ??? ??? ??? ??? stream = resolve(self.location)? # if it fails this time, do not catch exception here ??? ??? return stream ||| |??? def emit(self, record):| |??????? """| |??????? Emit a record.| || |??????? If a formatter is specified, it is used to format the record.| |??????? The record is then written to the stream with a trailing newline.? If| |??????? exception information is present, it is formatted using| |??????? traceback.print_exception and appended to the stream.? If the stream| |??????? has an 'encoding' attribute, it is used to determine how to do the| |??????? output to the stream.| |??????? """| |??????? if not self.stream.writable():| |??????????? self.reload()| |??????? StreamHandler.emit(self, record)| |||``` | What do you think? (about the idea, the implementation, and the way I wrote this email)| | --- L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par le logiciel antivirus Avast. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Jan 26 16:13:43 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 13:13:43 -0800 Subject: [Python-ideas] .then execution of actions following a future's completion In-Reply-To: <169871CD-3D04-4292-AC52-F4AFFC36B2A8@gmail.com> References: <677D5E69-077C-44EB-B440-62DB2C5011F5@gmail.com> <10C6409B-2D98-461B-BC3A-A7209C7B669B@gmail.com> <169871CD-3D04-4292-AC52-F4AFFC36B2A8@gmail.com> Message-ID: Hm. Good point. (Though I'm not sure why that ban exists, since it's not enforceable.) Well, feel free to propose a new API for Python 3.8. On Fri, Jan 26, 2018 at 11:23 AM, Daniel Collins wrote: > That?s very true. I?ll try to keep my terminology more in line with the > implementation in the future. > > The only problem with that, is that the function utilizes methods that are > marked in the documentation as exclusively to be called by the executor > (set_result, instantiation of future objects) and it would be confusing if > a few lines later, a ?but you can use them for this? example was provided. > > -dancollins34 > > Sent from my iPhone > > On Jan 26, 2018, at 1:05 PM, Guido van Rossum wrote: > > On Fri, Jan 26, 2018 at 9:20 AM, Daniel Collins > wrote: > >> @Guido As an aside, my understanding was libraries that fall back to c >> (Numpy as an example) release the GIL for load heavy operations. But I >> believe the explanation would hold in the general case if you replace >> thread with process using a ProcessPoolExecutor, that it would be good to >> be able to submit a callback function back to the executor. >> > > Sure, but your explanation didn't mention any of that. > > And yes, good catch on the last line of my example. :-) > > Given that the solution is only a few lines -- perhaps it's enough to just > add it as an example to the docs, rather than to add it as a new function > to concurrent.futures? A doc change can be added to 3.7! > > -- > --Guido van Rossum (python.org/~guido) > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From pfreixes at gmail.com Fri Jan 26 16:35:46 2018 From: pfreixes at gmail.com (Pau Freixes) Date: Fri, 26 Jan 2018 22:35:46 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? Message-ID: Hi, This mail is the consequence of a true story, a story where CPython got defeated by Javascript, Java, C# and Go. One of the teams of the company where Im working had a kind of benchmark to compare the different languages on top of their respective "official" web servers such as Node.js, Aiohttp, Dropwizard and so on. The test by itself was pretty simple and tried to test the happy path of the logic, a piece of code that fetches N rules from another system and then apply them to X whatevers also fetched from another system, something like that def filter(rule, whatever): if rule.x in whatever.x: return True rules = get_rules() whatevers = get_whatevers() for rule in rules: for whatever in whatevers: if filter(rule, whatever): cnt = cnt + 1 return cnt The performance of Python compared with the other languages was almost x10 times slower. It's true that they didn't optimize the code, but they did not for any language having for all of them the same cost in terms of iterations. Once I saw the code I proposed a pair of changes, remove the call to the filter function making it "inline" and caching the rule's attributes, something like that for rule in rules: x = rule.x for whatever in whatevers: if x in whatever.x: cnt += 1 The performance of the CPython boosted x3/x4 just doing these "silly" things. The case of the rule cache IMHO is very striking, we have plenty examples in many repositories where the caching of none local variables is a widely used pattern, why hasn't been considered a way to do it implicitly and by default? The case of the slowness to call functions in CPython is quite recurrent and looks like its an unsolved problem at all. Sure I'm missing many things, and I do not have all of the information. This mail wants to get all of this information that might help me to understand why we are here - CPython - regarding this two slow patterns. This could be considered an unimportant thing, but its more relevant than someone could expect, at least IMHO. If the default code that you can write in a language is by default slow and exists an alternative to make it faster, this language is doing something wrong. BTW: pypy looks like is immunized [1] [1] https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 -- --pau From guido at python.org Fri Jan 26 16:44:02 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 13:44:02 -0800 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> Message-ID: What does attrs' solution for this problem look like? On Fri, Jan 26, 2018 at 11:11 AM, George Leslie-Waksman wrote: > Even if we could inherit the setting, I would think that we would still > want to require the code be explicit. It seems worse to implicitly require > keyword only arguments for a class without giving any indication in the > code. > > As it stands, the current implementation does not allow a later subclass > to be declared without `keyword_only=True` so we could handle this case by > adding a note to the `TypeError` message about considering the keyword_only > flag. > > How do I got about putting together a proposal to get this into 3.8? > > --George > > > On Thu, Jan 25, 2018 at 5:12 AM Eric V. Smith wrote: > >> I'm not completely opposed to this feature. But there are some cases to >> consider. Here's the first one that occurs to me: note that due to the >> way dataclasses work, it would need to be used everywhere down an >> inheritance hierarchy. That is, if an intermediate base class required >> it, all class derived from that intermediate base would need to specify >> it, too. That's because each class just makes decisions based on its >> fields and its base classes' fields, and not on any flags attached to >> the base class. As it's currently implemented, a class doesn't remember >> any of the decorator's arguments, so there's no way to look for this >> information, anyway. >> >> I think there are enough issues here that it's not going to make it in >> to 3.7. It would require getting a firm proposal together, selling the >> idea on python-dev, and completing the implementation before Monday. But >> if you want to try, I'd participate in the discussion. >> >> Taking Ivan's suggestion one step further, a way to do this currently is >> to pass init=False and then write another decorator that adds the >> kw-only __init__. So the usage would be: >> >> @dataclass >> class Foo: >> some_default: dict = field(default_factory=dict) >> >> @kw_only_init >> @dataclass(init=False) >> class Bar(Foo): >> other_field: int >> >> kw_only_init(cls) would look at fields(cls) and construct the __init__. >> It would be a hassle to re-implement dataclasses's _init_fn function, >> but it could be made to work (in reality, of course, you'd just copy it >> and hack it up to do what you want). You'd also need to use some private >> knowledge of InitVars if you wanted to support them (the stock >> fields(cls) doesn't return them). >> >> For 3.8 we can consider changing dataclasses's APIs if we want to add >> this. >> >> Eric. >> >> On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: >> > It may be possible but it makes for pretty leaky abstractions and it's >> > unclear what that custom __init__ should look like. How am I supposed to >> > know what the replacement for default_factory is? >> > >> > Moreover, suppose I want one base class with an optional argument and a >> > half dozen subclasses each with their own required argument. At that >> > point, I have to write the same __init__ function a half dozen times. >> > >> > It feels rather burdensome for the user when an additional flag (say >> > "kw_only=True") and a modification to: >> > https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294 >> that >> > inserted `['*']` after `[self_name]` if the flag is specified could >> > ameliorate this entire issue. >> > >> > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi > > > wrote: >> > >> > It is possible to pass init=False to the decorator on the subclass >> > (and supply your own custom __init__, if necessary): >> > >> > @dataclass >> > class Foo: >> > some_default: dict = field(default_factory=dict) >> > >> > @dataclass(init=False) # This works >> > class Bar(Foo): >> > other_field: int >> > >> > -- >> > Ivan >> > >> > >> > >> > On 23 January 2018 at 03:33, George Leslie-Waksman >> > > wrote: >> > >> > The proposed implementation of dataclasses prevents defining >> > fields with defaults before fields without defaults. This can >> > create limitations on logical grouping of fields and on >> inheritance. >> > >> > Take, for example, the case: >> > >> > @dataclass >> > class Foo: >> > some_default: dict = field(default_factory=dict) >> > >> > @dataclass >> > class Bar(Foo): >> > other_field: int >> > >> > this results in the error: >> > >> > 5 @dataclass >> > ----> 6 class Bar(Foo): >> > 7 other_field: int >> > 8 >> > >> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >> site-packages/dataclasses.py >> > in dataclass(_cls, init, repr, eq, order, hash, frozen) >> > 751 >> > 752 # We're called as @dataclass, with a class. >> > --> 753 return wrap(_cls) >> > 754 >> > 755 >> > >> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >> site-packages/dataclasses.py >> > in wrap(cls) >> > 743 >> > 744 def wrap(cls): >> > --> 745 return _process_class(cls, repr, eq, order, >> > hash, init, frozen) >> > 746 >> > 747 # See if we're being called as @dataclass or >> > @dataclass(). >> > >> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >> site-packages/dataclasses.py >> > in _process_class(cls, repr, eq, order, hash, init, frozen) >> > 675 # in __init__. Use >> > "self" if possible. >> > 676 '__dataclass_self__' if >> > 'self' in fields >> > --> 677 else 'self', >> > 678 )) >> > 679 if repr: >> > >> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >> site-packages/dataclasses.py >> > in _init_fn(fields, frozen, has_post_init, self_name) >> > 422 seen_default = True >> > 423 elif seen_default: >> > --> 424 raise TypeError(f'non-default argument >> > {f.name !r} ' >> > 425 'follows default >> argument') >> > 426 >> > >> > TypeError: non-default argument 'other_field' follows default >> > argument >> > >> > I understand that this is a limitation of positional arguments >> > because the effective __init__ signature is: >> > >> > def __init__(self, some_default: dict = , >> > other_field: int): >> > >> > However, keyword only arguments allow an entirely reasonable >> > solution to this problem: >> > >> > def __init__(self, *, some_default: dict = , >> > other_field: int): >> > >> > And have the added benefit of making the fields in the __init__ >> > call entirely explicit. >> > >> > So, I propose the addition of a keyword_only flag to the >> > @dataclass decorator that renders the __init__ method using >> > keyword only arguments: >> > >> > @dataclass(keyword_only=True) >> > class Bar(Foo): >> > other_field: int >> > >> > --George Leslie-Waksman >> > >> > _______________________________________________ >> > Python-ideas mailing list >> > Python-ideas at python.org >> > https://mail.python.org/mailman/listinfo/python-ideas >> > Code of Conduct: http://python.org/psf/codeofconduct/ >> > >> > >> > >> > >> > _______________________________________________ >> > Python-ideas mailing list >> > Python-ideas at python.org >> > https://mail.python.org/mailman/listinfo/python-ideas >> > Code of Conduct: http://python.org/psf/codeofconduct/ >> > >> >> > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From egregius313 at gmail.com Fri Jan 26 17:03:31 2018 From: egregius313 at gmail.com (Edward Minnix) Date: Fri, 26 Jan 2018 17:03:31 -0500 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: <29101586-1cf2-4faf-bd5c-1b3f7aa79912@Spark> There are several reasons for the issues you are mentioning. 1. Attribute look up is much more complicated than you would think. ?(If you have the time, watch?https://www.youtube.com/watch?v=kZtC_4Ecq1Y?that will explain things better than I can) ?The series of operations that happen with every `obj.attr` occurrence can be complicated. It goes something like: ?def get_attr(obj, attr): ?? ? if attr in obj.__dict__: ? ? ?? ? value = obj.__dict__[attr] ? ? ? ? ?if is_descriptor(value): ? ? ? ? ? ? ?return value(obj) ? ? ? ? ?else: ? ? ? ? ? ? ?return value ? ?? else: ? ? ? ? ?for cls in type(obj).mro(): ? ? ? ? ? ? ?if attr in cls.__dict__: ? ? ? ? ? ? ? ? ?value = cls.__dict__[attr] ? ? ? ? ? ? ? ? ?if is_descriptor(value): ? ? ? ? ? ? ? ? ? ? ?return value(obj) ? ? ? ? ? ? ? ? ?else: ? ? ? ? ? ? ? ? ? ? ?return value ? ? ? ? ?else: ? ? ? ? ? ? ?raise AttributeError('Attribute %s not found' % attr) ?Therefore, the caching means this operation is only done once instead of n times (where n = len(whatevers)) 2. Function calls 3. Dynamic code makes things harder to optimize ?Python?s object model allows for constructs that are very hard to optimize without knowing about the structure of the data ahead of time. ?For instance, if an attribute is defined by a property, there are no guarantees of obj.attr will return the same thing. ?So in simple terms, the power Python gives you over the language makes it harder to optimize the language. 4. CPython?s compiler makes (as a rule) no optimizations ?CPython?s compiler is a fairly direct source-to-bytecode compiler, not an actual optimizing compiler. So anything beyond constant-folding and ?deletion of some types of debug code, the language isn?t going to worry about optimizing things for you. So in simple terms, of the languages you mentioned, JavaScript?s object model is substantially less powerful than Python?s, but it also is more straightforward in terms of what obj.attr means, and the other 3 you mentioned all have statically-typed, optimizing compilers, with a straight-forward method resolution order. The things you see as flaws end up being the way Pythonistas can add more dynamic systems into their APIs (and since we don?t have macros, most of our dynamic operations must be done at run-time). - Ed On Jan 26, 2018, 16:36 -0500, Pau Freixes , wrote: > Hi, > > This mail is the consequence of a true story, a story where CPython > got defeated by Javascript, Java, C# and Go. > > One of the teams of the company where Im working had a kind of > benchmark to compare the different languages on top of their > respective "official" web servers such as Node.js, Aiohttp, Dropwizard > and so on. The test by itself was pretty simple and tried to test the > happy path of the logic, a piece of code that fetches N rules from > another system and then apply them to X whatevers also fetched from > another system, something like that > > def filter(rule, whatever): > if rule.x in whatever.x: > return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > for whatever in whatevers: > if filter(rule, whatever): > cnt = cnt + 1 > > return cnt > > > The performance of Python compared with the other languages was almost > x10 times slower. It's true that they didn't optimize the code, but > they did not for any language having for all of them the same cost in > terms of iterations. > > Once I saw the code I proposed a pair of changes, remove the call to > the filter function making it "inline" and caching the rule's > attributes, something like that > > for rule in rules: > x = rule.x > for whatever in whatevers: > if x in whatever.x: > cnt += 1 > > The performance of the CPython boosted x3/x4 just doing these "silly" things. > > The case of the rule cache IMHO is very striking, we have plenty > examples in many repositories where the caching of none local > variables is a widely used pattern, why hasn't been considered a way > to do it implicitly and by default? > > The case of the slowness to call functions in CPython is quite > recurrent and looks like its an unsolved problem at all. > > Sure I'm missing many things, and I do not have all of the > information. This mail wants to get all of this information that might > help me to understand why we are here - CPython - regarding this two > slow patterns. > > This could be considered an unimportant thing, but its more relevant > than someone could expect, at least IMHO. If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. > > BTW: pypy looks like is immunized [1] > > [1] https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 > -- > --pau > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Fri Jan 26 17:12:29 2018 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 27 Jan 2018 09:12:29 +1100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On Sat, Jan 27, 2018 at 8:35 AM, Pau Freixes wrote: > def filter(rule, whatever): > if rule.x in whatever.x: > return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > for whatever in whatevers: > if filter(rule, whatever): > cnt = cnt + 1 > > return cnt > > > The performance of Python compared with the other languages was almost > x10 times slower. It's true that they didn't optimize the code, but > they did not for any language having for all of them the same cost in > terms of iterations. Did you consider using a set instead of a list for your inclusion checks? I don't have the full details of what the code is doing, but the "in" check on a large set can be incredibly fast compared to the equivalent on a list/array. > This could be considered an unimportant thing, but its more relevant > than someone could expect, at least IMHO. If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. Are you sure it's the language's fault? Failing to use a better data type simply because some other language doesn't have it is a great way to make a test that's "fair" in the same way that Balance and Armageddon are "fair" in Magic: The Gathering. They reset everyone to the baseline, and the baseline's equal for everyone right? Except that that's unfair to a language that prefers to work somewhere above the baseline, and isn't optimized for naive code. ChrisA From chris.barker at noaa.gov Fri Jan 26 17:18:53 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 26 Jan 2018 14:18:53 -0800 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: If there are robust and simple optimizations that can be added to CPython, great, but: This mail is the consequence of a true story, a story where CPython > got defeated by Javascript, Java, C# and Go. > at least those last three are statically compiled languages -- they are going to be faster than Python for this sort of thing -- particularly for code written in a non-pythonic style... def filter(rule, whatever): > if rule.x in whatever.x: > return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > for whatever in whatevers: > if filter(rule, whatever): > cnt = cnt + 1 > > return cnt > > It's true that they didn't optimize the code, but > they did not for any language having for all of them the same cost in > terms of iterations. > sure, but I would argue that you do need to write code in a clean style appropriate for the language at hand. For instance, the above creates a function that is a simple one-liner -- there is no reason to do that, and the fact that function calls to have significant overhead in Python is going to bite you. for rule in rules: > x = rule.x > for whatever in whatevers: > if x in whatever.x: > cnt += 1 > > The performance of the CPython boosted x3/x4 just doing these "silly" > things. > "inlining" the filter call is making the code more pythonic and readable -- a no brainer. I wouldn't call that a optimization. making rule.x local is an optimization -- that is, the only reason you'd do it to to make the code go faster. how much difference did that really make? I also don't know what type your "whatevers" are, but "x in something" can be order (n) if they re sequences, and using a dict or set would be a much better performance. and perhaps collections.Counter would help here, too. In short, it is a non-goal to get python to run as fast as static langues for simple nested loop code like this :-) The case of the rule cache IMHO is very striking, we have plenty > examples in many repositories where the caching of none local > variables is a widely used pattern, why hasn't been considered a way > to do it implicitly and by default? > you can bet it's been considered -- the Python core devs are a pretty smart bunch :-) The fundamental reason is that rule.x could change inside that loop -- so you can't cache it unless you know for sure it won't. -- Again, dynamic language. The case of the slowness to call functions in CPython is quite > recurrent and looks like its an unsolved problem at all. > dynamic language again ... If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. > yes, that's true -- but your example shouldn't be the default code you write in Python. BTW: pypy looks like is immunized [1] > > [1] https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 PyPy uses a JIT -- which is the way to make a dynamic language run faster -- That's kind of why it exists.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From fakedme+py at gmail.com Fri Jan 26 17:59:19 2018 From: fakedme+py at gmail.com (Soni L.) Date: Fri, 26 Jan 2018 20:59:19 -0200 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: <704a5473-13e1-6d12-56fb-1b66282abfef@gmail.com> On 2018-01-26 08:18 PM, Chris Barker wrote: > If there are robust and simple optimizations that can be added to > CPython, great, but: > > This mail is the consequence of a true story, a story where CPython > got defeated by Javascript, Java, C# and Go. > > > at least those last three are statically compiled languages -- they > are going to be faster than Python for this sort of thing -- > particularly for code written in a non-pythonic style... Java and C#? Statically compiled? Haha. No. Java has a bytecode. While yes, Java doesn't need to compile your code before running it, the compilation time in CPython is usually minimal, unless you're using eval. You can precompile your python into bytecode but it's usually not worth it. Java can also load bytecode at runtime and do bytecode manipulation stuff. The only "real" benefit of Java is that object layout is pretty much static. (This can be simulated with __slots__ I think? idk.) See also, for example: http://luajit.org/ext_ffi.html#cdata (The same goes for C#. Idk about Go.) (Ofc, their JITs do also help. But even with the JIT off, it's still pretty good.) > > def filter(rule, whatever): > ? ? if rule.x in whatever.x: > ? ? ? ? return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > ? ? for whatever in whatevers: > ? ? ? ? if filter(rule, whatever): > ? ? ? ? ? ? cnt = cnt + 1 > > return cnt > > ?It's true that they didn't optimize the code, but > they did not for any language having for all of them the same cost in > terms of iterations. > > > sure, but I would argue that you do need to write code in a clean > style appropriate for the language at hand. > For instance, the above creates a function that is a simple one-liner > -- there is no reason to do that, and the fact that function calls to > have significant overhead in Python is going to bite you. > > for rule in rules: > ? ? x = rule.x > ? ? for whatever in whatevers: > ? ? ? ? if x in whatever.x: > ? ? ? ? ? ? cnt += 1 > > The performance of the CPython boosted x3/x4 just doing these > "silly" things. > > > "inlining" the filter call is making the code more pythonic and > readable -- a no brainer. I wouldn't call that a optimization. > > making rule.x local is an optimization -- that is, the only reason > you'd do it to to make the code go faster. how much difference did > that really make? > > I also don't know what type your "whatevers" are, but "x in something" > can be order (n) if they re sequences, and using a dict or set would > be a much better performance. > > and perhaps collections.Counter would help here, too. > > In short, it is a non-goal to get python to run as fast as static > langues for simple nested loop code like this :-) > > The case of the rule cache IMHO is very striking, we have plenty > examples in many repositories where the caching of none local > variables is a widely used pattern, why hasn't been considered a way > to do it implicitly and by default? > > > you can bet it's been considered -- the Python core devs are a pretty > smart bunch :-) > > The fundamental reason is that rule.x could change inside that loop -- > so you can't cache it unless you know for sure it won't. -- Again, > dynamic language. > > The case of the slowness to call functions in CPython is quite > recurrent and looks like its an unsolved problem at all. > > > dynamic language again ... > > ?If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. > > > yes, that's true -- but your example shouldn't be the default code you > write in Python. > > BTW: pypy looks like is immunized [1] > > [1] > https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 > > > > PyPy uses a JIT -- which is the way to make a dynamic language run > faster -- That's kind of why it exists.... > > -CHB > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R ? ? ? ? ? ?(206) 526-6959?? voice > 7600 Sand Point Way NE ??(206) 526-6329?? fax > Seattle, WA ?98115 ? ? ??(206) 526-6317?? main reception > > Chris.Barker at noaa.gov > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Fri Jan 26 18:07:53 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Jan 2018 10:07:53 +1100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: <20180126230748.GC22500@ando.pearwood.info> On Sat, Jan 27, 2018 at 09:12:29AM +1100, Chris Angelico wrote: > Are you sure it's the language's fault? Failing to use a better data > type simply because some other language doesn't have it is a great way > to make a test that's "fair" in the same way that Balance and > Armageddon are "fair" in Magic: The Gathering. They reset everyone to > the baseline, and the baseline's equal for everyone right? I'm afraid I have no idea what that analogy means :-) -- Steve From victor.stinner at gmail.com Fri Jan 26 18:28:12 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Sat, 27 Jan 2018 00:28:12 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: Hi, Well, I wrote https://faster-cpython.readthedocs.io/ website to answer to such question. See for example https://faster-cpython.readthedocs.io/mutable.html "Everything in Python is mutable". Victor 2018-01-26 22:35 GMT+01:00 Pau Freixes : > Hi, > > This mail is the consequence of a true story, a story where CPython > got defeated by Javascript, Java, C# and Go. > > One of the teams of the company where Im working had a kind of > benchmark to compare the different languages on top of their > respective "official" web servers such as Node.js, Aiohttp, Dropwizard > and so on. The test by itself was pretty simple and tried to test the > happy path of the logic, a piece of code that fetches N rules from > another system and then apply them to X whatevers also fetched from > another system, something like that > > def filter(rule, whatever): > if rule.x in whatever.x: > return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > for whatever in whatevers: > if filter(rule, whatever): > cnt = cnt + 1 > > return cnt > > > The performance of Python compared with the other languages was almost > x10 times slower. It's true that they didn't optimize the code, but > they did not for any language having for all of them the same cost in > terms of iterations. > > Once I saw the code I proposed a pair of changes, remove the call to > the filter function making it "inline" and caching the rule's > attributes, something like that > > for rule in rules: > x = rule.x > for whatever in whatevers: > if x in whatever.x: > cnt += 1 > > The performance of the CPython boosted x3/x4 just doing these "silly" things. > > The case of the rule cache IMHO is very striking, we have plenty > examples in many repositories where the caching of none local > variables is a widely used pattern, why hasn't been considered a way > to do it implicitly and by default? > > The case of the slowness to call functions in CPython is quite > recurrent and looks like its an unsolved problem at all. > > Sure I'm missing many things, and I do not have all of the > information. This mail wants to get all of this information that might > help me to understand why we are here - CPython - regarding this two > slow patterns. > > This could be considered an unimportant thing, but its more relevant > than someone could expect, at least IMHO. If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. > > BTW: pypy looks like is immunized [1] > > [1] https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 > -- > --pau > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From steve at pearwood.info Fri Jan 26 19:25:35 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Jan 2018 11:25:35 +1100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: <20180127002535.GD22500@ando.pearwood.info> On Fri, Jan 26, 2018 at 02:18:53PM -0800, Chris Barker wrote: [...] > sure, but I would argue that you do need to write code in a clean style > appropriate for the language at hand. Indeed. If you write Java-esque code in Python with lots of deep chains obj.attr.spam.eggs.cheese.foo.bar.baz expecting that the compiler will resolve them at compile-time, your code will be slow. No language is immune from this: it is possible to write bad code in any language, and if you write Pythonesque highly dynamic code using lots of runtime dispatching in Java, your Java benchmarks will be slow too. But having agreed with your general principle, I'm afraid I have to disagree with your specific: > For instance, the above creates a function that is a simple one-liner -- > there is no reason to do that, and the fact that function calls to have > significant overhead in Python is going to bite you. I disagree that there is no reason to write simple "one-liners". As soon as you are calling that one-liner from more than two, or at most three, places, the DRY principle strongly suggests you move it into a function. Even if you're only calling the one-liner from the one place, there can still be reasons to refactor it out into a separate function, such as for testing and maintainability. Function call overhead is a genuine pain-point for Python code which needs to be fast. I'm fortunate that I rarely run into this in practice: most of the time either my code doesn't need to be fast (if it takes 3 ms instead of 0.3 ms, I'm never going to notice the difference) or the function call overhead is trivial compared to the rest of the computation. But it has bit me once or twice, in the intersection of: - code that needs to be as fast as possible; - code that needs to be factored into subroutines; - code where the cost of the function calls is a significant fraction of the overall cost. When all three happen at the same time, it is painful and there's no good solution. > "inlining" the filter call is making the code more pythonic and readable -- > a no brainer. I wouldn't call that a optimization. In this specific case of "if rule.x in whatever.x", I might agree with you, but if the code is a bit more complex but still a one-liner: if rules[key].matcher.lower() in data[key].person.history: I would much prefer to see it factored out into a function or method. So we have to judge each case on its merits: it isn't a no-brainer that inline code is always more Pythonic and readable. > making rule.x local is an optimization -- that is, the only reason you'd do > it to to make the code go faster. how much difference did that really make? I assumed that rule.x could be a stand-in for a longer, Java-esque chain of attribute accesses. > I also don't know what type your "whatevers" are, but "x in something" can > be order (n) if they re sequences, and using a dict or set would be a much > better performance. Indeed. That's a good point. -- Steve From songofacandy at gmail.com Fri Jan 26 19:52:39 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Sat, 27 Jan 2018 09:52:39 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: > > That's fine with me. Please also add it to bytes and bytearray objects. It's > okay if the implementation has to scan the string -- so do isdigit() etc. > > -- > --Guido van Rossum (python.org/~guido) Thanks for your pronouncement! I'll do it in this weekend. Regards, -- INADA Naoki From steve at pearwood.info Fri Jan 26 20:27:18 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Jan 2018 12:27:18 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <20180126123953.GA22500@ando.pearwood.info> Message-ID: <20180127012717.GE22500@ando.pearwood.info> On Fri, Jan 26, 2018 at 02:37:14PM +0100, Victor Stinner wrote: > 2018-01-26 13:39 GMT+01:00 Steven D'Aprano : > > I have no objection to isascii, but I don't think it goes far enough. > > Sometimes I want to know whether a string is compatible with Latin-1 or > > UCS-2 as well as ASCII. For that, I used a function that exposes the > > size of code points in bits: > > Really? I never required such check in practice. Would you mind to > elaborate your use case? tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. Dealing with the Supplementary Unicode Planes have the same problems that older "narrow" builds of Python sufferred from: single code points were counted as len(2) instead of len(1), slicing could be wrong, etc. There are still many applications which assume Latin-1 data. For instance, I use a media player which displays mojibake when passed anything outside of Latin-1. Sometimes it is useful to know in advance when text you pass to another application is going to run into problems because of the other application's limitations. -- Steve From rosuav at gmail.com Fri Jan 26 21:29:38 2018 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 27 Jan 2018 13:29:38 +1100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: <20180126230748.GC22500@ando.pearwood.info> References: <20180126230748.GC22500@ando.pearwood.info> Message-ID: On Sat, Jan 27, 2018 at 10:07 AM, Steven D'Aprano wrote: > On Sat, Jan 27, 2018 at 09:12:29AM +1100, Chris Angelico wrote: > >> Are you sure it's the language's fault? Failing to use a better data >> type simply because some other language doesn't have it is a great way >> to make a test that's "fair" in the same way that Balance and >> Armageddon are "fair" in Magic: The Gathering. They reset everyone to >> the baseline, and the baseline's equal for everyone right? > > I'm afraid I have no idea what that analogy means :-) > When you push everyone to an identical low level, you're not truly being fair. Let's say you try to benchmark a bunch of programming languages against each other by having them use no more than four local variables, all integers, one static global array for shared storage, and no control flow other than conditional GOTOs. (After all, that's all you get in some machine languages!) It's perfectly fair, all languages have to compete on the same grounds. But it's also completely UNfair on high level languages, because you're implementing things in terribly bad ways. "Fair" is a tricky concept, and coding in a non-Pythonic way is not truly "fair" to Python. ChrisA From guido at python.org Fri Jan 26 21:38:42 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 18:38:42 -0800 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180127012717.GE22500@ando.pearwood.info> References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: IMO the special status for isascii() matches the special status of ASCII as encoding (yeah, I know, it's not the default encoding anywhere, but it still comes up regularly in standards and as common subset of other encodings). Should you wish to check for compatibility with other ranges IMO some expression involving max() should cut it. (FWIW there should be a special place in hell for those people who say "ASCII" when they mean "Latin-1".) On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano wrote: > On Fri, Jan 26, 2018 at 02:37:14PM +0100, Victor Stinner wrote: > > 2018-01-26 13:39 GMT+01:00 Steven D'Aprano : > > > I have no objection to isascii, but I don't think it goes far enough. > > > Sometimes I want to know whether a string is compatible with Latin-1 or > > > UCS-2 as well as ASCII. For that, I used a function that exposes the > > > size of code points in bits: > > > > Really? I never required such check in practice. Would you mind to > > elaborate your use case? > > tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. > Dealing with the Supplementary Unicode Planes have the same problems > that older "narrow" builds of Python sufferred from: single code points > were counted as len(2) instead of len(1), slicing could be wrong, etc. > > There are still many applications which assume Latin-1 data. For > instance, I use a media player which displays mojibake when passed > anything outside of Latin-1. > > Sometimes it is useful to know in advance when text you pass to another > application is going to run into problems because of the other > application's limitations. > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs at cskk.id.au Fri Jan 26 19:20:23 2018 From: cs at cskk.id.au (Cameron Simpson) Date: Sat, 27 Jan 2018 11:20:23 +1100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: <704a5473-13e1-6d12-56fb-1b66282abfef@gmail.com> References: <704a5473-13e1-6d12-56fb-1b66282abfef@gmail.com> Message-ID: <20180127002023.GA16300@cskk.homeip.net> On 26Jan2018 20:59, Soni L. wrote: >On 2018-01-26 08:18 PM, Chris Barker wrote: >>If there are robust and simple optimizations that can be added to >>CPython, great, but: >> >> This mail is the consequence of a true story, a story where CPython >> got defeated by Javascript, Java, C# and Go. >> >>at least those last three are statically compiled languages -- they >>are going to be faster than Python for this sort of thing -- >>particularly for code written in a non-pythonic style... > >Java and C#? Statically compiled? Haha. >No. > >Java has a bytecode. While yes, Java doesn't need to compile your code >before running it, the compilation time in CPython is usually minimal, >unless you're using eval. You can precompile your python into bytecode >but it's usually not worth it. Java can also load bytecode at runtime >and do bytecode manipulation stuff. > >The only "real" benefit of Java is that object layout is pretty much >static. (This can be simulated with __slots__ I think? idk.) See also, >for example: >http://luajit.org/ext_ffi.html#cdata > >(The same goes for C#. Idk about Go.) However, both Java and Go are staticly typed; I think C# is too, but don't know. The compiler has full knowledge of the types of almost every symbol, and can write machine optimal code for operations (even though the initial machine is the JVM for Java - I gather the JVM bytecode is also type annotated, so JITs can in turn do a far better job at making machine optimal machine code when used). This isn't really an option for "pure" Python. Cheers, Cameron Simpson (formerly cs at zip.com.au) From tjreedy at udel.edu Fri Jan 26 23:22:33 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 26 Jan 2018 23:22:33 -0500 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180127012717.GE22500@ando.pearwood.info> References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: On 1/26/2018 8:27 PM, Steven D'Aprano wrote: > On Fri, Jan 26, 2018 at 02:37:14PM +0100, Victor Stinner wrote: >> Really? I never required such check in practice. Would you mind to >> elaborate your use case? > > tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. Since IDLE is a tkinter application, it would be helpful if there were an isbmp function that exposed the existing flag in the string object. -- Terry Jan Reedy From ncoghlan at gmail.com Sat Jan 27 01:42:18 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 27 Jan 2018 16:42:18 +1000 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On 27 January 2018 at 07:35, Pau Freixes wrote: > This could be considered an unimportant thing, but its more relevant > than someone could expect, at least IMHO. If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. > Not really, as we've seen with the relatively slow adoption of PyPy over the past several years. CPython, as an implementation, emphasises C/C++ compatibility, and internal interpreter simplicity. That comes at a definite cost in runtime performance (especially where attribute access and function calls are concerned), but has also enabled an enormous orchestration ecosystem, originally around C/C++/FORTRAN components, but now increasingly around Rust components within the same process, as well as out-of-process Java, C#, and JavaScript components. In this usage model, if Python code becomes the throughput bottleneck, it's only because something has gone wrong at the system architecture level. PyPy, by contrast, emphasises raw speed, sacrificing various aspects of CPython's C/C++ interoperability in order to attain it. It's absolutely the implementation you want to be using if your main concern is the performance of your Python code in general, and there aren't any obvious hotspots that could be more selectively accelerated. To date, the CPython model of "Use (C)Python to figure out what kind of problem you have, then rewrite your performance bottlenecks in a language more specifically tailored to that problem space" has proven relatively popular. There's likely still more we can do within CPython to make typical code faster without increasing the interpreter complexity too much (e.g. Yury's idea of introducing an implicit per-opcode result cache into the eval loop), but opt-in solutions that explicit give up some of Python's language level dynamism are always going to be able to do less work at runtime than typical Python code does. Cheers, Nick. P.S. You may find https://www.curiousefficiency.org/posts/2015/10/languages-to-improve-your-python.html#broadening-our-horizons interesting in the context of considering some of the many factors other than raw speed that may influence people's choice of programming language. Similarly, https://www.curiousefficiency.org/posts/2017/10/considering-pythons-target-audience.html provides some additional info on the scope of Python's use cases (for the vast majority of which, "How many requests per second can I serve in a naive loop in a CPU bound process?" isn't a particularly relevant characteristic) -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Sat Jan 27 02:01:10 2018 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Jan 2018 23:01:10 -0800 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: On Fri, Jan 26, 2018 at 8:22 PM, Terry Reedy wrote: > On 1/26/2018 8:27 PM, Steven D'Aprano wrote: > >> On Fri, Jan 26, 2018 at 02:37:14PM +0100, Victor Stinner wrote: >> > > Really? I never required such check in practice. Would you mind to >>> elaborate your use case? >>> >> >> tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. >> > > Since IDLE is a tkinter application, it would be helpful if there were an > isbmp function that exposed the existing flag in the string object. > Sorry, no. The existing internal flag is an implementation detail that should not show in the API. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Sat Jan 27 02:30:16 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 27 Jan 2018 02:30:16 -0500 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: On 1/27/2018 2:01 AM, Guido van Rossum wrote: > On Fri, Jan 26, 2018 at 8:22 PM, Terry Reedy > > wrote: > > On 1/26/2018 8:27 PM, Steven D'Aprano wrote: > > On Fri, Jan 26, 2018 at 02:37:14PM +0100, Victor Stinner wrote: > > > Really? I never required such check in practice. Would you > mind to > elaborate your use case? > > > tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. > > > Since IDLE is a tkinter application, it would be helpful if there > were an isbmp function that exposed the existing flag in the string > object. > > > Sorry, no. The existing internal flag is an implementation detail that > should not show in the API. It occurred to me that this might be an issue. Rather than define a LBYL scanner in Python, I think I will try wrapping inserts of user-supplied strings into widgets with with try: insert; except: . -- Terry Jan Reedy From stephanh42 at gmail.com Sat Jan 27 05:33:38 2018 From: stephanh42 at gmail.com (Stephan Houben) Date: Sat, 27 Jan 2018 11:33:38 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: Hi all, I would like to remark that, in my opinion, the question of CPython's performance cannot be decoupled from the extremely wide selection of packages which provide optimized code for almost any imaginable task. For example: Javascript may be faster than (C)Python on simple benchmarks, but as soon as the task is somewhat amenable to scypi, and I can use scipy in Python, the resulting performance will completely cream Javascript in a way that isn't funny anymore. And scipy is just an example, there are tons of such libraries for all kinds of tasks I am not aware of any language ecosystem with a similar wide scope of packages; at least Java and Node both fall short. (Node may have more packages by number but the quality is definitely less and there is tons of overlap). Stephan 2018-01-27 7:42 GMT+01:00 Nick Coghlan : > On 27 January 2018 at 07:35, Pau Freixes wrote: > >> This could be considered an unimportant thing, but its more relevant >> than someone could expect, at least IMHO. If the default code that you >> can write in a language is by default slow and exists an alternative >> to make it faster, this language is doing something wrong. >> > > Not really, as we've seen with the relatively slow adoption of PyPy over > the past several years. > > CPython, as an implementation, emphasises C/C++ compatibility, and > internal interpreter simplicity. That comes at a definite cost in runtime > performance (especially where attribute access and function calls are > concerned), but has also enabled an enormous orchestration ecosystem, > originally around C/C++/FORTRAN components, but now increasingly around > Rust components within the same process, as well as out-of-process Java, > C#, and JavaScript components. In this usage model, if Python code becomes > the throughput bottleneck, it's only because something has gone wrong at > the system architecture level. > > PyPy, by contrast, emphasises raw speed, sacrificing various aspects of > CPython's C/C++ interoperability in order to attain it. It's absolutely the > implementation you want to be using if your main concern is the performance > of your Python code in general, and there aren't any obvious hotspots that > could be more selectively accelerated. > > To date, the CPython model of "Use (C)Python to figure out what kind of > problem you have, then rewrite your performance bottlenecks in a language > more specifically tailored to that problem space" has proven relatively > popular. There's likely still more we can do within CPython to make typical > code faster without increasing the interpreter complexity too much (e.g. > Yury's idea of introducing an implicit per-opcode result cache into the > eval loop), but opt-in solutions that explicit give up some of Python's > language level dynamism are always going to be able to do less work at > runtime than typical Python code does. > > Cheers, > Nick. > > P.S. You may find https://www.curiousefficiency. > org/posts/2015/10/languages-to-improve-your-python.html# > broadening-our-horizons interesting in the context of considering some of > the many factors other than raw speed that may influence people's choice of > programming language. Similarly, https://www.curiousefficiency. > org/posts/2017/10/considering-pythons-target-audience.html provides some > additional info on the scope of Python's use cases (for the vast majority > of which, "How many requests per second can I serve in a naive loop in a > CPU bound process?" isn't a particularly relevant characteristic) > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pfreixes at gmail.com Sat Jan 27 16:18:08 2018 From: pfreixes at gmail.com (Pau Freixes) Date: Sat, 27 Jan 2018 22:18:08 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: Hi, Thanks to all of you for your responses, the points of view and the information that you shared to back up your rationales, I had some time to visit few of them but sure I will try to suit the proper time to review all of them. It's hard to try to keep the discussion organized responding at each response, if you don't mind I would do it with just this email. If you believe that I'm missing something important shoot it. First of all, my fault starting the discussion with the language battle side, this didn't help it to focus the conversation to the point that I wanted to discuss. So, the intention was to raise two use cases which both have a performance cost that could be explicitly circumvented by the developer, taking into account that both are, let's say, well known by the community. Correct me if I'm wrong, but most of you argue that the proper Zen of Python - can we say it mutability [1]? as Victor pointed out - that allow the user have the freedom to mutate objects in runtime goes in the opposite direction of allowing the *compiler* to make code with some optimizations. Or, more specifically for the ceval - *interpreter*? - apply some hacks that would help to reduce the footprint of some operations. Im wondering if a solution might pass for having something like that [2] but for generic attributes, should it be possible? has been discussed before ? is there any red-flag that you might thing that will make to much complicated a well-balanced solution? Regarding the cost of calling a function, that I can guess is not related with the previous stuff, what is an impediment right now to make it faster ? [1] https://faster-cpython.readthedocs.io/mutable.html [2] https://bugs.python.org/issue28158 On Fri, Jan 26, 2018 at 10:35 PM, Pau Freixes wrote: > Hi, > > This mail is the consequence of a true story, a story where CPython > got defeated by Javascript, Java, C# and Go. > > One of the teams of the company where Im working had a kind of > benchmark to compare the different languages on top of their > respective "official" web servers such as Node.js, Aiohttp, Dropwizard > and so on. The test by itself was pretty simple and tried to test the > happy path of the logic, a piece of code that fetches N rules from > another system and then apply them to X whatevers also fetched from > another system, something like that > > def filter(rule, whatever): > if rule.x in whatever.x: > return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > for whatever in whatevers: > if filter(rule, whatever): > cnt = cnt + 1 > > return cnt > > > The performance of Python compared with the other languages was almost > x10 times slower. It's true that they didn't optimize the code, but > they did not for any language having for all of them the same cost in > terms of iterations. > > Once I saw the code I proposed a pair of changes, remove the call to > the filter function making it "inline" and caching the rule's > attributes, something like that > > for rule in rules: > x = rule.x > for whatever in whatevers: > if x in whatever.x: > cnt += 1 > > The performance of the CPython boosted x3/x4 just doing these "silly" things. > > The case of the rule cache IMHO is very striking, we have plenty > examples in many repositories where the caching of none local > variables is a widely used pattern, why hasn't been considered a way > to do it implicitly and by default? > > The case of the slowness to call functions in CPython is quite > recurrent and looks like its an unsolved problem at all. > > Sure I'm missing many things, and I do not have all of the > information. This mail wants to get all of this information that might > help me to understand why we are here - CPython - regarding this two > slow patterns. > > This could be considered an unimportant thing, but its more relevant > than someone could expect, at least IMHO. If the default code that you > can write in a language is by default slow and exists an alternative > to make it faster, this language is doing something wrong. > > BTW: pypy looks like is immunized [1] > > [1] https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 > -- > --pau -- --pau From mertz at gnosis.cx Sat Jan 27 17:53:53 2018 From: mertz at gnosis.cx (David Mertz) Date: Sat, 27 Jan 2018 14:53:53 -0800 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: > > def filter(rule, whatever): > if rule.x in whatever.x: > return True > > rules = get_rules() > whatevers = get_whatevers() > for rule in rules: > for whatever in whatevers: > if filter(rule, whatever): > cnt = cnt + 1 > > return cnt > This code seems almost certainly broken as written. Not just suboptimal, but just plain wrong. As a start, it has a return statement outside a function or method body, so it's not even syntactical (also, `cnt` is never initialized). But assuming we just print out `cnt` it still gives an answer we probably don't want. For example, I wrote get_rules() and get_whatevers() functions that produce a list of "things" that have an x attribute. Each thing holds a (distinct) word from a Project Gutenberg book (Title: Animal Locomotion: Or walking, swimming, and flying, with a dissertation on a?ronautics; Author: J. Bell Pettigrew). Whatevers omit a few of the words, since the code suggests there should be more rules. In particular: In [1]: len(get_rules()), len(get_whatevers()) Out [1]: (12306, 12301) Running the probably wrong code is indeed slow: In [2]: %%time def filter(rule, whatever): if rule.x in whatever.x: return True rules = get_rules() whatevers = get_whatevers() cnt = 0 for rule in rules: for whatever in whatevers: if filter(rule, whatever): cnt = cnt + 1 print(cnt) Out [2]: 110134 CPU times: user 53.1 s, sys: 190 ms, total: 53.3 s Wall time: 53.6 s It's hard for me to imagine why this is the question one would want answered. It seems much more likely you'd want to know: In [3]: %%time len({thing.x for thing in get_rules()} - {thing.x for thing in get_whatevers()}) Out [3]: CPU times: user 104 ms, sys: 4.89 ms, total: 109 ms Wall time: 112 ms 5 So that's 500 times faster, more Pythonic, and seems to actually answer the question one would want answered. However, let's suppose there really is a reason to answer the question in the original code. Using more sensible basic datatypes, we only get about a 3x speedup: In [4]: %%time rules = {thing.x for thing in get_rules()} whatevers = {thing.x for thing in get_whatevers()} cnt = 0 for rule in rules: for whatever in whatevers: if rule in whatever: cnt += 1 print(cnt) Out [4]: 110134 CPU times: user 18.3 s, sys: 96.9 ms, total: 18.4 s Wall time: 18.5 s I'm sure there is room for speedup if the actual problem being solved was better described. Maybe something involving itertools.product(), but it's hard to say without knowing what the correct behavior actually is. Overall this is similar to saying you could implement bogosort in Python, and it would be much slower than calling Timsort with `sorted(my_stuff)` -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Jan 28 01:14:54 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Jan 2018 16:14:54 +1000 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On 28 January 2018 at 07:18, Pau Freixes wrote: > Regarding the cost of calling a function, that I can guess is not > related with the previous stuff, what is an impediment right now to > make it faster ? At a technical level, the biggest problems relate to the way we manipulate frame objects at runtime, including the fact that we expose those frames programmatically for the benefit of debuggers and other tools. More broadly, the current lack of perceived commercial incentives for large corporations to invest millions in offering a faster default Python runtime, the way they have for the other languages you mentioned in your initial post :) Cheers, Nick. P.S. Fortunately for Python users in general, those incentives are in the process of changing, as we see the rise of platforms like AWS Lambda (where vendors and platforms charging by the RAM-second gives a clear financial incentive to investing in software performance improvements). -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From mertz at gnosis.cx Sun Jan 28 01:25:09 2018 From: mertz at gnosis.cx (David Mertz) Date: Sat, 27 Jan 2018 22:25:09 -0800 Subject: [Python-ideas] Format mini-language for lakh and crore Message-ID: In South Asia, a different style of digit delimiters for large numbers is used than in Europe, North America, Australia, etc. With some minor spelling differences, the term lakh is used for a hundred-thousand, and it is generally written as '1,00,000'. In turn, a crore is 100 lakh, and is written as '1,00,00,000'. Extending this pattern, larger numbers continue to use two digits in groups (other than the smallest grouping of three digits. So, e.g. 1e12 is written as 10,00,00,00,00,000. It's nice that we now have the optional underscore in numeric literals. So we could write a number as either `12_34_56_78_00_000` or `1_234_567_800_000` depending on what region of the world and which convention was more familiar. However, in *formatting* those numbers, the format mini-language only allows the European convention. So e.g. In [1]: x = 12_34_56_78_00_000 In [2]: "{:,d}".format(x) Out[2]: '1,234,567,800,000' In [3]: f"{x:,d}" Out[3]: '1,234,567,800,000' In order to get Indian number delimiters, you'd have to write a custom formatting function, notwithstanding that something like 1.5 billion people use the three-then-two delimiting convention. I propose that Python should have an additional grouping option, or some other way to specify this grouping convention. Oddly, the '_' grouping symbol is available, even though no one actually uses that grouper outside of programming languages like Python, e.g.: In [4]: f"{x:_d}" Out[4]: '1_234_567_800_000' I guess this is nice for something like round-tripping numbers used in code, but it's not a symbol anyone uses "natively" (I understand why comma or period cannot be used in numeric literals since they mean something else in Python already). I'm not sure what symbol or combination I would recommend, but finding something suitable shouldn't be so hard. Perhaps now that backtick no longer has any other meaning in Python, it could be used since it looks similar to a comma. E.g. in Python 3.8 we might have: >>> f"{x:`d}" '12,34,56,78,00,000' (actually, this probably isn't any parser issue even in Python 2 since it's already inside quotes; but the issue is moot). Or maybe a two character version like: >>> f"{x:2,d}" '12,34,56,78,00,000' Or: >>> f"{x:,,d}" '12,34,56,78,00,000' Even if `2,` was used, that wouldn't preclude giving an additional length descriptor after it. Now we can have: >>> f"{x:,.2f}" '1,234,567,800,000.00' Perhaps in the future this would work: >>> f"{x:2,.2f}" '12,34,56,78,00,000.00' -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pfreixes at gmail.com Sun Jan 28 02:35:24 2018 From: pfreixes at gmail.com (Pau Freixes) Date: Sun, 28 Jan 2018 08:35:24 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: > At a technical level, the biggest problems relate to the way we > manipulate frame objects at runtime, including the fact that we expose > those frames programmatically for the benefit of debuggers and other > tools. Shoudnt be something that could be tackled with the introduction of a kind of "-g" flag ? Asking the user to make explicit that is willing on having all of this extra information that in normal situations won't be there. > > More broadly, the current lack of perceived commercial incentives for > large corporations to invest millions in offering a faster default > Python runtime, the way they have for the other languages you > mentioned in your initial post :) Agree, at least from my understanding, Google has had a lot of initiatives to improve the JS runtime. But at the same moment, these last years and with the irruption of Asyncio many companies such as Facebook are implementing their systems on top of CPython meaning that they are indirectly inverting on it. -- --pau From stephanh42 at gmail.com Sun Jan 28 04:30:35 2018 From: stephanh42 at gmail.com (Stephan Houben) Date: Sun, 28 Jan 2018 10:30:35 +0100 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: Message-ID: Hi David, Perhaps the "n" locale-dependent number formatting specifier should accept a , to have locale-appropriate formatting of thousand separators? f"{x:,n}" would Do The Right Thing(TM) depending on the locale. Today it is an error. Stephan 2018-01-28 7:25 GMT+01:00 David Mertz : > In South Asia, a different style of digit delimiters for large numbers is > used than in Europe, North America, Australia, etc. With some minor > spelling differences, the term lakh is used for a hundred-thousand, and it > is generally written as '1,00,000'. > > In turn, a crore is 100 lakh, and is written as '1,00,00,000'. Extending > this pattern, larger numbers continue to use two digits in groups (other > than the smallest grouping of three digits. So, e.g. 1e12 is written > as 10,00,00,00,00,000. > > It's nice that we now have the optional underscore in numeric literals. > So we could write a number as either `12_34_56_78_00_000` or > `1_234_567_800_000` depending on what region of the world and which > convention was more familiar. > > However, in *formatting* those numbers, the format mini-language only > allows the European convention. So e.g. > > In [1]: x = 12_34_56_78_00_000 > In [2]: "{:,d}".format(x) > Out[2]: '1,234,567,800,000' > In [3]: f"{x:,d}" > Out[3]: '1,234,567,800,000' > > > In order to get Indian number delimiters, you'd have to write a custom > formatting function, notwithstanding that something like 1.5 billion people > use the three-then-two delimiting convention. > > I propose that Python should have an additional grouping option, or some > other way to specify this grouping convention. Oddly, the '_' grouping > symbol is available, even though no one actually uses that grouper outside > of programming languages like Python, e.g.: > > In [4]: f"{x:_d}" > Out[4]: '1_234_567_800_000' > > > I guess this is nice for something like round-tripping numbers used in > code, but it's not a symbol anyone uses "natively" (I understand why comma > or period cannot be used in numeric literals since they mean something else > in Python already). > > I'm not sure what symbol or combination I would recommend, but finding > something suitable shouldn't be so hard. Perhaps now that backtick no > longer has any other meaning in Python, it could be used since it looks > similar to a comma. E.g. in Python 3.8 we might have: > > >>> f"{x:`d}" > '12,34,56,78,00,000' > > (actually, this probably isn't any parser issue even in Python 2 since > it's already inside quotes; but the issue is moot). > > Or maybe a two character version like: > > >>> f"{x:2,d}" > '12,34,56,78,00,000' > > > Or: > > >>> f"{x:,,d}" > '12,34,56,78,00,000' > > > Even if `2,` was used, that wouldn't preclude giving an additional length > descriptor after it. Now we can have: > > >>> f"{x:,.2f}" > > '1,234,567,800,000.00' > > Perhaps in the future this would work: > > >>> f"{x:2,.2f}" > '12,34,56,78,00,000.00' > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tritium-list at sdamon.com Sun Jan 28 05:57:07 2018 From: tritium-list at sdamon.com (Alex Walters) Date: Sun, 28 Jan 2018 05:57:07 -0500 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: Message-ID: <0c0f01d39826$bdbd8610$39389230$@sdamon.com> It?s my opinion that instead of adding syntax, we should instead encourage using number formatting library functions. * You can replace the function or have the function dispatch differently depending on locale * It means that syntax doesn?t need to be extended for every use case ? its easier to replace a function than change syntax. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com at python.org] On Behalf Of David Mertz Sent: Sunday, January 28, 2018 1:25 AM To: python-ideas Subject: [Python-ideas] Format mini-language for lakh and crore In South Asia, a different style of digit delimiters for large numbers is used than in Europe, North America, Australia, etc. With some minor spelling differences, the term lakh is used for a hundred-thousand, and it is generally written as '1,00,000'. In turn, a crore is 100 lakh, and is written as '1,00,00,000'. Extending this pattern, larger numbers continue to use two digits in groups (other than the smallest grouping of three digits. So, e.g. 1e12 is written as 10,00,00,00,00,000. It's nice that we now have the optional underscore in numeric literals. So we could write a number as either `12_34_56_78_00_000` or `1_234_567_800_000` depending on what region of the world and which convention was more familiar. However, in *formatting* those numbers, the format mini-language only allows the European convention. So e.g. In [1]: x = 12_34_56_78_00_000 In [2]: "{:,d}".format(x) Out[2]: '1,234,567,800,000' In [3]: f"{x:,d}" Out[3]: '1,234,567,800,000' In order to get Indian number delimiters, you'd have to write a custom formatting function, notwithstanding that something like 1.5 billion people use the three-then-two delimiting convention. I propose that Python should have an additional grouping option, or some other way to specify this grouping convention. Oddly, the '_' grouping symbol is available, even though no one actually uses that grouper outside of programming languages like Python, e.g.: In [4]: f"{x:_d}" Out[4]: '1_234_567_800_000' I guess this is nice for something like round-tripping numbers used in code, but it's not a symbol anyone uses "natively" (I understand why comma or period cannot be used in numeric literals since they mean something else in Python already). I'm not sure what symbol or combination I would recommend, but finding something suitable shouldn't be so hard. Perhaps now that backtick no longer has any other meaning in Python, it could be used since it looks similar to a comma. E.g. in Python 3.8 we might have: >>> f"{x:`d}" '12,34,56,78,00,000' (actually, this probably isn't any parser issue even in Python 2 since it's already inside quotes; but the issue is moot). Or maybe a two character version like: >>> f"{x:2,d}" '12,34,56,78,00,000' Or: >>> f"{x:,,d}" '12,34,56,78,00,000' Even if `2,` was used, that wouldn't preclude giving an additional length descriptor after it. Now we can have: >>> f"{x:,.2f}" '1,234,567,800,000.00' Perhaps in the future this would work: >>> f"{x:2,.2f}" '12,34,56,78,00,000.00' -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Jan 28 06:09:07 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Jan 2018 21:09:07 +1000 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On 28 January 2018 at 17:35, Pau Freixes wrote: >> At a technical level, the biggest problems relate to the way we >> manipulate frame objects at runtime, including the fact that we expose >> those frames programmatically for the benefit of debuggers and other >> tools. > > Shoudnt be something that could be tackled with the introduction of a > kind of "-g" flag ? Asking the user to make explicit that is willing > on having all of this extra information that in normal situations > won't be there. This is exactly what some other Python runtimes do, although some of them are also able to be clever about it and detect at runtime if you're doing something that relies on access to frame objects (e.g. PyPy does that). That's one of the biggest advantages of making folks opt-in to code acceleration measures, whether it's using a different interpreter implementation (like PyPy), or using some form of accelerator in combination with CPython (like Cython or Numba): because those tools are opt-in, they don't necessarily need to execute 100% of the software that runs on CPython, they only need to execute the more speed sensitive software that folks actually try to run on them. And because they're not directly integrated into CPython, they don't need to abide by our design and implementation constraints either. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Jan 28 06:51:05 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Jan 2018 21:51:05 +1000 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: Message-ID: On 28 January 2018 at 19:30, Stephan Houben wrote: > Hi David, > > Perhaps the "n" locale-dependent number formatting specifier > should accept a , to have locale-appropriate formatting of thousand > separators? > > f"{x:,n}" > > would Do The Right Thing(TM) depending on the locale. Checking https://www.python.org/dev/peps/pep-0378/, we did suggest using the locale module for cases where the engineering style groups-of-three structure wasn't appropriate, with the parallel being drawn to the fact that you also need to use locale dependent formatting to get a decimal separator other than ".". One nice aspect of this suggestion is that supplying the comma would map directly to the "grouping" parameter in https://docs.python.org/3/library/locale.html#locale.format: >>> import locale >>> locale.setlocale(locale.LC_ALL, "en_IN.utf8") 'en_IN.utf8' >>> locale.format("%d", 10e9, grouping=True) '10,00,00,00,000' Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From eric at trueblade.com Sun Jan 28 08:46:10 2018 From: eric at trueblade.com (Eric V. Smith) Date: Sun, 28 Jan 2018 08:46:10 -0500 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: Message-ID: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> On 1/28/2018 6:51 AM, Nick Coghlan wrote: > On 28 January 2018 at 19:30, Stephan Houben wrote: >> Hi David, >> >> Perhaps the "n" locale-dependent number formatting specifier >> should accept a , to have locale-appropriate formatting of thousand >> separators? >> >> f"{x:,n}" >> >> would Do The Right Thing(TM) depending on the locale. > > Checking https://www.python.org/dev/peps/pep-0378/, we did suggest > using the locale module for cases where the engineering style > groups-of-three structure wasn't appropriate, with the parallel being > drawn to the fact that you also need to use locale dependent > formatting to get a decimal separator other than ".". If I recall correctly, we discussed this at the time, and the problem with locale is that it's not thread safe. I agree that if it were, it would be nice to be able to use it, either with 'n', or in some other mode just for grouping. The underlying C setlocale()/localeconv() just isn't very friendly to this use case. Eric. From steve at pearwood.info Sun Jan 28 09:05:39 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 29 Jan 2018 01:05:39 +1100 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: Message-ID: <20180128140538.GG22500@ando.pearwood.info> On Sun, Jan 28, 2018 at 09:51:05PM +1000, Nick Coghlan wrote: > Checking https://www.python.org/dev/peps/pep-0378/, we did suggest > using the locale module for cases where the engineering style > groups-of-three structure wasn't appropriate, with the parallel being > drawn to the fact that you also need to use locale dependent > formatting to get a decimal separator other than ".". Or you could use string replacement: py> format(123456789.25, "0,.3f").replace(',', '_').replace('.', '?') '123_456_789?250' which may not be quite as convenient or efficient, and it doesn't work where groups-of-three aren't appropriate. But on the other hand using locale has a number of disadvantages too: - quoting PEP 378: "Finance users and non-professional programmers find the locale approach to be frustrating, arcane and non-obvious". https://www.python.org/dev/peps/pep-0378/ - the available locales and their spellings are OS dependent; - its not cheap, thread-safe or local to your library/function. The documentation for locale warns: "It is generally a bad idea to call setlocale() in some library routine, since as a side effect it affects the entire program. Saving and restoring it is almost as bad: it is expensive and affects other threads that happen to run before the settings have been restored." -- Steve From solipsis at pitrou.net Sun Jan 28 10:47:11 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 28 Jan 2018 16:47:11 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? References: Message-ID: <20180128164711.00ba064d@fsol> On Sat, 27 Jan 2018 22:18:08 +0100 Pau Freixes wrote: > > Correct me if I'm wrong, but most of you argue that the proper Zen of > Python - can we say it mutability [1]? as Victor pointed out - that > allow the user have the freedom to mutate objects in runtime goes in > the opposite direction of allowing the *compiler* to make code with > some optimizations. Or, more specifically for the ceval - > *interpreter*? - apply some hacks that would help to reduce the > footprint of some operations. Allow me to disagree. It's true that the extremely dynamic and flexible nature of Python makes it much harder to optimize Python code than, say, PHP code (I'm not entirely sure about this, but still). Still, I do think it's a collective failure that we've (*) made little progress in interpreter optimization in the last 10-15 years, compared to other languages. (*) ("we" == the CPython team here; PyPy is another question) Regards Antoine. From njs at pobox.com Sun Jan 28 19:27:24 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 28 Jan 2018 16:27:24 -0800 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith wrote: > If I recall correctly, we discussed this at the time, and the problem with > locale is that it's not thread safe. I agree that if it were, it would be > nice to be able to use it, either with 'n', or in some other mode just for > grouping. > > The underlying C setlocale()/localeconv() just isn't very friendly to this > use case. POSIX.1-2008 added thread-local locales (say that 3x fast); see uselocale(3). This appears to be supported on Linux (since glibc 2.3, which is older than all supported enterprise distros), MacOS, and the BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to provide the non-standard sprintf_l, which takes an explicit locale to use. So it looks like all mainstream OSes actually make it possible to use a specific locale to do arbitrary formatting in a thread-safe way. -n -- Nathaniel J. Smith -- https://vorpus.org From steve at pearwood.info Sun Jan 28 19:41:48 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 29 Jan 2018 11:41:48 +1100 Subject: [Python-ideas] Logging: a more perseverent version of the StreamHandler? In-Reply-To: References: Message-ID: <20180129004144.GH22500@ando.pearwood.info> On Fri, Jan 26, 2018 at 09:46:51PM +0100, liam marsh wrote: > Hello, > Some time ago, I set up some logging using stdout in a program with the > `stdout_redirected()` context manager, which had to close and reopen > stdout to work. > Unsurprisingly, the StreamHandler didn't take it well. > > So I made a Handler class which is able to reload the stream [...] > What do you think? (about the idea, the implementation, and the way I > wrote this email)| I'll admit that I'm not actually very interested in the problem you're solving, I've never hit that problem so I'll leave it for those who have to comment on the idea and implementation. But I'll comment on the email: for some reason, the implementation you give has extraneous pipes | at the start and end of each line, e.g.: |class ReloadingHandler(StreamHandler):| |??? """| |??? A stream handler which reloads the stream object from one place if an error occurs| |??? """| Between the extraneous pipes and the lines being line-wrapped, the code has ended up severely mangled. (Code being emailed probably needs to stay below 78 characters per line to avoid being wrapped at the ends of lines.) -- Steve From mertz at gnosis.cx Sun Jan 28 20:31:36 2018 From: mertz at gnosis.cx (David Mertz) Date: Sun, 28 Jan 2018 17:31:36 -0800 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. But it's still much less general than having the option in the f-string/.format() mini-language. This is really about the formatted string, not necessarily about the locale. So, e.g. I'd like to be able to write: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:`.2f}") I don't want the format necessarily to be some pseudo-global setting, even if it can get stored in thread-locals. That said, having a locale-aware symbol for delimiting numbers in the format mini-language would also not be a bad thing. On Sun, Jan 28, 2018 at 4:27 PM, Nathaniel Smith wrote: > On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith wrote: > > If I recall correctly, we discussed this at the time, and the problem > with > > locale is that it's not thread safe. I agree that if it were, it would be > > nice to be able to use it, either with 'n', or in some other mode just > for > > grouping. > > > > The underlying C setlocale()/localeconv() just isn't very friendly to > this > > use case. > > POSIX.1-2008 added thread-local locales (say that 3x fast); see > uselocale(3). This appears to be supported on Linux (since glibc 2.3, > which is older than all supported enterprise distros), MacOS, and the > BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to > provide the non-standard sprintf_l, which takes an explicit locale to > use. > > So it looks like all mainstream OSes actually make it possible to use > a specific locale to do arbitrary formatting in a thread-safe way. > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sun Jan 28 20:48:52 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 28 Jan 2018 17:48:52 -0800 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: On Sun, Jan 28, 2018 at 5:31 PM, David Mertz wrote: > I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. > But it's still much less general than having the option in the > f-string/.format() mini-language. This is really about the formatted > string, not necessarily about the locale. So, e.g. I'd like to be able to > write: > >>>> print(f"In European format x is {x:,.2f}, in Indian format it is >>>> {x:`.2f}") > > I don't want the format necessarily to be some pseudo-global setting, even > if it can get stored in thread-locals. That said, having a locale-aware > symbol for delimiting numbers in the format mini-language would also not be > a bad thing. I don't understand the format mini-language well enough to know what would fit in, but maybe some way to (a) request localified formatting, (b) some way to explicitly say which locale you want to use? Like if "h" means "human friendly", it might be something like: f"In the current locale x is {x:h.2f}, in Indian format it is {x:h(en_IN).2f}" -n -- Nathaniel J. Smith -- https://vorpus.org From turnbull.stephen.fw at u.tsukuba.ac.jp Mon Jan 29 00:13:50 2018 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Mon, 29 Jan 2018 14:13:50 +0900 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> <21deae87-ac0a-f13d-ec08-bfe263d3e97b@egenix.com> <0c47dff0-5075-3da9-b2f9-36362c6ac7e6@egenix.com> <20180121104343.GQ22500@ando.pearwood.info> <23141.34841.479927.670393@turnbull.sk.tsukuba.ac.jp> Message-ID: <23150.44430.296338.99356@turnbull.sk.tsukuba.ac.jp> Sorry for the long delay. I had a lot on my plate at work, and was spending 14 hours a day sleeping because of the flu. "It got better." Rob Speer writes: > I don't really understand what you're doing when you take a > fragment of my sentence where I explain a wrong understanding of > WHATWG encodings, and say "that's wrong, as you explain". I know > it's wrong. That's what I was saying. Sure, but you're not my entire audience: the part I care most about is the committers. I've seen proposals to "fill in" seriously made in other contexts, I wanted to agree that's wrong for Python. > In this pseudocode that implements a "whatwg_error_mode", can you describe > what the Python code to call it would look like? There isn't any Python code that calls it. It's an error handler, like 'strict' or 'surrogateescape', and all the functions that call it are in C. > Does every call to .encode and .decode now have a > "whatwg_error_mode" parameter, in addition to the "errors" > parameter? Or are there twice as many possible strings you could > pass as the "errors" parameter, so you can have "replace", > "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc? It would be the latter. I haven't thought about it carefully, but what I would likely do is define a factory function taking an encoding name (str), an error handler name, and a bytes-str mapping for the exceptional cases like windows-1255 where WHAT-WG enhances the graphic repertoire, and returns a name like "whatwg-windows-1255-fatal". Internally it would 1. Check if the error handler name is 'fatal' or 'strict', or 'html' or 'xmlcharrefreplace' ('strict' and 'xmlcharrefreplace' would be used internally to the factory function, the registered name would be 'fatal' or 'html'). 'replace' has the same semantics in Python and in WHAT-WG, and other error handlers 'backslashreplace', 'ignore', and 'surrogateescape' would be up to the programmer to use or avoid. They'd go by their Python names. Alternatively we could follow the strict WHAT-WG standard and not allow those, or provide another argument to allow "lax" checking of the handler argument. 2. Check if the name is already registered. If so, return it. 3. Otherwise, def a function that takes an Unicode error and a mapping that defaults to the one passed to the factory, and a. passes C0 and C1 control characters through, else b. returns the mapped value if present, else c. passes the Unicode error to the named error handler and returns what that returns 4. Register the new handler with that name, and return the name. You would use it like handler = factory('windows-1255', 'html', [(b'0x00', '\Udeadbeef')]) b'deadbeef'.decode('windows-1255', errors=handler) The mapping would default to [], and the remaining question would be what the default for the error handler should be. I guess that would 'strict' (the problem is that the WHAT-WG defaults differ for decoding and encoding). (The choice of a list of tuples for the mapping is due to JIS, where the map is not 1-1, and a specific reverse mapping is defined.) > My objection here isn't efficiency, it's adding confusing extra > options to .encode() and .decode() that aren't relevant in most > cases. There wouldn't be extra *arguments*, but there would be additional handler names to use as values. We'd want three standard handlers for everything but windows-1255 and JIS (AFAIK). One would be mainly for validating XML, and the name would be 'whatwg-any-fatal'. (Note that the name of the encoding is actually only used in the name of the handler, and that only to identify auxiliary mappings, such as that for windows-1255.) The others would be for everyday HTML (and maybe for XHTML form input?). They would be named 'whatwg-any-replace' and 'whatwg-any-html'. I'm not sure whether to have a separate suite for windows-1255, or let the programmer take care of that. Also, since 'replace' is a pretty simplistic handler, I suspect a lot of programmers would like to use surrogateescape, but since WHAT-WG explicitly restricts error modes to fatal, replace, and html, that's on the programmer to define, at least until it's clear there's overwhelming demand for it. > I'd like to limit this proposal to single-byte encodings, > addressing the discrepancies in the C1 characters and possibly that > Hebrew vowel point. I wonder what Microsoft's representatives to Unicode and WHAT-WG would say about that. I think it should definitely be handled somehow. I find adding it to the stdlib 1255 codec attractive, and I think the chance that Microsoft would sign off on that is nonzero. If they didn't, it would go into 1255-specific handlers. > If there are differences in the JIS encodings, that is a can of > worms I'd like to not open at the moment. Addressed by the factory function, which is needed anyway as discussed above. Footnotes: [1] I had this wrong. It's not the number of tokens to skip, it's the position to restart reading the input. [2] The actual handlers are all in C, and return 0 if they don't know what to do. I haven't had time to figure out what actually happens here (None is an actual object and I'm sure it doesn't live at 0x0). I'm guessing that a pure Python handler would return None, but perhaps it should reraise. That doesn't affect the ability to construct a chaining handler, only what such a handler would do if it "knows" the input is *bad* and decides to stop rather than delegate. From ncoghlan at gmail.com Mon Jan 29 02:13:29 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 29 Jan 2018 17:13:29 +1000 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: On 29 January 2018 at 11:48, Nathaniel Smith wrote: > On Sun, Jan 28, 2018 at 5:31 PM, David Mertz wrote: >> I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. >> But it's still much less general than having the option in the >> f-string/.format() mini-language. This is really about the formatted >> string, not necessarily about the locale. So, e.g. I'd like to be able to >> write: >> >>>>> print(f"In European format x is {x:,.2f}, in Indian format it is >>>>> {x:`.2f}") >> >> I don't want the format necessarily to be some pseudo-global setting, even >> if it can get stored in thread-locals. That said, having a locale-aware >> symbol for delimiting numbers in the format mini-language would also not be >> a bad thing. > > I don't understand the format mini-language well enough to know what > would fit in, but maybe some way to (a) request localified formatting, Given the example, I think a more useful approach would be to allow an optional digit grouping specifier after the comma separator, and allow the separator to be repeated to indicate non-uniform groupings in the lower order digits. If we did that, then David's example could become: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}") The core elements of interpreting that would then be: - digit group size specifiers are permited for both "," (decimal display only) and "_" (all display bases) - if no digit group size specifier is given, it defaults to 3 for decimal and 4 for binary, octal, and hexadecimal - if multiple digit group specifiers are given, then the last one given is applied starting from the least significant integer digit so "{x:,2,3.2f}" means: - an arbitrary number of leading 2-digit groups - 1 group of 3 digits - 2 decimal places It would then be reasonably straightforward to use this as a lower level primitive to implement locale dependent formatting, as follows: - format in English using the locale's grouping rules [1] (either LC_NUMERIC.grouping or LC_MONETARY.mon_grouping, as appropriate) - use str.translate() [2] to replace "," and "." with the locale's thousands_sep & decimal_point or mon_thousands_sep & mon_decimal_point [1] https://docs.python.org/3/library/locale.html#locale.localeconv [2] https://docs.python.org/3/library/stdtypes.html#str.translate Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From steve.dower at python.org Mon Jan 29 05:54:44 2018 From: steve.dower at python.org (Steve Dower) Date: Mon, 29 Jan 2018 21:54:44 +1100 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: Someone would have to check, but presumably the CRT on Windows is converting the natively thread-local locale into a process-wide locale for POSIX compatibility, which means it can probably be easily bypassed without having to use specific overloads. Top-posted from my Windows phone From: Nathaniel Smith Sent: Monday, January 29, 2018 11:29 To: Eric V. Smith Cc: python-ideas Subject: Re: [Python-ideas] Format mini-language for lakh and crore On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith wrote: > If I recall correctly, we discussed this at the time, and the problem with > locale is that it's not thread safe. I agree that if it were, it would be > nice to be able to use it, either with 'n', or in some other mode just for > grouping. > > The underlying C setlocale()/localeconv() just isn't very friendly to this > use case. POSIX.1-2008 added thread-local locales (say that 3x fast); see uselocale(3). This appears to be supported on Linux (since glibc 2.3, which is older than all supported enterprise distros), MacOS, and the BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to provide the non-standard sprintf_l, which takes an explicit locale to use. So it looks like all mainstream OSes actually make it possible to use a specific locale to do arbitrary formatting in a thread-safe way. -n -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mertz at gnosis.cx Mon Jan 29 10:43:14 2018 From: mertz at gnosis.cx (David Mertz) Date: Mon, 29 Jan 2018 10:43:14 -0500 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: Nick suggests: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}") This looks very good and general. I only know of the "European" and South Asian conventions in widespread use, but we could give other grouping conventions using that little syntax and it definitely covers the ones I know about. There's not an issue about this giving the parser for the format mini-language hiccups over width specifier in there, is there? On Mon, Jan 29, 2018 at 2:13 AM, Nick Coghlan wrote: > On 29 January 2018 at 11:48, Nathaniel Smith wrote: > > On Sun, Jan 28, 2018 at 5:31 PM, David Mertz wrote: > >> I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. > >> But it's still much less general than having the option in the > >> f-string/.format() mini-language. This is really about the formatted > >> string, not necessarily about the locale. So, e.g. I'd like to be able > to > >> write: > >> > >>>>> print(f"In European format x is {x:,.2f}, in Indian format it is > >>>>> {x:`.2f}") > >> > >> I don't want the format necessarily to be some pseudo-global setting, > even > >> if it can get stored in thread-locals. That said, having a locale-aware > >> symbol for delimiting numbers in the format mini-language would also > not be > >> a bad thing. > > > > I don't understand the format mini-language well enough to know what > > would fit in, but maybe some way to (a) request localified formatting, > > Given the example, I think a more useful approach would be to allow an > optional digit grouping specifier after the comma separator, and allow > the separator to be repeated to indicate non-uniform groupings in the > lower order digits. > > If we did that, then David's example could become: > > >>> print(f"In European format x is {x:,.2f}, in Indian format it > is {x:,2,3.2f}") > > The core elements of interpreting that would then be: > > - digit group size specifiers are permited for both "," (decimal > display only) and "_" (all display bases) > - if no digit group size specifier is given, it defaults to 3 for > decimal and 4 for binary, octal, and hexadecimal > - if multiple digit group specifiers are given, then the last one > given is applied starting from the least significant integer digit > > so "{x:,2,3.2f}" means: > > - an arbitrary number of leading 2-digit groups > - 1 group of 3 digits > - 2 decimal places > > It would then be reasonably straightforward to use this as a lower > level primitive to implement locale dependent formatting, as follows: > > - format in English using the locale's grouping rules [1] (either > LC_NUMERIC.grouping or LC_MONETARY.mon_grouping, as appropriate) > - use str.translate() [2] to replace "," and "." with the locale's > thousands_sep & decimal_point or mon_thousands_sep & mon_decimal_point > > [1] https://docs.python.org/3/library/locale.html#locale.localeconv > [2] https://docs.python.org/3/library/stdtypes.html#str.translate > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Mon Jan 29 12:10:39 2018 From: brett at python.org (Brett Cannon) Date: Mon, 29 Jan 2018 17:10:39 +0000 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On Sat, Jan 27, 2018, 23:36 Pau Freixes, wrote: > > At a technical level, the biggest problems relate to the way we > > manipulate frame objects at runtime, including the fact that we expose > > those frames programmatically for the benefit of debuggers and other > > tools. > > Shoudnt be something that could be tackled with the introduction of a > kind of "-g" flag ? Asking the user to make explicit that is willing > on having all of this extra information that in normal situations > won't be there. > > > > > More broadly, the current lack of perceived commercial incentives for > > large corporations to invest millions in offering a faster default > > Python runtime, the way they have for the other languages you > > mentioned in your initial post :) > > Agree, at least from my understanding, Google has had a lot of > initiatives to improve the JS runtime. But at the same moment, these > last years and with the irruption of Asyncio many companies such as > Facebook are implementing their systems on top of CPython meaning that > they are indirectly inverting on it. > I find that's a red herring. There are plenty of massive companies that have relied on Python for performance-critical workloads in timespans measuring in decades and they have not funded core Python development or the PSF in a way even approaching the other languages Python was compared against in the original email. It might be the feeling of community ownership that keeps companies from making major investments in Python, but regardless it's important to simply remember the core devs are volunteers so the question of "why hasn't this been solved" usually comes down to "lack of volunteer time". -Brett > -- > --pau > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From waksman at gmail.com Mon Jan 29 13:51:09 2018 From: waksman at gmail.com (George Leslie-Waksman) Date: Mon, 29 Jan 2018 18:51:09 +0000 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> Message-ID: attrs' seems to also not allow mandatory attributes to follow optional one: In [14]: @attr.s ...: class Baz: ...: a = attr.ib(default=attr.Factory(list)) ...: b = attr.ib() ...: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 @attr.s 2 class Baz: 3 a = attr.ib(default=attr.Factory(list)) 4 b = attr.ib() 5 /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py in attrs(maybe_cls, these, repr_ns, repr, cmp, hash, init, slots, frozen, str, auto_attribs) 700 return wrap 701 else: --> 702 return wrap(maybe_cls) 703 704 /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py in wrap(cls) 669 raise TypeError("attrs only works with new-style classes.") 670 --> 671 builder = _ClassBuilder(cls, these, slots, frozen, auto_attribs) 672 673 if repr is True: /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py in __init__(self, cls, these, slots, frozen, auto_attribs) 369 370 def __init__(self, cls, these, slots, frozen, auto_attribs): --> 371 attrs, super_attrs = _transform_attrs(cls, these, auto_attribs) 372 373 self._cls = cls /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py in _transform_attrs(cls, these, auto_attribs) 335 "No mandatory attributes allowed after an attribute with a " 336 "default value or factory. Attribute in question: {a!r}" --> 337 .format(a=a) 338 ) 339 elif had_default is False and \ ValueError: No mandatory attributes allowed after an attribute with a default value or factory. Attribute in question: Attribute(name='b', default=NOTHING, validator=None, repr=True, cmp=True, hash=None, init=True, metadata=mappingproxy({}), type=None, converter=None) On Fri, Jan 26, 2018 at 1:44 PM Guido van Rossum wrote: > What does attrs' solution for this problem look like? > > On Fri, Jan 26, 2018 at 11:11 AM, George Leslie-Waksman > wrote: > >> Even if we could inherit the setting, I would think that we would still >> want to require the code be explicit. It seems worse to implicitly require >> keyword only arguments for a class without giving any indication in the >> code. >> >> As it stands, the current implementation does not allow a later subclass >> to be declared without `keyword_only=True` so we could handle this case by >> adding a note to the `TypeError` message about considering the keyword_only >> flag. >> >> How do I got about putting together a proposal to get this into 3.8? >> >> --George >> >> >> On Thu, Jan 25, 2018 at 5:12 AM Eric V. Smith wrote: >> >>> I'm not completely opposed to this feature. But there are some cases to >>> consider. Here's the first one that occurs to me: note that due to the >>> way dataclasses work, it would need to be used everywhere down an >>> inheritance hierarchy. That is, if an intermediate base class required >>> it, all class derived from that intermediate base would need to specify >>> it, too. That's because each class just makes decisions based on its >>> fields and its base classes' fields, and not on any flags attached to >>> the base class. As it's currently implemented, a class doesn't remember >>> any of the decorator's arguments, so there's no way to look for this >>> information, anyway. >>> >>> I think there are enough issues here that it's not going to make it in >>> to 3.7. It would require getting a firm proposal together, selling the >>> idea on python-dev, and completing the implementation before Monday. But >>> if you want to try, I'd participate in the discussion. >>> >>> Taking Ivan's suggestion one step further, a way to do this currently is >>> to pass init=False and then write another decorator that adds the >>> kw-only __init__. So the usage would be: >>> >>> @dataclass >>> class Foo: >>> some_default: dict = field(default_factory=dict) >>> >>> @kw_only_init >>> @dataclass(init=False) >>> class Bar(Foo): >>> other_field: int >>> >>> kw_only_init(cls) would look at fields(cls) and construct the __init__. >>> It would be a hassle to re-implement dataclasses's _init_fn function, >>> but it could be made to work (in reality, of course, you'd just copy it >>> and hack it up to do what you want). You'd also need to use some private >>> knowledge of InitVars if you wanted to support them (the stock >>> fields(cls) doesn't return them). >>> >>> For 3.8 we can consider changing dataclasses's APIs if we want to add >>> this. >>> >>> Eric. >>> >>> On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: >>> > It may be possible but it makes for pretty leaky abstractions and it's >>> > unclear what that custom __init__ should look like. How am I supposed >>> to >>> > know what the replacement for default_factory is? >>> > >>> > Moreover, suppose I want one base class with an optional argument and a >>> > half dozen subclasses each with their own required argument. At that >>> > point, I have to write the same __init__ function a half dozen times. >>> > >>> > It feels rather burdensome for the user when an additional flag (say >>> > "kw_only=True") and a modification to: >>> > https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294 >>> that >>> > inserted `['*']` after `[self_name]` if the flag is specified could >>> > ameliorate this entire issue. >>> > >>> > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi >> > > wrote: >>> > >>> > It is possible to pass init=False to the decorator on the subclass >>> > (and supply your own custom __init__, if necessary): >>> > >>> > @dataclass >>> > class Foo: >>> > some_default: dict = field(default_factory=dict) >>> > >>> > @dataclass(init=False) # This works >>> > class Bar(Foo): >>> > other_field: int >>> > >>> > -- >>> > Ivan >>> > >>> > >>> > >>> > On 23 January 2018 at 03:33, George Leslie-Waksman >>> > > wrote: >>> > >>> > The proposed implementation of dataclasses prevents defining >>> > fields with defaults before fields without defaults. This can >>> > create limitations on logical grouping of fields and on >>> inheritance. >>> > >>> > Take, for example, the case: >>> > >>> > @dataclass >>> > class Foo: >>> > some_default: dict = field(default_factory=dict) >>> > >>> > @dataclass >>> > class Bar(Foo): >>> > other_field: int >>> > >>> > this results in the error: >>> > >>> > 5 @dataclass >>> > ----> 6 class Bar(Foo): >>> > 7 other_field: int >>> > 8 >>> > >>> > >>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>> > in dataclass(_cls, init, repr, eq, order, hash, frozen) >>> > 751 >>> > 752 # We're called as @dataclass, with a class. >>> > --> 753 return wrap(_cls) >>> > 754 >>> > 755 >>> > >>> > >>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>> > in wrap(cls) >>> > 743 >>> > 744 def wrap(cls): >>> > --> 745 return _process_class(cls, repr, eq, order, >>> > hash, init, frozen) >>> > 746 >>> > 747 # See if we're being called as @dataclass or >>> > @dataclass(). >>> > >>> > >>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>> > in _process_class(cls, repr, eq, order, hash, init, frozen) >>> > 675 # in __init__. Use >>> > "self" if possible. >>> > 676 '__dataclass_self__' >>> if >>> > 'self' in fields >>> > --> 677 else 'self', >>> > 678 )) >>> > 679 if repr: >>> > >>> > >>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>> > in _init_fn(fields, frozen, has_post_init, self_name) >>> > 422 seen_default = True >>> > 423 elif seen_default: >>> > --> 424 raise TypeError(f'non-default argument >>> > {f.name !r} ' >>> > 425 'follows default >>> argument') >>> > 426 >>> > >>> > TypeError: non-default argument 'other_field' follows default >>> > argument >>> > >>> > I understand that this is a limitation of positional arguments >>> > because the effective __init__ signature is: >>> > >>> > def __init__(self, some_default: dict = , >>> > other_field: int): >>> > >>> > However, keyword only arguments allow an entirely reasonable >>> > solution to this problem: >>> > >>> > def __init__(self, *, some_default: dict = , >>> > other_field: int): >>> > >>> > And have the added benefit of making the fields in the __init__ >>> > call entirely explicit. >>> > >>> > So, I propose the addition of a keyword_only flag to the >>> > @dataclass decorator that renders the __init__ method using >>> > keyword only arguments: >>> > >>> > @dataclass(keyword_only=True) >>> > class Bar(Foo): >>> > other_field: int >>> > >>> > --George Leslie-Waksman >>> > >>> > _______________________________________________ >>> > Python-ideas mailing list >>> > Python-ideas at python.org >>> > https://mail.python.org/mailman/listinfo/python-ideas >>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>> > >>> > >>> > >>> > >>> > _______________________________________________ >>> > Python-ideas mailing list >>> > Python-ideas at python.org >>> > https://mail.python.org/mailman/listinfo/python-ideas >>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>> > >>> >>> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >> > > > -- > --Guido van Rossum (python.org/~guido) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Mon Jan 29 14:05:03 2018 From: guido at python.org (Guido van Rossum) Date: Mon, 29 Jan 2018 11:05:03 -0800 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> Message-ID: I think that settles it -- there's no reason to try to implement this. On Mon, Jan 29, 2018 at 10:51 AM, George Leslie-Waksman wrote: > attrs' seems to also not allow mandatory attributes to follow optional one: > > In [14]: @attr.s > ...: class Baz: > ...: a = attr.ib(default=attr.Factory(list)) > ...: b = attr.ib() > ...: > ------------------------------------------------------------ > --------------- > ValueError Traceback (most recent call last) > in () > ----> 1 @attr.s > 2 class Baz: > 3 a = attr.ib(default=attr.Factory(list)) > 4 b = attr.ib() > 5 > > /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ > python3.6/site-packages/attr/_make.py in attrs(maybe_cls, these, repr_ns, > repr, cmp, hash, init, slots, frozen, str, auto_attribs) > 700 return wrap > 701 else: > --> 702 return wrap(maybe_cls) > 703 > 704 > > /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ > python3.6/site-packages/attr/_make.py in wrap(cls) > 669 raise TypeError("attrs only works with new-style > classes.") > 670 > --> 671 builder = _ClassBuilder(cls, these, slots, frozen, > auto_attribs) > 672 > 673 if repr is True: > > /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ > python3.6/site-packages/attr/_make.py in __init__(self, cls, these, > slots, frozen, auto_attribs) > 369 > 370 def __init__(self, cls, these, slots, frozen, auto_attribs): > --> 371 attrs, super_attrs = _transform_attrs(cls, these, > auto_attribs) > 372 > 373 self._cls = cls > > /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ > python3.6/site-packages/attr/_make.py in _transform_attrs(cls, these, > auto_attribs) > 335 "No mandatory attributes allowed after an > attribute with a " > 336 "default value or factory. Attribute in question: > {a!r}" > --> 337 .format(a=a) > 338 ) > 339 elif had_default is False and \ > > ValueError: No mandatory attributes allowed after an attribute with a > default value or factory. Attribute in question: Attribute(name='b', > default=NOTHING, validator=None, repr=True, cmp=True, hash=None, init=True, > metadata=mappingproxy({}), type=None, converter=None) > > > On Fri, Jan 26, 2018 at 1:44 PM Guido van Rossum wrote: > >> What does attrs' solution for this problem look like? >> >> On Fri, Jan 26, 2018 at 11:11 AM, George Leslie-Waksman < >> waksman at gmail.com> wrote: >> >>> Even if we could inherit the setting, I would think that we would still >>> want to require the code be explicit. It seems worse to implicitly require >>> keyword only arguments for a class without giving any indication in the >>> code. >>> >>> As it stands, the current implementation does not allow a later subclass >>> to be declared without `keyword_only=True` so we could handle this case by >>> adding a note to the `TypeError` message about considering the keyword_only >>> flag. >>> >>> How do I got about putting together a proposal to get this into 3.8? >>> >>> --George >>> >>> >>> On Thu, Jan 25, 2018 at 5:12 AM Eric V. Smith >>> wrote: >>> >>>> I'm not completely opposed to this feature. But there are some cases to >>>> consider. Here's the first one that occurs to me: note that due to the >>>> way dataclasses work, it would need to be used everywhere down an >>>> inheritance hierarchy. That is, if an intermediate base class required >>>> it, all class derived from that intermediate base would need to specify >>>> it, too. That's because each class just makes decisions based on its >>>> fields and its base classes' fields, and not on any flags attached to >>>> the base class. As it's currently implemented, a class doesn't remember >>>> any of the decorator's arguments, so there's no way to look for this >>>> information, anyway. >>>> >>>> I think there are enough issues here that it's not going to make it in >>>> to 3.7. It would require getting a firm proposal together, selling the >>>> idea on python-dev, and completing the implementation before Monday. But >>>> if you want to try, I'd participate in the discussion. >>>> >>>> Taking Ivan's suggestion one step further, a way to do this currently is >>>> to pass init=False and then write another decorator that adds the >>>> kw-only __init__. So the usage would be: >>>> >>>> @dataclass >>>> class Foo: >>>> some_default: dict = field(default_factory=dict) >>>> >>>> @kw_only_init >>>> @dataclass(init=False) >>>> class Bar(Foo): >>>> other_field: int >>>> >>>> kw_only_init(cls) would look at fields(cls) and construct the __init__. >>>> It would be a hassle to re-implement dataclasses's _init_fn function, >>>> but it could be made to work (in reality, of course, you'd just copy it >>>> and hack it up to do what you want). You'd also need to use some private >>>> knowledge of InitVars if you wanted to support them (the stock >>>> fields(cls) doesn't return them). >>>> >>>> For 3.8 we can consider changing dataclasses's APIs if we want to add >>>> this. >>>> >>>> Eric. >>>> >>>> On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: >>>> > It may be possible but it makes for pretty leaky abstractions and it's >>>> > unclear what that custom __init__ should look like. How am I supposed >>>> to >>>> > know what the replacement for default_factory is? >>>> > >>>> > Moreover, suppose I want one base class with an optional argument and >>>> a >>>> > half dozen subclasses each with their own required argument. At that >>>> > point, I have to write the same __init__ function a half dozen times. >>>> > >>>> > It feels rather burdensome for the user when an additional flag (say >>>> > "kw_only=True") and a modification to: >>>> > https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294 >>>> that >>>> > inserted `['*']` after `[self_name]` if the flag is specified could >>>> > ameliorate this entire issue. >>>> > >>>> > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi >>> > > wrote: >>>> > >>>> > It is possible to pass init=False to the decorator on the subclass >>>> > (and supply your own custom __init__, if necessary): >>>> > >>>> > @dataclass >>>> > class Foo: >>>> > some_default: dict = field(default_factory=dict) >>>> > >>>> > @dataclass(init=False) # This works >>>> > class Bar(Foo): >>>> > other_field: int >>>> > >>>> > -- >>>> > Ivan >>>> > >>>> > >>>> > >>>> > On 23 January 2018 at 03:33, George Leslie-Waksman >>>> > > wrote: >>>> > >>>> > The proposed implementation of dataclasses prevents defining >>>> > fields with defaults before fields without defaults. This can >>>> > create limitations on logical grouping of fields and on >>>> inheritance. >>>> > >>>> > Take, for example, the case: >>>> > >>>> > @dataclass >>>> > class Foo: >>>> > some_default: dict = field(default_factory=dict) >>>> > >>>> > @dataclass >>>> > class Bar(Foo): >>>> > other_field: int >>>> > >>>> > this results in the error: >>>> > >>>> > 5 @dataclass >>>> > ----> 6 class Bar(Foo): >>>> > 7 other_field: int >>>> > 8 >>>> > >>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>> site-packages/dataclasses.py >>>> > in dataclass(_cls, init, repr, eq, order, hash, frozen) >>>> > 751 >>>> > 752 # We're called as @dataclass, with a class. >>>> > --> 753 return wrap(_cls) >>>> > 754 >>>> > 755 >>>> > >>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>> site-packages/dataclasses.py >>>> > in wrap(cls) >>>> > 743 >>>> > 744 def wrap(cls): >>>> > --> 745 return _process_class(cls, repr, eq, order, >>>> > hash, init, frozen) >>>> > 746 >>>> > 747 # See if we're being called as @dataclass or >>>> > @dataclass(). >>>> > >>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>> site-packages/dataclasses.py >>>> > in _process_class(cls, repr, eq, order, hash, init, frozen) >>>> > 675 # in __init__. Use >>>> > "self" if possible. >>>> > 676 '__dataclass_self__' >>>> if >>>> > 'self' in fields >>>> > --> 677 else 'self', >>>> > 678 )) >>>> > 679 if repr: >>>> > >>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>> site-packages/dataclasses.py >>>> > in _init_fn(fields, frozen, has_post_init, self_name) >>>> > 422 seen_default = True >>>> > 423 elif seen_default: >>>> > --> 424 raise TypeError(f'non-default argument >>>> > {f.name !r} ' >>>> > 425 'follows default >>>> argument') >>>> > 426 >>>> > >>>> > TypeError: non-default argument 'other_field' follows default >>>> > argument >>>> > >>>> > I understand that this is a limitation of positional arguments >>>> > because the effective __init__ signature is: >>>> > >>>> > def __init__(self, some_default: dict = , >>>> > other_field: int): >>>> > >>>> > However, keyword only arguments allow an entirely reasonable >>>> > solution to this problem: >>>> > >>>> > def __init__(self, *, some_default: dict = , >>>> > other_field: int): >>>> > >>>> > And have the added benefit of making the fields in the >>>> __init__ >>>> > call entirely explicit. >>>> > >>>> > So, I propose the addition of a keyword_only flag to the >>>> > @dataclass decorator that renders the __init__ method using >>>> > keyword only arguments: >>>> > >>>> > @dataclass(keyword_only=True) >>>> > class Bar(Foo): >>>> > other_field: int >>>> > >>>> > --George Leslie-Waksman >>>> > >>>> > _______________________________________________ >>>> > Python-ideas mailing list >>>> > Python-ideas at python.org >>>> > https://mail.python.org/mailman/listinfo/python-ideas >>>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>>> > >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > Python-ideas mailing list >>>> > Python-ideas at python.org >>>> > https://mail.python.org/mailman/listinfo/python-ideas >>>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>>> > >>>> >>>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >>> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From waksman at gmail.com Mon Jan 29 14:38:20 2018 From: waksman at gmail.com (George Leslie-Waksman) Date: Mon, 29 Jan 2018 19:38:20 +0000 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> Message-ID: Given I started this thread from a perspective of this is a feature that I would like because I need it, it feels a little dismissive to take attrs not having the feature to mean "there's no reason to try to implement this." On Mon, Jan 29, 2018 at 11:05 AM Guido van Rossum wrote: > I think that settles it -- there's no reason to try to implement this. > > On Mon, Jan 29, 2018 at 10:51 AM, George Leslie-Waksman > wrote: > >> attrs' seems to also not allow mandatory attributes to follow optional >> one: >> >> In [14]: @attr.s >> ...: class Baz: >> ...: a = attr.ib(default=attr.Factory(list)) >> ...: b = attr.ib() >> ...: >> >> --------------------------------------------------------------------------- >> ValueError Traceback (most recent call >> last) >> in () >> ----> 1 @attr.s >> 2 class Baz: >> 3 a = attr.ib(default=attr.Factory(list)) >> 4 b = attr.ib() >> 5 >> >> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py >> in attrs(maybe_cls, these, repr_ns, repr, cmp, hash, init, slots, frozen, >> str, auto_attribs) >> 700 return wrap >> 701 else: >> --> 702 return wrap(maybe_cls) >> 703 >> 704 >> >> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py >> in wrap(cls) >> 669 raise TypeError("attrs only works with new-style >> classes.") >> 670 >> --> 671 builder = _ClassBuilder(cls, these, slots, frozen, >> auto_attribs) >> 672 >> 673 if repr is True: >> >> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py >> in __init__(self, cls, these, slots, frozen, auto_attribs) >> 369 >> 370 def __init__(self, cls, these, slots, frozen, auto_attribs): >> --> 371 attrs, super_attrs = _transform_attrs(cls, these, >> auto_attribs) >> 372 >> 373 self._cls = cls >> >> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/python3.6/site-packages/attr/_make.py >> in _transform_attrs(cls, these, auto_attribs) >> 335 "No mandatory attributes allowed after an >> attribute with a " >> 336 "default value or factory. Attribute in >> question: {a!r}" >> --> 337 .format(a=a) >> 338 ) >> 339 elif had_default is False and \ >> >> ValueError: No mandatory attributes allowed after an attribute with a >> default value or factory. Attribute in question: Attribute(name='b', >> default=NOTHING, validator=None, repr=True, cmp=True, hash=None, init=True, >> metadata=mappingproxy({}), type=None, converter=None) >> >> >> On Fri, Jan 26, 2018 at 1:44 PM Guido van Rossum >> wrote: >> >>> What does attrs' solution for this problem look like? >>> >>> On Fri, Jan 26, 2018 at 11:11 AM, George Leslie-Waksman < >>> waksman at gmail.com> wrote: >>> >>>> Even if we could inherit the setting, I would think that we would still >>>> want to require the code be explicit. It seems worse to implicitly require >>>> keyword only arguments for a class without giving any indication in the >>>> code. >>>> >>>> As it stands, the current implementation does not allow a later >>>> subclass to be declared without `keyword_only=True` so we could handle this >>>> case by adding a note to the `TypeError` message about considering the >>>> keyword_only flag. >>>> >>>> How do I got about putting together a proposal to get this into 3.8? >>>> >>>> --George >>>> >>>> >>>> On Thu, Jan 25, 2018 at 5:12 AM Eric V. Smith >>>> wrote: >>>> >>>>> I'm not completely opposed to this feature. But there are some cases to >>>>> consider. Here's the first one that occurs to me: note that due to the >>>>> way dataclasses work, it would need to be used everywhere down an >>>>> inheritance hierarchy. That is, if an intermediate base class required >>>>> it, all class derived from that intermediate base would need to specify >>>>> it, too. That's because each class just makes decisions based on its >>>>> fields and its base classes' fields, and not on any flags attached to >>>>> the base class. As it's currently implemented, a class doesn't remember >>>>> any of the decorator's arguments, so there's no way to look for this >>>>> information, anyway. >>>>> >>>>> I think there are enough issues here that it's not going to make it in >>>>> to 3.7. It would require getting a firm proposal together, selling the >>>>> idea on python-dev, and completing the implementation before Monday. >>>>> But >>>>> if you want to try, I'd participate in the discussion. >>>>> >>>>> Taking Ivan's suggestion one step further, a way to do this currently >>>>> is >>>>> to pass init=False and then write another decorator that adds the >>>>> kw-only __init__. So the usage would be: >>>>> >>>>> @dataclass >>>>> class Foo: >>>>> some_default: dict = field(default_factory=dict) >>>>> >>>>> @kw_only_init >>>>> @dataclass(init=False) >>>>> class Bar(Foo): >>>>> other_field: int >>>>> >>>>> kw_only_init(cls) would look at fields(cls) and construct the __init__. >>>>> It would be a hassle to re-implement dataclasses's _init_fn function, >>>>> but it could be made to work (in reality, of course, you'd just copy it >>>>> and hack it up to do what you want). You'd also need to use some >>>>> private >>>>> knowledge of InitVars if you wanted to support them (the stock >>>>> fields(cls) doesn't return them). >>>>> >>>>> For 3.8 we can consider changing dataclasses's APIs if we want to add >>>>> this. >>>>> >>>>> Eric. >>>>> >>>>> On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: >>>>> > It may be possible but it makes for pretty leaky abstractions and >>>>> it's >>>>> > unclear what that custom __init__ should look like. How am I >>>>> supposed to >>>>> > know what the replacement for default_factory is? >>>>> > >>>>> > Moreover, suppose I want one base class with an optional argument >>>>> and a >>>>> > half dozen subclasses each with their own required argument. At that >>>>> > point, I have to write the same __init__ function a half dozen times. >>>>> > >>>>> > It feels rather burdensome for the user when an additional flag (say >>>>> > "kw_only=True") and a modification to: >>>>> > >>>>> https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L294 >>>>> that >>>>> > inserted `['*']` after `[self_name]` if the flag is specified could >>>>> > ameliorate this entire issue. >>>>> > >>>>> > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi < >>>>> levkivskyi at gmail.com >>>>> > > wrote: >>>>> > >>>>> > It is possible to pass init=False to the decorator on the >>>>> subclass >>>>> > (and supply your own custom __init__, if necessary): >>>>> > >>>>> > @dataclass >>>>> > class Foo: >>>>> > some_default: dict = field(default_factory=dict) >>>>> > >>>>> > @dataclass(init=False) # This works >>>>> > class Bar(Foo): >>>>> > other_field: int >>>>> > >>>>> > -- >>>>> > Ivan >>>>> > >>>>> > >>>>> > >>>>> > On 23 January 2018 at 03:33, George Leslie-Waksman >>>>> > > wrote: >>>>> > >>>>> > The proposed implementation of dataclasses prevents defining >>>>> > fields with defaults before fields without defaults. This can >>>>> > create limitations on logical grouping of fields and on >>>>> inheritance. >>>>> > >>>>> > Take, for example, the case: >>>>> > >>>>> > @dataclass >>>>> > class Foo: >>>>> > some_default: dict = field(default_factory=dict) >>>>> > >>>>> > @dataclass >>>>> > class Bar(Foo): >>>>> > other_field: int >>>>> > >>>>> > this results in the error: >>>>> > >>>>> > 5 @dataclass >>>>> > ----> 6 class Bar(Foo): >>>>> > 7 other_field: int >>>>> > 8 >>>>> > >>>>> > >>>>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>>>> > in dataclass(_cls, init, repr, eq, order, hash, frozen) >>>>> > 751 >>>>> > 752 # We're called as @dataclass, with a class. >>>>> > --> 753 return wrap(_cls) >>>>> > 754 >>>>> > 755 >>>>> > >>>>> > >>>>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>>>> > in wrap(cls) >>>>> > 743 >>>>> > 744 def wrap(cls): >>>>> > --> 745 return _process_class(cls, repr, eq, order, >>>>> > hash, init, frozen) >>>>> > 746 >>>>> > 747 # See if we're being called as @dataclass or >>>>> > @dataclass(). >>>>> > >>>>> > >>>>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>>>> > in _process_class(cls, repr, eq, order, hash, init, frozen) >>>>> > 675 # in __init__. Use >>>>> > "self" if possible. >>>>> > 676 >>>>> '__dataclass_self__' if >>>>> > 'self' in fields >>>>> > --> 677 else 'self', >>>>> > 678 )) >>>>> > 679 if repr: >>>>> > >>>>> > >>>>> ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/site-packages/dataclasses.py >>>>> > in _init_fn(fields, frozen, has_post_init, self_name) >>>>> > 422 seen_default = True >>>>> > 423 elif seen_default: >>>>> > --> 424 raise TypeError(f'non-default >>>>> argument >>>>> > {f.name !r} ' >>>>> > 425 'follows default >>>>> argument') >>>>> > 426 >>>>> > >>>>> > TypeError: non-default argument 'other_field' follows default >>>>> > argument >>>>> > >>>>> > I understand that this is a limitation of positional >>>>> arguments >>>>> > because the effective __init__ signature is: >>>>> > >>>>> > def __init__(self, some_default: dict = , >>>>> > other_field: int): >>>>> > >>>>> > However, keyword only arguments allow an entirely reasonable >>>>> > solution to this problem: >>>>> > >>>>> > def __init__(self, *, some_default: dict = , >>>>> > other_field: int): >>>>> > >>>>> > And have the added benefit of making the fields in the >>>>> __init__ >>>>> > call entirely explicit. >>>>> > >>>>> > So, I propose the addition of a keyword_only flag to the >>>>> > @dataclass decorator that renders the __init__ method using >>>>> > keyword only arguments: >>>>> > >>>>> > @dataclass(keyword_only=True) >>>>> > class Bar(Foo): >>>>> > other_field: int >>>>> > >>>>> > --George Leslie-Waksman >>>>> > >>>>> > _______________________________________________ >>>>> > Python-ideas mailing list >>>>> > Python-ideas at python.org >>>>> > https://mail.python.org/mailman/listinfo/python-ideas >>>>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > Python-ideas mailing list >>>>> > Python-ideas at python.org >>>>> > https://mail.python.org/mailman/listinfo/python-ideas >>>>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> > >>>>> >>>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas at python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>>> >>> >>> >>> -- >>> --Guido van Rossum (python.org/~guido) >>> >> > > > -- > --Guido van Rossum (python.org/~guido) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Mon Jan 29 14:43:47 2018 From: guido at python.org (Guido van Rossum) Date: Mon, 29 Jan 2018 11:43:47 -0800 Subject: [Python-ideas] Dataclasses, keyword args, and inheritance In-Reply-To: References: <2a660b18-3977-2393-ef3c-02e368934c8e@trueblade.com> Message-ID: That's fair. Let me then qualify my statement with "in the initial release". The initial release has enough functionality to deal with without considering your rather esoteric use case. (And I consider it esoteric because attrs has apparently never seen the need to solve it either.) We can reconsider for 3.8. On Mon, Jan 29, 2018 at 11:38 AM, George Leslie-Waksman wrote: > Given I started this thread from a perspective of this is a feature that I > would like because I need it, it feels a little dismissive to take attrs > not having the feature to mean "there's no reason to try to implement this." > > On Mon, Jan 29, 2018 at 11:05 AM Guido van Rossum > wrote: > >> I think that settles it -- there's no reason to try to implement this. >> >> On Mon, Jan 29, 2018 at 10:51 AM, George Leslie-Waksman < >> waksman at gmail.com> wrote: >> >>> attrs' seems to also not allow mandatory attributes to follow optional >>> one: >>> >>> In [14]: @attr.s >>> ...: class Baz: >>> ...: a = attr.ib(default=attr.Factory(list)) >>> ...: b = attr.ib() >>> ...: >>> ------------------------------------------------------------ >>> --------------- >>> ValueError Traceback (most recent call >>> last) >>> in () >>> ----> 1 @attr.s >>> 2 class Baz: >>> 3 a = attr.ib(default=attr.Factory(list)) >>> 4 b = attr.ib() >>> 5 >>> >>> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ >>> python3.6/site-packages/attr/_make.py in attrs(maybe_cls, these, >>> repr_ns, repr, cmp, hash, init, slots, frozen, str, auto_attribs) >>> 700 return wrap >>> 701 else: >>> --> 702 return wrap(maybe_cls) >>> 703 >>> 704 >>> >>> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ >>> python3.6/site-packages/attr/_make.py in wrap(cls) >>> 669 raise TypeError("attrs only works with new-style >>> classes.") >>> 670 >>> --> 671 builder = _ClassBuilder(cls, these, slots, frozen, >>> auto_attribs) >>> 672 >>> 673 if repr is True: >>> >>> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ >>> python3.6/site-packages/attr/_make.py in __init__(self, cls, these, >>> slots, frozen, auto_attribs) >>> 369 >>> 370 def __init__(self, cls, these, slots, frozen, auto_attribs): >>> --> 371 attrs, super_attrs = _transform_attrs(cls, these, >>> auto_attribs) >>> 372 >>> 373 self._cls = cls >>> >>> /Users/waksman/.pyenv/versions/3.6.1/envs/temp/lib/ >>> python3.6/site-packages/attr/_make.py in _transform_attrs(cls, these, >>> auto_attribs) >>> 335 "No mandatory attributes allowed after an >>> attribute with a " >>> 336 "default value or factory. Attribute in >>> question: {a!r}" >>> --> 337 .format(a=a) >>> 338 ) >>> 339 elif had_default is False and \ >>> >>> ValueError: No mandatory attributes allowed after an attribute with a >>> default value or factory. Attribute in question: Attribute(name='b', >>> default=NOTHING, validator=None, repr=True, cmp=True, hash=None, init=True, >>> metadata=mappingproxy({}), type=None, converter=None) >>> >>> >>> On Fri, Jan 26, 2018 at 1:44 PM Guido van Rossum >>> wrote: >>> >>>> What does attrs' solution for this problem look like? >>>> >>>> On Fri, Jan 26, 2018 at 11:11 AM, George Leslie-Waksman < >>>> waksman at gmail.com> wrote: >>>> >>>>> Even if we could inherit the setting, I would think that we would >>>>> still want to require the code be explicit. It seems worse to implicitly >>>>> require keyword only arguments for a class without giving any indication in >>>>> the code. >>>>> >>>>> As it stands, the current implementation does not allow a later >>>>> subclass to be declared without `keyword_only=True` so we could handle this >>>>> case by adding a note to the `TypeError` message about considering the >>>>> keyword_only flag. >>>>> >>>>> How do I got about putting together a proposal to get this into 3.8? >>>>> >>>>> --George >>>>> >>>>> >>>>> On Thu, Jan 25, 2018 at 5:12 AM Eric V. Smith >>>>> wrote: >>>>> >>>>>> I'm not completely opposed to this feature. But there are some cases >>>>>> to >>>>>> consider. Here's the first one that occurs to me: note that due to the >>>>>> way dataclasses work, it would need to be used everywhere down an >>>>>> inheritance hierarchy. That is, if an intermediate base class required >>>>>> it, all class derived from that intermediate base would need to >>>>>> specify >>>>>> it, too. That's because each class just makes decisions based on its >>>>>> fields and its base classes' fields, and not on any flags attached to >>>>>> the base class. As it's currently implemented, a class doesn't >>>>>> remember >>>>>> any of the decorator's arguments, so there's no way to look for this >>>>>> information, anyway. >>>>>> >>>>>> I think there are enough issues here that it's not going to make it in >>>>>> to 3.7. It would require getting a firm proposal together, selling the >>>>>> idea on python-dev, and completing the implementation before Monday. >>>>>> But >>>>>> if you want to try, I'd participate in the discussion. >>>>>> >>>>>> Taking Ivan's suggestion one step further, a way to do this currently >>>>>> is >>>>>> to pass init=False and then write another decorator that adds the >>>>>> kw-only __init__. So the usage would be: >>>>>> >>>>>> @dataclass >>>>>> class Foo: >>>>>> some_default: dict = field(default_factory=dict) >>>>>> >>>>>> @kw_only_init >>>>>> @dataclass(init=False) >>>>>> class Bar(Foo): >>>>>> other_field: int >>>>>> >>>>>> kw_only_init(cls) would look at fields(cls) and construct the >>>>>> __init__. >>>>>> It would be a hassle to re-implement dataclasses's _init_fn function, >>>>>> but it could be made to work (in reality, of course, you'd just copy >>>>>> it >>>>>> and hack it up to do what you want). You'd also need to use some >>>>>> private >>>>>> knowledge of InitVars if you wanted to support them (the stock >>>>>> fields(cls) doesn't return them). >>>>>> >>>>>> For 3.8 we can consider changing dataclasses's APIs if we want to add >>>>>> this. >>>>>> >>>>>> Eric. >>>>>> >>>>>> On 1/25/2018 1:38 AM, George Leslie-Waksman wrote: >>>>>> > It may be possible but it makes for pretty leaky abstractions and >>>>>> it's >>>>>> > unclear what that custom __init__ should look like. How am I >>>>>> supposed to >>>>>> > know what the replacement for default_factory is? >>>>>> > >>>>>> > Moreover, suppose I want one base class with an optional argument >>>>>> and a >>>>>> > half dozen subclasses each with their own required argument. At that >>>>>> > point, I have to write the same __init__ function a half dozen >>>>>> times. >>>>>> > >>>>>> > It feels rather burdensome for the user when an additional flag (say >>>>>> > "kw_only=True") and a modification to: >>>>>> > https://github.com/python/cpython/blob/master/Lib/ >>>>>> dataclasses.py#L294 that >>>>>> > inserted `['*']` after `[self_name]` if the flag is specified could >>>>>> > ameliorate this entire issue. >>>>>> > >>>>>> > On Wed, Jan 24, 2018 at 3:22 PM Ivan Levkivskyi < >>>>>> levkivskyi at gmail.com >>>>>> > > wrote: >>>>>> > >>>>>> > It is possible to pass init=False to the decorator on the >>>>>> subclass >>>>>> > (and supply your own custom __init__, if necessary): >>>>>> > >>>>>> > @dataclass >>>>>> > class Foo: >>>>>> > some_default: dict = field(default_factory=dict) >>>>>> > >>>>>> > @dataclass(init=False) # This works >>>>>> > class Bar(Foo): >>>>>> > other_field: int >>>>>> > >>>>>> > -- >>>>>> > Ivan >>>>>> > >>>>>> > >>>>>> > >>>>>> > On 23 January 2018 at 03:33, George Leslie-Waksman >>>>>> > > wrote: >>>>>> > >>>>>> > The proposed implementation of dataclasses prevents defining >>>>>> > fields with defaults before fields without defaults. This >>>>>> can >>>>>> > create limitations on logical grouping of fields and on >>>>>> inheritance. >>>>>> > >>>>>> > Take, for example, the case: >>>>>> > >>>>>> > @dataclass >>>>>> > class Foo: >>>>>> > some_default: dict = field(default_factory=dict) >>>>>> > >>>>>> > @dataclass >>>>>> > class Bar(Foo): >>>>>> > other_field: int >>>>>> > >>>>>> > this results in the error: >>>>>> > >>>>>> > 5 @dataclass >>>>>> > ----> 6 class Bar(Foo): >>>>>> > 7 other_field: int >>>>>> > 8 >>>>>> > >>>>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>>>> site-packages/dataclasses.py >>>>>> > in dataclass(_cls, init, repr, eq, order, hash, frozen) >>>>>> > 751 >>>>>> > 752 # We're called as @dataclass, with a class. >>>>>> > --> 753 return wrap(_cls) >>>>>> > 754 >>>>>> > 755 >>>>>> > >>>>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>>>> site-packages/dataclasses.py >>>>>> > in wrap(cls) >>>>>> > 743 >>>>>> > 744 def wrap(cls): >>>>>> > --> 745 return _process_class(cls, repr, eq, order, >>>>>> > hash, init, frozen) >>>>>> > 746 >>>>>> > 747 # See if we're being called as @dataclass or >>>>>> > @dataclass(). >>>>>> > >>>>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>>>> site-packages/dataclasses.py >>>>>> > in _process_class(cls, repr, eq, order, hash, init, frozen) >>>>>> > 675 # in __init__. >>>>>> Use >>>>>> > "self" if possible. >>>>>> > 676 >>>>>> '__dataclass_self__' if >>>>>> > 'self' in fields >>>>>> > --> 677 else 'self', >>>>>> > 678 )) >>>>>> > 679 if repr: >>>>>> > >>>>>> > ~/.pyenv/versions/3.6.2/envs/clover_pipeline/lib/python3.6/ >>>>>> site-packages/dataclasses.py >>>>>> > in _init_fn(fields, frozen, has_post_init, self_name) >>>>>> > 422 seen_default = True >>>>>> > 423 elif seen_default: >>>>>> > --> 424 raise TypeError(f'non-default >>>>>> argument >>>>>> > {f.name !r} ' >>>>>> > 425 'follows default >>>>>> argument') >>>>>> > 426 >>>>>> > >>>>>> > TypeError: non-default argument 'other_field' follows >>>>>> default >>>>>> > argument >>>>>> > >>>>>> > I understand that this is a limitation of positional >>>>>> arguments >>>>>> > because the effective __init__ signature is: >>>>>> > >>>>>> > def __init__(self, some_default: dict = , >>>>>> > other_field: int): >>>>>> > >>>>>> > However, keyword only arguments allow an entirely reasonable >>>>>> > solution to this problem: >>>>>> > >>>>>> > def __init__(self, *, some_default: dict = , >>>>>> > other_field: int): >>>>>> > >>>>>> > And have the added benefit of making the fields in the >>>>>> __init__ >>>>>> > call entirely explicit. >>>>>> > >>>>>> > So, I propose the addition of a keyword_only flag to the >>>>>> > @dataclass decorator that renders the __init__ method using >>>>>> > keyword only arguments: >>>>>> > >>>>>> > @dataclass(keyword_only=True) >>>>>> > class Bar(Foo): >>>>>> > other_field: int >>>>>> > >>>>>> > --George Leslie-Waksman >>>>>> > >>>>>> > _______________________________________________ >>>>>> > Python-ideas mailing list >>>>>> > Python-ideas at python.org >>>>>> > https://mail.python.org/mailman/listinfo/python-ideas >>>>>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > Python-ideas mailing list >>>>>> > Python-ideas at python.org >>>>>> > https://mail.python.org/mailman/listinfo/python-ideas >>>>>> > Code of Conduct: http://python.org/psf/codeofconduct/ >>>>>> > >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Python-ideas mailing list >>>>> Python-ideas at python.org >>>>> https://mail.python.org/mailman/listinfo/python-ideas >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> >>>>> >>>> >>>> >>>> -- >>>> --Guido van Rossum (python.org/~guido) >>>> >>> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Jan 29 15:54:41 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 29 Jan 2018 12:54:41 -0800 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180127012717.GE22500@ando.pearwood.info> References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano wrote: > tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. > Dealing with the Supplementary Unicode Planes have the same problems > that older "narrow" builds of Python sufferred from: single code points > were counted as len(2) instead of len(1), slicing could be wrong, etc. > > There are still many applications which assume Latin-1 data. For > instance, I use a media player which displays mojibake when passed > anything outside of Latin-1. > > Sometimes it is useful to know in advance when text you pass to another > application is going to run into problems because of the other > application's limitations. I'm confused -- isn't the way to do this to encode your text into the encoding the other application accepts ? if you really want to know in advance, it is so hard to run it through a encode/decode sandwich? Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it not there? Shouldn't it be? If only for this reason? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Mon Jan 29 23:24:43 2018 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Jan 2018 23:24:43 -0500 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: Just to add another perspective, I find many "performance" problems in the real world can often be attributed to factors other than the raw speed of the CPython interpreter. Yes, I'd love it if the interpreter were faster, but in my experience a lot of other things dominate. At least they do provide low hanging fruit to attack first. This can be anything from poorly written algorithms, a lack of understanding about the way Python works, use of incorrect or inefficient data structures, doing network accesses or other unpredictable work at import time, etc. The bottom line I think is that you have to measure what you've got in production, and attack the hotspots. For example, I love and can't wait to use Python 3.7's `-X importtime` flag to measure regressions in CLI start up times due to unfortunate things appearing in module globals. But there's something else that's very important to consider, which rarely comes up in these discussions, and that's the developer's productivity and programming experience. One of the things that makes Python so popular and effective I think, is that it scales well in the human dimension, meaning that it's a great language for one person, a small team, and scales all the way up to very large organizations. I've become convinced that things like type annotations helps immensely at those upper human scales; a well annotated code base can help ramp up developer productivity very quickly, and tools and IDEs are available that help quite a bit with that. This is often undervalued, but shouldn't be! Moore's Law doesn't apply to humans, and you can't effectively or cost efficiently scale up by throwing more bodies at a project. Python is one of the best languages (and ecosystems!) that make the development experience fun, high quality, and very efficient. Cheers, -Barry From ncoghlan at gmail.com Mon Jan 29 23:53:51 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Jan 2018 14:53:51 +1000 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: On 30 January 2018 at 01:43, David Mertz wrote: > Nick suggests: > >>> print(f"In European format x is {x:,.2f}, in Indian format it > is {x:,2,3.2f}") > > This looks very good and general. I only know of the "European" and South > Asian conventions in widespread use, but we could give other grouping > conventions using that little syntax and it definitely covers the ones I > know about. There's not an issue about this giving the parser for the > format mini-language hiccups over width specifier in there, is there? That's the part I haven't explicitly checked in the code, but I think it would be feasible based on https://docs.python.org/3/library/string.html#format-specification-mini-language My proposal is essentially to replace the current: grouping_option ::= "_" | "," with: grouping_option ::= underscore_grouping | comma_grouping underscore_grouping ::= "_" [group_width ("_" group_width)*] comma_grouping ::= "," [group_width ("," group_width)*] group_width ::= digit+ That's unambiguous, since the grouping field still always starts with "_" or ",", and the next field must be either the precision (which always starts with "."), the type (which is always a letter, and never a number or symbol), or the closing brace for the field specifier. Cheers, Nick. P.S. While writing this I noticed that the current format mini-language docs are incorrect and say "integer" where they should be saying "digit+": https://bugs.python.org/issue32720 -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Tue Jan 30 00:12:52 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Jan 2018 15:12:52 +1000 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: On 30 January 2018 at 06:54, Chris Barker wrote: > On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano > wrote: >> >> tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. >> Dealing with the Supplementary Unicode Planes have the same problems >> that older "narrow" builds of Python sufferred from: single code points >> were counted as len(2) instead of len(1), slicing could be wrong, etc. >> >> There are still many applications which assume Latin-1 data. For >> instance, I use a media player which displays mojibake when passed >> anything outside of Latin-1. >> >> Sometimes it is useful to know in advance when text you pass to another >> application is going to run into problems because of the other >> application's limitations. > > > I'm confused -- isn't the way to do this to encode your text into the > encoding the other application accepts ? > > if you really want to know in advance, it is so hard to run it through a > encode/decode sandwich? > > Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it > not there? Shouldn't it be? If only for this reason? If you're wanting to check whether or not something lies entirely within the BMP, check for: 2*len(text) == len(text.encode("utf-16")) # True iff text is UCS-2 If there's an astral code point in there, then the encoded version will need more than 2 bytes for at least one element, so the result will end up being longer than it would for UCS-2 data. You can also check for pure ASCII in much the same way: len(text) == len(text.encode("utf-8")) # True iff text is 7-bit ASCII So this is partly an optimisation question: - folks want to avoid allocating a bytes object just to throw it away - folks want to avoid running the equivalent of "max(map(ord, text))" - folks know that CPython (at least) tracks this kind of info internally to manage its own storage allocations But it's also a readability question: "is_ascii()" and "is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2 (or the basic multilingual plane) *are*, whereas the current ways of checking for them require knowing how they *behave*. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Tue Jan 30 00:45:47 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Jan 2018 15:45:47 +1000 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On 30 January 2018 at 14:24, Barry Warsaw wrote: > This is often undervalued, but shouldn't be! Moore's Law doesn't apply > to humans, and you can't effectively or cost efficiently scale up by > throwing more bodies at a project. Python is one of the best languages > (and ecosystems!) that make the development experience fun, high > quality, and very efficient. I'll also note that one of the things we (and others) *have* been putting quite a bit of time into is the question of "Why do people avoid using extension modules for code acceleration?". While there are definitely still opportunities to speed up the CPython interpreter itself, they're never going to compete for raw speed with the notion of "Let's just opt out of Python's relatively expensive runtime bookkeeping for lower level code units, and use native machine types instead". (Before folks say "But what about PyPy/Numba/etc?": this is what those tools do as well, they're just able to analyse your running code and do it on the fly, rather than having an outer feedback loop of humans doing it explicitly in version control based on benchmarks, flame graphs, and other performance profiling tools) And on that front, things have progressed quite a bit in the past few years: * wheel files & the conda ecosystem mean that precompiled binaries are often readily available for Windows/Mac OS X/Linux x86_64 * tools like pysip and milksnake are working to reduce the combinatorial explosion of potential wheel build targets * tools like Cython work to lower barriers to entry between working with dynamic compilation and working with precompiled systems (this is also where Grumpy fits in, but targeting Go rather than C/C++) * at the CPython interpreter level, we continue to work to reduce the differences between what extension modules can do and what regular source and bytecode modules can do There are still lots of other things it would be nice to have (such as transplanting the notion of JavaScript source maps so that debuggers can more readily map Python pyc files and extension modules back to the corresponding lines of source code), but the idea of "What about precompiling an extension module?" is already markedly less painful than it used to be. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From steve at pearwood.info Tue Jan 30 03:00:23 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 30 Jan 2018 19:00:23 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: <20180130080022.GF26553@ando.pearwood.info> On Tue, Jan 30, 2018 at 03:12:52PM +1000, Nick Coghlan wrote: [...] > So this is partly an optimisation question: > > - folks want to avoid allocating a bytes object just to throw it away > - folks want to avoid running the equivalent of "max(map(ord, text))" > - folks know that CPython (at least) tracks this kind of info > internally to manage its own storage allocations > > But it's also a readability question: "is_ascii()" and > "is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2 > (or the basic multilingual plane) *are*, whereas the current ways of > checking for them require knowing how they *behave*. Agreed with all of those. However, given how niche the varieties other than is_ascii() are, I'm not going to push for them. I use them rarely enough, or on small enough strings, that doing an O(N) max(string) is not that great a burden. I can continue using a helper function. -- Steve From steve at pearwood.info Tue Jan 30 03:08:53 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 30 Jan 2018 19:08:53 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> Message-ID: <20180130080853.GG26553@ando.pearwood.info> On Mon, Jan 29, 2018 at 12:54:41PM -0800, Chris Barker wrote: > I'm confused -- isn't the way to do this to encode your text into the > encoding the other application accepts ? Its more about warning the user of *my* application that the data they're exporting could generate mojibake, or even fail, in the other application. > if you really want to know in advance, it is so hard to run it through a > encode/decode sandwich? See Nick's answer. > Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it > not there? Shouldn't it be? If only for this reason? Strictly speaking, UCS-2 is an obsolute standard more or less equivalent to UTF-16, except it doesn't support "astral characters" encoded by a pair of supplementary code points. However, in practice, some languages' nominal UTF-16 handling is less than 100% conformant, in that they treat a surrogate pair as two undefined characters of one code point each, instead of a single defined character of two code points. So I guess I'm using UCS-2 in an informal sense of "like UTF-16, without the astral characters". I'm not asking for an explicit UCS-2 codec. -- Steve From gadgetsteve at live.co.uk Tue Jan 30 01:28:45 2018 From: gadgetsteve at live.co.uk (Steve Barnes) Date: Tue, 30 Jan 2018 06:28:45 +0000 Subject: [Python-ideas] Suggest adding struct iter_unpack allowing mixed formats & add iter_pack Message-ID: In my work I make a lot of use of struct pack & unpack but sometimes find the fact that I either have to supply the whole format string or add position tracking mechanisms to my code explicitly so as to use pack_to or unpack_from if the structures that I am dealing with are not simple arrays of a single type. In my particular use case of de/encoding messages from a remote device that is sometimes forwarding on collected information to & from the devices that it controls the endianness is not always consistent between message elements in a single message. It would be very nice if: a) iter_unpack.next produced an iterator that optionally took an additional parameter of a format string, (with endianness permitted), to replace the current format string in use. b) there was a matching iter_pack function that the next method took either more data to pack into the buffer extending it or a new format string, either as a named parameter or with an exposed set_fmt method on the packer. This would allow my specific use case and the ones where a buffer or stream is constructed of blocks consisting of a byte or word that specifies the data type(s) to follow (and sometimes a count) followed by the actual data. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. From chris.barker at noaa.gov Tue Jan 30 12:14:06 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 30 Jan 2018 09:14:06 -0800 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On Sat, Jan 27, 2018 at 10:14 PM, Nick Coghlan wrote: > More broadly, the current lack of perceived commercial incentives for > large corporations to invest millions in offering a faster default > Python runtime, the way they have for the other languages you > mentioned in your initial post :) > sure, though there have been a few high profile (failed?) efforts, by Google and DropBox, yes? Unladen Swallow was one -- not sure of the code name for the other. Turns out it's really hard :-) And, of course, PyPy is one such successful effort :-) And to someone's point, IIUC, PyPy has been putting a lot of effort into C-API compatibility, to the point where it can run numpy: https://pypy.org/compat.html Who knows -- maybe we will all be running PyPy some day :-) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Jan 30 12:21:46 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 30 Jan 2018 09:21:46 -0800 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On Mon, Jan 29, 2018 at 9:45 PM, Nick Coghlan wrote: > > I'll also note that one of the things we (and others) *have* been > putting quite a bit of time into is the question of "Why do people > avoid using extension modules for code acceleration?". > well, the scientific computing community does do that a lot -- with f2py, Cyton, and more recently numba. But the current state of the art makes it fairly easy and practical for number crunching (and to a somewhat less extent basic text crunching), but not so much for manipulating higher order data structures. For example running the OPs code through Cython would likely buy you very little performance. I don't think numba would do much for you either (though I don't have real experience with that) PyPy is the only one I know of that is targeting general "Python" code per se. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Jan 30 12:25:49 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Jan 2018 18:25:49 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? References: Message-ID: <20180130182549.39aaff6b@fsol> On Tue, 30 Jan 2018 09:14:06 -0800 Chris Barker wrote: > On Sat, Jan 27, 2018 at 10:14 PM, Nick Coghlan wrote: > > > More broadly, the current lack of perceived commercial incentives for > > large corporations to invest millions in offering a faster default > > Python runtime, the way they have for the other languages you > > mentioned in your initial post :) > > > > sure, though there have been a few high profile (failed?) efforts, by > Google and DropBox, yes? > > Unladen Swallow was one -- not sure of the code name for the other. You're probably thinking about Pyston. Regards Antoine. From chris.barker at noaa.gov Tue Jan 30 12:28:00 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 30 Jan 2018 09:28:00 -0800 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: <20180130080022.GF26553@ando.pearwood.info> References: <20180126123953.GA22500@ando.pearwood.info> <20180127012717.GE22500@ando.pearwood.info> <20180130080022.GF26553@ando.pearwood.info> Message-ID: On Tue, Jan 30, 2018 at 12:00 AM, Steven D'Aprano wrote: > > But it's also a readability question: "is_ascii()" and > > "is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2 > > (or the basic multilingual plane) *are*, whereas the current ways of > > checking for them require knowing how they *behave*. > This is important. Agreed with all of those. > > However, given how niche the varieties other than is_ascii() are, I'm > not going to push for them. I use them rarely enough, or on small enough > strings, that doing an O(N) max(string) is not that great a burden. sure, but adding is_ascii() and is_bmp() are pretty small additions as well. I"d say for the newbiew among us, it would be a nice feature: +1 As for is_bmp() -- yes, UCS-2 is "deprecated", but there are plenty of systems that don't handle UTF-16 well, so it's nice to know, and not hard to write. I also think a UCS-2 encoding would be handy -- but I won't personally use it, so I'll wait for someone that has a use case to ask for it. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From brenbarn at brenbarn.net Tue Jan 30 00:58:10 2018 From: brenbarn at brenbarn.net (Brendan Barnwell) Date: Mon, 29 Jan 2018 21:58:10 -0800 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: <5A700972.2040504@brenbarn.net> On 2018-01-29 20:24, Barry Warsaw wrote: > But there's something else that's very important to consider, which > rarely comes up in these discussions, and that's the developer's > productivity and programming experience. One of the things that makes > Python so popular and effective I think, is that it scales well in the > human dimension, meaning that it's a great language for one person, a > small team, and scales all the way up to very large organizations. I've > become convinced that things like type annotations helps immensely at > those upper human scales; a well annotated code base can help ramp up > developer productivity very quickly, and tools and IDEs are available > that help quite a bit with that. > > This is often undervalued, but shouldn't be! Moore's Law doesn't apply > to humans, and you can't effectively or cost efficiently scale up by > throwing more bodies at a project. Python is one of the best languages > (and ecosystems!) that make the development experience fun, high > quality, and very efficient. You are quite right. I think, however, that that is precisely why it's important to improve the speed of Python. It's easier to make a good language fast than it is to make a fast language good. It's easier to hack a compiler or an interpreter to run slow code faster than it is to hack the human brain to understand confusing code more easily. So I think the smart move is to take the languages that have intrinsically good design from cognitive/semantic perspective (such as Python) and support that good design with performant implementations. To be clear, this is just me philosophizing since I don't have the ability to do that myself. And I imagine many people on this list already think that anyone who is spending time making JavaScript faster would do better to make Python faster instead! :-) But I think it's an important corollary to the above. Python's excellence in developer-time "speed" is a sort of latent force multiplier that makes execution-time improvements all the more powerful. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown From solipsis at pitrou.net Tue Jan 30 13:10:53 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Jan 2018 19:10:53 +0100 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? References: Message-ID: <20180130191053.63e80966@fsol> On Mon, 29 Jan 2018 23:24:43 -0500 Barry Warsaw wrote: > > This is often undervalued, but shouldn't be! Moore's Law doesn't apply > to humans, and you can't effectively or cost efficiently scale up by > throwing more bodies at a project. Moore's Law doesn't really apply to semiconductors anymore either, and transistor size scaling is increasingly looking like it's reaching its end. Regards Antoine. From liam.marsh.home at gmail.com Tue Jan 30 14:43:25 2018 From: liam.marsh.home at gmail.com (liam marsh) Date: Tue, 30 Jan 2018 20:43:25 +0100 Subject: [Python-ideas] Logging: a more perseverent version of the StreamHandler? In-Reply-To: <20180129004144.GH22500@ando.pearwood.info> References: <20180129004144.GH22500@ando.pearwood.info> Message-ID: Hello, and sorry for the late answer, Le 29/01/2018 ? 01:41, Steven D'Aprano a ?crit?: > [...] > I'll comment on the email: for some reason, the implementation you > give has extraneous pipes | at the start and end of each line, e.g.: This is... unexpected, and doesn't even show on my version of the mail. My guess is some formatting happening behind the scenes > (Code being emailed probably needs to stay below 78 characters per line > to avoid being wrapped at the ends of lines.) I will try to remember this. Oh, and something I forgot in my last e-mail: credit for the |`stdout_redirected()`| contextmanager: https://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Tue Jan 30 15:02:05 2018 From: rosuav at gmail.com (Chris Angelico) Date: Wed, 31 Jan 2018 07:02:05 +1100 Subject: [Python-ideas] Logging: a more perseverent version of the StreamHandler? In-Reply-To: References: <20180129004144.GH22500@ando.pearwood.info> Message-ID: On Wed, Jan 31, 2018 at 6:43 AM, liam marsh wrote: > Hello, and sorry for the late answer, > > > Le 29/01/2018 ? 01:41, Steven D'Aprano a ?crit : > > [...] > I'll comment on the email: for some reason, the implementation you > give has extraneous pipes | at the start and end of each line, e.g.: > > This is... unexpected, and doesn't even show on my version of the mail. > My guess is some formatting happening behind the scenes > > (Code being emailed probably needs to stay below 78 characters per line > to avoid being wrapped at the ends of lines.) > > I will try to remember this. > > Oh, and something I forgot in my last e-mail: > credit for the `stdout_redirected()` contextmanager: > https://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python. > When I reply to your email, it comes out looking like the above. There _is_ some non-text formatting happening here. (HTML, to be specific.) To be sure of how your emails will be read, I recommend disabling HTML (sometimes called "rich text", because someone saw some of the eyesores that get sent that way and said "You call that text?? That's rich!!"), so you see it the exact same way everyone else will. ChrisA From barry at python.org Tue Jan 30 15:19:03 2018 From: barry at python.org (Barry Warsaw) Date: Tue, 30 Jan 2018 15:19:03 -0500 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: <20180130191053.63e80966@fsol> References: <20180130191053.63e80966@fsol> Message-ID: Antoine Pitrou wrote: > Moore's Law doesn't really apply to semiconductors anymore either, and > transistor size scaling is increasingly looking like it's reaching its > end. You forget about the quantum computing AI blockchain in the cloud. OTOH, I still haven't perfected my clone army yet. -Barry From greg.ewing at canterbury.ac.nz Tue Jan 30 16:18:59 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 31 Jan 2018 10:18:59 +1300 Subject: [Python-ideas] Suggest adding struct iter_unpack allowing mixed formats & add iter_pack In-Reply-To: References: Message-ID: <5A70E143.80005@canterbury.ac.nz> Steve Barnes wrote: > It would be very nice if: > a) iter_unpack.next produced an iterator that optionally took an > additional parameter of a format string, (with endianness permitted), to > replace the current format string in use. Hmmm, rather than screw around with the iterator protocol, it might be better to design a separate api for this. Maybe a pair of functions: struct.read(stream, format) struct.write(stream, format, data) -- Greg From chris.barker at noaa.gov Tue Jan 30 18:34:37 2018 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Tue, 30 Jan 2018 15:34:37 -0800 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: <5A700972.2040504@brenbarn.net> References: <5A700972.2040504@brenbarn.net> Message-ID: > > It's easier to make a good language fast than it is to make a fast language good. It's easier to hack a compiler or an interpreter to run slow code faster than it is to hack the human brain to understand confusing code more easily. So I think the smart move is to take the languages that have intrinsically good design from cognitive/semantic perspective (such as Python) and support that good design with performant implementations. A lot of smart people have worked on this ? and this is where we are. It turns out that keeping Python?s fully dynamic nature while making it run faster if hard! Maybe if more time/money/brains were thrown at cPython, it could be made to run much faster ? but it doesn?t look good. -CHB From wes.turner at gmail.com Tue Jan 30 19:47:31 2018 From: wes.turner at gmail.com (Wes Turner) Date: Tue, 30 Jan 2018 19:47:31 -0500 Subject: [Python-ideas] Why CPython is still behind in performance for some widely used patterns ? In-Reply-To: References: Message-ID: On Friday, January 26, 2018, Victor Stinner wrote: > Hi, > > Well, I wrote https://faster-cpython.readthedocs.io/ website to answer > to such question. > > See for example https://faster-cpython.readthedocs.io/mutable.html > "Everything in Python is mutable". Thanks! """ Contents: - Projects to optimize CPython 3.7 - Projects to optimize CPython 3.6 - Notes on Python and CPython performance, 2017 - FAT Python - Everything in Python is mutable - Optimizations - Python bytecode - Python C API - AST Optimizers - Register-based Virtual Machine for Python - Read-only Python - History of Python optimizations - Misc - Kill the GIL? - Implementations of Python - Benchmarks - Random notes about PyPy - Talks - Links """ Pandas & Cython https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html "Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." https://github.com/maartenbreddels/vaex > Victor > > 2018-01-26 22:35 GMT+01:00 Pau Freixes : > > Hi, > > > > This mail is the consequence of a true story, a story where CPython > > got defeated by Javascript, Java, C# and Go. > > > > One of the teams of the company where Im working had a kind of > > benchmark to compare the different languages on top of their > > respective "official" web servers such as Node.js, Aiohttp, Dropwizard > > and so on. The test by itself was pretty simple and tried to test the > > happy path of the logic, a piece of code that fetches N rules from > > another system and then apply them to X whatevers also fetched from > > another system, something like that > > > > def filter(rule, whatever): > > if rule.x in whatever.x: > > return True > > > > rules = get_rules() > > whatevers = get_whatevers() > > for rule in rules: > > for whatever in whatevers: > > if filter(rule, whatever): > > cnt = cnt + 1 > > > > return cnt > > > > > > The performance of Python compared with the other languages was almost > > x10 times slower. It's true that they didn't optimize the code, but > > they did not for any language having for all of them the same cost in > > terms of iterations. > > > > Once I saw the code I proposed a pair of changes, remove the call to > > the filter function making it "inline" and caching the rule's > > attributes, something like that > > > > for rule in rules: > > x = rule.x > > for whatever in whatevers: > > if x in whatever.x: > > cnt += 1 > > > > The performance of the CPython boosted x3/x4 just doing these "silly" > things. > > > > The case of the rule cache IMHO is very striking, we have plenty > > examples in many repositories where the caching of none local > > variables is a widely used pattern, why hasn't been considered a way > > to do it implicitly and by default? > > > > The case of the slowness to call functions in CPython is quite > > recurrent and looks like its an unsolved problem at all. > > > > Sure I'm missing many things, and I do not have all of the > > information. This mail wants to get all of this information that might > > help me to understand why we are here - CPython - regarding this two > > slow patterns. > > > > This could be considered an unimportant thing, but its more relevant > > than someone could expect, at least IMHO. If the default code that you > > can write in a language is by default slow and exists an alternative > > to make it faster, this language is doing something wrong. > > > > BTW: pypy looks like is immunized [1] > > > > [1] https://gist.github.com/pfreixes/d60d00761093c3bdaf29da025a004582 > > -- > > --pau > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Wed Jan 31 05:49:42 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 31 Jan 2018 12:49:42 +0200 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: 26.01.18 10:42, INADA Naoki ????: > Currently, int(), str.isdigit(), str.isalnum(), etc... accepts > non-ASCII strings. > >>>> s = ???" >>>> s > '???' >>>> s.isdigit() > True >>>> print(ascii(s)) > '\uff11\uff12\uff13' >>>> int(s) > 123 > > But sometimes, we want to accept only ascii string. For example, > ipaddress module uses: > > _DECIMAL_DIGITS = frozenset('0123456789') > ... > if _DECIMAL_DIGITS.issuperset(str): > > ref: https://github.com/python/cpython/blob/e76daebc0c8afa3981a4c5a8b54537f756e805de/Lib/ipaddress.py#L491-L494 > > If str has str.isascii() method, it can be simpler: > > `if s.isascii() and s.isdigit():` > > I want to add it in Python 3.7 if there are no opposite opinions. There were discussions about this. See for example https://bugs.python.org/issue18814. In short, there are two considerations that prevented adding this feature: 1. This function can have the constant computation complexity in CPython (just check a single bit), but other implementations may provide only the linear computation complexity. 2. In many cases just after taking the answer to this question we encode the string to bytes (or decode bytes to string). Thus the most natural way to determining if the string is ASCII-only is trying to encode it to ASCII. And adding a new method to the basic type has a high bar. The code in ipaddress if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str): cls._report_invalid_netmask(prefixlen_str) try: prefixlen = int(prefixlen_str) except ValueError: cls._report_invalid_netmask(prefixlen_str) if not (0 <= prefixlen <= cls._max_prefixlen): cls._report_invalid_netmask(prefixlen_str) return prefixlen can be rewritten as: if not prefixlen_str.isdigit(): cls._report_invalid_netmask(prefixlen_str) try: prefixlen = int(prefixlen_str.encode('ascii')) except UnicodeEncodeError: cls._report_invalid_netmask(prefixlen_str) except ValueError: cls._report_invalid_netmask(prefixlen_str) if not (0 <= prefixlen <= cls._max_prefixlen): cls._report_invalid_netmask(prefixlen_str) return prefixlen Other possibility -- adding support of the boolean argument in str.isdigit() and similar predicates that switch them to the ASCII-only mode. Such option will be very useful for the str.strip(), str.split() and str.splilines() methods. Currently they split using all Unicode whitespaces and line separators, but there is a need to split only on ASCII whitespaces and line separators CR, LF and CRLF. In case of str.strip() and str.split() you can just pass the string of whitespace characters, but there is no such option for str.splilines(). From storchaka at gmail.com Wed Jan 31 06:03:20 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 31 Jan 2018 13:03:20 +0200 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: 19.01.18 05:51, Guido van Rossum ????: > Can someone explain to me why this is such a controversial issue? > > It seems reasonable to me to add new encodings to the stdlib that do the > roundtripping requested in the first message of the thread. As long as > they have new names that seems to fall under "practicality beats > purity". (Modifying existing encodings seems wrong -- did the feature > request somehow transmogrify into that?) In any case you need to change your code. If add new error handler -- you need to change the decoding code to use this error handler: text = data.decode(encoding, 'whatwgreplace') If add new encodings -- you need to support an alias table that maps standard encoding names to corresponding names of WHATWG encoding: aliases = {'windows_1252': 'windows-1252-whatwg', 'windows_1251': 'windows-1251-whatwg', 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass ... } ... text = data.decode(aliases.get(normalize_encoding(encoding), encoding)) I don't see an advantage of the second approach for the end user. And of course it is more costly for maintainers, because we will need to implement around 20 new encodings, and adds a cognitive burden for new Python users, which now have more tables of encodings in the documentation. From songofacandy at gmail.com Wed Jan 31 06:18:42 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Wed, 31 Jan 2018 20:18:42 +0900 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: Hm, it seems I was too hurry to implement it... > > There were discussions about this. See for example > https://bugs.python.org/issue18814. > > In short, there are two considerations that prevented adding this feature: > > 1. This function can have the constant computation complexity in CPython > (just check a single bit), but other implementations may provide only the > linear computation complexity. > Yes. There are no O(1) guarantee about .isascii(). But I expect UTF-8 based string implementation PyPy will have can achieve O(1); just test len(s) == __internal_utf8_len(s) I think if *some* of implementations can achieve O(1), it's beneficial to implement. > 2. In many cases just after taking the answer to this question we encode the > string to bytes (or decode bytes to string). Thus the most natural way to > determining if the string is ASCII-only is trying to encode it to ASCII. > Yes. But ASCII is so special. Someone may want to check ASCII before passing string to int(), float(), decimal.Decimal(), etc... But I don't think there is real use case for encodings other than ASCII. > And adding a new method to the basic type has a high bar. > Agree. > The code in ipaddress > > if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str): > cls._report_invalid_netmask(prefixlen_str) > try: > prefixlen = int(prefixlen_str) > except ValueError: > cls._report_invalid_netmask(prefixlen_str) > if not (0 <= prefixlen <= cls._max_prefixlen): > cls._report_invalid_netmask(prefixlen_str) > return prefixlen > > can be rewritten as: > > if not prefixlen_str.isdigit(): > cls._report_invalid_netmask(prefixlen_str) > try: > prefixlen = int(prefixlen_str.encode('ascii')) > except UnicodeEncodeError: > cls._report_invalid_netmask(prefixlen_str) > except ValueError: > cls._report_invalid_netmask(prefixlen_str) > if not (0 <= prefixlen <= cls._max_prefixlen): > cls._report_invalid_netmask(prefixlen_str) > return prefixlen > Yes. But .isascii() will be match faster than try ... .encode('ascii') ... except UnicodeEncodeError on most Python implementations. > Other possibility -- adding support of the boolean argument in str.isdigit() > and similar predicates that switch them to the ASCII-only mode. Such option > will be very useful for the str.strip(), str.split() and str.splilines() > methods. Currently they split using all Unicode whitespaces and line > separators, but there is a need to split only on ASCII whitespaces and line > separators CR, LF and CRLF. In case of str.strip() and str.split() you can > just pass the string of whitespace characters, but there is no such option > for str.splilines(). > It sounds good idea. Maybe, keyword only argument `ascii=False`? But if revert adding str.isascii() from Python 3.7, same keyword-only argument should be added to int(), float(), decimal.Decimal(), fractions.Fraction(), etc... It's bit hard. So I think adding .isascii() is beneficial even if all str.is***() methods have `ascii=False` flag. From storchaka at gmail.com Wed Jan 31 07:17:03 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 31 Jan 2018 14:17:03 +0200 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: 31.01.18 13:18, INADA Naoki ????: > Yes. But .isascii() will be match faster than try ... > .encode('ascii') ... except UnicodeEncodeError > on most Python implementations. In this case this doesn't matter since this is an exceptional case, and in any case an exception is raised for non-ascii string. But you are true that str.isascii() can be faster than str.encode(), and encoding is not required for converting to int. >> Other possibility -- adding support of the boolean argument in str.isdigit() >> and similar predicates that switch them to the ASCII-only mode. Such option >> will be very useful for the str.strip(), str.split() and str.splilines() >> methods. Currently they split using all Unicode whitespaces and line >> separators, but there is a need to split only on ASCII whitespaces and line >> separators CR, LF and CRLF. In case of str.strip() and str.split() you can >> just pass the string of whitespace characters, but there is no such option >> for str.splilines(). >> > > It sounds good idea. Maybe, keyword only argument `ascii=False`? There is an issue for str.splilines() (don't remember the number). The main problem was that I was not sure about an obvious argument name. > But if revert adding str.isascii() from Python 3.7, same keyword-only > argument should be > added to int(), float(), decimal.Decimal(), fractions.Fraction(), > etc... It's bit hard. Ah, it is already committed. Then I think it is too later to revert this. I had doubts about this feature and were -0 for adding it (until we discuss it more), but since it is added I don't see much benefit from removing it. From victor.stinner at gmail.com Wed Jan 31 08:44:02 2018 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 31 Jan 2018 14:44:02 +0100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: I like the idea of str.isdigit(ascii=True): would behave as str.isdigit() and str.isascii(). It's easy to implement and likely to be very efficient. I'm just not sure that it's so commonly required? At least, I guess that some users can be surprised that str.isdigit() is "Unicode aware", accept non-ASCII digits, as int(str). Victor 2018-01-31 12:18 GMT+01:00 INADA Naoki : > Hm, it seems I was too hurry to implement it... > >> >> There were discussions about this. See for example >> https://bugs.python.org/issue18814. >> >> In short, there are two considerations that prevented adding this feature: >> >> 1. This function can have the constant computation complexity in CPython >> (just check a single bit), but other implementations may provide only the >> linear computation complexity. >> > > Yes. There are no O(1) guarantee about .isascii(). > But I expect UTF-8 based string implementation PyPy will have can achieve > O(1); just test len(s) == __internal_utf8_len(s) > > I think if *some* of implementations can achieve O(1), it's beneficial > to implement. > > >> 2. In many cases just after taking the answer to this question we encode the >> string to bytes (or decode bytes to string). Thus the most natural way to >> determining if the string is ASCII-only is trying to encode it to ASCII. >> > > Yes. But ASCII is so special. > Someone may want to check ASCII before passing string to int(), > float(), decimal.Decimal(), etc... > But I don't think there is real use case for encodings other than ASCII. > >> And adding a new method to the basic type has a high bar. >> > > Agree. > >> The code in ipaddress >> >> if not _BaseV4._DECIMAL_DIGITS.issuperset(prefixlen_str): >> cls._report_invalid_netmask(prefixlen_str) >> try: >> prefixlen = int(prefixlen_str) >> except ValueError: >> cls._report_invalid_netmask(prefixlen_str) >> if not (0 <= prefixlen <= cls._max_prefixlen): >> cls._report_invalid_netmask(prefixlen_str) >> return prefixlen >> >> can be rewritten as: >> >> if not prefixlen_str.isdigit(): >> cls._report_invalid_netmask(prefixlen_str) >> try: >> prefixlen = int(prefixlen_str.encode('ascii')) >> except UnicodeEncodeError: >> cls._report_invalid_netmask(prefixlen_str) >> except ValueError: >> cls._report_invalid_netmask(prefixlen_str) >> if not (0 <= prefixlen <= cls._max_prefixlen): >> cls._report_invalid_netmask(prefixlen_str) >> return prefixlen >> > > Yes. But .isascii() will be match faster than try ... > .encode('ascii') ... except UnicodeEncodeError > on most Python implementations. > > >> Other possibility -- adding support of the boolean argument in str.isdigit() >> and similar predicates that switch them to the ASCII-only mode. Such option >> will be very useful for the str.strip(), str.split() and str.splilines() >> methods. Currently they split using all Unicode whitespaces and line >> separators, but there is a need to split only on ASCII whitespaces and line >> separators CR, LF and CRLF. In case of str.strip() and str.split() you can >> just pass the string of whitespace characters, but there is no such option >> for str.splilines(). >> > > It sounds good idea. Maybe, keyword only argument `ascii=False`? > > But if revert adding str.isascii() from Python 3.7, same keyword-only > argument should be > added to int(), float(), decimal.Decimal(), fractions.Fraction(), > etc... It's bit hard. > > So I think adding .isascii() is beneficial even if all str.is***() > methods have `ascii=False` flag. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From rosuav at gmail.com Wed Jan 31 09:46:56 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 1 Feb 2018 01:46:56 +1100 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: On Thu, Feb 1, 2018 at 12:44 AM, Victor Stinner wrote: > I like the idea of str.isdigit(ascii=True): would behave as > str.isdigit() and str.isascii(). It's easy to implement and likely to > be very efficient. I'm just not sure that it's so commonly required? > > At least, I guess that some users can be surprised that str.isdigit() > is "Unicode aware", accept non-ASCII digits, as int(str). I disagree; the whole point of "isdigit" is that it catches every digit. If you want to check against a specific (and small) set of characters, just use set operations. ChrisA From guido at python.org Wed Jan 31 11:28:12 2018 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Jan 2018 08:28:12 -0800 Subject: [Python-ideas] Adding str.isascii() ? In-Reply-To: References: Message-ID: For the record, I encouraged this. I see no reason to amend my decision. Only .isascii(), for str, bytes, bytearray. No ascii= flags to other functions. On Wed, Jan 31, 2018 at 4:17 AM, Serhiy Storchaka wrote: > 31.01.18 13:18, INADA Naoki ????: > >> Yes. But .isascii() will be match faster than try ... >> .encode('ascii') ... except UnicodeEncodeError >> on most Python implementations. >> > > In this case this doesn't matter since this is an exceptional case, and in > any case an exception is raised for non-ascii string. > > But you are true that str.isascii() can be faster than str.encode(), and > encoding is not required for converting to int. > > Other possibility -- adding support of the boolean argument in >>> str.isdigit() >>> and similar predicates that switch them to the ASCII-only mode. Such >>> option >>> will be very useful for the str.strip(), str.split() and str.splilines() >>> methods. Currently they split using all Unicode whitespaces and line >>> separators, but there is a need to split only on ASCII whitespaces and >>> line >>> separators CR, LF and CRLF. In case of str.strip() and str.split() you >>> can >>> just pass the string of whitespace characters, but there is no such >>> option >>> for str.splilines(). >>> >>> >> It sounds good idea. Maybe, keyword only argument `ascii=False`? >> > > There is an issue for str.splilines() (don't remember the number). The > main problem was that I was not sure about an obvious argument name. > > But if revert adding str.isascii() from Python 3.7, same keyword-only >> argument should be >> added to int(), float(), decimal.Decimal(), fractions.Fraction(), >> etc... It's bit hard. >> > > Ah, it is already committed. Then I think it is too later to revert this. > I had doubts about this feature and were -0 for adding it (until we discuss > it more), but since it is added I don't see much benefit from removing it. > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Wed Jan 31 11:36:14 2018 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Jan 2018 08:36:14 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka wrote: > 19.01.18 05:51, Guido van Rossum ????: > >> Can someone explain to me why this is such a controversial issue? >> >> It seems reasonable to me to add new encodings to the stdlib that do the >> roundtripping requested in the first message of the thread. As long as they >> have new names that seems to fall under "practicality beats purity". >> (Modifying existing encodings seems wrong -- did the feature request >> somehow transmogrify into that?) >> > > In any case you need to change your code. If add new error handler -- you > need to change the decoding code to use this error handler: > > text = data.decode(encoding, 'whatwgreplace') > > If add new encodings -- you need to support an alias table that maps > standard encoding names to corresponding names of WHATWG encoding: > > aliases = {'windows_1252': 'windows-1252-whatwg', > 'windows_1251': 'windows-1251-whatwg', > 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass > ... > } > ... > text = data.decode(aliases.get(normalize_encoding(encoding), > encoding)) > > I don't see an advantage of the second approach for the end user. And of > course it is more costly for maintainers, because we will need to > implement around 20 new encodings, and adds a cognitive burden for new > Python users, which now have more tables of encodings in the documentation. > Hm. As a user, unless I run into problems with a specific encoding, I never care about how many encodings we have, so I don't see how adding extra encodings bothers those users who have no need for them. There's a reason to prefer new encoding names (maybe augmented with alias table) over a new error handler: there are lots of places where encodings are passed around via text files, Internet protocols, RPC calls, layers and layers of function calls. Many of these treat the encoding as a string, not as a (string, errorhandler) pair. So there may be situations where there is no way in a given API to preserve the need for using a special error handler, while the API would not have a problem preserving just the encoding name. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Wed Jan 31 12:41:04 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 31 Jan 2018 18:41:04 +0100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: <48931137-9ca3-4c84-ed06-2ac3ac7e8fcf@egenix.com> On 31.01.2018 17:36, Guido van Rossum wrote: > On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka > wrote: > > 19.01.18 05:51, Guido van Rossum ????: > > Can someone explain to me why this is such a controversial issue? > > It seems reasonable to me to add new encodings to the stdlib > that do the roundtripping requested in the first message of the > thread. As long as they have new names that seems to fall under > "practicality beats purity". (Modifying existing encodings seems > wrong -- did the feature request somehow transmogrify into that?) > > > In any case you need to change your code. If add new error handler > -- you need to change the decoding code to use this error handler: > > ? ? text = data.decode(encoding, 'whatwgreplace') > > If add new encodings -- you need to support an alias table that maps > standard encoding names to corresponding names of WHATWG encoding: > > ? ? aliases = {'windows_1252': 'windows-1252-whatwg', > ? ? ? ? ? ? ? ?'windows_1251': 'windows-1251-whatwg', > ? ? ? ? ? ? ? ?'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass > ? ? ? ? ? ? ? ?... > ? ? ? ? ? ? ? } > ? ? ... > ? ? text = data.decode(aliases.get(normalize_encoding(encoding), > encoding)) > > I don't see an advantage of the second approach for the end user. > And of course it is more costly for maintainers, because we will > need? to implement around 20 new encodings, and adds a cognitive > burden for new Python users, which now have more tables of encodings > in the documentation. > > > Hm. As a user, unless I run into problems with a specific encoding, I > never care about how many encodings we have, so I don't see how adding > extra encodings bothers those users who have no need for them. > > There's a reason to prefer new encoding names (maybe augmented with > alias table) over a new error handler: there are lots of places where > encodings are passed around via text files, Internet protocols, RPC > calls, layers and layers of function calls. Many of these treat the > encoding as a string, not as a (string, errorhandler) pair. So there may > be situations where there is no way in a given API to preserve the need > for using a special error handler, while the API would not have a > problem preserving just the encoding name. I already mentioned several reasons why I don't believe it's a good idea to add these encodings to the stdlib as opposed to keeping them on PyPI for those who need them, so won't repeat. One detail I did not mention is that these encodings do not have standard names. WHATWG uses the same names as the original encodings from which they derive - which makes sense for their intended purpose to interpret data coming from web servers, essentially in a decoding only way, but cannot be used for Python since our encodings follow the Unicode standard and don't generate mojibake when encoding. Whatever name would be used in the stdlib would neither be compatible to WHATWG nor to IANA. No other tool outside Python would be able to interpret the encoded data using those names. Given all those issues, I don't see what the benefit would be to add these encodings to the stdlib over leaving them on PyPI for the special use case of reading broken web server data. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 31 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From storchaka at gmail.com Wed Jan 31 12:48:29 2018 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 31 Jan 2018 19:48:29 +0200 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: 31.01.18 18:36, Guido van Rossum ????: > On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka > > wrote: > > 19.01.18 05:51, Guido van Rossum ????: > > Can someone explain to me why this is such a controversial issue? > > It seems reasonable to me to add new encodings to the stdlib > that do the roundtripping requested in the first message of the > thread. As long as they have new names that seems to fall under > "practicality beats purity". (Modifying existing encodings seems > wrong -- did the feature request somehow transmogrify into that?) > > > In any case you need to change your code. If add new error handler > -- you need to change the decoding code to use this error handler: > > ? ? text = data.decode(encoding, 'whatwgreplace') > > If add new encodings -- you need to support an alias table that maps > standard encoding names to corresponding names of WHATWG encoding: > > ? ? aliases = {'windows_1252': 'windows-1252-whatwg', > ? ? ? ? ? ? ? ?'windows_1251': 'windows-1251-whatwg', > ? ? ? ? ? ? ? ?'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass > ? ? ? ? ? ? ? ?... > ? ? ? ? ? ? ? } > ? ? ... > ? ? text = data.decode(aliases.get(normalize_encoding(encoding), > encoding)) > > I don't see an advantage of the second approach for the end user. > And of course it is more costly for maintainers, because we will > need? to implement around 20 new encodings, and adds a cognitive > burden for new Python users, which now have more tables of encodings > in the documentation. > > > Hm. As a user, unless I run into problems with a specific encoding, I > never care about how many encodings we have, so I don't see how adding > extra encodings bothers those users who have no need for them. The codecs module documentation contains several tables of encodings: standard encodings, Python-specific text encodings, binary transforms and text transforms (a single one). This will add yet one large table. The user that learn Python will need to learn the difference of these encodings from others encodings and how to use them correctly. The new user doesn't know what is important for he, and what he can ignore until he will need it (and how to know that he needs it). > There's a reason to prefer new encoding names (maybe augmented with > alias table) over a new error handler: there are lots of places where > encodings are passed around via text files, Internet protocols, RPC > calls, layers and layers of function calls. Many of these treat the > encoding as a string, not as a (string, errorhandler) pair. So there may > be situations where there is no way in a given API to preserve the need > for using a special error handler, while the API would not have a > problem preserving just the encoding name. The passed encoding differs from the name of new Python encoding. It is just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing encoding, this can break other code that expects the standard 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead of 'windows-1252' passed with the text, you need to map encoding names. How this differs from using a special error handler? Yet one problem, is that actually we need two error handlers. WHATWG specifies two behaviors for unmapped codes outside of C0-C1 range: replacing with a special character or error. This corresponds standard Python handlers 'replace' and 'strict'. Thus we need either add two new error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new encodings (more than 70 encodings totally!). From guido at python.org Wed Jan 31 14:23:42 2018 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Jan 2018 11:23:42 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: OK, I am no longer interested in this topic. If you can't reach agreement, so be it, and then the status quo prevails. I am going to mute this thread. There's no need to explain to me why I am wrong. On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka wrote: > 31.01.18 18:36, Guido van Rossum ????: > > On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka > > wrote: >> >> 19.01.18 05:51, Guido van Rossum ????: >> >> Can someone explain to me why this is such a controversial issue? >> >> It seems reasonable to me to add new encodings to the stdlib >> that do the roundtripping requested in the first message of the >> thread. As long as they have new names that seems to fall under >> "practicality beats purity". (Modifying existing encodings seems >> wrong -- did the feature request somehow transmogrify into that?) >> >> >> In any case you need to change your code. If add new error handler >> -- you need to change the decoding code to use this error handler: >> >> text = data.decode(encoding, 'whatwgreplace') >> >> If add new encodings -- you need to support an alias table that maps >> standard encoding names to corresponding names of WHATWG encoding: >> >> aliases = {'windows_1252': 'windows-1252-whatwg', >> 'windows_1251': 'windows-1251-whatwg', >> 'utf_8': 'utf-8-whatwg', # utf-8 + surrogatepass >> ... >> } >> ... >> text = data.decode(aliases.get(normalize_encoding(encoding), >> encoding)) >> >> I don't see an advantage of the second approach for the end user. >> And of course it is more costly for maintainers, because we will >> need to implement around 20 new encodings, and adds a cognitive >> burden for new Python users, which now have more tables of encodings >> in the documentation. >> >> >> Hm. As a user, unless I run into problems with a specific encoding, I >> never care about how many encodings we have, so I don't see how adding >> extra encodings bothers those users who have no need for them. >> > > The codecs module documentation contains several tables of encodings: > standard encodings, Python-specific text encodings, binary transforms and > text transforms (a single one). This will add yet one large table. The user > that learn Python will need to learn the difference of these encodings from > others encodings and how to use them correctly. The new user doesn't know > what is important for he, and what he can ignore until he will need it (and > how to know that he needs it). > > There's a reason to prefer new encoding names (maybe augmented with alias >> table) over a new error handler: there are lots of places where encodings >> are passed around via text files, Internet protocols, RPC calls, layers and >> layers of function calls. Many of these treat the encoding as a string, not >> as a (string, errorhandler) pair. So there may be situations where there is >> no way in a given API to preserve the need for using a special error >> handler, while the API would not have a problem preserving just the >> encoding name. >> > > The passed encoding differs from the name of new Python encoding. It is > just 'windows-1252', not 'windows-1252-whatwg'. If just change the existing > encoding, this can break other code that expects the standard > 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' instead > of 'windows-1252' passed with the text, you need to map encoding names. How > this differs from using a special error handler? > > Yet one problem, is that actually we need two error handlers. WHATWG > specifies two behaviors for unmapped codes outside of C0-C1 range: > replacing with a special character or error. This corresponds standard > Python handlers 'replace' and 'strict'. Thus we need either add two new > error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of new > encodings (more than 70 encodings totally!). > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rspeer at luminoso.com Wed Jan 31 15:51:42 2018 From: rspeer at luminoso.com (Rob Speer) Date: Wed, 31 Jan 2018 20:51:42 +0000 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: On Wed, 31 Jan 2018 at 12:50 Serhiy Storchaka wrote: > The passed encoding differs from the name of new Python encoding. It is > just 'windows-1252', not 'windows-1252-whatwg'. If just change the > existing encoding, this can break other code that expects the standard > 'windows-1252'. Thus every time when you need 'windows-1252-whatwg' > instead of 'windows-1252' passed with the text, you need to map encoding > names. How this differs from using a special error handler? > How is that the *same* as using a special error handler? This is not at all what error handlers are for. Mapping Python encoding names to the WHATWG standard (which, incidentally, is now also the W3C standard) is currently addressed by the "webencodings" package. That package currently doesn't return the correct encodings (because they don't exist), but it does at least return windows-1252 when a Web page says it's in "iso-8859-1", because that's what the Web standard says to do. ? Yet one problem, is that actually we need two error handlers. WHATWG > specifies two behaviors for unmapped codes outside of C0-C1 range: > replacing with a special character or error. This corresponds standard > Python handlers 'replace' and 'strict'. Thus we need either add two new > error handlers 'whatwgreplace' and 'whatwgstrict', or add *two* sets of > new encodings (more than 70 encodings totally!). > What?! This is going way off the rails. There are 8 new encodings. Not 70. Those 8 encodings would use the error handlers that already exist in Python. Why are you even talking about the C0 range? The C0 range is in ASCII. The ridiculous complexity of some of these counter-proposals has largely come from trying to use an error handler to do an encoding's job; now you're proposing to also use more encodings to do the error handler's job. I don't think it's a serious proposal, it's just so you could say "now you need 70 encodings lol". Maybe you just like to torpedo things? The "whatwg error handler" thing will not happen. It is a terrible design, a misunderstanding of what error handlers are for, and it attempts to be an overly-general solution to a _problem that does not generalize_. Even if this task could be sensibly implemented with error handlers, there are no other instances where these error handlers would ever be useful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric at trueblade.com Wed Jan 31 17:14:07 2018 From: eric at trueblade.com (Eric V. Smith) Date: Wed, 31 Jan 2018 17:14:07 -0500 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> Message-ID: <7679410d-8261-33f4-54ca-a2e4d8edb09a@trueblade.com> On 1/29/2018 2:13 AM, Nick Coghlan wrote: > On 29 January 2018 at 11:48, Nathaniel Smith wrote: >> On Sun, Jan 28, 2018 at 5:31 PM, David Mertz wrote: >>> I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. >>> But it's still much less general than having the option in the >>> f-string/.format() mini-language. This is really about the formatted >>> string, not necessarily about the locale. So, e.g. I'd like to be able to >>> write: >>> >>>>>> print(f"In European format x is {x:,.2f}, in Indian format it is >>>>>> {x:`.2f}") >>> >>> I don't want the format necessarily to be some pseudo-global setting, even >>> if it can get stored in thread-locals. That said, having a locale-aware >>> symbol for delimiting numbers in the format mini-language would also not be >>> a bad thing. >> >> I don't understand the format mini-language well enough to know what >> would fit in, but maybe some way to (a) request localified formatting, > > Given the example, I think a more useful approach would be to allow an > optional digit grouping specifier after the comma separator, and allow > the separator to be repeated to indicate non-uniform groupings in the > lower order digits. > > If we did that, then David's example could become: > > >>> print(f"In European format x is {x:,.2f}, in Indian format it > is {x:,2,3.2f}") This just seems too complicated to me, and is overgeneralizing. How many of these different formats would ever really be used? Can you really expect someone to remember what that means by looking at it? If you are going to generalize it, at least go all the way and support the struct lconv "CHAR_MAX" behavior, too. However, I suggest just pick another character to use instead of ",", and have it mean the 2,3 format. With no evidence (and willing to be wrong), it seems like it's the next-most needed variety of this. Maybe use ";"? Eric > > The core elements of interpreting that would then be: > > - digit group size specifiers are permited for both "," (decimal > display only) and "_" (all display bases) > - if no digit group size specifier is given, it defaults to 3 for > decimal and 4 for binary, octal, and hexadecimal > - if multiple digit group specifiers are given, then the last one > given is applied starting from the least significant integer digit > > so "{x:,2,3.2f}" means: > > - an arbitrary number of leading 2-digit groups > - 1 group of 3 digits > - 2 decimal places > > It would then be reasonably straightforward to use this as a lower > level primitive to implement locale dependent formatting, as follows: > > - format in English using the locale's grouping rules [1] (either > LC_NUMERIC.grouping or LC_MONETARY.mon_grouping, as appropriate) > - use str.translate() [2] to replace "," and "." with the locale's > thousands_sep & decimal_point or mon_thousands_sep & mon_decimal_point > > [1] https://docs.python.org/3/library/locale.html#locale.localeconv > [2] https://docs.python.org/3/library/stdtypes.html#str.translate > > Cheers, > Nick. > From chris.barker at noaa.gov Wed Jan 31 18:15:08 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 31 Jan 2018 15:15:08 -0800 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka wrote: > Hm. As a user, unless I run into problems with a specific encoding, I >> never care about how many encodings we have, so I don't see how adding >> extra encodings bothers those users who have no need for them. >> > > The codecs module documentation contains several tables of encodings: > standard encodings, Python-specific text encodings, binary transforms and > text transforms (a single one). This will add yet one large table. The user > that learn Python will need to learn the difference of these encodings from > others encodings and how to use them correctly. The new user doesn't know > what is important for he, and what he can ignore until he will need it (and > how to know that he needs it). no new user to Python is going ot study the entire set of built-in encoding in Python to decide what is useful to them -- no one! New (and experienced) users take the opposite approach -- they need an encoding for one reason or another (they are provided data to a service that requires a particular encoding, or they are reading data in another particular encoding). They then look at the built-in encoding to see if the one they want is there. A slightly larger set to look through is a very small burden, particularly if it's properly documented and has all the common synonyms. I still have no ide4a why there is such resistance to this -- yes, it's a fairly small benefit over a package no PyPi, but there is also virtually no downside. (I'm assuming the OP (or someone) will do all the actual work of coding and updating docs....) Practicality Beats Purity -- and this is a practical solution. sigh. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Wed Jan 31 18:40:41 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 1 Feb 2018 10:40:41 +1100 Subject: [Python-ideas] Support WHATWG versions of legacy encodings In-Reply-To: References: <20180119033907.GH22500@ando.pearwood.info> Message-ID: On Thu, Feb 1, 2018 at 10:15 AM, Chris Barker wrote: > I still have no ide4a why there is such resistance to this -- yes, it's a > fairly small benefit over a package no PyPi, but there is also virtually no > downside. I don't understand it either. Aside from maybe bikeshedding the *name* of the encoding, this seems like a pretty straight-forward addition. ChrisA From ncoghlan at gmail.com Wed Jan 31 23:11:25 2018 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 1 Feb 2018 14:11:25 +1000 Subject: [Python-ideas] Format mini-language for lakh and crore In-Reply-To: <7679410d-8261-33f4-54ca-a2e4d8edb09a@trueblade.com> References: <1f1f8906-8cf7-ff3c-087e-8f75b6399df5@trueblade.com> <7679410d-8261-33f4-54ca-a2e4d8edb09a@trueblade.com> Message-ID: On 1 February 2018 at 08:14, Eric V. Smith wrote: > On 1/29/2018 2:13 AM, Nick Coghlan wrote: >> Given the example, I think a more useful approach would be to allow an >> optional digit grouping specifier after the comma separator, and allow >> the separator to be repeated to indicate non-uniform groupings in the >> lower order digits. >> >> If we did that, then David's example could become: >> >> >>> print(f"In European format x is {x:,.2f}, in Indian format it >> is {x:,2,3.2f}") > > > This just seems too complicated to me, and is overgeneralizing. How many of > these different formats would ever really be used? Can you really expect > someone to remember what that means by looking at it? Sure - "," and "_" both mean "digit grouping", the numbers tell you how large the groups are from left to right (with the leftmost group size repeated as needed), and a single "," means the same thing as ",3," for decimal digits, and the same thing as ",4," for binary, octal, and hexadecimal digits. Another advantage of this approach is that we'd be able to control the grouping of binary and hexadecimal numbers printed with "_", rather than the current approach where we're restricted to half-byte grouping for binary and 16-bit word grouping for hex. > If you are going to generalize it, at least go all the way and support the > struct lconv "CHAR_MAX" behavior, too. I'm not sure how common that convention is, but if we wanted to support it then I'd spell it by repeating the group separator without an intervening group size specifier: {x:,,2,3.2f} # Ungrouped leading digits, single group of 2, single group of 3 > However, I suggest just pick another character to use instead of ",", and > have it mean the 2,3 format. With no evidence (and willing to be wrong), it > seems like it's the next-most needed variety of this. Maybe use ";"? That's even more arbitrary and hard to interpret than listing out the grouping spec, though. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia