From sjoerdjob at sjoerdjob.com Thu Dec 1 03:19:34 2016 From: sjoerdjob at sjoerdjob.com (Sjoerd Job Postmus) Date: Thu, 1 Dec 2016 09:19:34 +0100 Subject: [Python-ideas] Allow random.choice, random.sample to work on iterators In-Reply-To: References: <1480533959.3740598.804127281.6028420C@webmail.messagingengine.com> <1406db4f-8b71-bbd2-de81-4b8328f4b143@gmail.com> <73afd24e-a5c1-5646-2431-04a79a4937b9@gmail.com> Message-ID: <20161201081934.GF683@sjoerdjob.com> On Wed, Nov 30, 2016 at 02:32:54PM -0600, Nick Timkovich wrote: > a generator with known length that's not indexable (a rare beast?). Not as rare as you might think: >>> k = set(range(10)) >>> len(k) 10 >>> k[3] Traceback (most recent call last): File "", line 1, in TypeError: 'set' object does not support indexing From jelle.zijlstra at gmail.com Fri Dec 2 01:14:29 2016 From: jelle.zijlstra at gmail.com (Jelle Zijlstra) Date: Thu, 1 Dec 2016 22:14:29 -0800 Subject: [Python-ideas] Add optional defaults to namedtuple In-Reply-To: References: <583EEBBC.2050206@stoneleaf.us> Message-ID: 2016-11-30 8:11 GMT-08:00 Guido van Rossum : > On Wed, Nov 30, 2016 at 7:09 AM, Ethan Furman wrote: > >> On 11/30/2016 02:32 AM, Jelte Fennema wrote: >> >> It would be nice to have a supported way to add defaults to namedtuple, >>> so the slightly hacky solution here does not have to be used: >>> http://stackoverflow.com/a/18348004/2570866 >>> >> >> Actually, the solution right below it is better [1]: >> >> --> from collections import namedtuple >> --> class Node(namedtuple('Node', ['value', 'left', 'right'])): >> --> __slots__ = () >> --> def __new__(cls, value, left=None, right=None): >> --> return super(Node, cls).__new__(cls, value, left, right) >> >> But even more readable than that is using the NamedTuple class from my >> aenum [3] library (and on SO as [3]): >> >> --> from aenum import NamedTuple >> --> class Node(NamedTuple): >> --> val = 0 >> --> left = 1, 'previous Node', None >> --> right = 2, 'next Node', None >> >> shamelessly-plugging-my-own-solutions'ly yrs, >> > > Ditto: with PEP 526 and the latest typing.py (in 3.6) you will be able to > do this: > > class Employee(NamedTuple): > name: str > id: int > > We should make it so that the initial value in the class is used as the > default value, too. (Sorry, this syntax still has no room for a docstring > per attribute.) > > Implemented this in https://github.com/python/typing/pull/338 > -- > --Guido van Rossum (python.org/~guido ) > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From torsava at redhat.com Fri Dec 2 11:56:29 2016 From: torsava at redhat.com (Tomas Orsava) Date: Fri, 2 Dec 2016 17:56:29 +0100 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> Message-ID: On 11/30/2016 03:56 AM, Nick Coghlan wrote: > Really, I think the ideal solution from a distro perspective would be > to enable something closer to what bash and other shells support for > failed CLI calls: > > $ blender > bash: blender: command not found... > Install package 'blender' to provide command 'blender'? [N/y] n > > This would allow redistributors to point folks towards platform > packages (via apt/yum/dnf/PyPM/conda/Canopy/etc) for the components > they provide, and towards pip/PyPI for everything else (and while we > don't have a dist-lookup-by-module-name service for PyPI *today*, it's > something I hope we'll find a way to provide sometime in the next few > years). > > I didn't suggest that during the Fedora-level discussions of this PEP > because it didn't occur to me - the elegant simplicity of the new > import suffix as a tactical solution to the immediate "splitting the > standard library" problem [1] meant I missed that it was really a > special case of the general "provide guidance on obtaining missing > modules from the system package manager" concept. > > The problem with that idea however is that while it provides the best > possible interactive user experience, it's potentially really slow, > and hence too expensive to do for every import error - we would > instead need to find a way to run with Wolfgang Maier's suggestion of > only doing this for *unhandled* import errors. > > Fortunately, we do have the appropriate mechanisms in place to support > that approach: > > 1. For interactive use, we have sys.excepthook > 2. For non-interactive use, we have the atexit module > > As a simple example of the former: > > >>> def module_missing(modname): > ... return f"Module not found: {modname}" > >>> def my_except_hook(exc_type, exc_value, exc_tb): > ... if isinstance(exc_value, ModuleNotFoundError): > ... print(module_missing(exc_value.name)) > ... > >>> sys.excepthook = my_except_hook > >>> import foo > Module not found: foo > >>> import foo.bar > Module not found: foo > >>> import sys.bar > Module not found: sys.bar > > For the atexit handler, that could be installed by the `site` module, > so the existing mechanisms for disabling site module processing would > also disable any default exception reporting hooks. Folks could also > register their own handlers via either `sitecustomize.py` or > `usercustomize.py`. Is there some reason not to use sys.excepthook for both interactive and non-interactive use? From the docs: "When an exception is raised and uncaught, the interpreter calls|sys.excepthook|with three arguments, the exception class, exception instance, and a traceback object. In an interactive session this happens just before control is returned to the prompt; in a Python program this happens just before the program exits. The handling of such top-level exceptions can be customized by assigning another three-argument function to|sys.excepthook|." Though I believe the default sys.excepthook function is currently written in C, so it wouldn't be very easy for distributors to customize it. Maybe it could be made to read module=error_message pairs from some external file, which would be easier to modify? Yours aye, Tomas > And at that point the problem starts looking less like "Customise the > handling of missing modules" and more like "Customise the rendering > and reporting of particular types of unhandled exceptions". For > example, a custom handler for subprocess.CalledProcessError could > introspect the original command and use `shutil.which` to see if the > requested command was even visible from the current process (and, in a > redistributor provided Python, indicate which system packages to > install to obtain the requested command). > >> My personal vote is a callback called at >> https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap.py#L948 >> with a default implementation that raises ModuleNotFoundError just like the >> current line does. > Ethan's observation about try/except import chains has got me think > that limiting this to handling errors within the context of single > import statement will be problematic, especially given that folks can > already write their own metapath hook for that case if they really > want to. > > Cheers, > Nick. > > [1] For folks wondering "This problem has existed for years, why > suddenly worry about it now?", Fedora's in the process of splitting > out an even more restricted subset of the standard library for system > tools to use: https://fedoraproject.org/wiki/Changes/System_Python > > That means "You're relying on a missing stdlib module" is going to > come up more often for system tools developers trying to stick within > that restricted subset. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Fri Dec 2 20:58:21 2016 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 2 Dec 2016 20:58:21 -0500 Subject: [Python-ideas] Allow random.choice, random.sample to work on iterators In-Reply-To: <20161201081934.GF683@sjoerdjob.com> References: <1480533959.3740598.804127281.6028420C@webmail.messagingengine.com> <1406db4f-8b71-bbd2-de81-4b8328f4b143@gmail.com> <73afd24e-a5c1-5646-2431-04a79a4937b9@gmail.com> <20161201081934.GF683@sjoerdjob.com> Message-ID: On 12/1/2016 3:19 AM, Sjoerd Job Postmus wrote: > On Wed, Nov 30, 2016 at 02:32:54PM -0600, Nick Timkovich wrote: >> a generator with known length that's not indexable (a rare beast?). I don't believe a generator is ever indexable. > Not as rare as you might think: > >>>> k = set(range(10)) >>>> len(k) > 10 >>>> k[3] > Traceback (most recent call last): > File "", line 1, in > TypeError: 'set' object does not support indexing It is also not a generator. (It is an iterable.). If an *arbitrary* choice (without replacement) from a set is sufficient, set.pop() works. Otherwise, make a list. If we wanted selection selection from sets to be easy, without making a list, we should add a method that accesses the internal indexable array. -- Terry Jan Reedy From ncoghlan at gmail.com Fri Dec 2 23:08:35 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 3 Dec 2016 14:08:35 +1000 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> Message-ID: On 3 December 2016 at 02:56, Tomas Orsava wrote: > Is there some reason not to use sys.excepthook for both interactive and > non-interactive use? From the docs: > > "When an exception is raised and uncaught, the interpreter calls > sys.excepthook with three arguments, the exception class, exception > instance, and a traceback object. In an interactive session this happens > just before control is returned to the prompt; in a Python program this > happens just before the program exits. The handling of such top-level > exceptions can be customized by assigning another three-argument function to > sys.excepthook." No, that was just me forgetting that sys.excepthook was also called for unhandled exceptions in non-interactive mode. It further strengthens the argument for seeing how far we can get with just the flexibility CPython already provides, though. > Though I believe the default sys.excepthook function is currently written in > C, so it wouldn't be very easy for distributors to customize it. Maybe it > could be made to read module=error_message pairs from some external file, > which would be easier to modify? The default implementation is written in C, but distributors could patch site.py to replace it with a custom one written in Python. For example, publish a "fedora-hooks" module to PyPI (so non-system Python installations or applications regularly run without the site module can readily use the same hooks if they choose to do so), and then patch site.py in the system Python to do: import fedora_hooks fedora_hooks.install_excepthook() The nice thing about that approach is it wouldn't need a new switch to turn it off - it would get turned off with all the other site-specific customisations when -S or -I is used. It would also better open things up to redistributor experimentation in existing releases (2.7, 3.5, etc) before we commit to a specific approach in the reference interpreter (such as adding an optional 'platform.hooks' submodule that vendors may provide, and relevant stdlib APIs will then call automatically to override the default upstream provided processing). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From askvictor at gmail.com Sun Dec 4 18:15:04 2016 From: askvictor at gmail.com (victor rajewski) Date: Sun, 04 Dec 2016 23:15:04 +0000 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> Message-ID: Thanks for all of the thoughtful replies (and for moving to a more useful subject line). There is currently a big push towards teaching coding and computational thinking to school students, but a lack of skilled teachers to actually be able to support this, and I don't see any initiatives that will address this in a long-term, large-scale fashion (I'm speaking primarily from an Australian perspective, and might be misreading the situation in other countries). It's worth considering a classroom where the teacher has minimal experience in programming, and a portion of the students have low confidence in computing matters. Anything that will empower either the teacher or the students to get past a block will be useful here; and error messages are, in my experience as a teacher, one of more threatening parts of Python for the beginner. A few clarifications and thoughts arising from the discussion: - I personally find the current error messages quite useful, and they have the advantage of being machine-parseable, so that IDEs such as PyCharm can add value to them. However, the audience of this idea is not me, and probably not you. It is students who are learning Python, and probably haven't done any programming at all. But it might also be casual programmers who never really look at error message as they are too computer-y. - Learning how to parse an error message is a very valuable skill for a programmer to learn. However, I believe that should come later on in their journey. A technical error message when a student is starting out can be a bit overwhelming to some learners, who are already taking in a lot of information. - I'm not suggesting this should become part of the normal operation of Python, particularly if that breaks compatibility or impacts performance. A switch, or a seperate executable would probably work. I'd lean against the idea of tying this to a particular IDE/environment, but if that's the way this can progress, then let's do that to get it moving. However, it has to be dead simple to get it running. - I think this is necessary for scripts as well as the REPL (also other envs like Jupyter notebooks). - It will be almost impossible to deal with all cases, but that isn't the point here. The trick would be to find the most common errors that a beginning programmer will make, find the most common fixes, and provide them as hints, or suggestions. - The examples listed in my original email are simply ideas, without much thought about how feasible (or useful) they are to implement. Going forward, we would identify common errors that beginners make, and what would help them fix these errors. -- Victor Rajewski -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Sun Dec 4 18:40:21 2016 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 5 Dec 2016 10:40:21 +1100 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> Message-ID: On Mon, Dec 5, 2016 at 10:15 AM, victor rajewski wrote: > There is currently a big push towards teaching coding and computational > thinking to school students, but a lack of skilled teachers to actually be > able to support this, and I don't see any initiatives that will address this > in a long-term, large-scale fashion (I'm speaking primarily from an > Australian perspective, and might be misreading the situation in other > countries). It's worth considering a classroom where the teacher has minimal > experience in programming, and a portion of the students have low confidence > in computing matters. Anything that will empower either the teacher or the > students to get past a block will be useful here; and error messages are, in > my experience as a teacher, one of more threatening parts of Python for the > beginner. While I fully support enhancements to error messages (and the possibility of a "programming student" mode that assumes a novice and tweaks the messages accordingly), I don't think it's right to aim at a classroom where *the teacher* doesn't have sufficient programming skills. Would you build a pocket calculator so it can be used in a classroom where even the teacher doesn't know about division by zero? Would you design a violin so a non-musician can teach its use? IMO the right way to teach computer programming is for it to be the day job for people who do all their programming in open source and/or personal projects. There are plenty of people competent enough to teach programming and would benefit from a day job. Design the error messages to minimize the load on the room's sole expert, but assume that there'll always be someone around who can deal with the edge cases. In other words, aim for the 90% or 95%, rather than trying to explain 100% of situations. ChrisA From turnbull.stephen.fw at u.tsukuba.ac.jp Sun Dec 4 18:57:28 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Mon, 5 Dec 2016 08:57:28 +0900 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> Message-ID: <22596.44392.901811.945311@turnbull.sk.tsukuba.ac.jp> victor rajewski writes: > - I personally find the current error messages quite useful, and > they have the advantage of being machine-parseable, so that IDEs > such as PyCharm can add value to them. However, the audience of > this idea is not me, and probably not you. It is students who > are learning Python, and probably haven't done any programming > at all. But it might also be casual programmers who never really > look at error message as they are too computer-y. That's a misconception. You have not yet given up on a change to the Python interpreter, so the audience is *every* user of the Python interpreter (including other programs), and that's why you're getting pushback. The Python interpreter's main job is to execute code. A secondary job is provide *accurate* diagnostics of errors in execution. Interpreting those diagnostics is somebody else's job, typically the programmer's. For experienced programmers, that's usually what you want, because (1) the interpretation is frequently data-dependent and (2) the "obvious" suggestion may be wrong. FYI, a *lot* of effort has gone into making error messages more precise, more accurate, and more informative, eg, by improving stack traces. OTOH, if the diagnostics are accurate and machine-parsable, then the amount of annoying detail that needs to be dealt with in providing a "tutorial" front-end for those messages is small. That suggests to me that the problem really is that interpreting errors, even in "student" programs, is *hard* and rules of thumb are frequently mistaken. That's an excellent tradeoff if there's a teacher looking over the (student) programmer's shoulder. Not a good idea for the interpreter. > - I'm not suggesting this should become part of the normal > operation of Python, particularly if that breaks compatibility > or impacts performance. A switch, or a seperate executable would > probably work. I'd lean against the idea of tying this to a > particular IDE/environment, but if that's the way this can > progress, then let's do that to get it moving. It really should be a separate executable. There are multiple implementations of Python, and even restricted to CPython, with even a small amount of uptake this project will move a *lot* faster than CPython does. Every tiny change to the "better living through better errors" database makes a difference to all the students out there, so its release cycle should probably be really fast. > - The examples listed in my original email are simply ideas, > without much thought about how feasible (or useful) they are to > implement. Going forward, we would identify common errors that > beginners make, and what would help them fix these errors. In other words, you envision a long-term project with an ongoing level of effort. I think that it's worth doing. But I also think it's quite feasible to put it in a separate project, with cooperation from Python-Dev in the matter of ensuring that diagnostics are machine- parseable. Eg, this means that Python-Dev should not randomly change messages that are necessary to interpret an Exception, and in some cases it may be useful to add Exception/Error subtypes to make interpretation more precise (though this will often get pushback). From turnbull.stephen.fw at u.tsukuba.ac.jp Sun Dec 4 20:14:35 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Mon, 5 Dec 2016 10:14:35 +0900 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> Message-ID: <22596.49019.511677.186175@turnbull.sk.tsukuba.ac.jp> Chris Angelico writes: > On Mon, Dec 5, 2016 at 10:15 AM, victor rajewski wrote: > > There is currently a big push towards teaching coding and > > computational thinking to school students, but a lack of skilled > > teachers to actually be able to support this, and I don't see any > > initiatives that will address this in a long-term, large-scale > > fashion (I'm speaking primarily from an Australian perspective, > > and might be misreading the situation in other countries). It's > > worth considering a classroom where the teacher has minimal > > experience in programming, and a portion of the students have low > > confidence in computing matters. Anything that will empower > > either the teacher or the students to get past a block will be > > useful here; and error messages are, in my experience as a > > teacher, one of more threatening parts of Python for the > > beginner. > > While I fully support enhancements to error messages (and the > possibility of a "programming student" mode that assumes a novice and > tweaks the messages accordingly), I don't think it's right to aim at a > classroom where *the teacher* doesn't have sufficient programming > skills. That's not exactly what he said. High school math teachers are likely to be the product of education schools, and may be highly skilled in building PowerPoint presentations, and have some experience in programming, but not as a professional. But nobody expects David Beazley at Pigsty High! So I can easily imagine a teacher responsible for several classes of 40 students for 2 hour-long sessions a week per class, who is unable to "interpret at a glance" many error messages produced by the Python interpreter. This is basically the "aim for 90%" approach you describe, and Victor admits that's the best we can do. And, realistically, in today's ed systems there *will* be teachers far below the level you advocate. > IMO the right way to teach computer programming is for it to be the > day job for people who do all their programming in open source and/or > personal projects. There are plenty of people competent enough to > teach programming and would benefit from a day job. I don't know where you live, but in both of my countries there is a teacher's union to ensure that nobody without an Ed degree gets near a classroom. More precisely, volunteers under the supervision of somebody with professional teaching credentials, yes, day job, not in this century. And "teaching credentials" == degree from a state- certified 4-year Ed program, not something you can get at a community college in an adult ed program. (Japan is somewhat more lenient than that, but you do need a 4 year degree and a truckload of credits in ed courses -- and it's not a career-track job.) > Design the error messages to minimize the load on the room's sole > expert, but assume that there'll always be someone around who can > deal with the edge cases. In other words, aim for the 90% or 95%, > rather than trying to explain 100% of situations. I think we all agree on that being the best approach. From turnbull.stephen.fw at u.tsukuba.ac.jp Sun Dec 4 20:40:47 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Mon, 5 Dec 2016 10:40:47 +0900 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> Message-ID: <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> Chris Angelico writes: > On Mon, Dec 5, 2016 at 10:15 AM, victor rajewski wrote: > > There is currently a big push towards teaching coding and > > computational thinking to school students, but a lack of skilled > > teachers to actually be able to support this, and I don't see any > > initiatives that will address this in a long-term, large-scale > > fashion (I'm speaking primarily from an Australian perspective, > > and might be misreading the situation in other countries). It's > > worth considering a classroom where the teacher has minimal > > experience in programming, and a portion of the students have low > > confidence in computing matters. Anything that will empower > > either the teacher or the students to get past a block will be > > useful here; and error messages are, in my experience as a > > teacher, one of more threatening parts of Python for the > > beginner. > > While I fully support enhancements to error messages (and the > possibility of a "programming student" mode that assumes a novice and > tweaks the messages accordingly), I don't think it's right to aim at a > classroom where *the teacher* doesn't have sufficient programming > skills. That's not exactly what he said. High school teachers are likely to be the product of education schools, and may be highly skilled in building PowerPoint presentations, and have some experience in programming, but not as a professional. So I can easily imagine a teacher responsible for several classes of 40 students for 2 hour-long sessions a week per class, and not being able to "interpret at a glance" many error messages produced by the Python interpreter. This is basically the "aim for 90%" approach you describe, and he admits that's the best we can do. > IMO the right way to teach computer programming is for it to be the > day job for people who do all their programming in open source and/or > personal projects. There are plenty of people competent enough to > teach programming and would benefit from a day job. I don't know where you live, but in both of my countries there is a teacher's union to ensure that nobody without an Ed degree gets near a classroom. More precisely, volunteers under the supervision of somebody with professional teaching credentials, yes, day job, not in this century. And "teaching credentials" == degree from a state- certified 4-year Ed program, not something you can get at a community college in an adult ed program. > Design the error messages to minimize the load on the room's sole > expert, but assume that there'll always be someone around who can > deal with the edge cases. In other words, aim for the 90% or 95%, > rather than trying to explain 100% of situations. I think we all agree on that. From rosuav at gmail.com Sun Dec 4 21:35:10 2016 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 5 Dec 2016 13:35:10 +1100 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> Message-ID: On Mon, Dec 5, 2016 at 12:40 PM, Stephen J. Turnbull wrote: > That's not exactly what he said. High school teachers are likely to > be the product of education schools, and may be highly skilled in > building PowerPoint presentations, and have some experience in > programming, but not as a professional. So I can easily imagine a > teacher responsible for several classes of 40 students for 2 hour-long > sessions a week per class, and not being able to "interpret at a > glance" many error messages produced by the Python interpreter. This > is basically the "aim for 90%" approach you describe, and he admits > that's the best we can do. Okay, then I misinterpreted. Seems we are indeed in agreement. Sounds good! > > IMO the right way to teach computer programming is for it to be the > > day job for people who do all their programming in open source and/or > > personal projects. There are plenty of people competent enough to > > teach programming and would benefit from a day job. > > I don't know where you live, but in both of my countries there is a > teacher's union to ensure that nobody without an Ed degree gets near a > classroom. More precisely, volunteers under the supervision of > somebody with professional teaching credentials, yes, day job, not in > this century. And "teaching credentials" == degree from a state- > certified 4-year Ed program, not something you can get at a community > college in an adult ed program. Sadly, that's probably true here in Australia too, but I don't know for sure. I have no specific qualifications, but I teach online; it's high time the unions got broken IMO... but that's outside the scope of this. If it takes a credentialed teacher to get a job in a school, so be it - but at least make sure it's someone who knows how to interpret the error messages, so that any student who runs into trouble can ask the prof. ChrisA From ncoghlan at gmail.com Sun Dec 4 21:40:14 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 5 Dec 2016 12:40:14 +1000 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> Message-ID: On 5 December 2016 at 09:15, victor rajewski wrote: > > There is currently a big push towards teaching coding and computational > thinking to school students, but a lack of skilled teachers to actually be > able to support this, and I don't see any initiatives that will address this > in a long-term, large-scale fashion (I'm speaking primarily from an > Australian perspective, and might be misreading the situation in other > countries). It's worth considering a classroom where the teacher has minimal > experience in programming, and a portion of the students have low confidence > in computing matters. Anything that will empower either the teacher or the > students to get past a block will be useful here; and error messages are, in > my experience as a teacher, one of more threatening parts of Python for the > beginner. > Hi Victor, I'm one of the co-coordinators of the PyCon Australia Education Seminar, and agree entirely with what you say here. However, it isn't a problem that *python-dev* is well-positioned to tackle. Rather, it requires ongoing attention from vendors, volunteers and non-profit organisations that are specifically focused on meeting the needs of the educational sector. So your goal is valid, it's only your current choice of audience that is slightly mistargeted. Within Australia specifically, the two main drivers of the improvements in Python's suitability for teachers are Grok Learning (who provide a subscription-based online learning environment directly to schools based on a service originally developed for the annual National Computer Science School) and Code Club Australia (the Australian arm of a UK-based non-profit aimed at providing support for after-school code clubs around Australia, as well as professional development opportunities for teachers needing to cope with the incoming Digital Technologies curriculum). > I'm not suggesting this should become part of the normal operation of > Python, particularly if that breaks compatibility or impacts performance. A > switch, or a seperate executable would probably work. I'd lean against the > idea of tying this to a particular IDE/environment, but if that's the way > this can progress, then let's do that to get it moving. However, it has to > be dead simple to get it running. The model adopted by Grok Learning and many other education focused service providers (codesters.com, etc) is to provide the learning environment entirely through the browser, as that copes with entirely locked down client devices, and only requires whitelisting of the vendor's site in the school's firewall settings. The only context where it doesn't work is when the school doesn't have reliable internet connectivity at all, in which case the cheap-dedicated-device model driven by the UK's Raspberry Pi Foundation may be a more suitable option. > It will be almost impossible to deal with all cases, but that isn't the > point here. The trick would be to find the most common errors that a > beginning programmer will make, find the most common fixes, and provide them > as hints, or suggestions. > The examples listed in my original email are simply ideas, without much > thought about how feasible (or useful) they are to implement. Going forward, > we would identify common errors that beginners make, and what would help > them fix these errors. Right, and the folks best positioned to identify those errors empirically, and also to make data-driven improvements based on the typical number of iterations needed for beginners to fix their own mistakes, are the educational service providers. Some of the more sophisticated providers (like Knewton in the US) are even able to adapt their curricula on the fly, offer learners additional problems in areas they seem to be struggling with. Don't get me wrong, there are definitely lots of areas where we can make the default error messages more beginner friendly just by providing relevant information that the interpreter has available, and this is important for helping out the teachers that *don't* have institutional mandates backing them up. But for cases like the Australian Digital Curriculum, it makes sense for schools to look into the local service providers rather than asking teachers to make do with what they can download from the internet (while the latter option is viable in some cases, it really does require a high level of technical skill on the teacher's part) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Dec 4 22:08:56 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 5 Dec 2016 13:08:56 +1000 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> Message-ID: On 5 December 2016 at 12:35, Chris Angelico wrote: > On Mon, Dec 5, 2016 at 12:40 PM, Stephen J. Turnbull > wrote: >> I don't know where you live, but in both of my countries there is a >> teacher's union to ensure that nobody without an Ed degree gets near a >> classroom. More precisely, volunteers under the supervision of >> somebody with professional teaching credentials, yes, day job, not in >> this century. And "teaching credentials" == degree from a state- >> certified 4-year Ed program, not something you can get at a community >> college in an adult ed program. > > Sadly, that's probably true here in Australia too, but I don't know > for sure. I have no specific qualifications, but I teach online; it's > high time the unions got broken IMO... but that's outside the scope of > this. If it takes a credentialed teacher to get a job in a school, so > be it - but at least make sure it's someone who knows how to interpret > the error messages, so that any student who runs into trouble can ask > the prof. Graduate diplomas in Education in Australia are one- or two-year certificate programs, and some state level industry-to-education programs aim to get folks into the classroom early by offering pre-approvals for teaching subjects specifically related to their area of expertise. However, the main problem isn't the credentials, and it's definitely not unions, it's the fact that professional software developers have a lot of options open to them both locally and globally, and "empower the next generation to be the managers of digital systems rather than their servants" has a lot of downsides compared to the alternatives (most notably: you'll get paid a lot more in industry than you will as a teacher, so opting for teaching as a change in career direction here will necessarily be a lifestyle choice based on the non-monetary factors. That's not going to change as long as people assume that teaching is easy and/or not important). That means that we're not at a point in history where we can assume that teachers are going to be more computationally literate than their students - instead, we need to assume that many of the teachers involved will themselves be new to the concepts being taught and work on empowering them *anyway*. I just don't personally think that's feasible on a volunteer basis - you need professional service providers that are familiar not only with the specific concepts and technologies being taught, but also with the bureaucratic context that the particular schools and teachers they serve have to work within. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From torsava at redhat.com Mon Dec 5 04:56:58 2016 From: torsava at redhat.com (Tomas Orsava) Date: Mon, 5 Dec 2016 10:56:58 +0100 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> Message-ID: <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> On 12/03/2016 05:08 AM, Nick Coghlan wrote: >> Though I believe the default sys.excepthook function is currently written in >> C, so it wouldn't be very easy for distributors to customize it. Maybe it >> could be made to read module=error_message pairs from some external file, >> which would be easier to modify? > The default implementation is written in C, but distributors could > patch site.py to replace it with a custom one written in Python. For > example, publish a "fedora-hooks" module to PyPI (so non-system Python > installations or applications regularly run without the site module > can readily use the same hooks if they choose to do so), and then > patch site.py in the system Python to do: > > import fedora_hooks > fedora_hooks.install_excepthook() > > The nice thing about that approach is it wouldn't need a new switch to > turn it off - it would get turned off with all the other site-specific > customisations when -S or -I is used. It would also better open things > up to redistributor experimentation in existing releases (2.7, 3.5, > etc) before we commit to a specific approach in the reference > interpreter (such as adding an optional 'platform.hooks' submodule > that vendors may provide, and relevant stdlib APIs will then call > automatically to override the default upstream provided processing). Ah, but of course! That leaves us with only one part of the PEP unresolved: When the build process is unable to compile some modules when building Python from source (such as _sqlite3 due to missing sqlite headers), it would be great to provide a custom message when one then tries to import such module when using the compiled Python. Do you see a 'pretty' solution for that within this framework? Yours aye, Tomas From torsava at redhat.com Mon Dec 5 07:53:02 2016 From: torsava at redhat.com (Tomas Orsava) Date: Mon, 5 Dec 2016 13:53:02 +0100 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> Message-ID: <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> On 12/05/2016 01:42 PM, Nick Coghlan wrote: > On 5 December 2016 at 19:56, Tomas Orsava wrote: >> On 12/03/2016 05:08 AM, Nick Coghlan wrote: >>>> Though I believe the default sys.excepthook function is currently written >>>> in >>>> C, so it wouldn't be very easy for distributors to customize it. Maybe it >>>> could be made to read module=error_message pairs from some external file, >>>> which would be easier to modify? >>> The default implementation is written in C, but distributors could >>> patch site.py to replace it with a custom one written in Python. For >>> example, publish a "fedora-hooks" module to PyPI (so non-system Python >>> installations or applications regularly run without the site module >>> can readily use the same hooks if they choose to do so), and then >>> patch site.py in the system Python to do: >>> >>> import fedora_hooks >>> fedora_hooks.install_excepthook() >>> >>> The nice thing about that approach is it wouldn't need a new switch to >>> turn it off - it would get turned off with all the other site-specific >>> customisations when -S or -I is used. It would also better open things >>> up to redistributor experimentation in existing releases (2.7, 3.5, >>> etc) before we commit to a specific approach in the reference >>> interpreter (such as adding an optional 'platform.hooks' submodule >>> that vendors may provide, and relevant stdlib APIs will then call >>> automatically to override the default upstream provided processing). >> Ah, but of course! That leaves us with only one part of the PEP unresolved: >> When the build process is unable to compile some modules when building >> Python from source (such as _sqlite3 due to missing sqlite headers), it >> would be great to provide a custom message when one then tries to import >> such module when using the compiled Python. >> >> Do you see a 'pretty' solution for that within this framework? > I'm not sure it qualifies as 'pretty', but one approach would be to > have a './Modules/missing/' directory that gets pre-populated with > checked in ".py" files for extension modules that aren't always > built. When getpath.c detects it's running from a development > checkout, it would add that directory to sys.path (just before > site-packages), while 'make install' and 'make altinstall' would only > copy files from that directory into the installation target if the > corresponding extension modules were missing. > > Essentially, that would be the "name.missing.py" part of the draft > proposal for optional standard library modules, just with a regular > "name.py" module name and a tweak to getpath.c. To my eye that looks like a complicated mechanism necessitating changes to several parts of the codebase. Have you considered modifying the default sys.excepthook implementation to read a list of modules and error messages from a file that was generated during the build process? To me that seems simpler, and the implementation will be only in one place. In addition, distributors could just populate that file with their data, thus we would have one mechanism for both use cases. Tomas From ncoghlan at gmail.com Mon Dec 5 07:42:04 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 5 Dec 2016 22:42:04 +1000 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> Message-ID: On 5 December 2016 at 19:56, Tomas Orsava wrote: > On 12/03/2016 05:08 AM, Nick Coghlan wrote: >>> >>> Though I believe the default sys.excepthook function is currently written >>> in >>> C, so it wouldn't be very easy for distributors to customize it. Maybe it >>> could be made to read module=error_message pairs from some external file, >>> which would be easier to modify? >> >> The default implementation is written in C, but distributors could >> patch site.py to replace it with a custom one written in Python. For >> example, publish a "fedora-hooks" module to PyPI (so non-system Python >> installations or applications regularly run without the site module >> can readily use the same hooks if they choose to do so), and then >> patch site.py in the system Python to do: >> >> import fedora_hooks >> fedora_hooks.install_excepthook() >> >> The nice thing about that approach is it wouldn't need a new switch to >> turn it off - it would get turned off with all the other site-specific >> customisations when -S or -I is used. It would also better open things >> up to redistributor experimentation in existing releases (2.7, 3.5, >> etc) before we commit to a specific approach in the reference >> interpreter (such as adding an optional 'platform.hooks' submodule >> that vendors may provide, and relevant stdlib APIs will then call >> automatically to override the default upstream provided processing). > > Ah, but of course! That leaves us with only one part of the PEP unresolved: > When the build process is unable to compile some modules when building > Python from source (such as _sqlite3 due to missing sqlite headers), it > would be great to provide a custom message when one then tries to import > such module when using the compiled Python. > > Do you see a 'pretty' solution for that within this framework? I'm not sure it qualifies as 'pretty', but one approach would be to have a './Modules/missing/' directory that gets pre-populated with checked in ".py" files for extension modules that aren't always built. When getpath.c detects it's running from a development checkout, it would add that directory to sys.path (just before site-packages), while 'make install' and 'make altinstall' would only copy files from that directory into the installation target if the corresponding extension modules were missing. Essentially, that would be the "name.missing.py" part of the draft proposal for optional standard library modules, just with a regular "name.py" module name and a tweak to getpath.c. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Mon Dec 5 21:27:51 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 6 Dec 2016 12:27:51 +1000 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> Message-ID: On 5 December 2016 at 22:53, Tomas Orsava wrote: > On 12/05/2016 01:42 PM, Nick Coghlan wrote: >> Essentially, that would be the "name.missing.py" part of the draft >> proposal for optional standard library modules, just with a regular >> "name.py" module name and a tweak to getpath.c. > > To my eye that looks like a complicated mechanism necessitating changes to > several parts of the codebase. Have you considered modifying the default > sys.excepthook implementation to read a list of modules and error messages > from a file that was generated during the build process? To me that seems > simpler, and the implementation will be only in one place. > > In addition, distributors could just populate that file with their data, > thus we would have one mechanism for both use cases. That's certainly another possibility, and one that initially appears to confine most of the complexity to sys.excepthook(). However, the problem you run into in that case is that CPython, by default, doesn't have any configuration files other than site.py, sitecustomize.py, usercustomize.py and whatever PYTHONSTARTUP points to for interactive use. The only non-executable one that is currently defined is the recommendation to redistributors in PEP 493 for file-based configuration of HTTPS-verification-by-default backports to earlier 2.7.x versions. Probably the closest analogy I can think of is the way we currently generate _sysconfigdata-.py in order to capture the build time settings such that sysconfig.get_config_vars() can report them at runtime. So using _sysconfigdata as inspiration, it would likely be possible to provide a "sysconfig.get_missing_modules()" API that the default sys.excepthook() could use to report that a particular import didn't work because an optional standard library module hadn't been built. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From torsava at redhat.com Tue Dec 6 11:50:56 2016 From: torsava at redhat.com (Tomas Orsava) Date: Tue, 6 Dec 2016 17:50:56 +0100 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> Message-ID: <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> On 12/06/2016 03:27 AM, Nick Coghlan wrote: > On 5 December 2016 at 22:53, Tomas Orsava wrote: >> On 12/05/2016 01:42 PM, Nick Coghlan wrote: >>> Essentially, that would be the "name.missing.py" part of the draft >>> proposal for optional standard library modules, just with a regular >>> "name.py" module name and a tweak to getpath.c. >> To my eye that looks like a complicated mechanism necessitating >> changes to >> several parts of the codebase. Have you considered modifying the default >> sys.excepthook implementation to read a list of modules and error >> messages >> from a file that was generated during the build process? To me that >> seems >> simpler, and the implementation will be only in one place. >> >> In addition, distributors could just populate that file with their data, >> thus we would have one mechanism for both use cases. > That's certainly another possibility, and one that initially appears > to confine most of the complexity to sys.excepthook(). However, the > problem you run into in that case is that CPython, by default, doesn't > have any configuration files other than site.py, sitecustomize.py, > usercustomize.py and whatever PYTHONSTARTUP points to for interactive > use. The only non-executable one that is currently defined is the > recommendation to redistributors in PEP 493 for file-based > configuration of HTTPS-verification-by-default backports to earlier > 2.7.x versions. > > Probably the closest analogy I can think of is the way we currently > generate _sysconfigdata-.py in order to > capture the build time settings such that sysconfig.get_config_vars() > can report them at runtime. > > So using _sysconfigdata as inspiration, it would likely be possible to > provide a "sysconfig.get_missing_modules()" API that the default > sys.excepthook() could use to report that a particular import didn't > work because an optional standard library module hadn't been built. Quite interesting. And sysconfig.get_missing_modules() wouldn't even have to be generated during the build process, because it would be called only when the import has failed, at which point it is obvious Python was built without said component (like _sqlite3). So do you see that as an acceptable solution? Do you prefer the one you suggested previously? Alternatively, can the contents of site.py be generated during the build process? Because if some modules couldn't be built, a custom implementation of sys.excepthook might be generated there with the data for the modules that failed to be built. Regards, Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From random832 at fastmail.com Tue Dec 6 16:01:24 2016 From: random832 at fastmail.com (Random832) Date: Tue, 06 Dec 2016 16:01:24 -0500 Subject: [Python-ideas] Proposal: Tuple of str with w'list of words' In-Reply-To: <20161112180556.GP3365@ando.pearwood.info> References: <20161112180556.GP3365@ando.pearwood.info> Message-ID: <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> On Sat, Nov 12, 2016, at 13:05, Steven D'Aprano wrote: > I'm rather luke-warm on this proposal, although I might be convinced to > support it if: > > - w'...' unconditionally split on any whitespace (possibly > excluding NBSP); > > - and normal escapes worked. Is there any particular objection to allowing the backslash-space escape (and for escapes that mean whitespace characters, such as \t, \x20, to not split, if you meant to imply that they do)? That would provide the extra push to this being beneficial over split(). I also have an alternate idea: sl{word1 word2 'string 3' "string 4"} From turnbull.stephen.fw at u.tsukuba.ac.jp Tue Dec 6 19:51:25 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 7 Dec 2016 09:51:25 +0900 Subject: [Python-ideas] Proposal: Tuple of str with w'list of words' In-Reply-To: <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> References: <20161112180556.GP3365@ando.pearwood.info> <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> Message-ID: <22599.23821.553471.816507@turnbull.sk.tsukuba.ac.jp> Random832 writes: > Is there any particular objection to allowing the backslash-space escape > (and for escapes that mean whitespace characters, such as \t, \x20, to > not split, if you meant to imply that they do)? That would provide the > extra push to this being beneficial over split(). You're suggesting that (1) most escapes would be processed after splitting while (2) backslash-space (what about backslash-tab?) would be treated as an escape during splitting? > I also have an alternate idea: sl{word1 word2 'string 3' "string 4"} word1 and word2 are what perl would term "barewords"? Ie treated as strings? -1 to w"", -1 to inconsistent interpretation of escapes, and -1 to a completely new syntax. " ", "\x20", "\u0020", and "\U00000020" currently are different representations of the same string, so it would be confusing if the same notations meant different things in this context. Another syntax plus overloading standard string notation with yet another semantics (strings, rawstrings) doesn't seem like a win to me. As I accept the usual Pythonic aversion to mere abbreviations, I don't see any benefit to these notations, except for the case where a list just won't do, so you can avoid a call to tuple. We already have three good ways to do this: wordlist = ["word1", "word2", "string 3", "string 4"] wordlist = "word1,word2,string 3,string 4".split(",") wordlist = open(word_per_line_file).readlines() and for maximum Unicode-conforming generality with compact notation: wordlist = "word1\UFFFFword2\UFFFFstring 3\UFFFFstring 4".split("\UFFFF") More seriously, in most use cases there will be ASCII control characters that you could use, which most editors can enter (though they might be visually unattractive in many editors, eg, \x0C). Steve From steve at pearwood.info Tue Dec 6 20:03:46 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 7 Dec 2016 12:03:46 +1100 Subject: [Python-ideas] Proposal: Tuple of str with w'list of words' In-Reply-To: <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> References: <20161112180556.GP3365@ando.pearwood.info> <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> Message-ID: <20161207010345.GW3365@ando.pearwood.info> On Tue, Dec 06, 2016 at 04:01:24PM -0500, Random832 wrote: > On Sat, Nov 12, 2016, at 13:05, Steven D'Aprano wrote: > > I'm rather luke-warm on this proposal, although I might be convinced to > > support it if: > > > > - w'...' unconditionally split on any whitespace (possibly > > excluding NBSP); > > > > - and normal escapes worked. > > Is there any particular objection to allowing the backslash-space escape > (and for escapes that mean whitespace characters, such as \t, \x20, to > not split, if you meant to imply that they do)? I hadn't actually considered the question of whether w-strings should split before, or after, applying the escapes. (Or if I had, it was so long ago that I forgot what I decided.) I suppose there's no good reason for them to apply before splitting. I cannot think of any reason why you would write: w"Nobody expects the Spanish\x20Inquisition!" expecting to split "Spanish" and "Inquisition!". It's easier to just press the spacebar. So let's suppose that escapes are processed after the string is split, so that the w-string above becomes: ['Nobody', 'expects', 'the', 'Spanish Inquisition!'] Do we still need a new "\ " escape for a literal string? We clearly don't *need* it, since the user can write \x20 or \o40 or even '\N{SPACE}'. I'm *moderately* against it, since its hard to spot escaped spaces in a forest of unescaped ones, or vice versa: # example from the OP songs = w'My\ Bloody\ Valentine Blue\ Suede\ Shoes' I think that escaping spaces like that will be an attractive nuisance. I had to read the OP's example three times before I noticed that the space between Valentine and Blue was not escaped. What about ordinary strings? What is 'spam\ eggs'? It could be: - allow the escape and return 'spam eggs', even though it is pointless; - disallow the escape, and raise an exception, even though that's inconsistent with w-strings. I'm not really happy with either of those solutions (although I'm slightly less unhappy with the first). So in order of preference, least to worst: strong opposition -1 to the original proposal of w-strings with no escapes except for \space; weak opposition -0.25 for w-strings where \space behaves differently (raises an exception) in regular strings; mildly negative indifference -0 for w-strings with \space allowed in regular strings as well; mildly positive approval +0 for w-strings without bothering to allow \space at all (the user can use \x20 or equivalent). For the avoidance of doubt, by \space I mean a backslash followed by a literal space character. > That would provide the extra push to this being beneficial over split(). True, but it's not a lot of extra value over split(). If Python had this feature, I'd probably use it, but since it doesn't, I cannot in fairness ask somebody else to do the work on the basis that it is needed. I still think the existing solutions are Good Enough: - use split when you don't have space in any term: "fe fi fo fum".split() - use a list of manually split terms when you care about spaces: ['spam and eggs', 'cheese', 'tomato'] > I also have an alternate idea: sl{word1 word2 'string 3' "string 4"} Why "sl"? That looks like a set or a dict. Its bad enough that w-strings return a list, but to have "sl-sets" return a list is just weird :-) -- Steve From ncoghlan at gmail.com Wed Dec 7 00:24:20 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 7 Dec 2016 15:24:20 +1000 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> Message-ID: On 7 December 2016 at 02:50, Tomas Orsava wrote: > So using _sysconfigdata as inspiration, it would likely be possible to > provide a "sysconfig.get_missing_modules()" API that the default > sys.excepthook() could use to report that a particular import didn't > work because an optional standard library module hadn't been built. > > Quite interesting. And sysconfig.get_missing_modules() wouldn't even have to > be generated during the build process, because it would be called only when > the import has failed, at which point it is obvious Python was built without > said component (like _sqlite3). So do you see that as an acceptable > solution? Oh, I'd missed that - yes, the sysconfig API could potentially be something like `sysconfig.get_stdlib_modules()` and `sysconfig.get_optional_modules()` instead of specifically reporting which ones were missed by the build process. There'd still be some work around generating the manifests backing those APIs at build time (including getting them right for Windows as well), but it would make some other questions that are currently annoying to answer relatively straightforward (see http://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-python-standard-library-modules for more on that) > Do you prefer the one you suggested previously? The only strong preference I have around how this is implemented is that I don't want to add complex single-purpose runtime infrastructure for the task. For all of the other specifics, I think it makes sense to err on the side of "What will be easiest to maintain over time?" > Alternatively, can the contents of site.py be generated during the build > process? Because if some modules couldn't be built, a custom implementation > of sys.excepthook might be generated there with the data for the modules > that failed to be built. We don't really want site.py itself to be auto-generated (although it could be updated to use Argument Clinic selectively if we deemed that to be an appropriate thing to do), but there's no problem with generating either data modules or normal importable modules that get accessed from site.py. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From random832 at fastmail.com Wed Dec 7 01:34:06 2016 From: random832 at fastmail.com (Random832) Date: Wed, 07 Dec 2016 01:34:06 -0500 Subject: [Python-ideas] Proposal: Tuple of str with w'list of words' In-Reply-To: <22599.23821.553471.816507@turnbull.sk.tsukuba.ac.jp> References: <20161112180556.GP3365@ando.pearwood.info> <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> <22599.23821.553471.816507@turnbull.sk.tsukuba.ac.jp> Message-ID: <1481092446.3828676.811009777.3EAA483D@webmail.messagingengine.com> On Tue, Dec 6, 2016, at 19:51, Stephen J. Turnbull wrote: > Random832 writes: > > > Is there any particular objection to allowing the backslash-space escape > > (and for escapes that mean whitespace characters, such as \t, \x20, to > > not split, if you meant to imply that they do)? That would provide the > > extra push to this being beneficial over split(). > > You're suggesting that (1) most escapes would be processed after > splitting while (2) backslash-space (what about backslash-tab?) would > be treated as an escape during splitting? I don't understand what this "after splitting" you're talking about is. It would be a single pass through the characters of the token, with space alone meaning "eat all whitespace, next string" and space in backslash state meaning "next character of current string is space", just as "t" alone means "next character of current string is letter t" and t in backslash state means "next character of current string is space". I mean, even the idea that there would be a separate "splitting step" at all makes no sense to me, this implies building an "un-split string" as if the w weren't present, processing escapes as part of that, and then parsing the resulting string in a second pass, which is something we don't do for r"..." and *shouldn't* do for f"..." If you insist on consistency, backslash-space can mean space *everywhere* [once we've gotten through the deprecation cycle of backslash-unknown inserting a literal backslash], just like "\'" works fine despite double quotes not requiring it. As for backslash-tab, we already have \t. Maybe you'd like \s better for space. > > I also have an alternate idea: sl{word1 word2 'string 3' "string 4"} > > word1 and word2 are what perl would term "barewords"? Ie treated as > strings? The name "sl" was meant to evoke shlex (the syntax itself was also inspired by perl's qw{...} though perl doesn't provide any way of escaping whitespace). And I also meant this as a launching-off point for a general suggestion of word{ ... } as a readable syntax that doesn't collide with any currently valid constructs, for new kinds of literals (e.g. frozenset{a, b, c} and so on) So the result would be, more or less, the sequence that shlex.split('''word1 word2 'string 3' "string 4"''') gives. > -1 to w"", -1 to inconsistent interpretation of escapes, and -1 to a > completely new syntax. > > " ", "\x20", "\u0020", and "\U00000020" currently are different > representations of the same string, so it would be confusing if the > same notations meant different things in this context. "'" and "\x39" (etc) are representations of the same string, but '...\x39 doesn't act as an end quote. Unescaped whitespace within a w"" literal would be *syntax*, not *content*. (Whereas in a regular literal backslash is syntax but in a r'...' literal it's content) > Another syntax > plus overloading standard string notation with yet another semantics > (strings, rawstrings) doesn't seem like a win to me. > > As I accept the usual Pythonic aversion to mere abbreviations, I don't > see any benefit to these notations, except for the case where a list > just won't do, so you can avoid a call to tuple. We already have > three good ways to do this: > > wordlist = ["word1", "word2", "string 3", "string 4"] > wordlist = "word1,word2,string 3,string 4".split(",") > wordlist = open(word_per_line_file).readlines() > > and for maximum Unicode-conforming generality with compact notation: > > wordlist = "word1\UFFFFword2\UFFFFstring 3\UFFFFstring > 4".split("\UFFFF") You and I have very different definitions of the word "compact". In fact, this is *so obviously* non-compact that I find it hard to believe that you're being serious, but I don't think the joke's very funny if it's intended as one. > More seriously, in most use cases there will be ASCII control > characters that you could use, which most editors can enter (though > they might be visually unattractive in many editors, eg, \x0C). The point of using space is readability. (The point of returning a tuple is to avoid the disadvantage that the list returned by split must be built at runtime and can't be loaded as a constant, or perhaps turned into a frozenset constant by the optimizer in cases like "if x in w'foo bar baz':". From random832 at fastmail.com Wed Dec 7 01:44:29 2016 From: random832 at fastmail.com (Random832) Date: Wed, 07 Dec 2016 01:44:29 -0500 Subject: [Python-ideas] Proposal: Tuple of str with w'list of words' In-Reply-To: <20161207010345.GW3365@ando.pearwood.info> References: <20161112180556.GP3365@ando.pearwood.info> <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> <20161207010345.GW3365@ando.pearwood.info> Message-ID: <1481093069.3830650.811027345.6DCB6489@webmail.messagingengine.com> On Tue, Dec 6, 2016, at 20:03, Steven D'Aprano wrote: > > I also have an alternate idea: sl{word1 word2 'string 3' "string 4"} > > Why "sl"? Well, shlex was one of the inspirations. > That looks like a set or a dict. Its bad enough that w-strings return a > list, but to have "sl-sets" return a list is just weird :-) My idea was to have word{...} as a grand unifying solution for "we want a new kind of literal but can't think of a syntax for it that doesn't either look like grit on the screen or already means something", with this as one of the first examples. I think it's better than using word"..." for things that aren't strings. From turnbull.stephen.fw at u.tsukuba.ac.jp Wed Dec 7 02:49:27 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 7 Dec 2016 16:49:27 +0900 Subject: [Python-ideas] Proposal: Tuple of str with w'list of words' In-Reply-To: <1481092446.3828676.811009777.3EAA483D@webmail.messagingengine.com> References: <20161112180556.GP3365@ando.pearwood.info> <1481058084.3493918.810576489.342344B4@webmail.messagingengine.com> <22599.23821.553471.816507@turnbull.sk.tsukuba.ac.jp> <1481092446.3828676.811009777.3EAA483D@webmail.messagingengine.com> Message-ID: <22599.48903.87287.318504@turnbull.sk.tsukuba.ac.jp> Random832 writes: > I don't understand what this "after splitting" you're talking about > is. It would be a single pass through the characters of the token, Which may as well be thought of as a string (not a str). Although you can implement this process in one pass, you can also think of it in terms of two passes that give the same result. I suspect many people will think in terms of two passes, and I certainly do. Steven d'Aprano appears to, as well (he also used the "before splitting" terminology). Of course, he may find "the implementation will be single pass" persuasive, even though I don't. > You and I have very different definitions of the word "compact". In > fact, this is *so obviously* non-compact I used \u notation to ensure that people would understand that the separator is a non-character. (Emacs allows me to enter it, and with my current font it displays an empty box. I could fiddle with my PYTHONIOENCODING to use some sort of escape error handler to make it convenient, but I won't use w"" anyway so the point is sort of moot.) > (The point of returning a tuple is to avoid the disadvantage that > the list returned by split must be built at runtime and can't be > loaded as a constant, or perhaps turned into a frozenset constant > by the optimizer in cases like "if x in w'foo bar baz':". That's true, but where's the use case where that optimization matters? From mal at egenix.com Wed Dec 7 03:33:00 2016 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 7 Dec 2016 09:33:00 +0100 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> Message-ID: <5f1eea8d-dd17-9972-1865-f5c6d71d944a@egenix.com> I know that you started this thread focusing on the stdlib, but for the purpose of distributors, the scope goes far beyond just the stdlib. Basically any Python module or package which the distribution can provide should be usable as basis for a nice error message pointing to the package to install. Now, it's the distribution which knows which modules/packages are available, so we don't need a list of stdlib modules in Python to help with this. The helper function (whether called via sys.excepthook() or perhaps a new sys.importerrorhook()) would then check the imported module name against this list and write out the message pointing the user to the missing package. A list of stdlib modules may still be useful, but it comes with it's own set of problems, which should be irrelevant for this use case: some stdlib modules are optional and only available if the system provides (and Python can find) certain libs (or header files during compilation). For a distribution there are no optional stdlib modules, since the distributor will know the complete list of available modules in the distribution, including their external dependencies. In other words: Python already provides all the necessary logic to enable implementing the suggested use case. On 07.12.2016 06:24, Nick Coghlan wrote: > On 7 December 2016 at 02:50, Tomas Orsava wrote: >> So using _sysconfigdata as inspiration, it would likely be possible to >> provide a "sysconfig.get_missing_modules()" API that the default >> sys.excepthook() could use to report that a particular import didn't >> work because an optional standard library module hadn't been built. >> >> Quite interesting. And sysconfig.get_missing_modules() wouldn't even have to >> be generated during the build process, because it would be called only when >> the import has failed, at which point it is obvious Python was built without >> said component (like _sqlite3). So do you see that as an acceptable >> solution? > > Oh, I'd missed that - yes, the sysconfig API could potentially be > something like `sysconfig.get_stdlib_modules()` and > `sysconfig.get_optional_modules()` instead of specifically reporting > which ones were missed by the build process. There'd still be some > work around generating the manifests backing those APIs at build time > (including getting them right for Windows as well), but it would make > some other questions that are currently annoying to answer relatively > straightforward (see > http://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-python-standard-library-modules > for more on that) > >> Do you prefer the one you suggested previously? > > The only strong preference I have around how this is implemented is > that I don't want to add complex single-purpose runtime infrastructure > for the task. For all of the other specifics, I think it makes sense > to err on the side of "What will be easiest to maintain over time?" > >> Alternatively, can the contents of site.py be generated during the build >> process? Because if some modules couldn't be built, a custom implementation >> of sys.excepthook might be generated there with the data for the modules >> that failed to be built. > > We don't really want site.py itself to be auto-generated (although it > could be updated to use Argument Clinic selectively if we deemed that > to be an appropriate thing to do), but there's no problem with > generating either data modules or normal importable modules that get > accessed from site.py. > > Cheers, > Nick. > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 07 2016) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From ncoghlan at gmail.com Wed Dec 7 07:57:41 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 7 Dec 2016 22:57:41 +1000 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: <5f1eea8d-dd17-9972-1865-f5c6d71d944a@egenix.com> References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> <5f1eea8d-dd17-9972-1865-f5c6d71d944a@egenix.com> Message-ID: On 7 December 2016 at 18:33, M.-A. Lemburg wrote: > I know that you started this thread focusing on the stdlib, > but for the purpose of distributors, the scope goes far > beyond just the stdlib. > > Basically any Python module or package which the distribution can > provide should be usable as basis for a nice error message pointing to > the package to install. The PEP draft covered two questions: - experienced redistributors breaking the standard library up into pieces - optional modules for folks building their own Python (even if they're new to that) > Now, it's the distribution which knows which modules/packages > are available, so we don't need a list of stdlib modules > in Python to help with this. Right, that's the case that we realised can be covered entirely by the suggestion "patch site.py to install a different default sys.excepthook()" > A list of stdlib modules may still be useful, but it comes > with it's own set of problems, which should be irrelevant > for this use case: some stdlib modules are optional and > only available if the system provides (and Python can find) > certain libs (or header files during compilation). While upstream changes turned out not to be necessary for the "distributor breaking up the standard library" use case, they may still prove worthwhile in making import errors more informative in the case of "I just built my own Python from upstream sources and didn't notice (or didn't read) the build message indicating that some modules weren't built". Given the precedent of the sysconfig metadata generation, providing some form of machine-readable build-time-generated module manifest should be pretty feasible if someone was motivated to implement it, and we already have the logic to track which optional modules weren't built in order to generate the message at the end of the build process. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From turnbull.stephen.fw at u.tsukuba.ac.jp Wed Dec 7 13:22:08 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Thu, 8 Dec 2016 03:22:08 +0900 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <6e27a05d-6a02-44f0-fa3f-4c14b9e1befc@redhat.com> <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> <5f1eea8d-dd17-9972-1865-f5c6d71d944a@egenix.com> Message-ID: <22600.21328.501800.953514@turnbull.sk.tsukuba.ac.jp> Nick Coghlan writes: > While upstream changes turned out not to be necessary for the > "distributor breaking up the standard library" use case, they may > still prove worthwhile in making import errors more informative in the > case of "I just built my own Python from upstream sources and didn't > notice (or didn't read) the build message indicating that some modules > weren't built". This case-by-case line of argument gives me a really bad feeling. Do we have to play whack-a-mole with every obscure message that pops up that somebody might not be reading? OK, this is a pretty common and confusing case, but surely there's something more systematic (and flexible vs. turning every error message into a complete usage manual ... which tl;dr) we can do. One way to play would be an interactive checklist-based diagnostic module (ie, a "rule-based expert system") that could be plugged into IDEs or even into sys.excepthook. Given Python's excellent introspective facilities, with a little care the rule interpreter could be designed with access to namespaces to provide additional detail or tweak rule priority. We could even build in a learning engine to give priority to users' habitual bugs (including typical mistaken diagnoses). That said, I don't have time to work on it :-(, so feel free to ignore me. And I grant that since AFAIK we have zero existing code for the engine and rule database, it might be a good idea to do something for some particular obscure errors in the 3.7 timeframe. From mal at egenix.com Wed Dec 7 15:04:15 2016 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 7 Dec 2016 21:04:15 +0100 Subject: [Python-ideas] PEP: Distributing a Subset of the Standard Library In-Reply-To: References: <176c5504-78a8-401d-9631-6b7126ac5af9@redhat.com> <6062460d-2cbe-63cf-8937-a2051cfbfa8a@redhat.com> <53acb4a9-052e-fad8-888e-897cac0d0356@redhat.com> <5f1eea8d-dd17-9972-1865-f5c6d71d944a@egenix.com> Message-ID: <2a1043f9-7c21-09e2-0990-0109a806f8d7@egenix.com> On 07.12.2016 13:57, Nick Coghlan wrote: > On 7 December 2016 at 18:33, M.-A. Lemburg wrote: >> I know that you started this thread focusing on the stdlib, >> but for the purpose of distributors, the scope goes far >> beyond just the stdlib. >> >> Basically any Python module or package which the distribution can >> provide should be usable as basis for a nice error message pointing to >> the package to install. > > The PEP draft covered two questions: > > - experienced redistributors breaking the standard library up into pieces > - optional modules for folks building their own Python (even if > they're new to that) > >> Now, it's the distribution which knows which modules/packages >> are available, so we don't need a list of stdlib modules >> in Python to help with this. > > Right, that's the case that we realised can be covered entirely by the > suggestion "patch site.py to install a different default > sys.excepthook()" > >> A list of stdlib modules may still be useful, but it comes >> with it's own set of problems, which should be irrelevant >> for this use case: some stdlib modules are optional and >> only available if the system provides (and Python can find) >> certain libs (or header files during compilation). > > While upstream changes turned out not to be necessary for the > "distributor breaking up the standard library" use case, they may > still prove worthwhile in making import errors more informative in the > case of "I just built my own Python from upstream sources and didn't > notice (or didn't read) the build message indicating that some modules > weren't built". > > Given the precedent of the sysconfig metadata generation, providing > some form of machine-readable build-time-generated module manifest > should be pretty feasible if someone was motivated to implement it, > and we already have the logic to track which optional modules weren't > built in order to generate the message at the end of the build > process. True, but the build process only covers C extensions. Writing the information somewhere for Python to pick up would be easy, though (just dump the .failed* lists somewhere). For pure Python modules, I suppose the install process could record all installed modules. Put all this info into a generated "_sysconfigstdlib" module, import this into sysconfig and you're set. Still, in all the years I've been using Python I never ran into a situation where I was interested in such information. For cases where a module is optional, you usually write a try...except and handle this on a case-by-case basis. That's safer than relying on some build time generated list, since the Python binary may well have been built on a different machine than the one the application is currently running on and so, even if an optional module is listed as built successfully, it may still fail to import. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 07 2016) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From mikhailwas at gmail.com Wed Dec 7 18:52:56 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 00:52:56 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) Message-ID: In past discussion about inputing and printing characters, I was proposing decimal notation instead of hex. Since the discussion was lost in off-topic talks, I'll try to summarise my idea better. I use ASCII only for code input (there are good reasons for that). Here I'll use Python 3.6, and Windows 7, so I can use print() with unicode directly and it works now in system console. Suppose I only start programming and want to do some character manipulation. The vey first thing I would probably start with is a simple output for latin and cyrillic capital letters: caps_lat = "" for o in range(65, 91): caps_lat = caps_lat + chr(o) print (caps_lat) caps_cyr = "" for o in range(1040, 1072): caps_cyr = caps_cyr + chr(o) print (caps_cyr) Which prints: ABCDEFGHIJKLMNOPQRSTUVWXYZ ???????????????????????????????? Say, I want now to input something direct in code: s = "first cyrillic letters: " + chr(1040) + chr(1041) + chr(1042) Which works fine and has clean look. However it is not very convinient because of much typing and also, if I generate such strings, adds a bit more complexity. But in general it is fine, and I use this method currently. ========= Proposal: I would want to have a possibility to input it *by decimals*: s = "first cyrillic letters: \{1040}\{1041}\{1042}" or: s = "first cyrillic letters: \(1040)\(1041)\(1042)" ========= This is more compact and seems not very contradictive with current Python escape characters in string literals. So backslash is a start of some escaping in most cases. For me most important is that in such way I would avoid any presence of hex numbers in strings, which I find very good for readability and for me it is very convinient since I use decimals for processing everywhere (and encourage everyone to do so). So this is my proposal, any comments on this are appreciated. PS: Currently Python 3 supports these in addition to \x: (from https://docs.python.org/3/howto/unicode.html) """ If you can?t enter a particular character in your editor or want to keep the source code ASCII-only for some reason, you can also use escape sequences in string literals. >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name >>> "\u0394" # Using a 16-bit hex value >>> "\U00000394" # Using a 32-bit hex value """ So I have many possibilities and all of them strangely contradicts with my image of intuitive and readable. Well, using charater name is readable, but seriously not much of a practical solution for input, but could be very useful for printing description of a character. Mikhail From prometheus235 at gmail.com Wed Dec 7 19:13:31 2016 From: prometheus235 at gmail.com (Nick Timkovich) Date: Wed, 7 Dec 2016 18:13:31 -0600 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: Out of curiosity, why do you prefer decimal values to refer to Unicode code points? Most references, http://unicode.org/charts/PDF/U0400.pdf (official) or https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF , prefer to refer to them by hexadecimal as the planes and ranges are broken up by hex values. On Wed, Dec 7, 2016 at 5:52 PM, Mikhail V wrote: > In past discussion about inputing and printing characters, > I was proposing decimal notation instead of hex. > Since the discussion was lost in off-topic talks, I'll try to > summarise my idea better. > > I use ASCII only for code input (there are good reasons for that). > Here I'll use Python 3.6, and Windows 7, so I can use print() with unicode > directly and it works now in system console. > > Suppose I only start programming and want to do some character > manipulation. > The vey first thing I would probably start with is a simple output for > latin and cyrillic capital letters: > > caps_lat = "" > for o in range(65, 91): > caps_lat = caps_lat + chr(o) > print (caps_lat) > > caps_cyr = "" > for o in range(1040, 1072): > caps_cyr = caps_cyr + chr(o) > print (caps_cyr) > > > Which prints: > ABCDEFGHIJKLMNOPQRSTUVWXYZ > ???????????????????????????????? > > > Say, I want now to input something direct in code: > > s = "first cyrillic letters: " + chr(1040) + chr(1041) + chr(1042) > > Which works fine and has clean look. However it is not very convinient > because of much typing and also, if I generate such strings, > adds a bit more complexity. But in general it is fine, and I use this > method currently. > > ========= > Proposal: I would want to have a possibility to input it *by decimals*: > > s = "first cyrillic letters: \{1040}\{1041}\{1042}" > or: > s = "first cyrillic letters: \(1040)\(1041)\(1042)" > > ========= > > This is more compact and seems not very contradictive with > current Python escape characters in string literals. > So backslash is a start of some escaping in most cases. > > For me most important is that in such way I would avoid > any presence of hex numbers in strings, which I find very good > for readability and for me it is very convinient since I use decimals > for processing everywhere (and encourage everyone to do so). > > So this is my proposal, any comments on this are appreciated. > > > PS: > > Currently Python 3 supports these in addition to \x: > (from https://docs.python.org/3/howto/unicode.html) > """ > If you can?t enter a particular character in your editor or want to keep > the source code ASCII-only for some reason, you can also use escape > sequences in string literals. > > >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name > >>> "\u0394" # Using a 16-bit hex value > >>> "\U00000394" # Using a 32-bit hex value > > """ > So I have many possibilities and all of them strangely contradicts with > my image of intuitive and readable. Well, using charater name is readable, > but seriously not much of a practical solution for input, but could be > very useful > for printing description of a character. > > > Mikhail > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikhailwas at gmail.com Wed Dec 7 19:22:59 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 01:22:59 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 8 December 2016 at 01:13, Nick Timkovich wrote: > Out of curiosity, why do you prefer decimal values to refer to Unicode code > points? Most references, http://unicode.org/charts/PDF/U0400.pdf (official) > or https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF , > prefer to refer to them by hexadecimal as the planes and ranges are broken > up by hex values. Well, there was a huge discussion in October, see the subject name. Just didnt want it to go again in that direction. So in short hex notation not so readable and anyway decimal is kind of standard way to represent numbers and I treat string as a number array when I am processing it, so hex simply is redundant and not needed for me. Mikhail From ethan at stoneleaf.us Wed Dec 7 19:25:08 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 07 Dec 2016 16:25:08 -0800 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: <5848A864.6020005@stoneleaf.us> On 12/07/2016 03:52 PM, Mikhail V wrote: > In past discussion about inputing and printing characters, > I was proposing decimal notation instead of hex. > Since the discussion was lost in off-topic talks, I'll try to > summarise my idea better. While the discussion did range far and wide, one thing that was fairly constant is that the benefit of adding one more way to represent unicode characters is not worth the work involved to make it happen; and that using hexadecimal to reference unicode characters is nearly universal. To sum up: even if you wrote all the code yourself, it would not be accepted. -- ~Ethan~ From python at mrabarnett.plus.com Wed Dec 7 19:52:25 2016 From: python at mrabarnett.plus.com (MRAB) Date: Thu, 8 Dec 2016 00:52:25 +0000 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 2016-12-07 23:52, Mikhail V wrote: > In past discussion about inputing and printing characters, > I was proposing decimal notation instead of hex. > Since the discussion was lost in off-topic talks, I'll try to > summarise my idea better. > > I use ASCII only for code input (there are good reasons for that). > Here I'll use Python 3.6, and Windows 7, so I can use print() with unicode > directly and it works now in system console. > > Suppose I only start programming and want to do some character manipulation. > The vey first thing I would probably start with is a simple output for > latin and cyrillic capital letters: > > caps_lat = "" > for o in range(65, 91): > caps_lat = caps_lat + chr(o) > print (caps_lat) > > caps_cyr = "" > for o in range(1040, 1072): > caps_cyr = caps_cyr + chr(o) > print (caps_cyr) > > > Which prints: > ABCDEFGHIJKLMNOPQRSTUVWXYZ > ???????????????????????????????? > > > Say, I want now to input something direct in code: > > s = "first cyrillic letters: " + chr(1040) + chr(1041) + chr(1042) > > Which works fine and has clean look. However it is not very convinient > because of much typing and also, if I generate such strings, > adds a bit more complexity. But in general it is fine, and I use this > method currently. > > ========= > Proposal: I would want to have a possibility to input it *by decimals*: > > s = "first cyrillic letters: \{1040}\{1041}\{1042}" > or: > s = "first cyrillic letters: \(1040)\(1041)\(1042)" > > ========= > It's usually the case that escapes are \ followed by an ASCII-range letter or digit; \ followed by anything else makes it a literal, even if it's a metacharacter, e.g. " terminates a string that starts with ", but \" is a literal ", so I don't like \{...}. Perl doesn't have \u... or \U..., it has \x{...} instead, and Python already has \N{...}, so: s = "first cyrillic letters: \d{1040}\d{1041}\d{1042}" might be better, but I'm still -1 because hex is usual when referring to Unicode codepoints. From tjreedy at udel.edu Wed Dec 7 19:53:52 2016 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 7 Dec 2016 19:53:52 -0500 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 12/7/2016 7:22 PM, Mikhail V wrote: > On 8 December 2016 at 01:13, Nick Timkovich wrote: >> Out of curiosity, why do you prefer decimal values to refer to Unicode code >> points? Most references, http://unicode.org/charts/PDF/U0400.pdf (official) >> or https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF , >> prefer to refer to them by hexadecimal as the planes and ranges are broken >> up by hex values. > > Well, there was a huge discussion in October, see the subject name. > Just didnt want it to go again in that direction. > So in short hex notation not so readable and anyway decimal is > kind of standard way to represent numbers and I treat string as a number array > when I am processing it, so hex simply is redundant and not needed for me. I sympathize with your preference, but ... Perhap the hex numbers would bother you less if you thought of them as 'serial numbers'. It is standard for 'serial numbers' to include letters. It is also common for digit-letter serial numbers to have meaningful fields, as as do the hex versions of unicode serial numbers. The decimal versions are meaningless except as strict sequencers. -- Terry Jan Reedy From prometheus235 at gmail.com Wed Dec 7 19:57:50 2016 From: prometheus235 at gmail.com (Nick Timkovich) Date: Wed, 7 Dec 2016 18:57:50 -0600 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: > > hex notation not so readable and anyway decimal is kind of standard way to > represent numbers Can you cite some examples of Unicode reference tables I can look up a decimal number in? They seem rare; perhaps in a list as a secondary column, but they're not organized/grouped decimally. Readability counts, and introducing a competing syntax will make it harder for others to read. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikhailwas at gmail.com Wed Dec 7 21:07:54 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 03:07:54 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 8 December 2016 at 01:57, Nick Timkovich wrote: >> hex notation not so readable and anyway decimal is kind of standard way to >> represent numbers > > > Can you cite some examples of Unicode reference tables I can look up a > decimal number in? They seem rare; perhaps in a list as a secondary column, > but they're not organized/grouped decimally. Readability counts, and > introducing a competing syntax will make it harder for others to read. There were links to such table in previos discussion. Googling "unicode table decimal" and first link will it be. I think most online tables include decimals as well, usually as tuples of 8-bit decimals. Also earlier the decimal code was the first column in most tables, but it somehow settled in peoples' minds that hex reference should be preferred, for no solid reason IMO. One reason I think due to HTML standards which started to use it in html files long ago and had much influence later, but one should understand, that is just for brevity in most cases. Other reason is, file viewers show hex by default, but that is just misfortune, nothin besides brevity and 4-bit word alignment gives the hex notation unfortunatly, at least in its current typeface. This was discussed actually in that thread. Many people also think they are cool hackers if they make everything in hex :) In some cases it is worth it, but not this case IMO. Mainly for bitwise stuff, but then one should look into binary/trinary/quaternary representation depending on nature of operations and hardware. Yes there is unicode table pagination correspondence in hex reference, but that hardly plays any positive role for real applications, most of the time I need to look in my code and also perform number operations on *specific* ranges and codes, but not on whole pages of the table. This could only play role if I do low-level filtering of large files and want to filter out data after character's page, but that is the only positive thing I can think of, and I don't think it is directly for Python. Imagine some cryptography exercise - you take 27 units, you just give them numbers (0..26) and you do calculations, yes you can view results as hex numbers, but I don't do it and most people don't and should not, since why? It is ugly and not readable. From mikhailwas at gmail.com Wed Dec 7 21:15:06 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 03:15:06 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 8 December 2016 at 01:52, MRAB wrote: > On 2016-12-07 23:52, Mikhail V wrote: ... >> ========= >> Proposal: I would want to have a possibility to input it *by decimals*: >> >> s = "first cyrillic letters: \{1040}\{1041}\{1042}" >> or: >> s = "first cyrillic letters: \(1040)\(1041)\(1042)" >> >> ========= >> > It's usually the case that escapes are \ followed by an ASCII-range letter > or digit; \ followed by anything else makes it a literal, even if it's a > metacharacter, e.g. " terminates a string that starts with ", but \" is a > literal ", so I don't like \{...}. > > Perl doesn't have \u... or \U..., it has \x{...} instead, and Python already > has \N{...}, so: > > s = "first cyrillic letters: \d{1040}\d{1041}\d{1042}" > > might be better, I like this and I agree this corresponds the current style better . > but I'm still -1 because hex is usual when referring to > Unicode codepoints. :-( From boekewurm at gmail.com Wed Dec 7 21:32:20 2016 From: boekewurm at gmail.com (Matthias welp) Date: Thu, 8 Dec 2016 03:32:20 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: Dear Mikhail, With python3.6 you can use format strings to get very close to your desired behaviour: f"{48:c}" == "0" f"{:c}" == chr() It works with variables too: charvalue = 48 f"{charcvalue:c}" == chr(charvalue) # == "0" This is only 1 character overhead + 1 character extra per char formatted compared to your example. And as an extra you can use hex strings (f"{0x30:c}" == "0") and any other integer literal you might want. I don't see the added value of making character escapes in a non-default way only (chars escaped + 1) bytes shorter, with the added maintenance and development cost. I think that you can do a lot with f-strings, and using the built-in formatting options you can already get the behaviour you want in Python 3.6, months earlier than the next opportunity (Python 3.7). Check out the formatting options for integers and other built-in types here: https://docs.python.org/3.6/library/string.html#format-specification-mini-language I hope this helps solve your apparent usability problem. -Matthias On 8 December 2016 at 03:07, Mikhail V wrote: > On 8 December 2016 at 01:57, Nick Timkovich wrote: >>> hex notation not so readable and anyway decimal is kind of standard way to >>> represent numbers >> >> >> Can you cite some examples of Unicode reference tables I can look up a >> decimal number in? They seem rare; perhaps in a list as a secondary column, >> but they're not organized/grouped decimally. Readability counts, and >> introducing a competing syntax will make it harder for others to read. > > There were links to such table in previos discussion. Googling > "unicode table decimal" and > first link will it be. > I think most online tables include decimals as well, usually as tuples > of 8-bit decimals. > Also earlier the decimal code was the first column in most tables, but > it somehow settled in > peoples' minds that hex reference should be preferred, for no solid reason IMO. > One reason I think due to HTML standards which started to use it in html files > long ago and had much influence later, but one should understand, > that is just for brevity in most cases. Other reason is, file viewers > show hex by > default, but that is just misfortune, nothin besides brevity and 4-bit > word alignment > gives the hex notation unfortunatly, at least in its current typeface. > This was discussed actually in that thread. > Many people also think they are cool hackers if they make everything in hex :) > In some cases it is worth it, but not this case IMO. Mainly for > bitwise stuff, but > then one should look into binary/trinary/quaternary representation > depending on nature > of operations and hardware. > > Yes there is unicode table pagination correspondence in hex reference, > but that hardly plays > any positive role for real applications, most of the time I need to > look in my code > and also perform number operations on *specific* ranges and codes, but not > on whole pages of the table. This could only play role if I do > low-level filtering of large files > and want to filter out data after character's page, but that is the > only positive thing > I can think of, and I don't think it is directly for Python. > > Imagine some cryptography exercise - you take 27 units, you just give > them numbers (0..26) > and you do calculations, yes you can view results as hex numbers, but > I don't do it and most people > don't and should not, since why? It is ugly and not readable. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From alexander.belopolsky at gmail.com Wed Dec 7 21:36:48 2016 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Wed, 7 Dec 2016 21:36:48 -0500 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On Wed, Dec 7, 2016 at 9:07 PM, Mikhail V wrote: > > it somehow settled in > peoples' minds that hex reference should be preferred, for no solid reason IMO. I may be showing my age, but all the facts that I remember about ASCII codes are in hex: 1. SPACE is 0x20 followed by punctuation symbols. 2. Decimal digits start at 0x30 with '0' = 0x30, '1' = 0x31, ... 3. @ is 0x40 followed by upper-case letter: 'A' = 0x41, 'B' = 0x42, ... 4. Lower-case letters are offset by 0x20 from the uppercase ones: 'a' = 0x61, 'b' = 0x62, ... Unicode is also organized around hexadecimal codes with various scripts positioned in sections that start at round hexadecimal numbers. For example Cyrillic is at 0x0400 through 0x4FF < http://unicode.org/charts/PDF/U0400.pdf>. The only decimal fact I remember about Unicode is that the largest code-point is 1114111 - a palindrome! -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikhailwas at gmail.com Wed Dec 7 22:06:06 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 04:06:06 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 8 December 2016 at 03:36, Alexander Belopolsky wrote: > > On Wed, Dec 7, 2016 at 9:07 PM, Mikhail V wrote: >> >> it somehow settled in >> peoples' minds that hex reference should be preferred, for no solid reason >> IMO. > > I may be showing my age, but all the facts that I remember about ASCII codes > are in hex: > > 1. SPACE is 0x20 followed by punctuation symbols. > 2. Decimal digits start at 0x30 with '0' = 0x30, '1' = 0x31, ... > 3. @ is 0x40 followed by upper-case letter: 'A' = 0x41, 'B' = 0x42, ... > 4. Lower-case letters are offset by 0x20 from the uppercase ones: 'a' = > 0x61, 'b' = 0x62, ... > > Unicode is also organized around hexadecimal codes with various scripts > positioned in sections that start at round hexadecimal numbers. For example > Cyrillic is at 0x0400 through 0x4FF > . > > The only decimal fact I remember about Unicode is that the largest > code-point is 1114111 - a palindrome! As an aside, I've just noticed that in my example: s = "first cyrillic letters: \{1040}\{1041}\{1042}" s = "first cyrillic letters: \u0410\u0411\u0412" the hex and decimal codes are made up of same digits, such a peculiar coincidence... So you were catched up from the beginning with hex, as I see ;) I on the contrary in dark times of learning programming (that was C) always oriented myself on decimal codes and don't regret it now. From mikhailwas at gmail.com Wed Dec 7 22:45:51 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 04:45:51 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 8 December 2016 at 03:32, Matthias welp wrote: > Dear Mikhail, > > With python3.6 you can use format strings to get very close to your > desired behaviour: > > f"{48:c}" == "0" > f"{:c}" == chr() > > It works with variables too: > > charvalue = 48 > f"{charcvalue:c}" == chr(charvalue) # == "0" > Waaa! This works! > > I hope this helps solve your apparent usability problem. Big big thanks, I didn't now this feature, but I have googled alot about "input characters as decimals" , so it is just added? Another evidence that Python rules! I'll rewrite some code, hope it'll have no side issues. Mikhail From jcgoble3 at gmail.com Wed Dec 7 22:57:52 2016 From: jcgoble3 at gmail.com (Jonathan Goble) Date: Wed, 7 Dec 2016 22:57:52 -0500 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On Wed, Dec 7, 2016 at 10:45 PM, Mikhail V wrote: > Big big thanks, I didn't now this feature, but I have googled alot > about "input characters as decimals" , so it is just added? > Another evidence that Python rules! Yes, f-strings are a new feature in Python 3.6, which is currently in the release candidate stage. The final release of 3.6.0 (and thus the first stable release with this feature) is scheduled for December 16. From random832 at fastmail.com Wed Dec 7 23:39:42 2016 From: random832 at fastmail.com (Random832) Date: Wed, 07 Dec 2016 23:39:42 -0500 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: <1481171982.1720302.812217169.039D7550@webmail.messagingengine.com> On Wed, Dec 7, 2016, at 22:06, Mikhail V wrote: > So you were catched up from the beginning with hex, as I see ;) > I on the contrary in dark times of learning programming > (that was C) always oriented myself on decimal codes > and don't regret it now. C doesn't support decimal in string literals either, only octal and hex (incidentally octal seems to have been much more common in the environments where C was first invented). I can think of one context where decimal is used for characters, actually, now that I think about it. ANSI/ISO standards for 8-bit character sets often use a 'split' decimal format (i.e. DEL = 7/15 rather than 0x7F or 127.) From mikhailwas at gmail.com Thu Dec 8 00:06:38 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 06:06:38 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: <1481171982.1720302.812217169.039D7550@webmail.messagingengine.com> References: <1481171982.1720302.812217169.039D7550@webmail.messagingengine.com> Message-ID: On 8 December 2016 at 05:39, Random832 wrote: > On Wed, Dec 7, 2016, at 22:06, Mikhail V wrote: >> So you were catched up from the beginning with hex, as I see ;) >> I on the contrary in dark times of learning programming >> (that was C) always oriented myself on decimal codes >> and don't regret it now. > > C doesn't support decimal in string literals either, only octal and hex > (incidentally octal seems to have been much more common in the > environments where C was first invented). I can think of one context > where decimal is used for characters, actually, now that I think about > it. ANSI/ISO standards for 8-bit character sets often use a 'split' > decimal format (i.e. DEL = 7/15 rather than 0x7F or 127.) That is true, it does not support decimals in string literals, but I don't remember (it was more than 10 years ago) that I used anything but decimals for text processing in C. So normally load a file in memory, iterate over bytes, compare the value, and so on. And somewhat very foggy in my memory, but at that time most ASCII tables included decimals and they stood normally in the first column, but I can be wrong now, got to google some original tables. Jeez, how positive came this thread out, first Ethan said it will be never implemented, and it turns out it has already been implemented. Christmas magic. From greg.ewing at canterbury.ac.nz Thu Dec 8 00:52:21 2016 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 08 Dec 2016 18:52:21 +1300 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <1481171982.1720302.812217169.039D7550@webmail.messagingengine.com> Message-ID: <5848F515.2080104@canterbury.ac.nz> Mikhail V wrote: > first Ethan said > it will be never implemented, and it turns out it has already > been implemented. Only by accident -- I don't think anyone anticipated that f-strings would be used that way! -- Greg From p.f.moore at gmail.com Thu Dec 8 04:00:55 2016 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 8 Dec 2016 09:00:55 +0000 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: On 7 December 2016 at 23:52, Mikhail V wrote: > Proposal: I would want to have a possibility to input it *by decimals*: > > s = "first cyrillic letters: \{1040}\{1041}\{1042}" > or: > s = "first cyrillic letters: \(1040)\(1041)\(1042)" > > ========= > > This is more compact and seems not very contradictive with > current Python escape characters in string literals. > So backslash is a start of some escaping in most cases. > > For me most important is that in such way I would avoid > any presence of hex numbers in strings, which I find very good > for readability and for me it is very convinient since I use decimals > for processing everywhere (and encourage everyone to do so). > > So this is my proposal, any comments on this are appreciated. -1. We already have plenty of ways to specify characters in strings[1], we don't need another. If readability is what matters to you, and you (unlike many others) consider hex to be unreadable, use the \N{...} form. Paul [1] Including (ab)using f-strings to hide the use of chr(). From victor.stinner at gmail.com Thu Dec 8 05:27:48 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 8 Dec 2016 11:27:48 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: FYI you can also get a character by its name: >>> import unicodedata >>> unicodedata.name(chr(1040)) 'CYRILLIC CAPITAL LETTER A' >>> "\N{CYRILLIC CAPITAL LETTER A}" '?' Victor 2016-12-08 0:52 GMT+01:00 Mikhail V : > In past discussion about inputing and printing characters, > I was proposing decimal notation instead of hex. > Since the discussion was lost in off-topic talks, I'll try to > summarise my idea better. > > I use ASCII only for code input (there are good reasons for that). > Here I'll use Python 3.6, and Windows 7, so I can use print() with unicode > directly and it works now in system console. > > Suppose I only start programming and want to do some character manipulation. > The vey first thing I would probably start with is a simple output for > latin and cyrillic capital letters: > > caps_lat = "" > for o in range(65, 91): > caps_lat = caps_lat + chr(o) > print (caps_lat) > > caps_cyr = "" > for o in range(1040, 1072): > caps_cyr = caps_cyr + chr(o) > print (caps_cyr) > > > Which prints: > ABCDEFGHIJKLMNOPQRSTUVWXYZ > ???????????????????????????????? > > > Say, I want now to input something direct in code: > > s = "first cyrillic letters: " + chr(1040) + chr(1041) + chr(1042) > > Which works fine and has clean look. However it is not very convinient > because of much typing and also, if I generate such strings, > adds a bit more complexity. But in general it is fine, and I use this > method currently. > > ========= > Proposal: I would want to have a possibility to input it *by decimals*: > > s = "first cyrillic letters: \{1040}\{1041}\{1042}" > or: > s = "first cyrillic letters: \(1040)\(1041)\(1042)" > > ========= > > This is more compact and seems not very contradictive with > current Python escape characters in string literals. > So backslash is a start of some escaping in most cases. > > For me most important is that in such way I would avoid > any presence of hex numbers in strings, which I find very good > for readability and for me it is very convinient since I use decimals > for processing everywhere (and encourage everyone to do so). > > So this is my proposal, any comments on this are appreciated. > > > PS: > > Currently Python 3 supports these in addition to \x: > (from https://docs.python.org/3/howto/unicode.html) > """ > If you can?t enter a particular character in your editor or want to keep > the source code ASCII-only for some reason, you can also use escape > sequences in string literals. > >>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name >>>> "\u0394" # Using a 16-bit hex value >>>> "\U00000394" # Using a 32-bit hex value > > """ > So I have many possibilities and all of them strangely contradicts with > my image of intuitive and readable. Well, using charater name is readable, > but seriously not much of a practical solution for input, but could be > very useful > for printing description of a character. > > > Mikhail > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From abrault at mapgears.com Thu Dec 8 09:46:36 2016 From: abrault at mapgears.com (Alexandre Brault) Date: Thu, 8 Dec 2016 09:46:36 -0500 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> On 2016-12-07 09:07 PM, Mikhail V wrote: > On 8 December 2016 at 01:57, Nick Timkovich wrote: >>> hex notation not so readable and anyway decimal is kind of standard way to >>> represent numbers >> >> Can you cite some examples of Unicode reference tables I can look up a >> decimal number in? They seem rare; perhaps in a list as a secondary column, >> but they're not organized/grouped decimally. Readability counts, and >> introducing a competing syntax will make it harder for others to read. > There were links to such table in previos discussion. Googling > "unicode table decimal" and > first link will it be. > I think most online tables include decimals as well, usually as tuples > of 8-bit decimals. The fact that you need to specify "unicode table *decimal*" in your search, and that even then around half of the top results give the table in hex, to me illustrates quite well how much of a minority opinion "writing unicode characters in decimal is more logical" is From mikhailwas at gmail.com Thu Dec 8 11:06:39 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 17:06:39 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: On 8 December 2016 at 15:46, Alexandre Brault wrote: >>> Can you cite some examples of Unicode reference tables I can look up a >>> decimal number in? They seem rare; perhaps in a list as a secondary column, >>> but they're not organized/grouped decimally. Readability counts, and >>> introducing a competing syntax will make it harder for others to read. >> There were links to such table in previos discussion. Googling >> "unicode table decimal" and >> first link will it be. >> I think most online tables include decimals as well, usually as tuples >> of 8-bit decimals. > The fact that you need to specify "unicode table *decimal*" in your > search, and that even then around half of the top results give the table > in hex, to me illustrates quite well how much of a minority opinion > "writing unicode characters in decimal is more logical" is No I don't need to specify "unicode table *decimal*". Results for "unicode table" in google: Top Result # 2: www.utf8-chartable.de/ Top Result # 4: http://www.tamasoft.co.jp/en/general-info/index.html Some sites does not provide any code conversion, but everybody can do it easily, also I don't have problems generating a table programmatically. And I hope it is clear why most people stick to hex (I never argued that BTW), but it is mostly historical, nothing to do with "logical". There is just tendency to repeat what majority does and not always it is good, this case would be an example. From vgr255 at live.ca Thu Dec 8 11:32:02 2016 From: vgr255 at live.ca (Emanuel Barry) Date: Thu, 8 Dec 2016 16:32:02 +0000 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: > From: Mikhail V > Sent: Thursday, December 08, 2016 11:07 AM > Subject: Re: [Python-ideas] Input characters in strings by decimals (Was: > Proposal for default character representation) > No I don't need to specify "unicode table *decimal*". > > Results for "unicode table" in google: > > Top Result # 2: > www.utf8-chartable.de/ > > Top Result # 4: > http://www.tamasoft.co.jp/en/general-info/index.html Except that both of these websites show you hexadecimal notation. > And I hope it is clear why most people stick to hex (I never argued that BTW), > but it is mostly historical, nothing to do with "logical". That's not true. Characters are sorted by ranges. For example, I know that everything below 0x20 is control code, uppercase ASCII letters start at 0x41 (0x40 is '@') and lowercase ASCII letters start at 0x61 (where 0x60 is '`') - trivial to remember. I also know that ASCII goes as high as half a byte, or 0x7f (half of 0x100). For instance, the first letter of my name is 0xc9, and anyone can know, at a glance and without knowing my name or what the letter is, that it's not ASCII. Also, as far as I know, lowercase letters (ASCII or not) begin some multiple of 0x10 after the beginning of the uppercase letters (0x20 for ASCII or latin-1). As such, since I know that '?' is 0xc9, I can know, without even looking, that 0xe9 is '?'. That would be a lot trickier in decimal to remember and get right. As an aside, and I don't know this by heart, various sets of characters begin at fixed points, and knowing those points (when you need to work with specific sets of characters) can be very useful. If you look at a website (https://unicode-table.com/ seems good), you can even select ranges of characters, which conveniently end up being multiples of 0x10 (or 16 in decimal). If your point is "it's easier to work with numbers ending with 0", then you'll be pleased to know that character sets are actually designed so that, using hexadecimal notation, you're dealing with numbers ending with 0! Doing this using decimal notation is clunky at best. Yours, \xc9manuel From rosuav at gmail.com Thu Dec 8 11:52:23 2016 From: rosuav at gmail.com (Chris Angelico) Date: Fri, 9 Dec 2016 03:52:23 +1100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: On Fri, Dec 9, 2016 at 3:06 AM, Mikhail V wrote: > Results for "unicode table" in google: > > Top Result # 2: > www.utf8-chartable.de/ > > Top Result # 4: > http://www.tamasoft.co.jp/en/general-info/index.html Both of those show hex first, and decimal as an additional feature. > Some sites does not provide any code conversion, but everybody can > do it easily, also I don't have problems generating a table programmatically. > And I hope it is clear why most people stick to hex (I never argued that BTW), > but it is mostly historical, nothing to do with "logical". There is > just tendency > to repeat what majority does and not always it is good, this case > would be an example. In the first place, many people have pointed out to you that Unicode *is* laid out best in hexadecimal. (Another example: umop apisdn ?! are ??, which are ?! with one high bit set.) But in the second place, "what the majority does" actually IS a strong argument. It's called consistency. Why is "\r" a carriage return? Wouldn't it be more logical to use "\c" for that? Except that EVERYONE uses \r for it. And the one time in my life that I found "\123" to mean "{" rather than "S", it was a great frustration for me: http://rosuav.blogspot.com.au/2012/12/i-want-my-octal.html And that's the choice between decimal and *octal*, which is a far less well known base than hex is. I would still prefer octal, because it's consistent. So because of consistency, Python needs to support "\u0303" to mean COMBINING TILDE, and any competing notation has to be in addition to that. Can you justify the confusion of sometimes working with hex and sometimes decimal? It's a pretty high bar to attain. You have to show that decimal isn't just marginally better than hex; you have to show that there are situations where the value of decimal character literals is so great that it's worth forcing everyone to learn two systems. And I'm not convinced you've even hit the first point. ChrisA From random832 at fastmail.com Thu Dec 8 12:29:23 2016 From: random832 at fastmail.com (Random832) Date: Thu, 08 Dec 2016 12:29:23 -0500 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: <1481218163.888360.812826041.294747B2@webmail.messagingengine.com> On Thu, Dec 8, 2016, at 11:06, Mikhail V wrote: > Some sites does not provide any code conversion, but everybody can > do it easily, also I don't have problems generating a table > programmatically. > And I hope it is clear why most people stick to hex (I never argued that > BTW), but it is mostly historical, nothing to do with "logical". The problem is that there's a logic associated with how the character sets are designed. The character table works a lot better with rows of 16 than with rows of 10 or 20. In many blocks you get the uppercase letters lined up above the lowercase letters, for example. And if your rows are 16 (or 32, though that doesn't work as well for unicode because e.g. the Cyrillic basic set ?-?/?-? starts from 0x410), then your row and column labels work better in hex because you've lined up 0x40 above 0x50 and 0x60, which share the last digit, unlike 64/80/96, and the whole row (or half the row for 32) shares all but the last digit. And those values are also only off by one bit, too. Even if we were to arrange the characters themselves in rows of 10/20, so you've got 30 or 40 characters in an "alphabet row", then you'd have to add or subtract to change the case, whereas many early character sets were designed to be able to do this by changing a bit, for bit-paired keyboards. What looks better? Hex: ???????????????? ???????????????? ???????????????? ???????????????? Decimal: ???????????????????? ???????????????????? ???????????????????? ???? And it's only luck that the uppercase Russian alphabet starts at the beginning of a line. The ASCII section with the English alphabet looks like this in decimal: <=>?@ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_`abc defghijklmnopqrstuvw xyz compared to this in hex: @ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_ `abcdefghijklmno pqrstuvwxyz > There is just tendency > to repeat what majority does and not always it is good, this case > would be an example. From mertz at gnosis.cx Thu Dec 8 12:35:32 2016 From: mertz at gnosis.cx (David Mertz) Date: Thu, 8 Dec 2016 09:35:32 -0800 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: Message-ID: The Unicode Consortium reference entirely lacks decimal values in all their tables. EVERYTHING is given solely in hex. I'm sure someone somewhere had created a table with decimal values, but it's very rare. We should not change Python syntax because exactly one user prefers decimal representations. At most there can be an external library to cover strings in whatever manner he wants. Why is octal being neglected for us old fogeys?! ? On Dec 7, 2016 6:11 PM, "Mikhail V" wrote: > On 8 December 2016 at 01:57, Nick Timkovich > wrote: > >> hex notation not so readable and anyway decimal is kind of standard way > to > >> represent numbers > > > > > > Can you cite some examples of Unicode reference tables I can look up a > > decimal number in? They seem rare; perhaps in a list as a secondary > column, > > but they're not organized/grouped decimally. Readability counts, and > > introducing a competing syntax will make it harder for others to read. > > There were links to such table in previos discussion. Googling > "unicode table decimal" and > first link will it be. > I think most online tables include decimals as well, usually as tuples > of 8-bit decimals. > Also earlier the decimal code was the first column in most tables, but > it somehow settled in > peoples' minds that hex reference should be preferred, for no solid reason > IMO. > One reason I think due to HTML standards which started to use it in html > files > long ago and had much influence later, but one should understand, > that is just for brevity in most cases. Other reason is, file viewers > show hex by > default, but that is just misfortune, nothin besides brevity and 4-bit > word alignment > gives the hex notation unfortunatly, at least in its current typeface. > This was discussed actually in that thread. > Many people also think they are cool hackers if they make everything in > hex :) > In some cases it is worth it, but not this case IMO. Mainly for > bitwise stuff, but > then one should look into binary/trinary/quaternary representation > depending on nature > of operations and hardware. > > Yes there is unicode table pagination correspondence in hex reference, > but that hardly plays > any positive role for real applications, most of the time I need to > look in my code > and also perform number operations on *specific* ranges and codes, but not > on whole pages of the table. This could only play role if I do > low-level filtering of large files > and want to filter out data after character's page, but that is the > only positive thing > I can think of, and I don't think it is directly for Python. > > Imagine some cryptography exercise - you take 27 units, you just give > them numbers (0..26) > and you do calculations, yes you can view results as hex numbers, but > I don't do it and most people > don't and should not, since why? It is ugly and not readable. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikhailwas at gmail.com Thu Dec 8 13:37:12 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 19:37:12 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: On 8 December 2016 at 17:52, Chris Angelico wrote: > In the first place, many people have pointed out to you that Unicode > *is* laid out best in hexadecimal. Ok if it is aligned intentionally on binary grid obviously hex numbers will show some patterns, but who argues? And to be fair, from my examples for Cyrillic: Range start points in hex vs decimal: capitals: U+0410 #1040 lowercase: U+0430 #1072 So I need one number 1040 to remember, then if I know if it is 32 letters (except ?) I just sum 1040 + 32 and get 1072, and this will be the beginning of lowercase range, there are of course people who can efficiently sum and substract in head with hex, but I am not the one (guess who is in minority here), and there is no need to do it in this case. So if I know distances between ranges I can do it all much easier in head. Not a strong argument? To be more pedantic, if you know the fact that in Russian alphabet there are exactly 33 letters and not 32 as one could suggest from unicode table, you could have notice also that: letter ? is U+0401, and ? is U+0451 This means they are torn away from other letters and does not even lie in the range. In practice, this means if I want to filter against code ranges, I need to additionally check the value U+0451 and U+0401. Is it not because someone decided to align the alphabet in such a way? Alignment is not bad idea, but it should not contradict with common sense. > You have to show > that decimal isn't just marginally better than hex; you have to show > that there are situations where the value of decimal character > literals is so great that it's worth forcing everyone to learn two > systems. And I'm not convinced you've even hit the first point. Frankly I don't fully understand your point here. Everyone knows decimal, address of an element in a table is a number, in most cases I don't need to learn it by heart, since it is already known and written in some table on your PC. Also inputting characters by decimal is very common thing, alternates key combos (Alt+0192) is something very well established and many people *do* learn decimal code points by heart, including me. So now it is you who want me to learn two numbering systems for no reason. And even with all that said, it is not the strongest argument. Most important is that hex notation is an ugly circumstance, and in this case there is too little reason to introduce it in the algorithm which just checks the ranges and specific values. And for *specific single* values it is absolutely irrelevant which alignment do you have. You just choose what is better readable and/or common for abstract numbers. But that is other big question, and current hex notation does not fall into category "better readable" anyway. Mikhail From rosuav at gmail.com Thu Dec 8 13:45:37 2016 From: rosuav at gmail.com (Chris Angelico) Date: Fri, 9 Dec 2016 05:45:37 +1100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: On Fri, Dec 9, 2016 at 5:37 AM, Mikhail V wrote: >> You have to show >> that decimal isn't just marginally better than hex; you have to show >> that there are situations where the value of decimal character >> literals is so great that it's worth forcing everyone to learn two >> systems. And I'm not convinced you've even hit the first point. > > Frankly I don't fully understand your point here. Let me clarify. When you construct a string, you can already use escapes to represent characters: "n\u0303" --> n followed by combining tilde In order to be consistent with other languages, Python *has* to support hexadecimal. Plus, Python has _already_ supported hex for some time. To establish decimal as an alternative, you have to demonstrate that it is worth having ANOTHER way to do this. With completely green-field topics, you can debate the merits of one notation against another, and the overall best one will win. But when there's a well-established existing notation, you have to justify the proliferation of notations. You have to show that your new format is *so much* better than the existing one that it's worth adding it in parallel. That's quite a high bar - not impossible, obviously, but you need some very strong justification. At the moment, you're showing minor advantages to decimal, and other people are showing minor advantages to hex; but IMO nothing yet has been strong enough to justify the implementation of a completely new way to do things - remember, people have to understand *both* in order to read code. ChrisA From mikhailwas at gmail.com Thu Dec 8 14:50:49 2016 From: mikhailwas at gmail.com (Mikhail V) Date: Thu, 8 Dec 2016 20:50:49 +0100 Subject: [Python-ideas] Input characters in strings by decimals (Was: Proposal for default character representation) In-Reply-To: References: <2d86531d-9297-a55a-f24f-cb111a153bf6@mapgears.com> Message-ID: On 8 December 2016 at 19:45, Chris Angelico wrote: > At the moment, you're showing > minor advantages to decimal, and other people are showing minor > advantages to hex; but IMO nothing yet has been strong enough to > justify the implementation of a completely new way to do things - > remember, people have to understand *both* in order to read code. If the arguments in the last post are not strong enough, I think it will be too hard to make it more strong. In my eyes benefits in this case outweigh the downsides clearly. And anyway, since I can use f-string now to input it, probably one can just relax now. And this: f"{65:c}{66:c}{66:c}" , looks actually significantly better then: "\d{65}\d{66}\d{67}", And it covers the cases I was addressing with the proposal. I am happy. +1000 to developers, even if this is an "accidental" feature . From chris.barker at noaa.gov Sun Dec 11 02:22:42 2016 From: chris.barker at noaa.gov (Chris Barker) Date: Sat, 10 Dec 2016 23:22:42 -0800 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> Message-ID: On Sun, Dec 4, 2016 at 6:35 PM, Chris Angelico wrote: > I have no specific qualifications, but I teach online; nor do I, and I teach for a continuing ed program -- not in high school. But anyway, regardless of official qualifications, good programmers are not neccesray good teachers of programming. At all. If it takes a credentialed teacher to get a job in a school, so > be it - but at least make sure it's someone who knows how to interpret > the error messages, so that any student who runs into trouble can ask > the prof. > Exactly -- you can't be credentialed to teach Biology, or French, or.... without knowing the subject. That may not yet be true for computer science, as it is still "new" in high school curriculum, but it's still not Python's job to overcome that. All the being said -- I don't think we should try to tailor error messages specifically for newbies in the core interpreter, and the error messages have gotten a lot better with py3, but they could still use some improvement -- I would say that suggestions are welcome. And if they can be made (more) machine readable, so that an beginners IDE would enhance them, that would be great. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From skreft at gmail.com Sun Dec 11 04:09:36 2016 From: skreft at gmail.com (Sebastian Kreft) Date: Sun, 11 Dec 2016 20:09:36 +1100 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> Message-ID: Note that there is a draft pep https://www.python.org/dev/peps/pep-0473/ that aims at adding structured data to bultin exceptions. I've tried implementing some of those but had a couple of test failures that weren't obvious to me how to solve. On Dec 11, 2016 13:11, "Chris Barker" wrote: > On Sun, Dec 4, 2016 at 6:35 PM, Chris Angelico wrote: > >> I have no specific qualifications, but I teach online; > > > nor do I, and I teach for a continuing ed program -- not in high school. > > But anyway, regardless of official qualifications, good programmers are > not neccesray good teachers of programming. At all. > > If it takes a credentialed teacher to get a job in a school, so >> be it - but at least make sure it's someone who knows how to interpret >> the error messages, so that any student who runs into trouble can ask >> the prof. >> > > Exactly -- you can't be credentialed to teach Biology, or French, or.... > without knowing the subject. That may not yet be true for computer science, > as it is still "new" in high school curriculum, but it's still not Python's > job to overcome that. > > All the being said -- I don't think we should try to tailor error messages > specifically for newbies in the core interpreter, and the error messages > have gotten a lot better with py3, but they could still use some > improvement -- I would say that suggestions are welcome. > > And if they can be made (more) machine readable, so that an beginners IDE > would enhance them, that would be great. > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Dec 11 08:48:04 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 11 Dec 2016 23:48:04 +1000 Subject: [Python-ideas] Better error messages [was: (no subject)] In-Reply-To: References: <22590.13856.162202.818428@turnbull.sk.tsukuba.ac.jp> <22596.50591.129903.980234@turnbull.sk.tsukuba.ac.jp> Message-ID: On 11 December 2016 at 19:09, Sebastian Kreft wrote: > Note that there is a draft pep https://www.python.org/dev/peps/pep-0473/ > that aims at adding structured data to bultin exceptions. > > I've tried implementing some of those but had a couple of test failures that > weren't obvious to me how to solve. If you haven't already, note that it's OK to post proposed patches to the tracker even when they're still causing test failures - just note that you know the patch is incomplete, and explain the errors that you're seeing. Core developers will often be able to spot relevant problems through code review, and we're also pretty practiced at interpreting the sometimes cryptic failures that the test suite can emit. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From steve at pearwood.info Mon Dec 12 18:45:03 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 13 Dec 2016 10:45:03 +1100 Subject: [Python-ideas] Enhancing vars() Message-ID: <20161212234502.GA3365@ando.pearwood.info> In general, directly accessing dunders is a bit of a code smell. (I exclude writing dunder methods in your classes, of course.) There's usually a built-in or similar to do the job for you, e.g. instead of iterator.__next__() we should use next(iterator). One of the lesser-known ones is vars(obj), which should be used in place of obj.__dict__. Unfortunately, vars() is less useful than it might be, since not all objects have a __dict__. Some objects have __slots__ instead, or even both. That is considered an implementation detail of the object. Proposal: enhance vars() to return a proxy to the object namespace, regardless of whether said namespace is __dict__ itself, or a number of __slots__, or both. Here is a woefully incompete and untested prototype: class VarsProxy(object): def __init__(self, obj): if not (hasattr(obj, '__dict__') or hasattr(obj, '__slots__')): raise TypeError('object has no namespace') self._obj = obj def __getitem__(self, key): slots = getattr(type(self), '__slots__', None) # see inspect.getattr__static for a more correct implementation if slots is not None and key in slots: # return the content of the slot, without any inheritance. return getattr(self._obj, key) else: return self._obj.__dict__[key] def __setitem__(self, key, value): ... def __delitem__(self, key): ... One complication: it is possible for the slot and the __dict__ to both contain the key. In 3.5 that ambiguity is resolved in favour of the slot: py> class X: ... __slots__ = ['spam', '__dict__'] ... def __init__(self): ... self.spam = 'slot' ... self.__dict__['spam'] = 'dict' ... py> x = X() py> x.spam 'slot' Although __slots__ are uncommon, this would clearly distinguish vars(obj) from obj.__dict__ and strongly encourage the use of vars() over direct access to the dunder attribute. Thoughts? -- Steve From tjreedy at udel.edu Mon Dec 12 21:21:02 2016 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 12 Dec 2016 21:21:02 -0500 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161212234502.GA3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: On 12/12/2016 6:45 PM, Steven D'Aprano wrote: > In general, directly accessing dunders is a bit of a code smell. (I > exclude writing dunder methods in your classes, of course.) There's > usually a built-in or similar to do the job for you, e.g. instead of > iterator.__next__() we should use next(iterator). > > One of the lesser-known ones is vars(obj), which should be used in place > of obj.__dict__. > > Unfortunately, vars() is less useful than it might be, since not all > objects have a __dict__. Some objects have __slots__ instead, or even > both. That is considered an implementation detail of the object. > > Proposal: enhance vars() to return a proxy to the object namespace, > regardless of whether said namespace is __dict__ itself, or a number of > __slots__, or both. Here is a woefully incompete and untested prototype: +1 I believe this was mentioned as a possibility on some issue , but I cannot find it. Does vars currently work for things with dict proxies instead of dicts? > class VarsProxy(object): > def __init__(self, obj): > if not (hasattr(obj, '__dict__') or hasattr(obj, '__slots__')): > raise TypeError('object has no namespace') > self._obj = obj > > def __getitem__(self, key): > slots = getattr(type(self), '__slots__', None) > # see inspect.getattr__static for a more correct implementation > if slots is not None and key in slots: > # return the content of the slot, without any inheritance. > return getattr(self._obj, key) > else: > return self._obj.__dict__[key] > > def __setitem__(self, key, value): ... > def __delitem__(self, key): ... > > > > One complication: it is possible for the slot and the __dict__ to > both contain the key. In 3.5 that ambiguity is resolved in favour of the > slot: > > py> class X: > ... __slots__ = ['spam', '__dict__'] > ... def __init__(self): > ... self.spam = 'slot' > ... self.__dict__['spam'] = 'dict' > ... > py> x = X() > py> x.spam > 'slot' > > > Although __slots__ are uncommon, this would clearly distinguish > vars(obj) from obj.__dict__ and strongly encourage the use of vars() > over direct access to the dunder attribute. > > > Thoughts? > > > -- Terry Jan Reedy From ethan at stoneleaf.us Mon Dec 12 22:35:11 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Mon, 12 Dec 2016 19:35:11 -0800 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161212234502.GA3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: <584F6C6F.9000201@stoneleaf.us> On 12/12/2016 03:45 PM, Steven D'Aprano wrote: > Proposal: enhance vars() to return a proxy to the object namespace, > regardless of whether said namespace is __dict__ itself, or a number of > __slots__, or both. +1 -- ~Ethan~ From alexander.belopolsky at gmail.com Mon Dec 12 22:45:39 2016 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Mon, 12 Dec 2016 22:45:39 -0500 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161212234502.GA3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: On Mon, Dec 12, 2016 at 6:45 PM, Steven D'Aprano wrote: > Proposal: enhance vars() to return a proxy to the object namespace, > regardless of whether said namespace is __dict__ itself, or a number of > __slots__, or both. > How do you propose dealing with classes defined in C? Their objects don't have __slots__. One possibility is to use __dir__ or dir(), but those can return anything and in the past developers were encouraged to put only "useful" attributes in __dir__. -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve.dower at python.org Mon Dec 12 23:02:28 2016 From: steve.dower at python.org (Steve Dower) Date: Mon, 12 Dec 2016 20:02:28 -0800 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: I'm +1. This bites me far too often. > in the past developers were encouraged to put only "useful" attributes in __dir__. Good. If I'm getting vars() I really only want the useful ones. If I need interesting/secret ones then I'll getattr for them. Cheers, Steve Top-posted from my Windows Phone -----Original Message----- From: "Alexander Belopolsky" Sent: ?12/?12/?2016 19:47 To: "Steven D'Aprano" Cc: "python-ideas" Subject: Re: [Python-ideas] Enhancing vars() On Mon, Dec 12, 2016 at 6:45 PM, Steven D'Aprano wrote: Proposal: enhance vars() to return a proxy to the object namespace, regardless of whether said namespace is __dict__ itself, or a number of __slots__, or both. How do you propose dealing with classes defined in C? Their objects don't have __slots__. One possibility is to use __dir__ or dir(), but those can return anything and in the past developers were encouraged to put only "useful" attributes in __dir__. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.buttu at gmail.com Tue Dec 13 04:29:38 2016 From: marco.buttu at gmail.com (Marco Buttu) Date: Tue, 13 Dec 2016 10:29:38 +0100 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161212234502.GA3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: <584FBF82.4080906@oa-cagliari.inaf.it> On 13/12/2016 00:45, Steven D'Aprano wrote: > In general, directly accessing dunders is a bit of a code smell. (I > exclude writing dunder methods in your classes, of course.) There's > usually a built-in or similar to do the job for you, e.g. instead of > iterator.__next__() we should use next(iterator). > > One of the lesser-known ones is vars(obj), which should be used in place > of obj.__dict__. [...] > Proposal: enhance vars() to return a proxy to the object namespace, > regardless of whether said namespace is __dict__ itself, or a number of > __slots__, or both. +1. Would it be possible in the future (Py4?) to change the name `vars` to a more meaningful name? Maybe `namespace`, or something more appropriate. -- Marco Buttu INAF-Osservatorio Astronomico di Cagliari Via della Scienza n. 5, 09047 Selargius (CA) Phone: 070 711 80 217 Email: mbuttu at oa-cagliari.inaf.it From steve at pearwood.info Tue Dec 13 05:02:23 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 13 Dec 2016 21:02:23 +1100 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <584FBF82.4080906@oa-cagliari.inaf.it> References: <20161212234502.GA3365@ando.pearwood.info> <584FBF82.4080906@oa-cagliari.inaf.it> Message-ID: <20161213100223.GC3365@ando.pearwood.info> On Tue, Dec 13, 2016 at 10:29:38AM +0100, Marco Buttu wrote: > +1. Would it be possible in the future (Py4?) to change the name `vars` > to a more meaningful name? Maybe `namespace`, or something more appropriate. I'm not keen on the name vars() either, but it does make a certain sense: short for "variables", where "variable" here refers to attributes of an instance rather than local or global variables. I'm not sure that namespace is a better name: namespace, it seems to me, it likely to be used as the name of the target: namespace = vars(obj) But if there is a lot of popular demand for a name change, then I suppose it could happen. Ask again around Python 3.9 :-) -- Steve From ncoghlan at gmail.com Tue Dec 13 06:28:03 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 13 Dec 2016 21:28:03 +1000 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161213100223.GC3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> <584FBF82.4080906@oa-cagliari.inaf.it> <20161213100223.GC3365@ando.pearwood.info> Message-ID: On 13 December 2016 at 20:02, Steven D'Aprano wrote: > On Tue, Dec 13, 2016 at 10:29:38AM +0100, Marco Buttu wrote: > >> +1. Would it be possible in the future (Py4?) to change the name `vars` >> to a more meaningful name? Maybe `namespace`, or something more appropriate. > > I'm not keen on the name vars() either, but it does make a certain > sense: short for "variables", where "variable" here refers to attributes > of an instance rather than local or global variables. It also refers to local and global variables, as vars() is effectively an alias for locals() if you don't pass an argument, and locals() is effectively an alias for globals() at module level: >>> locals() is globals() True >>> vars() is globals() True >>> def f(): return vars() is locals() ... >>> f() True To be honest, rather than an enhanced vars(), I'd prefer to see a couple more alternate dict constructors: dict.fromattrs(obj, attributes) dict.fromitems(obj, keys) (With the lack of an underscore being due to the precedent set by dict.fromkeys()) Armed with those, the "give me all the attributes from __dir__" command would be: attrs = dict.from_attrs(obj, dir(obj)) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From p.f.moore at gmail.com Tue Dec 13 06:53:44 2016 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 13 Dec 2016 11:53:44 +0000 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> <584FBF82.4080906@oa-cagliari.inaf.it> <20161213100223.GC3365@ando.pearwood.info> Message-ID: On 13 December 2016 at 11:28, Nick Coghlan wrote: > Armed with those, the "give me all the attributes from __dir__" > command would be: > > attrs = dict.from_attrs(obj, dir(obj)) Which of course can already be spelled as attrs = { attr: getattr(obj, attr) for attr in dir(obj) } There's obviously a speed-up from avoiding repeated getattr calls, but is speed the key here? The advantage of an "enhanced vars" is more likely to be ease of discoverability, and I'm not sure dict.fromattrs gives us that benefit. Also, the dict constructor gives a *copy* of the namespace, where the proposal was for the proxy returned by vars() to provide update capability (if I understand the proposal correctly). Paul From turnbull.stephen.fw at u.tsukuba.ac.jp Tue Dec 13 06:56:26 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Tue, 13 Dec 2016 20:56:26 +0900 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> <584FBF82.4080906@oa-cagliari.inaf.it> <20161213100223.GC3365@ando.pearwood.info> Message-ID: <22607.57834.626099.299780@turnbull.sk.tsukuba.ac.jp> Nick Coghlan writes: > (With the lack of an underscore being due to the precedent set by > dict.fromkeys()) > > Armed with those, the "give me all the attributes from __dir__" > command would be: > > attrs = dict.from_attrs(obj, dir(obj)) A Urk --------------------+ You sure you want to follow precedent? My fingers really like that typo, too! From matt at getpattern.com Tue Dec 13 13:33:01 2016 From: matt at getpattern.com (Matt Gilson) Date: Tue, 13 Dec 2016 10:33:01 -0800 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> <584FBF82.4080906@oa-cagliari.inaf.it> <20161213100223.GC3365@ando.pearwood.info> Message-ID: > It also refers to local and global variables, as vars() is effectively > an alias for locals() if you don't pass an argument, and locals() is > effectively an alias for globals() at module level: > > to sign up! -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Tue Dec 13 17:58:16 2016 From: storchaka at gmail.com (Serhiy Storchaka) Date: Wed, 14 Dec 2016 00:58:16 +0200 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161212234502.GA3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: On 13.12.16 01:45, Steven D'Aprano wrote: > One of the lesser-known ones is vars(obj), which should be used in place > of obj.__dict__. > > Unfortunately, vars() is less useful than it might be, since not all > objects have a __dict__. Some objects have __slots__ instead, or even > both. That is considered an implementation detail of the object. http://bugs.python.org/issue13290 http://mail.python.org/pipermail/python-dev/2012-October/122011.html From steve at pearwood.info Tue Dec 13 18:49:14 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 14 Dec 2016 10:49:14 +1100 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: <20161213234913.GD3365@ando.pearwood.info> On Wed, Dec 14, 2016 at 12:58:16AM +0200, Serhiy Storchaka wrote: > On 13.12.16 01:45, Steven D'Aprano wrote: > >One of the lesser-known ones is vars(obj), which should be used in place > >of obj.__dict__. > > > >Unfortunately, vars() is less useful than it might be, since not all > >objects have a __dict__. Some objects have __slots__ instead, or even > >both. That is considered an implementation detail of the object. > > http://bugs.python.org/issue13290 > http://mail.python.org/pipermail/python-dev/2012-October/122011.html Thanks Serhiy! Glad to see I'm not the only one with this idea. I think: - the behaviour of locals() (and vars() when given no argument, where it returns locals()) is anomalous and should not be copied unless we really need to. - Other Python implementations don't always emulate the weird behaviour of locals(), for example I think IronPython locals() is writeable, and the local variables do change. steve at orac:~$ ipy IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433 Type "help", "copyright", "credits" or "license" for more information. >>> def test(): ... a = 1 ... locals()['a'] = 99 ... print a ... >>> test() 99 CPython will print 1 instead. So CPython locals() is an implementation detail and we shouldn't feel the need to copy it's weird behaviour. When given an object, vars(obj) should return a dict-like object which is a read/write proxy to the object's namespace. If the object has a __dict__ but no __slots__, then there's no need to change anything: it can keep the current behaviour and just return the dict itself: assert vars(obj) is obj.__dict__ But if the object has __slots__, with or without a __dict__, then vars should return a proxy which direct reads and writes to the correct slot or dict. It might be helpful to have a slotsproxy object which provides a dict-like interface to an object with __slots__ but no __dict__, and build support for both __slots__ and a __dict__ on top of that. If the objects has *neither* __slots__ nor __dict__, vars can probably raise a TypeError. -- Steve From steve at pearwood.info Tue Dec 13 18:54:10 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 14 Dec 2016 10:54:10 +1100 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> Message-ID: <20161213235410.GE3365@ando.pearwood.info> On Mon, Dec 12, 2016 at 10:45:39PM -0500, Alexander Belopolsky wrote: > On Mon, Dec 12, 2016 at 6:45 PM, Steven D'Aprano > wrote: > > > Proposal: enhance vars() to return a proxy to the object namespace, > > regardless of whether said namespace is __dict__ itself, or a number of > > __slots__, or both. > > > > How do you propose dealing with classes defined in C? Their objects don't > have __slots__. I don't see any clean way to do so. Maybe we should have a convention that such objects provide a __slots__ attribute listing public attributes, but I'm not too concerned. Let vars(weird_c_object) raise TypeError, just as it does now. > One possibility is to use __dir__ or dir(), but those can return anything > and in the past developers > were encouraged to put only "useful" attributes in __dir__. Indeed. -- Steve From steve at pearwood.info Tue Dec 13 18:56:46 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 14 Dec 2016 10:56:46 +1100 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> <584FBF82.4080906@oa-cagliari.inaf.it> <20161213100223.GC3365@ando.pearwood.info> Message-ID: <20161213235646.GF3365@ando.pearwood.info> On Tue, Dec 13, 2016 at 11:53:44AM +0000, Paul Moore wrote: [...] > There's obviously a speed-up from avoiding repeated getattr calls, but > is speed the key here? Not for me. > The advantage of an "enhanced vars" is more likely to be ease of > discoverability, and I'm not sure dict.fromattrs gives us that > benefit. Also, the dict constructor gives a *copy* of the namespace, > where the proposal was for the proxy returned by vars() to provide > update capability (if I understand the proposal correctly). Correct. -- Steve From vgr255 at live.ca Tue Dec 13 19:12:39 2016 From: vgr255 at live.ca (Emanuel Barry) Date: Wed, 14 Dec 2016 00:12:39 +0000 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161213234913.GD3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> <20161213234913.GD3365@ando.pearwood.info> Message-ID: > From Steven D'Aprano > Sent: Tuesday, December 13, 2016 6:49 PM > To: python-ideas at python.org > Subject: Re: [Python-ideas] Enhancing vars() > > But if the object has __slots__, with or without a __dict__, then vars > should return a proxy which direct reads and writes to the correct slot > or dict. > > It might be helpful to have a slotsproxy object which provides a > dict-like interface to an object with __slots__ but no __dict__, and > build support for both __slots__ and a __dict__ on top of that. That might be a bit tricky, for example, it's possible that a class has a `foo` slot *and* a `foo` instance attribute (by virtue of subclasses). What would you do in that case? Or what if there's a slot that doesn't have any value (i.e. raises AttributeError on access, but exists on the class nonetheless), but an instance attribute with the same name exists? And so on. > If the objects has *neither* __slots__ nor __dict__, vars can probably > raise a TypeError. Is that even possible in pure Python? The only object I know that can do this is `object`, but some other C objects might do that too. -Emanuel From steve at pearwood.info Tue Dec 13 19:40:43 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 14 Dec 2016 11:40:43 +1100 Subject: [Python-ideas] Enhancing vars() In-Reply-To: References: <20161212234502.GA3365@ando.pearwood.info> <20161213234913.GD3365@ando.pearwood.info> Message-ID: <20161214004043.GG3365@ando.pearwood.info> On Wed, Dec 14, 2016 at 12:12:39AM +0000, Emanuel Barry wrote: > > From Steven D'Aprano > > Sent: Tuesday, December 13, 2016 6:49 PM > > To: python-ideas at python.org > > Subject: Re: [Python-ideas] Enhancing vars() > > > > But if the object has __slots__, with or without a __dict__, then vars > > should return a proxy which direct reads and writes to the correct slot > > or dict. > > > > It might be helpful to have a slotsproxy object which provides a > > dict-like interface to an object with __slots__ but no __dict__, and > > build support for both __slots__ and a __dict__ on top of that. > > That might be a bit tricky, for example, it's possible that a class has a > `foo` slot *and* a `foo` instance attribute (by virtue of subclasses). What > would you do in that case? vars() shouldn't need to care about inheritance: it only cares about the object's own individual namespace, not attributes inherited from the class or superclasses. That's how vars() works now: py> class C: ... cheese = 1 ... py> obj = C() py> ns = vars(obj) py> 'cheese' in ns False The only difference here is that if the direct parent class has __slots__, the instance will use them instead of (or in addition to) a __dict__. We don't need to care about superclass __slots__, because they aren't inherited. > Or what if there's a slot that doesn't have any > value (i.e. raises AttributeError on access, but exists on the class > nonetheless), but an instance attribute with the same name exists? And so > on. Only the *list of slot names* exists on the class. The slots themselves are part of the instance. Nevertheless, you are right: a slot can be defined, but not assigned to. That has to be treated as if the slot didn't exist: py> class D: ... __slots__ = ['spam'] ... py> d = D() py> hasattr(d, 'spam') False So I would expect that 'spam' in vars(d) should likewise return False, until such time that d.spam is assigned too. The same applies even if the object has a __dict__. The slot always takes precedence, even if the slot isn't filled in. py> class E: ... __slots__ = ['spam', '__dict__'] ... py> e = E() py> e.__dict__['spam'] = 1 py> hasattr(e, 'spam') False > > If the objects has *neither* __slots__ nor __dict__, vars can probably > > raise a TypeError. > > Is that even possible in pure Python? The only object I know that can do > this is `object`, but some other C objects might do that too. I don't think pure Python classes can do this, at least not without some metaclass trickery, but certainly `object` itself lacks both __slots__ and instance __dict__, and C objects can do the same (so I'm told). -- Steve From ncoghlan at gmail.com Wed Dec 14 03:01:00 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 14 Dec 2016 18:01:00 +1000 Subject: [Python-ideas] Enhancing vars() In-Reply-To: <20161214004043.GG3365@ando.pearwood.info> References: <20161212234502.GA3365@ando.pearwood.info> <20161213234913.GD3365@ando.pearwood.info> <20161214004043.GG3365@ando.pearwood.info> Message-ID: On 14 December 2016 at 10:40, Steven D'Aprano wrote: > On Wed, Dec 14, 2016 at 12:12:39AM +0000, Emanuel Barry wrote: >> > From Steven D'Aprano >> > Sent: Tuesday, December 13, 2016 6:49 PM >> > To: python-ideas at python.org >> > Subject: Re: [Python-ideas] Enhancing vars() >> > >> > But if the object has __slots__, with or without a __dict__, then vars >> > should return a proxy which direct reads and writes to the correct slot >> > or dict. >> > >> > It might be helpful to have a slotsproxy object which provides a >> > dict-like interface to an object with __slots__ but no __dict__, and >> > build support for both __slots__ and a __dict__ on top of that. >> >> That might be a bit tricky, for example, it's possible that a class has a >> `foo` slot *and* a `foo` instance attribute (by virtue of subclasses). What >> would you do in that case? > > vars() shouldn't need to care about inheritance: it only cares about the > object's own individual namespace, not attributes inherited from the > class or superclasses. That's how vars() works now: > > py> class C: > ... cheese = 1 > ... > py> obj = C() > py> ns = vars(obj) > py> 'cheese' in ns > False > > The only difference here is that if the direct parent class has > __slots__, the instance will use them instead of (or in addition to) a > __dict__. We don't need to care about superclass __slots__, because they > aren't inherited. If folks genuinely want an attrproxy that provides a dict-like view over an instance, that's essentially: from collections import MutableMapping class AttrProxy(MutableMapping): def __init__(self, obj): self._obj = obj def __len__(self): return len(dir(self._obj)) def __iter__(self): for attr in dir(self._obj): yield attr def __contains__(self, attr): return hasattr(self._obj, attr) def __getitem__(self, attr): return getattr(self._obj, attr) def __setitem__(self, attr, value): setattr(self._obj, attr, value) def __delitem__(self, attr): delattr(self._obj, attr) >>> class C: ... a = 1 ... b = 2 ... c = 3 ... >>> ns = AttrProxy(C) >>> list(ns.keys()) ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'a', 'b', 'c'] >>> ns["d"] = 4 >>> C.d 4 >>> C.c 3 >>> del ns["c"] >>> C.c Traceback (most recent call last): File "", line 1, in AttributeError: type object 'C' has no attribute 'c' >>> ns["a"] = 5 >>> C.a 5 Replacing the calls to `dir(self._obj)` with a method that filters out dunder-methods and updating the other methods to reject them as keys is also relatively straightforward if people want that behaviour. Indirecting through dir(), hasattr(), getattr(), setattr() and delattr() this way means you don't have to worry about the vagaries of the descriptor protocol or inheritance or instance attributes vs class attributes or anything else like that, while inheriting from MutableMapping automatically gives you view-based keys(), values() and items() implementations. I wouldn't have any real objection to providing an API that behaves like this (in simple cases its functionally equivalent to manipulating __dict__ directly, while in more complex cases, the attrproxy approach is likely to just work, whereas __dict__ manipulation may fail). (Re-using vars() likely wouldn't be appropriate in that case though, due to the change in the way inheritance is handled) I *would* object to a new proxy type that duplicated descriptor logic that's already programmatically accessible in other builtins, or only selectively supported certain descriptors (like those created for __slots__) while ignoring others. Pseudo-lookups like inspect.getattr_static() exist to help out IDEs, debuggers and other code analysers, rather than as something we want people to be doing as part of their normal application execution. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From erik.m.bray at gmail.com Fri Dec 16 07:07:46 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Fri, 16 Dec 2016 13:07:46 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython Message-ID: Greetings all, I wanted to bring attention to an issue that's been languishing on the bug tracker since last year, which I think would best be addressed by changes to CPython's C-API. The original issue is at http://bugs.python.org/issue25658, but I have made an effort below in a sort of proto-PEP to summarize the problem and the proposed solution. I haven't written this up in the proper PEP format because I want to see if the idea has some broader support first, and it's also not clear to me whether C-API changes (especially to undocumented APIs) even require their own PEP. Abstract ======== The proposal is to add a new Thread Local Storage (TLS) API to CPython which would supersede use of the existing TLS API within the CPython interpreter, while deprecating the existing API. Because the existing TLS API is only used internally (it is not mentioned in the documentation, and the header that defines it, pythread.h, is not included in Python.h either directly or indirectly), this proposal probably only affects CPython, but might also affect other interpreter implementations (PyPy?) that implement parts of the CPython API. Specification ============= The current API for TLS used inside the CPython interpreter consists of 5 functions: PyAPI_FUNC(int) PyThread_create_key(void) PyAPI_FUNC(void) PyThread_delete_key(int key) PyAPI_FUNC(int) PyThread_set_key_value(int key, void *value) PyAPI_FUNC(void *) PyThread_get_key_value(int key) PyAPI_FUNC(void) PyThread_delete_key_value(int key) These would be superseded with a new set of analogous functions: PyAPI_FUNC(int) PyThread_tss_create(Py_tss_t *key) PyAPI_FUNC(void) PyThread_tss_delete(Py_tss_t key) PyAPI_FUNC(int) PyThread_tss_set(Py_tss_t key, void *value) PyAPI_FUNC(void *) PyThread_tss_get(Py_tss_t key) PyAPI_FUNC(void) PyThread_tss_delete_value(Py_tss_t key) and includes the definition of a new type Py_tss_t--any opaque type the specification of which is not given here, and may depend on the underlying TLS implementation. The new PyThread_tss_ functions are almost exactly analogous to their original counterparts with a minor difference: Whereas PyThread_create_key takes no arguments and returns a TLS key as an int, PyThread_tss_create takes a Py_tss_t* as an argument, and returns a Py_tss_t by pointer--the int return value is a status, returning zero on success and non-zero on failure. Further, the old PyThread_*_key* functions will be marked as deprecated. Additionally, the pthread implementations of the old PyThread_*_key* functions will either fail or be no-ops on platforms where sizeof(pythead_t) != sizeof(int). Motivation ========== The primary problem at issue here is the type of the keys (int) used for TLS values, as defined by the original PyThread TLS API. The original TLS API was added to Python by GvR back in 1997, and at the time the key used to represent a TLS value was an int, and so it has been to this day. This used CPython's own TLS implementation, the current generation of which can still be found, largely unchanged, in Python/thread.c. Support for implementation of the API on top of native thread implementations (NT and pthreads) was added much later, and the built-in implementation may still be used on other platforms. The problem with the choice of int to represent a TLS key, is that while it was fine for CPython's internal TLS implementation, and happens to be fine for NT (which uses DWORD), it is not compatible the POSIX standard for the pthreads API, which defines pthread_key_t as an opaque type not further designed by the standard (as with Py_tss_t described above). This leaves it up to the underlying implementation how a pthread_key_t value is used to look thread-specific data. This has not generally been a problem for Python's API, as it just happens that on Linux pthread_key_t is just defined as an unsigned int, and so is fully compatible with Python's TLS API--pthread_key_t's created by pthread_create_key can be freely cast to ints and back (well, not really, even this has issues as pointed out by issue #22206). However, as issue #25658 points out there are at least some platforms (namely Cygwin, CloudABI, but likely others as well) which have otherwise modern and POSIX-compliant pthreads implementations, but are not compatible with Python's API because their pthread_key_t is defined in a way that cannot be safely cast to int. In fact, the possibility of running into this problem was raised by MvL at the time pthreads TLS was added [1]. It could be argued that PEP-11 makes specific requirements for supporting a new, not otherwise officially-support platform (such as CloudABI), and that the status of Cygwin support is currently dubious. However, this places a very barrier to supporting platforms that are otherwise Linux- and/or POSIX-compatible and where CPython might otherwise "just work" except for this one hurdle which Python itself imposes by way of an API that is not compatible with POSIX (and in fact makes invalid assumptions about pthreads). Rationale for Proposed Solution =============================== The use of an opaque type (Py_tss_t) to key TLS values allows the API to be compatible, at least in this regard, with CPython's internal TLS implementation, as well as all present (NT and posix) and future (C11?) native TLS implementations supported by CPython, as it allows the definition of Py_tss_t to depend on the underlying implementation. A new API must be introduced, rather than changing the function signatures of the current API, in order to maintain backwards compatibility. The new API also more clearly groups together these related functions under a single name prefix, "PyThread_tss_". The "tss" in the name stands for "thread-specific storage", and was influenced by the naming and design of the "tss" API that is part of the C11 threads API. However, this is in no way meant to imply compatibility with or support for the C11 threads API, or signal any future intention of supporting C11--it's just the influence for the naming and design. Changing PyThread_create_key to immediately return a failure status on systems using pthreads where sizeof(int) != sizeof(pthread_key_t) is intended as a sanity check: Currently, PyThread_create_key will report initial success on such systems, but attempts to use the returned key are likely to fail. Although in practice this failure occurs quickly during interpreter startup, it's better to fail immediately at the source of failure (PyThread_create_key) rather than sometime later when use of an invalid key is attempted. Rejected Ideas ============== * Do nothing: The status quo is fine because it works on Linux, and platforms wishing to be supported by CPython should follow the requirements of PEP-11. As explained above, while this would be a fair argument if CPython were being to asked to make changes to support particular quirks of a specific platform, in this case the platforms in question are only asking to fix a quirk of CPython that prevents it from being used to its full potential on those platforms. The fact that the current implementation happens to work on Linux is a happy accident, and there's no guarantee that will never change. * Affected platforms should just configure Python --without-threads: This is a possible temporary workaround to the issue, but only that. Python should not be hobbled on affected platforms despite them being otherwise perfectly capable of running multi-threaded Python. * Affected platforms should not define Py_HAVE_NATIVE_TLS: This is a more acceptable alternative to the previous idea, and in fact there is a patch to do just that [2]. However, CPython's internal TLS implementation being "slower and clunkier" in general than native implementations still needlessly hobbles performance on affected platforms. At least one other module (tracemalloc) is also broken if Python is built without Py_HAVE_NATIVE_TLS. * Keep the existing API, but work around the issue by providing a mapping from pthread_key_t values to ints. A couple attempts were made at this [3] [4], but this only injects needless complexity and overhead into performance-critical code on platforms that are not currently affected by this issue (such as Linux). Even if use of this workaround were made conditional on platform compatibility, it introduces platform-specific code to maintain, and still has the problem of the previous rejected ideas of needlessly hobbling performance on affected platforms. Implementation ============== An initial version of a patch [5] is available on the bug tracker for this issue. The patch is proposed and written by Masayuki Yamamoto, who should be considered a co-author of this proto-PEP, though I have not consulted directly with him in writing this. If he's reading, he should chime in in case I've misrepresented anything. If you've made it this far, thanks for reading and thank you for your consideration, Erik [1] https://bugs.python.org/msg116292 [2] http://bugs.python.org/file45548/configure-pthread_key_t.patch [3] http://bugs.python.org/file44269/issue25658-1.patch [4] http://bugs.python.org/file44303/key-constant-time.diff [5] http://bugs.python.org/file45763/pythread-tss.patch From zachary.ware+pyideas at gmail.com Fri Dec 16 12:17:31 2016 From: zachary.ware+pyideas at gmail.com (Zachary Ware) Date: Fri, 16 Dec 2016 11:17:31 -0600 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: On Fri, Dec 16, 2016 at 6:07 AM, Erik Bray wrote: > Greetings all, > > I wanted to bring attention to an issue that's been languishing on the > bug tracker since last year, which I think would best be addressed by > changes to CPython's C-API. The original issue is at > http://bugs.python.org/issue25658, but I have made an effort below in > a sort of proto-PEP to summarize the problem and the proposed > solution. I am not familiar enough with the threading implementation to be anything more than moral support, but I am in favor of making some change here. This is a significant blocker to Cygwin support, which is actually fairly close to being supportable. -- Zach From solipsis at pitrou.net Fri Dec 16 12:51:02 2016 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 16 Dec 2016 18:51:02 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython References: Message-ID: <20161216185102.1e8396d4@fsol> On Fri, 16 Dec 2016 13:07:46 +0100 Erik Bray wrote: > Greetings all, > > I wanted to bring attention to an issue that's been languishing on the > bug tracker since last year, which I think would best be addressed by > changes to CPython's C-API. The original issue is at > http://bugs.python.org/issue25658, but I have made an effort below in > a sort of proto-PEP to summarize the problem and the proposed > solution. > > I haven't written this up in the proper PEP format because I want to > see if the idea has some broader support first, and it's also not > clear to me whether C-API changes (especially to undocumented APIs) > even require their own PEP. This is a nice detailed write-up and I'm in favour of the proposal. Regards Antoine. From rosuav at gmail.com Fri Dec 16 15:04:25 2016 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 17 Dec 2016 07:04:25 +1100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: On Fri, Dec 16, 2016 at 11:07 PM, Erik Bray wrote: > I haven't written this up in the proper PEP format because I want to > see if the idea has some broader support first, and it's also not > clear to me whether C-API changes (especially to undocumented APIs) > even require their own PEP. > You're pretty close to proper PEP format. Like others, I don't have enough knowledge of threading internals to speak to the technical side of it, but this is a well-written proposal and I agree in principle with tightening this up. The need for a PEP basically comes down to whether or not it's going to be controversial; a PEP allows you to hash out the details and then present a coherent proposal to Guido (or his delegate) for final approval. ChrisA From ma3yuki.8mamo10 at gmail.com Fri Dec 16 15:14:40 2016 From: ma3yuki.8mamo10 at gmail.com (Masayuki YAMAMOTO) Date: Sat, 17 Dec 2016 05:14:40 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: Hi, I'm patch author, I don't need to say anything for Erik's draft. I feel awesome that it has been clearly to explain, especially for history of API and against PEP. Thanks for great job, Erik! Cheers, Masayuki -------------- next part -------------- An HTML attachment was scrubbed... URL: From turnbull.stephen.fw at u.tsukuba.ac.jp Sat Dec 17 02:21:17 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Sat, 17 Dec 2016 16:21:17 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: <22612.59245.346572.695579@turnbull.sk.tsukuba.ac.jp> Erik Bray writes: > Abstract > ======== > > The proposal is to add a new Thread Local Storage (TLS) API to CPython > which would supersede use of the existing TLS API within the CPython > interpreter, while deprecating the existing API. Thank you for the analysis! Question: > Further, the old PyThread_*_key* functions will be marked as > deprecated. Of course, but: > Additionally, the pthread implementations of the old > PyThread_*_key* functions will either fail or be no-ops on > platforms where sizeof(pythead_t) != sizeof(int). Typo "pythead_t" in last line. I don't understand this. I assume that there are no such platforms supported at present. I would think that when such a platform becomes supported, code supporting "key" functions becomes unsupportable without #ifdefs on that platform, at least directly. So you should either (1) raise UnimplementedError, or (2) provide the API as a wrapper over the new API by making the integer keys indexes into a table of TSS'es, or some such device. I don't understand how (3) "make it a no-op" can be implemented for PyThread_create_key -- return 0 or -1? That would only work if there's a failure return status like 0 or -1, and it seems really dangerous to me since in general a lot of code doesn't check status even though it should. Even for code checking the status, the error message will be suboptimal ("creation failed" vs. "unimplemented"). I gather from references to casting pthread_key_t to unsigned int and back that there's probably code that does this in ways making (2) too dangerous to support. If true, perhaps that should be mentioned here. From turnbull.stephen.fw at u.tsukuba.ac.jp Sat Dec 17 04:35:11 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Sat, 17 Dec 2016 18:35:11 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: <22613.1743.668036.882893@turnbull.sk.tsukuba.ac.jp> Erik Bray writes: > Abstract > ======== > > The proposal is to add a new Thread Local Storage (TLS) API to CPython > which would supersede use of the existing TLS API within the CPython > interpreter, while deprecating the existing API. Thank you for the analysis. > Further, the old PyThread_*_key* functions will be marked as > deprecated. Additionally, the pthread implementations of the old > PyThread_*_key* functions will either fail or be no-ops on platforms > where sizeof(pythead_t) != sizeof(int). Typo "pythead_t" in last line. I don't understand this. I assume that there are no such platforms supported at present. I would think that when such a platform becomes supported, code supporting "key" functions becomes unsupportable without #ifdefs on that platform, at least directly. So you should either (1) raise UnimplementedError, or (2) provide the API as a wrapper over the new API by making the integer keys indexes into a table of TSS'es, or some such device. I don't understand how (3) "make it a no-op" can be implemented for PyThread_create_key -- return 0 or -1? That would only work if there's a failure return status like 0 or -1, and it seems really dangerous to me since in general a lot of code doesn't check status even though it should. Even for code checking the status, the error message will be suboptimal ("creation failed" vs. "unimplemented"). I gather from references to casting pthread_key_t to unsigned int and back that there's probably code that does this in ways making (2) too dangerous to support. If true, perhaps that should be mentioned here. From ma3yuki.8mamo10 at gmail.com Sat Dec 17 18:10:21 2016 From: ma3yuki.8mamo10 at gmail.com (Masayuki YAMAMOTO) Date: Sun, 18 Dec 2016 08:10:21 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython Message-ID: 2016-12-17 18:35 GMT+09:00 Stephen J. Turnbull : > I don't understand this. I assume that there are no such platforms > supported at present. I would think that when such a platform becomes > supported, code supporting "key" functions becomes unsupportable > without #ifdefs on that platform, at least directly. So you should > either (1) raise UnimplementedError, or (2) provide the API as a > wrapper over the new API by making the integer keys indexes into a > table of TSS'es, or some such device. I don't understand how (3) > "make it a no-op" can be implemented for PyThread_create_key -- return > 0 or -1? That would only work if there's a failure return status like > 0 or -1, and it seems really dangerous to me since in general a lot of > code doesn't check status even though it should. Even for code > checking the status, the error message will be suboptimal ("creation > failed" vs. "unimplemented"). PyThread_create_key has required user to check the return value since when key creation fails, returns -1 instead of valid key value. Therefore, my patch changes PyThread_create_key that always return -1 on platforms that cannot cast key to int safely and current API never return valid key value to these platforms. Its advantage to not change function specifications and no effect on supported platforms. Hence, this is reason that doesn't raise any exception on the API. (2) of ideas can enable current API on specific-platforms. If it's simple, I'd have liked to select it. However, work that brings current API using native TLS to specific-platforms brings duplication implementation that manages keys, and it's ugly (same reason for Erik's draft, the last item of Rejected Ideas). Thus, I gave up to keep feature and decided to implement "no-op", delegate error handling to API users. Kind regards, Masayuki -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.m.bray at gmail.com Mon Dec 19 05:50:23 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Mon, 19 Dec 2016 11:50:23 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: On Sun, Dec 18, 2016 at 12:10 AM, Masayuki YAMAMOTO wrote: > 2016-12-17 18:35 GMT+09:00 Stephen J. Turnbull > : >> >> I don't understand this. I assume that there are no such platforms >> supported at present. I would think that when such a platform becomes >> supported, code supporting "key" functions becomes unsupportable >> without #ifdefs on that platform, at least directly. So you should >> either (1) raise UnimplementedError, or (2) provide the API as a >> wrapper over the new API by making the integer keys indexes into a >> table of TSS'es, or some such device. I don't understand how (3) >> "make it a no-op" can be implemented for PyThread_create_key -- return >> 0 or -1? That would only work if there's a failure return status like >> 0 or -1, and it seems really dangerous to me since in general a lot of >> code doesn't check status even though it should. Even for code >> checking the status, the error message will be suboptimal ("creation >> failed" vs. "unimplemented"). > > > PyThread_create_key has required user to check the return value since when > key creation fails, returns -1 instead of valid key value. Therefore, my > patch changes PyThread_create_key that always return -1 on platforms that > cannot cast key to int safely and current API never return valid key value > to these platforms. Its advantage to not change function specifications and > no effect on supported platforms. Hence, this is reason that doesn't raise > any exception on the API. > > (2) of ideas can enable current API on specific-platforms. If it's simple, > I'd have liked to select it. However, work that brings current API using > native TLS to specific-platforms brings duplication implementation that > manages keys, and it's ugly (same reason for Erik's draft, the last item of > Rejected Ideas). Thus, I gave up to keep feature and decided to implement > "no-op", delegate error handling to API users. Yep--I think it speaks to the sensibleness of that decision that I pretty much read your mind :) From erik.m.bray at gmail.com Mon Dec 19 05:48:41 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Mon, 19 Dec 2016 11:48:41 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: <22612.59245.346572.695579@turnbull.sk.tsukuba.ac.jp> References: <22612.59245.346572.695579@turnbull.sk.tsukuba.ac.jp> Message-ID: On Sat, Dec 17, 2016 at 8:21 AM, Stephen J. Turnbull wrote: > Erik Bray writes: > > > Abstract > > ======== > > > > The proposal is to add a new Thread Local Storage (TLS) API to CPython > > which would supersede use of the existing TLS API within the CPython > > interpreter, while deprecating the existing API. > > Thank you for the analysis! And thank *you* for the feedback! > Question: > > > Further, the old PyThread_*_key* functions will be marked as > > deprecated. > > Of course, but: > > > Additionally, the pthread implementations of the old > > PyThread_*_key* functions will either fail or be no-ops on > > platforms where sizeof(pythead_t) != sizeof(int). > > Typo "pythead_t" in last line. Thanks, yes, that was suppose to be pthread_key_t of course. I think I had a few other typos too. > I don't understand this. I assume that there are no such platforms > supported at present. I would think that when such a platform becomes > supported, code supporting "key" functions becomes unsupportable > without #ifdefs on that platform, at least directly. So you should > either (1) raise UnimplementedError, or (2) provide the API as a > wrapper over the new API by making the integer keys indexes into a > table of TSS'es, or some such device. I don't understand how (3) > "make it a no-op" can be implemented for PyThread_create_key -- return > 0 or -1? That would only work if there's a failure return status like > 0 or -1, and it seems really dangerous to me since in general a lot of > code doesn't check status even though it should. Even for code > checking the status, the error message will be suboptimal ("creation > failed" vs. "unimplemented"). Masayuki already explained this downthread I think, but I could have probably made that section more precise. The point was that PyThread_create_key should immediately return -1 in this case. This is just a subtle difference over the current situation, which is that PyThread_create_key succeeds, but the key is corrupted by being cast to an int, so that later calls to PyThread_set_key_value and the like fail unexpectedly. The point is that PyThread_create_key (and we're only talking about the pthread implementation thereof, to be clear) must fail immediately if it can't work correctly. #ifdefs on the platform would not be necessary--instead, Masayuki's patch adds a feature check in configure.ac for sizeof(int) == sizeof(pthread_key_t). It should be noted that even this check is not 100% perfect, as on Linux pthread_key_t is an unsigned int, and so technically can cause Python's signed int key to overflow, but there's already an explicit check for that (which would be kept), and it's also a very unlikely scenario. > I gather from references to casting pthread_key_t to unsigned int and > back that there's probably code that does this in ways making (2) too > dangerous to support. If true, perhaps that should be mentioned here. It's not necessarily too dangerous, so much as not worth the trouble, IMO. Simpler to just provide, and immediately use the new API and make the old one deprecated and explicitly not supported on those platforms where it can't work. Thanks, Erik From ncoghlan at gmail.com Mon Dec 19 07:11:07 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 19 Dec 2016 22:11:07 +1000 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: <20161216185102.1e8396d4@fsol> References: <20161216185102.1e8396d4@fsol> Message-ID: On 17 December 2016 at 03:51, Antoine Pitrou wrote: > On Fri, 16 Dec 2016 13:07:46 +0100 > Erik Bray wrote: > > Greetings all, > > > > I wanted to bring attention to an issue that's been languishing on the > > bug tracker since last year, which I think would best be addressed by > > changes to CPython's C-API. The original issue is at > > http://bugs.python.org/issue25658, but I have made an effort below in > > a sort of proto-PEP to summarize the problem and the proposed > > solution. > > > > I haven't written this up in the proper PEP format because I want to > > see if the idea has some broader support first, and it's also not > > clear to me whether C-API changes (especially to undocumented APIs) > > even require their own PEP. > > This is a nice detailed write-up and I'm in favour of the proposal. > Likewise - we know the status quo isn't right, and the proposed change addresses that. In reviewing the patch on the tracker, the one downside I've found is that due to "pthread_key_t" being an opaque type with no defined sentinel, the consuming code in _tracemalloc.c and pystate.c needed to add separate boolean flag variables to track whether or not the key had been created. (The pthread examples at http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html use pthread_once for a similar effect) I don't see any obvious way around that either, as even using a small struct for native pthread TLS keys would still face the problem of how to initialise the pthread_key_t field. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.m.bray at gmail.com Mon Dec 19 09:45:50 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Mon, 19 Dec 2016 15:45:50 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Mon, Dec 19, 2016 at 1:11 PM, Nick Coghlan wrote: > On 17 December 2016 at 03:51, Antoine Pitrou wrote: >> >> On Fri, 16 Dec 2016 13:07:46 +0100 >> Erik Bray wrote: >> > Greetings all, >> > >> > I wanted to bring attention to an issue that's been languishing on the >> > bug tracker since last year, which I think would best be addressed by >> > changes to CPython's C-API. The original issue is at >> > http://bugs.python.org/issue25658, but I have made an effort below in >> > a sort of proto-PEP to summarize the problem and the proposed >> > solution. >> > >> > I haven't written this up in the proper PEP format because I want to >> > see if the idea has some broader support first, and it's also not >> > clear to me whether C-API changes (especially to undocumented APIs) >> > even require their own PEP. >> >> This is a nice detailed write-up and I'm in favour of the proposal. > > > Likewise - we know the status quo isn't right, and the proposed change > addresses that. In reviewing the patch on the tracker, the one downside I've > found is that due to "pthread_key_t" being an opaque type with no defined > sentinel, the consuming code in _tracemalloc.c and pystate.c needed to add > separate boolean flag variables to track whether or not the key had been > created. (The pthread examples at > http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html > use pthread_once for a similar effect) > > I don't see any obvious way around that either, as even using a small struct > for native pthread TLS keys would still face the problem of how to > initialise the pthread_key_t field. Hmm...fair point that it's not pretty. One way around it, albeit requiring more work/complexity, would be to extend this proposal to add a new function analogous to pthread_once--say--PyThread_call_once, and an associated Py_once_flag_t From erik.m.bray at gmail.com Mon Dec 19 09:53:42 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Mon, 19 Dec 2016 15:53:42 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Mon, Dec 19, 2016 at 3:45 PM, Erik Bray wrote: > On Mon, Dec 19, 2016 at 1:11 PM, Nick Coghlan wrote: >> On 17 December 2016 at 03:51, Antoine Pitrou wrote: >>> >>> On Fri, 16 Dec 2016 13:07:46 +0100 >>> Erik Bray wrote: >>> > Greetings all, >>> > >>> > I wanted to bring attention to an issue that's been languishing on the >>> > bug tracker since last year, which I think would best be addressed by >>> > changes to CPython's C-API. The original issue is at >>> > http://bugs.python.org/issue25658, but I have made an effort below in >>> > a sort of proto-PEP to summarize the problem and the proposed >>> > solution. >>> > >>> > I haven't written this up in the proper PEP format because I want to >>> > see if the idea has some broader support first, and it's also not >>> > clear to me whether C-API changes (especially to undocumented APIs) >>> > even require their own PEP. >>> >>> This is a nice detailed write-up and I'm in favour of the proposal. >> >> >> Likewise - we know the status quo isn't right, and the proposed change >> addresses that. In reviewing the patch on the tracker, the one downside I've >> found is that due to "pthread_key_t" being an opaque type with no defined >> sentinel, the consuming code in _tracemalloc.c and pystate.c needed to add >> separate boolean flag variables to track whether or not the key had been >> created. (The pthread examples at >> http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html >> use pthread_once for a similar effect) >> >> I don't see any obvious way around that either, as even using a small struct >> for native pthread TLS keys would still face the problem of how to >> initialise the pthread_key_t field. > > Hmm...fair point that it's not pretty. One way around it, albeit > requiring more work/complexity, would be to extend this proposal to > add a new function analogous to pthread_once--say--PyThread_call_once, > and an associated Py_once_flag_t Oops--fat-fingered a 'send' command before I finished. So workaround would be to add a PyThread_call_once function, analogous to pthread_once. Yet another interface one needs to implement for a native thread implementation, but not too hard either. For pthreads there's already an obvious analogue that can be wrapped directly. For other platforms that don't have a direct analogue a (naive) implementation is still fairly simple: All you need in Py_once_flag_t is a boolean flag with an associated mutex, and a sentinel value analogous to PTHREAD_ONCE_INIT. Best, Erik From ncoghlan at gmail.com Tue Dec 20 03:26:24 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 20 Dec 2016 18:26:24 +1000 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On 20 December 2016 at 00:53, Erik Bray wrote: > On Mon, Dec 19, 2016 at 3:45 PM, Erik Bray wrote: > >> Likewise - we know the status quo isn't right, and the proposed change > >> addresses that. In reviewing the patch on the tracker, the one downside > I've > >> found is that due to "pthread_key_t" being an opaque type with no > defined > >> sentinel, the consuming code in _tracemalloc.c and pystate.c needed to > add > >> separate boolean flag variables to track whether or not the key had been > >> created. (The pthread examples at > >> http://pubs.opengroup.org/onlinepubs/009695399/ > functions/pthread_key_create.html > >> use pthread_once for a similar effect) > >> > >> I don't see any obvious way around that either, as even using a small > struct > >> for native pthread TLS keys would still face the problem of how to > >> initialise the pthread_key_t field. > > > > Hmm...fair point that it's not pretty. One way around it, albeit > > requiring more work/complexity, would be to extend this proposal to > > add a new function analogous to pthread_once--say--PyThread_call_once, > > and an associated Py_once_flag_t > > Oops--fat-fingered a 'send' command before I finished. > > So workaround would be to add a PyThread_call_once function, > analogous to pthread_once. Yet another interface one needs to > implement for a native thread implementation, but not too hard either. > For pthreads there's already an obvious analogue that can be wrapped > directly. For other platforms that don't have a direct analogue a > (naive) implementation is still fairly simple: All you need in > Py_once_flag_t is a boolean flag with an associated mutex, and a > sentinel value analogous to PTHREAD_ONCE_INIT. > Yeah, I think I'd prefer that - it aligns nicely with the way pthreads are defined, and means we can be more prescriptive about how to use the new API correctly for key declarations (we're currently a bit vague about exactly how to handle that in the current TLS API). With that addition, I think it will be worth turning your initial post here into a PR to the peps repo, though - not to resolve any particular controversy, but rather as an easier to find reference for the design rationale than a mailing list thread or a tracker issue. (I'd also be happy to volunteer as BDFL-Delegate, since I'm already reviewing the patch on the tracker) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.m.bray at gmail.com Tue Dec 20 08:30:27 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Tue, 20 Dec 2016 14:30:27 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Tue, Dec 20, 2016 at 9:26 AM, Nick Coghlan wrote: > On 20 December 2016 at 00:53, Erik Bray wrote: >> >> On Mon, Dec 19, 2016 at 3:45 PM, Erik Bray wrote: >> >> Likewise - we know the status quo isn't right, and the proposed change >> >> addresses that. In reviewing the patch on the tracker, the one downside >> >> I've >> >> found is that due to "pthread_key_t" being an opaque type with no >> >> defined >> >> sentinel, the consuming code in _tracemalloc.c and pystate.c needed to >> >> add >> >> separate boolean flag variables to track whether or not the key had >> >> been >> >> created. (The pthread examples at >> >> >> >> http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html >> >> use pthread_once for a similar effect) >> >> >> >> I don't see any obvious way around that either, as even using a small >> >> struct >> >> for native pthread TLS keys would still face the problem of how to >> >> initialise the pthread_key_t field. >> > >> > Hmm...fair point that it's not pretty. One way around it, albeit >> > requiring more work/complexity, would be to extend this proposal to >> > add a new function analogous to pthread_once--say--PyThread_call_once, >> > and an associated Py_once_flag_t >> >> Oops--fat-fingered a 'send' command before I finished. >> >> So workaround would be to add a PyThread_call_once function, >> analogous to pthread_once. Yet another interface one needs to >> implement for a native thread implementation, but not too hard either. >> For pthreads there's already an obvious analogue that can be wrapped >> directly. For other platforms that don't have a direct analogue a >> (naive) implementation is still fairly simple: All you need in >> Py_once_flag_t is a boolean flag with an associated mutex, and a >> sentinel value analogous to PTHREAD_ONCE_INIT. > > > Yeah, I think I'd prefer that - it aligns nicely with the way pthreads are > defined, and means we can be more prescriptive about how to use the new API > correctly for key declarations (we're currently a bit vague about exactly > how to handle that in the current TLS API). > > With that addition, I think it will be worth turning your initial post here > into a PR to the peps repo, though - not to resolve any particular > controversy, but rather as an easier to find reference for the design > rationale than a mailing list thread or a tracker issue. > > (I'd also be happy to volunteer as BDFL-Delegate, since I'm already > reviewing the patch on the tracker) Okay, thanks. I will work on a PR to the PEPs repo, and update the proposal to add the PyThread_call_once idea, which some prescription for how it should be used. Of course, an updated patch will have to follow as well. This is probably an implementation detail, but ISTM that even with PyThread_call_once, it will be necessary to reset any used once_flags manually in PyOS_AfterFork, essentially for the same reason the autoTLSkey is reset there currently... Erik From ma3yuki.8mamo10 at gmail.com Tue Dec 20 10:35:13 2016 From: ma3yuki.8mamo10 at gmail.com (Masayuki YAMAMOTO) Date: Wed, 21 Dec 2016 00:35:13 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: 2016-12-20 22:30 GMT+09:00 Erik Bray : > This is probably an implementation detail, but ISTM that even with > PyThread_call_once, it will be necessary to reset any used once_flags > manually in PyOS_AfterFork, essentially for the same reason the > autoTLSkey is reset there currently... > Deleting threads key is executed on *_Fini functions, but Py_FinalizeEx function that calls *_Fini functions doesn't terminate CPython interpreter. Furthermore, source comment and document have said description about reinitialization after calling Py_FinalizeEx. [1] [2] That is to say there is an implicit possible that is reinitialization contrary to name "call_once" on a process level. Therefore, if CPython interpreter continues to allow reinitialization, I'd suggest to rename the call_once API to avoid misreading semantics. (for example, safe_init, check_init) Best regards, Masayuki [1] https://hg.python.org/cpython/file/default/Python/pylifecycle.c#l170 [2] https://docs.python.org/dev/c-api/init.html#c.Py_FinalizeEx -------------- next part -------------- An HTML attachment was scrubbed... URL: From thane.brimhall at gmail.com Tue Dec 20 19:50:57 2016 From: thane.brimhall at gmail.com (Thane Brimhall) Date: Tue, 20 Dec 2016 17:50:57 -0700 Subject: [Python-ideas] api suggestions for the cProfile module Message-ID: I use cProfile a lot, and would like to suggest three backwards-compatible improvements to the API. 1: When using cProfile on a specific piece of code I often use the enable() and disable() methods. It occurred to me that this would be an obvious place to use a context manager. 2: Enhance the `print_stats` method on Profile to accept more options currently available only through the pstats.Stats class. For example, strip_dirs could be a boolean argument, and limit could accept an int. This would reduce the number of cases you'd need to use the more complex API. 3: I often forget which string keys are available for sorting. It would be nice to add an enum for these so a user could have their linter and IDE check that value pre-runtime. Since it would subclass `str` and `Enum` it would still work with all currently existing code. The current documentation contains the following code: import cProfile, pstats, io pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = io.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print(s.getvalue()) While the code below doesn't exactly match the functionality above (eg. not using StringIO), I envision the context manager working like this, along with some adjustments on how to get the stats from the profiler: import cProfile, pstats with cProfile.Profile() as pr: # ... do something ... pr.print_stats(sort=pstats.Sort.cumulative, limit=10, strip_dirs=True) As you can see, the code is shorter and somewhat more self-documenting. The best thing about these suggestions is that as far as I can tell they would be backwards-compatible API additions. What do you think? Thank you in advance for your time! /Thane -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Dec 20 20:10:33 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 21 Dec 2016 11:10:33 +1000 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On 21 December 2016 at 01:35, Masayuki YAMAMOTO wrote: > 2016-12-20 22:30 GMT+09:00 Erik Bray : > >> This is probably an implementation detail, but ISTM that even with >> PyThread_call_once, it will be necessary to reset any used once_flags >> manually in PyOS_AfterFork, essentially for the same reason the >> autoTLSkey is reset there currently... >> > > Deleting threads key is executed on *_Fini functions, but Py_FinalizeEx > function that calls *_Fini functions doesn't terminate CPython interpreter. > Furthermore, source comment and document have said description about > reinitialization after calling Py_FinalizeEx. [1] [2] That is to say there > is an implicit possible that is reinitialization contrary to name > "call_once" on a process level. Therefore, if CPython interpreter continues > to allow reinitialization, I'd suggest to rename the call_once API to avoid > misreading semantics. (for example, safe_init, check_init) > Ouch, I'd missed that, and I agree it's not a negligible implementation detail - there are definitely applications embedding CPython out there that rely on being able to run multiple Initialize/Finalize cycles in the same process and have everything "just work". It also means using the "PyThread_*" prefix for the initialisation tracking aspect would be misleading, since the life cycle details are: 1. Create the key for the first time if it has never been previously set in the process 2. Destroy and reinit if Py_Finalize gets called 3. Destroy and reinit if a new subprocess is forked It also means we can't use pthread_once even in the pthread TLS implementation, since it doesn't provide those semantics. So I see two main alternatives here. Option 1: Modify the proposed PyThread_tss_create and PyThread_tss_delete APIs to accept a "bool *init_flag" pointer in addition to their current arguments. If *init_flag is true, then PyThread_tss_create is a no-op, otherwise it sets the flag to true after creating the key. If *init_flag is false, then PyThread_tss_delete is a no-op, otherwise it sets the flag to false after deleting the key. Option 2: Similar to option 1, but using a custom type alias, rather than using a C99 bool directly The closest API we have to these semantics at the moment would be PyGILState_Ensure, so the following API naming might work for option 2: Py_ensure_t Py_ENSURE_NEEDS_INIT Py_ENSURE_INITIALIZED Respectively, these would just be aliases for bool, false, and true. And then modify the proposed PyThread_tss_create and PyThread_tss_delete APIs to accept a "Py_ensure_t *init_flag" in addition to their current arguments. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.m.bray at gmail.com Wed Dec 21 05:01:13 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Wed, 21 Dec 2016 11:01:13 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Wed, Dec 21, 2016 at 2:10 AM, Nick Coghlan wrote: > On 21 December 2016 at 01:35, Masayuki YAMAMOTO > wrote: >> >> 2016-12-20 22:30 GMT+09:00 Erik Bray : >>> >>> This is probably an implementation detail, but ISTM that even with >>> PyThread_call_once, it will be necessary to reset any used once_flags >>> manually in PyOS_AfterFork, essentially for the same reason the >>> autoTLSkey is reset there currently... >> >> >> Deleting threads key is executed on *_Fini functions, but Py_FinalizeEx >> function that calls *_Fini functions doesn't terminate CPython interpreter. >> Furthermore, source comment and document have said description about >> reinitialization after calling Py_FinalizeEx. [1] [2] That is to say there >> is an implicit possible that is reinitialization contrary to name >> "call_once" on a process level. Therefore, if CPython interpreter continues >> to allow reinitialization, I'd suggest to rename the call_once API to avoid >> misreading semantics. (for example, safe_init, check_init) > > > Ouch, I'd missed that, and I agree it's not a negligible implementation > detail - there are definitely applications embedding CPython out there that > rely on being able to run multiple Initialize/Finalize cycles in the same > process and have everything "just work". It also means using the > "PyThread_*" prefix for the initialisation tracking aspect would be > misleading, since the life cycle details are: > > 1. Create the key for the first time if it has never been previously set in > the process > 2. Destroy and reinit if Py_Finalize gets called > 3. Destroy and reinit if a new subprocess is forked > > It also means we can't use pthread_once even in the pthread TLS > implementation, since it doesn't provide those semantics. > > So I see two main alternatives here. > > Option 1: Modify the proposed PyThread_tss_create and PyThread_tss_delete > APIs to accept a "bool *init_flag" pointer in addition to their current > arguments. > > If *init_flag is true, then PyThread_tss_create is a no-op, otherwise it > sets the flag to true after creating the key. > If *init_flag is false, then PyThread_tss_delete is a no-op, otherwise it > sets the flag to false after deleting the key. > > Option 2: Similar to option 1, but using a custom type alias, rather than > using a C99 bool directly > > The closest API we have to these semantics at the moment would be > PyGILState_Ensure, so the following API naming might work for option 2: > > Py_ensure_t > Py_ENSURE_NEEDS_INIT > Py_ENSURE_INITIALIZED > > Respectively, these would just be aliases for bool, false, and true. > > And then modify the proposed PyThread_tss_create and PyThread_tss_delete > APIs to accept a "Py_ensure_t *init_flag" in addition to their current > arguments. That all sounds good--between the two option 2 looks a bit more explicit. Though what about this? Rather than adding another type, the original proposal could be changed slightly so that Py_tss_t *is* partially defined as a struct consisting of a bool, with whatever the native TLS key is. E.g. typedef struct { bool init_flag; #if defined(_POSIX_THREADS) pthreat_key_t key; #elif defined (NT_THREADS) DWORD key; /* etc... */ } Py_tss_t; Then it's just taking Masayuki's original patch, with the global bool variables, and formalizing that by combining the initialized flag with the key, and requiring the semantics you described above for PyThread_tss_create/delete. For Python's purposes it seems like this might be good enough, with the more general purpose pthread_once-like functionality not required. Best, Erik From erik.m.bray at gmail.com Wed Dec 21 05:04:46 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Wed, 21 Dec 2016 11:04:46 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Wed, Dec 21, 2016 at 11:01 AM, Erik Bray wrote: > That all sounds good--between the two option 2 looks a bit more explicit. > > Though what about this? Rather than adding another type, the original > proposal could be changed slightly so that Py_tss_t *is* partially > defined as a struct consisting of a bool, with whatever the native TLS > key is. E.g. > > typedef struct { > bool init_flag; > #if defined(_POSIX_THREADS) > pthreat_key_t key; *pthread_key_t* of course, though I wonder if that was a Freudian slip :) > #elif defined (NT_THREADS) > DWORD key; > /* etc... */ > } Py_tss_t; > > Then it's just taking Masayuki's original patch, with the global bool > variables, and formalizing that by combining the initialized flag with > the key, and requiring the semantics you described above for > PyThread_tss_create/delete. > > For Python's purposes it seems like this might be good enough, with > the more general purpose pthread_once-like functionality not required. Of course, that's not to say it might not be useful for some other purpose, but then it's outside the scope of this discussion as long as it isn't needed for TLS key initialization. From ncoghlan at gmail.com Wed Dec 21 11:07:07 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 22 Dec 2016 02:07:07 +1000 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On 21 December 2016 at 20:01, Erik Bray wrote: > On Wed, Dec 21, 2016 at 2:10 AM, Nick Coghlan wrote: > > Option 2: Similar to option 1, but using a custom type alias, rather than > > using a C99 bool directly > > > > The closest API we have to these semantics at the moment would be > > PyGILState_Ensure, so the following API naming might work for option 2: > > > > Py_ensure_t > > Py_ENSURE_NEEDS_INIT > > Py_ENSURE_INITIALIZED > > > > Respectively, these would just be aliases for bool, false, and true. > > > > And then modify the proposed PyThread_tss_create and PyThread_tss_delete > > APIs to accept a "Py_ensure_t *init_flag" in addition to their current > > arguments. > > That all sounds good--between the two option 2 looks a bit more explicit. > > Though what about this? Rather than adding another type, the original > proposal could be changed slightly so that Py_tss_t *is* partially > defined as a struct consisting of a bool, with whatever the native TLS > key is. E.g. > > typedef struct { > bool init_flag; > #if defined(_POSIX_THREADS) > pthreat_key_t key; > #elif defined (NT_THREADS) > DWORD key; > /* etc... */ > } Py_tss_t; > > Then it's just taking Masayuki's original patch, with the global bool > variables, and formalizing that by combining the initialized flag with > the key, and requiring the semantics you described above for > PyThread_tss_create/delete. > > For Python's purposes it seems like this might be good enough, with > the more general purpose pthread_once-like functionality not required. > Aye, I also thought of that approach, but talked myself out of it since there's no definable default value for pthread_key_t. However, C99 partial initialisation may deal with that for us (by zeroing the memory without actually assigning a typed value to it), and if it does, I agree it would be better to handle the initialisation flag automatically rather than requiring callers to do it. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From flying-sheep at web.de Fri Dec 23 11:03:15 2016 From: flying-sheep at web.de (Philipp A.) Date: Fri, 23 Dec 2016 16:03:15 +0000 Subject: [Python-ideas] =?utf-8?q?PEP_536_=E2=80=93_Call_for_help_and_impr?= =?utf-8?q?ovement?= Message-ID: Hi Python Ideas, And merry christmas! Once upon a time ? in August this year ? I started a (somewhat badly titled) thread about improving the f-string grammar: https://mail.python.org/pipermail/python-ideas/2016-August/041727.html Luckily it resulted in an interim grammar change that invalidated a misleading property of the original grammar: To the rejoicing of syntax highlighters and humans everywhere, it?s no longer possible to escape syntactically relevant characters such as the f-string braces: f'\x7bvariable}' Now I created a PEP that makes f-strings work just like every other languages? string interpolation; enabling arbitrary nesting of python expressions in the expression parts of f-strings: https://github.com/python/peps/blob/master/pep-0536.txt All I want for christmas is your help: Please tell me how to improve wording, structure, or clarity of my PEP?s message (ideally via PR to https://github.com/flying-sheep/peps) I fear going forward I will also need guidance for the implementation part, as my only close-to-the-metal experiences are dabbling in C++, and the higher-level language Rust. Thank you and happy holidays! Philipp -------------- next part -------------- An HTML attachment was scrubbed... URL: From ma3yuki.8mamo10 at gmail.com Fri Dec 23 19:33:00 2016 From: ma3yuki.8mamo10 at gmail.com (Masayuki YAMAMOTO) Date: Sat, 24 Dec 2016 09:33:00 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython Message-ID: 2016-12-21 19:01 GMT+09:00 Erik Bray : > On Wed, Dec 21, 2016 at 2:10 AM, Nick Coghlan wrote: > > Ouch, I'd missed that, and I agree it's not a negligible implementation > > detail - there are definitely applications embedding CPython out there > that > > rely on being able to run multiple Initialize/Finalize cycles in the same > > process and have everything "just work". It also means using the > > "PyThread_*" prefix for the initialisation tracking aspect would be > > misleading, since the life cycle details are: > > > > 1. Create the key for the first time if it has never been previously set > in > > the process > > 2. Destroy and reinit if Py_Finalize gets called > > 3. Destroy and reinit if a new subprocess is forked > > > > It also means we can't use pthread_once even in the pthread TLS > > implementation, since it doesn't provide those semantics. > > > > So I see two main alternatives here. > > > > Option 1: Modify the proposed PyThread_tss_create and PyThread_tss_delete > > APIs to accept a "bool *init_flag" pointer in addition to their current > > arguments. > > > > If *init_flag is true, then PyThread_tss_create is a no-op, otherwise it > > sets the flag to true after creating the key. > > If *init_flag is false, then PyThread_tss_delete is a no-op, otherwise it > > sets the flag to false after deleting the key. > > > > Option 2: Similar to option 1, but using a custom type alias, rather than > > using a C99 bool directly > > > > The closest API we have to these semantics at the moment would be > > PyGILState_Ensure, so the following API naming might work for option 2: > > > > Py_ensure_t > > Py_ENSURE_NEEDS_INIT > > Py_ENSURE_INITIALIZED > > > > Respectively, these would just be aliases for bool, false, and true. > > > > And then modify the proposed PyThread_tss_create and PyThread_tss_delete > > APIs to accept a "Py_ensure_t *init_flag" in addition to their current > > arguments. > > That all sounds good--between the two option 2 looks a bit more explicit. > > Though what about this? Rather than adding another type, the original > proposal could be changed slightly so that Py_tss_t *is* partially > defined as a struct consisting of a bool, with whatever the native TLS > key is. E.g. > > typedef struct { > bool init_flag; > #if defined(_POSIX_THREADS) > pthreat_key_t key; > #elif defined (NT_THREADS) > DWORD key; > /* etc... */ > } Py_tss_t; > > Then it's just taking Masayuki's original patch, with the global bool > variables, and formalizing that by combining the initialized flag with > the key, and requiring the semantics you described above for > PyThread_tss_create/delete. > > For Python's purposes it seems like this might be good enough, with > the more general purpose pthread_once-like functionality not required. > > Best, > Erik Above mentioned, In currently TLS API, the thread key uses -1 as defined invalid value. If new TLS API inherits the specifications that the key requires defined invalid value, putting key and flag into one structure seems correct as semantics. In this case, I think TLS API should supply the defined invalid value (like PTHREAD_ONCE_INIT) to API users. Moreover, the structure has an opportunity to assert that the thread key type is the opaque using field name. I think to the suggestion that has effect to improve the understandability of the API because good field name can give that reading and writing to the key seems to be incorrect (even if API users don't read the precautionary statement). Have a nice holiday! Masayuki -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve.dower at python.org Sat Dec 24 10:59:51 2016 From: steve.dower at python.org (Steve Dower) Date: Sat, 24 Dec 2016 07:59:51 -0800 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: Right. Platforms that have a defined invalid value don't need the struct, and so they can define the type differently. It just means we also need to provide a macro for testing whether it's been created or not, and users should genuinely treat the value as opaque. Cheers, Steve Top-posted from my Windows Phone -----Original Message----- From: "Masayuki YAMAMOTO" Sent: ?12/?23/?2016 16:34 To: "Erik Bray" Cc: "python-ideas at python.org" Subject: Re: [Python-ideas] New PyThread_tss_ C-API for CPython 2016-12-21 19:01 GMT+09:00 Erik Bray : On Wed, Dec 21, 2016 at 2:10 AM, Nick Coghlan wrote: > Ouch, I'd missed that, and I agree it's not a negligible implementation > detail - there are definitely applications embedding CPython out there that > rely on being able to run multiple Initialize/Finalize cycles in the same > process and have everything "just work". It also means using the > "PyThread_*" prefix for the initialisation tracking aspect would be > misleading, since the life cycle details are: > > 1. Create the key for the first time if it has never been previously set in > the process > 2. Destroy and reinit if Py_Finalize gets called > 3. Destroy and reinit if a new subprocess is forked > > It also means we can't use pthread_once even in the pthread TLS > implementation, since it doesn't provide those semantics. > > So I see two main alternatives here. > > Option 1: Modify the proposed PyThread_tss_create and PyThread_tss_delete > APIs to accept a "bool *init_flag" pointer in addition to their current > arguments. > > If *init_flag is true, then PyThread_tss_create is a no-op, otherwise it > sets the flag to true after creating the key. > If *init_flag is false, then PyThread_tss_delete is a no-op, otherwise it > sets the flag to false after deleting the key. > > Option 2: Similar to option 1, but using a custom type alias, rather than > using a C99 bool directly > > The closest API we have to these semantics at the moment would be > PyGILState_Ensure, so the following API naming might work for option 2: > > Py_ensure_t > Py_ENSURE_NEEDS_INIT > Py_ENSURE_INITIALIZED > > Respectively, these would just be aliases for bool, false, and true. > > And then modify the proposed PyThread_tss_create and PyThread_tss_delete > APIs to accept a "Py_ensure_t *init_flag" in addition to their current > arguments. That all sounds good--between the two option 2 looks a bit more explicit. Though what about this? Rather than adding another type, the original proposal could be changed slightly so that Py_tss_t *is* partially defined as a struct consisting of a bool, with whatever the native TLS key is. E.g. typedef struct { bool init_flag; #if defined(_POSIX_THREADS) pthreat_key_t key; #elif defined (NT_THREADS) DWORD key; /* etc... */ } Py_tss_t; Then it's just taking Masayuki's original patch, with the global bool variables, and formalizing that by combining the initialized flag with the key, and requiring the semantics you described above for PyThread_tss_create/delete. For Python's purposes it seems like this might be good enough, with the more general purpose pthread_once-like functionality not required. Best, Erik Above mentioned, In currently TLS API, the thread key uses -1 as defined invalid value. If new TLS API inherits the specifications that the key requires defined invalid value, putting key and flag into one structure seems correct as semantics. In this case, I think TLS API should supply the defined invalid value (like PTHREAD_ONCE_INIT) to API users. Moreover, the structure has an opportunity to assert that the thread key type is the opaque using field name. I think to the suggestion that has effect to improve the understandability of the API because good field name can give that reading and writing to the key seems to be incorrect (even if API users don't read the precautionary statement). Have a nice holiday! Masayuki -------------- next part -------------- An HTML attachment was scrubbed... URL: From mistersheik at gmail.com Sat Dec 24 14:42:46 2016 From: mistersheik at gmail.com (Neil Girdhar) Date: Sat, 24 Dec 2016 11:42:46 -0800 (PST) Subject: [Python-ideas] (no subject) In-Reply-To: References: Message-ID: On Tuesday, November 29, 2016 at 4:08:19 AM UTC-5, Victor Stinner wrote: > > Hi, > > Python is optimized for performance. Formatting an error message has a > cost on performances. > > Usually, when an exception is hit that will (probably) crash the program, no one cares about less than a microsecond of performance. > I suggest you to teach your student to use the REPL and use a custom > exception handler: sys.excepthook: > https://docs.python.org/2/library/sys.html#sys.excepthook > > Using a custom exception handler, you can run expensive functions, > like the feature: "suggest len when length is used". > > The problem is then when students have to use a Python without the > custom exception handler. > > Victor > _______________________________________________ > Python-ideas mailing list > Python... at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tomuxiong at gmx.com Sat Dec 24 14:57:06 2016 From: tomuxiong at gmx.com (Thomas Nyberg) Date: Sat, 24 Dec 2016 11:57:06 -0800 Subject: [Python-ideas] (no subject) In-Reply-To: References: Message-ID: On 12/24/2016 11:42 AM, Neil Girdhar wrote: > Usually, when an exception is hit that will (probably) crash the > program, no one cares about less than a microsecond of performance. I would probably agree with you in the SyntaxError example, but not for the others. Programming with exceptions is totally standard in Python and they are often used in tight loops. See here: https://docs.python.org/3/glossary.html#term-eafp https://docs.python.org/3/glossary.html#term-lbyl So keeping exceptions fast is definitely important. Cheers, Thomas From zaharid at gmail.com Sun Dec 25 13:24:44 2016 From: zaharid at gmail.com (Zahari Dim) Date: Sun, 25 Dec 2016 19:24:44 +0100 Subject: [Python-ideas] AtributeError inside __get__ Message-ID: Hi, The other day I came across a particularly ugly bug. A simplified case goes like: class X: @property def y(self): return self.nonexisting hasattr(X(),'y') This returns False because hasattr calls the property which in turn raises an AttributeError which is used to determine that the property doesn't exist, even if it does. This is arguably unexpected and surprising and can be very difficult to understand if it happens within a large codebase. Given the precedent with generator_stop, which solves a similar problem for StopIteration, I was wondering if it would be possible to have the __get__ method convert the AttributeErrors raised inside it to RuntimeErrors. The situation with this is a little more complicated because there could be a (possibly strange) where one might want to raise an AttributeError inside __get__. But maybe the specification can be changed so either `raise ForceAttributeError()` or `return NotImplemented` achieves the same effect. Merry Christmas! Zahari. From prometheus235 at gmail.com Sun Dec 25 16:03:23 2016 From: prometheus235 at gmail.com (Nick Timkovich) Date: Sun, 25 Dec 2016 16:03:23 -0500 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: Are you saying that hasattr returning False was hiding a bug or is a bug? The former could be annoying to track down, though hasattr(X, 'y') == True. For the latter, having hasattr return False if an AttributeError is raised would allow the property decorator to retain identical functionality if it is used to replace a (sometimes) existing attribute. On Sun, Dec 25, 2016 at 1:24 PM, Zahari Dim wrote: > Hi, > > The other day I came across a particularly ugly bug. A simplified case > goes like: > > class X: > @property > def y(self): > return self.nonexisting > > hasattr(X(),'y') > > This returns False because hasattr calls the property which in turn > raises an AttributeError which is used to determine that the property > doesn't exist, even if it does. This is arguably unexpected and > surprising and can be very difficult to understand if it happens > within a large codebase. Given the precedent with generator_stop, > which solves a similar problem for StopIteration, I was wondering if > it would be possible to have the __get__ method convert the > AttributeErrors raised inside it to RuntimeErrors. > > The situation with this is a little more complicated because there > could be a (possibly strange) where one might want to raise an > AttributeError inside __get__. But maybe the specification can be > changed so either `raise ForceAttributeError()` or `return > NotImplemented` achieves the same effect. > > > Merry Christmas! > Zahari. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Sun Dec 25 16:51:07 2016 From: victor.stinner at gmail.com (Victor Stinner) Date: Sun, 25 Dec 2016 22:51:07 +0100 Subject: [Python-ideas] (no subject) In-Reply-To: References: Message-ID: Le 24 d?c. 2016 8:42 PM, "Neil Girdhar" a ?crit : > Usually, when an exception is hit that will (probably) crash the program, no one cares about less than a microsecond of performance. Just one example. By design, hasattr(obj, name) raises an exception to return False. So it has the cost of building the exception + raise exc + catch it. Victor -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Sun Dec 25 16:59:51 2016 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 26 Dec 2016 08:59:51 +1100 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: On Mon, Dec 26, 2016 at 8:03 AM, Nick Timkovich wrote: > Are you saying that hasattr returning False was hiding a bug or is a bug? > The former could be annoying to track down, though hasattr(X, 'y') == True. > For the latter, having hasattr return False if an AttributeError is raised > would allow the property decorator to retain identical functionality if it > is used to replace a (sometimes) existing attribute. This was touched on during the StopIteration discussions, but left aside (it's not really connected, other than that exceptions are used as a signal). It's more that a property function raising AttributeError makes it look like it doesn't exist. Worth noting, though: The confusion only really comes up with hasattr. If you simply try to access the property, you get an exception that identifies the exact fault: >>> X().y Traceback (most recent call last): File "", line 1, in File "", line 4, in y AttributeError: 'X' object has no attribute 'nonexisting' Interestingly, the exception doesn't seem to have very useful arguments: >>> ee.args ("'X' object has no attribute 'nonexisting'",) So here's a two-part proposal that would solve Zaheri's problem: 1) Enhance AttributeError to include arguments for the parts in quotes, for i18n independence. 2) Provide, in the docs, a hasattr replacement that checks the exception's args. The new hasattr would look like this: def hasattr(obj, name): try: getattr(obj, name) return True except AttributeError as e: if e.args[1] == obj.__class__.__name__ and e.args[2] == name: return False raise Since it's just a recipe in the docs, you could also have a version that works on current Pythons, but it'd need to do string manipulation to compare - something like: def hasattr(obj, name): try: getattr(obj, name) return True except AttributeError as e: if e.args[0] == "%r object has no attribute %r" % ( obj.__class__.__name__, name): return False raise I can't guarantee that this doesn't get some edge cases wrong, eg if you have weird characters in your name. But it'll deal with the normal cases, and it doesn't need any language changes - just paste that at the top of your file. Zaheri, would this solve your problem? ChrisA From rosuav at gmail.com Sun Dec 25 17:01:34 2016 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 26 Dec 2016 09:01:34 +1100 Subject: [Python-ideas] (no subject) In-Reply-To: References: Message-ID: On Mon, Dec 26, 2016 at 8:51 AM, Victor Stinner wrote: > Le 24 d?c. 2016 8:42 PM, "Neil Girdhar" a ?crit : >> Usually, when an exception is hit that will (probably) crash the program, >> no one cares about less than a microsecond of performance. > > Just one example. By design, hasattr(obj, name) raises an exception to > return False. > > So it has the cost of building the exception + raise exc + catch it. Printing an exception to the console can afford to be expensive, though. So if the work can be pushed into __str__, it won't hurt anything that try/excepts around it. ChrisA From ncoghlan at gmail.com Sun Dec 25 21:04:58 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 26 Dec 2016 12:04:58 +1000 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: On 26 December 2016 at 04:24, Zahari Dim wrote: > Hi, > > The other day I came across a particularly ugly bug. A simplified case > goes like: > > class X: > @property > def y(self): > return self.nonexisting > > hasattr(X(),'y') > > This returns False because hasattr calls the property which in turn > raises an AttributeError which is used to determine that the property > doesn't exist, even if it does. This is arguably unexpected and > surprising and can be very difficult to understand if it happens > within a large codebase. Given the precedent with generator_stop, > which solves a similar problem for StopIteration, I was wondering if > it would be possible to have the __get__ method convert the > AttributeErrors raised inside it to RuntimeErrors. > > The situation with this is a little more complicated because there > could be a (possibly strange) where one might want to raise an > AttributeError inside __get__. There are a lot of entirely valid properties that look something like this: @property def attr(self): try: return data_store[lookup_key] except KeyError: raise AttributeError("attr") And unlike StopIteration (where either "return" or "raise StopIteration" could be used), that *is* the way for a property method to indicate "attribute not actually present". This is one of the many cases where IDEs with some form of static structural checking really do make development easier - the "self.nonexisting" would be flagged as non-existent directly in the editor, even before you attempted to run the code. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From zaharid at gmail.com Mon Dec 26 06:23:18 2016 From: zaharid at gmail.com (Zahari Dim) Date: Mon, 26 Dec 2016 12:23:18 +0100 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: > There are a lot of entirely valid properties that look something like this: > > > @property > def attr(self): > try: > return data_store[lookup_key] > except KeyError: > raise AttributeError("attr") But wouldn't something like this be implemented more commonly with __getattr__ instead (likely there is more than one such property in a real example)? Even though __getattr__ has a similar problem (a bad AttributeError inside can cause many bugs), I'd agree it would probably be too difficult to change that without breaking a lot of code. For __get__, the errors are arguably more confusing (e.g. when used with @property) and the legitimate use case, while existing, seems more infrequent to me: I did a github search and there was a small number of cases, but most were for code written in python 2 anyway. Here a couple of valid ones: https://github.com/dimavitvickiy/server/blob/a9a6ea2a155b56b84d20a199b5948418d0dbf169/orm/decorators.py https://github.com/dropbox/pyston/blob/75562e57a8ec2f6f7bd0cf52012d49c0dc3d2155/test/tests/static_class_methods.py Cheers, Zahari > > This is one of the many cases where IDEs with some form of static structural > checking really do make development easier - the "self.nonexisting" would be > flagged as non-existent directly in the editor, even before you attempted to > run the code. In my particular case, the class had a __getattr__ that generated properties dynamically. Therefore an IDE was unlikely to be helpful. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From bzvi7919 at gmail.com Mon Dec 26 08:40:19 2016 From: bzvi7919 at gmail.com (Bar Harel) Date: Mon, 26 Dec 2016 13:40:19 +0000 Subject: [Python-ideas] singledispatch for instance methods In-Reply-To: References: Message-ID: Any updates with a singledispatch for methods? On Tue, Sep 20, 2016, 5:49 PM Bar Harel wrote: > At last! Haven't used single dispatch exactly because of that. Thank you > savior! > +1 > > On Tue, Sep 20, 2016, 6:03 AM Tim Mitchell > wrote: > >> Hi All, >> >> We have a modified version of singledispatch at work which works for >> methods as well as functions. We have open-sourced it as methoddispatch >> (pypi: https://pypi.python.org/pypi/methoddispatch). >> >> IMHO I thought it would make a nice addition to python stdlib. >> >> What does everyone else think? >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zaharid at gmail.com Tue Dec 27 05:11:24 2016 From: zaharid at gmail.com (Zahari) Date: Tue, 27 Dec 2016 02:11:24 -0800 (PST) Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: <2618d38e-74a0-41f5-8821-de88259136a5@googlegroups.com> > So here's a two-part proposal that would solve Zaheri's problem: > > 1) Enhance AttributeError to include arguments for the parts in > quotes, for i18n independence. > 2) Provide, in the docs, a hasattr replacement that checks the exception's > args. > > The new hasattr would look like this: > > def hasattr(obj, name): > try: > getattr(obj, name) > return True > except AttributeError as e: > if e.args[1] == obj.__class__.__name__ and e.args[2] == name: > return False > raise > > Since it's just a recipe in the docs, you could also have a version > that works on current Pythons, but it'd need to do string manipulation > to compare - something like: > > def hasattr(obj, name): > try: > getattr(obj, name) > return True > except AttributeError as e: > if e.args[0] == "%r object has no attribute %r" % ( > obj.__class__.__name__, name): > return False > raise > > I can't guarantee that this doesn't get some edge cases wrong, eg if > you have weird characters in your name. But it'll deal with the normal > cases, and it doesn't need any language changes - just paste that at > the top of your file. > > Zaheri, would this solve your problem? > This looks like a good idea. Note that there is also getattr(X(), 'y', 'default') that would have to behave like this. Cheers, Zahari > > ChrisA > _______________________________________________ > Python-ideas mailing list > Python... at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Tue Dec 27 06:58:44 2016 From: rosuav at gmail.com (Chris Angelico) Date: Tue, 27 Dec 2016 22:58:44 +1100 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: <2618d38e-74a0-41f5-8821-de88259136a5@googlegroups.com> References: <2618d38e-74a0-41f5-8821-de88259136a5@googlegroups.com> Message-ID: On Tue, Dec 27, 2016 at 9:11 PM, Zahari wrote: > This looks like a good idea. Note that there is also getattr(X(), 'y', > 'default') that would have to behave like this. > Forgot about that. Feel free to enhance the hasattr replacement. I still think the parameterization of AttributeError would be worth doing, but the two are independent. ChrisA From ammar at ammaraskar.com Tue Dec 27 12:25:48 2016 From: ammar at ammaraskar.com (Ammar Askar) Date: Tue, 27 Dec 2016 22:25:48 +0500 Subject: [Python-ideas] Function arguments in tracebacks Message-ID: Consider the following similar C and Python code and their tracebacks: C ------- int divide(int x, int y, char* some_string) { return x / y; } int main(...) { divide(2, 0, "Hello World"); } ------- Program received signal SIGFPE, Arithmetic exception. (gdb) bt #0 0x00000000004004c4 in divide (x=2, y=0, some_string=0x4005a8 "Hello World") at test.c:2 #1 0x00000000004004e7 in main (argc=1, argv=0x7fffffffe328) at test.c:6 Python ------- def divide(x, y, some_string): return x / y divide(2, 0, "Hello World") ------- Traceback (most recent call last): File "test.py", line 4, in File "test.py", line 2, in divide ZeroDivisionError: division by zero By including the function arguments within the traceback, we can get more information at a glance than we could with just the names of methods. This would be pretty cool and stop the occasional "printf" debugging without cluttering up the traceback too much. There will definitely need to be some reasonable line length limit because the repr() of parameters could be really long. In similar situations gdb replaces the value in the traceback with elipsis, and I believe that's a good solution for python as well. Obviously this isn't a great example since the error is immediately obvious but I think this could be potentially useful in a bunch of situations. I've made a a quick toy implementation in traceback.c, this is what it looks like for the script above. Traceback (most recent call last): File "test.py", line 4, in divide(2, 0, "Hello World") File "test.py", line 2, in divide (x=2, y=0, some_string='Hello World') return x / y ZeroDivisionError: division by zero == Potential Downsides == There's probably a lot more than these, but I could only think of these so far. * Private data might be leaked, imagine a def login(username, password): ... method. While function names/source files/source code are also private, variables can potentially contain all kinds of sensitive data. * A variable that takes a long time to return a string representation may significantly slow down the time it takes to generate a traceback. * We can really only return the state of the variables when the traceback is printed, this might result in some slightly un-intuitive behavior. (Easier to explain with an example) def f(x): x = 2 raise Exception() f(1) Traceback (most recent call last): File "", line 1, in File "", line 3, in f(x=2) The fact that x is mutated within the function body means that the value printed in the traceback is the changed value which might be slightly misleading. I'd love to hear your guy's thoughts on the idea. From jab at math.brown.edu Tue Dec 27 22:13:59 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Tue, 27 Dec 2016 22:13:59 -0500 Subject: [Python-ideas] incremental hashing in __hash__ Message-ID: Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large. Or suppose you have implemented an immutable, ordered collection type. For example, the collections-extended package provides a frozensetlist[1]. One of my own packages provides a frozen, ordered bidirectional mapping type.[2] These types should be hashable so that they can be inserted into sets and mappings. The order-sensitivity of the contents prevents them from using the built-in collections.Set._hash() helper in their __hash__ implementations, to keep from unnecessarily causing hash collisions for objects that compare unequal due only to having a different ordering of the same set of contained items. According to https://docs.python.org/3/reference/datamodel.html#object.__hash__ : """ it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example: def __hash__(self): return hash((self.name, self.nick, self.color)) """ Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first. Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))". In the meantime, what is the best way to incrementally compute a good hash value for such objects using built-in Python routines? (As a library author, it would be preferable to use a routine with explicit support for computing a hash incrementally, rather than having to worry about how to correctly combine results from multiple calls to hash(contained_item) in library code. (Simply XORing such results together would not be order-sensitive, and so wouldn't work.) Using a routine with explicit support for incremental hashing would allow libraries to focus on doing one thing well.[3,4,5]) I know that hashlib provides algorithms that support incremental hashing, but those use at least 128 bits. Since hash() throws out anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in hashlib seems like overkill. Am I right in thinking that's the wrong tool for the job? On the other hand, would binascii.crc32 be suitable, at least for 32-bit systems? (And is there some 64-bit incremental hash algorithm available for 64-bit systems? It seems Python has no support for crc64 built in.) For example: import binascii, struct class FrozenOrderedCollection: def __hash__(self): if hasattr(self, '__hashval'): # Computed lazily. return self.__hashval hv = crc32(b'FrozenOrderedCollection') for i in self: hv = binascii.crc32(struct.pack('@l', hash(i)), hv) hv &= 0xffffffff self.__hashval = hv return hv Note that this example illustrates two other common requirements of these use cases: (i) lazily computing the hash value on first use, and then caching it for future use (ii) priming the overall hash value with some class-specific initial value, so that if an instance of a different type of collection, which comprised the same items but which compared unequal, were to compute its hash value out of the same constituent items, we make sure our hash value differs. (On that note, should the documentation in https://docs.python.org/3/reference/datamodel.html#object.__hash__ quoted above be updated to add this advice? The current advice to "return hash((self.name, self.nick, self.color))" would cause a hash collision with a tuple of the same values, even though the tuple should presumably compare unequal with this object.) To summarize these questions: 1. Should hash() add support for incremental hashing? 2. In the meantime, what is the best way to compute a hash of a combination of many values incrementally (in constant space and linear time), using only what's available in the standard library? Ideally there is some routine available that uses exactly hash_info.hash_bits number of bits, and that does the combining of incremental results for you. 3. Should the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation be updated to include suitable advice for these use cases, in particular, that the overall hash value should be computed lazily, incrementally, and should be primed with a class-unique value? Thanks in advance for a helpful discussion, and best wishes. Josh References: [1] http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist [2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict [3] http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010 [4] http://stackoverflow.com/a/2909572/161642 [5] http://stackoverflow.com/a/27952689/161642 From rymg19 at gmail.com Tue Dec 27 22:28:04 2016 From: rymg19 at gmail.com (Ryan Gonzalez) Date: Tue, 27 Dec 2016 21:28:04 -0600 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: Message-ID: You could always try to make a Python version of the C tuple hashing function[1] (requires the total # of elements) or PyPy's[2] (seems like it would allow true incremental hashing). API idea: hasher = IncrementalHasher() hasher.add(one_item_to_hash) # updates hasher.hash property with result # repeat return hasher.hash [1]: https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331 [2]: https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562 -- Ryan (????) Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else http://kirbyfan64.github.io/ On Dec 27, 2016 9:15 PM, wrote: Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large. Or suppose you have implemented an immutable, ordered collection type. For example, the collections-extended package provides a frozensetlist[1]. One of my own packages provides a frozen, ordered bidirectional mapping type.[2] These types should be hashable so that they can be inserted into sets and mappings. The order-sensitivity of the contents prevents them from using the built-in collections.Set._hash() helper in their __hash__ implementations, to keep from unnecessarily causing hash collisions for objects that compare unequal due only to having a different ordering of the same set of contained items. According to https://docs.python.org/3/reference/datamodel.html# object.__hash__ : """ it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example: def __hash__(self): return hash((self.name, self.nick, self.color)) """ Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first. Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))". In the meantime, what is the best way to incrementally compute a good hash value for such objects using built-in Python routines? (As a library author, it would be preferable to use a routine with explicit support for computing a hash incrementally, rather than having to worry about how to correctly combine results from multiple calls to hash(contained_item) in library code. (Simply XORing such results together would not be order-sensitive, and so wouldn't work.) Using a routine with explicit support for incremental hashing would allow libraries to focus on doing one thing well.[3,4,5]) I know that hashlib provides algorithms that support incremental hashing, but those use at least 128 bits. Since hash() throws out anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in hashlib seems like overkill. Am I right in thinking that's the wrong tool for the job? On the other hand, would binascii.crc32 be suitable, at least for 32-bit systems? (And is there some 64-bit incremental hash algorithm available for 64-bit systems? It seems Python has no support for crc64 built in.) For example: import binascii, struct class FrozenOrderedCollection: def __hash__(self): if hasattr(self, '__hashval'): # Computed lazily. return self.__hashval hv = crc32(b'FrozenOrderedCollection') for i in self: hv = binascii.crc32(struct.pack('@l', hash(i)), hv) hv &= 0xffffffff self.__hashval = hv return hv Note that this example illustrates two other common requirements of these use cases: (i) lazily computing the hash value on first use, and then caching it for future use (ii) priming the overall hash value with some class-specific initial value, so that if an instance of a different type of collection, which comprised the same items but which compared unequal, were to compute its hash value out of the same constituent items, we make sure our hash value differs. (On that note, should the documentation in https://docs.python.org/3/reference/datamodel.html#object.__hash__ quoted above be updated to add this advice? The current advice to "return hash((self.name, self.nick, self.color))" would cause a hash collision with a tuple of the same values, even though the tuple should presumably compare unequal with this object.) To summarize these questions: 1. Should hash() add support for incremental hashing? 2. In the meantime, what is the best way to compute a hash of a combination of many values incrementally (in constant space and linear time), using only what's available in the standard library? Ideally there is some routine available that uses exactly hash_info.hash_bits number of bits, and that does the combining of incremental results for you. 3. Should the https://docs.python.org/3/reference/datamodel.html# object.__hash__ documentation be updated to include suitable advice for these use cases, in particular, that the overall hash value should be computed lazily, incrementally, and should be primed with a class-unique value? Thanks in advance for a helpful discussion, and best wishes. Josh References: [1] http://collections-extended.lenzm.net/api.html#collections_extended. frozensetlist [2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict [3] http://stackoverflow.com/questions/2909106/python- whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010 [4] http://stackoverflow.com/a/2909572/161642 [5] http://stackoverflow.com/a/27952689/161642 _______________________________________________ Python-ideas mailing list Python-ideas at python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jab at math.brown.edu Wed Dec 28 11:00:51 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Wed, 28 Dec 2016 11:00:51 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: Message-ID: I actually have been poking around that code already. I also found https://github.com/vperron/python-superfasthash/blob/master/superfasthash.py in case of interest. But it still seems like library authors with this use case should keep their library code free of implementation details like this, and instead use a higher-level API provided by Python. Thanks, Josh On Tue, Dec 27, 2016 at 10:28 PM, Ryan Gonzalez wrote: > You could always try to make a Python version of the C tuple hashing > function[1] (requires the total # of elements) or PyPy's[2] (seems like it > would allow true incremental hashing). API idea: > > > hasher = IncrementalHasher() > hasher.add(one_item_to_hash) # updates hasher.hash property with result > # repeat > return hasher.hash > > > [1]: https://hg.python.org/cpython/file/dcced3bd22fe/ > Objects/tupleobject.c#l331 > [2]: https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a34 > 5d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562 > > -- > Ryan (????) > Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else > http://kirbyfan64.github.io/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ned at nedbatchelder.com Wed Dec 28 11:48:10 2016 From: ned at nedbatchelder.com (Ned Batchelder) Date: Wed, 28 Dec 2016 11:48:10 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: Message-ID: On 12/27/16 10:13 PM, jab at math.brown.edu wrote: > Applying this advice to the use cases above would require creating an > arbitrarily large tuple in memory before passing it to hash(), which > is then just thrown away. It would be preferable if there were a way > to pass multiple values to hash() in a streaming fashion, such that > the overall hash were computed incrementally, without building up a > large object in memory first. > > Should there be better support for this use case? Perhaps hash() could > support an alternative signature, allowing it to accept a stream of > values whose combined hash would be computed incrementally in > *constant* space and linear time, e.g. "hash(items=iter(self))". You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time: def hash_stream(them): val = 0 for it in them: val = hash((val, it)) return val Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating. Or maybe I am misunderstanding the requirements? --Ned. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Wed Dec 28 12:10:59 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 28 Dec 2016 09:10:59 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: Message-ID: <5863F223.3040906@stoneleaf.us> On 12/27/2016 07:13 PM, jab at math.brown.edu wrote: > According to the docs [6]: > > """ > it is advised to mix together the hash values of the components of the > object that also play a part in comparison of objects by packing them > into a tuple and hashing the tuple. Example: > > def __hash__(self): > return hash((self.name, self.nick, self.color)) > > """ > > > Applying this advice to the use cases above would require creating an > arbitrarily large tuple in memory before passing it to hash(), which > is then just thrown away. It would be preferable if there were a way > to pass multiple values to hash() in a streaming fashion, such that > the overall hash were computed incrementally, without building up a > large object in memory first. Part of the reason for creating __hash__ like above is that: - it's simple - it's reliable However, it's not the only way to have a hash algorithm that works; in fact, the beginning of the sentence you quoted says: > The only required property is that objects which compare equal have > the same hash value; In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers). > (ii) priming the overall hash value with some class-specific initial > value, so that if an instance of a different type of collection, which > comprised the same items but which compared unequal, were to compute > its hash value out of the same constituent items, we make sure our > hash value differs. This is unnecessary: hashes are compared first as a way to weed out impossible matches, but when the hashes are the same an actual __eq__ test is still done [7]. -- ~Ethan~ [6] https://docs.python.org/3/reference/datamodel.html#object.__hash__ [7] some test code to prove above points: --- 8< ------------------------------------------------------------ from unittest import main, TestCase class Eggs(object): def __init__(self, value): self.value = value def __hash__(self): return hash(self.value) def __eq__(self, other): if isinstance(other, self.__class__): return self.value == other.value return NotImplemented class Spam(object): def __init__(self, value): self.value = value def __hash__(self): return hash(self.value) def __eq__(self, other): if isinstance(other, self.__class__): return self.value == other.value return NotImplemented e10 = Eggs(1) e20 = Eggs(2) e11 = Eggs(1) e21 = Eggs(2) s10 = Spam(1) s20 = Spam(2) s11 = Spam(1) s21 = Spam(2) bag = {} bag[e10] = 1 bag[s10] = 2 bag[e20] = 3 bag[s20] = 4 class TestEqualityAndHashing(TestCase): def test_equal(self): # same class, same value --> equal self.assertEqual(e10, e11) self.assertEqual(e20, e21) self.assertEqual(s10, s11) self.assertEqual(s20, s21) def test_not_equal(self): # different class, same value --> not equal self.assertEqual(e10.value, s10.value) self.assertNotEqual(e10, s10) self.assertEqual(e20.value, s20.value) self.assertNotEqual(e20, s20) def test_same_hash(self): # same class, same value, same hash self.assertEqual(hash(e10), hash(e11)) self.assertEqual(hash(e20), hash(e21)) self.assertEqual(hash(s10), hash(s11)) self.assertEqual(hash(s20), hash(s21)) # different class, same value, same hash self.assertEqual(hash(e10), hash(s10)) self.assertEqual(hash(e11), hash(s11)) self.assertEqual(hash(e20), hash(s20)) self.assertEqual(hash(e21), hash(s21)) def test_as_key(self): # different objects from different classes with same hash should still be distinct self.assertEqual(len(bag), 4) self.assertEqual(bag[e10], 1) self.assertEqual(bag[s10], 2) self.assertEqual(bag[e20], 3) self.assertEqual(bag[s20], 4) # different objects from same classes with same hash should not be distinct self.assertEqual(bag[e11], 1) self.assertEqual(bag[s11], 2) self.assertEqual(bag[e21], 3) self.assertEqual(bag[s21], 4) main() --- 8< ------------------------------------------------------------ From jab at math.brown.edu Wed Dec 28 12:27:59 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Wed, 28 Dec 2016 12:27:59 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: Message-ID: On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder wrote: > You can write a simple function to use hash iteratively to hash the entire > stream in constant space and linear time: > > def hash_stream(them): > val = 0 > for it in them: > val = hash((val, it)) > return val > > Although this creates N 2-tuples, they come and go, so the memory use > won't grow. Adjust the code as needed to achieve canonicalization before > iterating. > > Or maybe I am misunderstanding the requirements? > This is better than solutions like http://stackoverflow.com/a/ 27952689/161642 in the sense that it's a little higher level (no bit shifting or magic numbers). But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets. Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.) Don't we still probably want either: 1) Python to provide some such hash_stream() function as a built-in, or failing that, 2) the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation to bless this as the recommended solution to this problem, thereby providing assurance of its performance? If that makes sense, I'd be happy to file an issue, and include the start of a patch providing either 1 or 2. Thanks very much for the helpful response. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jab at math.brown.edu Wed Dec 28 12:44:55 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Wed, 28 Dec 2016 12:44:55 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <5863F223.3040906@stoneleaf.us> References: <5863F223.3040906@stoneleaf.us> Message-ID: On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman wrote: > In other words, objects that do not compare equal can also have the same > hash value (although too much of that will reduce the efficiency of > Python's containers). > Yes, I realize that unequal objects can return the same hash value with only performance, and not correctness, suffering. It's the performance I'm concerned about. That's what I meant by "...to keep from unnecessarily causing hash collisions..." in my original message, but sorry this wasn't clearer. We should be able to do this in a way that doesn't increase hash collisions unnecessarily. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Wed Dec 28 14:40:01 2016 From: brett at python.org (Brett Cannon) Date: Wed, 28 Dec 2016 19:40:01 +0000 Subject: [Python-ideas] Function arguments in tracebacks In-Reply-To: References: Message-ID: My quick on-vacation response is that attaching more objects to exceptions is typically viewed as dangerous as it can lead to those objects being kept alive longer than expected (see the discussions about richer error messages to see that worry come out for something as simple as attaching the type to a TypeError). On Tue, 27 Dec 2016 at 09:26 Ammar Askar wrote: > Consider the following similar C and Python code and their > tracebacks: > > C > ------- > int divide(int x, int y, char* some_string) { > return x / y; > } > int main(...) { > divide(2, 0, "Hello World"); > } > ------- > Program received signal SIGFPE, Arithmetic exception. > (gdb) bt > #0 0x00000000004004c4 in divide (x=2, y=0, some_string=0x4005a8 > "Hello World") at test.c:2 > #1 0x00000000004004e7 in main (argc=1, argv=0x7fffffffe328) at test.c:6 > > Python > ------- > def divide(x, y, some_string): > return x / y > > divide(2, 0, "Hello World") > ------- > Traceback (most recent call last): > File "test.py", line 4, in > File "test.py", line 2, in divide > ZeroDivisionError: division by zero > > > By including the function arguments within the traceback, we > can get more information at a glance than we could with just > the names of methods. > > This would be pretty cool and stop the occasional "printf" > debugging without cluttering up the traceback too much. > > There will definitely need to be some reasonable line length > limit because the repr() of parameters could be really long. > In similar situations gdb replaces the value in the traceback > with elipsis, and I believe that's a good solution for python > as well. > > Obviously this isn't a great example since the error is immediately > obvious but I think this could be potentially useful in a bunch > of situations. > > I've made a a quick toy implementation in traceback.c, this is what > it looks like for the script above. > > Traceback (most recent call last): > File "test.py", line 4, in > divide(2, 0, "Hello World") > File "test.py", line 2, in divide (x=2, y=0, some_string='Hello World') > return x / y > ZeroDivisionError: division by zero > > > == Potential Downsides == > > There's probably a lot more than these, but I could only think of > these so far. > > * Private data might be leaked, imagine a > > def login(username, password): > ... > > method. While function names/source files/source code are also > private, variables can potentially contain all kinds of sensitive data. > > * A variable that takes a long time to return a string representation may > significantly slow down the time it takes to generate a traceback. > > * We can really only return the state of the variables when the > traceback is printed, this might result in some slightly un-intuitive > behavior. (Easier to explain with an example) > > def f(x): > x = 2 > raise Exception() > > f(1) > > Traceback (most recent call last): > File "", line 1, in > File "", line 3, in f(x=2) > > The fact that x is mutated within the function body means that the > value printed in > the traceback is the changed value which might be slightly misleading. > > > I'd love to hear your guy's thoughts on the idea. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emanuel.landeholm at gmail.com Wed Dec 28 16:01:55 2016 From: emanuel.landeholm at gmail.com (Emanuel Landeholm) Date: Wed, 28 Dec 2016 22:01:55 +0100 Subject: [Python-ideas] Function arguments in tracebacks In-Reply-To: References: Message-ID: I think an argument could be made for including the str() of parameters of primitive types and with small values (for some value of "primitive" and "small", can of worms here...). I'm thinking numbers and short strings. Maybe a flag to control this behaviour? My gut feeling is that this would be a hack with lots of corner cases and surprises so it would probably not be very helpful in the general case. -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at mrabarnett.plus.com Wed Dec 28 16:23:57 2016 From: python at mrabarnett.plus.com (MRAB) Date: Wed, 28 Dec 2016 21:23:57 +0000 Subject: [Python-ideas] Function arguments in tracebacks In-Reply-To: References: Message-ID: On 2016-12-28 21:01, Emanuel Landeholm wrote: > I think an argument could be made for including the str() of parameters > of primitive types and with small values (for some value of "primitive" > and "small", can of worms here...). I'm thinking numbers and short > strings. Maybe a flag to control this behaviour? My gut feeling is that > this would be a hack with lots of corner cases and surprises so it would > probably not be very helpful in the general case. > Don't you mean the repr or ascii because you'll want 'foo' to print as: 'foo' and not as: foo From ned at nedbatchelder.com Wed Dec 28 16:27:07 2016 From: ned at nedbatchelder.com (Ned Batchelder) Date: Wed, 28 Dec 2016 16:27:07 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: Message-ID: <72243c5a-dc20-6899-9042-fa8a3a45a847@nedbatchelder.com> On 12/28/16 12:27 PM, jab at math.brown.edu wrote: > On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder > > wrote: > > You can write a simple function to use hash iteratively to hash > the entire stream in constant space and linear time: > > def hash_stream(them): > val = 0 > for it in them: > val = hash((val, it)) > return val > > Although this creates N 2-tuples, they come and go, so the memory > use won't grow. Adjust the code as needed to achieve > canonicalization before iterating. > > Or maybe I am misunderstanding the requirements? > > > This is better than solutions like > http://stackoverflow.com/a/27952689/161642 > in the sense that it's a > little higher level (no bit shifting or magic numbers). > > But it's not clear that it's any better in the sense that you're still > rolling your own incremental hash algorithm out of a lower-level > primitive that doesn't document support for this, and therefore taking > responsibility yourself for how well it distributes values into buckets. > > Are you confident this results in good hash performance? Is this > better than a solution built on top of a hash function with an > explicit API for calculating a hash incrementally, such as the crc32 > example I included? (And again, this would ideally be > a sys.hash_info.hash_bits -bit algorithm.) I don't have the theoretical background to defend this function. But it seems to me that if we believe that hash((int, thing)) distributes well, then how could this function not distribute well? --Ned. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Dec 28 17:13:50 2016 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 28 Dec 2016 14:13:50 -0800 Subject: [Python-ideas] Function arguments in tracebacks In-Reply-To: References: Message-ID: On Dec 28, 2016 12:44, "Brett Cannon" wrote: My quick on-vacation response is that attaching more objects to exceptions is typically viewed as dangerous as it can lead to those objects being kept alive longer than expected (see the discussions about richer error messages to see that worry come out for something as simple as attaching the type to a TypeError). This isn't an issue for printing arguments or other locals in tracebacks, though. The traceback printing code can access anything in the frame stack. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From mahmoud at hatnote.com Wed Dec 28 17:22:11 2016 From: mahmoud at hatnote.com (Mahmoud Hashemi) Date: Wed, 28 Dec 2016 14:22:11 -0800 Subject: [Python-ideas] Function arguments in tracebacks In-Reply-To: References: Message-ID: On Wed, Dec 28, 2016 at 2:13 PM, Nathaniel Smith wrote: > On Dec 28, 2016 12:44, "Brett Cannon" wrote: > > My quick on-vacation response is that attaching more objects to exceptions > is typically viewed as dangerous as it can lead to those objects being kept > alive longer than expected (see the discussions about richer error messages > to see that worry come out for something as simple as attaching the type to > a TypeError). > > > This isn't an issue for printing arguments or other locals in tracebacks, > though. The traceback printing code can access anything in the frame stack. > > -n > > Right. I'd actually be more worried about security leaks than memory leaks. Imagine you're calling a password checking function that got bytes instead of text, what amounts to a type check could leak the plaintext password. One rarely sees a C traceback, let alone a textual one, except during development, whereas Python tracebacks are seen during development and after deployment. Mahmoud https://github.com/mahmoud -------------- next part -------------- An HTML attachment was scrubbed... URL: From josephhackman at gmail.com Wed Dec 28 18:00:06 2016 From: josephhackman at gmail.com (Joseph Hackman) Date: Wed, 28 Dec 2016 18:00:06 -0500 Subject: [Python-ideas] VT100 style escape codes in Windows Message-ID: Hey All! I propose that Windows CPython flip the bit for VT100 support (colors and whatnot) for the stdout/stderr streams at startup time. I believe this behavior is worthwhile because ANSI escape codes are standard across most of Python's install base, and the alternative for Windows (using ctypes/win32 to alter the colors) is non-intuitive and well beyond the scope of most users. Under Linux/Mac, the terminal always supports what it can, and it's up to the application to verify escape codes are supported. Under Windows, applications (Python) must specifically request that escape codes be enabled. The flag lasts for the duration of the application, and must be flipped on every launch. It seems many of the built-in windows commands now operate in this mode. This change would not impede tools that use the win32 APIs for the console (such as colorama), and is supported in windows 2000 and up. The only good alternatives I can see is adding colorized/special output as a proper python feature that actually checks using the terminal information in *nix and win32. For more info, please see the issue: http://bugs.python.org/issue29059 Cheers, Joseph -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Wed Dec 28 18:06:22 2016 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 28 Dec 2016 23:06:22 +0000 Subject: [Python-ideas] VT100 style escape codes in Windows In-Reply-To: References: Message-ID: Would this only apply to recent versions of Windows? (IIRC, the VT100 support is Win10 only). If so, I'd be concerned about scripts that worked on *some* Windows versions but not others. And in particular, about scripts written on Unix using raw VT codes rather than using a portable solution like colorama. At the point where we can comfortably assume the majority of users are using a version of Windows that supports VT codes, I'd be OK with it being the default, but until then I'd prefer it were an opt-in option. Paul On 28 December 2016 at 23:00, Joseph Hackman wrote: > Hey All! > > I propose that Windows CPython flip the bit for VT100 support (colors and > whatnot) for the stdout/stderr streams at startup time. > > I believe this behavior is worthwhile because ANSI escape codes are standard > across most of Python's install base, and the alternative for Windows (using > ctypes/win32 to alter the colors) is non-intuitive and well beyond the scope > of most users. > > Under Linux/Mac, the terminal always supports what it can, and it's up to > the application to verify escape codes are supported. Under Windows, > applications (Python) must specifically request that escape codes be > enabled. The flag lasts for the duration of the application, and must be > flipped on every launch. It seems many of the built-in windows commands now > operate in this mode. > > This change would not impede tools that use the win32 APIs for the console > (such as colorama), and is supported in windows 2000 and up. > > The only good alternatives I can see is adding colorized/special output as a > proper python feature that actually checks using the terminal information in > *nix and win32. > > For more info, please see the issue: http://bugs.python.org/issue29059 > > Cheers, > Joseph > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From josephhackman at gmail.com Wed Dec 28 18:33:20 2016 From: josephhackman at gmail.com (Joseph Hackman) Date: Wed, 28 Dec 2016 18:33:20 -0500 Subject: [Python-ideas] VT100 style escape codes in Windows In-Reply-To: References: Message-ID: The quick answer is that the MSDN doc indicates support from windows 2000 onward, with no notes for partial compatability: https://msdn.microsoft.com/en-us/library/windows/desktop/ms686033(v=vs.85).aspx I'll build a Windows 7 VM to test. I believe Python 3.6 is only supported on Vista+ and 3.7 would be Windows 7+ only? On 28 December 2016 at 18:06, Paul Moore wrote: > Would this only apply to recent versions of Windows? (IIRC, the VT100 > support is Win10 only). If so, I'd be concerned about scripts that > worked on *some* Windows versions but not others. And in particular, > about scripts written on Unix using raw VT codes rather than using a > portable solution like colorama. > > At the point where we can comfortably assume the majority of users are > using a version of Windows that supports VT codes, I'd be OK with it > being the default, but until then I'd prefer it were an opt-in option. > Paul > > On 28 December 2016 at 23:00, Joseph Hackman > wrote: > > Hey All! > > > > I propose that Windows CPython flip the bit for VT100 support (colors and > > whatnot) for the stdout/stderr streams at startup time. > > > > I believe this behavior is worthwhile because ANSI escape codes are > standard > > across most of Python's install base, and the alternative for Windows > (using > > ctypes/win32 to alter the colors) is non-intuitive and well beyond the > scope > > of most users. > > > > Under Linux/Mac, the terminal always supports what it can, and it's up to > > the application to verify escape codes are supported. Under Windows, > > applications (Python) must specifically request that escape codes be > > enabled. The flag lasts for the duration of the application, and must be > > flipped on every launch. It seems many of the built-in windows commands > now > > operate in this mode. > > > > This change would not impede tools that use the win32 APIs for the > console > > (such as colorama), and is supported in windows 2000 and up. > > > > The only good alternatives I can see is adding colorized/special output > as a > > proper python feature that actually checks using the terminal > information in > > *nix and win32. > > > > For more info, please see the issue: http://bugs.python.org/issue29059 > > > > Cheers, > > Joseph > > > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josephhackman at gmail.com Wed Dec 28 19:13:48 2016 From: josephhackman at gmail.com (Joseph Hackman) Date: Wed, 28 Dec 2016 19:13:48 -0500 Subject: [Python-ideas] VT100 style escape codes in Windows In-Reply-To: References: Message-ID: Welp! You're definitely correct. Ah well. On 28 December 2016 at 18:33, Joseph Hackman wrote: > The quick answer is that the MSDN doc indicates support from windows 2000 > onward, with no notes for partial compatability: > https://msdn.microsoft.com/en-us/library/windows/desktop/ > ms686033(v=vs.85).aspx > > I'll build a Windows 7 VM to test. > > I believe Python 3.6 is only supported on Vista+ and 3.7 would be Windows > 7+ only? > > On 28 December 2016 at 18:06, Paul Moore wrote: > >> Would this only apply to recent versions of Windows? (IIRC, the VT100 >> support is Win10 only). If so, I'd be concerned about scripts that >> worked on *some* Windows versions but not others. And in particular, >> about scripts written on Unix using raw VT codes rather than using a >> portable solution like colorama. >> >> At the point where we can comfortably assume the majority of users are >> using a version of Windows that supports VT codes, I'd be OK with it >> being the default, but until then I'd prefer it were an opt-in option. >> Paul >> >> On 28 December 2016 at 23:00, Joseph Hackman >> wrote: >> > Hey All! >> > >> > I propose that Windows CPython flip the bit for VT100 support (colors >> and >> > whatnot) for the stdout/stderr streams at startup time. >> > >> > I believe this behavior is worthwhile because ANSI escape codes are >> standard >> > across most of Python's install base, and the alternative for Windows >> (using >> > ctypes/win32 to alter the colors) is non-intuitive and well beyond the >> scope >> > of most users. >> > >> > Under Linux/Mac, the terminal always supports what it can, and it's up >> to >> > the application to verify escape codes are supported. Under Windows, >> > applications (Python) must specifically request that escape codes be >> > enabled. The flag lasts for the duration of the application, and must be >> > flipped on every launch. It seems many of the built-in windows commands >> now >> > operate in this mode. >> > >> > This change would not impede tools that use the win32 APIs for the >> console >> > (such as colorama), and is supported in windows 2000 and up. >> > >> > The only good alternatives I can see is adding colorized/special output >> as a >> > proper python feature that actually checks using the terminal >> information in >> > *nix and win32. >> > >> > For more info, please see the issue: http://bugs.python.org/issue29059 >> > >> > Cheers, >> > Joseph >> > >> > >> > >> > _______________________________________________ >> > Python-ideas mailing list >> > Python-ideas at python.org >> > https://mail.python.org/mailman/listinfo/python-ideas >> > Code of Conduct: http://python.org/psf/codeofconduct/ >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From random832 at fastmail.com Wed Dec 28 20:30:15 2016 From: random832 at fastmail.com (Random832) Date: Wed, 28 Dec 2016 20:30:15 -0500 Subject: [Python-ideas] VT100 style escape codes in Windows In-Reply-To: References: Message-ID: <1482975015.2963954.831782761.5A3E0FFC@webmail.messagingengine.com> On Wed, Dec 28, 2016, at 18:33, Joseph Hackman wrote: > The quick answer is that the MSDN doc indicates support from windows 2000 > onward, with no notes for partial compatability: > https://msdn.microsoft.com/en-us/library/windows/desktop/ms686033(v=vs.85).aspx That's the function itself (and 2000 is just as far back as the website goes, it's actually existed, with the other modes, since NT 3.1 and Windows 95. The separate code sample page mentions that they are new features since Windows 10 Anniversary Edition. From abedillon at gmail.com Wed Dec 28 22:05:53 2016 From: abedillon at gmail.com (Abe Dillon) Date: Wed, 28 Dec 2016 21:05:53 -0600 Subject: [Python-ideas] Importing public symbols and simultainiously privatizing them, is too noisy In-Reply-To: References: <73a65a22-6440-4fde-ba99-fcd864a652d0@googlegroups.com> <20160317003545.GD8022@ando.pearwood.info> <007b7330-9b57-42a6-b1df-5af2029db1fb@googlegroups.com> Message-ID: > > I avoid __all__ like the plague. Too easy for it to get out of sync with > the API when i forget to add a new symbol. Your API should be one of the most stable parts of your code, no? On Fri, Mar 18, 2016 at 4:29 PM, Chris Barker wrote: > On Wed, Mar 16, 2016 at 6:52 PM, Rick Johnson < > rantingrickjohnson at gmail.com> wrote: > >> > Besides, why is "import x as _x" so special to require special syntax? >> > > It's not :-) I know I do, for instance, > > from matplotlib import pylot as plt > > But have NEVER done the leading underscore thing... > > >> from module import Foo as _Foo, bar as _bar, BAZ as _BAZ, spam as _spam, >> eggs as _eggs >> > > if you are mirroring an entire namespace, or a god fraction of one then > use a module name! > > import module as _mod > > then use _mod.Foo, etc..... > > Now, that may seem like a contrived example, but i've >> witnessed much longer "run-on import lines" than that. >> > > I have too, but I think it's bad style -- if you are importing a LOT of > names from one module, just import the darn module -- giving it a shorter > name if you like. This has become a really standard practice, like: > > import numpy as np > > for instance. > > The intended purpose is to: "automate the privatization of >> public symbols during the import process". >> > > I'm really confused about the use case for "privatization of public > symbols" at all, but again, if you need a lot of them, use the module name > to prefix them. Heck give it a one character name, and then it's hardly > more typing than the underscore... > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Thu Dec 29 03:20:00 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 29 Dec 2016 19:20:00 +1100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> Message-ID: <20161229081959.GA3887@ando.pearwood.info> On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab at math.brown.edu wrote: > On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman wrote: > > > In other words, objects that do not compare equal can also have the same > > hash value (although too much of that will reduce the efficiency of > > Python's containers). > > > > Yes, I realize that unequal objects can return the same hash value with > only performance, and not correctness, suffering. It's the performance I'm > concerned about. That's what I meant by "...to keep from unnecessarily > causing hash collisions..." in my original message, but sorry this wasn't > clearer. We should be able to do this in a way that doesn't increase hash > collisions unnecessarily. With respect Josh, I feel that this thread is based on premature optimization. It seems to me that you're *assuming* that anything less than some theoretically ideal O(1) space O(N) time hash function is clearly and obviously unsuitable. Of course I might be completely wrong. Perhaps you have implemented your own __hash__ methods as suggested by the docs, as well as Ned's version, and profiled your code and discovered that __hash__ is a significant bottleneck. In which case, I'll apologise for doubting you, but in my defence I'll say that the language you have used in this thread so far gives no hint that you've actually profiled your code and found the bottleneck. As I see it, this thread includes a few questions: (1) What is a good way to generate a hash one item at a time? I think Ned's answer is "good enough", subject to evidence to the contrary. If somebody cares to spend the time to analyse it, that's excellent, but we're volunteers and its the holiday period and most people have probably got better things to do. But we shouldn't let the perfect be the enemy of the good. But for what it's worth, I've done an *extremely* quick and dirty test to see whether the incremental hash function gives a good spread of values, by comparing it to the standard hash() function. py> import statistics py> def incrhash(iterable): ... h = hash(()) ... for x in iterable: ... h = hash((h, x)) ... return h ... py> py> data1 = [] py> data2 = [] py> for i in range(1000): ... it = range(i, i+100) ... data1.append(hash(tuple(it))) ... data2.append(incrhash(it)) ... py> # Are there any collisions? ... len(set(data1)), len(set(data2)) (1000, 1000) py> # compare spread of values ... statistics.stdev(data1), statistics.stdev(data2) (1231914201.0980585, 1227850884.443638) py> max(data1)-min(data1), max(data2)-min(data2) (4287398438, 4287569008) Neither the built-in hash() nor the incremental hash gives any collisions over this (admittedly small) data set, and both have very similar spreads of values as measured by either the standard deviation or the statistical range. The means are quite different: py> statistics.mean(data1), statistics.mean(data2) (-8577110.944, 2854438.568) but I don't think that matters. So that's good enough for me. (2) Should Ned's incremental hash, or some alternative with better properties, go into the standard library? I'm not convinced that your examples are common enough that the stdlib should be burdened with supporting it. On the other hand, I don't think it is an especially *large* burden, so perhaps it could be justified. Count me as sitting on the fence on this one. Perhaps a reasonable compromise would be to include it as a recipe in the docs. (3) If it does go in the stdlib, where should it go? I'm suspicious of functions that change their behaviour depending on how they are called, so I'm not keen on your suggestion of adding a second API to the hash built-in: hash(obj) # return hash of obj hash(iterable=obj) # return incrementally calculated hash of obj That feels wrong to me. I'd rather add a generator to the itertools module: itertools.iterhash(iterable) # yield incremental hashes or, copying the API of itertools.chain, add a method to hash: hash.from_iterable(iterable) # return hash calculated incrementally -- Steve From rosuav at gmail.com Thu Dec 29 03:35:04 2016 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 29 Dec 2016 19:35:04 +1100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <20161229081959.GA3887@ando.pearwood.info> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano wrote: > I'd rather add a generator to the itertools > module: > > itertools.iterhash(iterable) # yield incremental hashes > > or, copying the API of itertools.chain, add a method to hash: > > hash.from_iterable(iterable) # return hash calculated incrementally The itertools module is mainly designed to be consumed lazily. The hash has to be calculated eagerly, so it's not really a good fit for itertools. The only real advantage of this "hash from iterable" over hash(tuple(it)) is avoiding the intermediate tuple, so I'd want to see evidence that that's actually significant. ChrisA From erik.m.bray at gmail.com Thu Dec 29 07:12:58 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Thu, 29 Dec 2016 13:12:58 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Wed, Dec 21, 2016 at 5:07 PM, Nick Coghlan wrote: > On 21 December 2016 at 20:01, Erik Bray wrote: >> >> On Wed, Dec 21, 2016 at 2:10 AM, Nick Coghlan wrote: >> > Option 2: Similar to option 1, but using a custom type alias, rather >> > than >> > using a C99 bool directly >> > >> > The closest API we have to these semantics at the moment would be >> > PyGILState_Ensure, so the following API naming might work for option 2: >> > >> > Py_ensure_t >> > Py_ENSURE_NEEDS_INIT >> > Py_ENSURE_INITIALIZED >> > >> > Respectively, these would just be aliases for bool, false, and true. >> > >> > And then modify the proposed PyThread_tss_create and PyThread_tss_delete >> > APIs to accept a "Py_ensure_t *init_flag" in addition to their current >> > arguments. >> >> That all sounds good--between the two option 2 looks a bit more explicit. >> >> Though what about this? Rather than adding another type, the original >> proposal could be changed slightly so that Py_tss_t *is* partially >> defined as a struct consisting of a bool, with whatever the native TLS >> key is. E.g. >> >> typedef struct { >> bool init_flag; >> #if defined(_POSIX_THREADS) >> pthreat_key_t key; >> #elif defined (NT_THREADS) >> DWORD key; >> /* etc... */ >> } Py_tss_t; >> >> Then it's just taking Masayuki's original patch, with the global bool >> variables, and formalizing that by combining the initialized flag with >> the key, and requiring the semantics you described above for >> PyThread_tss_create/delete. >> >> For Python's purposes it seems like this might be good enough, with >> the more general purpose pthread_once-like functionality not required. > > > Aye, I also thought of that approach, but talked myself out of it since > there's no definable default value for pthread_key_t. However, C99 partial > initialisation may deal with that for us (by zeroing the memory without > actually assigning a typed value to it), and if it does, I agree it would be > better to handle the initialisation flag automatically rather than requiring > callers to do it. I think I understand what you're saying here... To be clear, let me enumerate the three currently supported cases and how they're affected: 1) CPython's TLS: Defines -1 as an uninitialized key (by fact of the implementation--that the keys are integers starting from zero) 2) pthreads: Does not definite an uninitialized default value for keys, for reasons described at [1] under "Non-Idempotent Data Key Creation". I understand their reasoning, though I can't claim to know specifically what they mean when they say that some implementations would require the mutual-exclusion to be performed on pthread_getspecific() as well. I don't know that it applies here. 3) windows: The return value of TlsAlloc() is a DWORD (unsigned int) and [2] states that its value should be opaque. So in principle we can cover all cases with an opaque struct that contains, as its first member, an is_initialized flag. The tricky part is how to initialize the rest of the struct (containing the underlying implementation-specific key). For 1) and 3) it doesn't matter--it can just be zero. For 2) it's trickier because there's no defined constant value to initialize a pthread_key_t to. Per Nick's suggestion this can be worked around by relying on C99's initialization semantics. Per [3] section 6.7.8, clause 21: """ If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration. """ How objects with static storage are initialized is described in the previous page under clause 10, but in practice it boils down to what you would expect: Everything is initialized to zero, including nested structs and arrays. So as long as we can use this feature of C99 then I think that's the best approach. [1] http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html [2] https://msdn.microsoft.com/en-us/library/windows/desktop/ms686801(v=vs.85).aspx [3] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf From jab at math.brown.edu Thu Dec 29 15:24:10 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Thu, 29 Dec 2016 15:24:10 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <20161229081959.GA3887@ando.pearwood.info> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: Thanks for the thoughtful discussion, it's been very interesting. Hash algorithms seem particularly sensitive and tricky to get right, with a great deal of research going into choices of constants, etc. and lots of gotchas. So it seemed worth asking about. If I had to bet on whether repeatedly accumulating pairwise hash() results would maintain the same desired properties that hash(tuple(items)) guarantees, I'd want to get confirmation from someone with expertise in this first, hence my starting this thread. But as you showed, it's certainly possible to do some exploration in the meantime. Prompted by your helpful comparison, I just put together https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further compare hash_tuple vs. hash_incremental. Based on that, hash_incremental seems to have a comparable distribution to hash_tuple. I'm not sure if the methodology there is sound, as I'm new to analysis like this. So I'd still welcome confirmation from someone with actual expertise in Python's internal hash algorithms. But so far this sure seems good enough for the use cases I described. Given sufficiently good distribution, I'd expect there to be unanimous agreement that an immutable collection, which could contain arbitrarily many items, should strongly prefer hash_incremental(self) over hash(tuple(self)), for the same reason we use generator comprehensions instead of list comprehensions when appropriate. Please correct me if I'm wrong. +1 for the "hash.from_iterable" API you proposed, if some additional support for this is added to Python. Otherwise +1 for including Ned's recipe in the docs. Again, happy to submit a patch for either of these if it would be helpful. And to be clear, I really appreciate the time that contributors have put into this thread, and into Python in general. Thoughtful responses are always appreciated, and never expected. I'm just interested in learning and in helping improve Python when I might have an opportunity. My Python open source work has been done on a voluntary basis too, and I haven't even gotten to use Python for paid/closed source work in several years, alas. Thanks again, Josh On Thu, Dec 29, 2016 at 3:20 AM, Steven D'Aprano wrote: > On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab at math.brown.edu wrote: > > On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman > wrote: > > > > > In other words, objects that do not compare equal can also have the > same > > > hash value (although too much of that will reduce the efficiency of > > > Python's containers). > > > > > > > Yes, I realize that unequal objects can return the same hash value with > > only performance, and not correctness, suffering. It's the performance > I'm > > concerned about. That's what I meant by "...to keep from unnecessarily > > causing hash collisions..." in my original message, but sorry this wasn't > > clearer. We should be able to do this in a way that doesn't increase hash > > collisions unnecessarily. > > With respect Josh, I feel that this thread is based on premature > optimization. It seems to me that you're *assuming* that anything less > than some theoretically ideal O(1) space O(N) time hash function is > clearly and obviously unsuitable. > > Of course I might be completely wrong. Perhaps you have implemented your > own __hash__ methods as suggested by the docs, as well as Ned's version, > and profiled your code and discovered that __hash__ is a significant > bottleneck. In which case, I'll apologise for doubting you, but in my > defence I'll say that the language you have used in this thread so far > gives no hint that you've actually profiled your code and found the > bottleneck. > > As I see it, this thread includes a few questions: > > (1) What is a good way to generate a hash one item at a time? > > I think Ned's answer is "good enough", subject to evidence to the > contrary. If somebody cares to spend the time to analyse it, that's > excellent, but we're volunteers and its the holiday period and most > people have probably got better things to do. But we shouldn't let the > perfect be the enemy of the good. > > But for what it's worth, I've done an *extremely* quick and dirty test > to see whether the incremental hash function gives a good spread of > values, by comparing it to the standard hash() function. > > > py> import statistics > py> def incrhash(iterable): > ... h = hash(()) > ... for x in iterable: > ... h = hash((h, x)) > ... return h > ... > py> > py> data1 = [] > py> data2 = [] > py> for i in range(1000): > ... it = range(i, i+100) > ... data1.append(hash(tuple(it))) > ... data2.append(incrhash(it)) > ... > py> # Are there any collisions? > ... len(set(data1)), len(set(data2)) > (1000, 1000) > py> # compare spread of values > ... statistics.stdev(data1), statistics.stdev(data2) > (1231914201.0980585, 1227850884.443638) > py> max(data1)-min(data1), max(data2)-min(data2) > (4287398438, 4287569008) > > > Neither the built-in hash() nor the incremental hash gives any > collisions over this (admittedly small) data set, and both have very > similar spreads of values as measured by either the standard deviation > or the statistical range. The means are quite different: > > py> statistics.mean(data1), statistics.mean(data2) > (-8577110.944, 2854438.568) > > but I don't think that matters. So that's good enough for me. > > > (2) Should Ned's incremental hash, or some alternative with better > properties, go into the standard library? > > I'm not convinced that your examples are common enough that the stdlib > should be burdened with supporting it. On the other hand, I don't think > it is an especially *large* burden, so perhaps it could be justified. > Count me as sitting on the fence on this one. > > Perhaps a reasonable compromise would be to include it as a recipe in > the docs. > > > (3) If it does go in the stdlib, where should it go? > > I'm suspicious of functions that change their behaviour depending on how > they are called, so I'm not keen on your suggestion of adding a second > API to the hash built-in: > > hash(obj) # return hash of obj > > hash(iterable=obj) # return incrementally calculated hash of obj > > That feels wrong to me. I'd rather add a generator to the itertools > module: > > itertools.iterhash(iterable) # yield incremental hashes > > or, copying the API of itertools.chain, add a method to hash: > > hash.from_iterable(iterable) # return hash calculated incrementally > > > > -- > Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothy.c.delaney at gmail.com Thu Dec 29 20:59:43 2016 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Fri, 30 Dec 2016 12:59:43 +1100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <20161229081959.GA3887@ando.pearwood.info> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On 29 December 2016 at 19:20, Steven D'Aprano wrote: > > With respect Josh, I feel that this thread is based on premature > optimization. It seems to me that you're *assuming* that anything less > than some theoretically ideal O(1) space O(N) time hash function is > clearly and obviously unsuitable. > > Of course I might be completely wrong. Perhaps you have implemented your > own __hash__ methods as suggested by the docs, as well as Ned's version, > and profiled your code and discovered that __hash__ is a significant > bottleneck. In which case, I'll apologise for doubting you, but in my > defence I'll say that the language you have used in this thread so far > gives no hint that you've actually profiled your code and found the > bottleneck. > In Josh's defence, the initial use case he put forward is exactly the kind of scenario where it's worthwhile optimising ahead of time. Quite often a poorly implemented hash function doesn't manifest as a problem until you scale up massively - and a developer may not have the capability to scale up to a suitable level in-house, resulting in performance issues at customer sites. I had one particular case (fortunately discovered before going to customers) where a field was included in the equality check, but wasn't part of the hash. Unfortunately, the lack of this one field resulted in objects only being allocated to a few buckets (in a Java HashMap), resulting in every access having to walk a potentially very long chain doing equality comparisons - O(N) behaviour from an amortised O(1) data structure. Unit tests - no worries. Small-scale tests - everything looked fine. Once we started our load tests though everything slowed to a crawl. 100% CPU, throughput at a standstill ... it didn't look good. Adding that one field to the hash resulted in the ability to scale up to hundreds of thousands of objects with minimal CPU. I can't remember if it was millions we tested to (it was around 10 years ago ...). Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Fri Dec 30 09:55:32 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 00:55:32 +1000 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: On 26 December 2016 at 21:23, Zahari Dim wrote: > > There are a lot of entirely valid properties that look something like > this: > > > > > > @property > > def attr(self): > > try: > > return data_store[lookup_key] > > except KeyError: > > raise AttributeError("attr") > > But wouldn't something like this be implemented more commonly with > __getattr__ instead (likely there is more than one such property in a > real example)? Even though __getattr__ has a similar problem (a bad > AttributeError inside can cause many bugs), I'd agree it would > probably be too difficult to change that without breaking a lot of > code. For __get__, the errors are arguably more confusing (e.g. when > used with @property) and the legitimate use case, while existing, > seems more infrequent to me: I did a github search and there was a > small number of cases, but most were for code written in python 2 > anyway. Aye, I agree this pattern is far more common in __getattr__ than it is in descriptor __get__ implementations or in property getter implementations. Rather than changing the descriptor protocol in general, I'd personally be more amenable to the idea of *property* catching AttributeError from the functions it calls and turning it into RuntimeError (after a suitable deprecation period). That way folks that really wanted the old behaviour could define their own descriptor that works the same way property does today, whereas if the descriptor protocol itself were to change, there's very little people could do to work around it if it wasn't what they wanted. Exploring that possible approach a bit further: - after a deprecation period, the "property" builtin would change to convert any AttributeError raised by the methods it calls into RuntimeError - the current "property" could be renamed "optionalproperty": the methods may raise AttributeError to indicate the attribute isn't *really* present, even though the property is defined at the class level - the deprecation warning would indicate that the affected properties should switch to using optionalproperty instead Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Fri Dec 30 10:10:03 2016 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 31 Dec 2016 02:10:03 +1100 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: On Sat, Dec 31, 2016 at 1:55 AM, Nick Coghlan wrote: > Rather than changing the descriptor protocol in general, I'd personally be > more amenable to the idea of *property* catching AttributeError from the > functions it calls and turning it into RuntimeError (after a suitable > deprecation period). That way folks that really wanted the old behaviour > could define their own descriptor that works the same way property does > today, whereas if the descriptor protocol itself were to change, there's > very little people could do to work around it if it wasn't what they wanted. > Actually, that makes a lot of sense. And since "property" isn't magic syntax, you could take it sooner: from somewhere import property and toy with it that way. What module would be appropriate, though? ChrisA From ncoghlan at gmail.com Fri Dec 30 10:28:47 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 01:28:47 +1000 Subject: [Python-ideas] Function arguments in tracebacks In-Reply-To: References: Message-ID: On 29 December 2016 at 08:13, Nathaniel Smith wrote: > On Dec 28, 2016 12:44, "Brett Cannon" wrote: > > My quick on-vacation response is that attaching more objects to exceptions > is typically viewed as dangerous as it can lead to those objects being kept > alive longer than expected (see the discussions about richer error messages > to see that worry come out for something as simple as attaching the type to > a TypeError). > > > This isn't an issue for printing arguments or other locals in tracebacks, > though. The traceback printing code can access anything in the frame stack. > Right, the reasons for the discrepancy here are purely pragmatic ones: - the default traceback printing machinery in CPython is written in C, and we don't currently have readily available tools at that layer to print a nice structured argument list the way gdb does for C functions (and there are good reasons for us to want the interpreter to be able to print tracebacks even if it's in a sufficiently unhealthy state that the "traceback" module won't run, so delegating the problem to Python level tooling isn't an answer for CPython) - displaying local variables in runtime tracebacks (as opposed to in interactive debuggers like gdb) is a known security risk that we don't currently provide good tools for handling in the standard library (e.g. we don't offer str and bytes subclasses with opaque representations that don't reveal their contents). Even if we did offer them, they'd still be opt-in for reasons of usability when working with data that *isn't* security sensitive. However, neither of those arguments applies to the "where" command in pdb, and that doesn't currently display this kind of information either: >>> def f(x, y, message): ... return x/y, message ... >>> f(2, 0, "Hello world") Traceback (most recent call last): File "", line 1, in File "", line 2, in f ZeroDivisionError: division by zero >>> import pdb; pdb.pm() > (2)f() (Pdb) w (1)()->None > (2)f() (Pdb) pdb already knows what the arguments are, as it can print them if you ask for them explicitly: (Pdb) args x = 2 y = 0 message = 'Hello world' So I think this kind of change may make a lot of sense as an RFE for pdb's "where" command (with the added bonus that projects like pdbpp could make it available to earlier Python versions as well). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Fri Dec 30 11:05:36 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 02:05:36 +1000 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On 29 December 2016 at 22:12, Erik Bray wrote: > 1) CPython's TLS: Defines -1 as an uninitialized key (by fact of the > implementation--that the keys are integers starting from zero) > 2) pthreads: Does not definite an uninitialized default value for > keys, for reasons described at [1] under "Non-Idempotent Data Key > Creation". I understand their reasoning, though I can't claim to know > specifically what they mean when they say that some implementations > would require the mutual-exclusion to be performed on > pthread_getspecific() as well. I don't know that it applies here. > That section is a little weird, as they describe two requests (one for a known-NULL default value, the other for implicit synchronisation of key creation to prevent race conditions), and only provide the justification for rejecting one of them (the second one). If I've understood correctly, the situation they're worried about there is that pthread_key_create() has to be called at least once-per-process, but must be called before *any* call to pthread_getspecific or pthread_setspecific for a given key. If you do "implicit init" rather than requiring the use of an explicit mechanism like pthread_once (or our own Py_Initialize and module import locks), then you may take a small performance hit as either *every* thread then has to call pthread_key_create() to ensure the key exists before using it, or else pthread_getspecific() and pthread_setspecific() have to become potentially blocking calls. Neither of those is desirable, so it makes sense to leave that part of the problem to the API client. In our case, we don't want the implicit synchronisation, we just want the known-NULL default value so the "Is it already set?" check can be moved inside the library function. > 3) windows: The return value of TlsAlloc() is a DWORD (unsigned int) > and [2] states that its value should be opaque. > > So in principle we can cover all cases with an opaque struct that > contains, as its first member, an is_initialized flag. The tricky > part is how to initialize the rest of the struct (containing the > underlying implementation-specific key). For 1) and 3) it doesn't > matter--it can just be zero. For 2) it's trickier because there's no > defined constant value to initialize a pthread_key_t to. > > Per Nick's suggestion this can be worked around by relying on C99's > initialization semantics. Per [3] section 6.7.8, clause 21: > > """ > If there are fewer initializers in a brace-enclosed list than there > are elements or members of an aggregate, or fewer characters in a > string literal used to initialize an array of known size than there > are elements in the array, the remainder of the aggregate shall be > initialized implicitly the same as objects that have static storage > duration. > """ > > How objects with static storage are initialized is described in the > previous page under clause 10, but in practice it boils down to what > you would expect: Everything is initialized to zero, including nested > structs and arrays. > > So as long as we can use this feature of C99 then I think that's the > best approach. > I checked PEP 7 to see exactly which features we've added to the approved C dialect, and designated initialisers are already on the list: https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html So I believe that would allow the initializer to be declared as something like: #define Py_tss_NEEDS_INIT {.is_initialized = false} Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Fri Dec 30 11:24:46 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 02:24:46 +1000 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On 29 December 2016 at 18:35, Chris Angelico wrote: > On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano > wrote: > > I'd rather add a generator to the itertools > > module: > > > > itertools.iterhash(iterable) # yield incremental hashes > > > > or, copying the API of itertools.chain, add a method to hash: > > > > hash.from_iterable(iterable) # return hash calculated incrementally > > The itertools module is mainly designed to be consumed lazily. The > hash has to be calculated eagerly, so it's not really a good fit for > itertools. I understood the "iterhash" suggestion to be akin to itertools.accumulate: >>> for value, tally in enumerate(accumulate(range(10))): print(value, tally) ... 0 0 1 1 2 3 3 6 4 10 5 15 6 21 7 28 8 36 9 45 However, I think including Ned's recipe (or something like it) in https://docs.python.org/3/reference/datamodel.html#object.__hash__ as a tool for avoiding large temporary tuple allocations may be a better way to start off as: 1. It's applicable to all currently released versions of Python, not just to 3.7+ 2. It provides more scope for people to experiment with their own variants of the idea before committing to a *particular* version somewhere in the standard library Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From jab at math.brown.edu Fri Dec 30 12:29:55 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 12:29:55 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: Updating the docs sounds like the more important change for now, given 3.7+. But before the docs make an official recommendation for that recipe, were the analyses that Steve and I did sufficient to confirm that its hash distribution and performance is good enough at scale, or is more rigorous analysis necessary? I've been trying to find a reasonably detailed and up-to-date reference on Python hash() result requirements and analysis methodology, with instructions on how to confirm if they're met, but am still looking. Would find that an interesting read if it's out there. But I'd take just an authoritative thumbs up here too. Just haven't heard one yet. And regarding any built-in support that might get added, I just want to make sure Ryan Gonzalez's proposal (the first reply on this thread) didn't get buried: hasher = IncrementalHasher() hasher.add(one_item_to_hash) # updates hasher.hash property with result # repeat return hasher.hash I think this is the only proposal so far that actually adds an explicit API for performing an incremental update. (i.e. The other "hash_stream(iterable)" -style proposals are all-or-nothing.) This would bring Python's built-in hash() algorithm's support up to parity with the other algorithms in the standard library (hashlib, crc32). Maybe that's valuable? -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.m.bray at gmail.com Fri Dec 30 12:38:05 2016 From: erik.m.bray at gmail.com (Erik Bray) Date: Fri, 30 Dec 2016 18:38:05 +0100 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: <20161216185102.1e8396d4@fsol> Message-ID: On Fri, Dec 30, 2016 at 5:05 PM, Nick Coghlan wrote: > On 29 December 2016 at 22:12, Erik Bray wrote: >> >> 1) CPython's TLS: Defines -1 as an uninitialized key (by fact of the >> implementation--that the keys are integers starting from zero) >> 2) pthreads: Does not definite an uninitialized default value for >> keys, for reasons described at [1] under "Non-Idempotent Data Key >> Creation". I understand their reasoning, though I can't claim to know >> specifically what they mean when they say that some implementations >> would require the mutual-exclusion to be performed on >> pthread_getspecific() as well. I don't know that it applies here. > > > That section is a little weird, as they describe two requests (one for a > known-NULL default value, the other for implicit synchronisation of key > creation to prevent race conditions), and only provide the justification for > rejecting one of them (the second one). Right, that is confusing to me as well. I'm guessing the reason for rejecting the first is in part a way to force us to recognize the second issue. > If I've understood correctly, the situation they're worried about there is > that pthread_key_create() has to be called at least once-per-process, but > must be called before *any* call to pthread_getspecific or > pthread_setspecific for a given key. If you do "implicit init" rather than > requiring the use of an explicit mechanism like pthread_once (or our own > Py_Initialize and module import locks), then you may take a small > performance hit as either *every* thread then has to call > pthread_key_create() to ensure the key exists before using it, or else > pthread_getspecific() and pthread_setspecific() have to become potentially > blocking calls. Neither of those is desirable, so it makes sense to leave > that part of the problem to the API client. > > In our case, we don't want the implicit synchronisation, we just want the > known-NULL default value so the "Is it already set?" check can be moved > inside the library function. Okay, we're on the same page here then. I just wanted to make sure there wasn't anything else I was missing in Python's case. >> 3) windows: The return value of TlsAlloc() is a DWORD (unsigned int) >> and [2] states that its value should be opaque. >> >> So in principle we can cover all cases with an opaque struct that >> contains, as its first member, an is_initialized flag. The tricky >> part is how to initialize the rest of the struct (containing the >> underlying implementation-specific key). For 1) and 3) it doesn't >> matter--it can just be zero. For 2) it's trickier because there's no >> defined constant value to initialize a pthread_key_t to. >> >> Per Nick's suggestion this can be worked around by relying on C99's >> initialization semantics. Per [3] section 6.7.8, clause 21: >> >> """ >> If there are fewer initializers in a brace-enclosed list than there >> are elements or members of an aggregate, or fewer characters in a >> string literal used to initialize an array of known size than there >> are elements in the array, the remainder of the aggregate shall be >> initialized implicitly the same as objects that have static storage >> duration. >> """ >> >> How objects with static storage are initialized is described in the >> previous page under clause 10, but in practice it boils down to what >> you would expect: Everything is initialized to zero, including nested >> structs and arrays. >> >> So as long as we can use this feature of C99 then I think that's the >> best approach. > > > > I checked PEP 7 to see exactly which features we've added to the approved C > dialect, and designated initialisers are already on the list: > https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html > > So I believe that would allow the initializer to be declared as something > like: > > #define Py_tss_NEEDS_INIT {.is_initialized = false} Great! One could argue about whether or not the designated initializer syntax also incorporates omitted fields, but it would seem strange to insist that it doesn't. Have a happy new year, Erik From turnbull.stephen.fw at u.tsukuba.ac.jp Fri Dec 30 12:39:53 2016 From: turnbull.stephen.fw at u.tsukuba.ac.jp (Stephen J. Turnbull) Date: Sat, 31 Dec 2016 02:39:53 +0900 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: <22630.39913.941657.111305@turnbull.sk.tsukuba.ac.jp> jab at math.brown.edu writes: > But as you showed, it's certainly possible to do some exploration in the > meantime. Prompted by your helpful comparison, I just put together > https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further > compare hash_tuple vs. hash_incremental. It's been a while :-) since I read Knuth[1] (and that was when Knuth was authoritative on this subject 8^O), but neither methodology is particularly helpful here. The ideal is effectively a uniform distribution on a circle, which has no mean. Therefore standard deviation is also uninteresting, since its definition uses the mean. The right test is something like a ?? test against a uniform distribution. The scatter plots (of hash against test data) simply show the same thing, without the precision of a statistical test. (Note: do not deprecate the computer visualization + human eye + human brain system. It is the best known machine for detecting significant patterns and anomolies, though YMMV.) But it's not very good at detecting high- dimensional patterns. However, it's nowhere near good enough for a hash function to have a uniform distribution on random data. It actually needs to be "chaotic" in the sense that it also spreads "nearby" data out, where "nearby" here would probably mean that as 4000-bit strings less than m% (for small m) differ, as in real data you usually expect a certain amount of structure (continuity in time, clustering around a mean, line noise in a communication channel) so that you'd be likely to get lots of "clusters" of very similar data. You don't want them pounding on a small subset of buckets. > I'm not sure if the methodology there is sound, as I'm new to > analysis like this. Even decades later, starting with Knuth[1] can't hurt. :-) > Given sufficiently good distribution, I'd expect there to be unanimous > agreement that an immutable collection, which could contain arbitrarily > many items, should strongly prefer hash_incremental(self) over > hash(tuple(self)), for the same reason we use generator comprehensions > instead of list comprehensions when appropriate. Please correct me if I'm > wrong. I think that's a misleading analogy. Just stick to For very large collections where we already have the data, duplicating the data is very expensive. Furthermore, since the purpose of hashing is making equality comparisons cheap, this is likely to happen in a loop. On the other hand, if there are going to be a *lot* of "large" collections being stored, then they're actually not all that large compared to the system, and you might not actually care that much about the cost of the emphemeral tuples, because the real cost is in an n^2 set of comparisons. From the point of view of NumPy, this is an "are you kidding?" argument because large datasets are its stock in trade, but for core Python, this may be sufficiently esoteric that it should be delegated to On balance, the arguments that Steven d'Aprano advanced for having a statistics module in the stdlib vs. importing pandas apply here. In particular, I think there are a huge number of options for an iterative hash. For example, Ned chained 2-tuples. But perhaps for efficient time in bounded space you want to use bounded but larger tuples. I don't know -- and that's the point. If we have a TOOWTDI API like hash.from_iterable then smart people (and people with time on their hands to do exhaustive experiments!) can tune that over time, as has been done with the sort API. Another option is yielding partials, as Steven says: > > itertools.iterhash(iterable) # yield incremental hashes That's a very interesting idea, though I suspect it rarely would make a big performance improvement. I'm not sure I like the "hash.from_iterable" name for this API. The problem is that if it's not a concrete collection[3], then you throw away the data. If the hash algorithm happens to suck for certain data, you'll get a lot of collisions, and conclude that your data is much more concentrated than it actually is. I find it hard to imagine a use case where you have large data where you only care about whether two data points are different (cf. equality comparisons for floats). You want to know how they're different. So I think I would prefer to name it "hash.from_collection" or similar. Of course whether the implementation actually raises on a generator or other non-concrete iterable is a fine point. Footnotes: [1] Of course I mean The Art of Computer Programming, Ch. 3. [2] Including the newly ordered dicts, maybe? Somebody tweet @dabeaz! What evil can he do with this? [3] Yeah, I know, "re-iterables". But we don't have a definition, let alone an API for identifying, those Things. From ethan at stoneleaf.us Fri Dec 30 14:52:47 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Dec 2016 11:52:47 -0800 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: <5866BB0F.9090301@stoneleaf.us> On 12/30/2016 06:55 AM, Nick Coghlan wrote: > Rather than changing the descriptor protocol in general, I'd personally be > more amenable to the idea of *property* catching AttributeError from the > functions it calls and turning it into RuntimeError (after a suitable > deprecation period). That way folks that really wanted the old behaviour > could define their own descriptor that works the same way property does > today, whereas if the descriptor protocol itself were to change, there's > very little people could do to work around it if it wasn't what they wanted. Sounds good to me. :) -- ~Ethan~ From ethan at stoneleaf.us Fri Dec 30 14:53:33 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Dec 2016 11:53:33 -0800 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: Message-ID: <5866BB3D.6010102@stoneleaf.us> On 12/30/2016 07:10 AM, Chris Angelico wrote: > Actually, that makes a lot of sense. And since "property" isn't magic > syntax, you could take it sooner: > > from somewhere import property > > and toy with it that way. > > What module would be appropriate, though? Well, DynamicClassAttribute is kept in the types module, so that's probably the place to put optionalproperty as well. -- ~Ethan~ From bussonniermatthias at gmail.com Fri Dec 30 14:59:33 2016 From: bussonniermatthias at gmail.com (Matthias Bussonnier) Date: Fri, 30 Dec 2016 20:59:33 +0100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan wrote: > > I understood the "iterhash" suggestion to be akin to itertools.accumulate: > > >>> for value, tally in enumerate(accumulate(range(10))): print(value, ... It reminds me of hmac[1]/hashlib[2], with the API : h.update(...) before a .digest(). It is slightly lower level than a `from_iterable`, but would be a bit more flexible. If the API were kept similar things would be easier to remember. -- M [1]: https://docs.python.org/3/library/hmac.html#hmac.HMAC.update [2]: https://docs.python.org/3/library/hashlib-blake2.html#module-hashlib From christian at python.org Fri Dec 30 15:54:59 2016 From: christian at python.org (Christian Heimes) Date: Fri, 30 Dec 2016 21:54:59 +0100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On 2016-12-30 20:59, Matthias Bussonnier wrote: > On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan wrote: >> >> I understood the "iterhash" suggestion to be akin to itertools.accumulate: >> >> >>> for value, tally in enumerate(accumulate(range(10))): print(value, ... > > It reminds me of hmac[1]/hashlib[2], with the API : h.update(...) > before a .digest(). > It is slightly lower level than a `from_iterable`, but would be a bit > more flexible. > If the API were kept similar things would be easier to remember. Hi, I'm the author of PEP 456 (SipHash24) and one of the maintainers of the hashlib module. Before we come up with a new API or recipe, I would like to understand the problem first. Why does the initial op consider hash(large_tuple) a performance issue? If you have an object with lots of members that affect both __hash__ and __eq__, then __hash__ is really least of your concern. The hash has to be computed just once and then will stay the same over the life time of the object. Once computed the hash can be cached. On the other hand __eq__ is at least called once for every successful hash lookup. On the worst case it is called n-1 for a dict of size n for a match *and* a hashmap miss. Every __eq__ call has to compare between 1 and m member attributes. For a dict with 1,000 elements with 1,000 members each, that's just 1,000 hash computations with roughly 8 bB memory allocation but almost a million comparisons in worst case. A hasher objects adds further overhead, e.g. object allocation, creation of a bound methods for each call etc. It's also less CPU cache friendly than the linear data structure of a tuple. Christian From ma3yuki.8mamo10 at gmail.com Fri Dec 30 17:24:05 2016 From: ma3yuki.8mamo10 at gmail.com (Masayuki YAMAMOTO) Date: Sat, 31 Dec 2016 07:24:05 +0900 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython Message-ID: I have read the discussion and I'm sure that use structure as Py_tss_t instead of platform-specific data type. Just as Steve said that Py_tss_t should be genuinely treated as an opaque type, the key state checking should provide macros or inline functions with name like PyThread_tss_is_created. Well, I'd resolve the specification a bit more :) If PyThread_tss_create is called with the created key, it is no-op but which the function should succeed or fail? In my opinion, It is better to return a failure because it is a high possibility that the code is incorrect for multiple callings of PyThread_tss_create for One key. In this opinion PyThread_tss_is_created should return a value as follows: (A) False while from after defining with Py_tss_NEED_INIT to before calling PyThread_tss_create (B) True after calling PyThread_tss_create succeeded (C) Unchanging before and after calling PyThread_tss_create failed (D) False after calling PyThread_tss_delete regardless of timing (E) For other functions, the return value of PyThread_tss_is_created does not change before and after calling I think that it is better to write a test about the state of the Py_tss_t. Kind regards, Masayuki 2016-12-31 2:38 GMT+09:00 Erik Bray : > On Fri, Dec 30, 2016 at 5:05 PM, Nick Coghlan wrote: > > On 29 December 2016 at 22:12, Erik Bray wrote: > >> > >> 1) CPython's TLS: Defines -1 as an uninitialized key (by fact of the > >> implementation--that the keys are integers starting from zero) > >> 2) pthreads: Does not definite an uninitialized default value for > >> keys, for reasons described at [1] under "Non-Idempotent Data Key > >> Creation". I understand their reasoning, though I can't claim to know > >> specifically what they mean when they say that some implementations > >> would require the mutual-exclusion to be performed on > >> pthread_getspecific() as well. I don't know that it applies here. > > > > > > That section is a little weird, as they describe two requests (one for a > > known-NULL default value, the other for implicit synchronisation of key > > creation to prevent race conditions), and only provide the justification > for > > rejecting one of them (the second one). > > Right, that is confusing to me as well. I'm guessing the reason for > rejecting the first is in part a way to force us to recognize the > second issue. > > > If I've understood correctly, the situation they're worried about there > is > > that pthread_key_create() has to be called at least once-per-process, but > > must be called before *any* call to pthread_getspecific or > > pthread_setspecific for a given key. If you do "implicit init" rather > than > > requiring the use of an explicit mechanism like pthread_once (or our own > > Py_Initialize and module import locks), then you may take a small > > performance hit as either *every* thread then has to call > > pthread_key_create() to ensure the key exists before using it, or else > > pthread_getspecific() and pthread_setspecific() have to become > potentially > > blocking calls. Neither of those is desirable, so it makes sense to leave > > that part of the problem to the API client. > > > > In our case, we don't want the implicit synchronisation, we just want the > > known-NULL default value so the "Is it already set?" check can be moved > > inside the library function. > > Okay, we're on the same page here then. I just wanted to make sure > there wasn't anything else I was missing in Python's case. > > >> 3) windows: The return value of TlsAlloc() is a DWORD (unsigned int) > >> and [2] states that its value should be opaque. > >> > >> So in principle we can cover all cases with an opaque struct that > >> contains, as its first member, an is_initialized flag. The tricky > >> part is how to initialize the rest of the struct (containing the > >> underlying implementation-specific key). For 1) and 3) it doesn't > >> matter--it can just be zero. For 2) it's trickier because there's no > >> defined constant value to initialize a pthread_key_t to. > >> > >> Per Nick's suggestion this can be worked around by relying on C99's > >> initialization semantics. Per [3] section 6.7.8, clause 21: > >> > >> """ > >> If there are fewer initializers in a brace-enclosed list than there > >> are elements or members of an aggregate, or fewer characters in a > >> string literal used to initialize an array of known size than there > >> are elements in the array, the remainder of the aggregate shall be > >> initialized implicitly the same as objects that have static storage > >> duration. > >> """ > >> > >> How objects with static storage are initialized is described in the > >> previous page under clause 10, but in practice it boils down to what > >> you would expect: Everything is initialized to zero, including nested > >> structs and arrays. > >> > >> So as long as we can use this feature of C99 then I think that's the > >> best approach. > > > > > > > > I checked PEP 7 to see exactly which features we've added to the > approved C > > dialect, and designated initialisers are already on the list: > > https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html > > > > So I believe that would allow the initializer to be declared as something > > like: > > > > #define Py_tss_NEEDS_INIT {.is_initialized = false} > > Great! One could argue about whether or not the designated > initializer syntax also incorporates omitted fields, but it would seem > strange to insist that it doesn't. > > Have a happy new year, > > Erik > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jab at math.brown.edu Fri Dec 30 18:36:37 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 18:36:37 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On Fri, Dec 30, 2016 at 3:54 PM, Christian Heimes wrote: > Hi, > > I'm the author of PEP 456 (SipHash24) and one of the maintainers of the > hashlib module. > > Before we come up with a new API or recipe, I would like to understand > the problem first. Why does the initial op consider hash(large_tuple) a > performance issue? If you have an object with lots of members that > affect both __hash__ and __eq__, then __hash__ is really least of your > concern. The hash has to be computed just once and then will stay the > same over the life time of the object. Once computed the hash can be > cached. > > On the other hand __eq__ is at least called once for every successful > hash lookup. On the worst case it is called n-1 for a dict of size n for > a match *and* a hashmap miss. Every __eq__ call has to compare between 1 > and m member attributes. For a dict with 1,000 elements with 1,000 > members each, that's just 1,000 hash computations with roughly 8 bB > memory allocation but almost a million comparisons in worst case. > > A hasher objects adds further overhead, e.g. object allocation, creation > of a bound methods for each call etc. It's also less CPU cache friendly > than the linear data structure of a tuple. Thanks for joining the discussion, great to have your input. In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood? Here is some example code: class FrozenOrderedCollection: ... def __eq__(self, other): if not isinstance(other, FrozenOrderedCollection): return NotImplemented if len(self) != len(other): return False return all(i == j for (i, j) in zip(self, other)) def __hash__(self): if hasattr(self, '_hashval'): return self._hashval hash_initial = hash(self.__class__) self._hashval = reduce(lambda h, i: hash((h, i)), self, hash_initial) return self._hashval Is it the case that internally, the code is mostly there to compute the hash of such an object in incremental fashion, and it's just a matter of figuring out the right API to expose it? If creating an arbitrarily large tuple, passing that to hash(), and then throwing it away really is the best practice, then perhaps that should be explicitly documented, since I think many would find that counterintuitive? @Stephen J. Turnbull, thank you for your input -- still digesting, but very interesting. Thanks again to everyone for the helpful discussion. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Dec 30 19:20:43 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Dec 2016 16:20:43 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: <5866F9DB.9020207@stoneleaf.us> On 12/30/2016 03:36 PM, jab at math.brown.edu wrote: > In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood? If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done. If that wasn't your intention then, while there can certainly be a few quick checks to rule out equality (such as length) if those match, the expensive full equality check will have to happen. -- ~Ethan~ From jab at math.brown.edu Fri Dec 30 19:31:39 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 19:31:39 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <5866F9DB.9020207@stoneleaf.us> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> Message-ID: On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote: > On 12/30/2016 03:36 PM, jab at math.brown.edu wrote: > > In the use cases I described, the objects' members are ordered. So in the >> unlikely event that two objects hash to the same value but are unequal, the >> __eq__ call should be cheap, because they probably differ in length or on >> their first member, and can skip further comparison. After a successful >> hash lookup of an object that's already in a set or dict, a successful >> identity check avoids an expensive equality check. Perhaps I misunderstood? >> > > If you are relying on an identity check for equality then no two > FrozenOrderedCollection instances can be equal. Was that your intention? > It it was, then just hash the instance's id() and you're done. > No, I was talking about the identity check done by a set or dict when doing a lookup to check if the object in a hash bucket is identical to the object being looked up. In that case, there is no need for the set or dict to even call __eq__. Right? The FrozenOrderedCollection.__eq__ implementation I sketched out was as intended -- no identity check there. If that wasn't your intention then, while there can certainly be a few > quick checks to rule out equality (such as length) if those match, the > expensive full equality check will have to happen. I think we're misunderstanding each other, but at the risk of saying the same thing again: Because it's an ordered collection, the equality check for two unequal instances with the same hash value will very likely be able to bail after comparing lengths or the first items. With a good hash function, the odds of two unequal instances both hashing to the same value and having their first N items equal should be vanishingly small, no? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Dec 30 20:04:58 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Dec 2016 17:04:58 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> Message-ID: <5867043A.5040302@stoneleaf.us> On 12/30/2016 04:31 PM, jab at math.brown.edu wrote: > On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote: >> If you are relying on an identity check for equality then no two >> FrozenOrderedCollection instances can be equal. Was that your >> intention? It it was, then just hash the instance's id() and >> you're done. > > No, I was talking about the identity check done by a set or dict > when doing a lookup to check if the object in a hash bucket is > identical to the object being looked up. In that case, there is > no need for the set or dict to even call __eq__. Right? No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities. -- ~Ethan~ From jab at math.brown.edu Fri Dec 30 20:10:14 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 20:10:14 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <5867043A.5040302@stoneleaf.us> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> Message-ID: On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman wrote: > No. It is possible to have two keys be equal but different -- an easy > example is 1 and 1.0; they both hash the same, equal the same, but are not > identical. dict has to check equality when two different objects hash the > same but have non-matching identities. > Python 3.6.0 (default, Dec 24 2016, 00:01:50) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> d = {1: 'int', 1.0: 'float'} >>> d {1: 'float'} IPython 5.1.0 -- An enhanced Interactive Python. In [1]: class Foo: ...: def __eq__(self, other): ...: return True ...: def __init__(self, val): ...: self.val = val ...: def __repr__(self): ...: return '' % self.val ...: def __hash__(self): ...: return 42 ...: In [2]: f1 = Foo(1) In [3]: f2 = Foo(2) In [4]: x = {f1: 1, f2: 2} In [5]: x Out[5]: {: 2} I'm having trouble showing that two equal but nonidentical objects can both be in the same dict. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jab at math.brown.edu Fri Dec 30 21:12:02 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 21:12:02 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> Message-ID: On Fri, Dec 30, 2016 at 8:10 PM, wrote: > On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman wrote: > >> No. It is possible to have two keys be equal but different -- an easy >> example is 1 and 1.0; they both hash the same, equal the same, but are not >> identical. dict has to check equality when two different objects hash the >> same but have non-matching identities. >> > ... > I'm having trouble showing that two equal but nonidentical objects can > both be in the same dict. > (Because they can't, though) your point stands that Python had to call __eq__ in these cases. But with instances of large, immutable, ordered collections, an application could either: 1. accept that it might create a duplicate, equivalent instance of an existing instance with sufficient infrequency that it's okay taking the performance hit, or 2. try to avoid creating duplicate instances in the first place, using the existing, equivalent instance it created as a singleton. Then a set or dict lookup could use the identity check, and avoid the expensive __eq__ call. I think it's much more important to focus on what happens with unequal instances here, since there are usually many more of them. And with them, the performance hit of the __eq__ calls definitely does not necessarily dominate that of __hash__. Right? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Dec 30 21:21:36 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Dec 2016 18:21:36 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> Message-ID: <58671630.6020203@stoneleaf.us> On 12/30/2016 06:12 PM, jab at math.brown.edu wrote: > ... your point stands that Python had to call __eq__ in these cases. > > But with instances of large, immutable, ordered collections, an > application could either: > > 1. accept that it might create a duplicate, equivalent instance of > an existing instance with sufficient infrequency that it's okay > taking the performance hit, or > > 2. try to avoid creating duplicate instances in the first place, > using the existing, equivalent instance it created as a singleton. > Then a set or dict lookup could use the identity check, and avoid > the expensive __eq__ call. > I think it's much more important to focus on what happens with > unequal instances here, since there are usually many more of them. > And with them, the performance hit of the __eq__ calls definitely > does not necessarily dominate that of __hash__. Right? I don't think so. As someone else said, a hash can be calculated once and then cached, but __eq__ has to be called every time. Depending on the clustering of your data that could be quick... or not. -- ~Ethan~ From jab at math.brown.edu Fri Dec 30 21:47:54 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 21:47:54 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <58671630.6020203@stoneleaf.us> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> <58671630.6020203@stoneleaf.us> Message-ID: On Fri, Dec 30, 2016 at 9:21 PM, Ethan Furman wrote: > I don't think so. As someone else said, a hash can be calculated once and > then cached, but __eq__ has to be called every time. Depending on the > clustering of your data that could be quick... or not. > __eq__ only has to be called when a hash bucket is non-empty. In that case, it may be O(n) in pathological cases, but it could also be O(1) every time. On the other hand, __hash__ has to be called on every lookup, is O(n) on the first call, and ideally O(1) after. I'd expect that __eq__ may often not dominate, and avoiding an unnecessary large tuple allocation on the first __hash__ call could be helpful. If a __hash__ implementation can avoid creating an arbitrarily large tuple unnecessarily and still perform well, why not save the space? If the resulting hash distribution is worse, that's another story, but it seems worth documenting one way or the other, since the current docs give memory-conscious Python programmers pause for this use case. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Dec 30 22:08:27 2016 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 30 Dec 2016 19:08:27 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> <58671630.6020203@stoneleaf.us> Message-ID: <5867212B.2050304@stoneleaf.us> On 12/30/2016 06:47 PM, jab at math.brown.edu wrote: > __eq__ only has to be called when a hash bucket is non-empty. In that case, > it may be O(n) in pathological cases, but it could also be O(1) every time. > On the other hand, __hash__ has to be called on every lookup, is O(n) on > the first call, and ideally O(1) after. I'd expect that __eq__ may often > not dominate, and avoiding an unnecessary large tuple allocation on the > first __hash__ call could be helpful. If a __hash__ implementation can > avoid creating an arbitrarily large tuple unnecessarily and still perform > well, why not save the space? If the resulting hash distribution is worse, > that's another story, but it seems worth documenting one way or the other, > since the current docs give memory-conscious Python programmers pause for > this use case. A quick list of what we have agreed on: - __hash__ is called once for every dict/set lookup - __hash__ is calculated once because we are caching the value - __eq__ is being called an unknown number of times, but can be quite expensive in terms of time - items with the same hash are probably cheap in terms of __eq__ (?) So maybe this will work? def __hash__(self): return hash(self.name) * hash(self.nick) * hash(self.color) In other words, don't create a new tuple, just use the ones you already have and toss in a couple maths operations. (and have somebody vet that ;) -- ~Ethan~ From jab at math.brown.edu Fri Dec 30 22:24:08 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Fri, 30 Dec 2016 22:24:08 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <5867212B.2050304@stoneleaf.us> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> <58671630.6020203@stoneleaf.us> <5867212B.2050304@stoneleaf.us> Message-ID: On Fri, Dec 30, 2016 at 10:08 PM, Ethan Furman wrote: > So maybe this will work? > > def __hash__(self): > return hash(self.name) * hash(self.nick) * hash(self.color) > > In other words, don't create a new tuple, just use the ones you already > have and toss in a couple maths operations. (and have somebody vet that ;) See the "Simply XORing such results together would not be order-sensitive, and so wouldn't work" from my original post. (Like XOR, multiplication is also commutative.) e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2, 1]), we should try to avoid making their hashes equal, or else we increase collisions unnecessarily. More generally, I think the trick is to get an even, chaotic distribution into sys.hash_info.hash_bits out of this hash algorithm, and I think simply multiplying hash values together like this wouldn't do that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Dec 30 22:30:58 2016 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 30 Dec 2016 19:30:58 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On Fri, Dec 30, 2016 at 9:29 AM, wrote: > Updating the docs sounds like the more important change for now, given 3.7+. > But before the docs make an official recommendation for that recipe, were > the analyses that Steve and I did sufficient to confirm that its hash > distribution and performance is good enough at scale, or is more rigorous > analysis necessary? > > I've been trying to find a reasonably detailed and up-to-date reference on > Python hash() result requirements and analysis methodology, with > instructions on how to confirm if they're met, but am still looking. Would > find that an interesting read if it's out there. But I'd take just an > authoritative thumbs up here too. Just haven't heard one yet. I'm not an expert on hash table implementation subtleties, but I can quote Objects/dictobject.c: "Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't ..." https://github.com/python/cpython/blob/d0a2f68a5f972e4474f5ecce182752f6f7a22178/Objects/dictobject.c#L133 (Basically IIUC the idea is that Python dicts use a relatively sophisticated probe sequence to resolve collisions which allows the use of relatively "poor" hashes. There are interesting trade-offs here...) -n -- Nathaniel J. Smith -- https://vorpus.org From rosuav at gmail.com Fri Dec 30 22:35:29 2016 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 31 Dec 2016 14:35:29 +1100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> <58671630.6020203@stoneleaf.us> <5867212B.2050304@stoneleaf.us> Message-ID: On Sat, Dec 31, 2016 at 2:24 PM, wrote: > See the "Simply XORing such results together would not be order-sensitive, > and so wouldn't work" from my original post. (Like XOR, multiplication is > also commutative.) > > e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2, > 1]), we should try to avoid making their hashes equal, or else we increase > collisions unnecessarily. How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter. ChrisA From jab at math.brown.edu Sat Dec 31 00:13:07 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Sat, 31 Dec 2016 00:13:07 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On Fri, Dec 30, 2016 at 10:30 PM, Nathaniel Smith wrote: > ... "Most hash schemes depend on having a "good" hash function, in the sense of > simulating randomness. Python doesn't ..." https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133 ... Thanks for that link, fascinating to read the rest of that comment!! Btw, the part you quoted seemed like more a defense for what followed, i.e. the choice to make hash(some_int) == some_int. I'm not sure how much the part you quoted applies generally. e.g. The https://docs.python.org/3/ reference/datamodel.html#object.__hash__ docs don't say, "Don't worry about your __hash__ implementation, dict's collision resolution strategy is so good it probably doesn't matter anyway." But it also doesn't have any discussion of any of the tradeoffs you mentioned that might be worth considering. What should you do when there are arbitrarily many "components of the object that play a part in comparison of objects"? The "hash((self._name, self._nick, self._color))" code snippet is the only example it gives. This leaves people who have use cases like mine wondering whether it's still advised to scale this up to the arbitrarily many items that instances of their class can contain. If not, then what is advised? Passing a tuple of fewer items to a single hash() call, e.g. hash(tuple(islice(self, CUTOFF)))? Ned's recipe of pairwise-accumulating hash() results over all the items? Or only pairwise-accumulating up to a certain cutoff? Stephen J. Turnbull's idea to use fewer accumulation steps and larger-than-2-tuples? Passing all the items into some other cheap, built-in hash algorithm that actually has an incremental update API (crc32?)? Still hoping someone can give some authoritative advice, and hope it's still reasonable to be asking. If not, I'll cut it out. On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico wrote: > How likely is it that you'll have this form of collision, rather than some > other? Remember, collisions *will* happen, so don't try to eliminate them > all; just try to minimize the chances of *too many* collisions. So if > you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) > and (3,1,2), then sure, you need to care about order; but otherwise, one > possible cause of a collision is no worse than any other. Keep your > algorithm simple, and don't sweat the details that you aren't sure matter. In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!" -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Dec 31 01:57:05 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 31 Dec 2016 17:57:05 +1100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> <58671630.6020203@stoneleaf.us> Message-ID: <20161231065705.GF3887@ando.pearwood.info> On Fri, Dec 30, 2016 at 09:47:54PM -0500, jab at math.brown.edu wrote: > __eq__ only has to be called when a hash bucket is non-empty. In that case, > it may be O(n) in pathological cases, but it could also be O(1) every time. > On the other hand, __hash__ has to be called on every lookup, is O(n) on > the first call, and ideally O(1) after. I'd expect that __eq__ may often > not dominate, and avoiding an unnecessary large tuple allocation on the > first __hash__ call could be helpful. Sorry to be the broken record repeating himself, but this sounds *exactly* like premature optimization here. My suggestion is that you are overthinking things, or at least worrying about issues before you've got any evidence that they are going to be real issues. I expect that the amount of time and effort you've spent in analysing the theorectical problems here and writing to this list is more than the time it would have taken for you to write the simplest __hash__ method that could work (using the advice in the docs to make a tuple). You could have implemented that in five minutes, and be running code by now. Of course I understand that performance issues may not be visible when you have 100 or 100 thousand items but may be horrific when you have 100 million. I get that. But you're aware of the possibility, so you can write a performance test that generates 100 million objects and tests performance. *If* you find an actual problem, then you can look into changing your __hash__ method. You could come back here and talk about actual performance issues instead of hypothetical issues, or you could hire an expert to tune your hash function. (And if you do pay for an expert to solve this, please consider giving back to the community.) Remember that the specific details of __hash__ should be an internal implementation detail for your class. You shouldn't fear changing the hash algorithm as often as you need to, including in bug fix releases. You don't have to wait for the perfect hash function, just a "good enough for now" one to get started. I'm not trying to be dismissive of your concerns. They may be real issues that you have to solve. I'm just saying, you should check your facts first rather than solve hypothetic problems. I have seen, and even written, far too much pessimized code in the name of "this must be better, it stands to reason" to give much credence to theoretical arguments about performance. -- Steve From steve at pearwood.info Sat Dec 31 02:00:21 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 31 Dec 2016 18:00:21 +1100 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <5867212B.2050304@stoneleaf.us> References: <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> <58671630.6020203@stoneleaf.us> <5867212B.2050304@stoneleaf.us> Message-ID: <20161231070021.GG3887@ando.pearwood.info> On Fri, Dec 30, 2016 at 07:08:27PM -0800, Ethan Furman wrote: > So maybe this will work? > > def __hash__(self): > return hash(self.name) * hash(self.nick) * hash(self.color) I don't like the multiplications. If any of the three hashes return zero, the overall hash will be zero. I think you need better mixing than that. Look at tuple: py> hash((0, 1, 2)) -421559672 py> hash(0) * hash(1) * hash(2) 0 -- Steve From ncoghlan at gmail.com Sat Dec 31 02:33:52 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 17:33:52 +1000 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: <5866BB3D.6010102@stoneleaf.us> References: <5866BB3D.6010102@stoneleaf.us> Message-ID: On 31 December 2016 at 05:53, Ethan Furman wrote: > On 12/30/2016 07:10 AM, Chris Angelico wrote: > > Actually, that makes a lot of sense. And since "property" isn't magic >> syntax, you could take it sooner: >> >> from somewhere import property >> >> and toy with it that way. >> >> What module would be appropriate, though? >> > > Well, DynamicClassAttribute is kept in the types module, so that's > probably the place to put optionalproperty as well. > I'd also be OK with just leaving it as a builtin. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sat Dec 31 02:42:14 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 17:42:14 +1000 Subject: [Python-ideas] New PyThread_tss_ C-API for CPython In-Reply-To: References: Message-ID: On 31 December 2016 at 08:24, Masayuki YAMAMOTO wrote: > I have read the discussion and I'm sure that use structure as Py_tss_t > instead of platform-specific data type. Just as Steve said that Py_tss_t > should be genuinely treated as an opaque type, the key state checking > should provide macros or inline functions with name like > PyThread_tss_is_created. Well, I'd resolve the specification a bit more :) > > If PyThread_tss_create is called with the created key, it is no-op but > which the function should succeed or fail? In my opinion, It is better to > return a failure because it is a high possibility that the code is > incorrect for multiple callings of PyThread_tss_create for One key. > That's not what we currently do for the EnsureGIL autoTLS key and the tracemalloc key though - the reentrant key creation is part of "create-if-needed" flows where the key creation is silently skipped if the key already exists. Changing that would require some further research into how we ended up with the current approach, while carrying it over into the new API design would be the default option. In this opinion PyThread_tss_is_created should return a value as follows: > (A) False while from after defining with Py_tss_NEED_INIT to before > calling PyThread_tss_create > (B) True after calling PyThread_tss_create succeeded > (C) Unchanging before and after calling PyThread_tss_create failed > (D) False after calling PyThread_tss_delete regardless of timing > (E) For other functions, the return value of PyThread_tss_is_created does > not change before and after calling > > I think that it is better to write a test about the state of the Py_tss_t. > I agree it would be good to add more test cases for this scenario to the test suite. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sat Dec 31 07:17:28 2016 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 31 Dec 2016 22:17:28 +1000 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On 31 December 2016 at 15:13, wrote: > On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico wrote: > >> How likely is it that you'll have this form of collision, rather than >> some other? Remember, collisions *will* happen, so don't try to eliminate >> them all; just try to minimize the chances of *too many* collisions. So if >> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) >> and (3,1,2), then sure, you need to care about order; but otherwise, one >> possible cause of a collision is no worse than any other. Keep your >> algorithm simple, and don't sweat the details that you aren't sure matter. > > > In the context of writing a collections library, and not an application, > it should work well for a diversity of workloads that your users could > throw at it. In that context, it's hard to know what to do with advice like > this. "Heck, just hash the first three items and call it a day!" > Yes, this is essentially what we're suggesting you do - start with a "good enough" hash that may have scalability problems (e.g. due to memory copying) or mathematical distribution problems (e.g. due to a naive mathematical combination of values), and then improve it over time based on real world usage of the library. Alternatively, you could take the existing tuple hashing algorithm and reimplement that in pure Python: https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336 Based on microbenchmarks, you could then find the size breakpoint where it makes sense to switch between "hash(tuple(self))" (with memory copying, but a more optimised implementation of the algorithm) and a pure Python "tuple_hash(self)". In either case, caching the result on the instance would minimise the computational cost over the lifetime of the object. Cheers, Nick. P.S. Having realised that the tuple hash *algorithm* can be re-used without actually creating a tuple, I'm more amenable to the idea of exposing a "hash.from_iterable" callable that produces the same result as "hash(tuple(iterable))" without actually creating the intermediate tuple. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From brenbarn at brenbarn.net Fri Dec 30 20:13:02 2016 From: brenbarn at brenbarn.net (Brendan Barnwell) Date: Fri, 30 Dec 2016 17:13:02 -0800 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: <5867043A.5040302@stoneleaf.us> References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> <5866F9DB.9020207@stoneleaf.us> <5867043A.5040302@stoneleaf.us> Message-ID: <5867061E.8070102@brenbarn.net> On 2016-12-30 17:04, Ethan Furman wrote: > On 12/30/2016 04:31 PM,jab at math.brown.edu wrote: >>> On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote: >>>>> If you are relying on an identity check for equality then no >>>>> two FrozenOrderedCollection instances can be equal. Was that >>>>> your intention? It it was, then just hash the instance's >>>>> id() and you're done. >>> >>> No, I was talking about the identity check done by a set or dict >>> when doing a lookup to check if the object in a hash bucket is >>> identical to the object being looked up. In that case, there is >>> no need for the set or dict to even call __eq__. Right? > No. It is possible to have two keys be equal but different -- an > easy example is 1 and 1.0; they both hash the same, equal the same, > but are not identical. dict has to check equality when two different > objects hash the same but have non-matching identities. I think that is the same as what he said. The point is that if they *are* the same object, you *don't* need to check equality. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown From ericsnowcurrently at gmail.com Sat Dec 31 17:38:53 2016 From: ericsnowcurrently at gmail.com (Eric Snow) Date: Sat, 31 Dec 2016 15:38:53 -0700 Subject: [Python-ideas] AtributeError inside __get__ In-Reply-To: References: <5866BB3D.6010102@stoneleaf.us> Message-ID: On Sat, Dec 31, 2016 at 12:33 AM, Nick Coghlan wrote: > On 31 December 2016 at 05:53, Ethan Furman wrote: >> On 12/30/2016 07:10 AM, Chris Angelico wrote: >>> What module would be appropriate, though? >> >> Well, DynamicClassAttribute is kept in the types module, so that's >> probably the place to put optionalproperty as well. > > I'd also be OK with just leaving it as a builtin. FWIW, I've felt for a while that the "types" module is becoming a catchall for stuff that would be more appropriate in a new "classtools" module (a la functools). I suppose that's what "types" has become, but I personally prefer the separate modules that make the distinction and would rather that "types" looked more like it does in 2.7. Perhaps this would be a good time to get that ball rolling or maybe it's just too late. I'd like to think it's the former, especially since I consider "classtools" a module that has room to grow. -eric From jab at math.brown.edu Sat Dec 31 17:39:41 2016 From: jab at math.brown.edu (jab at math.brown.edu) Date: Sat, 31 Dec 2016 17:39:41 -0500 Subject: [Python-ideas] incremental hashing in __hash__ In-Reply-To: References: <5863F223.3040906@stoneleaf.us> <20161229081959.GA3887@ando.pearwood.info> Message-ID: On Sat, Dec 31, 2016 at 7:17 AM, Nick Coghlan wrote: > On 31 December 2016 at 15:13, wrote: > >> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico >> wrote: >> >>> How likely is it that you'll have this form of collision, rather than >>> some other? Remember, collisions *will* happen, so don't try to eliminate >>> them all; just try to minimize the chances of *too many* collisions. So if >>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) >>> and (3,1,2), then sure, you need to care about order; but otherwise, one >>> possible cause of a collision is no worse than any other. Keep your >>> algorithm simple, and don't sweat the details that you aren't sure matter. >> >> >> In the context of writing a collections library, and not an application, >> it should work well for a diversity of workloads that your users could >> throw at it. In that context, it's hard to know what to do with advice like >> this. "Heck, just hash the first three items and call it a day!" >> > > Yes, this is essentially what we're suggesting you do - start with a "good > enough" hash that may have scalability problems (e.g. due to memory > copying) or mathematical distribution problems (e.g. due to a naive > mathematical combination of values), and then improve it over time based on > real world usage of the library. > > Alternatively, you could take the existing tuple hashing algorithm and > reimplement that in pure Python: https://hg.python.org/cpython/ > file/tip/Objects/tupleobject.c#l336 > > Based on microbenchmarks, you could then find the size breakpoint where it > makes sense to switch between "hash(tuple(self))" (with memory copying, but > a more optimised implementation of the algorithm) and a pure Python > "tuple_hash(self)". In either case, caching the result on the instance > would minimise the computational cost over the lifetime of the object. > > Cheers, > Nick. > > P.S. Having realised that the tuple hash *algorithm* can be re-used > without actually creating a tuple, I'm more amenable to the idea of > exposing a "hash.from_iterable" callable that produces the same result as > "hash(tuple(iterable))" without actually creating the intermediate tuple. > Nice! I just realized, similar to tupleobject.c's "tuplehash" routine, I think the frozenset_hash algorithm (implemented in setobject.c) can't be reused without actually creating a frozenset either. As mentioned, a set hashing algorithm is exposed as collections.Set._hash() in _collections_abc.py, which can be passed an iterable, but that routine is implemented in Python. So here again it seems like you have to choose between either creating an ephemeral copy of your data so you can use the fast C routine, or streaming your data to a slower Python implementation. At least in this case the Python implementation is built-in though. Given my current shortage of information, for now I'm thinking of handling this problem in my library by exposing a parameter that users can tune if needed. See bidict/_frozen.py in https://github.com/jab/bidict/ commit/485bf98#diff-215302d205b9f3650d58ee0337f77297, and check out the _HASH_NITEMS_MAX attribute. I have to run for now, but thanks again everyone, and happy new year! -------------- next part -------------- An HTML attachment was scrubbed... URL: