From rosuav at gmail.com Tue Jul 1 00:05:19 2014 From: rosuav at gmail.com (Chris Angelico) Date: Tue, 1 Jul 2014 08:05:19 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> Message-ID: On Tue, Jul 1, 2014 at 3:18 AM, wrote: > On Sat, Jun 28, 2014, at 01:28, Chris Angelico wrote: >> empty_set_literal = >> type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm > > If you're embedding the entire compiler (in fact, a modified one) in > your tool, why not just output a .pyc? I'm not, I'm calling on the normal compiler. Also, I'm not familiar with the pyc format, nor with any of the potential pit-falls of that approach. But if someone wants to make an "alternative front end that makes a .pyc file" kind of thing, they're most welcome to. ChrisA From abarnert at yahoo.com Tue Jul 1 01:48:14 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 30 Jun 2014 16:48:14 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> Message-ID: <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> First, two quick side notes: It might be nice if the compiler were as easy to hook as the importer.?Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. I doubt either of those would be useful often enough that anyone wants to put in the work.?But then I doubt the empty-set literal would be either, so anyone who seriously wants to work on this might want to work on the inline assembly and/or hookable compiler first. Anyway: On Monday, June 30, 2014 3:12 PM, Chris Angelico wrote: >On Tue, Jul 1, 2014 at 3:18 AM,? wrote: >> On Sat, Jun 28, 2014, at 01:28, Chris Angelico wrote: >>> empty_set_literal = >>> type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm I think it makes more sense to use types.FunctionType and types.CodeType here than to generate two extra functions for each function, even if that means you have to put an import types at the top of every munged source file. >> If you're embedding the entire compiler (in fact, a modified one) in >> your tool, why not just output a .pyc? > >I'm not, I'm calling on the normal compiler. Also, I'm not familiar >with the pyc format, nor with any of the potential pit-falls of that >approach. But if someone wants to make an "alternative front end that >makes a .pyc file" kind of thing, they're most welcome to. The tricky bit with making a .pyc file is generating the header information?last I checked (quite a while ago, and not that deeply?) that wasn't documented, and there were no helpers exported to Python. But I think what he was suggesting is something like this: Let?py_compile.compile generate the .pyc file as normal, then munge the bytecode in that file, instead of compiling each function, munging its bytecode, and emitting source that creates the munged functions. Besides being a lot less work, his version works for ? at top level, in class definitions, in lambda expressions, etc., not just for def statements. And it doesn't require finding and identifying all of the things to munge in a source file (which I assume you'd do bottom-up based on the ast.parse tree or something). But either way, this still doesn't solve the big problem. Compiling a function by hand and then tweaking the bytecode is easy; doing it programmatically is more painful.?You obviously need the function to compile, so you have to replace the ? with something else whose bytecode you can search-and-replace. But what? That something else has to be valid in an expression context (so it compiles), has to compile to a 3-byte opcode (otherwise, replacing it will screw up any jump targets that point after it), can't add any globals/constants/etc. to the list (otherwise, removing it will screw up any LOAD_FOO statements that refer to a higher-numbered foo), and can't appear anywhere in the code being compiled. The only thing I can think of off the top of my head is to replace it with whichever of [], (), or {} doesn't appear anywhere in the code being compiled, then you can search-replace BUILD_LIST/TUPLE/MAP 0 with BUILD_SET 0. But what if all three appear in the code? Raise a SyntaxError('Cannot use all 4 kinds of empty literals in the same scope')? One more thing that I'm sure you thought of, but may not have thought through all the way: To make this generally useful, you can't just hardcode creating a zero-arg top-level function; you need to copy all of the code and function constructor arguments from the compiled function.? So, if the function is a closure, how do you do that? You need to pass a list of closure cell objects that bind to the appropriate co_cellvars from the?current frame, and I don't think there's a way to do that from Python.?So, you need to do that by bytecode-hacking the outer function in the same way, just so it can build the inner function.?And, even if you could build closure cells, once you've replaced the inner function definition with a function constructor from bytecode, when the resulting code gets compiled, it won't have any cellvars anymore. And going back to the top, all of these problems are why I think random's solution would be a lot easier than yours, but why my solution (first build compiler hooks or inline assembly, then use that to implement the empty set trivially) would be no harder than either (and a lot more generally useful), and also why I think this really isn't worth doing. From ron3200 at gmail.com Tue Jul 1 02:16:38 2014 From: ron3200 at gmail.com (Ron Adam) Date: Mon, 30 Jun 2014 19:16:38 -0500 Subject: [Python-ideas] Special keyword denoting an infinite loop In-Reply-To: <20140630182030.GR13014@ando> References: <20140628091112.GI13014@ando> <1404149077.20890.136187005.57F7B11E@webmail.messagingengine.com> <20140630182030.GR13014@ando> Message-ID: On 06/30/2014 01:20 PM, Steven D'Aprano wrote: > On Mon, Jun 30, 2014 at 01:24:37PM -0400, random832 at fastmail.us wrote: >> On Sat, Jun 28, 2014, at 06:05, Stefan Behnel wrote: >>> Adding a new keyword needs very serious reasoning, and that's a good >>> thing. > [...] >> What about _just_ "while:" or "for:"? > > Why bother? Is there anything you can do with a bare "while:" that you > can't do with "while True:"? If not, what's the point? It looks like (in python3) "while 1:", "while True:", and while with a string, generates the same byte code. Just a bare SETUP_LOOP. Which would be the exact same as "while:" would. So no, it wouldn't make a bit of difference other than saving a few key strokes in the source code. >>> def L(): ... while True: ... break ... >>> L() >>> dis(L) 2 0 SETUP_LOOP 4 (to 7) 3 >> 3 BREAK_LOOP 4 JUMP_ABSOLUTE 3 >> 7 LOAD_CONST 0 (None) 10 RETURN_VALUE >>> def LL(): ... while 1: ... break ... >>> dis(LL) 2 0 SETUP_LOOP 4 (to 7) 3 >> 3 BREAK_LOOP 4 JUMP_ABSOLUTE 3 >> 7 LOAD_CONST 0 (None) 10 RETURN_VALUE >>> def LLL(): ... while "looping": ... break ... >>> dis(LLL) 2 0 SETUP_LOOP 4 (to 7) 3 >> 3 BREAK_LOOP 4 JUMP_ABSOLUTE 3 >> 7 LOAD_CONST 0 (None) 10 RETURN_VALUE Cheers, Ron From abarnert at yahoo.com Tue Jul 1 02:27:55 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 30 Jun 2014 17:27:55 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <20140701001814.GA27480@leliel.pault.ag> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <20140701001814.GA27480@leliel.pault.ag> Message-ID: <1404174475.669.YahooMailNeo@web181001.mail.ne1.yahoo.com> On Monday, June 30, 2014 5:18 PM, Paul Tagliamonte wrote: [snip] >Right, so, this was brought up before, but with Hylang >(https://github.com/hylang/hy), we abuse the PEP 302 new import hooks to >search sys.path for .hy files rather then .py files. > >You could do the same for your .pyu files (again, *without* the blessing >of the core team, as this is insane), and do the mangling before passing >it to the normal internals to turn it into bytecode / AST. > >Doing it this way means you won't have to futz with the compiler, >and you can remain happy. The reason for needing to futz with the compiler is to generate source code that actually compiles to the bytecode to build an empty set directly, instead of the bytecode to load and call the "set" global. I agree with both you and Guido that the whole thing is silly, and set() is fine. I also agree with your implied assumption that, even if you needed an empty set literal, having it compile to the same thing as set() would be fine. But those who disagree with both, and really want an empty set literal that compiles to "BUILD_SET 0", cannot have it without futzing with the compiler. So, I'd encourage them to work on making the compiler more futzable (which surely more people would have a use for than the number of people for whom set() is intolerably slow, or unusable because they want to redefine the global). From rosuav at gmail.com Tue Jul 1 02:39:14 2014 From: rosuav at gmail.com (Chris Angelico) Date: Tue, 1 Jul 2014 10:39:14 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: On Tue, Jul 1, 2014 at 9:48 AM, Andrew Barnert wrote: > First, two quick side notes: > > It might be nice if the compiler were as easy to hook as the importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. > That would be interesting, but it raises the possibility of mucking up the stack. (Imagine if you put BUILD_SET 1 in there instead. What's it going to make a set of? What's going to happen to the rest of the stack? Do you REALLY want to debug that?) Back when I did a lot of C and C++ programming, I used to make good use of a "drop to assembly" feature. There were two broad areas where I'd use it: either to access a CPU feature that the compiler and library didn't offer me (like CPUID, in its early days), or to hand-optimize some code. Then compilers got better and better, and the first set of cases got replaced with library functions... and the second lot ended up being no better than the compiler's output, and potentially a lot worse - particularly because they're non-portable. Allowing a "drop to bytecode" in CPython would have the exact same effects, I think. Some people would use it to create an empty set, others would use it to replace variable swapping with a marginally faster and *almost* identical stack-based swap: x,y = y,x LOAD_GLOBAL y LOAD_GLOBAL x ROT_TWO STORE_GLOBAL x STORE_GLOBAL y becomes LOAD_GLOBAL x LOAD_GLOBAL y STORE_GLOBAL x STORE_GLOBAL y Seems fine, right? But it's a subtle change to semantics (evaluation order), and not much benefit anyway. Plus, if it's decided that this semantic change is safe (if it's provably not going to have any significance), a future version of CPython would be able to make the exact same optimization, while leaving the code readable, and portable to other Python implementations. So while an inline bytecode assembler might have some uses, I suspect it'd be an attractive nuisance more than anything else. > On Monday, June 30, 2014 3:12 PM, Chris Angelico wrote: >>On Tue, Jul 1, 2014 at 3:18 AM, wrote: > >>> On Sat, Jun 28, 2014, at 01:28, Chris Angelico wrote: >>>> empty_set_literal = >>>> type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm > > I think it makes more sense to use types.FunctionType and types.CodeType here than to generate two extra functions for each function, even if that means you have to put an import types at the top of every munged source file. Sure. This is just a proof-of-concept anyway, and it's not meant to be good code. Either way works, I just tried to minimize name usage (and potential name collisions). > But I think what he was suggesting is something like this: Let py_compile.compile generate the .pyc file as normal, then munge the bytecode in that file, instead of compiling each function, munging its bytecode, and emitting source that creates the munged functions. > > > Besides being a lot less work, his version works for ? at top level, in class definitions, in lambda expressions, etc., not just for def statements. And it doesn't require finding and identifying all of the things to munge in a source file (which I assume you'd do bottom-up based on the ast.parse tree or something). > Sure. But all I was doing was responding to the implied statement that it's not possible to write a .py file that makes a function with BUILD_SET 0 in it. Translating a .pyu directly into a .pyc is still possible, but was not the proposal. > But either way, this still doesn't solve the big problem. Compiling a function by hand and then tweaking the bytecode is easy; doing it programmatically is more painful. You obviously need the function to compile, so you have to replace the ? with something else whose bytecode you can search-and-replace. But what? That something else has to be valid in an expression context (so it compiles), has to compile to a 3-byte opcode (otherwise, replacing it will screw up any jump targets that point after it), can't add any globals/constants/etc. to the list (otherwise, removing it will screw up any LOAD_FOO statements that refer to a higher-numbered foo), and can't appear anywhere in the code being compiled. > What I did was put in a literal string. https://github.com/Rosuav/shed/blob/master/empty_set.py It uses "? is set()" as a marker, depending on that string not existing in the source. (I could compile the function twice, once with that string, and then a second time with another string; the first compilation would show what consts it uses, and the program could then generate an arbitrary constant which doesn't exist.) The opcode is the right length (assuming it doesn't go for EXTENDED_ARG, which I've never heard of; it seems to be necessary if you have more than 64K consts/globals/locals in a function???), and the resulting function has an unnecessary const in it. It wouldn't be hard to drop it (the code already parses through everything; it could just go "if it's LOAD_CONST, three options - if it's the marker, switch in a BUILD_SET, if it's less than the marker, no change, if it's more than the marker, decrement"), but it doesn't seem to be a problem to have an extra const in there. > One more thing that I'm sure you thought of, but may not have thought through all the way: To make this generally useful, you can't just hardcode creating a zero-arg top-level function; you need to copy all of the code and function constructor arguments from the compiled function. > It handles arguments and stuff. All the attributes of the original function object get passed through unchanged to the resulting function, with the exception of the bytecode, obviously. > So, if the function is a closure, how do you do that? You need to pass a list of closure cell objects that bind to the appropriate co_cellvars from the current frame, and I don't think there's a way to do that from Python. So, you need to do that by bytecode-hacking the outer function in the same way, just so it can build the inner function. And, even if you could build closure cells, once you've replaced the inner function definition with a function constructor from bytecode, when the resulting code gets compiled, it won't have any cellvars anymore. > Ah, that part I've no idea about. But it wouldn't be impossible for someone to develop that a bit further. > And going back to the top, all of these problems are why I think random's solution would be a lot easier than yours, but why my solution (first build compiler hooks or inline assembly, then use that to implement the empty set trivially) would be no harder than either (and a lot more generally useful), and also why I think this really isn't worth doing. > Right. I absolutely agree with your conclusion (not worth doing), and always have had that view. This is proof that it's kinda possible, but still a bad idea. Now, if someone comes up with a really compelling use-case for an empty set literal, then maybe it'd be more important; but if that happens, CPython will probably grow an empty set literal in ASCII somehow, and then the .pyu translation can just turn ? into that. ChrisA From paultag at gmail.com Tue Jul 1 02:18:14 2014 From: paultag at gmail.com (Paul Tagliamonte) Date: Mon, 30 Jun 2014 20:18:14 -0400 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: <20140701001814.GA27480@leliel.pault.ag> On Mon, Jun 30, 2014 at 04:48:14PM -0700, Andrew Barnert wrote: > First, two quick side notes: > > It might be nice if the compiler were as easy to hook as the importer.?Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. > > I doubt either of those would be useful often enough that anyone wants to put in the work.?But then I doubt the empty-set literal would be either, so anyone who seriously wants to work on this might want to work on the inline assembly and/or hookable compiler first. Again, to be absolutely clear here, I hate this idea. `set()` is perfectly clear. Sorry. Had to be said before any of this. Right, so, this was brought up before, but with Hylang (https://github.com/hylang/hy), we abuse the PEP 302 new import hooks to search sys.path for .hy files rather then .py files. You could do the same for your .pyu files (again, *without* the blessing of the core team, as this is insane), and do the mangling before passing it to the normal internals to turn it into bytecode / AST. Doing it this way means you won't have to futz with the compiler, and you can remain happy. And we like being happy. More info: https://github.com/hylang/hy/blob/master/hy/importer.py http://slides.pault.ag/hy.html#/15 https://www.youtube.com/watch?v=AmMaN1AokTI https://www.youtube.com/watch?v=ulekCWvDFVI http://legacy.python.org/dev/peps/pep-0302/ Again, this approach can be a bit flaky, and this particular issue might very well cause problems for us as a community - seeing as how the syntax is almost exactly identical. Hylang (for what it's worth) is just a nice way for us Lisp nerds to stop complaining as much. Godspeed, Paul -- #define sizeof(x) rand() :wq -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: From ncoghlan at gmail.com Tue Jul 1 04:15:20 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 1 Jul 2014 12:15:20 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: On 30 Jun 2014 16:51, "Andrew Barnert" wrote: > > First, two quick side notes: > > It might be nice if the compiler were as easy to hook as the importer. Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. Eugene Toder & Dave Malcolm have some interesting patches on the tracker to help enhance the compiler (in particular, Dave's allowed compiler plugins to be written in Python). Neither set of patches made it to being merge ready, though, and they'll be rather stale at this point. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Tue Jul 1 04:18:44 2014 From: guido at python.org (Guido van Rossum) Date: Mon, 30 Jun 2014 19:18:44 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers). On Mon, Jun 30, 2014 at 7:15 PM, Nick Coghlan wrote: > > On 30 Jun 2014 16:51, "Andrew Barnert" > wrote: > > > > First, two quick side notes: > > > > It might be nice if the compiler were as easy to hook as the > importer. Alternatively, it might be nice if there were a way to do "inline > bytecode assembly" in CPython, similar to the way you do inline assembly in > many C compilers, so the answer to random's question is just "asm > [('BUILD_SET', 0)]" or something similar. Either of those would make this > problem trivial. > > Eugene Toder & Dave Malcolm have some interesting patches on the tracker > to help enhance the compiler (in particular, Dave's allowed compiler > plugins to be written in Python). Neither set of patches made it to being > merge ready, though, and they'll be rather stale at this point. > > Cheers, > Nick. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Tue Jul 1 10:04:37 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 01:04:37 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> Before I get to the reply, because I couldn't find a 3.x-compatible bytecode assembler, I slapped one together at https://github.com/abarnert/cpyasm. I think it would be reasonably possible to use this to add inline assembly to a preprocessor, but I haven't tried, because I don't have a preprocessor I actually want, and this was the fun part. :) > On Monday, June 30, 2014 5:39 PM, Chris Angelico wrote: > > On Tue, Jul 1, 2014 at 9:48 AM, Andrew Barnert wrote: >> First, two quick side notes: >> >> It might be nice if the compiler were as easy to hook as the importer. > Alternatively, it might be nice if there were a way to do "inline bytecode > assembly" in CPython, similar to the way you do inline assembly in many C > compilers, so the answer to random's question is just "asm > [('BUILD_SET', 0)]" or something similar. Either of those would > make this problem trivial. >> > > That would be interesting, but it raises the possibility of mucking up > the stack. (Imagine if you put BUILD_SET 1 in there instead. What's it > going to make a set of? What's going to happen to the rest of the > stack? Do you REALLY want to debug that?) The same thing that happens if you use bad inline assembly in C, or a bad C extension module in Python?bad things that you can't debug at source level. And yet, inline assembly in C and C extension modules in Python are still quite useful. Of course the difference is that you can drop from the source level to the machine level pretty easily in gdb, lldb, Microsoft's debugger, etc., while you can't as easily drop from the source level to the bytecode level in pdb. (I'm not sure that wouldn't be an interesting feature to add in itself, but that's getting even farther off topic, so forget it for now.)? > Back when I did a lot of C and C++ programming, I used to make good > use of a "drop to assembly" feature. There were two broad areas where > I'd use it: either to access a CPU feature that the compiler and > library didn't offer me (like CPUID, in its early days), or to > hand-optimize some code. Then compilers got better and better, and the > first set of cases got replaced with library functions... and the > second lot ended up being no better than the compiler's output, and > potentially a lot worse - particularly because they're non-portable. > Allowing a "drop to bytecode" in CPython would have the exact same > effects, I think. I'll ignore the second case for the moment, because I think it's rarely if ever appropriate to Python, and just focus on the first. Those cases did not go away because CPUID got replaced with library functions. Those library functions?which are compiled with the same compiler you use for your code?have inline assembly in them. (Or, if you're on linux, those library functions read from a device file, but the device driver, which is compiled with the same compiler you use, has inline assembly in it.) So, the compiler still needs to be able to compile it. There are cases where that isn't true. For example, most modern compilers that care about x86 have extended the C language in some way to make it unnecessary for you to write LOCK CMPXCHG all over the place if you want to do lockfree refcounting (and, even better, they've done so in a way that also does the right thing on ARM 9 or SPARC or whatever else you care about). Or, in some cases, they've done something halfway in between, adding "compiler intrinsic functions" that look like functions, but are compiled directly into inline asm. But either way, that didn't happen until a lot of people were publishing code that used that inline assembly. Otherwise, the compiler vendors have no reason to believe it's necessary to add a new feature. Plus, people still needed to keep distributing code that uses the inline asm for years, until the oldest compiler and library on every platform they support incorporated the change they needed. And, just as you say, I think it would have the exact same effects in CPython. If we added inline bytecode asm to 3.5, and there were actually something useful to do with it, people would start doing it, and that's how we'd know that something useful was worth adding to the language, and when we added that something useful in 3.7, eventually people could start using that, and then it would be years before all of the projects that need that feature either die or require 3.7. But that's not a problem; that's inline asm working exactly as it should. There is one good reason to reject the inline asm idea: If it's unlikely that there will be anything worth using it for (or if it might plausibly be useful, but not enough so that anyone's worth doing the work). Which I think is at least plausible, and maybe likely. > Some people would use it to create an empty set, > others would use it to replace variable swapping with a marginally > faster and *almost* identical stack-based swap: Do you really think anyone would do the latter? Seriously, what kind of code can you imagine that's too slow in CPython, not worth rewriting in C or running in PyPy or whatever, but fast enough with the rot opcode removed? And if someone really _did_ need that, I doubt they'd care much that Python 3.8 makes it unnecessary; they obviously have a specific deployment platform that has to work and that needed that last 3% speedup under 3.6.2, and they're going to need that to keep working for years. The former, maybe. Not just to allow ?, but maybe someone would want to write a Unicode-math-ified Python dialect as an import-hook preprocessor that used inline asm among other tools. In which case? so what? That's not going to be something that people just randomly drop into their code, there will be a single project with however many users, which will be no worse for the Python community than Hylang. If their demonstration is just so cool that everyone decides we need Unicode symbols in Python core, then great. If not, and they still want to keep using it, well, a simpler preprocessor will be easier for the rest of us to understand than a ridiculously complicated one that does bytecode hackery, or than a hacked-up CPython compiler. > So while an inline bytecode assembler might have some uses, I suspect > it'd be an attractive nuisance more than anything else. I honestly don't see it becoming an attractive nuisance.? I can easily see it just not getting used for anything at all, beyond people playing with the interpreter. And now, on to your other replies: >> On Monday, June 30, 2014 3:12 PM, Chris Angelico ? > wrote: >>> On Tue, Jul 1, 2014 at 3:18 AM,? wrote: >> >>>> On Sat, Jun 28, 2014, at 01:28, Chris Angelico wrote: >>>>> empty_set_literal = >>>>> > type(lambda:0)(type((lambda:0).__code__)(0,0,0,3,67,b't\x00\x00d\x01\x00h\x00\x00\x83\x02\x00\x01d\x00\x00S',(None,"I'm >> >> I think it makes more sense to use types.FunctionType and types.CodeType > here than to generate two extra functions for each function, even if that means > you have to put an import types at the top of every munged source file. > > Sure. This is just a proof-of-concept anyway, and it's not meant to be > good code. Either way works, I just tried to minimize name usage (and > potential name collisions). > >> But I think what he was suggesting is something like this: Let > py_compile.compile generate the .pyc file as normal, then munge the bytecode in > that file, instead of compiling each function, munging its bytecode, and > emitting source that creates the munged functions. >> >> >> Besides being a lot less work, his version works for ? at top level, in > class definitions, in lambda expressions, etc., not just for def statements. And > it doesn't require finding and identifying all of the things to munge in a > source file (which I assume you'd do bottom-up based on the ast.parse tree > or something). >> > > Sure. But all I was doing was responding to the implied statement that > it's not possible to write a .py file that makes a function with > BUILD_SET 0 in it. Translating a .pyu directly into a .pyc is still > possible, but was not the proposal. Agreed, I just think it's an _easier_ proposal than yours, not a harder one (assuming you want to actually build the real thing, not just a proof of concept), which I think is why Random suggested it. Also, again, I don't think a real project that allowed ? in a def but not in a lambda, class, or top-level code would be acceptable to anyone, and I don't see how your solution can be easily adapted to those cases (well, except lambda). [snip, and everything below here condensed] > What I did was put in a literal string?? > It uses "? is set()" as a marker ? and the resulting function > has an unnecessary const in it.? I assumed that leaving the unnecessary const behind was unacceptable. After all, we're talking about (hypothetical?) people who find the cost of LOAD_GLOBAL set; CALL_FUNCTION 0 to be unacceptable? But you're right that fixing up all the other LOAD_CONST bytecodes' args is a feasible way to solve that. >> So, if the function is a closure, how do you do that? > Ah, that part I've no idea about. But it wouldn't be impossible for > someone to develop that a bit further. Not impossible, but very hard, much harder than what you've done so far. Ultimately, I think that just backs up your larger point: This is doable, but it's going to be a lot of work, and the benefit isn't even nearly worth the cost. My point is that there are other ways to do it that would be less work and/or that would have more side benefits? but the benefit still isn't even nearly worth the cost, so who cares? :) From abarnert at yahoo.com Tue Jul 1 10:27:00 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 01:27:00 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> (Replies to both Guido's top-post and Nick's reply-post below.) On Monday, June 30, 2014 7:19 PM, Guido van Rossum wrote: >Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers). But CPython does expose bytecode via the dis module, parts of inspect, etc. For that matter, it exposes some of the compiler's workings (especially if you consider everything up to AST generation part of the compiler, since every step up to there is exposed, including doing the whole thing in one whack with PyCF_ONLY_AST). So, I don't see how exposing the AST-to-bytecode transformation part (or, while we're at it, the .pyc generation part) is any more unportable than what's already there. That being said, I can appreciate that it would almost certainly take a lot more work, and a lot riskier work to do that, so the same tradeoff could easily go the other way in this case. (Not to mention that the dis module and so on are already there, while the patches Nick was talking about, much less something brand new, are obviously not.) >On Mon, Jun 30, 2014 at 7:15 PM, Nick Coghlan wrote: > >>On 30 Jun 2014 16:51, "Andrew Barnert" wrote: >>> >>> First, two quick side notes: >>> >>> It might be nice if the compiler were as easy to hook as the importer.?Alternatively, it might be nice if there were a way to do "inline bytecode assembly" in CPython, similar to the way you do inline assembly in many C compilers, so the answer to random's question is just "asm [('BUILD_SET', 0)]" or something similar. Either of those would make this problem trivial. >>Eugene Toder & Dave Malcolm have some interesting patches on the tracker to help enhance the compiler (in particular, Dave's allowed compiler plugins to be written in Python). Neither set of patches made it to being merge ready, though, and they'll be rather stale at this point. Thanks! Are you referring to Dave Malcolm's patch to adding a hook for an AST optimizer (in Python) right before compiling the AST to code (http://bugs.python.org/issue10399 and related)?? If so, I don't think that would actually help here. Unless it's possible to say "BUILD_SET 0" in AST, but in that case, we don't need any new compiler hooks; just use an import hook the same way MacroPy does. (Doing it without import hooks would be a little nicer, but it's not essential.) The only patch I could find by Eugene Toder is one to reenable constant folding on -0, which I think was already committed in 3.3, and doesn't seem relevant anyway. Is there something else I should be searching for? From rosuav at gmail.com Tue Jul 1 10:38:37 2014 From: rosuav at gmail.com (Chris Angelico) Date: Tue, 1 Jul 2014 18:38:37 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: On Tue, Jul 1, 2014 at 6:04 PM, Andrew Barnert wrote: >> On Monday, June 30, 2014 5:39 PM, Chris Angelico wrote: > >> That would be interesting, but it raises the possibility of mucking up >> the stack. (Imagine if you put BUILD_SET 1 in there instead. What's it >> going to make a set of? What's going to happen to the rest of the >> stack? Do you REALLY want to debug that?) > > The same thing that happens if you use bad inline assembly in C, or a bad C extension module in Python?bad things that you can't debug at source level. And yet, inline assembly in C and C extension modules in Python are still quite useful. Right, useful but it adds another set of problems. (Just out of curiosity, what protection _is_ there for a smashed stack? I just tried fiddling with it and didn't manage to crash stuff.) > I'll ignore the second case for the moment, because I think it's rarely if ever appropriate to Python, and just focus on the first. Those cases did not go away because CPUID got replaced with library functions. Those library functions?which are compiled with the same compiler you use for your code?have inline assembly in them. (Or, if you're on linux, those library functions read from a device file, but the device driver, which is compiled with the same compiler you use, has inline assembly in it.) So, the compiler still needs to be able to compile it. > Or those library functions are written in assembly language directly. It's entirely possible to write something that uses CPUID and doesn't use inline assembly in a C source file. The equivalent here, I suppose, would be hand-rolling a .pyc file. >> Some people would use it to create an empty set, >> others would use it to replace variable swapping with a marginally >> faster and *almost* identical stack-based swap: > > Do you really think anyone would do the latter? Seriously, what kind of code can you imagine that's too slow in CPython, not worth rewriting in C or running in PyPy or whatever, but fast enough with the rot opcode removed? And if someone really _did_ need that, I doubt they'd care much that Python 3.8 makes it unnecessary; they obviously have a specific deployment platform that has to work and that needed that last 3% speedup under 3.6.2, and they're going to need that to keep working for years. > Hang on, you're asking two different questions there. I'll split it out: 1) Do I really think anyone *should* do this? Your subsequent comments support this question, and the answer is resoundingly NO. CPython is not the sort of platform on which that kind of thing is ever worth doing. You'll get far more performance by using Cython for parts, or in some other way improving your code, than you will by hand-tweaking the Python bytecode. 2) Do I think anyone would, if given the ability to tweak the bytecode, go "Ah ha!" and proudly improve on what the compiler has done, and then brag about the performance improvement? Definitely. Someone will. It'll make some marginal difference to a microbenchmark, and if you don't believe that would cause people to warp their code into utter unreadability, you clearly don't hang out on python-list enough :) >> So while an inline bytecode assembler might have some uses, I suspect >> it'd be an attractive nuisance more than anything else. > > I honestly don't see it becoming an attractive nuisance. > > I can easily see it just not getting used for anything at all, beyond people playing with the interpreter. The "attractive nuisance" part is with microbenchmarking. Code won't materially improve, and it'll be markedly worse in readability/maintainability and portability (although the latter probably doesn't matter all that much; a lot of people's code will be suboptimal on Pythons other than CPython, if only for lack of 'with' statements around files and such), with the addition of such a feature. >> What I did was put in a literal string? >> It uses "? is set()" as a marker ? and the resulting function >> has an unnecessary const in it. > > I assumed that leaving the unnecessary const behind was unacceptable. After all, we're talking about (hypothetical?) people who find the cost of LOAD_GLOBAL set; CALL_FUNCTION 0 to be unacceptable? But you're right that fixing up all the other LOAD_CONST bytecodes' args is a feasible way to solve that. I'm not sure whether the problem is the cost of LOAD_GLOBAL followed by CALL_FUNCTION (and, by the way, one unnecessary constant in the function won't have anything like that cost - a bit of wasted RAM, but not a function call), or the fact that such a style is vulnerable to shadowing of the name 'set', which admittedly is a very useful name. But in any case, it's quite solvable. >>> So, if the function is a closure, how do you do that? >> Ah, that part I've no idea about. But it wouldn't be impossible for >> someone to develop that a bit further. > > Not impossible, but very hard, much harder than what you've done so far. > > Ultimately, I think that just backs up your larger point: This is doable, but it's going to be a lot of work, and the benefit isn't even nearly worth the cost. My point is that there are other ways to do it that would be less work and/or that would have more side benefits? but the benefit still isn't even nearly worth the cost, so who cares? :) Yep. Maybe someone (great, that probably means me) should write this up into a PEP for immediate rejection or withdrawal, just to be a document to point to - if you want an empty set literal, answer these objections. ChrisA From abarnert at yahoo.com Tue Jul 1 11:00:29 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 02:00:29 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> Message-ID: <1404205229.86750.YahooMailNeo@web181001.mail.ne1.yahoo.com> On , Andrew Barnert wrote: > On Mon, Jun 30, 2014 at 7:15 PM, Nick Coghlan > wrote: >> Eugene Toder & Dave Malcolm have some interesting patches on the >> tracker to help enhance the compiler [snip] ? > Are you referring to Dave Malcolm's patch to adding a hook for an AST > optimizer (in Python) right before compiling the AST to code > (http://bugs.python.org/issue10399 and related)?? > > If so, I don't think that would actually help here. Unless it's possible > to say "BUILD_SET 0" in AST, but in that case, we don't need any > new compiler hooks; just use an import hook the same way MacroPy does. I should have just tested it before saying anything: >>> e = ast.Expression(body=ast.Set(elts=[], ctx=ast.Load(), ... ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lineno=1, col_offset=0)) >>> c = compile(e2, '<>', 'eval') >>> dis.dis(c) ? 1 ? ? ? ? ? 0 BUILD_SET ? ? ? ? ? 0 ? ? ? ? ? ? ? 3 RETURN_VALUE So? it is possible to say "BUILD_SET 0" in AST. Which means the easy way to do this is to wrap an import hook around this: class FixEmptySet(ast.NodeTransformer): ? ? def visit_Name(self, node): ? ? ? ? if node.id == '_EMPTY_SET_LITERAL': ? ? ? ? ? ? return ast.copy_location( ? ? ? ? ? ? ? ? ast.Set(elts=[], ctx=ast.Load()), ? ? ? ? ? ? ? ? node) ? ? ? ? return node def ecompile(src, fname): ? ? src = src.replace('?', '_EMPTY_SET_LITERAL') ? ? tree = compile(src, fname, 'exec', flags=ast.PyCF_ONLY_AST) ? ? tree = FixEmptySet().visit(tree) ? ? return compile(tree, fname, 'exec') code = ecompile('def f(): return ?', '<>') exec(code) f() That returns set(). And if you dis.dis(f), it's just BUILD_SET 0 and RETURN_VALUE. From rosuav at gmail.com Tue Jul 1 11:07:28 2014 From: rosuav at gmail.com (Chris Angelico) Date: Tue, 1 Jul 2014 19:07:28 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404205229.86750.YahooMailNeo@web181001.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> <1404205229.86750.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: On Tue, Jul 1, 2014 at 7:00 PM, Andrew Barnert wrote: > src = src.replace('?', '_EMPTY_SET_LITERAL') Note that this suffers from a flaw that my POC script also suffers from: it replaces this character *anywhere*, rather than only when it's being used as a symbol on its own. Even inside a literal string. It might be necessary to replace it back the other way afterward, somehow, but I'm not sure if that would work. ChrisA From abarnert at yahoo.com Tue Jul 1 12:23:19 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 03:23:19 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> <1404205229.86750.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <1404210199.47734.YahooMailNeo@web181004.mail.ne1.yahoo.com> > On Tuesday, July 1, 2014 2:08 AM, Chris Angelico wrote: > > On Tue, Jul 1, 2014 at 7:00 PM, Andrew Barnert > > wrote: >> ? ? src = src.replace('?', '_EMPTY_SET_LITERAL') > > Note that this suffers from a flaw that my POC script also suffers > from: it replaces this character *anywhere*, rather than only when > it's being used as a symbol on its own. Even inside a literal string. > It might be necessary to replace it back the other way afterward, > somehow, but I'm not sure if that would work. Yes, that's easy. Also, _EMPTY_SET_LITERAL_ itself could exist in your source (after all, it exists in my source fragment right above, right?), but that's easy too. See https://github.com/abarnert/emptyset for a slapped-together implementation that solves both those problems (except for bytes literals, but it explains how to do that). If it prints out "set() is the empty set ?", then it worked; it successfully replaced the ? in your source with an empty set literal, and left the ? in your format string as ?. From abarnert at yahoo.com Tue Jul 1 12:51:18 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 03:51:18 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: <1404211878.45311.YahooMailNeo@web181005.mail.ne1.yahoo.com> > On Tuesday, July 1, 2014 1:39 AM, Chris Angelico wrote: > > On Tue, Jul 1, 2014 at 6:04 PM, Andrew Barnert wrote: >>> On Monday, June 30, 2014 5:39 PM, Chris Angelico > wrote: >> >>> That would be interesting, but it raises the possibility of mucking up >>> the stack. (Imagine if you put BUILD_SET 1 in there instead. What's > it >>> going to make a set of? What's going to happen to the rest of the >>> stack? Do you REALLY want to debug that?) >> >> The same thing that happens if you use bad inline assembly in C, or a bad C > extension module in Python?bad things that you can't debug at source level. > And yet, inline assembly in C and C extension modules in Python are still quite > useful. > > Right, useful but it adds another set of problems. (Just out of > curiosity, what protection _is_ there for a smashed stack? I just > tried fiddling with it and didn't manage to crash stuff.) I believe there are cases where the interpreter can detect that you've gone below 0 and raise an exception, but in general there's no protection, or at least nothing you can count on. For example, assemble this code as a complete function: ? ? CALL_FUNCTION 1 ? ? RETURN_VALUE In 3.4.1, on my Mac, I get a bus error. But, even when you don't manage to crash the interpreter, when you just confuse it at the bytecode level, there's still no way to debug that except by dropping to gdb/lldb/etc. >> I'll ignore the second case for the moment, because I think it's > rarely if ever appropriate to Python, and just focus on the first. Those cases > did not go away because CPUID got replaced with library functions. Those library > functions?which are compiled with the same compiler you use for your code?have > inline assembly in them. (Or, if you're on linux, those library functions > read from a device file, but the device driver, which is compiled with the same > compiler you use, has inline assembly in it.) So, the compiler still needs to be > able to compile it. >> > > Or those library functions are written in assembly language directly. > It's entirely possible to write something that uses CPUID and doesn't > use inline assembly in a C source file. The equivalent here, I > suppose, would be hand-rolling a .pyc file. Yeah, that's entirely possible, but that's not how the linux device driver or the FreeBSD libc wrapper do it; they use inline assembly. Why? Well, for one thing, you get the function prolog and epilog code appropriate for your compiler automatically, instead of having to write it yourself. Also, you can do nice things like cast the result to a struct that you defined in C (which could be done with, e.g., a C macro wrapping the assembly source, but that's just making things more complicated for no benefit). And you don't need to know how to configure and run an assembler alongside the C compiler to build the device. And so on. Basically, the C versions of the exact same reasons you wouldn't want to hand-roll a .pyc file in Python? > 2) Do I think anyone would, if given the ability to tweak the > bytecode, go "Ah ha!" and proudly improve on what the compiler has > done, and then brag about the performance improvement? Definitely. > Someone will. It'll make some marginal difference to a microbenchmark, > and if you don't believe that would cause people to warp their code > into utter unreadability, you clearly don't hang out on python-list > enough :) Using ctypes to load Python.so to swap the pointers under the covers is already significantly faster, and would still be significantly faster than your optimized bytecode, and yes, people have suggested it on at least two StackOverflow questions. For that matter, you can already do exactly your optimization with a relatively simple bytecode hack, which would look a lot worse than the inline asm and have the same effect. Also, that bytecode hack could be factored out into a function, without any performance cost except a constant cost at .pyc time, while the inline asm obviously can't, another reason the inline asm (which would have to be written inline, and edited to fit the variables in question, each time) would be less of an attractive nuisance than what's already there. Sure, there may be a few people who are looking for horrible micro-optimizations like this, would know enough to figure out how to do this with inline asm, would not know how to do it with bytecode hacks, would not know any of the better (as in much worse, to anyone but them) alternatives, etc., but I think that number is vanishingly small. >>> What I did was put in a literal string? >>> It uses "? is set()" as a marker ? and the resulting function >>> has an unnecessary const in it. >> >> I assumed that leaving the unnecessary const behind was unacceptable. After > all, we're talking about (hypothetical?) people who find the cost of > LOAD_GLOBAL set; CALL_FUNCTION 0 to be unacceptable? But you're right that > fixing up all the other LOAD_CONST bytecodes' args is a feasible way to > solve that. > > I'm not sure whether the problem is the cost of LOAD_GLOBAL followed > by CALL_FUNCTION (and, by the way, one unnecessary constant in the > function won't have anything like that cost - a bit of wasted RAM, but > not a function call), or the fact that such a style is vulnerable to > shadowing of the name 'set', which admittedly is a very useful name. > But in any case, it's quite solvable. I realize the cost of an extra LOAD_GLOBAL is much smaller than an extra CALL_FUNCTION, it's just that I think in 99.9999% of real cases neither will make a difference, and anyone who's objecting to the latter on principle will probably also object to the former on principle? >>>> ? So, if the function is a closure, how do you do that? >>> Ah, that part I've no idea about. But it wouldn't be impossible > for >>> someone to develop that a bit further. >> >> Not impossible, but very hard, much harder than what you've done so > far. >> >> Ultimately, I think that just backs up your larger point: This is doable, > but it's going to be a lot of work, and the benefit isn't even nearly > worth the cost. My point is that there are other ways to do it that would be > less work and/or that would have more side benefits? but the benefit still > isn't even nearly worth the cost, so who cares? :) > > Yep. Maybe someone (great, that probably means me) should write this > up into a PEP for immediate rejection or withdrawal, just to be a > document to point to - if you want an empty set literal, answer these > objections. I think Terry Reedy actually had a better answer: just tell people to implement it, polish it up, put it on PyPI, and come back to us when they're ready to show off their tons of users who can't live without it. Random objected that wasn't possible, in which case Terry's idea is more of a dismissal than a helpful suggestion, but I think https://github.com/abarnert/emptyset proves that it is possible, and even pretty easy. From kn0m0n3 at gmail.com Tue Jul 1 16:23:16 2014 From: kn0m0n3 at gmail.com (www.leap.cc) Date: Tue, 01 Jul 2014 09:23:16 -0500 Subject: [Python-ideas] Enigmail on cell phone? Message-ID: Anyone done this successfully? How can I use python for hidden Markov mods in machine learning from closed circuit came for public transportation to make better schedualing ask local universities computer science competition... cheers, d.j.? Andrew Barnert wrote: >> On Tuesday, July 1, 2014 1:39 AM, Chris Angelico wrote: > >> > On Tue, Jul 1, 2014 at 6:04 PM, Andrew Barnert wrote: >>>> On Monday, June 30, 2014 5:39 PM, Chris Angelico >> wrote: >>> >>>> That would be interesting, but it raises the possibility of mucking up >>>> the stack. (Imagine if you put BUILD_SET 1 in there instead. What's >> it >>>> going to make a set of? What's going to happen to the rest of the >>>> stack? Do you REALLY want to debug that?) >>> >>> The same thing that happens if you use bad inline assembly in C, or a bad C >> extension module in Python?bad things that you can't debug at source level. >> And yet, inline assembly in C and C extension modules in Python are still quite >> useful. >> >> Right, useful but it adds another set of problems. (Just out of >> curiosity, what protection _is_ there for a smashed stack? I just >> tried fiddling with it and didn't manage to crash stuff.) > >I believe there are cases where the interpreter can detect that you've gone below 0 and raise an exception, but in general there's no protection, or at least nothing you can count on. > >For example, assemble this code as a complete function: > >? ? CALL_FUNCTION 1 >? ? RETURN_VALUE > >In 3.4.1, on my Mac, I get a bus error. > >But, even when you don't manage to crash the interpreter, when you just confuse it at the bytecode level, there's still no way to debug that except by dropping to gdb/lldb/etc. > >>> I'll ignore the second case for the moment, because I think it's >> rarely if ever appropriate to Python, and just focus on the first. Those cases >> did not go away because CPUID got replaced with library functions. Those library >> functions?which are compiled with the same compiler you use for your code?have >> inline assembly in them. (Or, if you're on linux, those library functions >> read from a device file, but the device driver, which is compiled with the same >> compiler you use, has inline assembly in it.) So, the compiler still needs to be >> able to compile it. >>> >> >> Or those library functions are written in assembly language directly. >> It's entirely possible to write something that uses CPUID and doesn't >> use inline assembly in a C source file. The equivalent here, I >> suppose, would be hand-rolling a .pyc file. > >Yeah, that's entirely possible, but that's not how the linux device driver or the FreeBSD libc wrapper do it; they use inline assembly. Why? Well, for one thing, you get the function prolog and epilog code appropriate for your compiler automatically, instead of having to write it yourself. Also, you can do nice things like cast the result to a struct that you defined in C (which could be done with, e.g., a C macro wrapping the assembly source, but that's just making things more complicated for no benefit). And you don't need to know how to configure and run an assembler alongside the C compiler to build the device. And so on. Basically, the C versions of the exact same reasons you wouldn't want to hand-roll a .pyc file in Python? > >> 2) Do I think anyone would, if given the ability to tweak the >> bytecode, go "Ah ha!" and proudly improve on what the compiler has >> done, and then brag about the performance improvement? Definitely. >> Someone will. It'll make some marginal difference to a microbenchmark, >> and if you don't believe that would cause people to warp their code >> into utter unreadability, you clearly don't hang out on python-list >> enough :) > > >Using ctypes to load Python.so to swap the pointers under the covers is already significantly faster, and would still be significantly faster than your optimized bytecode, and yes, people have suggested it on at least two StackOverflow questions. For that matter, you can already do exactly your optimization with a relatively simple bytecode hack, which would look a lot worse than the inline asm and have the same effect. Also, that bytecode hack could be factored out into a function, without any performance cost except a constant cost at .pyc time, while the inline asm obviously can't, another reason the inline asm (which would have to be written inline, and edited to fit the variables in question, each time) would be less of an attractive nuisance than what's already there. Sure, there may be a few people who are looking for horrible micro-optimizations like this, would know enough to figure out how to do this with inline asm, would not know how to do it > with bytecode hacks, would not know any of the better (as in much worse, to anyone but them) alternatives, etc., but I think that number is vanishingly small. > >>>> What I did was put in a literal string? > >>>> It uses "? is set()" as a marker ? and the resulting function >>>> has an unnecessary const in it. >>> >>> I assumed that leaving the unnecessary const behind was unacceptable. After >> all, we're talking about (hypothetical?) people who find the cost of >> LOAD_GLOBAL set; CALL_FUNCTION 0 to be unacceptable? But you're right that >> fixing up all the other LOAD_CONST bytecodes' args is a feasible way to >> solve that. >> >> I'm not sure whether the problem is the cost of LOAD_GLOBAL followed >> by CALL_FUNCTION (and, by the way, one unnecessary constant in the >> function won't have anything like that cost - a bit of wasted RAM, but >> not a function call), or the fact that such a style is vulnerable to >> shadowing of the name 'set', which admittedly is a very useful name. >> But in any case, it's quite solvable. > >I realize the cost of an extra LOAD_GLOBAL is much smaller than an extra CALL_FUNCTION, it's just that I think in 99.9999% of real cases neither will make a difference, and anyone who's objecting to the latter on principle will probably also object to the former on principle? > >>>>> ? So, if the function is a closure, how do you do that? >>>> Ah, that part I've no idea about. But it wouldn't be impossible >> for >>>> someone to develop that a bit further. >>> >>> Not impossible, but very hard, much harder than what you've done so >> far. >>> >>> Ultimately, I think that just backs up your larger point: This is doable, >> but it's going to be a lot of work, and the benefit isn't even nearly >> worth the cost. My point is that there are other ways to do it that would be >> less work and/or that would have more side benefits? but the benefit still >> isn't even nearly worth the cost, so who cares? :) >> >> Yep. Maybe someone (great, that probably means me) should write this >> up into a PEP for immediate rejection or withdrawal, just to be a >> document to point to - if you want an empty set literal, answer these >> objections. > > >I think Terry Reedy actually had a better answer: just tell people to implement it, polish it up, put it on PyPI, and come back to us when they're ready to show off their tons of users who can't live without it. Random objected that wasn't possible, in which case Terry's idea is more of a dismissal than a helpful suggestion, but I think https://github.com/abarnert/emptyset proves that it is possible, and even pretty easy. >_______________________________________________ >Python-ideas mailing list >Python-ideas at python.org >https://mail.python.org/mailman/listinfo/python-ideas >Code of Conduct: http://python.org/psf/codeofconduct/ From liam.marsh.home at gmail.com Tue Jul 1 18:31:22 2014 From: liam.marsh.home at gmail.com (Liam Marsh) Date: Tue, 1 Jul 2014 18:31:22 +0200 Subject: [Python-ideas] error in os.popen result Message-ID: hello, for a small server program, I wanted to know which ports were occuped. with the dos command 'netstat' so I tried this: *>>>a=os.popen('netstat')* *>>>bytes(a.read())* but this occured in the second step: *Traceback (most recent call last):* * File "", line 1, in * * bytes(a.read())* * File "C:\Apps\Programmation\Python3.2\lib\encodings\cp1252.py", line 23, in decode* * return codecs.charmap_decode(input,self.errors,decoding_table)[0]* *UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 79: character maps to * *how can I avoid it and why does the windows cmd does return an undecodable character?* *thank you.* -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Jul 1 18:58:37 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 1 Jul 2014 09:58:37 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> Message-ID: On 1 July 2014 01:27, Andrew Barnert wrote: > (Replies to both Guido's top-post and Nick's reply-post below.) > > On Monday, June 30, 2014 7:19 PM, Guido van Rossum wrote: > >>Like bytecode, the compiler's workings are not part of the language spec, and are likely to change incompatibly between versions and not work for anything besides CPython. I don't really want to go there (cool though it sounds for wannabe compiler hackers). > > > But CPython does expose bytecode via the dis module, parts of inspect, etc. For that matter, it exposes some of the compiler's workings (especially if you consider everything up to AST generation part of the compiler, since every step up to there is exposed, including doing the whole thing in one whack with PyCF_ONLY_AST). So, I don't see how exposing the AST-to-bytecode transformation part (or, while we're at it, the .pyc generation part) is any more unportable than what's already there. > Note that the dis module has a "CPython implementation detail" disclaimer, and the AST structure is deliberately exempted from the usual backwards compatibility guarantees. As far as hooking compilation goes, https://docs.python.org/3/library/importlib.html#importlib.abc.InspectLoader.source_to_code was added in 3.4 specifically to make it easier to define custom loaders that make use of most of the existing import machinery (including bytecode cache files), but do something different for the source -> bytecode transformation step. Cheers, Nick. From steve at pearwood.info Tue Jul 1 19:04:22 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 2 Jul 2014 03:04:22 +1000 Subject: [Python-ideas] error in os.popen result In-Reply-To: References: Message-ID: <20140701170422.GS13014@ando> Hello Liam, This is a mailing list for discussing possible future ideas for the next version of Python, not for general support. I recommend that you use the python-list at python.org mailing list, also available via Usenet on comp.lang.python. Good luck. On Tue, Jul 01, 2014 at 06:31:22PM +0200, Liam Marsh wrote: > hello, for a small server program, I wanted to know which ports were > occuped. > with the dos command 'netstat' > so I tried this: > > *>>>a=os.popen('netstat')* > *>>>bytes(a.read())* > > but this occured in the second step: > > *Traceback (most recent call last):* > * File "", line 1, in * > * bytes(a.read())* > * File "C:\Apps\Programmation\Python3.2\lib\encodings\cp1252.py", line 23, > in decode* > * return codecs.charmap_decode(input,self.errors,decoding_table)[0]* > *UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 79: > character maps to * > > *how can I avoid it and why does the windows cmd does return an > undecodable character?* > > *thank you.* > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From steve at pearwood.info Tue Jul 1 19:15:29 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 2 Jul 2014 03:15:29 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404203220.53463.YahooMailNeo@web181002.mail.ne1.yahoo.com> Message-ID: <20140701171529.GT13014@ando> On Tue, Jul 01, 2014 at 09:58:37AM -0700, Nick Coghlan wrote: > On 1 July 2014 01:27, Andrew Barnert wrote: > > But CPython does expose bytecode via the dis module, parts of > > inspect, etc. [...] > > Note that the dis module has a "CPython implementation detail" > disclaimer, and the AST structure is deliberately exempted from the > usual backwards compatibility guarantees. Further to what Nick says, the *output* of dis is not expected to remain backwards compatible from version to version, only the dis API itself. There's a big difference between saying "we guarantee that the dis module will correctly and accurately disassemble valid bytecode", and saying "we guarantee that this specific chunk of bytecode will do these things". In order to use a hypothetical asm function, you need to know what pseudo-assembly to write, say `asm [SPAM, EGGS]`. That means that SPAM and EGGS must be stable and part of the language definition. (Or at least part of the CPython API.) That's a big step from the current situation. -- Steven From steve at pearwood.info Tue Jul 1 19:33:07 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 2 Jul 2014 03:33:07 +1000 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: <20140701173307.GU13014@ando> On Tue, Jul 01, 2014 at 06:38:37PM +1000, Chris Angelico wrote: [...] > 1) Do I really think anyone *should* do this? Your subsequent comments > support this question, and the answer is resoundingly NO. CPython is "This" being trying to micro-optimize code by bytecode-hacking. > not the sort of platform on which that kind of thing is ever worth > doing. You'll get far more performance by using Cython for parts, or > in some other way improving your code, than you will by hand-tweaking > the Python bytecode. I think that micro-optimization is probably the wrong reason to hack bytecodes. What I'm more interested in is exploring potential new features, or to add functionality, for example: Adding the ability to trace individual expressions, not just lines: http://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.html Exploring dynamic scoping: http://www.voidspace.org.uk/python/articles/code_blocks.shtml A proposal from Python 2.3 days for a brand-new decorator syntax: http://code.activestate.com/recipes/286147 A (serious!) defence of GOTO in Python: http://www.dr-josiah.com/2012/04/python-bytecode-hacks-gotos-revisited.html (although even Josiah doesn't suggest using COMEFROM :-) I don't know that such bytecode manipulations should be provided in the standard library, and certainly not as a built-in "asm" command. But, I think that we ought to acknowledge that bytecode hacking has a role to play in the wider Python ecosystem. I'm lead to understand that in the Java community, bytecode hacking is, perhaps not common, but accepted as something that powerusers do when all else fails: https://weblogs.java.net/blog/simonis/archive/2009/02/we_need_a_dirty.html [Aside: does Python do any sort of verification of the bytecode before executing it, as Java does?] -- Steven From ncoghlan at gmail.com Tue Jul 1 20:07:35 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 1 Jul 2014 11:07:35 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <20140701173307.GU13014@ando> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <20140701173307.GU13014@ando> Message-ID: On 1 July 2014 10:33, Steven D'Aprano wrote: > On Tue, Jul 01, 2014 at 06:38:37PM +1000, Chris Angelico wrote: > [...] >> 1) Do I really think anyone *should* do this? Your subsequent comments >> support this question, and the answer is resoundingly NO. CPython is > > "This" being trying to micro-optimize code by bytecode-hacking. > >> not the sort of platform on which that kind of thing is ever worth >> doing. You'll get far more performance by using Cython for parts, or >> in some other way improving your code, than you will by hand-tweaking >> the Python bytecode. > > I think that micro-optimization is probably the wrong reason to hack > bytecodes. What I'm more interested in is exploring potential new > features, or to add functionality https://pypi.python.org/pypi/withhacks and https://pypi.python.org/pypi/byteplay may also be of interest to anyone wishing to seriously tinker with what the CPython VM (as opposed to Python-the-language) already supports. I also highly advise working Python 3.4, since we made some substantial improvements to the dis module API (adding the yield from tests for 3.3 highlighted how limited the previous API was for testing purposes, so we fixed it in a way that made bytecode easier to work with in general). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Tue Jul 1 20:16:29 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 1 Jul 2014 11:16:29 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <20140701173307.GU13014@ando> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <20140701173307.GU13014@ando> Message-ID: On 1 July 2014 10:33, Steven D'Aprano wrote: > [Aside: does Python do any sort of verification of the bytecode before > executing it, as Java does?] Nope, it will happily attempt to execute invalid bytecode. That's actually one of the reasons executing untrusted bytecode is even less safe than executing untrusted source code - it's likely to be possible to trigger segfaults that way. There's an initial attempt at a bytecode verifier on PyPI (https://pypi.python.org/pypi/Python-Bytecode-Verifier/), and I have a vague recollection that Google have a bytecode verifier kicking around somewhere, but there's nothing built in to the CPython runtime. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From python at mrabarnett.plus.com Tue Jul 1 20:59:21 2014 From: python at mrabarnett.plus.com (MRAB) Date: Tue, 01 Jul 2014 19:59:21 +0100 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <20140701173307.GU13014@ando> Message-ID: <53B30509.4090808@mrabarnett.plus.com> On 2014-07-01 19:16, Nick Coghlan wrote: > On 1 July 2014 10:33, Steven D'Aprano wrote: > >> [Aside: does Python do any sort of verification of the bytecode >> before executing it, as Java does?] > > Nope, it will happily attempt to execute invalid bytecode. That's > actually one of the reasons executing untrusted bytecode is even less > safe than executing untrusted source code - it's likely to be > possible to trigger segfaults that way. > > There's an initial attempt at a bytecode verifier on PyPI > (https://pypi.python.org/pypi/Python-Bytecode-Verifier/), and I have > a vague recollection that Google have a bytecode verifier kicking > around somewhere, but there's nothing built in to the CPython > runtime. > The re module also uses a kind of bytecode that's generated by the Python front end and verified by the C back end. The bytecode contains things like offsets; for example, the bytecode that starts a repeated sequence has an offset to the corresponding bytecode that ends it, and vice versa. The problem with that is that the structure (i.e. the nesting) is no longer explicit, so it's more difficult to spot misnested structures. For the regex module, I decided that it would be easier to verify if I kept the structure explicit by using bytecodes to indicate the start and end of the structures. For example, a repeated sequence could be indicated by having a structure like GREEDY_REPEAT min_count max_count ... END. The C back end could then build the internal representation that's actually interpreted. From tjreedy at udel.edu Tue Jul 1 22:04:20 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 01 Jul 2014 16:04:20 -0400 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <1404211878.45311.YahooMailNeo@web181005.mail.ne1.yahoo.com> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404211878.45311.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: On 7/1/2014 6:51 AM, Andrew Barnert wrote: > I think Terry Reedy actually had a better answer: just tell people to > implement it, polish it up, put it on PyPI, and come back to us when > they're ready to show off their tons of users who can't live without > it. Random objected that wasn't possible, 'Random' said something quite different. He only noted that if '?' were translated to 'set()', then the resulting CPython-specific bytecode would continue to be "LOAD_GLOBAL (set), CALL_FUNCTION 0" rather than the 'optimized' "BUILD_SET 0". He also noted (objected) that there is no python code that CPython currently compiles as "BUILD_SET 0" Well, its unfortunate that {} is not available. If it were, there would be no issue, to me anyway, of using '?'. However, optimizing CPython bytecode, and compiler hooks, are completely different issues from translating unisym python to standard python that could run on any implementation of Python. If we thought the bytecode difference was important (which most do not), we could have a peephole optimizer to 'fix' it, completely independently of the existence of '?' or any idea of using it in python code. > in which case Terry's idea is more of a dismissal than a helpful suggestion, My post was a dismissal of the idea of changing python itself *and* a suggestion of how to proceed without involving pydev. > https://github.com/abarnert/emptyset proves that it is possible, and > even pretty easy. I consider producing (or at least being able to produce) a standard .py file that can be published outside the specialized group run on and debugged on standard interpreters to be essential to any sensible idea for augmented Python code (whether with unicode symbols or anything else, such as native-language keywords). However, as I said before, off topic here for unicode symbols, though not on python-list. -- Terry Jan Reedy From mertz at gnosis.cx Tue Jul 1 22:25:25 2014 From: mertz at gnosis.cx (David Mertz) Date: Tue, 1 Jul 2014 13:25:25 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404211878.45311.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: Somewhere in this thread, someone mentioned https://github.com/ehamberg/vim-cute-python (and something similar for emacs, but I'm a vim user). I'm not sure if this mention was a joke or not, but I thought it looked cool and started using it. I can't decide if I actually find it useful or distracting, but in truth it seems to answer the *entire* concern of anyone wanting to see an empty-set symbol (but not to save one bytecode instruction, I admit), and also various other math symbols that name concepts spelled in ASCII in Python. While some hypothetical .pyu translation tool or import hook could do the same thing, this really *does* seem like something to just do at the editor level since there's nothing *semantic* about the new symbols, just a way for them to look. On Tue, Jul 1, 2014 at 1:04 PM, Terry Reedy wrote: > On 7/1/2014 6:51 AM, Andrew Barnert wrote: > > I think Terry Reedy actually had a better answer: just tell people to >> implement it, polish it up, put it on PyPI, and come back to us when >> they're ready to show off their tons of users who can't live without >> it. Random objected that wasn't possible, >> > > 'Random' said something quite different. He only noted that if '?' were > translated to 'set()', then the resulting CPython-specific bytecode would > continue to be "LOAD_GLOBAL (set), CALL_FUNCTION 0" rather than the > 'optimized' "BUILD_SET 0". He also noted (objected) that there is no python > code that CPython currently compiles as "BUILD_SET 0" Well, its unfortunate > that {} is not available. If it were, there would be no issue, to me > anyway, of using '?'. However, optimizing CPython bytecode, and compiler > hooks, are completely different issues from translating unisym python to > standard python that could run on any implementation of Python. If we > thought the bytecode difference was important (which most do not), we could > have a peephole optimizer to 'fix' it, completely independently of the > existence of '?' or any idea of using it in python code. > > > in which case Terry's idea is more of a dismissal than a helpful >> suggestion, >> > > My post was a dismissal of the idea of changing python itself *and* a > suggestion of how to proceed without involving pydev. > > > https://github.com/abarnert/emptyset proves that it is possible, and >> even pretty easy. >> > > I consider producing (or at least being able to produce) a standard .py > file that can be published outside the specialized group run on and > debugged on standard interpreters to be essential to any sensible idea for > augmented Python code (whether with unicode symbols or anything else, such > as native-language keywords). However, as I said before, off topic here > for unicode symbols, though not on python-list. > > -- > Terry Jan Reedy > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Tue Jul 1 23:33:02 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 14:33:02 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: <20140701173307.GU13014@ando> References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <20140701173307.GU13014@ando> Message-ID: <1404250382.2679.YahooMailNeo@web181005.mail.ne1.yahoo.com> > On Tuesday, July 1, 2014 10:35 AM, Steven D'Aprano wrote: > I think that micro-optimization is probably the wrong reason to hack > bytecodes. What I'm more interested in is exploring potential new > features, or to add functionality, for example: > > Adding the ability to trace individual expressions, not just lines: > http://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.html > > Exploring dynamic scoping: > http://www.voidspace.org.uk/python/articles/code_blocks.shtml > > A proposal from Python 2.3 days for a brand-new decorator syntax: > http://code.activestate.com/recipes/286147 > > A (serious!) defence of GOTO in Python: > http://www.dr-josiah.com/2012/04/python-bytecode-hacks-gotos-revisited.html > > (although even Josiah doesn't suggest using COMEFROM :-) > > > I don't know that such bytecode manipulations should be provided in the > standard library, and certainly not as a built-in "asm" command. But, > I > think that we ought to acknowledge that bytecode hacking has a role to > play in the wider Python ecosystem. I think CPython provides just about the right level of support here. The documentation, the APIs, and the helper tools for dealing with bytecode are all superb, and get better with each release. It's all more than sufficient to figure out what you're doing, and how to do it. It might be nice if there were an assembler in the stdlib, but the format is simple enough, and the documentation complete enough, that you can write one in a couple hours (as I did). And, honestly, I suspect a stdlib assembler wouldn't be updated fast enough?e.g., when support for Instruction objects was added to CPython's dis module in 3.4, I doubt an existing assembler would have been modified to take advantage of that, but a new one that you slap together can do so easily. Documenting that bytecode is only supported on CPython, and can change between CPython versions, isn't a problem for anyone who's just looking to experiment with and explore ideas, rather than write production code. As your examples show, you can usually even publish your explorations for others to experiment with, granting those limitations, and maintain them for years without much headache. (Bytecode has traditionally been much more conservative than what the documentation allows; it's generally only when your hacks rely on knowing exactly what bytecode will be generated for a given Python expression that they break. But even there, with a sufficient test suite, it's usually pretty simple to adapt.) > I'm lead to understand that in the Java community, bytecode hacking is,? > perhaps not common, but accepted as something that powerusers do when > all else fails: > > https://weblogs.java.net/blog/simonis/archive/2009/02/we_need_a_dirty.html Here, it sounds like you _are_ suggesting that bytecode hacking may need to be used for production code, not just for exploration. But there are some pretty big differences between Java and Python that I think are relevant here: ?* Java is designed for one specific VM, on which many other languages run; Python is designed to run on a variety of VMs, and nothing else runs on the CPython VM. ?* Java is designed to be secure first, fast second, and flexible a distant third; Python is designed to be simple and transparent first, flexible and dynamic second, and everything else a distant third. So most of what you'd want to do (including solving problems like the one in the blog) can be done with simple monkey-patching and related techniques?and you can go a lot deeper than that without getting beyond the supported, portable reflection techniques. ?* Java's VM is designed to be debuggable and optimizable; CPython's is designed to be the simplest thing that could support CPython. So, anything that's too hard to do with runtime structures is often easier at the VM level in Java, while the reverse is true in CPython. ?* Java code is often distributed and always deployed as binary files; Python almost always as source. Besides being the cause of problems like the one in this article, it also means that if you have to go below the runtime level, you don't have the intermediate steps of source and AST hacking, you have no choice but to go to the bytecode. From abarnert at yahoo.com Wed Jul 2 00:03:17 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 1 Jul 2014 15:03:17 -0700 Subject: [Python-ideas] .pyu nicode syntax symbols (was Re: Empty set, Empty dict) In-Reply-To: References: <1403931602.14407.135458493.44CF193B@webmail.messagingengine.com> <1404148690.18766.136186337.62A26E9E@webmail.messagingengine.com> <1404172094.55099.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404201877.75807.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1404211878.45311.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: <1404252197.45327.YahooMailNeo@web181001.mail.ne1.yahoo.com> > On Tuesday, July 1, 2014 1:05 PM, Terry Reedy wrote: > > On 7/1/2014 6:51 AM, Andrew Barnert wrote: > >> I think Terry Reedy actually had a better answer: just tell people to >> implement it,? polish it up, put it on PyPI, and come back to us when >> they're ready to show off their tons of users who can't live > without >> it. Random objected that wasn't possible, > > 'Random' said something quite different. He only noted that if > '?' were > translated to 'set()', then the resulting CPython-specific bytecode > would continue to be "LOAD_GLOBAL (set), CALL_FUNCTION 0" rather than > the 'optimized' "BUILD_SET 0". He also noted (objected) that > there is no > python code that CPython currently compiles as "BUILD_SET 0"? You're reading a lot into a 2-line message, but your take is that he interpreted the problem as needing to compile "BUILD_SET 0", and pointed out that there is no way to do that with a source preprocessor. You can insist that they're two separate problems to be solved (or, maybe, not solved), and I think you're right. You just have to make that point?as you, I, and half a dozen others have done since his original post. But meanwhile, Chris Angelico offered a solution to the problem that answers his complaint, and I offered another solution that doesn't even require bytecode hacking. That shows that even if you accept the objection, it still doesn't block anyone. > Well, its > unfortunate that {} is not available. If it were, there would be no > issue, to me anyway, of using '?'.? However, optimizing CPython > bytecode, and compiler hooks, are completely different issues from > translating unisym python to standard python that could run on any > implementation of Python. First, as others have pointed out, it's not just, or even primarily, an optimization, it's also a semantic difference. > If we thought the bytecode difference was? > important (which most do not), we could have a peephole optimizer to > 'fix' it, completely independently of the existence of '?' or > any idea > of using it in python code. But you can't make semantic changes in a peephole optimizer. You'd have to first change the language to document that set() may (or may not!) return an empty set even if the name set resolves to something different. While this isn't entirely unique in Python history (e.g., back when you could redefine False through various kinds of trickery, the compiler was still allowed to optimize out if False: code), but it's very unusual. And nobody's going to do that for a minor optimization (if False:, besides being a potentially huge optimization, also _fixes_ a semantic problem, rather than causing one, since False was supposed to be un-redefinable, but wasn't because of various holes). >> in which case Terry's idea is more of a dismissal than a helpful > suggestion, > > My post was a dismissal of the idea of changing python itself *and* a > suggestion of how to proceed without involving pydev. My point is that _if_ you take Random's objection as being critical, _then_ your post dismisses the idea, even though it wasn't intended to. You can follow up in two ways: challenge his objection, or answer his objection; there were replies doing both, and if either of the two succeeds, the idea is still alive for people to take further if they want. >> https://github.com/abarnert/emptyset proves that it is possible, and >> even pretty easy. > > I consider producing (or at least being able to produce) a standard .py > file that can be published outside the specialized group run on and > debugged on standard interpreters to be essential to any sensible idea? My approach is made up of nothing but standard .py files. Those files can be published outside a specialized group, and run and debugged on CPython 3.4+. They can also be edited by people outside that specialized group, without needing a specialized build process involving a preprocessor, just a standard Python module that they already have. Sure, it only works on CPython, but Python 3.4, scipy, etc. also currently only work on CPython, and that doesn't prevent a large community of users from making using of them, publishing code outside a specialized group, and?most importantly for the topic at hand?coming up with suggestions that are germane to Python as a whole and taken seriously. For example, nobody suggested that PEP 465 wasn't a sensible idea because all of the sample code presented only runs on CPython; the idea itself is clearly portable, the community using such code is gigantic and mature, and that's all that matters. Finally, I don't think anyone actually needs this feature, but I was able to whip up a proof of concept in an hour that provides it. Anyone who seriously wants to pursue it doesn't have to use my approach, much less my code; it still serves as an existence proof that what they want to do can be done, meaning they should go do it. From stefano.borini at ferrara.linux.it Wed Jul 2 00:36:48 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Wed, 02 Jul 2014 00:36:48 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments Message-ID: <53B33800.1030300@ferrara.linux.it> Dear all, after the first mailing list feedback, and further private discussion with Joseph Martinot-Lagarde, I drafted a first iteration of a PEP for keyword arguments in indexing. The document is available here. https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt The document is not in final form when it comes to specifications. In fact, it requires additional discussion about the best strategy to achieve the desired result. Particular attention has been devoted to present alternative implementation strategies, their pros and cons. I will examine all feedback tomorrow morning European time (in approx 10 hrs), and apply any pull requests or comments you may have. When the specification is finalized, or this community suggests that the PEP is in a form suitable for official submission despite potential open issues, I will submit it to the editor panel for further discussion, and deploy an actual implementation according to the agreed specification for a working test run. I apologize for potential mistakes in the PEP drafting and submission process, as this is my first PEP. Kind Regards, Stefano Borini From rosuav at gmail.com Wed Jul 2 03:06:24 2014 From: rosuav at gmail.com (Chris Angelico) Date: Wed, 2 Jul 2014 11:06:24 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> Message-ID: On Wed, Jul 2, 2014 at 8:36 AM, Stefano Borini wrote: > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt A good start! """ C0: a[1] -> idx = 1 # integer a[1,2] -> idx = (1,2) # tuple C1: a[Z=3] -> idx = {"Z": 3} # dictionary with single key C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} # dictionary/ordereddict [*] or idx = ({"Z": 3}, {"R": 4}) # tuple of two single-key dict [**] ... C5. a[1, 2, Z=3] -> idx = (1, 2, {"Z": 3}) """ Another possibility for the keyword arguments is a two-item tuple, which would mean that C1 comes up as ("Z", 3) (or maybe (("Z", 3),) - keyword arguments forcing a tuple of all args for consistency/clarity), C2 as (("Z", 3), ("R", 4)), and C5 as (1, 2, ("Z", 3)). This would be lighter and easier to use than the tuple of dicts, and still preserves order (unlike the regular dict); however, it doesn't let you easily fetch up the one keyword you're interested in, which is normally something you'd want to support for a **kwargs-like feature: def __getitem__(self, item, **kwargs): # either that, or kwargs is part of item in some way ret = self.base[item] if "precis" in kwargs: ret.round(kwargs["precis"]) return ret To implement that with a tuple of tuples, or a tuple of dicts, you'd have to iterate over it and check each one - much less clean code. I would be inclined to simply state, in the PEP, that keyword arguments in indexing are equivalent to kwargs in function calls, and equally unordered (that is to say: if a proposal to make function call kwargs ordered is accepted, the same consideration can be applied to this, but otherwise they have no order). This does mean that it doesn't fit the original use-case, but it seems very odd to start out by saying "here, let's give indexing the option to carry keyword args, just like with function calls", and then come back and say "oh, but unlike function calls, they're inherently ordered and carried very differently". For the OP's use-case, though, it would actually be possible to abuse slice notation. I don't remember this being mentioned, but it does preserve order; the cost is that all the "keywords" have to be defined as objects. class kw: pass # because object() doesn't have attributes def make_kw(names): for i in names.split(): globals()[i] = obj = kw() obj.keyword_arg = i make_kw("Z R X") # Now you can use them in indexing some_obj[5, Z:3] some_obj[7, Z:3, R:4] The parameters will arrive in the item tuple as slice objects, where the start is a signature object and the stop is its value. >>> some_obj[5, Z:3] getitem: (5, slice(<__main__.kw object at 0x016C5E10>, 3, None)) Yes, it uses a colon rather than an equals sign, but on the flip side, it already works :) ChrisA From rob.cliffe at btinternet.com Wed Jul 2 04:36:23 2014 From: rob.cliffe at btinternet.com (Rob Cliffe) Date: Wed, 02 Jul 2014 03:36:23 +0100 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> Message-ID: <53B37027.9030808@btinternet.com> On 01/07/2014 23:36, Stefano Borini wrote: > Dear all, > > after the first mailing list feedback, and further private discussion > with Joseph Martinot-Lagarde, I drafted a first iteration of a PEP for > keyword arguments in indexing. The document is available here. > > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt A small bit of uninformed feedback (no charge :-) ): 1) Ahem, doesn't a[3] (usually) return the *fourth* element of a ? 2) """ Compare e.g. a[1:3, Z=2] with a.get(slice(1,3,None), Z=2). """ I think this is slightly unfair as the second form can be abbreviated to a.get(slice(1,3), Z=2), just as the first is an abbreviation for a[1:3:None, Z=2]. 3) You may not consider this relevant. But as an (I believe) intelligent reader, but one unfamiliar with the material, I cannot understand what your first example """ low_accuracy_energy = computeEnergy(molecule, BasisSet[Z=3]) """ is about, and whether it is really (conceptually) related to indexing, or just a slick hack. I guess it could be anything, depending on the implementation of __getitem__. Best wishes, Rob Cliffe From anthony at xtfx.me Wed Jul 2 04:58:44 2014 From: anthony at xtfx.me (C Anthony Risinger) Date: Tue, 1 Jul 2014 21:58:44 -0500 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> Message-ID: On Jul 1, 2014 8:06 PM, "Chris Angelico" wrote: > > [...] > > For the OP's use-case, though, it would actually be possible to abuse > slice notation. I don't remember this being mentioned, but it does > preserve order; the cost is that all the "keywords" have to be defined > as objects. > > class kw: pass # because object() doesn't have attributes > def make_kw(names): > for i in names.split(): > globals()[i] = obj = kw() > obj.keyword_arg = i > make_kw("Z R X") > > # Now you can use them in indexing > some_obj[5, Z:3] > some_obj[7, Z:3, R:4] > > The parameters will arrive in the item tuple as slice objects, where > the start is a signature object and the stop is its value. > > >>> some_obj[5, Z:3] > getitem: (5, slice(<__main__.kw object at 0x016C5E10>, 3, None)) > > Yes, it uses a colon rather than an equals sign, but on the flip side, > it already works :) This works great, IIRC you can pretty much pass *anything*: dict[{}:] dict[AType:lambda x: x] dict[::] dict[:] ...don't forget extended slice possibilities :) I've dabbled with this in custom dict implementations and it usefully excludes all normal dicts, which quickly reject slice objects. -- C Anthony [mobile] -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Wed Jul 2 09:06:53 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 2 Jul 2014 00:06:53 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> Message-ID: On 1 July 2014 15:36, Stefano Borini wrote: > Dear all, > > after the first mailing list feedback, and further private discussion with > Joseph Martinot-Lagarde, I drafted a first iteration of a PEP for keyword > arguments in indexing. The document is available here. > > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt > > The document is not in final form when it comes to specifications. In fact, > it requires additional discussion about the best strategy to achieve the > desired result. Particular attention has been devoted to present alternative > implementation strategies, their pros and cons. I will examine all feedback > tomorrow morning European time (in approx 10 hrs), and apply any pull > requests or comments you may have. It's a well written PEP, but the "just use call notation instead" argument is going to be a challenging one to overcome. Given that part of the rationale given is that "slice(start, stop, step)" is uglier than the "start:stop:step" permitted in an indexing operation, the option of allowing "[start:]", "[:stop]","[start:stop:step]", etc as dedicated slice syntax should also be explicitly considered. Compare: a.get(slice(1,3), Z=2) # today a.get([1:3], Z=2) # slice sytax a[1:3, Z=2] # PEP Introducing a more general slice notation would make indexing *less* special (reducing the current "allows slice notation" special case to "allows slice notation with the surrounding square brackets implied". The reduction of special casing could be taken further, by allowing the surrounding square brackets to be omitted in tuple and list displays, just as they are in indexing operations. I'm not saying such a proposal would necessarily be accepted - I just see a proposal that takes an existing special case and proposes to make it *less* special as more appealing than one that proposes to make it even *more* special. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From nicholas.cole at gmail.com Wed Jul 2 09:45:47 2014 From: nicholas.cole at gmail.com (Nicholas Cole) Date: Wed, 2 Jul 2014 08:45:47 +0100 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> Message-ID: > It's a well written PEP, but the "just use call notation instead" > argument is going to be a challenging one to overcome. > +1 The advantages the PEP suggests are very subjective ones to do with readability. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Wed Jul 2 09:59:54 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Wed, 2 Jul 2014 09:59:54 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> Message-ID: <20140702075954.GA26500@ferrara.linux.it> On Wed, Jul 02, 2014 at 08:45:47AM +0100, Nicholas Cole wrote: > > It's a well written PEP, but the "just use call notation instead" > > argument is going to be a challenging one to overcome. > > > > +1 > > The advantages the PEP suggests are very subjective ones to do with > readability. I want to be honest, I agree with this point of view myself. it's not _needed_. it would be a nice additional feature but maybe only rarely used and in very specialized cases, and again, there are always workarounds. Even if rejected on the long run, it rationalizes and analyzes motivations and alternatives, and enshrines them formally on why it's a "not worth it" scenario. Thank you for all the feedback. I am including all the raised points in the PEP and I'll follow up with a revised version ASAP. Stefano Borini From stefano.borini at ferrara.linux.it Wed Jul 2 12:08:25 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Wed, 2 Jul 2014 12:08:25 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B37027.9030808@btinternet.com> References: <53B33800.1030300@ferrara.linux.it> <53B37027.9030808@btinternet.com> Message-ID: <20140702100825.GA29589@ferrara.linux.it> On Wed, Jul 02, 2014 at 03:36:23AM +0100, Rob Cliffe wrote: > 1) Ahem, doesn't a[3] (usually) return the *fourth* element of a ? Yes. I changed the indexes many times for consistency and that slipped through. It used to be a[2] > > low_accuracy_energy = computeEnergy(molecule, BasisSet[Z=3]) """ > > is about, and whether it is really (conceptually) related to indexing, > or just a slick hack. I guess it could be anything, depending on the > implementation of __getitem__. The reason behind an indexing is that the BasisSet object could be internally represented as a numeric table, where rows are associated to individual elements (e.g. row 0:5 to element 1, row 5:8 to element 2) and each column is associated to a given degree of accuracy (e.g. first column is low accuracy, second column is medium accuracy etc). You could say that users are not concerned with the internal representation, but if they are eventually allowed to create these basis sets in this tabular form, it makes a nice conceptual model to keep the association column <-> accuracy and keep it explicit in the interface. From xavier.combelle at gmail.com Wed Jul 2 13:47:03 2014 From: xavier.combelle at gmail.com (Xavier Combelle) Date: Wed, 2 Jul 2014 13:47:03 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140702100825.GA29589@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <53B37027.9030808@btinternet.com> <20140702100825.GA29589@ferrara.linux.it> Message-ID: in this case: C1: a[Z=3] -> idx = {"Z": 3} # P1/P2 dictionary with single key as we can index with any object, I wonder how one could differency between the calls, a[z=3] and the actual a[{"Z":3}]. Do they should be return the same? -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Wed Jul 2 14:20:03 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Wed, 2 Jul 2014 14:20:03 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <53B37027.9030808@btinternet.com> <20140702100825.GA29589@ferrara.linux.it> Message-ID: <20140702122003.GA2183@ferrara.linux.it> On Wed, Jul 02, 2014 at 01:47:03PM +0200, Xavier Combelle wrote: > in this case: > > C1: a[Z=3] -> idx = {"Z": 3} # P1/P2 > dictionary with single key > > > as we can index with any object, I wonder how one could differency between > the calls, a[z=3] > and the actual a[{"Z":3}]. Do they should be return the same? indeed you can't, and if I recall correctly I wrote it somewhere. The point is eventually if such distinction is worth considering or if, instead, the two cases should be handled as degenerate (equivalent) notations. IMHO, they should be kept distinct, and this disqualifies that implementation strategy. Too much magic would happen otherwise. -- ------------------------------------------------------------ -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCS d- s+:--- a? C++++ UL++++ P+ L++++ E--- W- N+ o K- w--- O+ M- V- PS+ PE+ Y PGP++ t+++ 5 X- R* tv+ b DI-- D+ G e h++ r+ y* ------------------------------------------------------------ From 4kir4.1i at gmail.com Wed Jul 2 17:14:47 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Wed, 02 Jul 2014 19:14:47 +0400 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments References: <53B33800.1030300@ferrara.linux.it> Message-ID: <878uobn66w.fsf@gmail.com> Stefano Borini writes: > Dear all, > > after the first mailing list feedback, and further private discussion > with Joseph Martinot-Lagarde, I drafted a first iteration of a PEP for > keyword arguments in indexing. The document is available here. > > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt > > The document is not in final form when it comes to specifications. In > fact, it requires additional discussion about the best strategy to > achieve the desired result. Particular attention has been devoted to > present alternative implementation strategies, their pros and cons. I > will examine all feedback tomorrow morning European time (in approx 10 > hrs), and apply any pull requests or comments you may have. > > When the specification is finalized, or this community suggests that > the PEP is in a form suitable for official submission despite > potential open issues, I will submit it to the editor panel for > further discussion, and deploy an actual implementation according to > the agreed specification for a working test run. > > I apologize for potential mistakes in the PEP drafting and submission > process, as this is my first PEP. > Strategy 3b: builtin named tuple C0. a[2] -> idx = 2; # scalar a[2,3] -> idx = (2, 3) # tuple idx[0] == 2 idx[1] == 3 C1. a[Z=3] -> idx = (Z=3) # builtin named tuple (pickable, etc) idx[0] == idx.Z == 3 C2. a[Z=3, R=2] -> idx = (Z=3, R=2) idx[0] == idx.Z == 3 idx[1] == idx.R == 2 C3. a[1, Z=3] -> idx = (1, Z=3) idx[0] == 1 idx[1] == idx.Z == 3 C4. a[1, Z=3, R=2] -> idx = (1, Z=3, R=2) idx[0] == 1 idx[1] == idx.Z == 3 idx[2] == idx.R == 2 C5. a[1, 2, Z=3] -> idx = (1, 2, Z=3) C6. a[1, 2, Z=3, R=4] -> (1, 2, Z=3, R=4) C7. a[1, Z=3, 2, R=4] -> SyntaxError: non-keyword arg after keyword arg Pros: - looks nice - easy to explain: a[1,b=2] is equivalent to a[(1,b=2)] like a[1,2] is equivalent to a[(1,2)] - it makes `__getitem__` *less special* if Python supports a builtin named tuple and/or ordered keyword args (the call syntax) Cons: - Python currently has no builtin named tuple (an ordered collection of named (optionally) values) - Python currently doesn't support ordered keyword args (it might have made the implementation trivial) Note: `idx = (Z=3)` is a SyntaxError so it is safe to produce a named tuple instead of a scalar. -- Akira From drekin at gmail.com Wed Jul 2 18:40:43 2014 From: drekin at gmail.com (drekin at gmail.com) Date: Wed, 02 Jul 2014 09:40:43 -0700 (PDT) Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> Message-ID: <53b4360b.ebd3b40a.33ca.5890@mx.google.com> Hello, just some remarks: Ad degeneracy of notation: The case of a[Z=3] and a[{"Z": 3}] is similar to current a[1, 2] and a[(1, 2)]. Even though one may argue that the parentheses are actually not part of tuple notation but are just needed because of syntax, it may look as degeneracy of notation when compared to function call: f(1, 2) is not the same thing as f((1, 2)). Ad making dict.get() obsolete: There is still often used a_dict.get(key) which has to be spelled a_dict[key, default=None] with index notation. The _n keys used in strategy 3 may be indexed from zero like list indices. Regards, Drekin From joseph.martinot-lagarde at m4x.org Wed Jul 2 21:17:15 2014 From: joseph.martinot-lagarde at m4x.org (Joseph Martinot-Lagarde) Date: Wed, 02 Jul 2014 21:17:15 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> Message-ID: <53B45ABB.30802@m4x.org> Le 02/07/2014 09:45, Nicholas Cole a ?crit : > > It's a well written PEP, but the "just use call notation instead" > argument is going to be a challenging one to overcome. > > > +1 > > The advantages the PEP suggests are very subjective ones to do with > readability. Well, "Readability counts" is in the zen of python ! Having recently translated a Matlab program to python, I can assure you that the notation difference between call and indexing is really useful. .get() does not looks like indexing. From timothy.c.delaney at gmail.com Wed Jul 2 22:12:30 2014 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Thu, 3 Jul 2014 06:12:30 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> Message-ID: On 2 July 2014 08:36, Stefano Borini wrote: > Dear all, > > after the first mailing list feedback, and further private discussion with > Joseph Martinot-Lagarde, I drafted a first iteration of a PEP for keyword > arguments in indexing. The document is available here. > > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt > > The document is not in final form when it comes to specifications. In > fact, it requires additional discussion about the best strategy to achieve > the desired result. Particular attention has been devoted to present > alternative implementation strategies, their pros and cons. I will examine > all feedback tomorrow morning European time (in approx 10 hrs), and apply > any pull requests or comments you may have. > > When the specification is finalized, or this community suggests that the > PEP is in a form suitable for official submission despite potential open > issues, I will submit it to the editor panel for further discussion, and > deploy an actual implementation according to the agreed specification for a > working test run. > > I apologize for potential mistakes in the PEP drafting and submission > process, as this is my first PEP. > One option I don't see is to have a[b=1, c=2] be translated to a.__getitem__((slice('b', 1, None), slice['c', 2, None)) automatically. That completely takes care of backwards compatibility in __getitem__ (no change at all), and also deals with your issue with abusing slice objects: a[K=1:10:2] -> a.__getitem__(slice('K', slice(1, 10, 2))) And using that we can have an ordered dict "literal" class OrderedDictLiteral(object): def __getitem__(self, t): try: i = iter(t) except TypeError: i = (t,) return collections.OrderedDict((s.start, s.stop) for s in i) odict = OrderedDictLiteral() o = odict[a=1, b='c'] print(o) # prints OrderedDict([('a', 1), ('b', 'c')]) On a related note, if we combined this with the idea that kwargs should be constructed using the type of the passed dict (i.e. if you pass an OrderedDict as **kwargs you get a new OrderedDict in the function) we could do: kw = OrderedDictLiteral() def f(**kw): print(kw) f('a', 'b', **kw[c='d', e=2]) always resulting in: {'c': 'd', 'e': 2} Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothy.c.delaney at gmail.com Wed Jul 2 22:14:00 2014 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Thu, 3 Jul 2014 06:14:00 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> Message-ID: On 3 July 2014 06:12, Tim Delaney wrote: > > a[K=1:10:2] -> a.__getitem__(slice('K', slice(1, 10, 2))) > Of course, that should have been: a[K=1:10:2] -> a.__getitem__(slice('K', slice(1, 10, 2), None)) Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Wed Jul 2 23:29:53 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Wed, 2 Jul 2014 23:29:53 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> Message-ID: <20140702212953.GA16637@ferrara.linux.it> On Thu, Jul 03, 2014 at 06:12:30AM +1000, Tim Delaney wrote: > One option I don't see is to have a[b=1, c=2] be translated to > a.__getitem__((slice('b', 1, None), slice['c', 2, None)) automatically. it would be weird, since it's not technically a slice, but it would work. I personally think that piggybacking on the slice would appear hackish. One could eventually think to have a keyword() object similar to slice(), but then it's basically a single item dictionary (Strategy 1) with a fancy name. -- ------------------------------------------------------------ -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCS d- s+:--- a? C++++ UL++++ P+ L++++ E--- W- N+ o K- w--- O+ M- V- PS+ PE+ Y PGP++ t+++ 5 X- R* tv+ b DI-- D+ G e h++ r+ y* ------------------------------------------------------------ From timothy.c.delaney at gmail.com Thu Jul 3 01:10:18 2014 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Thu, 3 Jul 2014 09:10:18 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140702212953.GA16637@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> Message-ID: On 3 July 2014 07:29, Stefano Borini wrote: > On Thu, Jul 03, 2014 at 06:12:30AM +1000, Tim Delaney wrote: > > One option I don't see is to have a[b=1, c=2] be translated to > > a.__getitem__((slice('b', 1, None), slice['c', 2, None)) automatically. > > it would be weird, since it's not technically a slice, but it would work. > I personally think that piggybacking on the slice would appear hackish. > One could eventually think to have a keyword() object similar to slice(), > but then it's basically a single item dictionary (Strategy 1) with a fancy > name. I really do think that a[b=c, d=e] should just be syntax sugar for a['b':c, 'd':e]. It's simple to explain, and gives the greatest backwards compatibility. In particular, libraries that already abused slices in this way will just continue to work with the new syntax. I'd maybe thought a subclass of slice, with .key (= .start) and and .value (= .stop) variables would work, but slice isn't subclassable so it would be a bit more difficult. That would also be backwards-compatible with existing __getitem__ that used slice, but would preclude people calling that __getitem__ with slice syntax, which I personally don't think is desireable. Instead, maybe recommend something like: ordereddict = OrderedDictLiteral() # using the definition from previous email class GetItemByName(object): def __getitem__(self, t): # convert the parameters to a dictionary d = ordereddict[t] return d['name'] Hmm - here's an anonymous named tuple "literal" as another example: class AnonymousNamedTuple(object): def __getitem__(self, t): d = ordereddict[t] t = collections.namedtuple('_', d) return t(*d.values()) namedtuple = AnonymousNamedTuple() print(namedtuple[a='b', c=1]) # _(a='b', c=1) As you can see, I'm in favour of keeping the order of the keyword arguments to the index - losing it would prevent things like the above. Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Thu Jul 3 01:40:39 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 02 Jul 2014 16:40:39 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> Message-ID: <53B49877.6090304@stoneleaf.us> On 07/02/2014 04:10 PM, Tim Delaney wrote: > > I really do think that a[b=c, d=e] should just be syntax sugar for a['b':c, 'd':e]. It's simple to explain, and gives > the greatest backwards compatibility. In particular, libraries that already abused slices in this way will just continue > to work with the new syntax. +0.5 for keywords in __getitem__ +1 for this version of it ~Ethan~ From bruce at leapyear.org Thu Jul 3 09:37:45 2014 From: bruce at leapyear.org (Bruce Leban) Date: Thu, 3 Jul 2014 00:37:45 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B49877.6090304@stoneleaf.us> References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> Message-ID: On Wed, Jul 2, 2014 at 4:40 PM, Ethan Furman wrote: > On 07/02/2014 04:10 PM, Tim Delaney wrote: > >> >> I really do think that a[b=c, d=e] should just be syntax sugar for >> a['b':c, 'd':e]. It's simple to explain, and gives >> the greatest backwards compatibility. In particular, libraries that >> already abused slices in this way will just continue >> to work with the new syntax. >> > > +0.5 for keywords in __getitem__ > > +1 for this version of it If there weren't already abuse of slices for this purpose, would this be the first choice? I think not. This kind of abuse makes it more likely that there will be mysterious failures when someone tries to use keyword indexing for objects that don't support it. In contrast, using kwargs means you'll get an immediate meaningful exception. Tangentially, I think the PEP can reasonably reserve the keyword argument name 'default' for default values specifying that while __getitem__ methods do not need to support default, they should not use that keyword for any other purpose. Also, the draft does not explain why you would not allow defining __getitem__(self, idx, x=1, y=2) rather than only supporting the kwargs form. I don't know if I think it should or shouldn't at this point, but it definitely think it need to be discussed and justified one way or the other. --- Bruce Learn how hackers think: http://j.mp/gruyere-security https://www.linkedin.com/in/bruceleban -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Thu Jul 3 19:00:36 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Thu, 3 Jul 2014 19:00:36 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> Message-ID: <20140703170036.GA13843@ferrara.linux.it> On Thu, Jul 03, 2014 at 09:10:18AM +1000, Tim Delaney wrote: > I really do think that a[b=c, d=e] should just be syntax sugar for a['b':c, > 'd':e]. It's simple to explain, and gives the greatest backwards > compatibility This is indeed a point, as the initialization for a dictionary looks very, very similar, however, it would definitely collide with the slice object. At the very least, it would be confusing. > In particular, libraries that already abused slices in this > way will just continue to work with the new syntax. Are there any actual examples in the wild of this behavior? From stefano.borini at ferrara.linux.it Thu Jul 3 19:15:09 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Thu, 3 Jul 2014 19:15:09 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> Message-ID: <20140703171509.GB13843@ferrara.linux.it> On Wed, Jul 02, 2014 at 12:36:48AM +0200, Stefano Borini wrote: > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt I committed and pushed the most recent changes and they are now available. Some points have been clarified and expanded. Also, there's a new section about C interface compatibility. Please check the diffs for tracking the changes. Tonight I will comb the document and the thread again, further distilling the current hot spots. From shoyer at gmail.com Thu Jul 3 19:57:48 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 3 Jul 2014 10:57:48 -0700 (PDT) Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> Message-ID: <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> don't have strong opinions about the implementation, but I am strongly supportive of this PEP for the second case it lists -- the ability to index an a multi-dimensional array by axis name or label instead of position. Why? Suppose you're working with high dimensional data, where arrays may have any number of axes such as time, x, y and z. I work with this sort of data every day, as do many scientists. It is awkward and error prone to use the existing __getitem__ and __setitem__ syntax, because it's difficult to reliably keep track of axis order with this many indices: a[:, :, 0:10] vs. a[y=0:10] Keyword getitem syntax should be encouraged for the same reasons that keyword arguments are often preferable to positional arguments: it is both explicit (no implicit reliance on axis order), and more flexible (the same code will work on arrays with transposed or altered axes). This is particularly important because it is typical to be working with arrays that use some but not all the same axes. A method does allow for an explicit (if verbose) alternative to __getitem__ syntax: a.getitem(y=slice(0, 10)) But it's worse for __setitem__: a.setitem(dict(y=slice(0, 10)), 0) vs. a[y=0:10] = 0 ------------ Another issue: The PEP should address whether expressions with slice abbreviations like the following should be valid syntax: a[x=:, y=:5, z=::-1] These look pretty strange (=: looks like a form of assign), but the functionality would certainly be nice to support in some way. Surrounding the indices with [] might help: a[x=[:], y=[:5], z=[::-1]] ------------- On Thursday, July 3, 2014 12:39:09 AM UTC-7, Bruce Leban wrote: > > Tangentially, I think the PEP can reasonably reserve the keyword argument > name 'default' for default values specifying that while __getitem__ methods > do not need to support default, they should not use that keyword for any > other purpose. > -1 from me. The existing get method handles this case pretty well, with fewer keystrokes than the keyword only "default" index (as I think has already been pointed out). In my opinion, case 1 (labeled indices for a physics DSL) and case 2 (labeled indices to removed ambiguity) are basically the same, and the only use-cases that should be encouraged. Labeling tensor indices with names in mathematical notation is standard for precisely the same reasons that it's a good idea for Python. Best, Stephan (note: apologies for any redundant messages, I tried sending this message from the google groups mirror before I signed up, which didn't go out to the main listing list) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Thu Jul 3 20:30:59 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Thu, 3 Jul 2014 20:30:59 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140703171509.GB13843@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140703171509.GB13843@ferrara.linux.it> Message-ID: <20140703183059.GC13843@ferrara.linux.it> On Thu, Jul 03, 2014 at 07:15:09PM +0200, Stefano Borini wrote: > On Wed, Jul 02, 2014 at 12:36:48AM +0200, Stefano Borini wrote: > > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt > > I committed and pushed the most recent changes and they are now available. > Some points have been clarified and expanded. Also, there's a new section about > C interface compatibility. Please check the diffs for tracking the changes. Forgot: I also added a possibility P4 for the first strategy: keyword (alternative name "keyindex") which was proposed in the thread. This solution would look rather neat >>> a[3] 3 >>> a[3:1] slice(3, 1, None) >>> a[slice(3,1,None)] # <- Note how this notation is a long and equivalent form of the slice(3, 1, None) # syntactic sugar above >>> a[z=4] # <- Again, note how this notation would be a syntactic sugar keyindex("z", 4) # for a[keyindex("z", 4)] >>> a[z=1:5:2] # <- Supports slices too. keyindex("z", slice(1,5,2)) # No ambiguity with dictionaries, and C compatibility is # straightforward >>> keyindex("z", 4).key "z" Another thing I observed is that the point of indexing operation is indexing, and a keyed _index_ is not the same thing as a keyed _option_ during an indexing operation. This has been stated during the thread but it's worth to point out explicitly in the PEPi (it isn't). Using it for options such as default would technically be a misuse, but an acceptable one for... broad definitions of indexing. The keyindex object could be made to implement the same interface as its value through forwarding, so it can behave just as its value if your logic cares only about position, and not key >>> keyindex("z", 4) + 1 5 Another rationalization: current indexing has only one degree of freedom, that is: positioning. Add keywords and now there are two degrees of freedom: position and key. How are these two degrees of freedom supposed to interact? From stefano.borini at ferrara.linux.it Thu Jul 3 21:33:56 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Thu, 3 Jul 2014 21:33:56 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> Message-ID: <20140703193356.GD13843@ferrara.linux.it> On Thu, Jul 03, 2014 at 10:57:48AM -0700, Stephan Hoyer wrote: > don't have strong opinions about the implementation, but I am strongly > supportive of this PEP for the second case it lists -- the ability to index > an a multi-dimensional array by axis name or label instead of position. thinking aloud. The biggest problem is that there's no way of specifying which labels the object supports, and therefore no way of binding a specified keyword, unless the __getitem__ signature is deeply altered. > It is awkward and error prone to use the existing __getitem__ and > __setitem__ syntax, because it's difficult to reliably keep track of axis > order with this many indices: > > a[:, :, 0:10] > > vs. > > a[y=0:10] This is indeed an important use case. I should probably stress it more in the PEP. > Another issue: The PEP should address whether expressions with slice > abbreviations like the following should be valid syntax: > > a[x=:, y=:5, z=::-1] looks ugly indeed > Surrounding the indices with [] might help: > > a[x=[:], y=[:5], z=[::-1]] better, but unusual > -1 from me. The existing get method handles this case pretty well, with > fewer keystrokes than the keyword only "default" index (as I think has > already been pointed out). > > In my opinion, case 1 (labeled indices for a physics DSL) and case 2 > (labeled indices to removed ambiguity) are basically the same, and the only > use-cases that should be encouraged. Labeling tensor indices with names in > mathematical notation is standard for precisely the same reasons that it's > a good idea for Python. Meaning dropping the use of keyword indexing for "options" use cases. From shoyer at gmail.com Thu Jul 3 21:43:59 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 3 Jul 2014 12:43:59 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140703193356.GD13843@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> Message-ID: On Thu, Jul 3, 2014 at 12:33 PM, Stefano Borini < stefano.borini at ferrara.linux.it> wrote: > > thinking aloud. > The biggest problem is that there's no way of specifying which labels the > object supports, and therefore no way of binding a specified keyword, > unless > the __getitem__ signature is deeply altered. I don't I follow you here. The object itself handles the __getitem__ logic in whatever way it sees fit, and it would be up to it to raise KeyError when an invalid label is supplied, much like the current situation with invalid keys. Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Thu Jul 3 21:59:11 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Thu, 3 Jul 2014 21:59:11 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> Message-ID: <20140703195911.GE13843@ferrara.linux.it> On Thu, Jul 03, 2014 at 12:43:59PM -0700, Stephan Hoyer wrote: > On Thu, Jul 3, 2014 at 12:33 PM, Stefano Borini < > stefano.borini at ferrara.linux.it> wrote: > > > > thinking aloud. > > The biggest problem is that there's no way of specifying which labels the > > object supports, and therefore no way of binding a specified keyword, > > unless > > the __getitem__ signature is deeply altered. > > > I don't I follow you here. The object itself handles the __getitem__ logic > in whatever way it sees fit, and it would be up to it to raise KeyError > when an invalid label is supplied, much like the current situation with > invalid keys. NB: Still thinking aloud here... True, but the problem is that in a function def foo(x,y,z): pass calling the following will give the exact same result foo(1,2,3) foo(x=1, y=2, z=3) foo(z=3, x=1, y=2) this happens because at function definition you can specify the argument names. with __getitem__ you can't explain this binding. its current form precludes it __getitem__(self, idx) if you use a[1,2,3], you have no way of saying that "the first index is called x", so you have no way for these two to be equivalent in a similar way a function does a[1,2,3] a[z=3, x=1, y=2] unless you allow getitem in the form __getitem__(self, x, y, z) which I feel it would be a wasps' nest in terms of backward compatibility, both at the python and C level. I doubt this would fly. So if you want to keep __getitem__ signature unchanged, you will have to map labels to positions manuallyi inside __getitem__, a potentially complex task. Not even strategy 3 (namedtuple) would solve this issue. From shoyer at gmail.com Thu Jul 3 22:20:45 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 3 Jul 2014 13:20:45 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140703195911.GE13843@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> <20140703195911.GE13843@ferrara.linux.it> Message-ID: On Thu, Jul 3, 2014 at 12:59 PM, Stefano Borini < stefano.borini at ferrara.linux.it> wrote: > So if you want to keep __getitem__ signature unchanged, you will have to > map labels > to positions manuallyi inside __getitem__, a potentially complex task. Not > even > strategy 3 (namedtuple) would solve this issue. > Yes, this is true. However, in practice many implementations of labeled arrays would have generic labeled axes, so they would need to use their own logic to do the mapping in __getitem__ anyways. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sturla.molden at gmail.com Thu Jul 3 22:48:06 2014 From: sturla.molden at gmail.com (Sturla Molden) Date: Thu, 3 Jul 2014 20:48:06 +0000 (UTC) Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments References: <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> <20140703195911.GE13843@ferrara.linux.it> Message-ID: <110367949426113052.835097sturla.molden-gmail.com@news.gmane.org> Stephan Hoyer wrote: > Yes, this is true. However, in practice many implementations of labeled > arrays would have generic labeled axes, so they would need to use their own > logic to do the mapping in __getitem__ anyways. If you are thiniking about Pandas, then each keyword should be allowed to take a slice as well. dataframe[apples=1:3, oranges=2:6] Sturla From shoyer at gmail.com Thu Jul 3 23:00:20 2014 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 3 Jul 2014 14:00:20 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <110367949426113052.835097sturla.molden-gmail.com@news.gmane.org> References: <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> <20140703195911.GE13843@ferrara.linux.it> <110367949426113052.835097sturla.molden-gmail.com@news.gmane.org> Message-ID: On Thu, Jul 3, 2014 at 1:48 PM, Sturla Molden wrote: > Stephan Hoyer wrote: > > > Yes, this is true. However, in practice many implementations of labeled > > arrays would have generic labeled axes, so they would need to use their > own > > logic to do the mapping in __getitem__ anyways. > > If you are thiniking about Pandas, then each keyword should be allowed to > take a slice as well. > > dataframe[apples=1:3, oranges=2:6] > Yes, I am indeed thinking about pandas and other similar libraries. Supporting slices with keywords would be essential. Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Thu Jul 3 23:48:10 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 3 Jul 2014 14:48:10 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> <20140703195911.GE13843@ferrara.linux.it> <110367949426113052.835097sturla.molden-gmail.com@news.gmane.org> Message-ID: On 3 July 2014 14:00, Stephan Hoyer wrote: > On Thu, Jul 3, 2014 at 1:48 PM, Sturla Molden > wrote: >> >> Stephan Hoyer wrote: >> >> > Yes, this is true. However, in practice many implementations of labeled >> > arrays would have generic labeled axes, so they would need to use their >> > own >> > logic to do the mapping in __getitem__ anyways. >> >> If you are thiniking about Pandas, then each keyword should be allowed to >> take a slice as well. >> >> dataframe[apples=1:3, oranges=2:6] > > > Yes, I am indeed thinking about pandas and other similar libraries. > Supporting slices with keywords would be essential. Some more concrete pandas-based examples could definitely help make a more compelling case. I genuinely think the hard part here is to make the case for offering the feature *at all*, so adding a "here is current real world pandas based code" and "here is how this PEP could make that code more readable" example could be worthwhile. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From stefano.borini at ferrara.linux.it Fri Jul 4 08:25:13 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 04 Jul 2014 08:25:13 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <20140702212953.GA16637@ferrara.linux.it> <53B49877.6090304@stoneleaf.us> <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> <20140703195911.GE13843@ferrara.linux.it> <110367949426113052.835097sturla.molden-gmail.com@news.gmane.org> Message-ID: <53B648C9.1090907@ferrara.linux.it> On 7/3/14 11:48 PM, Nick Coghlan wrote: > Some more concrete pandas-based examples could definitely help make a > more compelling case. I genuinely think the hard part here is to make > the case for offering the feature *at all*, so adding a "here is > current real world pandas based code" and "here is how this PEP could > make that code more readable" example could be worthwhile. I agree. I will examine pandas this evening for more context. From j.wielicki at sotecware.net Fri Jul 4 10:21:53 2014 From: j.wielicki at sotecware.net (Jonas Wielicki) Date: Fri, 04 Jul 2014 10:21:53 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140703183059.GC13843@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140703171509.GB13843@ferrara.linux.it> <20140703183059.GC13843@ferrara.linux.it> Message-ID: <53B66421.80902@sotecware.net> On 03.07.2014 20:30, Stefano Borini wrote: > The keyindex object could be made to implement the same interface as its value > through forwarding, so it can behave just as its value if your logic cares only about > position, and not key > >>>> keyindex("z", 4) + 1 > 5 > What about a value which has a .key attribute? regards, jwi From stefano.borini at ferrara.linux.it Fri Jul 4 11:20:50 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 4 Jul 2014 11:20:50 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B66421.80902@sotecware.net> References: <53B33800.1030300@ferrara.linux.it> <20140703171509.GB13843@ferrara.linux.it> <20140703183059.GC13843@ferrara.linux.it> <53B66421.80902@sotecware.net> Message-ID: <20140704092050.GA8507@ferrara.linux.it> On Fri, Jul 04, 2014 at 10:21:53AM +0200, Jonas Wielicki wrote: > On 03.07.2014 20:30, Stefano Borini wrote: > > The keyindex object could be made to implement the same interface as its value > > through forwarding, so it can behave just as its value if your logic cares only about > > position, and not key > > > >>>> keyindex("z", 4) + 1 > > 5 > > > > What about a value which has a .key attribute? that would have to be added, and unless you copy the passed index it would be a side effect of getitem on the passed entity, which would not be nice. From drekin at gmail.com Fri Jul 4 11:29:34 2014 From: drekin at gmail.com (drekin at gmail.com) Date: Fri, 04 Jul 2014 02:29:34 -0700 (PDT) Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> Message-ID: <53b673fe.475fc20a.2d71.5367@mx.google.com> Just some ideas, not claiming they are good: ??? As already stated in the thread and also in the PEP, there are two different classes of uses cases of indexing with keyword arguments: as a named index, and as an option contextual to the indexing. I think that the cases ask for different signatures. Even if I have a complex indexing scheme, the signature is (assumming Strategy 1 or 3): def __getitem__(self, idx): ??? However if I now want to add support for default value, I would do it like: _Empty = object() def __getitem__(self, idx, *, default=_Empty): ??? That leads to the following strategies. ??? Just for sake of completeness, maybe the easiest and also most powerful strategy would be just copying of behaviour of function call just with arguments going to __getitem__ instead of __call__ and allowing the syntax sugar for slices (which would raise the question whether to allow slice literals also in functin call or even in every expression). This strategy has two serious problems: 1. It is not backwards compatible with current mechanism of automatic packing of positional arguments. 2. It is not clear how to intercorporate the additional parameter of __setitem__. ??? This takes me to the following hybrid strategy. Both strategies 1 and 3 pack everything into one idx object whereas stratery 2 leaves key indices in separate kwargs parameter. The hybrid strategy takes as much as possible from function call strategy and generalizes strategies 1, 2, 3 at the same time. The general signature looks like this: def __getitem__(self, idx, *, key1, key2=default, **kwargs): ??? During the call, every provided keyword argument with present corresponding parameter is put into that parameter. If there is **kwargs parameter then the remaining keyword arguments are put into kwargs and if not then they are somehow (strategy 1 or 3) packed into idx parameter. Also the additional __setitem__ argument is just added as positional argument: def __setitem__(self, idx, value, *, key1, key2=default, **kwargs): ??? Regards, Drekin From stefano.borini at ferrara.linux.it Fri Jul 4 17:44:30 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 4 Jul 2014 17:44:30 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B648C9.1090907@ferrara.linux.it> References: <44dd6995-1335-41aa-8ab0-8e1ae79f8818@googlegroups.com> <20140703193356.GD13843@ferrara.linux.it> <20140703195911.GE13843@ferrara.linux.it> <110367949426113052.835097sturla.molden-gmail.com@news.gmane.org> <53B648C9.1090907@ferrara.linux.it> Message-ID: <20140704154430.GA18583@ferrara.linux.it> On Fri, Jul 04, 2014 at 08:25:13AM +0200, Stefano Borini wrote: > On 7/3/14 11:48 PM, Nick Coghlan wrote: >> Some more concrete pandas-based examples could definitely help make a >> more compelling case. I genuinely think the hard part here is to make >> the case for offering the feature *at all*, so adding a "here is >> current real world pandas based code" and "here is how this PEP could >> make that code more readable" example could be worthwhile. > > I agree. I will examine pandas this evening for more context. Ok, I examined pandas, and I think it solves a completely different problem In [27]: df.loc[:,['A','B']] Out[27]: A B 2013-01-01 0.469112 -0.282863 2013-01-02 1.212112 -0.173215 Pandas is naming the columns. With keyword arguments you would be naming the _axes_. From stefano.borini at ferrara.linux.it Fri Jul 4 20:10:51 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 4 Jul 2014 20:10:51 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B33800.1030300@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> Message-ID: <20140704181051.GB18583@ferrara.linux.it> On Wed, Jul 02, 2014 at 12:36:48AM +0200, Stefano Borini wrote: > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt I just added a new strategy. This one cuts the problem down. Strategy 4: Strict dictionary ----------------------------- This strategy accepts that __getitem__ is special in accepting only one object, and the nature of that object must be non-ambiguous in its specification of the axes: it can be either by order, or by name. As a result of this assumption, in presence of keyword arguments, the passed entity is a dictionary and all labels must be specified. C0. a[1]; a[1,2] -> idx = 1; idx=(1, 2) C1. a[Z=3] -> idx = {"Z": 3} C2. a[Z=3, R=4] -> idx = {"Z"=3, "R"=4} C3. a[1, Z=3] -> raise SyntaxError C4. a[1, Z=3, R=4] -> raise SyntaxError C5. a[1, 2, Z=3] -> raise SyntaxError C6. a[1, 2, Z=3, R=4] -> raise SyntaxError C7. a[1, Z=3, 2, R=4] -> raise SyntaxError Pros: - strong conceptual similarity between the tuple case and the dictionary case. In the first case, we are specifying a tuple, so we are naturally defining a plain set of values separated by commas. In the second, we are specifying a dictionary, so we are specifying a homogeneous set of key/value pairs, as in dict(Z=3, R=4) - simple and easy to parse on the __getitem__ side: if it gets a tuple, determine the axes using positioning. If it gets a dictionary, use the keywords. - C interface does not need changes. Cons: - degeneracy of a[{"Z": 3, "R": 4}] with a[Z=3, R=4], but the same degeneracy exists for a[(2,3)] and a[2,3]. - very strict. - destroys the use case a[1, 2, default=5] i From phd at phdru.name Fri Jul 4 20:20:18 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 4 Jul 2014 20:20:18 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704181051.GB18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> Message-ID: <20140704182018.GA30712@phdru.name> On Fri, Jul 04, 2014 at 08:10:51PM +0200, Stefano Borini wrote: > C1. a[Z=3] -> idx = {"Z": 3} > C2. a[Z=3, R=4] -> idx = {"Z"=3, "R"=4} Huh? Shouldn't it be C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} ??? > Cons: > - degeneracy of a[{"Z": 3, "R": 4}] with a[Z=3, R=4], but the same degeneracy exists > for a[(2,3)] and a[2,3]. There is no degeneration in the second case. Tuples are created by commas, not parentheses (except for an empty tuple), hence (2,3) and 2,3 are simply the same thing. While Z=3, R=4 is far from being the same as {"Z": 3, "R": 4}. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From stefano.borini at ferrara.linux.it Fri Jul 4 20:34:24 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 4 Jul 2014 20:34:24 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704183322.GC18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> Message-ID: <20140704183424.GD18583@ferrara.linux.it> On Fri, Jul 04, 2014 at 08:20:18PM +0200, Oleg Broytman wrote: > On Fri, Jul 04, 2014 at 08:10:51PM +0200, Stefano Borini wrote: > > C1. a[Z=3] -> idx = {"Z": 3} > > C2. a[Z=3, R=4] -> idx = {"Z"=3, "R"=4} > > Huh? Shouldn't it be > C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} yes. typo. already fixed in the PEP > > Cons: > > - degeneracy of a[{"Z": 3, "R": 4}] with a[Z=3, R=4], but the same degeneracy exists > > for a[(2,3)] and a[2,3]. > > There is no degeneration in the second case. Tuples are created by > commas, not parentheses (except for an empty tuple), hence (2,3) and 2,3 > are simply the same thing. We discussed this point above in the thread, and you are of course right in saying so, yet it stresses the fact that no matter what you pass inside those square brackets, they always end up funneled inside a single object, which happens to be a tuple that you just created > While Z=3, R=4 is far from being the same as > {"Z": 3, "R": 4}. but dict(Z=3, R=4) is the same as {"Z": 3, "R": 4}. this is exactly like tuple((2,3)) is the same as (2,3) See the similarity? the square brackets "call a constructor" on its content. This constructor is tuple if entries are not key=values (except for the single index case, of course), and dict if entries are key=values. From phd at phdru.name Fri Jul 4 20:39:15 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 4 Jul 2014 20:39:15 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704183424.GD18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> Message-ID: <20140704183915.GA31861@phdru.name> On Fri, Jul 04, 2014 at 08:34:24PM +0200, Stefano Borini wrote: > On Fri, Jul 04, 2014 at 08:20:18PM +0200, Oleg Broytman wrote: > > Z=3, R=4 is far from being the same as > > {"Z": 3, "R": 4}. > > but dict(Z=3, R=4) is the same as {"Z": 3, "R": 4}. > this is exactly like tuple((2,3)) is the same as (2,3) > See the similarity? the square brackets "call a constructor" > on its content. This constructor is tuple if entries are not > key=values (except for the single index case, of course), > and dict if entries are key=values. I didn't like the idea from the beginning and I am still against it. d = dict a[d(Z=3, R=4)] looks good enough for me without adding any magic to the language. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From stefano.borini at ferrara.linux.it Fri Jul 4 20:40:56 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 4 Jul 2014 20:40:56 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704183424.GD18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> Message-ID: <20140704184056.GA24625@ferrara.linux.it> On Fri, Jul 04, 2014 at 08:34:24PM +0200, Stefano Borini wrote: > but dict(Z=3, R=4) is the same as {"Z": 3, "R": 4}. > this is exactly like tuple((2,3)) is the same as (2,3) > See the similarity? the square brackets "call a constructor" > on its content. This constructor is tuple if entries are not > key=values (except for the single index case, of course), > and dict if entries are key=values. On this regard, one can of course do idx=(2,3) print(a[idx]) idx={"x":2, "y":3} print(a[idx]) the above syntax is already legal today, and calls back to a comment from a previous post. keywords would just be a shorthand for it. From alexander.belopolsky at gmail.com Fri Jul 4 21:00:56 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Fri, 4 Jul 2014 15:00:56 -0400 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704181051.GB18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> Message-ID: On Fri, Jul 4, 2014 at 2:10 PM, Stefano Borini < stefano.borini at ferrara.linux.it> wrote: > I just added a new strategy. This one cuts the problem down. > > Strategy 4: Strict dictionary > Did anyone consider treating = inside [] in a similar way as : is treated now. One can even (re/ab)use the slice object: a[1, 2, 5:7, Z=42] -> a.__getitem__((1, 2, slice(5, 7, None), slice('Z', '=', 42))) This strategy would also offer a semi-readable back-porting solution: >>> class C: ... def __getitem__(self, key): ... print(key) ... >>> c = C() >>> c['Z':'=':42] slice('Z', '=', 42) -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothy.c.delaney at gmail.com Fri Jul 4 22:10:15 2014 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Sat, 5 Jul 2014 06:10:15 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704184056.GA24625@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> Message-ID: 1. I think you absolutely *must* address the option of purely syntactic sugar in the PEP. It will come up on python-dev, so address it now. a[b, c=f, e=f:g:h] -> a[b, 'c':d, 'e':slice(f, g, h)] The rationale is readability and being both backwards and forwards compatible - existing __getitem__ designed to abuse slices will continue to work, and __getitem__ designed to work with the new syntax will work by abusing slices in older versions of Python. Pandas could be cited as an example of an existing library that could potentially benefit. It would be good if there were precise examples of Pandas syntax that would benefit immediately, but I don't know it beyond a cursory glance over the docs. My gut feeling from that is that if the syntax were available Pandas might be able to use it effectively. 2. I think you're at the point that you need to pick a single option as your preferred option, and everything else needs to be in the alternatives. FWIW, I would vote: +1 for syntax-sugar only (zero backwards-compatibility concerns). If I were starting from scratch this would not be my preferred option, but I think compatibility is important. +0 for a keyword(key, value) parameter object i.e. a[b, c=d, e=f:g:h] -> a[b, keyword('c', d), keyword('e', slice(f, g, h))] My objection is that either __getitem__ will be more complicated if you want to support earlier versions of Python (abuse slices for earlier versions, use keyword object for current) or imposes an additional burden on the caller in earlier versions (need to create a keyword-equivalent object to call with). If we were starting from scratch this would be one of my preferred options. -1 to any option that loses the order of the parameters (I'm strongly in favour of bringing order to keyword arguments - let's not take a backwards step here). -0 to any option that doesn't allow arbitrary ordering of positional and keyword arguments i.e. any option where the following is not legal: a[b, c=d, e] This is something we can do now (albeit in a fairly verbose way at times) and I think restricting this is likely to remove options for DSLs, etc. -0 for namedtuple (BTW you might want to mention that collections.namedtuple() already has precedent for _X positional parameter names) My objection is that it's not possible to determine definitively in __getitem__ if the the call was: a[b, c] or a[_0=b, _1=c] which might be important in some use cases. The same objection would apply to passing an OrderedDict (but that's got additional compatibility issues). Cheers, Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Jul 4 22:39:00 2014 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 4 Jul 2014 21:39:00 +0100 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> Message-ID: On Fri, Jul 4, 2014 at 9:10 PM, Tim Delaney wrote: > 1. I think you absolutely *must* address the option of purely syntactic > sugar in the PEP. It will come up on python-dev, so address it now. > > a[b, c=f, e=f:g:h] > -> a[b, 'c':d, 'e':slice(f, g, h)] > > The rationale is readability and being both backwards and forwards > compatible - existing __getitem__ designed to abuse slices will continue to > work, and __getitem__ designed to work with the new syntax will work by > abusing slices in older versions of Python. I don't know of any existing code that abuses slices in this way (so worrying about compatibility with it seems odd?). > Pandas could be cited as an example of an existing library that could > potentially benefit. It would be good if there were precise examples of > Pandas syntax that would benefit immediately, but I don't know it beyond a > cursory glance over the docs. My gut feeling from that is that if the syntax > were available Pandas might be able to use it effectively. Your hack (aside from being pointlessly ugly) would actually prevent pandas from using this feature. In pandas, slices like foo["a":"b"] already have a meaning (i.e., take all items from the one labeled "a" to the one labeled "b"). -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org From ethan at stoneleaf.us Fri Jul 4 22:19:02 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 04 Jul 2014 13:19:02 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> Message-ID: <53B70C36.3000202@stoneleaf.us> On 07/04/2014 01:10 PM, Tim Delaney wrote: > > 1. I think you absolutely *must* address the option of purely syntactic > sugar in the PEP. It will come up on python-dev, so address it now. > > a[b, c=f, e=f:g:h] > -> a[b, 'c':d, 'e':slice(f, g, h)] > +1 for syntax-sugar only (zero backwards-compatibility concerns). Also +1 for this approach. -- ~Ethan~ From timothy.c.delaney at gmail.com Fri Jul 4 22:46:58 2014 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Sat, 5 Jul 2014 06:46:58 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> Message-ID: On 5 July 2014 06:39, Nathaniel Smith wrote: > On Fri, Jul 4, 2014 at 9:10 PM, Tim Delaney > wrote: > > 1. I think you absolutely *must* address the option of purely syntactic > > sugar in the PEP. It will come up on python-dev, so address it now. > > > > a[b, c=f, e=f:g:h] > > -> a[b, 'c':d, 'e':slice(f, g, h)] > > > > The rationale is readability and being both backwards and forwards > > compatible - existing __getitem__ designed to abuse slices will continue > to > > work, and __getitem__ designed to work with the new syntax will work by > > abusing slices in older versions of Python. > > pandas from using this feature. In pandas, slices like foo["a":"b"] > already have a meaning (i.e., take all items from the one labeled "a" > to the one labeled "b"). > If that's the case then it should be listed as a reason in the PEP for a change larger than syntax sugar, otherwise this important information will be lost. One of the first suggestions when this PEP came up was to just (ab)use slices - people will use the syntax they have available to them. Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Jul 4 23:07:41 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 04 Jul 2014 14:07:41 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> Message-ID: <53B7179D.8050101@stoneleaf.us> On 07/04/2014 01:39 PM, Nathaniel Smith wrote: > On Fri, Jul 4, 2014 at 9:10 PM, Tim Delaney wrote: > > Your hack (aside from being pointlessly ugly) would actually prevent > pandas from using this feature. In pandas, slices like foo["a":"b"] > already have a meaning (i.e., take all items from the one labeled "a" > to the one labeled "b"). Isn't that the standard way slices are supposed to be used though? Instead of integers Panda is allowing strings. How would Pandas use the new feature? -- ~Ethan~ From timothy.c.delaney at gmail.com Fri Jul 4 23:40:23 2014 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Sat, 5 Jul 2014 07:40:23 +1000 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B7179D.8050101@stoneleaf.us> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> <53B7179D.8050101@stoneleaf.us> Message-ID: On 5 July 2014 07:07, Ethan Furman wrote: > On 07/04/2014 01:39 PM, Nathaniel Smith wrote: > > On Fri, Jul 4, 2014 at 9:10 PM, Tim Delaney wrote: >> >> Your hack (aside from being pointlessly ugly) would actually prevent >> pandas from using this feature. In pandas, slices like foo["a":"b"] >> already have a meaning (i.e., take all items from the one labeled "a" >> to the one labeled "b"). >> > > Isn't that the standard way slices are supposed to be used though? > Instead of integers Panda is allowing strings. How would Pandas use the > new feature? > I think Nathaniel is saying that pandas is already using string slices in an appropriate way (rather than abusing them), and so if this was just syntax sugar they wouldn't be able to use the new syntax for new functionality (since you couldn't distinguish the two). It would be possible to make both approaches "work" by having an object that had all of .start, .stop, .step, .key and .value (and trying .key/.value first), but IMO that's going too far - I'd rather have a separate object with just .key and .value to test for. Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.borini at ferrara.linux.it Fri Jul 4 23:41:44 2014 From: stefano.borini at ferrara.linux.it (Stefano Borini) Date: Fri, 04 Jul 2014 23:41:44 +0200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <53B7179D.8050101@stoneleaf.us> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> <20140704182018.GA30712@phdru.name> <20140704183322.GC18583@ferrara.linux.it> <20140704183424.GD18583@ferrara.linux.it> <20140704184056.GA24625@ferrara.linux.it> <53B7179D.8050101@stoneleaf.us> Message-ID: <53B71F98.6010309@ferrara.linux.it> On 7/4/14 11:07 PM, Ethan Furman wrote: > Isn't that the standard way slices are supposed to be used though? > Instead of integers Panda is allowing strings. How would Pandas use the > new feature? It would not. Pandas is using it to use labels as indexes. adding keywords would allow to name the axes. These are two completely different use cases. For example, one could have a table containing the temperature with the city on one axis and the time on the other axis. So one could have temperature["London", 12] Pandas would have text indexes for "London", "New York", "Chicago" and so on. One could say temperature["London":"Chicago", 12] to get the temperature of the cities between "London" and "Chicago" at noon. The PEP would allow instead to name the axes in the query temperature[city="London":"Chicago", hour=12] From greg.ewing at canterbury.ac.nz Sat Jul 5 01:05:22 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 05 Jul 2014 11:05:22 +1200 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704181051.GB18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> Message-ID: <53B73332.8020505@canterbury.ac.nz> Stefano Borini wrote: > Strategy 4: Strict dictionary > ----------------------------- > > in presence of keyword arguments, the passed entity is a dictionary and all > labels must be specified. This wouldn't solve the OP's problem, because he apparently needs to preserve the order of the keywords. I don't really understand what he's trying to do, but labelling the axes doesn't seem to be it, or at least not just that. -- Greg From paultag at gmail.com Sat Jul 5 02:59:30 2014 From: paultag at gmail.com (Paul Tagliamonte) Date: Fri, 4 Jul 2014 20:59:30 -0400 Subject: [Python-ideas] lazy tuple unpacking Message-ID: <20140705005930.GA7612@leliel.pault.ag> Given: >>> def g_range(n): ... for y in range(n): ... yield y ... I notice that: >>> a, b, c, *others = g_range(100) Works great. Super useful stuff there. Looks good. I also notice that this causes *others to consume the generator in a greedy way. >>> type(others) And this makes me sad. >>> a, b, c, *others = g_range(10000000000) # will also make your machine very sad. Eventually resulting # (ok, unless you've got a really fancy bit of kit) in: Killed Really, the behavior (I think) should be more similar to: >>> _x = g_range(1000000000000) >>> a = next(_x) >>> b = next(_x) >>> c = next(_x) >>> others = _x >>> Of course, this leads to all sorts of fun errors, like the fact you couldn't iterate over it twice. This might not be expected. However, it might be nice to have this behavior when you're unpacking a generator. Thoughts? Paul -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: From pyideas at rebertia.com Sat Jul 5 03:10:44 2014 From: pyideas at rebertia.com (Chris Rebert) Date: Fri, 4 Jul 2014 18:10:44 -0700 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: <20140705005930.GA7612@leliel.pault.ag> References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: On Fri, Jul 4, 2014 at 5:59 PM, Paul Tagliamonte wrote: > I notice that: > > >>> a, b, c, *others = g_range(100) > > Works great. Super useful stuff there. Looks good. > > I also notice that this causes *others to consume the generator > in a greedy way. > > >>> type(others) > > > And this makes me sad. > > >>> a, b, c, *others = g_range(10000000000) > # will also make your machine very sad. Eventually resulting > # (ok, unless you've got a really fancy bit of kit) in: > Killed > > Really, the behavior (I think) should be more similar to: > > >>> _x = g_range(1000000000000) > >>> a = next(_x) > >>> b = next(_x) > >>> c = next(_x) > >>> others = _x > >>> > > > Of course, this leads to all sorts of fun errors, like the fact you > couldn't iterate over it twice. This might not be expected. However, it > might be nice to have this behavior when you're unpacking a generator. > > Thoughts? It would mean an (IMHO undesirable) loss of consistency/symmetry, type-wise, with other unpackings where this generator optimization isn't possible: Python 3.4.1 (default, May 19 2014, 13:10:29) >>> x = [1,2,3,4,5,6,7] >>> a,b,*c,d = x >>> c [3, 4, 5, 6] >>> *e,f,g = x >>> e [1, 2, 3, 4, 5] Cheers, Chris From paultag at gmail.com Sat Jul 5 03:12:49 2014 From: paultag at gmail.com (Paul Tagliamonte) Date: Fri, 4 Jul 2014 21:12:49 -0400 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: <20140705011249.GA9902@leliel.pault.ag> On Fri, Jul 04, 2014 at 06:10:44PM -0700, Chris Rebert wrote: > It would mean an (IMHO undesirable) loss of consistency/symmetry, > type-wise, with other unpackings where this generator optimization > isn't possible: > > Python 3.4.1 (default, May 19 2014, 13:10:29) > >>> x = [1,2,3,4,5,6,7] > >>> a,b,*c,d = x > >>> c > [3, 4, 5, 6] > >>> *e,f,g = x > >>> e > [1, 2, 3, 4, 5] > > Cheers, > Chris Euch, good point. This feature might just be DOA. Thanks! Paul -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: From graffatcolmingov at gmail.com Sat Jul 5 03:19:01 2014 From: graffatcolmingov at gmail.com (Ian Cordasco) Date: Fri, 4 Jul 2014 20:19:01 -0500 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: <20140705005930.GA7612@leliel.pault.ag> References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: On Fri, Jul 4, 2014 at 7:59 PM, Paul Tagliamonte wrote: > Given: > > >>> def g_range(n): > ... for y in range(n): > ... yield y > ... > > I notice that: > > >>> a, b, c, *others = g_range(100) > > Works great. Super useful stuff there. Looks good. > > I also notice that this causes *others to consume the generator > in a greedy way. > > >>> type(others) > > > And this makes me sad. > > >>> a, b, c, *others = g_range(10000000000) > # will also make your machine very sad. Eventually resulting > # (ok, unless you've got a really fancy bit of kit) in: > Killed > > Really, the behavior (I think) should be more similar to: > > >>> _x = g_range(1000000000000) > >>> a = next(_x) > >>> b = next(_x) > >>> c = next(_x) > >>> others = _x > >>> > > > Of course, this leads to all sorts of fun errors, like the fact you > couldn't iterate over it twice. This might not be expected. However, it > might be nice to have this behavior when you're unpacking a generator. > > Thoughts? I agree that the behaviour is suboptimal, but as Chris already pointed out it would introduce a significant inconsistency in the API of unpacking. I'm struggling to see a *good* way of doing this. My first instinct was that we could make something like this do what you expect: >>> a, b, c, others = g_range(some_really_big_number) >>> others But this doesn't work like this currently because Python currently raises a ValueError because there were too many values to unpack. I'm also against introducing some new syntax to add the behaviour. From bruce at leapyear.org Sat Jul 5 05:16:59 2014 From: bruce at leapyear.org (Bruce Leban) Date: Fri, 4 Jul 2014 20:16:59 -0700 Subject: [Python-ideas] PEP pre-draft: Support for indexing with keyword arguments In-Reply-To: <20140704181051.GB18583@ferrara.linux.it> References: <53B33800.1030300@ferrara.linux.it> <20140704181051.GB18583@ferrara.linux.it> Message-ID: On Fri, Jul 4, 2014 at 11:10 AM, Stefano Borini < stefano.borini at ferrara.linux.it> wrote: > On Wed, Jul 02, 2014 at 12:36:48AM +0200, Stefano Borini wrote: > > https://github.com/stefanoborini/pep-keyword/blob/master/PEP-XXX.txt > > Strategy 4: Strict dictionary > ----------------------------- > > This strategy accepts that __getitem__ is special in accepting only one > object, > and the nature of that object must be non-ambiguous in its specification > of the > axes: it can be either by order, or by name. As a result of this > assumption, > in presence of keyword arguments, the passed entity is a dictionary and all > labels must be specified. > The result that "all labels must be specified" does not follow from that assumption that the object must be unambiguous. Numbers are not valid keyword names but are perfectly useful as index values. See below. Note that I am not advocating for/against strategy 4, just commenting on it. > > C0. a[1]; a[1,2] -> idx = 1; idx=(1, 2) > C1. a[Z=3] -> idx = {"Z": 3} > C2. a[Z=3, R=4] -> idx = {"Z"=3, "R"=4} > C3. a[1, Z=3] -> {0: 1, "Z": 3} > C4. a[1, Z=3, R=4] -> {0: 1, "Z": 3, "R": 4} C5. a[1, 2, Z=3] -> {0: 1, 1: 2, "Z": 3} > C6. a[1, 2, Z=3, R=4] -> {0: 1, 1: 2, "Z": 3, "R": 4} C7. a[1, Z=3, 2, R=4] -> raise SyntaxError > > Note that idx[0] would have the same value it would have in the normal __getitem__ call while in all cases above idx[3] would raise an exception. It would not be the case that a[1,2] and a[x=1,y=2] would be interchangeable as they would for function calls. That would still have to be handled by the __getitem__ function itself. But it's fairly easy to write a function that does that: def extract_indexes(idx, args): # args is a list of tuples either (key, default) or (key,) if no default result = [] for i, arg in zip(itertools.count(), args): if i in idx and arg[0] in idx: raise IndexError result.append(idx[i] if i in idx else idx[arg[0]] if arg[0] in idx else arg[1]) return result This raises IndexError if a key value is specified both positionally and by name or if a missing key value does not have a default. It should also (but does not) raise IndexError when idx contains extra keys not listed in args. It also doesn't support unnamed (positional only) indexes. Neither of those is difficult to add. --- Bruce Learn how hackers think: http://j.mp/gruyere-security https://www.linkedin.com/in/bruceleban -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Sat Jul 5 13:26:26 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sat, 5 Jul 2014 04:26:26 -0700 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: <1404559586.21453.YahooMailNeo@web181003.mail.ne1.yahoo.com> On Friday, July 4, 2014 6:11 PM, Chris Rebert wrote: > On Fri, Jul 4, 2014 at 5:59 PM, Paul Tagliamonte > wrote: > >> I notice that: >> >> ? ? >>> a, b, c, *others = g_range(100) >> >> Works great. Super useful stuff there. Looks good. >> >> I also notice that this causes *others to consume the generator >> in a greedy way. >> >> ? ? >>> type(others) >> ? ? >> >> And this makes me sad. >> >> ? ? >>> a, b, c, *others = g_range(10000000000) >> ? ? # will also make your machine very sad. Eventually resulting >> ? ? # (ok, unless you've got a really fancy bit of kit) in: >> ? ? Killed >> >> Really, the behavior (I think) should be more similar to: >> >> ? ? >>> _x = g_range(1000000000000) >> ? ? >>> a = next(_x) >> ? ? >>> b = next(_x) >> ? ? >>> c = next(_x) >> ? ? >>> others = _x >> ? ? >>> >> >> >> Of course, this leads to all sorts of fun errors, like the fact you >> couldn't iterate over it twice. This might not be expected. However, it >> might be nice to have this behavior when you're unpacking a generator. >> >> Thoughts? > > It would mean an (IMHO undesirable) loss of consistency/symmetry, > type-wise, with other unpackings where this generator optimization > isn't possible: > > Python 3.4.1 (default, May 19 2014, 13:10:29) >>>> x = [1,2,3,4,5,6,7] >>>> a,b,*c,d = x >>>> c > [3, 4, 5, 6] >>>> *e,f,g = x >>>> e > [1, 2, 3, 4, 5] When I was experimenting with adding lazy lists (sequences that wrap an iterator and request each value on first request), I played around with PyPy (Python 2.x, not 3.x), making unpacking (as well as map, filter, etc.) return them instead of iterators. (I also tried to make generator functions and expressions return them, but I couldn't get that to work in a quick hack?) It worked nicely, and I think it would work with the expanded unpacking in 3.x. In Paul's original example, others is a lazy list of 9999999997 elements, which will only evaluate the ones you actually ask for. In Chris's examples, c and e are fully-evaluated lazy lists of 4 or 3 elements, respectively. But, even if this weren't a ridiculously radical change to the language, I don't think it's what you'd want. First, an iterator over 9999999997 elements is a lot more useful than a lazy list of 9999999997 elements, because you can iterate the whole thing without running out of memory. Second, it wouldn't help in a case like this: ? ? a, b, *c, d, e = range(10000000000) To make that work, you'd need something smarter than just using a lazy list instead of an iterator. One way to solve it is to try to keep the original type, so unpacking maps to something like this: ? ? try: ? ? ? ? a, b, c, d, e = i[0], i[1], i[2:-2], i[-2], i[-1] ? ? except TypeError: ? ? ? ? # current behavior Then c ends up as range(2, 9999999998), which is the best possible thing you could get there. You could take this even further by adding either a notion of bidirectional and forward-only sequences, or a notion of reversible iterables, but that's getting much farther into left field; if anyone's interested, see http://stupidpythonideas.blogspot.com/2014/07/lazy-tuple-unpacking.html for details. From toddrjen at gmail.com Tue Jul 8 10:30:16 2014 From: toddrjen at gmail.com (Todd) Date: Tue, 8 Jul 2014 10:30:16 +0200 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: <20140705005930.GA7612@leliel.pault.ag> References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: On Sat, Jul 5, 2014 at 2:59 AM, Paul Tagliamonte wrote: > Given: > > >>> def g_range(n): > ... for y in range(n): > ... yield y > ... > > I notice that: > > >>> a, b, c, *others = g_range(100) > > Works great. Super useful stuff there. Looks good. > > I also notice that this causes *others to consume the generator > in a greedy way. > > >>> type(others) > > > And this makes me sad. > > >>> a, b, c, *others = g_range(10000000000) > # will also make your machine very sad. Eventually resulting > # (ok, unless you've got a really fancy bit of kit) in: > Killed > > Really, the behavior (I think) should be more similar to: > > >>> _x = g_range(1000000000000) > >>> a = next(_x) > >>> b = next(_x) > >>> c = next(_x) > >>> others = _x > >>> > > > Of course, this leads to all sorts of fun errors, like the fact you > couldn't iterate over it twice. This might not be expected. However, it > might be nice to have this behavior when you're unpacking a generator. > > Thoughts? > Paul > Besides the issues others have discussed, another issue I see here is that you are basically copying the iterator. In the case of this, where "gen_a" is a generator : >>> a, b, c, *others = gen_a "others" should be the same as "gen_a" in the end (i.e. "others is gen_a == True"). This seems redundant, especially when we have the itertools "take" recipe which can be used to retrieve the first "n" values of an iterator, which can then be unpacked in whatever way you want. However, there might be an alternative. You could have something where, if you are unpacking an iterable to N variables, you can tell it to just unpack the first N values, and the iterable then remains at position N (similar to "take", but integrated more deeply). For the case of something like a list or tuple, it will just unpack those variables and skip the rest. Maybe either a method like this: >>> a, b, c = gen_a.unpack() Or some sort of syntax to say that the remaining values should be skipped (although I don't really know what syntax would be good here, the syntax I am using here is probably not good): >>> a, b, c, [] = gen_a Of course with "take" so simple to implement, this is probably way overkill. I also don't know if it is even possible for the right side of the expression to know how the layout of the left in that way. -------------- next part -------------- An HTML attachment was scrubbed... URL: From paultag at gmail.com Tue Jul 8 18:17:37 2014 From: paultag at gmail.com (Paul Tagliamonte) Date: Tue, 8 Jul 2014 12:17:37 -0400 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: <20140708161737.GA13805@helios.pault.ag> On Tue, Jul 08, 2014 at 10:30:16AM +0200, Todd wrote: > Besides the issues others have discussed, another issue I see here is that > you are basically copying the iterator.? In the case of this, where > "gen_a" is a generator : > > >>> a, b, c, *others = gen_a > > "others" should be the same as "gen_a" in the end (i.e. "others is gen_a > == True").? This seems redundant, especially when we have the itertools > "take" recipe which can be used to retrieve the first "n" values of an > iterator, which can then be unpacked in whatever way you want. > > However, there might be an alternative.? You could have something where, > if you are unpacking an iterable to N variables, you can tell it to just > unpack the first N values, and the iterable then remains at position N > (similar to "take", but integrated more deeply).? For the case of > something like a list or tuple, it will just unpack those variables and > skip the rest.? Maybe either a method like this: > > >>> a, b, c = gen_a.unpack() > > Or some sort of syntax to say that the remaining values should be skipped > (although I don't really know what syntax would be good here, the syntax I > am using here is probably not good): > > >>> a, b, c, [] = gen_a > > Of course with "take" so simple to implement, this is probably way > overkill.? I also don't know if it is even possible for the right side of > the expression to know how the layout of the left in that way. Yeah, I think all the productive ideas (thanks, Andrew and Todd) to make this happen are mostly starting to converge on full-blown lazy lists, which is to say, generators which are indexable, sliceable, and work from both ends (which is to say: more than the current iterator protocol). I totally like the idea, not sure how keen everyone will be about it. I'm not sure I have the energy or drive to try and convince everyone on python-dev this is a good idea, but I'd totally love to play around with this. Anyone else? Cheers, Paul -- #define sizeof(x) rand() :wq -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: From abarnert at yahoo.com Tue Jul 8 19:41:09 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 8 Jul 2014 10:41:09 -0700 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: References: <20140705005930.GA7612@leliel.pault.ag> Message-ID: <1404841269.59215.YahooMailNeo@web181002.mail.ne1.yahoo.com> On Tuesday, July 8, 2014 1:31 AM, Todd wrote: >Besides the issues others have discussed, another issue I see here is that you are basically copying the iterator.? In the case of this, where "gen_a" is a generator : > >>>> a, b, c, *others = gen_a > >"others" should be the same as "gen_a" in the end (i.e. "others is gen_a == True").? This seems redundant, especially when we have the itertools "take" recipe which can be used to retrieve the first "n" values of an iterator, which can then be unpacked in whatever way you want. I don't think the issue he's trying to solve is that others is gen_a, but just that gen_a is not exhausted and copied into a list. If others were a wrapper around gen_a instead, I think that would solve all of the interesting use cases.?(But I don't want to put words in Paul Tagliamonte's mouth here, so make sure he confirms it before replying in depth?) >However, there might be an alternative.? You could have something where, if you are unpacking an iterable to N variables, you can tell it to just unpack the first N values, and the iterable then remains at position N (similar to "take", but integrated more deeply).? For the case of something like a list or tuple, it will just unpack those variables and skip the rest.? Maybe either a method like this: > >>>> a, b, c = gen_a.unpack() > >Or some sort of syntax to say that the remaining values should be skipped (although I don't really know what syntax would be good here, the syntax I am using here is probably not good): > > >>>> a, b, c, [] = gen_a I think the obvious way to write this is a bare *: >>> a, b, *, c, d = range(10) >>> a, b, c, d (0, 1, 8, 9) >Of course with "take" so simple to implement, this is probably way overkill.? I also don't know if it is even possible for the right side of the expression to know how the layout of the left in that way. There was a thread a few months back about enhancing the unpacking protocol by asking the iterable itself to do the unpacking (possibly feeding it more information about what needs to be unpacked, something like an __unpack__(self, prestar_count, star_flag, poststar_count)), which would allow the flexibility you're looking for. I don't want to repeat someone else's use cases and arguments out of my own faulty memory; if you're interested in following up, search the python-ideas archive. Anyway, your explicit method version could be written today. Obviously it wouldn't work on generators or other arbitrary iterators, but you could write a simple wrapper that takes an iterator and returns an iterator with an unpack method, which I think would be enough for experimenting with the feature. Meanwhile, in the case of a non-iterator iterable, what would you want to happen here? Should others end up as what's left over from the iterator created by iter(iterable)? In other words: >>> a, b, *others = [1, 2, 3, 4, 5] >>> others >>> next(others) 3 From abarnert at yahoo.com Tue Jul 8 20:25:36 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 8 Jul 2014 11:25:36 -0700 Subject: [Python-ideas] lazy tuple unpacking In-Reply-To: <20140708161737.GA13805@helios.pault.ag> References: <20140705005930.GA7612@leliel.pault.ag> <20140708161737.GA13805@helios.pault.ag> Message-ID: <1404843936.17303.YahooMailNeo@web181003.mail.ne1.yahoo.com> > On Tuesday, July 8, 2014 9:18 AM, Paul Tagliamonte wrote: > Yeah, I think all the productive ideas (thanks, Andrew and Todd) to make > this happen are mostly starting to converge on full-blown lazy lists, > which is to say, generators which are indexable, sliceable, and work from both > ends (which is to say: more than the current iterator protocol). I don't think that's necessarily true. First, I think the idea of just trying to index and slice the iterable and falling back to the current behavior is at least worth thinking about. It wouldn't solve the problem for iterators, but it would for your example (range), or any other kind of sequence that knows how to slice itself in a better way than making a list. And to take that farther, you don't necessarily need to replace iterators with lazy lists, just with some kind of sequence. A view is just as indexable, sliceable, and reusable as a lazy list, and better in many ways, and Python already has views like dict_keys built in, and NumPy already uses a similar idea for slicing. I believe either Guido or Nick has a writeup somewhere on why list slices being new lists rather than views is a good thing, not just a historical accident we're stuck with. But that doesn't mean that having views for many of the cases we use iterators for today (including a view comprehension, a viewtools library, etc.) would necessarily be a bad idea. And, as I mentioned, expanding the notion of sequence to include weaker notions of bidirectional-only sequence and forward-only sequence eliminates many of the other need for iterators (but not all?a general generator function obviously can't return a reusable forward-only sequence). If you're interest in more on this, see http://stupidpythonideas.blogspot.com/2014/07/lazy-tuple-unpacking.html and http://stupidpythonideas.blogspot.com/2014/07/swift-style-map-and-filter-views.html for some ideas. > I totally like the idea, not sure how keen everyone will be about it. > > I'm not sure I have the energy or drive to try and convince everyone on > python-dev this is a good idea, but I'd totally love to play around > with this. Anyone else? Before trying to convince anyone this is a good idea, first you want to build a lazy-list library: a lazy list type, lazy-list versions of map/filter/dropwhile, etc. It's actually pretty simple. A lazy list basically looks like this: ? ? class LazyList(collections.abc.Sequence): ? ? ? ? def __init__(self, iterable): ? ? ? ? ? ? self.lst, self.it = [], iter(iterable) ? ? ? ? def __getitem__(self, index): ? ? ? ? ? ? while index >= len(self.lst): ? ? ? ? ? ? ? ? self.lst.append(next(self.it)) ? ? ? ? ? ? return self.lst[index] You have to add slice support, set/del, etc., but it's all pretty simply. The only tricky question is what to do about slicing, because you have a choice there. You could just loop over the slice and get/set/del each index, or you could return a new LazyList around islice(self), or you could do the latter if stop is None else the former. And then all the lazy functions just call the iterator function and wrap the result in a LazyList. The big problem with lazy lists is that once a value is instantiated, it stays around as long as the list does. So, if you use a lazy list as an iterable, you're basically building the whole list in memory. Iterators obviously don't have that problem. It's worth looking at Haskell and other lazy functional languages to see why they don't have that problem. Their lists are conses (singly-linked lists with tail sharing). So, making it lazy automatically means that if you just iterate L without keeping a reference, only one cons is around at a time, while if you keep a reference to L, the whole list is available in memory. That won't work for array-based lists like Python's, and I'm not sure how you'd solve that without completely changing the way iteration works in Python. (Of course you can easily implement cons lists in Python, and then make them lazy, but then they're not sequences?in particular, they're not indexable, and generally won't work with any typical Python algorithms that aren't already happy with iterators.) From abarnert at yahoo.com Thu Jul 17 21:53:05 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 12:53:05 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier Message-ID: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> tl;dr: readline and friends should take an optional sep parameter (which also means adding an iterlines method). Recently, I was trying to add -0 support to a command-line tool, which means that it reads filenames out of stdin and/or a text file with \0 separators instead of \n. This means that my code that looked like this: ? ? with open(path, encoding=sys.getfilesystemencoding()) as f: ? ? ? ? for filename in f: ? ? ? ? ? ? do_stuff(filename) ? turned into this (from memory, not the exact code): ? ? def resplit(chunks, sep): ? ? ? ? buf = b'' ? ? ? ? for chunk in chunks: ? ? ? ? ? ? parts = (buf+chunk).split(sep) ? ? ? ? ? ? yield from parts[:-1] ? ? ? ? ? ? buf = parts[-1] ? ? ? ? if buf: ? ? ? ? ? ? yield buf ? ? with open(path, 'rb') as f: ? ? ? ? chunks = iter(lambda: f.read(4096), b'') ? ? ? ? for line in resplit(chunks, b'\0'): ? ? ? ? ? ? filename = line.decode(sys.getfilesystemencoding()) ? ? ? ? ? ? do_stuff(filename) Besides being a lot more code (and involving things that a novice might have problems reading like that two-argument iter), this also means that the file pointer is way ahead of the line that's just been iterated, I'm inefficiently buffering everything twice, etc. The problem is that readline is hardcoded to look for b'\n' for binary files, smart-universal-newline-thingy for text files, there's no way to reuse its machinery if you want to look for something different, and there's no way to access the internals that it uses if you want to reimplement it. While it might be possible to fix the latter problems in some generic and flexible way, that doesn't seem all that useful; really, other than changing the way readline splits, I don't think anyone wants to hook anything else about file objects. (On the other hand, people might want to hook it in more complex ways?e.g., pass a separator function instead of a separator string? I'm probably reaching there?) If I'm right, all that's needed is an extra sep=None keyword-only parameter to readline and friends (where None means the existing newline behavior), along with an iterlines method that's identical to __iter__ except that it has room for that new parameter. One minor side problem: Sometimes you don't actually have a file, but some kind of file-like object. I realize that as 3.1 or so, this is supposed to mean it actually is an io.BufferedIOBase or etc., but there are still plenty of third-party modules that just demand and/or provide "something with read(size)" or the like. In fact, that's the case with the problem I ran into above; another feature uses a third-party module to provide file-like objects for members of all kinds of uncommon archive types, and unlike zipfile, that module wasn't changed to provide io subclasses when it was ported to 3.x. So, it might be worth having adapters that make it easier (or just possible?) to wrap such a thing in the actual io interfaces. (The existing wrappers aren't adapters?BufferedReader demands readinto(buf), not read(size); TextIOWrapper can only wrap a BufferedIOBase.) But that's really a separate issue (and the answer to that one may just be to hold firm with the "file-like object means IOBase" and eventually every library you care about will work that way, even if you occasionally have to fix it yourself). From guido at python.org Thu Jul 17 22:48:28 2014 From: guido at python.org (Guido van Rossum) Date: Thu, 17 Jul 2014 13:48:28 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: I think it's fine to add something to stdlib that encapsulates your example. (TBD: where?) I don't think it is reasonable to add a new parameter to readline(), because streams are widely implemented using duck typing -- every implementation would have to be updated to support this. On Thu, Jul 17, 2014 at 12:53 PM, Andrew Barnert < abarnert at yahoo.com.dmarc.invalid> wrote: > tl;dr: readline and friends should take an optional sep parameter (which > also means adding an iterlines method). > > Recently, I was trying to add -0 support to a command-line tool, which > means that it reads filenames out of stdin and/or a text file with \0 > separators instead of \n. > > This means that my code that looked like this: > > with open(path, encoding=sys.getfilesystemencoding()) as f: > for filename in f: > do_stuff(filename) > > ? turned into this (from memory, not the exact code): > > def resplit(chunks, sep): > buf = b'' > for chunk in chunks: > parts = (buf+chunk).split(sep) > > yield from parts[:-1] > buf = parts[-1] > if buf: > yield buf > > with open(path, 'rb') as f: > chunks = iter(lambda: f.read(4096), b'') > for line in resplit(chunks, b'\0'): > filename = line.decode(sys.getfilesystemencoding()) > do_stuff(filename) > > Besides being a lot more code (and involving things that a novice might > have problems reading like that two-argument iter), this also means that > the file pointer is way ahead of the line that's just been iterated, I'm > inefficiently buffering everything twice, etc. > > The problem is that readline is hardcoded to look for b'\n' for binary > files, smart-universal-newline-thingy for text files, there's no way to > reuse its machinery if you want to look for something different, and > there's no way to access the internals that it uses if you want to > reimplement it. > > While it might be possible to fix the latter problems in some generic and > flexible way, that doesn't seem all that useful; really, other than > changing the way readline splits, I don't think anyone wants to hook > anything else about file objects. (On the other hand, people might want to > hook it in more complex ways?e.g., pass a separator function instead of a > separator string? I'm probably reaching there?) > > If I'm right, all that's needed is an extra sep=None keyword-only > parameter to readline and friends (where None means the existing newline > behavior), along with an iterlines method that's identical to __iter__ > except that it has room for that new parameter. > > One minor side problem: Sometimes you don't actually have a file, but some > kind of file-like object. I realize that as 3.1 or so, this is supposed to > mean it actually is an io.BufferedIOBase or etc., but there are still > plenty of third-party modules that just demand and/or provide "something > with read(size)" or the like. In fact, that's the case with the problem I > ran into above; another feature uses a third-party module to provide > file-like objects for members of all kinds of uncommon archive types, and > unlike zipfile, that module wasn't changed to provide io subclasses when it > was ported to 3.x. So, it might be worth having adapters that make it > easier (or just possible?) to wrap such a thing in the actual io > interfaces. (The existing wrappers aren't adapters?BufferedReader demands > readinto(buf), not read(size); TextIOWrapper can only wrap a > BufferedIOBase.) But that's really a separate issue (and the answer to that > one may just be to hold firm > with the "file-like object means IOBase" and eventually every library you > care about will work that way, even if you occasionally have to fix it > yourself). > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at 2sn.net Thu Jul 17 23:39:42 2014 From: python at 2sn.net (Alexander Heger) Date: Fri, 18 Jul 2014 07:39:42 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: > I don't think it is reasonable to add a new parameter to readline(), because > streams are widely implemented using duck typing -- every implementation > would have to be updated to support this. Could the "split" (or splitline) keyword-only parameter instead be passed to the open function (and the __init__ of IOBase and be stored there)? From abarnert at yahoo.com Thu Jul 17 23:59:29 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 14:59:29 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <1405634369.86826.YahooMailNeo@web181006.mail.ne1.yahoo.com> On Thursday, July 17, 2014 1:48 PM, Guido van Rossum wrote: >I think it's fine to add something to stdlib that encapsulates your example. (TBD: where?) Good question about the where. The resplit function seems like it could be of more general use than just this case, but I'm not sure where it belongs. Maybe itertools? The?iter(lambda: f.read(bufsize), b'') part seems too trivial to put anywhere, even just as an example in the docs?but given that it probably looks like a magic incantation to anyone who's a Python novice (even if they're a C or JS or whatever expert), maybe it is worth putting somewhere. Maybe io.iterchunks(f, 4096)? If so, the combination of the two into something like iterlines(f, b'\0') seems like it should go right alongside iterchunks. However? >I don't think it is reasonable to add a new parameter to readline() The problem is that my code has significant problems for many use cases, and I don't think they can be solved. Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc. Maybe if we had more powerful adapters or wrappers so I could just say "here's a pre-existing buffer plus a text-file-like object, now wrap that up as a real TextIOBase for me" it would be possible to write something that worked from outside without these problems, but as things stand, I don't see an answer. Maybe put resplit in the stdlib, then just give iterlines as a 2-liner example (in the itertools recipes, or the file-I/O section of the tutorial?) where all these problems can be raised and not answered? From abarnert at yahoo.com Fri Jul 18 00:21:25 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 15:21:25 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> > On Thursday, July 17, 2014 2:40 PM, Alexander Heger wrote: > >> I don't think it is reasonable to add a new parameter to readline(), > because >> streams are widely implemented using duck typing -- every implementation >> would have to be updated to support this. > > Could the "split" (or splitline) keyword-only parameter instead be > passed to the open function (and the __init__ of IOBase and be stored > there)? Good idea. It's less powerful/flexible, but probably good enough for almost all use cases. (I can't think of any file where I'd need to split part of it on \0 and the rest on \n?) Also, it means you can stick with the normal __iter__ instead of needing a separate iterlines method. And, since open/__init__/etc. isn't part of the protocol, it's perfectly fine for the builtin open, etc., to be an example or template that's generally worth following if there's no good reason not to do so, rather than a requirement that must be followed. So, if I'm getting file-like objects handed to me by some third-party library or plugin API or whatever, and I need them to be \0-separated, in many cases the problems with resplit won't be an issue so I can just use it as a workaround, and in the remaining cases, I can request that the library/app/whatever add the sep parameter to the next iteration of the API. So, I retract my original suggestion in favor of this one. And, separately, Guido's idea of adding the helpers (or at least resplit, plus documentation on how to write the other stuff) to the stdlib somewhere. Thanks. From guido at python.org Fri Jul 18 00:37:58 2014 From: guido at python.org (Guido van Rossum) Date: Thu, 17 Jul 2014 15:37:58 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405634369.86826.YahooMailNeo@web181006.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405634369.86826.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert < abarnert at yahoo.com.dmarc.invalid> wrote: > On Thursday, July 17, 2014 1:48 PM, Guido van Rossum > wrote: > > > >I think it's fine to add something to stdlib that encapsulates your > example. (TBD: where?) > > Good question about the where. > > The resplit function seems like it could be of more general use than just > this case, but I'm not sure where it belongs. Maybe itertools? > > The iter(lambda: f.read(bufsize), b'') part seems too trivial to put > anywhere, even just as an example in the docs?but given that it probably > looks like a magic incantation to anyone who's a Python novice (even if > they're a C or JS or whatever expert), maybe it is worth putting somewhere. > Maybe io.iterchunks(f, 4096)? > > If so, the combination of the two into something like iterlines(f, b'\0') > seems like it should go right alongside iterchunks. > > > However? > > > >I don't think it is reasonable to add a new parameter to readline() > > The problem is that my code has significant problems for many use cases, > and I don't think they can be solved. > > Calling readline (or iterating the file) uses the underlying buffer (and > stream decoder, for text files), keeps the file pointer in the same place, > etc. My code doesn't, and no external code can. So, besides being less > efficient, it leaves the file pointer in the wrong place (imagine using it > to parse an RFC822 header then read() the body), doesn't properly decode > files where the separator can be ambiguous with other bytes (try separating > on '\0' in a UTF-16 file), etc. > You can implement a subclass of io.BufferedIOBase that wraps an instance of io.RawIOBase (I think those are the right classes) where the wrapper adds a readuntil(separator) method. Whichever thing then wants to read the rest of the data should call read() on the wrapper object. This still sounds a lot better to me than asking everyone to add a new parameter to their readline() (and the implementation). Maybe if we had more powerful adapters or wrappers so I could just say > "here's a pre-existing buffer plus a text-file-like object, now wrap that > up as a real TextIOBase for me" it would be possible to write something > that worked from outside without these problems, but as things stand, I > don't see an answer. > You probably have to do a separate wrapper for text streams, the types and buffering implementation are just too different. > Maybe put resplit in the stdlib, then just give iterlines as a 2-liner > example (in the itertools recipes, or the file-I/O section of the > tutorial?) where all these problems can be raised and not answered? > (Sorry, in a hurry / terribly distracted.) -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Fri Jul 18 02:04:00 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 17:04:00 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> On Thursday, July 17, 2014 3:21 PM, Andrew Barnert wrote: >? On Thursday, July 17, 2014 2:40 PM, Alexander Heger wrote: >>? Could the "split" (or splitline) keyword-only >> parameter instead be?passed to the open function? >> (and the __init__ of IOBase and be stored?there)? > > Good idea. It's less powerful/flexible, but probably > good enough for almost all use cases. (I can't think > of any file where I'd need to split part of it on \0 > and the rest on \n?) Also, it means you can stick with > the normal __iter__ instead of needing a separate > iterlines method. It turns out to be even simpler than I expected. I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO. For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it. (Of course you'd also want to add it to all of the stdlib cases like zipfile.ZipFile.open/zipfile.ExtZipFile.__init__, but there aren't too many of those.) This means that the buffer underlying a text file with a non-standard newline doesn't automatically have a matching newline. I think that's a good thing ('\r\n' and '\r' would need exceptions for backward compatibility; '\0'.encode('utf-16-le') isn't a very useful thing to split on; etc.), but doing it the other way is almost as easy, and very little code will never care. From steve at pearwood.info Fri Jul 18 05:21:00 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 18 Jul 2014 13:21:00 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <20140718032100.GH9112@ando> On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote: > It turns out to be even simpler than I expected. > > I reused the "newline" parameter of open and TextIOWrapper.__init__, > adding a param of the same name to the constructors for > BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and > FileIO. > > For text files, just remove the check for newline being one of the > standard values and it all works. For binary files, remove the check > for truthy, make open pass each Buffered* constructor newline=(newline > if binary else None), make each Buffered* class store it, and change > two lines in RawIOBase.readline to use it. And that's it. All the words are in English, but I have no idea what you're actually saying... :-) You seem to be talking about the implementation of the change, but what is the interface? Having made all these changes, how does it effect Python code? You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file? > This means that the buffer underlying a text file with a non-standard > newline doesn't automatically have a matching newline. I don't understand what you mean by this. -- Steven From rosuav at gmail.com Fri Jul 18 05:36:17 2014 From: rosuav at gmail.com (Chris Angelico) Date: Fri, 18 Jul 2014 13:36:17 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <20140718032100.GH9112@ando> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140718032100.GH9112@ando> Message-ID: On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano wrote: > On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote: > >> It turns out to be even simpler than I expected. >> >> I reused the "newline" parameter of open and TextIOWrapper.__init__, >> adding a param of the same name to the constructors for >> BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and >> FileIO. >> >> For text files, just remove the check for newline being one of the >> standard values and it all works. For binary files, remove the check >> for truthy, make open pass each Buffered* constructor newline=(newline >> if binary else None), make each Buffered* class store it, and change >> two lines in RawIOBase.readline to use it. And that's it. > > All the words are in English, but I have no idea what you're actually > saying... :-) > > You seem to be talking about the implementation of the change, but what > is the interface? Having made all these changes, how does it effect > Python code? You have a use-case of splitting on something other than > the standard newlines, so how does one do that? E.g. suppose I have a > file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line > character. How would I iterate over lines in this file? The way I understand it is this: for line in open("spam.txt", newline="\u0085"): process(line) If that's the case, I would be strongly in favour of this. Nice and clean, and should break nothing; there'll be special cases for newline=None and newline='', and the only change is that, instead of a small number of permitted values ('\n', '\r', '\r\n'), any string (or maybe any one-character string plus '\r\n'?) would be permitted. Effectively, it's not "iterate over this file, divided by \0 instead of newlines", but it's "this file uses the unusual encoding of newline=\0, now iterate over lines in the file". Seems a smart way to do it IMO. ChrisA From abarnert at yahoo.com Fri Jul 18 06:18:08 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 21:18:08 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <20140718032100.GH9112@ando> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140718032100.GH9112@ando> Message-ID: <6C155609-776E-482D-954C-DF5F1D2AD962@yahoo.com> On Jul 17, 2014, at 20:21, Steven D'Aprano wrote: > On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote: > >> It turns out to be even simpler than I expected. >> >> I reused the "newline" parameter of open and TextIOWrapper.__init__, >> adding a param of the same name to the constructors for >> BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and >> FileIO. >> >> For text files, just remove the check for newline being one of the >> standard values and it all works. For binary files, remove the check >> for truthy, make open pass each Buffered* constructor newline=(newline >> if binary else None), make each Buffered* class store it, and change >> two lines in RawIOBase.readline to use it. And that's it. > > All the words are in English, but I have no idea what you're actually > saying... :-) > > You seem to be talking about the implementation of the change, but what > is the interface? "I reused the newline parameter." My mistake was assuming that was so simple, nothing else needed to be said. But that only works if everyone went back and completely read the previous suggestions, which I realize nobody had any good reason to do. Basically, the only change to the API is that it's no longer an error to pass arbitrary strings (or bytes, for binary mode) for newlines. The rules for how "\0" are handled are identical to the rules for "\r". There's almost nothing else to explain, but not quite--so, like an idiot, I dove into the minor nits in detail, skipping over the main point. > Having made all these changes, how does it effect > Python code? Existing legal code does not change at all. Some code that used to be an error now does something useful (see below). > You have a use-case of splitting on something other than > the standard newlines, so how does one do that? E.g. suppose I have a > file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line > character. How would I iterate over lines in this file? with open("spam.txt", newline="\u0085") as f: for line in f: process(line) >> This means that the buffer underlying a text file with a non-standard >> newline doesn't automatically have a matching newline. > > I don't understand what you mean by this. If you write this: with open("spam.txt", newline="\u0085") as f: for line in f.buffer: The bytes you get back will be split on b"\n", not on "\u0085".encode(locale.getdefaultencoding()). The newlines applies only to the text file, not its underlying binary buffer. (This is exactly the same as the current behavior--if you open a file with newline='\r' in 3.4 then iterate f.buffer, it's still going to split on b'\n', not b'\r'.) From abarnert at yahoo.com Fri Jul 18 06:23:05 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 21:23:05 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140718032100.GH9112@ando> Message-ID: <9202C31C-8D17-47CB-AB24-EB6DA7CA4553@yahoo.com> On Jul 17, 2014, at 20:36, Chris Angelico wrote: > On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano wrote: >> You seem to be talking about the implementation of the change, but what >> is the interface? Having made all these changes, how does it effect >> Python code? You have a use-case of splitting on something other than >> the standard newlines, so how does one do that? E.g. suppose I have a >> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line >> character. How would I iterate over lines in this file? > > The way I understand it is this: > > for line in open("spam.txt", newline="\u0085"): > process(line) > > If that's the case, I would be strongly in favour of this. Nice and > clean, and should break nothing; there'll be special cases for > newline=None and newline='', and the only change is that, instead of a > small number of permitted values ('\n', '\r', '\r\n'), any string (or > maybe any one-character string plus '\r\n'?) would be permitted. > > Effectively, it's not "iterate over this file, divided by \0 instead > of newlines", but it's "this file uses the unusual encoding of > newline=\0, now iterate over lines in the file". Seems a smart way to > do it IMO. Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea. (Apologies for overestimating the obviousness of that.) From abarnert at yahoo.com Fri Jul 18 06:40:11 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 21:40:11 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405634369.86826.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <03610AED-BF17-42BB-B62D-6D8E007B48EB@yahoo.com> On Jul 17, 2014, at 15:37, Guido van Rossum wrote: > On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert wrote: >> >I don't think it is reasonable to add a new parameter to readline() >> >> The problem is that my code has significant problems for many use cases, and I don't think they can be solved. >> >> Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc. > > You can implement a subclass of io.BufferedIOBase that wraps an instance of io.RawIOBase (I think those are the right classes) where the wrapper adds a readuntil(separator) method. Whichever thing then wants to read the rest of the data should call read() on the wrapper object. > > This still sounds a lot better to me than asking everyone to add a new parameter to their readline() (and the implementation). [snip] > You probably have to do a separate wrapper for text streams, the types and buffering implementation are just too different. The problem isn't needing two separate wrappers, it's that the text wrapper if effectively impossible. For binary files, MyBufferedReader.readuntil is a slightly modified version of _pyio.RawIOBase.readline, which only needs to access the public interface of io.BufferedReader (peek and read). For text files, however, it needs to access private information from TextIOWrapper that isn't exposed from C to Python. And, unlike BufferedReader, TextIOWrapper has no way to peek ahead, or push data back onto the buffer, or anything else usable as a workaround, so even if you wanted to try to take care of the decoding state problems manually, you can't, except by reading one character at a time. There are also some minor problems even for binary files (e.g., MyBufferedReader(f.raw) has a different file position from f, so if you switch between them you'll end up skipping part of the file), but these won't affect most use cases; the text file problem is the big one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Jul 18 06:47:06 2014 From: guido at python.org (Guido van Rossum) Date: Thu, 17 Jul 2014 21:47:06 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <9202C31C-8D17-47CB-AB24-EB6DA7CA4553@yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140718032100.GH9112@ando> <9202C31C-8D17-47CB-AB24-EB6DA7CA4553@yahoo.com> Message-ID: Well, I had to look up the newline option for open(), even though I probably invented it. :-) Would it still apply only to text files? On Thursday, July 17, 2014, Andrew Barnert wrote: > On Jul 17, 2014, at 20:36, Chris Angelico > > wrote: > > > On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano > wrote: > >> You seem to be talking about the implementation of the change, but what > >> is the interface? Having made all these changes, how does it effect > >> Python code? You have a use-case of splitting on something other than > >> the standard newlines, so how does one do that? E.g. suppose I have a > >> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line > >> character. How would I iterate over lines in this file? > > > > The way I understand it is this: > > > > for line in open("spam.txt", newline="\u0085"): > > process(line) > > > > If that's the case, I would be strongly in favour of this. Nice and > > clean, and should break nothing; there'll be special cases for > > newline=None and newline='', and the only change is that, instead of a > > small number of permitted values ('\n', '\r', '\r\n'), any string (or > > maybe any one-character string plus '\r\n'?) would be permitted. > > > > Effectively, it's not "iterate over this file, divided by \0 instead > > of newlines", but it's "this file uses the unusual encoding of > > newline=\0, now iterate over lines in the file". Seems a smart way to > > do it IMO. > > Exactly. As soon as Alexander suggested it, I immediately knew it was much > better than my original idea. > > (Apologies for overestimating the obviousness of that.) > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (on iPad) -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Fri Jul 18 08:26:28 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Thu, 17 Jul 2014 23:26:28 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140718032100.GH9112@ando> <9202C31C-8D17-47CB-AB24-EB6DA7CA4553@yahoo.com> Message-ID: <92919CA1-2052-4B07-97D9-1D8A0757F117@yahoo.com> On Jul 17, 2014, at 21:47, Guido van Rossum wrote: > Well, I had to look up the newline option for open(), even though I probably invented it. :-) While we're at it, I think most places in the documentation and docstrings that refer to the parameter, except open itself, call it newlines (e.g., io.IOBase.readline), and as far as I can tell it's been like that from day one, which shows just how much people pay attention to the current feature. :) > Would it still apply only to text files? I think it makes sense to apply to binary files as well. Splitting binary files on \0 (or, for that matter, \r\n...) is probably at least as common a use case as text files. Obviously the special treatment for "" (as a universal-newline-behavior flag) wouldn't carry over to b"" (which might as well just be an error, although I suppose it could also mean to split on every byte, as with bytes.split?). Also, I'm not sure if the write behavior (replace terminal "\n" with newline) should carry over from text to binary, or just ignore newline on write. Binary files don't need the special-casing for b"" (with text files, that's more a universal-newlines flag than a newline value), and I'm not sure if they need the write behavior or only the read behavior. > On Thursday, July 17, 2014, Andrew Barnert wrote: >> On Jul 17, 2014, at 20:36, Chris Angelico wrote: >> >> > On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano wrote: >> >> You seem to be talking about the implementation of the change, but what >> >> is the interface? Having made all these changes, how does it effect >> >> Python code? You have a use-case of splitting on something other than >> >> the standard newlines, so how does one do that? E.g. suppose I have a >> >> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line >> >> character. How would I iterate over lines in this file? >> > >> > The way I understand it is this: >> > >> > for line in open("spam.txt", newline="\u0085"): >> > process(line) >> > >> > If that's the case, I would be strongly in favour of this. Nice and >> > clean, and should break nothing; there'll be special cases for >> > newline=None and newline='', and the only change is that, instead of a >> > small number of permitted values ('\n', '\r', '\r\n'), any string (or >> > maybe any one-character string plus '\r\n'?) would be permitted. >> > >> > Effectively, it's not "iterate over this file, divided by \0 instead >> > of newlines", but it's "this file uses the unusual encoding of >> > newline=\0, now iterate over lines in the file". Seems a smart way to >> > do it IMO. >> >> Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea. >> >> (Apologies for overestimating the obviousness of that.) >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ > > > -- > --Guido van Rossum (on iPad) > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wolfgang.maier at biologie.uni-freiburg.de Fri Jul 18 13:53:48 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Fri, 18 Jul 2014 13:53:48 +0200 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On 07/18/2014 02:04 AM, Andrew Barnert wrote: > On Thursday, July 17, 2014 3:21 PM, Andrew Barnert wrote: > > > >> On Thursday, July 17, 2014 2:40 PM, Alexander Heger wrote: > >>> Could the "split" (or splitline) keyword-only >>> parameter instead be passed to the open function >>> (and the __init__ of IOBase and be stored there)? >> >> Good idea. It's less powerful/flexible, but probably >> good enough for almost all use cases. (I can't think >> of any file where I'd need to split part of it on \0 >> and the rest on \n?) Also, it means you can stick with >> the normal __iter__ instead of needing a separate >> iterlines method. > > It turns out to be even simpler than I expected. > > I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO. > > For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it. > You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade: http://bugs.python.org/issue1152248 Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan: http://bugs.python.org/issue1152248#msg109117 Not that I wouldn't like to see this feature to be shipping with Python, but it may help to read through all aspects of the problem that have been discussed before. Best, Wolfgang From abarnert at yahoo.com Fri Jul 18 18:43:26 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 18 Jul 2014 09:43:26 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those? Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it. While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.) On Jul 18, 2014, at 4:53, Wolfgang Maier wrote: > On 07/18/2014 02:04 AM, Andrew Barnert wrote: >> On Thursday, July 17, 2014 3:21 PM, Andrew Barnert wrote: >> >> >> >>> On Thursday, July 17, 2014 2:40 PM, Alexander Heger wrote: >> >>>> Could the "split" (or splitline) keyword-only >>>> parameter instead be passed to the open function >>>> (and the __init__ of IOBase and be stored there)? >>> >>> Good idea. It's less powerful/flexible, but probably >>> good enough for almost all use cases. (I can't think >>> of any file where I'd need to split part of it on \0 >>> and the rest on \n?) Also, it means you can stick with >>> the normal __iter__ instead of needing a separate >>> iterlines method. >> >> It turns out to be even simpler than I expected. >> >> I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO. >> >> For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it. > > You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade: > > http://bugs.python.org/issue1152248 > > Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan: > > http://bugs.python.org/issue1152248#msg109117 Thanks. Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch. The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong. > Not that I wouldn't like to see this feature to be shipping with Python, but it may help to read through all aspects of the problem that have been discussed before. > > Best, > Wolfgang > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From ncoghlan at gmail.com Sat Jul 19 09:10:58 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 19 Jul 2014 03:10:58 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On 18 July 2014 12:43, Andrew Barnert wrote: > Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those? > > Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it. > > While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.) Slight tangent, but this rewrapping question also arises in the context of changing encodings on an already open stream. See http://bugs.python.org/issue15216 for (the gory) details. > On Jul 18, 2014, at 4:53, Wolfgang Maier wrote: >> You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade: >> >> http://bugs.python.org/issue1152248 >> >> Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan: >> >> http://bugs.python.org/issue1152248#msg109117 > > Thanks. > > Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch. > > The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong. I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing, and I don't think we'd be doing anybody any favours by conflating them (whether we're confusing them at the method level or at the constructor argument level). While, as an implementation artifact, it may be possible to get this "easily" by abusing the existing newline parameter, that's likely to break a lot of assumptions in *other* code, that specifically expects newlines to refer to actual line endings. A new separate method cleanly isolates the feature to code that wants to use it, preventing potentially adverse and hard to debug impacts on unrelated code that happens to receive a file object with a custom record separator configured. With this kind of proposal, it isn't the "what happens when it works?" cases that worry me - it's the cases where it *fails* and someone is stuck with figuring out what has gone wrong. A new method fails cleanly, but changing the semantics of *existing* arguments, attributes and methods? That doesn't fail cleanly at all, and can also have far reaching impacts on the correctness of all sorts of documentation. Attempting to wedge this functionality into *existing* constructs means *changing* a lot of expectations that are now well established in a Python context. By contrast, adding a *new* construct, specifically for this purpose, means nothing needs to change with existing constructs, we don't inadvertently introduce even more obscure corner cases in newline handling, and there's a solid terminology hook to hang the documentation one (iteration by line vs iteration by record - and we can also be clear that "line buffered" really does correspond to iteration by line, and may not be available for arbitrary record separators). Providing this feature as a separate method also makes it possible for the IO ABC's to provide a default implementation (along the lines of your resplit function), that concrete implementations can optionally override with something more optimised. Pure ducktyped cases (not inheriting from the ABCs) will fail with a fairly obvious error ("AttributeError: 'MyCustomFileType' object has no attribute 'readrecords'" rather than something related to unknown parameter names or illegal argument values), while those that do inherit from the ABCs will "just work". Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From rosuav at gmail.com Sat Jul 19 09:32:53 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 19 Jul 2014 17:32:53 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan wrote: > I still favour my proposal there to add a separate "readrecords()" > method, rather than reusing the line based iteration methods - lines > and arbitrary records *aren't* the same thing But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0. ChrisA From ncoghlan at gmail.com Sat Jul 19 10:18:35 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 19 Jul 2014 04:18:35 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On 19 July 2014 03:32, Chris Angelico wrote: > On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan wrote: >> I still favour my proposal there to add a separate "readrecords()" >> method, rather than reusing the line based iteration methods - lines >> and arbitrary records *aren't* the same thing > > But they might well be the same thing. Look at all the Unix commands > that usually separate output with \n, but can be told to separate with > \0 instead. If you're reading from something like that, it should be > just as easy to split on \n as on \0. Python isn't Unix, and Python has never supported \0 as a "line ending". Changing the meaning of existing constructs is fraught with complexity, and should only be done when there is absolutely no alternative. In this case, there's an alternative: a new method, specifically for reading arbitrary records. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From steve at pearwood.info Sat Jul 19 11:01:59 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 19 Jul 2014 19:01:59 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <20140719090159.GJ9112@ando> On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote: > On 19 July 2014 03:32, Chris Angelico wrote: > > On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan wrote: > >> I still favour my proposal there to add a separate "readrecords()" > >> method, rather than reusing the line based iteration methods - lines > >> and arbitrary records *aren't* the same thing > > > > But they might well be the same thing. Look at all the Unix commands > > that usually separate output with \n, but can be told to separate with > > \0 instead. If you're reading from something like that, it should be > > just as easy to split on \n as on \0. > > Python isn't Unix, and Python has never supported \0 as a "line > ending". Changing the meaning of existing constructs is fraught with > complexity, and should only be done when there is absolutely no > alternative. In this case, there's an alternative: a new method, > specifically for reading arbitrary records. I don't have an opinion one way or the other, but I don't quite see why you're worried about allowing the newline parameter to be set to some arbitrary separator. The best I can come up with is a scenario something like this: I open a file with some record-separator fp = open(filename, newline="\0") then pass it to a function: spam(fp) which assumes that each chunk ends with a linefeed: assert next(fp).endswith('\n') But in a case like that, the function is already buggy. I can see at least two problems with such an assumption: - what if universal newlines has been turned off and you're reading a file created under (e.g.) classic Mac OS or RISC OS? - what if the file contains a single line which does not end with an end of line character at all? open('/tmp/junk', 'wb').write("hello world!") next(open('/tmp/junk', 'r')) Have I missed something? Although I'm don't mind whether files grow a readrecords() method, or re-use the readlines() method, I'm not convinced that API decisions should be driven solely by the needs of programs which are already buggy. -- Steven From stephen at xemacs.org Sat Jul 19 11:06:59 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 19 Jul 2014 18:06:59 +0900 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <87mwc5k98s.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Angelico writes: > But they might well be the same thing. Look at all the Unix commands > that usually separate output with \n, but can be told to separate with > \0 instead. If you're reading from something like that, it should be > just as easy to split on \n as on \0. Nick's point is more general, I think, but as a special case consider a "multiline" record. What's the right behavior on output from the application if the newline convention of this particular multiline differs from that of the rest of the output stream? IMO this goes beyond "consenting adults" (YMMV, of course). Steve From ncoghlan at gmail.com Sat Jul 19 11:27:49 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 19 Jul 2014 05:27:49 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <20140719090159.GJ9112@ando> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> Message-ID: On 19 July 2014 05:01, Steven D'Aprano wrote: > On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote: > But in a case like that, the function is already buggy. I can see at > least two problems with such an assumption: > > - what if universal newlines has been turned off and you're reading > a file created under (e.g.) classic Mac OS or RISC OS? That's exactly the point though - people *do* assume "\n", and we've gone to great lengths to make that assumption *more correct* (even though it's still wrong sometimes). We can't reverse course on that, and expect the outcome to make sense to *people*. When making use of a configurable line endings feature breaks (and it will), they're going to be confused, and the docs likely aren't going to help much. > - what if the file contains a single line which does not end with an > end of line character at all? > > open('/tmp/junk', 'wb').write("hello world!") > next(open('/tmp/junk', 'r')) > > Have I missed something? > > > Although I'm don't mind whether files grow a readrecords() method, or > re-use the readlines() method, I'm not convinced that API decisions > should be driven solely by the needs of programs which are already > buggy. It's not being driven by the needs of programs that are already buggy - my preferences are driven by the fact that line endings and record separators are *not the same thing*. Thinking that they are is a matter of confusing the conceptual data model with the implementation of the framing at the serialisation layer. If we *do* try to treat them as the same thing, then we have to go find *every single reference* to line endings in the documentation and add a caveat about it being configurable at file object creation time, so it might actually be based on something completely arbitrary. Line endings are *already* confusing enough that the "universal newlines" mechanism was added to make it so that Python level code could mostly ignore the whole "\n" vs "\r" vs "\r\n" distinction, and just assume "\n" everywhere. This is why I'm a fan of keeping things comparatively simple, and just adding a new method (if we only add an iterator version) or two (if we add a list version as well) specifically for this use case. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From p.f.moore at gmail.com Sat Jul 19 11:30:38 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 19 Jul 2014 10:30:38 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <20140719090159.GJ9112@ando> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> Message-ID: On 19 July 2014 10:01, Steven D'Aprano wrote: > I open a file with some record-separator > > fp = open(filename, newline="\0") > > then pass it to a function: > > spam(fp) > > which assumes that each chunk ends with a linefeed: > > assert next(fp).endswith('\n') I will often do for line in fp: line = line.strip() to remove the line ending ("record separator"). This fails if you have an arbitrary separator. And for that matter, how would you remove an arbitrary separator? Maybe line = line[:-1] works, but what if at some point people ask for multi-character separators ("\n\n" for "paragraph separated", for example - ignoring the universal newline complexities in that). A splitrecord method still needs a means for code to to remove the record separator, of course, but the above demonstrates how reusing line separation could break the assumptions of *current* code. Paul From apalala at gmail.com Sat Jul 19 13:49:58 2014 From: apalala at gmail.com (=?UTF-8?Q?Juancarlo_A=C3=B1ez?=) Date: Sat, 19 Jul 2014 07:19:58 -0430 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On Sat, Jul 19, 2014 at 3:48 AM, Nick Coghlan wrote: > Python isn't Unix, and Python has never supported \0 as a "line > ending". Changing the meaning of existing constructs is fraught with > complexity, and should only be done when there is absolutely no > alternative. In this case, there's an alternative: a new method, > specifically for reading arbitrary records. > "practicality beats purity." http://legacy.python.org/dev/peps/pep-0020/ -- Juancarlo *A?ez* -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Sat Jul 19 16:55:43 2014 From: antoine at python.org (Antoine Pitrou) Date: Sat, 19 Jul 2014 10:55:43 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <20140719090159.GJ9112@ando> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> Message-ID: Le 19/07/2014 05:01, Steven D'Aprano a ?crit : > > I open a file with some record-separator > > fp = open(filename, newline="\0") Hmm... newline="\0" already *looks* wrong. To me, it's a hint that you're abusing the API. The main advantage of it, though, is that you can use iteration in addition to the regular readline() (or readrecord()) method. Regards Antoine. From python at mrabarnett.plus.com Sat Jul 19 18:21:33 2014 From: python at mrabarnett.plus.com (MRAB) Date: Sat, 19 Jul 2014 17:21:33 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <20140719090159.GJ9112@ando> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> Message-ID: <53CA9B0D.1070405@mrabarnett.plus.com> On 2014-07-19 10:01, Steven D'Aprano wrote: [snip] > - what if universal newlines has been turned off and you're reading > a file created under (e.g.) classic Mac OS or RISC OS? > [snip] FTR, the line ending in RISC OS is '\n'. From guido at python.org Sat Jul 19 22:05:32 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 19 Jul 2014 13:05:32 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <53CA9B0D.1070405@mrabarnett.plus.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CA9B0D.1070405@mrabarnett.plus.com> Message-ID: I don't have time for this thread. I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me). I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way). I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-). I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character. I value the equivalence of __next__() and readline(). I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data). Once a suitable wrapper class has been implemented as a 3rd party module and is in common use you may petition to have it added to the standard library, as a separate module/class/function. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Sun Jul 20 01:28:55 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sat, 19 Jul 2014 16:28:55 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> (replies to multiple messages here) On Saturday, July 19, 2014 1:19 AM, Nick Coghlan wrote: >On 19 July 2014 03:32, Chris Angelico wrote: >> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan wrote: >>> I still favour my proposal there to add a separate "readrecords()" >>> method, rather than reusing the line based iteration methods - lines >>> and arbitrary records *aren't* the same thing >> >> But they might well be the same thing. Look at all the Unix commands >> that usually separate output with \n, but can be told to separate with >> \0 instead. If you're reading from something like that, it should be >> just as easy to split on \n as on \0. > >Python isn't Unix, and Python has never supported \0 as a "line >ending". Well, yeah, but Python is used on Unix, and it's used to write scripts that interoperate with other Unix command-line tools. For the record, the reason this came up is that someone was trying to use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script.? In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing?it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem. > Changing the meaning of existing constructs is fraught with >complexity, and should only be done when there is absolutely no >alternative. In this case, there's an alternative: a new method, >specifically for reading arbitrary records. This was basically my original suggestion, so obviously I don't think it's a terrible idea. But I don't think it's as good. First, which of these is more readable, easier for novices to figure out how to write, etc.: ? ? with open(path, newline='\0') as f: ? ? ? ? for line in f: ? ? ? ? ? ? handle(line.rstrip('\0')) ? ? with open(path) as f: ? ? ? ? for line in iter(lambda: f.readrecord('\0'), ''): ? ? ? ? ? ? handle(line.rstrip('\0')) Second, as Guido mentioned at the start of this thread, existing file-like object types (whether they implement BufferedIOBase or TextIOBase, or just duck-type the interfaces) are not going to have the new functionality. Construction has never been part of the interface of the file-like object API; opening a real file has always looked different from opening a member file in a zip archive or making a file-like wrapper around a socket transport or whatever. But using the resulting object has always been the same. Adding a readrecord method or changing the interface readline means that's no longer true. There might be a good argument for making the change more visible?that is, using a different parameter on the open call instead of reusing the existing newline. (And that's what Alexander originally suggested as an alternative to my readrecord idea.) That way, it's much more obvious that spam.open or eggs.makefile or whatever doesn't support alternate line endings, without having to read its documentation on what newline means. But either way, I think it should go in the open function, not the file-object API. On Saturday, July 19, 2014 2:28 AM, Nick Coghlan wrote: > - my preferences are driven by the fact that line endings and record > separators are *not the same thing*.? Thinking that they are is a > matter of confusing the conceptual data model with the implementation > of the framing at the serialisation layer.? Yes, using lines implicitly as records can lead to confusion?but people actually do that all the time; this isn't a new problem, and it's exactly the same problem with \r\n, or even \n, as with \0. When you open up TextEdit and write a grocery list with one item on each line, those newlines are not part of the items. When you pipe the output of find to a script, the newlines are not part of the filenames. When you pipe the output of find -0 to a script, the \0 terminators are not part of the filenames. > Line endings are *already* confusing enough that the "universal > newlines" mechanism was added to make it so that Python level code > could mostly ignore the whole "\n" vs "\r" vs? > "\r\n" distinction, and > just assume "\n" everywhere. I understand the point here. There are cases where universal newlines let you successfully ignore the confusion rather than dealing with it, and newline='\0' will not be useful in those cases. But then newline='\r' is also never useful in those cases. The new behavior will be useful in exactly the cases where '\r' already is?no more, but no less. > This is why I'm a fan of keeping things comparatively simple, and just > adding a new method (if we only add an iterator version) or two (if we > add a list version as well) specifically for this use case. Actually, the obvious new method is neither the iterator version nor the list version, but a single-record version, readrecord. Sometimes you need readline/readrecord, and it's conceptually simpler for the user. And of course the implementation is a lot simpler; you don't need to build a new iterator object that references the file for readrecord the way you do for iterrecords. And finally, if you only have one of the two,?as bad as iter(lambda: f.readrecord('\0'), '') may look to novices, next(f.iterrecords('\0')) would probably be even more confusing. But we could also add an iterrecords, for two methods. And as for the list-based version? well, I don't even understand why readlines still exists in 3.x (much less why the tutorial suggests it), so I'd be fine not having a readrecords, but I don't have any real objection. On Saturday, July 19, 2014 1:06 PM, Guido van Rossum wrote: >I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me). I get the feeling either there's a much simpler way to wrap a file object that I'm missing, or that you think there is. In order to do the equivalent of readrecord, you have to do one of three things: 1. Read character by character, which can be incredibly slow. 2. Peek or push back on the buffer, as the io classes' readline methods do. 3. Put another buffer in front of the file, which means you have two objects both sharing the same file but with effective file pointers out of sync. And you have to reproduce all of the file-like-object API methods for your new buffered object (a lot more work, and a lot more to get wrong?effectively, it means you have to write all of BufferedReader or TextIOWrapper, but modified to wrap another buffered file instead of wrapping the lower-level thing).?And no matter how you do it, it's obviously going to be less efficient. If there's a lighter version of #3 that makes sense, I'm not seeing it. Which is probably a problem with my lack of insight, but I'd appreciate a pointer in the right direction. >I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way). Maybe using a different argument is a better answer. (That's what Alexander suggested originally.) The reason both I and people on the bug thread suggested using newline instead is because the behavior you want from sep='\0' happens to be identical to the behavior you get from newline='\r', except with '\0' instead of '\r'. And that's the best argument I have for reusing newline: someone has already worked out and documented all the implications of newline, and people have already learned them, so if we really want the same functionality, it makes sense to reuse it.? But I realize that argument only goes so far. It wasn't obvious, until I looked into it, that I wanted the exact same functionality. >I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-). Sure, it would have been a lot better for find and friends to grow a --escape parameter instead of -0, but I think that ship has sailed. >I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character. > >I value the equivalence of __next__() and readline(). > >I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data). Again, I don't see any way to do this sensibly that wouldn't be a whole lot more work than just forking the io package. But maybe that's the answer: I can write _io2 as a fork of _io with my changes, the same for _pyio2 (for PyPy), and then the only thing left to write is a __main__ for the package that wraps up _io2/_pyio2 in the io ABCs (and re-exports those ABCs). From ncoghlan at gmail.com Sun Jul 20 01:49:38 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 19 Jul 2014 19:49:38 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: On 20 Jul 2014 09:28, "Andrew Barnert" wrote: > > (replies to multiple messages here) > > On Saturday, July 19, 2014 1:19 AM, Nick Coghlan wrote: > > > >On 19 July 2014 03:32, Chris Angelico wrote: > >> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan wrote: > >>> I still favour my proposal there to add a separate "readrecords()" > >>> method, rather than reusing the line based iteration methods - lines > >>> and arbitrary records *aren't* the same thing > >> > >> But they might well be the same thing. Look at all the Unix commands > >> that usually separate output with \n, but can be told to separate with > >> \0 instead. If you're reading from something like that, it should be > >> just as easy to split on \n as on \0. > > > >Python isn't Unix, and Python has never supported \0 as a "line > >ending". > > Well, yeah, but Python is used on Unix, and it's used to write scripts that interoperate with other Unix command-line tools. > > For the record, the reason this came up is that someone was trying to use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script. > > In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing?it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem. I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences. Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Sun Jul 20 01:51:26 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sun, 20 Jul 2014 09:51:26 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: On Sun, Jul 20, 2014 at 9:49 AM, Nick Coghlan wrote: > Adding a single possible newline character is a much simpler change, and one > likely to have far fewer odd consequences. This is especially so if > specifying NULL as the line separator is only permitted for files opened in > binary mode. U+0000 is a valid Unicode character, so I'd have no objection to, for instance, splitting a UTF-8 encoded text file on \0. ChrisA From ncoghlan at gmail.com Sun Jul 20 01:56:18 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 19 Jul 2014 19:56:18 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: On 20 Jul 2014 09:49, "Nick Coghlan" wrote: > > > On 20 Jul 2014 09:28, "Andrew Barnert" wrote: > > > > (replies to multiple messages here) > > > > On Saturday, July 19, 2014 1:19 AM, Nick Coghlan wrote: > > > > > > >On 19 July 2014 03:32, Chris Angelico wrote: > > >> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan wrote: > > >>> I still favour my proposal there to add a separate "readrecords()" > > >>> method, rather than reusing the line based iteration methods - lines > > >>> and arbitrary records *aren't* the same thing > > >> > > >> But they might well be the same thing. Look at all the Unix commands > > >> that usually separate output with \n, but can be told to separate with > > >> \0 instead. If you're reading from something like that, it should be > > >> just as easy to split on \n as on \0. > > > > > >Python isn't Unix, and Python has never supported \0 as a "line > > >ending". > > > > Well, yeah, but Python is used on Unix, and it's used to write scripts that interoperate with other Unix command-line tools. > > > > For the record, the reason this came up is that someone was trying to use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script. > > > > In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing?it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem. > > I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences. > > Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode. Also, the interoperability argument is a good one, as is the analogy with '\r'. Since this does end up touching the open() builtin and the core IO abstractions, it will need a PEP. As far as implementation goes, I suspect a RecordIOWrapper layered IO model inspired by the approach used for TextIOWrapper may make sense. Cheers, Nick. > > Cheers, > Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Sun Jul 20 02:57:14 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sat, 19 Jul 2014 17:57:14 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> On Saturday, July 19, 2014 4:49 PM, Nick Coghlan wrote: >On 20 Jul 2014 09:28, "Andrew Barnert" wrote: >> In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing?it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem. >I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences. >Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode. But newline is only permitted for text mode.?Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today? Also, would you want the same semantics for newline='\0' on binary files that newline='\r' has on text files (including newline remapping on write)? And I'm still not sure why you think this shouldn't be allowed in text mode in the first place (especially given that you suggested the same thing for text files _only_ a few years ago). The output of file is a list of newline-separated or \0-separated filenames, in the filesystem's encoding. Why should I be able to handle the first as a text file, but have to handle the second as a binary file and then manually decode each line? You could argue that file -0 isn't really separating Unicode filenames with U+0000, but separating UTF-8 or Latin-1 or whatever filenames with \x00, and it's just a coincidence that they happen to match up. But it really isn't just a coincidence; it was an intentional design decision for Unicode (and UTF-8, and Latin-1) that the ASCII control characters map in the obvious way, and one that many tools and scripts take advantage of, so why shouldn't tools and scripts written in Python be able to take advantage of it? From ncoghlan at gmail.com Sun Jul 20 03:23:56 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 20 Jul 2014 11:23:56 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: On 20 July 2014 10:57, Andrew Barnert wrote: > On Saturday, July 19, 2014 4:49 PM, Nick Coghlan wrote: > >>On 20 Jul 2014 09:28, "Andrew Barnert" wrote: > > >>> In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing?it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem. > >>I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences. > > >>Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode. > > > But newline is only permitted for text mode. Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today? Actually, I temporarily forgot that newline was only handled at the TextIOWrapper layer. All the more reason for a PEP that clearly lays out the status quo (both Python's own newline handling and the "-0" option for various UNIX utilities, and the way that is handled in other scripting langauges), and discusses the various options for dealing with it (new RecordIOWrapper class with a new "open" parameter, new methods on IO clases, new semantics on the existing TextIOWrapper class). If the description of the use cases is clear enough, then the "right answer" amongst the presented alternatives (which includes "don't change anything") may be obvious. At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From rosuav at gmail.com Sun Jul 20 03:31:10 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sun, 20 Jul 2014 11:31:10 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan wrote: > At present, I'm genuinely unclear on > why someone would ever want to pass the "-0" option to the other UNIX > utilities, which then makes it very difficult to have a sensible > discussion on how we should address that use case in Python. That one's easy. What happens if you use 'find' to list files, and those files might have \n in their names? You need another sep. ChrisA From ncoghlan at gmail.com Sun Jul 20 03:40:25 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 20 Jul 2014 11:40:25 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: On 20 July 2014 11:31, Chris Angelico wrote: > On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan wrote: >> At present, I'm genuinely unclear on >> why someone would ever want to pass the "-0" option to the other UNIX >> utilities, which then makes it very difficult to have a sensible >> discussion on how we should address that use case in Python. > > That one's easy. What happens if you use 'find' to list files, and > those files might have \n in their names? You need another sep. Yes, but having a newline in a filename is sufficiently weird that I find it hard to imagine a scenario where "fix the filenames" isn't a better answer. Hence why I think the PEP needs to explain why the UNIX utilities considered this use case sufficiently non-obscure to add explicit support for it, rather than just assuming that the obviousness of the use case can be taken for granted. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From abarnert at yahoo.com Sun Jul 20 05:58:58 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sat, 19 Jul 2014 20:58:58 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> On Saturday, July 19, 2014 6:42 PM, Nick Coghlan wrote: > On 20 July 2014 11:31, Chris Angelico wrote: >>??On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan > wrote: >>>??At present, I'm genuinely unclear on >>>??why someone would ever want to pass the "-0" option to the >>> other UNIX >>>??utilities, which then makes it very difficult to have a sensible >>>??discussion on how we should address that use case in Python. >> >>??That one's easy. What happens if you use 'find' to list files, >> and >> ?those files might have \n in their names? You need another sep. > > Yes, but having a newline in a filename is sufficiently weird that I > find it hard to imagine a scenario where "fix the filenames" isn't > a > better answer. Hence why I think the PEP needs to explain why the UNIX > utilities considered this use case sufficiently non-obscure to add > explicit support for it, rather than just assuming that the > obviousness of the use case can be taken for granted. First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with,?not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length. Second, "fix the filenames" is almost _never_ a better answer. If you're publishing a program for other people to use, you want to document that it won't work on some perfectly good files, and close their bugs as "Not a bug, rename your files if you want to use my software"??If the files are on a read-only filesystem or a slow tape backup, you really want to copy the entire filesystem over just so you can run a script on it? Also, even if "fix the filenames" were the right answer, you need to write a tool to do that, and why shouldn't it be possible to use Python for that tool? (In fact, one of the scripts I wanted this feature for is a replacement for the traditional rename tool (http://plasmasturm.org/code/rename/). I mainly wanted to let people use regular expressions without letting them run arbitrary Perl code, as rename -e does, but also, I couldn't figure out how to rename "foo" to "Foo" on a case-preserving-but-insensitive filesystem in Perl, and I know how to do it in Python.) At any rate, there are decades of tradition behind using -print0, and that's not going to change just because Python isn't as good as other languages at dealing with it.?The GNU find documentation (http://linux.die.net/man/1/find) explicitly recommends, in multiple places, using -print0 instead of -print whenever possible. (For example, in the summary near the top, "If no expression is given, the expression -print is used (but you should probably consider using -print0 instead, anyway).") And part of the reason for that is that many other tools, like xargs, split on any whitespace, not on newlines, if not given the -0 argument. Fortunately, all of those tools know how to handle backslash escapes, but unfortunately, find doesn't know how to emit them. (Actually,?frustratingly, both BSD and SysV find have the code to do it, but not in a way you can use here.)?So, if you're writing a script that uses find and might get piped to anything that handles input like xargs, you have to use -print0. And that means, if you're writing a tool that might get find piped to it, you have to handle -print0, even if you're pretty sure nobody will ever have newlines for you to deal with, because they're probably going to want to use -print0 anyway, rather than figure out how your tool deals with other whitespace. From greg.ewing at canterbury.ac.nz Sun Jul 20 06:16:54 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 20 Jul 2014 16:16:54 +1200 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <53CB42B6.9050100@canterbury.ac.nz> Nick Coghlan wrote: > having a newline in a filename is sufficiently weird that I > find it hard to imagine a scenario where "fix the filenames" isn't a > better answer. In Classic MacOS, the way you gave a folder an icon was to put it in a hidden file called "Icon\r". -- Greg From greg.ewing at canterbury.ac.nz Sun Jul 20 06:03:20 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 20 Jul 2014 16:03:20 +1200 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> Message-ID: <53CB3F88.9090709@canterbury.ac.nz> Paul Moore wrote: > And for that matter, how would you remove an > arbitrary separator? Maybe line = line[:-1] works, but what if at some > point people ask for multi-character separators If the newline mechanism is re-used, it would convert whatever separator is used into '\n'. -- Greg From ncoghlan at gmail.com Sun Jul 20 07:00:15 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 20 Jul 2014 15:00:15 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: On 20 July 2014 13:58, Andrew Barnert wrote: > First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with, not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length. You answered your own question: because DOS/Windows make them illegal, and the Unix shell isn't fond of them either. I was a DOS/Windows user for more than a decade before switching to Linux for personal use, and in a decade of using Linux (and even going to work for a Linux vendor), I've never encountered a filename with a newline in it. Thus the idea that anyone *would* do such a thing, and that it would be prevalent enough for UNIX tools to include a workaround in programs that normally produce newline separated output is an entirely novel concept for me. Any such file I encountered *would* be an outlier, and I'd likely be in a position to get the offending filename fixed rather than changing any data processing pipelines (whether written in Python or not) to tolerate newlines in filenames (since the cost differential between fixing one filename vs updating the data processing pipelines would be enormous). However, note that my attitude changed significantly once you clarified the use case - it's clear that there *is* a use case, it's just one that's outside my own personal experience. That's one of the things the PEP process is for - to explain such use cases to folks that haven't personally encountered them, and then explain why the proposed solution addresses the use case in a way that makes sense for the domains where the use case arises. The recent matrix multiplication PEP was an exemplary example of the breed. That's what I'm asking for here: a PEP that makes sense to someone like me for whom the idea of putting a newline in a filename is completely alien. Yes, it's technically permitted by the underlying operating system APIs on POSIX systems, but all the affordances at both the console and GUI level suggest "no newlines allowed". If you're coming from a DOS/Windows background (as I did), then the idea that a newline is technically a permitted filename character may never even occur to you (it certainly hadn't to me, and I'd never previously come across anything to challenge that assumption). Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From abarnert at yahoo.com Sun Jul 20 07:02:03 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sat, 19 Jul 2014 22:02:03 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <53CB3F88.9090709@canterbury.ac.nz> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CB3F88.9090709@canterbury.ac.nz> Message-ID: <1405832523.20539.YahooMailNeo@web181004.mail.ne1.yahoo.com> On Saturday, July 19, 2014 9:42 PM, Greg Ewing wrote: > > Paul Moore wrote: >> And for that matter, how would you remove an >> arbitrary separator? Maybe line = line[:-1] works, but what if at some >> point people ask for multi-character separators You already can't use line[:-1] today, because '\r\n' is already a valid value, and always has been. And however people deal with newline='\r\n' will work for any crazy separator you can think of. Maybe line[:-len(nl)]. Maybe line.rstrip(nl) if it's appropriate (it isn't always, either for \r\n or for some arbitrary separator). > If the newline mechanism is re-used, it would > convert whatever separator is used into '\n'. No it wouldn't. https://docs.python.org/3/library/io.html#io.TextIOWrapper > When reading input from the stream, if newline is None, universal newlines mode is enabled? If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. So, making '\0' a legal value just means the '\0' line endings will be returned to the caller untranslated. Also, remember that binary files don't do universal newline translation ever, so just letting you change the separator there wouldn't add translation. Of course both of those could be changed as well (although with what interface, I'm not sure?), but I don't think they should be. From guido at python.org Sun Jul 20 07:45:04 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 19 Jul 2014 22:45:04 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405832523.20539.YahooMailNeo@web181004.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CB3F88.9090709@canterbury.ac.nz> <1405832523.20539.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: If and when something is decided in this thread, can someone summarize it to me? I don't have time to read all the lengthy arguments but I do care about the outcome. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mertz at gnosis.cx Sun Jul 20 07:58:53 2014 From: mertz at gnosis.cx (David Mertz) Date: Sat, 19 Jul 2014 22:58:53 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: The pattern I use, by far, most often with the -0 option is: find $path -print0 | xargs -0 some_command Embedding a '\n' in a filename might be weird, but having whitespace in general (i.e. spaces) really isn't uncommon. However, in this case it doesn't really seem to matter if some_command is some_command.py. But I still think the null byte special delimiter is plausible for similar pipelines. On Sat, Jul 19, 2014 at 6:40 PM, Nick Coghlan wrote: > On 20 July 2014 11:31, Chris Angelico wrote: > > On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan > wrote: > >> At present, I'm genuinely unclear on > >> why someone would ever want to pass the "-0" option to the other UNIX > >> utilities, which then makes it very difficult to have a sensible > >> discussion on how we should address that use case in Python. > > > > That one's easy. What happens if you use 'find' to list files, and > > those files might have \n in their names? You need another sep. > > Yes, but having a newline in a filename is sufficiently weird that I > find it hard to imagine a scenario where "fix the filenames" isn't a > better answer. Hence why I think the PEP needs to explain why the UNIX > utilities considered this use case sufficiently non-obscure to add > explicit support for it, rather than just assuming that the > obviousness of the use case can be taken for granted. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wichert at wiggy.net Sun Jul 20 09:50:10 2014 From: wichert at wiggy.net (Wichert Akkerman) Date: Sun, 20 Jul 2014 09:50:10 +0200 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CA9B0D.1070405@mrabarnett.plus.com> Message-ID: > On 19 Jul 2014, at 22:05, Guido van Rossum wrote: > > I don't have time for this thread. > > I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me). > > I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way). I see another problem with doing this by modifying the open() call: it does not work for filehandles creates using other methods such as pipe() or socket(), either used directly or via subprocess. There are have real-world examples of situations where that is very useful. One of them was even mentioned in this discussion: processing the output of find -0. Wichert. From wichert at wiggy.net Sun Jul 20 09:58:44 2014 From: wichert at wiggy.net (Wichert Akkerman) Date: Sun, 20 Jul 2014 09:58:44 +0200 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <29BD9830-94CE-4C46-8BC9-7AB83A9DFBDD@wiggy.net> > On 20 Jul 2014, at 03:40, Nick Coghlan wrote: > > On 20 July 2014 11:31, Chris Angelico wrote: >> On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan wrote: >>> At present, I'm genuinely unclear on >>> why someone would ever want to pass the "-0" option to the other UNIX >>> utilities, which then makes it very difficult to have a sensible >>> discussion on how we should address that use case in Python. >> >> That one's easy. What happens if you use 'find' to list files, and >> those files might have \n in their names? You need another sep. > > Yes, but having a newline in a filename is sufficiently weird that I > find it hard to imagine a scenario where "fix the filenames" isn't a > better answer. Because you are likely to have no control af all over what people do with filenames. Since, on POSIX at least, filenames are allowed to contain all characters other than NUL and / you must be able to deal with that. Similar to how you must also be able to deal with a mixture of filenames using different encodings or even pure binary names. Wichert. From wolfgang.maier at biologie.uni-freiburg.de Sun Jul 20 12:41:29 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Sun, 20 Jul 2014 12:41:29 +0200 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <53CB9CD9.4060404@biologie.uni-freiburg.de> On 19.07.2014 09:10, Nick Coghlan wrote: > > I still favour my proposal there to add a separate "readrecords()" > method, rather than reusing the line based iteration methods - lines > and arbitrary records *aren't* the same thing, and I don't think we'd > be doing anybody any favours by conflating them (whether we're > confusing them at the method level or at the constructor argument > level). > Thinking about possible use-cases for my own work, made me realize one thing: At least for text files, the distinction between records and lines, in practical terms, is that records may have *internal structure based on newline characters*, while lines are just lines. If a future readrecords() method would return the record as a StringIO or BytesIO object, this would allow nested reading of files as lines (with full newline processing) within records: for record in infile.readrecords(): for line in record: do_something() For me, that sort of feature is a more common requirement than being able to retrieve single lines terminated by something else than newline characters. Maybe though, it's possible to have both, a readrecords method like the one above and an extended set of "newline" tokens that can be passed to open (at least allowing "\0" seems to make sense). Best, Wolfgang From abarnert at yahoo.com Sun Jul 20 13:53:01 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sun, 20 Jul 2014 04:53:01 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CA9B0D.1070405@mrabarnett.plus.com> Message-ID: On Jul 20, 2014, at 0:50, Wichert Akkerman wrote: > >> On 19 Jul 2014, at 22:05, Guido van Rossum wrote: >> >> I don't have time for this thread. >> >> I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me). >> >> I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way). > > I see another problem with doing this by modifying the open() call: it does not work for filehandles creates using other methods such as pipe() or socket(), either used directly or via subprocess. There are have real-world examples of situations where that is very useful. A socket() is not a python file object, doesn't have a similar API, and doesn't have a readline method. The result of calling socket.makefile, on the other hand, is a file object--and it's created by calling open.* And I'm pretty sure socket.makefile already takes a newline argument and just passes it along, in which case it will magically work with no changes at all.** IIRC, os.pipe() just returns a pair of fds (integers), not a file object at all. It's up to you to wrap that in a file object if you want to--which you do by passing it to the open function. So, neither of your objections works. There are some better examples you could have raised, however. For example, a bz2.BzipFile is created with bz2.open. And, while the file delegates to a BufferedReader or TextIOWrapper, bz2.open almost certainly validates its inputs and won't pass newline on to the BufferedReader in binary mode. So, it would have to be changed to get the benefit. However, given that there's no way to magically make every file-like object anyone has ever written automatically grow this new functionality, having the API change on the constructors, which are not part of any API and not consistent, is better than having it on the readline method. Think about where you'd get the error in each case: before even writing your code, when you look up how BzipFile instances are created and see there's no way to pass a newline argument, or deep in your code when you're using a file object that came from who knows where and it's readline method doesn't like the standard, documented newline argument? * Or maybe it's created by constructing a BufferedReader, BufferedWriter, BufferedRandom, or TextIOWrapper directly. I don't remember off hand. But it doesn't matter, because the suggestion is to put the new parameter in those constructors, and make open forward to them, so whether makefile calls them directly or via open, it gets the same effect. ** Unless it validates the arguments before passing them along. I looked over a few stdlib classes, and there was at least one that unnecessarily does the same validation open is going to do anyway, so obviously that needs to be removed before the class magically benefits. In some cases (like tempfile.NamedTemporaryFile), even that isn't necessary, because the implementation just passes through all **kwargs that it doesn't want to handle to the open or constructor call. From abarnert at yahoo.com Sun Jul 20 13:56:28 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sun, 20 Jul 2014 04:56:28 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CB3F88.9090709@canterbury.ac.nz> <1405832523.20539.YahooMailNeo@web181004.mail.ne1.yahoo.com> Message-ID: Per Nick's suggestion, I will write up a draft PEP, and link it to issue #1152248, which should be a lot easier to follow. If you want to wait until the first round of discussion and the corresponding update to the PEP before checking in, I'll make sure it's obvious when that's happened. Sent from a random iPhone On Jul 19, 2014, at 22:45, Guido van Rossum wrote: > If and when something is decided in this thread, can someone summarize it to me? I don't have time to read all the lengthy arguments but I do care about the outcome. > > -- > --Guido van Rossum (python.org/~guido) > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Sun Jul 20 15:42:20 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 20 Jul 2014 14:42:20 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CA9B0D.1070405@mrabarnett.plus.com> Message-ID: On 20 July 2014 12:53, Andrew Barnert wrote: > There are some better examples you could have raised, however. For example, a bz2.BzipFile is created > with bz2.open. And, while the file delegates to a BufferedReader or TextIOWrapper, bz2.open almost > certainly validates its inputs and won't pass newline on to the BufferedReader in binary mode. > So, it would have to be changed to get the benefit. The most significant example is one which has been mentioned, but you may have missed. The motivation for this proposal is to interoperate with the -0 flag on things like the unix find command. But that is typically used in a pipe, which means your Python program will likely receive \0 terminated records via sys.stdin. And sys.stdin is already opened for you - you do not have the option to specify a newline argument. In actual fact, I can't think of a good example (either from my own experience, or mentioned in this thread) where I'd expect to be reading \0-terminated records from anything *except* sys.stdin. Paul From clint.hepner at gmail.com Sun Jul 20 17:11:25 2014 From: clint.hepner at gmail.com (Clint Hepner) Date: Sun, 20 Jul 2014 11:11:25 -0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <20140719090159.GJ9112@ando> <53CA9B0D.1070405@mrabarnett.plus.com> Message-ID: <2592A06E-DFFD-420A-AD13-5755B8B5BE61@gmail.com> -- Clint > On Jul 20, 2014, at 9:42 AM, Paul Moore wrote: > > In actual fact, I can't think of a good example (either from my own > experience, or mentioned in this thread) where I'd expect to be > reading \0-terminated records from anything *except* sys.stdin. Named pipes and whatever is used to implement process substitution ( < <(find ... -0) ) come to mind. From ram.rachum at gmail.com Mon Jul 21 00:06:33 2014 From: ram.rachum at gmail.com (Ram Rachum) Date: Sun, 20 Jul 2014 15:06:33 -0700 (PDT) Subject: [Python-ideas] Changing `Sequence.__contains__` Message-ID: Why does the default `Sequence.__contains__` iterate through the items rather than use `.index`, which may sometimes be more efficient? I suggest an implementation like this: def __contains__(self, i): try: self.index(i) except ValueError: return False else: return True What do you think? -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Mon Jul 21 02:41:32 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sun, 20 Jul 2014 17:41:32 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> On Saturday, July 19, 2014 10:00 PM, Nick Coghlan wrote: > That's one of the > things the PEP process is for - to explain such use cases to folks > that haven't personally encountered them, and then explain why the > proposed solution addresses the use case in a way that makes sense for > the domains where the use case arises. OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt It's probably a lot more detailed than necessary in many areas, but I figured it was better to include too much than to leave things ambiguous; after I know which parts are not contentious, I can strip it down in the next revision. Meanwhile, while writing it, and re-reading Guido's replies in this thread, I decided to come back to the alternative idea of exposing text files' buffers just like binary files' buffers. If done properly, that would make it much easier (still not trivial, but much easier) for users to just implement the readrecord functionality on their own, or for someone to package it up on PyPI. And?I don't think the idea is as radical as it sounded at first, so I don't want it to be dismissed out of hand. So, also see?http://bugs.python.org/file36009/pep-peek.txt Finally, writing this up made me recognize a couple of minor problems with the patch I'd been writing, and I don't think I have time to clean it up and write relevant tests now, so I might not be able to upload a useful patch until next weekend. Hopefully people can still discuss the PEP without a patch to play with. From steve at pearwood.info Mon Jul 21 03:41:43 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 21 Jul 2014 11:41:43 +1000 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: References: Message-ID: <20140721014142.GL9112@ando> On Sun, Jul 20, 2014 at 03:06:33PM -0700, Ram Rachum wrote: > Why does the default `Sequence.__contains__` iterate through the items > rather than use `.index`, which may sometimes be more efficient? Because having an index() method is not a requirement to be a sequence. It is optional. The implementation for Sequence.__contains__ which makes the least assumptions about the class is to iterate over the items. > I suggest an implementation like this: > > def __contains__(self, i): > try: self.index(i) > except ValueError: return False > else: return True > > What do you think? That now means that sequence types will have to define an index method in order to be a sequence. Not only that, but the index method has to follow a standard API, which not all sequence types may do. This would be marginally better: def __contains__(self, obj): try: index = type(self).index except AttributeError: for o in self: if o is obj or o == obj: return True return False else: try: index(obj) except ValueError: return False else: return True but it has at two problems I can see: - it's not backwards compatible with sequence types which may already define an index attribute which does something different, e.g.: class Book: def index(self): # return the index of the book def __getitem__(self, n): # return page n - for a default implementation, it's too complicated. If your sequence class has an efficient index method (or an efficient find method, or __getitem__ method, or any other efficient way of testing whether something exists in the sequence quickly) it's not much more work to define a custom __contains__ to take advantage of that. There's no need for the default Sequence fallback to try to guess what time-saving methods you might provide. For a historical view, you should be aware that until recently, tuples had no index method: [steve at ando ~]$ python2.5 Python 2.5.4 (r254:67916, Nov 25 2009, 18:45:43) [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2 Type "help", "copyright", "credits" or "license" for more information. py> (1, 2).index Traceback (most recent call last): File "", line 1, in AttributeError: 'tuple' object has no attribute 'index' There's no reason to expect that all sequences will have an index method, and certainly no reason to demand that they do. -- Steven From abarnert at yahoo.com Mon Jul 21 03:52:58 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sun, 20 Jul 2014 18:52:58 -0700 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: <20140721014142.GL9112@ando> References: <20140721014142.GL9112@ando> Message-ID: <1405907578.10294.YahooMailNeo@web181006.mail.ne1.yahoo.com> On Sunday, July 20, 2014 6:42 PM, Steven D'Aprano wrote: > > On Sun, Jul 20, 2014 at 03:06:33PM -0700, Ram Rachum wrote: > >> Why does the default `Sequence.__contains__` iterate through the items >> rather than use `.index`, which may sometimes be more efficient? > > Because having an index() method is not a requirement to be a sequence. > It is optional.? Sequence.__contains__ certainly can assume that your class will have an index method, because it provides one if you don't. See https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes (and you can look all the way back to 2.6 and 3.0 to verify that it's always been there). The default implementation looks like this: ? ?for i, v in enumerate(self): ? ? ? ? if v == value: ? ? ? ? ? ? return i ? ?raise ValueError > but it has at two problems I can see: > > - it's not backwards compatible with sequence types which may already > define an index attribute which does something different, e.g.: > > ? ? class Book: > ? ? ? ? def index(self): > ? ? ? ? ? ? # return the index of the book This isn't a Sequence. You didn't inherit from collections.abc.Sequence, or even register with it. So, Sequence.__contains__ can't get called on your class in the first place. If you _do_ make it a Sequence, then you're violating the protocol you're claiming to support, and it's your own fault if that doesn't work. You can also write a __getitem__ that requires four arguments and call yourself a Sequence, but you're going to get exceptions all over the place trying to use it. > For a historical view, you should be aware that until recently, tuples? > had no index method: That was true up until the collections ABCs were aded in 2.6 and 3.0. Prior to that, yes, the "sequence protocol" was a vague thing, and you couldn't be sure that something had an index method just because it looked like a sequence?but, by the same token, prior to that, there was no Sequence ABC mixin, so the problem wasn't relevant in the first place. From breamoreboy at yahoo.co.uk Mon Jul 21 04:06:59 2014 From: breamoreboy at yahoo.co.uk (Mark Lawrence) Date: Mon, 21 Jul 2014 03:06:59 +0100 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: References: Message-ID: On 20/07/2014 23:06, Ram Rachum wrote: > Why does the default `Sequence.__contains__` iterate through the items > rather than use `.index`, which may sometimes be more efficient? > > I suggest an implementation like this: > > def __contains__(self, i): > try: self.index(i) > except ValueError: return False > else: return True > What do you think? > I don't see how that can be more efficient than the naive def __contains__(self, i): for elem in self: if elem == i: return True return False What am I missing? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From rosuav at gmail.com Mon Jul 21 04:09:27 2014 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 21 Jul 2014 12:09:27 +1000 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: References: Message-ID: On Mon, Jul 21, 2014 at 12:06 PM, Mark Lawrence wrote: > On 20/07/2014 23:06, Ram Rachum wrote: >> >> Why does the default `Sequence.__contains__` iterate through the items >> rather than use `.index`, which may sometimes be more efficient? >> >> I suggest an implementation like this: >> >> def __contains__(self, i): >> try: self.index(i) >> except ValueError: return False >> else: return True >> What do you think? >> > > I don't see how that can be more efficient than the naive > > def __contains__(self, i): > for elem in self: > if elem == i: > return True > return False > > What am I missing? If your sequence provides a more efficient index(), then __contains__ can take advantage of it. If not, it's a bit more indirection and the same result. ChrisA From abarnert at yahoo.com Mon Jul 21 04:18:44 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sun, 20 Jul 2014 19:18:44 -0700 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: References: Message-ID: <1405909124.91690.YahooMailNeo@web181001.mail.ne1.yahoo.com> On Sunday, July 20, 2014 7:06 PM, Mark Lawrence wrote: > > On 20/07/2014 23:06, Ram Rachum wrote: >> Why does the default `Sequence.__contains__` iterate through the items >> rather than use `.index`, which may sometimes be more efficient? >> >> I suggest an implementation like this: >> >> ? ? ? def __contains__(self, i): >> ? ? ? ? ? try: self.index(i) >> ? ? ? ? ? except ValueError: return False >> ? ? ? ? ? else: return True >> What do you think? >> > > I don't see how that can be more efficient than the naive > > def __contains__(self, i): > ? ? for elem in self: > ? ? ? ? if elem == i: > ? ? ? ? ? ? return True > ? ? return False > > What am I missing? Consider a blist.sortedlist (http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data structure built on a tree, skip list, etc. The index method is O(log N), so Ram's __contains__ is also O(log N). But naively iterating is obviously O(N). (In fact, it could be worse?if you don't implement a custom __iter__, and your indexing is O(log N), then the naive __contains__ will be O(N log N)?) Needless to say, blist.sortedlist implements a custom O(log N) __contains__, and so does (hopefully) every other such library on PyPI. But Ram's proposal would mean they no longer have to do so; they'll get O(log N) __contains__ for free just by implementing index. Of course that only removes one method. For example, they still have to implement a custom count method or they'll get O(N) performance from the default version. If you look at the code for any of these types, __contains__ is a tiny percentage of the implementation. So, it's not a huge win. But it's a small one. From breamoreboy at yahoo.co.uk Mon Jul 21 04:36:52 2014 From: breamoreboy at yahoo.co.uk (Mark Lawrence) Date: Mon, 21 Jul 2014 03:36:52 +0100 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: References: Message-ID: On 21/07/2014 03:09, Chris Angelico wrote: > On Mon, Jul 21, 2014 at 12:06 PM, Mark Lawrence wrote: >> On 20/07/2014 23:06, Ram Rachum wrote: >>> >>> Why does the default `Sequence.__contains__` iterate through the items >>> rather than use `.index`, which may sometimes be more efficient? >>> >>> I suggest an implementation like this: >>> >>> def __contains__(self, i): >>> try: self.index(i) >>> except ValueError: return False >>> else: return True >>> What do you think? >>> >> >> I don't see how that can be more efficient than the naive >> >> def __contains__(self, i): >> for elem in self: >> if elem == i: >> return True >> return False >> >> What am I missing? > > If your sequence provides a more efficient index(), then __contains__ > can take advantage of it. If not, it's a bit more indirection and the > same result. > > ChrisA > The question was about the default sequence.__contains__, not mine or any other sequence which may or may not provide a more efficient index(). -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From breamoreboy at yahoo.co.uk Mon Jul 21 04:40:49 2014 From: breamoreboy at yahoo.co.uk (Mark Lawrence) Date: Mon, 21 Jul 2014 03:40:49 +0100 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: <1405909124.91690.YahooMailNeo@web181001.mail.ne1.yahoo.com> References: <1405909124.91690.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: On 21/07/2014 03:18, Andrew Barnert wrote: > On Sunday, July 20, 2014 7:06 PM, Mark Lawrence wrote: > >>> On 20/07/2014 23:06, Ram Rachum wrote: >>> Why does the default `Sequence.__contains__` iterate through the items >>> rather than use `.index`, which may sometimes be more efficient? >>> >>> I suggest an implementation like this: >>> >>> def __contains__(self, i): >>> try: self.index(i) >>> except ValueError: return False >>> else: return True >>> What do you think? >>> >> >> I don't see how that can be more efficient than the naive >> >> def __contains__(self, i): >> for elem in self: >> if elem == i: >> return True >> return False >> >> What am I missing? > > > Consider a blist.sortedlist (http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data structure built on a tree, skip list, etc. > > The index method is O(log N), so Ram's __contains__ is also O(log N). But naively iterating is obviously O(N). (In fact, it could be worse?if you don't implement a custom __iter__, and your indexing is O(log N), then the naive __contains__ will be O(N log N)?) > > Needless to say, blist.sortedlist implements a custom O(log N) __contains__, and so does (hopefully) every other such library on PyPI. But Ram's proposal would mean they no longer have to do so; they'll get O(log N) __contains__ for free just by implementing index. > > Of course that only removes one method. For example, they still have to implement a custom count method or they'll get O(N) performance from the default version. If you look at the code for any of these types, __contains__ is a tiny percentage of the implementation. So, it's not a huge win. But it's a small one. > What has blist.sortedlist, which IIRC is one of the data structures that has been rejected as forming part of the standard library, got to do with the default sequence.__contains__ ? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From abarnert at yahoo.com Mon Jul 21 04:59:26 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Sun, 20 Jul 2014 19:59:26 -0700 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: References: <1405909124.91690.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <1405911566.63080.YahooMailNeo@web181006.mail.ne1.yahoo.com> On Sunday, July 20, 2014 7:40 PM, Mark Lawrence wrote: > > On 21/07/2014 03:18, Andrew Barnert wrote: >> On Sunday, July 20, 2014 7:06 PM, Mark Lawrence > wrote: >> >>>> On 20/07/2014 23:06, Ram Rachum wrote: >>>> ? Why does the default `Sequence.__contains__` iterate through the > items >>>> ? rather than use `.index`, which may sometimes be more efficient? >>>> >>>> ? I suggest an implementation like this: >>>> >>>> ? ? ? ? def __contains__(self, i): >>>> ? ? ? ? ? ? try: self.index(i) >>>> ? ? ? ? ? ? except ValueError: return False >>>> ? ? ? ? ? ? else: return True >>>> ? What do you think? >>>> >>> >>> I don't see how that can be more efficient than the naive >>> >>> def __contains__(self, i): >>> ? ? ? for elem in self: >>> ? ? ? ? ? if elem == i: >>> ? ? ? ? ? ? ? return True >>> ? ? ? return False >>> >>> What am I missing? >> >> >> Consider a blist.sortedlist > (http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data > structure built on a tree, skip list, etc. >> >> The index method is O(log N), so Ram's __contains__ is also O(log N). > But naively iterating is obviously O(N). (In fact, it could be worse?if you > don't implement a custom __iter__, and your indexing is O(log N), then the > naive __contains__ will be O(N log N)?) >> >> Needless to say, blist.sortedlist implements a custom O(log N) > __contains__, and so does (hopefully) every other such library on PyPI. But > Ram's proposal would mean they no longer have to do so; they'll get > O(log N) __contains__ for free just by implementing index. >> >> Of course that only removes one method. For example, they still have to > implement a custom count method or they'll get O(N) performance from the > default version. If you look at the code for any of these types, __contains__ is > a tiny percentage of the implementation. So, it's not a huge win. But > it's a small one. >> > > What has blist.sortedlist, which IIRC is one of the data structures that > has been rejected as forming part of the standard library, got to do > with the default sequence.__contains__ ? I think you're missing the whole point here. Sequence is an ABC?an Abstract Base Class?that's used (either by inheritance, or registration) by a wide variety of sequence classes?built-in, stdlib, or third-party. Like most of the other ABCs in the Python stdlib, it's also usable as a mixin, providing default implementations for methods that you don't want to provide in terms of those that you do. Among the mixin methods it provides is __contains__, as documented at https://docs.python.org/dev/library/collections.abc.html#collections-abstract-base-classes and implemented at http://hg.python.org/cpython/file/default/Lib/_collections_abc.py#l629 I suspect the problem is that you parsed "the default Sequence.__contains__" wrong; Ram was referring to "the default implementation of __contains__ provided as Sequence.__contains__", but you thought he was referring to "the implementation of __contains__ in the default sequence", and whatever "the default sequence means" it obviously can't be a class from a third-party module, right? From p.f.moore at gmail.com Mon Jul 21 09:04:32 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 21 Jul 2014 08:04:32 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On 21 July 2014 01:41, Andrew Barnert wrote: > OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help. Here's the sort of thing I mean, written for newline-separated files: import sys def process(filename): """Trivial example""" return filename.lower() if __name__ == '__main__': for filename in sys.stdin: filename = process(filename) print(filename) This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me. Paul From breamoreboy at yahoo.co.uk Mon Jul 21 19:26:47 2014 From: breamoreboy at yahoo.co.uk (Mark Lawrence) Date: Mon, 21 Jul 2014 18:26:47 +0100 Subject: [Python-ideas] Changing `Sequence.__contains__` In-Reply-To: <1405911566.63080.YahooMailNeo@web181006.mail.ne1.yahoo.com> References: <1405909124.91690.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405911566.63080.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: On 21/07/2014 03:59, Andrew Barnert wrote: > On Sunday, July 20, 2014 7:40 PM, Mark Lawrence wrote: > >>> On 21/07/2014 03:18, Andrew Barnert wrote: >>> On Sunday, July 20, 2014 7:06 PM, Mark Lawrence >> wrote: >>> >>>>> On 20/07/2014 23:06, Ram Rachum wrote: >>>>> Why does the default `Sequence.__contains__` iterate through the >> items >>>>> rather than use `.index`, which may sometimes be more efficient? >>>>> >>>>> I suggest an implementation like this: >>>>> >>>>> def __contains__(self, i): >>>>> try: self.index(i) >>>>> except ValueError: return False >>>>> else: return True >>>>> What do you think? >>>>> >>>> >>>> I don't see how that can be more efficient than the naive >>>> >>>> def __contains__(self, i): >>>> for elem in self: >>>> if elem == i: >>>> return True >>>> return False >>>> >>>> What am I missing? >>> >>> >>> Consider a blist.sortedlist >> (http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data >> structure built on a tree, skip list, etc. >>> >>> The index method is O(log N), so Ram's __contains__ is also O(log N). >> But naively iterating is obviously O(N). (In fact, it could be worse?if you >> don't implement a custom __iter__, and your indexing is O(log N), then the >> naive __contains__ will be O(N log N)?) >>> >>> Needless to say, blist.sortedlist implements a custom O(log N) >> __contains__, and so does (hopefully) every other such library on PyPI. But >> Ram's proposal would mean they no longer have to do so; they'll get >> O(log N) __contains__ for free just by implementing index. >>> >>> Of course that only removes one method. For example, they still have to >> implement a custom count method or they'll get O(N) performance from the >> default version. If you look at the code for any of these types, __contains__ is >> a tiny percentage of the implementation. So, it's not a huge win. But >> it's a small one. >>> >> >> What has blist.sortedlist, which IIRC is one of the data structures that >> has been rejected as forming part of the standard library, got to do >> with the default sequence.__contains__ ? > > I think you're missing the whole point here. > > Sequence is an ABC?an Abstract Base Class?that's used (either by inheritance, or registration) by a wide variety of sequence classes?built-in, stdlib, or third-party. > > > Like most of the other ABCs in the Python stdlib, it's also usable as a mixin, providing default implementations for methods that you don't want to provide in terms of those that you do. Among the mixin methods it provides is __contains__, as documented at https://docs.python.org/dev/library/collections.abc.html#collections-abstract-base-classes and implemented at http://hg.python.org/cpython/file/default/Lib/_collections_abc.py#l629 > > I suspect the problem is that you parsed "the default Sequence.__contains__" wrong; Ram was referring to "the default implementation of __contains__ provided as Sequence.__contains__", but you thought he was referring to "the implementation of __contains__ in the default sequence", and whatever "the default sequence means" it obviously can't be a class from a third-party module, right? > Thanks for the explanation and yes I did parse it incorrectly. Strangely everything seems much clearer at 6PM rather than 3AM :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From greg.ewing at canterbury.ac.nz Mon Jul 21 23:12:41 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 22 Jul 2014 09:12:41 +1200 Subject: [Python-ideas] Struct format with multiple endianness markers Message-ID: <53CD8249.20302@canterbury.ac.nz> I'd like to propose a small enhancement to the struct module: Allow the endianness characters to occur more than once in the format string, rather than just as the first character. My use case is reading ESRI shapefile headers, which mix big and little endian data. This means I can't use a single struct.unpack call to read what is logically a single structure, but have to split it up and use multiple calls. If I could switch endianness part way through the format, I could unpack the whole structure with a single call. -- Greg From guido at python.org Tue Jul 22 02:54:55 2014 From: guido at python.org (Guido van Rossum) Date: Mon, 21 Jul 2014 17:54:55 -0700 Subject: [Python-ideas] Struct format with multiple endianness markers In-Reply-To: <53CD8249.20302@canterbury.ac.nz> References: <53CD8249.20302@canterbury.ac.nz> Message-ID: Simple and elegant. Can you submit a patch? One suggestion: disallow endianness marker if there isn't one at the start (i.e. default). On Jul 21, 2014 5:44 PM, "Greg Ewing" wrote: > I'd like to propose a small enhancement to the > struct module: Allow the endianness characters to > occur more than once in the format string, > rather than just as the first character. > > My use case is reading ESRI shapefile headers, which > mix big and little endian data. This means I can't > use a single struct.unpack call to read what is > logically a single structure, but have to split it > up and use multiple calls. > > If I could switch endianness part way through > the format, I could unpack the whole structure > with a single call. > > -- > Greg > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From 4kir4.1i at gmail.com Tue Jul 22 18:05:42 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Tue, 22 Jul 2014 20:05:42 +0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <87bnshnzu1.fsf@gmail.com> Paul Moore writes: > On 21 July 2014 01:41, Andrew Barnert > wrote: >> OK, I wrote up a draft PEP, and attached it to the bug (if that's >> not a good thing to do, apologies); you can find it at >> http://bugs.python.org/file36008/pep-newline.txt > > As a suggestion, how about adding an example of a simple nul-separated > filename filter - the sort of thing that could go in a find -print0 | > xxx | xargs -0 pipeline? If I understand it, that's one of the key > motivating examples for this change, so seeing how it's done would be > a great help. > > Here's the sort of thing I mean, written for newline-separated files: > > import sys > > def process(filename): > """Trivial example""" > return filename.lower() > > if __name__ == '__main__': > > for filename in sys.stdin: > filename = process(filename) > print(filename) > > This is also an example of why I'm struggling to understand how an > open() parameter "solves all the cases". There's no explicit open() > call here, so how do you specify the record separator? Seeing how you > propose this would work would be really helpful to me. > `find -print0 | ./tr-filename -0 | xargs -0` example implies that you can replace `sys.std*` streams without worrying about preserving `sys.__std*__` streams: #!/usr/bin/env python import io import re import sys from pathlib import Path def transform_filename(filename: str) -> str: # example """Normalize whitespace in basename.""" path = Path(filename) new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) path.replace(new_path) # rename on disk if necessary return str(new_path) def SystemTextStream(bytes_stream, **kwargs): encoding = sys.getfilesystemencoding() return io.TextIOWrapper(bytes_stream, encoding=encoding, errors='surrogateescape' if encoding != 'mbcs' else 'strict', **kwargs) nl = '\0' if '-0' in sys.argv else None sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl) io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'. The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding(). Note: - `end=nl` is necessary because `print()` prints '\n' by default -- it does not use `file.newline` - `-0` option is required in the current implementation if filenames may have a trailing whitespace. It can be improved - SystemTextStream() handles undecodable in the current locale filenames i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) - undecodable filenames are not supported on Windows. It is not clear how to pass an undecodable filename via a pipe on Windows -- perhaps `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It assumes that the short path exists and it is always encodable using mbcs. If we can control all parts of the pipeline *and* Windows API uses proper utf-16 (not ucs-2) then utf-8 can be used to pass filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be tried e.g., https://github.com/Drekin/win-unicode-console -- Akira From p.f.moore at gmail.com Tue Jul 22 19:35:58 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 22 Jul 2014 18:35:58 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87bnshnzu1.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> Message-ID: On 22 July 2014 17:05, Akira Li <4kir4.1i at gmail.com> wrote: > The example function handles Unicode whitespace to demonstrate why > opaque bytes-based cookies can't be used to represent filenames in this > case even on POSIX, though which characters are recognized depends on > sys.getfilesystemencoding(). Thanks. That's how you'd do it now. A question for the OP: how would the proposed change improve this code? Paul From 4kir4.1i at gmail.com Wed Jul 23 01:48:06 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Wed, 23 Jul 2014 03:48:06 +0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> Message-ID: <87wqb5lzux.fsf@gmail.com> Paul Moore writes: > On 22 July 2014 17:05, Akira Li > <4kir4.1i at gmail.com> wrote: >> The example function handles Unicode whitespace to demonstrate why >> opaque bytes-based cookies can't be used to represent filenames in this >> case even on POSIX, though which characters are recognized depends on >> sys.getfilesystemencoding(). > > Thanks. That's how you'd do it now. You've cut too much e.g. I wrote in [1]: >> io.TextIOWrapper() plays the role of open() in this case. The code >> assumes that `newline` parameter accepts '\0'. [1] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html > A question for the OP: how would the proposed change improve this code? > Paul I'm not sure who is OP in this context but I can answer: the proposed change might allow TextIOWrapper(.., newline='\0') and the code in [1] doesn't support `-0` command-line parameter without it. -- Akira From abarnert at yahoo.com Wed Jul 23 06:24:12 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 22 Jul 2014 21:24:12 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> Message-ID: <01067774-6B85-436D-B240-83E14CBDA315@yahoo.com> On Jul 21, 2014, at 0:04, Paul Moore wrote: > On 21 July 2014 01:41, Andrew Barnert wrote: >> OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt > > As a suggestion, how about adding an example of a simple nul-separated > filename filter - the sort of thing that could go in a find -print0 | > xxx | xargs -0 pipeline? If I understand it, that's one of the key > motivating examples for this change, so seeing how it's done would be > a great help. > > Here's the sort of thing I mean, written for newline-separated files: > > import sys > > def process(filename): > """Trivial example""" > return filename.lower() > > if __name__ == '__main__': > > for filename in sys.stdin: > filename = process(filename) > print(filename) for file in io.TextIOWrapper(sys.stdin.buffer, encoding=sys.stdin.encoding, errors=sys.stdin.errors, newline='\0'): filename = process(filename.rstrip('\0')) print(filename) I assume you wanted an rstrip('\n') in the original, so I did the equivalent here. If you want to pipe the result to another -0 tool, you also need to add end='\0' to the print, of course. If we had Nick Coghlan's separate idea of adding rewrap methods to the stream classes (not part of this proposal, but I would be happy to have it), it would be even simpler: for file in sys.stdin.rewrap(newline='\0'): filename = process(filename.rstrip('\0')) print(filename) Anyway, this isn't perfect if, e.g., you might have illegal-as-UTF8 Latin-1 filenames hiding in your UTF8 filesystem, but neither is your code; in fact, this does exactly the same thing, except that it takes \0 terminators (so it can handle filenames with embedded newlines, or pipelines that use -print0 just because they can't be sure which tools in the chain can handle spaces). It's obviously a little more complicated than your code, but that's to be expected; it's a lot simpler than anything we can write today. (And it runs at the same speed of your code instead of 2x slower or worse.) > This is also an example of why I'm struggling to understand how an > open() parameter "solves all the cases". There's no explicit open() > call here, so how do you specify the record separator? Seeing how you > propose this would work would be really helpful to me. The open function is just a shortcut to constructing a stack of io classes; you can always construct them manually. It would be nice if some cases of that were made a little easier (again, see Nick's proposal above), but it's easy enough to live with. From abarnert at yahoo.com Wed Jul 23 06:40:54 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 22 Jul 2014 21:40:54 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87bnshnzu1.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> Message-ID: On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i at gmail.com> wrote: > Paul Moore writes: > >> On 21 July 2014 01:41, Andrew Barnert >> wrote: >>> OK, I wrote up a draft PEP, and attached it to the bug (if that's >>> not a good thing to do, apologies); you can find it at >>> http://bugs.python.org/file36008/pep-newline.txt >> >> As a suggestion, how about adding an example of a simple nul-separated >> filename filter - the sort of thing that could go in a find -print0 | >> xxx | xargs -0 pipeline? If I understand it, that's one of the key >> motivating examples for this change, so seeing how it's done would be >> a great help. >> >> Here's the sort of thing I mean, written for newline-separated files: >> >> import sys >> >> def process(filename): >> """Trivial example""" >> return filename.lower() >> >> if __name__ == '__main__': >> >> for filename in sys.stdin: >> filename = process(filename) >> print(filename) >> >> This is also an example of why I'm struggling to understand how an >> open() parameter "solves all the cases". There's no explicit open() >> call here, so how do you specify the record separator? Seeing how you >> propose this would work would be really helpful to me. > > `find -print0 | ./tr-filename -0 | xargs -0` example implies that you > can replace `sys.std*` streams without worrying about preserving > `sys.__std*__` streams: > > #!/usr/bin/env python > import io > import re > import sys > from pathlib import Path > > def transform_filename(filename: str) -> str: # example > """Normalize whitespace in basename.""" > path = Path(filename) > new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) > path.replace(new_path) # rename on disk if necessary > return str(new_path) > > def SystemTextStream(bytes_stream, **kwargs): > encoding = sys.getfilesystemencoding() > return io.TextIOWrapper(bytes_stream, > encoding=encoding, > errors='surrogateescape' if encoding != 'mbcs' else 'strict', > **kwargs) > > nl = '\0' if '-0' in sys.argv else None > sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) > for line in SystemTextStream(sys.stdin.detach(), newline=nl): > print(transform_filename(line.rstrip(nl)), end=nl) Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything. > io.TextIOWrapper() plays the role of open() in this case. The code > assumes that `newline` parameter accepts '\0'. > > The example function handles Unicode whitespace to demonstrate why > opaque bytes-based cookies can't be used to represent filenames in this > case even on POSIX, though which characters are recognized depends on > sys.getfilesystemencoding(). > > Note: > > - `end=nl` is necessary because `print()` prints '\n' by default -- it > does not use `file.newline` Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or ''). But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate '\n' characters in the middle of a line, re-creating the same problem we're trying to avoid...) But it uses sys.stdout.newline, not sys.stdin.newline. > - `-0` option is required in the current implementation if filenames may > have a trailing whitespace. It can be improved > - SystemTextStream() handles undecodable in the current locale filenames > i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) > - undecodable filenames are not supported on Windows. It is not clear > how to pass an undecodable filename via a pipe on Windows -- perhaps > `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It > assumes that the short path exists and it is always encodable using > mbcs. If we can control all parts of the pipeline *and* Windows API > uses proper utf-16 (not ucs-2) then utf-8 can be used to pass > filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be > tried e.g., https://github.com/Drekin/win-unicode-console First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on top of it guarantee that you can never get such unencodable filenames (sometimes by just pretending the file doesn't exist, but if possible by having the filesystem map it to something valid, unique, and persistent for this session, usually the short name)? Second, trying to solve this implies that you have some other native (as opposed to Cygwin) tool that passes or accepts such filenames over simple pipes (as opposed to PowerShell typed ones). Are there any? What does, say, mingw's find do with invalid filenames if it finds them? On Unix, of course, it's a real problem. From p.f.moore at gmail.com Wed Jul 23 10:11:23 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 23 Jul 2014 09:11:23 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87wqb5lzux.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87wqb5lzux.fsf@gmail.com> Message-ID: On 23 July 2014 00:48, Akira Li <4kir4.1i at gmail.com> wrote: > I'm not sure who is OP in this context but I can answer: the proposed > change might allow TextIOWrapper(.., newline='\0') and the code in [1] > doesn't support `-0` command-line parameter without it. I see. My apologies, I read that part but didn't spot what you meant. Thanks for clarifying. From p.f.moore at gmail.com Wed Jul 23 10:14:31 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 23 Jul 2014 09:14:31 +0100 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <01067774-6B85-436D-B240-83E14CBDA315@yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405635685.60281.YahooMailNeo@web181004.mail.ne1.yahoo.com> <1405641840.13158.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <01067774-6B85-436D-B240-83E14CBDA315@yahoo.com> Message-ID: On 23 July 2014 05:24, Andrew Barnert wrote: >> This is also an example of why I'm struggling to understand how an >> open() parameter "solves all the cases". There's no explicit open() >> call here, so how do you specify the record separator? Seeing how you >> propose this would work would be really helpful to me. > > The open function is just a shortcut to constructing a stack of io classes; Ah, yes, I get what you're saying now. I was reading your proposal too literally as being about "open", and forgetting you can use the underlying classes to rewrap existing streams. Thanks for your patience. Paul From 4kir4.1i at gmail.com Wed Jul 23 14:13:06 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Wed, 23 Jul 2014 16:13:06 +0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: (Andrew Barnert's message of "Tue, 22 Jul 2014 21:40:54 -0700") References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> Message-ID: <87lhrkmeos.fsf@gmail.com> Andrew Barnert writes: > On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i at gmail.com> wrote: > >> Paul Moore writes: >> >>> On 21 July 2014 01:41, Andrew Barnert >>> wrote: >>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's >>>> not a good thing to do, apologies); you can find it at >>>> http://bugs.python.org/file36008/pep-newline.txt >>> >>> As a suggestion, how about adding an example of a simple nul-separated >>> filename filter - the sort of thing that could go in a find -print0 | >>> xxx | xargs -0 pipeline? If I understand it, that's one of the key >>> motivating examples for this change, so seeing how it's done would be >>> a great help. >>> >>> Here's the sort of thing I mean, written for newline-separated files: >>> >>> import sys >>> >>> def process(filename): >>> """Trivial example""" >>> return filename.lower() >>> >>> if __name__ == '__main__': >>> >>> for filename in sys.stdin: >>> filename = process(filename) >>> print(filename) >>> >>> This is also an example of why I'm struggling to understand how an >>> open() parameter "solves all the cases". There's no explicit open() >>> call here, so how do you specify the record separator? Seeing how you >>> propose this would work would be really helpful to me. >> >> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you >> can replace `sys.std*` streams without worrying about preserving >> `sys.__std*__` streams: >> >> #!/usr/bin/env python >> import io >> import re >> import sys >> from pathlib import Path >> >> def transform_filename(filename: str) -> str: # example >> """Normalize whitespace in basename.""" >> path = Path(filename) >> new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) >> path.replace(new_path) # rename on disk if necessary >> return str(new_path) >> >> def SystemTextStream(bytes_stream, **kwargs): >> encoding = sys.getfilesystemencoding() >> return io.TextIOWrapper(bytes_stream, >> encoding=encoding, >> errors='surrogateescape' if encoding != 'mbcs' else 'strict', >> **kwargs) >> >> nl = '\0' if '-0' in sys.argv else None >> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) >> for line in SystemTextStream(sys.stdin.detach(), newline=nl): >> print(transform_filename(line.rstrip(nl)), end=nl) > > Nice, much more complete example than mine. I just tried to handle as > many edge cases as the original he asked about, but you handle > everything. > >> io.TextIOWrapper() plays the role of open() in this case. The code >> assumes that `newline` parameter accepts '\0'. >> >> The example function handles Unicode whitespace to demonstrate why >> opaque bytes-based cookies can't be used to represent filenames in this >> case even on POSIX, though which characters are recognized depends on >> sys.getfilesystemencoding(). >> >> Note: >> >> - `end=nl` is necessary because `print()` prints '\n' by default -- it >> does not use `file.newline` > > Actually, yes it does. Or, rather, print pastes on a '\n', but > sys.stdout.write translates any '\n' characters to sys.stdout.writenl > (a private variable that's initialized from the newline argument at > construction time if it's anything other than None or ''). You are right. I've stopped reading the source for print() function at `PyFile_WriteString("\n", file);` line assuming that "\n" is not translated if newline="\0". But the current behaviour if "\0" were in "the other legal values" category (like "\r") would be to translate "\n" [1]: When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string. [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper Example: $ ./python -c 'import sys, io; sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); sys.stdout.write("\n\r\r\n")'| xxd 0000000: 0d0a 0d0d 0d0a ...... "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r"). In order to newline="\0" case to work, it should behave similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters. My original code works as is in this case i.e., *end=nl is still necessary*. > But of course that's the newline argument to sys.stdout, and you only > changed sys.stdin, so you do need end=nl anyway. (And you wouldn't > want output translation here anyway, because that could also translate > \n' characters in the middle of a line, re-creating the same problem > we're trying to avoid...) > > But it uses sys.stdout.newline, not sys.stdin.newline. The code affects *both* sys.stdout/sys.stdin. Look [2]: >> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) >> for line in SystemTextStream(sys.stdin.detach(), newline=nl): >> print(transform_filename(line.rstrip(nl)), end=nl) [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html >> - SystemTextStream() handles undecodable in the current locale filenames >> i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) >> - undecodable filenames are not supported on Windows. It is not clear >> how to pass an undecodable filename via a pipe on Windows -- perhaps >> `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It >> assumes that the short path exists and it is always encodable using >> mbcs. If we can control all parts of the pipeline *and* Windows API >> uses proper utf-16 (not ucs-2) then utf-8 can be used to pass >> filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be >> tried e.g., https://github.com/Drekin/win-unicode-console > > First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on > top of it guarantee that you can never get such unencodable filenames > (sometimes by just pretending the file doesn't exist, but if possible > by having the filesystem map it to something valid, unique, and > persistent for this session, usually the short name)? > Second, trying to solve this implies that you have some other native > (as opposed to Cygwin) tool that passes or accepts such filenames over > simple pipes (as opposed to PowerShell typed ones). Are there any? > What does, say, mingw's find do with invalid filenames if it finds > them? In short: I don't know :) To be clear, I'm talking about native Windows applications (not find/xargs on Cygwin). The goal is to process robustly *arbitrary* filenames on Windows via a pipe (SystemTextStream()) or network (bytes interface). I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow strings such main(), fopen(), fstream is broken e.g., Thai filenames on Greek computer [3]. Unicode (W) API should enforce utf-16 in principle since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many places due to bad programming practices (based on the common wrong assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not fixed due to MS' backwards compatibility policies in the past [5]. [3] http://blog.gatunka.com/2014/04/25/character-encodings-for-modern-programmers/ [4] http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_systems_and_environments [5] http://blogs.msdn.com/b/oldnewthing/archive/2003/10/15/55296.aspx -- Akira From abarnert at yahoo.com Wed Jul 23 17:49:19 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Wed, 23 Jul 2014 08:49:19 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87oawgmfxp.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> Message-ID: <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i at gmail.com> wrote: > Andrew Barnert writes: > >> On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i at gmail.com> wrote: >> >>> Paul Moore writes: >>> >>>> On 21 July 2014 01:41, Andrew Barnert >>>> wrote: >>>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's >>>>> not a good thing to do, apologies); you can find it at >>>>> http://bugs.python.org/file36008/pep-newline.txt >>>> >>>> As a suggestion, how about adding an example of a simple nul-separated >>>> filename filter - the sort of thing that could go in a find -print0 | >>>> xxx | xargs -0 pipeline? If I understand it, that's one of the key >>>> motivating examples for this change, so seeing how it's done would be >>>> a great help. >>>> >>>> Here's the sort of thing I mean, written for newline-separated files: >>>> >>>> import sys >>>> >>>> def process(filename): >>>> """Trivial example""" >>>> return filename.lower() >>>> >>>> if __name__ == '__main__': >>>> >>>> for filename in sys.stdin: >>>> filename = process(filename) >>>> print(filename) >>>> >>>> This is also an example of why I'm struggling to understand how an >>>> open() parameter "solves all the cases". There's no explicit open() >>>> call here, so how do you specify the record separator? Seeing how you >>>> propose this would work would be really helpful to me. >>> >>> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you >>> can replace `sys.std*` streams without worrying about preserving >>> `sys.__std*__` streams: >>> >>> #!/usr/bin/env python >>> import io >>> import re >>> import sys >>> from pathlib import Path >>> >>> def transform_filename(filename: str) -> str: # example >>> """Normalize whitespace in basename.""" >>> path = Path(filename) >>> new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) >>> path.replace(new_path) # rename on disk if necessary >>> return str(new_path) >>> >>> def SystemTextStream(bytes_stream, **kwargs): >>> encoding = sys.getfilesystemencoding() >>> return io.TextIOWrapper(bytes_stream, >>> encoding=encoding, >>> errors='surrogateescape' if encoding != 'mbcs' else 'strict', >>> **kwargs) >>> >>> nl = '\0' if '-0' in sys.argv else None >>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) >>> for line in SystemTextStream(sys.stdin.detach(), newline=nl): >>> print(transform_filename(line.rstrip(nl)), end=nl) >> >> Nice, much more complete example than mine. I just tried to handle as >> many edge cases as the original he asked about, but you handle >> everything. >> >>> io.TextIOWrapper() plays the role of open() in this case. The code >>> assumes that `newline` parameter accepts '\0'. >>> >>> The example function handles Unicode whitespace to demonstrate why >>> opaque bytes-based cookies can't be used to represent filenames in this >>> case even on POSIX, though which characters are recognized depends on >>> sys.getfilesystemencoding(). >>> >>> Note: >>> >>> - `end=nl` is necessary because `print()` prints '\n' by default -- it >>> does not use `file.newline` >> >> Actually, yes it does. Or, rather, print pastes on a '\n', but >> sys.stdout.write translates any '\n' characters to sys.stdout.writenl >> (a private variable that's initialized from the newline argument at >> construction time if it's anything other than None or ''). > > You are right. I've stopped reading the source for print() function at > `PyFile_WriteString("\n", file);` line assuming that "\n" is not > translated if newline="\0". But the current behaviour if "\0" were in > "the other legal values" category (like "\r") would be to translate "\n" > [1]: > > When writing output to the stream, if newline is None, any '\n' > characters written are translated to the system default line > separator, os.linesep. If newline is '' or '\n', no translation takes > place. If newline is any of the other legal values, any '\n' > characters written are translated to the given string. > > [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper > > Example: > > $ ./python -c 'import sys, io; > sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); > sys.stdout.write("\n\r\r\n")'| xxd > 0000000: 0d0a 0d0d 0d0a ...... > > "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r"). > > In order to newline="\0" case to work, it should behave similar to > newline='' or newline='\n' case instead i.e., no translation should take > place, to avoid corrupting embed "\n\r" characters. The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n. For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it? Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement? Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it? > My original code > works as is in this case i.e., *end=nl is still necessary*. >> But of course that's the newline argument to sys.stdout, and you only >> changed sys.stdin, so you do need end=nl anyway. (And you wouldn't >> want output translation here anyway, because that could also translate >> \n' characters in the middle of a line, re-creating the same problem >> we're trying to avoid...) >> >> But it uses sys.stdout.newline, not sys.stdin.newline. > > The code affects *both* sys.stdout/sys.stdin. Look [2]: I didn't notice that you passed it for stdout as well--as I explained above, you don't need it, and shouldn't do it. As a side note, I think it might have been a better design to have separate arguments for input newline, output newline, and universal newlines mode, instead of cramming them all into one argument; for some simple cases the current design makes things a little less verbose, but it gets in the way for more complex cases, even today with \r or \r\n. However, I don't think that needs to be changed as part of this proposal. It also might be nice to have a full set of PYTHONIOFOO env variables rather than just PYTHONIOENCODING, but again, I don't think that needs to be part of this proposal. And likewise for Nick Coghlan's rewrap method proposal on TextIOWrapper and maybe BufferedFoo. >>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) >>> for line in SystemTextStream(sys.stdin.detach(), newline=nl): >>> print(transform_filename(line.rstrip(nl)), end=nl) > > [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html > >>> - SystemTextStream() handles undecodable in the current locale filenames >>> i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) >>> - undecodable filenames are not supported on Windows. It is not clear >>> how to pass an undecodable filename via a pipe on Windows -- perhaps >>> `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It >>> assumes that the short path exists and it is always encodable using >>> mbcs. If we can control all parts of the pipeline *and* Windows API >>> uses proper utf-16 (not ucs-2) then utf-8 can be used to pass >>> filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be >>> tried e.g., https://github.com/Drekin/win-unicode-console >> >> First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on >> top of it guarantee that you can never get such unencodable filenames >> (sometimes by just pretending the file doesn't exist, but if possible >> by having the filesystem map it to something valid, unique, and >> persistent for this session, usually the short name)? >> Second, trying to solve this implies that you have some other native >> (as opposed to Cygwin) tool that passes or accepts such filenames over >> simple pipes (as opposed to PowerShell typed ones). Are there any? >> What does, say, mingw's find do with invalid filenames if it finds >> them? > > In short: I don't know :) > > To be clear, I'm talking about native Windows applications (not > find/xargs on Cygwin). The goal is to process robustly *arbitrary* > filenames on Windows via a pipe (SystemTextStream()) or network (bytes > interface). Yes, I assumed that, I just wanted to make that clear. My point is that if there isn't already an ecosystem of tools that do so on Windows, or a recommended answer from Microsoft, we don't need to fit into existing practices here. (Actually, there _is_ a recommended answer from Microsoft, but it's "don't send encoded filenames over a binary stream, send them as an array of UTF-16 strings over PowerShell cmdlet typed pipes"--and, more generally, "don't use any ANSI interfaces except for backward compatibility reasons".) At any rate, if the filenames-over-pipes encoding problem exists on Windows, and if it's solvable, it's still outside the scope of this proposal, unless you think the documentation needs a completely worked example that shows how to interact with some Windows tool, alongside one for interacting with find -print0 on Unix. (And I don't think it does. If we want a Windows example, resource compiler string input files, which are \0-terminated UTF-16, probably serve better.) > I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow > strings such main(), fopen(), fstream is broken e.g., Thai filenames on > Greek computer [3]. Yes, and broken in a way that people cannot easily work around except by using the UTF-16 interfaces. That's been Microsoft's recommended answer to the problem since NT 3.5, Win 95, and MSVCRT 3: if you want to handle all filenames, use _wmain, _wfopen, etc.--or, better, use CreateFileW instead of fopen. They never really addressed the issue of passing filenames between command-line tools at all, until PowerShell, where you pass them as a list of UTF-16 strings rather than a stream of newline-separated encoded bytes. (As a side note, I have no idea how well Python works for writing PowerShell cmdlets, but I don't think that's relevant to the current proposal.) > Unicode (W) API should enforce utf-16 in principle > since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many > places due to bad programming practices (based on the common wrong > assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not > fixed due to MS' backwards compatibility policies in the past [5]. Yes, I've run into such bugs in the past. It's even more fun when you're dealing with unterminated string with separate length interfaces. Fortunately, as far as I know, no such bugs affect reading and writing binary files, pipes, and sockets, so they don't affect us here. From 4kir4.1i at gmail.com Thu Jul 24 11:07:59 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Thu, 24 Jul 2014 13:07:59 +0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> (Andrew Barnert's message of "Wed, 23 Jul 2014 08:49:19 -0700") References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> Message-ID: <87egxbm8eo.fsf@gmail.com> Andrew Barnert writes: > On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i at gmail.com> wrote: >> Andrew Barnert writes: >>> On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i at gmail.com> wrote: >>>> Paul Moore writes: >>>>> On 21 July 2014 01:41, Andrew Barnert >>>>> wrote: >>>>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's >>>>>> not a good thing to do, apologies); you can find it at >>>>>> http://bugs.python.org/file36008/pep-newline.txt >>>>> >>>>> As a suggestion, how about adding an example of a simple nul-separated >>>>> filename filter - the sort of thing that could go in a find -print0 | >>>>> xxx | xargs -0 pipeline? If I understand it, that's one of the key >>>>> motivating examples for this change, so seeing how it's done would be >>>>> a great help. >>>> >>>> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you >>>> can replace `sys.std*` streams without worrying about preserving >>>> `sys.__std*__` streams: >>>> >>>> #!/usr/bin/env python >>>> import io >>>> import re >>>> import sys >>>> from pathlib import Path >>>> >>>> def transform_filename(filename: str) -> str: # example >>>> """Normalize whitespace in basename.""" >>>> path = Path(filename) >>>> new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) >>>> path.replace(new_path) # rename on disk if necessary >>>> return str(new_path) >>>> >>>> def SystemTextStream(bytes_stream, **kwargs): >>>> encoding = sys.getfilesystemencoding() >>>> return io.TextIOWrapper(bytes_stream, >>>> encoding=encoding, >>>> errors='surrogateescape' if encoding != 'mbcs' else 'strict', >>>> **kwargs) >>>> >>>> nl = '\0' if '-0' in sys.argv else None >>>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) >>>> for line in SystemTextStream(sys.stdin.detach(), newline=nl): >>>> print(transform_filename(line.rstrip(nl)), end=nl) >>> >>> Nice, much more complete example than mine. I just tried to handle as >>> many edge cases as the original he asked about, but you handle >>> everything. >>>> >>>> io.TextIOWrapper() plays the role of open() in this case. The code >>>> assumes that `newline` parameter accepts '\0'. >>>> >>>> The example function handles Unicode whitespace to demonstrate why >>>> opaque bytes-based cookies can't be used to represent filenames in this >>>> case even on POSIX, though which characters are recognized depends on >>>> sys.getfilesystemencoding(). >>>> >>>> Note: >>>> >>>> - `end=nl` is necessary because `print()` prints '\n' by default -- it >>>> does not use `file.newline` >>> >>> Actually, yes it does. Or, rather, print pastes on a '\n', but >>> sys.stdout.write translates any '\n' characters to sys.stdout.writenl >>> (a private variable that's initialized from the newline argument at >>> construction time if it's anything other than None or ''). >> >> You are right. I've stopped reading the source for print() function at >> `PyFile_WriteString("\n", file);` line assuming that "\n" is not >> translated if newline="\0". But the current behaviour if "\0" were in >> "the other legal values" category (like "\r") would be to translate "\n" >> [1]: >> >> When writing output to the stream, if newline is None, any '\n' >> characters written are translated to the system default line >> separator, os.linesep. If newline is '' or '\n', no translation takes >> place. If newline is any of the other legal values, any '\n' >> characters written are translated to the given string. >> >> [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper >> >> Example: >> >> $ ./python -c 'import sys, io; >> sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); >> sys.stdout.write("\n\r\r\n")'| xxd >> 0000000: 0d0a 0d0d 0d0a ...... >> >> "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r"). >> >> In order to newline="\0" case to work, it should behave similar to >> newline='' or newline='\n' case instead i.e., no translation should take >> place, to avoid corrupting embed "\n\r" characters. > > The draft PEP discusses this. I think it would be more consistent to > translate for \0, just like \r and \r\n. I read the [draft]. No translation is a better choice here. Otherwise (at the very least) it breaks `find -print0` use case. [draft] http://bugs.python.org/file36008/pep-newline.txt Simple things should be simple (i.e., no translation unless special case): - binary file -- a stream of bytes: no structure, no translation on read/write - text file -- a stream of Unicode codepoints - file with fixed-length chunks: for chunk in iter(partial(file.read, chunksize), EOF): pass - file with variable-length records (aka lines) which end with a separator or EOF: no translation, no escaping (no embed separators): for line in file: pass or line = file.readline() # next(file) newline in {None, '', '\r', '\r\n'} is a (very important) special case that represents the complicated legacy behavior for text files. newline='\0' (like '\n') should be a *much simpler* case: no translation on read/write, no escaping (no embed '\0', each '\0' in the stream is a separator). newline='\0' is simple to explain: readline/next return everything until the next '\0' (including it) or EOF. It is simple to implement - no translation is required. readline(keep_end=True) keyword-only parameter and/or chomp()-like method could be added to simplify removing a trailing newline. newline in {"\N{NEL}", "\n\n", "\r\r", "\n\r"} behave like newline="\n" i.e., no translation. New *docs for writing text files*: When writing output to the stream: - if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep - if newline is '\r' or '\r\n', any '\n' characters written are translated to the given string - no translation takes place for any other newline value. The docs for binary files are simpler: No translation takes place for any newline value. The line terminator is newline parameter (default is b'\n'). The new *docs for reading text files*: When reading input from the stream: - if newline is None, universal newlines mode is enabled: lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller - if newline is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated - if newline is any other value, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. The new behavior being more powerful is no more complex than the old one https://docs.python.org/3.4/library/io.html#io.TextIOWrapper Backwards compatibility is preserved except that newline parameter accepts more values. > For the your script, there is no reason to pass newline=nl to the > stdout replacement. The only effect that has on output is \n > replacement, which you don't want. And if we removed that effect from > the proposal, it would have no effect at all on output, so why pass > it? Keep in mind, I expect that newline='\0' does *not* translate '\n' to '\0'. If you remove newline=nl then embed \n might be corrupted i.e., it breaks `find -print0` use-case. Both newline=nl for stdout and end=nl are required here. Though (optionally) it would be nice to change `print()` so that it would use `end=file.newline or '\n'` by default instead. There is also line_buffering parameter. From the docs: If line_buffering is True, flush() is implied when a call to write contains a newline character. i.e., you might also need newline=nl to flush() the stream in time. For example, the absense of the flush() call on newline may lead to a deadlock if subprocess module is used to implement pexpect-like behavior. There are corresponding Python issues: - text mode http://bugs.python.org/issue21332 : add line_buffering=True if bufsize=1, to avoid a deadlock (regression from Python 2 behavior) - binary mode http://bugs.python.org/issue21471 : implement line_buffering=True behavior for binary files when bufsize=1 > Do you have a use case where you need to pass a non-standard newline > to a text file/stream, but don't want newline replacement? `find -print0` use case that my code implements above. > Or is it just a matter of avoiding confusion if people accidentally > pass it for stdout when they didn't want it? See the explanation above that starts with "Simple things should be simple." >> My original code >> works as is in this case i.e., *end=nl is still necessary*. > >>> But of course that's the newline argument to sys.stdout, and you only >>> changed sys.stdin, so you do need end=nl anyway. (And you wouldn't >>> want output translation here anyway, because that could also translate >>> \n' characters in the middle of a line, re-creating the same problem >>> we're trying to avoid...) >>> >>> But it uses sys.stdout.newline, not sys.stdin.newline. >> >> The code affects *both* sys.stdout/sys.stdin. Look [2]: > > I didn't notice that you passed it for stdout as well--as I explained > above, you don't need it, and shouldn't do it. Both newline=nl and end=nl are needed because I assume that there is no newline translation in newline='\0' case. See the explanation above. Here's the same code for context: sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl) [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html > As a side note, I think it might have been a better design to have > separate arguments for input newline, output newline, and universal > newlines mode, instead of cramming them all into one argument; for > some simple cases the current design makes things a little less > verbose, but it gets in the way for more complex cases, even today > with \r or \r\n. However, I don't think that needs to be changed as > part of this proposal. Usually different objects are used for input and output i.e., a single newline parameter allows input newlines to be different from output newlines. The newline behavior for reading and writing is different but it is closely related. Having two parameters wouldn't make the documentation simpler. Separate parameters might be useful if the same file object is used for reading and writing *and* input/output newlines are different from each other. But I don't think it is worth it to complicate the common case (separate objects). -- Akira From wolfgang.maier at biologie.uni-freiburg.de Thu Jul 24 15:45:53 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Thu, 24 Jul 2014 15:45:53 +0200 Subject: [Python-ideas] os.path.argparse - optional startdir argument Message-ID: Dear all, currently, os.path.abspath(somepath) is, essentially, equivalent to os.path.normpath(os.path.join(os.getcwd(),path)). However, I'd find it useful, occasionally, to be able to specify a starting directory other than the current working directory. One such situation is when reading a config file of an application: if you encounter a relative link in such a file, you'll typically want to transform it into an absolute path using that application's working directory as opposed to your own one. My suggestion would be to add an optional startdir argument to abspath, which, when provided would be used instead of os.getcwd(). If startdir itself is not an absolute path either, it would be turned into one through recursion. Currently, you have to write: os.path.normpath(os.path.join(startdir, path)) or even os.path.normpath(os.path.join(os.path.abspath(startdir), path)) instead of the proposed: os.path.abspath(path, startdir) Before posting I checked the bug tracker and found that this idea has been brought up years ago (http://bugs.python.org/issue9882), but not pursued further. The patch suggested there is a bit of an oversimplification, but I have my own one, which I could provide if someone's interested. For issue9882 it was suggested to bring it up on python-ideas, but to the best of my knowledge that was never done, so I'm doing it now. Thoughts ? Wolfgang From wolfgang.maier at biologie.uni-freiburg.de Thu Jul 24 16:43:59 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Thu, 24 Jul 2014 16:43:59 +0200 Subject: [Python-ideas] os.path.abspath - what was I thinking (was Re: os.path.argparse - optional startdir argument) In-Reply-To: References: Message-ID: Just realized my typo: I meant os.path.abspath in the title - don't know what I was thinking about when I typed that From apalala at gmail.com Thu Jul 24 17:21:15 2014 From: apalala at gmail.com (=?UTF-8?Q?Juancarlo_A=C3=B1ez?=) Date: Thu, 24 Jul 2014 10:51:15 -0430 Subject: [Python-ideas] os.path.argparse - optional startdir argument In-Reply-To: References: Message-ID: On Thu, Jul 24, 2014 at 9:15 AM, Wolfgang Maier < wolfgang.maier at biologie.uni-freiburg.de> wrote: > os.path.normpath(os.path.join(os.getcwd(),path)). > > However, I'd find it useful, occasionally, to be able to specify a > starting directory other than the current working directory. > os.path.normpath(os.path.join(config_dir, path)) Better yet, use the pathlib module. Cheers, -- Juancarlo *A?ez* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wolfgang.maier at biologie.uni-freiburg.de Thu Jul 24 17:30:58 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Thu, 24 Jul 2014 17:30:58 +0200 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On 24.07.2014 17:21, Juancarlo A?ez wrote: > > On Thu, Jul 24, 2014 at 9:15 AM, Wolfgang Maier > > > wrote: > > os.path.normpath(os.path.join(__os.getcwd(),path)). > > However, I'd find it useful, occasionally, to be able to specify a > starting directory other than the current working directory. > > > os.path.normpath(os.path.join(config_dir, path)) > As I said, I'm aware of this, but it's ugly and even uglier if you have to turn config_dir into an absolute path itself. > Better yet, use the pathlib module. > As it stands, the pathlib module is only provisional plus, IMO, kind of overkill for a simple task like that. > Cheers, > Juancarlo *A?ez* From techtonik at gmail.com Thu Jul 24 18:51:08 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Thu, 24 Jul 2014 19:51:08 +0300 Subject: [Python-ideas] os.path.cansymlink(path) Message-ID: This is a live code from current virtualenv.py: if hasattr(os, 'symlink'): logger.info('Symlinking Python bootstrap modules') This code is wrong, because OS support for symlinks doesn't guarantee that mounted filesystem can do this, resulting in OSError at runtime. So, the proper check would be to check if specific path supports symlinking. The idea is: os.path.cansymlink(path) - Return True if filesystem of specified path can be symlinked. Yes/No/Opinions? -- anatoly t. From apalala at gmail.com Thu Jul 24 18:53:00 2014 From: apalala at gmail.com (=?UTF-8?Q?Juancarlo_A=C3=B1ez?=) Date: Thu, 24 Jul 2014 12:23:00 -0430 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On Thu, Jul 24, 2014 at 11:00 AM, Wolfgang Maier < wolfgang.maier at biologie.uni-freiburg.de> wrote: > As it stands, the pathlib module is only provisional plus, IMO, kind of > overkill for a simple task like that. https://docs.python.org/3/library/pathlib.html The pathlib module is "*New in version 3.4"*. There's an implementation for previous versions of Python in PyPi. https://pypi.python.org/pypi/pathlib The pathlib module is not overkill, as it provides the same functionality as os.path, but in a more OO and syntactically simpler form: (configpath / filepath).resolve() Cheers, -- Juancarlo *A?ez* -------------- next part -------------- An HTML attachment was scrubbed... URL: From geoffspear at gmail.com Thu Jul 24 19:33:28 2014 From: geoffspear at gmail.com (Geoffrey Spear) Date: Thu, 24 Jul 2014 13:33:28 -0400 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: Message-ID: On Thu, Jul 24, 2014 at 12:51 PM, anatoly techtonik wrote: > This is a live code from current virtualenv.py: > > if hasattr(os, 'symlink'): > logger.info('Symlinking Python bootstrap modules') > > This code is wrong, because OS support for > symlinks doesn't guarantee that mounted filesystem > can do this, resulting in OSError at runtime. So, the > proper check would be to check if specific path > supports symlinking. > > The idea is: > > os.path.cansymlink(path) - Return True if filesystem > of specified path can be symlinked. > > Yes/No/Opinions? Surely the third-party module you found that wrong code in has their own communication channels? From steve at pearwood.info Thu Jul 24 19:43:41 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 25 Jul 2014 03:43:41 +1000 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: Message-ID: <20140724174341.GU9112@ando> On Thu, Jul 24, 2014 at 07:51:08PM +0300, anatoly techtonik wrote: > This is a live code from current virtualenv.py: > > if hasattr(os, 'symlink'): > logger.info('Symlinking Python bootstrap modules') > > This code is wrong, because OS support for > symlinks doesn't guarantee that mounted filesystem > can do this, resulting in OSError at runtime. So, the > proper check would be to check if specific path > supports symlinking. > > The idea is: > > os.path.cansymlink(path) - Return True if filesystem > of specified path can be symlinked. > > Yes/No/Opinions? No. Even if the file system supports symlinks, doesn't mean that you can create one. You may not have privileges to create the symlink, or some other runtime error may occur. Like most other file system operations, you should guard them with a try...except, not "Look Before You Leap". -- Steven From steve at pearwood.info Thu Jul 24 19:45:31 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 25 Jul 2014 03:45:31 +1000 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: Message-ID: <20140724174531.GV9112@ando> On Thu, Jul 24, 2014 at 01:33:28PM -0400, Geoffrey Spear wrote: > On Thu, Jul 24, 2014 at 12:51 PM, anatoly techtonik wrote: > > This is a live code from current virtualenv.py: [...] > Surely the third-party module you found that wrong code in has their > own communication channels? Anatoly is not asking for a fix for the (possibly) buggy code in virtualenv, but suggesting an enhancement for the Python standard library. That makes this the right place to ask the question. -- Steven From dw+python-ideas at hmmz.org Thu Jul 24 19:53:16 2014 From: dw+python-ideas at hmmz.org (dw+python-ideas at hmmz.org) Date: Thu, 24 Jul 2014 17:53:16 +0000 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: Message-ID: <20140724175316.GA14260@k2> On Thu, Jul 24, 2014 at 01:33:28PM -0400, Geoffrey Spear wrote: > > This code is wrong, because OS support for symlinks doesn't > > guarantee that mounted filesystem can do this, resulting in OSError > > at runtime. So, the proper check would be to check if specific path > > supports symlinking. > > > > The idea is: > > > > os.path.cansymlink(path) - Return True if filesystem > > of specified path can be symlinked. > > > > Yes/No/Opinions? -1, since there is no sane way to guarantee a FS operation will succeed without trying it in most cases. Even if a filesystem (driver) supports the operation, the filesystem (data) might be exhausted, e.g. inode count, max directory entries, ... And if not that, then e.g. in the case of NFS or CIFS, while the protocol might support the operation, there is no mechanism for a particular server implementation to communicate that it does not support it. Even if none of this were true, it also introduces a race between a program testing the state of the filesystem, and that state changing, e.g. due to USB disconnect, or a lazy unmount succeeding, or.. David > > Surely the third-party module you found that wrong code in has their > own communication channels? > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From storchaka at gmail.com Thu Jul 24 21:24:40 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Thu, 24 Jul 2014 22:24:40 +0300 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: 24.07.14 16:45, Wolfgang Maier ???????(??): > currently, os.path.abspath(somepath) is, essentially, equivalent to > > os.path.normpath(os.path.join(os.getcwd(),path)). Actually currently posixpath.abspath() is more complicated and ntpath.abspath() has totally different implementation. > Currently, you have to write: > > os.path.normpath(os.path.join(startdir, path)) > or even > os.path.normpath(os.path.join(os.path.abspath(startdir), path)) Yes, it is natural and straightforward way. You can define your own function if you need this often. > Before posting I checked the bug tracker and found that this idea has > been brought up years ago (http://bugs.python.org/issue9882), but not > pursued further. > The patch suggested there is a bit of an oversimplification, but I have > my own one, which I could provide if someone's interested. > For issue9882 it was suggested to bring it up on python-ideas, but to > the best of my knowledge that was never done, so I'm doing it now. > > Thoughts ? This will add to abspath() a feature which is unrelated to the purpose of abspath(). This will complicate the API without significant benefit. I'm -1. From wolfgang.maier at biologie.uni-freiburg.de Thu Jul 24 23:32:01 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Thu, 24 Jul 2014 23:32:01 +0200 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On 24.07.2014 21:24, Serhiy Storchaka wrote: > 24.07.14 16:45, Wolfgang Maier ???????(??): >> currently, os.path.abspath(somepath) is, essentially, equivalent to >> >> os.path.normpath(os.path.join(os.getcwd(),path)). > > Actually currently posixpath.abspath() is more complicated and > ntpath.abspath() has totally different implementation. I know, that's why I wrote "essentially" and "equivalent" instead of "is implemented as". It's still easy to patch both the posixpath and the ntpath version though. > >> Currently, you have to write: >> >> os.path.normpath(os.path.join(startdir, path)) >> or even >> os.path.normpath(os.path.join(os.path.abspath(startdir), path)) > > Yes, it is natural and straightforward way. You can define your own > function if you need this often. > I'm not saying, this is a must-have in Python. I don't have a problem with sticking to normpath, just thought it's a tiny change giving some benefit in readability. >> Before posting I checked the bug tracker and found that this idea has >> been brought up years ago (http://bugs.python.org/issue9882), but not >> pursued further. >> The patch suggested there is a bit of an oversimplification, but I have >> my own one, which I could provide if someone's interested. >> For issue9882 it was suggested to bring it up on python-ideas, but to >> the best of my knowledge that was never done, so I'm doing it now. >> >> Thoughts ? > > This will add to abspath() a feature which is unrelated to the purpose > of abspath(). This will complicate the API without significant benefit. > It would not complicate the API all that much. If you don't want to use the argument, just ignore it, it would be optional. As pointed out in the bug tracker issue, it is also not without precedence, os.path.relpath has a start argument already. From tjreedy at udel.edu Thu Jul 24 23:49:51 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 24 Jul 2014 17:49:51 -0400 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On 7/24/2014 11:30 AM, Wolfgang Maier wrote: > On 24.07.2014 17:21, Juancarlo A?ez wrote: >> Better yet, use the pathlib module. Thank for the reminder. I took a better look at it. > As it stands, the pathlib module is only provisional plus, 'Provisional' means that there *could* be a few api changes that would break code. The module is not going away. > IMO, kind of overkill for a simple task like that. Overkill? import pathlib as path import os.path as path are equally easy The 'simple task' combines joining, normalizing, and 'absoluting'. pathlib.Path joins, Path.resolve normalizes and 'absolutes'. Together they combine the functions of os.path.join, os.path.abspath and os.path.normpath, with a nicer syntax, and with OS awareness. >>> path.Path('../../../Python27/lib', 'ast.py').resolve() WindowsPath('C:/Programs/Python27/Lib/ast.py') If one starts with a Path object, as would be typical, one can use '/' to join, as JuanCarlo mentioned. >>> base = path.Path('.') >>> (base / '../../../Python27/lib' / 'ast.py').resolve() WindowsPath('C:/Programs/Python27/Lib/ast.py') -- Terry Jan Reedy From apalala at gmail.com Fri Jul 25 00:29:44 2014 From: apalala at gmail.com (=?UTF-8?Q?Juancarlo_A=C3=B1ez?=) Date: Thu, 24 Jul 2014 17:59:44 -0430 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On a related topic... What's missing in Python 3.4, is that most modules with functions or methods that take file names or file paths as parameters are not pathlib-aware, so a mandatory str(mypahtlibpath) is required. For example, you cannot do: f = open(Path(_file__) / 'app.conf') It will fail. But pathlib as part of the standard lib is new, so it's OK. It will take time how know where in the module dependency hierarchy it should belong. Cheers, On Thu, Jul 24, 2014 at 5:19 PM, Terry Reedy wrote: > On 7/24/2014 11:30 AM, Wolfgang Maier wrote: > >> On 24.07.2014 17:21, Juancarlo A?ez wrote: >> > > Better yet, use the pathlib module. >>> >> > Thank for the reminder. I took a better look at it. > > > As it stands, the pathlib module is only provisional plus, >> > > 'Provisional' means that there *could* be a few api changes that would > break code. The module is not going away. > > > IMO, kind of overkill for a simple task like that. >> > > Overkill? > > import pathlib as path > import os.path as path > > are equally easy > > The 'simple task' combines joining, normalizing, and 'absoluting'. > pathlib.Path joins, Path.resolve normalizes and 'absolutes'. Together they > combine the functions of os.path.join, os.path.abspath and > os.path.normpath, with a nicer syntax, and with OS awareness. > > >>> path.Path('../../../Python27/lib', 'ast.py').resolve() > WindowsPath('C:/Programs/Python27/Lib/ast.py') > > If one starts with a Path object, as would be typical, one can use '/' to > join, as JuanCarlo mentioned. > > >>> base = path.Path('.') > >>> (base / '../../../Python27/lib' / 'ast.py').resolve() > WindowsPath('C:/Programs/Python27/Lib/ast.py') > > -- > Terry Jan Reedy > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Juancarlo *A?ez* -------------- next part -------------- An HTML attachment was scrubbed... URL: From rymg19 at gmail.com Fri Jul 25 03:54:00 2014 From: rymg19 at gmail.com (Ryan) Date: Thu, 24 Jul 2014 20:54:00 -0500 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: Instead, though, you'd do: f = (Path(_file__) / 'app.conf').open() https://docs.python.org/3/library/pathlib.html#pathlib.Path.open "Juancarlo A?ez" wrote: >On a related topic... > >What's missing in Python 3.4, is that most modules with functions or >methods that take file names or file paths as parameters are not >pathlib-aware, so a mandatory str(mypahtlibpath) is required. > >For example, you cannot do: > >f = open(Path(_file__) / 'app.conf') > >It will fail. > >But pathlib as part of the standard lib is new, so it's OK. > >It will take time how know where in the module dependency hierarchy it >should belong. > >Cheers, > > > >On Thu, Jul 24, 2014 at 5:19 PM, Terry Reedy wrote: > >> On 7/24/2014 11:30 AM, Wolfgang Maier wrote: >> >>> On 24.07.2014 17:21, Juancarlo A?ez wrote: >>> >> >> Better yet, use the pathlib module. >>>> >>> >> Thank for the reminder. I took a better look at it. >> >> >> As it stands, the pathlib module is only provisional plus, >>> >> >> 'Provisional' means that there *could* be a few api changes that >would >> break code. The module is not going away. >> >> >> IMO, kind of overkill for a simple task like that. >>> >> >> Overkill? >> >> import pathlib as path >> import os.path as path >> >> are equally easy >> >> The 'simple task' combines joining, normalizing, and 'absoluting'. >> pathlib.Path joins, Path.resolve normalizes and 'absolutes'. Together >they >> combine the functions of os.path.join, os.path.abspath and >> os.path.normpath, with a nicer syntax, and with OS awareness. >> >> >>> path.Path('../../../Python27/lib', 'ast.py').resolve() >> WindowsPath('C:/Programs/Python27/Lib/ast.py') >> >> If one starts with a Path object, as would be typical, one can use >'/' to >> join, as JuanCarlo mentioned. >> >> >>> base = path.Path('.') >> >>> (base / '../../../Python27/lib' / 'ast.py').resolve() >> WindowsPath('C:/Programs/Python27/Lib/ast.py') >> >> -- >> Terry Jan Reedy >> >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > >-- >Juancarlo *A?ez* > > >------------------------------------------------------------------------ > >_______________________________________________ >Python-ideas mailing list >Python-ideas at python.org >https://mail.python.org/mailman/listinfo/python-ideas >Code of Conduct: http://python.org/psf/codeofconduct/ -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From apalala at gmail.com Fri Jul 25 04:43:13 2014 From: apalala at gmail.com (=?UTF-8?Q?Juancarlo_A=C3=B1ez?=) Date: Thu, 24 Jul 2014 22:13:13 -0430 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On Thu, Jul 24, 2014 at 9:24 PM, Ryan wrote: > Instead, though, you'd do: > > f = (Path(_file__) / 'app.conf').open() > Indeed, that solves the "right place in the module dependency hierarchy" thing, and it even has an "econding=" kwarg! I hadn't paid attention to it. Sorry. Problem solved! Thanks! -- Juancarlo *A?ez* -------------- next part -------------- An HTML attachment was scrubbed... URL: From wolfgang.maier at biologie.uni-freiburg.de Fri Jul 25 09:40:32 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Fri, 25 Jul 2014 09:40:32 +0200 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: <53D209F0.8040302@biologie.uni-freiburg.de> On 24.07.2014 23:49, Terry Reedy wrote: > On 7/24/2014 11:30 AM, Wolfgang Maier wrote: >> On 24.07.2014 17:21, Juancarlo A?ez wrote: > >>> Better yet, use the pathlib module. > > Thank for the reminder. I took a better look at it. > >> As it stands, the pathlib module is only provisional plus, > > 'Provisional' means that there *could* be a few api changes that would > break code. The module is not going away. > The 3.4 docs explicitly mention the possibility: Note: This module has been included in the standard library on a provisional basis. Backwards incompatible changes (up to and including removal of the package) may occur if deemed necessary by the core developers. >> IMO, kind of overkill for a simple task like that. > > Overkill? > > import pathlib as path > import os.path as path > > are equally easy > > The 'simple task' combines joining, normalizing, and 'absoluting'. > pathlib.Path joins, Path.resolve normalizes and 'absolutes'. Together > they combine the functions of os.path.join, os.path.abspath and > os.path.normpath, with a nicer syntax, and with OS awareness. > Yes, the syntax is nicer *now*, but with my proposed change to os.path.abspath things would look quite similar: pathlib version now: > >>> path.Path('../../../Python27/lib', 'ast.py').resolve() os.path as proposed: os.path.abspath('ast.py', '../../../Python27/lib') So I would see this as an argument for the proposal rather than against it. Even if the pathlib module will stay, I am not sure whether that should exclude enhancements in overlapping parts of os.path. Anyway, that whole thing is not that important to me, so if nobody finds it useful, then let's stick to the status quo. From ncoghlan at gmail.com Fri Jul 25 10:18:08 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 25 Jul 2014 18:18:08 +1000 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: On 25 Jul 2014 08:33, "Juancarlo A?ez" wrote: > > On a related topic... > > What's missing in Python 3.4, is that most modules with functions or methods that take file names or file paths as parameters are not pathlib-aware, so a mandatory str(mypahtlibpath) is required. > > For example, you cannot do: > > f = open(Path(_file__) / 'app.conf') > > It will fail. Just like ipaddress, this is a deliberate design choice that avoids coupling low level APIs to a high level convenience library. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Fri Jul 25 10:26:15 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 25 Jul 2014 04:26:15 -0400 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: <53D209F0.8040302@biologie.uni-freiburg.de> References: <53D209F0.8040302@biologie.uni-freiburg.de> Message-ID: On 7/25/2014 3:40 AM, Wolfgang Maier wrote: > Yes, the syntax is nicer *now*, but with my proposed change to > os.path.abspath things would look quite similar: > > pathlib version now: >> >>> path.Path('../../../Python27/lib', 'ast.py').resolve() > > os.path as proposed: > os.path.abspath('ast.py', '../../../Python27/lib') > > So I would see this as an argument for the proposal rather than against it. > > Even if the pathlib module will stay, I am not sure whether that should > exclude enhancements in overlapping parts of os.path. I understand your reasoning. But it leaves out the following. When a feature is added, use of the feature makes code incompatible with previous versions. So we generally like new features to add more than this one would. If you look hard enough, I am sure that you can find an addition that by this criteria should not have been added. If you do, I will probably agree that it should not have been. > Anyway, that whole thing is not that important to me, so if nobody finds > it useful, then let's stick to the status quo. It is a matter of useful enough to justify the cost. -- Terry Jan Reedy From me+python at ixokai.io Fri Jul 25 09:54:59 2014 From: me+python at ixokai.io (Stephen Hansen) Date: Fri, 25 Jul 2014 00:54:59 -0700 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: <53D209F0.8040302@biologie.uni-freiburg.de> References: <53D209F0.8040302@biologie.uni-freiburg.de> Message-ID: Warning: Lurker... On Fri, Jul 25, 2014 at 12:40 AM, Wolfgang Maier < wolfgang.maier at biologie.uni-freiburg.de> wrote > > Yes, the syntax is nicer *now*, but with my proposed change to > os.path.abspath things would look quite similar: > > pathlib version now: > > >>> path.Path('../../../Python27/lib', 'ast.py').resolve() >> > > os.path as proposed: > os.path.abspath('ast.py', '../../../Python27/lib') > > So I would see this as an argument for the proposal rather than against it. > Am I the only one who sees this as completely crazy-talk and an argument against? The idea that os.path.xxx(y,z) could be interpreted as z+y then resolved is a completely horrible API. The pathlib version keeps the parts of the path in order, and then resolves them, and where things are, well, they're clear. The proposed os.path modification reads, to me, as nonsense. Half of me wants to say it is asking to find the absolute path of ast.py and find this additional component in relation to that absolute path, the other half of me just shuts down. "os.path.abspath('ast.py', '../../../Python27/lib')" speaks in no way to me of absoluteness. There's two relative paths in its arguments and no sensible way of interpreting that comes forth, to me. It may make sense if you were adding a keyword-only argument, maybe, (maaaybe), but as an example of how they are similar it is IMHO a stark sign against why its ever so not similar and in fact, bad. The pathlib version conveys a fairly clear idea of where the files its talking about are located. The proposal is just weird. /relurk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.rodola at gmail.com Fri Jul 25 11:54:07 2014 From: g.rodola at gmail.com (Giampaolo Rodola') Date: Fri, 25 Jul 2014 11:54:07 +0200 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: <20140724175316.GA14260@k2> References: <20140724175316.GA14260@k2> Message-ID: -1 for me as well given the reasons mentioned above. Il 24/lug/2014 20:01 ha scritto: > On Thu, Jul 24, 2014 at 01:33:28PM -0400, Geoffrey Spear wrote: > > > > This code is wrong, because OS support for symlinks doesn't > > > guarantee that mounted filesystem can do this, resulting in OSError > > > at runtime. So, the proper check would be to check if specific path > > > supports symlinking. > > > > > > The idea is: > > > > > > os.path.cansymlink(path) - Return True if filesystem > > > of specified path can be symlinked. > > > > > > Yes/No/Opinions? > > -1, since there is no sane way to guarantee a FS operation will succeed > without trying it in most cases. Even if a filesystem (driver) supports > the operation, the filesystem (data) might be exhausted, e.g. inode > count, max directory entries, ... And if not that, then e.g. in the case > of NFS or CIFS, while the protocol might support the operation, there is > no mechanism for a particular server implementation to communicate that > it does not support it. > > Even if none of this were true, it also introduces a race between a > program testing the state of the filesystem, and that state changing, > e.g. due to USB disconnect, or a lazy unmount succeeding, or.. > > > David > > > > > Surely the third-party module you found that wrong code in has their > > own communication channels? > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From techtonik at gmail.com Fri Jul 25 12:08:23 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Fri, 25 Jul 2014 13:08:23 +0300 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: Message-ID: On Thu, Jul 24, 2014 at 8:33 PM, Geoffrey Spear wrote: > On Thu, Jul 24, 2014 at 12:51 PM, anatoly techtonik wrote: >> This is a live code from current virtualenv.py: >> >> if hasattr(os, 'symlink'): >> logger.info('Symlinking Python bootstrap modules') >> >> This code is wrong, because OS support for >> symlinks doesn't guarantee that mounted filesystem >> can do this, resulting in OSError at runtime. So, the >> proper check would be to check if specific path >> supports symlinking. >> >> The idea is: >> >> os.path.cansymlink(path) - Return True if filesystem >> of specified path can be symlinked. >> >> Yes/No/Opinions? > > Surely the third-party module you found that wrong code in has their > own communication channels? I can't get how does this comment contributes to the idea. Care to explain? -- anatoly t. From storchaka at gmail.com Fri Jul 25 12:12:07 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Fri, 25 Jul 2014 13:12:07 +0300 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: 25.07.14 00:32, Wolfgang Maier ???????(??): > On 24.07.2014 21:24, Serhiy Storchaka wrote: >> 24.07.14 16:45, Wolfgang Maier ???????(??): > I'm not saying, this is a must-have in Python. I don't have a problem > with sticking to normpath, just thought it's a tiny change giving some > benefit in readability. To me explicit well known join() and normpath() are more readable then unexpected second argument to abspath(). >> This will add to abspath() a feature which is unrelated to the purpose >> of abspath(). This will complicate the API without significant benefit. > It would not complicate the API all that much. If you don't want to use > the argument, just ignore it, it would be optional. As pointed out in > the bug tracker issue, it is also not without precedence, > os.path.relpath has a start argument already. The inverse of two-argument relpath() is join(), not abspath(). Two-argument relpath() is only the way to compute relative patch between two patches. This is essential functionality, there is no redundancy. But two-argument abspath() will be redundant. Not every one-line function should be added to the stdlib. And I found only 10 usages of normpath(join()) combination in Python source three (including 4 in tests and 4 in PC build script, therefore only 2 in the stdlib itself) against 205 usages of abspath(). From techtonik at gmail.com Fri Jul 25 12:17:13 2014 From: techtonik at gmail.com (anatoly techtonik) Date: Fri, 25 Jul 2014 13:17:13 +0300 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: <20140724175316.GA14260@k2> References: <20140724175316.GA14260@k2> Message-ID: On Thu, Jul 24, 2014 at 8:53 PM, wrote: > On Thu, Jul 24, 2014 at 01:33:28PM -0400, Geoffrey Spear wrote: > >> > This code is wrong, because OS support for symlinks doesn't >> > guarantee that mounted filesystem can do this, resulting in OSError >> > at runtime. So, the proper check would be to check if specific path >> > supports symlinking. >> > >> > The idea is: >> > >> > os.path.cansymlink(path) - Return True if filesystem >> > of specified path can be symlinked. >> > >> > Yes/No/Opinions? > > -1, since there is no sane way to guarantee a FS operation will succeed > without trying it in most cases. > > Even if a filesystem (driver) supports > the operation, the filesystem (data) might be exhausted, e.g. inode > count, max directory entries, ... And if not that, then e.g. in the case > of NFS or CIFS, while the protocol might support the operation, there is > no mechanism for a particular server implementation to communicate that > it does not support it. You do realize that high level program logic changes depending on the fact that FS supports symlinks or not. It is not "an exceptional" case as you've presented it. This is not a replacement for os.symlink(), but doc link to this function will help people avoid this trap and runtime errors in future. -- anatoly t. From phd at phdru.name Fri Jul 25 12:23:40 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 25 Jul 2014 12:23:40 +0200 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: Message-ID: <20140725102340.GA4015@phdru.name> Hi! On Thu, Jul 24, 2014 at 07:51:08PM +0300, anatoly techtonik wrote: > This is a live code from current virtualenv.py: > > if hasattr(os, 'symlink'): > logger.info('Symlinking Python bootstrap modules') > > This code is wrong, because OS support for > symlinks doesn't guarantee that mounted filesystem > can do this, resulting in OSError at runtime. So, the > proper check would be to check if specific path > supports symlinking. > > The idea is: > > os.path.cansymlink(path) - Return True if filesystem > of specified path can be symlinked. > > Yes/No/Opinions? Such function (if it would be a function) should return one of three answer, not two. Something like: None - I don't know if the OS/fs support symlinks because another OSError occurred during test (perhaps not enough rights to write to the path); False- the path clearly doesn't support symlinks; True - the path positively supports symlinks. Implement the function in a module and publish the module at PyPI. Warn users (in accompanying docs) that even if a path supports (or doesn't support) symlinks this says nothing about any subpath of the path because a subpath can be a mount of a different fs. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From rosuav at gmail.com Fri Jul 25 12:53:02 2014 From: rosuav at gmail.com (Chris Angelico) Date: Fri, 25 Jul 2014 20:53:02 +1000 Subject: [Python-ideas] os.path.cansymlink(path) In-Reply-To: References: <20140724175316.GA14260@k2> Message-ID: On Fri, Jul 25, 2014 at 8:17 PM, anatoly techtonik wrote: > You do realize that high level program logic changes depending on the fact > that FS supports symlinks or not. It is not "an exceptional" case as you've > presented it. There are plenty of other non-exceptional cases that are signalled with exceptions. It's part of the EAFP model and its reliability. If high level logic changes, it surely can be like this: try: os.symlink(whatever) except OSError: alternate_logic() How would asking the path if it's symlinkable (by the way, do you ask the source or destination?) improve that? ChrisA From wolfgang.maier at biologie.uni-freiburg.de Fri Jul 25 13:31:27 2014 From: wolfgang.maier at biologie.uni-freiburg.de (Wolfgang Maier) Date: Fri, 25 Jul 2014 13:31:27 +0200 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: <53D2400F.2080709@biologie.uni-freiburg.de> On 25.07.2014 12:12, Serhiy Storchaka wrote: > 25.07.14 00:32, Wolfgang Maier ???????(??): >> On 24.07.2014 21:24, Serhiy Storchaka wrote: >>> 24.07.14 16:45, Wolfgang Maier ???????(??): >> I'm not saying, this is a must-have in Python. I don't have a problem >> with sticking to normpath, just thought it's a tiny change giving some >> benefit in readability. > > To me explicit well known join() and normpath() are more readable then > unexpected second argument to abspath(). > Ok, I just seem to think differently than all of you. whenever I need this functionality (and just like for the stdlib, it's less often than regular abspath), I think: oh, this must be addressable with abspath, then after a moment I realize there is no start option like in relpath. Then I consult the docs where I find this for abspath: "Return a normalized absolutized version of the pathname path. On most platforms, this is equivalent to calling the function normpath() as follows: normpath(join(os.getcwd(), path))." From which the solution is apparent. Never have I thought first, ah, that's a job for normpath. Maybe that's because I can't remember a single case where I used normpath for anything else in my code, so I'm kind of thinking about normpath as a low-level function needed a lot in os.path, but typically not needed much outside of it because there are higher-level functions like abspath that do the normalization in the background. It's interesting to learn that I seem to be quite alone with this view, but that's ok, I'm sure it will help me remember normpath next time :) > Not every one-line function should be added to the stdlib. And I found > only 10 usages of normpath(join()) combination in Python source three > (including 4 in tests and 4 in PC build script, therefore only 2 in the > stdlib itself) against 205 usages of abspath(). > From antoine at python.org Fri Jul 25 15:41:50 2014 From: antoine at python.org (Antoine Pitrou) Date: Fri, 25 Jul 2014 09:41:50 -0400 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: References: Message-ID: <53D25E9E.6060209@python.org> Le 25/07/2014 04:18, Nick Coghlan a ?crit : > > For example, you cannot do: > > > > f = open(Path(_file__) / 'app.conf') > > > > It will fail. > > Just like ipaddress, this is a deliberate design choice that avoids > coupling low level APIs to a high level convenience library. Note the gap could be crossed without coupling by introducing a __path__ protocol (or something similar for IP addresses). Regards Antoine. From ncoghlan at gmail.com Fri Jul 25 16:01:50 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 26 Jul 2014 00:01:50 +1000 Subject: [Python-ideas] os.path.abspath - optional startdir argument In-Reply-To: <53D25E9E.6060209@python.org> References: <53D25E9E.6060209@python.org> Message-ID: On 25 July 2014 23:41, Antoine Pitrou wrote: > Le 25/07/2014 04:18, Nick Coghlan a ?crit : > >> > For example, you cannot do: >> > >> > f = open(Path(_file__) / 'app.conf') >> > >> > It will fail. >> >> Just like ipaddress, this is a deliberate design choice that avoids >> coupling low level APIs to a high level convenience library. > > > Note the gap could be crossed without coupling by introducing a __path__ > protocol (or something similar for IP addresses). My main concern with that approach is the sheer number of places we'd need to touch. I'm not implacably opposed to the idea, I just strongly suspect it wouldn't be worth the hassle to save the str() calls, as: - explicit str() calls would still be needed for anyone still supporting older versions of Python - explicit str() calls would still be needed when dealing with third party libraries that don't support the new protocol yet - we wouldn't get to simplify any of the low level APIs, since they'd still need to support str objects - the new protocol would be strictly additive. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From abarnert at yahoo.com Fri Jul 25 20:29:11 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 25 Jul 2014 11:29:11 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87egxbm8eo.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> <87egxbm8eo.fsf@gmail.com> Message-ID: <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i at gmail.com> wrote: > > Andrew Barnert writes: > >> On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i at gmail.com> wrote: >>> In order to newline="\0" case to work, it should behave? >>> similar to >>> newline='' or newline='\n' case instead i.e., no >>> translation should take >>> place, to avoid corrupting embed "\n\r" characters. >> >> The draft PEP discusses this. I think it would be more consistent to >> translate for \0, just like \r and \r\n. > > I read the [draft]. No translation is a better choice here. Otherwise >> (at the very least) it breaks `find -print0` use case. No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate. As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that. (It's less of an issue for binary files, because binary files can't take a newline parameter at all today, and because "no output translation" has been part of the definition of what "binary file" means all the way back to Python 1.x.) > Backwards compatibility is preserved except that newline parameter > accepts more values. The same is true with the draft proposal. You've basically copied the exact same thing, except for what happens on output for newlines other than None, '', '\n', '\r', and '\r\n' in text files. Since that case cannot arise today, there are no backward compatibility issues. Your version is only a small change to the documentation and a small change to the code, but my version is an even smaller change to the documentation and no change to the code, so you can't argue this from a conservative point of view. > >> For the your script, there is no reason to pass newline=nl to the >> stdout replacement. The only effect that has on output is \n >> replacement, which you don't want. And if we removed that effect from >> the proposal, it would have no effect at all on output, so why pass >> it? > > Keep in mind, I expect that newline='\0' does *not* translate > '\n' to > '\0'. If you remove newline=nl then embed \n might be corrupted? No, it's only corrupted if you _pass_ newline=nl. If you instead passed, e.g., newline='', nothing could possibly corrupted. > i.e., it > breaks `find -print0` use-case. Both newline=nl for stdout and end=nl > are required here. Though (optionally) it would be nice to change > `print()` so that it would use `end=file.newline or '\n'` by default > instead. That might be a nice change; I'll mention it in the next draft. But I think it's better to keep the changes as small and conservative as possible, so unless there's an upswell of support for it, I think anything that isn't actually necessary to solving the problem should be left out. > There is also line_buffering parameter. From the docs: > > ? If line_buffering is True, flush() is implied when a call to write > ? contains a newline character. The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated '\n'. So, it's doing the wrong thing with '\r' in most modes, and with '\n' in '' mode on non-Unix systems. So my thought was, just leave it broken. But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three... I'll need to think this through (and reread the code) this weekend; thanks for bringing it up. >> Do you have a use case where you need to pass a non-standard newline >> to a text file/stream, but don't want newline replacement? > > `find -print0` use case that my code implements above. > >> Or is it just a matter of avoiding confusion if people accidentally >> pass it for stdout when they didn't want it? > > See the explanation above that starts with "Simple things should be > simple." I still don't understand your point here, and just repeating it isn't helping. You're making simple things _less_ simple than they are in the draft, requiring slightly more change to the documentation and to the code and slightly more for people to understand just to allow them to pass an unnecessary parameter. That doesn't sound like an argument from simplicity to me. But line_buffering definitely might be a good argument, in which case it doesn't matter how good this one is. From abarnert at yahoo.com Fri Jul 25 22:46:29 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 25 Jul 2014 13:46:29 -0700 Subject: [Python-ideas] Expose __abstractmethods__/__isabstractmethod__ in abc Message-ID: <1406321189.64180.YahooMailNeo@web181003.mail.ne1.yahoo.com> An ABC built with abc.ABCMeta has a member?__abstractmethods__, which is an iterable of all of the abstract methods defined in that ABC that need to be overridden.?A method decorated with @abstractmethod gets a member __isabstractmethod__=True, which is how the ABC (or, rather, the interpreter) checks whether each of its abstract methods have been overridden. However, they're part of a private protocol used by the CPython implementation of the abc module, which means any third-party code that uses them isn't portable or future-proof. Which is a shame, because there are all kinds of things you can build easily on top of abc with them, but would have to duplicate most of the module (and the special interpreter support for it) without them. The simplest change is just to document these two members as part of the module interface. Alternatively, there could be functions abc.isabstractmethod(method) and abc.abstractmethods(cls), which would allow for implementations that didn't use the same protocol internally but supported the same interface. Examples where this could be useful: *?Write a runtime check-and-register function. * Explicitly test that a set of classes have implemented their ABC(s) without having to know how to correctly instantiate them. * Write a generic @autoabc decorator or similar for creating ABCs that are automatically virtual base classes of any type with the right methods, instead of doing it manually in each class (as collections.abc does today), like Go interfaces, C++ auto concepts, traditional ObjC checked informal protocols, etc.). * Build a signature-checking (rather than just name-checking) ABC (like https://github.com/apieum/ducktype). * Build a simplified version of PyProtocols-like adapters on top of abc instead of PyProtocols. Some of these might belong in the stdlib (in fact, it looks like http://bugs.python.org/issue9731 basically covers the first two), in which case they don't need to be implementable from outside? but that's certainly not true for all of them. (Without a precise algorithm for "compatible signature" or a standardized notion of adaptation, stdlib inclusion isn't even sensible for the last two, much less a good idea.) So, outside libraries should be able to implement them. From ncoghlan at gmail.com Sat Jul 26 01:28:23 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 26 Jul 2014 09:28:23 +1000 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> <87egxbm8eo.fsf@gmail.com> <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: On 26 Jul 2014 04:33, "Andrew Barnert" wrote: > As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that. It's potentially still worth spelling out that idea as a Rejected Alternative in the PEP. A draft design that separates them may help clarify the concepts being conflated more effectively than simply describing them, even if your own pragmatic assessment is "too much pain for not enough gain". Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sat Jul 26 01:34:28 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 26 Jul 2014 09:34:28 +1000 Subject: [Python-ideas] Expose __abstractmethods__/__isabstractmethod__ in abc In-Reply-To: <1406321189.64180.YahooMailNeo@web181003.mail.ne1.yahoo.com> References: <1406321189.64180.YahooMailNeo@web181003.mail.ne1.yahoo.com> Message-ID: The additional module level functions sound like a good idea to me. I see it as similar to the functools.singledispatch driven addition to expose a way to obtain a cache validity token for the virtual object graph. I thought "__isabstractmethod__" was already documented though, since we rely on it to control the pass through behaviour of property and other decorators like classmethod and staticmethod. If it isn't, that's really a bug rather than an RFE. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From 4kir4.1i at gmail.com Sat Jul 26 04:13:24 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Sat, 26 Jul 2014 06:13:24 +0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> <87egxbm8eo.fsf@gmail.com> <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <87vbqklvej.fsf@gmail.com> I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016 Andrew Barnert writes: > On Thursday, July 24, 2014 2:08 AM, Akira Li > <4kir4.1i at gmail.com> wrote: > >> > Andrew Barnert writes: >> >>> On Jul 23, 2014, at 5:13, Akira Li >>> <4kir4.1i at gmail.com> wrote: >>>> In order to newline="\0" case to work, it should behave? > >>>> similar to >>>> newline='' or newline='\n' case instead i.e., no >>>> translation should take >>>> place, to avoid corrupting embed "\n\r" characters. >>> >>> The draft PEP discusses this. I think it would be more consistent to >>> translate for \0, just like \r and \r\n. >> >> I read the [draft]. No translation is a better choice here. Otherwise >>> (at the very least) it breaks `find -print0` use case. > > No it doesn't. The only reason it breaks your code is that you add > newline='\0' to your stdout wrapper as well as your stdin wrapper. If > you just passed '', it would not do anything. And this is exactly > parallel with the existing case with, e.g., trying to pass through a > classic-Mac file full of '\r'-delimited strings that might contain > embedded '\n' characters that you don't want to translate. I won't repeat it several times but as you've already found out newline='\0' for stdout (at the very least) can be useful for line_buffering=True behavior. ... >> There is also line_buffering parameter. From the docs: >> >> ? If line_buffering is True, flush() is implied when a call to write >> ? contains a newline character. > > The way this is actually defined seems broken to me; IIRC (I'll check > the code later) it flushes on any '\r', and on any translated > \n'. So, it's doing the wrong thing with '\r' in most modes, and with > \n' in '' mode on non-Unix systems. So my thought was, just leave it > broken. Yes. I've found at least one issue http://bugs.python.org/issue22069 > But now that I think about it, the existing code can only flush > excessively, never insufficiently, and that's probably a property > worth preserving. So maybe there _is_ a reason to pass newline for > output without translation after all. In other words, the parameter > may actually conflate _four_ things, not just three... > > I'll need to think this through (and reread the code) this weekend; > thanks for bringing it up. -- Akira From abarnert at yahoo.com Sat Jul 26 04:22:26 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 25 Jul 2014 19:22:26 -0700 Subject: [Python-ideas] Expose __abstractmethods__/__isabstractmethod__ in abc In-Reply-To: References: <1406321189.64180.YahooMailNeo@web181003.mail.ne1.yahoo.com> Message-ID: On Jul 25, 2014, at 16:34, Nick Coghlan wrote: > The additional module level functions sound like a good idea to me. I see it as similar to the functools.singledispatch driven addition to expose a way to obtain a cache validity token for the virtual object graph. > Another reason the function seems better than the attribute. The only advantage to the attribute is that we could document that it already existed in 3.4 and earlier, instead of just documenting a new function. And if that was desirable we could always add that as a note to the documentation of the function. > I thought "__isabstractmethod__" was already documented though, since we rely on it to control the pass through behaviour of property and other decorators like classmethod and staticmethod. If it isn't, that's really a bug rather than an RFE. > You're right; I was looking for it in the wrong place. It doesn't document that abstract methods created by @abstractmethod have that attribute, but it does document that if you want to create an abstract method manually you have to set it, and shows how @property both uses and exposes the attribute, which is more than enough. So, never mind that part. -------------- next part -------------- An HTML attachment was scrubbed... URL: From 4kir4.1i at gmail.com Sat Jul 26 04:24:16 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Sat, 26 Jul 2014 06:24:16 +0400 Subject: [Python-ideas] Iterating non-newline-separated files should be easier References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> <87egxbm8eo.fsf@gmail.com> <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> Message-ID: <87tx64luwf.fsf@gmail.com> Nick Coghlan writes: > On 26 Jul 2014 04:33, "Andrew Barnert" > > wrote: >> As I've said before, I don't really like the design for '\r' and '\r\n', > or the fact that three separate notions (universal-newlines flag, line > ending for readline, and output translation for write) are all conflated > into one idea and crammed into one parameter, but I think it's probably too > late and too radical to change that. > > It's potentially still worth spelling out that idea as a Rejected > Alternative in the PEP. A draft design that separates them may help clarify > the concepts being conflated more effectively than simply describing them, > even if your own pragmatic assessment is "too much pain for not enough > gain". > It can't be in the rejected ideas because it is the current behavior for io.TextIOWrapper(newline=..) and it will never change (in Python 3) due to backward compatibility. As I understand Andrew doesn't like that *newline* parameter does too much: - *newline* parameter turns on/off universal newline mode - it may specify the line separator e.g., newline='\r' - it specifies whether newline translation happens e.g., newline='' turns it off - together with *line_buffering*, it may enable flush() if newline is written It is unrelated to my proposal [1] that shouldn't change the old behavior if newline in {None, '', '\n', '\r', '\r\n'}. [1] http://bugs.python.org/issue1152248#msg224016 -- Akira From abarnert at yahoo.com Sat Jul 26 06:03:30 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 25 Jul 2014 21:03:30 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87tx64luwf.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> <87egxbm8eo.fsf@gmail.com> <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> <87tx64luwf.fsf@gmail.com> Message-ID: <3D379B63-8016-4130-87F1-7242E11CBF59@yahoo.com> On Jul 25, 2014, at 19:24, Akira Li <4kir4.1i at gmail.com> wrote: > Nick Coghlan writes: > >> On 26 Jul 2014 04:33, "Andrew Barnert" >> >> wrote: >>> As I've said before, I don't really like the design for '\r' and '\r\n', >> or the fact that three separate notions (universal-newlines flag, line >> ending for readline, and output translation for write) are all conflated >> into one idea and crammed into one parameter, but I think it's probably too >> late and too radical to change that. >> >> It's potentially still worth spelling out that idea as a Rejected >> Alternative in the PEP. A draft design that separates them may help clarify >> the concepts being conflated more effectively than simply describing them, >> even if your own pragmatic assessment is "too much pain for not enough >> gain". > > It can't be in the rejected ideas because it is the current behavior for > io.TextIOWrapper(newline=..) and it will never change (in Python 3) due > to backward compatibility. That's exactly why changing it would be a "rejected idea". It certainly doesn't hurt to document the fact that we thought about it and decided not to change it for backward compatibility reasons. > As I understand Andrew doesn't like that *newline* parameter does too > much: > > - *newline* parameter turns on/off universal newline mode > - it may specify the line separator e.g., newline='\r' > - it specifies whether newline translation happens e.g., newline='' > turns it off > - together with *line_buffering*, it may enable flush() if newline is > written Exactly. And the fourth one only indirectly; "newline" flushing doesn't exactly mean _either_ of "\n" or the newline argument. And the related-but-definitely-not-the-same newlines attribute makes it even more confusing. (I've found bug reports with both Guido and Nick confused into thinking that newline was available as an attribute after construction; what hope do the rest of us have?) But the reality is, it rarely affects real-life programs, so it's definitely not worth breaking compatibility over. And it's still a whole lot cleaner than the 2.x design despite having a lot more details to deal with. From abarnert at yahoo.com Sat Jul 26 06:09:41 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 25 Jul 2014 21:09:41 -0700 Subject: [Python-ideas] Iterating non-newline-separated files should be easier In-Reply-To: <87vbqklvej.fsf@gmail.com> References: <1405626785.14773.YahooMailNeo@web181006.mail.ne1.yahoo.com> <1405812535.88058.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405817834.46270.YahooMailNeo@web181001.mail.ne1.yahoo.com> <1405828738.93713.YahooMailNeo@web181005.mail.ne1.yahoo.com> <1405903292.28722.YahooMailNeo@web181006.mail.ne1.yahoo.com> <87bnshnzu1.fsf@gmail.com> <87oawgmfxp.fsf@gmail.com> <3E31BD23-A903-4B48-82E5-6DDA4AA2E15C@yahoo.com> <87egxbm8eo.fsf@gmail.com> <1406312951.60505.YahooMailNeo@web181001.mail.ne1.yahoo.com> <87vbqklvej.fsf@gmail.com> Message-ID: On Jul 25, 2014, at 19:13, Akira Li <4kir4.1i at gmail.com> wrote: > I've added a patch that demonstrates "no translation" for alternative > newlines behavior http://bugs.python.org/issue1152248#msg224016 Having taken a better look at the line buffering code, I now agree with you that this is necessary; otherwise we'd have to make a much bigger change to the implementation (which I don't think we want). When I update the draft PEP I'll change that and add a rationale (this also makes the rationale for "no translation for binary files" and for "only readnl is exposed, not writenl" a lot simpler). I'll also change it in my C patch (which I hope to be able to clean up and upload this weekend). > Andrew Barnert > writes: > >> On Thursday, July 24, 2014 2:08 AM, Akira Li >> <4kir4.1i at gmail.com> wrote: >> >>>> Andrew Barnert writes: >>> >>>> On Jul 23, 2014, at 5:13, Akira Li >>>> <4kir4.1i at gmail.com> wrote: >>>>> In order to newline="\0" case to work, it should behave >> >>>>> similar to >>>>> newline='' or newline='\n' case instead i.e., no >>>>> translation should take >>>>> place, to avoid corrupting embed "\n\r" characters. >>>> >>>> The draft PEP discusses this. I think it would be more consistent to >>>> translate for \0, just like \r and \r\n. >>> >>> I read the [draft]. No translation is a better choice here. Otherwise >>>> (at the very least) it breaks `find -print0` use case. >> >> No it doesn't. The only reason it breaks your code is that you add >> newline='\0' to your stdout wrapper as well as your stdin wrapper. If >> you just passed '', it would not do anything. And this is exactly >> parallel with the existing case with, e.g., trying to pass through a >> classic-Mac file full of '\r'-delimited strings that might contain >> embedded '\n' characters that you don't want to translate. > > I won't repeat it several times but as you've already found out newline='\0' > for stdout (at the very least) can be useful for line_buffering=True > behavior. > > ... >>> There is also line_buffering parameter. From the docs: >>> >>> If line_buffering is True, flush() is implied when a call to write >>> contains a newline character. >> >> The way this is actually defined seems broken to me; IIRC (I'll check >> the code later) it flushes on any '\r', and on any translated >> \n'. So, it's doing the wrong thing with '\r' in most modes, and with >> \n' in '' mode on non-Unix systems. So my thought was, just leave it >> broken. > > Yes. I've found at least one issue http://bugs.python.org/issue22069 > >> But now that I think about it, the existing code can only flush >> excessively, never insufficiently, and that's probably a property >> worth preserving. So maybe there _is_ a reason to pass newline for >> output without translation after all. In other words, the parameter >> may actually conflate _four_ things, not just three... >> >> I'll need to think this through (and reread the code) this weekend; >> thanks for bringing it up. > > > -- > Akira > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From ronaldoussoren at mac.com Sat Jul 26 10:03:13 2014 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Sat, 26 Jul 2014 10:03:13 +0200 Subject: [Python-ideas] PEP 447 revisited Message-ID: <5BB87CC4-F31B-4213-AAAC-0C0CE738460C@mac.com> Hi, After a long hiatus I?ve done some updates to PEP 447 which proposes a new metaclass method that?s used in attribute resolution for normal and super instances. There have been two updates, the first one is trivial, the proposed method has a new name (__getdescriptor__). The second change to the PEP is to add a Python pseudo implementation of object.__getattribute__ and super.__getattribute__ to make it easier to reason about the impact of the proposal. I?d like to move forward with this PEP, either to rejection or (preferable) to acceptance of the feature in some form. That said, I?m not too attached to the exact proposal, it just seems to be the minmal clean change that can be used to implement my use case for this. My use case is fairly obscure, but hopefully it is not too obscure :-). The problem I have at the moment is basically that it is not possible to hook into the attribute resolution algorithm used by super.__getattribute__ and this PEP would solve that. My use case for this PEP is PyObjC, the PEP would make it possible to remove a custom ?super? class used in that project. I?ll try to sketch what PyObjC does and why the current super is a problem in the paragraphs below. PyObjC is a bridge between Python and Objective-C. The bit that?s important for this discussion is that every Objective-C object and class can be proxied into Python code. That?s done completely dynamically: the PyObjC bridge reads information from the Objective-C runtime (using a public API for that) to determine which classes are present there and which methods those classes have. Accessing the information on methods is done on demand, the bridge only looks for a method when Python code tries to access it. There are two reasons for that, the first one is performance: extracting method information eagerly is too expensive because there are a lot of them and Python code typically uses only a fraction of them. The second reason is more important than that: Objective-C classes are almost as dynamic Python classes and it is possible to add new methods at runtime either by loading add-on bundles (?Categories?) or by interacting with the Objective-C runtime. Both are actually used by Apple?s frameworks. There are no hooks that can be used to detect there modification, the only option I?ve found that can be used to keep the Python representation of a class in sync with the Objective-C representation is to eagerly scan classes every time they might be accessed, for example in the __getattribute__ of the proxies for Objective-C classes and instances. That?s terribly expensive, and still leaves a race condition when using super, in code like the code below the superclass might grow a new method between the call to the python method and using the superclass method: def myMethod(self): self.objectiveCMethod() super().otherMethod() Because of this the current PyObjC release doesn?t even try to keep the Python representation in sync, but always lazily looks for methods (but with a cache for all found methods to avoid the overhead of looking for them when methods are used multiple times). As that definitely will break builtin.super PyObjC also includes a custom super implementation that must be used. That works, but can lead to confusing errors when users forget to add ?from objc import super? to modules that use super in subclasses from Objective-C classes. The performance impact on CPython seemed to be minimal according to the testing I performed last year, but I have no idea what the impact would be on other implementation (in particular PyPy?s JIT). A link to the PEP: http://legacy.python.org/dev/peps/pep-0447/ I?d really appreciate further feedback on this PEP. Regards, Ronald From ncoghlan at gmail.com Sat Jul 26 13:59:35 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 26 Jul 2014 21:59:35 +1000 Subject: [Python-ideas] PEP 447 revisited In-Reply-To: <5BB87CC4-F31B-4213-AAAC-0C0CE738460C@mac.com> References: <5BB87CC4-F31B-4213-AAAC-0C0CE738460C@mac.com> Message-ID: On 26 July 2014 18:03, Ronald Oussoren wrote: > Hi, > > After a long hiatus I?ve done some updates to PEP 447 which proposes a new metaclass method that?s used in attribute resolution for normal and super instances. There have been two updates, the first one is trivial, the proposed method has a new name (__getdescriptor__). The second change to the PEP is to add a Python pseudo implementation of object.__getattribute__ and super.__getattribute__ to make it easier to reason about the impact of the proposal. > > I?d like to move forward with this PEP, either to rejection or (preferable) to acceptance of the feature in some form. That said, I?m not too attached to the exact proposal, it just seems to be the minmal clean change that can be used to implement my use case for this. > > My use case is fairly obscure, but hopefully it is not too obscure :-). The problem I have at the moment is basically that it is not possible to hook into the attribute resolution algorithm used by super.__getattribute__ and this PEP would solve that. The use case seems reasonable to me, and the new slot name seems much easier to document and explain than the previous iteration. I'd like to see the PEP look into the inspect module and consider the consequences for the functions there (e.g. another way for getattr_static to miss methods), as well as any possible implications for dir(). We had a few issues there with the enum changes for 3.4 (and some more serious ones with Argument Clinic) - it's not a blocker, it's just nice going in to have some idea of the impact going in :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From castironpi at gmail.com Sat Jul 26 22:51:16 2014 From: castironpi at gmail.com (Aaron Brady) Date: Sat, 26 Jul 2014 13:51:16 -0700 (PDT) Subject: [Python-ideas] Mutating while iterating Message-ID: Hi, I asked about the inconsistency of the "RuntimeError" being raised when mutating a container while iterating over it here [1], "set and dict iteration" on Aug 16, 2012. [1] http://www.gossamer-threads.com/lists/python/python/1004659 Continuing new from the bugs issue page [2]: [2] http://bugs.python.org/issue22084 Other prior discussion [3] [4]: [3] http://bugs.python.org/issue19332 [4] http://bugs.python.org/issue6017 Thanks Mr. Storchaka for your comments. The new documentation didn't help. The current behavior is still a rare but inconsistent silent error. An implementation sketch in pseudocode might simplify the endeavor [5]: [5] http://home.comcast.net/~castironpi-misc/irc-0168%20mutating%20while%20iterating%20markup.html I gather we wouldn't want to pursue the "custom" data container, option "2e": we would still need both "malloc/free" and a reference count. -------------- next part -------------- An HTML attachment was scrubbed... URL: From oreilldf at gmail.com Sat Jul 26 22:59:16 2014 From: oreilldf at gmail.com (Dan O'Reilly) Date: Sat, 26 Jul 2014 16:59:16 -0400 Subject: [Python-ideas] Better integration of multiprocessing with asyncio Message-ID: I think it would be helpful for folks using the asyncio module to be able to make non-blocking calls to objects in the multiprocessing module more easily. While some use-cases for using multiprocessing can be replaced with ProcessPoolExecutor/run_in_executor, there are others that cannot; more advanced usages of multiprocessing.Pool aren't supported by ProcessPoolExecutor (initializer/initargs, contexts, etc.), and other multiprocessing classes like Lock and Queue have blocking methods that could be made into coroutines. Consider this (extremely contrived, but use your imagination) example of a asyncio-friendly Queue: import asyncio import time def do_proc_work(q, val, val2): time.sleep(3) # Imagine this is some expensive CPU work. ok = val + val2 print("Passing {} to parent".format(ok)) q.put(ok) # The Queue can be used with the normal blocking API, too. item = q.get() print("got {} back from parent".format(item)) def do_some_async_io_task(): # Imagine there's some kind of asynchronous I/O # going on here that utilizes asyncio. asyncio.sleep(5) @asyncio.coroutine def do_work(q): loop.run_in_executor(ProcessPoolExecutor(), do_proc_work, q, 1, 2) do_some_async_io_task() item = yield from q.coro_get() # Non-blocking get that won't affect our io_task print("Got {} from worker".format(item)) item = item + 25 yield from q.coro_put(item) if __name__ == "__main__": q = AsyncProcessQueue() # This is our new asyncio-friendly version of multiprocessing.Queue loop = asyncio.get_event_loop() loop.run_until_complete(do_work(q)) I have seen some rumblings about a desire to do this kind of integration on the bug tracker (http://bugs.python.org/issue10037#msg162497 and http://bugs.python.org/issue9248#msg221963) though that discussion is specifically tied to merging the enhancements from the Billiard library into multiprocessing.Pool. Are there still plans to do that? If so, should asyncio integration with multiprocessing be rolled into those plans, or does it make sense to pursue it separately? Even more generally, do people think this kind of integration is a good idea to begin with? I know using asyncio is primarily about *avoiding* the headaches of concurrent threads/processes, but there are always going to be cases where CPU-intensive work is going to be required in a primarily I/O-bound application. The easier it is to for developers to handle those use-cases, the better, IMO. Note that the same sort of integration could be done with the threading module, though I think there's a fairly limited use-case for that; most times you'd want to use threads over processes, you could probably just use non-blocking I/O instead. Thanks, Dan -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at 2sn.net Sun Jul 27 01:34:16 2014 From: python at 2sn.net (Alexander Heger) Date: Sun, 27 Jul 2014 09:34:16 +1000 Subject: [Python-ideas] adding dictionaries Message-ID: Is there a good reason for not implementing the "+" operator for dict.update()? A = dict(a=1, b=1) B = dict(a=2, c=2) B += A B dict(a=1, b=1, c=2) That is B += A should be equivalent to B.update(A) It would be even better if there was also a regular "addition" operator that is equivalent to creating a shallow copy and then calling update(): C = A + B should equal to C = dict(A) C.update(B) (obviously not the same as C = B + A, but the "+" operator is not commutative for most operations) class NewDict(dict): def __add__(self, other): x = dict(self) x.update(other) return x def __iadd__(self, other): self.update(other) My apologies if this has been posted before but with a quick google search I could not see it; if it was, could you please point me to the thread? I assume this must be a design decision that has been made a long time ago, but it is not obvious to me why. From steve at pearwood.info Sun Jul 27 03:17:39 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 27 Jul 2014 11:17:39 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: <20140727011739.GC9112@ando> On Sun, Jul 27, 2014 at 09:34:16AM +1000, Alexander Heger wrote: > Is there a good reason for not implementing the "+" operator for dict.update()? [...] > That is > > B += A > > should be equivalent to > > B.update(A) You're asking the wrong question. The burden is not on people to justify *not* adding new features, the burden is on somebody to justify adding them. Is there a good reason for implementing the + operator as dict.update? We can already write B.update(A), under what circumstances would you spell it B += A instead, and why? > It would be even better if there was also a regular "addition" > operator that is equivalent to creating a shallow copy and then > calling update(): > > C = A + B > > should equal to > > C = dict(A) > C.update(B) That would be spelled C = dict(A, **B). I'd be more inclined to enhance the dict constructor and update methods so you can provide multiple arguments: dict(A, B, C, D) # Rather than A + B + C + D D.update(A, B, C) # Rather than D += A + B + C > My apologies if this has been posted before but with a quick google > search I could not see it; if it was, could you please point me to the > thread? I assume this must be a design decision that has been made a > long time ago, but it is not obvious to me why. I'm not sure it's so much a deliberate decision not to implement dictionary addition, as uncertainty as to what dictionary addition ought to mean. Given two dicts: A = {'a': 1, 'b': 1} B = {'a': 2, 'c': 2} I can think of at least four things that C = A + B could do: # add values, defaulting to 0 for missing keys C = {'a': 3, 'b': 1, 'c': 2} # add values, raising KeyError if there are missing keys # shallow copy of A, update with B C = {'a': 2, 'b': 1, 'c': 2} # shallow copy of A, insert keys from B only if not already in A C = {'a': 1, 'b': 1, 'c': 2} Except for the second one, I've come across people suggesting that each of the other three is the one and only obvious thing for A+B to do. -- Steven From tjreedy at udel.edu Sun Jul 27 03:27:04 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 26 Jul 2014 21:27:04 -0400 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: On 7/26/2014 7:34 PM, Alexander Heger wrote: > Is there a good reason for not implementing the "+" operator for dict.update()? As you immediate noticed, this is an incoherent request as stated. A op B should be a new object. > A = dict(a=1, b=1) > B = dict(a=2, c=2) > B += A Since "B op= A" is *defined* as resulting in B having the value of "B op A", with the operations possibly being done in-place if B is mutable, we would first have to define addition on dicts. > B > dict(a=1, b=1, c=2) > > That is > > B += A > > should be equivalent to > > B.update(A) > > It would be even better if there was also a regular "addition" > operator that is equivalent to creating a shallow copy and then > calling update(): You have this backwards. Dict addition would have to come first, and there are multiple possible and contextually useful definitions. The idea of choosing anyone of them as '+' has been rejected. As indicated, augmented dict addition would follow from the choice of dict addition. It would not necessarily be equivalent to .update. The addition needed to make this true would be asymmetric, like catenation. But unlike sequence catenation, information is erased in that items in the updated dict get subtracted. Conceptually, update is replacement rather than just addition. > My apologies if this has been posted Multiple dict additions have been proposed and discussed here on python-ideas and probably on python-list. -- Terry Jan Reedy From python at 2sn.net Sun Jul 27 04:18:48 2014 From: python at 2sn.net (Alexander Heger) Date: Sun, 27 Jul 2014 12:18:48 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: Dear Terry, > As you immediate noticed, this is an incoherent request as stated. A op B > should be a new object. > [...] > You have this backwards. Dict addition would have to come first, and there > are multiple possible and contextually useful definitions. The idea of > choosing anyone of them as '+' has been rejected. I had set out wanting to have a short form for dict.update(), hence the apparently reversed order. The proposed full addition does the same after first making a shallow copy; the operator interface does define both __iadd__ and __add__. > As indicated, augmented dict addition would follow from the choice of dict > addition. It would not necessarily be equivalent to .update. The addition > needed to make this true would be asymmetric, like catenation. yes. As I note, most uses of the "+" operator in Python are not symmetric (commutative). > But unlike sequence catenation, information is erased in that items in the > updated dict get subtracted. Conceptually, update is replacement rather than > just addition. Yes., not being able to have multiple identical keys is the nature of dictionaries. This does not mean that things should not be done in the best way they can be done. I was considering the set union operator "|" but that is also symmetric and may cause more confusion. Another consideration suggested was the element-wise addition in some form. This is the natural way of doing things for structures of fixed length like arrays, including numpy arrays. And this is being accepted. In contrast, for data structures with variable length, like lists and strings, "addition" is concatenation, and what I would see the most natural extension for dictionaries hence is to add the keys (not the key values or values to each other), with the common behavior to overwrite existing keys. You do have the choice in which order you write the operation. It would be funny if addition of strings would add their ASCII, char, or unicode values and return the resulting string. Sorry for bringing up, again, the old discussion of how to add dictionaries as part of this. -Alexander On 27 July 2014 11:27, Terry Reedy wrote: > On 7/26/2014 7:34 PM, Alexander Heger wrote: >> >> Is there a good reason for not implementing the "+" operator for >> dict.update()? > > > As you immediate noticed, this is an incoherent request as stated. A op B > should be a new object. > > >> A = dict(a=1, b=1) >> B = dict(a=2, c=2) >> B += A > > > Since "B op= A" is *defined* as resulting in B having the value of "B op A", > with the operations possibly being done in-place if B is mutable, we would > first have to define addition on dicts. > > >> B >> dict(a=1, b=1, c=2) >> >> That is >> >> B += A >> >> should be equivalent to >> >> B.update(A) >> >> It would be even better if there was also a regular "addition" >> operator that is equivalent to creating a shallow copy and then >> calling update(): > > > You have this backwards. Dict addition would have to come first, and there > are multiple possible and contextually useful definitions. The idea of > choosing anyone of them as '+' has been rejected. > > As indicated, augmented dict addition would follow from the choice of dict > addition. It would not necessarily be equivalent to .update. The addition > needed to make this true would be asymmetric, like catenation. > > But unlike sequence catenation, information is erased in that items in the > updated dict get subtracted. Conceptually, update is replacement rather than > just addition. > > >> My apologies if this has been posted > > > Multiple dict additions have been proposed and discussed here on > python-ideas and probably on python-list. > > -- > Terry Jan Reedy > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From guido at python.org Sun Jul 27 04:39:23 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 26 Jul 2014 19:39:23 -0700 Subject: [Python-ideas] Better integration of multiprocessing with asyncio In-Reply-To: References: Message-ID: I actually know very little about multiprocessing (have never used it) but I imagine the way you normally interact with multiprocessing is using a synchronous calls that talk to the subprocesses and their work queues and so on, right? In the asyncio world you would put that work in a thread and then use run_in_executor() with a thread executor -- the thread would then be managing the subprocesses and talking to them. While you are waiting for that thread to complete your other coroutines will still work. Unless you want to rewrite the communication and process management as coroutines, but that sounds like a lot of work. On Sat, Jul 26, 2014 at 1:59 PM, Dan O'Reilly wrote: > I think it would be helpful for folks using the asyncio module to be able > to make non-blocking calls to objects in the multiprocessing module more > easily. While some use-cases for using multiprocessing can be replaced with > ProcessPoolExecutor/run_in_executor, there are others that cannot; more > advanced usages of multiprocessing.Pool aren't supported by > ProcessPoolExecutor (initializer/initargs, contexts, etc.), and other > multiprocessing classes like Lock and Queue have blocking methods that > could be made into coroutines. > > Consider this (extremely contrived, but use your imagination) example of a > asyncio-friendly Queue: > > import asyncio > import time > > def do_proc_work(q, val, val2): > time.sleep(3) # Imagine this is some expensive CPU work. > ok = val + val2 > print("Passing {} to parent".format(ok)) > q.put(ok) # The Queue can be used with the normal blocking API, too. > item = q.get() > print("got {} back from parent".format(item)) > > def do_some_async_io_task(): > # Imagine there's some kind of asynchronous I/O > # going on here that utilizes asyncio. > asyncio.sleep(5) > > @asyncio.coroutine > def do_work(q): > loop.run_in_executor(ProcessPoolExecutor(), > do_proc_work, q, 1, 2) > do_some_async_io_task() > item = yield from q.coro_get() # Non-blocking get that won't affect > our io_task > print("Got {} from worker".format(item)) > item = item + 25 > yield from q.coro_put(item) > > > if __name__ == "__main__": > q = AsyncProcessQueue() # This is our new asyncio-friendly version of > multiprocessing.Queue > loop = asyncio.get_event_loop() > loop.run_until_complete(do_work(q)) > > I have seen some rumblings about a desire to do this kind of integration > on the bug tracker (http://bugs.python.org/issue10037#msg162497 and > http://bugs.python.org/issue9248#msg221963) though that discussion is > specifically tied to merging the enhancements from the Billiard library > into multiprocessing.Pool. Are there still plans to do that? If so, should > asyncio integration with multiprocessing be rolled into those plans, or > does it make sense to pursue it separately? > > Even more generally, do people think this kind of integration is a good > idea to begin with? I know using asyncio is primarily about *avoiding* the > headaches of concurrent threads/processes, but there are always going to be > cases where CPU-intensive work is going to be required in a primarily > I/O-bound application. The easier it is to for developers to handle those > use-cases, the better, IMO. > > Note that the same sort of integration could be done with the threading > module, though I think there's a fairly limited use-case for that; most > times you'd want to use threads over processes, you could probably just use > non-blocking I/O instead. > > Thanks, > Dan > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From oreilldf at gmail.com Sun Jul 27 05:34:29 2014 From: oreilldf at gmail.com (Dan O'Reilly) Date: Sat, 26 Jul 2014 23:34:29 -0400 Subject: [Python-ideas] Better integration of multiprocessing with asyncio In-Reply-To: References: Message-ID: Right, this is the same approach I've used myself. For example, the AsyncProcessQueue in my example above was implemented like this: def AsyncProcessQueue(maxsize=0): m = Manager() q = m.Queue(maxsize=maxsize) return _ProcQueue(q) class _ProcQueue(object): def __init__(self, q): self._queue = q self._executor = self._get_executor() self._cancelled_join = False def __getstate__(self): self_dict = self.__dict__ self_dict['_executor'] = None return self_dict def _get_executor(self): return ThreadPoolExecutor(max_workers=cpu_count()) def __setstate__(self, self_dict): self_dict['_executor'] = self._get_executor() self.__dict__.update(self_dict) def __getattr__(self, name): if name in ['qsize', 'empty', 'full', 'put', 'put_nowait', 'get', 'get_nowait', 'close']: return getattr(self._queue, name) else: raise AttributeError("'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) @asyncio.coroutine def coro_put(self, item): loop = asyncio.get_event_loop() return (yield from loop.run_in_executor(self._executor, self.put, item)) @asyncio.coroutine def coro_get(self): loop = asyncio.get_event_loop() return (yield from loop.run_in_executor(self._executor, self.get)) def cancel_join_thread(self): self._cancelled_join = True self._queue.cancel_join_thread() def join_thread(self): self._queue.join_thread() if self._executor and not self._cancelled_join: self._executor.shutdown() I'm wondering if a complete library providing this kind of behavior for all or some subset of multiprocessing is worth adding to the the asyncio module, or if you prefer users to deal with this on their own (or perhaps just distribute something that provides this behavior as a stand-alone library). I suppose adding asyncio-friendly methods to the existing objects in multiprocessing is also an option, but I doubt its desirable to add asyncio-specific code to modules other than asyncio. It also sort of sounds like some of the work that's gone on in Billiard would make the alternative, more complicated approach you mentioned a realistic possibility, at least going by this comment by Ask Solem (from http://bugs.python.org/issue9248#msg221963): > we have a version of multiprocessing.Pool using async IO and one pipe per process that drastically improves performance and also avoids the threads+forking issues (well, not the initial fork), but I have not yet adapted it to use the new asyncio module in 3.4. I don't know the details there, though. Hopefully someone more familiar with Billiard/multiprocessing than I am can provide some additional information. On Sat, Jul 26, 2014 at 10:39 PM, Guido van Rossum wrote: > I actually know very little about multiprocessing (have never used it) but > I imagine the way you normally interact with multiprocessing is using a > synchronous calls that talk to the subprocesses and their work queues and > so on, right? > > In the asyncio world you would put that work in a thread and then use > run_in_executor() with a thread executor -- the thread would then be > managing the subprocesses and talking to them. While you are waiting for > that thread to complete your other coroutines will still work. > > Unless you want to rewrite the communication and process management as > coroutines, but that sounds like a lot of work. > > > On Sat, Jul 26, 2014 at 1:59 PM, Dan O'Reilly wrote: > >> I think it would be helpful for folks using the asyncio module to be able >> to make non-blocking calls to objects in the multiprocessing module more >> easily. While some use-cases for using multiprocessing can be replaced with >> ProcessPoolExecutor/run_in_executor, there are others that cannot; more >> advanced usages of multiprocessing.Pool aren't supported by >> ProcessPoolExecutor (initializer/initargs, contexts, etc.), and other >> multiprocessing classes like Lock and Queue have blocking methods that >> could be made into coroutines. >> >> Consider this (extremely contrived, but use your imagination) example of >> a asyncio-friendly Queue: >> >> import asyncio >> import time >> >> def do_proc_work(q, val, val2): >> time.sleep(3) # Imagine this is some expensive CPU work. >> ok = val + val2 >> print("Passing {} to parent".format(ok)) >> q.put(ok) # The Queue can be used with the normal blocking API, too. >> item = q.get() >> print("got {} back from parent".format(item)) >> >> def do_some_async_io_task(): >> # Imagine there's some kind of asynchronous I/O >> # going on here that utilizes asyncio. >> asyncio.sleep(5) >> >> @asyncio.coroutine >> def do_work(q): >> loop.run_in_executor(ProcessPoolExecutor(), >> do_proc_work, q, 1, 2) >> do_some_async_io_task() >> item = yield from q.coro_get() # Non-blocking get that won't affect >> our io_task >> print("Got {} from worker".format(item)) >> item = item + 25 >> yield from q.coro_put(item) >> >> >> if __name__ == "__main__": >> q = AsyncProcessQueue() # This is our new asyncio-friendly version >> of multiprocessing.Queue >> loop = asyncio.get_event_loop() >> loop.run_until_complete(do_work(q)) >> >> I have seen some rumblings about a desire to do this kind of integration >> on the bug tracker (http://bugs.python.org/issue10037#msg162497 and >> http://bugs.python.org/issue9248#msg221963) though that discussion is >> specifically tied to merging the enhancements from the Billiard library >> into multiprocessing.Pool. Are there still plans to do that? If so, should >> asyncio integration with multiprocessing be rolled into those plans, or >> does it make sense to pursue it separately? >> >> Even more generally, do people think this kind of integration is a good >> idea to begin with? I know using asyncio is primarily about *avoiding* the >> headaches of concurrent threads/processes, but there are always going to be >> cases where CPU-intensive work is going to be required in a primarily >> I/O-bound application. The easier it is to for developers to handle those >> use-cases, the better, IMO. >> >> Note that the same sort of integration could be done with the threading >> module, though I think there's a fairly limited use-case for that; most >> times you'd want to use threads over processes, you could probably just use >> non-blocking I/O instead. >> >> Thanks, >> Dan >> >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > > -- > --Guido van Rossum (python.org/~guido) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Jul 27 05:39:59 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 27 Jul 2014 13:39:59 +1000 Subject: [Python-ideas] Mutating while iterating In-Reply-To: References: Message-ID: On 27 July 2014 06:51, Aaron Brady wrote: > Hi, I asked about the inconsistency of the "RuntimeError" being raised when > mutating a container while iterating over it here [1], "set and dict > iteration" on Aug 16, 2012. Hi, This is clearly an issue of grave concern to you, but as Raymond pointed out previously, you appear to have misunderstood the purpose of those exceptions. They're there to prevent catastrophic failure of the interpreter itself (i.e. segmentation faults), not to help find bugs in user code. If users want to mutate containers while they're iterating over them, they're generally free to do so. The only time we'll actively disallow it is when such mutation will outright *break* the iterator, rather than merely producing potentially surprising results. I have closed the new issue and added a longer reply (with examples) that will hopefully better explain why we have no intention of changing this behaviour: http://bugs.python.org/issue22084#msg224100 Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From guido at python.org Sun Jul 27 05:43:07 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 26 Jul 2014 20:43:07 -0700 Subject: [Python-ideas] Better integration of multiprocessing with asyncio In-Reply-To: References: Message-ID: I'm going to go out on a limb here and say that it feels too early to me. First someone has to actually solve this problem well as a 3rd party package before we can talk about adding it to the asyncio package. It doesn't actually sound like Billiards has adapted to asyncio yet (not that I have any idea what Billiards is -- it sounds like a fork of multiprocessing actually?). On Sat, Jul 26, 2014 at 8:34 PM, Dan O'Reilly wrote: > Right, this is the same approach I've used myself. For example, the > AsyncProcessQueue in my example above was implemented like this: > > def AsyncProcessQueue(maxsize=0): > m = Manager() > q = m.Queue(maxsize=maxsize) > return _ProcQueue(q) > > class _ProcQueue(object): > def __init__(self, q): > self._queue = q > self._executor = self._get_executor() > self._cancelled_join = False > > def __getstate__(self): > self_dict = self.__dict__ > self_dict['_executor'] = None > return self_dict > > def _get_executor(self): > return ThreadPoolExecutor(max_workers=cpu_count()) > > def __setstate__(self, self_dict): > self_dict['_executor'] = self._get_executor() > self.__dict__.update(self_dict) > > def __getattr__(self, name): > if name in ['qsize', 'empty', 'full', 'put', 'put_nowait', > 'get', 'get_nowait', 'close']: > return getattr(self._queue, name) > else: > raise AttributeError("'%s' object has no attribute '%s'" % > (self.__class__.__name__, name)) > > @asyncio.coroutine > def coro_put(self, item): > loop = asyncio.get_event_loop() > return (yield from loop.run_in_executor(self._executor, self.put, > item)) > > @asyncio.coroutine > def coro_get(self): > loop = asyncio.get_event_loop() > return (yield from loop.run_in_executor(self._executor, self.get)) > > def cancel_join_thread(self): > self._cancelled_join = True > self._queue.cancel_join_thread() > > def join_thread(self): > self._queue.join_thread() > if self._executor and not self._cancelled_join: > self._executor.shutdown() > > I'm wondering if a complete library providing this kind of behavior for > all or some subset of multiprocessing is worth adding to the the asyncio > module, or if you prefer users to deal with this on their own (or perhaps > just distribute something that provides this behavior as a stand-alone > library). I suppose adding asyncio-friendly methods to the existing objects > in multiprocessing is also an option, but I doubt its desirable to add > asyncio-specific code to modules other than asyncio. > > It also sort of sounds like some of the work that's gone on in Billiard > would make the alternative, more complicated approach you mentioned a > realistic possibility, at least going by this comment by Ask Solem (from > http://bugs.python.org/issue9248#msg221963): > > > we have a version of multiprocessing.Pool using async IO and one pipe per process that drastically improves performance and also avoids the threads+forking issues (well, not the initial fork), but I have not yet adapted it to use the new asyncio module in 3.4. > > I don't know the details there, though. Hopefully someone more familiar with Billiard/multiprocessing than I am can provide some additional information. > > > > > > On Sat, Jul 26, 2014 at 10:39 PM, Guido van Rossum > wrote: > >> I actually know very little about multiprocessing (have never used it) >> but I imagine the way you normally interact with multiprocessing is using a >> synchronous calls that talk to the subprocesses and their work queues and >> so on, right? >> >> In the asyncio world you would put that work in a thread and then use >> run_in_executor() with a thread executor -- the thread would then be >> managing the subprocesses and talking to them. While you are waiting for >> that thread to complete your other coroutines will still work. >> >> Unless you want to rewrite the communication and process management as >> coroutines, but that sounds like a lot of work. >> >> >> On Sat, Jul 26, 2014 at 1:59 PM, Dan O'Reilly wrote: >> >>> I think it would be helpful for folks using the asyncio module to be >>> able to make non-blocking calls to objects in the multiprocessing module >>> more easily. While some use-cases for using multiprocessing can be replaced >>> with ProcessPoolExecutor/run_in_executor, there are others that cannot; >>> more advanced usages of multiprocessing.Pool aren't supported by >>> ProcessPoolExecutor (initializer/initargs, contexts, etc.), and other >>> multiprocessing classes like Lock and Queue have blocking methods that >>> could be made into coroutines. >>> >>> Consider this (extremely contrived, but use your imagination) example of >>> a asyncio-friendly Queue: >>> >>> import asyncio >>> import time >>> >>> def do_proc_work(q, val, val2): >>> time.sleep(3) # Imagine this is some expensive CPU work. >>> ok = val + val2 >>> print("Passing {} to parent".format(ok)) >>> q.put(ok) # The Queue can be used with the normal blocking API, too. >>> item = q.get() >>> print("got {} back from parent".format(item)) >>> >>> def do_some_async_io_task(): >>> # Imagine there's some kind of asynchronous I/O >>> # going on here that utilizes asyncio. >>> asyncio.sleep(5) >>> >>> @asyncio.coroutine >>> def do_work(q): >>> loop.run_in_executor(ProcessPoolExecutor(), >>> do_proc_work, q, 1, 2) >>> do_some_async_io_task() >>> item = yield from q.coro_get() # Non-blocking get that won't affect >>> our io_task >>> print("Got {} from worker".format(item)) >>> item = item + 25 >>> yield from q.coro_put(item) >>> >>> >>> if __name__ == "__main__": >>> q = AsyncProcessQueue() # This is our new asyncio-friendly version >>> of multiprocessing.Queue >>> loop = asyncio.get_event_loop() >>> loop.run_until_complete(do_work(q)) >>> >>> I have seen some rumblings about a desire to do this kind of integration >>> on the bug tracker (http://bugs.python.org/issue10037#msg162497 and >>> http://bugs.python.org/issue9248#msg221963) though that discussion is >>> specifically tied to merging the enhancements from the Billiard library >>> into multiprocessing.Pool. Are there still plans to do that? If so, should >>> asyncio integration with multiprocessing be rolled into those plans, or >>> does it make sense to pursue it separately? >>> >>> Even more generally, do people think this kind of integration is a good >>> idea to begin with? I know using asyncio is primarily about *avoiding* the >>> headaches of concurrent threads/processes, but there are always going to be >>> cases where CPU-intensive work is going to be required in a primarily >>> I/O-bound application. The easier it is to for developers to handle those >>> use-cases, the better, IMO. >>> >>> Note that the same sort of integration could be done with the threading >>> module, though I think there's a fairly limited use-case for that; most >>> times you'd want to use threads over processes, you could probably just use >>> non-blocking I/O instead. >>> >>> Thanks, >>> Dan >>> >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Jul 27 05:47:49 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 27 Jul 2014 13:47:49 +1000 Subject: [Python-ideas] Better integration of multiprocessing with asyncio In-Reply-To: References: Message-ID: On 27 July 2014 13:34, Dan O'Reilly wrote: > > I'm wondering if a complete library providing this kind of behavior for all > or some subset of multiprocessing is worth adding to the the asyncio module, > or if you prefer users to deal with this on their own (or perhaps just > distribute something that provides this behavior as a stand-alone library). > I suppose adding asyncio-friendly methods to the existing objects in > multiprocessing is also an option, but I doubt its desirable to add > asyncio-specific code to modules other than asyncio. Actually, having asyncio act as a "nexus" for asynchronous IO backends is one of the reasons for its existence. The asyncio event loop is pluggable, so making multiprocessing asyncio friendly (whether directly, or as an addon library that bridges the two) *also* has the effect of making it compatible with all the other asynchronous event loops that can be plugged into the asyncio framework. I'm inclined to agree with Guido, though - while I think making asyncio and multiprocessing play well together is a good idea in principle, I think we're still in the "third party exploration phase" of that integration. Once folks figure out good ways to do it, *then* we can start talking about making that integration a default part of Python 3.5 or 3.6+. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ryan at ryanhiebert.com Sun Jul 27 05:48:02 2014 From: ryan at ryanhiebert.com (Ryan Hiebert) Date: Sat, 26 Jul 2014 22:48:02 -0500 Subject: [Python-ideas] Better integration of multiprocessing with asyncio In-Reply-To: References: Message-ID: <6D00A7DF-A35D-4608-8AEA-5C21376F909B@ryanhiebert.com> > On Jul 26, 2014, at 10:43 PM, Guido van Rossum wrote: > > I'm going to go out on a limb here and say that it feels too early to me. First someone has to actually solve this problem well as a 3rd party package before we can talk about adding it to the asyncio package. It doesn't actually sound like Billiards has adapted to asyncio yet (not that I have any idea what Billiards is -- it sounds like a fork of multiprocessing actually?). Yep, Billiard is a fork of multiprocessing: https://pypi.python.org/pypi/billiard -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronaldoussoren at mac.com Sun Jul 27 09:42:02 2014 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Sun, 27 Jul 2014 09:42:02 +0200 Subject: [Python-ideas] PEP 447 revisited In-Reply-To: References: <5BB87CC4-F31B-4213-AAAC-0C0CE738460C@mac.com> Message-ID: <06ED1B99-850E-49C1-950C-B311FEC340C8@mac.com> On 26 Jul 2014, at 13:59, Nick Coghlan wrote: > On 26 July 2014 18:03, Ronald Oussoren wrote: >> Hi, >> >> After a long hiatus I?ve done some updates to PEP 447 which proposes a new metaclass method that?s used in attribute resolution for normal and super instances. There have been two updates, the first one is trivial, the proposed method has a new name (__getdescriptor__). The second change to the PEP is to add a Python pseudo implementation of object.__getattribute__ and super.__getattribute__ to make it easier to reason about the impact of the proposal. >> >> I?d like to move forward with this PEP, either to rejection or (preferable) to acceptance of the feature in some form. That said, I?m not too attached to the exact proposal, it just seems to be the minmal clean change that can be used to implement my use case for this. >> >> My use case is fairly obscure, but hopefully it is not too obscure :-). The problem I have at the moment is basically that it is not possible to hook into the attribute resolution algorithm used by super.__getattribute__ and this PEP would solve that. > > The use case seems reasonable to me, and the new slot name seems much > easier to document and explain than the previous iteration. Some Australian guy you may know suggested the name the last time I posted the PEP for review, and I liked the name. Naming is hard... > > I'd like to see the PEP look into the inspect module and consider the > consequences for the functions there (e.g. another way for > getattr_static to miss methods), as well as any possible implications > for dir(). We had a few issues there with the enum changes for 3.4 > (and some more serious ones with Argument Clinic) - it's not a > blocker, it's just nice going in to have some idea of the impact going > in :) I agree that it is useful to explain those consequences. The consequences for dir() should be similar to those for __getattribute__ itself: if you override the default implementation you should implement __dir__ to match, or live with the inconsistency. There should be little or no impact on inspect, other then that getattr_static may not work as expected when using a custom implemention of __getdescriptor__ because the class __dict__ may not contain the values you need. There?s nothing that can be done about that, the entire point of getattr_static is to avoid triggering custom attribute lookup code. inspect.getmembers and inspect.get_class_attrs, look directly at the class __dict__, and hence might not show everything that?s available through the class when using a custom __getdescriptor__ method. I have to think about the consequences and possible mitigation of those consequences a bit, not just for this PEP but for the current PyObjC implementation as well. Anyways, I?ll add a section about introspection to the PEP that describes these issues and their consequences. Ronald > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From joshua at landau.ws Mon Jul 28 07:26:13 2014 From: joshua at landau.ws (Joshua Landau) Date: Mon, 28 Jul 2014 06:26:13 +0100 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140727011739.GC9112@ando> References: <20140727011739.GC9112@ando> Message-ID: On 27 July 2014 02:17, Steven D'Aprano wrote: > On Sun, Jul 27, 2014 at 09:34:16AM +1000, Alexander Heger wrote: > >> Is there a good reason for not implementing the "+" operator for dict.update()? > [...] >> That is >> >> B += A >> >> should be equivalent to >> >> B.update(A) > > You're asking the wrong question. The burden is not on people to justify > *not* adding new features, the burden is on somebody to justify adding > them. Is there a good reason for implementing the + operator as > dict.update? One good reason is that people are still convinced "dict(A, **B)" makes some kind of sense. But really, we have collections.ChainMap, dict addition is confusing and there's already a PEP (python.org/dev/peps/pep-0448) that has a solution I prefer ({**A, **B}). From steve at pearwood.info Mon Jul 28 16:59:51 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 29 Jul 2014 00:59:51 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> Message-ID: <20140728145951.GH9112@ando> On Mon, Jul 28, 2014 at 06:26:13AM +0100, Joshua Landau wrote: > On 27 July 2014 02:17, Steven D'Aprano wrote: [...] > > Is there a good reason for implementing the + operator as > > dict.update? > > One good reason is that people are still convinced "dict(A, **B)" > makes some kind of sense. Explain please. dict(A, **B) makes perfect sense to me, and it works perfectly too. It's a normal constructor call, using the same syntax as any other function or method call. Are you suggesting that it does not make sense? -- Steven From dw+python-ideas at hmmz.org Mon Jul 28 17:33:06 2014 From: dw+python-ideas at hmmz.org (dw+python-ideas at hmmz.org) Date: Mon, 28 Jul 2014 15:33:06 +0000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140728145951.GH9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> Message-ID: <20140728153306.GA5756@k2> On Tue, Jul 29, 2014 at 12:59:51AM +1000, Steven D'Aprano wrote: > > One good reason is that people are still convinced "dict(A, **B)" > > makes some kind of sense. > > Explain please. dict(A, **B) makes perfect sense to me, and it works > perfectly too. It's a normal constructor call, using the same syntax > as any other function or method call. Are you suggesting that it does > not make sense? It worked in Python 2, but Python 3 added code to explicitly prevent the kwargs mechanism from being abused by passing non-string keys. Effectively, the only reason it worked was due to a Python 2.x kwargs implementation detail. It took me a while to come to terms with this one too, it was really quite a nice hack. But that's all it ever was. The domain of valid keys accepted by **kwargs should never have exceeded the range supported by the language syntax for declaring keyword arguments. David From guido at python.org Mon Jul 28 17:40:17 2014 From: guido at python.org (Guido van Rossum) Date: Mon, 28 Jul 2014 08:40:17 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140728145951.GH9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> Message-ID: I'll regret jumping in here, but while dict(A, **B) as a way to merge two dicts A and B makes some sense, it has two drawbacks: (1) slow (creates an extra copy of B as it creates the keyword args structure for dict()) and (2) not general enough (doesn't support key types other than str). On Mon, Jul 28, 2014 at 7:59 AM, Steven D'Aprano wrote: > On Mon, Jul 28, 2014 at 06:26:13AM +0100, Joshua Landau wrote: > > On 27 July 2014 02:17, Steven D'Aprano wrote: > [...] > > > Is there a good reason for implementing the + operator as > > > dict.update? > > > > One good reason is that people are still convinced "dict(A, **B)" > > makes some kind of sense. > > Explain please. dict(A, **B) makes perfect sense to me, and it works > perfectly too. It's a normal constructor call, using the same syntax as > any other function or method call. Are you suggesting that it does not > make sense? > > > -- > Steven > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Mon Jul 28 18:04:50 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 29 Jul 2014 02:04:50 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140728153306.GA5756@k2> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> Message-ID: <20140728160450.GI9112@ando> On Mon, Jul 28, 2014 at 03:33:06PM +0000, dw+python-ideas at hmmz.org wrote: > On Tue, Jul 29, 2014 at 12:59:51AM +1000, Steven D'Aprano wrote: > > > > One good reason is that people are still convinced "dict(A, **B)" > > > makes some kind of sense. > > > > Explain please. dict(A, **B) makes perfect sense to me, and it works > > perfectly too. It's a normal constructor call, using the same syntax > > as any other function or method call. Are you suggesting that it does > > not make sense? > > It worked in Python 2, but Python 3 added code to explicitly prevent the > kwargs mechanism from being abused by passing non-string keys. /face-palm Ah of course! You're right, using dict(A, **B) isn't general enough. I'm still inclined to prefer allowing update() to accept multiple arguments: a.update(b, c, d) rather than a += b + c + d which suggests that maybe there ought to be an updated() built-in, Let the bike-shedding begin: should such a thing be spelled ? new_dict = a + b + c + d Pros: + is short to type; subclasses can control the type of new_dict. Cons: dict addition isn't obvious. new_dict = updated(a, b, c, d) Pros: analogous to sort/sorted, reverse/reversed. Cons: another built-in; isn't very general, only applies to Mappings new_dict = a.updated(b, c, d) Pros: only applies to mappings, so it should be a method; subclasses can control the type of the new dict returned. Cons: easily confused with dict.update -- Steven From guido at python.org Mon Jul 28 18:08:49 2014 From: guido at python.org (Guido van Rossum) Date: Mon, 28 Jul 2014 09:08:49 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140728153306.GA5756@k2> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> Message-ID: In addition, dict(A, **B) is not something you easily stumble upon when your goal is "merge two dicts"; nor is it even clear that that's what it is when you read it for the first time. All signs of too-clever hacks in my book. On Mon, Jul 28, 2014 at 8:33 AM, wrote: > On Tue, Jul 29, 2014 at 12:59:51AM +1000, Steven D'Aprano wrote: > > > > One good reason is that people are still convinced "dict(A, **B)" > > > makes some kind of sense. > > > > Explain please. dict(A, **B) makes perfect sense to me, and it works > > perfectly too. It's a normal constructor call, using the same syntax > > as any other function or method call. Are you suggesting that it does > > not make sense? > > It worked in Python 2, but Python 3 added code to explicitly prevent the > kwargs mechanism from being abused by passing non-string keys. > Effectively, the only reason it worked was due to a Python 2.x kwargs > implementation detail. > > It took me a while to come to terms with this one too, it was really > quite a nice hack. But that's all it ever was. The domain of valid keys > accepted by **kwargs should never have exceeded the range supported by > the language syntax for declaring keyword arguments. > > > David > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ron3200 at gmail.com Mon Jul 28 19:17:10 2014 From: ron3200 at gmail.com (Ron Adam) Date: Mon, 28 Jul 2014 12:17:10 -0500 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140728160450.GI9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> Message-ID: On 07/28/2014 11:04 AM, Steven D'Aprano wrote: > On Mon, Jul 28, 2014 at 03:33:06PM +0000,dw+python-ideas at hmmz.org wrote: >> >On Tue, Jul 29, 2014 at 12:59:51AM +1000, Steven D'Aprano wrote: >> > >>>> > > >One good reason is that people are still convinced "dict(A, **B)" >>>> > > >makes some kind of sense. >>> > > >>> > >Explain please. dict(A, **B) makes perfect sense to me, and it works >>> > >perfectly too. It's a normal constructor call, using the same syntax >>> > >as any other function or method call. Are you suggesting that it does >>> > >not make sense? >> > >> >It worked in Python 2, but Python 3 added code to explicitly prevent the >> >kwargs mechanism from being abused by passing non-string keys. > /face-palm > > Ah of course! You're right, using dict(A, **B) isn't general enough. and make the language easier to write and use > I'm still inclined to prefer allowing update() to accept multiple > arguments: > > a.update(b, c, d) To me, the constructor and update method should be as near alike as possible. So I think if it's done in the update method, it should also work in the constructor. And other type constructors, such as list, should work in similar ways as well. I'm not sure that going in this direction would be good in the long term. > rather than a += b + c + d > > which suggests that maybe there ought to be an updated() built-in, Let > the bike-shedding begin: should such a thing be spelled ? > > new_dict = a + b + c + d > > Pros: + is short to type; subclasses can control the type of new_dict. > Cons: dict addition isn't obvious. I think it's more obvious. It only needs __add__ and __iadd__ methods to make it consistent with the list type. The cons is that somewhere someone could be catching TypeError to differentiate dict from other types while adding. But it's just as likely they are doing so in order to add them after a TypeError occurs. I think this added consistency between lists and dicts would be useful. But, Putting __add__ and __iadd__ methods on dicts seems like something that was probably discussed in length before, and I wonder what reasons where given for not doing it then. Cheers, Ron From antoine at python.org Mon Jul 28 19:29:00 2014 From: antoine at python.org (Antoine Pitrou) Date: Mon, 28 Jul 2014 13:29:00 -0400 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> Message-ID: Le 28/07/2014 12:08, Guido van Rossum a ?crit : > In addition, dict(A, **B) is not something you easily stumble upon when > your goal is "merge two dicts"; nor is it even clear that that's what it > is when you read it for the first time. > > All signs of too-clever hacks in my book. Agreed with Guido (!). Regards Antoine. From ryan at ryanhiebert.com Mon Jul 28 20:37:03 2014 From: ryan at ryanhiebert.com (Ryan Hiebert) Date: Mon, 28 Jul 2014 13:37:03 -0500 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140728160450.GI9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> Message-ID: <6122DCE6-D84A-4B05-AB02-C1FD3CED82A4@ryanhiebert.com> > On Jul 28, 2014, at 11:04 AM, Steven D'Aprano wrote: > > I'm still inclined to prefer allowing update() to accept multiple > arguments: > > a.update(b, c, d) > > rather than a += b + c + d > > which suggests that maybe there ought to be an updated() built-in, Let > the bike-shedding begin: should such a thing be spelled ? > > new_dict = a + b + c + d > or, to match set new_dict = a | b | c | d From nathan at cmu.edu Mon Jul 28 20:58:21 2014 From: nathan at cmu.edu (Nathan Schneider) Date: Mon, 28 Jul 2014 14:58:21 -0400 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: On Sat, Jul 26, 2014 at 7:34 PM, Alexander Heger wrote: > > My apologies if this has been posted before but with a quick google > search I could not see it; if it was, could you please point me to the > thread? > Here are two threads that had some discussion of this: https://mail.python.org/pipermail/python-ideas/2011-December/013227.html and https://mail.python.org/pipermail/python-ideas/2013-June/021140.html. Seems like a useful feature if there could be a clean way to spell it. Cheers, Nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Mon Jul 28 21:21:54 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 28 Jul 2014 20:21:54 +0100 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: On 28 July 2014 19:58, Nathan Schneider wrote: > Here are two threads that had some discussion of this: > https://mail.python.org/pipermail/python-ideas/2011-December/013227.html This doesn't seem to have a use case, other than "it would be nice". > https://mail.python.org/pipermail/python-ideas/2013-June/021140.html. This can be handled using ChainMap, if I understand the proposal. > Seems like a useful feature if there could be a clean way to spell it. I've yet to see any real-world situation when I've wanted "dictionary addition" (with any of the various semantics proposed here) and I've never encountered a situation where using d1.update(d2) was sufficiently awkward that having an operator seemed reasonable. In all honesty, I'd suggest that code which looks bad enough to warrant even considering this feature is probably badly in need of refactoring, at which point the problem will likely go away. Paul From abarnert at yahoo.com Mon Jul 28 22:20:20 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 28 Jul 2014 13:20:20 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> On Jul 28, 2014, at 12:21, Paul Moore wrote: > On 28 July 2014 19:58, Nathan Schneider wrote: >> Here are two threads that had some discussion of this: >> https://mail.python.org/pipermail/python-ideas/2011-December/013227.html > > This doesn't seem to have a use case, other than "it would be nice". > >> https://mail.python.org/pipermail/python-ideas/2013-June/021140.html. > > This can be handled using ChainMap, if I understand the proposal. When the underlying dicts and desired combined dict are all going to be used immutably, ChainMap is the perfect answer. (Better than an "updated" function for performance if nothing else.) And usually, when you're looking for a non-mutating combine-dicts operation, that will be what you want. But usually isn't always. If you want a snapshot of the combination of mutable dicts, ChainMap is wrong. If you want to be able to mutate the result, ChainMap is wrong. All that being said, I'm not sure these use cases are sufficiently common to warrant adding an operator--especially since there are other just-as-(un)common use cases it wouldn't solve. (For example, what I often want is a mutable "overlay" ChainMap, which doesn't need to copy the entire potentially-gigantic source dicts. I wouldn't expect an operator for that, even though I need it far more often than I need a mutable snapshot copy.) And of course, as you say, real-life use cases would be a lot more compelling than theoretical/abstract ones. >> Seems like a useful feature if there could be a clean way to spell it. > > I've yet to see any real-world situation when I've wanted "dictionary > addition" (with any of the various semantics proposed here) and I've > never encountered a situation where using d1.update(d2) was > sufficiently awkward that having an operator seemed reasonable. > > In all honesty, I'd suggest that code which looks bad enough to > warrant even considering this feature is probably badly in need of > refactoring, at which point the problem will likely go away. > > Paul > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From encukou at gmail.com Mon Jul 28 22:53:43 2014 From: encukou at gmail.com (Petr Viktorin) Date: Mon, 28 Jul 2014 22:53:43 +0200 Subject: [Python-ideas] adding dictionaries In-Reply-To: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> References: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> Message-ID: On Mon, Jul 28, 2014 at 10:20 PM, Andrew Barnert wrote: > When the underlying dicts and desired combined dict are all going to be used immutably, ChainMap is the perfect answer. (Better than an "updated" function for performance if nothing else.) And usually, when you're looking for a non-mutating combine-dicts operation, that will be what you want. > > But usually isn't always. If you want a snapshot of the combination of mutable dicts, ChainMap is wrong. If you want to be able to mutate the result, ChainMap is wrong. In those cases, do dict(ChainMap(...)). > > All that being said, I'm not sure these use cases are sufficiently common to warrant adding an operator--especially since there are other just-as-(un)common use cases it wouldn't solve. (For example, what I often want is a mutable "overlay" ChainMap, which doesn't need to copy the entire potentially-gigantic source dicts. I wouldn't expect an operator for that, even though I need it far more often than I need a mutable snapshot copy.) > > And of course, as you say, real-life use cases would be a lot more compelling than theoretical/abstract ones. From python at 2sn.net Mon Jul 28 22:59:29 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 06:59:29 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> Message-ID: On 29 July 2014 02:08, Guido van Rossum wrote: > In addition, dict(A, **B) is not something you easily stumble upon when your > goal is "merge two dicts"; nor is it even clear that that's what it is when > you read it for the first time. > > All signs of too-clever hacks in my book. I try to convince students to learn and *use* python. If I tell students to merge 2 dictionaries they have to do dict(A, **B} or {**A, **B} that seem less clear (not something you "stumble across" as Guidon says) than A + B; then we still have to tell them the rules of the operation, as usual for any operation. It does not have to be "+", could be the "union" operator "|" that is used for sets where s.update(t) is the same as s |= t ... and accordingly D = A | B | C Maybe this operator is better as this equivalence is already being used (for sets). Accordingly "union(A,B)" could do a merge operation and return the new dict(). (this then still allows people who want "+" to add the values be made happy in the long run) -Alexander From python at 2sn.net Tue Jul 29 00:15:49 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 08:15:49 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: > In all honesty, I'd suggest that code which looks bad enough to > warrant even considering this feature is probably badly in need of > refactoring, at which point the problem will likely go away. I often want to call functions with added (or removed, replaced) keywords from the call. args0 = dict(...) args1 = dict(...) def f(**kwargs): g(**(arg0 | kwargs | args1)) currently I have to write args = dict(...) def f(**kwargs): temp_args = dict(dic0) temp_args.update(kwargs) temp_args.update(dic1) g(**temp_args) It would also make the proposed feature to allow multiple kw args expansions in Python 3.5 easy to write by having f(**a, **b, **c) be equivalent to f(**(a | b | c)) -Alexander From abarnert at yahoo.com Tue Jul 29 00:17:22 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 28 Jul 2014 15:17:22 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> Message-ID: <999F9DAF-E27A-46FE-A444-C2713A18BBB6@yahoo.com> On Jul 28, 2014, at 13:59, Alexander Heger wrote: > On 29 July 2014 02:08, Guido van Rossum wrote: >> In addition, dict(A, **B) is not something you easily stumble upon when your >> goal is "merge two dicts"; nor is it even clear that that's what it is when >> you read it for the first time. >> >> All signs of too-clever hacks in my book. > > I try to convince students to learn and *use* python. > > If I tell students to merge 2 dictionaries they have to do dict(A, > **B} or {**A, **B} that seem less clear (not something you "stumble > across" as Guidon says) than A + B; then we still have to tell them > the rules of the operation, as usual for any operation. > > It does not have to be "+", could be the "union" operator "|" that is > used for sets where > s.update(t) > is the same as > s |= t The difference is that with sets, it (at least conceptually) doesn't matter whether you keep elements from s or t when they collide, because by definition they only collide if they're equal, but with dicts, it very much matters whether you keep items from s or t when their keys collide, because the corresponding values are generally _not_ equal. So this is a false analogy; the same problem raised in the first three replies on this thread still needs to be answered: Is it obvious that the values from b should overwrite the values from a (assuming that's the rule you're suggesting, since you didn't specify; translate to the appropriate question if you want a different rule) in all real-life use cases? If not, is this so useful that the benefits in some uses outweigh the almost certain confusion in others? Without a compelling "yes" to one of those two questions, we're still at square one here; switching from + to | and making an analogy with sets doesn't help. > ... and accordingly > > D = A | B | C > > Maybe this operator is better as this equivalence is already being > used (for sets). Accordingly "union(A,B)" could do a merge operation > and return the new dict(). Wouldn't you expect a top-level union function to take any two iterables and return the union of them as a set (especially given that set.union accepts any iterable for its non-self argument)? A.union(B) seems a lot better than union(A, B). Then again, A.updated(B) or updated?A, B) might be even better, as someone suggested, because the parallel between update and updated (and between e.g. sort and sorted) is not at all problematic. > (this then still allows people who want "+" to add the values be made > happy in the long run) > > -Alexander > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From abarnert at yahoo.com Tue Jul 29 00:19:22 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 28 Jul 2014 15:19:22 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: On Jul 28, 2014, at 15:15, Alexander Heger wrote: >> In all honesty, I'd suggest that code which looks bad enough to >> warrant even considering this feature is probably badly in need of >> refactoring, at which point the problem will likely go away. > > I often want to call functions with added (or removed, replaced) > keywords from the call. > > args0 = dict(...) > args1 = dict(...) > > def f(**kwargs): > g(**(arg0 | kwargs | args1)) > > currently I have to write > > args = dict(...) > def f(**kwargs): > temp_args = dict(dic0) > temp_args.update(kwargs) > temp_args.update(dic1) > g(**temp_args) No, you just have to write a one-liner with ChainMap, except in the (very rare) case where you're expecting g to hold onto and later modify its kwargs. > > It would also make the proposed feature to allow multiple kw args > expansions in Python 3.5 easy to write by having > > f(**a, **b, **c) > be equivalent to > f(**(a | b | c)) > > -Alexander > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From ncoghlan at gmail.com Tue Jul 29 00:20:53 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 29 Jul 2014 08:20:53 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <6122DCE6-D84A-4B05-AB02-C1FD3CED82A4@ryanhiebert.com> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <6122DCE6-D84A-4B05-AB02-C1FD3CED82A4@ryanhiebert.com> Message-ID: On 29 Jul 2014 04:40, "Ryan Hiebert" wrote: > > > > On Jul 28, 2014, at 11:04 AM, Steven D'Aprano wrote: > > > > I'm still inclined to prefer allowing update() to accept multiple > > arguments: > > > > a.update(b, c, d) > > > > rather than a += b + c + d Note that if update() was changed to accept multiple args, the dict() constructor could similarly be updated. Then: x = dict(a) x.update(b) x.update(c) x.update(d) Would become: x = dict(a, b, c, d) Aside from the general "What's the use case that wouldn't be better served by a larger scale refactoring?" concern, my main issue with that approach would be the asymmetry it would introduce with the set constructor (which disallows multiple arguments to avoid ambiguity in the single argument case). But really, I'm not seeing a compelling argument for why this needs to be a builtin. If someone is merging dicts often enough to care, they can already write a function to do the dict copy-and-update as a single operation. What makes this more special than the multitude of other three line functions in the world? Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at 2sn.net Tue Jul 29 00:20:53 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 08:20:53 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: > https://mail.python.org/pipermail/python-ideas/2013-June/021140.html. I see, this is a very extended thread google did not show me when I started this one, and many good points were made there. So, my apologies I restarted this w/o reference; this discussion does seem to resurface, however. It seems it would be valuable to parallel the behaviour of operators already in place for collections. Counter: A + B adds values (calls __add__ or __iadd__ function of values, likely __iadd__ for values of A) A |= B does A.update(B) etc. -Alexander On 29 July 2014 05:21, Paul Moore wrote: > On 28 July 2014 19:58, Nathan Schneider wrote: >> Here are two threads that had some discussion of this: >> https://mail.python.org/pipermail/python-ideas/2011-December/013227.html > > This doesn't seem to have a use case, other than "it would be nice". > >> https://mail.python.org/pipermail/python-ideas/2013-June/021140.html. > > This can be handled using ChainMap, if I understand the proposal. > >> Seems like a useful feature if there could be a clean way to spell it. > > I've yet to see any real-world situation when I've wanted "dictionary > addition" (with any of the various semantics proposed here) and I've > never encountered a situation where using d1.update(d2) was > sufficiently awkward that having an operator seemed reasonable. > > In all honesty, I'd suggest that code which looks bad enough to > warrant even considering this feature is probably badly in need of > refactoring, at which point the problem will likely go away. > > Paul From python at 2sn.net Tue Jul 29 00:21:09 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 08:21:09 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> References: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> Message-ID: > When the underlying dicts and desired combined dict are all going to be used immutably, ChainMap is the perfect answer. (Better than an "updated" function for performance if nothing else.) And usually, when you're looking for a non-mutating combine-dicts operation, that will be what you want. > > But usually isn't always. If you want a snapshot of the combination of mutable dicts, ChainMap is wrong. If you want to be able to mutate the result, ChainMap is wrong. > > All that being said, I'm not sure these use cases are sufficiently common to warrant adding an operator--especially since there are other just-as-(un)common use cases it wouldn't solve. (For example, what I often want is a mutable "overlay" ChainMap, which doesn't need to copy the entire potentially-gigantic source dicts. I wouldn't expect an operator for that, even though I need it far more often than I need a mutable snapshot copy.) > > And of course, as you say, real-life use cases would be a lot more compelling than theoretical/abstract ones. For many applications you may not care one way or the other, only for some you do, and only then you need to know the details of operation. My point is to make the dict() data structure more easy to use for most users and use cases. Especially novices. This is what adds power to the language. Not that you can do things (Turing machines can) but that you can do them easily and naturally. From ncoghlan at gmail.com Tue Jul 29 00:27:06 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 29 Jul 2014 08:27:06 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: On 29 Jul 2014 08:16, "Alexander Heger" wrote: > > > In all honesty, I'd suggest that code which looks bad enough to > > warrant even considering this feature is probably badly in need of > > refactoring, at which point the problem will likely go away. > > I often want to call functions with added (or removed, replaced) > keywords from the call. > > args0 = dict(...) > args1 = dict(...) > > def f(**kwargs): > g(**(arg0 | kwargs | args1)) > > currently I have to write > > args = dict(...) > def f(**kwargs): > temp_args = dict(dic0) > temp_args.update(kwargs) > temp_args.update(dic1) > g(**temp_args) The first part of this one of the use cases for functools.partial(), so it isn't a compelling argument for easy dict merging. The above is largely an awkward way of spelling: import functools f = functools.partial(g, **...) The one difference is to also silently *override* some of the explicitly passed arguments, but that part's downright user hostile and shouldn't be encouraged. Regards, Nick. > > It would also make the proposed feature to allow multiple kw args > expansions in Python 3.5 easy to write by having > > f(**a, **b, **c) > be equivalent to > f(**(a | b | c)) > > -Alexander > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Jul 29 00:40:02 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 29 Jul 2014 08:40:02 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> Message-ID: On 29 Jul 2014 08:22, "Alexander Heger" wrote: > > My point is to make the dict() data structure more easy to use for > most users and use cases. Especially novices. > This is what adds power to the language. Not that you can do things > (Turing machines can) but that you can do them easily and naturally. But why is dict merging into a *new* dict something that needs to be done as a single expression? What's the problem with spelling out "to merge two dicts into a new, first make a dict, then merge in the other one": x = dict(a) x.update(b) That's the real competitor here, not the more cryptic "x = dict(a, **b)" You can even use it as an example of factoring out a helper function: def copy_and_update(a, *args): x = dict(a) for arg in args: x.update(arg) return x My personal experience suggests that's a rare enough use case that it's fine to leave it as a trivial helper function that people can write if they need it. The teaching example isn't compelling, since in the teaching case, spelling out the steps is going to be necessary anyway to explain what the function or method call is actually doing. Cheers, Nick. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at 2sn.net Tue Jul 29 00:35:55 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 08:35:55 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <999F9DAF-E27A-46FE-A444-C2713A18BBB6@yahoo.com> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <999F9DAF-E27A-46FE-A444-C2713A18BBB6@yahoo.com> Message-ID: > The difference is that with sets, it (at least conceptually) doesn't matter whether you keep elements from s or t when they collide, because by definition they only collide if they're equal, but with dicts, it very much matters whether you keep items from s or t when their keys collide, because the corresponding values are generally _not_ equal. So this is a false analogy; the same problem raised in the first three replies on this thread still needs to be answered: Is it obvious that the values from b should overwrite the values from a (assuming that's the rule you're suggesting, since you didn't specify; translate to the appropriate question if you want a different rule) in all real-life use cases? If not, is this so useful that the benefits in some uses outweigh the almost certain confusion in others? Without a compelling "yes" to one of those two questions, we're still at square one here; switching from + to | and making an analogy with sets doesn't help. > >> ... and accordingly >> >> D = A | B | C >> >> Maybe this operator is better as this equivalence is already being >> used (for sets). Accordingly "union(A,B)" could do a merge operation >> and return the new dict(). > > Wouldn't you expect a top-level union function to take any two iterables and return the union of them as a set (especially given that set.union accepts any iterable for its non-self argument)? A.union(B) seems a lot better than union(A, B). > > Then again, A.updated(B) or updated?A, B) might be even better, as someone suggested, because the parallel between update and updated (and between e.g. sort and sorted) is not at all problematic. yes, one does have to deal with collisions and spell out a clear rule: same behaviour as update(). I was less uneasy about the | operator 1) it is already used the same way for collections.Counter [this is a quite strong constraint] 2) in shells it is used as "pipe" implying directionality - order matters yes, you are wondering whether the order should be this or that; you just *define* what it is, same as you do for subtraction. Another way of looking at it is to say that even in sets you take the second, but because they are identical it does not matter ;-) -Alexander From python at 2sn.net Tue Jul 29 00:48:49 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 08:48:49 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <6122DCE6-D84A-4B05-AB02-C1FD3CED82A4@ryanhiebert.com> Message-ID: > But really, I'm not seeing a compelling argument for why this needs to be a > builtin. If someone is merging dicts often enough to care, they can already > write a function to do the dict copy-and-update as a single operation. What > makes this more special than the multitude of other three line functions in > the world? We all have too many of those. This would not add too much complexity to the language and overcome some awkward constructs needed otherwise. Currently dictionaries are not really as easy to use as your everyday data type as it should be lacking such operators. -Alexander From python at 2sn.net Tue Jul 29 01:04:42 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 09:04:42 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: >> args = dict(...) >> def f(**kwargs): >> temp_args = dict(dic0) >> temp_args.update(kwargs) >> temp_args.update(dic1) >> g(**temp_args) > > No, you just have to write a one-liner with ChainMap, except in the (very rare) case where you're expecting g to hold onto and later modify its kwargs. yes, this (modify) is what I do. In any case, it would still be g(**collections.ChainMap(dict1, kwargs, dic0)) In either case a new dict is created and passed to g as kwargs. It's not pretty, but it does work. Thanks. so the general case D = A | B | C becomes D = dict(collections.ChainMap(C, B, A)) (someone may suggest dict could have a "chain" constructor class method D = dict.chain(C, B, A)) From python at 2sn.net Tue Jul 29 01:18:37 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 09:18:37 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: >> args0 = dict(...) >> args1 = dict(...) >> >> def f(**kwargs): >> g(**(arg0 | kwargs | args1)) >> >> currently I have to write >> >> args = dict(...) >> def f(**kwargs): >> temp_args = dict(dic0) >> temp_args.update(kwargs) >> temp_args.update(dic1) >> g(**temp_args) > > The first part of this one of the use cases for functools.partial(), so it > isn't a compelling argument for easy dict merging. The above is largely an > awkward way of spelling: > > import functools > f = functools.partial(g, **...) > > The one difference is to also silently *override* some of the explicitly > passed arguments, but that part's downright user hostile and shouldn't be > encouraged. yes, poor example due to briefly. ;-) In my case f would actually do something with the values of kwargs before calling g, and args1 many not be static outside f. (hence partial is not a solution for the full application) def f(**kwargs): # do something with kwrags, create dict0 and dict1 using kwargs temp_args = dict(dict0) temp_args.update(kwargs) temp_args.update(dict1) g(**temp_args) # more uses of dict0 which could be def f(**kwargs): # do something with kwargs, create dict0 and dict1 using kwargs g(**collections.ChainMap(dict1, kwargs, dict0)) # more uses of dict0 Maybe good enough for that case, like with + or |, one still need to know/learn the lookup order for key replacement, and it is sort of bulky. -Alexander From python at 2sn.net Tue Jul 29 01:45:06 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 09:45:06 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> Message-ID: > But why is dict merging into a *new* dict something that needs to be done as > a single expression? What's the problem with spelling out "to merge two > dicts into a new, first make a dict, then merge in the other one": > > x = dict(a) > x.update(b) > > That's the real competitor here, not the more cryptic "x = dict(a, **b)" > > You can even use it as an example of factoring out a helper function: > > def copy_and_update(a, *args): > x = dict(a) > for arg in args: > x.update(arg) > return x > > My personal experience suggests that's a rare enough use case that it's fine > to leave it as a trivial helper function that people can write if they need > it. The teaching example isn't compelling, since in the teaching case, > spelling out the steps is going to be necessary anyway to explain what the > function or method call is actually doing. it is more about having easy operations for people who learn Python for the sake of using it (besides, I teach science students not computer science students). The point is that it could be done in one operation. It seems like asking people to write a = 2 + 3 as a = int(2) a.add(3) Turing machine vs modern programming language. It does already work for Counters. The discussion seems to go such that because people can't agree whether the first or second occurrence of keys takes precedence, or what operator to use (already decided by the design of Counter) it is not done at all. To be fair, I am not a core Python programmer and am asking others to implement this - or maybe even agree it would be useful -, maybe pushing too much where just an idea should be floated. -Alexander From stephen at xemacs.org Tue Jul 29 02:16:08 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 29 Jul 2014 09:16:08 +0900 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> Alexander Heger writes: > It seems it would be valuable to parallel the behaviour of operators > already in place for collections. Mappings aren't collections. In set theory, of course, they are represented as *appropriately restricted* collections, but the meaning of "+" as applied to mappings in mathematics varies. For functions on the same domain, there's usually an element-wise meaning that's applied. For functions on different domains, I've seen it used to mean "apply the appropriate function on the disjoint union of the domains". I don't think there's an obvious winner in the competition among the various meanings. From python at 2sn.net Tue Jul 29 02:38:38 2014 From: python at 2sn.net (Alexander Heger) Date: Tue, 29 Jul 2014 10:38:38 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: > > It seems it would be valuable to parallel the behaviour of operators > > already in place for collections. > > Mappings aren't collections. In set theory, of course, they are > represented as *appropriately restricted* collections, but the meaning > of "+" as applied to mappings in mathematics varies. For functions on > the same domain, there's usually an element-wise meaning that's > applied. For functions on different domains, I've seen it used to > mean "apply the appropriate function on the disjoint union of the > domains". > > I don't think there's an obvious winner in the competition among the > various meanings. I mistyped. It should have read " ... the behaviour in place for collections.Counter" It does define "+" and "|" operations. -Alexander From tjreedy at udel.edu Tue Jul 29 03:39:28 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 28 Jul 2014 21:39:28 -0400 Subject: [Python-ideas] adding dictionaries In-Reply-To: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 7/28/2014 8:16 PM, Stephen J. Turnbull wrote: > Alexander Heger writes: > > > It seems it would be valuable to parallel the behaviour of operators > > already in place for collections. > > Mappings aren't collections. In set theory, of course, they are > represented as *appropriately restricted* collections, but the meaning > of "+" as applied to mappings in mathematics varies. For functions on > the same domain, there's usually an element-wise meaning that's > applied. This assumes the same range set (of addable items) also. If Python were to add d1 + d2 and d1 += d2, I think we should use this existing and most common definition and add values. The use cases are keyed collections of things that can be added, which are pretty common. Then dict addition would have the properties of the value addition. Example: Let sales be a mapping from salesperson to total sales (since whenever). Let sales_today be a mapping from saleperson to today's sales. Then sales = sales + sales_today, or sales += sales_today. I could, of course, do this today with class Sales(dict): with __add__, __iadd__, and probably other app-specific methods. The issue is that there are two ways to update a mapping with an update mapping: replace values and combine values. Addition combines, so to me, dict addition, if defined, should combine. > For functions on different domains, I've seen it used to > mean "apply the appropriate function on the disjoint union of the > domains". According to https://en.wikipedia.org/wiki/Disjoint_union, d_u has at least two meaning. -- Terry Jan Reedy From abarnert at yahoo.com Tue Jul 29 04:09:31 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 28 Jul 2014 19:09:31 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <60A434E0-C0DA-467B-B13A-BA6986C5B7B1@yahoo.com> Message-ID: On Jul 28, 2014, at 16:45, Alexander Heger wrote: > The discussion seems to go such that because people can't agree > whether the first or second occurrence of keys takes precedence, or > what operator to use (already decided by the design of Counter) it is > not done at all. Well, yeah, that happens a lot. An good idea that can't be turned into a concrete design that fits the language and makes everyone happy doesn't get added, unless it's so ridiculously compelling that nobody can imagine living without it. But that's not necessarily a bad thing--it's why Python is a relatively small and highly consistent language, which I think is a big part of why Python is so readable and teachable. Anyway, I think you're on to something with your idea of adding an updated or union or whatever function/method whose semantics are obvious, and then mapping the operators to that method and update. I can definitely buy that a.updated(b) or union(a, b) favors values from b for exactly the same reason a.update(b) does (although as I mentioned I have other problems with a union function). Meanwhile, if you have use cases for which ChainMap is not appropriate, you might want to write a dict subclass that you can use in your code or in teaching students or whatever, so you can amass some concrete use cases and show how much cleaner it is than the existing alternatives. > To be fair, I am not a core Python programmer and am > asking others to implement this - or maybe even agree it would be > useful -, maybe pushing too much where just an idea should be floated. If it helps, if you can get everyone to agree on this, except that none of the core devs wants to do the work, I'll volunteer to write the C code (after I finish my io patch and my abc patch...), so you only have to add the test cases (which are easy Python code; the only hard part is deciding what to test) and the docs. From jeanpierreda at gmail.com Tue Jul 29 04:46:14 2014 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Mon, 28 Jul 2014 19:46:14 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Mon, Jul 28, 2014 at 5:16 PM, Stephen J. Turnbull wrote: > Alexander Heger writes: > > > It seems it would be valuable to parallel the behaviour of operators > > already in place for collections. > > Mappings aren't collections. In set theory, of course, they are > represented as *appropriately restricted* collections, but the meaning > of "+" as applied to mappings in mathematics varies. For functions on > the same domain, there's usually an element-wise meaning that's > applied. For functions on different domains, I've seen it used to > mean "apply the appropriate function on the disjoint union of the > domains". > > I don't think there's an obvious winner in the competition among the > various meanings. The former meaning requires that the member types support addition, so it's the obvious loser -- dicts can contain any kind of value, not just addable ones. Adding a method that only works if the values satisfy certain extra optional constraints is rare in Python, and needs justification over the alternatives. The second suggestion works just fine, you just need to figure out what to do with the intersection since we won't have disjoint domains. The obvious suggestion is to pick an ordering, just like the update method does. For another angle: the algorithms course I took in university introduced dictionaries as sets where the members of the set are tagged with values. This makes set-like operators obvious in meaning, with the only question being, again, what to do with the tags during collisions. (FWIW, the meaning of + as applied to sets is generally union -- but Python's set type uses | instead, presumably for analogy with ints when they are treated as a set of small integers). That said, the only reason I can think of to support this new stuff is to stop dict(x, **y) from being such an attractive nuisance. -- Devin From stephen at xemacs.org Tue Jul 29 05:13:15 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 29 Jul 2014 12:13:15 +0900 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87silkn9h0.fsf@uwakimon.sk.tsukuba.ac.jp> Alexander Heger writes: > I mistyped. It should have read " ... the behaviour in place for > collections.Counter" But there *is* a *the* (ie, unique) "additive" behavior for Counter. (At least, I find it reasonable to think so.) What you're missing is that there is no such agreement on what it means to add dictionaries. True, you can "just pick one". Python doesn't much like to do that, though. The problem is that on discovering that dictionaries can be added, *everybody* is going to think that their personal application is the obvious one to implement as "+" and/or "+=". Some of them are going to be wrong and write buggy code as a consequence. From steve at pearwood.info Tue Jul 29 05:34:12 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 29 Jul 2014 13:34:12 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> Message-ID: <20140729033411.GJ9112@ando> On Mon, Jul 28, 2014 at 12:17:10PM -0500, Ron Adam wrote: > > On 07/28/2014 11:04 AM, Steven D'Aprano wrote: [...] > >new_dict = a + b + c + d > > > >Pros: + is short to type; subclasses can control the type of new_dict. > >Cons: dict addition isn't obvious. > > I think it's more obvious. It only needs __add__ and __iadd__ methods to > make it consistent with the list type. What I meant was that it wasn't obvious what dict1 + dict2 should do, not whether or not the __add__ method exists. > I think this added consistency between lists and dicts would be useful. Lists and dicts aren't the same kind of object. I'm not sure it is helpful to force them to be consistent. Should list grow an update() method to make it consistent with dicts? How about setdefault()? As for being useful, useful for what? Useful how often? I'm sure that one could take any piece of code, no matter how obscure, and say it is useful *somewhere* :-) but the question is whether it is useful enough to be part of the language. I was wrong to earlier dismiss the OP's usecase for dict addition by suggestion dict(a, **b). Such a thing only works if all the keys of b are valid identifiers. But that doesn't mean that just because my shoot-from-the-hip response missed the target that we should conclude that dict addition solves an important problem or that + is the correct way to spell it. I'm still dubious that it's needed, but if it were, this is what I would prefer to see: * should be a Mapping method, not a top-level function; * should accept anything the dict constructor accepts, mappings or lists of (key,value) pairs as well as **kwargs; * my prefered name for this is now "merged" rather than "updated"; * it should return a new mapping, not modify in-place; * when called from a class, it should behave like a class method: MyMapping.merged(a, b, c) should return an instance of MyMapping; * but when called from an instance, it should behave like an instance method, with self included in the chain of mappings to merge: a.merged(b, c) rather than a.merged(a, b, c). I have a descriptor type which implements the behaviour from the last two bullet points, so from a technical standpoint it's not hard to implement this. But I can imagine a lot of push-back from the more conservative developers about adding a *fourth* method type (even if it is private) to the Python builtins, so it would take a really compelling use-case to justify adding a new method type and a new dict method. (Personally, I think this hybrid class/instance method type is far more useful than staticmethod, since I've actually used it in production code, but staticmethod isn't going away.) -- Steven From stephen at xemacs.org Tue Jul 29 07:15:44 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 29 Jul 2014 14:15:44 +0900 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87ppgon3sv.fsf@uwakimon.sk.tsukuba.ac.jp> Terry Reedy writes: > This assumes the same range set (of addable items) also. If Python were > to add d1 + d2 and d1 += d2, I think we should use this existing and > most common definition and add values. IMHO[1] that's way too special for the generic mapping types. If one wants such operations, she should define NumericValuedMapping and StringValuedMapping etc classes for each additive set of values. > > For functions on different domains, I've seen it used to > > mean "apply the appropriate function on the disjoint union of the > > domains". > > According to https://en.wikipedia.org/wiki/Disjoint_union, d_u has at > least two meaning. Either meaning will do here, with the distinction that the set- theoretic meaning (which I intended) applies to any two functions, while the alternate meaning imposes a restriction on the functions that can be added (and therefore is inappropriate for this discussion IMHO). Footnotes: [1] I mean the "H", I'm no authority. From abarnert at yahoo.com Tue Jul 29 08:15:44 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Mon, 28 Jul 2014 23:15:44 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140729033411.GJ9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <20140729033411.GJ9112@ando> Message-ID: <1406614544.48360.YahooMailNeo@web181002.mail.ne1.yahoo.com> On Monday, July 28, 2014 8:34 PM, Steven D'Aprano wrote: [snip] > * when called from a class, it should behave like a class method: > ? MyMapping.merged(a, b, c) should return an instance of MyMapping; > > * but when called from an instance, it should behave like an instance > ? method, with self included in the chain of mappings to merge: > ? a.merged(b, c) rather than a.merged(a, b, c). > > > I have a descriptor type which implements the behaviour from the last > two bullet points, so from a technical standpoint it's not hard to > implement this. But I can imagine a lot of push-back from the more > conservative developers about adding a *fourth* method type (even if it > is private) to the Python builtins, so it would take a really compelling > use-case to justify adding a new method type and a new dict method. > > (Personally, I think this hybrid class/instance method type is far more > useful than staticmethod, since I've actually used it in production > code, but staticmethod isn't going away.) How is this different from a plain-old (builtin or normal) method? >>> class Spam: ... ? ? def eggs(self, a): ... ? ? ? ? print(self, a) >>> spam = Spam() >>> Spam.eggs(spam, 2) <__main__.Spam object at 0x106377080> 2 >>> spam.eggs(2) <__main__.Spam object at 0x106377080> 2 >>> Spam.eggs >>> spam.eggs > >>> s = {1, 2, 3} >>> set.union(s, [4]) {1, 2, 3, 4} >>> s.union([4]) {1, 2, 3, 4} >>> set.union >>> s.union This is the way methods have always worked (although the details of how they worked under the covers changed in 3.0, and before that when descriptors and new-style classes were added). From p.f.moore at gmail.com Tue Jul 29 08:22:34 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 29 Jul 2014 07:22:34 +0100 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: On 29 July 2014 00:04, Alexander Heger wrote: > D = A | B | C > > becomes > > D = dict(collections.ChainMap(C, B, A)) This immediately explains the key problem with this proposal. It never even *occurred* to me that anyone would expect C to take priority over A in the operator form. But the ChainMap form makes it immediately clear to me that this is the intent. An operator form will be nothing but a maintenance nightmare and a source of bugs. Thanks for making this obvious :-) -1. Paul From ncoghlan at gmail.com Tue Jul 29 09:46:56 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 29 Jul 2014 17:46:56 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <87silkn9h0.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87wqaxm33r.fsf@uwakimon.sk.tsukuba.ac.jp> <87silkn9h0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 29 July 2014 13:13, Stephen J. Turnbull wrote: > Alexander Heger writes: > > > I mistyped. It should have read " ... the behaviour in place for > > collections.Counter" > > But there *is* a *the* (ie, unique) "additive" behavior for Counter. > (At least, I find it reasonable to think so.) What you're missing is > that there is no such agreement on what it means to add dictionaries. > > True, you can "just pick one". Python doesn't much like to do that, > though. The problem is that on discovering that dictionaries can be > added, *everybody* is going to think that their personal application > is the obvious one to implement as "+" and/or "+=". Some of them are > going to be wrong and write buggy code as a consequence. In fact, the existence of collections.Counter.__add__ is an argument *against* introducing dict.__add__ with different semantics: >>> issubclass(collections.Counter, dict) True So, if someone *wants* a dict with "addable" semantics, they can already use collections.Counter. While some of its methods really only work with integers, the addition part is actually usable with arbitrary addable types. If set-like semantics were added to dict, it would conflict with the existing element-wise semantics of Counter. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From toddrjen at gmail.com Tue Jul 29 10:05:10 2014 From: toddrjen at gmail.com (Todd) Date: Tue, 29 Jul 2014 10:05:10 +0200 Subject: [Python-ideas] Accept list in os.path.join Message-ID: Currently, os.path.join joins strings specified in its arguments, with one string per argument. On its own, that is not a problem. However, it is inconsistent with str.join, which accepts only a list of strings. This inconsistency can lead to some confusion, since these operations that have similar names and carry out similar tasks have fundamentally different syntax. My suggestion is to allow os.path.join to accept a list of strings in addition to existing one string per argument. This would allow it to be used in a manner consistent with str.join, while still allowing existing code to run as expected. Currently, when os.path.join is given a single list, it returns that list exactly. This is undocumented behavior (I am surprised it is not an exception). It would mean, however, this change would break code that wants a list if given a list but wants to join if given multiple strings. This is conceivable, but outside of catching the sorts of errors this change would prevent, I would be surprised if it is a common use-case. In the case where multiple arguments are used and one or more of those arguments are a list, I think the best solution would be to raise an exception, since this would avoid corner cases and be less likely to silently propagate bugs. However, I am not set on that, so if others prefer it join all the strings in all the lists that would be okay too. So the syntax would be like this (on POSIX as an example): >>> os.path.join('test1', 'test2', 'test3') # current syntax 'test1/test2/test3' >>> os.path.join(['test1', 'test2', 'test3']) # new syntax 'test1/test2/test3' >>> os.path.join(['test1', 'test2'], 'test3') Exception >>> os.path.join(['test1'], 'test2', 'test3') Exception >>> os.path.join(['test1', 'test2'], ['test3']) Exception -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Tue Jul 29 11:12:28 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 29 Jul 2014 02:12:28 -0700 Subject: [Python-ideas] Accept list in os.path.join In-Reply-To: References: Message-ID: <1406625148.37122.YahooMailNeo@web181004.mail.ne1.yahoo.com> On Tuesday, July 29, 2014 1:14 AM, Todd wrote: >Currently, os.path.join joins strings specified in its arguments, with one string per argument. ? > >On its own, that is not a problem.? However, it is inconsistent with str.join, which accepts only a list of strings. No, str.join accepts any iterable of strings?including a string, which is an iterable of single-character strings.? Not that you often intentionally pass a string to it, but you do very often pass a generator expression or other iterator, so treating lists specially for os.path.join to make it work more like str.join would just increase confusion, not reduce it. Also, I don't know of anything else in Python that has special treatment for lists vs. other iterables. There are a few cases that have special treatment for _tuples_ (notably str.__mod__), but I don't think anyone wants to expand those, and I don't think it would make you happy here, either. From tjreedy at udel.edu Tue Jul 29 11:30:25 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 29 Jul 2014 05:30:25 -0400 Subject: [Python-ideas] Accept list in os.path.join In-Reply-To: References: Message-ID: On 7/29/2014 4:05 AM, Todd wrote: > Currently, os.path.join joins strings specified in its arguments, with > one string per argument. One typically has 2 or possibly 3 path segements, never 1000. > On its own, that is not a problem. However, it is inconsistent with > str.join, which accepts only a list of strings. This inconsistency can > lead to some confusion, since these operations that have similar names > and carry out similar tasks have fundamentally different syntax. I partly agree, but think about the actually use cases. > My suggestion is to allow os.path.join to accept a list of strings in > addition to existing one string per argument. os.path.join(*iterable) -- Terry Jan Reedy From j.wielicki at sotecware.net Tue Jul 29 13:37:54 2014 From: j.wielicki at sotecware.net (Jonas Wielicki) Date: Tue, 29 Jul 2014 13:37:54 +0200 Subject: [Python-ideas] Accept list in os.path.join In-Reply-To: References: Message-ID: <53D78792.9080401@sotecware.net> On 29.07.2014 10:05, Todd wrote: > In the case where multiple arguments are used and one or more of those > arguments are a list, I think the best solution would be to raise an > exception, since this would avoid corner cases and be less likely to > silently propagate bugs. However, I am not set on that, so if others > prefer it join all the strings in all the lists that would be okay too. >From the implementation point of view, I have yet to see a duck-typing way to distinguish a list (or any other iterable) of strings from a string. regards, jwi From j.wielicki at sotecware.net Tue Jul 29 13:56:57 2014 From: j.wielicki at sotecware.net (Jonas Wielicki) Date: Tue, 29 Jul 2014 13:56:57 +0200 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: <53D78C09.20406@sotecware.net> On 29.07.2014 08:22, Paul Moore wrote: > On 29 July 2014 00:04, Alexander Heger wrote: >> D = A | B | C >> >> becomes >> >> D = dict(collections.ChainMap(C, B, A)) > > This immediately explains the key problem with this proposal. It never > even *occurred* to me that anyone would expect C to take priority over > A in the operator form. But the ChainMap form makes it immediately > clear to me that this is the intent. FWIW, one could use an operator which inherently shows a direction: << and >>, for both directions respectively. A = B >> C lets B take precedence, and A = B << C lets C take precedence. regards, jwi p.s.: I?m not entirely sure what to think about my suggestion---I?d like to hear opinions. > > An operator form will be nothing but a maintenance nightmare and a > source of bugs. Thanks for making this obvious :-) > > -1. > > Paul > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > From p.f.moore at gmail.com Tue Jul 29 14:29:52 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 29 Jul 2014 13:29:52 +0100 Subject: [Python-ideas] adding dictionaries In-Reply-To: <53D78C09.20406@sotecware.net> References: <53D78C09.20406@sotecware.net> Message-ID: On 29 July 2014 12:56, Jonas Wielicki wrote: > FWIW, one could use an operator which inherently shows a direction: << > and >>, for both directions respectively. > > A = B >> C lets B take precedence, and A = B << C lets C take precedence. > > regards, > jwi > > p.s.: I?m not entirely sure what to think about my suggestion---I?d like > to hear opinions. Personally, I don't like it much more than the symmetric-looking operators. I get your point, but it feels like you're just patching over a relatively small aspect of a fundamentally bad idea. But then again as I've already said, I see no need for any of this, the existing functionality seems fine to me. Paul From steve at pearwood.info Tue Jul 29 15:35:56 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 29 Jul 2014 23:35:56 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <1406614544.48360.YahooMailNeo@web181002.mail.ne1.yahoo.com> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <20140729033411.GJ9112@ando> <1406614544.48360.YahooMailNeo@web181002.mail.ne1.yahoo.com> Message-ID: <20140729133556.GK9112@ando> On Mon, Jul 28, 2014 at 11:15:44PM -0700, Andrew Barnert wrote: > On Monday, July 28, 2014 8:34 PM, Steven D'Aprano wrote: > [snip] > > * when called from a class, it should behave like a class method: > > ? MyMapping.merged(a, b, c) should return an instance of MyMapping; > > > > * but when called from an instance, it should behave like an instance > > ? method, with self included in the chain of mappings to merge: > > ? a.merged(b, c) rather than a.merged(a, b, c). > > > > > > I have a descriptor type which implements the behaviour from the last > > two bullet points, so from a technical standpoint it's not hard to > > implement this. But I can imagine a lot of push-back from the more > > conservative developers about adding a *fourth* method type (even if it > > is private) to the Python builtins, so it would take a really compelling > > use-case to justify adding a new method type and a new dict method. > > > > (Personally, I think this hybrid class/instance method type is far more > > useful than staticmethod, since I've actually used it in production > > code, but staticmethod isn't going away.) > > > How is this different from a plain-old (builtin or normal) method? I see I failed to explain clearly, sorry about that. With class methods, the method always receives the class as the first argument. Regardless of whether you write dict.fromkeys or {1:'a'}.fromkeys, the first argument is the class, dict. With instance methods, the method receives the instance. If you call it from a class, the method is "unbound" and you are responsible for providing the "self" argument. To me, this hypothetical merged() method sometimes feels like an alternative constructor, like fromkeys, and therefore best written as a class method, but sometimes like a regular method. Since it feels like a hybrid to me, I think a hybrid descriptor approach is best, but as I already said I can completely understand if conservative developers reject this idea. In the hybrid form I'm referring to, the first argument provided is the class when called from the class, and the instance when called from an instance. Imagine it written in pure Python like this: class dict: @hybridmethod def merged(this, *args, **kwargs): if isinstance(this, type): # Called from the class new = this() else: # Called from an instance. new = this.copy() for arg in args: new.update(arg) new.update(kwargs) return new If merged is a class method, we can avoid having to worry about the case where your "a" mapping happens to be a list of (key,item) pairs: a.merged(b, c, d) # Fails if a = [(key, item), ...] dict.merged(a, b, c, d) # Always succeeds. It also allows us to easily specify a different mapping type for the result: MyMapping.merged(a, b, c, d) although some would argue this is just as clear: MyMapping().merged(a, b, c, d) albeit perhaps not quite as efficient if MyMapping is expensive to instantiate. (You create an empty instance, only to throw it away again.) On the other hand, there are use-cases where merged() best communicates the intent if it is a regular instance method. Consider: settings = application_defaults.merged( global_settings, user_settings, commandline_settings) seems more clear to me than: settings = dict.merged( application_defaults, global_settings, user_settings, commandline_settings) especially in the case that application_defaults is a dict literal. tl;dr It's not often that I can't decide whether a method ought to be a class method or an instance method, the decision is usually easy, but this is one of those times. -- Steven From j.wielicki at sotecware.net Tue Jul 29 16:03:09 2014 From: j.wielicki at sotecware.net (Jonas Wielicki) Date: Tue, 29 Jul 2014 16:03:09 +0200 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140729133556.GK9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <20140729033411.GJ9112@ando> <1406614544.48360.YahooMailNeo@web181002.mail.ne1.yahoo.com> <20140729133556.GK9112@ando> Message-ID: <53D7A99D.6000006@sotecware.net> On 29.07.2014 15:35, Steven D'Aprano wrote: > On Mon, Jul 28, 2014 at 11:15:44PM -0700, Andrew Barnert wrote: >> On Monday, July 28, 2014 8:34 PM, Steven D'Aprano wrote: >> [snip] >>> * when called from a class, it should behave like a class method: >>> MyMapping.merged(a, b, c) should return an instance of MyMapping; >>> >>> * but when called from an instance, it should behave like an instance >>> method, with self included in the chain of mappings to merge: >>> a.merged(b, c) rather than a.merged(a, b, c). >>> >>> >>> I have a descriptor type which implements the behaviour from the last >>> two bullet points, so from a technical standpoint it's not hard to >>> implement this. But I can imagine a lot of push-back from the more >>> conservative developers about adding a *fourth* method type (even if it >>> is private) to the Python builtins, so it would take a really compelling >>> use-case to justify adding a new method type and a new dict method. >>> >>> (Personally, I think this hybrid class/instance method type is far more >>> useful than staticmethod, since I've actually used it in production >>> code, but staticmethod isn't going away.) >> >> >> How is this different from a plain-old (builtin or normal) method? > [snip] > In the hybrid form I'm referring to, the first argument provided is the > class when called from the class, and the instance when called from an > instance. Imagine it written in pure Python like this: > > class dict: > @hybridmethod > def merged(this, *args, **kwargs): > if isinstance(this, type): > # Called from the class > new = this() > else: > # Called from an instance. > new = this.copy() > for arg in args: > new.update(arg) > new.update(kwargs) > return new [snip] I really like the semantics of that. This allows for concise, and in my opinion, clearly readable code. Although I think maybe one should have two separate methods: the class method being called ``merged`` and the instance method called ``merged_with``. I find result = somedict.merged(b, c) somewhat less clear than result = somedict.merged_with(b, c) regards, jwi From steve at pearwood.info Tue Jul 29 16:36:05 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 30 Jul 2014 00:36:05 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: Message-ID: <20140729143605.GL9112@ando> On Tue, Jul 29, 2014 at 07:22:34AM +0100, Paul Moore wrote: > On 29 July 2014 00:04, Alexander Heger wrote: > > D = A | B | C > > > > becomes > > > > D = dict(collections.ChainMap(C, B, A)) > > This immediately explains the key problem with this proposal. It never > even *occurred* to me that anyone would expect C to take priority over > A in the operator form. But the ChainMap form makes it immediately > clear to me that this is the intent. Hmmm. Funny you say that, because to me that is a major disadvantage of the ChainMap form: you have to write the arguments in reverse order. Suppose that we want to start with a, then override it with b, then override that with c. Since a is the start (the root, the base), we start with a, something like this: d = {} d.update(a) d.update(b) d.update(c) If update was chainable as it would be in Ruby: d.update(a).update(b).update(c) or even: d.update(a, b, c) This nicely leads us to d = a+b+c (assuming we agree that + meaning merge is the spelling we want). The ChainMap, on the other hand, works backwards from this perspective: the last dict to be merged has to be given first: ChainMap(c, b, a) -- Steven From nathan at cmu.edu Tue Jul 29 16:50:02 2014 From: nathan at cmu.edu (Nathan Schneider) Date: Tue, 29 Jul 2014 10:50:02 -0400 Subject: [Python-ideas] adding dictionaries In-Reply-To: <53D78C09.20406@sotecware.net> References: <53D78C09.20406@sotecware.net> Message-ID: On Tue, Jul 29, 2014 at 7:56 AM, Jonas Wielicki wrote: > > FWIW, one could use an operator which inherently shows a direction: << > and >>, for both directions respectively. > > A = B >> C lets B take precedence, and A = B << C lets C take precedence. > If there is to be an operator devoted specifically to this, I like << and >> as unambiguous choices. Proof: https://mail.python.org/pipermail/python-ideas/2011-December/013232.html :) I am also partial to the {**A, **B} proposal in http://legacy.python.org/dev/peps/pep-0448/. Cheers, Nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Tue Jul 29 21:29:29 2014 From: abarnert at yahoo.com (Andrew Barnert) Date: Tue, 29 Jul 2014 12:29:29 -0700 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140729143605.GL9112@ando> References: <20140729143605.GL9112@ando> Message-ID: <1406662169.52281.YahooMailNeo@web181005.mail.ne1.yahoo.com> On Tuesday, July 29, 2014 7:36 AM, Steven D'Aprano wrote: >On Tue, Jul 29, 2014 at 07:22:34AM +0100, Paul Moore wrote: >> On 29 July 2014 00:04, Alexander Heger wrote: >> > D = A | B | C >> > >> > becomes >> > >> > D = dict(collections.ChainMap(C, B, A)) >> >> This immediately explains the key problem with this proposal. It never >> even *occurred* to me that anyone would expect C to take priority over >> A in the operator form. But the ChainMap form makes it immediately >> clear to me that this is the intent. > >Hmmm. Funny you say that, because to me that is a major disadvantage of >the ChainMap form: you have to write the arguments in reverse order. I think that's pretty much exactly his point: To him, it's obvious that + should be in the order of ChainMap, and he can't even conceive of the possibility that you'd want it "backward". To you, it's obvious that + should be the other way around, and you find it annoying that ChainMap is "backward". Which seems to imply that any attempt at setting an order is going to not only seem backward, but possibly surprisingly so, to a subset of Python's users. And this is the kind of thing can lead to subtle bugs. If a and b _almost never_ have duplicate keys, but very rarely do, you won't catch the problem until you think to test for it. And if one order or the other is so obvious to you that you didn't even imagine anyone would ever implement the opposite order, you probably won't think to write the test until you have a bug in the field? From ron3200 at gmail.com Wed Jul 30 01:12:16 2014 From: ron3200 at gmail.com (Ron Adam) Date: Tue, 29 Jul 2014 18:12:16 -0500 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140729033411.GJ9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <20140729033411.GJ9112@ando> Message-ID: On 07/28/2014 10:34 PM, Steven D'Aprano wrote: > On Mon, Jul 28, 2014 at 12:17:10PM -0500, Ron Adam wrote: >> > >> >On 07/28/2014 11:04 AM, Steven D'Aprano wrote: > [...] > >>> > >new_dict = a + b + c + d >>> > > >>> > >Pros: + is short to type; subclasses can control the type of new_dict. >>> > >Cons: dict addition isn't obvious. >> > >> >I think it's more obvious. It only needs __add__ and __iadd__ methods to >> >make it consistent with the list type. > What I meant was that it wasn't obvious what dict1 + dict2 should do, > not whether or not the __add__ method exists. What else could it do besides return a new copy of dict1 updated with dict2 contents? It's an unordered container, so it wouldn't append, and the duplicate keys would be resolved based on the order of evaluation. I don't see any problem with that. I also don't know of any other obvious way to combine two dictionaries. The argument against it, may simply be that it's a feature by design, to have dictionaries unique enough so that code which handles them is clearly specific to them. I'm not sure how strong that logic is though. >> >I think this added consistency between lists and dicts would be useful. > Lists and dicts aren't the same kind of object. I'm not sure it is > helpful to force them to be consistent. Should list grow an update() > method to make it consistent with dicts? How about setdefault()? Well, here is how they currently compare. >>> set(dir(dict)).intersection(set(dir(list))) {'copy', '__hash__', '__format__', '__sizeof__', '__ge__', '__delitem__', '__getitem__', '__dir__', 'pop', '__gt__', '__repr__', '__init__', '__subclasshook__', '__eq__', 'clear', '__len__', '__str__', '__le__', '__new__', '__reduce_ex__', '__doc__', '__getattribute__', '__ne__', '__reduce__', '__contains__', '__delattr__', '__class__', '__lt__', '__setattr__', '__setitem__', '__iter__'} >>> set(dir(dict)).difference(set(dir(list))) {'popitem', 'update', 'setdefault', 'items', 'values', 'fromkeys', 'get', 'keys'} >>> set(dir(list)).difference(set(dir(dict))) {'sort', '__mul__', 'remove', '__iadd__', '__reversed__', 'insert', 'extend', 'append', 'count', '__add__', '__rmul__', 'index', '__imul__', 'reverse'} They do have quite a lot in common already. The usefulness of different types having the same methods is that external code can be less specific to the objects they handle. Of course, if those like methods act too differently they can be surprising as well. That may be the case if '+' and '+=' are used to update dictionaries, but then again, maybe not. (?) > As for being useful, useful for what? Useful how often? I'm sure that > one could take any piece of code, no matter how obscure, and say it is > useful*somewhere* :-) but the question is whether it is useful enough > to be part of the language. That's where examples will have an advantage over an initial personal opinion. Not that initial opinions aren't useful at first to express support or non-support. I could have just used +1. ;-) Cheers, Ron From steve at pearwood.info Wed Jul 30 02:17:26 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 30 Jul 2014 10:17:26 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <20140729033411.GJ9112@ando> Message-ID: <20140730001726.GM9112@ando> On Tue, Jul 29, 2014 at 06:12:16PM -0500, Ron Adam wrote on the similarity of lists and dicts: [...] > Well, here is how they currently compare. > > >>> set(dir(dict)).intersection(set(dir(list))) > {'copy', '__hash__', '__format__', '__sizeof__', '__ge__', '__delitem__', > '__getitem__', '__dir__', 'pop', '__gt__', '__repr__', '__init__', > '__subclasshook__', '__eq__', 'clear', '__len__', '__str__', '__le__', > '__new__', '__reduce_ex__', '__doc__', '__getattribute__', '__ne__', > '__reduce__', '__contains__', '__delattr__', '__class__', '__lt__', > '__setattr__', '__setitem__', '__iter__'} Now strip out the methods which are common to pretty much all objects, in other words just look at the ones which are common to mapping and sequence APIs but not to objects in general: {'copy', '__ge__', '__delitem__', '__getitem__', 'pop', '__gt__', 'clear', '__len__', '__le__', '__contains__', '__lt__', '__setitem__', '__iter__'} And now look a little more closely: - although dicts and lists both support order comparisons like > and <, you cannot compare a dict to a list in Python 3; - although dicts and lists both support a pop method, their signatures are different; x.pop() will fail if x is a dict, and x.pop(k, d) will fail if x is a list; - although both support membership testing "a in x", what is being tested is rather different; if x is a dict, then a must be a key, but the analog of keys for lists is the index, not the value. So the similarities between list and dict are: * both have a length * both are iterable * both support subscripting operations x[i] * although dicts don't support slicing x[i:j:k] * both support a copy() method * both support a clear() method That's not a really big set of operations in common, and they're rather general. The real test is, under what practical circumstances would you expect to freely substitute a list for a dict or visa versa, and what could you do with that object when you received it? For me, the only answer that comes readily to mind is that the dict constructor accepts either another dict or a list of (key,item) pairs. [...] > They do have quite a lot in common already. The usefulness of different > types having the same methods is that external code can be less specific to > the objects they handle. I don't think that it is reasonable to treat dicts and lists as having a lot in common. They have a little in common, by virtue of both being containers, but then a string bag and a 40ft steel shipping container are both containers too, so that doesn't imply much similarity :-) It seems to me that outside of utterly generic operations like iteration, conversion to string and so on, lists do not quack like dicts, and dicts do not swim like lists, in any significant sense. -- Steven From greg.ewing at canterbury.ac.nz Wed Jul 30 00:46:46 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 30 Jul 2014 10:46:46 +1200 Subject: [Python-ideas] adding dictionaries In-Reply-To: <53D78C09.20406@sotecware.net> References: <53D78C09.20406@sotecware.net> Message-ID: <53D82456.3060102@canterbury.ac.nz> Jonas Wielicki wrote: > FWIW, one could use an operator which inherently shows a direction: << > and >>, for both directions respectively. > > A = B >> C lets B take precedence, and A = B << C lets C take precedence. While it succeeds in indicating a direction, it fails to suggest any kind of addition or union. -- Greg From j.wielicki at sotecware.net Wed Jul 30 10:37:23 2014 From: j.wielicki at sotecware.net (Jonas Wielicki) Date: Wed, 30 Jul 2014 10:37:23 +0200 Subject: [Python-ideas] adding dictionaries In-Reply-To: <53D82456.3060102@canterbury.ac.nz> References: <53D78C09.20406@sotecware.net> <53D82456.3060102@canterbury.ac.nz> Message-ID: <53D8AEC3.6080602@sotecware.net> On 30.07.2014 00:46, Greg Ewing wrote: > Jonas Wielicki wrote: >> FWIW, one could use an operator which inherently shows a direction: << >> and >>, for both directions respectively. >> >> A = B >> C lets B take precedence, and A = B << C lets C take precedence. > > While it succeeds in indicating a direction, it > fails to suggest any kind of addition or union. > As already noted elsewhere (to continue playing devils advocate), its not an addition or union anyways. It?s not a union because it is lossy and not commutative it?s not something I?d call addition either. While one can certainly see it as shifting the elements from dict A over dict B. regards, jwi From ncoghlan at gmail.com Wed Jul 30 13:52:54 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 30 Jul 2014 21:52:54 +1000 Subject: [Python-ideas] adding dictionaries In-Reply-To: <1406662169.52281.YahooMailNeo@web181005.mail.ne1.yahoo.com> References: <20140729143605.GL9112@ando> <1406662169.52281.YahooMailNeo@web181005.mail.ne1.yahoo.com> Message-ID: On 30 July 2014 05:29, Andrew Barnert wrote: > On Tuesday, July 29, 2014 7:36 AM, Steven D'Aprano wrote: > >>On Tue, Jul 29, 2014 at 07:22:34AM +0100, Paul Moore wrote: >>> On 29 July 2014 00:04, Alexander Heger wrote: >>> > D = A | B | C >>> > >>> > becomes >>> > >>> > D = dict(collections.ChainMap(C, B, A)) >>> >>> This immediately explains the key problem with this proposal. It never >>> even *occurred* to me that anyone would expect C to take priority over >>> A in the operator form. But the ChainMap form makes it immediately >>> clear to me that this is the intent. >> >>Hmmm. Funny you say that, because to me that is a major disadvantage of >>the ChainMap form: you have to write the arguments in reverse order. > > > I think that's pretty much exactly his point: > > To him, it's obvious that + should be in the order of ChainMap, and he can't even conceive of the possibility that you'd want it "backward". > > To you, it's obvious that + should be the other way around, and you find it annoying that ChainMap is "backward". > > Which seems to imply that any attempt at setting an order is going to not only seem backward, but possibly surprisingly so, to a subset of Python's users. > > And this is the kind of thing can lead to subtle bugs. If a and b _almost never_ have duplicate keys, but very rarely do, you won't catch the problem until you think to test for it. And if one order or the other is so obvious to you that you didn't even imagine anyone would ever implement the opposite order, you probably won't think to write the test until you have a bug in the field? I think this is a nice way of explaining the concern. I'll also note that, given we turned a whole pile of similarly subtle data driven bugs into structural type errors in the Python 3 transition, I'm not exactly enamoured of the idea of adding more :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From drekin at gmail.com Wed Jul 30 15:58:02 2014 From: drekin at gmail.com (drekin) Date: Wed, 30 Jul 2014 15:58:02 +0200 Subject: [Python-ideas] Redesign of Python stdio backend Message-ID: I would expect that all standard IO in Python goes through sys.stdin, sys.stdout and sys.stderr or the underlying buffer or raw objects. The only exception should be error messages before the sys.std* objects are initialized. I was surprised that this is actually not the case ? reading input in the interactive loop actually doesn't use sys.stdin (see http://bugs.python.org/issue17620). However it uses its encoding, which doesn't make sense. My knowledge of the actual implementation is rather poor, but I got impression that the codepath of getting input from user in interactive loop is complicated. I would think that it consits just of wrapping an underlying system call (or GNU readline or anything) in sys.stdin.buffer.raw.readinto or something. With current implementation, fixing issues may be complicated ? for example handling SIGINT produced by Ctrl-C on Windows issues. There is a closed issue http://bugs.python.org/issue17619 but also an open issue http://bugs.python.org/issue18597. There is also a seven years old issue http://bugs.python.org/issue1602 regarding Unicode support on Windows console. Even if the issue isn't fixed, anyone could just write their own sys.std* objects a install them in the running interpreter. This doesn't work now because of the problem described. I just wanted to bring up the idea of redesign the stdio backend which also results in fixing http://bugs.python.org/issue17620 and helping fixing the others. Regards, Drekin -------------- next part -------------- An HTML attachment was scrubbed... URL: From ron3200 at gmail.com Wed Jul 30 16:27:00 2014 From: ron3200 at gmail.com (Ron Adam) Date: Wed, 30 Jul 2014 09:27:00 -0500 Subject: [Python-ideas] adding dictionaries In-Reply-To: <20140730001726.GM9112@ando> References: <20140727011739.GC9112@ando> <20140728145951.GH9112@ando> <20140728153306.GA5756@k2> <20140728160450.GI9112@ando> <20140729033411.GJ9112@ando> <20140730001726.GM9112@ando> Message-ID: On 07/29/2014 07:17 PM, Steven D'Aprano wrote: > On Tue, Jul 29, 2014 at 06:12:16PM -0500, Ron Adam wrote on the > similarity of lists and dicts: > > [...] >> >Well, here is how they currently compare. >> > >>>>> > >>>set(dir(dict)).intersection(set(dir(list))) >> >{'copy', '__hash__', '__format__', '__sizeof__', '__ge__', '__delitem__', >> >'__getitem__', '__dir__', 'pop', '__gt__', '__repr__', '__init__', >> >'__subclasshook__', '__eq__', 'clear', '__len__', '__str__', '__le__', >> >'__new__', '__reduce_ex__', '__doc__', '__getattribute__', '__ne__', >> >'__reduce__', '__contains__', '__delattr__', '__class__', '__lt__', >> >'__setattr__', '__setitem__', '__iter__'} > Now strip out the methods which are common to pretty much all objects, > in other words just look at the ones which are common to mapping and > sequence APIs but not to objects in general: > > {'copy', '__ge__', '__delitem__', '__getitem__', 'pop', '__gt__', > 'clear', '__len__', '__le__', '__contains__', '__lt__', '__setitem__', > '__iter__'} > > And now look a little more closely: > > - although dicts and lists both support order comparisons like > and <, > you cannot compare a dict to a list in Python 3; I think this would be the case we are describing with + and +=. You would not be able to add a dict and some other incompatible type. Cheers, Ron