From mail.yogi841 at gmail.com Sat Dec 1 03:58:21 2018 From: mail.yogi841 at gmail.com (Adam Johnson) Date: Sat, 1 Dec 2018 08:58:21 +0000 Subject: [Python-ideas] __len__() for map() In-Reply-To: <20181201011734.GN4319@ando.pearwood.info> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> Message-ID: On Sat, 1 Dec 2018 at 01:17, Steven D'Aprano wrote: > > In principle, we could make this work, by turning the output of map() > into a view like dict.keys() etc, or a lazy sequence type like range(). > wrapping the underlying sequence. That might be worth exploring. I can't > think of any obvious problems with a view-like interface, but that > doesn't mean there aren't any. I've spent like 30 seconds thinking about > it, so the fact that I can't see any problems with it means little. Something to consider that, so far, seems to have been overlooked is that the total length of the resulting map isn't only dependent upon the iterable, but also the mapped function. It is a pretty pathological case, but there is no guarantee that the function is a pure function, free from side effects. If the iterable is mutable and the mapped function has a reference to it (either from scoping or the iterable (in)directly containing a reference to itself), there is nothing to prevent the function modifying the iterable as the map is evaluated. For example, map can be used as a filter: it = iter((0, 16, 1, 4, 8, 29, 2, 13, 42)) def filter_odd(x): while x % 2 == 0: x = next(it) return x tuple(map(filter_odd, it)) # (1, 29, 13) The above also illustrates the second way the total length of the map could differ from the length input iterable, even if is immutable. If StopIteration is raised within the mapped function, map finishes early, so can be used in a manner similar to takewhile: def takewhile_lessthan4(x): if x < 4: return x raise StopIteration tuple(map(takewhile_lessthan4, range(9))) # (0, 1, 2, 3) I really don't understand why this is true, under 'normal' usage, map shouldn't have any reason to silently swallow a StopIteration raised _within_ the mapped function. As I opened with, I wouldn't consider using map in either of these ways to be a good idea, and anyone doing so should probably be persuaded to find better alternatives, but it might be something to bear in mind. AJ From greg.ewing at canterbury.ac.nz Sat Dec 1 05:44:07 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 01 Dec 2018 23:44:07 +1300 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> Message-ID: <5C0265F7.3070303@canterbury.ac.nz> Adam Johnson wrote: > def takewhile_lessthan4(x): > if x < 4: > return x > raise StopIteration > > tuple(map(takewhile_lessthan4, range(9))) > # (0, 1, 2, 3) > > I really don't understand why this is true, under 'normal' usage, map > shouldn't have any reason to silently swallow a StopIteration raised > _within_ the mapped function. It's not -- the StopIteration isn't terminating the map, it's terminating the iteration being performed by tuple(). It's easy to show that map() is not swallowing the StopIteration: >>> m = map(takewhile_lessthan4, range(9)) >>> next(m) 0 >>> next(m) 1 >>> next(m) 2 >>> next(m) 3 >>> next(m) Traceback (most recent call last): File "", line 1, in File "", line 4, in takewhile_lessthan4 StopIteration -- Greg From mail.yogi841 at gmail.com Sat Dec 1 07:45:08 2018 From: mail.yogi841 at gmail.com (Adam Johnson) Date: Sat, 1 Dec 2018 12:45:08 +0000 Subject: [Python-ideas] __len__() for map() In-Reply-To: <5C0265F7.3070303@canterbury.ac.nz> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <5C0265F7.3070303@canterbury.ac.nz> Message-ID: On Sat, 1 Dec 2018 at 10:44, Greg Ewing wrote: > It's not -- the StopIteration isn't terminating the map, > it's terminating the iteration being performed by tuple(). That was a poor choice of wording on my part, it's rather that map doesn't do anything special in that regard. To whatever is iterating over the map, any unexpected StopIteration from the function isn't distinguishable from the expected one from the iterable(s) being exhausted. This issue was dealt with in generators by PEP-479 (by replacing the StopIteration with a RuntimeError). Whilst map, filter, and others may not be generators, I would expect them to be consistent with that PEP when handling the same issue. From paul-python at svensson.org Sat Dec 1 11:07:53 2018 From: paul-python at svensson.org (Paul Svensson) Date: Sat, 1 Dec 2018 11:07:53 -0500 (EST) Subject: [Python-ideas] __len__() for map() In-Reply-To: <20181201011734.GN4319@ando.pearwood.info> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> Message-ID: On Sat, 1 Dec 2018, Steven D'Aprano wrote: > On Thu, Nov 29, 2018 at 08:13:12PM -0500, Paul Svensson wrote: > >> What's being proposed is simple, either: >> * len(map(f, x)) == len(x), or >> * both raise TypeError > > Simple, obvious, and problematic. > > Here's a map object I prepared earlier: > > from itertools import islice > mo = map(lambda x: x, "aardvark") > list(islice(mo, 3)) > > If I now pass you the map object, mo, what should len(mo) return? Five > or eight? mo = "aardvark" list(islice(mo, 3)) By what magic would the length change? Per the proposal, it can only be eight. Of course, that means mo can't, in this case, be an iterator. That's what the proposal would change. /Paul From mertz at gnosis.cx Sat Dec 1 11:27:31 2018 From: mertz at gnosis.cx (David Mertz) Date: Sat, 1 Dec 2018 11:27:31 -0500 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> Message-ID: A proposal to make map() not return an iterator seems like a non-starter. Yes, Python 2 worked that way, but that was a long time ago and we know better now. In the simple example it doesn't matter much: mo = map(lambda x: x, "aardvark") But map() is more useful for the non-toy case: mo = map(expensive_db_lookup, list_of_keys) list_of_keys can be a concrete list, but I'm using map() mainly specifically to get lazy iterator behavior. On Sat, Dec 1, 2018, 11:10 AM Paul Svensson On Sat, 1 Dec 2018, Steven D'Aprano wrote: > > > On Thu, Nov 29, 2018 at 08:13:12PM -0500, Paul Svensson wrote: > > > >> What's being proposed is simple, either: > >> * len(map(f, x)) == len(x), or > >> * both raise TypeError > > > > Simple, obvious, and problematic. > > > > Here's a map object I prepared earlier: > > > > from itertools import islice > > mo = map(lambda x: x, "aardvark") > > list(islice(mo, 3)) > > > > If I now pass you the map object, mo, what should len(mo) return? Five > > or eight? > > mo = "aardvark" > list(islice(mo, 3)) > > By what magic would the length change? > Per the proposal, it can only be eight. > Of course, that means mo can't, in this case, be an iterator. > That's what the proposal would change. > > /Paul > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Dec 1 11:53:20 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 2 Dec 2018 03:53:20 +1100 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> Message-ID: <20181201165320.GQ4319@ando.pearwood.info> On Sat, Dec 01, 2018 at 11:07:53AM -0500, Paul Svensson wrote: [...] > >Here's a map object I prepared earlier: > > > >from itertools import islice > >mo = map(lambda x: x, "aardvark") > >list(islice(mo, 3)) > > > >If I now pass you the map object, mo, what should len(mo) return? Five > >or eight? > > mo = "aardvark" > list(islice(mo, 3)) > > By what magic would the length change? > Per the proposal, it can only be eight. > Of course, that means mo can't, in this case, be an iterator. > That's what the proposal would change. I already discussed that: map is not currently a sequence, and just giving it a __len__ is not going to make it one. Making it a sequence, or a view of a sequence, is a bigger change, but worth considering, as I already said in part of my post you deleted. However, it is also a backwards incompatible change. In case its not obvious from my example above, I'll be explicit: # current behaviour mo = map(lambda x: x, "aardvark") list(islice(mo, 3)) # discard the first three items assert ''.join(mo) == 'dvark' => passes # future behaviour, with your proposal mo = map(lambda x: x, "aardvark") list(islice(mo, 3)) # discard the first three items assert ''.join(mo) == 'dvark' => fails with AssertionError Given the certainty that this change will break code (I know it will break *my* code, as I often rely on map() being an iterator not a sequence) it might be better to introduce a new "mapview" type rather than change the behaviour of map() itself. On the other hand, since the fix is simple enough: mo = iter(mo) perhaps all we need is a depreciation period of at least one full release before changing the behaviour. Either way, this isn't a simple or obvious change, and will probably need a PEP to nut out all the fine details. -- Steve From mertz at gnosis.cx Sat Dec 1 12:06:23 2018 From: mertz at gnosis.cx (David Mertz) Date: Sat, 1 Dec 2018 12:06:23 -0500 Subject: [Python-ideas] __len__() for map() In-Reply-To: <20181201165320.GQ4319@ando.pearwood.info> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> Message-ID: On Sat, Dec 1, 2018, 11:54 AM Steven D'Aprano # current behaviour > mo = map(lambda x: x, "aardvark") > list(islice(mo, 3)) # discard the first three items > assert ''.join(mo) == 'dvark' > => passes > > # future behaviour, with your proposal > assert ''.join(mo) == 'dvark' > => fails with AssertionError > > Given the certainty that this change will break code (I know it will > break *my* code, as I often rely on map() being an iterator not a > sequence) it might be better to introduce a new "mapview" type rather than > change the behaviour of map() itself. On the other hand, since the fix is > simple enough: > > mo = iter(mo) > Given that the anti-fix is just as simple and currently available, I don't see why we'd want a change: # map->sequence mo = list(mo) FWIW, I actually do write exactly that code fairly often, it's not hard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Dec 1 12:10:18 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 2 Dec 2018 04:10:18 +1100 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> Message-ID: <20181201171018.GR4319@ando.pearwood.info> On Sat, Dec 01, 2018 at 11:27:31AM -0500, David Mertz wrote: > A proposal to make map() not return an iterator seems like a non-starter. > Yes, Python 2 worked that way, but that was a long time ago and we know > better now. Paul is certainly not suggesting reverting the behaviour to the Python2 map, at the very least map(func, iterator) will continue to return an iterator. What Paul is *precisely* proposing isn't clear to me, except that map(func, sequence) will be "loosely" a sequence. What that means is not obvious. What is especially unclear is what his map() will do when passed multiple iterable arguments. [...] > list_of_keys can be a concrete list, but I'm using map() mainly > specifically to get lazy iterator behavior. Indeed. That's often why I use it too. But there is a good use-case for having map(), or a map-like function, provide either a lazy sequence like range() or a view. But the devil is in the details. Terry was right to encourage people to experiment with their own map-like function (a subclass?) to identify any tricky corners in the proposal. -- Steve From steve at pearwood.info Sat Dec 1 12:23:07 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 2 Dec 2018 04:23:07 +1100 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> Message-ID: <20181201172307.GS4319@ando.pearwood.info> On Sat, Dec 01, 2018 at 12:06:23PM -0500, David Mertz wrote: > Given that the anti-fix is just as simple and currently available, I don't > see why we'd want a change: > > # map->sequence > mo = list(mo) > > FWIW, I actually do write exactly that code fairly often, it's not hard. Sure, but that makes a copy of the original data and means you lose the benefit of map being lazy. Naturally we will always have the ability to call list and eagerly convert to a sequence, but these proposals are for a way of getting the advantages of sequence-like behaviour while still keeping the advantages of laziness. With iterators, the only way to get that advantage of laziness is to give up the ability to query length, random access to items, etc even when the underlying data is a sequence and that information would have been readily available. We can, at least sometimes, have the best of both worlds. Maybe. -- Steve From mertz at gnosis.cx Sat Dec 1 12:28:16 2018 From: mertz at gnosis.cx (David Mertz) Date: Sat, 1 Dec 2018 12:28:16 -0500 Subject: [Python-ideas] __len__() for map() In-Reply-To: <20181201172307.GS4319@ando.pearwood.info> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> Message-ID: Other than being able to ask len(), are there any advantages to a slightly less opaque map()? Getting the actual result of applying the function to the element is necessarily either eager or lazy, you can't have both. On Sat, Dec 1, 2018, 12:24 PM Steven D'Aprano On Sat, Dec 01, 2018 at 12:06:23PM -0500, David Mertz wrote: > > > Given that the anti-fix is just as simple and currently available, I > don't > > see why we'd want a change: > > > > # map->sequence > > mo = list(mo) > > > > FWIW, I actually do write exactly that code fairly often, it's not hard. > > Sure, but that makes a copy of the original data and means you lose the > benefit of map being lazy. > > Naturally we will always have the ability to call list and eagerly > convert to a sequence, but these proposals are for a way of getting the > advantages of sequence-like behaviour while still keeping the advantages > of laziness. > > With iterators, the only way to get that advantage of laziness is > to give up the ability to query length, random access to items, etc even > when the underlying data is a sequence and that information would have > been readily available. We can, at least sometimes, have the best of > both worlds. Maybe. > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Dec 1 14:08:03 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 2 Dec 2018 06:08:03 +1100 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> Message-ID: <20181201190803.GT4319@ando.pearwood.info> On Sat, Dec 01, 2018 at 12:28:16PM -0500, David Mertz wrote: > Other than being able to ask len(), are there any advantages to a slightly > less opaque map()? Getting the actual result of applying the function to > the element is necessarily either eager or lazy, you can't have both. I don't understand the point you think you are making here. There's no fundamental need to make a copy of a sequence just to apply a map function to it, especially if the function is cheap. (If it is expensive, you might want to add a cache.) This proof of concept wrapper class could have been written any time since Python 1.5 or earlier: class lazymap: def __init__(self, function, sequence): self.function = function self.wrapped = sequence def __len__(self): return len(self.wrapped) def __getitem__(self, item): return self.function(self.wrapped[item]) It is fully iterable using the sequence protocol, even in Python 3: py> x = lazymap(str.upper, 'aardvark') py> list(x) ['A', 'A', 'R', 'D', 'V', 'A', 'R', 'K'] Mapped items are computed on demand, not up front. It doesn't make a copy of the underlying sequence, it can be iterated over and over again, it has a length and random access. And if you want an iterator, you can just pass it to the iter() function. There are probably bells and whistles that can be added (a nicer repr? any other sequence methods? a cache?) and I haven't tested it fully. For backwards compatibilty reasons, we can't just make map() work like this, because that's a change in behaviour. There may be tricky corner cases I haven't considered, but as a proof of concept I think it shows that the basic premise is sound and worth pursuing. -- Steve From mertz at gnosis.cx Sat Dec 1 14:26:41 2018 From: mertz at gnosis.cx (David Mertz) Date: Sat, 1 Dec 2018 14:26:41 -0500 Subject: [Python-ideas] __len__() for map() In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> Message-ID: To illustrate the distinction that someone (I think Steven D'Aprano) makes, I think these two (modestly tested, but could have flaws) implementations are both sensible for some purposes. Both are equally "obvious," yet they are different: >>> import sys >>> from itertools import count >>> class map1(object): ... def __init__(self, fn, *seqs): ... try: # See if there is a length ... self._len = min(map(len, seqs)) ... except: # Fallback isn't in any sense accurate, just "large" ... self._len = sys.maxsize ... self._fn = fn ... self._seqs = seqs ... self._iters = [iter(seq) for seq in seqs] ... def __iter__(self): ... return self ... def __next__(self): ... args = [next(it) for it in self._iters] ... return self._fn(*args) ... def __len__(self): ... return self._len ... >>> class map2(map1): ... def __init__(self, fn, *seqs): ... super().__init__(fn, *seqs) ... def __next__(self): ... self._len -= 1 ... return super().__next__() ... >>> m1 = map1(add, [1,2,3,4], (5,6,7)) >>> len(m1) 3 >>> next(m1) 6 >>> len(m1) 3 >>> m2 = map2(add, [1,2,3,4], (5,6,7)) >>> len(m2) 3 >>> next(m2) 6 >>> len(m2) 2 >>> m1_inf = map1(lambda x: x, count()) >>> len(m1_inf) 9223372036854775807 >>> next(m1_inf) 0 >>> next(m1_inf) 1 I wasn't sure what to set self._len to where it doesn't make sense. I thought of None which makes len(mo) raise one exception, or -1 which makes len(mo) raise a different exception. I just choose an arbitrary "big" value in the above implementation. mo.__length_hint__() is a possibility, but that is specialized, not a way of providing a response to len(mo). I don't have to, but I do keep around mo._seqs as a handle to the underlying sequences. In concept those could be re-inspected for other properties as the user of the classes desired. On Sat, Dec 1, 2018 at 12:28 PM David Mertz wrote: > Other than being able to ask len(), are there any advantages to a slightly > less opaque map()? Getting the actual result of applying the function to > the element is necessarily either eager or lazy, you can't have both. > > On Sat, Dec 1, 2018, 12:24 PM Steven D'Aprano >> On Sat, Dec 01, 2018 at 12:06:23PM -0500, David Mertz wrote: >> >> > Given that the anti-fix is just as simple and currently available, I >> don't >> > see why we'd want a change: >> > >> > # map->sequence >> > mo = list(mo) >> > >> > FWIW, I actually do write exactly that code fairly often, it's not hard. >> >> Sure, but that makes a copy of the original data and means you lose the >> benefit of map being lazy. >> >> Naturally we will always have the ability to call list and eagerly >> convert to a sequence, but these proposals are for a way of getting the >> advantages of sequence-like behaviour while still keeping the advantages >> of laziness. >> >> With iterators, the only way to get that advantage of laziness is >> to give up the ability to query length, random access to items, etc even >> when the underlying data is a sequence and that information would have >> been readily available. We can, at least sometimes, have the best of >> both worlds. Maybe. >> >> >> -- >> Steve >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.ewing at canterbury.ac.nz Sat Dec 1 20:07:16 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 02 Dec 2018 14:07:16 +1300 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <20181201190803.GT4319@ando.pearwood.info> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> Message-ID: <5C033044.9080907@canterbury.ac.nz> Steven D'Aprano wrote: > For backwards compatibilty reasons, we can't just make map() work like > this, because that's a change in behaviour. Actually, I think it's possible to get the best of both worlds. Consider this: from operator import itemgetter class MapView: def __init__(self, func, *args): self.func = func self.args = args self.iterator = None def __len__(self): return min(map(len, self.args)) def __getitem__(self, i): return self.func(*list(map(itemgetter(i), self.args))) def __iter__(self): return self def __next__(self): if not self.iterator: self.iterator = map(self.func, *self.args) return next(self.iterator) If you give it sequences, it behaves like a sequence: >>> a = [1, 2, 3, 4, 5] >>> b = [2, 3, 5] >>> from math import pow >>> m = MapView(pow, a, b) >>> print(list(m)) [1.0, 8.0, 243.0] >>> print(list(m)) [1.0, 8.0, 243.0] >>> print(len(m)) 3 >>> print(m[1]) 8.0 If you give it iterators, it behaves like an iterator: >>> m = MapView(pow, iter(a), iter(b)) >>> print(next(m)) 1.0 >>> print(list(m)) [8.0, 243.0] >>> print(list(m)) [] >>> print(len(m)) Traceback (most recent call last): File "", line 1, in File "/Users/greg/foo/mapview/mapview.py", line 14, in __len__ return min(map(len, self.args)) TypeError: object of type 'list_iterator' has no len() If you use it as an iterator after giving it sequences, it also behaves like an iterator: >>> m = MapView(pow, a, b) >>> print(next(m)) 1.0 >>> print(next(m)) 8.0 What do people think? Could we drop something like this in as a replacement for map() without disturbing anything too much? -- Greg From rosuav at gmail.com Sat Dec 1 20:24:19 2018 From: rosuav at gmail.com (Chris Angelico) Date: Sun, 2 Dec 2018 12:24:19 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C033044.9080907@canterbury.ac.nz> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> Message-ID: On Sun, Dec 2, 2018 at 12:08 PM Greg Ewing wrote: > class MapView: > def __len__(self): > return min(map(len, self.args)) > > def __iter__(self): > return self > > def __next__(self): > if not self.iterator: > self.iterator = map(self.func, *self.args) > return next(self.iterator) I can't help thinking that it will be extremely surprising to have the length remain the same while the items get consumed. After you take a couple of elements off, the length of the map is exactly the same, yet the length of a list constructed from that map won't be. Are there any other non-pathological examples where len(x) != len(list(x))? ChrisA From greg.ewing at canterbury.ac.nz Sun Dec 2 08:04:31 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 03 Dec 2018 02:04:31 +1300 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> Message-ID: <5C03D85F.2040702@canterbury.ac.nz> Chris Angelico wrote: > I can't help thinking that it will be extremely surprising to have the > length remain the same while the items get consumed. That can be fixed. The following version raises an exception if you try to find the length after having used it as an iterator. (I also fixed a bug -- I had screwed up the sequence case, and it wasn't re-iterating properly.) class MapView: def __init__(self, func, *args): self.func = func self.args = args self.iterator = None def __len__(self): return min(map(len, self.args)) def __getitem__(self, i): return self.func(*list(map(itemgetter(i), self.args))) def __iter__(self): return map(self.func, *self.args) def __next__(self): if not self.iterator: self.iterator = iter(self) return next(self.iterator) >>> a = [1, 2, 3, 4, 5] >>> b = [2, 3, 5] >>> m = MapView(pow, a, b) >>> print(next(m)) 1 >>> print(len(m)) Traceback (most recent call last): File "", line 1, in File "/Users/greg/foo/mapview/mapview.py", line 12, in __len__ raise TypeError("Mapping iterator has no len()") TypeError: Mapping iterator has no len() It will still report a length if you use len() *before* starting to use it as an iterator, but the length it returns is correct at that point, so I don't think that's a problem. > Are there any > other non-pathological examples where len(x) != len(list(x))? No longer a problem: >>> m = MapView(pow, a, b) >>> len(m) == len(list(m)) True -- Greg From steve at pearwood.info Sun Dec 2 08:43:24 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 3 Dec 2018 00:43:24 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C03D85F.2040702@canterbury.ac.nz> References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> Message-ID: <20181202134324.GV4319@ando.pearwood.info> On Mon, Dec 03, 2018 at 02:04:31AM +1300, Greg Ewing wrote: > Chris Angelico wrote: > >I can't help thinking that it will be extremely surprising to have the > >length remain the same while the items get consumed. > > That can be fixed. The following version raises an exception if > you try to find the length after having used it as an iterator. That's not really a "fix" as such, more of a violation of the principle of least astonishment. Perhaps more like the principle of most astonishment: the object changes from sized to unsized even if you don't modify its value or its type, but merely if you look at it the wrong way: # This is okay, doesn't change the nature of the object. for i in range(sys.maxint): try: print(mapview[i]) except IndexError: break # But this unexpectedly changes it from sized to unsized. for x in mapview: break That makes this object a fragile thing that can unexpectedly change from sized to unsized. Neither fish nor fowl with a confusing API that is not quite a sequence, not quite an iterator, not quite sized, but just enough of each to lead people into error. Or... at least that's what the code is supposed to do, the code you give doesn't actually work that way: > class MapView: > def __init__(self, func, *args): > self.func = func > self.args = args > self.iterator = None > def __len__(self): > return min(map(len, self.args)) > def __getitem__(self, i): > return self.func(*list(map(itemgetter(i), self.args))) > def __iter__(self): > return map(self.func, *self.args) > def __next__(self): > if not self.iterator: > self.iterator = iter(self) > return next(self.iterator) > > >>> a = [1, 2, 3, 4, 5] > >>> b = [2, 3, 5] > >>> m = MapView(pow, a, b) > >>> print(next(m)) > 1 > >>> print(len(m)) > Traceback (most recent call last): > File "", line 1, in > File "/Users/greg/foo/mapview/mapview.py", line 12, in __len__ > raise TypeError("Mapping iterator has no len()") > TypeError: Mapping iterator has no len() I can't reproduce that behaviour with the code you give above. When I try it, it returns the length 3, even after the iterator has been completely consumed. I daresay you could jerry-rig something to "fix" this bug, but I think this is a poor API that tries to make a single type act like two conceptually different things at the same time. -- Steve From greg.ewing at canterbury.ac.nz Sun Dec 2 17:52:07 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 03 Dec 2018 11:52:07 +1300 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <20181202134324.GV4319@ando.pearwood.info> References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> Message-ID: <5C046217.7010805@canterbury.ac.nz> Steven D'Aprano wrote: > Perhaps more like the principle of most > astonishment: the object changes from sized to unsized even if you don't > modify its value or its type, but merely if you look at it the wrong > way: Yes, but keep in mind the purpose of the whole thing is to provide a sequence interface while not breaking old code that expects an iterator interface. Code that was written to work with the existing map() will not be calling len() on it at all, because that would never have worked. > Neither fish nor fowl with a confusing API that is not > quite a sequence, not quite an iterator, not quite sized, but just > enough of each to lead people into error. Yes, it's a compromise in the interests of backwards compatibility. But there are no surprises as long as you stick to one interface or the other. Weird things happen if you mix them up, but sane code won't be doing that. > I can't reproduce that behaviour with the code you give above. When I > try it, it returns the length 3, even after the iterator has been > completely consumed. It sounds like you were still using the old version with a broken __iter__() method. This is my current complete code together with test cases: #----------------------------------------------------------- from operator import itemgetter class MapView: def __init__(self, func, *args): self.func = func self.args = args self.iterator = None def __len__(self): if self.iterator: raise TypeError("Mapping iterator has no len()") return min(map(len, self.args)) def __getitem__(self, i): return self.func(*list(map(itemgetter(i), self.args))) def __iter__(self): return map(self.func, *self.args) def __next__(self): if not self.iterator: self.iterator = iter(self) return next(self.iterator) if __name__ == "__main__": a = [1, 2, 3, 4, 5] b = [2, 3, 5] print("As a sequence:") m = MapView(pow, a, b) print(list(m)) print(list(m)) print(len(m)) print(m[1]) print() print("As an iterator:") m = MapView(pow, iter(a), iter(b)) print(next(m)) print(list(m)) print(list(m)) try: print(len(m)) except Exception as e: print("***", e) print() print("As an iterator over sequences:") m = MapView(pow, a, b) print(next(m)) print(next(m)) try: print(len(m)) except Exception as e: print("***", e) #----------------------------------------------------------- This is the output I get: As a sequence: [1, 8, 243] [1, 8, 243] 3 8 As an iterator: 1 [8, 243] [] *** Mapping iterator has no len() As an iterator over sequences: 1 8 *** Mapping iterator has no len() -- Greg From abedillon at gmail.com Wed Dec 5 21:43:44 2018 From: abedillon at gmail.com (Abe Dillon) Date: Wed, 5 Dec 2018 20:43:44 -0600 Subject: [Python-ideas] [Brainstorm] Testing with Documented ABCs In-Reply-To: References: Message-ID: [Marko Ristin-Kaufmann] > > What we do need at this moment, IMO, is a broad practical experience of > using contracts in Python. Once you make a change to the language, it's > impossible to undo. In contrast to what has been suggested in the previous > discussions (including my own voiced opinions), I actually now don't think > that introducing a language change would be beneficial *at this precise > moment*. I agree. That's why I prefaced this topic with [Brainstorm]. I want to explore the solution space to this problem and discuss some of the pros and cons of different ideas, *not* proceed straight to action. I also wanted to bring three thoughts to the table: 1. Fuzz testing and stateful testing like that provided by hypothesis might work together with contracts in an interesting way. 2. Tying tests/contracts to the bits of documentation that they validate is a great way to keep documentation in sync with code, but doctest does it a bit "backwards". Like in icontract-sphinx (or even this) it's better to construct documentation (partially) from test code than to write test code within documentation. In general, I find the relationship between documentation, testing, and type-checking interesting. The problems they each address seem to overlap quite a bit. 3. There seems like a lot of opportunity for the re-use of contracts, so maybe we should consider a mechanism to facilitate that. [Marko Ristin-Kaufmann] > I'd prefer to hear from people who actually use contracts in their > professional Python programming -- apart from the noisy syntax, how was the > experience? Did it help you catch bugs (and how many)? Were there big > problems with maintainability? Could you easily refactor? What were the > limits of the contracts you encountered? What kind of snapshot mechanism do > we need? How did you deal with multi-threading? And so on. That's a good point. I would argue that the concept of contracts isn't new, so there should be at least a few cases that we can draw on where others have tread before us (which you've obviously done to a large degree). That's not to belittle the work you've done on icontracts. It's a great tool for the reasons you describe. [Marko Ristin-Kaufmann] > *Multiple predicates per decorator. *The problem is that you can not deal > with toggling/describing individual contracts easily. While you can hack > your way through it (considering the arguments in the sequence, for > example), we found it clearer to have separate decorators. Moreover, > tracebacks are much easier to read, which is important when you debug a > program. I suppose it may be difficult to implement a clean, *backwards-compatible* solution, but yes; going through the arguments in a sequence would be my naive solution. Each entry has an optional description, a callable, and an optional tag or level to enable toggling (I would follow a simple model such as logging levels) *in that order*. It makes sense that the text description come first because that's the most relevant to a reader (like a doc-string), then the corresponding code, then the toggling flag which will often be an optimization detail which generally fall behind code correctness in priority. It may be less straight-forward to parse, but I wouldn't call it a "hack". I guess I'm not sure what to say about tracebacks being hard to read. [Marko Ristin-Kaufmann] > *Practicality of decorators. *We have retrospective meetings at the > company and I frequently survey the opinions related to the contracts > (explicitly asking about the readability and maintainability) -- so far > nobody had any difficulties and nobody was bothered by the noisy syntax. That's fair enough. I think the implementation you've come up with is pretty close to optimally concise given the tools at your disposal. I think something like Eiffel is a good goal for Python to eventually shoot for, but without new syntax; each step between icontracts and an Eiffel-esque platonic ideal would require significant hackery with diminishing returns on investment. On Thu, Nov 29, 2018 at 1:05 AM Marko Ristin-Kaufmann < marko.ristin at gmail.com> wrote: > Hi Abe, > Thanks for your suggestions! We actually already considered the two > alternatives you propose. > > *Multiple predicates per decorator. *The problem is that you can not deal > with toggling/describing individual contracts easily. While you can hack > your way through it (considering the arguments in the sequence, for > example), we found it clearer to have separate decorators. Moreover, > tracebacks are much easier to read, which is important when you debug a > program. > > *AST magic. *The problem with any approach based on parsing (be it > parsing the code or the description) is that parsing is slow so you end up > spending a lot of cycles on contracts which might not be enabled (many > contracts are applied only in the testing environment, not int he > production). Hence you must have an approach that offers practically zero > overhead cost to importing a module when its contracts are turned off. > > Decoding byte-code does not work as current decoding libraries can not > keep up with the changes in the language and the compiler hence they are > always lagging behind. > > *Practicality of decorators. *We have retrospective meetings at the > company and I frequently survey the opinions related to the contracts > (explicitly asking about the readability and maintainability) -- so far > nobody had any difficulties and nobody was bothered by the noisy syntax. > The decorator syntax is simply not beautiful, no discussion about that. But > when it comes to maintenance, there's a linter included ( > https://github.com/Parquery/pyicontract-lint), and if you want contracts > rendered in an appealing way, there's a documentation tool for sphinx ( > https://github.com/Parquery/sphinx-icontract). The linter facilitates the > maintainability a lot and sphinx tool gives you nice documentation for a > library so that you don't even have to look into the source code that often > if you don't want to. > > We need to be careful not to mistake issues of aesthetics for practical > issues. Something might not be beautiful, but can be useful unless it's > unreadable. > > *Conclusion. *What we do need at this moment, IMO, is a broad practical > experience of using contracts in Python. Once you make a change to the > language, it's impossible to undo. In contrast to what has been suggested > in the previous discussions (including my own voiced opinions), I actually > now don't think that introducing a language change would be beneficial *at > this precise moment*. We don't know what the use cases are, and there is > no practical experience to base the language change on. > > I'd prefer to hear from people who actually use contracts in their > professional Python programming -- apart from the noisy syntax, how was the > experience? Did it help you catch bugs (and how many)? Were there big > problems with maintainability? Could you easily refactor? What were the > limits of the contracts you encountered? What kind of snapshot mechanism do > we need? How did you deal with multi-threading? And so on. > > icontract library is already practically usable and, if you don't use > inheritance, dpcontracts is usable as well. I would encourage everybody to > try out programming with contracts using an existing library and just hold > their nose when writing the noisy syntax. Once we unearthed deeper problems > related to contracts, I think it will be much easier and much more > convincing to write a proposal for introducing contracts in the core > language. If I had to write a proposal right now, it would be only based on > the experience of writing a humble 100K code base by a team of 5-10 people. > Not very convincing. > > > Cheers, > Marko > > On Thu, 29 Nov 2018 at 02:26, Abe Dillon wrote: > >> Marko, I have a few thoughts that might improve icontract. >> First, multiple clauses per decorator: >> >> @pre( >> *lambda* x: x >= 0, >> *lambda* y: y >= 0, >> *lambda* width: width >= 0, >> *lambda* height: height >= 0, >> *lambda* x, width, img: x + width <= width_of(img), >> *lambda* y, height, img: y + height <= height_of(img)) >> @post( >> *lambda* self: (self.x, self.y) in self, >> *lambda* self: (self.x+self.width-1, self.y+self.height-1) in self, >> *lambda* self: (self.x+self.width, self.y+self.height) not in self) >> *def* __init__(self, img: np.ndarray, x: int, y: int, width: int, >> height: int) -> None: >> self.img = img[y : y+height, x : x+width].copy() >> self.x = x >> self.y = y >> self.width = width >> self.height = height >> >> *def* __contains__(self, pt: Tuple[int, int]) -> bool: >> x, y = pt >> return (self.x <= x < self.x + self.width) and (self.y <= y < self.y + >> self.height) >> >> >> You might be able to get away with some magic by decorating a method just >> to flag it as using contracts: >> >> >> @contract # <- does byte-code and/or AST voodoo >> *def* __init__(self, img: np.ndarray, x: int, y: int, width: int, >> height: int) -> None: >> pre(x >= 0, >> y >= 0, >> width >= 0, >> height >= 0, >> x + width <= width_of(img), >> y + height <= height_of(img)) >> >> # this would probably be declared at the class level >> inv(*lambda* self: (self.x, self.y) in self, >> *lambda* self: (self.x+self.width-1, self.y+self.height-1) in >> self, >> *lambda* self: (self.x+self.width, self.y+self.height) not in >> self) >> >> self.img = img[y : y+height, x : x+width].copy() >> self.x = x >> self.y = y >> self.width = width >> self.height = height >> >> That might be super tricky to implement, but it saves you some lambda >> noise. Also, I saw a forked thread in which you were considering some sort >> of transpiler with similar syntax to the above example. That also works. >> Another thing to consider is that the role of descriptors >> >> overlaps some with the role of invariants. I don't know what to do with >> that knowledge, but it seems like it might be useful. >> >> Anyway, I hope those half-baked thoughts have *some* value... >> >> On Wed, Nov 28, 2018 at 1:12 AM Marko Ristin-Kaufmann < >> marko.ristin at gmail.com> wrote: >> >>> Hi Abe, >>> >>> I've been pulling a lot of ideas from the recent discussion on design by >>>> contract (DBC), the elegance and drawbacks >>>> of doctests >>>> , and the amazing talk >>>> given by Hillel Wayne at >>>> this year's PyCon entitled "Beyond Unit Tests: Taking your Tests to the >>>> Next Level". >>>> >>> >>> Have you looked at the recent discussions regarding design-by-contract >>> on this list ( >>> https://groups.google.com/forum/m/#!topic/python-ideas/JtMgpSyODTU >>> and the following forked threads)? >>> >>> You might want to have a look at static checking techniques such as >>> abstract interpretation. I hope to be able to work on such a tool for >>> Python in some two years from now. We can stay in touch if you are >>> interested. >>> >>> Re decorators: to my own surprise, using decorators in a larger code >>> base is completely practical including the readability and maintenance of >>> the code. It's neither that ugly nor problematic as it might seem at first >>> look. >>> >>> We use our https://github.com/Parquery/icontract at the company. Most >>> of the design choices come from practical issues we faced -- so you might >>> want to read the doc even if you don't plant to use the library. >>> >>> Some of the aspects we still haven't figured out are: how to approach >>> multi-threading (locking around the whole function with an additional >>> decorator?) and granularity of contract switches (right now we use >>> always/optimized, production/non-optimized and teating/slow, but it seems >>> that a larger system requires finer categories). >>> >>> Cheers Marko >>> >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From marko.ristin at gmail.com Fri Dec 7 02:15:39 2018 From: marko.ristin at gmail.com (Marko Ristin-Kaufmann) Date: Fri, 7 Dec 2018 08:15:39 +0100 Subject: [Python-ideas] [Brainstorm] Testing with Documented ABCs In-Reply-To: References: Message-ID: Hi Abe, I agree. That's why I prefaced this topic with [Brainstorm]. I want to > explore the solution space to this problem and discuss some of the pros and > cons of different ideas, *not* proceed straight to action. You are right. Please apologize, I was so primed by the discussions we had in October 2019 that I didn't pay enough attention to "branstorm" in the subject. Fuzz testing and stateful testing like that provided by hypothesis might > work together with contracts in an interesting way. > You might want to look at the literature on automatic test generation. A possible entry point could be: https://www.research-collection.ethz.ch/handle/20.500.11850/69581 If I had time available, I would start with a tool that analyses a given module and automatically generates code for the Hypothesis test cases. The tool needs to select functions which accept primitive data types and for each one of them translates their contracts into Hypothesis code. If contracts are not trivially translatable to Hypothesis, the function is ignored. For readability and speed of development (of the code under test, not of the tool), I would prefer this tool *not *to be dynamic so that the developer herself needs to re-run it if the function signatures changed. The ingredients for such a tool are all there with icontract (similar to sphinx-icontract, you import the module and analyze its functions; you can copy/past parts of sphinx-icontract implementation for parsing and listing the AST of the contracts). (If you'd like to continue discussing this topic, let's create an issue on icontract github page or switch to private correspondence in order not to spam this mail list). There seems like a lot of opportunity for the re-use of contracts, so maybe > we should consider a mechanism to facilitate that. > This was the case for the requests library. @James Lu was looking into it -- a lot of functions had very similar contracts. However, in our code base at the company (including the open-sourced libraries), there was not a single case where we thought that contracts re-use would be beneficial. Either it would have hurt the readability and introduce unnecessary couplings (when the contracts were trivial) or it made sense to encapsulate more complex contracts in a separate function. >> *Multiple predicates per decorator. * >> > I suppose it may be difficult to implement a clean, *backwards-compatible* > solution, but yes; going through the arguments in a sequence would be my > naive solution. Each entry has an optional description, a callable, and an > optional tag or level to enable toggling (I would follow a simple model > such as logging levels) *in that order*. > I found that to be too error-prone in a larger code base, but that is my very subjective opinion. Maybe you could make an example? but without new syntax; each step between icontracts and an Eiffel-esque > platonic ideal would require significant hackery with diminishing returns > on investment. > I agree. There are also issues with core python interpreter which I expect to remain open for a long time (see the issues related to retrieving code text of lambda functions and decorators and tweaking dynamically the behavior of help(.) for functions). Cheers, Marko > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhroncok at redhat.com Fri Dec 7 03:53:04 2018 From: mhroncok at redhat.com (=?UTF-8?Q?Miro_Hron=c4=8dok?=) Date: Fri, 7 Dec 2018 09:53:04 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads Message-ID: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> Hi, I see md5 checksums at a release download page such as [1]. My idea is to switch to sha512 for a more reliable outcome. I'm no security expert, but AFAK md5 is generally believed to be unsafe, as it was repeatedly proven it can be vulnerable [2]. [1] https://www.python.org/downloads/release/python-371/ [2] https://en.wikipedia.org/wiki/MD5#Security -- Miro Hron?ok -- Phone: +420777974800 IRC: mhroncok From solipsis at pitrou.net Fri Dec 7 04:39:30 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 7 Dec 2018 10:39:30 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> Message-ID: <20181207103930.565ce442@fsol> On Fri, 7 Dec 2018 09:53:04 +0100 Miro Hron?ok wrote: > Hi, > > I see md5 checksums at a release download page such as [1]. > > My idea is to switch to sha512 for a more reliable outcome. > > I'm no security expert, but AFAK md5 is generally believed to be unsafe, > as it was repeatedly proven it can be vulnerable [2]. md5 is only used for a quick integrity check here (think of it as a sophisticated checksum). For security you need to verify the corresponding GPG signature. Regards Antoine. From jeanpierreda at gmail.com Fri Dec 7 09:49:59 2018 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Fri, 7 Dec 2018 06:49:59 -0800 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <20181207103930.565ce442@fsol> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> Message-ID: On Fri, Dec 7, 2018 at 1:40 AM Antoine Pitrou wrote: > md5 is only used for a quick integrity check here (think of it as a > sophisticated checksum). For security you need to verify the > corresponding GPG signature. > More to the point: you're getting the hash from the same place as the binary. If one is vulnerable to modifications by attackers, both are. So it doesn't matter. The real defense most people are relying on is TLS. -- Devin -------------- next part -------------- An HTML attachment was scrubbed... URL: From prometheus235 at gmail.com Fri Dec 7 10:56:22 2018 From: prometheus235 at gmail.com (Nick Timkovich) Date: Fri, 7 Dec 2018 09:56:22 -0600 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> Message-ID: Devils advocate: it might complicate things for someone that needs to use FIPS, where MD5 can be a pain to deal with. On Fri, Dec 7, 2018 at 8:50 AM Devin Jeanpierre wrote: > On Fri, Dec 7, 2018 at 1:40 AM Antoine Pitrou wrote: > >> md5 is only used for a quick integrity check here (think of it as a >> sophisticated checksum). For security you need to verify the >> corresponding GPG signature. >> > > More to the point: you're getting the hash from the same place as the > binary. If one is vulnerable to modifications by attackers, both are. So it > doesn't matter. The real defense most people are relying on is TLS. > > -- Devin > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Fri Dec 7 13:47:24 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 7 Dec 2018 19:47:24 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> Message-ID: <20181207194724.7cb16fb5@fsol> On Fri, 7 Dec 2018 06:49:59 -0800 Devin Jeanpierre wrote: > On Fri, Dec 7, 2018 at 1:40 AM Antoine Pitrou wrote: > > > md5 is only used for a quick integrity check here (think of it as a > > sophisticated checksum). For security you need to verify the > > corresponding GPG signature. > > > > More to the point: you're getting the hash from the same place as the > binary. If one is vulnerable to modifications by attackers, both are. So it > doesn't matter. The real defense most people are relying on is TLS. If the site is vulnerable to modifications, then TLS doesn't help. Again: you must verify the GPG signatures (since they are produced by the release manager's private key, which is *not* stored on the python.org Web site). Regards Antoine. From bernardo at bernardosulzbach.com Fri Dec 7 14:46:29 2018 From: bernardo at bernardosulzbach.com (Bernardo Sulzbach) Date: Fri, 7 Dec 2018 17:46:29 -0200 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> Message-ID: Would this change actually help people who need to use FIPS? Other than that this change would only decrease the already very small probability of a corrupted download hashing the same, which isn't a bad thing. If it could make some users' jobs easier, even if it by no means helps guaranteeing the authenticity of the downloaded file, it might be worth considering. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeanpierreda at gmail.com Fri Dec 7 14:54:59 2018 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Fri, 7 Dec 2018 11:54:59 -0800 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <20181207194724.7cb16fb5@fsol> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> Message-ID: On Fri, Dec 7, 2018 at 10:48 AM Antoine Pitrou wrote: > If the site is vulnerable to modifications, then TLS doesn't help. > Again: you must verify the GPG signatures (since they are produced by > the release manager's private key, which is *not* stored on the > python.org Web site). > This is missing the point. They were asking why not to use SHA512. The answer is that the hash does not provide any extra security. GPG is separate: even if there was no GPG signature, SHA512 would still not provide any extra security. That's why I said "more to the point". :P Nobody "must" verify the GPG signatures. TLS doesn't protect against everything, but neither does GPG. A naive user might just download a public GPG key from a compromised python.org and use it to verify the compromised release, see everything is "OK", and still be hosed. -- Devin -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Fri Dec 7 16:25:19 2018 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 7 Dec 2018 13:25:19 -0800 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> Message-ID: For this specific purpose, md5 is just as good as a proper hash. But all else being equal, it would still be better to use a proper hash, just so people don't have to go through the whole security analysis to check that. Of course all else isn't equal: switching from md5 to sha-whatever would require someone do the work. Is anyone volunteering? On Fri, Dec 7, 2018, 11:56 Devin Jeanpierre On Fri, Dec 7, 2018 at 10:48 AM Antoine Pitrou > wrote: > >> If the site is vulnerable to modifications, then TLS doesn't help. >> Again: you must verify the GPG signatures (since they are produced by >> the release manager's private key, which is *not* stored on the >> python.org Web site). >> > > This is missing the point. They were asking why not to use SHA512. The > answer is that the hash does not provide any extra security. GPG is > separate: even if there was no GPG signature, SHA512 would still not > provide any extra security. That's why I said "more to the point". :P > > Nobody "must" verify the GPG signatures. TLS doesn't protect against > everything, but neither does GPG. A naive user might just download a public > GPG key from a compromised python.org and use it to verify the > compromised release, see everything is "OK", and still be hosed. > > -- Devin > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Fri Dec 7 18:38:06 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 8 Dec 2018 10:38:06 +1100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> Message-ID: <20181207233805.GE13061@ando.pearwood.info> On Fri, Dec 07, 2018 at 01:25:19PM -0800, Nathaniel Smith wrote: > For this specific purpose, md5 is just as good as a proper hash. But all > else being equal, it would still be better to use a proper hash, just so > people don't have to go through the whole security analysis to check that. I don't understand what you are trying to say here about "the whole security analysis" to check "that". What security analysis, and what is "that"? It seems to me that moving to a cryptographically-secure hash would give many people a false sense of security, that just because the hash matched, the download was not only not corrupted, but not compromised as well. For those two purposes: - testing for accidental corruption; - testing for deliberate compromise; md5 and sha512 are precisely equivalent: both are sufficient for the first, and useless for the second. But a crypto-hash can give a false sense of security. The original post in this thread is evidence of that. As such, I don't think we should move to anything stronger than md5. -- Steve From njs at pobox.com Fri Dec 7 19:35:56 2018 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 7 Dec 2018 16:35:56 -0800 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <20181207233805.GE13061@ando.pearwood.info> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: On Fri, Dec 7, 2018 at 3:38 PM Steven D'Aprano wrote: > On Fri, Dec 07, 2018 at 01:25:19PM -0800, Nathaniel Smith wrote: > > > For this specific purpose, md5 is just as good as a proper hash. But all > > else being equal, it would still be better to use a proper hash, just so > > people don't have to go through the whole security analysis to check > that. > > I don't understand what you are trying to say here about "the whole > security analysis" to check "that". What security analysis, and > what is "that"? > The analysis that people posted in this thread, demonstrating that for the particular purpose at hand, md5 and sha-whatever are equally useful. > It seems to me that moving to a cryptographically-secure hash would give > many people a false sense of security, that just because the hash > matched, the download was not only not corrupted, but not compromised as > well. For those two purposes: > > - testing for accidental corruption; > - testing for deliberate compromise; > > md5 and sha512 are precisely equivalent: both are sufficient for the > first, and useless for the second. But a crypto-hash can give a false > sense of security. The original post in this thread is evidence of that. > If you're worried about giving people a false sense of security, I think it would be more effective to post a prominent notice or link describing how people should interpret the hashes. Maybe some people see md5 and think "ah-hah, this is their way of warning me that the hash is suitable for defending against accidental corruption but not malicious actors", but it must be a small minority :-). (That's certainly not what the OP thought.) Most people will just think we're fools who don't realize or care md5 is broken. Statistically, that's a pretty reasonable guess when you see someone using md5. -n -- Nathaniel J. Smith -- https://vorpus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From songofacandy at gmail.com Fri Dec 7 21:05:43 2018 From: songofacandy at gmail.com (INADA Naoki) Date: Sat, 8 Dec 2018 11:05:43 +0900 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <20181207233805.GE13061@ando.pearwood.info> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: > > It seems to me that moving to a cryptographically-secure hash would give > many people a false sense of security, that just because the hash > matched, the download was not only not corrupted, but not compromised as > well. For those two purposes: > > - testing for accidental corruption; > - testing for deliberate compromise; > > md5 and sha512 are precisely equivalent: both are sufficient for the > first, and useless for the second. But a crypto-hash can give a false > sense of security. The original post in this thread is evidence of that. > > As such, I don't think we should move to anything stronger than md5. > We already use SHA256 on PyPI. Many project in the world moving from md5 to SHA256. And at some time, SHA256 can be better than md5. When hash is delivered through other route than content, it's difficult to attack / easy to detect we're under attack. For example, sha256 is written in requirements.txt or Homebrew formula. When hash mismatch is happened, we can detect something go wrong. So it's worth to write stronger hash in such files. And if we use sha256 on download site, it's easy to check hash equality between formula and download site. If it's different, Homebrew or download site is under attack. So I think it's worth enough to moving to stronger and more used hash. (And by this reason, I prefer sha256 to sha512 for now.) -- INADA Naoki From steve at pearwood.info Fri Dec 7 23:09:26 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 8 Dec 2018 15:09:26 +1100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: <20181208040925.GH13061@ando.pearwood.info> On Fri, Dec 07, 2018 at 04:35:56PM -0800, Nathaniel Smith wrote: > On Fri, Dec 7, 2018 at 3:38 PM Steven D'Aprano wrote: > > > On Fri, Dec 07, 2018 at 01:25:19PM -0800, Nathaniel Smith wrote: > > > > > For this specific purpose, md5 is just as good as a proper hash. But all > > > else being equal, it would still be better to use a proper hash, just so > > > people don't have to go through the whole security analysis to check > > > that. > > > > I don't understand what you are trying to say here about "the whole > > security analysis" to check "that". What security analysis, and > > what is "that"? > > > > The analysis that people posted in this thread, demonstrating that for the > particular purpose at hand, md5 and sha-whatever are equally useful. Okay, so your position is that even though there's no actual increase in security from using sha512, we ought to use it so that people who don't know any better won't complain that we're using a "less secure" hash. Is that accurate? As security theatre goes, I guess its less harmful than most :-) [...] > If you're worried about giving people a false sense of security, I think it > would be more effective to post a prominent notice or link describing how > people should interpret the hashes. I want to avoid encouraging a false sense of security. I'm not sure that we ought to extend that further to actively taking on the responsibility of teaching users about this. On the other hand, perhaps threads like this suggest that this is inevitable... on the gripping hand, many users won't read the notice regardless of what we do... How often does this issue come up? I'm not sure it is common enough to bother fixing, but others' judgement on that may differ. > Maybe some people see md5 and think > "ah-hah, this is their way of warning me that the hash is suitable for > defending against accidental corruption but not malicious actors", but it > must be a small minority :-). (That's certainly not what the OP thought.) I didn't think they would. > Most people will just think we're fools who don't realize or care md5 is > broken. Statistically, that's a pretty reasonable guess when you see > someone using md5. I don't think there's any way to know for sure, but I'd be shocked if "most people" even thought about the issue, or checked the hash, regardless of whether it is sha512, md5 or a CRC checksum. In my experience, browsers and downloaders like wget either download the data correctly, or they make it damn obvious that the download failed. YMMV. As for those who "think we're fools", that's not a reasonable guess by any means. Since we're not fools, and for the purposes we're using the hash there is no difference between md5 and sha512, such a guess would be a classic example of "a little knowledge is dangerous" and "not as clever or well-informed as you think you are" (that's a generic "you", not you personally). If they don't think we're fools for using md5, they'll probably think we're fools for some other reason. -- Steve From steve at pearwood.info Fri Dec 7 23:14:18 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 8 Dec 2018 15:14:18 +1100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: <20181208041417.GI13061@ando.pearwood.info> On Sat, Dec 08, 2018 at 11:05:43AM +0900, INADA Naoki wrote: > We already use SHA256 on PyPI. > Many project in the world moving from md5 to SHA256. [...] How easy is it to use sha256 on the major platforms, compared to md5? On Linux, it is just as easy: [steve at ando ~]$ md5sum x.py 7008dcaa07fd35917474835425c6151a x.py [steve at ando ~]$ sha256sum x.py 6730dbf2b5ea5c874e789a39532b0e544af18fbea3c680880b01c81b773eabe2 x.py but how about Windows and Mac users? Do those platforms provide a sha256 checksum utility? (Maybe we should provide both hashes.) -- Steve From greg at krypto.org Fri Dec 7 23:55:53 2018 From: greg at krypto.org (Gregory P. Smith) Date: Fri, 7 Dec 2018 20:55:53 -0800 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <20181207233805.GE13061@ando.pearwood.info> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: On Fri, Dec 7, 2018 at 3:38 PM Steven D'Aprano wrote: > On Fri, Dec 07, 2018 at 01:25:19PM -0800, Nathaniel Smith wrote: > > > For this specific purpose, md5 is just as good as a proper hash. But all > > else being equal, it would still be better to use a proper hash, just so > > people don't have to go through the whole security analysis to check > that. > > I don't understand what you are trying to say here about "the whole > security analysis" to check "that". What security analysis, and > what is "that"? > > It seems to me that moving to a cryptographically-secure hash would give > many people a false sense of security, that just because the hash > matched, the download was not only not corrupted, but not compromised as > well. For those two purposes: > > - testing for accidental corruption; > - testing for deliberate compromise; > > md5 and sha512 are precisely equivalent: both are sufficient for the > first, and useless for the second. But a crypto-hash can give a false > sense of security. The original post in this thread is evidence of that. > > As such, I don't think we should move to anything stronger than md5. > If we switched to sha2+ or listed 8 different hashes at once in the announcement text so that nobody can find the actual link content, we'd stop having people pipe up and complain that we used md5 for something. Less mailing list threads like this one seems like a benefit. :P Debian provides all of the popular FIPS hashes, in side files, so people can use whatever floats their boat for a content integrity check: https://cdimage.debian.org/debian-cd/current/ppc64el/iso-cd/ >From a semi-security perspective without verifying gpg signatures, listing a currently collision-resistant hash (sha2 onwards today) in widely disseminated release announcement that goes on mailing lists and gets forwarded and reposted in many places is still useful. Being not hosted in a single central place, if the downloads and hashes on the main servers change *after* their computation, publishing, and announcement - it serves as a widely distributed question mark. A pointless one, as the gpg signature also exists, but it is one none the less. As to windows and mac providing hashing functions on the command line, nope. assume nothing is provided. On linux my fingers would use "openssl hashname" rather than *sum commands. But none of those are ever required to be installed by anything. The only people who ever check hashes are those who already know what tools to use and how. Some could ironically install the downloaded python and use it to check its own hash. None of that is our problem. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Sat Dec 8 07:01:57 2018 From: phd at phdru.name (Oleg Broytman) Date: Sat, 8 Dec 2018 13:01:57 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: <20181208120157.x7rb3sh2rc37hvit@phdru.name> On Fri, Dec 07, 2018 at 08:55:53PM -0800, "Gregory P. Smith" wrote: > Debian provides all of the popular FIPS hashes... [skip] > https://cdimage.debian.org/debian-cd/current/ppc64el/iso-cd/ And they protect the hash files by signing them instead of signing CDs/DVDs. > -gps Oleg. -- Oleg Broytman https://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From christian at python.org Sat Dec 8 10:06:51 2018 From: christian at python.org (Christian Heimes) Date: Sat, 8 Dec 2018 16:06:51 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> Message-ID: On 08/12/2018 05.55, Gregory P. Smith wrote: > > On Fri, Dec 7, 2018 at 3:38 PM Steven D'Aprano > > wrote: > > On Fri, Dec 07, 2018 at 01:25:19PM -0800, Nathaniel Smith wrote: > > > For this specific purpose, md5 is just as good as a proper hash. > But all > > else being equal, it would still be better to use a proper hash, > just so > > people don't have to go through the whole security analysis to > check that. > > I don't understand what you are trying to say here about "the whole > security analysis" to check "that". What security analysis, and > what is "that"? > > It seems to me that moving to a cryptographically-secure hash would > give > many people a false sense of security, that just because the hash > matched, the download was not only not corrupted, but not > compromised as > well. For those two purposes: > > - testing for accidental corruption; > - testing for deliberate compromise; > > md5 and sha512 are precisely equivalent: both are sufficient for the > first, and useless for the second. But a crypto-hash can give a false > sense of security. The original post in this thread is evidence of that. > > As such, I don't think we should move to anything stronger than md5. > > > If we switched to sha2+ or listed 8 different hashes at once in the > announcement text so that nobody can find the actual link content, we'd > stop having people pipe up and complain that we used md5 for something.? > Less mailing list threads like this one seems like a benefit. :P > > Debian provides all of the popular FIPS hashes, in side files, so people > can use whatever floats their boat for a content integrity check: > ?https://cdimage.debian.org/debian-cd/current/ppc64el/iso-cd/ By the way it's a common misunderstanding that FIPS forbids MD5 in general. FIPS is more complicated than black and white lists of algorithms. FIPS also takes into account how an algorithm is used. For example and if I recall correctly, AES-GCM is only allowed in network communication protocols but not for persistent storage. Simply speaking: In FIPS mode, MD5 is still allowed in **non-security contexts**. You cannot use MD5 to make any security claims like file integrity. However you are still allowed to use MD5 as non-secure hash function to detect file corruption. The design and documentation must clearly state that you are only guarding against accidental file corruption caused by network or hardware issue, but as protection against a malicious attacker. Christian From solipsis at pitrou.net Sat Dec 8 11:54:31 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 8 Dec 2018 17:54:31 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> Message-ID: <20181208175431.3ad22bfb@fsol> On Fri, 7 Dec 2018 11:54:59 -0800 Devin Jeanpierre wrote: > On Fri, Dec 7, 2018 at 10:48 AM Antoine Pitrou wrote: > > > If the site is vulnerable to modifications, then TLS doesn't help. > > Again: you must verify the GPG signatures (since they are produced by > > the release manager's private key, which is *not* stored on the > > python.org Web site). > > This is missing the point. Why do you think I missed anything here? Regards Antoine. From jamtlu at gmail.com Sat Dec 8 16:27:13 2018 From: jamtlu at gmail.com (James Lu) Date: Sat, 8 Dec 2018 16:27:13 -0500 Subject: [Python-ideas] [Brainstorm] Testing with Documented ABCs In-Reply-To: References: Message-ID: > Interesting. In the thread you linked on DBC, it seemed like Steve D'Aprano and David Mertz (and possibly others) were put off by the verbosity and noisiness of the decorator-based solution you provided with icontract (though I think there are ways to streamline that solution). It seems like syntactic support could offer a more concise and less noisy implementation. Btw, it would be relatively easy to create a parser for Python. Python doesn't have any crazy grammar constructs like the lexer hack AFAIK. I'm imagining using Bison: 1. convert python's grammar ( https://github.com/python/cpython/blob/master/Lib/lib2to3/Grammar.txt) to Bison format. 2. write a lexer to parse tokens and convert indentation to indent/dedent tokens. 3. extend the grammar however you want it. Call these custom AST nodes "contract nodes." 4. create a simple AST, really an annotated parse tree. I think we can use a simple one that's a bunch of nested lists: ["for_stmt", "for i in range(10):", [ ["exprlist", "i", [ ... ]], ["testlist", "range(10)", [ ... ]] ]] # ["node_type", "", ] The AST can be made more detailed on an as-needed basis. 5. traverse the AST, and "rewrite" the the AST by pasting traditional python AST nodes where contract nodes are. This example from the Babel handbook may help if you have trouble understanding what this step means. https://github.com/jamiebuilds/babel-handbook/blob/master/translations/en/plugin-handbook.md#toc-writing-your-first-babel-plugin 6. turn the AST back into python source. Since we're storing the source code from the beginning, this should be fairly easy. (Bison lets your lexer tell the parser the line and column numbers of each token.) --- I made a joke language with Bison once, it's really flexible and well-suited for this kind of task. This 6-step p Tip: I found Bison's C++ mode too complicated, so I used it in C mode with the C++ Standard Library and C++ references enabled. --- I'm interested, what contract-related functionality do you think Python's existing syntax is inadequate for? You could look into using with statements and a python program that takes the AST and snips contract-related with statements to produce optimized code, though I suppose that's one step below the custom-parser method. On Wed, Nov 28, 2018 at 3:29 PM Abe Dillon wrote: > [Marko Ristin-Kaufmann] >> >> Have you looked at the recent discussions regarding design-by-contract on >> this list > > > I tried to read through them all before posting, but I may have missed > some of the forks. There was a lot of good discussion! > > [Marko Ristin-Kaufmann] > >> You might want to have a look at static checking techniques such as >> abstract interpretation. I hope to be able to work on such a tool for >> Python in some two years from now. We can stay in touch if you are >> interested. > > > I'll look into that! I'm very interested! > > [Marko Ristin-Kaufmann] > >> Re decorators: to my own surprise, using decorators in a larger code base >> is completely practical including the readability and maintenance of the >> code. It's neither that ugly nor problematic as it might seem at first look. > > > Interesting. In the thread you linked on DBC, it seemed like Steve > D'Aprano and David Mertz (and possibly others) were put off by the > verbosity and noisiness of the decorator-based solution you provided with > icontract (though I think there are ways to streamline that solution). It > seems like syntactic support could offer a more concise and less noisy > implementation. > > One thing that I can get on a soap-box about is the benefit putting the > most relevant information to the reader in the order of top to bottom and > left to right whenever possible. I've written many posts about this. I > think a lot of Python syntax gets this right. It would have been easy to > follow the same order as for-loops when designing comprehensions, but > expressions allow you some freedom to order things differently, so now > comprehensions read: > > squares = ... > # squares is > > squares = [... > # squares is a list > > squares = [number*number... > # squares is a list of num squared > > squares = [number*number for num in numbers] > # squares is a list of num squared 'from' numbers > > I think decorators sort-of break this rule because they can put a lot of > less important information (like, that a function is logged or timed) > before more important information (like the function's name, signature, > doc-string, etc...). It's not a huge deal because they tend to be > de-emphasized by my IDE and there typically aren't dozens of them on each > function, but I definitely prefer Eiffel's syntax > over > decorators for that reason. > > I understand that syntax changes have an very high bar for very good > reasons. Hillel Wayne's PyCon talk got me thinking that we might be close > enough to a really great solution to a wide variety of testing problems > that it might justify some new syntax or perhaps someone has an idea that > wouldn't require new syntax that I didn't think of. > > [Marko Ristin-Kaufmann] > >> Some of the aspects we still haven't figured out are: how to approach >> multi-threading (locking around the whole function with an additional >> decorator?) and granularity of contract switches (right now we use >> always/optimized, production/non-optimized and teating/slow, but it seems >> that a larger system requires finer categories). > > > Yeah... I don't know anything about testing concurrent or parallel code. > > On Wed, Nov 28, 2018 at 1:12 AM Marko Ristin-Kaufmann < > marko.ristin at gmail.com> wrote: > >> Hi Abe, >> >> I've been pulling a lot of ideas from the recent discussion on design by >>> contract (DBC), the elegance and drawbacks >>> of doctests >>> , and the amazing talk >>> given by Hillel Wayne at >>> this year's PyCon entitled "Beyond Unit Tests: Taking your Tests to the >>> Next Level". >>> >> >> Have you looked at the recent discussions regarding design-by-contract on >> this list ( >> https://groups.google.com/forum/m/#!topic/python-ideas/JtMgpSyODTU >> and the following forked threads)? >> >> You might want to have a look at static checking techniques such as >> abstract interpretation. I hope to be able to work on such a tool for >> Python in some two years from now. We can stay in touch if you are >> interested. >> >> Re decorators: to my own surprise, using decorators in a larger code base >> is completely practical including the readability and maintenance of the >> code. It's neither that ugly nor problematic as it might seem at first look. >> >> We use our https://github.com/Parquery/icontract at the company. Most of >> the design choices come from practical issues we faced -- so you might want >> to read the doc even if you don't plant to use the library. >> >> Some of the aspects we still haven't figured out are: how to approach >> multi-threading (locking around the whole function with an additional >> decorator?) and granularity of contract switches (right now we use >> always/optimized, production/non-optimized and teating/slow, but it seems >> that a larger system requires finer categories). >> >> Cheers Marko >> >> >> >> _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronaldoussoren at mac.com Sun Dec 9 02:26:06 2018 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Sun, 9 Dec 2018 08:26:06 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <20181208041417.GI13061@ando.pearwood.info> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> <20181208041417.GI13061@ando.pearwood.info> Message-ID: <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> > On 8 Dec 2018, at 05:14, Steven D'Aprano wrote: > > On Sat, Dec 08, 2018 at 11:05:43AM +0900, INADA Naoki wrote: > >> We already use SHA256 on PyPI. >> Many project in the world moving from md5 to SHA256. > [...] > > > How easy is it to use sha256 on the major platforms, compared to md5? > > On Linux, it is just as easy: > > [steve at ando ~]$ md5sum x.py > 7008dcaa07fd35917474835425c6151a x.py > [steve at ando ~]$ sha256sum x.py > 6730dbf2b5ea5c874e789a39532b0e544af18fbea3c680880b01c81b773eabe2 x.py > > but how about Windows and Mac users? Do those platforms provide a sha256 > checksum utility? > > (Maybe we should provide both hashes.) macOS has a shasum tool that does the same thing: $ shasum -a 256 __init__.py 8db2fe0b21deec50d134895a6d5cfbb5300b23922bf2d9bb5b4b63ac40c6a22e __init__.py There?s also python itself that can be used to calculate the checksum :-) Ronald -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at barrys-emacs.org Sun Dec 9 09:54:44 2018 From: barry at barrys-emacs.org (Barry Scott) Date: Sun, 9 Dec 2018 14:54:44 +0000 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> <20181208041417.GI13061@ando.pearwood.info> <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> Message-ID: <1B3C5E4A-4B1A-4E1E-972E-FC59D736834A@barrys-emacs.org> On Windows 10 this works: c:Downloads> certutil -hashfile python-3.7.1-amd64.exe sha512 SHA512 hash of python-3.7.1-amd64.exe: 7dec6362c402b38a9c29b85b204398d7d3fd19509f05279bf713a92abe5b485d4c0c4b175c4edb47f81fd800a599bc2283642a8f0c666edd9e971b5cedf18041 CertUtil: -hashfile command completed successfully. Barry From p.f.moore at gmail.com Sun Dec 9 12:31:22 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 9 Dec 2018 17:31:22 +0000 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <1B3C5E4A-4B1A-4E1E-972E-FC59D736834A@barrys-emacs.org> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> <20181208041417.GI13061@ando.pearwood.info> <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> <1B3C5E4A-4B1A-4E1E-972E-FC59D736834A@barrys-emacs.org> Message-ID: On Sun, 9 Dec 2018 at 15:13, Barry Scott wrote: > > On Windows 10 this works: > > c:Downloads> certutil -hashfile python-3.7.1-amd64.exe sha512 > SHA512 hash of python-3.7.1-amd64.exe: > 7dec6362c402b38a9c29b85b204398d7d3fd19509f05279bf713a92abe5b485d4c0c4b175c4edb47f81fd800a599bc2283642a8f0c666edd9e971b5cedf18041 > CertUtil: -hashfile command completed successfully. In Powershell, there's Get-FileHash python-3.7.1-amd64.exe -Algorithm sha512. The default algorithm is SHA256. On Windows, it's surprisingly often the case that things which traditionally fell under "Windows users probably don't have a tool to do that" are available in Powershell. None of which is that relevant, the fact still remains that no matter what algorithm is used, the hash only has limited value as a security measure. Paul From ronaldoussoren at mac.com Mon Dec 10 01:31:44 2018 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Mon, 10 Dec 2018 07:31:44 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> <20181208041417.GI13061@ando.pearwood.info> <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> <1B3C5E4A-4B1A-4E1E-972E-FC59D736834A@barrys-emacs.org> Message-ID: <9156E866-0904-4804-B3DE-52F6745B6D44@mac.com> > On 9 Dec 2018, at 18:31, Paul Moore wrote: > > None of which is that relevant, the fact still remains that no matter > what algorithm is used, the hash only has limited value as a security > measure. That?s true, but it does show that switching from MD5 to SHA2 doesn?t make it harder to validate the checksum on major platforms. I don?t have a strong opinion either way, I?m slightly in favour of switching to the same algorithm as used on PyPI to be consistent within these PSF properties. BTW. I wonder how many actually verify these checksums, I personally generally assume that HTTPS downloads are reliable enough and don?t verify checksums unless I do the download in an automation pipeline. Ronald From mhroncok at redhat.com Mon Dec 10 05:11:21 2018 From: mhroncok at redhat.com (=?UTF-8?Q?Miro_Hron=c4=8dok?=) Date: Mon, 10 Dec 2018 11:11:21 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> Message-ID: <6220c108-701c-ce04-656a-4a7d2210bfcc@redhat.com> Dne 07. 12. 18 v 15:49 Devin Jeanpierre napsal(a): > On Fri, Dec 7, 2018 at 1:40 AM Antoine Pitrou > wrote: > > md5 is only used for a quick integrity check here (think of it as a > sophisticated checksum).? For security you need to verify the > corresponding GPG signature. > > > More to the point: you're getting the hash from the same place as the > binary. If one is vulnerable to modifications by attackers, both are. So > it doesn't matter. The real defense most people are relying on is TLS. Yes I really on TLS, no I'm not getting the archive necessarily from python.org. I might get it from a 3rd parrty that claims it's genuine. Such party might be a Linux distro or another package manager (e.g. homebrew). I can of course use GPG to verify it, but for quick check a sha512 sum works for me, while md5 not so much. In Fedora, we use sha512 checksums [1]. In homebrew they use sha256 [2]. [1] https://src.fedoraproject.org/rpms/python3/blob/master/f/sources [2] https://github.com/Homebrew/homebrew-core/blob/master/Formula/python.rb -- Miro Hron?ok -- Phone: +420777974800 IRC: mhroncok From solipsis at pitrou.net Mon Dec 10 05:26:15 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 10 Dec 2018 11:26:15 +0100 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> <20181208041417.GI13061@ando.pearwood.info> <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> <1B3C5E4A-4B1A-4E1E-972E-FC59D736834A@barrys-emacs.org> <9156E866-0904-4804-B3DE-52F6745B6D44@mac.com> Message-ID: <20181210112615.51d37cfb@fsol> On Mon, 10 Dec 2018 07:31:44 +0100 Ronald Oussoren via Python-ideas wrote: > > That?s true, but it does show that switching from MD5 to SHA2 doesn?t make it harder to validate the checksum on major platforms. > > I don?t have a strong opinion either way, I?m slightly in favour of switching to the same algorithm as used on PyPI to be consistent within these PSF properties. > > BTW. I wonder how many actually verify these checksums, I personally generally assume that HTTPS downloads are reliable enough and don?t verify checksums unless I do the download in an automation pipeline. Ah, the automation use case is a good point in favor of stronger hashes. You may have checked the initial download hash and then use it in a script to make sure later downloads haven't been tempered with. Regards Antoine. From erik.m.bray at gmail.com Mon Dec 10 08:22:22 2018 From: erik.m.bray at gmail.com (E. Madison Bray) Date: Mon, 10 Dec 2018 14:22:22 +0100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C046217.7010805@canterbury.ac.nz> References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On Sun, Dec 2, 2018 at 11:52 PM Greg Ewing wrote: > > Steven D'Aprano wrote: > > Perhaps more like the principle of most > > astonishment: the object changes from sized to unsized even if you don't > > modify its value or its type, but merely if you look at it the wrong > > way: > > Yes, but keep in mind the purpose of the whole thing is to > provide a sequence interface while not breaking old code > that expects an iterator interface. Code that was written > to work with the existing map() will not be calling len() > on it at all, because that would never have worked. > > > Neither fish nor fowl with a confusing API that is not > > quite a sequence, not quite an iterator, not quite sized, but just > > enough of each to lead people into error. > > Yes, it's a compromise in the interests of backwards > compatibility. But there are no surprises as long as you > stick to one interface or the other. Weird things happen > if you mix them up, but sane code won't be doing that. Indeed; I believe it is very useful to have a map-like object that is effectively an augmented list/sequence. From bernardo at bernardosulzbach.com Mon Dec 10 09:44:43 2018 From: bernardo at bernardosulzbach.com (Bernardo Sulzbach) Date: Mon, 10 Dec 2018 12:44:43 -0200 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <6220c108-701c-ce04-656a-4a7d2210bfcc@redhat.com> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <6220c108-701c-ce04-656a-4a7d2210bfcc@redhat.com> Message-ID: If the discussion gets to which SHA-2 should be used, I would like to point out that SHA-512 is not only twice the width of SHA-256 but also faster to compute (anecdotally) on most 64-bit platforms. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcos.eliziario at gmail.com Mon Dec 10 10:05:49 2018 From: marcos.eliziario at gmail.com (Marcos Eliziario) Date: Mon, 10 Dec 2018 13:05:49 -0200 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <6220c108-701c-ce04-656a-4a7d2210bfcc@redhat.com> Message-ID: My two cents. Automation tools should check the PGP signature. The public keys should be obtained once via https from an odd number of different trustworthy sources from a set of well know domains that use DNSSEC. Users should be advised to check the certificate chain from those domains at the first time those keys are downloaded and explicitly agree. This is a more secure schema than simply relying on a checksum that you've got from the same site you've used to download the code. Moving from MD5 from SHA obscures this, by making people believe that this hash should be used for anything more than checking for file corruption. Em seg, 10 de dez de 2018 ?s 12:45, Bernardo Sulzbach < bernardo at bernardosulzbach.com> escreveu: > If the discussion gets to which SHA-2 should be used, I would like to > point out that SHA-512 is not only twice the width of SHA-256 but also > faster to compute (anecdotally) on most 64-bit platforms. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Marcos Elizi?rio Santos mobile/whatsapp/telegram: +55(21) 9-8027-0156 skype: marcos.eliziario at gmail.com linked-in : https://www.linkedin.com/in/eliziario/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcos.eliziario at gmail.com Mon Dec 10 10:28:31 2018 From: marcos.eliziario at gmail.com (Marcos Eliziario) Date: Mon, 10 Dec 2018 13:28:31 -0200 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <6220c108-701c-ce04-656a-4a7d2210bfcc@redhat.com> Message-ID: A Hash is surely useful in the context of locking versions of software packages in Pipfile.lock because we tell us that the code we are downloading has not changed since the first we saw this particular version of the package, but only a signature scheme tell us with a reasonable degree of certainty (though, not absolute) that this particular version of the code came from who it claims to have came from. If an attacker is able to hijack the github repository from a project and it's website, specially on low activity projects, nothing would prevent them from releasing a rogue version, and people downloading it and using it for some time until the rightful maintainers of said project are able to take back control of it. Signing of course is as secure as the ability of said project maintainers to keep their private keys safe. But while we know that nothing can be made 100% secure, a culture that relies on signatures is inherently more secure than relying only on hashes, no matter how cryptographically strong they may be. Hashes tell us that the code we've download we have is the same as other blob of code stored somewhere that for whatever reasons we trust. PGP tells us that there is a high probability, assuming the private keys haven't been compromised, and that a lot of people agrees that the public key we have came from the right person or organization, that this blob of code came from who it says it came from. Em seg, 10 de dez de 2018 ?s 13:05, Marcos Eliziario < marcos.eliziario at gmail.com> escreveu: > My two cents. > Automation tools should check the PGP signature. The public keys should be > obtained once via https from an odd number of different trustworthy sources > from a set of well know domains that use DNSSEC. Users should be advised to > check the certificate chain from those domains at the first time those keys > are downloaded and explicitly agree. This is a more secure schema than > simply relying on a checksum that you've got from the same site you've used > to download the code. > Moving from MD5 from SHA obscures this, by making people believe that this > hash should be used for anything more than checking for file corruption. > > Em seg, 10 de dez de 2018 ?s 12:45, Bernardo Sulzbach < > bernardo at bernardosulzbach.com> escreveu: > >> If the discussion gets to which SHA-2 should be used, I would like to >> point out that SHA-512 is not only twice the width of SHA-256 but also >> faster to compute (anecdotally) on most 64-bit platforms. >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > > > -- > Marcos Elizi?rio Santos > mobile/whatsapp/telegram: +55(21) 9-8027-0156 > skype: marcos.eliziario at gmail.com > linked-in : https://www.linkedin.com/in/eliziario/ > > -- Marcos Elizi?rio Santos mobile/whatsapp/telegram: +55(21) 9-8027-0156 skype: marcos.eliziario at gmail.com linked-in : https://www.linkedin.com/in/eliziario/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.franke at campus.tu-berlin.de Mon Dec 10 16:47:07 2018 From: m.franke at campus.tu-berlin.de (Franke, Maximilian Julian Shawn) Date: Mon, 10 Dec 2018 21:47:07 +0000 Subject: [Python-ideas] TAPS Implementation Message-ID: <8fe47843f248469cb7c221bee1053408@EX-MBX-02.tubit.win.tu-berlin.de> Hi, I am a student worker with the Internet Networks Architecture department at TU Berlin and I am working with APIs for network protocols. We are currently looking into implementing TAPS, a novel way to offer transport layer services to the application layer. The idea is to offer an API on top of multiple different transport protocols, such as TCP and QUIC. Instead of explicitly choosing a transport protocol, the application only provides abstract requirements, e.g., reliability. The TAPS system maps these properties to transport protocols, potentially trying out multiple protocols in parallel. Furthermore, TAPS can select between multiple local interfaces and remote IP addresses. A short talk (~25 minutes) from the All systems go! conference about it is available here: https://media.ccc.de/v/ASG2018-188-the_future_of_networking_apis. TAPS is currently being standardized in the IETF (https://datatracker.ietf.org/wg/taps/about/). Here you can find the proposed architecture: https://datatracker.ietf.org/doc/draft-ietf-taps-arch/, interface: https://datatracker.ietf.org/doc/draft-ietf-taps-interface/ and an informal draft on implementation considerations: https://datatracker.ietf.org/doc/draft-ietf-taps-impl/. One if the implementations currently in the works is done by Apple in form of their Network.framework API (https://developer.apple.com/documentation/network). While this implementation is relatively advanced, it is so far only available for MacOS, iOS and it derivatives. As such, it would be favorable to have a platform agnostic and open-source implementation as well. >From what we can tell, asyncio seems to offer a lot of the ground work necessary to implement it efficiently, so here are some questions we have before beginning with the implementation: - Is something like this in the scope to become part of the standard python library or something that would be done in an external library? If it is in scope, what would the requirements for it to become part of the standard library be? - Are there currently any other active efforts to implement new network functionality into the standard library? - Are there currently any considerations to expand the standard transports offered by asyncio (TCP, UDP and SSL) by additional ones like SCTP, or more importantly QUIC? Any comments or further pointers to sources that could be helpful with this would be greatly appreciated. Best regards Max -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Mon Dec 10 17:31:36 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 11 Dec 2018 09:31:36 +1100 Subject: [Python-ideas] TAPS Implementation In-Reply-To: <8fe47843f248469cb7c221bee1053408@EX-MBX-02.tubit.win.tu-berlin.de> References: <8fe47843f248469cb7c221bee1053408@EX-MBX-02.tubit.win.tu-berlin.de> Message-ID: <20181210223135.GB13061@ando.pearwood.info> Hi Max, and welcome! On Mon, Dec 10, 2018 at 09:47:07PM +0000, Franke, Maximilian Julian Shawn wrote: [...] > We are currently looking into implementing TAPS, a novel way to offer > transport layer services to the application layer. [...] > TAPS is currently being standardized ... > Here you can find the proposed architecture ... These are factors which strongly go against TAPS being implemented in the standard library: it is novel and the usage of it is unproven, and it hasn't been standardized yet. Generally speaking, the Python standard library only provides proven, standardized protocols. A few reasons for this: - We don't have the resources of Apple, we can't support everything, so we have to choose those which are most likely to be useful; that means those with a proven track-record, not experimental or novel protocols. - We take backwards-compatibility seriously, so with a few exceptions, any API we offer would have to be stable. (There are ways around this, but we don't use them lightly.) - The Python release cycle is relatively sedate and slow, and experimental libraries usually need a much faster release cycle. This is not to absolutely rule out a std lib implementation. If the networking experts among the core developers think this is a good idea, it could happen, regardless of how novel it is. But in the meantime, I recommend that you consider writing a library and offering it on PyPI as a third-party library: https://pypi.org/ If you are still keen to push for a standard library implementation, you will probably need to write a PEP: https://www.python.org/dev/peps/ At the very least, reading over some successful PEPs will suggest what sort of arguments you should make in order to get TAPS approved. -- Steve From solipsis at pitrou.net Mon Dec 10 17:50:33 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 10 Dec 2018 23:50:33 +0100 Subject: [Python-ideas] TAPS Implementation References: <8fe47843f248469cb7c221bee1053408@EX-MBX-02.tubit.win.tu-berlin.de> <20181210223135.GB13061@ando.pearwood.info> Message-ID: <20181210235033.47381eea@fsol> On Tue, 11 Dec 2018 09:31:36 +1100 Steven D'Aprano wrote: > Hi Max, and welcome! > > On Mon, Dec 10, 2018 at 09:47:07PM +0000, Franke, Maximilian Julian Shawn wrote: > [...] > > We are currently looking into implementing TAPS, a novel way to offer > > transport layer services to the application layer. > [...] > > TAPS is currently being standardized ... > > Here you can find the proposed architecture ... > > These are factors which strongly go against TAPS being implemented in > the standard library: it is novel and the usage of it is unproven, and > it hasn't been standardized yet. I agree that TAPS doesn't look proven at all (I would also add that I'm a bit skeptical it will achieve the stated goals -- but we'll see). IMO it's not a good candidate for standard library inclusion. Regards Antoine. From chris.barker at noaa.gov Mon Dec 10 20:15:36 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 10 Dec 2018 17:15:36 -0800 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On Mon, Dec 10, 2018 at 5:23 AM E. Madison Bray wrote: > Indeed; I believe it is very useful to have a map-like object that is > effectively an augmented list/sequence. but what IS a "map-like object" -- I'm trying to imagine what that actually means. "map" takes a function and maps it onto a interable, returning a new iterable. So a map object is an iterable -- what's under the hood being used to create it is (and should remain) opaque. Back in the day, Python was "all about sequences" -- so map() took a sequence and returned a sequence (an actual list, but that's not the point here). And that's pretty classic "map". With py3, there was a big shift toward iterables, rather than sequences as the core type to work with. There are a few other benefits, but the main one is that often sequences were made, simply so that they could be immediately iterated over, and that was a waste of resources. for i, item in enumerate(a_sequence): ... for x, y in zip(seq1, seq2): ... These two are pretty obvious, but the same approach was taken over much of python: dict.keys(), map(), range(), .... So now in Python, you need to decide, when writing code, what your API is -- does your function take a sequence? or does it take an iterable? Of course, a sequence is an iterable, but a iterable is not (necessarily) a sequence. -- so back in the day, you din't really need to make the decision. So in the case of the Sage example -- I wonder what the real problem is -- if you have an API that requires a sequence, on Py2, folks may have well been passing it the result of a map() call. -- note that they weren't passing a "map object" that is now somehow different than it used to be -- they were passing a list plain and simple. And there are all sorts of places, when converting from py2 to py3, where you will now get an iterable that isn't a proper sequence, and if the API you are using requires a sequence, you need to wrap a list() or tuple() or some such around it to make the sequence. Note that you can write your code to work under either 2 or 3, but it's really hard to write a library so that your users can run it under either 2 or 3 without any change in their code! But note: the fact that it's a map object is just one special case. I suppose one could write an API now that actually expects a map object (rather than a generic sequence or iterable) but it was literally impossible in py2 -- there was no such object. I'm still confused -- what's so wrong with: list(map(func, some_iterable)) if you need a sequence? You can, of course mike lazy-evaluated sequences (like range), and so you could make a map-like function that required a sequence as input, and would lazy evaluate that sequence. This could be useful if you weren't going to work with the entire collection, but really wanted to only index out a few items, but I'm trying to imagine a use case for that, and I haven't. And I don't think that's the use case that started this thread... -CHB > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Dec 10 20:23:20 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 10 Dec 2018 17:23:20 -0800 Subject: [Python-ideas] Using sha512 instead of md5 on python.org/downloads In-Reply-To: <9156E866-0904-4804-B3DE-52F6745B6D44@mac.com> References: <775682f6-16f0-e7a7-dd17-7e3ccfb7e772@redhat.com> <20181207103930.565ce442@fsol> <20181207194724.7cb16fb5@fsol> <20181207233805.GE13061@ando.pearwood.info> <20181208041417.GI13061@ando.pearwood.info> <19C76C80-0055-4980-953B-E94288CEC06D@mac.com> <1B3C5E4A-4B1A-4E1E-972E-FC59D736834A@barrys-emacs.org> <9156E866-0904-4804-B3DE-52F6745B6D44@mac.com> Message-ID: On Sun, Dec 9, 2018 at 10:32 PM Ronald Oussoren via Python-ideas < python-ideas at python.org> wrote: > BTW. I wonder how many actually verify these checksums, > Hardly anyone -- most of us verify the download by trying to use it :-) Which doesn't mean that we shouldn't have it -- but it will indeed make very little difference to the vast majority of users -- and those that do check it are generally pretty sophisticated -- shouldn't be hard to use a different hash algorithm. Though people would have to update their workflows, which could be annoying. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.m.bray at gmail.com Tue Dec 11 05:37:25 2018 From: erik.m.bray at gmail.com (E. Madison Bray) Date: Tue, 11 Dec 2018 11:37:25 +0100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On Tue, Dec 11, 2018 at 2:16 AM Chris Barker wrote: > On Mon, Dec 10, 2018 at 5:23 AM E. Madison Bray wrote: >> >> Indeed; I believe it is very useful to have a map-like object that is >> effectively an augmented list/sequence. > > > but what IS a "map-like object" -- I'm trying to imagine what that actually means. > > "map" takes a function and maps it onto a interable, returning a new iterable. So a map object is an iterable -- what's under the hood being used to create it is (and should remain) opaque. I don't understand why this is confusing. Greg gave an example of what this *might* mean up thread. It's not the only possible approach but it is one that makes a lot of sense to me. The way you're defining "map" is arbitrary and post-hoc. It's a definition that makes sense for "map" that's restricted to iterating over arbitrary iterators. It's how it happens to be defined in Python 3 for various reasons that you took time to explain at great length, which I regret to inform you was time wasted explaining things I already know. For something like a fixed sequence a "map" could just as easily be defined as a pair (, ) that applies , which I'm claiming is a pure function, to every element returned by the . This transformation can be applied lazily on a per-element basis whether I'm iterating over it, or performing random access (since is known for all N). Python has no formal notion of a pure function, but I'm an adult and can accept responsibility if I try to use this "map-like" object in a way that is not logically consistent. The stuff about Sage is beside the point. I'm not even talking about that anymore. From p.f.moore at gmail.com Tue Dec 11 06:13:12 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 11 Dec 2018 11:13:12 +0000 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On Tue, 11 Dec 2018 at 10:38, E. Madison Bray wrote: > I don't understand why this is confusing. [...] > For something like a fixed sequence a "map" could just as easily be > defined as a pair (, ) that applies , > which I'm claiming is a pure function, to every element returned by > the . This transformation can be applied lazily on a > per-element basis whether I'm iterating over it, or performing random > access (since is known for all N). What's confusing to *me*, at least, is what's actually being suggested here. There's a lot of theoretical discussion, but I've lost track of how it's grounded in reality: 1. If we're saying that "it would be nice if there were a function that acted like map but kept references to its arguments", that's easy to do as a module on PyPI. Go for it - no-one will have any problem with that. 2. If we're saying "the builtin map needs to behave like that", then 2a. *Why*? What is so special about this situation that the builtin has to be changed? 2b. Compatibility questions need to be addressed. Is this important enough to code that "needs" it that such code is OK with being Python 3.8+ only? If not, why aren't the workarounds needed for Python 3.7 good enough? (Long term improvement and simplification of the code *is* a sufficient reason here, it's just something that should be explicit, as it means that the benefits are long-term rather than immediate). 2c. Weird corner case questions, while still being rare, *do* need to be addressed - once a certain behaviour is in the stdlib, changing it is a major pain, so we have a responsibility to get even the corner cases right. 2d. It's not actually clear to me how critical that need actually is. Nice to have, sure (you only need a couple of people who would use a feature for it to be "nice to have") but beyond that I haven't seen a huge number of people offering examples of code that would benefit (you mentioned Sage, but that example rapidly degenerated into debates about Sage's design, and while that's a very good reason for not wanting to continue using that as a use case, it does leave us with few actual use cases, and none that I'm aware of that are in production code...) 3. If we're saying something else (your comment "map could just as easily be defined as..." suggests that you might be) then I'm not clear what it is. Can you describe your proposal as pseudo-code, or a Python implementation of the "map" replacement you're proposing? Paul From erik.m.bray at gmail.com Tue Dec 11 06:48:10 2018 From: erik.m.bray at gmail.com (E. Madison Bray) Date: Tue, 11 Dec 2018 12:48:10 +0100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On Tue, Dec 11, 2018 at 12:13 PM Paul Moore wrote: > > On Tue, 11 Dec 2018 at 10:38, E. Madison Bray wrote: > > I don't understand why this is confusing. > [...] > > For something like a fixed sequence a "map" could just as easily be > > defined as a pair (, ) that applies , > > which I'm claiming is a pure function, to every element returned by > > the . This transformation can be applied lazily on a > > per-element basis whether I'm iterating over it, or performing random > > access (since is known for all N). > > What's confusing to *me*, at least, is what's actually being suggested > here. There's a lot of theoretical discussion, but I've lost track of > how it's grounded in reality: It's true, this has been a wide-ranging discussion and it's confusing. Right now I'm specifically responding to the sub-thread that Greg started "Suggested MapView object", so I'm considering this a mostly clean slate from the previous thread "__len__() for map()". Different ideas have been tossed around and the discussion has me thinking about broader possibilities. I responded to this thread because I liked Greg's proposal and the direction he's suggesting. I think that the motivation underlying much of this discussion, forth both the OP who started the original thread, as well as myself, and others is that before Python 3 changed the implementation of map() there were certain assumptions one could make about map() called on a list* which, under normal circumstances were quite reasonable and sane (e.g. len(map(func, lst)) == len(lst), or map(func, lst)[N] == func(lst[N])). Python 3 broke all of these assumptions, for reasons that I personally have no disagreement with, in terms of motivation. However, in retrospect, it might have been nice if more consideration were given to backwards compatibility for some "obvious" simple cases. This isn't a Python 2 vs Python 3 whine though: I'm just trying to think about how I might expect map() to work on different types of arguments, and I see no problem--so long as it's properly documented--with making its behavior somewhat polymorphic on the types of arguments. The idea would be to now enhance the existing built-ins to restore at least some previously lost assumptions, at least in the relevant cases. To give an analogy, Python 3.0 replaced range() with (effectively) xrange(). This broken a lot of assumptions that the object returned by range(N) would work much like a list, and Python 3.2 restored some of that list-like functionality by adding support for slicing and negative indexing on range(N). I believe it's worth considering such enhancements for filter() and map() as well, though these are obviously a bit trickier. * or other fixed-length sequence, but let's just use list as a shorthand, and assume for the sake of simplicity a single list as well. > 1. If we're saying that "it would be nice if there were a function > that acted like map but kept references to its arguments", that's easy > to do as a module on PyPI. Go for it - no-one will have any problem > with that. Sure, though since this is about the behavior of global built-ins that are commonly used by users at all experience levels the problem is a bit hairier. Anybody can implement anything they want and put it in a third-party module. That doesn't mean anyone will use it. I still have to write code that handles map objects. In retrospect I think Guido might have had the right idea of wanting to move map() and filter() into functools along with reduce(). There's a surprisingly lot more at stake in terms of backwards compatibility and least-astonishment when it comes to built-ins. I think that's in part why the new Python 3 definitions of map() and filter() were kept so simple: although they were not backwards compatible I do think they were well designed to minimize astonishment. That's why I don't necessarily disagree with the choices made (but still would like to think about how we can make enhancements going forward). > 2. If we're saying "the builtin map needs to behave like that", then > 2a. *Why*? What is so special about this situation that the builtin > has to be changed? Same question could apply to last time it was changed. I think now we're trying to find some middle-ground. > 2b. Compatibility questions need to be addressed. Is this important > enough to code that "needs" it that such code is OK with being Python > 3.8+ only? If not, why aren't the workarounds needed for Python 3.7 > good enough? (Long term improvement and simplification of the code > *is* a sufficient reason here, it's just something that should be > explicit, as it means that the benefits are long-term rather than > immediate). That's a good point: I think the same arguments as for enhancing range() apply here, but this is worth further consideration (though having a more concrete proposal in the first place should come first). > 2c. Weird corner case questions, while still being rare, *do* need > to be addressed - once a certain behaviour is in the stdlib, changing > it is a major pain, so we have a responsibility to get even the corner > cases right. It depends on what you mean by getting them "right". It's definitely worth going over as one can think of. Not all corner cases have a satisfying resolution (and may be highly context-dependent). In those cases getting it "right" is probably no more than documenting that corner case and perhaps warning against it. > 2d. It's not actually clear to me how critical that need actually > is. Nice to have, sure (you only need a couple of people who would use > a feature for it to be "nice to have") but beyond that I haven't seen > a huge number of people offering examples of code that would benefit > (you mentioned Sage, but that example rapidly degenerated into debates > about Sage's design, and while that's a very good reason for not > wanting to continue using that as a use case, it does leave us with > few actual use cases, and none that I'm aware of that are in > production code...) That's a fair point worthy of further consideration. To me, at least, map on a list working as an augmented list is obvious, clear, useful, at solves most of the use-cases where having map.__len__ might be desirable, among others. > 3. If we're saying something else (your comment "map could just as > easily be defined as..." suggests that you might be) then I'm not > clear what it is. Can you describe your proposal as pseudo-code, or a > Python implementation of the "map" replacement you're proposing? Again, I'm mostly responding to Greg's proposal which I like. To extend it, I'm suggesting that a call to map() where all the arguments are sequences** might return something like his MapView. If even that idea is crazy or impractical though, I can accept that. But I think it's quite analogous to how map on arbitrary iterables went from immediate evaluation to lazy evaluation while iterating: in the same way map on some sequence(s) can be evaluated lazily on random access. ** I have a separate complaint that there's no great way, at the Python level, to define a class that is explicitly a "sequence" as opposed to a more general "mapping", but that's a topic for another thread... From p.f.moore at gmail.com Tue Dec 11 07:53:38 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 11 Dec 2018 12:53:38 +0000 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On Tue, 11 Dec 2018 at 11:49, E. Madison Bray wrote: > The idea would be to now enhance the existing built-ins to restore at > least some previously lost assumptions, at least in the relevant > cases. To give an analogy, Python 3.0 replaced range() with > (effectively) xrange(). This broken a lot of assumptions that the > object returned by range(N) would work much like a list, and Python > 3.2 restored some of that list-like functionality by adding support > for slicing and negative indexing on range(N). I believe it's worth > considering such enhancements for filter() and map() as well, though > these are obviously a bit trickier. Thanks. That clarifies the situation for me very well. I agree with most of the comments you made, although I don't have any good answers. I think you're probably right that Guido's original idea to move map and filter to functools might have been better, forcing users to explicitly choose between a genexp and a list comprehension. On the other hand, it might have meant people used more lists than they needed to, as a result. Paul From steve at pearwood.info Tue Dec 11 09:47:30 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 12 Dec 2018 01:47:30 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: <20181211144726.GE13061@ando.pearwood.info> On Mon, Dec 10, 2018 at 05:15:36PM -0800, Chris Barker via Python-ideas wrote: [...] > I'm still confused -- what's so wrong with: > > list(map(func, some_iterable)) > > if you need a sequence? You might need a sequence. Why do you think that has to be an *eager* sequence? I can think of two obvious problems with eager sequences: space and time. They can use too much memory, and they can take too much time to generate them up-front and too much time to reap when they become garbage. And if you have an eager sequence, and all you want is the first item, you still have to generate all of them even though they aren't needed. We can afford to be profligate with memory when the data is small, but eventually you run into cases where having two copies of the data is one copy too many. > You can, of course mike lazy-evaluated sequences (like range), and so you > could make a map-like function that required a sequence as input, and would > lazy evaluate that sequence. This could be useful if you weren't going to > work with the entire collection, Or even if you *are* going to work with the entire collection, but you don't need them all at once. I once knew a guy whose fondest dream was to try the native cuisine of every nation of the world ... but not all in one meal. This is a classic time/space tradeoff: for the cost of calling the mapping function anew each time we index the sequence, we can avoid allocating a potentially huge list and calling a potentially expensive function up front for items we're never going to use. Instead, we call it only on demand. These are the same principles that justify (x)range and dict views. Why eagerly generate a list up front, if you only need the values one at a time on demand? Why make a copy of the dict keys, if you don't need a copy? These are not rhetorical questions. This is about avoiding the need to make unnecessary copies for those times we *don't* need an eager sequence generated up front, keeping the laziness of iterators and the random-access of sequences. map(func, sequence) is a great candidate for this approach. It has to hold onto a reference to the sequence even as an iterator. The function is typically side-effect free (a pure function), and if it isn't, "consenting adults" applies. We've already been told there's at least one major Python project, Sage, where this would have been useful. There's a major functional language, Haskell, where nearly all sequence processing follows this approach. I suggest we provide a separate mapview() type that offers only the lazy sequence API, without trying to be an iterator at the same time. If you want an eager sequence, or an iterator, they're only a single function call away: list(mapview_instance) iter(mapview_instance) # or just stick to map() Rather than trying to guess whether people want to treat their map objects as sequences or iterators, we let them choose which they want and be explicit about it. Consider the history of dict.keys(), values() and items() in Python 2. Originally they returned eager lists. Did we try to retrofit view-like and iterator-like behaviour onto the existing dict.keys() method, returning a cunning object which somehow turned from a list to a view to an iterator as needed? Hell no! We introduced *six new methods* on dicts: - dict.iterkeys() - dict.viewkeys() and similar for items() and values(). Compared to that, adding a single variant on map() that expects a sequence and returns a view on the sequence seems rather timid. -- Steve From steve at pearwood.info Tue Dec 11 11:26:27 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 12 Dec 2018 03:26:27 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: <20181211162627.GF13061@ando.pearwood.info> On Tue, Dec 11, 2018 at 12:48:10PM +0100, E. Madison Bray wrote: > Right now I'm specifically responding to the sub-thread that Greg > started "Suggested MapView object", so I'm considering this a mostly > clean slate from the previous thread "__len__() for map()". Different > ideas have been tossed around and the discussion has me thinking about > broader possibilities. I responded to this thread because I liked > Greg's proposal and the direction he's suggesting. Greg's code can be found here: https://mail.python.org/pipermail/python-ideas/2018-December/054659.html His MapView tries to be both an iterator and a sequence at the same time, but it is neither. The iterator protocol is that iterators must: - have a __next__ method; - have an __iter__ method which returns self; and the test for an iterator is: obj is iter(obj) https://docs.python.org/3/library/stdtypes.html#iterator-types Greg's MapView object is an *iterable* with a __next__ method, which makes it neither a sequence nor a iterator, but a hybrid that will surprise people who expect it to act considently as either. This is how iterators work: py> x = iter("abcdef") # An actual iterator. py> next(x) 'a' py> next(x) 'b' py> next(iter(x)) 'c' Greg's hybrid violates that expected behaviour: py> x = MapView(str.upper, "abcdef") # An imposter. py> next(x) 'A' py> next(x) 'B' py> next(iter(x)) 'A' As an iterator, it is officially "broken", continuing to yield values even after it is exhausted: py> x = MapView(str.upper, 'a') py> next(x) 'A' py> next(x) Traceback (most recent call last): File "", line 1, in File "/home/steve/gregmapview.py", line 24, in __next__ return next(self.iterator) StopIteration py> list(x) # But wait! There's more! ['A'] py> list(x) # And even more! ['A'] This hybrid is fragile: whether operations succeed or not depend on the order that you call them: py> x = MapView(str.upper, "abcdef") py> len(x)*next(x) # Safe. But only ONCE. 'AAAAAA' py> y = MapView(str.upper, "uvwxyz") py> next(y)*len(y) # Looks safe. But isn't. Traceback (most recent call last): File "", line 1, in File "/home/steve/gregmapview.py", line 12, in __len__ raise TypeError("Mapping iterator has no len()") TypeError: Mapping iterator has no len() (For brevity, from this point on I shall trim the tracebacks and show only the final error message.) Things that work once, don't work a second time. py> len(x)*next(x) # Worked a moment ago, but now it is broken. TypeError: Mapping iterator has no len() If you pass your MapView object to another function, it can accidentally sabotage your code: py> def innocent_looking_function(obj): ... next(obj) ... py> x = MapView(str.upper, "abcdef") py> len(x) 6 py> innocent_looking_function(x) py> len(x) TypeError: Mapping iterator has no len() I presume this is just an oversight, but indexing continues to work even when len() has been broken. Greg seems to want to blame the unwitting coder who runs into these boobytraps: "But there are no surprises as long as you stick to one interface or the other. Weird things happen if you mix them up, but sane code won't be doing that." (URL as above). This MapView class offers a hybrid "sequence plus iterator, together at last!" double-headed API, and even its creator says that sane code shouldn't use that API. Unfortunately, you can't use the iterator API, because its broken as an iterator, and you can't use it as a sequence, because any function you pass it to might use it as an iterator and pull the rug out from under your feet. Greg's code is, apart from the addition of the __next__ method, almost identical to the version of mapview I came up with in my own testing. Except Greg's is even better, since I didn't bother handling the multiple-sequences case and his does. Its the __next__ method which ruins it, by trying to graft on almost- but-not-really iterator behaviour onto something which otherwise is a sequence. I don't think there's any way around that: I think that any attempt to make a single MapView object work as either a sequence with a length and indexing AND an iterator with next() and no length and no indexing is doomed to the same problems. Far from minimizing surprise, it will maximise it. Look at how many violations of the Principle Of Least Surprise Greg's MapView has: - If an object has a __len__ method, calling len() on it shouldn't raise TypeError; - If you called len() before, and it succeeded, calling it again should also succeed; - if an object has a __next__ method, it should be an iterator, and that means iter(obj) is obj; - if it isn't an iterator, you shouldn't be able to call next() on it; - if it is an iterator, once it is exhausted, it should stay exhausted; - iterating over an object (calling next() or iter() on it) shouldn't change it from a sequence to a non-sequence; - passing a sequence to another function, shouldn't result in that sequence no longer supporting len() or indexing; - if an object has a length, then it should still have a length even after iterating over it. I may have missed some. -- Steve From chris.barker at noaa.gov Tue Dec 11 12:01:27 2018 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Tue, 11 Dec 2018 09:01:27 -0800 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <20181211144726.GE13061@ando.pearwood.info> References: <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211144726.GE13061@ando.pearwood.info> Message-ID: Perhaps I got confused by the early part of this discussion. My point was that there is no ?map-like? object at the Python level. (That is no Map abc). Py2?s map produced a sequence. Py3?s map produced an iterable. So any API that was expecting a sequence could accept the result of a py2 map, but not a py3 map. There is absolutely nothing special about map here. The example of range has been brought up, but I don?t think it?s analogous ? py2 range returns a list, py3 range returns an immutable sequence. Because that?s as close as we can get to a sequence while preserving the lazy evaluation that is wanted. I _think_ someone may be advocating that map() could return an iterable if it is passed a iterable, and a sequence of it is passed a sequence. Yes, it could, but that seems like a bad idea to me. But folks are proposing a ?map? that would produce a lazy-evaluated sequence. Sure ? as Paul said, put it up on pypi and see if folks find it useful. Personally, I?m still finding it hard to imagine a use case where you need the sequence features, but also lazy evaluation is important. Sure: range() has that, but it came at almost zero cost, and I?m not sure the sequence features are used much. Note: the one use-case I can think of for a lazy evaluated sequence instead of an iterable is so that I can pick a random element with random.choice(). (Try to pick a random item from. a dict), but that doesn?t apply here?pick a random item from the source sequence instead. But this is specific example of a general use case: you need to access only a subset of the mapped sequence (or access it out of order) so using the iterable version won?t work, and it may be large enough that making a new sequence is too resource intensive. Seems rare to me, and in many cases, you could do the subsetting before applying the function, so I think it?s a pretty rare use case. But go ahead and make it ? I?ve been wrong before :-) -CHB Sent from my iPhone > On Dec 11, 2018, at 6:47 AM, Steven D'Aprano wrote: > >> On Mon, Dec 10, 2018 at 05:15:36PM -0800, Chris Barker via Python-ideas wrote: >> [...] >> I'm still confused -- what's so wrong with: >> >> list(map(func, some_iterable)) >> >> if you need a sequence? > > You might need a sequence. Why do you think that has to be an *eager* > sequence? > > I can think of two obvious problems with eager sequences: space and > time. They can use too much memory, and they can take too much time to > generate them up-front and too much time to reap when they become > garbage. And if you have an eager sequence, and all you want is the > first item, you still have to generate all of them even though they > aren't needed. > > We can afford to be profligate with memory when the data is small, but > eventually you run into cases where having two copies of the data is one > copy too many. > > >> You can, of course mike lazy-evaluated sequences (like range), and so you >> could make a map-like function that required a sequence as input, and would >> lazy evaluate that sequence. This could be useful if you weren't going to >> work with the entire collection, > > Or even if you *are* going to work with the entire collection, but you > don't need them all at once. I once knew a guy whose fondest dream was > to try the native cuisine of every nation of the world ... but not all > in one meal. > > This is a classic time/space tradeoff: for the cost of calling the > mapping function anew each time we index the sequence, we can avoid > allocating a potentially huge list and calling a potentially expensive > function up front for items we're never going to use. Instead, we call > it only on demand. > > These are the same principles that justify (x)range and dict views. Why > eagerly generate a list up front, if you only need the values one at a > time on demand? Why make a copy of the dict keys, if you don't need a > copy? These are not rhetorical questions. > > This is about avoiding the need to make unnecessary copies for those > times we *don't* need an eager sequence generated up front, keeping the > laziness of iterators and the random-access of sequences. > > map(func, sequence) is a great candidate for this approach. It has to > hold onto a reference to the sequence even as an iterator. The function > is typically side-effect free (a pure function), and if it isn't, > "consenting adults" applies. We've already been told there's at least > one major Python project, Sage, where this would have been useful. > > There's a major functional language, Haskell, where nearly all sequence > processing follows this approach. > > I suggest we provide a separate mapview() type that offers only the lazy > sequence API, without trying to be an iterator at the same time. If you > want an eager sequence, or an iterator, they're only a single function > call away: > > list(mapview_instance) > iter(mapview_instance) # or just stick to map() > > Rather than trying to guess whether people want to treat their map > objects as sequences or iterators, we let them choose which they want > and be explicit about it. > > Consider the history of dict.keys(), values() and items() in Python 2. > Originally they returned eager lists. Did we try to retrofit view-like > and iterator-like behaviour onto the existing dict.keys() method, > returning a cunning object which somehow turned from a list to a view to > an iterator as needed? Hell no! We introduced *six new methods* on > dicts: > > - dict.iterkeys() > - dict.viewkeys() > > and similar for items() and values(). > > Compared to that, adding a single variant on map() that expects a > sequence and returns a view on the sequence seems rather timid. > > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From tjreedy at udel.edu Tue Dec 11 12:51:17 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 11 Dec 2018 12:51:17 -0500 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C033044.9080907@canterbury.ac.nz> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> Message-ID: On 12/1/2018 8:07 PM, Greg Ewing wrote: > Steven D'Aprano wrote: After defining a separate iterable mapview sequence class >> For backwards compatibilty reasons, we can't just make map() work like >> this, because that's a change in behaviour. > > Actually, I think it's possible to get the best of both worlds. I presume you mean the '(iterable) sequence' 'iterator' worlds. I don't think they should be mixed. A sequence is reiterable, an iterator is once through and done. > Consider this: > > from operator import itemgetter > > class MapView: > > ??? def __init__(self, func, *args): > ??????? self.func = func > ??????? self.args = args > ??????? self.iterator = None > > ??? def __len__(self): > ??????? return min(map(len, self.args)) > > ??? def __getitem__(self, i): > ??????? return self.func(*list(map(itemgetter(i), self.args))) > > ??? def __iter__(self): > ??????? return self > > ??? def __next__(self): > ??????? if not self.iterator: > ??????????? self.iterator = map(self.func, *self.args) > ??????? return next(self.iterator) The last two (unnecessarily) restrict this to being a once through iterator. I think much better would be def __iter__: return map(self.func, *self.args) -- Terry Jan Reedy From tjreedy at udel.edu Tue Dec 11 13:06:47 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 11 Dec 2018 13:06:47 -0500 Subject: [Python-ideas] __len__() for map() In-Reply-To: <20181201190803.GT4319@ando.pearwood.info> References: <3e46b3e5-09e0-b53e-16f3-a1605c88df3f@thekunderts.net> <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> Message-ID: On 12/1/2018 2:08 PM, Steven D'Aprano wrote: > This proof of concept wrapper class could have been written any time > since Python 1.5 or earlier: > > class lazymap: > def __init__(self, function, sequence): One could now add at the top of the file from collections.abc import Sequence and here if not isinstance(sequence, Sequence): raise TypeError(f'{sequence} is not a sequence') > self.function = function > self.wrapped = sequence > def __len__(self): > return len(self.wrapped) > def __getitem__(self, item): > return self.function(self.wrapped[item]) For 3.x, I would add def __iter__: return map(self.function, self.sequence) but your point that iteration is possible even without, with the old protocol, is well made. > It is fully iterable using the sequence protocol, even in Python 3: > > py> x = lazymap(str.upper, 'aardvark') > py> list(x) > ['A', 'A', 'R', 'D', 'V', 'A', 'R', 'K'] > > > Mapped items are computed on demand, not up front. It doesn't make a > copy of the underlying sequence, it can be iterated over and over again, > it has a length and random access. And if you want an iterator, you can > just pass it to the iter() function. > > There are probably bells and whistles that can be added (a nicer repr? > any other sequence methods? a cache?) and I haven't tested it fully. > > For backwards compatibilty reasons, we can't just make map() work like > this, because that's a change in behaviour. There may be tricky corner > cases I haven't considered, but as a proof of concept I think it shows > that the basic premise is sound and worth pursuing. -- Terry Jan Reedy From tjreedy at udel.edu Tue Dec 11 13:41:32 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 11 Dec 2018 13:41:32 -0500 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201011734.GN4319@ando.pearwood.info> <20181201165320.GQ4319@ando.pearwood.info> <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> Message-ID: On 12/11/2018 6:48 AM, E. Madison Bray wrote: > The idea would be to now enhance the existing built-ins to restore at > least some previously lost assumptions, at least in the relevant > cases. To give an analogy, Python 3.0 replaced range() with > (effectively) xrange(). This broken a lot of assumptions that the > object returned by range(N) would work much like a list, A range represents an arithmetic sequence. Any usage of range that could be replaced by xrange, which is nearly all uses, made no assumption broken by xrange. The basic assumption was and is that a range/xrange could be repeatedly iterated. That this assumption was met in the first case by returning a list was somewhat of an implementation detail. In terms of mutability, a tuple would be have been better, as range objects should not be mutable. (If [2,4,6] is mutated to [2,3,7], it is no longer a range (arithmetic sequence). > and Python 3.2 restored some of that list-like functionality As I see it, xranges were unfinished as sequence objects and 3.2 finished the job. This included having the min() and max() builtins calculate the min and max efficiently, as a human would, as the first or last of the sequence, rather than uselessly iterating and comparing all the items in the sequence. A proper analogy to range would be a re-iterable mapview (or 'mapseq) like what Steven D'Aprano proposes. > ** I have a separate complaint that there's no great way, at the > Python level, to define a class that is explicitly a "sequence" as > opposed to a more general "mapping", You mean like this? >>> from collections.abc import Sequence as S >>> isinstance((), S) True >>> isinstance([], S) True >>> isinstance(range(5), S) True >>> isinstance({}, S) False >>> isinstance(set(), S) False >>> class NItems(S): def __init__(self, n, item): self.len = n self.item = item def __getitem__(self, i): # missing index check return self.item def __len__(self): >>> isinstance(NItems(2, 3), S) True -- Terry Jan Reedy From tjreedy at udel.edu Tue Dec 11 14:08:47 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 11 Dec 2018 14:08:47 -0500 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211144726.GE13061@ando.pearwood.info> Message-ID: On 12/11/2018 12:01 PM, Chris Barker - NOAA Federal via Python-ideas wrote: > Perhaps I got confused by the early part of this discussion. > > My point was that there is no ?map-like? object at the Python level. > (That is no Map abc). > > Py2?s map produced a sequence. Py3?s map produced an iterable. > > So any API that was expecting a sequence could accept the result of a > py2 map, but not a py3 map. There is absolutely nothing special about > map here. > > The example of range has been brought up, but I don?t think it?s > analogous ? py2 range returns a list, py3 range returns an immutable > sequence. Because that?s as close as we can get to a sequence while > preserving the lazy evaluation that is wanted. > > I _think_ someone may be advocating that map() could return an > iterable if it is passed a iterable, I believe you mean 'iterator' rather than 'iterable' here and below as a sequence is an iterable. > and a sequence of it is passed a sequence. > Yes, it could, but that seems like a bad idea to me. > > But folks are proposing a ?map? that would produce a lazy-evaluated > sequence. Sure ? as Paul said, put it up on pypi and see if folks find > it useful. > > Personally, I?m still finding it hard to imagine a use case where you > need the sequence features, but also lazy evaluation is important. > > Sure: range() has that, but it came at almost zero cost, and I?m not > sure the sequence features are used much. > > Note: the one use-case I can think of for a lazy evaluated sequence > instead of an iterable is so that I can pick a random element with > random.choice(). (Try to pick a random item from. a dict), but that > doesn?t apply here?pick a random item from the source sequence > instead. > > But this is specific example of a general use case: you need to access > only a subset of the mapped sequence (or access it out of order) so > using the iterable version won?t work, and it may be large enough that > making a new sequence is too resource intensive. > > Seems rare to me, and in many cases, you could do the subsetting > before applying the function, so I think it?s a pretty rare use case. > > But go ahead and make it ? I?ve been wrong before :-) From greg.ewing at canterbury.ac.nz Tue Dec 11 17:31:03 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Dec 2018 11:31:03 +1300 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <20181211144726.GE13061@ando.pearwood.info> References: <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211144726.GE13061@ando.pearwood.info> Message-ID: <5C103AA7.8020909@canterbury.ac.nz> Steven D'Aprano wrote: > I suggest we provide a separate mapview() type that offers only the lazy > sequence API, without trying to be an iterator at the same time. Then we would be back to the bad old days of having two functions that do almost exactly the same thing. My suggestion was made in the interests of moving the language in the direction of having less warts, rather than adding more or moving the existing ones around. I acknowledge that the dual interface is itself a bit wartish, but it's purely for backwards compatibility, so it could be deprecated and eventually removed if desired. -- Greg From chris.barker at noaa.gov Tue Dec 11 18:46:20 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 11 Dec 2018 15:46:20 -0800 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <20181201172307.GS4319@ando.pearwood.info> <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211144726.GE13061@ando.pearwood.info> Message-ID: On Tue, Dec 11, 2018 at 11:10 AM Terry Reedy wrote: > > I _think_ someone may be advocating that map() could return an > > iterable if it is passed a iterable, > > I believe you mean 'iterator' rather than 'iterable' here and below as a > sequence is an iterable. > well, the iterator / iterable distinction is important in this thread in many places, so I should have been more careful about that -- but not for this reason. Yes, a a sequence is an iterable, but what I meant was an "iterable-that-is-not-a-sequence". -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.ewing at canterbury.ac.nz Tue Dec 11 18:50:41 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 12 Dec 2018 12:50:41 +1300 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <20181211162627.GF13061@ando.pearwood.info> References: <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> Message-ID: <5C104D51.7090704@canterbury.ac.nz> Steven D'Aprano wrote: > The iterator protocol is that iterators must: > > - have a __next__ method; > - have an __iter__ method which returns self; > > and the test for an iterator is: > > obj is iter(obj) By that test, it identifies as a sequence, as does testing it for the presence of __len__: >>> m is iter(m) False >>> hasattr(m, '__len__') True So, code that doesn't know whether it has a sequence or iterator and tries to find out, will conclude that it has a sequence. Presumably it will then proceed to treat it as a sequence, which will work fine. > py> x = MapView(str.upper, "abcdef") # An imposter. > py> next(x) > 'A' > py> next(x) > 'B' > py> next(iter(x)) > 'A' That's a valid point, but it can be fixed: def __iter__(self): return self.iterator or map(self.func, *self.args) Now it gives >>> next(x) 'A' >>> list(x) [] There is still one case that will behave differently from the current map(), i.e. using list() first and then expecting it to behave like an exhausted iterator. I'm finding it hard to imagine real code that would depend on that behaviour, though. > whether operations succeed or not depend on the > order that you call them: > > py> x = MapView(str.upper, "abcdef") > py> len(x)*next(x) # Safe. But only ONCE. But what sane code is going to do that? Remember, the iterator interface is only there for backwards compatibility. That would fail under both Python 2 and the current Python 3. > py> def innocent_looking_function(obj): > ... next(obj) > ... > py> x = MapView(str.upper, "abcdef") > py> len(x) > 6 > py> innocent_looking_function(x) > py> len(x) > TypeError: Mapping iterator has no len() If you're using len(), you clearly expect to have a sequence, not an iterator, so why are you calling a function that blindly expects an iterator? Again, this cannot be and could never have been working code. > I presume this is just an oversight, but indexing continues to work even > when len() has been broken. That could be fixed. > This MapView class offers a hybrid "sequence plus iterator, together at > last!" double-headed API, and even its creator says that sane code > shouldn't use that API. No. I would document it like this: It provides a sequence API. It also, *for backwards compatibility*, implements some parts of the iterator API, but new code should not rely on that, nor should any code expect to be able to use both interfaces on the same object. The backwards compatibility would not be perfect, but I think it would work in the vast majority of cases. I also envisage that the backwards compatibility provisions would not be kept forever, and that it would eventually become a pure sequence object. I'm not necessarily saying this *should* be done, just pointing out that it's a possible strategy for migrating map() from an iterator to a view, if we want to do that. -- Greg From steve at pearwood.info Tue Dec 11 20:24:27 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 12 Dec 2018 12:24:27 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C103AA7.8020909@canterbury.ac.nz> References: <20181201190803.GT4319@ando.pearwood.info> <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211144726.GE13061@ando.pearwood.info> <5C103AA7.8020909@canterbury.ac.nz> Message-ID: <20181212012425.GH13061@ando.pearwood.info> On Wed, Dec 12, 2018 at 11:31:03AM +1300, Greg Ewing wrote: > Steven D'Aprano wrote: > >I suggest we provide a separate mapview() type that offers only the lazy > >sequence API, without trying to be an iterator at the same time. > > Then we would be back to the bad old days of having two functions > that do almost exactly the same thing. They aren't "almost exactly the same thing". One is a sequence, which is a rich API that includes random access to items and a length; the other is an iterator, which is an intentionally simple API which fails to meet the needs of some users. > My suggestion was made in > the interests of moving the language in the direction of having > less warts, rather than adding more or moving the existing ones > around. > > I acknowledge that the dual interface is itself a bit wartish, It's a "bit wartish" in the same way that the sun is "a bit warmish". > but it's purely for backwards compatibility And it fails at that too. x = map(str.upper, "abcd") x is iter(x) returns True with the current map, an actual iterator, and False with your hybrid. Current map() is a proper, non-broken iterator; your hybrid is a broken iterator. (That's not me being derogative: its the official term for iterators which don't stay exhausted.) I'd be more charitable if I thought the flaws were mere bugs that could be fixed. But I don't think there is any way to combine two incompatible interfaces, the sequence and iterator APIs, into one object without these sorts of breakages. Take the __next__ method out of your object, and it is a better version of what I proposed earlier. With the __next__ method, its just broken. -- Steve From tjreedy at udel.edu Tue Dec 11 22:36:24 2018 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 11 Dec 2018 22:36:24 -0500 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C104D51.7090704@canterbury.ac.nz> References: <5C033044.9080907@canterbury.ac.nz> <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> Message-ID: On 12/11/2018 6:50 PM, Greg Ewing wrote: > I'm not necessarily saying this *should* be done, just pointing > out that it's a possible strategy for migrating map() from > an iterator to a view, if we want to do that. Python has list and list_iterator, tuple and tuple_iterator, set and set_iterator, dict and dict_iterator, range and range_iterator. In 3.0, we could have turned map into a finite sequence analogous to range, and add a new map_iterator. To be completely lazy, such a map would have to restrict input to Sequences. To be compatible with 2.0 map, it would have to use list(iterable) to turn other finite iterables into concrete lists, making it only semi-lazy. Since I am too lazy to write the multi-iterable version, here is the one-iterable version to show the idea. def __init__(func, iterable): self.func = func self.seq = iterable if isinstance(iterable, Sequence) else list(iterable) Given the apparent little need for the extra complication, and the possibility of keeping a reference to sequences and explicitly applying list otherwise, it was decided to rebind 'map' to the fully lazy and general itertools.map. -- Terry Jan Reedy From steve at pearwood.info Wed Dec 12 03:12:50 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 12 Dec 2018 19:12:50 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C104D51.7090704@canterbury.ac.nz> References: <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> Message-ID: <20181212081249.GI13061@ando.pearwood.info> On Wed, Dec 12, 2018 at 12:50:41PM +1300, Greg Ewing wrote: > Steven D'Aprano wrote: > >The iterator protocol is that iterators must: > > > >- have a __next__ method; > >- have an __iter__ method which returns self; > > > >and the test for an iterator is: > > > > obj is iter(obj) > > By that test, it identifies as a sequence, as does testing it > for the presence of __len__: Since existing map objects are iterators, that breaks backwards compatibility. For code that does something like this: if obj is iter(obj): process_iterator() else: n = len(obj) process_sequence() it will change behaviour, shifting map objects from the iterator branch to the sequence branch. That's a definite change in behaviour, which alone could change the meaning of the code. E.g. if the two process_* functions use different algorithms. Or it could break the code outright, because your MapView objects can raise TypeError when you call len() on them. I know that any object with a __len__ could in principle raise TypeError. But for anything else, we are justified in calling it a bug in the __len__ implementation. You're trying to sell it as a feature. > >>> m is iter(m) > False > >>> hasattr(m, '__len__') > True > > So, code that doesn't know whether it has a sequence or iterator > and tries to find out, will conclude that it has a sequence. > Presumably it will then proceed to treat it as a sequence, which > will work fine. It will work fine, unless something has called __next__, which will cause len() to blow up in their face by raising TypeError. I call these sorts of designs "landmines". They're absolutely fine, right up to the point where you hit the right combination of actions and step on the landmine. For anything else, this sort of thing would be a bug. You're calling it a feature. > >py> x = MapView(str.upper, "abcdef") # An imposter. > >py> next(x) > >'A' > >py> next(x) > >'B' > >py> next(iter(x)) > >'A' > > That's a valid point, but it can be fixed: > > def __iter__(self): > return self.iterator or map(self.func, *self.args) > > Now it gives > > >>> next(x) > 'A' > >>> list(x) > [] > > There is still one case that will behave differently from the > current map(), i.e. using list() first and then expecting it > to behave like an exhausted iterator. I'm finding it hard to > imagine real code that would depend on that behaviour, though. That's not the only breakage. This is a pattern which I sometimes use: def test(iterator): # Process items up to some symbol one way, # and items after that symbol another way. for a in iterator: print(1, a) if a == 'C': break # This relies on iterator NOT resetting to the beginning, # but continuing from where we left off # i.e. not being broken for b in iterator: print(2, b) Being an iterator, right now I can pass map() objects directly to that code, and it works as expected: py> test(map(str.upper, 'abcde')) 1 A 1 B 1 C 2 D 2 E Your MapView does not: py> test(MapView(str.upper, 'abcde')) 1 A 1 B 1 C 2 A 2 B 2 C 2 D 2 E This is why such iterators are deemed to be "broken". > > whether operations succeed or not depend on the > >order that you call them: > > > >py> x = MapView(str.upper, "abcdef") > >py> len(x)*next(x) # Safe. But only ONCE. > > But what sane code is going to do that? You have an object that supports len() and next(). Why shouldn't people use both len() and next() on it when both are supported methods? They don't have to be in a single expression: x = MapView(blah blah blah) a = some_function_that_calls_len(x) b = some_function_that_calls_next(x) That works. But reverse the order, and you step on a landmine: b = some_function_that_calls_next(x) a = some_function_that_calls_len(x) The caller may not even know that the functions call next() or len(), they could be implementation details buried deep inside some library function they didn't even know they were calling. Do you still think that it is the caller's code that is insane? > Remember, the iterator > interface is only there for backwards compatibility. Famous last words. > That would fail under both Python 2 and the current Python 3. Honestly Greg, you've been around long enough that you ought to recognise *minimal examples* for what they are. They're not meant to be real-world production code. They're the simplest, most minimal example that demonstates the existence of a problem. The fact that they are *simple* is to make it easy to see the underlying problem, not to give you an excuse to dismiss it. You're supposed to imagine that in real-life code, the call to next() could be buried deep, deep, deep in a chain of 15 function calls in some function in some third party library that I don't even know is being called, and it took me a week to debug why len(obj) would sometimes fail mysteriously. The problem is not the caller, or even the library code, but that your class magically and implictly swaps from a sequence to a pseudo-iterator whether I want it to or not. A perfect example of why DWIM code is so hated: http://www.catb.org/jargon/html/D/DWIM.html > >py> def innocent_looking_function(obj): > >... next(obj) > >... > >py> x = MapView(str.upper, "abcdef") > >py> len(x) > >6 > >py> innocent_looking_function(x) > >py> len(x) > >TypeError: Mapping iterator has no len() > > If you're using len(), you clearly expect to have a sequence, > not an iterator, so why are you calling a function that blindly > expects an iterator? *Minimal example* again. You ought to be able to imagine the actual function is fleshed out, without expecting me to draw you a picture: if hasattr(obj, '__next__'): first = next(obj, sentinel) Or if you prefer: try: first = next(obj) except TypeError: # fall back on sequence algorithm except StopIteration: # empty iterator None of this boilerplate adds any insight at all to the discussion. There's a reason bug reports ask for minimal examples. The point is, I'm calling some innocent looking function, and it breaks my sequence: len(obj) worked before I called the function, and afterwards, it raises TypeError. I wouldn't have to care about the implementation if your MapView object didn't magically flip from sequence to iterator behind my back. -- Steve From chris.barker at noaa.gov Wed Dec 12 23:06:17 2018 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Wed, 12 Dec 2018 20:06:17 -0800 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <20181212081249.GI13061@ando.pearwood.info> References: <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> Message-ID: >>> and the test for an iterator is: >>> >>> obj is iter(obj) Is that a hard and fast rule? I know it?s the vast majority of cases, but I imagine you could make an object that behaved exactly like an iterator, but returned some proxy object rather that itself. Not sure why one would do that, but it should be possible. - CHB From rosuav at gmail.com Wed Dec 12 23:45:09 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 13 Dec 2018 15:45:09 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> Message-ID: On Thu, Dec 13, 2018 at 3:07 PM Chris Barker - NOAA Federal via Python-ideas wrote: > > >>> and the test for an iterator is: > >>> > >>> obj is iter(obj) > > Is that a hard and fast rule? I know it?s the vast majority of cases, > but I imagine you could make an object that behaved exactly like an > iterator, but returned some proxy object rather that itself. > > Not sure why one would do that, but it should be possible. Yes, it is. https://docs.python.org/3/library/stdtypes.html#iterator-types For an iterable, __iter__ needs to return an appropriate iterator. For an iterator, __iter__ needs to return self (which is, by definition, the "appropriate iterator"). Note also that the behaviour around StopIteration is laid out there, including that an iterator whose __next__ has raised SI but then subsequently doesn't continue to raise SI is broken. (Though it *is* legit to raise StopIteration with a value the first time, and then raise a vanilla SI subsequently. Generators do this, rather than retain the return value indefinitely.) ChrisA From steve at pearwood.info Thu Dec 13 00:11:22 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 13 Dec 2018 16:11:22 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> Message-ID: <20181213051111.GS13061@ando.pearwood.info> On Wed, Dec 12, 2018 at 08:06:17PM -0800, Chris Barker - NOAA Federal wrote: > >>> and the test for an iterator is: > >>> > >>> obj is iter(obj) > > Is that a hard and fast rule? Yes, that's the rule for the iterator protocol. Any object can have an __iter__ method which returns anything you want. (It doesn't even have to be iterable, this is Python, and if you want to shoot yourself in the foot, you can.) But to be an iterator, the rule is that obj.__iter__() must return obj itself. Otherwise we say that obj is an iterable, not an iterator. https://docs.python.org/3/library/stdtypes.html#iterator.__iter__ -- Steve From greg.ewing at canterbury.ac.nz Thu Dec 13 00:53:54 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 13 Dec 2018 18:53:54 +1300 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: References: <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> Message-ID: <5C11F3F2.7020106@canterbury.ac.nz> Chris Angelico wrote: > On Thu, Dec 13, 2018 at 3:07 PM Chris Barker - NOAA Federal via > Python-ideas wrote: > >>>>> obj is iter(obj) >> >>Is that a hard and fast rule? > Yes, it is. > > https://docs.python.org/3/library/stdtypes.html#iterator-types The docs aren't very clear on this point. They claim this is necessary so that the iterator can be used in a for-loop, but that's obviously not strictly true, since a proxy object could also be used. They also make no mention about whether one should be able to rely on this as a definitive test of iterator-ness. In any case, I don't claim that my MapView implements the full iterator protocol, only enough of it to pass for an iterator in most likely scenarios that assume one. -- Greg From rosuav at gmail.com Thu Dec 13 01:16:34 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 13 Dec 2018 17:16:34 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C11F3F2.7020106@canterbury.ac.nz> References: <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> <5C11F3F2.7020106@canterbury.ac.nz> Message-ID: On Thu, Dec 13, 2018 at 4:54 PM Greg Ewing wrote: > > Chris Angelico wrote: > > On Thu, Dec 13, 2018 at 3:07 PM Chris Barker - NOAA Federal via > > Python-ideas wrote: > > > >>>>> obj is iter(obj) > >> > >>Is that a hard and fast rule? > > Yes, it is. > > > > https://docs.python.org/3/library/stdtypes.html#iterator-types > > The docs aren't very clear on this point. They claim this is necessary > so that the iterator can be used in a for-loop, but that's obviously > not strictly true, since a proxy object could also be used. > iterator.__iter__() Return the iterator object itself. I do believe "the iterator object itself" means that "iterator.__iter__() is iterator" should always be true. But maybe there's some other way to return "the object itself" other than actually returning "the object itself"? ChrisA From p.f.moore at gmail.com Thu Dec 13 04:25:26 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 13 Dec 2018 09:25:26 +0000 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C11F3F2.7020106@canterbury.ac.nz> References: <5C03D85F.2040702@canterbury.ac.nz> <20181202134324.GV4319@ando.pearwood.info> <5C046217.7010805@canterbury.ac.nz> <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> <5C11F3F2.7020106@canterbury.ac.nz> Message-ID: On Thu, 13 Dec 2018 at 05:55, Greg Ewing wrote: > > Chris Angelico wrote: > > On Thu, Dec 13, 2018 at 3:07 PM Chris Barker - NOAA Federal via > > Python-ideas wrote: > > > >>>>> obj is iter(obj) > >> > >>Is that a hard and fast rule? > > Yes, it is. > > > > https://docs.python.org/3/library/stdtypes.html#iterator-types > > The docs aren't very clear on this point. They claim this is necessary > so that the iterator can be used in a for-loop, but that's obviously > not strictly true, since a proxy object could also be used. See also https://docs.python.org/3.7/glossary.html#term-iterator, which reiterates the point that "Iterators are required to have an __iter__() method that returns the iterator object itself". By that point, I'd say the docs are pretty clear... > They also make no mention about whether one should be able to rely > on this as a definitive test of iterator-ness. That glossary entry is linked from https://docs.python.org/3.7/library/collections.abc.html#collections.abc.Iterator, so it would be pretty hard to argue that it's not part of the "definitive test of iterator-ness". > In any case, I don't claim that my MapView implements the full > iterator protocol, only enough of it to pass for an iterator in > most likely scenarios that assume one. But not enough that it's legitimate to describe it as an "iterator". It may well be a useful class, and returning it from a map-like function may be a practical and effective thing to do, but describing it as an "iterator" does nothing apart from leading to distracting debates on how it doesn't work the same as an iterator. Better to just accept that it's *not* an iterator, and focus on whether it's useful... IMO, it sounds like it's useful, but it's not backward compatible (because it's not an iterator ;-)). Whether it's *sufficiently* useful to justify breaking backward compatibility is a different discussion (all I can say on that question is that I've never personally had a case where the current Python 3 behaviour of map is a problem). Paul From steve at pearwood.info Thu Dec 13 06:08:19 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 13 Dec 2018 22:08:19 +1100 Subject: [Python-ideas] Suggested MapView object (Re: __len__() for map()) In-Reply-To: <5C11F3F2.7020106@canterbury.ac.nz> References: <20181211162627.GF13061@ando.pearwood.info> <5C104D51.7090704@canterbury.ac.nz> <20181212081249.GI13061@ando.pearwood.info> <5C11F3F2.7020106@canterbury.ac.nz> Message-ID: <20181213110819.GT13061@ando.pearwood.info> On Thu, Dec 13, 2018 at 06:53:54PM +1300, Greg Ewing wrote: > In any case, I don't claim that my MapView implements the full > iterator protocol, only enough of it to pass for an iterator in > most likely scenarios that assume one. Whether your hybrid sequence+iterator is close enough to an iterator or not isn't the critical point here. If we really wanted to, we could break backwards compatibility, with or without a future import or a deprecation period, and simply declare that this is how map() will work in the future. Doing that, or not, becomes a question of whether the gain is worth the breakages. The critical question here is whether a builtin ought to include the landmines your hybrid class does. *By design*, your class will blow up in people's faces if they try to use the full API offered. It violates at least two expected properties: - As an iterator, it is officially "broken" because in at least two reasonable scenarios, it automatically resets after being exhausted. (Although presumably we could fix that with an "is_exhausted" flag.) - As a sequence, it violates the expectation that if an object is Sized (it has a __len__ method) calling len() on it should not raise TypeError; As a sequence, it is fragile and easily breakable, changing from a sequence to a (pseudo-)iterator whether the caller wants it to or not. Third-party code could easily flip the switch, leading to obscure errors. That second one is critical to your "Do What I Mean" design; the whole point of your class is for the object to automagically swap from behaving like a sequence to behaving like an iterator according to how it is used. Rather than expecting the user to make an explicit choice of which behaviour they want: - use map() to get current iterator behaviour; - use mapview() to get lazy-sequence behaviour; your class tries to do both, and then guesses what the user wants depending on how the map object happens to get used. -- Steve From jcrmatos at gmail.com Thu Dec 13 07:23:45 2018 From: jcrmatos at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Matos?=) Date: Thu, 13 Dec 2018 04:23:45 -0800 (PST) Subject: [Python-ideas] It would be great if the json module would allow and clear existing comments. Message-ID: <25ac1595-2f9c-40f8-a096-d743591154ae@googlegroups.com> Hello, Comments in JSON files are a great way to document a configuration file for example. Even JSON's Douglas Crockford agrees that it is a helpful thing and it suggests using JSMin before handing it to the JSON parser in here https://plus.google.com/+DouglasCrockfordEsq/posts/RK8qyGVaGSr So, I would like to suggest adding that feature do the json module. A simple boolean argument, clear_comments, with the default False to keep previous compatibility. With the clear_comments=True and following JSMin "rules", comments in the // form should be replaced with linefeeds and comments in the /* */ form with spaces. Best regards, JM -------------- next part -------------- An HTML attachment was scrubbed... URL: From chbailly at gmail.com Sun Dec 16 03:21:14 2018 From: chbailly at gmail.com (Christophe Bailly) Date: Sun, 16 Dec 2018 09:21:14 +0100 Subject: [Python-ideas] [asyncio] Suggestion for a major PEP Message-ID: Hello, I copy paste the main idea from an article I have written: contextual async " Imagine you have some code written for monothread. And you want to include your code in a multithread environment. Do you need to adapt all your code which is what you do when you want to migrate to async code ? The answer is no. Functionnally these constraints are not justified neither technically Do we have the tools to do this ? Yes because thanks to boost::context we can switch context between tasks. When a task suspends, it just calls a function (the event loop or reactor) to potentially switch to another task. Just like threads switch contexts? Async/Await logic has introduced a symetric relation wich introduces unnecessary contraints. We should just the same logic as thread logic. " Read the examples in the article I have developped a prototype in C++ and everything works perfectly. My opinion is that sooner or later, it will have to switch to this logic because chaining async/aswait is a huge contraints and does not make sense in my opinion. Maybe I am missing something, Feel free to give me your feedback. Regards, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sun Dec 16 04:15:57 2018 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 16 Dec 2018 01:15:57 -0800 Subject: [Python-ideas] [asyncio] Suggestion for a major PEP In-Reply-To: References: Message-ID: If you want this style of concurrency, you don't need to write a PEP, just 'pip install gevent' :-) But unfortunately you're years too late to argue for making asyncio work this way. This was discussed extensively at the time, and the decision to use special syntax was made intentionally, and after studying existing systems like gevent that made the other choice. This section of the trio docs explain why explicit async/await syntax makes life easier for developers: https://trio.readthedocs.io/en/latest/reference-core.html#checkpoints It's also awkward but very doable to support both sync and async mode with a single code base: https://github.com/python-trio/unasync/ In fact, when doing this, the async/await syntax isn't really the hard part ? the hard part is that different libraries have very different networking APIs. E.g., the stdlib socket API and the stdlib asyncio API are totally different. -n On Sun, Dec 16, 2018 at 12:21 AM Christophe Bailly wrote: > > Hello, > > I copy paste the main idea from an article I have written: > contextual async > > " > > Imagine you have some code written for monothread. And you want to include your code in a multithread environment. Do you need to adapt all your code which is what you do when you want to migrate to async code ? The answer is no. > > Functionnally these constraints are not justified neither technically > > Do we have the tools to do this ? Yes because thanks to boost::context we can switch context between tasks. When a task suspends, it just calls a function (the event loop or reactor) to potentially switch to another task. Just like threads switch contexts? > > Async/Await logic has introduced a symetric relation wich introduces unnecessary contraints. We should just the same logic as thread logic. > > " > > Read the examples in the article I have developped a prototype in C++ and everything works perfectly. > > My opinion is that sooner or later, it will have to switch to this logic because chaining async/aswait is a huge contraints and does not make sense in my opinion. > > Maybe I am missing something, > > Feel free to give me your feedback. > > Regards, > > > Chris > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- Nathaniel J. Smith -- https://vorpus.org From steve at pearwood.info Sun Dec 16 04:44:34 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 16 Dec 2018 20:44:34 +1100 Subject: [Python-ideas] [asyncio] Suggestion for a major PEP In-Reply-To: References: Message-ID: <20181216094432.GZ13061@ando.pearwood.info> On Sun, Dec 16, 2018 at 09:21:14AM +0100, Christophe Bailly wrote: > Async/Await logic has introduced a symetric relation wich introduces > unnecessary contraints. We should just the same logic as thread logic. I'm not an expert on async, but whenever I hear people saying "we should use (something just like) threads" I'm reminded of something that Jamie Zawinski could have (but didn't) say: Some people, when confronted with a problem, think "I know, I'll use threads." Nothwey htwo pavroble ems. I know, that's a sound-bite, not a reasoned argument. But if your intention is to make async code just like threads, how do you avoid the well-known perils of threading? The point of async code is to make context switches explicit, rather than implicit like threading. So at first glance, it seems like you are suggesting we take the major benefit of async (explicitness) and replace it with the major disadvantage of other concurrency models (implicitness). -- Steve From chbailly at gmail.com Sun Dec 16 04:47:01 2018 From: chbailly at gmail.com (Christophe Bailly) Date: Sun, 16 Dec 2018 10:47:01 +0100 Subject: [Python-ideas] [asyncio] Suggestion for a major PEP In-Reply-To: References: Message-ID: Hello, Thanks for your answer. The advantage of this method is that you still follow the logic of async/await. In fact you make it even easier,. It is just a different implementation but with fewer constraints. So my suggestion is to keep this logic because it is a very good logic !!!. , But we should remove this unjusfied chaining of async/await methods. To be clear, this is an async /await logic, except you have an async on one end and an await at the other end and you remove everything in between ! I could have written my examples with async/await, I have used the future syntax but it is the same. I can rewrite my examples if you prefer, I will just use other keywords. That is my opinon, we differ on this but I think there is something really wrong when you add unjustified syntax. You suggest gevent but where do you see async await in gevent ? >From my experience, this is really a pain to mix async code with sync code with the current implementation. Of course you can create threads instead but it is better and simpler to remain async if possible I think there is a flaw in this logic, again this is my opinion, my main intention is to share ideas. I perfectly understand that it is late to implement this, but we could take also into account the real limitations that are difficult to overcome with asyncio. I think I do not need to post links, many will undertand the constraints I am talking about. Regards, Chris On Sun, 16 Dec 2018 at 10:16, Nathaniel Smith wrote: > If you want this style of concurrency, you don't need to write a PEP, > just 'pip install gevent' :-) > > But unfortunately you're years too late to argue for making asyncio > work this way. This was discussed extensively at the time, and the > decision to use special syntax was made intentionally, and after > studying existing systems like gevent that made the other choice. > > This section of the trio docs explain why explicit async/await syntax > makes life easier for developers: > https://trio.readthedocs.io/en/latest/reference-core.html#checkpoints > > It's also awkward but very doable to support both sync and async mode > with a single code base: https://github.com/python-trio/unasync/ > > In fact, when doing this, the async/await syntax isn't really the hard > part ? the hard part is that different libraries have very different > networking APIs. E.g., the stdlib socket API and the stdlib asyncio > API are totally different. > > -n > On Sun, Dec 16, 2018 at 12:21 AM Christophe Bailly > wrote: > > > > Hello, > > > > I copy paste the main idea from an article I have written: > > contextual async > > > > " > > > > Imagine you have some code written for monothread. And you want to > include your code in a multithread environment. Do you need to adapt all > your code which is what you do when you want to migrate to async code ? The > answer is no. > > > > Functionnally these constraints are not justified neither technically > > > > Do we have the tools to do this ? Yes because thanks to boost::context > we can switch context between tasks. When a task suspends, it just calls a > function (the event loop or reactor) to potentially switch to another task. > Just like threads switch contexts? > > > > Async/Await logic has introduced a symetric relation wich introduces > unnecessary contraints. We should just the same logic as thread logic. > > > > " > > > > Read the examples in the article I have developped a prototype in C++ > and everything works perfectly. > > > > My opinion is that sooner or later, it will have to switch to this logic > because chaining async/aswait is a huge contraints and does not make sense > in my opinion. > > > > Maybe I am missing something, > > > > Feel free to give me your feedback. > > > > Regards, > > > > > > Chris > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > -- > Nathaniel J. Smith -- https://vorpus.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.ewing at canterbury.ac.nz Sun Dec 16 08:03:44 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 17 Dec 2018 02:03:44 +1300 Subject: [Python-ideas] [asyncio] Suggestion for a major PEP In-Reply-To: References: Message-ID: <5C164D30.9090506@canterbury.ac.nz> Christophe Bailly wrote: > I copy paste the main idea from an article I have written: > contextual async > All of your examples there are C++. It's not clear how any of this relates to Python. > Do we have the tools to do this ? Yes because thanks to boost::context > we can switch context between tasks. How does it work? Is it pulling some kind of stack switching trick? If so, I would be skeptical about the portability and reliability. Also, if it requires the Python interpreter to become C++ or depend on a C++ runtime, it's not going to be accepted. -- Greg From chbailly at gmail.com Sun Dec 16 13:18:45 2018 From: chbailly at gmail.com (Christophe Bailly) Date: Sun, 16 Dec 2018 19:18:45 +0100 Subject: [Python-ideas] Fwd: [asyncio] Suggestion for a major PEP In-Reply-To: References: Message-ID: Hello everybody, I thought I had sent this mail to everybody, so here was my answer to Nathaniel, Regards, Chris ---------- Forwarded message --------- From: Christophe Bailly Date: Sun, 16 Dec 2018 at 13:56 Subject: Re: [Python-ideas] [asyncio] Suggestion for a major PEP To: Nathaniel Smith Hello Nathaniel, After reading many papers, I understand that it is not as simple as I could imagine in my first thought. Sorry about that, I have learned something. Explicit vs implicit asynchronism is a complex topic. Regards, Chris On Sun, 16 Dec 2018 at 10:16, Nathaniel Smith wrote: > If you want this style of concurrency, you don't need to write a PEP, > just 'pip install gevent' :-) > > But unfortunately you're years too late to argue for making asyncio > work this way. This was discussed extensively at the time, and the > decision to use special syntax was made intentionally, and after > studying existing systems like gevent that made the other choice. > > This section of the trio docs explain why explicit async/await syntax > makes life easier for developers: > https://trio.readthedocs.io/en/latest/reference-core.html#checkpoints > > It's also awkward but very doable to support both sync and async mode > with a single code base: https://github.com/python-trio/unasync/ > > In fact, when doing this, the async/await syntax isn't really the hard > part ? the hard part is that different libraries have very different > networking APIs. E.g., the stdlib socket API and the stdlib asyncio > API are totally different. > > -n > On Sun, Dec 16, 2018 at 12:21 AM Christophe Bailly > wrote: > > > > Hello, > > > > I copy paste the main idea from an article I have written: > > contextual async > > > > " > > > > Imagine you have some code written for monothread. And you want to > include your code in a multithread environment. Do you need to adapt all > your code which is what you do when you want to migrate to async code ? The > answer is no. > > > > Functionnally these constraints are not justified neither technically > > > > Do we have the tools to do this ? Yes because thanks to boost::context > we can switch context between tasks. When a task suspends, it just calls a > function (the event loop or reactor) to potentially switch to another task. > Just like threads switch contexts? > > > > Async/Await logic has introduced a symetric relation wich introduces > unnecessary contraints. We should just the same logic as thread logic. > > > > " > > > > Read the examples in the article I have developped a prototype in C++ > and everything works perfectly. > > > > My opinion is that sooner or later, it will have to switch to this logic > because chaining async/aswait is a huge contraints and does not make sense > in my opinion. > > > > Maybe I am missing something, > > > > Feel free to give me your feedback. > > > > Regards, > > > > > > Chris > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas at python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > > > > -- > Nathaniel J. Smith -- https://vorpus.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From s-ball at laposte.net Tue Dec 18 04:10:51 2018 From: s-ball at laposte.net (s-ball at laposte.net) Date: Tue, 18 Dec 2018 10:10:51 +0100 (CET) Subject: [Python-ideas] Use lazy loading with hashtable in python gettext module In-Reply-To: <1652737848.3435845.1545123047394.JavaMail.zimbra@laposte.net> Message-ID: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel ( http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the original msgfmt from GNU gettext. I could find that only the original msgfmt was able to generate a hashtable, and that anyway the Python gettext module loaded everything in memory and did not use it. But I also find a TODO note saying # TODO: # - Lazy loading of .mo files. Currently the entire catalog is loaded into # memory, but that's probably bad for large translated programs. Instead, # the lexical sort of original strings in GNU .mo files should be exploited # to do binary searches and lazy initializations. Or you might want to use # the undocumented double-hash algorithm for .mo files with hash tables, but # you'll need to study the GNU gettext code to do this. I have studied GNU gettext code and found that implemententing the hashing algorithm in Python would not be that hard. The undocumented features required for implementation are: - the version number can safely stay to 0 when processing Python code - the size of the hash table is the first odd prime greater than or equal to 4 * n / 3 where n is the number of strings - the first hashing function uses a variant of PJW hash function described in https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~ high is replaced with h = h ^ high, and using 32 bits integers. The index in the table in the result of the function modulus the size of the hash table - when there is a conflict (the slot given by the first hashing function is already used by another string) the following is used: - let h be the result of the PJW variant hash function and size be the size of the hash table, an increment value is set to 1 +( h % (size -2)) - that increment is repeatedly added to the index in the hash table (modulus the table size) until an empty slot is found (or the correct original string is found) For now, my (alpha) code is able to generate in pure Python the same mo file that GNU msgfmt generates, and use the hashtable to access the strings. Remaining problems: - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it? - the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better - I should either rely on the current way (load everything in memory) or implement a binary search algo for the case where the hash table is not present (it is of course optional) - it would be an important change, and I think that options should be allow to choose between an eager or lazy access Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end. -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Tue Dec 18 08:14:11 2018 From: phd at phdru.name (Oleg Broytman) Date: Tue, 18 Dec 2018 14:14:11 +0100 Subject: [Python-ideas] Use lazy loading with hashtable in python gettext module In-Reply-To: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> References: <1652737848.3435845.1545123047394.JavaMail.zimbra@laposte.net> <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> Message-ID: <20181218131411.a2jmjhef7a3rpqs6@phdru.name> Hi! On Tue, Dec 18, 2018 at 10:10:51AM +0100, Serge Ballesta via Python-ideas wrote: > In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel ( http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the original msgfmt from GNU gettext. I use gettext quite extensively. I use Python's msgfmt to generate .mo files. I also use Django's compilemessage; I don't know what it uses internally, it could be an independent implementation or Python's msgfmt. > I could find that only the original msgfmt was able to generate a hashtable, and that anyway the Python gettext module loaded everything in memory and did not use it. But I also find a TODO note saying > > # TODO: > # - Lazy loading of .mo files. Currently the entire catalog is loaded into > # memory, but that's probably bad for large translated programs. Instead, > # the lexical sort of original strings in GNU .mo files should be exploited > # to do binary searches and lazy initializations. Or you might want to use > # the undocumented double-hash algorithm for .mo files with hash tables, but > # you'll need to study the GNU gettext code to do this. > > I have studied GNU gettext code and found that implemententing the hashing algorithm in Python would not be that hard. That's interesting! > The undocumented features required for implementation are: > - the version number can safely stay to 0 when processing Python code > - the size of the hash table is the first odd prime greater than or equal to 4 * n / 3 where n is the number of strings > - the first hashing function uses a variant of PJW hash function described in https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~ high is replaced with h = h ^ high, and using 32 bits integers. The index in the table in the result of the function modulus the size of the hash table > - when there is a conflict (the slot given by the first hashing function is already used by another string) the following is used: > - let h be the result of the PJW variant hash function and size be the size of the hash table, an increment value is set to 1 +( h % (size -2)) > - that increment is repeatedly added to the index in the hash table (modulus the table size) until an empty slot is found (or the correct original string is found) > > For now, my (alpha) code is able to generate in pure Python the same mo file that GNU msgfmt generates, and use the hashtable to access the strings. > > Remaining problems: > - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it? You should ask a lawyer and I am not. But my understanding is that you can borrow ideas from a GPL-protected code without contaminating your code with GPL. You cannot copy code -- that makes your code GPL'd. > - the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better There is the third option -- open and close the file. I'd prefer the option as file descriptors are precious resources limited in supply. There is a twist though. The file could be replaced while closed so you have to find a way to verify the was replaced and reread the has table from it. Perhaps checking timestamp of the file (date/time of the last modification) is enough. > - I should either rely on the current way (load everything in memory) or implement a binary search algo for the case where the hash table is not present (it is of course optional) > - it would be an important change, and I think that options should be allow to choose between an eager or lazy access > > Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end. Well, I mus admit my .po/.mo aren't that big. The biggest .po is 60k, its corresponding .mo is only 30k bytes. I don't know if using the hash table gives me improvement. Oleg. -- Oleg Broytman https://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From s-ball at laposte.net Tue Dec 18 13:20:23 2018 From: s-ball at laposte.net (Serge Ballesta) Date: Tue, 18 Dec 2018 19:20:23 +0100 Subject: [Python-ideas] Use lazy loading with hashtable in python gettext module In-Reply-To: <20181218131411.a2jmjhef7a3rpqs6@phdru.name> References: <1652737848.3435845.1545123047394.JavaMail.zimbra@laposte.net> <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> <20181218131411.a2jmjhef7a3rpqs6@phdru.name> Message-ID: <09c7c055-2842-d9d3-9b74-e17151a9e17e@laposte.net> Hi! >... > I use gettext quite extensively. I use Python's msgfmt to generate > .mo files. I also use Django's compilemessage; I don't know what it uses > internally, it could be an independent implementation or Python's msgfmt. > Never used Django's implementation and I do not know its features. I'll try to have a look to have a more exhaustive context. >> ... >> Remaining problems: >> - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it? > > You should ask a lawyer and I am not. But my understanding is that > you can borrow ideas from a GPL-protected code without contaminating > your code with GPL. You cannot copy code -- that makes your code GPL'd > That is one of the reasons I have described here what I have done. I believe that it is correct, but I would be glad to have more experienced people's advice. >> - the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better > > There is the third option -- open and close the file. I'd prefer the > option as file descriptors are precious resources limited in supply. > There is a twist though. The file could be replaced while closed so > you have to find a way to verify the was replaced and reread the has > table from it. Perhaps checking timestamp of the file (date/time of the > last modification) is enough. > Yeah, the problem is there: file descriptors are a scarce resource, but opening a file is a costly operation. Here again that's why I considere an option to let the library users choose according to their own use case. Serge From barry at barrys-emacs.org Tue Dec 18 17:09:30 2018 From: barry at barrys-emacs.org (Barry Scott) Date: Tue, 18 Dec 2018 22:09:30 +0000 Subject: [Python-ideas] Use lazy loading with hashtable in python gettext module In-Reply-To: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> References: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> Message-ID: > On 18 Dec 2018, at 09:10, Serge Ballesta via Python-ideas wrote: > > In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel (http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the original msgfmt from GNU gettext. snip > Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end I think about it this way. Based on the largest project I have worked on that was internationalised into 14 languages the British English text translated to American English (en-US) created a 350KiB MO file. The largest mo file was for Thai (th-TH) at 680KiB. Is it worth the complexity of the hash code to save that memory? Will the hash code improve the load time? We never noticed the load time and we reloaded the MO on ever web page access. As for FDs it uses 1 and on my linux system I have 1.6M to play with. Barry -------------- next part -------------- An HTML attachment was scrubbed... URL: From s-ball at laposte.net Tue Dec 18 17:58:56 2018 From: s-ball at laposte.net (Serge Ballesta) Date: Tue, 18 Dec 2018 23:58:56 +0100 Subject: [Python-ideas] Use lazy loading with hashtable in python gettext module In-Reply-To: References: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> Message-ID: <5a07f8ce-b7d5-07d8-ad30-25d1665436a9@laposte.net> Le 18/12/2018 ? 23:09, Barry Scott a ?crit?: > > >> On 18 Dec 2018, at 09:10, Serge Ballesta via Python-ideas >> > wrote: >> >> In a project of mine, I have used the gettext module from Python >> Standard Library. I have found that several tools could be used to >> generate the Machine Object (mo) file from the source Portable Object >> (one): pybabel (http://babel.pocoo.org/en/latest/), msgfmt.py from >> Python tools or the original msgfmt from GNU gettext. > > snip > >> Before going further, I would like to know whether implementing lazy >> access through the hash table that way seems to be a interesting >> improvement or a dead end > > I think about it this way. > > Based on the largest project I have worked on that was internationalised > into > 14 languages the British English text translated to American English > (en-US) created a 350KiB MO file. > > The largest mo file was for Thai (th-TH) at 680KiB. > > Is it worth the complexity of the hash code to save that memory? > The hash code is not that complex. The main problem was that it is not documented except in the source code. > Will the hash code improve the load time? > We never noticed the load time and we reloaded the MO on ever web page > access. > > As for FDs it uses 1 and on my linux system I have 1.6M to play with. > > Barry > What make me think that it deserves a try is that it is the way it is implemented in original GNU gettext, and that a TODO note said it should be considered. But the documentation also explains that the hash table is optional... Serge From s-ball at laposte.net Sun Dec 23 12:06:48 2018 From: s-ball at laposte.net (Serge Ballesta) Date: Sun, 23 Dec 2018 18:06:48 +0100 Subject: [Python-ideas] Use lazy loading with hashtable in python gettext module In-Reply-To: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> References: <650517947.3648989.1545124251040.JavaMail.zimbra@laposte.net> Message-ID: <875b596b-4e81-d163-eda0-493ab16de220@laposte.net> Hi all, The feed back on my initial mail convinced me that it was important to allow the current behaviour of eagerly loading the whole catalog, and that keeping the files opened should also be optional. All that lead to this proposal: Features: ======== The gettext module should be allowed to load lazily the catalogs from mo file. This lazy load should be optional and make use of the hash tables from mo files when they are present or revert to a binary search. The translation strings should be cached for better performances. API changes: ============ 3 functions from the gettext module will have 2 new optional parameter named caching, and keepopen: gettext.bindtextdomain(domain, localedir=None) would become gettext.bindtextdomain(domain, localedir=None, caching=None, keepopen=False) gettext.translation(domain, localedir=None, languages=None, class_=None, fallback=False, codeset=None) would become gettext.translation(domain, localedir=None, languages=None, class_=None, fallback=False, codeset=None, caching=None, keepopen=False) gettext.install(domain, localedir=None, codeset=None, names=None) would become gettext.install(domain, localedir=None, codeset=None, names=None, caching=None, keepopen=False) The new caching parameter could receive the following values: caching=None: revert to the previour eager loading of the full catalog. It will be the default to allow previous application to see no change caching=1: lazy loading with unlimited cache caching=n where n is a positive (>=0) integer value: lazy loading with a LRU cache limited to n strings The keepopen parameter would be a boolean: keepopen=False (default): the mo file is only opened before loading a translation string and closed immediately after - it is also opened once when the GNUTranslation class is initialized to load the file description keepopen=True: the mo file is kept open during the lifetime of the GNUTranslation object. This parameter is ignored and not used if caching is None Implementation: ============== The current GNUTranslation class loads the content of the mo file to build a dictionnary where the original strings are the keys and the translated keys the values. Plural forms use a special processing: the key is a 2 tuple (singular original string, order), and the value is the corresponding translated string - order=0 is normally for the singular translated string. The proposed implementation would simply replace this dictionary with a special mapping subclass when caching is not None. That subclass would use same keys as the original directory and would: - first search in its cache - if not found in cache and if the hashtable has not a zero size search the original string by hash - if not found in cache and if the hashtable has a zero size, search the original string with a binary search algorithm. - if a string is found, it should feed the LRU cache, eventually throwing away the oldest entry (entries) That should allow to implement the new feature with minimal refactoring for the gettext module. Le 18/12/2018 ? 10:10, Serge Ballesta via Python-ideas a ?crit?: > In a project of mine, I have used the gettext module from Python > Standard Library. I have found that several tools could be used to > generate the Machine Object (mo) file from the source Portable Object > (one): pybabel (http://babel.pocoo.org/en/latest/), msgfmt.py from > Python tools or the original msgfmt from GNU gettext. > > I could find that only the original msgfmt was able to generate a > hashtable, and that anyway the Python gettext module loaded everything > in memory and did not use it. But I also find a TODO note saying > > # TODO: > # - Lazy loading of .mo files.? Currently the entire catalog is loaded into > #?? memory, but that's probably bad for large translated programs.? Instead, > #?? the lexical sort of original strings in GNU .mo files should be > exploited > #?? to do binary searches and lazy initializations.? Or you might want > to use > #?? the undocumented double-hash algorithm for .mo files with hash > tables, but > #?? you'll need to study the GNU gettext code to do this. > > I have studied GNU gettext code and found that implemententing the > hashing algorithm in Python would not be that hard. > > The undocumented features required for implementation are: > - the version number can safely stay to 0 when processing Python code > - the size of the hash table is the first odd prime greater than or > equal to 4 * n / 3 where n is the number of strings > - the first hashing function uses a variant of PJW hash function > described in https://en.wikipedia.org/wiki/PJW_hash_function, where the > line h = h & ~high is replaced with h = h ^ high, and using 32 bits > integers. The index in the table in the result of the function modulus > the size of the hash table > - when there is a conflict (the slot given by the first hashing function > is already used by another string) the following is used: > ? - let h be the result of the PJW variant hash function and size be > the size of the hash table, an increment value is set to 1 +( h % (size -2)) > ? - that increment is repeatedly added to the index in the hash table > (modulus the table size) until an empty slot is found (or the correct > original string is found) > > For now, my (alpha) code is able to generate in pure Python the same mo > file that GNU msgfmt generates, and use the hashtable to access the strings. > > Remaining problems: > - I had to read GPL copyrighted code to find the undocumented features. > I have of course wrote my own code from scratch, but may I use an Apache > Free License 2.1 on it? > - the current code for gettext loads everything from the mo file and > immediately closes it. My own code keeps the file opened to be able to > access it with the mmap module. There could be use case where first > option is better > - I should either rely on the current way (load everything in memory) or > implement a binary search algo for the case where the hash table is not > present (it is of course optional) > - it would be an important change, and I think that options should be > allow to choose between an eager or lazy access > > Before going further, I would like to know whether implementing lazy > access through the hash table that way seems to be a interesting > improvement or a dead end. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > From phylimo at 163.com Mon Dec 24 05:21:31 2018 From: phylimo at 163.com (=?UTF-8?B?5p2O6buY?=) Date: Mon, 24 Dec 2018 18:21:31 +0800 (GMT+08:00) Subject: [Python-ideas] About the passing the function arguments in Keyword form. Message-ID: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> I am having an idea on loosing the argument validity check when passing the function arguments in keyword way. For example: ------------------------------- deff(x, y): print(x, y) defcall_f(): f(x=7, y=9, z=9) call_f() ------------------------------ In the current of python, the extra pass of 'z' would let the interpreter raise an exception and stop work. My idea is that the interpreter need not stop because all the needed args are completely provided. Of course for this toy example, 'f' can be define as f(x, y, **kwargs) to achieve the same goal. However, essentially it is reasonably to keep interpreter going as long as enough args are passed. And this modification can bring more freedom of programming. Think about the following situations: situation 1) there are many 'f's written by other people, and their args are very similar and your job is to run each of them to get some results. --------------------- ##########code by others: def f0(): ... def f1(x): ... def f2(x, y): ... def f3(x, y, z): ... #if passing extra args are valid, you can run all the functions in the following way, which is very compact and easy to read. def test_universal_call(): funcs = [f0, f1, f2, f3] args = {'x':1, 'y':5, 'z':8} for f in funcs: f(**args) ------------------ situation 2) there are several steps for make one product, each step is in an individual function and needs different args. ------------------ def make_oil(oil): ... def make_water( water): ... def make_powder(powder): ... ## if passing extra args are valid, you can run all the functions in the following way, which is very compact and easy to read. def dish(): procedures = [make_oil, make_water, make_powder] args = {'oil' : 1, 'water': 10, 'powder': 4} for f in procedures: f(**args) --------------- This idea is different from **kwargs. **kwargs are used when user wants to record all the keywords passed. This idea is that even if the user doesn?t want to record the arguments, that extra pass of keyword arguments wont?t cause an exception. Sorry for bothering you guys if this is a stupid idea. Happy to hear your suggestions. Li Mo -------------- next part -------------- An HTML attachment was scrubbed... URL: From dwarwick96 at gmail.com Mon Dec 24 06:11:01 2018 From: dwarwick96 at gmail.com (Drew Warwick) Date: Mon, 24 Dec 2018 06:11:01 -0500 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: The struct unpack API is inconvenient to use with files. I must do: struct.unpack(fmt, file.read(struct.calcsize(fmt)) every time I want to read a struct from the file. I ended up having to create a utility function for this due to how frequently I was using struct.unpack with files: def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) else: return struct.unpack(fmt, frm) This seems like something that should be built into the default implementation -- struct.unpack already has all the information it needs with just the struct format and open binary file. Current behavior is an error since struct.unpack only supports bytes-like objects, so this should be backwards compatible except in the case where a developer is relying on that to error in a try block instead of verifying the buffer type beforehand. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Mon Dec 24 06:24:32 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 24 Dec 2018 22:24:32 +1100 Subject: [Python-ideas] About the passing the function arguments in Keyword form. In-Reply-To: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> References: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> Message-ID: <20181224112431.GO13061@ando.pearwood.info> On Mon, Dec 24, 2018 at 06:21:31PM +0800, ?? wrote: > I am having an idea on loosing the argument validity check when passing the function arguments in keyword way. > For example: > ------------------------------- > deff(x, y): > print(x, y) > defcall_f(): > f(x=7, y=9, z=9) > > > call_f() > ------------------------------ > In the current of python, the extra pass of 'z' would let the > interpreter raise an exception and stop work. Correct. As the Zen of Python says: Errors should never pass silently. Passing an unexpected argument "z" is an error, regardless of whether you pass it by keyword or as a positional argument. It should raise an exception. Don't think about toy examples like your f above with single character names. Think about code with proper names: def download(url, output_file=None, overwrite=True): if output_file is None: output_file = generate_filename(url) ... # Oops, a typo, which silently deletes data. download(url, override=False) > My idea is that the > interpreter need not stop because all the needed args are completely > provided. Of course for this toy example, 'f' can be define as f(x, > y, **kwargs) to achieve the same goal. However, essentially it is > reasonably to keep interpreter going as long as enough args are > passed. I don't agree that it is reasonable. To quote Chris Smith: "I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn?t crash is a horrible nightmare." Functions which silently ignore unexpected arguments instead of telling us that we have made a mistake ("got an unexpected keyword argument z") just *hides* the error, instead of reporting it so we can fix it. > And this modification can bring more freedom of programming. Freedom to have more hard to diagnose bugs in our code. -- Steve From boxed at killingar.net Mon Dec 24 07:05:17 2018 From: boxed at killingar.net (=?utf-8?Q?Anders_Hovm=C3=B6ller?=) Date: Mon, 24 Dec 2018 13:05:17 +0100 Subject: [Python-ideas] About the passing the function arguments in Keyword form. In-Reply-To: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> References: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> Message-ID: <8805015A-3A0D-49B3-BA99-2C5145F83B79@killingar.net> > On 24 Dec 2018, at 11:21, ?? wrote: > > I am having an idea on loosing the argument validity check when passing the function arguments in keyword way. > For example: > ------------------------------- > def f(x, y): > print(x, y) > def call_f(): > f(x=7, y=9, z=9) > > call_f() > ------------------------------ > In the current of python, the extra pass of 'z' would let the interpreter raise an exception and stop work. My idea is that the interpreter need not stop because all the needed args are completely provided. Of course for this toy example, 'f' can be define as f(x, y, **kwargs) to achieve the same goal. However, essentially it is reasonably to keep interpreter going as long as enough args are passed. And this modification can bring more freedom of programming. Similar features exists in JavaScript (where you can also do the same thing with positional arguments), and Clojure to make two. I personally think this is extremely bad. This type of behavior can make error in your code slip by undetected for a very long time. Let's take a concrete example! We have a function: def foo(*, a, b=3, c): .... People call it like so: foo(a=7, b=1, c=11) Now what happens if we rename argument b to q? The above code still runs! It just now passes 3 (the default value) to foo instead of the intended 1. I hope this example is enough to convince you of the danger of such a feature. It's certainly the reason why I think JavaScript and Clojure are terrible when it comes to passing arguments :) Best regards Anders -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew.svetlov at gmail.com Mon Dec 24 08:01:07 2018 From: andrew.svetlov at gmail.com (Andrew Svetlov) Date: Mon, 24 Dec 2018 15:01:07 +0200 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: Handling files overcomplicates both implementation and mental space for API saving. Files can be opened in text mode, what to do in this case? What exception should be raised? How to handle OS errors? On Mon, Dec 24, 2018 at 1:11 PM Drew Warwick wrote: > The struct unpack API is inconvenient to use with files. I must do: > > struct.unpack(fmt, file.read(struct.calcsize(fmt)) > > every time I want to read a struct from the file. I ended up having to > create a utility function for this due to how frequently I was using > struct.unpack with files: > > def unpackStruct(fmt, frm): > if isinstance(frm, io.IOBase): > return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) > else: > return struct.unpack(fmt, frm) > > This seems like something that should be built into the default > implementation -- struct.unpack already has all the information it needs > with just the struct format and open binary file. Current behavior is an > error since struct.unpack only supports bytes-like objects, so this should > be backwards compatible except in the case where a developer is relying on > that to error in a try block instead of verifying the buffer type > beforehand. > >> _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Thanks, Andrew Svetlov -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Mon Dec 24 08:33:14 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 25 Dec 2018 00:33:14 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: <20181224133313.GP13061@ando.pearwood.info> On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote: > Handling files overcomplicates both implementation and mental space for API > saving. Perhaps. Although the implementation doesn't seem that complicated, and the mental space for the API not that much more difficult: unpack from bytes, or read from a file; versus unpack from bytes, which you might read from a file Seems about the same to me, except that with the proposal you don't have to calculate the size of the struct before reading. I haven't thought about this very deeply, but at first glance, I like Drew's idea of being able to just pass an open file to unpack and have it read from the file. > Files can be opened in text mode, what to do in this case? What > exception should be raised? That is easy to answer: the same exception you get if you pass text to unpack() when it is expecting bytes: py> struct.unpack(fmt, "a") Traceback (most recent call last): File "", line 1, in TypeError: a bytes-like object is required, not 'str' There should be no difference whether the text comes from a literal, a variable, or is read from a file. > How to handle OS errors? unpack() shouldn't try to handle them. If an OS error occurs, raise an exception, exactly the same way file.read() would raise an exception. -- Steve From 2QdxY4RzWzUUiLuE at potatochowder.com Mon Dec 24 09:17:20 2018 From: 2QdxY4RzWzUUiLuE at potatochowder.com (Dan Sommers) Date: Mon, 24 Dec 2018 08:17:20 -0600 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181224133313.GP13061@ando.pearwood.info> References: <20181224133313.GP13061@ando.pearwood.info> Message-ID: <85edba54-4995-e20f-ba85-a8d2ac6b1883@potatochowder.com> On 12/24/18 7:33 AM, Steven D'Aprano wrote: > On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote: >> Handling files overcomplicates both implementation and mental space >> for API saving. > I haven't thought about this very deeply, but at first glance, I like > Drew's idea of being able to just pass an open file to unpack and have > it read from the file. The json module has load for files, and loads for bytes and strings, That said, JSON is usually read and decoded all at once, but I can see lots of use cases for ingesting "unpackable" data in little chunks. Similarly (but not really), print takes an optional destination that overrides the default destination of stdout. Ironically, StringIO adapts strings so that they can be used in places that expect open files. What about something like gzip.GzipFile (call it struct.StructFile?), which is basically a specialized file-like class that packs data on writes and unpacks data on reads? Dan From jheiv at jheiv.com Mon Dec 24 10:19:35 2018 From: jheiv at jheiv.com (James Edwards) Date: Mon, 24 Dec 2018 10:19:35 -0500 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <85edba54-4995-e20f-ba85-a8d2ac6b1883@potatochowder.com> References: <20181224133313.GP13061@ando.pearwood.info> <85edba54-4995-e20f-ba85-a8d2ac6b1883@potatochowder.com> Message-ID: Here's a snippet of semi-production code we use: def read_and_unpack(handle, fmt): size = struct.calcsize(fmt) data = handle.read(size) if len(data) < size: return None return struct.unpack(fmt, data) which was originally something like: def read_and_unpack(handle, fmt, offset=None): if offset is not None: handle.seek(*offset) size = struct.calcsize(fmt) data = handle.read(size) if len(data) < size: return None return struct.unpack(fmt, data) until we pulled file seeking up out of the function. Having struct.unpack and struct.unpack_from support files would seem straightforward and be a nice quality of life change, imo. On Mon, Dec 24, 2018 at 9:36 AM Dan Sommers < 2QdxY4RzWzUUiLuE at potatochowder.com> wrote: > On 12/24/18 7:33 AM, Steven D'Aprano wrote: > > On Mon, Dec 24, 2018 at 03:01:07PM +0200, Andrew Svetlov wrote: > > >> Handling files overcomplicates both implementation and mental space > >> for API saving. > > > I haven't thought about this very deeply, but at first glance, I like > > Drew's idea of being able to just pass an open file to unpack and have > > it read from the file. > > The json module has load for files, and loads for bytes and strings, > That said, JSON is usually read and decoded all at once, but I can see > lots of use cases for ingesting "unpackable" data in little chunks. > > Similarly (but not really), print takes an optional destination that > overrides the default destination of stdout. > > Ironically, StringIO adapts strings so that they can be used in places > that expect open files. > > What about something like gzip.GzipFile (call it struct.StructFile?), > which is basically a specialized file-like class that packs data on > writes and unpacks data on reads? > > Dan > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Mon Dec 24 10:36:07 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 24 Dec 2018 15:36:07 +0000 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181224133313.GP13061@ando.pearwood.info> References: <20181224133313.GP13061@ando.pearwood.info> Message-ID: On Mon, 24 Dec 2018 at 13:39, Steven D'Aprano wrote: > > > Files can be opened in text mode, what to do in this case? What > > exception should be raised? > > That is easy to answer: the same exception you get if you pass text to > unpack() when it is expecting bytes: > > py> struct.unpack(fmt, "a") > Traceback (most recent call last): > File "", line 1, in > TypeError: a bytes-like object is required, not 'str' > > There should be no difference whether the text comes from a literal, a > variable, or is read from a file. One difference is that with a file, it's (as far as I can see) impossible to determine whether or not you're going to get bytes or text without reading some data (and so potentially affecting the state of the file object). This might be considered irrelevant (personally, I don't see a problem with a function definition that says "parameter fd must be an object that has a read(length) method that returns bytes" - that's basically what duck typing is all about) but it *is* a distinguishing feature of files over in-memory data. There is also the fact that read() is only defined to return *at most* the requested number of bytes. Non-blocking reads and objects like pipes that can return additional data over time add extra complexity. Again, not insoluble, and potentially simple enough to handle with "read N bytes, if you got something other than bytes or fewer than N of them, raise an error", but still enough that the special cases start to accumulate. The suggestion is a nice convenience method, and probably a useful addition for the majority of cases where it would do exactly what was needed, but still not completely trivial to actually implement and document (if I were doing it, I'd go with the naive approach, and just raise a ValueError when read(N) returns anything other than N bytes, for what it's worth). Paul From eric at trueblade.com Mon Dec 24 11:30:59 2018 From: eric at trueblade.com (Eric V. Smith) Date: Mon, 24 Dec 2018 11:30:59 -0500 Subject: [Python-ideas] About the passing the function arguments in Keyword form. In-Reply-To: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> References: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> Message-ID: <227e4e6f-5c39-fad9-654d-89da3ae411f7@trueblade.com> On 12/24/2018 5:21 AM, ?? wrote: > I am having an idea on loosing the argument validity check when passing > the function arguments in keyword way. > For example: > ------------------------------- > deff(x, y): > > print(x, y) def call_f(): f(x=7, y=9, z=9) > > call_f() > > ------------------------------ > > In the current of python, the extra pass of 'z' would let the > interpreter raise an exception and stop work. My idea is that the > interpreter need not stop because all the needed args are completely > provided. Of course for this toy example, 'f' can be define as f(x, y, > **kwargs) to achieve the same goal. However, essentially it is > reasonably to keep interpreter going as long as enough args are passed. > And this modification can bring more freedom of programming. > > > Think about the following situations: > > situation 1) there are many 'f's written by other people, and their args > are very similar and your job is to run each of them to get some results. > > --------------------- > > ##########code by others: > > def f0(): ... def f1(x): ... def f2(x, y): ... def f3(x, y, z): ... > > #if passing extra args are valid, you can run all the functions in the > following way, which is very compact and easy to read. > > def test_universal_call(): > > funcs = [f0, f1, f2, f3] args = {'x':1, 'y':5, 'z':8} for f in funcs: > f(**args) > > ------------------ > > > situation 2) there are several steps for make one product, each step is > in an individual function and needs different args. > > ------------------ > > def make_oil(oil): ... > > def make_water( water): ... > > def make_powder(powder): ... > > ## if passing extra args are valid, you can run all the functions in the > following way, which is very compact and easy to read. > > def dish(): procedures = [make_oil, make_water, make_powder] > > args = {'oil' : 1, 'water': 10, 'powder': 4} for f in procedures: f(**args) > > > --------------- > > > This idea is different from **kwargs. **kwargs are used when user wants > to record all the keywords passed. This idea is that even if the user > doesn?t want to record the arguments, that extra pass of keyword > arguments wont?t cause an exception. I agree with other posters that we definitely do not want this as the default behavior in Python. However, it's also sometimes a useful pattern. I use it when I have a large plugin architecture that can take dozens or hundreds of possible parameters, but any given plugin is likely to only use a few parameters. I've written calllib (https://pypi.org/project/calllib/) to support this. It might achieve your goals. This code: ------------------- from calllib import apply def f0(): print('f0') def f1(x): print(f'f1 {x!r}') def f2(x, y): print(f'f2 {x!r} {y!r}') def f3(x, y, z): print(f'f3 {x!r} {y!r} {z!r}') def test_universal_call(): funcs = [f0, f1, f2, f3] args = {'x':1, 'y':5, 'z':8} for f in funcs: apply(f, args) test_universal_call() ------------------- produces: f0 f1 1 f2 1 5 f3 1 5 8 Eric From boxed at killingar.net Mon Dec 24 13:07:46 2018 From: boxed at killingar.net (=?utf-8?Q?Anders_Hovm=C3=B6ller?=) Date: Mon, 24 Dec 2018 19:07:46 +0100 Subject: [Python-ideas] About the passing the function arguments in Keyword form. In-Reply-To: <227e4e6f-5c39-fad9-654d-89da3ae411f7@trueblade.com> References: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> <227e4e6f-5c39-fad9-654d-89da3ae411f7@trueblade.com> Message-ID: <85301CD0-BB63-407D-A8FF-39413D0E9FB5@killingar.net> > I agree with other posters that we definitely do not want this as the default behavior in Python. However, it's also sometimes a useful pattern. I use it when I have a large plugin architecture that can take dozens or hundreds of possible parameters, but any given plugin is likely to only use a few parameters. I've written calllib (https://pypi.org/project/calllib/) to support this. It might achieve your goals. We do the same for various libs (tri.table for example) and our solution is just to say that you need to include **_ in your arguments for such functions. Simpler and more obvious than a simple DI system imo. / Anders From steve at pearwood.info Mon Dec 24 16:17:33 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 25 Dec 2018 08:17:33 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> Message-ID: <20181224211733.GQ13061@ando.pearwood.info> On Mon, Dec 24, 2018 at 03:36:07PM +0000, Paul Moore wrote: > > There should be no difference whether the text comes from a literal, a > > variable, or is read from a file. > > One difference is that with a file, it's (as far as I can see) > impossible to determine whether or not you're going to get bytes or > text without reading some data (and so potentially affecting the state > of the file object). Here are two ways: look at the type of the file object, or look at the mode of the file object: py> f = open('/tmp/spam.binary', 'wb') py> g = open('/tmp/spam.text', 'w') py> type(f), type(g) (, ) py> f.mode, g.mode ('wb', 'w') > This might be considered irrelevant Indeed :-) > (personally, > I don't see a problem with a function definition that says "parameter > fd must be an object that has a read(length) method that returns > bytes" - that's basically what duck typing is all about) but it *is* a > distinguishing feature of files over in-memory data. But it's not a distinguishing feature between the proposal, and writing: unpack(fmt, f.read(size)) which will also read from the file and affect the file state before failing. So its a difference that makes no difference. > There is also the fact that read() is only defined to return *at most* > the requested number of bytes. Non-blocking reads and objects like > pipes that can return additional data over time add extra complexity. How do they add extra complexity? According to the proposal, unpack() attempts the read. If it returns the correct number of bytes, the unpacking succeeds. If it doesn't, you get an exception, precisely the same way you would get an exception if you manually did the read and passed it to unpack(). Its the caller's responsibility to provide a valid file object. If your struct needs 10 bytes, and you provide a file that returns 6 bytes, you get an exception. There's no promise made that unpack() should repeat the read over and over again, hoping that its a pipe and more data becomes available. It either works with a single read, or it fails. Just like similar APIs as those provided by pickle, json etc which provide load() and loads() functions. In hindsight, the precedent set by pickle, json, etc suggests that we ought to have an unpack() function that reads from files and an unpacks() function that takes a string, but that ship has sailed. > Again, not insoluble, and potentially simple enough to handle with > "read N bytes, if you got something other than bytes or fewer than N > of them, raise an error", but still enough that the special cases > start to accumulate. I can understand the argument that the benefit of this is trivial over unpack(fmt, f.read(calcsize(fmt)) Unlike reading from a pickle or json record, its pretty easy to know how much to read, so there is an argument that this convenience method doesn't gain us much convenience. But I'm just not seeing where all the extra complexity and special case handing is supposed to be, except by having unpack make promises that the OP didn't request: - read partial structs from non-blocking files without failing - deal with file system errors without failing - support reading from text files when bytes are required without failing - if an exception occurs, the state of the file shouldn't change Those promises *would* add enormous amounts of complexity, but I don't think we need to make those promises. I don't think the OP wants them, I don't want them, and I don't think they are reasonable promises to make. > The suggestion is a nice convenience method, and probably a useful > addition for the majority of cases where it would do exactly what was > needed, but still not completely trivial to actually implement and > document (if I were doing it, I'd go with the naive approach, and just > raise a ValueError when read(N) returns anything other than N bytes, > for what it's worth). Indeed. Except that we should raise precisely the same exception type that struct.unpack() currently raises in the same circumstances: py> struct.unpack("ddd", b"a") Traceback (most recent call last): File "", line 1, in struct.error: unpack requires a bytes object of length 24 rather than ValueError. -- Steve From andrew.svetlov at gmail.com Mon Dec 24 18:28:02 2018 From: andrew.svetlov at gmail.com (Andrew Svetlov) Date: Tue, 25 Dec 2018 01:28:02 +0200 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181224211733.GQ13061@ando.pearwood.info> References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> Message-ID: The proposal can generate cryptic messages like `a bytes-like object is required, not 'NoneType'` To produce more informative exception text all mentioned cases should be handled: > - read partial structs from non-blocking files without failing > - deal with file system errors without failing > - support reading from text files when bytes are required without failing > - if an exception occurs, the state of the file shouldn't change I can add a couple of cases but the list is long enough for demonstration purposes. When a user calls unpack(fmt, f.read(calcsize(fmt)) the user is responsible for handling all edge cases (or ignore them most likely). If it is a part of a library -- robustness is the library responsibility. On Mon, Dec 24, 2018 at 11:23 PM Steven D'Aprano wrote: > On Mon, Dec 24, 2018 at 03:36:07PM +0000, Paul Moore wrote: > > > > There should be no difference whether the text comes from a literal, a > > > variable, or is read from a file. > > > > One difference is that with a file, it's (as far as I can see) > > impossible to determine whether or not you're going to get bytes or > > text without reading some data (and so potentially affecting the state > > of the file object). > > Here are two ways: look at the type of the file object, or look at the > mode of the file object: > > py> f = open('/tmp/spam.binary', 'wb') > py> g = open('/tmp/spam.text', 'w') > py> type(f), type(g) > (, ) > > py> f.mode, g.mode > ('wb', 'w') > > > > This might be considered irrelevant > > Indeed :-) > > > > (personally, > > I don't see a problem with a function definition that says "parameter > > fd must be an object that has a read(length) method that returns > > bytes" - that's basically what duck typing is all about) but it *is* a > > distinguishing feature of files over in-memory data. > > But it's not a distinguishing feature between the proposal, and writing: > > unpack(fmt, f.read(size)) > > which will also read from the file and affect the file state before > failing. So its a difference that makes no difference. > > > > There is also the fact that read() is only defined to return *at most* > > the requested number of bytes. Non-blocking reads and objects like > > pipes that can return additional data over time add extra complexity. > > How do they add extra complexity? > > According to the proposal, unpack() attempts the read. If it returns the > correct number of bytes, the unpacking succeeds. If it doesn't, you get > an exception, precisely the same way you would get an exception if you > manually did the read and passed it to unpack(). > > Its the caller's responsibility to provide a valid file object. If your > struct needs 10 bytes, and you provide a file that returns 6 bytes, you > get an exception. There's no promise made that unpack() should repeat > the read over and over again, hoping that its a pipe and more data > becomes available. It either works with a single read, or it fails. > > Just like similar APIs as those provided by pickle, json etc which > provide load() and loads() functions. > > In hindsight, the precedent set by pickle, json, etc suggests that we > ought to have an unpack() function that reads from files and an > unpacks() function that takes a string, but that ship has sailed. > > > > Again, not insoluble, and potentially simple enough to handle with > > "read N bytes, if you got something other than bytes or fewer than N > > of them, raise an error", but still enough that the special cases > > start to accumulate. > > I can understand the argument that the benefit of this is trivial over > > unpack(fmt, f.read(calcsize(fmt)) > > Unlike reading from a pickle or json record, its pretty easy to know how > much to read, so there is an argument that this convenience method > doesn't gain us much convenience. > > But I'm just not seeing where all the extra complexity and special case > handing is supposed to be, except by having unpack make promises that > the OP didn't request: > > - read partial structs from non-blocking files without failing > - deal with file system errors without failing > - support reading from text files when bytes are required without failing > - if an exception occurs, the state of the file shouldn't change > > Those promises *would* add enormous amounts of complexity, but I don't > think we need to make those promises. I don't think the OP wants them, > I don't want them, and I don't think they are reasonable promises to > make. > > > > The suggestion is a nice convenience method, and probably a useful > > addition for the majority of cases where it would do exactly what was > > needed, but still not completely trivial to actually implement and > > document (if I were doing it, I'd go with the naive approach, and just > > raise a ValueError when read(N) returns anything other than N bytes, > > for what it's worth). > > Indeed. Except that we should raise precisely the same exception type > that struct.unpack() currently raises in the same circumstances: > > py> struct.unpack("ddd", b"a") > Traceback (most recent call last): > File "", line 1, in > struct.error: unpack requires a bytes object of length 24 > > rather than ValueError. > > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Thanks, Andrew Svetlov -------------- next part -------------- An HTML attachment was scrubbed... URL: From robertve92 at gmail.com Tue Dec 25 05:03:36 2018 From: robertve92 at gmail.com (Robert Vanden Eynde) Date: Tue, 25 Dec 2018 11:03:36 +0100 Subject: [Python-ideas] About the passing the function arguments in Keyword form. In-Reply-To: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> References: <2cf41470.9511.167dfbbe20e.Coremail.phylimo@163.com> Message-ID: It's very important that f(z=5) Raises an exception if z is not an argument. For your case, I'd do a wrapper, instead lf calling f(z=5) you can call UniversalCall(f, x=1, y=2, z=5) if you want to specify it on the caller side. Or else, you can create a decorator : @universal_callable def f(x, y): ... f(x=1, y=2, z=5) # works ! On Mon, 24 Dec 2018, 11:21 ?? I am having an idea on loosing the argument validity check when passing > the function arguments in keyword way. > For example: > ------------------------------- > def f(x, y): > > print(x, y)def call_f(): > f(x=7, y=9, z=9) > > call_f() > > ------------------------------ > > In the current of python, the extra pass of 'z' would let the interpreter raise an exception and stop work. My idea is that the interpreter need not stop because all the needed args are completely provided. Of course for this toy example, 'f' can be define as f(x, y, **kwargs) to achieve the same goal. However, essentially it is reasonably to keep interpreter going as long as enough args are passed. And this modification can bring more freedom of programming. > > > Think about the following situations: > > situation 1) there are many 'f's written by other people, and their args are very similar and your job is to run each of them to get some results. > > --------------------- > > ##########code by others: > > def f0(): > ... > def f1(x): > ... > def f2(x, y): > ... > def f3(x, y, z): > ... > > #if passing extra args are valid, you can run all the functions in the following way, which is very compact and easy to read. > > def test_universal_call(): > > funcs = [f0, f1, f2, f3] > args = {'x':1, 'y':5, 'z':8} > for f in funcs: > f(**args) > > ------------------ > > > situation 2) there are several steps for make one product, each step is in an individual function and needs different args. > > ------------------ > > def make_oil(oil): > ... > > def make_water( water): > ... > > def make_powder(powder): > ... > > ## if passing extra args are valid, you can run all the functions in the following way, which is very compact and easy to read. > > def dish(): > procedures = [make_oil, make_water, make_powder] > > args = {'oil' : 1, 'water': 10, 'powder': 4} > for f in procedures: > f(**args) > > > --------------- > > > This idea is different from **kwargs. **kwargs are used when user wants to record all the keywords passed. This idea is that even if the user doesn?t want to record the arguments, that extra pass of keyword arguments wont?t cause an exception. > > > > Sorry for bothering you guys if this is a stupid idea. > > Happy to hear your suggestions. > > > Li Mo > > > > > > > > > > > > > > > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eryksun at gmail.com Tue Dec 25 17:51:18 2018 From: eryksun at gmail.com (eryk sun) Date: Tue, 25 Dec 2018 16:51:18 -0600 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: On 12/24/18, Drew Warwick wrote: > The struct unpack API is inconvenient to use with files. I must do: > > struct.unpack(fmt, file.read(struct.calcsize(fmt)) Alternatively, we can memory-map the file via mmap. An important difference is that the mmap buffer interface is low-level (e.g. no file pointer and the offset has to be page aligned), so we have to slice out bytes for the given offset and size. We can avoid copying via memoryview slices. We can also use ctypes instead of memoryview/struct. From steve at pearwood.info Tue Dec 25 20:16:32 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 26 Dec 2018 12:16:32 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: <20181226011629.GS13061@ando.pearwood.info> On Tue, Dec 25, 2018 at 04:51:18PM -0600, eryk sun wrote: > On 12/24/18, Drew Warwick wrote: > > The struct unpack API is inconvenient to use with files. I must do: > > > > struct.unpack(fmt, file.read(struct.calcsize(fmt)) > > Alternatively, we can memory-map the file via mmap. An important > difference is that the mmap buffer interface is low-level (e.g. no > file pointer and the offset has to be page aligned), so we have to > slice out bytes for the given offset and size. We can avoid copying > via memoryview slices. Seems awfully complicated. How do we do all these things, and what advantage does it give? > We can also use ctypes instead of > memoryview/struct. Only if you want non-portable code. What advantage over struct is ctypes? -- Steve From cs at cskk.id.au Tue Dec 25 21:05:51 2018 From: cs at cskk.id.au (Cameron Simpson) Date: Wed, 26 Dec 2018 13:05:51 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: <20181226020551.GA52216@cskk.homeip.net> On 24Dec2018 10:19, James Edwards wrote: >Here's a snippet of semi-production code we use: > > def read_and_unpack(handle, fmt): > size = struct.calcsize(fmt) > data = handle.read(size) > if len(data) < size: return None > return struct.unpack(fmt, data) > >which was originally something like: > > def read_and_unpack(handle, fmt, offset=None): > if offset is not None: > handle.seek(*offset) > size = struct.calcsize(fmt) > data = handle.read(size) > if len(data) < size: return None > return struct.unpack(fmt, data) > >until we pulled file seeking up out of the function. > >Having struct.unpack and struct.unpack_from support files would seem >straightforward and be a nice quality of life change, imo. These days I go the other way. I make it easy to get bytes from what I'm working with and _expect_ to parse from a stream of bytes. I have a pair of modules cs.buffer (for getting bytes from things) and cs.binary (for parsing structures from binary data). (See PyPI.) cs.buffer primarily offers a CornuCopyBuffer which manages access to any iterable of bytes objects. It has a suite of factories to make these from binary files, bytes, bytes[], a mmap, etc. Once you've got one of these you have access to a suite of convenient methods. Particularly for grabbing structs, these's a .take() method which obtains a precise number of bytes. (Think that looks like a file read? Yes, and it offers a basic file-like suite of methods too.) Anyway, cs.binary is based of a PacketField base class oriented around pulling a binary structure from a CornuCopyBuffer. Obviously, structs are very common, and cs.binary has a factory: def structtuple(class_name, struct_format, subvalue_names): which gets you a PacketField subclass whose parse methods read a struct and return it to you in a nice namedtuple. Also, PacketFields self transcribe: you can construct one from its values and have it write out the binary form. Once you've got these the tendency is just to make a PacketField instances from that function for the structs you need and then to just grab things from a CornuCopyBuffer providing the data. And you no longer have to waste effort on different code for bytes or files. Example from cs.iso14496: PDInfo = structtuple('PDInfo', '>LL', 'rate initial_delay') Then you can just use PDInfo.from_buffer() or PDInfo.from_bytes() to parse out your structures from then on. I used to have tedious duplicated code for bytes and files in various placed; I'm ripping it out and replacing with this as I encounter it. Far more reliable, not to mention smaller and easier. Cheers, Cameron Simpson From eryksun at gmail.com Tue Dec 25 22:52:44 2018 From: eryksun at gmail.com (eryk sun) Date: Tue, 25 Dec 2018 21:52:44 -0600 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181226011629.GS13061@ando.pearwood.info> References: <20181226011629.GS13061@ando.pearwood.info> Message-ID: On 12/25/18, Steven D'Aprano wrote: > On Tue, Dec 25, 2018 at 04:51:18PM -0600, eryk sun wrote: >> >> Alternatively, we can memory-map the file via mmap. An important >> difference is that the mmap buffer interface is low-level (e.g. no >> file pointer and the offset has to be page aligned), so we have to >> slice out bytes for the given offset and size. We can avoid copying >> via memoryview slices. > > Seems awfully complicated. How do we do all these things, and what > advantage does it give? Refer to the mmap and memoryview docs. It is more complex, not significantly, but not something I'd suggest to a novice. Anyway, another disadvantage is that this requires a real OS file, not just a file-like interface. One possible advantage is that we can work naively and rely on the OS to move pages of the file to and from memory on demand. However, making this really convenient requires the ability to access memory directly with on-demand conversion, as is possible with ctypes (records & arrays) or numpy (arrays). Out of the box, multiprocessing works like this for shared-memory access. For example: import ctypes import multiprocessing class Record(ctypes.LittleEndianStructure): _pack_ = 1 _fields_ = (('a', ctypes.c_int), ('b', ctypes.c_char * 4)) a = multiprocessing.Array(Record, 2) a[0].a = 1 a[0].b = b'spam' a[1].a = 2 a[1].b = b'eggs' >>> a._obj Shared values and arrays are accessed out of a heap that uses arenas backed by mmap instances: >>> a._obj._wrapper._state ((, 0, 16), 16) >>> a._obj._wrapper._state[0][0].buffer The two records are stored in this shared memory: >>> a._obj._wrapper._state[0][0].buffer[:16] b'\x01\x00\x00\x00spam\x02\x00\x00\x00eggs' >> We can also use ctypes instead of >> memoryview/struct. > > Only if you want non-portable code. ctypes has good support for at least Linux and Windows, but it's an optional package in CPython's standard library and not necessarily available with other implementations. > What advantage over struct is ctypes? If it's available, I find that ctypes is often more convenient than the manual pack/unpack approach of struct. If we're writing to the file, ctypes lets us directly assign data to arrays and the fields of records on disk (the ctypes instance knows the address and its data descriptors handle converting values implicitly). The tradeoff is that defining structures in ctypes can be tedious (_pack_, _fields_) compared to the simple format strings of the struct module. With ctypes it helps to already be fluent in C. From steve at pearwood.info Wed Dec 26 00:11:35 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 26 Dec 2018 16:11:35 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> Message-ID: <20181226051134.GV13061@ando.pearwood.info> On Tue, Dec 25, 2018 at 01:28:02AM +0200, Andrew Svetlov wrote: > The proposal can generate cryptic messages like > `a bytes-like object is required, not 'NoneType'` How will it generate such a message? That's not obvious to me. The message doesn't seem cryptic to me. It seems perfectly clear: a bytes-like object is required, but you provided None instead. The only thing which is sub-optimal is the use of "NoneType" (the name of the class) instead of None. > To produce more informative exception text all mentioned cases should be > handled: Why should they? How are the standard exceptions not good enough? The standard library is full of implementations which use ducktyping, and if you pass a chicken instead of a duck you get errors like AttributeError: 'Chicken' object has no attribute 'bill' Why isn't that good enough for this function too? We already have a proof-of-concept implementation, given by the OP. Here is it again: import io, struct def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) else: return struct.unpack(fmt, frm) Here's the sort of exceptions it generates. For brevity, I have cut the tracebacks down to only the final line: py> unpackStruct("ddd", open("/tmp/spam", "w")) io.UnsupportedOperation: not readable Is that not clear enough? (This is not a rhetorical question.) In what way do you think that exception needs enhancing? It seems perfectly fine to me. Here's another exception that may be fine as given. If the given file doesn't contain enough bytes to fill the struct, you get this: py> __ = open("/tmp/spam", "wb").write(b"\x10") py> unpackStruct("ddd", open("/tmp/spam", "rb")) struct.error: unpack requires a bytes object of length 24 It might be *nice*, but hardly *necessary*, to re-word the error message to make it more obvious that we're reading from a file, but honestly that should be obvious from context. There are certainly worse error messages in Python. Here is one exception which should be reworded: py> unpackStruct("ddd", open("/tmp/spam", "r")) Traceback (most recent call last): File "", line 1, in File "", line 3, in unpackStruct TypeError: a bytes-like object is required, not 'str' For production use, that should report that the file needs to be opened in binary mode, not text mode. Likewise similar type errors should report "bytes-like or file-like" object. These are minor enhancements to exception reporting, and aren't what I consider to be adding complexity in any meaningful sense. Of course we should expect that library-quality functions will have more error checking and better error reporting than a simple utility function for you own use. The OP's simple implementation is a five line function. Adding more appropriate error messages might, what? Triple it? That surely is an argument for *doing it right, once* in the standard library, rather than having people re-invent the wheel over and over. def unpackStruct(fmt, frm): if isinstance(frm, io.IOBase): if isinstance(frm, io.TextIOBase): raise TypeError('file must be opened in binary mode, not text') n = struct.calcsize(fmt) value = frm.read(n) assert isinstance(value, bytes) if len(value) != n: raise ValueError( 'expected %d bytes but only got %d' % (n, len(value)) ) return struct.unpack(fmt, value) else: return struct.unpack(fmt, frm) I think this is a useful enhancement to unpack(). If we were designing the struct module from scratch today, we'd surely want unpack() to read from files and unpacks() to read from a byte-string, mirroring the API of json, pickle, and similar. But given the requirement for backwards compatibility, we can't change the fact that unpack() works with byte-strings. So we can either add a new function, unpack_from_file() or simply make unpack() a generic function that accepts either a byte-like interface or a file-like interface. I vote for the generic function approach. (Or do nothing, of course.) So far, I'm not seeing any substantial arguments for why this isn't useful, or too difficult to implement. If anything, the biggest argument against it is that it is too simple to bother with (but that argument would apply equally to the pickle and json APIs). "Not every ~~one~~ fifteen line function needs to be in the standard library." -- Steve From andrew.svetlov at gmail.com Wed Dec 26 02:48:15 2018 From: andrew.svetlov at gmail.com (Andrew Svetlov) Date: Wed, 26 Dec 2018 09:48:15 +0200 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181226051134.GV13061@ando.pearwood.info> References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> Message-ID: On Wed, Dec 26, 2018 at 7:12 AM Steven D'Aprano wrote: > On Tue, Dec 25, 2018 at 01:28:02AM +0200, Andrew Svetlov wrote: > > > The proposal can generate cryptic messages like > > `a bytes-like object is required, not 'NoneType'` > > How will it generate such a message? That's not obvious to me. > > The message doesn't seem cryptic to me. It seems perfectly clear: a > bytes-like object is required, but you provided None instead. > > The only thing which is sub-optimal is the use of "NoneType" (the name > of the class) instead of None. > > The perfect demonstration of io objects complexity. `stream.read(N)` can return None by spec if the file is non-blocking and have no ready data. Confusing but still possible and documented behavior. > > > To produce more informative exception text all mentioned cases should be > > handled: > > Why should they? How are the standard exceptions not good enough? The > standard library is full of implementations which use ducktyping, and if > you pass a chicken instead of a duck you get errors like > > AttributeError: 'Chicken' object has no attribute 'bill' > > Why isn't that good enough for this function too? > > We already have a proof-of-concept implementation, given by the OP. > Here is it again: > > > import io, struct > def unpackStruct(fmt, frm): > if isinstance(frm, io.IOBase): > return struct.unpack(fmt, frm.read(struct.calcsize(fmt))) > else: > return struct.unpack(fmt, frm) > > > Here's the sort of exceptions it generates. For brevity, I have cut the > tracebacks down to only the final line: > > > py> unpackStruct("ddd", open("/tmp/spam", "w")) > io.UnsupportedOperation: not readable > > > Is that not clear enough? (This is not a rhetorical question.) In what > way do you think that exception needs enhancing? It seems perfectly fine > to me. > > Here's another exception that may be fine as given. If the given file > doesn't contain enough bytes to fill the struct, you get this: > > > py> __ = open("/tmp/spam", "wb").write(b"\x10") > py> unpackStruct("ddd", open("/tmp/spam", "rb")) > struct.error: unpack requires a bytes object of length 24 > > > It might be *nice*, but hardly *necessary*, to re-word the error message > to make it more obvious that we're reading from a file, but honestly > that should be obvious from context. There are certainly worse error > messages in Python. > > Here is one exception which should be reworded: > > py> unpackStruct("ddd", open("/tmp/spam", "r")) > Traceback (most recent call last): > File "", line 1, in > File "", line 3, in unpackStruct > TypeError: a bytes-like object is required, not 'str' > > For production use, that should report that the file needs to be opened > in binary mode, not text mode. > > Likewise similar type errors should report "bytes-like or file-like" > object. > > These are minor enhancements to exception reporting, and aren't what I > consider to be adding complexity in any meaningful sense. Of course we > should expect that library-quality functions will have more error > checking and better error reporting than a simple utility function for > you own use. > > > The OP's simple implementation is a five line function. Adding more > appropriate error messages might, what? Triple it? That surely is an > argument for *doing it right, once* in the standard library, rather than > having people re-invent the wheel over and over. > > > def unpackStruct(fmt, frm): > if isinstance(frm, io.IOBase): > if isinstance(frm, io.TextIOBase): > raise TypeError('file must be opened in binary mode, not text') > n = struct.calcsize(fmt) > value = frm.read(n) > assert isinstance(value, bytes) > if len(value) != n: > raise ValueError( > 'expected %d bytes but only got %d' > % (n, len(value)) > ) > return struct.unpack(fmt, value) > else: > return struct.unpack(fmt, frm) > > You need to repeat reads until collecting the value of enough size. `.read(N)` can return less bytes by definition, that's true starting from very low-level read(2) syscall. Otherwise a (low) change of broken code with very non-obvious error message exists. > > I think this is a useful enhancement to unpack(). If we were designing > the struct module from scratch today, we'd surely want unpack() to read > from files and unpacks() to read from a byte-string, mirroring the API > of json, pickle, and similar. > > But given the requirement for backwards compatibility, we can't change > the fact that unpack() works with byte-strings. So we can either add a > new function, unpack_from_file() or simply make unpack() a generic > function that accepts either a byte-like interface or a file-like > interface. I vote for the generic function approach. > > (Or do nothing, of course.) > > So far, I'm not seeing any substantial arguments for why this isn't > useful, or too difficult to implement. > > If anything, the biggest argument against it is that it is too simple to > bother with (but that argument would apply equally to the pickle and > json APIs). "Not every ~~one~~ fifteen line function needs to be in > the standard library." > > > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Thanks, Andrew Svetlov -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Wed Dec 26 04:25:19 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 26 Dec 2018 20:25:19 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> Message-ID: <20181226092518.GX13061@ando.pearwood.info> On Wed, Dec 26, 2018 at 09:48:15AM +0200, Andrew Svetlov wrote: > The perfect demonstration of io objects complexity. > `stream.read(N)` can return None by spec if the file is non-blocking > and have no ready data. > > Confusing but still possible and documented behavior. https://docs.python.org/3/library/io.html#io.RawIOBase.read Regardless, my point doesn't change. That has nothing to do with the behaviour of unpack. If you pass a non-blocking file-like object which returns None, you get exactly the same exception as if you wrote unpack(fmt, f.read(size)) and the call to f.read returned None. Why is it unpack's responsibility to educate the caller that f.read can return None? Let's see what other functions with similar APIs do. py> class FakeFile: ... def read(self, n=-1): ... return None ... def readline(self): ... return None ... py> pickle.load(FakeFile()) Traceback (most recent call last): File "", line 1, in TypeError: a bytes-like object is required, not 'NoneType' py> json.load(FakeFile()) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/local/lib/python3.5/json/__init__.py", line 312, in loads s.__class__.__name__)) TypeError: the JSON object must be str, not 'NoneType' If it is good enough for pickle and json load() functions to report a TypeError like this, it is good enough for unpack(). Not every exception needs a custom error message. > You need to repeat reads until collecting the value of enough size. That's not what the OP has asked for, it isn't what the OP's code does, and its not what I've suggested. Do pickle and json block and repeat the read until they have a complete object? I'm pretty sure they don't -- the source for json.load() that I have says: return loads(fp.read(), ... ) so it definitely doesn't repeat the read. I think it is so unlikely that pickle blocks waiting for extra input that I haven't even bothered to look. Looping and repeating the read is a clear case of YAGNI. Don't over-engineer the function, and then complain that the over- engineered function is too complex. There is no need for unpack() to handle streaming input which can output anything less than a complete struct per read. > `.read(N)` can return less bytes by definition, Yes, we know that. And if it returns fewer bytes, then you get a nice, clear exception. -- Steve From andrew.svetlov at gmail.com Wed Dec 26 05:18:23 2018 From: andrew.svetlov at gmail.com (Andrew Svetlov) Date: Wed, 26 Dec 2018 12:18:23 +0200 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181226092518.GX13061@ando.pearwood.info> References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> <20181226092518.GX13061@ando.pearwood.info> Message-ID: On Wed, Dec 26, 2018 at 11:26 AM Steven D'Aprano wrote: > On Wed, Dec 26, 2018 at 09:48:15AM +0200, Andrew Svetlov wrote: > > > The perfect demonstration of io objects complexity. > > `stream.read(N)` can return None by spec if the file is non-blocking > > and have no ready data. > > > > Confusing but still possible and documented behavior. > > https://docs.python.org/3/library/io.html#io.RawIOBase.read > > Regardless, my point doesn't change. That has nothing to do with the > behaviour of unpack. If you pass a non-blocking file-like object which > returns None, you get exactly the same exception as if you wrote > > unpack(fmt, f.read(size)) > > and the call to f.read returned None. Why is it unpack's responsibility > to educate the caller that f.read can return None? > > Let's see what other functions with similar APIs do. > > > py> class FakeFile: > ... def read(self, n=-1): > ... return None > ... def readline(self): > ... return None > ... > py> pickle.load(FakeFile()) > Traceback (most recent call last): > File "", line 1, in > TypeError: a bytes-like object is required, not 'NoneType' > py> json.load(FakeFile()) > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python3.5/json/__init__.py", line 268, in load > parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, > **kw) > File "/usr/local/lib/python3.5/json/__init__.py", line 312, in loads > s.__class__.__name__)) > TypeError: the JSON object must be str, not 'NoneType' > > > If it is good enough for pickle and json load() functions to report a > TypeError like this, it is good enough for unpack(). > > Not every exception needs a custom error message. > > > > > You need to repeat reads until collecting the value of enough size. > > That's not what the OP has asked for, it isn't what the OP's code does, > and its not what I've suggested. > > Do pickle and json block and repeat the read until they have a complete > object? I'm pretty sure they don't -- the source for json.load() that I > have says: > > return loads(fp.read(), ... ) > > so it definitely doesn't repeat the read. I think it is so unlikely that > pickle blocks waiting for extra input that I haven't even bothered to > look. Looping and repeating the read is a clear case of YAGNI. > > json is correct: if `read()` is called without argument it reads the whole content until EOF. But with size argument the is different for interactive and non-interactive streams. RawIOBase and BufferedIOBase also have slightly different behavior for `.read()`. Restriction fp to BufferedIOBase looks viable though, but it is not a file-like object. Also I'm thinking about type annotations in typeshed. Now the type is Union[array[int], bytes, bytearray, memoryview] Should it be Union[io.BinaryIO, array[int], bytes, bytearray, memoryview] ? What is behavior of unpack_from(fp, offset=120)? Should iter_unpack() read the whole buffer from file into a memory before emitting a first value? > Don't over-engineer the function, and then complain that the over- > engineered function is too complex. There is no need for unpack() to > handle streaming input which can output anything less than a complete > struct per read. > > > > > `.read(N)` can return less bytes by definition, > > Yes, we know that. And if it returns fewer bytes, then you get a nice, > clear exception. > > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Thanks, Andrew Svetlov -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Wed Dec 26 06:10:05 2018 From: njs at pobox.com (Nathaniel Smith) Date: Wed, 26 Dec 2018 03:10:05 -0800 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> <20181226092518.GX13061@ando.pearwood.info> Message-ID: On Wed, Dec 26, 2018, 02:19 Andrew Svetlov > Also I'm thinking about type annotations in typeshed. > Now the type is Union[array[int], bytes, bytearray, memoryview] > Should it be Union[io.BinaryIO, array[int], bytes, bytearray, memoryview] ? > Yeah, trying to support both buffers and file-like objects in the same function seems like a clearly bad idea. If we do this at all it should be by adding new convenience functions/methods that take file-like objects exclusively, like the ones several people posted on the thread. I don't really have an opinion on whether this is worth doing at all. I guess I can think of some arguments against: Packing/unpacking multiple structs to the same file-like object may be less efficient than using a single buffer + a single call to read/write. And it's unfortunate that the obvious pack_into/unpack_from names are already taken. And it's only 2 lines of code to write your own helpers. But none of these are particularly strong arguments either, and clearly some people would find them handy. -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Wed Dec 26 06:25:06 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 26 Dec 2018 22:25:06 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> <20181226092518.GX13061@ando.pearwood.info> Message-ID: <20181226112506.GY13061@ando.pearwood.info> On Wed, Dec 26, 2018 at 03:10:05AM -0800, Nathaniel Smith wrote: > On Wed, Dec 26, 2018, 02:19 Andrew Svetlov > > > > Also I'm thinking about type annotations in typeshed. > > Now the type is Union[array[int], bytes, bytearray, memoryview] > > Should it be Union[io.BinaryIO, array[int], bytes, bytearray, memoryview] ? > > > > Yeah, trying to support both buffers and file-like objects in the same > function seems like a clearly bad idea. It might be clear to you, but it's not clear to me. Why is it a bad idea? The OP has a function which does precisely that, and it works well for him. -- Steve From steve at pearwood.info Wed Dec 26 06:42:30 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 26 Dec 2018 22:42:30 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> <20181226092518.GX13061@ando.pearwood.info> Message-ID: <20181226114229.GZ13061@ando.pearwood.info> On Wed, Dec 26, 2018 at 12:18:23PM +0200, Andrew Svetlov wrote: [...] > > json is correct: if `read()` is called without argument it reads the whole > content until EOF. > But with size argument the is different for interactive and non-interactive > streams. > RawIOBase and BufferedIOBase also have slightly different behavior for > `.read()`. This is complexity that isn't the unpack() function's responsibility to care about. All it wants is to call read(N) and get back N bytes. If it gets back anything else, that's an error. > Restriction fp to BufferedIOBase looks viable though, but it is not a > file-like object. There is no need to restrict it to BufferedIOBase. In hindsight, I am not even sure we should do an isinstance check at all. Surely all we care about is that the object has a read() method which takes a single argument, and returns that number of bytes? Here's another proof-of-concept implementation which doesn't require any isinstance checks on the argument. The only type checking it does is to verify that the read returns bytes, and even that is only a convenience so it can provide a friendly error message. def unpackStruct(fmt, frm): try: read = frm.read except AttributeError: return struct.unpack(fmt, frm) n = struct.calcsize(fmt) value = read(n) if not isinstance(value, bytes): raise TypeError('read method must return bytes') if len(value) != n: raise ValueError('expected %d bytes but only got %d' % (n, len(value))) return struct.unpack(fmt, value) [...] > What is behavior of unpack_from(fp, offset=120)? I don't know. What does the "offset" parameter do, and who requested it? I didn't, and neither did the OP Drew Warwick. James Edwards wrote that he too uses a similar function in production, one which originally did support file seeking, but they took it out. If you are suggesting an offset parameter to the unpack() function, it is up to you to propose what meaning it will have and justify why it should be part of unpack's API. Until then, YAGNI. > Should iter_unpack() read the whole buffer from file into a memory before > emitting a first value? Nobody has requested any changes to iter_unpack(). -- Steve From p.f.moore at gmail.com Wed Dec 26 08:32:38 2018 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 26 Dec 2018 13:32:38 +0000 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181226092518.GX13061@ando.pearwood.info> References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> <20181226092518.GX13061@ando.pearwood.info> Message-ID: On Wed, 26 Dec 2018 at 09:26, Steven D'Aprano wrote: > Regardless, my point doesn't change. That has nothing to do with the > behaviour of unpack. If you pass a non-blocking file-like object which > returns None, you get exactly the same exception as if you wrote > > unpack(fmt, f.read(size)) > > and the call to f.read returned None. Why is it unpack's responsibility > to educate the caller that f.read can return None? Abstraction, basically - once the unpack function takes responsibility for doing the read, and hiding the fact that there's a read going on behind an API unpack(fmt, f), it *also* takes on responsibility for managing all of the administration of that read call. It's perfectly at liberty to do so by saying "we do a read() behind the scenes, so you get the same behaviour as if you did that read() yourself", but that's a pretty thin layer of abstraction (and people often expect something less transparent). As I say, you *can* define the behaviour as you say, but it shouldn't be surprising if people expect a bit more (even if, as you've said a few times, "no-one has asked for that"). Designing an API that meets people's (often unstated) expectations isn't always as easy as just writing a convenience function. Paul PS I remain neutral on whether the OP's proposal is worth adding, but the conversation has drifted more into abstract questions about what "needs" to be in this API, so take the above on that basis. From cs at cskk.id.au Wed Dec 26 18:02:09 2018 From: cs at cskk.id.au (Cameron Simpson) Date: Thu, 27 Dec 2018 10:02:09 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: Message-ID: <20181226230209.GA68021@cskk.homeip.net> On 26Dec2018 12:18, Andrew Svetlov wrote: >On Wed, Dec 26, 2018 at 11:26 AM Steven D'Aprano >wrote: > >> On Wed, Dec 26, 2018 at 09:48:15AM +0200, Andrew Svetlov wrote: >> > The perfect demonstration of io objects complexity. >> > `stream.read(N)` can return None by spec if the file is non-blocking >> > and have no ready data. >> > >> > Confusing but still possible and documented behavior. >> >> https://docs.python.org/3/library/io.html#io.RawIOBase.read >> >> Regardless, my point doesn't change. That has nothing to do with the >> behaviour of unpack. If you pass a non-blocking file-like object which >> returns None, you get exactly the same exception as if you wrote >> >> unpack(fmt, f.read(size)) >> >> and the call to f.read returned None. Why is it unpack's responsibility >> to educate the caller that f.read can return None? [...] >> > You need to repeat reads until collecting the value of enough size. >> >> That's not what the OP has asked for, it isn't what the OP's code does, >> and its not what I've suggested. >> >> Do pickle and json block and repeat the read until they have a complete >> object? I'm pretty sure they don't [...] >> json is correct: if `read()` is called without argument it reads the >> whole >content until EOF. >But with size argument the is different for interactive and non-interactive >streams. Oh, it is better than that. At the low level, even blocking streams can return short reads - particularly serial streams like ttys and TCP connections. >RawIOBase and BufferedIOBase also have slightly different behavior for >`.read()`. > >Restriction fp to BufferedIOBase looks viable though, but it is not a >file-like object. > >Also I'm thinking about type annotations in typeshed. >Now the type is Union[array[int], bytes, bytearray, memoryview] >Should it be Union[io.BinaryIO, array[int], bytes, bytearray, >memoryview] ? And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms. And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread). Its entire purpose is to wrap an iterable of bytes-like objects and do all that work via convenient methods. And which has factory methods to make these from files or other common things. Given a CornuCopyBuffer `bfr`: S = struct('spec-here...') sbuf = bfr.take(S.size) result = S.unpack(sbuf) Under the covers `bfr` take care of short "reads" (iteraion values) etc in the underlying iterable. The return from .take is typically a memoryview from `bfr`'s internal buffer - it is _always_ exactly `size` bytes long if you don't pass short_ok=True, or it raises an exception. And so on. The point here is: make a class to get what you actually need, and _don't_ stuff variable and hard to agree on extra semantics inside multiple basic utility classes like struct. For myself, the CornuCopyBuffer is now my universal interface to byte streams (binary files, TCP connections, whatever) which need binary parsing, and it has the methods and internal logic to provide that, including presenting a simple read only file-like interface with read and seek-forward, should I need to pass it to a file-expecting object. Do it _once_, and don't megacomplicatise all the existing utility classes. Cheers, Cameron Simpson From steve at pearwood.info Wed Dec 26 19:42:30 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 27 Dec 2018 11:42:30 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: References: <20181224133313.GP13061@ando.pearwood.info> <20181224211733.GQ13061@ando.pearwood.info> <20181226051134.GV13061@ando.pearwood.info> <20181226092518.GX13061@ando.pearwood.info> Message-ID: <20181227004230.GC10079@ando.pearwood.info> On Wed, Dec 26, 2018 at 01:32:38PM +0000, Paul Moore wrote: > On Wed, 26 Dec 2018 at 09:26, Steven D'Aprano wrote: > > Regardless, my point doesn't change. That has nothing to do with the > > behaviour of unpack. If you pass a non-blocking file-like object which > > returns None, you get exactly the same exception as if you wrote > > > > unpack(fmt, f.read(size)) > > > > and the call to f.read returned None. Why is it unpack's responsibility > > to educate the caller that f.read can return None? > > Abstraction, basically - once the unpack function takes responsibility > for doing the read, and hiding the fact that there's a read going on > behind an API unpack(fmt, f), it *also* takes on responsibility for > managing all of the administration of that read call. As I keep pointing out, the json.load and pickle.load functions don't take on all that added administration. Neither does marshal, or zipfile, and I daresay there are others. Why does "abstraction" apply to this proposal but not the others? If you pass a file-like object to marshal.load that returns less than a full record, it simply raises an exception. There's no attempt to handle non-blocking streams and re-read until it has a full record: py> class MyFile: ... def read(self, n=-1): ... print("reading") ... return marshal.dumps([1, "a"])[:5] ... py> marshal.load(MyFile()) reading Traceback (most recent call last): File "", line 1, in EOFError: EOF read where object expected The use-case for marshall.load is to read a valid, complete marshall record from a file on disk. Likewise for json.load and pickle.load. There's no need to complicate the implementation by handling streams from ttys and other exotic file-like objects. Likewise there's zipfile, which also doesn't take on this extra responsibility. It doesn't try to support non-blocking streams which return None, for example. It assumes the input file is seekable, and doesn't raise a dedicated error for the case that it isn't. Nor does it support non-blocking streams by looping until it has read the data it expects. The use-case for unpack with a file object argument is the same. Why should we demand that it alone take on this unnecessary, unwanted, unused extra responsibility? It seems to me that only people insisting that unpack() take on this extra responsibility are those who are opposed to the proposal. We're asking for a battery, and they're insisting that we actually need a nuclear reactor, and rejecting the proposal because nuclear reactors are too complex. Here are some of the features that have been piled on to the proposal: - you need to deal with non-blocking streams that return None; - if you read an incomplete struct, you need to block and read in a loop until the struct is complete; - you need to deal with OS errors in some unspecified way, apart from just letting them bubble up to the caller. The response to all of these are: No we don't need to do these things, they are all out of scope for the proposal and other similar functions in the standard library don't do them. These are examples of over-engineering and YAGNI. *If* (a very big if!) somebody requests these features in the future, then they'll be considered as enhancement requests. The effort required versus the benefit will be weighed up, and if the benefit exceeds the costs, then the function may be enhanced to support streams which return partial records. The benefit will need to be more than just "abstraction". If there are objective, rational reasons for unpack() taking on these extra responsibilities, when other stdlib code doesn't, then I wish people would explain what those reasons are. Why does "abstraction" apply to struct.unpack() but not json.load()? I'm willing to be persuaded, I can change my mind. When Andrew suggested that unpack would need extra code to generate better error messages, I tested a few likely exceptions, and ended up agreeing that at least one and possibly two such enhancements were genuinely necessary. Those better error messages ended up in my subsequent proof-of-concept implementations, tripling the size from five lines to fifteen. (A second implementation reduced it to twelve.) But it irks me when people unnecessarily demand that new proposals are written to standards far beyond what the rest of the stdlib is written to. (I'm not talking about some of the venerable old, crufty parts of the stdlib dating back to Python 1.4, I'm talking about actively maintained, modern parts like json.) Especially when they seem unwilling or unable to explain *why* we need to apply such a high standard. What's so specially about unpack() that it has to handle these additional use-cases? If an objection to a proposal equally applies to parts of the stdlib that are in widepread use without actually being a problem in practice, then the objection is probably invalid. Remember the Zen: Now is better than never. Although never is often better than *right* now. Even if we do need to deal with rare, exotic or unusual input, we don't need to deal with them *right now*. When somebody submits an enhancement request "support non-blocking streams", we can deal with it then. Probably by rejecting it. -- Steve From boxed at killingar.net Wed Dec 26 20:53:44 2018 From: boxed at killingar.net (=?utf-8?Q?Anders_Hovm=C3=B6ller?=) Date: Thu, 27 Dec 2018 02:53:44 +0100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181226230209.GA68021@cskk.homeip.net> References: <20181226230209.GA68021@cskk.homeip.net> Message-ID: <49CC98BB-A877-41F1-A772-4A5D14B16461@killingar.net> > And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms. > > And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread). Seems like that should be in the standard library then! / Anders From steve at pearwood.info Wed Dec 26 20:59:39 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 27 Dec 2018 12:59:39 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181226230209.GA68021@cskk.homeip.net> References: <20181226230209.GA68021@cskk.homeip.net> Message-ID: <20181227015939.GA13061@ando.pearwood.info> On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote: [...] > >Also I'm thinking about type annotations in typeshed. > >Now the type is Union[array[int], bytes, bytearray, memoryview] > >Should it be Union[io.BinaryIO, array[int], bytes, bytearray, > >memoryview] ? > > And this is why I, personally, think augumenting struct.unpack and > json.read and a myriad of other arbitrary methods to accept both > file-like things and bytes is an open ended can of worms. I presume you mean json.load(), not read, except that it already reads from files. Nobody is talking about augmenting "a myriad of other arbitrary methods" except for you. We're talking about enhancing *one* function to be a simple generic function. I assume you have no objection to the existence of json.load() and json.loads() functions. (If you do think they're a bad idea, I don't know what to say.) Have they lead to "an open ended can of worms"? If we wrote a simple wrapper: def load(obj, *args, **kwargs): if isinstance(obj, str): return json.loads(obj, *args, **kwargs) else: return json.load(obj, *args, **kwargs) would that lead to "an open ended can of worms"? These aren't rhetoricial questions. I'd like to understand your objection. You have dismissed what seems to be a simple enhancement with a vague statement about hypothetical problems. Please explain in concrete terms what these figurative worms are. Let's come back to unpack. Would you object to having two separate functions that matched (apart from the difference in name) the API used by json, pickle, marshal etc? - unpack() reads from files - unpacks() reads from strings Obviously this breaks backwards compatibility, but if we were designing struct from scratch today, would this API open a can of worms? (Again, this is not a rhetorical question.) Let's save backwards compatibility: - unpack() reads from strings - unpackf() reads from files Does this open a can of worms? Or we could use a generic function. There is plenty of precedent for generic files in the stdlib. For example, zipfile accepts either a file name, or an open file object. def unpack(fmt, frm): if hasattr(frm, "read"): return _unpack_file(fmt, frm) else: return _unpack_bytes(fmt, frm) Does that generic function wrapper create "an open ended can of worms"? If so, in what way? I'm trying to understand where the problem lies, between the existing APIs used by json etc (presumably they are fine) and the objections to using what seems to be a very similar API for unpack, offerring the same functionality but differing only in spelling (a single generic function instead of two similarly-named functions). > And it is why I wrote myself my CornuCopyBuffer class (see my other post > in this thread). [...] > The return from .take is typically a > memoryview from `bfr`'s internal buffer - it is _always_ exactly `size` > bytes long if you don't pass short_ok=True, or it raises an exception. That's exactly the proposed semantics for unpack, except there's no "short_ok" parameter. If the read is short, you get an exception. > And so on. > > The point here is: make a class to get what you actually need Do you know better than the OP (Drew Warwick) and James Edwards what they "actually need"? How would you react if I told you that your CornuCopyBuffer class, is an over-engineered, over-complicated, over-complex class that you don't need? You'd probably be pretty pissed off at my arrogance in telling you what you do or don't need for your own use-cases. (Especially since I don't know your use-cases.) Now consider that you are telling Drew and James that they don't know their own use-cases, despite the fact that they've been working successfully with this simple enhancement for years. I'm happy for you that CornuCopyBuffer solves real problems for you, and if you want to propose it for the stdlib I'd be really interested to learn more about it. But this is actually irrelevant to the current proposal. Even if we had a CornuCopyBuffer in the std lib, how does that help? We will still need to call struct.calcsize(format) by hand, still need to call read(size) by hand. Your CornuCopyBuffer does nothing to avoid that. The point of this proposal is to avoid that tedious make-work, not increase it by having to wrap our simple disk files in a CornuCopyBuffer before doing precisely the same make-work we didn't want to do in the first case. Drew has asked for a better hammer, and you're telling him he really wants a space shuttle. -- Steve From rosuav at gmail.com Wed Dec 26 21:15:46 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 27 Dec 2018 13:15:46 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181227015939.GA13061@ando.pearwood.info> References: <20181226230209.GA68021@cskk.homeip.net> <20181227015939.GA13061@ando.pearwood.info> Message-ID: I'm quoting Steve's post here but am responding more broadly to the whole thread too. On Thu, Dec 27, 2018 at 1:00 PM Steven D'Aprano wrote: > I assume you have no objection to the existence of json.load() and > json.loads() functions. (If you do think they're a bad idea, I don't > know what to say.) Have they lead to "an open ended can of worms"? Personally, I'd actually be -0 on json.load if it didn't already exist. It's just a thin wrapper around json.loads() - it doesn't actually add anything. This proposal is _notably better_ in that it will (attempt to) read the correct number of bytes. The only real reason to have json.load/json.loads is to match pickle etc. (Though pickle does things the other way around, at least in the Python source code I have handy - loads is implemented using BytesIO, so it's the file-based API that's fundamental, as opposed to JSON where the string-based API is fundamental. I guess maybe that's a valid reason? To allow either one to be implemented in terms of the other?) But reading a struct *and then leaving the rest behind* is, IMO, a more valuable feature. > Let's save backwards compatibility: > > - unpack() reads from strings > - unpackf() reads from files > > Does this open a can of worms? Not in my opinion, but I also don't think it gains you anything much. It isn't consistent with other stdlib modules, and it isn't very advantageous over the OP's idea of just having the same function able to cope with files as well as strings. The only advantage that I can see is that unpackf() might be made able to accept a pathlike, which it will open, read from, and close. (Since a pathlike could be a string, the single function would technically be ambiguous.) And I'd drop that idea in the YAGNI basket. > Or we could use a generic function. There is plenty of precedent for > generic files in the stdlib. For example, zipfile accepts either > a file name, or an open file object. > > def unpack(fmt, frm): > if hasattr(frm, "read"): > return _unpack_file(fmt, frm) > else: > return _unpack_bytes(fmt, frm) FTR, I am +0.9 on this kind of proposal - basically "just make it work" within the existing API. It's a small amount of additional complexity to support a quite reasonable use-case. > Drew has asked for a better hammer, and you're telling him he really > wants a space shuttle. But but.... a space shuttle is very effective at knocking nails into wood... also, I just want my own space shuttle. Plz? Thx. Bye! :) ChrisA From cs at cskk.id.au Thu Dec 27 00:06:46 2018 From: cs at cskk.id.au (Cameron Simpson) Date: Thu, 27 Dec 2018 16:06:46 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <49CC98BB-A877-41F1-A772-4A5D14B16461@killingar.net> References: <49CC98BB-A877-41F1-A772-4A5D14B16461@killingar.net> Message-ID: <20181227050646.GA29362@cskk.homeip.net> On 27Dec2018 02:53, Anders Hovm?ller wrote: > >> And this is why I, personally, think augumenting struct.unpack and json.read and a myriad of other arbitrary methods to accept both file-like things and bytes is an open ended can of worms. >> >> And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread). > >Seems like that should be in the standard library then! It is insufficiently used at present. The idea seems sound - a flexible adapter of bytes sources providing easy methods to aid parsing - based on how useful it has been to me. But it has rough edges and one needs to convince others of its utility before entry into the stdlib. So it is on PyPI for easy use. If you're in the binary I/O/parsing space, pip install it (and cs.binary, which utilises it) and see how they work for you. Complain to me about poor semantics or bugs. And then we'll see how general purpose it really is. The PyPI package pages for each have doco derived from the module docstrings. Cheers, Cameron Simpson From cs at cskk.id.au Thu Dec 27 01:14:35 2018 From: cs at cskk.id.au (Cameron Simpson) Date: Thu, 27 Dec 2018 17:14:35 +1100 Subject: [Python-ideas] struct.unpack should support open files In-Reply-To: <20181227015939.GA13061@ando.pearwood.info> References: <20181227015939.GA13061@ando.pearwood.info> Message-ID: <20181227061435.GA41503@cskk.homeip.net> On 27Dec2018 12:59, Steven D'Aprano wrote: >On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote: >[...] >> >Also I'm thinking about type annotations in typeshed. >> >Now the type is Union[array[int], bytes, bytearray, memoryview] >> >Should it be Union[io.BinaryIO, array[int], bytes, bytearray, >> >memoryview] ? >> >> And this is why I, personally, think augumenting struct.unpack and >> json.read and a myriad of other arbitrary methods to accept both >> file-like things and bytes is an open ended can of worms. > >I presume you mean json.load(), not read, except that it already reads >from files. Likely. Though the json module is string oriented (though if one has UTF-8 data, turning binary into that is easy). >Nobody is talking about augmenting "a myriad of other arbitrary methods" >except for you. We're talking about enhancing *one* function to be a >simple generic function. Yes, but that is how the rot sets in. Some here want to enhance json.load/loads. The OP wants to enhance struct.unpack. Yay. Now let's also do csv.reader. Etc. I think my point is twofold: once you start down this road you (a) start doing it to every parser in the stdlib and (b) we all start bikeshedding about semantics. There are at least two roads to such enhancement: make the functions polymorphic, coping with files or bytes/strs (depending), or make a parallel suite of functions like json.load/loads. The latter is basicly API bloat to little advantage. The former is rather slippery - I've a few functions myself with accept-str-or-file call modes, and _normally_ the "str" flavour is taken as a filename. But... if the function is a string parser, maybe it should parse the string itself? Already the choices are messy. And both approaches have much bikeshedding. Some of us would like something like struct.unpack to pull enough data from the file even if the file returns short reads. You, I gather, generally like the shim to be very shallow and have a short read cause an exception through insufficient data. Should the file version support an optional seek/offset argument? The example from James suggests that such a thing would benefit him. And so on. And this argument has to play out for _every_ parser interface you want to adapt for both files and direct bytes/str (again, depending). >I assume you have no objection to the existence of json.load() and >json.loads() functions. (If you do think they're a bad idea, I don't >know what to say.) Have they lead to "an open ended can of worms"? On their own, no. The isolated example never starts that way. But really consistency argues that the entire stdlib should have file and str/bytes parallel functions across all parsers. And _that_ is a can of worms. >If we wrote a simple wrapper: > >def load(obj, *args, **kwargs): > if isinstance(obj, str): > return json.loads(obj, *args, **kwargs) > else: > return json.load(obj, *args, **kwargs) > >would that lead to "an open ended can of worms"? Less so. I've a decorator of my own called @strable, which wraps other functions; it intercepts the first positional argument if it is a str and replaces it with something derived from it. The default mode is an open file, with the str as the filename, but it is slightly pluggable. Such a decorator could reside in a utility stdlib module and become heavily used in places like json.load if desired. >These aren't rhetoricial questions. I'd like to understand your >objection. You have dismissed what seems to be a simple enhancement with >a vague statement about hypothetical problems. Please explain in >concrete terms what these figurative worms are. I'm hoping my discussion above shows where I think the opn ended side of the issue arises: once we do it to one function we sort of want to do it to all similar functions, and there are multiple defensible ways to do it. >Let's come back to unpack. Would you object to having two separate >functions that matched (apart from the difference in name) the API used >by json, pickle, marshal etc? > >- unpack() reads from files >- unpacks() reads from strings Well, yeah. (Presuming you mean bytes rather than strings above in the Python 3 domain.) API bloat. There are essentially identical functions in terms of utility. >Obviously this breaks backwards compatibility, but if we were designing >struct from scratch today, would this API open a can of worms? >(Again, this is not a rhetorical question.) Only in that it opens the door to doing the same for every other similar function in the stdlib. And wouldn't it be nice to have a third form to take a filename and open it? >Let's save backwards compatibility: Some degree of objection: API bloat requiring repated bloat elsewhere. Let's set backwards compatibility aside: it halves the discussion and examples. >Or we could use a generic function. There is plenty of precedent for >generic files in the stdlib. For example, zipfile accepts either >a file name, or an open file object. Indeed, and here we are with flavour #3: the string isn't a byte sequence to parse, it is now a filename. In Python 3 we can disambiuate if we parse bytes and treat str as a filename. But what if we're parsing str, as JSON does? Now we don't know and must make a policy decision. >def unpack(fmt, frm): > if hasattr(frm, "read"): > return _unpack_file(fmt, frm) > else: > return _unpack_bytes(fmt, frm) > >Does that generic function wrapper create "an open ended can of worms"? >If so, in what way? If you were to rewrite the above in the form of my @strable decorator, provide it in a utility library, and _use_ it in unpack, I'd be +1, because the _same_ utility can be reused elsewhere by anyone for any API. Embedding it directly in unpack complicates unpack's semantics for what it essentially a shim. Here's my @strable, minus its docstring: @decorator def strable(func, open_func=None): if open_func is None: open_func = open def accepts_str(arg, *a, **kw): if isinstance(arg, str): with Pfx(arg): with open_func(arg) as opened: return func(opened, *a, **kw) return func(arg, *a, **kw) return accepts_str and an example library function: @strable def count_lines(f): count = 0 for line in f: count += 1 return count and there's a function taking an open file or a filename. But suppose we want to supply a string whose lines need counting, not a filename. We count _either_ change our policy decision from "accepts a filename" to "accepts an input string", _or_ we can start adding a third mode on top of the existing two modes. All three modes are reasonable. >I'm trying to understand where the problem lies, between the existing >APIs used by json etc (presumably they are fine) They're historic. I think I'm -0 on having 2 functions. But only because it is so easy to hand file contents to loads. >and the objections to >using what seems to be a very similar API for unpack, offerring the same >functionality but differing only in spelling (a single generic function >instead of two similarly-named functions). I hope I've made it more clear above that my objection is to either approach (polymorphic or parallel functions) because one can write a general purpose shim and use it with almost anything, and then we can make things like json or struct accept _only_ str or bytes respectively, with _no_ complication extra semantics. Because once we do it for these 2 we _should_ do it for every parser for consistency. Yes, yes, stripping json _back_ to just loads would break backwards compatibility; I'm not proposing that for real. I'm proposing resisting extra semantic bloat in favour of a help class or decorator. Consider: from shimutils import bytes_from_file from struct import unpack unpackf = bytes_from_file(unpack) Make a bunch of shims for the common use cases and the burden on users of the various _other_ modules becomes very small, and we don't have to go to every parser API and bloat it out. Especially since we've seen the bikeshedding on semantics even on this small suggestion ("accept a file"). >> And it is why I wrote myself my CornuCopyBuffer class (see my other >> post in this thread). >[...] >> The return from .take is typically a >> memoryview from `bfr`'s internal buffer - it is _always_ exactly `size` >> bytes long if you don't pass short_ok=True, or it raises an exception. > >That's exactly the proposed semantics for unpack, except there's no >"short_ok" parameter. If the read is short, you get an exception. And here we are. Bikeshedding already! My CCB.take (for short) raises an exception on _insufficient_ data, not a short read. It does enough reads to get the data demanded. If I _want_ to know that a read was short I can pass short_ok=True and examine the result before use. Its whole point is to give the right data to the caller. Let me give you some examples: I run som binary protocols over TCP streams. They're not network packets; the logical packets can span IP packets, and of course conversely several small protocol packets may fit in a single network packet because they're assembled in a buffer at the sending end (via plain old file.write). Via a CCB the receiver _doesn't care_. Ask for the required data, the CCB gathers enough and hands it over. I parse MP4 files. The ISO14496 packet structure has plenty of structures of almost arbitrary size, particularly the media data packet (MDAT) which can be gigabytes in size. You're _going_ to get a short read there. I'd be annoyed by an exception. >> And so on. >> >> The point here is: make a class to get what you actually need > >Do you know better than the OP (Drew Warwick) and James Edwards what >they "actually need"? No, but I know what _I_ need. A flexible controller with several knobs to treat input in various common ways. >How would you react if I told you that your CornuCopyBuffer class, is an >over-engineered, over-complicated, over-complex class that you don't >need? You'd probably be pretty pissed off at my arrogance in telling you >what you do or don't need for your own use-cases. (Especially since I >don't know your use-cases.) Some examples above. There's a _little_ over engineering, but it actually solves a _lot_ of problems, making everything else MUCH MUCH simpler. >Now consider that you are telling Drew and James that they don't know >their own use-cases, despite the fact that they've been working >successfully with this simple enhancement for years. I'm not. I'm _suggesting_ that _instead_ of embedded extra semantics which we can't even all agree on into parser libraries it is often better to make it easy to give the parser what their _current_ API accepts. And that the tool to do that should be _outside_ those parser modules, not inside, because it can be generally applicable. >I'm happy for you that CornuCopyBuffer solves real problems for you, >and if you want to propose it for the stdlib I'd be really interested >to learn more about it. Not yet. Slightly rough and the user audience is basicly me right now. But feel free to pip install cs.buffer and cs.binary and have a look. >But this is actually irrelevant to the current proposal. Even if we had >a CornuCopyBuffer in the std lib, how does that help? We will still need >to call struct.calcsize(format) by hand, still need to call read(size) >by hand. Your CornuCopyBuffer does nothing to avoid that. No, but its partner cs.binary _does_. As described in my first post to this thread. Have a quick reread, particularly near the "PDInfo" example. >The point of this proposal is to avoid that tedious make-work, not >increase it by having to wrap our simple disk files in a CornuCopyBuffer >before doing precisely the same make-work we didn't want to do in the >first case. > >Drew has asked for a better hammer, and you're telling him he really >wants a space shuttle. To my eye he asked to make unpack into a multitool (bytes and files), and I'm suggesting maybe he should get a screwdriver to go with his hammer (to use as a chisel, of course). Anyway, I've making 2 arguments: - don't bloat the stdlib APIs to accomodate thing much beyond their core - offer a tool to make the things beyond the core _easily_ available for use in the core way The latter can then _also_ be used with other APIs not yet extended. Cheers, Cameron Simpson From malincns at 163.com Thu Dec 27 06:48:40 2018 From: malincns at 163.com (Ma Lin) Date: Thu, 27 Dec 2018 19:48:40 +0800 Subject: [Python-ideas] Add regex pattern literal p"" Message-ID: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> We can use this literal to represent a compiled pattern, for example: >>> p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c'] >>> compiled = p"(?<=abc)def" >>> m = compiled.search('abcdef') >>> m.group(0) 'def' >>> rp'\W+'.split('Words, words, words.') ['Words', 'words', 'words', ''] This allows peephole optimizer to store compiled pattern in .pyc file, we can get performance optimization like replacing constant set by frozenset in .pyc file. Then such issue [1] can be solved perfectly. [1] Optimize base64.b16decode to use compiled regex [1] https://bugs.python.org/issue35559 Two shortcomings: 1, Elevating a class in a module (re.Pattern) to language level, this sounds not very natural. This makes Python looks like Perl. 2, We can't use regex module as a drop-in replacement: import regex as re IMHO, I would like to see regex module be adopted into stdlib after cutting off its "full case-folding" and "fuzzy matching" features. Related links: [2] Chris Angelico conceived of "compiled regexes be stored in .pyc file" in March 2013. [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html [3] Ken Hilton conceived of "Give regex operations more sugar" in June 2018. [3] https://mail.python.org/pipermail/python-ideas/2018-June/051395.html From rosuav at gmail.com Thu Dec 27 07:11:29 2018 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 27 Dec 2018 23:11:29 +1100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: On Thu, Dec 27, 2018 at 10:49 PM Ma Lin wrote: > > We can use this literal to represent a compiled pattern, for example: > > >>> p"(?i)[a-z]".findall("a1B2c3") > ['a', 'B', 'c'] > > >>> compiled = p"(?<=abc)def" > >>> m = compiled.search('abcdef') > >>> m.group(0) > 'def' > > >>> rp'\W+'.split('Words, words, words.') > ['Words', 'words', 'words', ''] > > This allows peephole optimizer to store compiled pattern in .pyc file, > we can get performance optimization like replacing constant set by > frozenset in .pyc file. Before discussing something specific like regex literal syntax, I would love to see a way to measure that sort of performance difference. Does anyone here have MacroPy experience or something and could mock something up that would precompile and save a regex? In theory, it would be possible to tag ANY value as "constant once evaluated" and have it saved in the pyc. It'd be good to know just how much benefit this precompilation actually grants. > [2] Chris Angelico conceived of "compiled regexes be stored in .pyc > file" in March 2013. > [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html Wow that's an old post of mine :) ChrisA From malincns at 163.com Thu Dec 27 08:15:10 2018 From: malincns at 163.com (Ma Lin) Date: Thu, 27 Dec 2018 21:15:10 +0800 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: <473b160f-a7a4-45de-7343-e18567941817@163.com> > It'd be good to know just how much benefit this precompilation actually grants. As far as I know, Pattern objects in regex module can be pickled, don't know if it's useful. >>> import pickle >>> import regex >>> p = regex.compile('[a-z]') >>> b = pickle.dumps(p) >>> p = pickle.loads(b) > Wow that's an old post of mine I searched on Google before post this, hope there is no omission. From boxed at killingar.net Thu Dec 27 09:01:02 2018 From: boxed at killingar.net (=?utf-8?Q?Anders_Hovm=C3=B6ller?=) Date: Thu, 27 Dec 2018 15:01:02 +0100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: > We can use this literal to represent a compiled pattern, for example: > > >>> p"(?i)[a-z]".findall("a1B2c3") > ['a', 'B', 'c'] There are some other advantages to this. For me the most interesting is that we can know from code easier that something is a regex. For my mutation tester mutmut I have an experimental regex mutation system but it just feels wrong to write hacky heuristics to guess if a string is a regex. And it's complicated to look at too much context (although I'm working on ways to make that type of thing radically nicer to do). It would be much nicer if I could just know based on the AST node type. I guess the same goes for static analyzers. / Anders From stefan_ml at behnel.de Thu Dec 27 09:42:17 2018 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 27 Dec 2018 15:42:17 +0100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <473b160f-a7a4-45de-7343-e18567941817@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <473b160f-a7a4-45de-7343-e18567941817@163.com> Message-ID: Ma Lin schrieb am 27.12.18 um 14:15: >> It'd be good to know just how much benefit this precompilation actually > grants. > > As far as I know, Pattern objects in regex module can be pickled, don't > know if it's useful. > >>>> import pickle >>>> import regex That's from the external regex package, not the stdlib re module. >>>> p = regex.compile('[a-z]') >>>> b = pickle.dumps(p) >>>> p = pickle.loads(b) Look a little closer: >>> import pickle, re >>> p = re.compile("[abc]") >>> pickle.dumps(p) b'\x80\x03cre\n_compile\nq\x00X\x05\x00\x00\x00[abc]q\x01K \x86q\x02Rq\x03.' What this does, essentially, is to make the pickle loader pass the original regex pattern string into re.compile() to "unpickle" it. Meaning, it compiles the regex on the way in. Thus, there isn't much to gain from using (the current form of) regex pickling here. I'm not saying that this can't be changed, but personally, this is exactly what I would do if I was asked to make a compiled regex picklable. Everything else would probably get you into portability hell. Stefan From rosuav at gmail.com Thu Dec 27 12:27:46 2018 From: rosuav at gmail.com (Chris Angelico) Date: Fri, 28 Dec 2018 04:27:46 +1100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <473b160f-a7a4-45de-7343-e18567941817@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <473b160f-a7a4-45de-7343-e18567941817@163.com> Message-ID: On Fri, Dec 28, 2018 at 12:15 AM Ma Lin wrote: > > > It'd be good to know just how much benefit this precompilation > actually grants. > > As far as I know, Pattern objects in regex module can be pickled, don't > know if it's useful. > > >>> import pickle > >>> import regex > >>> p = regex.compile('[a-z]') > >>> b = pickle.dumps(p) > >>> p = pickle.loads(b) What Stefan pointed out regarding the stdlib's "re" module is also true of the third party "regex" - unpickling just compiles from the original string. Regarding pyc files, though, pickle is less significant than marshal. And both re.compile() and regex.compile() return unmarshallable objects. Fortunately, marshal doesn't need to produce cross-compatible files, so the portability issues don't apply. So, let's suppose that marshalling a compiled regex became possible. It would need to be (a) absolutely guaranteed to have the same effect as compiling the original text string, and (b) faster than compiling the original text string, otherwise it's useless. This is where testing would be needed: can it actually save any significant amount of time? > > Wow that's an old post of mine > I searched on Google before post this, hope there is no omission. You're absolutely fine :) I was amused to find that a post of mine from nearly six years ago should be the most notable on the subject, is all. Good work digging it up. ChrisA From python at mrabarnett.plus.com Thu Dec 27 12:47:46 2018 From: python at mrabarnett.plus.com (MRAB) Date: Thu, 27 Dec 2018 17:47:46 +0000 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: <4c7d51ba-9022-8e8e-b847-61ce4f575bf4@mrabarnett.plus.com> On 2018-12-27 11:48, Ma Lin wrote: [snip] > 2, We can't use regex module as a drop-in replacement: import regex as re > IMHO, I would like to see regex module be adopted into stdlib after > cutting off its "full case-folding" and "fuzzy matching" features. > I think that omitting full casefolding would be a bad idea; after all, strings (in Python 3) have a .casefold method. From steve at pearwood.info Thu Dec 27 17:00:43 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 28 Dec 2018 09:00:43 +1100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <4c7d51ba-9022-8e8e-b847-61ce4f575bf4@mrabarnett.plus.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <4c7d51ba-9022-8e8e-b847-61ce4f575bf4@mrabarnett.plus.com> Message-ID: <20181227220043.GB13061@ando.pearwood.info> On Thu, Dec 27, 2018 at 05:47:46PM +0000, MRAB wrote: > On 2018-12-27 11:48, Ma Lin wrote: > [snip] > >2, We can't use regex module as a drop-in replacement: import regex as re > >IMHO, I would like to see regex module be adopted into stdlib after > >cutting off its "full case-folding" and "fuzzy matching" features. > > > I think that omitting full casefolding would be a bad idea; after all, > strings (in Python 3) have a .casefold method. And I don't understand why omitting fuzzy matching is a good idea. If you don't want fuzzy matching, don't use it in your code. But why remove it? -- Steve From malincns at 163.com Fri Dec 28 04:54:48 2018 From: malincns at 163.com (Ma Lin) Date: Fri, 28 Dec 2018 17:54:48 +0800 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <473b160f-a7a4-45de-7343-e18567941817@163.com> Message-ID: Reply to Stefan Behnel and Chris Angelico. On 18-12-27 22:42, Stefan Behnel wrote: >? >>> import pickle, re >? >>> p = re.compile("[abc]") >? >>> pickle.dumps(p) >? b'\x80\x03cre\n_compile\nq\x00X\x05\x00\x00\x00[abc]q\x01K \x86q\x02Rq\x03.' > > What this does, essentially, is to make the pickle loader pass the original regex pattern string into re.compile() to "unpickle" it. Meaning, it compiles the regex on the way in. Thus, there isn't much to gain from using (the current form of) regex pickling here. Yes, re module only pickles pattern string and flags, it's safe for cross-version pickle/unpickle. re module's pickle code: def _pickle(p): ??? return _compile, (p.pattern, p.flags) copyreg.pickle(Pattern, _pickle, _compile) On 18-12-28 1:27, Chris Angelico wrote: > What Stefan pointed out regarding the stdlib's "re" module is also > true of the third party "regex" - unpickling just compiles from the > original string. I had followed regex module for a year, it does pickle the compiled data, this is its code: def _pickle(pattern): ??? return _regex.compile, pattern._pickled_data _copy_reg.pickle(Pattern, _pickle) // in _regex.c file self->pickled_data = Py_BuildValue("OnOOOOOnOnn", pattern, flags, ??? code_list, groupindex, indexgroup, named_lists, named_list_indexes, ??? req_offset, required_chars, req_flags, public_group_count); if (!self->pickled_data) { ??? Py_DECREF(self); ??? return NULL; } From malincns at 163.com Fri Dec 28 04:55:46 2018 From: malincns at 163.com (Ma Lin) Date: Fri, 28 Dec 2018 17:55:46 +0800 Subject: [Python-ideas] In fact, I'm a bit worry about this literal p"" In-Reply-To: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: <2841cffb-c448-9e9d-5d67-7437cf0c6f57@163.com> Maybe this literal will encourage people to finish tasks using regex, even lead to abuse regex, will this change Python's style? What's worse is, people using mixed manners in the same project: one_line.split(',') ... p','.split(one_line) Maybe it will break the Python's style, reduce code readability, is this worth it? From jsbueno at python.org.br Fri Dec 28 09:54:00 2018 From: jsbueno at python.org.br (Joao S. O. Bueno) Date: Fri, 28 Dec 2018 12:54:00 -0200 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: I am a full -1 on this idea - > Two shortcomings: > > 1, Elevating a class in a module (re.Pattern) to language level, this > sounds not very natural. > This makes Python looks like Perl. > > 2, We can't use regex module as a drop-in replacement: import regex as re > IMHO, I would like to see regex module be adopted into stdlib after > cutting off its "full case-folding" and "fuzzy matching" features. > Sorry for sounding over-reactive, but yes, this could make Python look like Perl. I think one full advantage of Python is exactly that regexps are treated fairly, with no special syntax. You call a function, or build an instance, and have the regex power, and that is it. And you can just plug any third-party regex module, and it will work just like the one that is built-in the language. This proposal at least keep the ' " ' quotes - so we don't end up like Javascript which has a "squeashy regexy" thing that can sneak in code and you are never sure when it is run, or even if it can be assigned to a variable at all. I am quite sure that if the mater is performance, a way to pickle, or somehow store pre-compiled regexes can be found without requiring special syntax. And a 3rd shortcoming - flags can't be passed as parameters, and have to be built-in the regexp themselves, further complicating the readability even for very simple regular expressions. Other than that it would not be much different from the ' f" ' strings thing, indeed, On Thu, 27 Dec 2018 at 09:49, Ma Lin wrote: > Related links: > > [2] Chris Angelico conceived of "compiled regexes be stored in .pyc > file" in March 2013. > [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html > > [3] Ken Hilton conceived of "Give regex operations more sugar" in June 2018. > [3] https://mail.python.org/pipermail/python-ideas/2018-June/051395.html > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ From malincns at 163.com Fri Dec 28 21:42:13 2018 From: malincns at 163.com (Ma Lin) Date: Sat, 29 Dec 2018 10:42:13 +0800 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> On 18-12-28 22:54, Joao S. O. Bueno wrote: > Sorry for sounding over-reactive, but yes, this could make Python look > like Perl. Yes, this may introduce Perl's style irreversibly, we need to be cautious about this. I'm thinking, if people ask these questions in their mind when reading a piece of Python code: 1, "Is this Python code?" 2, "What's the purpose of this code?" 3, "How can I modify it if I want to ... ?" Maybe Python is on a doubtful way. There is an interesting question: Will literal p"" ruin Python's (or other dynamic languages like Ruby) style? Why will this happen? > And a 3rd shortcoming - flags can't be passed as parameters, and have > to be built-in the regexp themselves, further complicating the readability even > for very simple regular expressions. IMO this is an advantage, it's hard to omit flags when reading/copying an regex pattern. From python at 2sn.net Sat Dec 29 00:29:32 2018 From: python at 2sn.net (Alexander Heger) Date: Sat, 29 Dec 2018 16:29:32 +1100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> Message-ID: for regular strings one can write "aaa" + "bbb" which also works for f-strings, r-strings, etc.; in regular expressions, there is, e.g., parameter counting and references to numbered matches. How would that be dealt with in a compound p-string? Either it would have to re-compiled or not, either way could lead to unexpected results p"(\d)\1" + p"(\s)\1" or p"^(\w)" + p"^(\d)" regular strings can be added, bu the results of p-string could not - well, their are not strings. This brings me to the point that the key difference is that f- and r- strings actually return strings, whereas p- string would return a different kind of object. That would seem certainly very confusing to novices - and also for the language standard as a whole. -Alexander -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Dec 29 01:52:44 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 29 Dec 2018 17:52:44 +1100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> Message-ID: <20181229065243.GH13616@ando.pearwood.info> On Sat, Dec 29, 2018 at 04:29:32PM +1100, Alexander Heger wrote: > for regular strings one can write > > "aaa" + "bbb" > > which also works for f-strings, r-strings, etc.; in regular expressions, > there is, e.g., parameter counting and references to numbered matches. How > would that be dealt with in a compound p-string? Either it would have to > re-compiled or not, either way could lead to unexpected results What does Perl do? > p"(\d)\1" + p"(\s)\1" Since + is used for concatenation, then that would obviously be the same as: p"(\d)\1(\s)\1" Whether it gets done at compile-time or run-time depends on how smart the keyhole optimiser is. If it is smart enough to recognise regex literals, it could fold the two strings together and regex-compile them at python-compile time, otherwise it could be equivalent to: _t1 = re.compile(r"(\d)\1") # compile-time _t2 = re.compile(r"(\s)\1") # compile-time re.compile(_t1.pattern + _t2.pattern) # run-time Obviously that defeats the purpose of using a p"" pre-compiled regex object, but the answer to that is either: 1. Don't do that then; or 2. We better make sure the keyhole optimizer is smarter. Or we just ban concatenation. "P-strings" aren't strings, even though they look like them. > This brings me to the point that > the key difference is that f- and r- strings actually return strings, To be precise, f-"strings" are actually code that returns a string when executed at runtime; r-strings are literal syntax for strings. > whereas p- string would return a different kind of object. > That would seem certainly very confusing to novices - and also for the > language standard as a whole. Indeed. Perhaps something like \\regex\\ would be better, *if* this feature is desired. -- Steve From neatnate at gmail.com Sat Dec 29 01:56:19 2018 From: neatnate at gmail.com (Nathan Schneider) Date: Sat, 29 Dec 2018 01:56:19 -0500 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> Message-ID: On Sat, Dec 29, 2018 at 12:30 AM Alexander Heger wrote: > for regular strings one can write > > "aaa" + "bbb" > > which also works for f-strings, r-strings, etc.; in regular expressions, > there is, e.g., parameter counting and references to numbered matches. How > would that be dealt with in a compound p-string? Either it would have to > re-compiled or not, either way could lead to unexpected results > > p"(\d)\1" + p"(\s)\1" > > or > > p"^(\w)" + p"^(\d)" > > regular strings can be added, bu the results of p-string could not - well, > their are not strings. > Isn't this a feature, not a bug, of encouraging literals to be specified as patterns: addition of patterns would raise an error (as is currently the case for addition of compiled patterns in the re and regex modules)? Currently, I find it easiest to use r-strings for patterns and call re.search() etc. without precompiling them, which means that I could accidentally concatenate two patterns together that would silently produce an unmatchable pattern. Using p-literals for most patterns would mean I have to be explicit in the exceptional case where I do want to assemble a pattern from multiple parts: FIRSTNAME = p"[A-Z][-A-Za-z']+" LASTNAME = p"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = FIRSTNAME + p' ' + LASTNAME # error FIRSTNAME = r"[A-Z][-A-Za-z']+" LASTNAME = r"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = re.compile(FIRSTNAME + ' ' + LASTNAME) # success Another potential advantage is that an ill-formed p-literal (such as a mismatched parenthesis) would be caught immediately, rather than when it is first used. This could pay off, for example, if I am defining a data structure with a bunch of regexes that would get used for different input. (But there may be performance tradeoffs here.) > This brings me to the point that > the key difference is that f- and r- strings actually return strings, > whereas p- string would return a different kind of object. > That would seem certainly very confusing to novices - and also for the > language standard as a whole. > > The b prefix produces a bytes literal. Is a bytes object a kind of string, more so than a regex pattern is? I could see an argument that bytes is a particular encoding of sequential character data, whereas a regex pattern represents a string *language*, i.e. an abstraction over string data. But...this distinction starts to feel very theoretical rather than practical. If novices are expected to read code with regular expressions in it, why would they have trouble understanding that the "p" prefix means "pattern"? As someone who works with text a lot, I think there's a decent practicality-beats-purity argument in favor of p-literals, which would make regex operations more easily accessible and prevent patterns from being mixed up with string data. A potential downside, though, is that it will be tempting to introduce flags as prefixes, too. Do we want to go down the road of pui"my Unicode-compatible case-insensitive pattern"? Nathan > -Alexander > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From malincns at 163.com Sat Dec 29 06:49:41 2018 From: malincns at 163.com (Ma Lin) Date: Sat, 29 Dec 2018 19:49:41 +0800 Subject: [Python-ideas] Use p"" to represent `pattern_str` -- a subclass of `str` In-Reply-To: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: I have a compromise idea, here is some points: 1, Create a built-in class `pattern_str` which is a subclass of `str`, it's dedicated to regex pattern string. 2, Use p"" to represent `pattern_str`. Some advantages: 1, Since it's a subclass of `str`, we can use it as normal `str`. 2, IDE/linter/compiler can identify it as an regex pattern, something like type hint in language level. 3, We can still store compiled pattern in .pyc file *quietly*. 4, Won't introduce Perl style into Python, to avoid abusing regex in some degree. We still using regex in the old way: import re re.search(p"(?i)[a-z]", s) But if re.search() find the pattern is a `pattern_str`, it load compiled pattern from .pyc file directly. From greg.ewing at canterbury.ac.nz Sun Dec 30 17:55:34 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 31 Dec 2018 11:55:34 +1300 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <20181229065243.GH13616@ando.pearwood.info> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> <20181229065243.GH13616@ando.pearwood.info> Message-ID: <5C294CE6.5090403@canterbury.ac.nz> Steven D'Aprano wrote: > _t1 = re.compile(r"(\d)\1") # compile-time > _t2 = re.compile(r"(\s)\1") # compile-time > re.compile(_t1.pattern + _t2.pattern) # run-time It would be weird if p"(\d)\1" + p"(\s)\1" worked but re.compile(r"(\d)\1") + re.compile(r"(\s)\1") didn't. -- Greg From greg.ewing at canterbury.ac.nz Sun Dec 30 17:44:00 2018 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 31 Dec 2018 11:44:00 +1300 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> Message-ID: <5C294A30.6000102@canterbury.ac.nz> I don't see a justification for baking REs into the syntax of Python. In the Python world, REs are just one tool in a toolbox containing a great many tools. What's more, it's a tool that should be used with considerable reluctance, because REs are essentially unreadable, so every time you use one you're creating a maintenance headache. This quality is quite the opposite of what one would expect from a core language feature. -- Greg From python at 2sn.net Sun Dec 30 18:35:49 2018 From: python at 2sn.net (Alexander Heger) Date: Mon, 31 Dec 2018 10:35:49 +1100 Subject: [Python-ideas] Add regex pattern literal p"" In-Reply-To: <5C294A30.6000102@canterbury.ac.nz> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <95793ea0-2c9a-c76b-6b0a-f3c9f97787ed@163.com> <5C294A30.6000102@canterbury.ac.nz> Message-ID: > What's more, it's a tool that should be used > with considerable reluctance, because REs are essentially unreadable, > so every time you use one you're creating a maintenance headache. Well, it requires some experience to read REs, I have written many, and I still need to test thoroughly even many basic ones for that they really do what they are supposed to do. And then there is the issue that there is many different implementation, what you have to escape, etc., varies between python (raw and regular strings), emacs, grep, overleaf, ... Never mind, my main point is that they return an object that is qualitatively different from a string, for example, in terms of concatenation. I also think it is too specialised, and time-critical constant REs can be stored in the module body, etc., if need be. I do that. But since this is the ideas mailing list, and taking this thread on an excursion, maybe an "addition" operator could be defined for REs, such that re.compile(s1 + s1) == re.compile(s1) + re.compile(s2) with the restriction that s1 and s2 are strings that are valid REs each. Even that would leave questions about how to deal with compile flags; they probably should be treated the same as if they were embedded at the beginning of each string. -Alexander -------------- next part -------------- An HTML attachment was scrubbed... URL: From ubershmekel at gmail.com Mon Dec 31 03:48:56 2018 From: ubershmekel at gmail.com (Yuval Greenfield) Date: Mon, 31 Dec 2018 00:48:56 -0800 Subject: [Python-ideas] In fact, I'm a bit worry about this literal p"" In-Reply-To: <2841cffb-c448-9e9d-5d67-7437cf0c6f57@163.com> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <2841cffb-c448-9e9d-5d67-7437cf0c6f57@163.com> Message-ID: On Fri, Dec 28, 2018 at 1:56 AM Ma Lin wrote: > Maybe this literal will encourage people to finish tasks using regex, > even lead to abuse regex, will this change Python's style? > > What's worse is, people using mixed manners in the same project: > > one_line.split(',') > ... > p','.split(one_line) > > Maybe it will break the Python's style, reduce code readability, is this > worth it? > > The bar for introducing a new type of literal should be very high. Do performance numbers show this change would have a large impact for a large amount of libraries and programs? In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing. The damage to readability, burden of changing syntax and burden of yet another language feature for newcomers to learn is too high. Cheers, Yuval -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Mon Dec 31 05:54:37 2018 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 31 Dec 2018 21:54:37 +1100 Subject: [Python-ideas] In fact, I'm a bit worry about this literal p"" In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <2841cffb-c448-9e9d-5d67-7437cf0c6f57@163.com> Message-ID: <20181231105437.GM13616@ando.pearwood.info> On Mon, Dec 31, 2018 at 12:48:56AM -0800, Yuval Greenfield wrote: > In my opinion, only if this change would make 50% of programs run 50% > faster then it might be worth discussing. What if it were 100% of programs 25% faster? *wink* Generally speaking, we don't introduce new syntax as a speed optimization. The main reasons to introduce syntax is for convenience and to improve the expressiveness of code. That's why we usually prefer to use operators like + and == instead of functions add() and equal(). There's nothing a list comprehension can do that a for-loop can't, but list comps are often more expressive. And the class statement is just syntactic sugar for type(name, bases, dict), but much more convenient. In this specific case, I don't think that regex literals will add much expressiveness: regex = re.compile(r"...") regex = p("...") is not that much different. -- Steve From solipsis at pitrou.net Mon Dec 31 06:23:16 2018 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 31 Dec 2018 12:23:16 +0100 Subject: [Python-ideas] No need to add a regex pattern literal References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> Message-ID: <20181231122316.28d49afc@fsol> On Thu, 27 Dec 2018 19:48:40 +0800 Ma Lin wrote: > We can use this literal to represent a compiled pattern, for example: > > >>> p"(?i)[a-z]".findall("a1B2c3") > ['a', 'B', 'c'] > > >>> compiled = p"(?<=abc)def" > >>> m = compiled.search('abcdef') > >>> m.group(0) > 'def' > > >>> rp'\W+'.split('Words, words, words.') > ['Words', 'words', 'words', ''] > > This allows peephole optimizer to store compiled pattern in .pyc file, > we can get performance optimization like replacing constant set by > frozenset in .pyc file. > > Then such issue [1] can be solved perfectly. > [1] Optimize base64.b16decode to use compiled regex > [1] https://bugs.python.org/issue35559 The simple solution to the perceived performance problem (not sure how much of a problem it is in real life) is to have a stdlib function that lazily-compiles a regex (*). Just like "re.compile", but lazy: you don't bear the cost of compiling when simply importing the module, but once the pattern is compiled, there is no overhead for looking up a global cache dict. No need for a dedicated literal. (*) Let's call it "re.pattern", for example. Regards Antoine. From antoine at python.org Mon Dec 31 06:47:10 2018 From: antoine at python.org (Antoine Pitrou) Date: Mon, 31 Dec 2018 12:47:10 +0100 Subject: [Python-ideas] No need to add a regex pattern literal In-Reply-To: References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <20181231122316.28d49afc@fsol> Message-ID: <264c35f0-ccfd-7326-931c-ce5aa098709c@python.org> Le 31/12/2018 ? 12:31, M.-A. Lemburg a ?crit?: > > We already have re.search() and re.match() which deal with compilation > on-the-fly and caching. Perhaps the documentation should hint at this > more explicitly... The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559 Regards Antoine. From mal at egenix.com Mon Dec 31 06:31:06 2018 From: mal at egenix.com (M.-A. Lemburg) Date: Mon, 31 Dec 2018 12:31:06 +0100 Subject: [Python-ideas] No need to add a regex pattern literal In-Reply-To: <20181231122316.28d49afc@fsol> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <20181231122316.28d49afc@fsol> Message-ID: On 31.12.2018 12:23, Antoine Pitrou wrote: > On Thu, 27 Dec 2018 19:48:40 +0800 > Ma Lin wrote: >> We can use this literal to represent a compiled pattern, for example: >> >> >>> p"(?i)[a-z]".findall("a1B2c3") >> ['a', 'B', 'c'] >> >> >>> compiled = p"(?<=abc)def" >> >>> m = compiled.search('abcdef') >> >>> m.group(0) >> 'def' >> >> >>> rp'\W+'.split('Words, words, words.') >> ['Words', 'words', 'words', ''] >> >> This allows peephole optimizer to store compiled pattern in .pyc file, >> we can get performance optimization like replacing constant set by >> frozenset in .pyc file. >> >> Then such issue [1] can be solved perfectly. >> [1] Optimize base64.b16decode to use compiled regex >> [1] https://bugs.python.org/issue35559 > > The simple solution to the perceived performance problem (not sure how > much of a problem it is in real life) is to have a stdlib function that > lazily-compiles a regex (*). Just like "re.compile", but lazy: you don't > bear the cost of compiling when simply importing the module, but once > the pattern is compiled, there is no overhead for looking up a global > cache dict. > > No need for a dedicated literal. > > (*) Let's call it "re.pattern", for example. No need for a new function :-) We already have re.search() and re.match() which deal with compilation on-the-fly and caching. Perhaps the documentation should hint at this more explicitly... https://docs.python.org/3.7/library/re.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 31 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ From boxed at killingar.net Mon Dec 31 07:07:56 2018 From: boxed at killingar.net (=?utf-8?Q?Anders_Hovm=C3=B6ller?=) Date: Mon, 31 Dec 2018 13:07:56 +0100 Subject: [Python-ideas] In fact, I'm a bit worry about this literal p"" In-Reply-To: <20181231105437.GM13616@ando.pearwood.info> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <2841cffb-c448-9e9d-5d67-7437cf0c6f57@163.com> <20181231105437.GM13616@ando.pearwood.info> Message-ID: <1A069A45-DF5F-4B12-98D4-44805A79DAB3@killingar.net> > regex = re.compile(r"...") > regex = p("...") > > is not that much different. True, but when the literal is put somewhere far from the compile() call it becomes a problem for static analysis. Conceptually a regex is not a string but an embedded foreign language. That's why I think this discussion is worth having. It would be nice with a way to mark up foreign languages in a way that had some other advantages so people would be incentivised to do it, but just a way to mark it with comments would be fine too I think if it's standardized. Maybe the discussion should be expanded to cover the general case of embedded foreign languages? SQL, HTML, CSS and (obviously) regex comes to mind. One could also think of C for stuff like CFFI. / Anders From malincns at 163.com Mon Dec 31 08:02:56 2018 From: malincns at 163.com (Ma Lin) Date: Mon, 31 Dec 2018 21:02:56 +0800 Subject: [Python-ideas] No need to add a regex pattern literal In-Reply-To: <264c35f0-ccfd-7326-931c-ce5aa098709c@python.org> References: <20f68a19-dd5d-b5cf-dbd0-3ec1a6181138@163.com> <20181231122316.28d49afc@fsol> <264c35f0-ccfd-7326-931c-ce5aa098709c@python.org> Message-ID: On 18-12-31 19:47, Antoine Pitrou wrote: > The complaint is that the global cache is still too costly. > See measurements in https://bugs.python.org/issue35559 In this issue, using a global variable `_has_non_base16_digits` [1] will accelerate 30%. Is re module's internal cache [2] so bad? If rewrite re module's cache with C and use a custom data structure, maybe we will get a small speedup. [1] `_has_non_base16_digits` in PR11287 [1] https://github.com/python/cpython/pull/11287/files [2] re module's internal cache code: [2] https://github.com/python/cpython/blob/master/Lib/re.py#L268-L295 _cache = {}? # ordered! _MAXCACHE = 512 def _compile(pattern, flags): ??? # internal: compile pattern ??? if isinstance(flags, RegexFlag): ??????? flags = flags.value ??? try: ??????? return _cache[type(pattern), pattern, flags] ??? except KeyError: ??????? pass ??? ...