[Python-ideas] Float range class

Fri Jan 9 21:58:43 CET 2015

On Jan 9, 2015, at 9:21, Chris Barker <chris.barker at noaa.gov> wrote:

> On Thu, Jan 8, 2015 at 7:33 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On 9 Jan 2015 03:02, "Neil Girdhar" <mistersheik at gmail.com> wrote:
>> >
>> > I agree with everyone above.  At first I was +1 on this proposal, but why not suggest to the numpy people that arange and linespace should return Sequence objects rather than numpy arrays? 
>> 
> numpy arrays are a python sequence -- what do you mean here? Did you mean iterator?
> 
> This whole conversation made me think a bit about numpy and iterators -- python has been moving toward more use of iterators, maybe numpy should do the same?
> 
> But as I think about, it's actually a totally different model -- py3 made range() an iterator, because it is most often used that way -- i.e. in a for loop or list comporehension or genator expression. when you really want the sequence, you wrap it in a list().

No, py3 made range a lazy sequence-like view, not an iterator. It has indexing, __contains__, etc. People make the same mistake about dictionary views. dict.keys and friends return sequence-like or set-like views, not iterators. (That was backported to 2.7 as dict.viewkeys and friends, but people using 2.x still don't seem to know it exists.)

There's no reason numpy _couldn't_ add various kinds of lazy views. It already has arrays that borrow storage from slices of other arrays, and things like scipy,sparse that generate the values on the fly. You could easily write a range class that generates values on the fly; you could even create lazy arrays by arithmetic (an a**2 that didn't store anything, but referenced a and generated each value on demand by squaring the value from a).

The reason numpy doesn't do this is that it's usually significantly faster to precompute everything, keeping each inner loop as small as possible, and keeping each function dealing only with one kind of thing (so there's no need for costly branching anywhere), despite the waste of storage (and potential cache and VM effects). Not _always_ faster, so there might be some benefits from adding laziness all over the place, but presumably often enough that's it hasn't been worth anyone's effort to add it and see.

Also, the waste of storage usually comes down to peak memory being at worst 2x what you'd want (e.g., in newarr = arr * x, presumably each of the three has the same size and dtype, so having x by lazy would only reduce peak memory by 1/3rd); it's not like in Python iterator code, where you might have no in-memory sequences at all, and adding one for x might increase your storage 500000x (and, more to the point, make it linear instead of constant).

> But numpy is, at it's core, about arrays, and the iteration happens INSIDE the array object. e.g. to multiply all the elements of an array by a number you do:
> 
> new_arr = arr * x
> 
> then the actual looping(iteration) happens inside numpy, with C data types, at C speed.
> 
> turning that into:
> 
> new_arr = [i*x for i in arr]
> 
> would push all the work back out into python, killing the point of numpy.

Right, but there's no reason numpy couldn't implement an arr * x for a special lazy view type x. Using Python iterators would almost certainly not work, because accessing each value is done by calling a Python function, and they can only be accessed in order, and so on. But if x were a custom type whose values could be generated random-access by a native numpy expression, that wouldn't be a problem. 

Again, the problem is that someone would have to write the code to implement that type, to multiply that type with itself or with an array, etc. Unless there's a clear win, nobody's going to write all that. 

> In fact, killing BOTH points of numpy:
> 
> 1) performance 
> 
> 2) clean readable array expressions -- that is:
> 
>     c = np.sqrt(a**2 + b**2)
> 
>     rather than:
> 
>     c = [ math.sqrt(x) for x in (x+y for x, y in zip( (x**2 for x in a),  (x**2 for x in b) ) ) ]
> 
>     OK, there may  be a more readable way to write that...
> 
> Anyway, numpy is about arrays, so linspace, arange, etc create arrays.
> 
> There may well be a good reason to make a numpy iterator version of these, for when you HAVE to loop, but that wouldn't help folks that aren't using numpy anyway.
> 
> but sequences (iterators) of ranges of floating point numbers (and other "numeric-like" objects) is a generally useful thing for all users of python, and not entirely trivial to do right -- hence this conversation.

Again, this is misleading, and missed an important point:

arange is actually very easy to get right, but hard to _use_ properly. A half-open range of values from 0 to .9 by .3 is going to include a number just under 0.9 if properly implemented. However you slice it, .3*3<.9, .3+.3+.3<.9, etc., so that number belongs in the range. The problem is on the user's side, not the implementation's--using arange is generally effectively the same mistake as using ==. You can use arange safely if you think through the FP issues properly (e.g., if you know that epsilon<<step, you can pass stop-epsilon instead of stop), but it's up to the user to do that thinking.

linspace is harder to get right, but very easy to use properly. There's no room for rounding issues to affect things on the user side; the worst that happens is that all your values are off by up to the input error e, which is generally acceptable (as opposed to having the wrong number of values, which rarely is). And here, the implementation matters; a naive implementation will return values off by more than e.

So, having linspace in the stdlib sounds good--not just to avoid people implementing it wrong, but to avoid people implementing and/or using arange when they should be using linspace.

In fact, maybe even make range's TypeError message reference linspace when given floats, as sum does for str.join?

In the less common cases where arange really was what you wanted, and you know how to use it, you almost certainly can implement it yourself. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150109/c43ded08/attachment-0001.html>