Easier way to do this?

Wed Oct 4 18:00:32 EDT 2017

On 04/10/17 22:47, Fabien wrote:
> On 10/04/2017 10:11 PM, Thomas Jollans wrote:
>> Be warned, pandas is part of the scientific python stack, which is
>> immensely powerful and popular, but it does have a distinctive style
>> that may appear cryptic if you're used to the way the rest of the world
>> writes Python.
>
> Can you elaborate on this one? As a scientist, I am curious ;-)

Sure.

Python is GREAT at iterating. Generators are everywhere. Everyone loves
for loops. List comprehensions and generator expressions are star
features. filter and map are builtins. reduce used be a builtin, even
though almost nobody really understood what it did.

In [1]: import numpy as np

In the world of numpy (and the greater scientific stack), you don't
iterate. You don't write for loops. You have a million floats in memory
that you want to do math on - you don't want to wait for ten million
calls to __class__.__dict__['__getattr__']('__add__').__call__() or
whatever to run. In numpy land, numpy writes your loops for you. In
FORTRAN. (well ... probably C)

As I see it the main cultural difference between "traditional" Python
and numpy-Python is that numpy implicitly iterates over arrays all the
time. Python never implicitly iterates. Python is not MATLAB.

In [2]: np.array([1, 2, 3]) + np.array([-3, -2, -1])
Out[2]: array([-2,  0,  2])

In [3]: [1, 2, 3] + [-3, -2, -1]
Out[3]: [1, 2, 3, -3, -2, -1]

In numpy, operators don't mean what you think they mean.

In [4]: a = (np.random.rand(30) * 10).astype(np.int64)

In [5]: a
Out[5]:
array([6, 1, 6, 9, 1, 0, 3, 5, 8, 5, 2, 6, 1, 1, 2, 2, 4, 2, 4, 2, 5, 3, 7,
       8, 2, 5, 8, 1, 0, 8])

In [6]: a > 5
Out[6]:
array([ True, False,  True,  True, False, False, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False,  True,  True, False, False,  True,
       False, False,  True], dtype=bool)

In [7]: list(a) > 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-0c10c9961870> in <module>()
----> 1 list(a) > 5

TypeError: unorderable types: list() > int()

Suddenly, you can even compare sequences and scalars! And > no longer
gives you a bool! Madness!

Now, none of this, so far, has been ALL THAT cryptic as far as I can
tell. It's when you do more complicated things, and start combining
different parts of the numpy toolbox, that it becomes clear that
numpy-Python is kind of a different language.

In [8]: a[(np.sqrt(a).astype(int)**2 == a) & (a < 5)]
Out[8]: array([1, 1, 0, 1, 1, 4, 4, 1, 0])

In [9]: import math

In [10]: [i for i in a if int(math.sqrt(i))**2 == i and i < 5]
Out[10]: [1, 1, 0, 1, 1, 4, 4, 1, 0]

Look at my pandas example from my previous post. If you're a
Python-using scientist, even if you're not very familiar with pandas,
you'll probably be able to see more or less how it works. I imagine that
there are plenty of experienced Pythonistas on this list who never need
to deal with large amounts of numeric data that are completely
nonplussed by it, and I wouldn't blame them. The style and the
idiosyncrasies of array-heavy scientific Python and stream or
iterator-heavy scripting and networking Python are just sometimes rather
different.

Cheers

Thomas