[Numpy-discussion] Numpy Array of dtype=object with strings and floats question

Tue Nov 10 14:34:52 EST 2009

On Tue, Nov 10, 2009 at 11:28 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Tue, Nov 10, 2009 at 11:14 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Tue, Nov 10, 2009 at 10:53 AM, Darryl Wallace
>> <darryl.wallace at prosensus.ca> wrote:
>>> I currently do as you suggested.  But when the dataset size becomes large,
>>> it gets to be quite slow due to the overhead of python looping.
>>
>> Are you using a for loop? Is so, something like this might be faster:
>>
>>>> x = [1, 2, '', 3, 4, 'String']
>>>> from numpy import nan
>>>> [(z, nan)[type(z) is str] for z in x]
>>   [1, 2, nan, 3, 4, nan]
>>
>> I use something similar in my code, so I'm interested to see if anyone
>> can speed things up using python or numpy, or both. I run it on each
>> row of the file replacing '' with None. Here's the benchmark code:
>>
>>>> x = [1, 2, '', 4, 5, '', 7, 8, 9, 10]
>>>> timeit [(z, None)[z is ''] for z in x]
>> 100000 loops, best of 3: 2.32 µs per loop
>
> If there are few missing values (my use case), this seems to be faster:
>
> def myfunc(x):
>    while '' in x:
>        x[x.index('')] = None
>    return x
>
>>> timeit myfunc(x)
> 1000000 loops, best of 3: 697 ns per loop
>
> Note that it works inplace.

Oops. It's hard to time functions that change the input. Making a copy
of x at the top of the functions takes away the speed advantage. Is
there a way to get timeit to restore the input on each cycle? OK, I'll
stop spamming the list.