[Numpy-discussion] Trim a numpy array in numpy.

Tue Aug 16 18:43:50 EDT 2011

On 16 Aug 2011, at 23:51, Hongchun Jin wrote:

> Thanks Derek for  the quick reply. But I am sorry, I did not make it clear in my last email.  Assume I have an array like 
> ['CAL_LID_L2_05kmCLay-Prov-V3-01.2008-01-01T00-37-48ZD.hdf'
> 
>  'CAL_LID_L2_05kmCLay-Prov-V3-01.2008-01-01T00-37-48ZD.hdf'
> 
>  'CAL_LID_L2_05kmCLay-Prov-V3-01.2008-01-01T00-37-48ZD.hdf' ...,
> 
>  'CAL_LID_L2_05kmCLay-Prov-V3-01.2008-01-31T23-56-35ZD.hdf'
> 
>  'CAL_LID_L2_05kmCLay-Prov-V3-01.2008-01-31T23-56-35ZD.hdf'
> 
>  'CAL_LID_L2_05kmCLay-Prov-V3-01.2008-01-31T23-56-35ZD.hdf']
> 
> I need to get the sub-string for date and time, for example,  
> 
> '2008-01-31T23-56-35ZD' in the middle of each element. In more general cases, the sub-string could be any part of the string in such an array.  I hope to assign the start and stop of the sub-string when I am subsetting it.  
> 
Well, maybe I was a bit too quick in my reply - see the documentation for np.char for some vectorized array operations that might be of use. Unfortunately, operations like 'lstrip' and 'rstrip' don't do exactly what you might them expect to, but you could use for example 
np.char.split(x,'.') 
to create an array of lists of substrings and then deal with them; something like removing the '.hdf' suffix would already require a somewhat lengthy recursion:

np.char.rstrip(np.char.rstrip(np.char.rstrip(np.char.rstrip(x, 'f'), 'd'), 'h'), '.')

To also remove the leading substring in your case clearly would lead to a very clumsy expression...

It turns out however, something like the above for a similar test case with a length 100000 array takes about 3 times longer than the np.char.split() way; but even that is slower than a direct loop over string functions:

In [6]: %timeit -n 10 y = np.char.split(x, '.')
10 loops, best of 3: 188 ms per loop

In [7]: %timeit -n 10 y = np.char.split(x, '.'); z = np.fromiter( (l[1] for l in y), dtype='|S3', count=x.shape[0])
10 loops, best of 3: 218 ms per loop

In [8]: %timeit -n 10 z = np.fromiter( (l.split('.')[1] for l in x), dtype='|S3', count=x.shape[0])
10 loops, best of 3: 143 ms per loop

So it seems all of the vectorization in np.char is not that great after all (and the direct loop might still be acceptable for 1.e6 elements...)!

Cheers,
								Derek