[Numpy-discussion] named ndarray axes
Wes McKinney
wesmckinn at gmail.com
Wed Jul 13 23:26:21 EDT 2011
On Wed, Jul 13, 2011 at 7:15 PM, Craig Yoshioka <craigyk at me.com> wrote:
> Yup exactly. To enable this sort of tracking I needed to explicitly reverse-engineer the effects of indexing on axes. I figure overriding indexing catches most cases that modify axes, but other holes need to be plugged as well... ie: tranpose, swapaxes. This probably means most C functions that change array axes (np.mean(axis=), etc.) need to be covered as well.... that sucks.
>
> BTW, it sounds like you're trying to track very similar data. I am trying to load structural biology data formats, and I try to preserve as much of the metadata as possible, ie: I am encoding unit cell length/angle information as vectors, etc.
>
> Here is my implementation:
>
> def __getitem__(self,idxs):
> idxs,trans,keep = idx_axes(self,idxs)
> array = self.view(np.ndarray).transpose(trans)
> array = array[idxs]
> if isinstance(array,ndarray):
> array = array.view(self.__class__)
> array.axes = self.axes.transpose(keep)
> return array
>
>
> def idx_axes(array,idxs):
>
> # explicitly expand ellipsis
> expanded_idxs = idx_expanded(array.ndim,idxs)
>
> # determine how the axes will be rearranged as a result of axes-based indexing
> # and the creation of newaxes
> remapped_axes = idx_axes_remapped(array.ndim,array.axes,expanded_idxs)
>
> # determine numpy compatible transpose, before newaxes are created
> transposed_axes = idx_axes_transposed(remapped_axes)
>
> # determine numpy compatible indexes with axes-based indexing removed
> filtered_idxs = idx_filtered(expanded_idxs)
>
> # determine which axes will be kept after numpy indexing
> kept_axes = idx_axes_kept(remapped_axes,filtered_idxs)
>
> return filtered_idxs,transposed_axes,kept_axes
>
>
> def idx_expanded(ndim,idxs):
> '''
> explicitly expands ellipsis taking into account newaxes
> '''
>
> if not isinstance(idxs,tuple):
> return idx_expanded(ndim,(idxs,))
>
> # how many dimensions we will end up having
> ndim = ndim + idxs.count(newaxis)
>
> filler = slice(None)
>
> def skip_ellipsis(idxs):
> return tuple([filler if isinstance(x,type(Ellipsis)) else x for x in idxs])
>
> def fill_ellipsis(ndim,l,r):
> return (filler,)*(ndim-len(l)-len(r))
>
> # expand first ellipsis, treat all other ellipsis as slices
> if Ellipsis in idxs:
> idx = idxs.index(Ellipsis)
> llist = idxs[:idx]
> rlist = skip_ellipsis(idxs[idx+1:])
> cfill = fill_ellipsis(ndim,llist,rlist)
> idxs = llist + cfill + rlist
>
> return idxs
>
>
> def idx_filtered(idxs):
> '''
> replace indexes on axes with normal numpy indexes
> '''
> def axisAsIdx(idx):
> if isinstance(idx,str):
> return slice(None)
> elif isinstance(idx,slice):
> if isinstance(idx.start,str):
> return idx.stop
> return idx
>
> return tuple([axisAsIdx(x) for x in idxs])
>
>
> def idx_axes_remapped(ndim,axes,idxs):
> '''
> if given a set of array indexes that contain labeled axes,
> return a tuple that maps axes in the source array to the axes
> as they will end up in the destination array. Must take into
> account the spaces created by newaxis indexes.
> '''
>
> # how many dims are we expecting?
> ndim = ndim + idxs.count(newaxis)
>
> # new unique object for marking unassigned locations in mapping
> unassigned = object()
>
> # by default all locations are unsassigned
> mapping = [unassigned] * ndim
>
> # find labels in indexes and set the dims for those locations
> for dim,label in enumerate(idxs):
> if label == newaxis:
> mapping[dim] = label
> elif isinstance(label,str):
> mapping[dim] = axes.dimForLabel(label)
> elif isinstance(label,slice):
> if isinstance(label.start,str):
> mapping[dim] = axes.dimForLabel(label.start)
>
> # find unassigned dims, in order
> unmapped = [d for d in range(ndim) if d not in set(mapping)]
>
> # fill in remaining unassigned locations with dims
> for dst,src in enumerate(mapping):
> if src == unassigned:
> mapping[dst] = unmapped.pop(0)
>
> return tuple(mapping)
>
>
> def idx_axes_transposed(mapping):
> '''
> stripping out newaxes in mapping yields a tuple compatible with transpose
> '''
> return tuple([x for x in mapping if x != newaxis])
>
>
> def idx_axes_kept(mapping,idxs):
> '''
> remove axes from mapping that will not survive the indexing (ie: ints)
> '''
> kept = []
> first_list = True
> for dst,src in enumerate(mapping):
> if dst < len(idxs):
> idx = idxs[dst]
> if isinstance(idx,int):
> continue
> elif isinstance(idx,list):
> if not first_list:
> continue
> first_list = False
> kept += [src]
> return tuple(kept)
>
>
>
>
> On Jul 13, 2011, at 3:13 PM, Sam Quinan wrote:
>
>> I'm currently working on interfacing ndarrays with a custom C-representation
>> for n-dimensional arrays. My custom C code provides additional per-axis
>> information (labeling, spacing between samples / range of sample positions
>> along the axis, axis direction, cell vs.node centering, etc.) Subclassing
>> ndarray to hold onto this info is fairly simple, but getting numpy's methods
>> to intelligently modify that information when the array is sliced is
>> something that I'm still trying to figure out.
>>
>> A robust way to attach per-axis info to a given ndarray, whether it just be
>> a label or some more complex structure, would definitely be something I (and
>> likely others) would find useful...
>>
>> That said, I'd love to know more about how the idx_axes() structure in your
>> workaround works...
>>
>> - Sam
>>
>>
>>
>> On 7/13/11 12:00 PM, "numpy-discussion-request at scipy.org"
>> <numpy-discussion-request at scipy.org> wrote:
>>
>>> Date: Tue, 12 Jul 2011 16:39:47 -0700
>>> From: Craig Yoshioka <craigyk at me.com>
>>> Subject: [Numpy-discussion] named ndarray axes
>>> To: NumPy-Discussion at scipy.org
>>> Message-ID: <0FC8B43E-26CD-40ED-A6FA-59DD8D641998 at me.com>
>>> Content-Type: text/plain; CHARSET=US-ASCII
>>>
>>> I brought up a while ago about how it would be nice if numpy arrays could have
>>> their axes 'labeled'. = I got an implementation that works pretty well for
>>> me and in the process learned quite a few things, and was hoping to foster
>>> some more discussion on this topic, as I think I have found a simple/flexible
>>> solution to support this at the numpy level.
>>>
>>> Here are *some* examples code snippets from my unit tests on 'Array':
>>>
>>> a = Array((4,5,6))
>>>
>>> # you can assign data to all axes by attribute:
>>> a.Axes.Label = (0:'z',1:'y',2:'x'}
>>>
>>> # or add metadata to each individually:
>>> a.Axes[1].Vector = [0,1,0]
>>> a.Axes[2].Vector = [0,0,1]
>>>
>>> # simple case int indexing
>>> b = a[0]
>>> assert b.shape == (5,6)
>>> assert b.Axes.Label == {0:'y',1:'x'}
>>> assert b.Axes.Vector == {0:[0,1,0],1:[0,0,1]}
>>>
>>> # indexing with slices
>>> b = a[:,0,:]
>>> assert b.shape == (4,6)
>>> assert b.Axes.Label == {0:'z',1:'x'}
>>> assert b.Axes.Vector == {1:[0,0,1]}
>>>
>>> # indexing with ellipsis
>>> b = a[...,0]
>>> assert b.shape == (4,5)
>>> assert b.Axes.Label == {0:'z',1:'y'}
>>>
>>> # indexing with ellipsis, newaxis, etc.
>>> b = a[newaxis,...,2,newaxis]
>>> assert b.shape == (1,4,5,1)
>>> assert b.Axes.Label == {1:'z',2:'y'}
>>>
>>> # indexing with lists
>>> b = a[[1,2],:,[1,2]]
>>> assert b.shape == (2,5)
>>> assert b.Axes.Label == {0:'z',1:'y'}
>>>
>>> # most interesting examples, indexing with axes labels----------------
>>> # I was a bit confused about how to handle indexing with mixed
>>> axes/non-axes indexes
>>> # IE: what does a['x',2:4] mean? on what axis is the 2:4 slice being
>>> applied, the first? the first after 'x'?
>>> # One option is to disallow mixing (simpler to implement,
>>> understand?)
>>> # Instead I chose to treat the axis indexing as a forced assignment
>>> of an axis to a position.
>>>
>>> # axis indexing that transposes the first two dimensions, but otherwise
>>> does nothing
>>> b = a['y','z']
>>> assert b.shape == (5,4,6)
>>> assert b.Axes.Label == {0:'y',1:'z',2:'x'}
>>>
>>> # abusing slices to allow specifying indexes for axes
>>> b = a['y':0,'z']
>>> assert b.shape == (4,6)
>>> assert b.Axes.Label == {0:'z',1:'x'}
>>>
>>> # unfortunately that means a slice index on an axis must be written like
>>> so:
>>> b = a['y':slice(0,2),'x','z']
>>> assert b.shape == (2,6,4)
>>> assert b.Axes.Label == {0:'y',1:'x',2:'z'}
>>>
>>> b = a['y':[1,2,3],'x','z':slice(0:1)]
>>> # or due to the forced transposition, this is the same as:
>>> c = a['y','x','z'][[1,2,3],:,0:1]
>>>
>>> assert b.shape == (3,6,1)
>>> assert b.Axes.Label == {0:'y',1:'x',2:'z'}
>>> assert b.shape == c.shape
>>> assert b.Axes == c.Axes
>>>
>>>
>>> #-----------------------------------------------------------------------------
>>> -----------
>>>
>>>
>>> To do all this I essentially had to recreate the effects of numpy indexing on
>>> axes.... This is not ideal, but so far I seem to have addressed most of the
>>> indexing I've used, at least. Here is what __getitem__ looks like:
>>>
>>> def __getitem__(self,idxs):
>>> filtered_idxs,transposed_axes,kept_axes = self.idx_axes(idxs)
>>> array = self.view(ndarray).transpose(transposed_axes)
>>> array = array[filtered_idxs]
>>> if isinstance(array,ndarray):
>>> array = array.view(Array)
>>> array.Axes = self.Axes.keep(kept_axes)
>>> return array
>>>
>>> As you can see idx_axes() essentially recreates a lot of ndarray indexing
>>> behavior, so that its effects can be explicitly handled.
>>>
>>> Having done all this, I think the best way for numpy to support 'labeled' axes
>>> in the future is by having numpy itself keep track of a very simple tuple
>>> attribute, like shape, and leave more complex axis naming/labeling to
>>> subclasses on the python side. As an example, upon creating a new dimension
>>> in an array, numpy assigns that dimension a semi-unique id, and this tuple
>>> could be used in __array_finalize__.
>>>
>>> For example my __array_finalize__ could look like:
>>>
>>> def __array_finalize__(self,obj):
>>> if hasattr(obj,'axesdata'):
>>> for axesid in self.axes:
>>> if axesid in obj.axes:
>>> self.axesdata[axesid] = obj.axesdata[axesid]
>>>
>>>
>>> This would cover a lot more situations and lead to much simpler code since the
>>> work required on the C side would be minimal, but still allow robust and
>>> custom tracking and propagation of axes information.
>>> Subclasses that tap into this data would react to the result of numpy
>>> operations vs. having to predict/anticipate.
>>>
>>> For example, my __getitem__, relying on the __array_finalize__ above, could
>>> look like:
>>>
>>> def __getitem__(self,idxs):
>>> filtered_idxs,transposed_axes= self.idx_axes(idxs)
>>> array = self.transpose(transposed_axes)
>>> return array[filtered_idxs]
>>>
>>> Not shown is how much simpler and robust the code for idx_axes would then be.
>>> I estimate it would go from 130 loc to < 20 loc.
>>>
>>> Sorry for the extra long e-mail,
>>> -Craig
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
Have you guys been following the DataArray discussions or project at
all? I think it provides a nearly complete implementation of what
you've been describing (named axes):
https://github.com/fperez/datarray
(no joke: 21 forks!)
some links:
http://inscight.org/2011/05/18/episode_13/
https://convore.com/python-scientific-computing/data-array-in-numpy/
I'm excited that many more people seem to be excited about making this
kind of functionality available in the scientific Python stack so
understanding every perspective and set of requirements makes a big
difference =)
best,
Wes
More information about the NumPy-Discussion
mailing list