[Numpy-discussion] named ndarray axes

Wed Jul 13 23:26:21 EDT 2011

On Wed, Jul 13, 2011 at 7:15 PM, Craig Yoshioka <craigyk at me.com> wrote:
> Yup exactly.  To enable this sort of tracking I needed to explicitly reverse-engineer the effects of indexing on axes.  I figure overriding indexing catches most cases that modify axes, but other holes need to be plugged as well... ie: tranpose, swapaxes.  This probably means most C functions that change array axes (np.mean(axis=), etc.) need to be covered as well.... that sucks.
>
> BTW, it sounds like you're trying to track very similar data.  I am trying to load structural biology data formats, and I try to preserve as much of the metadata as possible, ie: I am encoding unit cell length/angle information as vectors, etc.
>
> Here is my implementation:
>
>    def __getitem__(self,idxs):
>        idxs,trans,keep = idx_axes(self,idxs)
>        array = self.view(np.ndarray).transpose(trans)
>        array = array[idxs]
>        if isinstance(array,ndarray):
>            array = array.view(self.__class__)
>            array.axes = self.axes.transpose(keep)
>        return array
>
>
> def idx_axes(array,idxs):
>
>    # explicitly expand ellipsis
>    expanded_idxs = idx_expanded(array.ndim,idxs)
>
>    # determine how the axes will be rearranged as a result of axes-based indexing
>    # and the creation of newaxes
>    remapped_axes = idx_axes_remapped(array.ndim,array.axes,expanded_idxs)
>
>    # determine numpy compatible transpose, before newaxes are created
>    transposed_axes = idx_axes_transposed(remapped_axes)
>
>    # determine numpy compatible indexes with axes-based indexing removed
>    filtered_idxs = idx_filtered(expanded_idxs)
>
>    # determine which axes will be kept after numpy indexing
>    kept_axes = idx_axes_kept(remapped_axes,filtered_idxs)
>
>    return filtered_idxs,transposed_axes,kept_axes
>
>
> def idx_expanded(ndim,idxs):
>    '''
>    explicitly expands ellipsis taking into account newaxes
>    '''
>
>    if not isinstance(idxs,tuple):
>        return idx_expanded(ndim,(idxs,))
>
>    # how many dimensions we will end up having
>    ndim = ndim + idxs.count(newaxis)
>
>    filler = slice(None)
>
>    def skip_ellipsis(idxs):
>        return tuple([filler if isinstance(x,type(Ellipsis)) else x for x in idxs])
>
>    def fill_ellipsis(ndim,l,r):
>        return (filler,)*(ndim-len(l)-len(r))
>
>    # expand first ellipsis, treat all other ellipsis as slices
>    if Ellipsis in idxs:
>        idx = idxs.index(Ellipsis)
>        llist = idxs[:idx]
>        rlist = skip_ellipsis(idxs[idx+1:])
>        cfill = fill_ellipsis(ndim,llist,rlist)
>        idxs = llist + cfill + rlist
>
>    return idxs
>
>
> def idx_filtered(idxs):
>    '''
>    replace indexes on axes with normal numpy indexes
>    '''
>    def axisAsIdx(idx):
>        if isinstance(idx,str):
>            return slice(None)
>        elif isinstance(idx,slice):
>            if isinstance(idx.start,str):
>                return idx.stop
>        return idx
>
>    return tuple([axisAsIdx(x) for x in idxs])
>
>
> def idx_axes_remapped(ndim,axes,idxs):
>    '''
>    if given a set of array indexes that contain labeled axes,
>    return a tuple that maps axes in the source array to the axes
>    as they will end up in the destination array.  Must take into
>    account the spaces created by newaxis indexes.
>    '''
>
>    # how many dims are we expecting?
>    ndim = ndim + idxs.count(newaxis)
>
>    # new unique object for marking unassigned locations in mapping
>    unassigned = object()
>
>    # by default all locations are unsassigned
>    mapping = [unassigned] * ndim
>
>    # find labels in indexes and set the dims for those locations
>    for dim,label in enumerate(idxs):
>        if label == newaxis:
>            mapping[dim] = label
>        elif isinstance(label,str):
>            mapping[dim] = axes.dimForLabel(label)
>        elif isinstance(label,slice):
>            if isinstance(label.start,str):
>                mapping[dim] = axes.dimForLabel(label.start)
>
>    # find unassigned dims, in order
>    unmapped = [d for d in range(ndim) if d not in set(mapping)]
>
>    # fill in remaining unassigned locations with dims
>    for dst,src in enumerate(mapping):
>        if src == unassigned:
>            mapping[dst] = unmapped.pop(0)
>
>    return tuple(mapping)
>
>
> def idx_axes_transposed(mapping):
>    '''
>    stripping out newaxes in mapping yields a tuple compatible with transpose
>    '''
>    return tuple([x for x in mapping if x != newaxis])
>
>
> def idx_axes_kept(mapping,idxs):
>    '''
>    remove axes from mapping that will not survive the indexing (ie: ints)
>    '''
>    kept = []
>    first_list = True
>    for dst,src in enumerate(mapping):
>        if dst < len(idxs):
>            idx = idxs[dst]
>            if isinstance(idx,int):
>                continue
>            elif isinstance(idx,list):
>                if not first_list:
>                    continue
>                first_list = False
>        kept += [src]
>    return tuple(kept)
>
>
>
>
> On Jul 13, 2011, at 3:13 PM, Sam Quinan wrote:
>
>> I'm currently working on interfacing ndarrays with a custom C-representation
>> for n-dimensional arrays. My custom C code provides additional per-axis
>> information (labeling, spacing between samples / range of sample positions
>> along the axis, axis direction, cell vs.node centering, etc.) Subclassing
>> ndarray to hold onto this info is fairly simple, but getting numpy's methods
>> to intelligently modify that information when the array is sliced is
>> something that I'm still trying to figure out.
>>
>> A robust way to attach per-axis info to a given ndarray, whether it just be
>> a label or some more complex structure, would definitely be something I (and
>> likely others) would find useful...
>>
>> That said, I'd love to know more about how the idx_axes() structure in your
>> workaround works...
>>
>> - Sam
>>
>>
>>
>> On 7/13/11 12:00 PM, "numpy-discussion-request at scipy.org"
>> <numpy-discussion-request at scipy.org> wrote:
>>
>>> Date: Tue, 12 Jul 2011 16:39:47 -0700
>>> From: Craig Yoshioka <craigyk at me.com>
>>> Subject: [Numpy-discussion] named ndarray axes
>>> To: NumPy-Discussion at scipy.org
>>> Message-ID: <0FC8B43E-26CD-40ED-A6FA-59DD8D641998 at me.com>
>>> Content-Type: text/plain; CHARSET=US-ASCII
>>>
>>> I brought up a while ago about how it would be nice if numpy arrays could have
>>> their axes 'labeled'.    = I got an implementation that works pretty well for
>>> me and in the process learned quite a few things, and was hoping to foster
>>> some more discussion on this topic, as I think I have found a simple/flexible
>>> solution to support this at the numpy level.
>>>
>>> Here are *some* examples code snippets from my unit tests on 'Array':
>>>
>>>    a = Array((4,5,6))
>>>
>>>    # you can assign data to all axes by attribute:
>>>    a.Axes.Label = (0:'z',1:'y',2:'x'}
>>>
>>>    # or add metadata to each individually:
>>>    a.Axes[1].Vector = [0,1,0]
>>>    a.Axes[2].Vector = [0,0,1]
>>>
>>>    # simple case int indexing
>>>    b = a[0]
>>>    assert b.shape == (5,6)
>>>    assert b.Axes.Label == {0:'y',1:'x'}
>>>    assert b.Axes.Vector == {0:[0,1,0],1:[0,0,1]}
>>>
>>>    # indexing with slices
>>>    b = a[:,0,:]
>>>    assert b.shape == (4,6)
>>>    assert b.Axes.Label == {0:'z',1:'x'}
>>>    assert b.Axes.Vector == {1:[0,0,1]}
>>>
>>>    # indexing with ellipsis
>>>    b = a[...,0]
>>>    assert b.shape == (4,5)
>>>    assert b.Axes.Label == {0:'z',1:'y'}
>>>
>>>    # indexing with ellipsis, newaxis, etc.
>>>    b = a[newaxis,...,2,newaxis]
>>>    assert b.shape == (1,4,5,1)
>>>    assert b.Axes.Label == {1:'z',2:'y'}
>>>
>>>    # indexing with lists
>>>    b = a[[1,2],:,[1,2]]
>>>    assert b.shape == (2,5)
>>>    assert b.Axes.Label == {0:'z',1:'y'}
>>>
>>>    # most interesting examples, indexing with axes labels----------------
>>>    # I was a bit confused about how to handle indexing with mixed
>>> axes/non-axes indexes
>>>    # IE: what does a['x',2:4]  mean?  on what axis is the 2:4 slice being
>>> applied, the first? the first after 'x'?
>>>    #       One option is to disallow mixing (simpler to implement,
>>> understand?)
>>>    #       Instead I chose to treat the axis indexing as a forced assignment
>>> of an axis to a position.
>>>
>>>    # axis indexing that transposes the first two dimensions, but otherwise
>>> does nothing
>>>    b = a['y','z']
>>>    assert b.shape == (5,4,6)
>>>    assert b.Axes.Label == {0:'y',1:'z',2:'x'}
>>>
>>>    # abusing slices to allow specifying indexes for axes
>>>    b = a['y':0,'z']
>>>    assert b.shape == (4,6)
>>>    assert b.Axes.Label == {0:'z',1:'x'}
>>>
>>>    # unfortunately that means a slice index on an axis must be written like
>>> so:
>>>    b = a['y':slice(0,2),'x','z']
>>>    assert b.shape == (2,6,4)
>>>    assert b.Axes.Label == {0:'y',1:'x',2:'z'}
>>>
>>>    b = a['y':[1,2,3],'x','z':slice(0:1)]
>>>    # or due to the forced transposition, this is the same as:
>>>    c = a['y','x','z'][[1,2,3],:,0:1]
>>>
>>>    assert b.shape == (3,6,1)
>>>    assert b.Axes.Label == {0:'y',1:'x',2:'z'}
>>>    assert b.shape == c.shape
>>>    assert b.Axes == c.Axes
>>>
>>>
>>> #-----------------------------------------------------------------------------
>>> -----------
>>>
>>>
>>> To do all this I essentially had to recreate the effects of numpy indexing on
>>> axes....  This is not ideal, but so far I seem to have addressed most of the
>>> indexing I've used, at least. Here is what __getitem__ looks like:
>>>
>>>    def __getitem__(self,idxs):
>>>        filtered_idxs,transposed_axes,kept_axes = self.idx_axes(idxs)
>>>        array = self.view(ndarray).transpose(transposed_axes)
>>>        array = array[filtered_idxs]
>>>        if isinstance(array,ndarray):
>>>            array = array.view(Array)
>>>            array.Axes = self.Axes.keep(kept_axes)
>>>        return array
>>>
>>> As you can see idx_axes() essentially recreates a lot of ndarray indexing
>>> behavior, so that its effects can be explicitly handled.
>>>
>>> Having done all this, I think the best way for numpy to support 'labeled' axes
>>> in the future is by having numpy itself keep track of a very simple tuple
>>> attribute, like shape, and leave more complex axis naming/labeling to
>>> subclasses on the python side.  As an example, upon creating a new dimension
>>> in an array, numpy assigns that dimension a semi-unique id, and this tuple
>>> could be used in __array_finalize__.
>>>
>>> For example my __array_finalize__ could look like:
>>>
>>> def __array_finalize__(self,obj):
>>>    if hasattr(obj,'axesdata'):
>>>         for axesid in self.axes:
>>>              if axesid in obj.axes:
>>>                   self.axesdata[axesid] = obj.axesdata[axesid]
>>>
>>>
>>> This would cover a lot more situations and lead to much simpler code since the
>>> work required on the C side would be minimal, but still allow robust and
>>> custom tracking and propagation of axes information.
>>> Subclasses that tap into this data would react to the result of numpy
>>> operations vs. having to predict/anticipate.
>>>
>>> For example, my __getitem__, relying on the __array_finalize__ above, could
>>> look like:
>>>
>>>    def __getitem__(self,idxs):
>>>        filtered_idxs,transposed_axes= self.idx_axes(idxs)
>>>        array = self.transpose(transposed_axes)
>>>        return array[filtered_idxs]
>>>
>>> Not shown is how much simpler and robust the code for idx_axes would then be.
>>> I estimate it would go from 130 loc to < 20 loc.
>>>
>>> Sorry for the extra long e-mail,
>>> -Craig
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Have you guys been following the DataArray discussions or project at
all? I think it provides a nearly complete implementation of what
you've been describing (named axes):

https://github.com/fperez/datarray

(no joke: 21 forks!)

some links:
http://inscight.org/2011/05/18/episode_13/
https://convore.com/python-scientific-computing/data-array-in-numpy/

I'm excited that many more people seem to be excited about making this
kind of functionality available in the scientific Python stack so
understanding every perspective and set of requirements makes a big
difference =)

best,
Wes