[Numpy-discussion] Introduction

Thu Apr 11 14:57:14 EDT 2002

> [mailto:numpy-discussion-admin at lists.sourceforge.net]On Behalf Of Scott
> Gilbert
> Subject: [Numpy-discussion] Introduction
> 
> 
> Hello All.
> 
> I'm interested in this project, and am curious to what level you are
> willing to accept outside contribution.  I just tried to subscribe to
> the developers list, but I didn't realize that required admin approval.
>  Hopefully it doesn't look like I was shaking the door without knocking
> first.
> 
> Is this list active?  Is this the correct place to talk about Numarray?

Sure.

> 
> Following your design for the Array stuff, I've been able to implement
> a pretty usable array class that supports the bazillion array types I
> need (Bit, Complex Integer, etc...).  This gets me past my core
> requirements without polluting your world, but unfortunately my new
> XArray type doesn't play so well with your UFuncs.  I think my users
> will definitely want to use your UFuncs when the time comes, so I want
> to remedy this situation.
> 
> The first change I would like to make is to rework your code that
> verifies that an object is a "usable" array.  I think NumArray should
> only check for the interface required, not the actual type hierarchy. 
> By this I mean that the minimum required to be a supported array type
> is that it support the correct attributes, not that it actually inherit
> from NDArray:
> 
>    (quoting from your paper) something like:
> 
>        _data
>        _shape
>        _strides
>        _byteoffset
>        _aligned
>        _contiguous
>        _type
>        _byteswap
> 
> Most of these are just integer fields, or tuples of integers.  Ignoring
> _type for the moment, it appears that the interface required to be a
> NumArray is much less strict than actually requiring it to derive from
> NumArray.  If you allow me to change a few functions (inputarray() in
> numarray.py is one small example), I could use my independant XArray
> class almost as is, and moreover I can implement new array objects
> (possibly as extension types) for crazy things like working with page
> aligned memory, memory mapping etc...
> 
I guess we are not sure we understand what you mean by interface.
In particular, we don't understand why sharing the same object
attributes (the private ones you list above) is a benefit to the
code you are writing if you aren't also using the low level
implementation. The above attributes are private and nothing 
external to the Class should depend on or even know about them.
Could you elaborate on what you mean by interface and the relationship
between your arrays and numarrays?

> 
> Well, that's almost enough.  The _type field poses a small problem of
> sorts.  It looks like you don't require a _type to be derived from
> NumericType, and this is a good thing since it allows me (and others)
> to implement NumArray compatible arrays without actually requiring
> NumArray to be present.
>
What do you mean by NumArray compatible?

[some issues snipped since we need to understand the interface issue
first]

> I don't know if you're trying to get all of NumArray into the Python
> distribution or not, but I suspect a good interim step would be to have
> a PEP that specifies what it means to be a NumArray or NDArray in
> minimal terms.  Perhaps supplying an Array only module in Python that
> implements this interface.  Again, I'd be willing to help with all of
> this.
>
We are hoping to get numarray into the distribution [it won't be the
end of the world for us if it doesn't happen]. I'll warn you that the
PEP is out of date. We are likely to update it only after we feel
we are close to having the implementation ready for consideration 
for including into the standard distribution. I would refer to the
actual implementation and the design notes for the time being.
> 
> -------------------------
> 
> Ok, other suggestions...
> 
> Here is the list of things that your design document indicates are
> required to be a NumArray:
> 
>        _data
>        _shape
>        _strides
>        _byteoffset
>        _aligned
>        _contiguous
>        _type
>        _byteswap
> 
> I believe that one could calculate the values for _aligned and
> _contiguous from the other fields.  So they shouldn't really be part of
> the interface required.  I suspect it is useful for the C
> implementation of UFuncs to have this information in the NDINfo struct
> though, so while I would drop them from attribute interface, I would
> delegate the task of calculating these values to getNDInfo() and/or
> getNumInfo().
> 
> I also notice that you chose _byteswap to indicate byteswapping is
> needed.  I think a better choice would be to specify the endian-ness of
> the data (with an _endian attr), and have getNDInfo() and getNumInfo()
> calculte the _byteswap value for the NDInfo struct.
> 
> In my implementation, I came up with a slightly different list:
> 
>             self._endian
>             self._offset
>             self._shape
>             self._stride
>             self._itemtype
>             self._itemsize
>             self._itemformat
>             self._buffer
> 
Some of the name changes are worth considering (like replacing ._byteswap
with an endian indicator, though I find _endian completely opaque as to
what it would mean--1 means what? little or big?). (BTW, we already have
_itemsize). _contiguous and _aligned are things we have been considering
changing, but I would have to think about it carefully to determine if
they really are redundant.

> The only minimal differences are that _itemsize allows me to work with
> arrays of bytes without having any clue what the underlying type is (in
> some cases, _itemtype is "Unknown".)  Secondly, I implemented a
> "Struct" _itemtype, and _itemformat is useful for for this case.  (It's
> the same format string that the struct module in Python uses.)
> 
It looks like you are trying to deal with records with these "structs". 
We deal with records (efficiently) in a completely different way. Take
a look at the recarray module.

> Also, I specified 0 for _itemsize when the actual items aren't byte
> addressable.  In my module, this only occurred with the Bit type.  I
> figured specifying 0 like this could keep a UFunc that isn't Bit aware
> from stepping on memory that it isn't allowed to.
> 
Again, we aren't sure how this works with numarray.

> -------------------------
> 
> Next thought:  Memory Mapping
> 
> I really like the idea of having Python objects that map huge files a
> piece at time without using all of available memory.  I've seen this in
> NumArray's charter as part of the reason for breaking away from
> Numeric, and I'm curious how you intend to address it.
> 
> Right now, the only requirement for _data seems to be that it implement
> the PyBufferProcs.  For memory mapping something else is needed...
> 
> I haven't implemented this, so take it as just my rambling thoughts:
> 
> With the addition of 3 new, optional, attributes to the NumArray object
> interface, I think this could be efficiently accomplished:
> 
>      _mapproc
>      _mapmin
>      _mapmax
> 
> If _mapproc is present and not None, then it points to a function who's
> responsibility it is to set _mapmin and _mapmax appropriately. 
> _mapproc takes one argument which is the desired byte offset into the
> virtual array.  This is probably easier to describe with code:
> 
>      def _mapproc(self, offset):
>          unmap_the_old_range()
>          mmap_a_new_range_that_includes_byteoffset()
>          self._mapmin = minimum_of_new_range()
>          self._mapmax = maximum_of_new_range()
> 
> In this way, when the delta between _mapmin and _mapmax is large
> enough, the UFuncs could act over a large contiguous portion of the
> _data array at a time before another remapping is necessary.  If the
> byteoffset that a UFunc needs to work with is outside of _mapmin and
> _mapmax, it must call _mapproc to remedy the situation.
> 
> This puts a lot of work into UFuncs that choose to support this.  I
> suppose that is tough to avoid though.
> 
We deal with memory mapping a completely differnent way. It's a bit late
for me to go into it in great detail, but we wrap the standard library
mmap module with a module that lets us manage memory mapped files.
This module basically memory maps an entire file and then in effect
mallocs segments of that file as buffer objects. This allocation of
subsets is needed to ensure that overlapping memory maps buffers
don't happen. One can basically reserve part of the memory mapped file
as a buffer. Once that is done, nothing else can use that part of the
file for another buffer. We do not intend to handle memory maps as a
way of sequentially mapping parts of the file to provide windowed views
as your code segment above suggests. If you want a buffer that is the
whole (large) file, you just get a mapped buffer to the whole thing.
(Why wouldn't you?)

The above scheme is needed for our purposes because many of our data files
contain multiple data arrays and we need a means of creating a numarray
object for each one. Most of this machinery has already been implemented,
but we haven't released it since our I/O package (for astronomical FITS
files) is not yet at the point of being able to use it.

> Also, there are threading issues to think about here.  I don't know if
> UFuncs are going to release the Global Interpreter Lock, but if they do
> it's possible that multiple threads could have the same PyObject and
> try to _mapproc different offsets at different times.
> 
To tell you the truth, we haven't dealt with the threading issue much. We
think about it occasionally, but have deferred dealing with it until 
we have finished other aspects first. We do want to make it thread safe
though.

Perry Greenfield