[PYTHON MATRIX-SIG] More on Less Copying

Tue, 11 Mar 1997 10:12:34 -0700

> > (a) One of my needs is to read large arrays of arbitrary rank from
> > binary files. The size of the array is known in advance but need not
> > be equal to the remaining size of the file.
> 
> You are adressing an important topic that has largely been neglected
> so far: file I/O, both in binary and in text form, suitable for
> interchanging data with non-Python programs 

I'm not going to touch text files, but if someone else wants to they
are welcome to join in.

> (note that pickle works
> fine as long as the format doesn't matter).

Yes and no. There is nothing wrong with the interface to pickle, but
the current implementation has an additional copy implied by
tostring().

> > Two ways to do this might be:
> > 
> >     a = Numeric.new(shape, typecode)    # create uninitialized new array
> >     a.read(file)                        # fill array; copy once
> > 
> > and:
> > 
> >     a = Numeric.read(file, shape, typecode)
> 
> The first version is more general, since it allows to overwrite an
> existing array by reading from a file. 

> What I don't like about this
> approach is that there is no control over the file format. You must
> have a file that uses the same memory layout as NumPy arrays, which
> in principle is not documented.  In practice, you would know the file
> format and want NumPy to do whatever necessary, including perhaps
> number format conversion. You might also want to be able to read/write
> a subarray.

Although I cut my teeth on an IBM mainframe, I'm going to show my
UNIX-centricity.

The read() method could treat the data in the file as a byte-stream
defining a sequence of equal-sized elements without gaps or blocking.
It could map between the array and the sequence in the same way that
you map between an array and its ravel(). That way your file does not
need to know the layout of an array.

The problem is the format of the elements. You could survive on most
modern machines by assuming that the elements were either in native
format or byte-swapped native format. You would have a problem
transporting data back and forth between Cray, IBM, VAX, and IEEE
floatign point formats, but a number of tools exist for this already
and the problem gets smaller each year. (The only I interact with a
Cray vector machine nowadays as a front-end to one of their
Alpha-based MPP machines.) I note that the current version of pickle
does not do the conversion.

To my mind, the sticky interface issue is whether to have read() do
the byteswapping or not. If it does it in C, we can avoid another
copy. I'd suggest something along the lines of:

    a = Numeric.new(shape, typecode)
    a.read(file, order=...)
    a.read("file", order=...)

where value=None is implied in the new() function and order is None,
Numeric.LittleEndian, or Numeric.BigEndian in the read() method. None
implies native order. Ditto for write.

I presume you would read and write subarrays using:

    a[1,:].read(file)
    a[1,:].write(file)

Questions: 

    Whose files am I leaving out in the cold?

    What facilities for reading writing arrays as binary and text are
    there in BASIS and Yorick? Are they successful?

> Try
>   Numeric.add(a, b, a)
> This will add a and b and put the result into a without creating a
> copy.

Great. Thanks. These functions look to be exactly what I need.

> I doubt this will happen. The assumption that numbers are immutable is
> pretty basic in Python, and a lot of code would break if it were
> violated. At best, operators like += would be added for mutable types
> (e.g. lists) and classes. And even that doesn't look like a
> high-priority item for future versions.

Operators like += need not break the imutable types; they could act
as a shorthand and might save looking up the same name twice.

Alan Watson

_______________
MATRIX-SIG  - SIG on Matrix Math for Python

send messages to: matrix-sig@python.org
administrivia to: matrix-sig-request@python.org
_______________