large arrays in python (scientific)

Tue Jan 8 11:50:07 EST 2002

(Mark Grant) wrote
> > I have a data set that contains values for positions in 3D space.
> > There are about 2 million data points
> > (128x128x128).

Do you actually have one value for each cell? Or do you just mean the
number of possible cells? Are the values measured (a bit hard to
believe for 2 million)? or generated by a function such as the one you
want to fit?

> > I'm trying to fit a function to the data. I want to use the
> > LeastSquaresFit procedure in ScientificPython,
> > which takes an array of elements of the format:
> >
> > [[(xposition1, yposition1, zposition1), value1],
> > [[(xposition2, yposition2, zposition2), value2],
> > ...,
> > ...,
> > ]

Depending on what you data actually are, this may or may not be the
most compact representation.  Need more info to say more.

> > When I try to create this array, I create about a million of the
> > elements, and then the script slows down and
> > eventually stops.  I'm not sure why this is happening.

What hardware/memory do you have? One solution might be to lower the
spatial resolution to 64x64x64 or lower.

"Bill Tate" <tatebll at aol.com> answered
> Mark,
> Are you sure you want to use a least squares fit to begin with???
The
> ~ 2 million data points defines a bounding volume and any prediction
> of points within that volume based on a least squares fit is likely
to
> have very large error terms - your RMSE is likely to be very high
and
> I suspect your R2 value for least squares fit is probably going to
be
> very low.

He is trying to predict a value (temp, pressure, or whatever) as a
function of position, not position itself.

> I don't know what the nature of your problem is, e.g., whether you
are
> working with an irregular surface (like say topography data),
> scattered 3-D points, or points that define something akin to a
> contiguous or piece-wise contiguous surface, so its difficult to
> suggest a practical alternative.  In any event, I think working with
> the full 2 million data points at one time is probably not
practical.
> If you can provide more details about the nature of the data you are
> working with, I imagine you'll get more feedback in terms of useful
> alternatives.  Depending on the kind of data you are working with,
> there may be solution available that doesn't constitute much more
work
> that what is needed to perform a LSF.

I completely agree that we need more info to be more helpful.

Terry J. Reedy (statistical consultant)