determining available space for Float32, for instance

Robert Kern rkern at enthought.com
Tue May 23 19:30:24 EDT 2006


David Socha wrote:
> I am looking for a way to determine the maxium array size I can allocate
> for arrays of Float32 values (or Int32, or Int8, ...) at an arbitrary
> point in the program's execution.  This is needed because Python cannot
> allocate enough memory for all of the data we need to process, so we
> need to "chunk" the processing, as described below.
> 
> Python's memory management process makes this more complicated, since
> once memory is allocated for Float32, it cannot be used for any other
> data type, such as Int32.

Just for clarification, you're talking about Numeric arrays here (judging from
the names, you still haven't upgraded to numpy), not general Python. Python
itself has no notion of Float32 or Int32 or allocating chunks of memory for
those two datatypes.

> I'd like a solution that includes either
> memory that is not yet allocated, or memory that used to be allocated
> for that type, but is no longer used.
> 
> We do not want a solution that requires recompiling Python, since we
> cannot expect our end users to do that.  

OTOH, *you* could recompile Python and distribute your Python with your
application. We do that at Enthought although for different reasons. However, I
don't think it will come to that.

> Does anyone know how to do this?

With numpy, it's easy enough to change the datatype of an array on-the-fly as
long as the sizes match up.

In [8]: from numpy import *

In [9]: a = ones(10, dtype=float32)

In [10]: a
Out[10]: array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.], dtype=float32)

In [11]: a.dtype = int32

In [12]: a
Out[12]:
array([1065353216, 1065353216, 1065353216, 1065353216, 1065353216,
       1065353216, 1065353216, 1065353216, 1065353216, 1065353216], dtype=int32)

However, keeping track of the sizes of your arrays and the size of your
datatypes may be a bit much to ask.

> The following describes our application context in more detail.
>  
> Our application is UrbanSim (www.urbansim.org), a micro-simulation
> application for urban planning.  It uses "datasets," where each dataset
> may have millions of entities (e.g. households), and each entity (e.g.
> household) may have dozens of attributes (e.g. number_of_cars, income,
> etc.).  Attributes can be any of the standard Python "base" types,
> though most attributes are Float32 or Int32 values.  Our models often
> create a set of 2D arrays with one dimension being agents, and the
> second dimention being choices from another dataset.  For insances, the
> agents may be households that choose a new gridcell to live in.  For our
> Puget Sound application, there are 1 to 2 million households, and 800K
> gridcells.  Each attribute of a dataset has such a 2D array.  Given that
> we may have dozens of attributes, they can eat up a lot of memory,
> quickly.

[snip]

> First, we need to know how many attributes of each type (Float32, Int32,
> etc.) will be used by this model.  We can do that.  
>  
> Second, we need to know how much space is available for an array of a
> particular type of values, e.g. for Float32 values.  Is there a way to
> get this information for Python?  

numpy arrays (not sure about Numeric) have an .itemsize attribute that tells you
how many bytes each element has:

In [13]: a.size * a.itemsize
Out[13]: 40

numpy (definitely not Numeric) does have a feature called record arrays which
will allow you to deal with your agents much more conveniently:

  http://www.scipy.org/RecordArrays

Also, you will certainly want to look at using PyTables to store and access your
data. With PyTables you can leave all of your data on disk and access arbitrary
parts of it in a relatively clean fashion without doing the fiddly work of
swapping chunks of memory from disk and back again:

  http://www.pytables.org/moin

Doing the memory management yourself is tricky and probably not worthwhile given
PyTables' existence.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco




More information about the Python-list mailing list