[Numpy-discussion] Memory usage of numpy-arrays

Thu Jul 8 12:01:57 EDT 2010

On Thu, Jul 8, 2010 at 4:46 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 07/08/2010 08:52 AM, Wes McKinney wrote:
>> On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider
>> <hannes.bretschneider at wiwi.hu-berlin.de>  wrote:
>>
>>> Dear NumPy developers,
>>>
>>> I have to process some big data files with high-frequency
>>> financial data. I am trying to load a delimited text file having
>>> ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The
>>> machine is a Debian Lenny server 32bit with 3GB of memory.  Since
>>> the file is just 700MB I am naively assuming that it should fit
>>> into memory in whole. However, when I attempt to load it, python
>>> fills the entire available memory and then fails with
>>>
>>>
>>> Traceback (most recent call last):
>>>   File "<stdin>", line 1, in<module>
>>>   File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt
>>>     errmsg = "\n".join(errmsg)
>>> MemoryError
>>>
>>>
>>> Is there a way to load this file without crashing?
>>>
>>> Thanks, Hannes
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>> > From my experience I might suggest using PyTables (HDF5) as
>> intermediate storage for the data which can be populated iteratively
>> (you'll have to parse the data yourself, marking missing data could be
>> a problem). This of course requires that you know the column schema
>> ahead of time which is one thing that np.genfromtxt will handle
>> automatically. Particularly if you have a large static data set this
>> can be worthwhile as reading the data out of HDF5 will be many times
>> faster than parsing the text file.
>>
>> I believe you can also append rows to the PyTables Table structure in
>> chunks which would be faster than appending one row at a time.
>>
>> hth,
>> Wes
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> There have been past discussions on this. Numpy needs contiguous memory
> so you are running out of memory because as loading the original data
> and the numpy array will exhaust your available contiguous memory. Note
> that a file of ~700 MB does not translate into ~700 MB of memory since
> it depends on the dtypes. Also a system with 3GB of memory probably has
> about 1.5GB of free memory available (you might get closer to 2GB if you
> have a very lean system).
>
> If you know your data then you have do all the hard work yourself to
> minimize memory usage or use something like hdf5 or PyTables.
>
> Bruce
>

I would expect a 700MB text file translate into less than 200MB of
data - assuming that you are talking about decimal numbers (maybe
total of 10 digits each + spaces) and saving as float32 binary.
So the problem would "only" be the loading in - rather, going through
- all lines of text from start to end without choking.
This might be better done "by hand", i.e. in standard (non numpy) python:

nums = []
for line in file("myTextFile.txt"):
     fields = line.split()
     nums.extend (map(float, fields))

The last line converts to python-floats which is float64.
Using lists adds extra bytes behind the scenes.
So, one would have to read in   in blocks and blockwise convert to
float32 numpy arrays.
There is not much more to say unless we know more about the format of
the text file.

Regards,
Sebastian Haase