[Numpy-discussion] Reading a big netcdf file

Jeff Whitaker jswhit at fastmail.fm
Thu Aug 4 11:53:03 EDT 2011


On 8/4/11 4:46 AM, Kiko wrote:
> Hi, all.
>
> Thank you very much for your replies.
>
> I am obtaining some issues. If I use netcdf4-python or scipy.io.netcdf 
> libraries:
>
> In [4]: import netCDF4 as n4
> In [5]: from scipy.io <http://scipy.io> import netcdf as nS
> In [6]: import numpy as np
> In [7]: gebco4 = n4.Dataset('GridOne.grd', 'r')
> In [8]: gebcoS = nS.netcdf_file('GridOne.grd', 'r')
>
> Now, if a do:
>
> In [9]: z4 = gebco4.variables['z']
>
> I got no problems and I have:
>
> In [14]: type(z4); z4.shape; z4.size
> Out[14]: <type 'netCDF4.Variable'>
> Out[14]: (233312401,)
> Out[14]: 233312401
>
> But if I do:
>
> In [15]: z4 = gebco4.variables['z'][:]
> ------------------------------------------------------------
> Traceback (most recent call last):
>   File "<ipython console>", line 1, in <module>
>   File "netCDF4.pyx", line 2466, in netCDF4.Variable.__getitem__ 
> (netCDF4.c:22943)
>   File "C:\Python26\lib\site-packages\netCDF4_utils.py", line 278, in 
> _StartCountStride
>     n = len(range(beg,end,inc))
> MemoryError
>
> I got a memory error. 


Kiko:  I think the difference may be that when you read the data with 
netcdf4-python, it tries to unpack the short integers to a float32 
array, thereby using much more memory (more than you have available).  
scipy.io.netcdf is just returning you a numpy array of short integers.  
I bet if you do

gebco4.set_automaskandscale(False)

before reading the data from the getco4 variable, it will work, since 
this turns off the auto conversion to float32.

You'll have to do the conversion manually then, at which point you will 
may run out of memory anyway.

> But if a select a smaller array I've got:
>
> In [16]: z4 = gebco4.variables['z'][:10000000]
> In [17]: type(z4); z4.shape; z4.size
> Out[17]: <type 'numpy.ndarray'>
> Out[17]: (10000000,)
> Out[17]: 10000000
>
> What's the difference between z4 as a netCDF4.Variable and as a 
> numpy.ndarray?

the netcdf variable object just refers to the data in the file - only 
when you slice the object is the data read in and converted to a numpy 
array.

-Jeff
>
> Now, if I use scipy.io.netcdf:
>
> In [18]: zS = gebcoS.variables['z']
> In [20]: type(zS); zS.shape
> Out[20]: <class 'scipy.io.netcdf.netcdf_variable'>
> Out[20]: (233312401,)
>
> In [21]: zS = gebcoS.variables['z'][:]
> In [22]: type(zS); zS.shape
> Out[22]: <type 'numpy.ndarray'>
> Out[22]: (233312401,)
>
> What's the difference between zS as a scipy.io.netcdf.netcdf_variable 
> and as a numpy.ndarray?
> Why with scipy.io.netcdf I do not have a MemoryError?
>
> Finally, if I do the following (maybe it's a silly thing do this) 
> using Eric suggestions to clear the cache:
>
> In [32]: zS = gebcoS.variables['z']
> In [38]: timeit -n1 -r1 zSS = np.array(zS[:100000000]) # 100.000.000 
> out of 233.312.401 because I've got a MemoryError
> 1 loops, best of 1: 73.1 s per loop
>
> (If I use a copy, timeit -n1 -r1 zSS = np.array(zS[:100000000], 
> copy=True), I get a MemoryError and I have to set the size to 
> 50.000.000 but it's quite fast).
>
> Than you very much for your replies and excuse me if some questions 
> are very basic.
>
> Best regards.
>
> ***********************************************************************
> The results of ncdump -h
> netcdf GridOne {
> dimensions:
>         side = 2 ;
>         xysize = 233312401 ;
> variables:
>         double x_range(side) ;
>                 x_range:units = "user_x_unit" ;
>         double y_range(side) ;
>                 y_range:units = "user_y_unit" ;
>         short z_range(side) ;
>                 z_range:units = "user_z_unit" ;
>         double spacing(side) ;
>         short dimension(side) ;
>         short z(xysize) ;
>                 z:scale_factor = 1. ;
>                 z:add_offset = 0. ;
>                 z:node_offset = 0 ;
>
> // global attributes:
>                 :title = "GEBCO One Minute Grid" ;
>                 :source = "1.02" ;
> }
>
> The file is publicly available from: 
> http://www.gebco.net/data_and_products/gridded_bathymetry_data/
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110804/952fb2b6/attachment.html>


More information about the NumPy-Discussion mailing list