[Numpy-discussion] How to read data from text files fast?

Thu Jul 1 13:28:01 EDT 2004

Chris Barker wrote:
> Hi all,
> 
> I'm looking for a way to read data from ascii text files quickly. I've 
> found that using the standard python idioms like:
> 
> data = array((M,N),Float)
> for in range(N):
>      data.append(map(float,file.readline().split()))
> 
> Can be pretty slow. What I'd like is something like Matlab's fscanf:
> 
> data = fscanf(file, "%g", [M,N] )
> 
> I may have the syntax a little wrong, but the gist is there. What Matlab 
> does keep recycling the format string until the desired number of 
> elements have been read.
> 
> It is quite flexible, and ends up being pretty fast.
> 
> Has anyone written something like this for Numeric (or numarray, but I'd 
> prefer Numeric at this point) ?
> 
> I was surprised not to find something like this in SciPy, maybe I didn't 
> look hard enough.

scipy.io.read_array?

I haven't timed it, because it's been 'fast enough' for my needs.

For reading binary data files, I have this little utility which is basically a 
wrapper around Numeric.fromstring (N below is Numeric imported 'as N').  Note 
that it can read binary .gz files directly, a _huge_ gain for very sparse 
files representing 3d arrays (I can read a 400k gz file which blows up to 
~60MB when unzipped in no time at all, while reading the unzipped file is very 
slow):

def read_bin(fname,dims,typecode,recast_type=None,offset=0,verbose=0):
     """Read in a binary data file.

     Does NOT check for endianness issues.

     Inputs:
     fname - can be .gz
     dims (nx1,nx2,...,nxd)
     typecode
     recast_type
     offset=0: # of bytes to skip in file *from the beginning* before data starts
     """
     # config parameters
     item_size = N.zeros(1,typecode).itemsize()  # size in bytes
     data_size = N.product(N.array(dims))*item_size
     # read in data
     if fname.endswith('.gz'):
         data_file = gzip.open(fname)
     else:
         data_file = file(fname)
     data_file.seek(offset)
     data = N.fromstring(data_file.read(data_size),typecode)
     data_file.close()
     data.shape = dims
     if verbose:
         #print 'Read',data_size/item_size,'data points. Shape:',dims
         print 'Read',N.size(data),'data points. Shape:',dims
     if recast_type is not None:
         data = data.astype(recast_type)
     return data

HTH,

f