[Numpy-discussion] record data previous to Numpy use

Wed Jul 5 17:53:31 EDT 2017

Hi Paul,

> ascii file is an input format (and the only one I can deal with)
> 
> HDF5 one might be an export one (it's one of the options) in order to speed up the post-processing stage
> 
> 
> 
> Paul
> 
> 
> 
> 
> 
> Le 2017-07-05 20:19, Thomas Caswell a écrit :
> 
>> Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better storage format for what you are describing.
>>  
>> Tom
>> 
>> On Wed, Jul 5, 2017 at 8:42 AM <paul.carrico at free.fr> wrote:
>> Dear all
>> 
>> 
>> 
>> I'm sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I'm spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array ... with unsuccessfully!
>> 
>> 
>> 
>> The only way I found is to use 'append()' instruction involving dynamic memory allocation. :-(
>> 
>> 
>> 
>> From my current experience under Scilab (a like Matlab scientific solver), it is well know:
>> 
>> 	• Step 1 : matrix initialization like 'np.zeros(n,n)'
>> 	• Step 2 : record the data
>> 	• and write it in the matrix (step 3)
>> 
>> 
>> I'm obviously influenced by my current experience, but I'm interested in moving to Python and its packages
>> 
>> 
>> 
>> For huge asci files (involving dozens of millions of lines), my strategy is to work by 'blocks' as :
>> 
>> 	• Find the line index of the beginning and the end of one block (this implies that the file is read ounce)
>> 	• Read the block
>> 	• (process repeated on the different other blocks)
>> 
>> 
>> I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method
>> 

if you are indeed tied to using ASCII input data, you will of course have to deal with significant
performance handicaps, but there are at least some gains to be had by using an input parser
that does not do all the conversions at the Python level, but with a compiled (C) reader - either
pandas as Tom already mentioned, or astropy - see e.g. 
https://github.com/dhomeier/astropy-notebooks/blob/master/io/ascii/ascii_read_bench.ipynb
for the almost one order of magnitude speed gains you may get.

In your example it is not clear what “record” method you were trying to use that raised the errors
you mention - we would certainly need a full traceback of the error to find out more.

In principle your approach of allocating the numpy matrix first and reading the data in chunks
makes sense, as it will avoid the much larger temporary lists created during read-in.
But it might be more convenient to just read in the block into a list of lines and pass that to a
higher-level reader like np.genfromtxt or the faster astropy.io.ascii.read or pandas.read_csv
to speed up the parsing of the numbers themselves.
That said, on most systems these readers should still be able to handle files up to a few 10^8
items (expect ~ 25-55 bytes of memory for each input number allocated for temporary lists),
so if saving memory is not an absolute priority, directly reading the entire file might still be the
best choice (and would also save the first pass reading).

Cheers,
					Derek