[Numpy-discussion] Efficient way to load a 1Gb file?

Tue Sep 6 11:04:36 EDT 2011

On 02.09.2011, at 1:47AM, Russell E. Owen wrote:

>> I've made a pull request 
>> https://github.com/numpy/numpy/pull/144
>> implementing that option as a switch 'prescan'; could you review it in 
>> particular regarding the following:
>> 
>> Is the option reasonably named and documented?
>> 
>> In the case the allocated array does not match the input data (which 
>> really should never happen), right now just a warning is issued, 
>> filling any excess buffer with zeros or discarding remaining input data - 
>> should this rather raise an IndexError?
>> 
>> No prediction if/when I might be able to provide this for genfromtxt, sorry!
>> 
>> Cheers,
>>                                                        Derek
> 
> This looks like a great improvement to me! I think the name is well 
> chosen and the help is very clear.
> 
Thanks for your feedback, just a few quick comments:

> A few comments:
> - Might you rename the variable "l"? It is easily confused with the 
> digit 1.
> - I don't understand the l < n_valid test, so this may be off base, but 
> I'm surprised that you first massage the data and then raise an 
> exception. Is the massaged data any use after the exception is raised? 
> Naively I would expect you to issue a warning instead of raising an 
> exception if you are going to handle the error by massaging the data.
> 
The exception is currently caught right after the loop, which might seem a bit 
illogical, but the idea was to handle both cases in one place (if l > n_valid, 
trying to assign to X[l] will also raise an IndexError, meaning there are data 
left in the input that could not be stored) - so the present version indeed 
just issues a warning for both cases, but that could easily be changed...

> (It is a pity that your patch duplicates so much parsing code, but I 
> don't see a better way to do it. Putting conditionals in the parsing 
> loop to decide how to handle each line based on prescan would presumably 
> slow things down too much.)

That was my idea behind it - otherwise I would also have considered moving 
it out into its own functions, but as long as the entire code more or less fits into 
one editor window, this did not appear an obstacle to me. 

The main update on the issue is however, that all this is currently on hold 
because some concerns have been raised about not using dynamic resizing 
instead (the extra reading pass would break streamed input), and we have 
been discussing the best way to do this in another thread related to pull request 
https://github.com/numpy/numpy/pull/143 
(which implements similar functionality, plus a lot more, for a genfromtxt-like 
function). So don't be surprised if the loadtxt patch comes back later, in a 
completely revised form…

Cheers,
					Derek
--
----------------------------------------------------------------
Derek Homeier          Centre de Recherche Astrophysique de Lyon
ENS Lyon                                      46, Allée d'Italie
69364 Lyon Cedex 07, France                  +33 1133 47272-8894
----------------------------------------------------------------