[Numpy-discussion] load from text files Pull Request Review

Fri Sep 2 12:16:08 EDT 2011

Sorry I'm only now getting around to thinking more about this. Been
side-tracked by stats stuff.

On Fri, Sep 2, 2011 at 10:50 AM, Chris.Barker <Chris.Barker at noaa.gov> wrote:
> On 9/2/11 8:22 AM, Derek Homeier wrote:
>> I agree it would make a very nice addition, and could complement my
>> pre-allocation option for loadtxt - however there I've also been made
>> aware that this approach breaks streamed input etc., so the buffer.resize(…)
>> methods in accumulator would be the better way to go.
>

I'll read more about this soon. I haven't thought about it, and I
didn't realize it was breaking anything.

> Good point, that would be nice.
>
>> For load table this is not quite as straightforward, though, because the type
>> auto-detection, strictly done, requires to scan the entire input, because a
>> column full of int could still produce a float in the last row…
>
> hmmm -- it seems you could jsut as well be building the array as you go,
> and if you hit a change in the imput, re-set and start again.
>

I hadn't thought of that. Interesting idea. I'm surprised that
completely resetting the array could be faster.

> In my tests, I'm pretty sure that the time spent file io and string
> parsing swamp the time it takes to allocate memory and set the values.

In my tests, at least for a medium sized csv file (about 3000 rows by
30 columns), about 10% of the time was determine the types in the
first read through and 90% of the time was sticking the data in the
array.

However, that particular test took more time for reading in because
the data was quoted (so converting '"3,25"' to a float took between
1.5x and 2x as long as '3.25') and the datetime conversion is costly.

Regardless, that suggests making the data loading faster is more
important than avoiding reading through the file twice. I guess that
intuition probably breaks if the data doesn't fit until memory,
though. But I haven't worked with extremely large data files before,
so I'd appreciate refutation/confirmation of my priors.

>
> So there is little cost, and for the common use case, it would be faster
> and cleaner.
>
> There is a chance, of course, that you might have to re-wind and start
> over more than once, but I suspect that that is the rare case.
>

Perhaps. I know that in the 'really annoying dataset that loading
quickly and easily should be your use case' example I was given, about
half-way through the data one of the columns got its first
observation. (It was time series data where one of the columns didn't
start being observed until 1/2 through the observation period.) So I'm
not sure it'd be as rare we'd like.

>> For better consistency with what people have likely got used to from npyio,
>> I'd recommend some minor changes:
>>
>> make spaces the default delimiter
>
> +1
>

Sure.

>> enable automatic decompression (given the modularity, could you simply
>> use np.lib._datasource.open() like genfromtxt?)
>
> I _think_this would benefit from a one-pass solution as well -- so you
> don't need to de-compress twice.
>
> -Chris
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>