[Numpy-discussion] memory-efficient loadtxt

Sun Sep 30 10:16:41 EDT 2012

For convenience and clarity, this is the diff in question:
https://github.com/Dynetrekk/numpy-1/commit/5bde67531a2005ef80a2690a75c65cebf97c9e00

And this is my numpy fork:
https://github.com/Dynetrekk/numpy-1/

Paul


On Sun, Sep 30, 2012 at 4:14 PM, Paul Anton Letnes
<paul.anton.letnes at gmail.com> wrote:
> Hello everyone,
>
> I've modified loadtxt to make it (potentially) more memory efficient.
> The idea is that if a user passes a seekable file, (s)he can also pass
> the 'seekable=True' kwarg. Then, loadtxt will count the number of
> lines (containing data) and allocate an array of exactly the right
> size to hold the loaded data. The downside is that the line counting
> more than doubles the runtime, as it loops over the file twice, and
> there's a sort-of unnecessary np.array function call in the loop. The
> branch is called faster-loadtxt, which is silly due to the runtime
> doubling, but I'm hoping that the false advertising is acceptable :)
> (I naively expected a speedup by removing some needless list
> manipulation.)
>
> I'm pretty sure that the function can be micro-optimized quite a bit
> here and there, and in particular, the main for loop is a bit
> duplicated right now. However, I got the impression that someone was
> working on a More Advanced (TM) C-based file reader, which will
> replace loadtxt; this patch is intended as a useful thing to have
> while we're waiting for that to appear.
>
> The patch passes all tests in the test suite, and documentation for
> the kwarg has been added. I've modified all tests to include the
> seekable kwarg, but that was mostly to check that all tests are passed
> also with this kwarg. I guess it's bit too late for 1.7.0 though?
>
> Should I make a pull request? I'm happy to take any and all
> suggestions before I do.
>
> Cheers
> Paul