[Numpy-discussion] loadtxt ndmin option
Derek Homeier
derek at astro.physik.uni-goettingen.de
Thu May 5 16:53:04 EDT 2011
On 5 May 2011, at 22:31, Ralf Gommers wrote:
>
> On Thu, May 5, 2011 at 9:46 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers at googlemail.com
> > wrote:
>
>
> On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes <paul.anton.letnes at gmail.com
> > wrote:
>
> On 5. mai 2011, at 08.49, Benjamin Root wrote:
>
> >
> >
> > On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <paul.anton.letnes at gmail.com
> > wrote:
> >
> > On 4. mai 2011, at 20.33, Benjamin Root wrote:
> >
> > > On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <derek at astro.physik.uni-goettingen.de
> > wrote:
> > > On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
> > >
> > > > But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions
> written for this? Shouldn't we reuse them? Perhaps it's overkill,
> and perhaps it will reintroduce the 'transposed' problem?
> > >
> > > Yes, good point, one could replace the
> > > X.shape = (X.size, ) with X = np.atleast_1d(X),
> > > but for the ndmin=2 case, we'd need to replace
> > > X.shape = (X.size, 1) with X = np.atleast_2d(X).T -
> > > not sure which solution is more efficient in terms of memory
> access etc...
> > >
> > > Cheers,
> > > Derek
> > >
> > >
> > > I can confirm that the current behavior is not sufficient for
> all of the original corner cases that ndmin was supposed to
> address. Keep in mind that np.loadtxt takes a one-column data file
> and a one-row data file down to the same shape. I don't see how the
> current code is able to produce the correct array shape when
> ndmin=2. Do we have some sort of counter in loadtxt for counting
> the number of rows and columns read? Could we use those to help
> guide the ndmin=2 case?
> > >
> > > I think that using atleast_1d(X) might be a bit overkill, but it
> would be very clear as to the code's intent. I don't think we have
> to worry about memory usage if we limit its use to only situations
> where ndmin is greater than the number of dimensions of the array.
> In those cases, the array is either an empty result, a scalar value
> (in which memory access is trivial), or 1-d (in which a transpose is
> cheap).
> >
> > What if one does things the other way around - avoid calling
> squeeze until _after_ doing the atleast_Nd() magic? That way the row/
> column information should be conserved, right? Also, we avoid
> transposing, memory use, ...
> >
> > Oh, and someone could conceivably have a _looong_ 1D file, but
> would want it read as a 2D array.
> >
> > Paul
> >
> >
> >
> > @Derek, good catch with noticing the error in the tests. We do
> still need to handle the case I mentioned, however. I have attached
> an example script to demonstrate the issue. In this script, I would
> expect the second-to-last array to be a shape of (1, 5). I believe
> that the single-row, multi-column case would actually be the more
> common type of edge-case encountered by users than the others.
> Therefore, I believe that this ndmin fix is not adequate until this
> is addressed.
> >
> > @Paul, we can't call squeeze after doing the atleast_Nd() magic.
> That would just undo whatever we had just done. Also, wrt the
> transpose, a (1, 100000) array looks the same in memory as a
> (100000, 1) array, right?
> Agree. I thought more along the lines of (pseudocode-ish)
> if ndmin == 0:
> squeeze()
> if ndmin == 1:
> atleast_1D()
> elif ndmin == 2:
> atleast_2D()
> else:
> I don't rightly know what would go here, maybe raise
> ValueError?
>
> That would avoid the squeeze call before the atleast_Nd magic. But
> the code was changed, so I think my comment doesn't make sense
> anymore. It's probably fine the way it is!
>
> Paul
>
>
> I have thought of that too, but the problem with that approach is
> that after reading the file, X will have 2 or 3 dimensions,
> regardless of how many singleton dims were in the file. A squeeze
> will always be needed. Also, the purpose of squeeze is opposite
> that of the atleast_*d() functions: squeeze reduces dimensions,
> while atleast_*d will add dimensions.
>
> Therefore, I re-iterate... the patch by Derek gets the job done. I
> have tested it for a wide variety of inputs for both regular arrays
> and record arrays. Is there room for improvements? Yes, but I
> think that can wait for later. Derek's patch however fixes an
> important bug in the ndmin implementation and should be included for
> the release.
>
Thanks for the additional tests and discussion, Paul and Ben! Yes, I
think the
main issue is that multi-column input first comes back as 3-
dimensional, with
a "singleton" dimension in front, so I was getting rid of that first -
maybe could
add a comment on that. Single-column, multi-row apparently already come
back as 1-dimensional, so we only need the expansion there.
> Two questions: can you point me to the patch/ticket, and is this a
> regression?
>
> Thanks,
> Ralf
>
This was ticket #1562, had been closed as "resolved" in the meantime,
though.
>
> I don't know if he did a pull-request or not, but here is the link
> he provided earlier in the thread.
>
> https://github.com/dhomeier/numpy/compare/master...ndmin-cols
>
> Technically, this is not a "regression" as the ndmin feature is new
> in this release.
>
> Yes right, I forgot this was a recent change.
>
> However, the problem that ndmin is supposed to address is not fixed
> by the current implementation for the rc. Essentially, a single-
> row, multi-column file with ndmin=2 comes out as a Nx1 array which
> is the same result for a multi-row, single-column file. My feeling
> is that if we let the current implementation stand as is, and
> developers use it in their code, then fixing it in a later release
> would introduce more problems (maybe the devels would transpose the
> result themselves or something). Better to fix it now in rc with
> the two lines of code (and the correction to the tests), then to
> introduce a buggy feature that will be hard to fix in future
> releases, IMHO.
>
> Looks okay, and I agree that it's better to fix it now. The timing
> is a bit unfortunate though, just after RC2. I'll have closer look
> tomorrow and if it can go in, probably tag RC3.
>
> If in the meantime a few more people could test this, that would be
> helpful.
>
> Ralf
I agree, wish I had time to push this before rc2. I could add the
explanatory comments
mentioned above and switch to use the atleast_[12]d() solution, test
that and push it
in a couple of minutes, or should I better leave it as is now for
testing?
Cheers,
Derek
More information about the NumPy-Discussion
mailing list