[Numpy-discussion] loadtxt ndmin option

Thu May 5 16:53:04 EDT 2011

On 5 May 2011, at 22:31, Ralf Gommers wrote:

>
> On Thu, May 5, 2011 at 9:46 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers at googlemail.com 
> > wrote:
>
>
> On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes <paul.anton.letnes at gmail.com 
> > wrote:
>
> On 5. mai 2011, at 08.49, Benjamin Root wrote:
>
> >
> >
> > On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <paul.anton.letnes at gmail.com 
> > wrote:
> >
> > On 4. mai 2011, at 20.33, Benjamin Root wrote:
> >
> > > On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <derek at astro.physik.uni-goettingen.de 
> > wrote:
> > > On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
> > >
> > > > But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions  
> written for this? Shouldn't we reuse them? Perhaps it's overkill,  
> and perhaps it will reintroduce the 'transposed' problem?
> > >
> > > Yes, good point, one could replace the
> > > X.shape = (X.size, ) with X = np.atleast_1d(X),
> > > but for the ndmin=2 case, we'd need to replace
> > > X.shape = (X.size, 1) with X = np.atleast_2d(X).T -
> > > not sure which solution is more efficient in terms of memory  
> access etc...
> > >
> > > Cheers,
> > >                                                Derek
> > >
> > >
> > > I can confirm that the current behavior is not sufficient for  
> all of the original corner cases that ndmin was supposed to  
> address.  Keep in mind that np.loadtxt takes a one-column data file  
> and a one-row data file down to the same shape.  I don't see how the  
> current code is able to produce the correct array shape when  
> ndmin=2.  Do we have some sort of counter in loadtxt for counting  
> the number of rows and columns read?  Could we use those to help  
> guide the ndmin=2 case?
> > >
> > > I think that using atleast_1d(X) might be a bit overkill, but it  
> would be very clear as to the code's intent.  I don't think we have  
> to worry about memory usage if we limit its use to only situations  
> where ndmin is greater than the number of dimensions of the array.   
> In those cases, the array is either an empty result, a scalar value  
> (in which memory access is trivial), or 1-d (in which a transpose is  
> cheap).
> >
> > What if one does things the other way around - avoid calling  
> squeeze until _after_ doing the atleast_Nd() magic? That way the row/ 
> column information should be conserved, right? Also, we avoid  
> transposing, memory use, ...
> >
> > Oh, and someone could conceivably have a _looong_ 1D file, but  
> would want it read as a 2D array.
> >
> > Paul
> >
> >
> >
> > @Derek, good catch with noticing the error in the tests. We do  
> still need to handle the case I mentioned, however.  I have attached  
> an example script to demonstrate the issue.  In this script, I would  
> expect the second-to-last array to be a shape of (1, 5).  I believe  
> that the single-row, multi-column case would actually be the more  
> common type of edge-case encountered by users than the others.   
> Therefore, I believe that this ndmin fix is not adequate until this  
> is addressed.
> >
> > @Paul, we can't call squeeze after doing the atleast_Nd() magic.   
> That would just undo whatever we had just done.  Also, wrt the  
> transpose, a (1, 100000) array looks the same in memory as a  
> (100000, 1) array, right?
> Agree. I thought more along the lines of (pseudocode-ish)
> if ndmin == 0:
>        squeeze()
> if ndmin == 1:
>        atleast_1D()
> elif ndmin == 2:
>        atleast_2D()
> else:
>        I don't rightly know what would go here, maybe raise  
> ValueError?
>
> That would avoid the squeeze call before the atleast_Nd magic. But  
> the code was changed, so I think my comment doesn't make sense  
> anymore. It's probably fine the way it is!
>
> Paul
>
>
> I have thought of that too, but the problem with that approach is  
> that after reading the file, X will have 2 or 3 dimensions,  
> regardless of how many singleton dims were in the file.  A squeeze  
> will always be needed.  Also, the purpose of squeeze is opposite  
> that of the atleast_*d() functions:  squeeze reduces dimensions,  
> while atleast_*d will add dimensions.
>
> Therefore, I re-iterate... the patch by Derek gets the job done.  I  
> have tested it for a wide variety of inputs for both regular arrays  
> and record arrays.  Is there room for improvements?  Yes, but I  
> think that can wait for later.  Derek's patch however fixes an  
> important bug in the ndmin implementation and should be included for  
> the release.
>
Thanks for the additional tests and discussion, Paul and Ben! Yes, I  
think the
main issue is that multi-column input first comes back as 3- 
dimensional, with
a "singleton" dimension in front, so I was getting rid of that first -  
maybe could
add a comment on that. Single-column, multi-row apparently already come
back as 1-dimensional, so we only need the expansion there.

> Two questions: can you point me to the patch/ticket, and is this a  
> regression?
>
> Thanks,
> Ralf
>
This was ticket #1562, had been closed as "resolved" in the meantime,  
though.
>
> I don't know if he did a pull-request or not, but here is the link  
> he provided earlier in the thread.
>
> https://github.com/dhomeier/numpy/compare/master...ndmin-cols
>
> Technically, this is not a "regression" as the ndmin feature is new  
> in this release.
>
> Yes right, I forgot this was a recent change.
>
> However, the problem that ndmin is supposed to address is not fixed  
> by the current implementation for the rc.  Essentially, a single- 
> row, multi-column file with ndmin=2 comes out as a Nx1 array which  
> is the same result for a multi-row, single-column file.  My feeling  
> is that if we let the current implementation stand as is, and  
> developers use it in their code, then fixing it in a later release  
> would introduce more problems (maybe the devels would transpose the  
> result themselves or something).  Better to fix it now in rc with  
> the two lines of code (and the correction to the tests), then to  
> introduce a buggy feature that will be hard to fix in future  
> releases, IMHO.
>
> Looks okay, and I agree that it's better to fix it now. The timing  
> is a bit unfortunate though, just after RC2. I'll have closer look  
> tomorrow and if it can go in, probably tag RC3.
>
> If in the meantime a few more people could test this, that would be  
> helpful.
>
> Ralf

I agree, wish I had time to push this before rc2. I could add the  
explanatory comments
mentioned above and switch to use the atleast_[12]d() solution, test  
that and push it
in a couple of minutes, or should I better leave it as is now for  
testing?

Cheers,
						Derek