[Numpy-discussion] loadtxt ndmin option

Thu May 5 15:46:17 EDT 2011

On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers at googlemail.com>wrote:

>
>
> On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>>
>>
>> On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes <
>> paul.anton.letnes at gmail.com> wrote:
>>
>>>
>>> On 5. mai 2011, at 08.49, Benjamin Root wrote:
>>>
>>> >
>>> >
>>> > On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
>>> paul.anton.letnes at gmail.com> wrote:
>>> >
>>> > On 4. mai 2011, at 20.33, Benjamin Root wrote:
>>> >
>>> > > On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
>>> derek at astro.physik.uni-goettingen.de> wrote:
>>> > > On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
>>> > >
>>> > > > But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions
>>> written for this? Shouldn't we reuse them? Perhaps it's overkill, and
>>> perhaps it will reintroduce the 'transposed' problem?
>>> > >
>>> > > Yes, good point, one could replace the
>>> > > X.shape = (X.size, ) with X = np.atleast_1d(X),
>>> > > but for the ndmin=2 case, we'd need to replace
>>> > > X.shape = (X.size, 1) with X = np.atleast_2d(X).T -
>>> > > not sure which solution is more efficient in terms of memory access
>>> etc...
>>> > >
>>> > > Cheers,
>>> > >                                                Derek
>>> > >
>>> > >
>>> > > I can confirm that the current behavior is not sufficient for all of
>>> the original corner cases that ndmin was supposed to address.  Keep in mind
>>> that np.loadtxt takes a one-column data file and a one-row data file down to
>>> the same shape.  I don't see how the current code is able to produce the
>>> correct array shape when ndmin=2.  Do we have some sort of counter in
>>> loadtxt for counting the number of rows and columns read?  Could we use
>>> those to help guide the ndmin=2 case?
>>> > >
>>> > > I think that using atleast_1d(X) might be a bit overkill, but it
>>> would be very clear as to the code's intent.  I don't think we have to worry
>>> about memory usage if we limit its use to only situations where ndmin is
>>> greater than the number of dimensions of the array.  In those cases, the
>>> array is either an empty result, a scalar value (in which memory access is
>>> trivial), or 1-d (in which a transpose is cheap).
>>> >
>>> > What if one does things the other way around - avoid calling squeeze
>>> until _after_ doing the atleast_Nd() magic? That way the row/column
>>> information should be conserved, right? Also, we avoid transposing, memory
>>> use, ...
>>> >
>>> > Oh, and someone could conceivably have a _looong_ 1D file, but would
>>> want it read as a 2D array.
>>> >
>>> > Paul
>>> >
>>> >
>>> >
>>> > @Derek, good catch with noticing the error in the tests. We do still
>>> need to handle the case I mentioned, however.  I have attached an example
>>> script to demonstrate the issue.  In this script, I would expect the
>>> second-to-last array to be a shape of (1, 5).  I believe that the
>>> single-row, multi-column case would actually be the more common type of
>>> edge-case encountered by users than the others.  Therefore, I believe that
>>> this ndmin fix is not adequate until this is addressed.
>>> >
>>> > @Paul, we can't call squeeze after doing the atleast_Nd() magic.  That
>>> would just undo whatever we had just done.  Also, wrt the transpose, a (1,
>>> 100000) array looks the same in memory as a (100000, 1) array, right?
>>> Agree. I thought more along the lines of (pseudocode-ish)
>>> if ndmin == 0:
>>>        squeeze()
>>> if ndmin == 1:
>>>        atleast_1D()
>>> elif ndmin == 2:
>>>        atleast_2D()
>>> else:
>>>        I don't rightly know what would go here, maybe raise ValueError?
>>>
>>> That would avoid the squeeze call before the atleast_Nd magic. But the
>>> code was changed, so I think my comment doesn't make sense anymore. It's
>>> probably fine the way it is!
>>>
>>> Paul
>>>
>>>
>> I have thought of that too, but the problem with that approach is that
>> after reading the file, X will have 2 or 3 dimensions, regardless of how
>> many singleton dims were in the file.  A squeeze will always be needed.
>> Also, the purpose of squeeze is opposite that of the atleast_*d()
>> functions:  squeeze reduces dimensions, while atleast_*d will add
>> dimensions.
>>
>> Therefore, I re-iterate... the patch by Derek gets the job done.  I have
>> tested it for a wide variety of inputs for both regular arrays and record
>> arrays.  Is there room for improvements?  Yes, but I think that can wait for
>> later.  Derek's patch however fixes an important bug in the ndmin
>> implementation and should be included for the release.
>>
>> Two questions: can you point me to the patch/ticket, and is this a
> regression?
>
> Thanks,
> Ralf
>
>
>
I don't know if he did a pull-request or not, but here is the link he
provided earlier in the thread.

https://github.com/dhomeier/numpy/compare/master...ndmin-cols

Technically, this is not a "regression" as the ndmin feature is new in this
release.  However, the problem that ndmin is supposed to address is not
fixed by the current implementation for the rc.  Essentially, a single-row,
multi-column file with ndmin=2 comes out as a Nx1 array which is the same
result for a multi-row, single-column file.  My feeling is that if we let
the current implementation stand as is, and developers use it in their code,
then fixing it in a later release would introduce more problems (maybe the
devels would transpose the result themselves or something).  Better to fix
it now in rc with the two lines of code (and the correction to the tests),
then to introduce a buggy feature that will be hard to fix in future
releases, IMHO.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110505/1a6dd9ac/attachment.html>