[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Pierre GM pgmdevlist at gmail.com
Mon Jul 16 16:15:53 EDT 2012


Tom, I agree that the documentation should be updated (both the doctoring and the relevant parts of the user manual), and specific unit-tests added. Paul, that's a direct nudge ;) (I'm sure you don't mind).

I was also considering the weird case
>>> first_line = "# A B C #1 #2 #3"
How many columns in that case ? 6 ? 3 ?  
So, instead of using a `split`, maybe we should just check  
>>> index=first_line.index(comment)
and take `first_line[:index]` (or `first_line[index+1:]` after depending on the case).

But then again, it's a weird case.  



--  
Pierre GM


On Monday, July 16, 2012 at 22:00 , Tom Aldcroft wrote:

> On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto
> <mail at paul.kishimoto.name (mailto:mail at paul.kishimoto.name)> wrote:
> > I've implemented this feature with skip_header=-1 as suggested by
> > Pierre, and in doing so removed the regression. TravisBot seems to like
> > it: https://github.com/numpy/numpy/pull/351
> >  
> > On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
> > > To be ultra clear (since I want to code this), you are
> > > suggesting that
> > > 'first_commented_line' be a *new* accepted value for the kwarg
> > > 'names', to invoke the behaviour you suggest?
> > >  
> > >  
> > >  
> > > Nope, I was just referring to some hypothetical variable name. I meant
> > > that:
> > >  
> > > first_values = None
> > > try:
> > > while not first_values:
> > > first_line = fhd.next()
> > > if names is True:
> > > parsed = [m for m in first_line.split(comments) if
> > > m.strip()]
> > > if parsed:
> > > first_value = split_line(parsed[0])
> > > else:
> > > ...
> > >  
> > > (it's not tested, I'm writing it as it comes. And I didn't even use
> > > the `first_commented_line` name, sorry)
> > >  
> > >  
> > > If this IS what you mean, I'd counter-propose something in the
> > > same spirit, but a bit simpler…we let the kwarg 'skip_header'
> > > take some additional value, say int(0), int(-1), str('auto'),
> > > or True.
> > >  
> > >  
> > >  
> > >  
> > > In this case, instead of skipping a fixed number of lines, it
> > > will skip any number of consecutive empty OR commented lines;
> > >  
> > >  
> > >  
> > >  
> > > I really like the idea of having `skip_header=-1` skip all the empty
> > > or commented lines (that is, lines whose first non-space character is
> > > the `comments` character). That'd be rather convenient.
> > >  
> > >  
> > >  
> > >  
> > > The semantics of this are more intuitive, because this is what
> > > I am
> > > really after: to *skip* a commented *header* of arbitrary
> > > length. So my four examples below could be parsed with:
> > >  
> > > 1. genfromtxt(..., names=True)
> > > 2. genfromtxt(..., names=True, skip_header=True)
> > > 3. genfromtxt(..., names=True)
> > > 4. genfromtxt(..., names=True, skip_header=True)
> > >  
> > > …crucially #1 avoids the regression.
> > >  
> > >  
> > > Does this seem good to everyone?
> > >  
> > >  
> > >  
> > >  
> > > Sounds good w/ `skip_header=-1`
> > >  
> > >  
> > > But if this is NOT what you mean, then what you say does not
> > > actually work with the simple use-case of my Example #2 below.
> > > The first commented line is "# here is a..." with # as the
> > > first non-space character, so the part after becomes the names
> > > 'here', 'is', 'a' etc.
> > >  
> > >  
> > >  
> > >  
> > > In that case, you could always use `skip_header=2`
> > >  
> > > In short, the code can't resolve the ambiguity without some
> > > extra
> > > information from the user.
> > >  
> > >  
> > >  
> > >  
> > > It's always best not to let the code guess too much anyway...
> > >  
> > > Well, no regression, and you have a nice plan. I'm for it.
> > > Anybody else?
> > >  
> > >  
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at scipy.org (mailto:NumPy-Discussion at scipy.org)
> > > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> > >  
> >  
> >  
> > --
> > Paul Natsuo Kishimoto
> >  
> > SM candidate, Technology & Policy Program (2012)
> > Research assistant, http://globalchange.mit.edu
> > https://paul.kishimoto.name +1 617 302 6105
> >  
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org (mailto:NumPy-Discussion at scipy.org)
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >  
>  
>  
> I think that the proposed solution is OK, but it does make it even
> trickier for the average user to predict the behavior of genfromtxt()
> for different situations. Perhaps as part of this pull request Paul
> should also update the documentation to include a section describing
> this behavior and usage with examples 1 to 4.
>  
> - Tom
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org (mailto:NumPy-Discussion at scipy.org)
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120716/ef881984/attachment.html>


More information about the NumPy-Discussion mailing list