[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Fri Jul 13 13:29:31 EDT 2012

On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
> On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
> <mail at paul.kishimoto.name> wrote:
> > Hello everyone,
> >
> >         I am a longtime NumPy user, and I just filed my first contribution to
> > the code as pull request to fix what I felt was a bug in the behaviour
> > of genfromtxt() https://github.com/numpy/numpy/pull/351
> > It turns out this alters existing behaviour that some people may depend
> > on, so I was encouraged to raise the issue on this list to see what the
> > consensus was.
> >
> > This behaviour happens in the specific situation where:
> >       * Comments are used in the file (the default comment character is
> >         '#', which I'll use here), AND
> >       * The kwarg names=True is given. In this case, genfromtxt() is
> >         supposed to read an initial row containing the names of the
> >         columns and return an array with a structured dtype.
> >
> > Currently, these options work with a file like (Example #1):
> >
> >         # gender age weight
> >         M   21 72.100000
> >         F   35  58.330000
> >         M   33  21.99
> >
> > …but NOT with a file like (Example #2):
> >
> >         # here is a general file comment
> >         # it is spread over multiple lines
> >         gender age weight
> >         M   21 72.100000
> >         F   35  58.330000
> >         M   33  21.99
> >
> > …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
> > thinks all of the columns are strings because 'gender', 'age' and
> > 'weight' are not numbers.
> >
> >         This is because genfromtxt() (after skipping a number of lines as
> > specified in the optional kwarg skip_header) will use the *first* line
> > it encounters to produce column names. If that line contains a comment
> > character, genfromtxt() discards everything *up to and including* the
> > comment character, and tries to use the content *after* the comment
> > character as headers (Example 3):
> >
> >         gender age weight # wrong column names
> >         M   21  72.100000
> >         F   35  58.330000
> >         M   33  21.99
> >
> > …the resulting column names are 'wrong', 'column' and 'names'.
> >
> > My proposed change was that, if the first (or any subsequent) line
> > contains a comment character, it should be treated as an *actual
> > comment*, and discarded along with anything that follows it on the line.
> >
> >         In Example 2, the result would be that the first two lines appear empty
> > (no text before '#'), and the third line ("gender age weight") is used
> > for column names.
> >
> >         In Example 3, the result would be that "gender age weight" is used for
> > column names while "# wrong column names" is ignored.
> >
> > BUT!
> >
> >         In Example 1, the result would be that the first line appears empty,
> > and "M   21  72.100000" are used for column names.
> >
> > In other words, this change would do away with the previous behaviour
> > where the very first commented line was (magically?) treated not as a
> > comment but instead as column headers. This might break some existing
> > code. On the positive side, it would allow the user to be more liberal
> > with the format of input files (Example 4):
> >
> >         # here is a general file comment
> >         # the columns in this table are
> >         gender age weight # here is a comment on the header line
> >         # following this line are the data
> >         M   21  72.100000
> >         F   35  58.330000 # here is a comment on a data line
> >         M   33  21.99
> >
> > I feel that this is a better/more flexible behaviour for genfromtxt(),
> > but—as stated—I am interested in your thoughts.
> >
> > Cheers,
> > --
> > Paul Natsuo Kishimoto
> >
> > SM candidate, Technology & Policy Program (2012)
> > Research assistant,  http://globalchange.mit.edu
> > https://paul.kishimoto.name      +1 617 302 6105
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> 
> Hi Paul,
> 
> At least in astronomy tabular files with the column definitions in the
> first commented line are reasonably common.  This is driven in part by
> wide use of legacy packages like supermongo etc that don't have
> intelligent table readers, so users document the column names as a
> comment line.  I think making this break might be unfortunate for
> users in astronomy.
> 
> Dealing with commented header definitions is annoying.  Not that it
> matters specifically for your genfromtext() proposal, but in the
> asciitable reader this case is handled with a particular reader class
> that expects the first comment line to contain the column definitions:
> 
>  http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
> 
> Cheers,
> Tom

Tom,

Thanks for this information. In thinking about how people would work
around this, I figured it would be fairly easy to discard a comment
character that occurred as the very first character in a file, e.g.:

        raw = StringIO(open('example.txt').read()[1:])
        data = numpy.genfromtxt(raw, comment='#', names=True)

…but I realize that making this change in many places would still be an
annoyance.

	I should perhaps also add that my view of 'proper' table formats is
partly influenced by another plotting package, namely pgfplots for LaTeX
(http://pgfplots.sourceforge.net/ ,
http://pgfplots.sourceforge.net/gallery.html) which uses uncommented
headers. To the extent NumPy users are also LaTeX users, similar
semantics could be more friendly.

Looking forward to more input from other users,
-- 
Paul Natsuo Kishimoto

SM candidate, Technology & Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
https://paul.kishimoto.name      +1 617 302 6105
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120713/ae6748eb/attachment.sig>