[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour
Paul Natsuo Kishimoto
mail at paul.kishimoto.name
Fri Jul 13 13:29:31 EDT 2012
On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
> On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
> <mail at paul.kishimoto.name> wrote:
> > Hello everyone,
> >
> > I am a longtime NumPy user, and I just filed my first contribution to
> > the code as pull request to fix what I felt was a bug in the behaviour
> > of genfromtxt() https://github.com/numpy/numpy/pull/351
> > It turns out this alters existing behaviour that some people may depend
> > on, so I was encouraged to raise the issue on this list to see what the
> > consensus was.
> >
> > This behaviour happens in the specific situation where:
> > * Comments are used in the file (the default comment character is
> > '#', which I'll use here), AND
> > * The kwarg names=True is given. In this case, genfromtxt() is
> > supposed to read an initial row containing the names of the
> > columns and return an array with a structured dtype.
> >
> > Currently, these options work with a file like (Example #1):
> >
> > # gender age weight
> > M 21 72.100000
> > F 35 58.330000
> > M 33 21.99
> >
> > …but NOT with a file like (Example #2):
> >
> > # here is a general file comment
> > # it is spread over multiple lines
> > gender age weight
> > M 21 72.100000
> > F 35 58.330000
> > M 33 21.99
> >
> > …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
> > thinks all of the columns are strings because 'gender', 'age' and
> > 'weight' are not numbers.
> >
> > This is because genfromtxt() (after skipping a number of lines as
> > specified in the optional kwarg skip_header) will use the *first* line
> > it encounters to produce column names. If that line contains a comment
> > character, genfromtxt() discards everything *up to and including* the
> > comment character, and tries to use the content *after* the comment
> > character as headers (Example 3):
> >
> > gender age weight # wrong column names
> > M 21 72.100000
> > F 35 58.330000
> > M 33 21.99
> >
> > …the resulting column names are 'wrong', 'column' and 'names'.
> >
> > My proposed change was that, if the first (or any subsequent) line
> > contains a comment character, it should be treated as an *actual
> > comment*, and discarded along with anything that follows it on the line.
> >
> > In Example 2, the result would be that the first two lines appear empty
> > (no text before '#'), and the third line ("gender age weight") is used
> > for column names.
> >
> > In Example 3, the result would be that "gender age weight" is used for
> > column names while "# wrong column names" is ignored.
> >
> > BUT!
> >
> > In Example 1, the result would be that the first line appears empty,
> > and "M 21 72.100000" are used for column names.
> >
> > In other words, this change would do away with the previous behaviour
> > where the very first commented line was (magically?) treated not as a
> > comment but instead as column headers. This might break some existing
> > code. On the positive side, it would allow the user to be more liberal
> > with the format of input files (Example 4):
> >
> > # here is a general file comment
> > # the columns in this table are
> > gender age weight # here is a comment on the header line
> > # following this line are the data
> > M 21 72.100000
> > F 35 58.330000 # here is a comment on a data line
> > M 33 21.99
> >
> > I feel that this is a better/more flexible behaviour for genfromtxt(),
> > but—as stated—I am interested in your thoughts.
> >
> > Cheers,
> > --
> > Paul Natsuo Kishimoto
> >
> > SM candidate, Technology & Policy Program (2012)
> > Research assistant, http://globalchange.mit.edu
> > https://paul.kishimoto.name +1 617 302 6105
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> Hi Paul,
>
> At least in astronomy tabular files with the column definitions in the
> first commented line are reasonably common. This is driven in part by
> wide use of legacy packages like supermongo etc that don't have
> intelligent table readers, so users document the column names as a
> comment line. I think making this break might be unfortunate for
> users in astronomy.
>
> Dealing with commented header definitions is annoying. Not that it
> matters specifically for your genfromtext() proposal, but in the
> asciitable reader this case is handled with a particular reader class
> that expects the first comment line to contain the column definitions:
>
> http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
>
> Cheers,
> Tom
Tom,
Thanks for this information. In thinking about how people would work
around this, I figured it would be fairly easy to discard a comment
character that occurred as the very first character in a file, e.g.:
raw = StringIO(open('example.txt').read()[1:])
data = numpy.genfromtxt(raw, comment='#', names=True)
…but I realize that making this change in many places would still be an
annoyance.
I should perhaps also add that my view of 'proper' table formats is
partly influenced by another plotting package, namely pgfplots for LaTeX
(http://pgfplots.sourceforge.net/ ,
http://pgfplots.sourceforge.net/gallery.html) which uses uncommented
headers. To the extent NumPy users are also LaTeX users, similar
semantics could be more friendly.
Looking forward to more input from other users,
--
Paul Natsuo Kishimoto
SM candidate, Technology & Policy Program (2012)
Research assistant, http://globalchange.mit.edu
https://paul.kishimoto.name +1 617 302 6105
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120713/ae6748eb/attachment.sig>
More information about the NumPy-Discussion
mailing list