[Numpy-discussion] Add `nrows` to `genfromtxt`
Warren Weckesser
warren.weckesser at gmail.com
Sun Nov 2 17:24:38 EST 2014
On 11/2/14, Alexander Belopolsky <ndarray at mac.com> wrote:
> On Sun, Nov 2, 2014 at 2:32 PM, Warren Weckesser
> <warren.weckesser at gmail.com
>> wrote:
>
>>
>>> Still, the case of dtype=None, name=None is problematic. Suppose I
>>> want
>>> genfromtxt() to detect the column names from the 1-st row and data
>>> types
>>> from the 3-rd. How would you do that?
>>>
>>>
>>
>> This may sound like a cop out, but at some point, I stop trying to make
>> genfromtxt() handle every possible case, and instead I would write a
>> custom
>> header reader to handle this.
>>
>
> In the abstract, I would agree with you. It is often the case that 2-3
> lines of clear Python code is better than a terse function call with half a
> dozen non-obvious options. Specifically, I would be against the proposed
> slice_rows because it is either equivalent to genfromtxt(islice(..), ..)
> or hard to specify.
I don't have much more to add to the API discussion at the moment, but
I want to make sure one aspect is clear. (Sorry for the noise if the
following is obvious.)
In an earlier email, I gave my interpretation of the semantics of
`slice_rows` (and `max_rows`), which is that `genfromtxt(f, ...,
slice_rows=arg)` produces the same result as `genfromtxt(f,
...)[arg]`. (The difference is that it only consumes items from the
input iterator f as required by `arg`). This isn't the same as
`genfromtxt(islice(f, <slice args>), ...)`, because `genfromtxt` skips
comments and blank lines. (It also skips invalid lines if the
argument `invalid_raise=False` is used.) So if the input file was
-----
1 10
# A comment.
2 20
3 30
4 40
5 50
-----
Then `genfromtxt(f, dtype=int, slice_rows=slice(4))` would produce
`array([[1, 10], [2, 20], [3, 30], [4, 40]])`, while
`genfromtxt(islice(f, 4), dtype=int)` would produce `array([1, 10],
[2, 20]])`.
That's my interpretation of how `max_rows` or `slice_rows` should
work. If that is not what other folks expect, than that should also
be part of the discussion.
Warren
>
> On the other hand, skip_rows is different for two reasons:
>
> 1. It is not a new option. It is currently a deprecated alias to
> skip_header, so a change is expected - either removal or redefinition.
> 2. The intended use-case - inferring column names and type information from
> a file where data is separated from the column names is hard to code
> explicitly. (Try it!)
>
More information about the NumPy-Discussion
mailing list