[SciPy-User] Suggestion for numpy.genfromtxt documentation

Wed Oct 7 16:22:18 EDT 2009

On Wed, Oct 7, 2009 at 3:20 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 10/07/2009 10:52 AM, Skipper Seabold wrote:
>> On Wed, Oct 7, 2009 at 11:25 AM, Dharhas Pothina
>> <Dharhas.Pothina at twdb.state.tx.us>  wrote:
>>
>>> Hi,
>>>
>>> It took me a while and a lot of trial and error to work out why this didn't work as expected.
>>>
>>> data = np.genfromtxt(fname,usecols=(2,3,4),names='x,y,z')
>>>
>>> this command works and does not return any warnings or errors, but returns an numpy array with no field names. If you use:
>>>
>>> data = np.genfromtxt(fname,usecols=(2,3,4),dtype=None,names='x,y,z')
>>>
>>> then the command does what I expect it to and returns a structured numpy array with field names. So essentially, the 'names' argument doesn't not work unless you also specify the 'dtype' argument.
>>>
> What did you actually expect?
> It would be very informative if you could provide a simple example of
> this for testing.
>
> There are many combinations of arguments so not all have been tested and
> it is not always clear what the expected behavior should be.
>
>>> I think, it would be less confusing to new users to either have this explicitly mentioned in the documentation string for the genfromtxt 'names' argument or to have the function default to 'dtype=None'  if the 'names' argument is specified without specifying the 'dtype' argument.
>>>
>>> - dharhas
>>>
>> I came across this behavior recently and agree with you.  There is a
>> patch in the works for this.
>>
>> See this thread: http://thread.gmane.org/gmane.comp.python.numeric.general/33479
>>
>> And this ticket: http://projects.scipy.org/numpy/ticket/1252
>>
>> Cheers,
>>
>> Skipper
>>
>
>  From the numpy help, there is this example:
> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
> ('mystring','S5')], delimiter=",")
>

These examples got added recently, so it may not be in your version of
numpy if you haven't updated.  You can see them here:
http://docs.scipy.org/numpy/docs/numpy.lib.io.genfromtxt/

> It does not help that the dtype of structured arrays also includes the
> actual name. So I do not think we can use dtype argument without using
> the combination of dtype and name. Perhaps if dtype is split into names
> and formats so that dtype=('name', 'format').
>

In the first example above, since float is the default for dtype it's
really dtype=float, and names=[...].  Names doesn't get used and it
returns a plain ndarray.  All that it would take is zipping float with
each of the names so that it's a valid dtype.  Right now, you could do
dtype="f, f, f" or whatever and names = ['var1','var2',var3'].  In the
second example dtype = None determines the actual format of the data
from the data itself and constructs the dtype.

> In some sense you are suggesting that we should have something like:
>
> Ignore the use of None and True for dtype and names arguments:

I don't think I (at least) am suggesting to ignore anything from the user.

> i) If only dtype is only specified then use the specified dtype and add
> default names such as col1, col2,... if necessary
>

This is what happens right now.  But f0, f1, ... instead of col.

> ii) If names is only specified then contruct the dtype as ('name',
> 'default format')

Or whatever is passed to dtype.  See above.

> iii) If formats is only specified then construct the dtype as ('default
> name', 'format')

What is formats?  This is the same case as i?  Are you suggesting
adding a formats keyword?  I suggested `type` to distinguish between a
real dtype and this non-standard behavior that's being proposed now,
but Pierre doesn't seem to think it's necessary, and I guess I agree
as long as new users don't get too confused by this and it's
documented as non-standard.

> iv) If only names and formats are only specified then construct the
> dtype as ('name', 'format')
>
> v) If no dtype, names and formats are only specified then construct the
> dtype as ('default name', 'default format')
>
> vi) If dtype and names or formats are specified then use dtype if it is
> of the form ('name', 'format') or use one of the previous cases.
>
> When dtype is None this implies format is None so the format is obtained
> from the data. If names is not True then the names are either from the
> argument or default values.
>
> If names argument is True then the names should be read from the data
> and one of the previous cases apply.
>

I think I agree with this, except I don't think the `format` keyword
is totally necessary.

Basically, I want to leave the behavior as is, but if names is True or
a sequence, then they're never ignored and the dtype is constructed
for the user as "expected".

Skipper