[SciPy-User] Suggestion for numpy.genfromtxt documentation

Mon Oct 12 11:41:51 EDT 2009

On 10/12/2009 09:13 AM, Dharhas Pothina wrote:
> Hi All,
>
> Before I start I wanted to let all of you know that I really appreciate the work that has gone into genfromtxt. It is a hugely useful function that has become indispensable in my work. A lot of the problems I have in general come from the fact that I am a fairly new python/numpy user and don't always understand some of the intricacies involved.
>
> Just a disclaimer. I am not familiar enough with the way genfromtxt works to have understood the entire discussion that followed my posting, so I'm going to answer the questions I can answer.
>
>    
>>>> Bruce Southey<bsouthey at gmail.com>  10/7/2009 2:20 PM>>>
>>>> What did you actually expect?
>>>> It would be very informative if you could provide a simple example of
>>>> this for testing.
>>>>          
> Coming from a Matlab background the first thing I would have expected when given an option to read in (or otherwise define column) variables is a structure which lets me know what the name of each column is. In matlab this would be a variable say 'a' such that a.header is a list of header names and a.data has the data in a 2D array such that column 'n' has the data associated with a.header[n].
>
> Now since I've become fairly used to the way python does things, my modified expectation is if I read a file with the data below:
>
> 10.0 20.1 30.7
> 10.0 30.2 40.3
> 20.1 21.3 67.5
> ...
>
> with the command: a = np.genfromtxt(fname,usecols=(0,1,2),names='x,y,z')
>
> I should get a structured array
>
> such that a['x'] = np.array([10.0,10.0,20.1,...])
>
> etc.
>    
See Pierre's comments because genfromtxt outputs either a plain array 
type or a structured array - which is what I overlooked. So genfromtxt 
provides a plain array type by default where there are no named columns 
and thus names do not have an effect. If you want named columns then you 
have to get genfromtxt to give a structured array - see Skipper's 
examples on when that happens.


> If you would like a sample data file I can provide one.
>    
Small self contained examples are always very useful especially when 
there some thing is not as expected.

>>>> There are many combinations of arguments so not all have been tested and
>>>> it is not always clear what the expected behavior should be.
>>>>          
> I think for me the confusion is in an initial lack of understanding on how dtypes work. If I type help np.genfromtxt in Ipython I get:
>
> names : {None, True, string, sequence}, optional
>      If `names` is True, the field names are read from the first valid line
>      after the first `skiprows` lines.
>      If `names` is a sequence or a single-string of comma-separated names,
>      the names will be used to define the field names in a flexible dtype.
>      If `names` is None, the names of the dtype fields will be used, if any.
>
> My understanding of this was that the names argument would be used to define the field names. What I didn't realize is that if the dtype is not explicitly set (or set equal to None) then since all the data in the files are floats the dtype for the entire array is float rather than each column having its own dtype. So there are no column specific dtypes whose field names can be set to the values I specified and the file names I set are ignored (at least that's what I think is happening)
>
> To me the reason for having the 'names' argument is so that there is a mechanism to show what the names of each column are. The fact that it fails silently when the dtype is not specified is what was problematic. So my suggestion was to do one of the following:
>
> 1)add something in the docstring to note that dtype needs to be specified for the names argument to work
> 2) to change the way genfromtxt works to default to dtype=None when the 'names' argument is invoked without a dtype being specified.
> 3) issue some sort of warning/error
>
>    
>>>>  From the numpy help, there is this example:
>>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
>>>> ('mystring','S5')], delimiter=",")
>>>>
>>>> It does not help that the dtype of structured arrays also includes the
>>>> actual name. So I do not think we can use dtype argument without using
>>>> the combination of dtype and name. Perhaps if dtype is split into names
>>>> and formats so that dtype=('name', 'format').
>>>>          
> I think when I was reading the help. I was immediately drawn to the 'names' argument as the part of the function that would do what I needed it to. It was only a while later that I read through things more completely and worked out the connection to 'dtype' and also the fact that I could specify the field names through the 'dtype' argument as well. To me the combination of dtype=None&  names='x,y,z' is more useful because I can give each column a name but let numpy figure out the format automatically without having to specify each column manually.
>
> - dharhas
>
>
>    
I do agree that the documentation is really behind the functionality of 
genfromtxt and thus gets confusing. But both Skipper's and Pierre's 
comments have really cleared many of these points up. The documentation 
needs work but also we need people to test it and indicate when things 
are not as they expected. If it is documentation then we can address 
that issue in the documentation probably using a special help page on 
using genfromtxt with all the different cases that Skipper provided.

Bruce