[Numpy-discussion] More loadtxt() changes

Tue Nov 25 12:14:46 EST 2008

Ryan,
FYI,  I've been coding over the last couple of weeks an extension of  
loadtxt for a better support of masked data, with the option to read  
column names in a header. Please find an example below (I also have  
unittest). Most of the work is actually inspired from matplotlib's  
mlab.csv2rec. It might be worth not duplicating efforts.
Cheers,
P.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: _preview.py
Type: text/x-python-script
Size: 16095 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20081125/5db2cde6/attachment.bin>
-------------- next part --------------

On Nov 25, 2008, at 9:46 AM, Ryan May wrote:

> Hi,
>
> I have a couple more changes to loadtxt() that I'd like to code up  
> in time for 1.3, but I thought I should run them by the list before  
> doing too much work.  These are already implemented in some fashion  
> in matplotlib.mlab.csv2rec(), but the code bases are different  
> enough, that pretty much only the idea can be lifted.  All of these  
> changes would be done in a manner that is backwards compatible with  
> the current API.
>
> 1) Support for setting the names of fields in the returned  
> structured array without using dtype.  This can be a passed in list  
> of names or reading the names of fields from the first line of the  
> file.  Many files have a header line that gives a name for each  
> column.  Adding this would obviously make loadtxt much more general  
> and allow for more generic code, IMO. My current thinking is to add  
> a *name* keyword parameter that defaults to None, for no support for  
> reading names.  Setting it to True would tell loadtxt() to read the  
> names from the first line (after skiprows).  The other option would  
> be to set names to a list of strings.
>
> 2) Support for automatic dtype inference.  Instead of assuming all  
> values are floats, this would try a list of options until one  
> worked.  For strings, this would keep track of the longest string  
> within a given field before setting the dtype.  This would allow  
> reading of files containing a mixture of types much more easily,  
> without having to go to the trouble of constructing a full dtype by  
> hand.  This would work alongside any custom converters one passes  
> in.  My current thinking of API would just be to add the option of  
> passing the string 'auto' as the dtype parameter.
>
> 3) Better support for missing values.  The docstring mentions a way  
> of handling missing values by passing in a converter.  The problem  
> with this is that you have to pass in a converter for *every column*  
> that will contain missing values.  If you have a text file with 50  
> columns, writing this dictionary of converters seems like ugly and  
> needless boilerplate.  I'm unsure of how best to pass in both what  
> values indicate missing values and what values to fill in their  
> place.  I'd love suggestions
>
> Here's an example of my use case (without 50 columns):
>
> ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final
> 1234,Joe,Smith,85,90,,76,
> 5678,Jane,Doe,65,99,,78,
> 9123,Joe,Plumber,45,90,,92,
>
> Currently reading in this code requires a bit of boilerplace  
> (declaring dtypes, converters).  While it's nothing I can't write,  
> it still would be easier to write it once within loadtxt and have it  
> for everyone.
>
> Any support for *any* of these ideas?  Any suggestions on how the  
> user should pass in the information?
>
> Thanks,
>
> Ryan
>
> -- 
> Ryan May
> Graduate Research Assistant
> School of Meteorology
> University of Oklahoma
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion