[Numpy-discussion] More loadtxt() changes
Pierre GM
pgmdevlist at gmail.com
Tue Nov 25 12:14:46 EST 2008
Ryan,
FYI, I've been coding over the last couple of weeks an extension of
loadtxt for a better support of masked data, with the option to read
column names in a header. Please find an example below (I also have
unittest). Most of the work is actually inspired from matplotlib's
mlab.csv2rec. It might be worth not duplicating efforts.
Cheers,
P.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: _preview.py
Type: text/x-python-script
Size: 16095 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20081125/5db2cde6/attachment.bin>
-------------- next part --------------
On Nov 25, 2008, at 9:46 AM, Ryan May wrote:
> Hi,
>
> I have a couple more changes to loadtxt() that I'd like to code up
> in time for 1.3, but I thought I should run them by the list before
> doing too much work. These are already implemented in some fashion
> in matplotlib.mlab.csv2rec(), but the code bases are different
> enough, that pretty much only the idea can be lifted. All of these
> changes would be done in a manner that is backwards compatible with
> the current API.
>
> 1) Support for setting the names of fields in the returned
> structured array without using dtype. This can be a passed in list
> of names or reading the names of fields from the first line of the
> file. Many files have a header line that gives a name for each
> column. Adding this would obviously make loadtxt much more general
> and allow for more generic code, IMO. My current thinking is to add
> a *name* keyword parameter that defaults to None, for no support for
> reading names. Setting it to True would tell loadtxt() to read the
> names from the first line (after skiprows). The other option would
> be to set names to a list of strings.
>
> 2) Support for automatic dtype inference. Instead of assuming all
> values are floats, this would try a list of options until one
> worked. For strings, this would keep track of the longest string
> within a given field before setting the dtype. This would allow
> reading of files containing a mixture of types much more easily,
> without having to go to the trouble of constructing a full dtype by
> hand. This would work alongside any custom converters one passes
> in. My current thinking of API would just be to add the option of
> passing the string 'auto' as the dtype parameter.
>
> 3) Better support for missing values. The docstring mentions a way
> of handling missing values by passing in a converter. The problem
> with this is that you have to pass in a converter for *every column*
> that will contain missing values. If you have a text file with 50
> columns, writing this dictionary of converters seems like ugly and
> needless boilerplate. I'm unsure of how best to pass in both what
> values indicate missing values and what values to fill in their
> place. I'd love suggestions
>
> Here's an example of my use case (without 50 columns):
>
> ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final
> 1234,Joe,Smith,85,90,,76,
> 5678,Jane,Doe,65,99,,78,
> 9123,Joe,Plumber,45,90,,92,
>
> Currently reading in this code requires a bit of boilerplace
> (declaring dtypes, converters). While it's nothing I can't write,
> it still would be easier to write it once within loadtxt and have it
> for everyone.
>
> Any support for *any* of these ideas? Any suggestions on how the
> user should pass in the information?
>
> Thanks,
>
> Ryan
>
> --
> Ryan May
> Graduate Research Assistant
> School of Meteorology
> University of Oklahoma
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list