[Numpy-discussion] More loadtxt() changes

Wed Nov 26 09:02:32 EST 2008

Ryan May wrote:
> Hi,
> 
> I have a couple more changes to loadtxt() that I'd like to code up in time
> for 1.3, but I thought I should run them by the list before doing too much
> work.  These are already implemented in some fashion in
> matplotlib.mlab.csv2rec(), but the code bases are different enough, that
> pretty much only the idea can be lifted.  All of these changes would be done
> in a manner that is backwards compatible with the current API.
> 
> 1) Support for setting the names of fields in the returned structured array
> without using dtype.  This can be a passed in list of names or reading the
> names of fields from the first line of the file.  Many files have a header
> line that gives a name for each column.  Adding this would obviously make
> loadtxt much more general and allow for more generic code, IMO. My current
> thinking is to add a *name* keyword parameter that defaults to None, for no
> support for reading names.  Setting it to True would tell loadtxt() to read
> the names from the first line (after skiprows).  The other option would be
> to set names to a list of strings.
> 
> 2) Support for automatic dtype inference.  Instead of assuming all values
> are floats, this would try a list of options until one worked.  For strings,
> this would keep track of the longest string within a given field before
> setting the dtype.  This would allow reading of files containing a mixture
> of types much more easily, without having to go to the trouble of
> constructing a full dtype by hand.  This would work alongside any custom
> converters one passes in.  My current thinking of API would just be to add
> the option of passing the string 'auto' as the dtype parameter.
> 
> 3) Better support for missing values.  The docstring mentions a way of
> handling missing values by passing in a converter.  The problem with this is
> that you have to pass in a converter for *every column* that will contain
> missing values.  If you have a text file with 50 columns, writing this
> dictionary of converters seems like ugly and needless boilerplate.  I'm
> unsure of how best to pass in both what values indicate missing values and
> what values to fill in their place.  I'd love suggestions

Hi Ryan,
   this would be a great feature to have !!!

One question: I have a datafile in ASCII format that uses a fixed width 
for each column. If no data if present, the space is left empty (see 
second row). What is the default behavior of the StringConverter class 
in this case? Does it ignore the empty entry by default? If so, what is 
the value in the array in this case? Is it nan?

Example file:

   1| 123.4| -123.4| 00.0
   2|      |  234.7| 12.2

Manuel

> Here's an example of my use case (without 50 columns):
> 
> ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final
> 1234,Joe,Smith,85,90,,76,
> 5678,Jane,Doe,65,99,,78,
> 9123,Joe,Plumber,45,90,,92,
> 
> Currently reading in this code requires a bit of boilerplace (declaring
> dtypes, converters).  While it's nothing I can't write, it still would be
> easier to write it once within loadtxt and have it for everyone.
> 
> Any support for *any* of these ideas?  Any suggestions on how the user
> should pass in the information?
> 
> Thanks,
> 
> Ryan
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion